Archive forDecember, 2008

Government Agencies Told to Interoperate Are Like Children in a Sandbox

An interesting article on networkworld found via slashdot about government agencies refusing to cooperate and interoperate.

Like a bunch of children in a sandbox unable and perhaps unwilling to share their toys, multiple key government agencies cannot or will not cooperate to build a collaborative wireless network.

The Government Accountability Office report issued today took aim at the Departments of Justice, Homeland Security, and the Treasury which had intended what’s known as The Integrated Wireless Network (IWN) to be a joint radio communications system to improve communication among law enforcement agencies.

However IWN, which as already cost millions of dollars, is no longer being pursued as a joint development project, the GAO said. By abandoning collaboration on a joint implementation, the departments risk duplication of effort and inefficient use of resources as they continue to invest significant resources in independent solutions. Further, these efforts will not ensure the interoperability needed to serve day-to-day law enforcement operations or a coordinated response to terrorist or other events, the GAO said.

Sounds like a real problem… someone go work for these guys!  Apparently the DOJ has already spent $195 Million dollars on it… as Bob says, this is hard stuff!

Comments off

Auto-clustering of UC Berkeley courses

Maybe this is my last post for 202 blog.

I have taken statistical learning theory course at EECS dept in this semester. This course provides an introduction to the area of probabilistic models, and requires students to do a final project. I picked up an unsupervised clustering of UC Berkeley courses based on their descriptions.

The problem background for this task is as follows. You know, UC Berkeley provides an online courses search system, but it is a very low-level. It only provides users to search by course name, instructor name, department etc. But we can’t search courses keywords in course descriptions.

I beleive, first of all, that it should provide a keyword search system for course descriptions. Also, it is desireble for the system to be equipped with “recommendation systems”, which provides students course lists that may probably be suitable for them, based on their course registered histories(it is maybe like Amazon’s recommendation system, one kind of “folksonomy” to clssify courses).

To implement a recommendation system based on students’ course registered histories, courses in Berkeley need to be clustered by the student registerd histories in an unsupervised manner.

I can’t utilize students’ registered history. So, I utilize online course descriptions in the current system for the substitution and try to cluster UC Berkeley courses based on these course descriptions by a mixture mutlivariate Bernoulli distribution probabilistic model with EM algorithm(in detail, please refer a text “Introduction to Information Retrieval”, pp338-pp340.), and testify whether I can reasonably and explainably cluster UC Berkeley’s courses in unsupervised manner.

The result is as follows. I tested to categorize Math+Information+Statistics+Economics+Computer Science department courses(total 226 courses) into 7 clusters.  Several categories of classes generated by the algorithm are explainable such as “Statistical/Mathematical methology-oriented course cluster”, “Programming related course cluster”, “Individual study/Seminar related course cluster” and “Economics related(but less statitics oriented) course cluster) “.

Of course, not all courses are explanale by these labels. But, basically, although I applied very basic methods without special information-retrieval methods such as lemmatization, stemming and removing stop words, results are better than I expected before conducting the experiment.

1st cluster courses

2nd cluster courses

3rd cluseter courses

4th cluster courses

This is a back-envelope simulation and result is simple. Also, there are some problems. But it can show a certain result, and, more imporantly, this is an integrated task with Info202(Information organization and retrieval), Info206(network programming and Java) and CS281A(statistical learning theory, esp. EM algorithm) for me. I am satisfied with the result and the fact that I achieve ability  to implement this simulation by myself in a short time.

Comments (1)

My last 202 blog post

The Beer Judge Certification Program (no, really) has developed a set of guidelines (read categories and vocabulary) for judging beer. They even have it in downloadable XML format . They have developed (what they feel is) an authoritative vocabulary describing the various qualities of beer within their defined categories. Interestingly, they recognize that brewing styles change so their vocabulary is descriptive versus proscriptive and it will change over time. The organization also states that they use “experts” to choose the commercial examples (of the types of beer listed) instead of online surveys in order to remove the issue of “popularity contests” overwhelming the list.

So, they have taken a more Svenoniun approach to BeerML. However, I’ve never heard of these folks before (though, I am intrigued and have installed their iPhone app already). I find it interesting that one of my first questions upon reading their site (aside from how do I get in on this) was “What makes them an authority?” I have searched their site and find no association with any governing body. It seems like a bunch of folks trying to develop an authority on their own. Somewhat self-policing. I’d look more, but need to get back to writing my CMC paper and studying vector modeling. At least now I have vocabulary to use for describing the glass of awesome that is Belgian Witbier.

Comments off

Expert organization vs algorithm retrieval

I’d like to add to our discussion about expert-based organization and algorithm-based retrieval. I ran across this article in arstechnica.com about a movie recommendation site in beta called Clerk Dogs. Similar to Pandora’s use of the Music Genome Project, Clerk Dogs makes movie recommendation based on what they call a movie’s DNA.  Creators believe that algorithm-based recommendations on sites like Netflix can be improved by instead using expert-based organization. They assembled a group of “movie experts” to create a movie’s DNA based on a number of categories like character depth, suspense, violence, etc.

My favorite part of the article was, in pure 202 homage, when creators explained that one of their tools only works for crime and suspense movies because using humans for this kind of work is a tedious and time consuming process.

ClerkDogs screenshot

http://arstechnica.com/news.ars/post/20081209-clerk-dogs-brings-video-store-guy-into-your-home.html

Comments off

Cobwebs

Something we haven’t spent a whole lot of time talking about this semester is information and information systems getting old. I was listening to a speech by Clay Shirkey a few days ago about this subject, and he brings up a lot of interesting thoughts on the subject.

His classic example is of people trying to recover a computer 50 years from now. Everything is going great until they find a wobbly thing with two prongs on the end. “What’s this?” one researcher asks the other. “I think electricity used to go down those from the wall to the ‘computer,’” responds the other research. The point, in case you haven’t caught it yet, is that in order to archive data, we have to somehow archive data formats. And to archive data formats, at some point, we have to archive hardware. Once we’re archiving hardware, at some point, we have to archive the support system for it, namely the electrical grid.

It’s an interesting problem that I’ve heard a couple different theories about. One, from Ray Kurzweil, Singulatarian Extraordinaire, is that data will exist and be readable for as long as we care about it. That’s true to some extent, but it doesn’t really take into account the cost of curating that data. At some point, we could care, but just not enough.

Another problem with keeping data around is digital cobwebs. As an example, I see that Akismet has blocked about 140 comment spams from this very blog. Good for it. I don’t think it has let many through yet, but maybe somebody is dusting up, and I’m not aware of it. Whatever the case, I’m sure that in a year, once this blog has been forsooken, cobwebs will abound. It’s sad, really.

Comments off

History of the DSM IV

In this course we’ve been thinking of all kinds of catalogs and dictionary formats, but I read this fascinating article almost 4 years ago, and thought it would be good to share with 202-ers of a very important, but different genre.  It’s the incredible story of Psychology’s DSM IV (Diagnostic and Statistical Manual of Mental Disorders) — the manual which gives an identifier and a prose description for each psychological problem, and a checklist of symptoms that should be present in order to justify a diagnosis.  These numbers are used by insurance companies and doctors around the world.

The story is incredible because of its somewhat arbitrary construction(!), and problems faced that are similar to what we’ve been challenged with in this class. The DSM IV’s present form was heavily influenced by one person, Robert Spitzer at Columbia University.  The vocabulary problem and semantic gap make their appearances in the form of standardizing definitions to reduce these variances:

“informational variance” — “different doctors get different information from the same patient.”
“interpretive variance” — “each doctor carries in his mind his own definition of what a specific disease looks like.”  Hope you find it interesting!

http://www.newyorker.com/archive/2005/01/03/050103fa_fact?printable=true

Comments (1)

Hierarchically structured logic, thinking and communication

When I was in a business case study club during undergraduate, they were quite obsessive on abiding by a method for structuring the logic and communicating with each other with that. The method or principle is called Minto’s Pyramid Principle, which seems to be widely adopted by many major management consulting companies such as McKinsey & Co. (It is related to a principle called “MECE Principle — Mutually Exculsive Collectively Exhaustive.”) The person who coined this principle herself was a consultant at McKinsey a few decades ago, specialized on business communication.

The reason why I am talking about this is that I realized this principle would look pretty similar with the principle we had studied for devising our own vocabulary when we were doing the assignment 3 or 4. The principle is quite simple to explain. It argues that one should communicate with each other with a pyramid-looking logic, especially in business environment. Let me walk through this process for a second. See the picture below.

Let’s say you are consulting a company. You are trying to argue that the reason the company is suffering recently is that it is losing the customer base of Product A. Without structuring, every reason will be
just linearly enumerated. The listener will be confused with whether the investigation is comprehensive and whether there is no logical gap or leap. You need to structure your logic into hierarchy, so that the
listener can get the point easily and you and the client can be on the same page.

There are two dimensions in the pyramid principle: one is horizontal dimension and the other is vertical dimension of the logic. The horizontal dimension checks whether you achieve comprehensiveness in each layer of your logic without any overlapping. The vertical dimension guarantees that your logic does not contain any logical jump. The lower layer should be able to answer “Why so?” question of the upper layer’s argument. On the other hand, the upper layer should be derived by asking “So what?” after collecting the lower layer arguments. By doing this, you can have a comprehensive, efficient and logical-leap-proof logic for your presentation.

Isn’t this process sound quite similar with information organization principle we used this semester for designing vocabulary?

This principle is very useful when you present your idea to other people and try to persuade them, although the example above was extremely simple. Since they are probably not on the same page with you, you have to have a well-structured logic something like this. If you train yourself with this principle well, I believe that this will be definitely helpful for you from sending a flyer for your party to writing a paper. (I found a link explaining MECE framework. Here.)

Comments off

Minnesota Public Radio does broadcast information right

After looking at the YES API in class on Monday and hearing everyone’s complaints that it couldn’t possibly be right, I wanted to share an example of great radio broadcast listings. Minnesota Public Radio has complete listings of music played on both their Classical and Current (modern) stations. The default view is all music played on the current day, broken up by hour, with links to purchase the albums. In addition, you can search for songs played as long ago as 2005 or 2006, in case you want to find something you heard a while ago.

Comments off

No more Captcha?

This article from Forbes offers an end to the need for captchas using “metadata.” The gist is that a company claims they’ve come up with a way to test, via your actions on a website (browsing, clicking, mouse movements) to see if you are a human or bot. This actually touches on 3 classes I’m taking this semester: metadata as implicit information we provide websites (202 and ISSD) and raising the cost of avoiding captchas to the point of being too great (290-7). Just in case I’m not being obvious enough, studying at the iSchool is an amazing and eye-opening experience.

Comments off

Obama’s Forged Birth Certificate

Heather Dolan mentioned today during her demo of Photoshop metadata that it can be used to detect image tampering.  There is a too-compelling story going around that Obama isn’t eligible to be president because he wasn’t a natural-born US citizen, and after re-reading it I’m starting to be convinced that the records of his Hawaii birth that have been proposed as rebuttal evidence are in fact forgeries.

http://www.freerepublic.com/focus/f-bloggers/2136816/posts

I don’t know how to feel about this — we need Obama to become president — but this analysis of the file formats, metadata, and editing patterns of the documents sure look convincing to me.

Comments off

Next entries » · « Previous entries