Archive forUncategorized

Bad IO/IR Case That I Found

I have recently developed a ruby wrapper for the API of one of the largest internet portals in Korea. Most of their RESTful API were well organized in a form of RSS or XML, but there was one very interestingly bad case.

http://dev.naver.com/openapi/sample/rank.xml

If you open the link above (please ignore Korean part of it), you will see an XML file. The file describes the real-time hot keywords searched by people, and items are ordered by its rank. When you see the tags embracing each keyword, the names of elements are “R1″, “R2″, “R3″ and so on. In the perspective of a 202er, it should be corrected to something like that below.

<result>
<items>
<item>
<rank>1</rank>
<keyword>ischool</keyword>
<change>+32</change>
</item>
…..
</items>
</result>

Or, at least, they should use attribute to describe the rank instead using element name for doing so.

Comments off

The Library of Congress releases a report on the success of Flickr Commons

The Library of Congress has released a report discussing the results of their experiment to put a few thousand historical photos on flickr and allow users to add tags, comments, and notes on the photos. They’ve deemed the project a success, gathering lots of additional information about photos including personal stories from commenters’ family histories. The LOC has employees verify user-contributed information such as details on subject or location before adding it to the official description.

The report does mention some concern with the presence of rudeness or snarkiness that results when you open a project to the public: “Notes (annotations left directly on the photos) have some utility, such as pointing out specific persons in a crowd or deciphering the words on a sign or placard. Notes are also a means of adding graffiti-type messages and smart-aleck humor to the images, which is a cause for some concern among Flickr members and Library staff.”

Link: Library of Congress Blog

On an unrelated note, here’s a comic depicting an alternate method than what we discussed in class for calculating the impact of a researcher’s work based on their citations:


http://www.phdcomics.com/comics/archive.php?comicid=1108

Comments off

organizing our future | cleaning up art

<body intent:sarcasm>

<Introduction tone:praise>
An amazing presentation on the benefits of putting 202 into art. Using these methods artists can now estimate the amount of color they would need for their future works.
</Introduction>

<illustration> By creating ontologies and structurally separating elements in Art, the interoperability between artists would increase greatly. The BIG RED button would help us generate a 25% Monet + 75% Blake. </illustration>

<Dream> And the best part of all – Computers would be able to make sense of Art, and make decisions about it, just like they should be able to select the best doctor for Lucy (Tim Berners-Lee 2001)</Dream>

ART: I want it organized, just like I want my interface designed by an algorithm, and my Moby Dick in XML. 

amen.

</body intent:sarcasm>

<link: ted.com foolishTag:to-share>

“…Ursus Wehrli shares his vision for a cleaner, more organized, tidier form of art — by deconstructing the paintings of modern masters into their component pieces, sorted by color and size.”

http://www.ted.com/index.php/talks/ursus_wehrli_tidies_up_art.html

</link>

Comments off

The Silicon Tower

BBC News’s Aleks Krotoski has a thought-provoking op-ed piece about how technophilically skewed the bulk of the internet really is. Her observation is based on spending some time with people who simply don’t use the web. She points out that they are not luddites but people who have simply found that the web doesn’t speak their language, doesn’t share their ways of structuring information. She mentions issues with search facilities like Google, but she also points out that even approaches meant to be more democratic (e.g., the semantic web, or facilities based on the intelligence of the masses) fall short for people who are not technologically oriented because the creators of web sites and the presumed intelligent masses are dominated by technophiles. For us 202ers, of course, the differences in how people organize information are nothing new, but it’s good to remind ourselves now and then that, as aware of the differences as we are, we are ourselves members of a particular community of thought. We at the iSchool are, I think, too focused on serving the needs of society to be considered residents of the traditional ivory tower; we live instead in a silicon one.

Comments off

Government Agencies Told to Interoperate Are Like Children in a Sandbox

An interesting article on networkworld found via slashdot about government agencies refusing to cooperate and interoperate.

Like a bunch of children in a sandbox unable and perhaps unwilling to share their toys, multiple key government agencies cannot or will not cooperate to build a collaborative wireless network.

The Government Accountability Office report issued today took aim at the Departments of Justice, Homeland Security, and the Treasury which had intended what’s known as The Integrated Wireless Network (IWN) to be a joint radio communications system to improve communication among law enforcement agencies.

However IWN, which as already cost millions of dollars, is no longer being pursued as a joint development project, the GAO said. By abandoning collaboration on a joint implementation, the departments risk duplication of effort and inefficient use of resources as they continue to invest significant resources in independent solutions. Further, these efforts will not ensure the interoperability needed to serve day-to-day law enforcement operations or a coordinated response to terrorist or other events, the GAO said.

Sounds like a real problem… someone go work for these guys!  Apparently the DOJ has already spent $195 Million dollars on it… as Bob says, this is hard stuff!

Comments off

Auto-clustering of UC Berkeley courses

Maybe this is my last post for 202 blog.

I have taken statistical learning theory course at EECS dept in this semester. This course provides an introduction to the area of probabilistic models, and requires students to do a final project. I picked up an unsupervised clustering of UC Berkeley courses based on their descriptions.

The problem background for this task is as follows. You know, UC Berkeley provides an online courses search system, but it is a very low-level. It only provides users to search by course name, instructor name, department etc. But we can’t search courses keywords in course descriptions.

I beleive, first of all, that it should provide a keyword search system for course descriptions. Also, it is desireble for the system to be equipped with “recommendation systems”, which provides students course lists that may probably be suitable for them, based on their course registered histories(it is maybe like Amazon’s recommendation system, one kind of “folksonomy” to clssify courses).

To implement a recommendation system based on students’ course registered histories, courses in Berkeley need to be clustered by the student registerd histories in an unsupervised manner.

I can’t utilize students’ registered history. So, I utilize online course descriptions in the current system for the substitution and try to cluster UC Berkeley courses based on these course descriptions by a mixture mutlivariate Bernoulli distribution probabilistic model with EM algorithm(in detail, please refer a text “Introduction to Information Retrieval”, pp338-pp340.), and testify whether I can reasonably and explainably cluster UC Berkeley’s courses in unsupervised manner.

The result is as follows. I tested to categorize Math+Information+Statistics+Economics+Computer Science department courses(total 226 courses) into 7 clusters.  Several categories of classes generated by the algorithm are explainable such as “Statistical/Mathematical methology-oriented course cluster”, “Programming related course cluster”, “Individual study/Seminar related course cluster” and “Economics related(but less statitics oriented) course cluster) “.

Of course, not all courses are explanale by these labels. But, basically, although I applied very basic methods without special information-retrieval methods such as lemmatization, stemming and removing stop words, results are better than I expected before conducting the experiment.

1st cluster courses

2nd cluster courses

3rd cluseter courses

4th cluster courses

This is a back-envelope simulation and result is simple. Also, there are some problems. But it can show a certain result, and, more imporantly, this is an integrated task with Info202(Information organization and retrieval), Info206(network programming and Java) and CS281A(statistical learning theory, esp. EM algorithm) for me. I am satisfied with the result and the fact that I achieve ability  to implement this simulation by myself in a short time.

Comments (1)

My last 202 blog post

The Beer Judge Certification Program (no, really) has developed a set of guidelines (read categories and vocabulary) for judging beer. They even have it in downloadable XML format . They have developed (what they feel is) an authoritative vocabulary describing the various qualities of beer within their defined categories. Interestingly, they recognize that brewing styles change so their vocabulary is descriptive versus proscriptive and it will change over time. The organization also states that they use “experts” to choose the commercial examples (of the types of beer listed) instead of online surveys in order to remove the issue of “popularity contests” overwhelming the list.

So, they have taken a more Svenoniun approach to BeerML. However, I’ve never heard of these folks before (though, I am intrigued and have installed their iPhone app already). I find it interesting that one of my first questions upon reading their site (aside from how do I get in on this) was “What makes them an authority?” I have searched their site and find no association with any governing body. It seems like a bunch of folks trying to develop an authority on their own. Somewhat self-policing. I’d look more, but need to get back to writing my CMC paper and studying vector modeling. At least now I have vocabulary to use for describing the glass of awesome that is Belgian Witbier.

Comments off

Expert organization vs algorithm retrieval

I’d like to add to our discussion about expert-based organization and algorithm-based retrieval. I ran across this article in arstechnica.com about a movie recommendation site in beta called Clerk Dogs. Similar to Pandora’s use of the Music Genome Project, Clerk Dogs makes movie recommendation based on what they call a movie’s DNA.  Creators believe that algorithm-based recommendations on sites like Netflix can be improved by instead using expert-based organization. They assembled a group of “movie experts” to create a movie’s DNA based on a number of categories like character depth, suspense, violence, etc.

My favorite part of the article was, in pure 202 homage, when creators explained that one of their tools only works for crime and suspense movies because using humans for this kind of work is a tedious and time consuming process.

ClerkDogs screenshot

http://arstechnica.com/news.ars/post/20081209-clerk-dogs-brings-video-store-guy-into-your-home.html

Comments off

Cobwebs

Something we haven’t spent a whole lot of time talking about this semester is information and information systems getting old. I was listening to a speech by Clay Shirkey a few days ago about this subject, and he brings up a lot of interesting thoughts on the subject.

His classic example is of people trying to recover a computer 50 years from now. Everything is going great until they find a wobbly thing with two prongs on the end. “What’s this?” one researcher asks the other. “I think electricity used to go down those from the wall to the ‘computer,’” responds the other research. The point, in case you haven’t caught it yet, is that in order to archive data, we have to somehow archive data formats. And to archive data formats, at some point, we have to archive hardware. Once we’re archiving hardware, at some point, we have to archive the support system for it, namely the electrical grid.

It’s an interesting problem that I’ve heard a couple different theories about. One, from Ray Kurzweil, Singulatarian Extraordinaire, is that data will exist and be readable for as long as we care about it. That’s true to some extent, but it doesn’t really take into account the cost of curating that data. At some point, we could care, but just not enough.

Another problem with keeping data around is digital cobwebs. As an example, I see that Akismet has blocked about 140 comment spams from this very blog. Good for it. I don’t think it has let many through yet, but maybe somebody is dusting up, and I’m not aware of it. Whatever the case, I’m sure that in a year, once this blog has been forsooken, cobwebs will abound. It’s sad, really.

Comments off

History of the DSM IV

In this course we’ve been thinking of all kinds of catalogs and dictionary formats, but I read this fascinating article almost 4 years ago, and thought it would be good to share with 202-ers of a very important, but different genre.  It’s the incredible story of Psychology’s DSM IV (Diagnostic and Statistical Manual of Mental Disorders) — the manual which gives an identifier and a prose description for each psychological problem, and a checklist of symptoms that should be present in order to justify a diagnosis.  These numbers are used by insurance companies and doctors around the world.

The story is incredible because of its somewhat arbitrary construction(!), and problems faced that are similar to what we’ve been challenged with in this class. The DSM IV’s present form was heavily influenced by one person, Robert Spitzer at Columbia University.  The vocabulary problem and semantic gap make their appearances in the form of standardizing definitions to reduce these variances:

“informational variance” — “different doctors get different information from the same patient.”
“interpretive variance” — “each doctor carries in his mind his own definition of what a specific disease looks like.”  Hope you find it interesting!

http://www.newyorker.com/archive/2005/01/03/050103fa_fact?printable=true

Comments (1)

Next entries » · « Previous entries