DSM-V – Somewhat arbitrary?

Earlier in the semester, I posted about the story of the creation of the DSM-IV classification system based on psychological characteristics.   Today’s New York Times addresses the latest release, the DSM-V.  It’s an ideal case study for classification themes we’ve reviewed all semester:

“In psychiatry no one knows the causes of anything, so classification can be driven by all sorts of factors” — political, social and financial.”

“What you have in the end,” Mr. Shorter said, “is this process of sorting the deck of symptoms into syndromes, and the outcome all depends on how the cards fall.”


N grows over time..

Comments off

Auto-clustering of UC Berkeley courses

Maybe this is my last post for 202 blog.

I have taken statistical learning theory course at EECS dept in this semester. This course provides an introduction to the area of probabilistic models, and requires students to do a final project. I picked up an unsupervised clustering of UC Berkeley courses based on their descriptions.

The problem background for this task is as follows. You know, UC Berkeley provides an online courses search system, but it is a very low-level. It only provides users to search by course name, instructor name, department etc. But we can’t search courses keywords in course descriptions.

I beleive, first of all, that it should provide a keyword search system for course descriptions. Also, it is desireble for the system to be equipped with “recommendation systems”, which provides students course lists that may probably be suitable for them, based on their course registered histories(it is maybe like Amazon’s recommendation system, one kind of “folksonomy” to clssify courses).

To implement a recommendation system based on students’ course registered histories, courses in Berkeley need to be clustered by the student registerd histories in an unsupervised manner.

I can’t utilize students’ registered history. So, I utilize online course descriptions in the current system for the substitution and try to cluster UC Berkeley courses based on these course descriptions by a mixture mutlivariate Bernoulli distribution probabilistic model with EM algorithm(in detail, please refer a text “Introduction to Information Retrieval”, pp338-pp340.), and testify whether I can reasonably and explainably cluster UC Berkeley’s courses in unsupervised manner.

The result is as follows. I tested to categorize Math+Information+Statistics+Economics+Computer Science department courses(total 226 courses) into 7 clusters.  Several categories of classes generated by the algorithm are explainable such as “Statistical/Mathematical methology-oriented course cluster”, “Programming related course cluster”, “Individual study/Seminar related course cluster” and “Economics related(but less statitics oriented) course cluster) “.

Of course, not all courses are explanale by these labels. But, basically, although I applied very basic methods without special information-retrieval methods such as lemmatization, stemming and removing stop words, results are better than I expected before conducting the experiment.

1st cluster courses

2nd cluster courses

3rd cluseter courses

4th cluster courses

This is a back-envelope simulation and result is simple. Also, there are some problems. But it can show a certain result, and, more imporantly, this is an integrated task with Info202(Information organization and retrieval), Info206(network programming and Java) and CS281A(statistical learning theory, esp. EM algorithm) for me. I am satisfied with the result and the fact that I achieve ability  to implement this simulation by myself in a short time.

Comments (1)

Taxonomy of Philosophy

Weinberger links to this intriguing attempt to categorize philosophical papers for a system to “access online work in philosophy.”  

The best part is the discussion that follows David Chalmers’ blog post about the project, which sends me through a microcosm of the 202 course so far.  One commenter links to “An Essay towards a Real Character and a Philosophical Language” in which John Wilkins attempts to create a language where every word defines itself based on a hierarchy of 40 Genuses (each divided into Differences and then Species) of his design.  The Wikipedia article points me to Borges’ response, “The Analytical Language of John Wilkins”, where he casts doubt on such universal categorization schemes by comparison to The Celestial Emporium of Benevolent Knowledge.  Other commenters on the blog post point out similar problems: a separate set of categories for the history of philosophy seems strange since many of these papers are relevant to the philosophical topics themselves; there seem to be “multiple principles of division“.

One of the author’s of the philosophy taxonomy responds with a return to pragmatism:

OK, it’s a pseudo-taxonomy, or maybe just a category scheme. We’re not doing science here, just trying to come up with something useful and convenient.

Excellent.  We all know that classification systems should be judged by their usefulness rather than how essential their representations of the world are.

Finally, the other author of the taxonomy argues for the values of faceted classification:

our system allows massive cross-classification both of papers and categories: any paper or category can be in multiple categories. This allows us to cut the pie in many ways at once, and we hope that people will generally be able to find what they are looking for following their intuitive way of cutting the pie (along periods, figures, views, points of disagreement, etc).

Though if he is attempting to cut the pie in many different ways at once, I would think he would want explicitly orthogonal classifications, rather than one enormous tree.

Comments off

Search Flickr by Color

Searching for all the photos on Flickr that are tagged “red” is old-hat. Besides, searching for colors in tags is fraught with problems: people don’t have the patience to tag their photos exhaustively with all the colors in them, people may not be able to distinguish all the colors in a photos, and worse, they may be “wrong” about the colors. After all, your red is my pink. (If you want to get philosophical, check out the inverted spectrum problem, though this doesn’t pose a problem for Flickr tagging.)

An obvious approach is to tag photos with all their colors algorithmically. We can scan photos for colors and tag any picture with lots of #ff0000 “red”. Users who search for red will retrieve these results. This approach would be consistent, but it is still open to the problem of disagreement about colors–someone still has to define red in the computation. In terms from a recent 202 lecture, a semantic gap remains between the photo and the metadata used to describe (and consequently retrieve) it.

A solution to this problem is to search using a criteria at the same semantic level that you require in your results. Idée has implemented this idea with its Multicolr interface for searching Flickr. You select a color and see pictures that contain that color. Using Multicolr is mesmerizing because you can adjust your search criteria to encompass multiple colors and see results matching your search. Selecting the same color multiple times (i.e. the equivalent of “redred“) increases its intensity in your search.

Textual search is likely to remain our primary means of retrieval for the foreseeable future–so much of our discourse is word-dominated–but this is an example of the frontiers of IR.

Comments off

Project Bamboo


Not sure if you have heard of Project Bamboo, but it is a effort to find ways to utilize and incorporate technology into humanities research to advance the field(s).  Sponsored by the Mellon Foundation, the end goal is a proposal for an implementation strategy, including standards and the like.  My husband has been attending the most recent workshop on behalf of Blackboard (because they want a seat at the table as the standards are being set of course!!!) and it’s basically been a 202 extravaganza.  At the table?  Librarians, philosophers, artists, lit profs, computer scientists, even a few iSchool professors (Larson and Kansa), etc. This led to lengthy debates about the meaning of what they were actually trying to do, how explcitly they should define it, how to carve up their worlds, why the sky is blue, etc. One of the main things that they apparently kept coming back to was, of course, The Tradeoff.  Who does the work and who reaps the benefits.

Pretty cool stuff though, and hearing his recap (“classification”, “ontology”, “schemas”, “data interoperability”, “buzz”, “buzz”, “buzz”) was essentially like a mini-study session for the midterm.

If anyone is interested in contributing – especially those philosophers among us – there are links to join off of their site.


Comments off

Tagging with pictures | Tagging the physical world.

At the risk of fanning political flames, this jpg was just sent to me via email. If you move past the humor and politics of the photo, it seems salient to today’s topic of tagging. Specifically, using the characteristics we collectively/culturally ascribe to trains of varying types to tag each of the presidential/vice-presidential candidates. It was done visually instead of with words (modern, green, fast, powerful, coal powered, archaic, plastic, child’s toy). Are these “good” tags? I think guys named Nick who went to Amherst (the h is silent) would say yes.

Election Trains

After I stopped laughing, this made me wonder if there were already a system tagging things with pictures out there. I did not find any with a quick google search. Just a number of whitepapers.

However, I did find Tonchidot.

While not specifically related to using images to tag other images or ideas, they are developing an iPhone app that adds tags to the images the camera sees in real time. They take community tags and make them mobile in a very compelling way. Want to know what type of flower that is? Tree? Year a building you are looking at was made, who designed it? Which store at the mall has the thing you want to buy? How many stars the restaurant you are looking at has on yelp? When the next bart is arriving at your station? Find a lower price for something in a different store. Purchase something via the phone. Leave a message for a friend to pick up by walking by a specific place.

Tagging a specific location is also possible. This reminds me of William Gibson’s book Spook Country. One aspect of the storyline was the development of location based digital art installations. In order to see a specific digitally created piece you needed specially made hardware (eyeglass digital display) and a computer. You also needed to be in a specific geo-spatial location. Now, you’ll just need your iPhone.

One of the things an artist in the book said reminds me of the potential of Tonchidot’s technology. Imaging traveling across the country and seeing a whole 2nd landscape that covers, interacts, and integrates with the physical world. Offering different things to see, information about what you’re seeing, directions to get there, prices for goods/services (who would not love to know the cheapest place to get gas?). And of course a whole new opportunity for advertising and spam.

Maybe that’s the problem with spam. No ontological control.

The video is about 18mins long and worth watching. There is a particularly interesting practical question around the 14:15 min mark.

Comments (4)

the vocabulary problem strikes again

Found this funny article about a police officer who was called in to shoo off a “big cat” only to find out that it was actually a male mountain lion weighing 80 – 90 pounds.  In the article itself, the mountain lion was called 3 names: “kitty cat”, “big cat” and “house cat” — none of which I would probably use to describe a lion. The title comes closer with “cougar”.   I find it amazing that a 200-word article can call something 5 different names!

Complete article here.

Comments (1)

Where does your food come from?


The Agriculture Department has given American retailers six months to comply with a new rule requiring that meats, produce, and certain nuts be labeled with the country of origin. The idea is that consumers have the right to have access to this metadata when making decisions about which food to purchase. Interestingly enough, certain foods, such as roasted nuts and mixed vegetables, are exempt from this rule. The article refers to these exempted foods as processed foods (mixed vegetables – processed?). I don’t see why a food’s classification as “processed” grants it special status to be exempt from rules that are meant to inform consumers about the potential safety (or dangers) of food.

Regardless, I’m glad to see this step being taken to include more classes of foods in country-of-origin labeling (seafood is already labeled). Necessity arising out of food recalls seems to be driving these efforts.

Comments (3)

Our Digital Lives, Monitored By A Hidden ‘Numerati’

I listened to an interview, broadcast on NPR’s Fresh Air, with Stephen Baker, where he discusses his new book, “Numerati”, which examines the “mathematical modeling” of humanity, and what he believes are some of the potential consequences of this activity. He talks about how information is collected about each of us from cell phone use, credit cards, super-market scanners, Internet shopping and many other sources.  This information about our choices is now being examined, and based on the results consumers will be targeted for particular services and goods, creating customized profiles. On a more serious note this same type of data mining is being used in attempting to understand people more deeply to determine behavior, such as someone’s potential to be involved in terrorism.  If you have 20 minutes to listen, this is an interesting topic in light our current discussions on data-mining and classifications of information.


Comments off

Dewey or Don’t We?

This article from May 07 is about a library that decided to move away from the Dewey Decimal system and towards a subject based organization. They used 50 subject headings created by the Book Industry Study Group Inc. The library intentionally mimicked certain aspects of bookstores, not only in how the books are organized by subject, but also in physical layout. It appears they are trying to accommodate their customers’ habits and expectations.

For myself, this sounds interesting. I recall while reading Weinberger that I liked book stores and as long as the subject areas are clearly labeled I had little trouble finding the specific book I was seeking. At the very least it was no more difficult than in a library, and usually easier. Of course, this is a small library (24,000 books/dvds, etc). If you are dealing with a larger set of works this may become too difficult to manage.  And it seems more “natural” to me to search for a subject over a number.

However, one of the comments on the article is key (in my opinion) to the bookstore/Dewey decision. “That’s OK for leisure reading, but if you need to do research on a specific topic, you are going to have a hard time finding the particular information that you need.” The additional structure in the Dewey system makes it easier (once you know how to use the system) to find ever-granular information. Most bookstores just lump it all together.

I’ve not been able to find any follow-up information as to whether it worked or not. Their page shows they now have over 30,000 items in the library, but nothing about its current layout/organization or popularity. I wish I’d found this article when we read Weinberger’s piece.

PS: I wish I could claim the title as original, but I borrowed it.

Comments (1)

« Previous entries