Auto-clustering of UC Berkeley courses

Maybe this is my last post for 202 blog.

I have taken statistical learning theory course at EECS dept in this semester. This course provides an introduction to the area of probabilistic models, and requires students to do a final project. I picked up an unsupervised clustering of UC Berkeley courses based on their descriptions.

The problem background for this task is as follows. You know, UC Berkeley provides an online courses search system, but it is a very low-level. It only provides users to search by course name, instructor name, department etc. But we can’t search courses keywords in course descriptions.

I beleive, first of all, that it should provide a keyword search system for course descriptions. Also, it is desireble for the system to be equipped with “recommendation systems”, which provides students course lists that may probably be suitable for them, based on their course registered histories(it is maybe like Amazon’s recommendation system, one kind of “folksonomy” to clssify courses).

To implement a recommendation system based on students’ course registered histories, courses in Berkeley need to be clustered by the student registerd histories in an unsupervised manner.

I can’t utilize students’ registered history. So, I utilize online course descriptions in the current system for the substitution and try to cluster UC Berkeley courses based on these course descriptions by a mixture mutlivariate Bernoulli distribution probabilistic model with EM algorithm(in detail, please refer a text “Introduction to Information Retrieval”, pp338-pp340.), and testify whether I can reasonably and explainably cluster UC Berkeley’s courses in unsupervised manner.

The result is as follows. I tested to categorize Math+Information+Statistics+Economics+Computer Science department courses(total 226 courses) into 7 clusters.  Several categories of classes generated by the algorithm are explainable such as “Statistical/Mathematical methology-oriented course cluster”, “Programming related course cluster”, “Individual study/Seminar related course cluster” and “Economics related(but less statitics oriented) course cluster) “.

Of course, not all courses are explanale by these labels. But, basically, although I applied very basic methods without special information-retrieval methods such as lemmatization, stemming and removing stop words, results are better than I expected before conducting the experiment.

1st cluster courses

2nd cluster courses

3rd cluseter courses

4th cluster courses

This is a back-envelope simulation and result is simple. Also, there are some problems. But it can show a certain result, and, more imporantly, this is an integrated task with Info202(Information organization and retrieval), Info206(network programming and Java) and CS281A(statistical learning theory, esp. EM algorithm) for me. I am satisfied with the result and the fact that I achieve ability  to implement this simulation by myself in a short time.

Comments (1)

“Genius” Feature Makes Music Miscellaneous

As many of you probably already know, last week Apple released iTunes 8. One of the most interesting features announced in this update is Genius playlist creation.  Select any song in your library and the Genius will create a playlist of songs in your own library that go well with it. So, if you’re in the mood for jazz, just select your favorite Ella Fitzgerald song, press the genius button and you’ll have a playlist of songs like it.

In my use of this feature, I’d say it works really well. It saves a lot of time and reintroduces me to music I already have but may not have listened to in a while).

Where this feature gets interesting is in how it relates to the material we’ve discussed in 202. The Genius works by first collecting and submitting (anonymously) all of your music’s metadata to Apple’s servers. There, these data are analyzed and compared to other users’ music metadata as well as the buying habits of iTunes music store customers, of which there are about 70 million. The algorithm that Apple uses to determine music matches has in effect made music miscellaneous. People buy music from the iTunes store, rip CDs, and tag their own music files anyway. This feature taps into these disparate cataloging systems collected from millions of users and creates something new from them. It ameliorates the problem of having to recall all the music you have your library that might fit a particular mood. No music professionals required. 

To me, this is a clear win for Weinberger.

Comments (2)