What’s a Small Farm?

This year the the USDA released the much-anticipated 2007 agricultural census.  This census showed a rise in the number of small farms, and this statistic was celebrated in many farm and food articles and blogs.

Gristmill points out that former USDA Economic Research Service researcher, Michael Roberts, argues that there may not actually be more small farms, there may simply be a difference in what “counts” as a small farm.

The important revelation here is that the USDA uses statistical weighting to arrive at the numbers for these micro-farms since many of these people don’t even self-identify as farmers — and so their precision is entirely a question of their methodology, i.e. how they decide to model the presence/frequency of these small operations. Census weighting is, of course, both controversial and necessary. Counting everything by hand can have a larger margin for error than rigorous statistical modeling. Indeed, this “controversy” is right now at the heart of a monumental battle between Democrats and Republicans over the U.S. Census (just ask Sen. Judd Gregg).

That said, there is nothing inherently wrong with the practice. However, even if your overall approach is solid, if you then change your weighting techniques from year to year, comparing annual changes is all but impossible. And that appears to be exactly what the USDA is doing.

Needless to say, this is a pretty big deal.  Are the number of small farms actually growing?  Or is the current political climate in this realm simply pushing the USDA to fudge their methods a little, causing a shift in their categorization schemes?

Comments off

Auto-clustering of UC Berkeley courses

Maybe this is my last post for 202 blog.

I have taken statistical learning theory course at EECS dept in this semester. This course provides an introduction to the area of probabilistic models, and requires students to do a final project. I picked up an unsupervised clustering of UC Berkeley courses based on their descriptions.

The problem background for this task is as follows. You know, UC Berkeley provides an online courses search system, but it is a very low-level. It only provides users to search by course name, instructor name, department etc. But we can’t search courses keywords in course descriptions.

I beleive, first of all, that it should provide a keyword search system for course descriptions. Also, it is desireble for the system to be equipped with “recommendation systems”, which provides students course lists that may probably be suitable for them, based on their course registered histories(it is maybe like Amazon’s recommendation system, one kind of “folksonomy” to clssify courses).

To implement a recommendation system based on students’ course registered histories, courses in Berkeley need to be clustered by the student registerd histories in an unsupervised manner.

I can’t utilize students’ registered history. So, I utilize online course descriptions in the current system for the substitution and try to cluster UC Berkeley courses based on these course descriptions by a mixture mutlivariate Bernoulli distribution probabilistic model with EM algorithm(in detail, please refer a text “Introduction to Information Retrieval”, pp338-pp340.), and testify whether I can reasonably and explainably cluster UC Berkeley’s courses in unsupervised manner.

The result is as follows. I tested to categorize Math+Information+Statistics+Economics+Computer Science department courses(total 226 courses) into 7 clusters.  Several categories of classes generated by the algorithm are explainable such as “Statistical/Mathematical methology-oriented course cluster”, “Programming related course cluster”, “Individual study/Seminar related course cluster” and “Economics related(but less statitics oriented) course cluster) “.

Of course, not all courses are explanale by these labels. But, basically, although I applied very basic methods without special information-retrieval methods such as lemmatization, stemming and removing stop words, results are better than I expected before conducting the experiment.

1st cluster courses

2nd cluster courses

3rd cluseter courses

4th cluster courses

This is a back-envelope simulation and result is simple. Also, there are some problems. But it can show a certain result, and, more imporantly, this is an integrated task with Info202(Information organization and retrieval), Info206(network programming and Java) and CS281A(statistical learning theory, esp. EM algorithm) for me. I am satisfied with the result and the fact that I achieve ability  to implement this simulation by myself in a short time.

Comments (1)

My last 202 blog post

The Beer Judge Certification Program (no, really) has developed a set of guidelines (read categories and vocabulary) for judging beer. They even have it in downloadable XML format . They have developed (what they feel is) an authoritative vocabulary describing the various qualities of beer within their defined categories. Interestingly, they recognize that brewing styles change so their vocabulary is descriptive versus proscriptive and it will change over time. The organization also states that they use “experts” to choose the commercial examples (of the types of beer listed) instead of online surveys in order to remove the issue of “popularity contests” overwhelming the list.

So, they have taken a more Svenoniun approach to BeerML. However, I’ve never heard of these folks before (though, I am intrigued and have installed their iPhone app already). I find it interesting that one of my first questions upon reading their site (aside from how do I get in on this) was “What makes them an authority?” I have searched their site and find no association with any governing body. It seems like a bunch of folks trying to develop an authority on their own. Somewhat self-policing. I’d look more, but need to get back to writing my CMC paper and studying vector modeling. At least now I have vocabulary to use for describing the glass of awesome that is Belgian Witbier.

Comments off

Taxonomy of Philosophy

Weinberger links to this intriguing attempt to categorize philosophical papers for a system to “access online work in philosophy.”  

The best part is the discussion that follows David Chalmers’ blog post about the project, which sends me through a microcosm of the 202 course so far.  One commenter links to “An Essay towards a Real Character and a Philosophical Language” in which John Wilkins attempts to create a language where every word defines itself based on a hierarchy of 40 Genuses (each divided into Differences and then Species) of his design.  The Wikipedia article points me to Borges’ response, “The Analytical Language of John Wilkins”, where he casts doubt on such universal categorization schemes by comparison to The Celestial Emporium of Benevolent Knowledge.  Other commenters on the blog post point out similar problems: a separate set of categories for the history of philosophy seems strange since many of these papers are relevant to the philosophical topics themselves; there seem to be “multiple principles of division“.

One of the author’s of the philosophy taxonomy responds with a return to pragmatism:

OK, it’s a pseudo-taxonomy, or maybe just a category scheme. We’re not doing science here, just trying to come up with something useful and convenient.

Excellent.  We all know that classification systems should be judged by their usefulness rather than how essential their representations of the world are.

Finally, the other author of the taxonomy argues for the values of faceted classification:

our system allows massive cross-classification both of papers and categories: any paper or category can be in multiple categories. This allows us to cut the pie in many ways at once, and we hope that people will generally be able to find what they are looking for following their intuitive way of cutting the pie (along periods, figures, views, points of disagreement, etc).

Though if he is attempting to cut the pie in many different ways at once, I would think he would want explicitly orthogonal classifications, rather than one enormous tree.

Comments off

On Political Voicemail

The other day I got voice mail from Bill Clinton. Yes, Bill himself apparently took the time to call me and leave me a message reminding me to vote against Prop. 8. It must have been the real Bill, because my phone number is on the national do-not-call list, so I’m protected from annoying phone calls sent out by machinery. I’m only sorry I wasn’t home to talk to him myself, assure him that I will vote against Prop. 8, and ask how Hillary is feeling these days.

But seriously, it’s funny how phone calls from political campaigns get free reign under the rules around the national do-not-call list. Somehow it was decided that sales calls from for-profit businesses are in a different category from calls trying to sell you on a political agenda. Surveys by for-profit companies seem also to have escaped being categorized as sales calls. The argument for keeping things this way is that some people want to receive calls from nonprofits and some presumably would like to be included in surveys. Assuming that’s true, what we need is more granularity in the do-not-call list. Wouldn’t it be nice if we could all decide for ourselves whether sales calls, surveys, and political calls can be categorized as annoying? As I’m sure there are a few people out there who would not want to miss out on their yearly call from Bill Clinton, they would be able to set their political-call option to “useful” rather than “annoying” and have it still come through. The problem here is not so much that things have been classified wrong but that someone else is calling the shots on everyone else’s personal space of information.

Comments off

Tagging with pictures | Tagging the physical world.

At the risk of fanning political flames, this jpg was just sent to me via email. If you move past the humor and politics of the photo, it seems salient to today’s topic of tagging. Specifically, using the characteristics we collectively/culturally ascribe to trains of varying types to tag each of the presidential/vice-presidential candidates. It was done visually instead of with words (modern, green, fast, powerful, coal powered, archaic, plastic, child’s toy). Are these “good” tags? I think guys named Nick who went to Amherst (the h is silent) would say yes.

Election Trains

After I stopped laughing, this made me wonder if there were already a system tagging things with pictures out there. I did not find any with a quick google search. Just a number of whitepapers.

However, I did find Tonchidot.

While not specifically related to using images to tag other images or ideas, they are developing an iPhone app that adds tags to the images the camera sees in real time. They take community tags and make them mobile in a very compelling way. Want to know what type of flower that is? Tree? Year a building you are looking at was made, who designed it? Which store at the mall has the thing you want to buy? How many stars the restaurant you are looking at has on yelp? When the next bart is arriving at your station? Find a lower price for something in a different store. Purchase something via the phone. Leave a message for a friend to pick up by walking by a specific place.

Tagging a specific location is also possible. This reminds me of William Gibson’s book Spook Country. One aspect of the storyline was the development of location based digital art installations. In order to see a specific digitally created piece you needed specially made hardware (eyeglass digital display) and a computer. You also needed to be in a specific geo-spatial location. Now, you’ll just need your iPhone.

One of the things an artist in the book said reminds me of the potential of Tonchidot’s technology. Imaging traveling across the country and seeing a whole 2nd landscape that covers, interacts, and integrates with the physical world. Offering different things to see, information about what you’re seeing, directions to get there, prices for goods/services (who would not love to know the cheapest place to get gas?). And of course a whole new opportunity for advertising and spam.

Maybe that’s the problem with spam. No ontological control.

The video is about 18mins long and worth watching. There is a particularly interesting practical question around the 14:15 min mark.

Comments (4)

A Dogma of Categorization

In determining facets or categories for a set of objects, we might tend to think that some facets are better than others because they are more inherently essential to a particular set of objects.  I believe this is a dogma we should be careful to avoid and as a result I argue that we can only be pragmatic in evaluating ontologies.

__(‘Read the rest of this entry »’)

Comments (2)

The intelligent cloud

After our discussion on Monday on automation and Svenonius’s attitude toward the expensive cost of categorization I was interested to read a recent google blog post about the future of Google’s search technology.

We discussed that current technology would not allow automation to recognize / understand things such as metaphors, fuzzy words, and multi-word terms.

Google is predicting that by 2019 their technology will be able do much more than fully automate categorization and language comprehension but also solve complex problems and learn from its research.  The impact of their technology will go well beyond Google’s offerings and will generate many significant benefits for mankind.

“Thus, computer systems will have greater opportunity to learn from the collective behavior of billions of humans. They will get smarter, gleaning relationships between objects, nuances, intentions, meanings, and other deep conceptual information.”

The intelligent cloud

Comments (3)

Dewey Decimal

Based on section discussion of Cory Doctorow’s point “schemas aren’t neutral” and on a librarian friend complaining that Korea got shafted when it came to the folktale section of the Dewey Decimal system, I decided to look at the complete list of Dewey Decimal classes.

Like Nick mentioned in section, the religion section is overwhelmingly dominated by Christianity. Also, any time languages are mentioned, European languages get multiple categories (English, Other Germanic Languages, French, Spanish, Italian, Slavic, Scandinavian) while the rest of the world is stuck in the “other” category. Wikipedia, font of all knowledge, mentions that the Library of Congress system is even more US-centric than the Dewey Decimal system.

Makes you wonder what sort of systems of categorization information scientists in other countries create.

Comments (1)

On the Subject of Important Definitions – OR – Why Politics and Categorization Don’t Mix

In Erin Knight’s post, she talks about how the Capitol is trying to figure out how to redefine homelessness. This reminds me of a similar issue that I have encountered year after year while working for Contra Costa County.

Back in about 1970, a bit of research was done to determine what the poverty level should be. They did a bunch of research, but eventually just decided that the thing to do was to simply take the cost of food for a given family size and then multiply it times three. Out of this math, we have the poverty level.

From this number, the government has adjusted every year for inflation, and with that, we arrive at the federal poverty levels for 2008.

Now, this would be pretty bad research, and were I the professor overseeing the high schoolers responsible for these measures, I would probably scold them for committing every bad research method ever. The federal government however has taken these measures, and based pretty much every aid program on them….for the past 30-40 years.


In class, we have talked about how important it is to have specific and precise ways of categorizing things. Unfortunately, this thing happens to be humans, and unfortunately nobody wants to raise the poverty level while in office because that will mean that X number of people fell into poverty during their time.

When politics meets categorization, problems ensue.

Comments (5)

« Previous entries