Google’s Search Wiki

Hey, anybody notice the new goodies in your Google search results? You can manipulate the rankings of returned hits by using the quiet little arrow and x buttons that now follow each hit’s title. I’m guessing that, besides allowing customization, this is a new way to train relevance calculations. Using click-throughs alone to test relevance would include a certain percentage of pages that appeared from the results page to be useful but turned out not to be what the user was after. It would also necessarily include side trips initiated by curiosity rather than search refinement. Now, with a feature that explicitly allows users to refine how pages rank in their own results, Google is getting information much more clearly tied to what users consider to be relevant to a given search. Since human relevance rankings are the standard for training automated relevance rankings, this seems to be a win for Google as well as its users.

Comments (5)

What to do with the Nasties

BBC News is reporting that YouTube has removed some videos from its site that it judged to glorify the Columbine school shooters, which left me wondering what one does when one expunges “undesirable” data from a collection. Assuming the expunging is justified, do you keep the reference information so you have a record of having had the thing around (and thereby make yourself better able to detect its reappearance)? Do you expunge the thing from the entire database? It seems good general practice to have a place where one can keep old records that no longer point to something retrievable. Would it be wise to allow people to search and find that an item had been intentionally removed, to save them the trouble of searching and searching for it? Or would it be ethically questionable to have even just the record available, since it could give people the idea to seek it elsewhere or create copycat works? I’m guessing the videos will appear elsewhere on the net, and there is little anyone can do to keep them out of public view, but keeping them off popular sites could effectively marginalize them. I’m thinking the benefit of keeping something truly nasty beyond the view of the “tell me something about …” searcher outweighs the benefit of explaining the removal to the “I want this exact document” searcher.

Comments (1)

Stopwords analysis in the blogosphere

Jeff Atwood has a post today about stopwords (as we discussed in class yesterday).

He shows the default lists of stopwords that ship with Microsoft SQL Server and Oracle, which are interesting to see, and posts some interesting numbers on frequency of words on the web.  He finds, as we might expect, that many of the most frequent words aren’t normally considered stop words (information, website, download, internet, home, email).  He also links to an interesting Google patent on analyzing when to ignore stop words and when not to.

Again, commenters add to the blog: it looks like Tim Bray did a similar analysis in 2003.  Both note that Google handles searches for “to be or not to be” correctly (though it sounds like the behavior today is better than the behavior in 2003).

I think it bears repeating that the commonness of these words doesn’t seem like a good reason to drop them from indices or search queries.  A word that appears in every document might be useless, but if I can halve the result set with a single word (I’m looking for an email address, say), then the relative frequency of the word “email” doesn’t seem to hurt me much.  Removing stopwords that are unlikely to have semantic value (articles and conjunctions, say) makes more sense to me.

Comments (2)

semantic image retrieval

this may already be old news for regular readers of lifehacker.com, but incase you missed it, here’s another search engine. 

http://www.pixolu.de/

Pixolu is a semantic image search, which allows to refine a search by allowing users to select images that best represent their query. I tried it for some queries and it seems to do a good job, factoring in color, object shapes, size and density in images. 

The two-step search-and-refine process is very interesting and represents a more natural way of information gathering. Pixolu, a more 202’ish search pays attention to recent (and older) research in information gathering and foraging.

Comments off

Dewey or Don’t We?

This article from May 07 is about a library that decided to move away from the Dewey Decimal system and towards a subject based organization. They used 50 subject headings created by the Book Industry Study Group Inc. The library intentionally mimicked certain aspects of bookstores, not only in how the books are organized by subject, but also in physical layout. It appears they are trying to accommodate their customers’ habits and expectations.

For myself, this sounds interesting. I recall while reading Weinberger that I liked book stores and as long as the subject areas are clearly labeled I had little trouble finding the specific book I was seeking. At the very least it was no more difficult than in a library, and usually easier. Of course, this is a small library (24,000 books/dvds, etc). If you are dealing with a larger set of works this may become too difficult to manage.  And it seems more “natural” to me to search for a subject over a number.

However, one of the comments on the article is key (in my opinion) to the bookstore/Dewey decision. “That’s OK for leisure reading, but if you need to do research on a specific topic, you are going to have a hard time finding the particular information that you need.” The additional structure in the Dewey system makes it easier (once you know how to use the system) to find ever-granular information. Most bookstores just lump it all together.

I’ve not been able to find any follow-up information as to whether it worked or not. Their page shows they now have over 30,000 items in the library, but nothing about its current layout/organization or popularity. I wish I’d found this article when we read Weinberger’s piece.

PS: I wish I could claim the title as original, but I borrowed it.

Comments (1)