Stopwords analysis in the blogosphere

Jeff Atwood has a post today about stopwords (as we discussed in class yesterday).

He shows the default lists of stopwords that ship with Microsoft SQL Server and Oracle, which are interesting to see, and posts some interesting numbers on frequency of words on the web.  He finds, as we might expect, that many of the most frequent words aren’t normally considered stop words (information, website, download, internet, home, email).  He also links to an interesting Google patent on analyzing when to ignore stop words and when not to.

Again, commenters add to the blog: it looks like Tim Bray did a similar analysis in 2003.  Both note that Google handles searches for “to be or not to be” correctly (though it sounds like the behavior today is better than the behavior in 2003).

I think it bears repeating that the commonness of these words doesn’t seem like a good reason to drop them from indices or search queries.  A word that appears in every document might be useless, but if I can halve the result set with a single word (I’m looking for an email address, say), then the relative frequency of the word “email” doesn’t seem to hurt me much.  Removing stopwords that are unlikely to have semantic value (articles and conjunctions, say) makes more sense to me.

Comments (2)