Stopwords analysis in the blogosphere

Jeff Atwood has a post today about stopwords (as we discussed in class yesterday).

He shows the default lists of stopwords that ship with Microsoft SQL Server and Oracle, which are interesting to see, and posts some interesting numbers on frequency of words on the web.  He finds, as we might expect, that many of the most frequent words aren’t normally considered stop words (information, website, download, internet, home, email).  He also links to an interesting Google patent on analyzing when to ignore stop words and when not to.

Again, commenters add to the blog: it looks like Tim Bray did a similar analysis in 2003.  Both note that Google handles searches for “to be or not to be” correctly (though it sounds like the behavior today is better than the behavior in 2003).

I think it bears repeating that the commonness of these words doesn’t seem like a good reason to drop them from indices or search queries.  A word that appears in every document might be useless, but if I can halve the result set with a single word (I’m looking for an email address, say), then the relative frequency of the word “email” doesn’t seem to hurt me much.  Removing stopwords that are unlikely to have semantic value (articles and conjunctions, say) makes more sense to me.

2 Comments

  1. Ryan Greenberg Said,

    November 13, 2008 @ 4:19 pm

    A word that appears in every document might be useless, but if I can halve the result set with a single word (I’m looking for an email address, say), then the relative frequency of the word “email” doesn’t seem to hurt me much.

    On a practical note, the default settings for full-text search in MySQL ignores any search term that appears in more than 50% of a table’s rows [citation]. This is especially helpful to know when you are testing a database with a small set of text rows; in these cases it’s likely that your search terms will appear in more than half of your rows. Under these circumstances, it’s possible that your search for a specific word like “hydrochloride” and returns no results.

  2. Matt Gedigian Said,

    November 17, 2008 @ 9:27 pm

    When Jeff was discussing alternatives on StackOverflow, Lucene was recommended. What I found surprising was that FogCreek Software’s linux version of FogBugz uses Lucene.net on mono. Even more surprising was that Wikipedia uses the same setup. No love for the original/primary Java version of Lucene.

RSS feed for comments on this post