Mining quotations from digital libraries

Last month the CS department hosted an excellent talk by Bill Schilit, a researcher at Google, about how Google has done analysis of every word in every one of its digitized books (yes, it’s N^2, that’s how much processing power Google has) and found every time one book is quoted in another book.

This seems particularly relevant to Gruber’s “Collective Knowledge Systems” piece, which recommends pulling semantic information out of the participatory architecture of the social web.  Schilit and Google are effectively extracting the most important passage from each book — effectively learning what the book was about — by looking through all the massive information that users (authors of various other books) have written.  

This obviously reaches Gruber’s criteria for collected knowledge systems: it takes advantage of user-generated content (the books themselves) and human-machine synergy (human-written books and computerized analysis of them), and it gets all the more powerful at scale.  But I think the Google quotations system even reaches the point of Gruber’s “emergent knowledge“: the system can make non-trivial conclusions about the core subject matter of the book that even skilled cataloguers might find difficult, can link books in ways we might otherwise have missed and, (though I’m not sure we see this in practice yet, but it certainly seems possible) can reason over the topics and connections between books to reach completely new conclusions.

I don’t believe that EECS recorded the talk, but it looks like Schilit gave a very similar talk at PARC which was recorded (there’s a good abstract there too).

2 Comments

  1. Ryan Shaw Said,

    October 15, 2008 @ 8:56 am

    I’ve been singularly unimpressed with the “most important passages” feature of Google Books, mostly because it seems to rely purely on uninformed statistical processing rather than a model of intellectual production and citation. Case in point: the “most important passages” of a great many books seem to be quotations from the Bible, Shakespeare, or Aristotle. Presumably this is because those quotations show up in a lot of other books. So maybe they are “important” in some sense, but are they important to the content of the books they are are quoted in? I would argue that the more widespread a quotation is, the less it contributes to the content of whatever it is being cited in. The scholarly equivalent is the paper that everyone cites because they feel like they have to: sure it’s important, but we didn’t need sophisticated bibliometrics to tell us that, and the fact that paper X cited it doesn’t tell us much about paper X. So I’m skeptical that Google is actually discovering anything “non-trivial” here.

  2. Matt Gedigian Said,

    October 17, 2008 @ 5:06 am

    I haven’t used the feature, but I did attend the talk.

    @Ryan This sounds like a bug in the user interface. As you point out, Google has (in a purely statistical manner) identified important passages from books. I don’t think Schilit was claiming that passages which are quoted time and time again are contributing something vital to every work in which they appear. He was claiming that these quotings represent a consensus that the quoted material is important. It’s a pity if this is spoiled by a confusing presentation in the tool.

    The reason this is different from the scholarly equivalent you mention is that this does not require explicit metadata to be manually added by the author. Also, the information is more specific than just a citation link – this resolution is especially important when you’re dealing with books that are hundreds of pages rather than short articles.

RSS feed for comments on this post