Mining quotations from digital libraries
Last month the CS department hosted an excellent talk by Bill Schilit, a researcher at Google, about how Google has done analysis of every word in every one of its digitized books (yes, it’s N^2, that’s how much processing power Google has) and found every time one book is quoted in another book.
This seems particularly relevant to Gruber’s “Collective Knowledge Systems” piece, which recommends pulling semantic information out of the participatory architecture of the social web. Schilit and Google are effectively extracting the most important passage from each book — effectively learning what the book was about — by looking through all the massive information that users (authors of various other books) have written.
This obviously reaches Gruber’s criteria for collected knowledge systems: it takes advantage of user-generated content (the books themselves) and human-machine synergy (human-written books and computerized analysis of them), and it gets all the more powerful at scale. But I think the Google quotations system even reaches the point of Gruber’s “emergent knowledge“: the system can make non-trivial conclusions about the core subject matter of the book that even skilled cataloguers might find difficult, can link books in ways we might otherwise have missed and, (though I’m not sure we see this in practice yet, but it certainly seems possible) can reason over the topics and connections between books to reach completely new conclusions.
I don’t believe that EECS recorded the talk, but it looks like Schilit gave a very similar talk at PARC which was recorded (there’s a good abstract there too).