I feel lucky re: autotagging
I experimented with an automatic-tagging experiment/engine tonight. The designer cleverly used Wikipedia’s URIs as a source for a controlled concept/tagging vocabulary! Pretty cool. He full-text indexed Wikipedia (yes, he downloaded it all) with Lucene, (an open source search engine), and used its, “I feel lucky,” feature to make tag/concept associations.
As input I experimented with text from Bob’s first slide from the last lecture:
Overview of the Semantic Web RDF OWL A Critical Evaluation of the Semantic Web Semantically-aware systems
Try copying it and pasting it here
There’s too much output to post here, but you get a tiered-list of potential tags. I found the third-tier results pretty interesting. I could imagine this being useful (though imperfect) for automatically adding keyword/tag metadata to a document.
..So someone could develop a blog plugin that automatically generates tags based on this..
Matt Gedigian Said,
October 15, 2008 @ 12:23 am
Is there a description of what it’s doing? It don’t understand how these pieces are being put together.
Nathaniel Wharton Said,
October 15, 2008 @ 8:18 pm
Hi Matt, (apologies if this gets posted twice — there seems to be some trouble with this showing up in the comments):
I think he’s downloaded all of Wikipedia, and then full-text indexed it with Lucene (an open source search engine from Apache). Lucene is built with Java, and just like google, it has a, “get similar sites” feature. Called, “More Like This”:
I found some of the calls in the api here:
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThisQuery.html
or maybe he’s using:
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/SimilarityQueries.html
So you could imagine an equivalent search being done in Google with either a large number of search terms, or perhaps he’s parsed them out, (I don’t know), but in any event, just like in google, he gets search results as URLs.. in the case of Wikipedia, the URI has meaning — and can potentially be a tag!
Now since it looks like the site is running ruby on rails, it looks like he could actually be using a port of lucene to ruby, like Ferret:
http://hublog.hubmed.org/archives/001376.html
but I’m just guessing here. If someone else can take other guesses on what’s going on, that would be cool.
nat
Shawna Hein Said,
October 18, 2008 @ 9:08 pm
tagthe.net uses the REST API to return a set of tags based on the provided textual content. its pretty cool.