Is “elusive goal of machine translation” being achieved?

Google Reader recently launched the feature called “Translate in your language.” In fact, this works great. In the paper, “Elusive goal of machine translation,” the author introduced the case of success of statistical approach to language processing. I think this Google’s new feature shows another improvement on natural language processing.

Since many of you might not have a chance to take a look at this feature because your language is English, let me demonstrate and assess this new feature as a foreign language speaker. To demonstrate, I will translate the translated headline by Google. Then, I will compare it with the original headline.

The four headlines are (from top to bottom):
– The American consumer price and housing depression will begin.
– An American diplomat will be finally buried by Mao. Criticism.
– Argentina opportunity dust had people move outside.
– Obama rapidly oscillates about the pledge for climate change.

Then, let’s see what the original headlines were.

I think that this works pretty well. (The mismatch between translated sentences is partly due to my poor translation ability.)

I have never imagined that natural language processing could be improved this early.

Comments off

Stopwords analysis in the blogosphere

Jeff Atwood has a post today about stopwords (as we discussed in class yesterday).

He shows the default lists of stopwords that ship with Microsoft SQL Server and Oracle, which are interesting to see, and posts some interesting numbers on frequency of words on the web.  He finds, as we might expect, that many of the most frequent words aren’t normally considered stop words (information, website, download, internet, home, email).  He also links to an interesting Google patent on analyzing when to ignore stop words and when not to.

Again, commenters add to the blog: it looks like Tim Bray did a similar analysis in 2003.  Both note that Google handles searches for “to be or not to be” correctly (though it sounds like the behavior today is better than the behavior in 2003).

I think it bears repeating that the commonness of these words doesn’t seem like a good reason to drop them from indices or search queries.  A word that appears in every document might be useless, but if I can halve the result set with a single word (I’m looking for an email address, say), then the relative frequency of the word “email” doesn’t seem to hurt me much.  Removing stopwords that are unlikely to have semantic value (articles and conjunctions, say) makes more sense to me.

Comments (2)

Rats: pests or pets?

Bob’s mention of the ads for “glushko” in class today reminded me of my own experience with Google ads and disambiguation.

I have pet rats, and when Google was first putting ads in the right column of Gmail, any emails I sent about my pet rats ended up with Google ads for rat exterminators or rat poison. Needless to say, the ads were misplaced and also rather traumatic, as I didn’t want to poison my pets.

But after a few months, I noticed that the mix of ads began to change from exterminators and poison to pet rat food or litter or rescue societies for pet rats. Now they rarely misplace an ad. What I’m wondering is exactly how they changed their formula. I don’t think it was as simple as finding the word “pet” in the same document as the word “rat,” since I generally only refer to my rats as the rats, not as “my pet rats.” But maybe their algorithm was picking up on words like “ratties” or “cute” in my emails to further refine their ad targeting.

It’s still not on the same level as Svenonius-esque categorization, but at least it’s getting a little closer…

Comments off

Mining quotations from digital libraries

Last month the CS department hosted an excellent talk by Bill Schilit, a researcher at Google, about how Google has done analysis of every word in every one of its digitized books (yes, it’s N^2, that’s how much processing power Google has) and found every time one book is quoted in another book.

This seems particularly relevant to Gruber’s “Collective Knowledge Systems” piece, which recommends pulling semantic information out of the participatory architecture of the social web.  Schilit and Google are effectively extracting the most important passage from each book — effectively learning what the book was about — by looking through all the massive information that users (authors of various other books) have written.  

This obviously reaches Gruber’s criteria for collected knowledge systems: it takes advantage of user-generated content (the books themselves) and human-machine synergy (human-written books and computerized analysis of them), and it gets all the more powerful at scale.  But I think the Google quotations system even reaches the point of Gruber’s “emergent knowledge“: the system can make non-trivial conclusions about the core subject matter of the book that even skilled cataloguers might find difficult, can link books in ways we might otherwise have missed and, (though I’m not sure we see this in practice yet, but it certainly seems possible) can reason over the topics and connections between books to reach completely new conclusions.

I don’t believe that EECS recorded the talk, but it looks like Schilit gave a very similar talk at PARC which was recorded (there’s a good abstract there too).

Comments (2)

Good Luck if you get locked out of your gmail account!

http://www.nytimes.com/2008/10/05/business/05digi.html?em

During lecture today Bob mentioned that businesses can’t satisfy all of customer’s needs.  Some companies such as Walmart will devote most of their resources to building efficiency, Google will do so with optimal organizing, and Nordstrom with building a reputation for exceptional customer service.  You just can’t do it all.  But we still want them to. 

Knock on wood, it hasn’t happened to me yet, (it can happen not only if you input an incorrect password multiple times but for security purposes since there is minimal personal info required to set up an account) but it seems in order to restore one’s gmail account, one commonly must deal with lapse in time and frustration.  Google, like competitors Microsoft and Yahoo with their respective premier email services, only offers phone support for gmail customers who subscribe to Google Apps Premier addition which costs $50 annually.  With a purported tens of millions of users as its base, google has made a deliberate decision not to expend its resources and minimize its cost in the customer service arena.

On the other hand, it seems Netflix has recently (just last year) implemented phone support with enthusiasm and as a result has received top ratings in online retail customer satisfaction. 

Will this sway me away from google?  Sigh…but no.  

 

Comments off