Archive forNovember, 2008

Game to identify image descriptors

image semantic data-entry is now a game!
Kind of a “Captcha or del.icio.us Prisoner’s Dilemma” : ).

http://freakonomics.blogs.nytimes.com/2008/11/05/theres-free-labor-in-video-games/

Comments off

A Search Engine Tries to Deal with Google’s Sociopolitical Bias

Newsweek reported a search engine called “Rushmoredrive” which is trying to cope with the sociopolitical bias of Google. While Google relies on the behavior of majority, Rushmoredrive is “tailors its results to the proclivities” of the African-American community.

Rushmoredrive relies on a technique callled “geo-biasing.” With data about concentration of African-Americans in various ZIP codes, and the IP address of the request, the website is able to present different result for requests from different regions. For instance, a search request for “Whitney” that comes from Atlanta, a region with a large black population, would result in “Whitney Houston” and “Whitney M. Young”. A search request for the same word that comes from a largely white residential area would result in “Whitney Museum”.

Even though the Newsweek article exalted this idea, in my opinion however, it introduces another bias: what about white man living in black regions? Are they less worthy of a “tailored result” than the white man in white regions?

Another problem is a common flaw with “personalized” or “tailored” search results: user want to see stable and consistent search results from search engines. Mainstream search engine achieve this by making the first few results constant and only “personalize” for lower ranked results. But this search engine seems to overlook this wisdom and over-tailored its results.

Newsweek story: http://www.newsweek.com/id/136339

Comments (1)

Alphabet Economics: The link between names and reputation

In lecture today, Bob mentioned that those with last names that start towards the end of the alphabet have and continue to be put at a disadvantage in various circumstances, a big one being when coauthoring papers.

I originally set out to look for an article I read I think about a year ago studying this general phenomenon and examining differing character traits among people whose last names start at the beginning of the alphabet and those that start towards the end. I didn’t end up finding that specific article but I did come across this one which specifically studies the link between names and reputation when coauthoring academic publications. The read is a bit dense but it highlights some interesting points that have real world implications such as tenure and salary and examines other aspects of the name such as last names with prefixes (eg. Van der, Von, De La, etc), two last names, and non-English names. Below is their conclusion:

Authorship on the overwhelming majority of economic coauthored papers is ordered lexicographically, on the basis of the alphabetical ordering of author’s names (Engers et al., 1999). Some people argue that this is beneficial for authors whose names come early in the alphabet, since being the first author implies certain advantages such as greater attention and prestige. Some others cite that having a name that comes early in the alphabet is actually harmful because an A-author can never signal a higher than proportional contribution to a paper. The first objective of this paper was to investigate whether the alphabetical ranking of names affects someone’s reputation. It was found that being a professor of economics and having a last name initial “A” instead of “Z” increases the probability of getting employment at high standing research institutions in the United States. This effect seems to hold when considering economic departments in the United Kingdom and when controlling for nationality and name origin. Furthermore, it was found that having a name ranked earlier in the alphabet increases the probability of being amongst the authors whose work has been downloaded or read the most. One could speculate that the reported relationship is driven by the effect of last name initials on life outcomes and has nothing to do with the convention in economics to order coauthors alphabetically. However, this potential explanation is ruled out since controlling for output and productivity yields very similar results. The second objective was to explore whether the established alphabetical effect creates differential incentives for coauthoring. It turned out that authors are aware of it and in some cases respond by manipulating their names. More precisely, it was found that authors whose name has a prefix beginning with “D” tend to use it for the determination of alphabetic name orderings whereas authors whose name has a prefix beginning with “V” tend to omit it when the alphabetical placement is defined. Furthermore, it was found that authors with two last names decide how to record these names based on lexicographical criteria: the higher the distance in the alphabet between the two names the higher the likelihood of using as a first name the one closest to “A”. Finally, we presented some evidence that the alphabetical effect influences Greek authors’ transliterating decisions, though to a much lesser extent than the two author groups previously mentioned—probably due to some common trends among Greek users.

Comments off

Lib O’Congress on flickr

THE Library of Congress is posting images to flickr.
http://www.flickr.com/people/library_of_congress/

The LOC is uploading images to flickr and inviting viewers to add tags. The goal is to share images, to experiment with socially constructed taxonomies, and to start wading among the people of the tubes.

The LOC is following these general guidelines with respect to annotation of the images the post on flickr:
We placed only one tag (”Library of Congress”) and two machine tags on each photo when we loaded them. Any other tags you see were added by the community; we are generally not controlling the content of Flickr tags, notes and comments, but we reserve the right to remove added content for any reason.  

The project has been a success, according to the LOC — many people have participated in annotating the images with comments, tags, and notes.  Here’s an example of an image that has been viewed more than 85,000 times and has much of annotation from viewers: http://www.flickr.com/photos/library_of_congress/2179930812/in/set-72157603671370361/

Comments off

Search, Facet, and Filtering Examples

Konigi is a User Experience Design site that features interesting interfaces, with a handful of features on searches, filtering, and faceted navigation.
http://konigi.com/

A couple of sites they’ve featured:

Kayak.com, my favorite travel search interface
http://konigi.com/interface/kayak-filtering

FanSnap, an event ticket site
http://konigi.com/interface/fansnap-search-results-filtering

Also, Cookstr is a recipe site that has a ton of interesting facets once you search or click a category: cuisine, cost, dietary considerations, kid friendly, holiday… Much cleaner and easier to use than other recipe sites I’ve played with.
http://www.cookstr.com/recipes

Doesn’t it just make you happy when a company gets search right?

Comments off

Google’s Search Wiki

Hey, anybody notice the new goodies in your Google search results? You can manipulate the rankings of returned hits by using the quiet little arrow and x buttons that now follow each hit’s title. I’m guessing that, besides allowing customization, this is a new way to train relevance calculations. Using click-throughs alone to test relevance would include a certain percentage of pages that appeared from the results page to be useful but turned out not to be what the user was after. It would also necessarily include side trips initiated by curiosity rather than search refinement. Now, with a feature that explicitly allows users to refine how pages rank in their own results, Google is getting information much more clearly tied to what users consider to be relevant to a given search. Since human relevance rankings are the standard for training automated relevance rankings, this seems to be a win for Google as well as its users.

Comments (5)

Google, IR, and Preference Data

As I remarked in my comment to Annette’s post (above) on the Google Search Wiki, in order to personalize a search and persist it across time and location, an individual’s preference data must be stored by the search engine provider.  This data is used in ways we might not think about.  There was an article published on Friday (11/21/08) by MIT’s Technology Review site (http://www.technologyreview.com/blog/editors/22202/) that discusses how Google will soon leverage its search interface features and IR data to offer targeted advertising services in the television arena (similar to its web-based counterpart, AdWords).  The new service will allow advertisers to search for shows based on audience demographics, spending habits, and potentially a number of other facets (who knows what data Google’s gathering about our search preferences).  It is important to recognize that our web search activities are often recorded and re-purposed to varying degrees.  I wonder how Search Wiki will later be used to further profile individuals and populations in profitable ways.  Is Google perhaps the world’s most powerful data broker?  The article concludes by summarizing the words of Google’s TV Ads product manager, Keval Desai, saying:

“A satellite-TV company called Echostar, working with credit-reporting company Equifax, will cross-reference shows watched (using its own data from set-top boxes) with income and buying habits (using Equifax’s data). This will let Google offer shows to advertisers that will reach, for example, people with household incomes greater than $100,000. Desai stresses that all this data is made anonymous, so it certainly won’t be possible to target specific households with ads.  I wonder how long we’ll have to wait for that.”

Comments off

Metaphors and the Future

I found this blog post, BookWeb vs. GeoWeb to be a nice simple reminder, (passed on via the ever-interesting Mano) about the guiding metaphor of the web and how that is being problematized in certain domains. As a web developer, the days of creating single, static pages are largely gone, but the experience of interacting with pages via the browser is not, regardless if it was generated dynamically. It got me thinking, as our class dives into the more technical IR section focusing on the analysis and retrieval of documents, about how we may engage with information that doesn’t conform so easily to that metaphor.

In August the design firm adaptive path released a concept video series portraying “…a plausible vision of how technology, the browser, and the Web might evolve in the future by depicting that experience in a variety of real-world contexts.” (Aurora, pt. 1) Whatever your reaction, it is interesting to see echos of our discussions about the semantic web perpetuated here, as well as to count how many elements of the scenario are already observable in our present day web, although perhaps in nascent form.

Also, recently there was a demo from the University of Washington showing a not yet released project aimed at exploring the temporal dimensions of the web, capturing and visualizing what are often ephemeral flows of information. (Zeotrope: Web crawler archives historical data for easy searching) Watch the video for the full effect.

The last thing I wanted to add is that after recently hearing a couple of fantastic talks on the work being done in non-desktop environments (and the challenges it presents), one by our own Kimiko Ryokai, and another by Mirjana Spasojevic, a Nokia Researcher, both hosted by the Berkeley Center for New Media, I wonder how a shift in the devices we use to interact with information will spur the reformulation of the primary metaphors we use to frame our experiences.

Comments off

Plagiarism Detection

My brother works for an Oakland based software company called iParadigms, which does plagiarism detection. According to their website, they use “document source analysis,” which uses computer algorithms to create “fingerprints” of documents and then compares them against each other. It’s really effective and is in use by many universities as well as in publishing and legal companies. I’m not sure how their technology compares to latent semantic analysis that we talked about yesterday, but I just wanted to point out that plagiarism is a big problem, and the success of this company demonstrates the usefulness and relevance of  plagiarism detection. The company continues to do well even during the economic downturn, as more people are choosing to go back to school, which can only mean that there will be continued need for plagiarism detection services.

Comments off

Is “elusive goal of machine translation” being achieved?

Google Reader recently launched the feature called “Translate in your language.” In fact, this works great. In the paper, “Elusive goal of machine translation,” the author introduced the case of success of statistical approach to language processing. I think this Google’s new feature shows another improvement on natural language processing.

Since many of you might not have a chance to take a look at this feature because your language is English, let me demonstrate and assess this new feature as a foreign language speaker. To demonstrate, I will translate the translated headline by Google. Then, I will compare it with the original headline.

The four headlines are (from top to bottom):
- The American consumer price and housing depression will begin.
- An American diplomat will be finally buried by Mao. Criticism.
- Argentina opportunity dust had people move outside.
- Obama rapidly oscillates about the pledge for climate change.

Then, let’s see what the original headlines were.

I think that this works pretty well. (The mismatch between translated sentences is partly due to my poor translation ability.)

I have never imagined that natural language processing could be improved this early.

Comments off

« Previous entries