Search Flickr by Color

Searching for all the photos on Flickr that are tagged “red” is old-hat. Besides, searching for colors in tags is fraught with problems: people don’t have the patience to tag their photos exhaustively with all the colors in them, people may not be able to distinguish all the colors in a photos, and worse, they may be “wrong” about the colors. After all, your red is my pink. (If you want to get philosophical, check out the inverted spectrum problem, though this doesn’t pose a problem for Flickr tagging.)

An obvious approach is to tag photos with all their colors algorithmically. We can scan photos for colors and tag any picture with lots of #ff0000 “red”. Users who search for red will retrieve these results. This approach would be consistent, but it is still open to the problem of disagreement about colors–someone still has to define red in the computation. In terms from a recent 202 lecture, a semantic gap remains between the photo and the metadata used to describe (and consequently retrieve) it.

A solution to this problem is to search using a criteria at the same semantic level that you require in your results. Idée has implemented this idea with its Multicolr interface for searching Flickr. You select a color and see pictures that contain that color. Using Multicolr is mesmerizing because you can adjust your search criteria to encompass multiple colors and see results matching your search. Selecting the same color multiple times (i.e. the equivalent of “redred“) increases its intensity in your search.

Textual search is likely to remain our primary means of retrieval for the foreseeable future–so much of our discourse is word-dominated–but this is an example of the frontiers of IR.

Comments off

Mining quotations from digital libraries

Last month the CS department hosted an excellent talk by Bill Schilit, a researcher at Google, about how Google has done analysis of every word in every one of its digitized books (yes, it’s N^2, that’s how much processing power Google has) and found every time one book is quoted in another book.

This seems particularly relevant to Gruber’s “Collective Knowledge Systems” piece, which recommends pulling semantic information out of the participatory architecture of the social web.  Schilit and Google are effectively extracting the most important passage from each book — effectively learning what the book was about — by looking through all the massive information that users (authors of various other books) have written.  

This obviously reaches Gruber’s criteria for collected knowledge systems: it takes advantage of user-generated content (the books themselves) and human-machine synergy (human-written books and computerized analysis of them), and it gets all the more powerful at scale.  But I think the Google quotations system even reaches the point of Gruber’s “emergent knowledge“: the system can make non-trivial conclusions about the core subject matter of the book that even skilled cataloguers might find difficult, can link books in ways we might otherwise have missed and, (though I’m not sure we see this in practice yet, but it certainly seems possible) can reason over the topics and connections between books to reach completely new conclusions.

I don’t believe that EECS recorded the talk, but it looks like Schilit gave a very similar talk at PARC which was recorded (there’s a good abstract there too).

Comments (2)

Weinberger Need Statisticians

I’ve always wondered how Weinberger could get meaningful information out of his “huge pile”, and in his interview with Doctorow[1], Weinberger mentioned a way to make use of it: statistical analysis. This is what he said:

“Tags are chaos, and as you get more and more of them, it will get more and more chaotic.  It turns out that when you have a lot of them, the statistical analysis becomes really pretty precise.”

This reminds me of a paper I’ve previously read, “Toward Extracting Flickr Tag Semantics”, written by Yahoo! Research Berkeley and published on WWW2007[2]. The method described in the paper could identify “place tag” and “event tag” from the tags store in Flickr. For instance, the authors could “detect that the tag Bay Bridge describes a place, and that the tag WWW2007 is an event.” (WWW2007 is a conference held in Canada in 2007.)

How did they do that? The main idea is, “place tag” like Bay Bridge has significant spatial patterns, tending to concentrate within a certain geographic range, and “event tag” like a conference has significant temporal patterns, tending to appear around a certain time period. So by using preexisting spatial and temporal statistical methods, computer scientists are able to discover the “semantics” of Fickr tags.

In all, statistical analysis can help Weinberger make use of the huge amount of information, and it may also serve as a “filter” to deal with information overload problems.



[1] Metacrap and Flickr Tags: An Interview with Cory Doctorow,

[2] Towards Extracting Flickr Tag Semantics,

Comments off

New Method for Building Multilingual Ontologies

Researchers from the Validation and Business Applications Group based at the Universidad Politécnica de Madrid’s School of Computing (FIUPM) have developed a new method for building multilingual ontologies that can be applied to the Semantic Web.

So ontologies are the cool thing to be developing these days given the promise of the Semantic web looming over us.  But up until yesterday, a big limitation with ontologies was that they were relatively single-minded when it came to language.  “The application of ontologies to the Internet comes up against serious problems triggered by linguistic breadth and diversity. This diversity stands in the way of users making intelligent use of the web.

People have tried to bridge the gap, but strategies like expert-based terminology (ahem, Svenonius, ahem) and using one language as the “pivot”, have failed miserably.  But these researchers claim to have created a method for building ontologies IRRESPECTIVE of language. And their secret weapons appear to be universal words and the assumption that “any text has implicit ontological relations that can be extracted by analysing certain grammatical structures of the sentences making up the text“. (I mean, I could’ve told them that, but whatever)

Interesting stuff, and will probably be even more interesting when I finally grasp what an ontology actually is. 😛 (just kidding) (sort of)

Comments off

Capitol Strives to Define “Homeless”

NYTimes, 15 September, 2008

So the heated discussion of choice a few days ago in our nation’s capital was apparently how to define ‘homeless‘. For the last 20+ years, ‘homeless‘ meant “only people living on the streets or in shelters”. But given the high-and-getting-higher foreclosure and unemployment rates, the Hill is arguing whether or not to expand that definition.

New expansions of the existing definition under consideration are:

1) to include the ‘precariously housed’ (living with friends, couch-to-couch, day-to-day hotels, etc)

2) just to include the smaller number of people who have fled due to domestic violence

3) to include “only those forced to move three times in one year or twice in 21 days”

(Obviously we have some variance in specificity here.)

The definition is important because whoever qualifies as ‘homeless‘ is eligible for aid, shelter and housing assistance from the Department of Housing and Urban Development.

That said, in a typical DC move, none of the bills have anything about increasing funding.  The current budget ($1.7MM) can’t come close to providing enough/adequate resources for the people falling under the current definition of ‘homeless‘.  So while expanding the definition seemingly demonstrates homeland concern and goodwill, instead of a semantic debate, they should be talking about actions/solutions to actually care for these people.

(And of course it is turning into a Democrat/Republican flame war.  I would paraphrase but you know the drill…)

Two additional thoughts:

  • I think I may have lived couch to couch at some point in my younger younger years.  That definition might need some fine tuning to avoid dealing in every 22 year old in the country.
  • I don’t miss DC at all.

Comments (4)

RDFa: friend or pita?

RDFa has just become a W3C proposed recommendation.  It’s potentially cool/useful, because it may surmount some semantic barriers to automation that concerned Svenonius.

Similar to microformats, RDFa describes a syntax for embedding semantic meaning inside XHTML. It’s could be useful for us because most web pages today include (X)HTML (syntax), but don’t have a mechanism to embed clear meanings (semantics) for elements within the page, which might be picked up by search engines or browsers to return more precise and/or relevant results. RDFa, doesn’t seem overly complicated, either, something which frequently gets screwed up.

How might one of us use this? When you author an XHTML page, point to a very specific vocabulary document on the web, (you can borrow someone someone else’s that has already been created) and then add as many “tuple” statements as you’d like to describe fields within <span></span> tags in terms of that vocabulary.  An RDFa tuple is composed of a subject, a predicate, and an object all defined. An example of an RDFa tuple is, “Nat [subject] is a [predicate] Person [object],” and “Nat [subject] hates [predicate] homework [object]”. ; ).  Will RDFa be widely accepted?  I have no idea.  I kind of like the idea of microformats for encapsulating semantic meaning, too.


Comments (2)

Which “Class” of “Middle Class” Are You Part Of?

Four Middle Classes ChartThis Pew Research Center article, entitled “America’s Four Middle Classes,” has little explicitly to do with IR technology. However, it does feature a new model for the categorization of the “American Middle Class,” which I think is a useful example for a discussion of the ways in which redefining data categories can provide new insights and sweep away widely held myths. This report describes how social survey data was used by researchers to segment people who self-identify as “Middle Class” into four new categories that describe financial stability – namely Top of the Class, Satisfied Middle, Struggling Middle, and Anxious Middle. The report demonstrates how within the self-identified category of “Middle Class,” there is a great variation in financial status, from relative economic comfort to the potential for financial hardship. I was personally drawn to this report because it demonstrates that simple recategorization of data can possibly lead to sweeping changes in social perceptions.

Here is an excerpt from the article: Life is considerably tougher for the Struggling Middle, a group disproportionately composed of women and minorities. In fact, many members of the Struggling Middle have more in common with the lower class than they do with those in the other three groups and actually have a lower median family income than Americans who put themselves on the lowest rungs of the social ladder. About one-in-six self-identified middle class Americans fall into the Struggling Middle.

Rest of article here. Full, 19-page PDF report of this project available here.

Relevant lectures: 5. CONCEPTS & CATEGORIES (9/15)

Comments (2)

Open Secrets

“Open Secrets: Enron, intelligence, and the perils of too much information” by Malcolm Gladwell in the January 8, 2007 New Yorker.

There are puzzles and there are mysteries. You solve puzzles by finding the missing information; with mysteries the problem is that you have too much information, and solving them requires analysis. In this article, Gladwell applies this paradigm from Gregory Treverton to Enron’s collapse (and also to the hunt for Osama bin Laden, Watergate, Nazi propaganda in WWII, and cancer). He disputes the widely accepted premise that Enron withheld information on its dubious practices. In reality Enron disclosed nearly everything and analysts failed to understand the sea of data.

Today’s problems force us to re-examine the human element of information processing. Though the amount of information is paralyzing and noise is high, “the complex, uncertain issues that the modern world throws at us require the mystery paradigm.”

This article fits best with the 9/3 lecture on issues and contexts.

Comments off

Fixing Broken Ballots

How Design Can Save Democracy
The New York Times, August 25, 2008.

Interactive Feature: Problems/Solutions in Ballot Design

In recent years, there has been controversy about the design of election ballots that cause confusion for both voters and vote-counters. (Remember butterfly ballots?) Unfortunately, voting technology and ballot design are not standardized or consistent, and vary wildly across the country. Ignoring the whole other issue of electronic voting security, there are still many problems with ballots that use confusing language and layout, as well as have difficult to read small print. These are especially problematic for people with visual impairments or those whose first language is not English.

Fortunately, the United Stated Election Assistance Commission created ballot design guidelines earlier this year. Following a guide to improve clarity in both language and design should reduce voter confusion, and will hopefully reduce problems of vote accuracy.

Local governments often have very limited funding, and it’s challenging to design forms that are clear to the hugely diverse population of “Americans 18 and older.” However, it seems to me that this is a case where budgeting for some extra thought and effort in the initial design can prevent many problems and their related costs later.

Relates to lectures:
3. Organization {and, or, vs} Retrieval
7. Controlled names and vocabularies
12. Enterprise/institutional categorization & standards
19. Information organization in user interfaces

Comments (1)

Dead Sea Scrolls Now Available Online

 Dead Sea Scrolls Go from Parchment to the Internet, Ben Wedeman,, 27 August 2008

Israel’s Antiquities Authority’s decision to take the 2000-years-old Dead Sea Scrolls online reflects several aspects of information organization and retrieval. So far, the Dead Sea Scrolls are the oldest Hebrew manuscripts of the Bible, dating back to the 2nd century BC. Their significance is immense in more than one way. When first found in 1947, the text introduced a previously unknown Jewish sect that took refuge in the Dead Sea caves. Moreover, some researchers claim that the scrolls include parts of the earliest New Testament.


These academic controversies carry political and religious tension, in a region where archaeological findings enhance national and ideological narratives and provide ammunition in the fight over political legitimacy. As noted by CNN anchor in the video segment, taking the scrolls online is an attempt to put an end to the ongoing belief that the Authority, or Israel, has been using its hold of the scrolls to hide information from the world. The head of conservation for the Antiquities Authority hopes that enabling people around the world to view them online will bring monotheistic religions together; however, extremist would find this an opportunity to create additional feuds.


It would be interesting to see how the project would tackle the following issues: Who controls the information? How is it shared? One of the main challenges would be to decide what type of information would be added to the images once they are posted. Numerous scholars and PhD students have been working on translating and interpreting the text (which is mostly in Hebrew, plus some in Aramaic and Greek). Would the authority provide access to ALL related research, even those works which negate the Authority’s version? The obstacles presented by word usage to information organization would be coupled by the gaps between Modern Hebrew and ancient Hebrew used in the text. Technical issues are just as important: What would be the user experience in accessing the information? For example, would the languages used require users to install fonts?


Related lectures:

7: Controlled Names and Vocabularies

8: Classification

13: The Semantic Web

19: Information Organization in User Interfaces


Comments off

« Previous entries