XML was not always a silver bullet.

I programmed a simulator for another class this semester and I tried to use XML format for my input file to the simulator. The simulator takes a graph topology information first, and then needs to parse it. Compare the two formats below describing the same graph information.

:: GraphML (Standard XML format for describing graph data structure) ::

<graphml xmlns=”http://graphml.graphdrawing.org/xmlns” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=”http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd” >
<graph edgedefault=”undirected” parse.nodes=”10000″ parse.edges=”20000″>
<node id=”0″ />
<node id=”1″ />
<node id=”2″ />
<node id=”9997″ />
<node id=”9998″ />
<node id=”9999″ />
<edge source=”2″ target=”1″ />
<edge source=”2″ target=”0″ />
<edge source=”3″ target=”1″ />
<edge source=”0″ target=”8068″ />
<edge source=”1″ target=”9731″ />
<edge source=”1″ target=”5549″ />

:: Normal text format ::

Topology: ( 10000 Nodes, 20000 Edges )
Model (1 – RTWaxman)

Nodes: ( 10000 )

Edges: ( 20000 )
0    2    1
1    2    0
2    3    1
19997    0    8068
19998    1    9731
19999    1    5549

The second format was much better both in terms of file size and parsing speed. The XML format spent too much on putting structured metadata on the data. Once the data will be used in a limited domain, costs for structuring and standardizing data could overwhelm the benefit of doing so.

This case reminded me the warning of Svenonius, which was “putting infinite number of metadata to data is economically impossible,” although my case did not involve “infinite” numbers of metadata. Anyway, I experienced the tradeoff of IO and IR again.

Comments off

Annotating Video at NFL Films

Related to today’s lecture on multimedia information retrieval, last year Wired magazine featured a short article on NFL Films’ efforts to annotate its video collection. Started in 1962, NFL Films has 10,000 canisters of 16mm film – 110 terabytes of data. NFL films uses a structured, manual tagging system – team, date, yardage, etc. It also uses a free-form tagging system, where loggers can enter a highlight’s nickname, such as ‘The Catch” with Joe Montana and Dwight Clark. It’s a quick read:


Comments (1)

Online news standardization


I came across this a bit old story about a project undertaken by Tim Berners lee. The project aims at creating a system to feed important metadata like journalist’s profile, way the news story was created etc. to create a rich metadata set which will provide more credibility to the news story and also provide more retrieval options.

It also aims to standardize the way this information can be embedded in the news stories to help news aggregators provide us with more accurate and meaningful news. 

“They can be buried anywhere – the first or second paragraph, the beginning of the story, or even the end,” he said. “It just seemed incredible that of all the basic information you might want to know about a story, even such basic things as who wrote it and for who, is extremely hard to get at the moment.”

Comments (1)

Search Flickr by Color

Searching for all the photos on Flickr that are tagged “red” is old-hat. Besides, searching for colors in tags is fraught with problems: people don’t have the patience to tag their photos exhaustively with all the colors in them, people may not be able to distinguish all the colors in a photos, and worse, they may be “wrong” about the colors. After all, your red is my pink. (If you want to get philosophical, check out the inverted spectrum problem, though this doesn’t pose a problem for Flickr tagging.)

An obvious approach is to tag photos with all their colors algorithmically. We can scan photos for colors and tag any picture with lots of #ff0000 “red”. Users who search for red will retrieve these results. This approach would be consistent, but it is still open to the problem of disagreement about colors–someone still has to define red in the computation. In terms from a recent 202 lecture, a semantic gap remains between the photo and the metadata used to describe (and consequently retrieve) it.

A solution to this problem is to search using a criteria at the same semantic level that you require in your results. Idée has implemented this idea with its Multicolr interface for searching Flickr. You select a color and see pictures that contain that color. Using Multicolr is mesmerizing because you can adjust your search criteria to encompass multiple colors and see results matching your search. Selecting the same color multiple times (i.e. the equivalent of “redred“) increases its intensity in your search.

Textual search is likely to remain our primary means of retrieval for the foreseeable future–so much of our discourse is word-dominated–but this is an example of the frontiers of IR.

Comments off

In continuation with Nick’s very valuable info on ‘NY Times tags API’


Jacob harris highlights the importance of metadata in News industry. And they have been using it since 1851 phew!!  

On a different note the following excerpt (from this article) touches upon the ‘automation vs manual’ tradeoff discussed in today’s class. 

“Still my snarky aside has truth to it: people are ultimately controlling the process. In the beginning, rules for the automatic extraction and tagging are set by an Information Architect. In the end, final approval and correction of suggested metadata is done by various Web producers before publication. Web producers also do the important job of accurately summarizing the story. So, while we have machines to help out the process, it’s still ultimately a human endeavor, largely because automated summarization and classification has its problems.”

Comments off

NYTimes TimesTags API

The New York Times has created an API against their “taxonomy and controlled vocabulary used by Times indexers since 1851”.  Send their API a word and the NYTimes will send back a list of the most common relevant tags (and whether it’s a Person, Description, Organization or Location).  

Why create our own structured vocabulary when highly trained people have been doing it since 1851 and we can borrow theirs?

Comments off

Tools and services for PIM

This is more of a link dump, but PIM is one of my favourite areas, and a lot of questions that I have explored came up in class today. I thought my fellow students would find these applications, technologies, and concepts interesting. These are all things that I use or have used at one point.

Rescue Time — for passive recording of on-computer activity (active application, with tagging/productivity scoring)
Cluztr — tracking (and publishing) all web pages visited
Attention Recorder — tracking all web pages visited
IPTC tagging — I’d call this one of the most underused technologies for PIM. Various apps available, add-ons for iPhoto, ACDSee, etc. Keep your descriptions, captions, photographer, tags, etc WITH your photos, so they are on your local copy and also added when you upload to flickr (only caveat is that they’re lost on edit).
Google Desktop — index/search of email, chat logs, web visits, etc across multiple computers
ScheduleWorld — an OpenML/Funambol service to synchronize calendar/to-do/contacts across multiple devices/people/apps
Wakoopa — tracking software usage
PhoneTag — voicemail-to-text transcription
EarthClassMail — have all your snail mail go to a central location and get it scanned online for you
RingCentral — virtual PBX to centralize and easily access phone numbers/voice mail anywhere

Also, it’s helpful to use a network attached storage drive, IMAP for email, SVN for file versioning, a scanner that does good one-touch OTR scanning of documents…

Comments (1)

Tagging with pictures | Tagging the physical world.

At the risk of fanning political flames, this jpg was just sent to me via email. If you move past the humor and politics of the photo, it seems salient to today’s topic of tagging. Specifically, using the characteristics we collectively/culturally ascribe to trains of varying types to tag each of the presidential/vice-presidential candidates. It was done visually instead of with words (modern, green, fast, powerful, coal powered, archaic, plastic, child’s toy). Are these “good” tags? I think guys named Nick who went to Amherst (the h is silent) would say yes.

Election Trains

After I stopped laughing, this made me wonder if there were already a system tagging things with pictures out there. I did not find any with a quick google search. Just a number of whitepapers.

However, I did find Tonchidot.

While not specifically related to using images to tag other images or ideas, they are developing an iPhone app that adds tags to the images the camera sees in real time. They take community tags and make them mobile in a very compelling way. Want to know what type of flower that is? Tree? Year a building you are looking at was made, who designed it? Which store at the mall has the thing you want to buy? How many stars the restaurant you are looking at has on yelp? When the next bart is arriving at your station? Find a lower price for something in a different store. Purchase something via the phone. Leave a message for a friend to pick up by walking by a specific place.

Tagging a specific location is also possible. This reminds me of William Gibson’s book Spook Country. One aspect of the storyline was the development of location based digital art installations. In order to see a specific digitally created piece you needed specially made hardware (eyeglass digital display) and a computer. You also needed to be in a specific geo-spatial location. Now, you’ll just need your iPhone.

One of the things an artist in the book said reminds me of the potential of Tonchidot’s technology. Imaging traveling across the country and seeing a whole 2nd landscape that covers, interacts, and integrates with the physical world. Offering different things to see, information about what you’re seeing, directions to get there, prices for goods/services (who would not love to know the cheapest place to get gas?). And of course a whole new opportunity for advertising and spam.

Maybe that’s the problem with spam. No ontological control.

The video is about 18mins long and worth watching. There is a particularly interesting practical question around the 14:15 min mark.

Comments (4)

Visualization of Google News Data

This site provides a very nicely visualized representation of news topics broken down by country and topic. Color is used to show types of news (world business technology sports etc) and time passed since the story was published (darker is older). Size is used to represent the number links to that particular story. It’s a rather fun way to see what news media (online) considers important within a country as well as internationally.

This makes me think of the Dashboard mentioned at the end of lecture today. It provides a quick, easily understood view of information that is collected on-the-fly. It’s not great for keeping track of the news on a granular level. You cannot search from the site at all. I’ll stick with my personalized news feeds, but I check this daily to get a sense for what the rest of the world (at least a portion of it) is concerned with in comparison to the US (my own personal context).

A note from the author of the site “Its objective is to simply demonstrate visually the relationships between data and the unseen patterns in news media. It is not thought to display an unbiased view of the news; on the contrary, it is thought to ironically accentuate the bias of it.”

For Example, as I look right now almost every country has something about Citibank fighting for Wachovia (after Wells Fargo’s bid), though Germany, France, and Spain care less than 1/2 as much as the rest. It can also show the context (bias?) each country uses to view a topic. Each country has an article about the Russian/Georgian ceasefire. The US article states Russia is accusing Georgia of harming the ceasefire. Everyone else states Russian is trying to “mend fence-posts”.

Comments off

Where does your food come from?


The Agriculture Department has given American retailers six months to comply with a new rule requiring that meats, produce, and certain nuts be labeled with the country of origin. The idea is that consumers have the right to have access to this metadata when making decisions about which food to purchase. Interestingly enough, certain foods, such as roasted nuts and mixed vegetables, are exempt from this rule. The article refers to these exempted foods as processed foods (mixed vegetables – processed?). I don’t see why a food’s classification as “processed” grants it special status to be exempt from rules that are meant to inform consumers about the potential safety (or dangers) of food.

Regardless, I’m glad to see this step being taken to include more classes of foods in country-of-origin labeling (seafood is already labeled). Necessity arising out of food recalls seems to be driving these efforts.

Comments (3)

« Previous entries