Archiving the web before Inauguration Day

I learned a fascinating tidbit this morning when I met my co-worker’s husband. He works at the California Digital Library and one of the library’s projects is archiving the web in a structured fashion. When we think of a bibliographic entity we nearly always think of publication date as a piece of metadata, but what about de-publication date? He mentioned that they’re preparing for a giant switchover in government websites. They have to archive content on government websites before the new McCain or Obama administration takes over because much of the content on these sites can change overnight when administrations switch. Recommendations on health matters, policy statements, and more are updated or revised. They missed some of the changes when the Clinton administration changed to the Bush administration and there’s no publicly accessible method for rewinding to see the changes.

We’re used to the idea of revision tracking on a Wikipedia article, but what about revision tracking across all bibliographic resources? How does this added temporal component affect the way we browse, search, and categorize instances?

Comments off

Visualization of Google News Data

This site provides a very nicely visualized representation of news topics broken down by country and topic. Color is used to show types of news (world business technology sports etc) and time passed since the story was published (darker is older). Size is used to represent the number links to that particular story. It’s a rather fun way to see what news media (online) considers important within a country as well as internationally.

This makes me think of the Dashboard mentioned at the end of lecture today. It provides a quick, easily understood view of information that is collected on-the-fly. It’s not great for keeping track of the news on a granular level. You cannot search from the site at all. I’ll stick with my personalized news feeds, but I check this daily to get a sense for what the rest of the world (at least a portion of it) is concerned with in comparison to the US (my own personal context).

A note from the author of the site “Its objective is to simply demonstrate visually the relationships between data and the unseen patterns in news media. It is not thought to display an unbiased view of the news; on the contrary, it is thought to ironically accentuate the bias of it.”

For Example, as I look right now almost every country has something about Citibank fighting for Wachovia (after Wells Fargo’s bid), though Germany, France, and Spain care less than 1/2 as much as the rest. It can also show the context (bias?) each country uses to view a topic. Each country has an article about the Russian/Georgian ceasefire. The US article states Russia is accusing Georgia of harming the ceasefire. Everyone else states Russian is trying to “mend fence-posts”.

Comments off

Dewey or Don’t We?

This article from May 07 is about a library that decided to move away from the Dewey Decimal system and towards a subject based organization. They used 50 subject headings created by the Book Industry Study Group Inc. The library intentionally mimicked certain aspects of bookstores, not only in how the books are organized by subject, but also in physical layout. It appears they are trying to accommodate their customers’ habits and expectations.

For myself, this sounds interesting. I recall while reading Weinberger that I liked book stores and as long as the subject areas are clearly labeled I had little trouble finding the specific book I was seeking. At the very least it was no more difficult than in a library, and usually easier. Of course, this is a small library (24,000 books/dvds, etc). If you are dealing with a larger set of works this may become too difficult to manage.  And it seems more “natural” to me to search for a subject over a number.

However, one of the comments on the article is key (in my opinion) to the bookstore/Dewey decision. “That’s OK for leisure reading, but if you need to do research on a specific topic, you are going to have a hard time finding the particular information that you need.” The additional structure in the Dewey system makes it easier (once you know how to use the system) to find ever-granular information. Most bookstores just lump it all together.

I’ve not been able to find any follow-up information as to whether it worked or not. Their page shows they now have over 30,000 items in the library, but nothing about its current layout/organization or popularity. I wish I’d found this article when we read Weinberger’s piece.

PS: I wish I could claim the title as original, but I borrowed it.

Comments (1)

The intelligent cloud

After our discussion on Monday on automation and Svenonius’s attitude toward the expensive cost of categorization I was interested to read a recent google blog post about the future of Google’s search technology.

We discussed that current technology would not allow automation to recognize / understand things such as metaphors, fuzzy words, and multi-word terms.

Google is predicting that by 2019 their technology will be able do much more than fully automate categorization and language comprehension but also solve complex problems and learn from its research.  The impact of their technology will go well beyond Google’s offerings and will generate many significant benefits for mankind.

“Thus, computer systems will have greater opportunity to learn from the collective behavior of billions of humans. They will get smarter, gleaning relationships between objects, nuances, intentions, meanings, and other deep conceptual information.”

The intelligent cloud

Comments (3)

Categorizing music

When I started collecting MP3s many years ago I was obsessive about filling in blank ID3 tags. No more “Track 09 — Unknown Artist” for me. There was one problem: I didn’t know how to fill out the genre tag. I actually remember posting a message to a newsgroup asking how to know what counted as rock, pop, hip-hop, R&B, and so forth. Someone had created these categories and I wanted to use them, but I didn’t have a clue how to do it (Doctorow’s “People are stupid,” I suppose).

I was having a conversation about this with Michael Manoochehri who interjected that to some extent those are just commercial categories for music, which I hadn’t considered before. Nonetheless, there does seem to be at least some useful aspect of these categories: sometimes I feel like listening to music from one of them and not others. Using genres as categories is painting in broad strokes–different songs from the same album might properly belong to different genres, and an artist might move between genres during her career–a system like Pandora’s use of the music genome project might more accurately select what I want to hear.

While I tried to conform to what I imagined where norms for genre categorization, I had a friend who created an entire set of unique genres for her music. Instead of pop and rock, she changed everything to “Coffee Shop Grooves” or “Rocking the Suburbs” or something similarly unusual, effectively using the genre namespace to sort music into her own categories.

Categorizing music is an issue across borders as well. I saw these two signs in a record store in South America:

Anglo rock and pop

Black music

Note that Eminem has a couple albums in the “Black Music” section. Now there’s a funny categorization scheme for you.

Comments (2)

Physicist != Alchemist

This comic from User Friendly does a very good (and funny) job of representing two groups clashing in the process of creating vocabulary and defining words. Each group has it’s own definition for a common word that is a misrepresentation (or partial representation) of what the word means to the other group.

This shows one facet of dealing with word definition and categories: agreement of meaning. Physicist and alchemist are more specifically defined in that they are not really interchangeable. Here they are used to make a point. Hacker and cracker have differing definitions, but cracker in this context is used in a way that most outside the hacker community would not differentiate. Within the community it is a very important distinction. This also shows the ‘living’ aspect of language in the changing definitions pre-existing words (hacker and cracker).

Comments (1)