Archive forSeptember, 2008

Facets in Enterprise Search

An article in Forbes this week is about Enterprise Search and why “Google Isn’t Enough.”

The basic point is that many consumers love Google and want that kind of keyword search when they’re at work, however it usually is not the best way to get at a business’s corpus of information.  One of the points they mention regards the ordering of results, and how “popularity” might not be the best ordering when it comes to finding documents in a business.

This article is interesting to look at in light of our conversation about faceted classification, as it discusses faceted search as a way to get around these problems:

Compared to Web search, enterprise search queries often return a large number of results that are less amenable to objective ranking. Hence, enterprise search systems often augment the list of top-ranked results with an overview that summarizes all of the results of the query. An increasingly popular approach is faceted search, in which results are grouped based on various means of classification.

It continues by giving various examples.  Overall, it’s good to think about classification and search in different domains, and what might be benefits and drawbacks for each mode in the different domains.

Comments off

The intelligent cloud

After our discussion on Monday on automation and Svenonius’s attitude toward the expensive cost of categorization I was interested to read a recent google blog post about the future of Google’s search technology.

We discussed that current technology would not allow automation to recognize / understand things such as metaphors, fuzzy words, and multi-word terms.

Google is predicting that by 2019 their technology will be able do much more than fully automate categorization and language comprehension but also solve complex problems and learn from its research.  The impact of their technology will go well beyond Google’s offerings and will generate many significant benefits for mankind.

“Thus, computer systems will have greater opportunity to learn from the collective behavior of billions of humans. They will get smarter, gleaning relationships between objects, nuances, intentions, meanings, and other deep conceptual information.”

The intelligent cloud

Comments (3)

Cataloguge Fail

My boyfriend enjoys the music of one Kennedy, a rather gleefully noisy and profane disco/electronica artist. Recently, he’d been  looking forward to the release of Kennedy’s new album.

So imagine his surprise when his iTunes download turned out to be sad, uninspired post-rock junk.

We were confused at first, since the album was definitely by Kennedy and had the same name and cover Rolling Stone and various advance news had led him to expect. Perhaps Kennedy had suddenly gotten really sad, cleaned up his lyrics, and lost all writing talent?

But no, or else I wouldn’t be writing about it on the 202 blog: it turns out there are two musicians called Kennedy, and iTunes has them utterly, utterly confused, as does Rolling Stone.

If you search for Kennedy on iTunes, three hits come up for Artists named simply Kennedy, in the jazz, pop, and alternative genres. Jazz is easily discounted as the target Kennedy. But looking at the pop and alternative hits presents a puzzle which the album covers begin to solve: colorful, tacky pop-arty covers indicate the ‘real’ Kennedy, while gray and/or sadness designates the other. Both the pop Kennedy and the alternative Kennedy hits contain albums from both artists!

Even worse, Rolling Stone (who originally got my boyfriend excited about the new release) lists every real Kennedy album in a trustworthy way, then inserts the impostor album at the top of the list!

Truly, patronising little-known artists is fraught with IO peril.

Comments off

Dewey Decimal

Based on section discussion of Cory Doctorow’s point “schemas aren’t neutral” and on a librarian friend complaining that Korea got shafted when it came to the folktale section of the Dewey Decimal system, I decided to look at the complete list of Dewey Decimal classes.

Like Nick mentioned in section, the religion section is overwhelmingly dominated by Christianity. Also, any time languages are mentioned, European languages get multiple categories (English, Other Germanic Languages, French, Spanish, Italian, Slavic, Scandinavian) while the rest of the world is stuck in the “other” category. Wikipedia, font of all knowledge, mentions that the Library of Congress system is even more US-centric than the Dewey Decimal system.

Makes you wonder what sort of systems of categorization information scientists in other countries create.

Comments (1)

Categorizing music

When I started collecting MP3s many years ago I was obsessive about filling in blank ID3 tags. No more “Track 09 — Unknown Artist” for me. There was one problem: I didn’t know how to fill out the genre tag. I actually remember posting a message to a newsgroup asking how to know what counted as rock, pop, hip-hop, R&B, and so forth. Someone had created these categories and I wanted to use them, but I didn’t have a clue how to do it (Doctorow’s “People are stupid,” I suppose).

I was having a conversation about this with Michael Manoochehri who interjected that to some extent those are just commercial categories for music, which I hadn’t considered before. Nonetheless, there does seem to be at least some useful aspect of these categories: sometimes I feel like listening to music from one of them and not others. Using genres as categories is painting in broad strokes–different songs from the same album might properly belong to different genres, and an artist might move between genres during her career–a system like Pandora’s use of the music genome project might more accurately select what I want to hear.

While I tried to conform to what I imagined where norms for genre categorization, I had a friend who created an entire set of unique genres for her music. Instead of pop and rock, she changed everything to “Coffee Shop Grooves” or “Rocking the Suburbs” or something similarly unusual, effectively using the genre namespace to sort music into her own categories.

Categorizing music is an issue across borders as well. I saw these two signs in a record store in South America:

Anglo rock and pop

Black music

Note that Eminem has a couple albums in the “Black Music” section. Now there’s a funny categorization scheme for you.

Comments (2)

Aliasing system commands in a GUI

The article that we read for Tuesday, “The Vocabulary Problem in Human-System Communication,” showed that you need at least 10 aliased terms for a referent before untrained people can reliably select it.

It’s easy to imagine how this might work in the command-line environment: a system designer picks a command and gives it the arbitrary authoritative term “delete” (for example). People who type “delete” at the command line will access this command. But we can easily toss in some aliases so that people who type “remove,” “trash,” “eliminate,” “wipe”, etc. will be referred to the delete command. This same idea can be applied to a GUI, but what would it look like?

One possibility might be something similar to what you see in OS X 10.5’s help menu. Starting in Leopard you can search for menu names in the help box and the system will visually point to where they are in the menu hierarchy. This way if you know a command is called “Crop” but you can’t remember where it is, the system will show you. Here’s a screenshot:

GUI implementation of aliases for system commands

Although the menu search currently only matches literal strings, it’s not hard to imagine it working by matching your search against aliases for commands. You search for “trim” or “cut edges” and the system suggests the crop menu. (Ignoring for the moment that trim happens to be a separate command in, e.g. Photoshop). Application designers would have to do some simple research to see what aliases would best serve users.

There are definitely other ways to implement this idea, but this seems like one simple way to put research into practice.

Comments (2)

Dupe Detection – Not Just for Preventing Too Many Mailings

After today’s discussion of duplicate detection I was interested in areas where duplicates can cause a problem, and not a “we sent this person an extra catalog” kind of problem, but a serious financial problem.

From the pages of Bank Technology News I learned about the problems banks encounter when processing check deposits. Checks these days come into a bank’s clearinghouse via traditional paper, ACH, scanned images, and remote deposits, and sometimes duplicates get into the system. The article quotes 120 in 1 million, which doesn’t really seem like a lot. But imagine that rather than one $500 checking being deposited into your account, perhaps two or more are deposited. Great for you, bad for the bank, as they have to find  and undo the error, and then explain it to you.

Banks use the technology to better identify duplicates before the transaction posts to your account, which saves them considerable time, money, and perhaps most importantly – embarrassment.

Comments (1)

on duplicates

I was reminded in class today of how Microsoft keeps track of duplicate bugs. If two bugs describe issues with a common root cause, the Microsoft bug-tracking application will label the later bug ID as a duplicate of the former bug ID. If you fix one bug, the other is automatically considered fixed. Of course, each bug in a set of duplicates contains a reference to all other bugs in that set. The system appears to work fairly smoothly/unambiguously, to the best of my knowledge.

Comments off

Automated: Intrinsic Music Metadata

I discovered an article closely matching our topic in today’s section.

After permitting computers to, “listen” to music selections, algorithms have attained some success identifying cover songs by different artists, and identifying song genre, mood, composer, and title.

Here’s one example:
“Given one rendition of Led Zeppelin’s “Stairway to Heaven,” the electronic listeners had to sift through 1,000 songs and pick out 10 performances of “Stairway” by other artists, one of them on the banjo. A team from Barcelona, Spain, won that challenge, with a 75 percent success rate.

In another example, upon humming a few notes, a computer attempts to, “Name that tune.”
The algorithms seems to have some success with popular music, but I haven’t seen much in the way of extrapolation from that data towards identifying newly-discovered music.  The evidence hints at conditions where extrinsic metadata isn’t required.

http://www.philly.com/inquirer/entertainment/20080922_Computers_have_exquisite_ears.html
potentially relevant to: Metadata & Metadata Standards, Classification, Metadata for Multimedia, Multimedia IR

Comments (2)

On the Subject of Important Definitions – OR – Why Politics and Categorization Don’t Mix

In Erin Knight’s post, she talks about how the Capitol is trying to figure out how to redefine homelessness. This reminds me of a similar issue that I have encountered year after year while working for Contra Costa County.

Back in about 1970, a bit of research was done to determine what the poverty level should be. They did a bunch of research, but eventually just decided that the thing to do was to simply take the cost of food for a given family size and then multiply it times three. Out of this math, we have the poverty level.

From this number, the government has adjusted every year for inflation, and with that, we arrive at the federal poverty levels for 2008.

Now, this would be pretty bad research, and were I the professor overseeing the high schoolers responsible for these measures, I would probably scold them for committing every bad research method ever. The federal government however has taken these measures, and based pretty much every aid program on them….for the past 30-40 years.

Brilliant.

In class, we have talked about how important it is to have specific and precise ways of categorizing things. Unfortunately, this thing happens to be humans, and unfortunately nobody wants to raise the poverty level while in office because that will mean that X number of people fell into poverty during their time.

When politics meets categorization, problems ensue.

Comments (5)

Next entries » · « Previous entries