Divyakumar Menghani | I202 | Information Organization and Retrieval

OVERVIEW

Scientific research is published in many periodic journals since beginning of 19^th century. It enables sharing of collective human knowledge and wisdom. These articles/papers are a result of the extensive research done by researchers in specific domains.

Google Scholar [1] is a digital bibliographic database and search engine of published scholarly literature across varied disciplines. It indexes research papers, books, articles, publications etc. and is very popular in academia. It is interesting in the way it applies the organization principles to solve a complex problem of distributing collective human wisdom in the form of scholarly literature.

1.1 WHAT RESOURCES ARE BEING ORGANIZED?

Google scholar is a digital organizing system. Documents of various types and formats are being organized. It indexes “full-text journal articles, technical reports, preprints, theses, books, and other documents, including selected web pages that are deemed to be ‘scholarly.’”[2] It organizes knowledge, which is represented in variety of formats. It was launched in late 2004 and organizes articles created in 1901. All articles have been converted to digital formats that can be viewed on the Internet. The unit of organization is one single document. The system is designed to locate resources and users can form a collection in their own personal library. Resources are created by crawling the papers over the Internet. Older articles were digitized by Google in collaboration with libraries/organizations around the world.

In this case, the resources are the digital copies of the articles. It is not organizing the actual paper but a copy of the resource.

Some challenges and design decisions reflected in the system:

The system designers could have decided only to include one document format, but they chose multiple formats for the system.
They created crawlers and algorithms which will parse the page and fill resource description such as author, category, keyword etc. Some attributes are mandatory and some are not.
All resources are searched. There is no distinction visible to the user if the resource is from 2013 or 1905, whether it is a pdf document or an html page. The system creators could have chosen to restrict the domain of resources but they kept it vast.
Also, granularity is one challenge. They could have chosen to create one journal as one document. But they chose one document as a unit of organizing.
Resource selection principle applied is “relevance” and “purpose” while retrieving.
Retention policies – It seems the resource reside in the system indefinitely.

1.2 WHY ARE THE RESOURCES BEING ORGANIZED?

Resources are organized so that any user can easily look for relevant information aligned to his interests
Resources are indexed so that information retrieval can be efficient
It helps users to find any rare resource which they may not have physical access to.
By looking at certain attributes of the resource such as citations, credibility and trust could be established towards the research/writing by the author. The resource description play an important role in the organizing system
Resources have attributes such as cited by. These attributes can be tracked and interesting analysis can be performed such as “Most cited<attribute> paper<resource> in Information retrieval<category>”
Allows to create alerts for topics the user is interested in

1.3 HOW MUCH ARE THE RESOURCES BEING ORGANIZED?

Google Scholar leverages the crawlers and parsers similar to the Google search engine. It crawls documents to identify authors, topics, keywords, citations, publication information, metadata such as publication year and month etc. The documents are parsed and statistical metrics[3] such as the h-index, h-median are calculated to determine how the resource relates to other resource. These metrics establish credibility of the document and hence the author. To support efficient search, resources are classified by disciplines and sub-discipline and keywords used to define them. There seems to be a combination of hierarchical and faceted classification happening in multiple places in the website. For example, it maintains hierarchical classification for categories and subcategories. Also, some categories overlap such as ICTD. One can reach the same resource via multiple keywords/tags.

It also has a three tier architecture and separates – Storage, Logic and Interface in the application.

If I think of the document type spectrum, it lies in the lowest left side as it is Narrative. Thinking about Information IQ plot, it lies in lowest left quadrant as it is a scanned document or a pdf document without any standard structure. There is no standard which governs how papers should be return. Sometimes IEEE standards are followed in terms of formatting, but there are others which are popular too.

1.4 WHEN ARE THE RESOURCES ORGANIZED?

Research papers, articles are periodically being crawled and their resource description is captured to calculate metrics by Google Scholar algorithms and computational process. Google Scholar keeps track of citations for a document and generates Google Scholar Metrics periodically that may help the researcher further.

Maintenance and curation of resources is constant as well. Scores, citations are updated periodically using algorithms.

1.5 WHO DOES THE ORGANIZING?

Algorithms and computational processes organize the information within and between resources. Additionally, publishers are given guidelines on what they could do better so that Google Scholar can crawl their documents. Human intervention is also involved for moderation.

For older documents, Google digitized them and created its resource description in conjunction with experts.

1.6 OTHER CONSIDERATIONS

1. Google scholar looks for patterns in a document on the internet and then considers it to be a paper or not. There are a high number of false positive in the process. Documents are not always good quality research papers.

2. Google scholar doesn’t provide the actual data on how the results were derived. If I think about reproducible research, it relies on the research paper to do the job of describing it.

3. Quality of resources could be a challenge given the number of papers being generated every day. If the organizing system doesn’t maintain high quality resources, it may defeat its purpose and become a search engine.

[1] Google Scholar http://scholar.google.com/

[2] Vine, Rita (January 2006). “Google Scholar”. Journal of the Medical Library Association94 (1): 97–9. PMC 1324783.

[3] Google scholar metrics help http://scholar.google.com/intl/en/scholar/metrics.html#metrics

I202 | Information Organization and Retrieval

UC Berkeley Fall 2013 INFO 202

Author Archives: Divyakumar Menghani

Google Scholar