Tag Archives: Lisa

Google Scholar

OVERVIEW

Scientific research is published in many periodic journals since beginning of 19th century. It enables sharing of collective human knowledge and wisdom. These articles/papers are a result of the extensive research done by researchers in specific domains.

Google Scholar[1] is a digital bibliographic database and search engine of published scholarly literature across varied disciplines. It indexes research papers, books, articles, publications etc. and is very popular in academia. It is interesting in the way it applies the organization principles to solve a complex problem of distributing collective human wisdom in the form of scholarly literature.

1.1         WHAT RESOURCES ARE BEING ORGANIZED?

Google scholar is a digital organizing system. Documents of various types and formats are being organized. It indexes “full-text journal articles, technical reports, preprints, theses, books, and other documents, including selected web pages that are deemed to be ‘scholarly.’”[2] It organizes knowledge, which is represented in variety of formats. It was launched in late 2004 and organizes articles created in 1901. All articles have been converted to digital formats that can be viewed on the Internet. The unit of organization is one single document. The system is designed to locate resources and users can form a collection in their own personal library. Resources are created by crawling the papers over the Internet. Older articles were digitized by Google in collaboration with libraries/organizations around the world.

In this case, the resources are the digital copies of the articles. It is not organizing the actual paper but a copy of the resource.

Some challenges and design decisions reflected in the system:

  1. The system designers could have decided only to include one document format, but they chose multiple formats for the system.
  2. They created crawlers and algorithms which will parse the page and fill resource description such as author, category, keyword etc. Some attributes are mandatory and some are not.
  3. All resources are searched. There is no distinction visible to the user if the resource is from 2013 or 1905, whether it is a pdf document or an html page. The system creators could have chosen to restrict the domain of resources but they kept it vast.
  4. Also, granularity is one challenge. They could have chosen to create one journal as one document. But they chose one document as a unit of organizing.
  5. Resource selection principle applied is “relevance” and “purpose” while retrieving.
  6. Retention policies – It seems the resource reside in the system indefinitely.

 

1.2         WHY ARE THE RESOURCES BEING ORGANIZED?

  • Resources are organized so that any user can easily look for relevant information aligned to his interests
  • Resources are indexed so that information retrieval can be efficient
  • It helps users to find any rare resource which they may not have physical access to.
  • By looking at certain attributes of the resource such as citations, credibility and trust could be established towards the research/writing by the author. The resource description play an important role in the organizing system
  • Resources have attributes such as cited by. These attributes can be tracked and interesting analysis can be performed such as “Most cited<attribute> paper<resource> in Information retrieval<category>
  • Allows to create alerts for topics the user is interested in

 

1.3         HOW MUCH ARE THE RESOURCES BEING ORGANIZED?

Google Scholar leverages the crawlers and parsers similar to the Google search engine. It crawls documents to identify authors, topics, keywords, citations, publication information, metadata such as publication year and month etc. The documents are parsed and statistical metrics[3] such as the h-index, h-median are calculated to determine how the resource relates to other resource. These metrics establish credibility of the document and hence the author. To support efficient search, resources are classified by disciplines and sub-discipline and keywords used to define them. There seems to be a combination of hierarchical and faceted classification happening in multiple places in the website. For example, it maintains hierarchical classification for categories and subcategories. Also, some categories overlap such as ICTD. One can reach the same resource via multiple keywords/tags.

It also has a three tier architecture and separates – Storage, Logic and Interface in the application.

If I think of the document type spectrum, it lies in the lowest left side as it is Narrative. Thinking about Information IQ plot, it lies in lowest left quadrant as it is a scanned document or a pdf document without any standard structure. There is no standard which governs how papers should be return. Sometimes IEEE standards are followed in terms of formatting, but there are others which are popular too.

1.4         WHEN ARE THE RESOURCES ORGANIZED?

Research papers, articles are periodically being crawled and their resource description is captured to calculate metrics by Google Scholar algorithms and computational process. Google Scholar keeps track of citations for a document and generates Google Scholar Metrics periodically that may help the researcher further.

Maintenance and curation of resources is constant as well. Scores, citations are updated periodically using algorithms.

1.5         WHO DOES THE ORGANIZING?

Algorithms and computational processes organize the information within and between resources. Additionally, publishers are given guidelines on what they could do better so that Google Scholar can crawl their documents. Human intervention is also involved for moderation.

For older documents, Google digitized them and created its resource description in conjunction with experts.

1.6         OTHER CONSIDERATIONS

1. Google scholar looks for patterns in a document on the internet and then considers it to be a paper or not. There are a high number of false positive in the process. Documents are not always good quality research papers.

2. Google scholar doesn’t provide the actual data on how the results were derived. If I think about reproducible research, it relies on the research paper to do the job of describing it.

3. Quality of resources could be a challenge given the number of papers being generated every day. If the organizing system doesn’t maintain high quality resources, it may defeat its purpose and become a search engine.


[2] Vine, Rita (January 2006). “Google Scholar”Journal of the Medical Library Association94 (1): 97–9. PMC 1324783.

PeerLibrary: Organizing the world’s scientific knowledge

OVERVIEW

Since the dawn of science, it is estimated humans have published approximately 50 million scientific articles. In the biomedical and social sciences alone, there is a new publication added approximately every minute of every day. Discussed here is one novel organizing system that is currently under development– a web application called PeerLibrary— that strives to create an enriched, collaborative experience around navigating this corpus of knowledge.

WHAT RESOURCES ARE BEING USED?

Scientific articles are the primary resources being organized by PeerLibrary. The entire service is centered around access to and collaborative human-generated description of these articles, in the form of digital document annotations. Resource descriptions are extracted from each article, including author names, the abstract text, publication year, and journal information. Because many articles are not available in HTML/XML formats, descriptions provided by journals, users, and through Optical Character Recognition must unfortunately be relied on. Additionally, it is notable that users in this system are not only interacting with the articles, but also considered resources themselves as potential authors of both articles and annotations.

One major design consideration in building this system was determining exactly how articles would be selected and added to the PeerLibrary resource collection. While the overall goal is to make all of the world’s scientific knowledge open for discussion, to dynamically add every article as it is published is impractical. APIs to journal and archive databases help fill in much of the resources, but complex issues such as user authentication to access articles behind paywalls and timely discovery of newly published work persist. An Import feature was added to the current application version to allow users to quickly be able to access all of their initial resources of interest in the cloud. Of course, this useful feature poses a challenge for the organizing system because it provides opportunities for duplicate articles to be created and added to the resource collection. The resource descriptions, such as article title, year of publication, and author names must be used consistently to properly handle duplicate cases.

The matter of scoping which kind of resources would be allowed into the collection is also an important, non-trivial conversation that occurred early in the development process. Lots of different kinds of articles could benefit from collaborative annotation, including fiction writing, news stories, and humanities research. However, building a tool that specifically supports interactions around scientific inquiry narrows the types of resource descriptions that are possible while maximizing their utility. For example, entities like abstracts can be consistently used as comprehensive previews of articles, journals usually provide information about the field of study that the article is primarily relevant to, and authors can be connected to articles, fields of study, and to other authors in a meaningful social network.

WHY ARE THE RESOURCES ORGANIZED?

Historically, publishing of scholarly literature has been a practice that exploits the research community while creating lucrative profits for publishing companies. To drive science forward, researchers need access to the highest quality and most relevant past work that can inform context and decisions for current and future studies. Furthermore, it is not enough to allow articles to be openly and freely accessed. There is increasingly a need for a space to openly exchange knowledge, feedback, and insights about the conducted research. PeerLibrary recognizes that researchers need access to intuitive collaboration tools in order to get used to being in this open science mindset. Somewhere down the road, this approach might help build a more open, sustainable, and high quality peer review system.

HOW MUCH ARE THE RESOURCES ORGANIZED?

When scientific articles are added to PeerLibrary, they are parsed for all of the resource descriptions discussed earlier, and then added to a NoSQL MongoDB database right away. The documents are added to the collection in a very unstructured way and saved as independent Document objects, each containing resource descriptions in structured fields. Unlike in systems with more heterogeneous resource description formats, PeerLibrary worries minimally about how documents are organized in the collection and instead relies on the structure of these resource descriptions to facilitate user interactions such as searching for articles.

WHEN ARE THE RESOURCES ORGANIZED?

Article resources have the potential to be organized by any resource description properties as soon as they are added. Users dynamically organize articles into sub-collections of interest by narrowing their search with keywords, specific authors, publication date ranges, and other descriptors. Users can also share pointers to articles and specific annotations with others, creating the possibility of group collections.

WHO DOES THE ORGANIZING?

The organization of articles in PeerLibrary is done largely by the users. The crowd sourced approach of this tool allows users to select articles relevant to personal or group research interests and read, share, discuss, and export citations for. While journals impose structured descriptions of the articles, the collaborative layer of user-generated knowledge that PeerLibrary creates enables each user to create custom collections.

OTHER CONSIDERATIONS

An notable complication in this system is that many users who would like to sign up for an account to read and annotate documents might be authors of articles contained in the resource collection. To protect against duplicating a person’s identity as a user and author, a decision was made to create a Person identity for each unique author that is in the current corpus. When a new user goes to create an account at PeerLibrary, the system first checks whether the information they provide is a potential match for a Person that is already in the system. If so, the User and Person identities get linked together to avoid the vocabulary problem of synonymy and provide an richer user experience. When a group of authors publish a new article, it is also important that this article be listed under the correct individuals’ profile pages. Occurrences such as multiple researchers having the same name, individuals changing their names, and inconsistent formatting of name strings across an individual’s publications (e.g. including a middle initial or not) pose significant obstacles here.

The Organizing System for Cricket

Overview
From the old-school charms of Calcutta’s grounds to the big-city vibe of Bombay, popular from foothills of the Himalayas to the beaches of Chennai, Cricket is a not just a sport in India but a religion, which unites the whole nation. Tons of venues, hundreds of players, thousands of matches, and millions of spectators altogether generate so much of information, which when systematically organized, becomes a powerful tool for post-match analysis. The domain of my case study is “Organizing system for Cricket matches”, and scope of my case study is “to organize and provide Cricket match details on an online platform during and after a cricket match”.
What resources are being used?
This Organizing system systematically organizes information about all the Cricket matches, in-depth statistics of every player and team, match scorecards, and match commentary for each game. It also organizes the post-match in-depth analysis on every aspect of the game by the expert, whose judgments are based not merely on the keenest understanding of the game but on a wider understanding of society, history and human behavior, and its impact on the game.
Why are the resources organized?
In this system, resources are organized to allow Cricket fans to retrieve data about a cricket match that is currently going on or that has already been played. Organizing cricket match details on an online platform allows for the resources to be interacted with at different levels of granularity, which can allow for multiple levels of interaction such as live commentary access and live scorecard updates during the match. Organizing cricket match details also allows for interactions not just with the cricket match but also allows a fan to interact with the sport as a whole (in terms of the resource descriptions provided for each stroke played, and post-match analysis), the players (access and updates to a player’s profile), and to the collection of players in the team as well.
This organization system assesses the information of the agents involved in the process, and the goal is to synthesize insights from the information generated by these parameters, to provide richer understanding to the participating teams in any tournament, to help them decide the selection of players. It is also used to analyze and predict players’ performances, to predict their likelihood of injury, and also to predict the number of spectators for a particular match on a venue. This organizing system makes it easier for the Cricket teams to trace the connection between their requirement and players’ capabilities, and hence it helps the teams to decide transaction of players among themselves.
How much are the resources organized?
Cricket has a predefined controlled vocabulary, which allows semantic comprehension for the users of this organizing system. The use of the controlled vocabulary is imperative as it is a means to communicate the statistics of the match, analyze the match, and also the performance of the players.
The expected lifetime of the organizing system is not the same as the expected lifetime of the Cricket match in this case. Although the match is a short-lived entity, but the resources in this case such as match scorecard, match commentary, player statistics, and post-match analysis report remain till the lifecycle of the organizing system.
When are the resources organized?
In this organizing system, all the resources being used are digital resources, some of which are generated after every match, while others are created by automated processes. Therefore, these exhibit a high degree of organization and structure because they are generated automatically in conformance with data or document schemas. These schemas implement the rules of the game and information models for the updation of scorecards and generation of the match statistics.
All the entities involved in system become part of the organization as soon as they participate in the game in any form, and they are part of the system till the lifetime of the organization. Since this sport is mandated by the rules imposed by International Cricket Committee, rules and regulations vary from time to time and this ultimately affects the nature and extent of the organizing system. For example, based on the performance of a player, he can be promoted to the International level from domestic level, which shifts him to another class in the information.
Who does the organizing?
Organization is performed by professional indexers and information feeders by using computer algorithms. They create and maintain Organizing System by ensuring the accuracy of the data. They are also responsible for implementing the same logical Organizing System in different classes by separating the Organizing principles in the middle tier.
Other considerations
Currently, there are fixed formats of the game which depend on the duration of the match, such as one day matches last for 50 overs each, test matches last for unlimited overs but for 5 days. With the introduction of new formats in the game such as 20-20 matches, which are fast paced, last only for 40 overs, and whose main focus is public entertainment, a lot of factors are going to be impacted in analyzing the statistics of the players. This will also change the schema or structure of the organizing system.

A Digital Collection of Mountains

 

  • Overview (1 pt)

 

This is a hypothetical website dedicated to the organization of mountains for outdoor enthusiasts. The website allows users to discover mountains and find out information that will aid them in their mountain activities. Information includes: elevation, location, route descriptions / trails, climbing difficulty, user reviews, photos, and maps. These properties allow users to search for mountains based on what properties they value most, as well as perform research for future trips into the wilderness.

 

  • What resources are being used? (2 pts)

 

What is a mountain? At first thought, this seems like a relatively trivial question. But different entities have differing requirements for what is considered a mountain, based on different properties. The most widely used property for defining a mountain is elevation. The United States considers any point higher than 1,000 feet to be a mountain, but the United Kingdom requires a peak to be at least 2,000 feet. In addition to elevation requirements, some entities include a topographical prominence requirement, which basically means that a peak must be a specified elevation above its surrounding geography. Similar to elevation requirements, there are many conventions used, but no universal standard. For the purposes of this organizing system, we will define a mountain as any geographic peak 1,000 feet higher than its surrounding geography.

Even after we have  settled on a definition based on elevation and prominence, we still have to answer, what is a single mountain? For example Mount Whitney, which is the tallest peak in California, actually consists of five separate peaks, but is still identified as one entity. Mountaineers typically use a less-than-perfect specification based on perceived prominence. If a particular peak dominates an area, it is called the “parent” peak. A hierarchy is created, with all associate peaks on the landmass classified as “subpeaks” of the parent.

Taking into consideration the four distinctions about resources in an organizing system, we can use the above requirements of a mountain to describe the resource domain. The primary format of these resources is physical. The agency view on resources is passive, becoming valuable only if interacted with. And the focus is the digital description of these physical resources.

 

  • Why are the resources organized? (2 pts)

 

Most users of the organizing system will have a set of predefined requirements for what resource properties make up their ideal mountain experience, and seek to discover a mountain that fits these requirements. Other users will already know which mountain they are looking for, and seek out particular descriptions associated with this resource. With this in mind, resources are being organized to support access and discovery of mountains. Additionally, a selection interaction supports a choice/comparison of resources from the collection.

 

  • How much are the resources organized? (2 pts)

 

The resource scope is narrow and the scale is fairly large. There are many mountains to be organized, but the resource type is homogenous. Because the resource types are so similar, a larger number of descriptions is needed to differentiate them. Also, there is a large and relatively diverse set of users for this organizing system. Different types of users include: day hikers, backpackers, mountaineers, skiers and snowboarders, and ice climbers. This heterogeneity in users means that the system must support a variety of interactions, with different users placing more importance on particular resource properties.

Because of the diversity in users with different selection principles, a faceted classification system will allow users to search for resources by filtering on multiple resource descriptions.

 

  • When are the resources organized? (1 pt)

 

Organizing occurs when an interaction takes place. Resources are organized by selection of desired resource properties. Different users differentiate resources by different properties. Although a resource is created by users, the predefined schema ensures each resource is highly structured, with the same set of resource descriptions.

  • Who does the organizing? (1 pt)

 

The schema of the system determines which resource properties are used to organize the resources. However, each individual user determines which properties to organize the resources by. Similar to Wikipedia, curation of the resources is handled by the user community. Explicit dynamic properties that users may modify, like climbing routes and activities, could affect the organization of these resources.

 

  • Other considerations (1 pt)

Because resources are created by users, maintenance is of critical importance to ensure accuracy of information. Misinformation could put other hikers in danger. Another maintenance consideration is dynamic resource properties and the possible vocabulary problem that they create. Names of mountaineering routes and hiking trails may change with time, creating an overlapping synonimity problem. People searching for a particular route by name may be unsuccessful if they are using the “wrong” name. Additionally, even properties that appear to be static may be dynamic. The elevation of Mount Everest was recorded to be 29,028 feet in 1954. Recent measurements indicate that Everest has grown 7 feet to 29,035. Mountain Names can also be dynamic properties. There has been a longstanding debate about what to call North America’s highest mountain, Mount McKinley or Denali.

Managing a Fantasy Soccer Team

Fantasy Soccer Recommendation System

Overview

A nerdy, competitive soccer fan has recently been introduced to the world of Fantasy Football. As a new manager of a virtual soccer team, he faces an immediate goal of building a team that competes and wins in his ‘league’. A league here is his group of friends or peers who are competing with him. Each week, his basic objective is to ensure his team gets the most Fantasy points by making good player selection decisions. There are several constraints to making a good team which include the budget that is allocated to each team in the league, and the limited pool of available players for selection.

An organizing system, therefore, is required to support the goal of building a winning team by choosing a mix of players that will perform well in their respective matches and potentially generate maximum Fantasy points. While there are several different ways to achieve this, I am going to tackle this problem by organizing the information about the players in such a way that makes the selection process cognitively less demanding and more intuitive for the new Fantasy Soccer manager.

What resources are being organized?

A Fantasy soccer team as an entity can be represented at two different levels of abstraction. The first and more abstract level is the ‘squad’ of players available for selection. The squad typically consists of 17 players. The second layer of abstraction is the ‘playing eleven’ for a particular week. The squad is selected from a common ‘universe’ of players that is available to everyone in the league. Both these levels have have specific rules and place different constraints that determine the inclusion of a player. For example you cannot choose more than 3 players from the same club or team to be part of your squad.

We are interested in organizing information about the universe of players so that a manager can easily and intuitively select a squad and a playing eleven for an upcoming match. In any case the resources being organized are two kinds of digital resources, i.e.,  soccer players(virtual versions) and information about those players.

This set of information (or description resources) about players basically consist of performance statistics that can be organized to analyse trends and generate predictive, data-driven models to assist the selection process.

Why are the resources organized?

Both kinds of resources are organized to assist in the singular goal of decision support. In particular the decisions that a virtual manager makes each week will include ‘keeping’, ‘benching’, and ‘trading’ players. So the primary resources must be organized into these three categories. The only way the primary resources can be organized is by first selecting and organizing their description resources.

For anyone new to the world of Fantasy sports, it can be an overwhelming initial experience and furthermore, it can be years before people learn to recognise and analyse the data available to them. This organizing system is meant to somewhat smoothen the learning curve.

How much are the resources organized?

The primary resources in this organizing system are players. Ontologies in soccer management are typically well defined. But to be more explicit, at the lowest level of granularity, players are classified by teams they play for, and then by the positions they play in.

For the purpose of Fantasy football,  any type of organizing system will include these two basic classifications. One way of further increasing granularity is further classification. So, if a player X belongs to team Y and plays as a “Midfielder”, the position could be further classified into “Central” and “Wing”, and then “Central Midfielders” could further be classified into “Attacking” or “Defensive”. This would present a hierarchical organization scheme.

For our system, several deeper levels of granularity would be required for each description resource. “Central Attacking Midfielders” would have their own description resources like “number of passes”, “number of passes”, “number of goals”, “number of assists”, “work rate(distance run)”, etc. along with their averages. Depending on some logic applied to these description resources, a “central attacking midfielder” could then theoretically be classified into the “Keep”, “Bench”, or “Trade” categories.
Alternatively, the same description resources could classify the midfielder into “playmaker”, “scorer”, and other intermediate categories before finally classifying them into “Keep”, “Bench”, or “Trade”.

When are the resources organized?

Organizing begins when a Fantasy manager joins a league, so that he is able to pick a squad initially. The activity of organizing then continues throughout the season, and happens after every match. A player might not have been recommended initially, but may have improved many-fold over time to warrant selection, or a player might perform badly over time, or a player just might not be the best option in the position for the upcoming match. In any given case, the players must be organized continuously to accurately recommend players.

Who does the organizing?

Since the goal of the system is to assist decision making, it is only natural that the system does the organizing. However, the system does not enforce its recommended scheme. The Fantasy manager has the flexibility to select the players and ignore the system – in which case, the selection principles might be arbitrary.

Other considerations

An important consideration in such an organizing system is maintenance. What happens when there another parameter becomes available which can be measured to increase the accuracy of the predictions?

Another consideration in organizing players in Fantasy soccer is the selection of ‘formations’. A formation is very crucial in influencing the number of points a Fantasy team gets each week. As of now this system only tells the manager which players are likely to generate the maximum points for their positions. The user should be able to select 11 players that will potentially generate the maximum points and arrange them into an allowed formation.

Case Study: Flickr

Overview

My organizing system is a digital photo library.  Specifically, I will be examining my own personal Flickr library.

What resources are being used?

Flickr began as a website for sharing photographs online.  In 2008, Flickr expanded the scope of the system to allow for short videos as well.  This was initially limited to 90 seconds (a limit that was eventually expanded to 180 seconds).

Flickr has a 1TB limit on accounts, so there is certainly enough size to fit all photos and videos I have ever taken, but I have intentionally limited the scope in my Flickr library to photos that I subjectively consider to be of a high enough quality to warrant sharing with the public.  For example, photos of my brunch are not within the scope of my Flickr organizing system (it may be fair game for Instagram, in contrast).

Comments on my photos from other users are another type of resource organized within my Flickr library.

Why are the resources organized?

The primary reason for the Flickr library organizing system is to share photography with the rest of the world.  Allowing online users to view my photographs is the primary interaction the system is designed for.  Because of this priority, the photos are organized in various ways, such as sets and collections to allow users to view groups of related photos.

A secondary goal of the system is to allow these users to interact with the photographs in various ways (e.g., to comment on them, or to mark them as their “favorites”).

Another goal of the online Flickr library is preservation – it is important to have an off-site backup of digital files in case of burglary, data loss, or disasters.  However, this is only a secondary goal of the organizing system – if it were a primary goal, the scope would need to be expanded to be less selective.

How much are the resources organized?

Many people find photos through search queries (both on Flickr and other search engines like Google Image Search), so it is important to have helpful resource descriptions.  These take the form of a photo descriptions, as well as tags.  I geotag my photos with latitude and longitude markers to enable searching of my photos by location (as a special case, photos taken near my home are not geotagged for privacy reasons).

Flickr allows users to use arbitrary tags, so enforcing a controlled vocabulary at a Flickr-wide level is impossible.  However, within my own personal library I attempt to be consistent about the tags used to enable the retrieval of related photos.  For example, all photos taken with an iPhone 5S are consistently tagged “iphone5s” to enable users to find all photos taken with a particular camera.

The categories delineated by the tags used are primarily oriented around the subject of the photo (“goldengatebridge”, “sunset”), but may also be oriented around the location of the photo, the camera used to take the photo, or the “genre” of the photo (“sanfrancisco”, “iphone4s”, “portrait”).

The photos are also organized into sets of related photos.  The sets are then arranged into collections, forming a system that resembles a hierarchical classification system (except photos maybe found in multiple sets).  Similar to the way I use tags, I have sets for “genre”s (e.g., a “Black & White” set), location (“New Zealand”, “Taiwan”), as well as sets with more subjective selection criteria like “My Favorites”.

When are the resources organized?

Flickr photos are all organized much later than the time of creation.  Because the purpose of my organizing system is to share photos I like (or think are good photos), I tend to edit photos in a third-party program (e.g., Apple Aperture, Adobe Photoshop) before uploading them to Flickr.  This process sometimes takes a long time, and I might not edit a photo for months (or even years) after the photo was taken.

Who does the organizing?

The organizing of the photos is done entirely by me.  By design, Flickr does not allow other users to add photos to sets or collections.  However, Flickr does give the option of allowing other users to add tags to photos.  Because I use tags as a sort of personal categorization, I do not allow other users to tag my photos.

Other considerations

Different users on Flickr tends to use tags differently – the lack of a controlled vocabulary sometimes becomes a problem when searching for photos.  For instance, if you want to look for all photos relating to autumn, do you look for photos tagged “fall” or “autumn”?