Anna Swigart | I202 | Information Organization and Retrieval

OVERVIEW

Since the dawn of science, it is estimated humans have published approximately 50 million scientific articles. In the biomedical and social sciences alone, there is a new publication added approximately every minute of every day. Discussed here is one novel organizing system that is currently under development– a web application called PeerLibrary— that strives to create an enriched, collaborative experience around navigating this corpus of knowledge.

WHAT RESOURCES ARE BEING USED?

Scientific articles are the primary resources being organized by PeerLibrary. The entire service is centered around access to and collaborative human-generated description of these articles, in the form of digital document annotations. Resource descriptions are extracted from each article, including author names, the abstract text, publication year, and journal information. Because many articles are not available in HTML/XML formats, descriptions provided by journals, users, and through Optical Character Recognition must unfortunately be relied on. Additionally, it is notable that users in this system are not only interacting with the articles, but also considered resources themselves as potential authors of both articles and annotations.

One major design consideration in building this system was determining exactly how articles would be selected and added to the PeerLibrary resource collection. While the overall goal is to make all of the world’s scientific knowledge open for discussion, to dynamically add every article as it is published is impractical. APIs to journal and archive databases help fill in much of the resources, but complex issues such as user authentication to access articles behind paywalls and timely discovery of newly published work persist. An Import feature was added to the current application version to allow users to quickly be able to access all of their initial resources of interest in the cloud. Of course, this useful feature poses a challenge for the organizing system because it provides opportunities for duplicate articles to be created and added to the resource collection. The resource descriptions, such as article title, year of publication, and author names must be used consistently to properly handle duplicate cases.

The matter of scoping which kind of resources would be allowed into the collection is also an important, non-trivial conversation that occurred early in the development process. Lots of different kinds of articles could benefit from collaborative annotation, including fiction writing, news stories, and humanities research. However, building a tool that specifically supports interactions around scientific inquiry narrows the types of resource descriptions that are possible while maximizing their utility. For example, entities like abstracts can be consistently used as comprehensive previews of articles, journals usually provide information about the field of study that the article is primarily relevant to, and authors can be connected to articles, fields of study, and to other authors in a meaningful social network.

WHY ARE THE RESOURCES ORGANIZED?

Historically, publishing of scholarly literature has been a practice that exploits the research community while creating lucrative profits for publishing companies. To drive science forward, researchers need access to the highest quality and most relevant past work that can inform context and decisions for current and future studies. Furthermore, it is not enough to allow articles to be openly and freely accessed. There is increasingly a need for a space to openly exchange knowledge, feedback, and insights about the conducted research. PeerLibrary recognizes that researchers need access to intuitive collaboration tools in order to get used to being in this open science mindset. Somewhere down the road, this approach might help build a more open, sustainable, and high quality peer review system.

HOW MUCH ARE THE RESOURCES ORGANIZED?

When scientific articles are added to PeerLibrary, they are parsed for all of the resource descriptions discussed earlier, and then added to a NoSQL MongoDB database right away. The documents are added to the collection in a very unstructured way and saved as independent Document objects, each containing resource descriptions in structured fields. Unlike in systems with more heterogeneous resource description formats, PeerLibrary worries minimally about how documents are organized in the collection and instead relies on the structure of these resource descriptions to facilitate user interactions such as searching for articles.

WHEN ARE THE RESOURCES ORGANIZED?

Article resources have the potential to be organized by any resource description properties as soon as they are added. Users dynamically organize articles into sub-collections of interest by narrowing their search with keywords, specific authors, publication date ranges, and other descriptors. Users can also share pointers to articles and specific annotations with others, creating the possibility of group collections.

WHO DOES THE ORGANIZING?

The organization of articles in PeerLibrary is done largely by the users. The crowd sourced approach of this tool allows users to select articles relevant to personal or group research interests and read, share, discuss, and export citations for. While journals impose structured descriptions of the articles, the collaborative layer of user-generated knowledge that PeerLibrary creates enables each user to create custom collections.

OTHER CONSIDERATIONS

An notable complication in this system is that many users who would like to sign up for an account to read and annotate documents might be authors of articles contained in the resource collection. To protect against duplicating a person’s identity as a user and author, a decision was made to create a Person identity for each unique author that is in the current corpus. When a new user goes to create an account at PeerLibrary, the system first checks whether the information they provide is a potential match for a Person that is already in the system. If so, the User and Person identities get linked together to avoid the vocabulary problem of synonymy and provide an richer user experience. When a group of authors publish a new article, it is also important that this article be listed under the correct individuals’ profile pages. Occurrences such as multiple researchers having the same name, individuals changing their names, and inconsistent formatting of name strings across an individual’s publications (e.g. including a middle initial or not) pose significant obstacles here.

I202 | Information Organization and Retrieval

UC Berkeley Fall 2013 INFO 202

Author Archives: Anna Swigart

PeerLibrary: Organizing the world’s scientific knowledge