China Data Initiative | I202 | Information Organization and Retrieval

OVERVIEW

China’s rapid development has an enormous impact on the rest of the planet. But despite the large amount of information collected by various government and research agencies, data on trends in China is still hard to find and interpret due to the huge silos created by various organizations and poor implementation of database/visualization technology.

The China Data Initiative (CDI) is designed to support existing information platforms focused on China by connecting their databases with with one another to reduce development cost and improve interdisciplinary understanding of environmental, health, and economic policy. China Data Initiative also works on creating and publishing new mashups of existing data sets that do not yet exist with organizational partners. CDI is finally a forum for users to review the quality of datasets and annotate.

CDI will work with various universities, government agencies, and research centers to aggregate data regarding China into a unified dataset. The project involves creating robust but easy use database system that can flexibility deal with large amounts of files ranging from high resolution satellite photos, journal articles, and large databases. Negotiation with various organizations who have different views on how data is shared and interpreted is a critical aspect of this initiative

WHAT RESOURCES ARE BEING USED?

The resources being used are statistical databases, spatial/photographic databases, and journal/articles databases, but organizing them cohesively overlay together is the big challenge. Given the scale of data already available and the different needs of other nations, the CDI platform only focuses on China as a global effort will be too complex.

All files on CDI should be properly categorized so that search and data comparison on the visualization layer can be done in a seamless way. For example, statistics should easily shown on the map of various power planets and the increases in pollutions in pollution over time via satellite imagery, and then followed by details like research on the effects of pollution on human health from a medical journal. In this respect you are using three mediums, to cover the issues of energy, pollution, and health in one single interdisciplinary platform.

CDI is essentially a large open source management system. It might be easier to consider using separate but related organizing systems for each of the media types, but these fundamentally do not offer the benefits of an integrated data management system that is built to deal with all three varieties of datasets. The opportunity cost of finding the exact app for each type of dataset is also too high. While some apps might offer high degree functionality and usability, the long expected lifetime of CDI’s requires a flexible and elegant open source solution that will allow users upload or export resources and resource descriptions with ease.

WHY ARE THE RESOURCES ORGANIZED?

The goal of organizing large amounts of statistical, spatial, and explanatory data is to minimize search cost and, improve interdisciplinary understanding of complex interdependent issues, and help maintain data integrity/transparency.

The three major target key audiences for this venture are the general public, research institutions, and government agencies who lack resources to aggregate all this data or have the technology to extract the most meaning from it.

The system will have datasets that are rich with metadata in order to increase accuracy which currently does not exist given the disparate data standard collection standards in the many different disciplines that will populate the platform. Datasets themselves will have this layer to ensure accuracy, a major problem with research in China.

These decisions determine requirements for the interactions to organizing system must support, but the repertoire of interactions is mostly determined by the choice of storage , visualization, and sharing application. The platform will be completely based in the cloud, but significant work must done to identify platforms that works well in the Great Firewall. Functionality must win out since complexity that would overwhelm your less technology-savvy users, which are the majority.

HOW MUCH ARE THE RESOURCES ORGANIZED?

China Data Initiative will organize data in three major functional areas. Raw datasets from verified sources, spatial/photographic data, and research reports will be organized since they represent the core fundamental forms of data representation. The each topic will have key words tagging vocabulary that will enable activation and combination with other relevant visualizations or data sets. Other users will be able to add to tagging to improve the interdisciplinary understanding of the topic.

A carefully designed set of categories and a controlled tagging vocabulary will enable precise browsing, search, and analysis. The users community supports grouping and tagging of data sets. But not everyone should be allowed to group or tag datasets. Unaccredited members can view the dataset but not add to it. Annotation will be allowed so that verified professionals can add additional input or address mistakes. All the data will again be stored in the cloud, making sure small files do not get lost in a random server somewhere.

WHEN ARE THE RESOURCES ORGANIZED?

Organizations and individual experts are required to use categories or tags existing resource descriptions which should be done whenever they are published on the platform. A quarterly review cycle by the expert committee from leading organizations will be used to review anomalies in conjunction with algorithms that can help detect bad or fraudulent data.

WHO DOES THE ORGANIZING?

A consortium different non-profit and academic organizations focused on China takes on the role of being the editors and curators. A single organization lacks expertise on all these matters, but a group of organizations along with independent community can help reinforce standards of quality.

OTHER CONSIDERATIONS

Maintenance of this collection for an indefinite time can be achieved as if it it becomes the de facto standard for data aggregation for China. Alternatively having a major funder back the platform would work, but the budget must be spent in a conservative fashion as it is internally not feasible to build teams to work on these data challenges by themselves alone, though an government and partners affairs department will be crucial.