Data Obfuscation and the U.S. Census
By Zain Khan | May 28, 2021
The decision by the U.S. Census Bureau to introduce deliberate errors into their data and now potentially bring in “synthetic data” has researchers up in arms. The use of synthetic data involves the deliberate manipulation of census data in an effort to protect the identities of those involved. Some researchers oppose this action on the grounds that loss of accuracy will harm the data’s research potential.
The act of introducing noise and synthetic data is not at all a new practice and is known as data obfuscation. Data obfuscation is the “process of replacing sensitive information with data that looks like real production information, making it useless to malicious actors” (Imperva). The need for obfuscation comes from the fact that personal data can be tied to individual identities by malevolent third parties. Data obfuscation aims to erase the dangers posed by the mass collection of personal data in a survey like the U.S. census. In fact compliance with data obfuscation is often regulated under compliance standards such as the EU’s General Data Protection Regulation (GDPR). Data obfuscation is especially important when dealing with population census data.
In The Dark Side of Numbers, William Seltzer and Margo Anderson delve into some of the historical atrocities that have been enabled by population census data. In focusing on the dangers of this kind of data, the duo establishes a three-way classification system for data that can be used to target vulnerable individuals or groups (Seltzer and Anderson). The three groups are identified as Macro, Meso, and Micro data.
The table above defines the three categories where macro data concerns census data we are more familiar with that reflect large geographic areas, micro data are individual level data that can be seen as the lowest level of data, and lastly meso data as statistical results for small geographic areas.
For the purpose of the 2020 U.S. Census, the Census Bureau seeks to synthesize micro data to protect individuals. Critics of the decision claim that the addition of such inaccuracies will undermine the credibility of the census. University of Minnesota demographer Steven Ruggles goes as far to say that the addition of synthetic data “will not be suitable for research” (AP News).
Despite the claims made by Ruggles, Seltzer and Anderson’s work highlights the need for data obfuscation and safeguards in census data. The duo outlines a list of safeguards that should be used in tandem in order to protect against the misuse of previous data in the past. The use of data obfuscation in this context falls under what Seltzer and Anderson define as “Methodological and Technological Safeguards”.
While the researchers themselves are held to ethical standards, their use of micro data results in the publication of meso data. As seen in Seltzer and Anderon’s work, meso data is where the danger of census data lies as the data can be used to target vulnerable population subgroups. It only takes a brief skim of Seltzer and Anderson’s work to see the dire consequences that failing to upkeep these basic safeguards has held.
Data obfuscation is necessary for census data that deals with micro data and without it, the census itself is inherently dangerous. Researchers such as Ruggles who claim that the Bureau is “inventing imaginary threats to confidentiality” (AP News) fail to recognize the historical impact that improper population data collection practices has held. While there may be a small level of inaccuracies in the data, no amount of research is worth risking the well being of American citizens. Safeguards such as the introduction of synthetic data are the bare minimum for data collection practices moving forward.