Synthetic Data: Silver Bullet? – Data Science W231 | Behind the Data: Humans and Values

Synthetic Data: Silver Bullet?
By Vinod Viswanathan | October 20, 2022

One of the biggest harms that organizations and government agencies can cause to customers and citizens is exposing personal information arising out of security breaches exploited by bad actors, both internal and external. A lot of the security vulnerability is the result of a conflict between securing data while allowing safe sharing of data; goals that are primarily at odds with each other.

Synthetic data is artificially generated data through machine learning techniques that model the real world. Artificial data, to qualify as synthetic data, must have two properties. It must retain all the statistical properties of the real world data and it must not be possible to reconstruct the real world data from the artificial data. This technique was first developed in 1993, in Harvard University, by Prof. Donald Rubin who wanted to anonymize census data for his studies and was failing to do it. He instead used statistical methods to create an artificial dataset that mirrored the population statistics of the census data allowing him and his colleagues to analyze and draw inferences without compromising the privacy of the citizens. In addition to privacy, synthetic data allowed for large data sets to be generated and solved the data scarcity problem as well.

As privacy legislation progressed along with efficient large-scale compute, synthetic data started to play a bigger role in machine learning and artificial intelligence by providing anonymous, safe, accurate, large-scale, flexible training data. The anonymity guarantees allowed collaboration; cross-team, cross-organization and cross-industry collaboration providing cost effective research.

Synthetic data mirrors the real world including its biases. One way the bias shows up is through the underrepresentation of certain classifications (groups) in the dataset. As this technique is capable of generating data, it can be used to boost the representation in the dataset while being representative of the classification.

Gartner report, released in June 2022, estimates that by 2030 synthetic data will completely replace real data in training models.

So, have we solved the data problem ? Is synthetic data the silver bullet that is going to allow R&D with personal data with all of the privacy harms.

Definitely not.

Synthetic data can improve representation only if a human involved in the research is able to identify the bias in the data. Bias, by nature, is implicit in humans. We have it and typically we do not know or realize it. Therefore, it is hard for us to pick

it up in the dataset; real or synthetic. This realization of bias continues to be a problem even though safe sharing and collaboration with a diverse group of researchers increases the odds of removing the blindfolds and addressing the inherent bias in the data.

The real world is hardly constant and the phrase “the only constant in life is change” is unfortunately true. The safe, large, accurate and anonymous dataset that can support open access can blind researchers into using these datasets even when the real world has changed. Depending on the application, even a small change in the real world can introduce large deviations in the inferences and predictions from the models that use the incorrect dataset.

Today, the cost of computing power needed to generate synthetic datasets is expensive and not all organizations can afford it. The cost is exponentially higher if the datasets involve rich media assets; images and video, which are very common in the healthcare and transportation automation industries. It is also extremely hard to validate synthetic datasets and their source real world data to generate identical results in all research experiments.

The ease and the advantages of synthetic data can incentivize laziness in researchers, where the researchers simply stop doing the hard work of collecting real-world data and default to synthetic data. In a worst-case scenario, deep-fakes for example makes it extremely difficult to distinguish real and synthetic data allowing misinformation to propagate into the real world and through real world events and data back into synthetic data creating a vicious cycle with devastating consequences.

In summary, don’t drop your guard if you are working with synthetic data. What Is Synthetic Data? Gerard Andrews, Nvidia, June 2021

Sources:

https://blogs.nvidia.com/blog/2021/06/08/what-is-synthetic-data/

The Real Deal About Synthetic Data, MIT Sloan Review, Winter 2022

https://sloanreview.mit.edu/article/the-real-deal-about-synthetic-data/

How Synthetic Data is Accelerating Computer Vision

https://hackernoon.com/how-synthetic-data-is-accelerating-computer-vision-xp153 w6q