Could Synthetic Data Be the Future of Data Sharing?

Synthetic data generation (SDG) is rapidly emerging as a practical privacy enhancing technology (PET) for sharing data for secondary purposes. It does so by generating non-identifiable datasets that can be used and disclosed without the legislative need for additional consent given that these datasets would not be considered personal information.

Having worked in the privacy and data anonymization space for over 15 years, the limitations of traditional de-identification methods are becoming more evident. This creates room for modern PETs that can enable the responsible processing of data for secondary purposes. There’s a growing appetite from CPOs to understand where SDG fits as a PET, how it’s generated, what problems it can solve, as well as how laws and regulations apply.

What is synthetic data?

In a nutshell, synthetic data is generated from real data. We first build a machine learning model that captures the patterns in real data and then generate new data from that model. The generated data closely captures the statistical properties and patterns in the original dataset. Because there’s no one-to-one mapping from synthetic records back to a person, synthetic data is considered non-identifiable data and can be shared more freely, with fewer administrative and technical controls. The process is highly automated, and therefore requires fewer skills than traditional de-identification methods, which required an expert in statistical disclosure control to perform well.

You have probably seen synthetic data in the form of deep fakes. For example, realistic images of people or animals that are all machine generated (see here). In this article we are focused on generating structured data that can be used for analytics.

How can synthetic data be used?

SDG has been around for many years but has begun to really pick up steam because of advances in machine learning and deep learning, which has improved the quality of the generated datasets. Along with that, there’s an ever-increasing demand for data.

Synthetic data can mitigate a number of challenges and offer tremendous opportunities. It’s currently being applied more heavily in the health and financial sectors, but we’re also seeing it percolate in telecommunications, retail, automotive, and the list goes on. An important use case for SDG is enabling access to data for AI and machine learning projects. For instance, in a recent survey by Deloitte, data access and privacy issues were ranked as top challenges with the successful implementation of AI projects.

We see that play out in practice often, whereby data science teams are unable to reach their full potential because of the friction in getting access to internal and external datasets. Synthetic versions of these datasets can give those teams rapid access to the data and accelerate their ability to generate insights from them.

Another common use case is software testing. This is particularly problematic for statistical and analytics software which needs realistic datasets for the testing to be meaningful. Using production data faces many privacy challenges and is no longer deemed an acceptable risk by many organizations. This is where synthetic data comes in to provide large amounts of testing data generated from production data.

SDG methods have recently become powerful that the generated datasets are good proxies for the original data, and they can capture strong as well as subtle signals in the data. This means that it is not necessary to know in advance how the data will be used to build useful synthetic datasets.

The field is moving so rapidly that Gartner predicts that by 2024, 60% of the data used for the development of AI and analytics solutions will be synthetically generated and that the use of synthetic data will halve the volume of real data needed for machine learning.

But how reliable is it… how close is it to the real thing?

Empirical evaluations coming from a number of studies (for example, here and here) are showing that synthetic data models the patterns in real data really well (in general, the prediction accuracy tends to be within 2% to 5% of the original data). There’s always a balancing act between how accurate the model needs to be and how close the synthetic data is to the original data. It is the classic trade-off between data utility and data privacy. The synthetic data is not going to look exactly like the real data because if it did, we would essentially be replicating the original data and this would raise privacy concerns. What’s important is that the synthetic data will offer very similar results and lead you to the same conclusions.

An important factor is that you don’t have to take any of it for granted. Both the utility and privacy risks for any specific data set can be evaluated. It’s all testable against commonly established standards and what we’re seeing when it comes to privacy is it that re-identification risks for synthetic data in practice fall well below widely established thresholds for what is deemed non-identifiable data. In other words, there’s a very low risk of matching the information back to any individual.

What about privacy regulation?

Evidence to date is supporting the conclusion that synthetic data is non-identifiable information, suggesting it would in principle fall outside of privacy regulation. It’s considered further processing, which is deemed permissible, either explicitly in some jurisdictions, and implied or in practice in others.

Even with this, just like for other types of data uses, transparency and oversight measures are important. Some form of ethics review on data uses would be strongly advised to ensure they are not discriminatory or potentially harmful to data subjects.

Indications from privacy regulators on the question of synthetic data are very positive, as they begin to recognize the many economic and societal opportunities. The Commission des informations et libertés (CNIL) in France has, for example, recently approved an SDG method as an acceptable form of anonymization. This is quite telling and suggests they see data synthesis as more reliable than other methods given that such a designation has not been given to other anonymization techniques. We anticipate we’ll be seeing more like this from regulators around the world, as the technology matures, awareness grows, and evidence accumulates.

The road ahead

CPOs know all too well that regulation, cost and other factors can, at times, impede the great many societal and organizational benefits of sharing data.

Synthetic data is a PET that avoids many of the privacy challenges that exist under current regulations when it comes to sharing data, within and outside borders. Along with that, it’s in many ways easier than traditional de-identification methods and the high level of automation makes it more accessible as it is not dependent on difficult to find skill sets. It has the potential to democratize access to data and data custodians could generate more value in their data sets than they do today.

Another advantage is that in the situation of a breach, the impact is lessened if synthetic data is involved instead of personal or identifiable information. It’s also a great example of how, when it’s done well, AI and machine learning can help to improve privacy.

While there will always be multiple PETs available for organizations to select from, it is expected that SDG will be an important component of that PET toolbox moving forward. So it’s a good time for CPOs to become more familiar with what it is and to consider how it might be able to help them in their work, moving forward.