Companies are stuck between opposing forces. Data privacy pressures are growing, while the adoption of data-driven technologies is accelerating. White hat privacy officers often find themselves in the way of innovation, denying data access requests or demanding the use of anonymization tools that destroy data utility. Others turn a blind eye and let massive AI models be built on biased, radioactive customer data with little or no oversight. Time and time again, we see production data in test environments at companies where privacy is shunned in favor of growth. We can only hope that all those Revolut accounts won’t end up in the wrong hands, except we can be almost certain that they will.
Privacy-enhancing technologies, like homomorphic encryption, AI-generated synthetic data, and federated learning, are here to save the day. As opposed to traditional data anonymization techniques, privacy enhancing technologies or PETs for short, approach data differently, extracting the value, while leaving sensitive information behind. Instead of trying to mask their way out of the privacy pressure, PETs typically transform the data into something entirely different, yet statistically or computationally identical.
For example, homomorphic encryption allows data users to execute functions on encrypted data without having to decrypt the original. It’s nothing new – homomorphic encryption was invented in the 1970s – however, due to the recent increase in computational capacities, it is now a viable option to consider.
Synthetic data is also a result of advances made in the field of deep generative AI in the past five years. Synthetic images were the first breakthroughs and by now they are widely used in training computer vision software. Tabular synthetic data is the next frontier, which is set to change how extremely hard to anonymize time-series or behavioral data is used. These types of datasets, like patient journeys or credit card transactions are almost impossible to anonymize due to the chronological order of the datapoints. Taxi rides are like fingerprints and new types of linkage attacks connecting publicly available data with leaked time-series datasets are becoming more and more prevalent. The advantage of synthetic behavioral data is that all datapoints are artificially generated, while the statistical properties – the value for analytics and AI – is preserved.
The options are there and more data-savvy companies, typically insurance, finance and healthcare, are already well on their way to operationalize them. Training AI and machine learning models on data destroyed by traditional anonymization tools doesn’t work. Training AI models on customer data straight out of production would have to exclude those without explicit consent, which accounts for the majority of records. Right now, the European Union is busy working on the new AI Act, which will specifically regulate when and how data can be used for training AI systems. We don’t yet know the details, but we can be sure that the way forward is paved with more prohibitions and less room for unchecked data flows.
A recent report from the Joint Research Centre of the European Union examined the usefulness of synthetic populations in policy development. In comparison to aggregation – a data protection method widely used by governments – synthetic data offers a safer and more useful way to prepare data for sharing across institutions, countries and research teams. Accuracy and reliable data intelligence is especially important when it drives population level healthcare policies.
So why is it that these technologies are not (yet) a part of every data pipeline and tech stack? The issue is that privacy officers know little about privacy enhancing data science and data scientists know or care little about privacy. Privacy enhancing processes within companies directed by internal or external policy are mission-critical for meeting data protection challenges. Instead of trying to manage data access requests and assess anonymization tools on a one-by-one basis, sound processes and data protection pipelines should be set up. We’ve seen companies successfully accelerate the adoption of privacy enhancing technologies by setting up synthetic data sandboxes internally and data exchange platforms for external data sharing. Data protection decisions shouldn’t be left to the individual data scientist or the lone privacy officer. The goal should be to automate data protection by setting up sound processes, giving the right tools to the right people and connecting the entire operation to company-level KPIs. It’s not enough to convince the legal team and the management. Convincing the board and educating the citizen data scientist about treating data privacy and data ethics issues as high priority is also important. But what it really comes down to is making the right tools readily available for the right people at the right time.