New innovations in data anonymization
The issues above may make it sound like anonymization is pointless. Take for example the infamous “anonymous” taxi record from New York City, which was quickly re-identified after its release in 2014 . CNIL, the French Data Protection Authority, wanted to demonstrate that a true anonymization of this data set is indeed possible. The result is very inaccurate and therefore, unfortunately, unsuitable for most applications.
However, there is a way of anonymizing this data without losing the utility. Using Diffix, an approach we at Aircloak have developed together with the Max Planck Institute for Software Systems, we can create a data set that is just as secure and yet significantly more useful.
In the figure below, the left-hand image shows the NYC taxi data set anonymized by classic approaches. It is clear the data is fairly useless as it now only indicates departure points by postal code areas. The right-hand image shows the same dataset anonymized using Diffix. As can be seen, the data is far more detailed, and hence has greater utility.
Comparing classic anonymisation with Diffix. These figures come from the article “Can Anonymous Data Still Be Useful? Part Deux ” by Paul Francis.
In contrast to classic anonymization, Diffix does not anonymize the entire data set before the analysis. Instead, it dynamically anonymizes each database query, and adds carefully tailored noise to the query. The approach is already successfully being used in the banking industry to evaluate transaction data, and is also very helpful in the healthcare industry to anonymize patient data. The analysis of such anonymized data sets gives significant potential for improved product development, better marketing and more focused business intelligence. Using descriptive statistics, it is possible to safely determine what the median income of a particular group of users is, even sorted by different banks. Correlations and regressions are also possible: Do people with higher incomes spend more on insurance? When are people most likely to apply for a loan?
Data treasure despite data protection
Whether the comparison between data and oil is adequate or not, a competitive company must develop new data-driven business models and processes. Using modern data protection approaches is not only a legal requirement, it can also help improve your whole data strategy, allowing you to monetize it without risking data security. While naïve and classic anonymization is difficult to achieve and often does not provide good results, modern technologies like Diffix offer the best of both worlds. The GDPR has shaken up the data anonymization industry and it is certain that over the coming years we can expect to see more advanced tools being developed by many suppliers.
The Who’s Who of Data Protection Technologies
In the anonymization process, data is changed in such a way that inference to natural persons is no longer possible or only possible through a disproportionately high effort. This is achieved, by grouping data points together or by adding noise (for example incorrect data) to the data set. Anonymized data are not subject to data protection laws.
In the case of pseudonymization, direct identifiers of a data record are deleted and replaced with pseudonyms (for example a telephone number could be exchanged with random digits, or a user ID could be stored instead of a real name). This type of processing preserves much of the value of the data, but is not nearly as secure as anonymization. Therefore, pseudonymous data continues to be considered as personal data under the GDPR.
1 This is a simplified representation of anonymization. Functional algorithms are able to produce anonymized data with significantly less alteration. But the basic problem remains the same.