Anonymizing data is hard.
Data that is properly anonymized, and thereby anonymous, does not fall under the GDPR (recital 26 states: “This Regulation does not therefore concern the processing of such anonymous information”). This alone should be reason enough to want to familiarize ourselves with data analysis and anonymization!
If you already know the difference between data masking, pseudonymization, and anonymization, then you are ahead of the curve and know more than most. Just in case: the former two differ from anonymization in that the data still contains observations that can be tied to an individual – in a properly anonymized data this is no longer the case. Unfortunately, this knowledge alone won’t get you far if you want to anonymize and release a dataset.
Off-the-shelf solutions exist for anonymization. These do however require that you are able to categorize the dimensions (columns) in your dataset as identifying or not identifying. This way the algorithm knows what information to redact. For some classes of data this is straight forward. Combinations of first and last name taken together are clearly identifying, and so is an individual’s social security or bank account numbers. In other cases, it is not quite so straight forward. Is the time at which someone left their home in the morning, or the color of someone’s favorite sweater identifying? You might say it is not, or you might say it is. Either answer could be correct. It fully depends on what additional knowledge the person who is given access to the dataset for data analysis has. There have been real life examples of this. A famous one was when an “anonymized” medical database was released in the U.S. in 2002. By combining the medical dataset with a publicly available voting registry based on seemingly non-identifying attributes, a researcher was able to find the medical history of the governor of Massachusetts!
Since we can never fully predict what additional knowledge someone might have, the only seemingly safe option we are left with for data analysis and anonymization is the rather pessimistic assumption that every dimension could be, and therefore has to be treated as, identifying. Taking this route using one of the standard anonymization techniques such as K-anonymity or L-diversity, the data quality suffers dramatically. The richer the dataset is (the more dimensions it contains) the worse the result becomes in terms of its predictive capabilities.
Why does an anonymized dataset get worse the more columns it contains?
Let’s make the simplifying assumption that all that is needed for a dataset to be anonymous is that there needs to be K individuals that share all attributes across all columns for a dataset (in real life this alone is not sufficient). In this example we will use K=2. If the dataset contains the dimensions age and gender like in the following table:
Then one could anonymize it by generalizing the age. The result could then look something like this:
Now let’s add in another category, namely the car brand someone owns:
To anonymize this richer dataset the data needs to be generalized even further. One could do it in multiple ways, but one example would be:
The more dimensions you add, the more information you have to redact to keep it anonymous, and the worse the analytical capabilities the dataset offers become.
At this point some people may decide to shy away from anonymization as a solution for data analysis. If the alternatives are an insufficiently anonymized dataset and the accompanying risk, or data without predictive capabilities, the better option might be to do nothing at all.
Thankfully modern research offers a solution for data analysis and anonymization. Newer approaches to anonymization offer the ability to consider all dimensions as potentially identifying while retaining sufficient data quality. This class of anonymization approaches are called dynamic, or query-by-query, anonymization. The leap these approaches take is that the anonymization does not happen ahead of time, but rather as an analyst asks a particular query of the data. As a result, only the dimensions that are part of the query need to be considered for the anonymization and the rest can be ignored. While having the potential to offer strong anonymity and high predictive capabilities, these solutions come at a price. Where traditional anonymization results in a dataset that an analyst might be used to working with, dynamic anonymization requires that the analyst change the way they operate, and query a dynamic anonymization engine. Given the alternatives, most analysts deem this a worthwhile trade-off.
Two well-known approaches to dynamic anonymization are Differential Privacy and Diffix. Differential Privacy which originated at Microsoft Research and has become quite popular in the research community. Diffix is a newer approach. It originated at the Max Planck Institute for Software Systems and was developed as an alternative to Differential Privacy that would be viable for use in a commercial setting. Today there are offerings available for both approaches. I highly recommend you evaluate whether they can help you in your road to anonymization.
Personally I think the introduction of GDPR is a great thing. Our personal data has never needed more protecting than today. The introduction of the new laws do however require that we adapt the way we work with data. Anonymization is a tool that, when applicable, can make this work a lot easier.