Privacy and Anonymity in the Age of Big Data

Under most privacy laws or data protection laws if you’re not dealing with personal data, you’re outside of the scope of the legislation. However, that line is no longer as clear as we might have thought it was historically. Privacy and anonymity gets complicated in the age of big data.

We always have to be looking at what is the state of the art when it comes to anonymisation, or de-identification. We also have to be looking at the techniques and technologies that can be used to re-identify or victimise individuals whose information might have been compromised.

Ten years ago, scientists or engineers said, “Yes. This information is de-identified. We don’t have to worry about it.” Now that ten years later that exact same data sitting on a server is in fact identifiable, because the line has continued to move.

In Canada, privacy laws hinge entirely on personal information. In other places, it’s personal data.

The law is also intended to be technologically neutral. The Canadian statute was written in 1996 and it was significantly influenced by the European directive at that time. We have 10 principles that are essentially mapped almost directly from the 8 principles of the OECD guidelines but rooted on principles rather than strict rules.

In Canada there’s no such thing as legitimate purpose. Users have to have consent, provide notice and consent for all collection, use, and disclosure of personal information.

Privacy and anonymity – Where to draw the line?

In Canada, our definition is as simple as ‘information about an identifiable individual’. If it’s information about an identifiable individual, it is personal information. The regulators in Canada have taken the view that if there is a risk that you can connect an IP address with an individual then you need to treat it as personal information, the question is – where do we draw the line?

The EU GDPR is going to be incredibly influential in the way that people are thinking about these sorts of things. According to the EU GDPR, personal data is defined as to whether or not the individual can be identified directly or indirectly. Does it include a national social insurance, social security, other sort of number? Does it include a driver’s license number, a passport number, something else like that? What are the implications for privacy and anonymity especially in the age of big data?

Importance of pseudonymisation for privacy and anonymity

The EU GDPR also introduces this concept of pseudonymisation for addressing privacy and anonymity. Pseudonymisation talks about a process, that alters personal data in such a way that the individual personal data can no longer be attributed to a specific data subject, without the use of additional information. It is also important to differentiate pseudonymisation from anonymisation, which anonymises data to the extent that the individual can no longer be identified.

The EU GDPR states that when it comes to the obligations for safeguarding information and protecting the privacy and anonymity of individuals, pseudonymisation is important. Organisations need to think about this identifiability continuum and think about how risk maps onto that and the risk to the individual.

In age of big data, data scientists are using mathematical techniques to blur information and data sets, so that it gets less likely that you could in fact identify individuals.

From the regulator’s point of view, the ability to identify one person positively is in fact the threshold that they’re concerned about. This is where the issue of consent comes in. What is interesting about the Canadian principles-based system, is that the form of the consent needs to be based on the sensitivity of the information, and the less identifiable it is, the lower down on the sensitivity continuum you are.

The minimisation principle

Using the guidelines of the minimisation principle, an organisation should only collect the personal information that’s reasonably necessarily in the circumstances. Will de-identified or fuzzily modified information be sufficient? This is a challenge to the marketing people. If you’re going to be using the data for one-on-one marketing based on the identity of the individual, then perhaps fuzzily identifiable or de-identified information is not going to work. But this alternative increases the form of consent that must be obtained, and that consent needs to be informed consent.

One big problem is that consumers don’t necessarily understand the nuances of data science, of data mining, all these different uses that information can be put to. In the age of big data, companies need to think carefully about how they are going to explain to consumers how they are using their data.

At the very least organisations will always need to be transparent. If they fit anywhere along that continuum of re-identifiability, they’re going to need to be transparent with their customers and also with the regulators.

In the age of big data, organisations also need to think carefully about the value of even de-identified data – just how determined are hackers? The example of Netflix is a good one.

Netflix, which was looking to improve its recommendation engine, took its data set and stripped off what it thought were all the identifiers. Researchers got their hands on this data and were able to match it up against users of the IMDB, the Internet Movie Database. If you liked Dances With Wolves, and you gave it five stars on Netflix, you’re likely the guy who gave it five stars on IMDB. The company thought that they were dealing with de-identified information, but in fact that was far from the case.

In a world where individuals are sharing a huge amount of information, yet increasingly concerned about their privacy and anonymity, we need to think about how it all fits together.

Big data challenges for privacy and anonymity

Privacy and anonymity are being challenged by new data analysis techniques in the age of big data. Organisations need to understand that what is anonymous today may not be anonymous tomorrow. Personal information means information about an identifiable individual. What does that actually mean? Can you identify the particular individual from the data? Can you identify the particular individual from the data combined with other data that is reasonably available? Where is the line between personal information and non-personal information? Pseudonymisation means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information.

Collection, use and disclosure is important. Can the organisation clearly articulate why it wants the data and what it will be using it for? Does the organisation have the right framework in place to secure and govern this data? Has the organisation looked over the horizon to consider future use? Get consent now. The same techniques that make big data so useful increase the risk of reidentification.

Is the notion of consent quaintly out of date if consumers can’t understand what we’re doing with data? Privacy ethics? Organisations need to look beyond the current privacy and anonymity horizon.