Engineer working on laptop with circuit board on virtual screen showing privacy engineering

Overcoming Challenges in Privacy Engineering

Privacy engineering has become a top architecture challenge, one that I have personally experienced as an architect in a number of medium and large companies. Privacy legislation has been multiplying around the world in the last few years, becoming both more extensive and more complex.

Compliance with local, regional, and international data privacy laws is now a vital business concern, and there’s immense pressure on R&D to effectively implement privacy engineering in ways that bake in compliance.

However, I’ve repeatedly found that privacy engineering is easier said than done. R&D, particularly at medium and large enterprises, faces significant and serious obstacles which they have to overcome in order to meet business compliance requirements and privacy standards.

The never-ending task of data mapping

In an ideal world, every data-based digital service, program, or algorithm will be developed using privacy by design principles so that the entire system is built in a way that simplifies data management and makes it easy to erase or restrict access to specific datasets.

But in the real world, it’s all too common for developers to build solutions according to the needs of the product, often under time pressure. Privacy by design protocols frequently fall by the wayside.

As a result, privacy engineering typically begins with data mapping, a short phrase that conceals an enormous amount of work. Data mapping involves discovering and assessing all your data flows. It includes surveying and listing every service, database, storage, type of storage, and type of database that make up your data processes. You need to understand all the relationships between all the moving parts in your possession, and chart them in a way that enables you to refer back to them with ease.

From my work with many companies, I have seen situations where an entire development team had to be allocated to interview developers from other departments. They would generate and then present enormous excel files with lists of entities that could potentially be personal or sensitive data. The developers they were interviewing would need to review this file and typically would respond from memory rather than going back and checking the code. This process can potentially lead to misalignment with the actual state of the software. In addition, even if it’s 100% correct at that moment (which is unlikely), the findings become obsolete the day after this alignment has been done.

Data mapping and inventorying is vital to understanding what systems, tools, and storage units touch your data. Without a comprehensive and clear inventory, you won’t be able to move on to any further privacy engineering tasks. But as you can see, it is also tedious and time-consuming, occupying a great deal of energy and focus.

The proliferation of digital services

One issue that makes data mapping such a challenge in medium to large companies is the sheer volume of code that needs to be checked. Big corporations are constantly rolling out new digital services or updated versions.

Today’s cloud-based, user-friendly machine learning tools make it easy for developers to deploy new data-based algorithms, systems, and workflows. In the organizations where I worked, we had dozens of teams with hundreds of developers, each of them busy with different projects. All these data-touching projects are built in highly complex environments, using infrastructure like Kubernetes or Dockers which are tough to explore, but easy to deploy.

As a result, it’s extremely difficult to monitor, validate, and enforce data privacy best practices, or even to track the path of data through the system.

The hidden trap of legacy code

The bigger the company, the greater the likelihood that there’ll be considerable amounts of legacy code lurking in the depths of the organization’s systems. Very few developers properly understand legacy code, so it’s usually highly opaque.

Some employees might know the connections for some of the lines of code, and some sections might have been replaced more recently, but in general there’s very poor visibility into which services are related to which database, which services are sharing data with which other services, and other aspects of legacy code.

The moving target of live projects

On top of all this, data mapping projects are caught in a tech version of Zeno’s paradox. Most of the projects that are being mapped are live projects, which means that more data, more tables, and more connections are being added on a continual basis. But most data mapping is currently carried out manually.

The map is out of date as soon as it’s completed, because of the speed at which live projects expand. There’s no way that human employees can keep up with the pace at which new data and relationships are added to the project. Many of today’s data maps may be “good enough,” but they are not comprehensive.

The enormity of gap analysis

Data mapping can take months or even years to complete, but it’s only the first in a series of privacy engineering challenges. Once we had an inventory of all data flows, we would need to move on to the next tedious and time-consuming job: gap analysis.

Gap analysis involves uncovering holes in the data flow and identifying data privacy issues which we then need to resolve. It requires data engineers to look for details like tables and/or storage that are duplicated, redundant, or no longer in use. It also means seeking out third parties who receive information from your developers and checking that they are only sent information that they are entitled to.

We found that it wasn’t unusual for the gap analysis to take just as long as data mapping, and it could be every bit as frustrating to undertake.

The complexity of reviewing and enforcing data regulations

After succeeding at gap analysis, we would need to remove all unused third parties, tables, storage, etc. However, that’s the relatively easy part. The next challenge is to review and enforce all the relevant data regulations, including access controls, encryption for all sensitive data, and some method of de-identification whenever relevant.

Access controls apply to internal users who are authorized to work with some customer data, but are not authorized to view all the data that you store. We would need to provide access to permitted data while restricting it for sensitive data, often within the same table. The process includes building in auditing to check who has access, when they have access, and very often why they have access.

Encryption goes hand in hand with access controls, because it’s the primary way to protect the data that you are permitted to hold. Our encryption needed to be strong enough to ensure that these boundaries are respected and that nobody can view or use data without authorization.

Some datasets also need de-identification or anonymization. For example, a hospital holds sensitive patient health data. Doctors need to view full patient details for diagnoses, prescriptions, and effective follow-up. Hospital administrators might need access to personal identifying data to help patients schedule appointments, but not to their health data. And researchers developing cures for health conditions may be permitted access to anonymized health data that doesn’t link reactions to drugs, the onset of disease, or the progression of disorders to any patient-identifying data.

Bear in mind that these review and enforcement activities need to be carried out in a continuous manner. It’s not enough to enforce access controls, encryption, and pseudonymization once on a database. It’s crucial to build an automated service that will take the raw data, send it to the right database, and apply the right levels of access, encryption, and anonymization on a continuous basis.

The obstacle of data retention and minimization in production

Finally, these challenges are exacerbated because they need to be overcome for programs that are in production. A live program has to respond to a user’s request to opt out from data collection and respect their “right to be forgotten.” It’s also necessary to verify processes that store data for limited periods of time, such as validation data that can be kept for a few hours.

Again, this should be built into the solution during development, but it often isn’t, for any number of reasons. With so many live projects and so many developers, it’s hard to identify all the places where user data is collected and verify if there’s a built-in opt-out for each table and process, adding to the challenge in policing implementation of data privacy protocols.

Depending on the way a program was built, deleting data can be a difficult process. For example, many developers I have worked with prefer columnar storage because it’s excellent for analytics, but it’s very awkward to delete specific datasets from this type of storage. If a digital service uses columnar storage, enforcing adequate data retention and minimization policies will be a long and complex process.

Privacy engineering can be an obstacle course

Privacy engineering is becoming a growing concern for virtually all companies globally. In order to meet this dynamic set of challenges, business leaders must implement a gap analysis specifically around issues related to privacy engineering in their organizations and grade themselves accordingly. Based on those findings they will be able to best understand and lay out the precise steps that must be taken to protect the privacy of their employees and customers, and ensure compliance with ever-changing privacy regulations. Every company is different, and so too are the actions and policies that must be implemented to ensure privacy engineering success.