How the Modern Data Ecosystem Broke Data Governance

Most companies today understand the immense opportunity the “Age of Data” offers them, and an ecosystem of modern technologies has sprouted up to help. But for companies, putting together a comprehensive modern data ecosystem to deliver data value from the available offerings is very confusing and difficult. Ironically, some of the technologies that have made certain segments easier and faster have made data governance and protection appear more difficult.

Big data and the multiverse of madness

“Use data to make decisions? That’s crazy talk!” was a common thought in IT back in the 2000’s. Information Technology groups didn’t really understand the value of data – they treated it like money in the bank. They thought if they stored it in a database and kept it perfect that it would gain value — so they resisted letting people use it (especially in its most granular form). But there is no compound interest on locked up data. A better analogy would be the food in your freezer. You need to cycle through it. You need to pull things out and eat (eh, I mean use) them, otherwise they just go bad. Data is the same – it needs to be used, updated, and refreshed or else it loses value.

Over the past several years we’ve developed a better understanding of how data should be utilized and managed to maximize value. With this new understanding have come disruptive technologies to help enable and speed the process along, simplifying difficult tasks and minimizing the cost and complexity required to complete a task with data.

But when you look at the entire ecosystem, it is difficult to make sense of it all. If you try to organize companies into a technology stack, it’s more like “52 card pickup” – no two cards are going to fall exactly on top of each other because very few companies present exactly the same offering and very few cards line up side-to-side to offer perfectly complementary technologies. That’s one of the challenges of trying to integrate best of breed offerings. The integration is hard and interstitial spots are difficult to deal with.

We can look at Matt Turck’s data ecosystem diagrams from 2012 to 2020 and see a clear trend of increasing complexity – both in the number of companies as well in the categorization. It’s extraordinarily confusing even for those of us in the industry, and while he did a good job of organizing it, I would argue that pursuing a taxonomy of the analytics industry is not productive. Some technologies are miscategorized or misrepresented, and some companies should be listed in 2 or more spots. It’s no surprise companies attempting to build their own modern stack might be at a loss. No one really knows or understands the entire ecosystem because it’s just too massive. Diagrams like these have value as a loosely organized catalog but should be taken with a grain of salt.

A saner, but still legacy approach

Andreessen Horowitz (a16z) provides a different way to look at the ecosystem—one that’s based more on the data lifecycle—that they call a “unified data infrastructure architecture.” This starts with data sources on the left, ingestion/transformation, storage, historical processing, predictive processing and finally output on the right. Along the bottom are data quality, performance, and governance functions which are pervasive through the stack. This model should look familiar because it is very similar to the linear pipeline architectures of legacy systems.

Just like in the previous model, many of today’s modern data companies don’t fit neatly into a single section. Most companies can span across two adjacent spaces, others will surround “storage”, for example having ETL and visualization capabilities, to give an apparent discontinuous value proposition.

Sources

Starting on the left, sources are obvious but worth going into some detail. They are the transactional databases, applications, and application data and other data sources that have been discussed in Big Data infographics and presentations over the past decade. The key takeaway is the three V’s of Big Data: Volume, Velocity and Variety. Those Big Data factors had meaningful impact on the Modern Data Ecosystem simply because traditional platforms could not handle at least one of the V’s. Within a given enterprise, data sources are always evolving.

Ingestion and transformation

The next section is more convoluted – ingestion and transformation. You can break this into more traditional ETL or newer ELT platforms, programming languages for the promise of ultimate flexibility, and lastly event and real- or semi-real time data streaming. The ETL/ELT area has seen innovation driven by the need to handle semi-structured and JSON data without losing transformations. It’s apparent that the reason why there are so many solutions in this space today is not only the variety of data, but also the variety of use cases. Solutions are optimizing on ease of use, efficiency, or flexibility where I would argue you cannot get all three in a single tool. If it is not apparent, since data sources are dynamic, ingestion and transformation strategies and technologies must follow suit.

Storage

Recently, storage has also been a center of innovation in the modern data ecosystem due to the need for meeting capacity requirements. Traditionally, databases were designed with the compute and storage tightly coupled. Any upgrades required the entire system to come down, and it was difficult and expensive to manage capacity. Today innovations are coming quickly from new cloud-based data warehouses like Snowflake, which has separated compute from storage to allow for elastic capacity scaling. Snowflake is an interesting, difficult to categorize case. It is a data warehouse, but through its Data Marketplace can also be a data source. Furthermore, as ELT gains traction and Snowpark gains capabilities, Snowflake is becoming a transformation engine. While there are many solutions in the EDW, data lake, data lakehouse, etc. industry, the key disruptions we are experiencing are cheap infinite storage and elastic and flexible compute capabilities.

BI and data science

The a16z model breaks down in the Historical, Predictive and Output categories. In my opinion, most software companies in this space occupy multiple categories, if not all three, making these groupings only academic. Challenged to come up with a better way to make sense of an incredibly dynamic industry, I gave up and oversimplified. I reduced this to database clients, focused on just two types: BI and Data Science. You can consider BI the historical category, Data Science the predictive category and pretend that each has built in “Output”. Both have created challenges to the governance space with their ease of use and pervasiveness.

Business Intelligence has also come a long way in the past 15 years. Legacy BI platforms required extensive data modeling and semantic layers to harmonize how the data was viewed and to overcome performance issues of slower OLAP databases. Since these old platforms were centrally managed by a few people, the data was easier to control. Users only had access to aggregated data that was updated infrequently. The analyses provided in those days were far less sensitive than today. BI in the modern data ecosystem brought a sea of change: the average office worker can create their own analyses and reports, the data is more granular (when was the last time you hit an OLAP cube?), and the data is approaching real time. It is now commonplace for a data-savvy enterprise to get reports that are updated every 15 minutes. Today, teams across the enterprise can see their performance metrics on current data and enable fast changes to behavior and effectiveness.

While Data Science has been around as a technology for a long time, the idea of democratizing has started to gain traction over the past few years. I am using the DS term in a very general sense of statistical and mathematical methods that focus on complex prediction and classification that goes beyond basic rules-based calculations. These new platforms increased the accessibility of analyzing data in more sophisticated ways without worrying about standing up the compute infrastructure or the complexity of coding. “Citizen data scientists” (this term is also used in the most general terms possible) are people who know their domain, have a foundational understanding of what data science algorithms can do, but lack the time, skill, or inclination to deal with the coding and the infrastructure. Unfortunately, this movement also increased the risk of exposure of sensitive data. Analysis on PII may be necessary to predict consumer churn, lifetime value or detailed customer segmentation, but I argue it doesn’t have to be analyzed in its raw or plain text form.

Data tokenization—which allows for modeling while keeping data secure—can reduce that risk. Users don’t need to know who the people are, just how to group them so they can execute cluster analysis without needing exposure to sensitive granular level data. Furthermore, by utilizing a deterministic tokenization technology, the tokens are a predictable, yet undecipherable, to enable database joins if the sensitive fields are used as keys.

Call it digital transformation, democratization of data or self-service analytics, the aesthetic for Historical, Predictive, Output – or BI and Data Science – is making the modern data ecosystem more approachable for the domain experts in the business. This also dramatically reduces the reliance on IT outside of the storage tier. The dynamics of what data can do requires users to iterate, and iterating is painful when multiple teams and processes and technology get in the way.