Plumbing the Depths: Data Mapping in Unpredictable Waters

Nathan Siegel
Nathan Siegel
Feb 12, 2024
min read
Plumbing the Depths: Data Mapping in Unpredictable Waters

If you’re a biologist trying to catalog all the flora and fauna in a tidepool, you’d be smart to begin from the shallow edges and work your way into the deep. The first steps into the pool have shallow, transparent water revealing the seabed below and making it easy to eyeball everything between the sand and the surface.

Wade deeper, and the work gets slower as the water gets cloudier and corals and fish multiply and swarm. You strap on your goggles and stick your head under to peer around. The deepest portions of the tide pool require you to use a scuba suit and vacuum up samples, a time-consuming task.

In addition to the work of a marine biologist, the process described above also covers the tough task of mapping data for an enterprise level company.

The Many-Layered Data Environment

As of 2022, the average enterprise company uses over 130 data sources to run its business and serve customers, and some report that number is as high as nearly 1,000 apps. This doesn’t count internal tools and may not include cloud storage, buckets and other software either. 

When a company wants to audit all customer data running through its stack of tools - for the purpose of privacy, legal, security or otherwise - the smart first step is to create a single source of truth for the data.

The best and truest source of what data is where, and who has access to it (the baseline requirements of any audit) is a data map. And just like the biologist in the tide pool, there are shallow, predictable levels of any technology stack that are simple to map data for. SaaS tools such as Mailchimp, for example, are simple and structured in terms of what data types to expect.

After determining that the company does indeed use Mailchimp, classifying the customer and personal data inside the company’s Mailchimp account isn’t difficult. There will never be customer credit card or health information inside this app, as its function is one-dimensional, and predicting what types of data will be there is something that can be done at-a-glance. One data source down, 129 to go.

Thankfully, this is the easy part of data mapping, and a solid chunk of the technology stack will be this effortless. But what about all those less-predictable, unstructured sources of data, like Amazon S3 or MongoDB? 

Comprehending and controlling unstructured data is an issue for up to 95% of businesses, as it’s extremely difficult to guess what types of data are in these multifaceted and multi-purpose tools.

The old way of mapping what data is in Amazon S3 was simply to identify the “data owner” or the employee in charge of S3 usage, and ask them to go in and write down the data types manually. In an ever-changing data environment you might be able to see why this is a waste of time, and not a small one either. Experts agree that the best way to determine data types in unstructured tools is by integrating the tool with whatever solution you’re using to map data, and then scanning the tool in its entirety.

The “full scan” method is ruthlessly accurate, but also extremely time intensive. Hooking up an integration and working with IT to ensure the job is done takes days if you’re lucky and weeks for a particularly deep system. If there are 20 or more unstructured, custom or internal tools that all need integrations… it’s easy to see why the average data mapping project takes 6 months to complete. 

But what if there was an easier way to predict what types of data reside in unstructured data sources?

A Smarter Way to Sample

If the biologist, rather than toting vacuums and tanks to the seabed, could simply take a vial of water and scan its contents to get a 100% accurate reading of all life in the vicinity, it would save astronomical sums of time and money. In parallel, a data mapping project manager can imagine taking a small sample of the data source and getting the complete picture of data classification there, without slogging through integration after integration. Wouldn’t it be nice!

It is. MineOS is the world’s premier data mapping solution for data privacy and governance, and in our mission to create a true single source of truth - without dragging our customers through half a year of onboarding calls - we’ve developed Smart Data Sampling. 

Smart Data Sampling takes a slice of an unstructured data source (anywhere from .5% to 50%) and applies this unique scan technique to it in order to identify 100% of the core data present. 

This is why having an enormous library of integrations is not always the answer for data privacy software, and will be less and less important as the field progresses. Integrations drive up the cost of compliance tech, which is decidedly unfriendly to customers when a smarter method of determining data sources exists and many privacy teams are already dealing with limited budgets.

Putting Our Clams Where Our Craw Is

For our customers, utilizing Smart Data Sampling means both money saved on integrations and compressing the typical onboarding time to implement a data map down to a single month. The cost savings benefits of this cannot be overstated. 

When data mapping, a company using MineOS automatically discovers its structured data sources and then (also automatically) identifies the data types there using our AI. Suggestions of the data types we know are present within a system are chosen in one click, right in our Data Inventory.

Then, we work with the customer using our risk and business impact assessment tools to understand which of the unstructured data sources that we’ve discovered require Smart Data Sampling, and which require a full scan. 

For data sources where the customer wants to enact policy rules and alerts, a full scan is recommended. Its exhaustive nature makes it a powerful solution for notifying system administrators when noncompliant data types are found. Customers usually choose Smart Data Sampling for all other circumstances, and as a result end up saving months of their time.

As the only data privacy and governance solution with Smart Data Sampling, MineOS is uniquely capable of detecting and cataloging the vast environment full of data hidden under the water’s surface.