min read
Apr 21, 2024

Why Sensitive Data Discovery Tools are Wrong More Often Than You'd Like

False alarm!

Did that inspire relief - or frustration?

When it comes to sensitive data discovery tools, false alarms (otherwise known as false positives) are the bane of an InfoSec team’s existence. And false positives persist in occuring more often than you would like (which is not at all). 

Why does this happen? 

Let’s take a look at the methods sensitive data discovery tools use to identify sensitive information, and see where they can fall short. We’ll also discuss what to look for in a sensitive data discovery tool that will give you the best shot at accurate detections.

The Methods (and the Madness) of Sensitive Data Discovery Tools

Regular expressions (regex)

Regular expressions (regex) are patterns used to match and manipulate specific sequences of characters within strings of text. 

For instance, 

\d{6} means “six consecutive digits”

\d{3}-\d{2}-\d{4} means a string of numbers that looks like NNN-NN-NNNN (the format of a US Social Security number)

(?i)exp means the string of letters ‘exp’, case-insensitive 

Regex are used in sensitive data discovery to search for sensitive text that matches defined rules.

Where regex can go wrong:

Regex often returns false positives because it matches patterns strictly based on form, not content or context. Without understanding meanings or variations, regex might incorrectly identify similar but irrelevant data as matches, especially when patterns are broad or ambiguously defined, leading to high rates of misidentification.

For example, if you’re a retail company that has product ID numbers in the form NNN-NN-NNNN, a sensitive data discovery tool that you’ve asked to search your data for social security numbers (via regex) will return all data assets that contain a product ID number.

Or if you want your sensitive data discovery tool to find credit card expiration dates, and you use the regex (?i)exp, your tool will flag every instance of the word express, inexpensive… you get the (bad) idea. 

Improving the specificity of your regex will give you fewer false positives. But they’ll never disappear altogether, because regex does not look for context. It’s designed to be a horse with blinders, narrowly walking the path you set out for it. What you ask for is what you get.

Data fingerprinting and exact data matching techniques

Exact data matching is exactly what it sounds like: comparing fields or records against a predefined list to see if an exact match exists. 

If you want your sensitive data discovery tool to find a specific instance of PII, like Jane Doe’s social security number, which is 964-76-3092 (yes, we know; social security numbers can’t start with 9), then exact match is the best choice. 

Data fingerprinting is like exact data matching via hashing algorithms. Instead of searching for the exact data itself, you first use a hashing function on it to generate an unique identifier (the “fingerprint”). Then you hash the data you want to search, and search that for the fingerprint.

Where exact data matching can go wrong:

Exact data matching is unlikely to have false positives. It is, however, more likely to cause false negatives due to its lack of flexibility. 

If you asked your sensitive data discovery tool to find instances of Jane Doe’s social security number with an exact data match of 964-76-3092, and due to data variation it appears somewhere in your data environment looking like this: 964763092, your discovery tool will not discover it. 

Natural Language Processing (NLP) techniques

Natural Language Processing (NLP) combines computational linguistics - rule-based modeling of human language - with statistical, machine learning, and deep learning models. NLP can identify entities and extract other features from unstructured text, as well as understand context and sentiment, making it very useful for sensitive data detection. 

Where NLP can go wrong:

False positives can happen where the NLP model is simplistic (like earlier versions of Named Entity Recognition - NER) or the training data was too narrow.

More advanced NLP models which better understand the context and nuances of human language (such as BERT - Bidirectional Encoder Representations from Transformers - or GPT - Generative Pre-trained Transformer) will be more likely to deliver accurate classification of sensitive data. 

Asset metadata scans

The purpose of metadata is to give you information about your data assets. Common metadata types include technical, descriptive, usage and administrative metadata. Sensitive data discovery tools can leverage that information to draw actionable conclusions about data assets.  

For example, if the asset’s file name (technical metadata) contains “financial projections” or “budget”, that is often a good indication that the content is sensitive financial data. If your tool checks that against the asset’s “last modified date” (also technical metadata) and finds that it was last modified eight years ago, the likelihood that it needs to be treated as confidential information goes down. On the other hand, if the asset’s “last modified date” was two weeks ago, that’s a strong indicator that confidentiality precautions are necessary.

The benefit of asset metadata scans is their speed. Scanning the metadata of a spreadsheet takes milliseconds - or less. Scanning the entire contents of a spreadsheet with 10,000 rows of data takes a lot longer.

Where asset metadata can go wrong:

Any time you’re judging a book by its cover, you can come to erroneous conclusions - both false positives and false negatives. 

The more types of asset metadata that a sensitive data discovery tool can use and correlate in its analysis (as in the above use of “last modified date” to shed light on the initial conclusions of “file name”), the higher the likelihood of accurate conclusions. 

In general, asset metadata scans shouldn’t be used in a vacuum. They’re best used by sensitive data discovery tools for a SaaS environment, where, after data is shared, it’s a matter of seconds before it might be accessed, copied and moved elsewhere. An asset metadata scan can give at least an indication of whether that share should be stopped immediately and a more comprehensive method used to evaluate the full content of the asset.

The Costs of Inaccurate Sensitive Data Discovery Tools

False alarms aren’t just annoying. They cost your organization in:

  • Business productivity
  • IT overload
  • Alert fatigue/blindness

Business productivity costs

False positives block your users from legitimately accessing or sharing data. This is especially problematic in a SaaS ecosystem, because streamlined workflows and enhanced collaboration is the very reason why organizations implement SaaS.

IT overload

Your IT and InfoSec teams are swamped with the amount of alerts that your sensitive data discovery solution sends their way. Investigating all these alerts takes excessive time and subtracts from the time they have available to do the real IT work your organization needs. 

Alert fatigue/blindness

Eventually, your IT and InfoSec teams decide either consciously or unconsciously to not investigate every single alarm. In this situation, the chances are high that a bonafide problem will pass through undetected. And as so wisely said by Enrique Saggese, a Microsoft Purview Principal PM: “A false positive can ruin my lunch, but a false negative can ruin my career.”

What to Look for in a Sensitive Data Discovery Tool

To reduce sensitive data false positives, look for a sensitive data discovery tool that:

  • Uses multiple methods of analysis
  • Has contextual awareness

The more angles that a tool brings to bear when it analyzes data for sensitivity, the fuller a picture it will get. And the better it is able to understand the different elements of that picture and how they interact with each other, the more accurate its conclusions will be.

You will also want to see how well a given sensitive data discovery tool works with your data. This usually isn’t necessary for straightforward pattern recognition methods like exact data match or regex - but we wouldn’t recommend a tool that uses only those methods, anyway. Once a tool uses more complex, fine-tuned models (which you certainly want!), its efficacy is connected to how well those models and their training data encompass the data you want it to work on. So make sure you investigate how it works with your data assets - not just with demo data. 

Don’t Sound the (False) Alarm

No sensitive data discovery method delivers perfect results. But you can get closer to that goal by understanding each method’s strengths and weaknesses, then choosing a sensitive data discovery tool that uses the right combination of methods for your data. 

May your alerts be few - but accurate!

Get updates to your inbox

Our latest tips, insights, and news