Integrating Policy as Code with DSPM
Classify, then Protect
Your organization has sensitive data scattered across databases, cloud storage, data lakes, and SaaS applications. Some contains PII, some falls under GDPR, some is intellectual property. The question: how do you even know where it all is, let alone control access to it?
This is where Data Security Posture Management (DSPM) and Policy as Code intersect. DSPM discovers and classifies your sensitive data. Policy as Code enforces who can access it based on those classifications.
DSPM (Data Security Posture Management) is a mechanism by which various data sources/data stores are analyzed and the individual elements (for example: database columns, or folders in cloud storage) are examined. When the DSPM system decides that the ‘object’ qualifies for one or more tags, it is tagged. If a parent datastore is tagged, its child objects (schemas, tables, fields) inherit the tag. There are a number of fairly standard tags (PII, GDPR, PCI), but there are no official standards for classification. In addition, the security administration teams can create their own categories (e.g., “Confidential-IP” or “M&A Docs”) to align with internal policies.
Controlling access to tagged data is policy
Imagine a policy designed to protect private data: “Only logged-in employees can access folders that are tagged ‘Confidential-IP’”.
You could imagine a (deliberately simplified) Rego policy for this:
authz.deny contains message if {
input.folder.tag == “Confidential-IP”
not input.user.is_employee
message := “Only employees are allowed to access confidential information”
} And that’s the simplest example of an integration between DSPM and Policy as Code.
Types of Integrations
In the real world, finding the tags for a piece of data aren’t that simple. There are a couple of common patterns that PaC can use to discover the tags:
Classification Manifest
Real-Time Query
And then one could write hybrid policies that attempt to use the manifest to discover the classification for a data element, and fall back to a real-time query if no classification exists.
Classification Manifest Cache
DSPM tools are designed to constantly scan the network infrastructure, looking for new data sources (or data store) to classify. When it finds one, it carefully examines the data store, to find the components data “types” for classification. Often, this can be a tedious, manual process. A database is probably the simplest example to understand.
A database consists of many tables
Tables consist of many columns
Individual columns can represent information worthy of classification
Imagine you have a database containing taxpayer information. For the sake of simplicity, it has two tables - TAXPAYER and TX_LOG.
The database is named “
TAXDB”TAXPAYERtable has the following columnsNAMESSNDATE_OF_BIRTHTX_ID
TX_LOGtable has the following columns:IDPAYMENTDATE
You can look at the TAXPAYER table and easily observe that many fields deserve special tags:
NAMEmight be tagged ‘PII’SSNmight have multiple tags: ‘PII-SENSITIVE’
DATE_OF_BIRTHcould be ‘PII’, but let’s say for the sake of this example that it doesn’t have any tags
On the other hand, nothing in the TX_LOG is necessarily deserving of special tags, since there’s no direct way of associating a payment with a particular human.
Example Classification Schema
The DSPM creates a manifest of the tags associated with the database/table/column relationship. That manifest could be pushed into the OPA data section for caching. Here’s a simplified example:
{
“data_classification”: {
“databases”: {
“TAXDB”: {
“tables”: {
“TAXPAYER”: {
“columns”: {
“SSN”: {
“classification”: “PII-SENSITIVE”
},
“NAME”: {
“classification”: “PII”
},
"DATE_OF_BIRTH": {
"classification": null
}
}
}
}
}
}
},
"data_access": [
null,
"PII",
"PII-SENSITIVE"
]
}Policy Logic
An input document arrives at the OPA. The query: “Is the specified user allowed to access the specified data element?”
{
“user”: {
“name”: “joe”,
“access”: 0
},
“query”: {
“fetch_db”: “TAXDB”,
“fetch_table”: “TAXPAYER”,
“fetch_columns”: [
“SSN”
]
}
}And you can probably imagine what the Rego might like to match that input document against the classification data above
deny[msg] if {
some column_name in input.query.fetch_columns
classification := data.data_classification.databases[input.query.fetch_db].tables[input.query.fetch_table].columns[column_name].classification
classification == “PII-SENSITIVE”
not user_has_clearance(”PII-SENSITIVE”)
msg := sprintf(”Access denied: %v contains sensitive PII”, [column_name])
}
user_has_clearance(level) if {
level == data.data_access[input.user.access]
}
Here’s this code implemented in the Rego Playground . Feel free to play with it - set the classifications, set the access level for the user, to see the changes in behavior.
As new data stores are discovered, it’s trivial to add them to the OPA data section, which now has even more classification data available locally.
Pros and Cons
The benefits of caching the classification schema data is speed - Rego can look up these classification relationships very quickly.
The downside - if a new data store is added to the system, and the DSPM hasn’t had a chance to analyze it yet, the cache will not have any classification data for that data store, and the user may be given access to data that should be inaccessible.
Real-Time Query
Another way to determine the classification of a data element is to perform a real-time API call against a DSPM ‘system’. Here’s a conceptual snippet of Rego code that could be used to call a DSPM for information:
real_time_classification = get_classification( db, table, column) {
resp := http.send({
"method": "GET",
"url": sprintf("https://dspm.example.com/classify/%s/%s/%s", [db, table, column]),
"cache": true
})
classification := resp.body.classification
}Note: in the real world, there would probably be security tokens and it will definitely have a much more sophisticated API call
In this case, we can discover the same classification results, but it will work even if the DSPM has not yet had a chance to push the classification schema to the OPA data.
Pros and Cons
The benefit of this approach is better classification accuracy - the DSPM might have more recent data than the cache.
The downside of this approach is latency (because OPA now has to go out to a third-party system for classification data), and the additional single point of failure (if the DSPM is unavailable)
Hybrid Approach
And of course, you can make a hybrid approach, where you check to see if the data is in the cache, and only make the API call when the cache is incomplete.
Use Cases
There are several ways that policy logic might be used in this type of configuration:
Masking
the actual data in the column might be replaced with ***** or spaces or null
Compliance Logging
each time data of a certain security level is accessed, a log message could be created (even if the user was allowed to see it)
Denial
as in the above example, the user could be prevented from accessing the data at all
In addition, Policy as Code allows for more sophisticated analysis. For example, policies can be defined on attributes of the data that are not directly related to the classification
The physical location of the data, relative to the physical location of the user
Whether the user is on the VPN or not
The age of the data
Whether the data is properly encrypted
And so forth. You can probably see how these things could be combined:
Employees can only access
PII-SENSITIVEmaterial if they’re on the VPN.Health Care workers can only access a patient’s data if they reside in the same US state as the patient.
If any field in a table is marked
PII-SENSITIVE, the disk holding that data must be encrypted
Conclusion
Thank you for taking the time to read this. You should now know a little more about how DSPM can integrate with Policy as Code, and perhaps see some use cases where this information could be useful to your organization.
If you have any questions, comments, or feedback, please feel free to reach out at: johnbr@paclabs.io

