Architectural Concerns with Policy-as-Code
I recently wrote a document introducing the concepts of Policy-As-Code. When I got to the interesting opportunities for implementation, I realized that as a software architect, I’d have a lot of questions about the practical aspects of implementing a PaC solution. So this is an essay discussing how the implementations of Policy-As-Code solutions help mitigate most of the architectural concerns you might have.
The summary of the problem, if you didn’t read the previous article is this:
For policy-as-code to be most effective, it has to be used in a lot of different services and applications. For example, every request to every service or API you produce should (in principle) be evaluated against the organization’s policies.
This raises a number of architectural concerns:
Is the PaC system a single point of failure?
How much additional latency is involved in processing these requests?
Does PaC reduce the chance of ‘breaking a pre-existing workflow’?
How much does it cost to implement this?
How does this system manage security?
The good news is that the majority of the pioneering work in this space has been done by one open-source project: the Open Policy Agent. There’s a lot of information on the design of Open Policy Agent already out there, so for the purposes of this essay:
The Open Policy Agent implementations can be housed as sidecars or side-processes within your existing infrastructure. So not a single point of failure, and minimal latency
There are professional tools (such as Styra DAS) available to improve the rigor of the policy-as-code solution. For example, using Styra DAS, you can maintain a historical record of requests and policy decisions, so you can update your policies and test the changes against historical requests — so you don’t have to ‘hope’ that the changes aren’t going to break anything.
Styra DAS is also extremely helpful with managing large fleets of OPAs
DAS integrates well with Git, so your Policy Rules can be properly managed as a version-controlled codebase, and then deployed to the OPAs as required
The Open Policy Agent is open-source, freely available, and have a small memory and CPU footprint, so the additional cost to integrate it into your infrastructure is negligible
Knowing this, let’s return to the list of operational concerns from earlier
Is OPA a single point of failure?
OPA is a small program, written in Go. It can be deployed in a container as a sidecar. It can be deployed as an executable on a VM. It can be deployed as a library embedded in your application or service. It is designed to be deployed extremely closely to the applications or services it supports.
OPA loads the policy rules from a central web server, and caches them. So even in the event of a network partition or central server outage, the OPA instances are still running and still making policy decisions. So the one weakness here is that if the policies change while the network is partitioned, the OPA instances on the other side of that partition will not know about it. But given how infrequent cadence of policy changes, this seems like a non-issue.
Additional Latency?
It is certain that if the application or service needs to reach out to OPA to make a policy decision, there is going to be an additional delay. The good news: since the OPAs are typically installed as sidecars or on the same VM as their ‘customers’, this additional latency is extremely small — typically on the order of microseconds, or a few milliseconds.
Processing the request and comparing it to the policy rules also takes time. OPA’s policy language — Rego — is specifically designed to be immutable and with no infinite loops, to improve speed. There are also tools and techniques to improve Rego policy performance, if a naïve implementation is too slow.
Breaking Workflows
The concern here is that just because we moved the policy code out of the application, and into OPA / Rego, doesn’t mean a poorly-decided policy change can’t break the application’s expected performance. And this concern is valid. OPA / Rego is not a silver bullet. Having said that, OPA provides some techniques to help:
OPA logs the requests it receives, and the decisions it makes, and provides mechanisms for that historical decision log to be collected by a central authority
Either by hand, or by using a tool like Styra DAS, you can update the policy code in a sandbox, and run the same set of historical decisions against the updated policy code, and compare the new results to the historical results
OPA also supports automated testing, so you can test esoteric and edge-case requests against the policy logic
Implementation Cost
OPA is open source, and freely available as a container, an executable or as a Go library. So the costs to implement it are exclusively in the labor required to integrate it into your system.
Styra DAS is a service, which can be provided either via SaaS or as a self-hosted (i.e. Enterprise) variant. There is a licensing cost associated with DAS, but you are not required to use DAS to use OPA. In this author’s opinion, DAS does provide some excellent operational benefits that make the additional cost more palatable.
Security Concerns
OPA is flexible in how it interacts with the world. It can be used in a completely unsecured fashion (http, no SSL) . At the other end, you can associate certificates with OPA such that decision requests, policy code repositories and logging destinations are all encrypted, and guarded with certificates to prove the identity of the connector. The highest level of security requires some configuration, but at that point, as long as your certificates are secure, you can be confident that:
No-one can eavesdrop on the decision requests/responses
No-one can ‘inject’ policies into your system
No-one can eavesdrop on the logging output
Closing Thoughts
Are there other architectural and security concerns that I haven’t addressed? Did I miss something important in my assessment? I would love to hear your thoughts: johnbr@paclabs.io