Use Case: OPA Scalability in Kubernetes
You are a new-tech financial transaction management system and you want to use Policy as Code to implement your Zero Trust solution. You want to scale to thousands of transactions per second, and expect the PaC agents to keep up. You’ve chosen Open Policy Agent (OPA) as your PaC solution. You’re using Envoy to forward all valid requests to an OPA sidecar for authorization, so the applications do not need to be aware of the policy regime at all.
The Problem
You are struggling to get sufficient transactions-per-second (TPS) for your solution. Your applications are able to process many more TPS than the OPA sidecar. You can’t seem to consistently get OPA to support more than 1500 TPS. How can you improve the throughput of OPA in this environment.
Evaluation
In isolation, OPA has been measured to support 16,000 TPS with a dedicated high-resource server. Using Docker containers to simulate the K8S architecture on local PCs, you can typically get 3k - 5k TPS out of OPA.
In our evaluation, we experimented with several options:
Varying the CPUs and memory allocation(s) available to OPA.
Simplifying the policy rules to the bare minimum.
Optimizing the policy bundles.
Using Enterprise OPA instead of the open-source agent.
Use different communications protocols between OPA and Envoy.
Results
Our findings were that in the K8S Envoy-OPA architecture, consistent performance above 1500 TPS could not be achieved. No matter how much memory or CPU was allocated to the OPA sidecar, better performance was not seen. Simplifying and optimizing the rules also demonstrated no concrete benefits. Research into other organizations performing similar experiments demonstrated similar, if not worse performance for their transactional throughput.
We provided several recommendations for improvements, including:
Moving OPA so that it was attached to the application, rather than to Envoy.
Creating a centralized cluster of OPAs using dedicated hardware.
Limiting the use of OPA to the creation of the claims in JWTs, and modifying the applications to make decisions about access via those claims.
Ultimately, the customer decided to keep the solution as-is for the moment, and scale up the number of K8S nodes as required to meet their transactional needs.
We (PACLabs) continue to research this to determine what causes such disparate performance.