Open Policy Agent in Skipper Ingress

Zalando has integrated Open Policy Agent (OPA) into Skipper, our open-source Ingress controller, to provide Authorization as a Service. It aims to simplify the developer experience and provides observability out of the box.

Magnus Jungsbluth

Senior Principal Software Engineer

Sandor Szücs

Principal Software Engineer

Posted on Dec 06, 2024

Tags:

Introduction

At Zalando, we continuously strive to enhance our platform capabilities to provide robust, scalable, and developer-friendly solutions. One such initiative is the integration of Open Policy Agent (OPA) into Skipper, our open-source ingress controller and reverse proxy, to deliver Authorization as a Service. This integration not only allows externalising authorization policies but also aligns with our goals of solving security concerns on the infrastructure with efficiency and developer experience in mind. It simplifies developer experience by embedding OPA as a library within Skipper and allows multiple virtual OPA instances to coexist within a single Skipper process. Enabling OPA for a specific application is as easy as just stating “application X should be protected” without touching multiple YAML files, adding monitoring, and inheriting many more responsibilities to be compliant.

Goals

Our primary goals for integrating OPA into Skipper include:

Externalised Authorization: Embedding OPA into Skipper provides a powerful and flexible policy engine as a platform feature. This enables our engineering teams to leverage externalised authorization policies without additional overhead.
Clear Responsibility Split: The integration allows a clear delineation of responsibilities: platform teams manage the core authorization infrastructure while application teams focus on application-specific policies, ensuring efficiency and security.
Scalability: The implementation is designed to handle millions of policy decisions per second, scaling with the demands of our extensive application landscape.
Enhanced Developer Experience: We prioritise making it straightforward for developers to enable authorization in their applications, reducing complexity and time required to implement secure access controls.

Developer Experience

To illustrate how to utilise the OPA integration in Skipper via Kubernetes, engineers might configure to use OPA via the opaAuthorizeRequest filter:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    zalando.org/skipper-filter: |
      opaAuthorizeRequest("my-application")
  labels:
    application: my-application
  name: my-application
spec:
  rules:
  - host: zalando.example
    http:
      paths:
      - backend:
          service:
            name: my-application
            port:
              number: 8080
        pathType: ImplementationSpecific

Explanation

The zalando.org/skipper-filter annotation specifies the Skipper filter that is applied to all routes in this Ingress manifest. In this example, the opaAuthorizeRequest filter is configured with one parameter: "my-application" (the name of the OPA policy bundle and also the registered ID of the application to be protected).

This is the only infrastructure setup required from engineers to authorise requests for their application. Specifics like which paths to protect and authoring rules using Rego, the policy language of Open Policy Agent, are decentrally managed in the application's Git repositories.

Skipper for Kubernetes Ingress

We use Skipper, our HTTP reverse proxy for service composition, to implement the control plane and data plane of Kubernetes ingress and routegroups. A creation of an ingress will result in having AWS NLB with TLS termination targeting skipper via kube-ingress-aws-controller, HTTP routes at skipper and a DNS name pointing to the NLB via external-dns. To understand the deployment context, this is the scale we operate at:

15,000 Ingresses and 5,000 routegroups
traffic of up to 2,000,000 requests per second
80-90% of our traffic are authenticated service to service calls with daily numbers between 500,000 and 1,000,000 rps across our service fleet in total

Technical Design

To achieve these goals, several key technical decisions were made:

Alignment with OPA Envoy Plugin's Input Structures: We chose to align closely with the OPA Envoy plugin's input structures to leverage existing documentation, examples, and training resources. This minimises the learning curve for our developers and keeps Zalando-isms at bay.
OPA Embedded as a Library in Skipper: Embedding OPA directly within Skipper as a library ensures minimal latency in policy enforcement by keeping policy decisions local to the ingress data plane. It also is cost efficient compared to running an OPA deployment per application or as sidecars.
Hide OPA Configuration from Engineers: To separate platform concerns from application concerns, we only expose the bundle name and additional context data as configuration to application engineers. How to run OPA and how it communicates with its control plane is configured and owned by platform engineers.

Skipper can configure multiple routes that can target different backend applications inside its surrounding Kubernetes cluster. OPA enabled filters can be used in multiple routes or even multiple times in the same route.

At Zalando, every application that is deployed to production must be registered first in our application registry. For structuring policies, we piggyback on this governance structure and expect application teams to have an OPA policy bundle per application which uses the application id in its name.

Inside Skipper, we create one virtual OPA instance per application that is referenced in at least one of the routes. This allows us to re-use memory and also provides a buffer against high-frequency route changes by having a grace period for garbage collection.

Skipper Process

OPA instances within a skipper process

To reduce the likelihood of outages due to an authorization infrastructure failure, we use AWS S3 and its availability promises as the source for policy bundles. Styra DAS, a commercial control plane for Open Policy Agent is used to source the bundles and publish them to S3.

To capture observability metrics, we both send spans for authorization decisions and spans for the control plane traffic to Lightstep via OpenTelemetry. To complement the picture, Styra DAS also receives regular updates via the OPA status and decision log plugins.

This approach allows us to scale and fail-over despite failures of our OPA control plane and only depends on S3 being available.

OPA control plane

Architecture of the OPA control plane

Trade-Offs

The integration involved several trade-offs:

Latency vs. Memory Consumption: Embedding OPA reduces latency but increases memory consumption, raising the risk of out-of-memory (OOM) issues. We mitigated this by implementing strict limits on bundle size and also doing constrained memory consumption for advanced features like request body parsing. Telemetry like decision streaming and status reports also use bounded data structures to avoid memory exhaustion.
Flexibility vs. Cost: While OPA offers great flexibility in defining policies, it can be more resource-intensive compared to simpler token validation methods that are implemented without a general purpose policy engine. However, we expect the benefits of fine-grained access control and externalised policy management to outweigh the additional computational costs.
OPA by default vs. on demand OPA is only enabled and bootstrapped only if at least one application uses OPA in a Kubernetes cluster and if the cluster is enabled to support OPA. Skipper instances which have OPA-enabled routes are generally scaled up to compensate for higher cpu consumption due to policy execution.

Observability

Running any service in production requires solid observability to pinpoint issues quickly. If Skipper is configured to send OpenTelemetry Spans, the OPA filters in Skipper automatically send Spans for two paths:

Policy Decisions: Whenever the OPA filter is executed as part of a Skipper route, a Span is injected that captures relevant metadata like the decision ID, the decision outcome, the bundle name (in our case the application ID) and the labels of the running OPA instance. This allows linking directly into the full decision as stored in Styra DAS but also allows capturing metrics right in Lightstep and only based on the traces.
Control Plane Traffic: Whenever OPA calls out to the control plane to fetch bundles or when it reports status / decisions back to the control plane, a separate Trace is generated. This allows monitoring for errors in the basic setup or general problems with fetching bundles or control plane communication.

Sample Trace

OPA Observability: Sample Trace

Sample Span

OPA Observability: Sample Span

Differences Between Envoy OPA Plugin and Skipper OPA Integration

Our OPA integration in Skipper introduces several unique features:

Multiple Virtual OPA Instances in one Deployment: This allows multiple virtual OPA instances to coexist within a single Skipper process deployment, providing low latency without a network hop and also no extra OPA deployment required. In a vanilla OPA deployment, you typically run one OPA process per application.
Serving HTTP Requests: OPA can serve authorization responses independently of the target application, useful for migrating existing legacy IAM services and supporting single-page applications (SPAs) that require precomputed authorization decisions or lists of permissions for the current users.

Conclusion

Integrating Open Policy Agent into Skipper marks a significant advancement in Zalando's platform capabilities. This integration not only enhances security and scalability but also empowers our developers with a robust, easy-to-use authorization service. By focusing on developer experience and maintaining a high-performance standard, we ensure that our platform remains at the forefront of technological innovation. On our journey, OPA has so far been used mostly used in employee- or partner facing applications and APIs where access models and authorization rules are generally more complex.