Zalando Postgres Operator: One Year Later

The Postgres operator provides a managed Postgres service for Kubernetes. It extends the Kubernetes API with a custom “postgresql” resource that describes desired characteristics of a Postgres cluster, monitors updates of this resource and adjusts Postgres clusters accordingly. Zalando successfully uses the operator to manage more than 450 Postgres clusters across a large number of Kubernetes installations.

photo of Sergey Dudoladov
Sergey Dudoladov

Software Engineer

Posted on Nov 26, 2018

Zalando Postgres operator: one year later

The Postgres operator provides a managed Postgres service for Kubernetes. It extends the Kubernetes API with a custom “postgresql” resource that describes desired characteristics of a Postgres cluster, monitors updates of this resource and adjusts Postgres clusters accordingly. Zalando successfully uses the operator to manage more than 450 Postgres clusters across a large number of Kubernetes installations.

Moving to production

More than a year and a half ago, Zalando prepared for running stateless and stateful applications alike on Kubernetes. With tens of teams working with hundreds of databases across multiple Kubernetes clusters, any kind of manual operations was out of the question. To keep the workload manageable Zalando’s database team therefore decided to automate the operations procedures.The operator pattern well known in the Kubernetes universe turned out to be a perfect fit for the job.

At present the operator manages more than 400 Postgres clusters in Zalando: it watches requests for additions, deletions and updates of Postgres manifests and automatically carries out all necessary actions on the clusters. This saves time for engineers and the admins alike: instead of manually configuring numerous Kubernetes objects, they just submit a single YAML file describing the desired Postgres cluster setup, and the operator takes care of the rest.

A year ago, the operator just left the prototype stage and was still in its infancy. Since then we have extended it into a production-ready Postgres-on-Kubernetes managed service with numerous features such as:

  1. Role-based access control: By its very nature, the operator requires broad permissions to operate databases in the Kubernetes environment. Given the importance of security, we factored out a separate operator-specific service account and employed the RBAC capabilities of Kubernetes to precisely define the rights required by the operator adhering to the principle of least privilege.
  2. Integration with external services: Postgres databases do not run in isolation but rather in the complex tech infrastructure. The seamless integration with existing tools is of great importance for our customer experience. Our generic sidecar container support enables running third-party applications side-by-side with the database pods. An example of such approach is a Scalyr sidecar that transparently to the user ships the Postgres container logs to the Scalyr service, hence empowering employees to use standard log processing tools.
  3. Log shipping of Postgres logs to cloud storage: While Postgres normally rotates its log files within one week, the operator and Spilo can join forces to continuously archive the database log history in the cloud for as long as necessary.
  4. Support for multiple namespaces. Namespaces enable us to better structure applications of different teams within a single Kubernetes cluster; a typical use case involves running experiments in a dedicated namespace and then deleting the no longer needed results by simply dropping the namespace. To take full advantage of multiple namespaces, we designed and built into the operator the ability to manage databases running in namespace other than the default one.
  5. API versioning. We keep an eye on the ongoing evolution of Kubernetes and timely exploit the most useful features for the benefit of operator users. Since recently, we started to use Kubernetes-standard code generation to implement the API of the “postgresql” custom resource. By doing so we introduced API versioning to the operator and greatly reduced the manual effort needed to support new Kubernetes versions within the operator codebase.
  6. Last by not least, we recognized the ever increasing adoption of our software and for that reason contributed the documentation to ease running this service in the environments other than ours.

Our efforts culminated in the release of the operator’s first stable version in August 2018. As the software we have built proved to be such a success within Zalando, we reached out the broader cloud computing community to share the experience of developing and operating a managed stateful service on top of Kubernetes. We are pleased to share our achievements with the community at the top tier industrial conferences such as FOSDEM 2018 and KubeCon North America 2018.

Want to delve in?

If you want to know more, check out our talks for a deeper technical perspective on what we are doing. For those of you who are willing to obtain hands-on experience with the hot technologies such as Postgres, Kubernetes, or golang in the thriving open-source environment, we prepared a list of good first issues. Finally, we are always looking for new team members who are eager to work with us full-time on the Zalando database infrastructure.



Related posts