Data Integration in a World of Microservices

Read about Saiki: our open-source, cloud-based, microservices-friendly data integration infrastructure.

Posted on Sep 21, 2015

Tags:

There’s not much one can say in favor of big, monolithic enterprise applications. Not only does the typical monolith become an inextricable mess of interdependencies — it also hinders development teams from moving quickly when trying out new ideas. Yet the monolith does present one advantage: Its consolidation of data in a single place makes business intelligence fairly simple. For example, if you want to analyze yesterday’s sales you can — at least in principle — simply query your operational database.

The more modular your enterprise application becomes, the more work you have to do to bring all of your data together. At Zalando, our technology infrastructure has been fairly modular for quite some time. With our adoption of team autonomy and microservices enables usi to facilitate large-scale growth of our massive business while reinforcing the agility of a startup.

For those of us on Zalando’s Business Intelligence team, microservices have brought about some interesting challenges in terms of how we manage our data. As part of our learning process, we recently designed and built Saiki: a scalable, cloud-based data integration infrastructure that makes data from our many microservices readily available for analytical teams. Named after the Javanese word for “queue,” Saiki is built mostly in Python and includes components that provide a scalable Change Data Capture infrastructure, consume PostgreSQL replication logs, and perform other relevant tasks.

Why Saiki

Even before we adopted microservices at scale, questions like “how many shoes did we sell yesterday?” presumed the prior integration of data distributed over a significant number of sources. Article as well as order data is horizontally sharded over eight PostgreSQL databases, so there is no way to simply fire up some ad hoc SQL to do a quick analysis. Before analyzing the data, we have to move it to a single place. Our core, Oracle-based data warehouse, where information from numerous source systems is integrated, has always been a critical component of Zalando’s data infrastructure. Without it, all but the simplest analytical tasks are futile.

With each service owning its data, data is spread over a significantly larger number of cloud-based microservices — each of which can use individual persistence mechanisms and storage technologies. No small challenge. Adding to the complexity is that Zalando is fiercely data-driven: At any given moment, several of our teams are working on large-scale data analysis projects using a vast number of different systems and tools. Meanwhile, other teams are busy exploring ways to better distribute this data across multiple applications. Finally, we need to make the right data available to our various customers at the right times, and in the right formats.

Enter Saiki, which manages all of this data integration and distribution with Amazon Web Services (AWS). It makes use of STUPS, our open-source Platform as a Service (PaaS), which allows for a secure and audit-compliant handling of the data involved.

How Saiki Works

We no longer live in a world of static data sets, but are instead confronted with an endless stream of events that constantly inform us about relevant happenings from all over the enterprise. Whenever someone creates a sales order, or packs an item in one of our warehouses, a dedicated event notice will be created by the respective microservice. Handling this stream of events involves roughly four main tasks: Accessing these events, storing them in an appropriate structure, processing them, and finally distributing them to various targets. We built Saiki and its components to do all these tasks.

Accessing: Typically a microservice or application has to push event notifications to one of our APIs. When programmatically pushing messages to our API is not an option, we can use Saiki Banyu — a tool that listens to PostgreSQL's logical replication stream and converts it into a series of JSON objects. These objects are then pushed to our central API. This approach allows for a non-intrusive and reliable Change Data Capture of PostgreSQL databases.
Storing: As the backbone for queueing the stream of event data, we chose Apache Kafka, a system with impressive capabilities. We deployed Kafka and Apache Zookeeper in AWS to make our event data available for downstream systems.
Having access to an event stream opens up a lot of new options for data processing. We plan to move our near-realtime business process monitoring application to this new system, and hope to become less dependent on huge, nightly ETL batch runs— moving closer to near-real-time data processing.
Distributing: We’re also investigating possible data sinks for data distribution. Nganggo, a project now underway, is a working prototype of a fast materialization engine that writes event data to a central Operational Data Store powered by PostgreSQL. We’re also working on a service that makes change data available for batch imports into other systems both inside and outside of AWS (for instance, our core data warehouse). Finally, we plan to use our S3 data lake to provide standardized data sets for further large-scale analyses.

Our team is polishing the first parts of Saiki for production use, but our data integration adventure has only just begun. We will update you as our work progresses!

We're hiring! Do you like working in an ever evolving organization such as Zalando? Consider joining our teams as a Software Engineer!