Simplicity by Distributing Complexity

Building an aggregated view of data in the event-driven microservice architecture

Big Data Software Engineer

Posted on Jan 23, 2018

Tags:

Building an aggregated view of data in the event-driven microservice architecture

In the world of microservices, where a domain model gets decomposed into related, but independently handled entities, we often face the challenge of building an aggregate view of the data that brings together different parts of that model. While this can already be interesting with “traditional” designs, the move to event-driven architectures can magnify these difficulties, especially with simplistic event streams.

In this post, I'll describe how we tackled this when building Zalando’s Smart Product Platform (SPP); how we initially got it wrong by trying to solve all problems in one place, and how we fixed it by "distributing" pieces of that complexity. I'll show how stepping back and taking a fresh look at the problem can lead to a much cleaner, simpler, more maintainable and less error-prone solution.

The Challenge

The SPP is the IT backbone of Zalando’s business. It consists of many smaller components focused on the ultimate goal of making articles sellable in Zalando’s online stores. One of the “earliest” stages in the pipeline is the Article ingestion, built as a part of Smart Product Platform. That’s the part that I, together with my colleagues, am responsible for.

Long story short, we built a system that allows for a very flexible Article data model based on a few “core” entities representing the various pieces of data we need in the system. For simplicity, in this blog post, I’m going to limit the core model to:

Product - core entity representing products that Zalando sells, which can form a hierarchy itself (Product being a child of another Product); all the entities need to be associated – either directly or indirectly – with a single Product that’s the “root” (topmost entity) of the hierarchy,
Media - photos, videos, etc., associated with a Product,
Enrichments - additional pieces of data associated with either Product or Media entity.

Sample hierarchy constructed from the building blocks described above may look like this:

null

Since we decided to adopt a “ third generation Microservices” architecture focusing on “event-first development”, we ended up with a bunch of CRUD-like (i) microservices (service per entity) producing ordered streams of events describing the current “state of the world” to their own Kafka topic. Clients would then build the Article (which is an aggregate of Product, Media and Enrichment information) by making subsequent calls to all the services, starting from a Product and then adding other pieces of data as required.

What’s important is that all the services were ensuring the correct ordering of operations. For instance, it’s not possible to create a Media entity associated with a non-existent Product, or a “child” Product whose “parent” Product wasn’t yet created (implementation detail: we make HEAD requests to the referenced entity’s service REST API to ensure it).

Initially, we thought that this was enough; we would expose these streams to the consumers, and they would do The Right Thing™, which is merging the Product, Media and Enrichment data into the aggregated Article view.

i) CRU, rather than CRUD - the “D” part was intentionally left out.

null

This approach, however, had some significant drawbacks; one being that consumers of our data needed the aggregated Article view. They rarely cared only about the bits of information we exposed. This would mean that the non-trivial logic responsible for combining different streams of data would have to be re-implemented by many teams across Zalando.

We decided to bite the bullet and solve this problem for everyone.

Gozer - the reason to cross the streams

Gozer, as we called it, is an application whose only purpose was “merging” the inbound streams of the data – initially only Products and Media – and use them to build an aggregated Article view that’s exposed to the consumers who need it.

null

As simple as it sounds, that each service was publishing its stream and there were no ordering guarantees across different streams was making Gozer’s implementation non-trivial. We knew that entities were created and published in the correct order, but it didn’t guarantee the ordering of consumption. To account for that, whenever we consume an entity that’s associated with another entity not yet received by Gozer (e.g. Media event for a Product), we fetch the missing data using the service’s REST API.

Once we have the data, we have to sequence all the events correctly; make the “root” Product go first and the rest of the hierarchy follow in the way that a node is followed by its children. We use Kafka, so it’s to make sure that for all the entities in the hierarchy, the ID of the “root” Product is used as Partition Key. This will become important later. To do it in a performant way, we need to keep track of the whole hierarchy and its “root” (and some other information, but I’m going to ignore it for simplicity), which added more complexity and a significant performance penalty to the processing.

Then sequenced entities are published in the correct order to an “intermediate” Kafka topic, so in the next step they can be processed and merged.

This whole logic, extra service calls, local hierarchy tracking and event sequencing, added some complexity to the code, but at the time we were happy with the outcome. We had reasonably simple REST APIs and a single place handling the merge complexity. This looked reasonable and quite clean at the time.

Unfortunately, it didn’t stay like this for too long. Soon we added handling of the Enrichments inbound stream and some other features. This added complexity to the sequencing and merging logic and resulted in even more dependencies on REST APIs for fetching the missing data. Code was becoming more and more convoluted and processing was becoming slower. Making changes to the Gozer codebase was becoming a pain.

null

To visualise this complexity, below you can see the interaction diagram that my colleague, Conor Clifford drew, which describes the creation of a 2-level Product hierarchy with a single Media and a single Enrichment for that Media. Don’t worry if you can’t see the details; it’s the number of interactions and services involved that matters, showing the scale of the problem:

null

Note that we’re dealing with millions of Products with tens of Media and Enrichments. What you see above is the unrealistically simplified version. The real issue was much, much bigger.

But it wasn’t the end. As our data model grew, not only more inbound streams were about to arise, but we also started discussing the need for adding outbound streams for other entities in the future.

At this time, a significant amount of Gozer’s code and processing power was dedicated to dealing with issues caused by the lack of ordering across all the inbound streams, which was guaranteed at the entity creation time (remember the HEAD checks I mentioned earlier?), but lost later. The fact that we were not maintaining this ordering all the way down to Gozer because of having a stream per entity was causing us significant pain when we had to deal with out-of-order events.

We realised that we were giving up a very important property of our system (ordering) because it was “convenient” and looked “clean”, only to introduce a lot of complexity and put significant effort into reclaiming it back later.

This was something that we needed to change.

Vigo to the rescue

Following the established naming convention, we decided to rework Gozer by creating Vigo; an application whose purpose was the same as Gozer’s, but the approach we took this time was substantially different.

The main difference was that this time we wanted to ensure that the order of events received by Vigo was guaranteed to be correct. This way Vigo wouldn’t be responsible for “merging streams” and fetching missing data as before. It would consume entity-related events as they come and it’s only purpose would be to produce the “aggregate” event correctly. This design would have two main benefits:

Ordered events mean no “missing” data when events are delivered in an incorrect order, so the application architecture (sequencing step, additional Kafka topic) and logic (handling of the “missing entity” case, fetching it via REST API) are simplified,
No external calls are required to fetch missing entities; a massive performance gain.

As much as we cared for performance, since we were about to add even more functionality to Gozer, the simplicity argument was the main driving force to make the Vigo project happen.

We knew what we want Vigo to be, but we had to figure out how to get there; how to create that single, ordered stream of all the entity-change events.

One could ask, “Why not just make all the services publish to one Kafka topic? This would ensure the ordering and it is simple to do”:

null

Unfortunately, it’s not that simple in our case. I mentioned earlier that all the entities in our system build a hierarchy and need to be processed in the context of a Product. More precisely, to know what partition key to use for an entity, we need to know its “root” Product entity ID, which is the very top of the whole hierarchy. That’s where this approach gets a bit tricky…

Let's consider a Product that has a Media associated with it. That Media has an Enrichment. This Enrichment only 'knows' which Media it's defined for, but has no 'knowledge' on the Product (and its ID) that the Media is for. From the Enrichment's perspective, to get the Product ID we need, we must either:

Make the Enrichment service query the Media service for information about the Product that given Media is assigned to (meaning that Media would be a “proxy” for Product API),
Make the Enrichment service “understand” this kind of relationship and make it query the Product service directly, asking for a Product ID for a Media that the Enrichment is assigned to.

Both of these solutions sound bad to us: they break encapsulation and leak the details of the “global” design all over the system. Services would become responsible for things that, we believe, they shouldn’t be. They should only “understand” the concept of the “parent entity” and they should only interact directly with the services that are responsible for their parents.

This leads us to the third option; a bit more complex than the simplistic, ideal approach described above, but still significantly cleaner than what we had before in Gozer.

This complexity wouldn’t completely disappear, of course. We still have to:

enforce the ordering of events across different streams,
ensure entities are processed in the context of their parents.

To achieve the above, our services needed to become a bit smarter. They would need to first consume the outbound streams of the services responsible for entities that depend on them (e.g. Product stream would consume Media and Enrichment streams) and then publish the received entities into a single, ordered Kafka topic (partitioned by Product ID, because it’s the “root” entity) after the entity they’re associated with.

This approach can be “decomposed” and presented in a more abstract form as a service X consuming N inbound streams (containing “dependent” entities), and multiplexing the events received with the entities it’s responsible for (X) into a single, ordered outbound topic.

null

This service’s outbound topic may then become an inbound topic for another service and so on, which means that these small blocks can be composed into a more complex structure maintaining the ordering across all the entities, but still allow them to process all the entities in a context of their parents.

Putting all the building blocks together, the final design looked like this:

null

with Product service’s outbound queue containing something like:

null

This queue contains all the entities in the correct order, so they can be consumed and processed by Vigo “as is”, without considering the case of missing data and making any external calls.

While the “big picture” looks more complex right now it’s important to remember that a single engineer will rarely (almost never) deal with all that complexity at once in their daily development work as it will usually be done within the boundaries of a single service. Before Vigo, Gozer was a single place where all this complexity (and more!) was accumulated, sitting and waiting for an unsuspecting engineer to come and get swallowed.

Also, do you remember the interaction diagram I showed you earlier? This is the same interaction after making the discussed changes:

null

Again, don’t worry about the details - it’s about the number of interactions and services involved. The difference should be apparent.

Was it worth it?

As I was hopefully able to show you, we removed a lot of complexity from the “aggregating” service, but it came at a price. We had to add some complexity to other services in our Platform; this is how we coined the term “simplicity by distributing complexity”. While we think there’s no “silver bullet” solution here, the benefits of the second design (Vigo) make it superior to the original solution (Gozer) for at least few reasons:

It’s easier to understand; fewer cases to handle, less special cases, less external dependencies make it easier to reason about the service and create a mental model of it.
It’s easier to test - there’s less code, less possible scenarios to test overall, no REST services to take into account.
It’s easier to reason about and debug - monitoring a couple of consumers with cross-dependencies and making external calls is much more challenging than doing the same for a single one.
More extensible and composable - adding new data flows and streams becomes a much smaller undertaking.
It’s more resilient - again, no external REST services to call means that it’s less likely that a problem with other services will stop us from aggregating the data that’s waiting to be aggregated.

What’s worth noting is that the last point (resiliency) is true for the system as a whole as well. These REST calls weren’t moved anywhere, they simply disappeared: they’re now handled by moving data through Kafka (which we already have a hard dependency on).

**What we noticed is that while complexity grouped in a single place tends to “multiply” (ii), the similar amount of complexity spread across many parts of the system is easier to handle and only “adds up”.

**This only applies for instances when complexity is spread by design, put where it “logically” belongs; not just randomly (or even worse: accidentally) thrown all over the place!

“Distributing the complexity” is not a free lunch. In the same way that Microservice architecture distributes the monolith’s complexity into smaller, self-contained services at the price of the general operational overhead, our approach massively reduced the pain related to the complexity of a single service, and yet, resulted in adding a few small moving pieces to a couple of other places in the system.

Overall: yes, we think it was worth it.

ii) Of course this mathematical / numerical interpretation assumes that complexity has to be greater than 1

We're hiring! Do you like working in an ever evolving organization such as Zalando? Consider joining our teams as a Software Engineer!