In our Microservices Architecture, services communicate both asynchronous via events and synchronous via REST calls. Frequently, a synchronous REST call modifies data in a data store and emits an event based on the changes made. Publishing data change events can be decoupled from performing the changes in the data store in order to increase the resilience of the application.
We will show how this is achieved with the Transactional Outbox pattern, presenting a cloud native approach utilizing Amazon DynamoDB, AWS DynamoDB Streams and AWS Lambda.
In Zalando Payments we have a service, called Order Store, that stores payment related data for a given order in a DynamoDB table. Updating this data happens via a synchronous REST call. Changes to the stored payment information need to be propagated to other services too, which is realized by sending events to Nakadi, Zalando's message bus.
Initially, the service created/updated data in DynamoDB and then sent events to Nakadi to inform other services about the change in payment information. This meant the service had two downstream dependencies to complete the request, namely the database and the message bus. As the availability of a service is the product of the availabilities of its dependencies, the more dependencies a service has, the lesser is its own availability. Let's assume DynamoDB and the message bus have availabilities of 99.9% each. Thus, the maximum availability for the service is
99.9% * 99.9% = 99.8%.
Aiming for the highest availability possible, reducing the dependency to only DynamoDB results in a higher availability of the service. After explaining the transactional outbox pattern, we will provide a concrete solution, the technologies it comprises and how we achieved decoupling the process.
Let us look at the underlying concept of how to decouple data update and event publication. The pattern we are describing here is known as Transactional Outbox. Our goal is to achieve that a service, synchronously called via a REST API, creates, deletes or updates a data store entry and also propagates the change to other services via messaging. However, publishing the message is decoupled from updating the data store.
In this drawing we provide the setup of the environment. Our flow consists of 4 steps, where the starting point is a synchronous call that triggers further actions.
Change Entry and Populate Outbox
After the call is received, the service triggers a change for an entry in the data store. This is denoted with
1. The actions that trigger a change consist of Create, Update or Delete, as a Read operation would not alter any data. Modifying data in the data store is transactional and once it is successfully completed, the service already returns a success response code to its caller.
As part of the transaction in the data store, the actual data change is written to an outbox. This is depicted in step
1.5. The outbox can be thought of as a write append log. Each data change operation in the data store will produce an entry in the outbox.
Consume Outbox and Publish Event
The transaction in the data store was successful and the data entry got updated or created. Thus, a new entry in the outbox exists. A so called message relay reads that entry from the outbox. To get aware of the new entry, the message relay notifies the outbox, which upon notification consumes the entry. This is depicted with number
Upon consumption, the message relay extracts the data, transforms it to an event and publishes it, marked in the diagram with
2.5. Only after successful publication the entry is marked as consumed.
After describing the pattern we now want to present the concrete solution. In order to decouple the asynchronous event emission from the synchronous process we take advantage of various cloud services AWS has to offer.
The following diagram shows the complete flow from a synchronous REST API call to the publicaton of the Nakadi event following the new approach:
Recently, DynamoDB was extended with a Change Data Capture implementation – DynamoDB Streams. Once activated, as soon as an item in the DynamoDB table is changed (added, updated or deleted) a corresponding dataset is sent to the stream. In our case this dataset contains the old image, containing the table item before the change, and the new image, containing the table item after the change. It can be configured which images AWS exposes to the DynamoDB stream. With both these images we are now able to assemble a corresponding Nakadi event using AWS Lambda.
The trigger for our AWS Lambda is a DynamoDB Stream item. We chose Python for our implementation as it is more lightweight compared to Java. The lambda function will receive the item containing the old and new image. Then it will assemble the data change event, which contains the complete item after its change as well as a patch node containing the diff. As a last step the assembled event is published to Nakadi.
In case the publication to Nakadi fails, e.g. due to timeouts, the request is retried. If all the retries fail then we make use of an AWS SQS queue as fallback storage which is further explained in the next chapter. This also means that we do not guarantee that the events are published in the correct order.
AWS SQS & Kubernetes CronJob
AWS SQS is a message queue service. When creating a new AWS Lambda function it already comes with an AWS SQS queue attached as a dead letter queue. Having this queue it is ensured that no events are lost in case of a failed publication or even worse a temporary outage. Now, whenever Nakadi event publishing fails the event is sent to the dead letter queue. For event publishing retries with exponential backoff are in place to minimize the number of events that could not be published ending up in the dead letter queue. In order to retry sending the events in the queue in intervals we created a Kubernetes cronjob. The cronjob simply runs the Python code that is also run by the AWS Lambda and tries to publish the events to Nakadi again. As publication is eventually successful the event is then removed from the SQS queue.
We successfully decoupled synchronous data changes from eventually consistent event publishing. Through decreasing dependencies, we increased the resiliency of our service. Besides improving the architecture, the team also got to work with DynamoDB streams and AWS Lambda for the first time, offering a great possibility to learn about AWS technologies. Having implemented this pattern, we are working with our infrastructure teams to offer an implementation of this pattern to all teams at Zalando. We already have an implementation of the Transactional Outbox for PostgreSQL, managed centrally via a Kubernetes operator.
If you would like to work on similar challenges, consider joining our engineering teams at Zalando Payments.