Parallel Run Pattern - A Migration Technique in Microservices Architecture

Learn how we leveraged the parallel run pattern to decompose a high traffic monolith to smaller microservices

Ali Sabzevari

Software Engineer

Francesco Sarracco

Software Engineer

Posted on Nov 04, 2021

Tags:

The business landscape in Zalando is growing every day. This continuous growth implies that we need to be able to cope with an ever-changing environment. Everyone with experience in software development knows that dealing with changes is a challenging problem. Especially, when the software is already working in production. Changing the software in production is like changing the tires on a car while it is still moving.

In large organisations such as Zalando, where microservices architecture is the standard, changes are even more frequent. Technologies become obsolete, organization structures change, teams split or merge, monoliths are being rewritten, and yesterday's microservices become today's monoliths. All those examples impose dramatic changes in codebases.

Naturally, testing is the first solution that comes to our minds when trying to minimize the regression of a change. But, in scenarios like decomposing a monolith or replacing a legacy component with a newer one, testing might not be enough. Furthermore, there are always dark corners in our systems that we have never tested or we don't know their behavior (anymore). Sometimes, as you may well know from your own experience, legacy systems don't even have tests one can use as a reference.

In this article, we will explore a design pattern called the Parallel Run¹ which is a strategy to make sure those dramatic changes will not break the system. We will walk you through a real-world example and describe how we managed to replace a service by taking advantage of this pattern and show you the challenges and surprises we dealt with. In the end, we summarize the upsides and downsides of this pattern to better help you choose when to implement it and when not.

Decomposing the monolith, a case study

Zalando is aiming to unify the user experience across platforms². As part of this effort we, the Returns team, were required to extract the returns logic out of a soon-to-be legacy monolithic application. Returns logic, as the name might imply, deals with everything to do with customers returning articles they've bought on the Zalando Fashion Store. This article will explore how our team used the Parallel Run pattern to transparently and safely extract the returns logic from the monolith to the new Returns microservice.

Decomposing the monolith

This new service should behave exactly like the respective part in the monolith and the customers should not notice any difference after the migration. In order to achieve this, the following complications needed to be overcome:

While reading the old code is possible, we might miss some parts of the logic or misunderstand the code.
Some parts of the code are not tested, so running the tests over the new code (if possible) would not guarantee the exact behavior.
The criticality of the application precludes downtime.

Parallel Run Pattern

In order to solve these problems, wouldn't it be nice if we could verify that each request handled by the new system would be handled exactly in the same way as for the system currently running in production? The parallel run pattern does exactly that.

When using a parallel run, rather than calling either the old or the new implementation, instead we call both, allowing us to compare the results to ensure they are equivalent. Despite calling both implementations, only one is considered the source of truth at any given time. Typically, the old implementation is considered the source of truth until the ongoing verification reveals that we can trust our new implementation.
-- Sam Newman, Monolith to microservices

Implementation

There are several ways of implementing this pattern. Hereafter we present how we solved it for the above use case.

The following diagram shows the flow for each incoming request:

Parallel run sequence diagram

(1-2) The Client makes a request that gets immediately processed and responded by the monolith to avoid any degradation in performance.
(3-4) After responding, the monolith POSTs a request to the /consistency-checks endpoint of the new Returns microservice, that immediately answers back with 202 (Accepted), indicating the request will be handled asynchronously. In this way we avoid the monolith having to wait, and we free its resources.
(5-6-7) The Returns microservice starts processing the request, in background, by first re-issuing the same request to itself but calling the actual endpoint.
(8) Then the response from the Returns microservice gets collected and compared with the one from the monolith.
(9) Finally, Metrics and Logs about the consistency are produced to later on verify that the expected consistency is reached and to investigate cases of inconsistencies.

The async request sent to the ConsistencyChecker part in the Returns microservice, contains information about the original request url with the query-params, the method, headers and, when present, the body. This information represents the new request to be sent to the Returns microservice. It includes also the HttpStatus, the headers, and the body of the response returned by the monolith in order to be checked against the response from the Returns microservice.

The following is an example of the structure that we used:

{
  "request": {
    "url": {
      "path": "api/example?param=something"
    },
    "headers": {
      "Content-Type": "application/json;charset=UTF-8",
      "Accept-Language": "de-DE"
    },
    "method": "GET",
    "body": null
  },
  "response": {
    "status": 200,
    "headers": {
      "Content-Type": "application/json;charset=UTF-8",
      "transfer-encoding": "chunked"
    },
    "body": "json-response-body"
  }
}

Each endpoint of the monolith has its own expected consistency to be reached in order to declare the migration successful. Once that threshold has been achieved, the migration can be considered safe, and we can perform the switch from the monolith to the new Returns microservice for that endpoint.

Monitoring and Reporting

In order to consider an endpoint ready, it had to reach a satisfying consistency percentage. For each request we produced the result metrics using Prometheus, and we displayed them with Grafana. Each endpoint, defined by an operation_id, had its own metric and its own tolerance. This was done because, as usual, fixing those last few percentages has a cost higher than the value it brings; given that each endpoint is completely separated from one another, each endpoint had its own target percentage to consider it consistent (enough).

Monitoring_example

Matched: counter for all the requests that matched between the monolith and the Returns microservice.

Unmatched: counter for all the requests that did not match between the two services. Possible examples could be:

Different HttpStatuses: such as 2xx and 4xx or even 201 and 200
Different Headers set: a missing header in one of the two responses or different values for the same header
Different Body responses: missing fields/attributes in the responses or different values for the same field/attribute

Failed: counter for all the requests where the response was terminated by temporary issues, such as for example in case of any 5xx. In these cases, even if they matched it would not be a valuable information given that the request couldn't be properly fulfilled due to a transient server-side issue. On the other hand, if the request did not match for 5xx cases, the unmatched counter should be increased because it means the overall behavior of the Returns microservice doesn't match the one from the monolith, and it requires a deeper investigation.

Rollout

The switch was done gradually, and it was done per endpoint to allow the system to be tested in a fully functional way. This was achieved by using a proxy to move the forwarding of the requests to the Returns microservice one by one once they were ready. In our case we used Skipper, an open-source Proxy developed by Zalando.

Endpoints rollout

In this way, by minimizing the amount of endpoint rolled out to one per switch, we avoided introducing a massive set of changes in one go, and we were able to collect additional feedback by every single switch while still working on finalizing the other ones.

Clean-up

Once the migration was successfully finalized, all the code related to the parallel run logic needed to be cleaned-up. The three main parts to remove were the handler performing the consistency check (use cases layer), the gateway to call the localhost (gateway layer) and the domain model related to the consistency logic (entities layer). Additional clean-ups were done for configuration files such as the feature toggle to enable/disable the consistency checker and the config for the localhost gateway, the dependency injection in the Main file, the consistency-checker api in the route and, of course, all the tests to validate the consistency check logic. Code-wise we removed ~700 lines of code and ~1.3k lines between unit and component tests.

Advantages of this approach

Live data for testing: We can leverage the real production data as test cases. Therefore, given enough time, the system will be tested potentially under all the "real-life" use cases.
Gradual rollout: The rollout is done per endpoint minimizing the amount of changes per switch.
Incremental development: The gradual rollout also enables the possibility to approach the implementation per endpoint.
Easy rollback: By using a proxy to do the traffic switch, rolling back just requires a change to the proxy to migrate the endpoint back to use the previous host instead of the microservice one; this avoids the need of redeploying, making the whole process faster.
Finding bugs: Since the new microservice will be tested with real data, there might be cases where even the monolith was behaving incorrectly. This approach can make those edge cases visible.
Load testing: In case of using a different technology for the newer service, parallel run pattern helps to understand the performance characteristics of the new service. As a result, the development team can target more realistic performance goals or SLOs before going live.

Considerations and Limitations

While this approach makes the migration safer and smoother, it has also some concerns and issues to be kept into account.

Increased load: Given that requests received by the monolith are forwarded to the microservice, the load across all components increases, potentially doubling.
Refine the comparisons: In the comparison check not everything needs to match 100%. For example, in our case we ignored some headers that were not relevant for the outcome of the request.
GDPR: While collecting the data for the comparison we need to keep into account that sensitive information should either not be stored or cleaned afterwards. In the former case, analyzing some inconsistencies for the fields containing personal data might not be easy.
Non-trivial comparisons: Comparing the results is not always a straightforward task. For example comparing PDFs might be complicated due to different but negligible metadata, or a change in the http frameworks might result in different default response headers, or collections could have different orderings.
Non-Idempotent endpoints: Idempotency should always be kept into account. For example this approach can be used for POSTs that are idempotent but not when the idempotency of the endpoint cannot be guaranteed. When doing this investigation always consider idempotency of each operation and possible side effects (for example calling another POST api, updating a database, or publishing an event).
Not a quick-win: Even if this approach leads to a smooth and safe migration, it requires quite some time and effort to be properly set up and tuned.

Verdict

Implementing a parallel run is rarely a trivial affair, and is typically reserved for those cases where the functionality being changed is considered to be high risk. (...) the work to implement this needs to be traded off against the benefits you gain.
-- Sam Newman, Monolith to microservices

The parallel run pattern is a powerful technique to overcome the complexities and stress of migration projects, but not every migration project is a match to use this pattern. Increasing traffic, complexities in comparing the results, and the amount of effort are the risks that should be considered before implementing this pattern.

In the end, this pattern is just a tool that should be used wisely considering constraints, use cases, and team capacity when planning for it. When it is done properly, it saves you a lot of headaches.

Newman S. (2020). Monolith to Microservices. 2nd ed. O’Reilly Media, Inc. ↩
You can learn more about this effort in a series of articles about GraphQL in this blog. ↩

We're hiring! Do you like working in an ever evolving organization such as Zalando? Consider joining our teams as a Frontend Engineer!