End-to-end load testing Zalando’s production website

How we made sure we stayed online for Black Friday 2018

photo of Brian Maher
Brian Maher

Software Engineer

photo of Thibaut Le Page
Thibaut Le Page

Software Engineer

photo of Oliwia Zaremba
Oliwia Zaremba

Software Engineer

Posted on Apr 11, 2019

Black Friday is the busiest day of the year for us, with over 4,200 orders per minute during the event in 2018. We need to make sure we’re technically able to handle the huge influx of customers.

As a part of our preparations we ask all of our teams to perform load tests to ensure their individual components will handle the expected load. In addition, and due to the distributed nature of our system's architecture, we also need to ensure it will handle the expected load once all components have to work together. To ensure this, we simulate real user behaviour using different scenarios that contain the most common user actions (e.g. visiting the homepage, browsing the catalogue, adding an item to cart, checking out) on a large scale on the production system.

In preparation for Black Friday 2018 our Testing & Quality Strategy team, in cooperation with our SRE (Site Reliability Engineering) team, took on the challenge of providing the tooling required to perform these simulations.

A new set of tools

Our starting point was to look at what was done to prepare for Black Friday 2017. We reviewed a tool that had been created internally to perform end-to-end load testing. It used scenarios written in JavaScript and ran using a distributed set of Puppeteer nodes, each of them interacting with an instance of a Chrome browser. Unfortunately, due to the heavy usage of resources by the browser instances at such a large scale, it was prohibitively expensive to run and so couldn’t be used again.

We went back to the drawing board and, along with feedback gathered from stakeholders that were involved in the previous year’s efforts, started to design a new solution.

We first looked at existing load testing tools such as JMeter, Locust, and Vegeta; all of which we had previous experience with. We quickly realised that, whilst they all individually had their merits, none of them alone completely solved the problem.

We needed a way of recording scenarios representing a user interacting with our website in order to simulate traffic from real users. What's more, we needed a method of translating the scenarios into load test scripts that could be replayed in a lightweight manner and reused. Finally, we needed a mechanism for cost-effective scaling of the load.

After a few design rounds we came up with the following multi-tool solution:

Locust From the learnings of the previous year, we knew that creating our own load test runner from scratch would not be feasible, nor desirable, in the time we had. Therefore, we decided the core of our solution would be one of the already existing load testing tools that we had previously investigated. We settled on using Locust due to its in-built ability to run in a distributed mode and its support for scripting (it uses Python files as inputs).

HAR files In order to easily record the scenarios, we realised we could again reuse existing technologies: a web browser’s session can be easily exported by modern browsers as HAR (HTTP Archive) files. This, however, presented us with a new challenge: how do we convert these HAR files into something Locust can run?

Transformer We built Transformer to convert the scenarios recorded as HAR files into Locust’s input format, a Python file (the "locustfile").

null

Transformer considers each HAR file as a single scenario. It takes every HTTP request recorded there, and expresses it in Locust's words. The result is a locustfile that exactly replays these requests. Transformer can combine multiple HAR files (i.e. multiple scenarios) into a single locustfile, allowing to replay many scenarios in the same load test, each with its own customizable amount of load (more users visit the catalog than the Help page). And because there are always exceptions, a plugin mechanism allows to arbitrarily modify and enrich each request by injecting pre- and post-processing code in the locustfile. This allowed us to, amongst other things, replay dynamic requests requiring temporary, JavaScript-generated tokens without actually executing any JavaScript.

Zelt The final piece of the puzzle, cost-effectively generating the required load at large scale, was solved by our in-house Kubernetes infrastructure. We built Zelt to orchestrate the execution of Transformer, the distribution of the generated locustfile, and the deployment of the Locust controller and worker nodes into one of our Kubernetes clusters. It allowed us to easily provision, scale-up/down, and execute our load tests.

One more for the road Another tool, a Node.js library called PuppetHAR, was created to allow us to programmatically generate HAR files from Puppeteer scripts rather than manually in the browser; ultimately this was never used.

In practice We built these tools in close collaboration with our SRE team. They provided us with the scenarios, crafted using data from our analytics teams to represent real user journeys through the Zalando website. They also provided us with the inputs to the equations required to translate our internal target metrics, in requests per second (RPS), to Locust’s input format of number of concurrent users.

To run the load tests, virtual situation rooms were created including us, SRE, and members of the component teams. Using the previously created locustfile, we used Zelt to deploy the load testing infrastructure in Kubernetes, and used the Locust dashboard to initiate and control the tests.

As the tests were running, the teams that owned various component receiving the load were monitoring their production components using our in-house monitoring tools and would let us know if and how things were showing signs of strain under load. We used the same monitoring tools to observe our progress towards reaching our load targets and concluded the tests once they had been reached and sustained for a period of time (or if a component team requested us to stop because of a bottleneck found).

In our final configuration, we ended up running four Locust stacks consisting of 300 nodes each, and reached a total of 130,000 RPS observed.

Learnings Overall, the project was a success. We were able to execute end-to-end load tests against the production website on a scale larger than the actual traffic received during the peak of the Cyber Week campaign. Thanks to this, the teams were able to act upon the information gathered, discover their optimal scaling configuration, and fix the bottlenecks that were discovered all before Black Friday.

Throughout the process, however, we faced some challenges that we needed to overcome.

Reverse engineering With all record and playback methods, there is no guarantee that what you record will be replayable without error as states tend to change over time.

Our tooling was no different and we faced this issue frequently. Session identifiers would expire, articles would go out of stock, rate limiting would kick-in, and security measures would catch us out.

For each instance we had to essentially reverse-engineer our own website and work out which piece was tripping us up and how to work around it. Not only was this a technical challenge but also one of communication and coordination as we needed to find the teams responsible for the components we were fouling and work with them to find solutions.

Often we could only verify our solutions during a load test as the symptoms would only appear in high-load scenarios, this was obviously costly and slow. In order to try to alleviate this, we started working even closer with the component teams, bringing them to sit with us and pair on developing solutions whilst they monitored their systems for us.

Locust We were happy with Locust initially, but as our solution grew more specific and the scale of the load increased, the disadvantages of the tool started to show up.

Two of the Locust features that we relied on the most were the distributed mode of the test runners and the weight system for the scenarios. As we learned the hard way, unfortunately the two features combined don’t work as expected on a large scale. We soon started to realize that the health of the Locust project is far from what we hoped - some very old issues were not fixed, new issues were not addressed and the maintainers were not responsive. By this time it was already too late to change the tool. Eventually we forked the project and made the necessary changes to immediately address the most painful issue.

Next steps At the time of writing, we’ve already open-sourced Transformer and Zelt, and plan to open-source PuppetHAR in the future; so keep an eye on our Zalando Incubator homepage!

Internally, we’re already preparing for Black Friday 2019 and continue to improve our tools and processes for ensuring a smooth customer experience during any and all high-load situations.



Related posts