Cyber Week has become an increasingly important time of the year in e-commerce. In 2019, we have attracted 840,000 new customers and our sales (Gross Merchandise Volume) increased by 32% compared to the previous year. During the event we grew faster as a business than throughout the year where we grow at a 20-25% rate. Our peak orders per minute reached 7,200 compared to 4,200 the year before (+71% YoY).
From an engineering point of view, Cyber Week is a very exciting time, during which all systems are exposed to load that is far beyond any peak seen throughout the year. The experience of supporting the event itself has been extremely rewarding for everyone involved due to close collaboration between teams and strong focus on operational excellence and reliability. During the preparation time for the Cyber Weeks we created new capabilities in our teams and platform that serve us throughout the whole year. Looking back at the past years, we would like to share our experience and how our capabilities evolved over time around key themes of: Site Reliability Engineering, Load Testing in Production, and the Preparation approach itself.
Site Reliability Engineering
Phase 1: Building up knowledge about reliability engineering
Six years ago, when our e-commerce platform was still within on-premise data centers, we had a handful of on-call teams. Two of these teams were responsible for the backend and frontend systems of our e-commerce platform and were primarily responsible for Cyber Week preparations and support during the event. When we started moving more and more critical systems into the AWS cloud as part of our micro-frontend architecture, we adopted the "you build it - you run it" mindset and the number of on-call teams has increased dramatically to around 100 teams today. This also meant that we needed to educate many teams about designing for reliability. To achieve that, we formed a team of 10 colleagues, who were passionate about SRE and who signed up to perform production readiness reviews of our applications ahead of Cyber Week. In preparation for that, we ran a series of workshops with teams to share knowledge about reliability patterns and identified clusters of applications that required adjustments, so that the platform is stable in case of various failure types (e.g. failures of dependencies, overload, timeouts).
Phase 2: Distributed tracing
We use distributed tracing following the OpenTracing standard across our platform. This allows us to inspect the performance of our distributed system and quickly find contributing factors for increased latency or error rates across our applications. After instrumenting a set of applications and proving the intended wins resulting from it, we leveraged Cyber Week preparations to scale this effort. In year one, we focused on critical, tier-1 systems involved in the hot path of the browse journey in our shop. The year following that, we have expanded the coverage further to tier-2 systems for applications in the scope of Cyber Week. During the instrumentation, we have adopted additional conventions that help us identify the traffic sources: App, Web, push notifications, load tests. This allows us to better understand traffic patterns and perform capacity planning based on the request ratios between incoming traffic and the respective parts of our platform.
Phase 3: Dedicated team for SRE enablement
What started as a grass-roots movement around SRE practices in Phase 1, has evolved to a SRE department within Zalando, which is focused on reliability engineering, observability, and providing necessary infrastructure around monitoring, logging and distributed tracing. The SRE team also organizes trainings and knowledge exchange within the SRE guild where teams share lessons learned and pitfalls about operating systems in production and collaborate on formulating best practices.
Distributed tracing has been a game-changer for us. We have leveraged tracing data to reduce alert fatigue of our on-call teams through an approach called adaptive paging. It's an alert handler that leverages the causality from tracing and OpenTracing's semantic conventions to page the team closest the problem. From a single alerting rule, a set of heuristics is applied to identify the most probable cause, paging the respective team instead of the alert owner. See our talk from the SRECon Are We All on the Same Page? Let's Fix That which explains our approach in detail.
Load testing in Production
Phase 1: Feeling lucky
Over the years of operating our shop in the Data Center, we learned how to scale our shop's frontend. We kept adding servers and scaling our Solr fleet responsible for Product Data and Search until this has become impractical due to a multi-month lead time needed to get new, physical servers. The Solr fleet was the one most benefiting from auto-scaling in the cloud and thus the first system that we moved to the cloud six years ago. Our backend services (e.g. product information management, inventory management, order management, customer accounts and data) however, formed an over-provisioned system with a fixed number of instances in the Data Center. At its heart were PostgreSQL instances heavily optimized by our Database infrastructure team that we scaled through sharding and switching from spinning disks to SSDs.
This was sufficient for Cyber Week in 2015 where commercial campaigns were just about the right size for our capacity. With no past knowledge about what type of traffic to expect we were amazed how much more headroom our backend systems really had. Never before had we seen load throughout the day that surpassed every past evening peak we saw. There were of course some challenges with scaling, but we could overcome these with small tuning of the system configuration during the event. This was achieved mostly through pausing some asynchronous processing that was not essential for accepting and processing orders.
Phase 2: Load Tests in Production
In a cloud-based system that relies heavily on auto-scaling for cost-optimization, proper testing and capacity planning is a must. To achieve that, we set the target to better understand our scalability limits. We tried many approaches and given our experience, the only way we found effective for a large-scale system like ours are live load tests in production. Testing in production is an established practice, but difficult to execute well. Mistakes become really costly as the customer experience is degraded and thus this approach requires the ability to quickly notice customer impact and react by aborting the test or mitigating the incident otherwise.
To achieve our goal, we wrote simulators that place sales orders for test products that can be clearly differentiated from real customer orders, processed to a certain degree, and then skipped at the stage of fulfillment. This gives us the understanding of the limitations of our order processing system and all its dependencies, incl. inventory management and payment processing. Further, as shared before in end-to-end load testing Zalando’s production website, we wrote a simulator that traverses the user journey across key customer touch-points in our shop. We ran this simulation in production for all countries and mimic the traffic patterns we observe for sales events. Through that we uncover scalability bottlenecks and verify if certain resilience patterns work properly. Running the simulation is a fun and thrilling exercise, especially if the whole team starts suddenly hearing pagers fire as we continue to increase the test traffic.
Phase 3: Load Tests inform capacity planning
Having written and evolved the user journey simulator for two years we were not fully satisfied with its abilities to generate load at scale. There were too many rough edges and tuning the simulator to be able to generate the required load profiles and investing our development time was very time consuming. We decided that it's better to leverage an existing product that will do the job better. This paid off heavily as last year we were able to run the tests both on App and Web platforms simultaneously.
The different types of load tests that we ran in production last year helped inform capacity planning based on commercial goals and the projected sales. The final, clean run of tests also gave us sufficient confidence that the platform was scaled to sustain a certain amount of incoming traffic and sales in the peak minute and thus contributed to a smooth event for our teams.
Preparation as a project
The Cyber Week project is always at the top of our project lists and we dedicate highest attention to the preparation work. Over the past years, we have progressively increased collaboration between the engineering and commercial teams and have dedicated Program Managers responsible for the delivery of the project. With every year we tune the structure and reporting within this project.
Thanks to the high priority of the Cyber Week preparations, every year we are able to invest in a key theme that helps us build up new capabilities that we did not have before - be it resilience engineering know-how, load testing in production, capacity planning, production readiness reviews, or collaboration across the company. On top of that, we also run dedicated projects aimed at increasing scalability of our platform and deliver changes to the customer experience for sales events.
During the event
After months of preparation, the event itself is a cherry on top - it's the time where we see how the time invested has paid off. If we are well prepared, we expect a rather uneventful time in terms of the number of production incidents. For the key period where we expect the highest load on our systems, we organize a Situation Room to ensure rapid incident response. In the room, we gather representatives from key engineering teams, SRE team, and dedicated Incident Commanders to closely watch the operational performance of our platform. It's basically a control center with dozens of screens and graphs, that looked like this in 2019:
We've explored two key themes in Zalando's Cyber Week preparation journey. We are constantly tuning our approach based on insights from each year and adapting the areas we invest in to the business growth and commercial campaign requirements. This year has an added twist of remote working, which likely will require us to rethink how to organize the Situation Room efficiently. With seven weeks until Cyber Week, our preparations for this year's event are well underway and we are looking forward to sharing results and lessons learned in follow-up posts. With our growing application landscape, there are sufficient challenges ahead as we have more than 1122 applications (out of 4000+) in scope of the Cyber Week preparations.