2016 - First attempt at rolling out SRE
Welcome to the first installment of our three part series following Zalando’s SRE journey. Be sure to come back for the other two, with the next one being published in a week.
Site Reliability Engineering (SRE) is a recent discipline in the Software Engineering field that is growing in popularity, with many companies turning to this new way of working to solve their operational issues, or to support its growing scale. But being a recent discipline, it’s not yet well established how organizations should adopt SRE, or even what is the role of a Site Reliability Engineer (although the role enjoys an increasing demand). At Zalando we also took a stab at implementing SRE within our organization. We looked at it as a way to help us scale our engineering efforts, improving efficiency and making life for our developers easier. Today, Zalando includes in its organization a Site Reliability Engineering department, but the journey to reach this point was filled with challenges and learnings that we are now sharing with everyone.
In this series of blog posts we will take our readers through the road so far. We’ll describe what worked well for us, and what didn’t. Where we failed, and where we succeeded. We’ll also look into how we defined the role of an SRE within the company, and how SRE is growing in Zalando.
Before we get to the ‘How’, let’s start with the ‘Why’. Why would we want to have SRE in Zalando? Well, for that we need to understand the point that we were at as a company before this journey began. That takes us back to 2016 when we were well into our move to the cloud, migrating our monoliths to a micro services architecture (you can find more details about this and what came after in the blog post from our colleague Henning Jacobs).
The move to the cloud came with disruptive changes to the way we were working. Teams were now responsible end-to-end for the software they built. That meant designing, developing, testing, deploying and operating the applications the teams owned. I’ll skip the gruesome details, but to put it simply, before this time, developers developed, and operators operated1. This meant that the vast majority of our engineers were not experienced in a good chunk of their newfound responsibilities. This lack of experience coupled with the hypergrowth that we were going through resulted in a lot of different and complex issues. These issues were mostly around the operational aspect of software development (monitoring, automated testing, deploying, incident handling, managing the cloud runtime).
One of the more obvious pain points was the on-call support. Before we started the microservice migration, our service landscape was small enough that 5 on-call teams could cover the whole stack. Each team had a large enough rotation, and the domain was well understood by each team member. The monoliths were also quite similar in terms of monitoring and operations, making it easier to tackle issues even in services that a given engineer would not be so familiar with. That gradually changed as new teams were created, and more and more services were deployed in the cloud. And there was little standardization across those services. The on-call teams did not grow to meet the new demands, and were increasingly overwhelmed by the new services that they were responsible for.
But 2016 is also the year that Google publishes their book Site Reliability Engineering. The practices and mindset described in that book seemed to provide some answers to the growth pains we were experiencing. For that reason, it becomes the main inspiration for implementing the SRE mindset, role and practices in Zalando. How it all started, though, was through a grassroots initiative to promote and pitch for an investment in SRE. After convincing enough managers, mostly through explaining the pain points being felt by the engineering teams, and how SRE can be a solution for those pains, a group of engineers teams up under a project scope to drive this implementation. One of their main goals was to solve the on-call situation, and make it sustainable. A quick side note: If it feels like the ‘convincing’ management is grossly summarized, or feels like it was just too easy, it’s important to bring up that Zalando is a company that does not shy away from change. It’s a core part of the company’s DNA and culture. And the culture of an organization always plays a key role in enabling (or resisting) such changes.
Now that there was an initial buy in from management, there were o so many things to discuss at the time. But the one that had the most influence in the following steps was “How do we structure SRE?”. Again, remember that this had to be done in a way that it would solve the on-call problem. Should we go for a central team? We were already too big for that (our headcount had grown to 1.000+), so odds were that we wouldn’t be effective. Although it would make staffing easier because we’d need fewer SREs. Should we distribute one SRE per team? The scope would be too large for the lone SREs. Not to mention that, over time, they’d likely become the Ops engineer for the team they were in. It was agreed that we would need several SRE teams. But that still begged the question: What is the granularity at which we would create SRE teams? In the end we went with one SRE team per Product Cluster. This would give SREs end-to-end responsibility over a domain, without having too wide of a scope.
There was another concern around the reporting chain. This was an easy discussion, as we quickly converged to following the guidance in the SRE book and consider reliability work as a specialized role and have them separate from the product delivery teams.
To further gauge the interest in the SRE role and mindset, we sent out a survey to our engineering Org. In that survey we included a description of the desired profile for an SRE. That profile included: Software engineering, Operational mindset, Systems engineering, Software architecture skills, Troubleshooting skills.
The survey results also gave us an idea on the talent pool that might be interested in a move to an SRE role. To further promote the role and the initiative within the company, several talks were done across the company and its different hubs, which, at the time, already included Helsinki, Dublin, and Dortmund.
With few engineers able to fit that profile we had to be smart about where to start rolling out SRE. Ideally, we start with the area with the most need for SRE practices. But to know which area that would be, we first had to measure the health of the different products at Zalando, to then be able to prioritize. Fortunately, at the core of SRE we have Service Level Objectives (SLOs) and Service Level Indicators (SLIs). With the lack of a standardized way of measuring availability, the first thing the team working on the SRE initiative decided to do was to roll out SLOs and SLIs. Workshops were conducted across the company for Engineers and Product Managers, and the first SLO reporting tool (SLR) was developed.
To further demonstrate the educational benefit of SRE, the SRE program team ran Reliability Workshops as part of Cyber Week preparations to discuss and review Reliability Patterns for the more critical services. In those Reliability Workshops we covered Retry Strategies, Circuit Breakers and Fallbacks.
Many services did have SLOs defined and collected, but it still did not end up influencing the software development process. The vast majority of SLOs were defined through initiatives from Engineers. But in a microservice architecture, a product is implemented by multiple services. Product Managers had a hard time establishing a link between the different SLOs and their own expectations for the products they are responsible for. Management was kept in the loop, but not directly involved, so there was no real motivation for management to uphold the SLOs.
Senior Management agreed that SRE concepts like SLOs and reliability patterns are a much needed practice, and that teams should continue doing that. However, there was a clear preference to keep building the missing operational capabilities in the Delivery Teams. The way that was chosen to kickstart that capability building, was by putting each delivery team on-call for the critical services they owned. This decision was fundamental to properly establish the “you build it, you run it” mentality we still have today.
With teams now responsible 24/7 for their own services, the plans for Zalando SRE would necessarily have to change. Join us for the next chapter of our series to learn more about the next steps of this journey.
EDIT 1: No reason to stop the reading here. The second part of our series is already available here.
Already curious enough to want to be part of this story and its future chapters? Then come join us at SRE. We're always looking for talented engineers to deliver our strategy.
We did have some engineers with end to end responsibility. They would deploy, monitor and even be on-call for the services of their respective area. This was not standardized in the company, and it would depend greatly on the leadership of their respective teams. ↩