Technology Choices at Zalando - Updating our Tech Radar Process

We have revisited the process of technology selection at Zalando, adjusted the Tech Radar ring semantics, and moved towards principle-based decision making. In this post, we would like to share the process and its outcomes so far.

Bartosz Ocytko

Executive Principal Engineer

Posted on Jul 15, 2020

Tags:

Zalando Tech Radar

Challenges with our Tech Radar

The Zalando Tech Radar is modelled after the Thoughtworks Technology Radar and includes a ring-based scoring for a certain technology/framework along with supplementary information about pros, cons, restrictions, usage, and lessons learned at Zalando available as a knowledge base for our teams. Since publishing, the approach and visualization engine has been used by others and also showcased at conferences as an example of how tech companies manage their technology choices.

Our initial concept of the Tech Radar suffered from a series of problems, which we have observed in the Engineering Community while maintaining the Tech Radar:

The ring change criteria were too high level without being specific for technology types (e.g. programming languages, data stores) or context (e.g. backend, data science, mobile), its support by our infrastructure and impact to engineering usage. They didn’t allow for transparent, objective, and recurring rescoring of the Tech Radar nor for clear guidance for our engineers on how to select or suggest technologies to evaluate.
The Tech Radar has been easy to ignore due to lack of a formal process and oftentimes delivery teams have been making key technology choices in isolation without consulting them with the guild maintaining the Tech Radar. Only after technologies were already in production, radar entries and ring changes were proposed instead of having followed the Tech Radar cycle. This led to a disconnect between the ring assignments and factual usage across teams.
The Tech Radar relied on voluntary contributions degrading in frequency due to neither being clearly incentivized nor part of the job expectations for higher grades. Contributions are usually driven by a small group of engineers forming an informal guild, who were driving the collection of lessons learned material and encouraging teams across the organization to contribute. The guild lacked a formal mandate to make company-wide technology decisions and was insufficiently representing our departments across the company.

Confirming the problem statements

To address these problems we have embarked on a journey starting with confirming the observed problems with our Engineering Managers and getting more insights on how they manage technology choices in their teams. We also explored potential effects on delivery in the past years. We found that Engineering Managers have felt insufficiently supported by the company to manage expectations and technology choices in their teams and missed the ability to lean on stricter guidance. Further, too broad technology choice has had an effect on the growth rate of their teams and created challenges with cross-team code contributions.

Technology choices in Tech companies

Having confirmed the problem, we’ve been collecting ideas on how the problems can be approached. We began with researching how other tech companies are managing technology selection. Unlike Zalando, other established tech companies (Google, Spotify, Tencent, Foursquare, and other CNCF End User companies) use a much stricter technology selection process, limit programming language choices, and invest into changing the way applications are built to leverage centralized control planes, which increases development velocity. They limit the tech stack choices due to the amount of investment into infrastructure support and the high cost of removing technologies that did not prove to be useful.

A too high number of technologies, that are adopted company-wide, make it challenging and expensive for Infrastructure teams to provide high-quality and well integrated tooling, e.g. CI/CD, observability, profiling, vulnerability scanning, compliance, governance, etc. It also causes the teams that provide infrastructure solutions to strongly depend on coordinated and continuous community contribution for technologies that are not supported centrally. A broad freedom of choice leads to increased difficulties in supporting software long-term when the original authors have left the company, which is guaranteed to happen sooner or later. There are also other problems related to development collaboration: (1) adjusting to cross-language communication becomes significant as teams will repeatedly implement the same functional components in different ways, (2) the code duplication rate is increased and it's costly to address non-functional requirements of services in terms of performance, high availability, and scalability, and (3) cross-team collaboration across different code bases is hindered.

Generally, aside from specialized use cases, especially high value in flexibility around technology choices is provided when organizations have the ability to identify technologies that are bringing a paradigm shift (e.g. Kubernetes) paired with business value and use case fit. This proves to be a difficult task and companies rarely get the timing right.

Data collection

We sourced information from the Engineering Community through a Programming Language survey among our developers. The survey indicated how many engineers are currently using a certain language, which they feel comfortable working with and to which degree, as well as which language they would like to support others with in terms of guidelines or ad-hoc help. We cross-checked this data with our 4,000+ applications and derived how the different programming languages have gained traction and popularity over time.

Setting the bar for ADOPT languages

We have collected expectations around the level of support that we would like to see for ADOPT languages, ranging from clear guidelines on the VM lifecycles, integration into CI/CD systems, observability, size and health of the community within and outside of the company, ability to hire engineers to grow our teams using those languages, up to best practices for common tasks like performance analysis and tuning through inspection of heap dumps or flame graphs. We then collected data on how all our languages used in production benchmark against that criteria to see how big the gap in our expectations is with reality.

Defining new ring semantics

We have redefined the ring semantics as follows:

ADOPT: technologies with broad adoption, in which Zalando is willing to invest long-term
TRIAL: captures all current experiments in production
ASSESS: active, non-production assessments of promising technologies and trends
HOLD: discouraged from broad adoption where the company is not willing to invest further; no new applications may use this technology
NIL: no ring assignment, captures previous assessments and findings for long-term documentation purposes (we periodically archive HOLD entries as NIL)

We optionally limit the ring assignments through a clear scope recommendation: Backend, Mobile, Web, Data, Machine Learning, and Infrastructure. This allows us to better differentiate between the specifics of those use cases. The updated semantics allow us to be broad in assessing the value of emerging technologies, but be selective in terms of their deployments to production and level of investment into adoption and promotion within the company. For TRIAL, we also involve explicit sponsorship from our Engineering Heads, who will support production trials and commit to being accountable for divesting from non-promising technologies and the removal of failed experiments from our technology landscape.

Technology Selection Principles and Principal Engineering Community

The timing for making changes to Tech Radar was fortunate due to two reasons. First, we have started an update of our role expectations for Software Engineers and Engineering Managers and included the responsibility and accountability for technology selection along with incentivizing contributions to the process in the new expectations. Second, we created a community of Principal Engineers with the most senior engineers across the company as members, who have been empowered to make decisions on technology selection and thus maintain the Tech Radar. We kicked off the community with a day-long remote off-site where we captured engineering challenges we face at Zalando, brainstormed on principles for technology selection, and initial exchange about the implications of new ring assignments and learnings about the programming languages we use in production. In departments that were not represented by Principal Engineers, we have included our Senior Engineers to contribute instead. Following the off-site, we have formalized Technology Selection Principles that provide guidance on technology choices in terms of breadth and depth, focus on company instead of local decision making, etc. Principle-based decision making enables healthy discussions and differs enormously from preference-based decision making, which easily becomes personal and leads to conflicts.

Parting ways with Clojure, Haskell, and Rust

Having reviewed the use cases where our teams have used the languages that are not on ADOPT, their current adoption within Zalando since 2016, the available set of languages, and the level of investment required to bring them to ADOPT, we have decided to part ways with Clojure, Haskell, and Rust and not create new applications in those languages moving forward. Although our teams have built many services using these languages and learned how to operate these at scale with many successes, following our technology selection principles, we decided to not further invest in these languages as their unique capabilities are not giving us any further leverage at this point in time. Instead, we are focusing our community efforts on Kotlin and TypeScript and expect our language communities to help us move these to ADOPT later this year.

Please note that this decision is specific to the context of Zalando (1,200+ developers, 4,000+ applications) and our current technology landscape and engineering practices. As such, this decision is not transferable to other organizations nor to be understood as a statement about the technical capabilities of the languages themselves. We encourage readers to follow a similar exercise as ours to derive decisions for their context.

Next steps

So far, we have reviewed the area of programming languages as the one having the biggest long-term impact on our engineers and system architecture as well as being the one sparking many debates on which language is better and why (when arguing based on preferences). As the next step, we are proceeding with reviewing the remaining categories of the Tech Radar, so stay tuned for further updates on our journey. (Update: check out our follow-up post on Scaling Contributions to the Tech Radar)

If you found the post relevant to your career ambitions, we'd be happy to get to know you! Join us at Zalando as a Principal Engineer and help us shape the role.