Zalando Tech Radar - Scaling Contributions to Technology Selection
Learn how we scaled contributions to Zalando Tech Radar
In our previous post about Technology Choices at Zalando we spoke about a few problems with scaling technology selection in Tech companies. Since then, we have focused on the remaining categories of the Tech Radar beyond languages and the Tech Radar contribution process. Now, we'd like to reflect on our lessons learned, which you can use when designing technology selection processes.
One of the challenges for us to solve was scaling contributions to the Tech Radar across our 250+ delivery teams. Technologists are often more excited in promoting a new, promising technology than working on guidelines or sharing knowledge about already well-known tech. Such individuals are also essential for continued innovation. On the other hand, companies look for organizational efficiency by ensuring talent mobility across teams supported by a more or less standardized tech stack. This makes it easier to address cross-team dependencies in product delivery by allowing teams to contribute to code bases beyond their area of responsibility. Further, it creates career opportunities for Engineers, who can quickly switch teams and work on a challenging, high impact project. Thus, for technology selection, there is a natural tension between early adopters' vested interest and the needs of the organization they work for. At Zalando, we have created a two-sided contribution model to the Tech Radar:
- Anyone in Zalando is encouraged to contribute knowledge about technologies we have on the Tech Radar or suggest ones that are promising to evaluate and play a key role in this process.
- Our Principal Engineers are maintainers of the Tech Radar and are moderating information collection on incoming suggestions, driving creation of good practices for technologies being evaluated or used, and for promoting technologies to increase their adoption.
Ring change suggestions are supported by issue templates in our internal Tech Radar GitHub repository. These templates provide guidance on common questions around use case fit, key differences from alternatives already on the Tech Radar, conformance to our Technology Selection Principles, and support within the Engineering Community.
We encourage and expect our Engineers to contribute information about usage, lessons learned from production incidents, or challenges they face at scale. Voluntary contributions alone are insufficient to keep an updated view of the technologies we use. Thus, to support usage information collection, we collect usage data from our AWS accounts, source code repositories, or our infrastructure platform offerings. Collected information is collected in a documentation page with a common structure across all entries:
Finally, we leverage Principal Engineers to moderate and drive discussions around technology adoption at Zalando. These colleagues have a sufficiently broad view on technology usage and performance in production across multiple teams and serve as a multiplying factor. They're responsible for encouraging teams they work with to share knowledge and highlight technology usage based on the software systems in their areas - either themselves or by enabling others to do so. Additionally, they moderate discussions within technology guilds or initiate working groups to create specific artifacts for the technologies, like collections of good practices or guidelines tailored to our environment, use cases, and scale. Such working groups are also excellent opportunities to develop or identify talent within the company.
Re-scoring - how have we decided upon changes?
After a longer period of time with no regular changes to the Tech Radar, we had a re-scoring exercise to complete. A similar approach was used originally at ThoughtWorks and can be used to create a Tech Radar from the ground up.
Within our Principal Engineering Community, we formed a working group per dimension: Datastores, Data processing, Infrastructure, and Queues. Our Tech Radar visualization merges Data processing and Queues in a single Data Management dimension for simplicity. Each working group was responsible for the data collection and analysis. One person from each group compiled the information in a structured format where per technology there was a case made for a ring change (or not). The change reasoning was supported by data points on usage, incidents, and expertise we gained since the technology was added to the Tech Radar (a few years in some cases) as well as conformance with our Technology Selection Principles. Where necessary to build a solid case, we reached out to teams in order to understand more details about their use cases or experience, if this was not sufficiently documented through recent information in our Tech Radar.
Based on the collected data, Principal Engineers participated in a review and re-scoring exercise. In a spreadsheet, we collected votes. Every 'nay' vote required a short rationale which we later discussed in the group to ensure we did not miss out on usage or use cases. We also found inconsistencies in the way we handle technologies with multiple deployment options (self-hosted vs. managed or vendor offerings), for which we did not find a good solution yet.
After the voting, the collected ring changes were discussed with our Senior Leadership Team. The main focus was on ensuring long-term support for the technologies we promote to ADOPT and that technologies on lower rings are in line with long-term strategies (e.g. Data Strategy).
Finally, the changes were shared with our Engineers where we shared detailed rationale per ring change and further information on the re-scoring process and contributions moving forward.
With the re-scoring, we moved a few technologies to ADOPT, confirming our investment in these. To scale adoption, in some cases, we formed dedicated teams that operate service offerings available to all Zalando Engineers and Data Scientists.
Apache Airflow is a Workflow Orchestration tool used by data teams in Zalando. We have a central infrastructure team responsible for managing Airflow as a Service for our data teams.
We've been using Apache Spark for various analytical and Machine Learning use cases and talked about our usage before (see Data Warehousing with Spark Streaming at Zalando). Databricks is also the core element of our Machine Learning Platform, available to all Engineers. More recently, we went from a centralized Data Lake approach towards a distributed Data Mesh architecture backed by Spark and built on Delta Lake powered by Databricks. See our talk Data Mesh in Practice: How Europe's Leading Online Platform for Fashion Goes Beyond the Data Lake for more information.
We've blogged about our GraphQL usage before. We have 200+ developers that contributed to the GraphQL API layer powering the Zalando shop over the past 2.5 years. We also have other use cases in production, for example in back-office applications for our Buying department.
Kotlin & TypeScript
Having seen continued and growing usage of Kotlin and TypeScript, we have initiated workstreams for within our language guilds to define guidelines, coding standards, reference projects, and service templates. These artifacts are helping teams in adopting the languages moving forward. Further, they help building a shared understanding what we consider as production-proven frameworks and libraries along with recommended configuration options. We've shared our TypeScript best practices in the past and more details about promoting Kotlin at Zalando.
We have blogged before about our usage of Amazon SageMaker for ML Pipelines with Real-Time Inference, distributed training. See also our talk on using SageMaker for training ML models from the AWS Summit 2019.
Tech Radar changes moving forward and future focus
The re-scoring exercise described in this post was a house-keeping exercise supported by clarifying the purpose of the Tech Radar, long-term ownership, and the contribution model. The amount of upcoming changes will of course depend on contributions from our Engineering Community and our appetite for trying out new technologies. While changes to ADOPT/HOLD are going to be evaluated on a quarterly basis, we have a steady stream of ongoing assessments and trials.
The Principal Engineering Community focuses on:
- supporting and guiding contributions from the Engineering Community,
- identifying promising technologies to invest in,
- collecting best practices and expertise around technologies on TRIAL and ADOPT.
With the last point we aim to define paved roads for Engineers describing for example battle-tested configurations for typical use cases or standardized monitoring dashboards with their explanation for the key and most common technologies. While this is today already the case for our PostgreSQL as a Service offering built on top of Patroni and Postgres Operator, given a dedicated team responsible for this infrastructure, we don't have such guidance collected across all our ADOPT technologies yet.
Challenges we have not solved yet
There are a few challenges that the Tech Radar does not solve for today, mostly related to consistency and completeness of the technology landscape. If we resolve any of these challenges, we will surely share our insights and lessons learned.
Some technologies (e.g. etcd) have been successfully used in our infrastructure teams, but we would not want any delivery team to use these (e.g. for configuration management counting as "infrastructure") as we have more suitable building blocks in our platform.
In other cases, we have invested into service offerings built around open-source software (e.g. Airflow) and we would rather have teams extend this platform offering rather than deploy their own infrastructure.
We also have solutions built in-house (e.g. our request router - Skipper) which are an essential part of our cloud infrastructure. Teams don't really have a choice to easily opt-out of these. These technologies will most likely be moved to a different place that will represent the maturity of the development infrastructure at Zalando from a Product perspective.
For technologies, where we chose vendor offerings built on top of a technology (e.g. Databricks for Spark), the question arises whether to include one or both and with which ring assignment (setting Spark to HOLD while keeping Databricks on ADOPT may sound confusing). Here, we consider using the underlying technology and outlining the recommended deployment options.
Finally, there are 3rd party products, which allow us to deliver solutions faster, without the need to reinvent the wheel. One example are Content Management Systems - we've built a few over the past years and strive not to do this again. A question arises how to make these sufficiently visible to our Engineers, so that they're considered while building future products for our customers.
If you would like to work on similar challenges and help scale our approach to technology selection, consider joining our engineering teams at Zalando.