<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Zalando Engineering Blog</title><link href="https://engineering.zalando.com/" rel="alternate"/><link href="https://engineering.zalando.com/atom.xml" rel="self"/><id>https://engineering.zalando.com/</id><updated>2026-06-08T00:00:00+02:00</updated><entry><title>Introducing Lightstep UQL to PromQL Translator</title><link href="https://engineering.zalando.com/posts/2026/06/introducing-lightstep-uql-to-promql-translator.html" rel="alternate"/><published>2026-06-08T00:00:00+02:00</published><updated>2026-06-08T00:00:00+02:00</updated><author><name>Maksim Pershin</name></author><id>tag:engineering.zalando.com,2026-06-08:/posts/2026/06/introducing-lightstep-uql-to-promql-translator.html</id><summary type="html">&lt;p&gt;Automate telemetry query migration from Lightstep UQL to PromQL with this open-source Go SDK and Web UI.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;The Sunset of Lightstep and the Migration Challenge&lt;/h2&gt;
&lt;p&gt;Lightstep is sunsetting. For teams that built their observability infrastructure around it,
this means migrating thousands of telemetry queries, dashboards, and alerts to a new backend.
At Zalando, we faced exactly this problem when transitioning to Dash0, our new PromQL-based
telemetry vendor.&lt;/p&gt;
&lt;p&gt;The task is straightforward but challenging: rewriting UQL (Unified Query Language) queries
into PromQL. Doing this manually is slow and error-prone. Even with LLMs, the translations are
unpredictable and often produce subtle bugs that break production alerts. When you're dealing
with thousands of queries that monitor critical systems, you need automation that produces
correct and consistent results.&lt;/p&gt;
&lt;h2&gt;Introducing the Lightstep UQL to PromQL Translator&lt;/h2&gt;
&lt;p&gt;To solve this, I built a production-ready translator that converts Lightstep UQL queries to
PromQL. The tool is written in Go and designed to be used in two ways: as a library (SDK) for
translation in your own code, and as an HTTP server with a Web UI for interactive ad-hoc translation.&lt;/p&gt;
&lt;p&gt;The translator implements a full parsing pipeline: a lexer tokenizes UQL queries, a parser
builds an abstract syntax tree (AST), an optimizer applies transformations to simplify the query
structure, and finally the translator generates clean PromQL output.&lt;/p&gt;
&lt;p&gt;The project is available at &lt;a href="https://github.com/zalando/lightstep-uql-to-promql-translator"&gt;github.com/zalando/lightstep-uql-to-promql-translator&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;How to Use the Translator&lt;/h2&gt;
&lt;h3&gt;Web UI and REST API&lt;/h3&gt;
&lt;p&gt;The simplest way to start is running the HTTP server. It provides both a browser-based UI for
interactive translation and a REST API for automation.&lt;/p&gt;
&lt;p&gt;First, configure your metric types in &lt;code&gt;cmd/main.go&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;package&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;log&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;github.com/zalando/lightstep-uql-to-promql-translator/pkg/model&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;github.com/zalando/lightstep-uql-to-promql-translator/pkg/promql&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;github.com/zalando/lightstep-uql-to-promql-translator/pkg/server&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;TranslateUQLToPromQL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;metricTypes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="nx"&gt;promql&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;MetricType&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;http_requests_total&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;promql&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;METRIC_TYPE_SUM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;cpu_usage&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="nx"&gt;promql&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;METRIC_TYPE_GAUGE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;request_duration&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;promql&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;METRIC_TYPE_HISTOGRAM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;metricConfig&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;promql&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;SpecialMetricConfig&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;SpansCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;spans.count&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;SpansLatency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;spans.latency&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;SpansCountUnadjusted&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;spans.count_unadjusted&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;LogsCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;logs.count&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;promql&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Translate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;metricTypes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;metricConfig&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;srv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;:8080&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;TranslateUQLToPromQL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;srv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Start&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then start the server:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;go&lt;span class="w"&gt; &lt;/span&gt;run&lt;span class="w"&gt; &lt;/span&gt;cmd/main.go
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The Web UI will be available at &lt;code&gt;http://localhost:8080&lt;/code&gt;. You can paste UQL queries and see the
PromQL translation instantly, with detailed error messages if something goes wrong.&lt;/p&gt;
&lt;p&gt;For programmatic access, use the REST API:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;curl&lt;span class="w"&gt; &lt;/span&gt;-X&lt;span class="w"&gt; &lt;/span&gt;POST&lt;span class="w"&gt; &lt;/span&gt;http://localhost:8080/api/translate&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;     &lt;/span&gt;-d&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;{&amp;quot;query&amp;quot;: &amp;quot;metric http_requests_total | rate 5m&amp;quot;}&amp;#39;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Go SDK&lt;/h3&gt;
&lt;p&gt;For embedding translation logic directly into your tools or CI pipelines, use the SDK:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;package&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;fmt&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;log&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;github.com/zalando/lightstep-uql-to-promql-translator/pkg/promql&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;metric http_requests_total | rate 5m | group_by [method], sum&amp;quot;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;metricTypes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="nx"&gt;promql&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;MetricType&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;http_requests_total&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;promql&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;METRIC_TYPE_SUM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;metricConfig&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;promql&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;SpecialMetricConfig&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;SpansCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;spans.count&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;SpansLatency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;spans.latency&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;promqlQuery&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;promql&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Translate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;metricTypes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;metricConfig&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Fatalf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Translation error: %s at position %d&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;SourceIndex&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;fmt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;PromQL:&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;promqlQuery&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;For more details on configuration options and advanced usage, see the
&lt;a href="https://github.com/zalando/lightstep-uql-to-promql-translator"&gt;project repository&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Internal Impact: Powering Move to Dash0&lt;/h2&gt;
&lt;p&gt;At Zalando, this translator was one of the core tools that were driving our migration
from Lightstep to Dash0. We deployed an internal instance of the Web UI that engineering
teams used for self-service query translation. The SDK was integrated into automation tools
that converted existing alerts and dashboard queries. The translator handled the full range
of UQL queries we had in production: from simple metric fetches to complex joins
with multiple subqueries.&lt;/p&gt;
&lt;h2&gt;The Business Value&lt;/h2&gt;
&lt;p&gt;The main value is time saved. Manually translating thousands of queries would take weeks
or months of engineering time. More importantly, it eliminates translation errors.
Production alerts need to work correctly - a broken alert means incidents go unnoticed.
Manual translation introduces bugs. LLM-based translation is unpredictable and produces
subtly wrong queries that only fail under specific conditions. A deterministic translator
that parses the query structure and applies robust transformation rules was the only
reliable approach.&lt;/p&gt;
&lt;h2&gt;Helping the Community Transition&lt;/h2&gt;
&lt;p&gt;Since Lightstep is sunsetting, many organizations face the same migration challenge.
Releasing this tool as open source helps the engineering community transition to PromQL-based
backends - whether that's Prometheus, Dash0, or any other vendor supporting PromQL.
The project is production-ready and actively used at Zalando.&lt;/p&gt;
&lt;p&gt;If you're migrating from Lightstep, you can start using the translator today:
&lt;a href="https://github.com/zalando/lightstep-uql-to-promql-translator"&gt;project repository&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Community contributions are welcome!&lt;/p&gt;</content><category term="Zalando"/><category term="SRE"/><category term="Golang"/><category term="Open Source"/><category term="Backend"/></entry><entry><title>Rejecting Invalid Ingress Routes at Apply Time</title><link href="https://engineering.zalando.com/posts/2026/04/skipper-validating-admission-webhook.html" rel="alternate"/><published>2026-04-09T00:00:00+02:00</published><updated>2026-04-09T00:00:00+02:00</updated><author><name>Veronika Volokitina</name></author><id>tag:engineering.zalando.com,2026-04-09:/posts/2026/04/skipper-validating-admission-webhook.html</id><summary type="html">&lt;p&gt;How Zalando used Skipper as a validating admission webhook to reject invalid filters and predicates at apply time, and what it took to make that safe on the Kubernetes control-plane path.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="https://github.com/zalando/skipper"&gt;Skipper&lt;/a&gt; is an open-source HTTP router and reverse proxy that can also run as a
Kubernetes ingress controller. At Zalando, it is the component that turns &lt;code&gt;Ingress&lt;/code&gt; and &lt;code&gt;RouteGroup&lt;/code&gt; configuration into
live routing behavior. Its routing model is powerful because requests can be matched by predicates, transformed by
filters, and then forwarded to backends.&lt;/p&gt;
&lt;p&gt;The downside is that Kubernetes has no understanding of Skipper-specific filters and predicates, and therefore cannot
validate them through
standard &lt;a href="https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/"&gt;Admission Control&lt;/a&gt;. For
example, a route might reference a non-existing predicate, use a filter with invalid parameters, or define a backend
that cannot be parsed. Kubernetes accepts this configuration because it is syntactically valid, but from Skipper’s
perspective, the route is broken.&lt;/p&gt;
&lt;p&gt;At Zalando scale, these invalid routes are critical. We run Skipper across &lt;strong&gt;250+ Kubernetes clusters&lt;/strong&gt;, with &lt;strong&gt;15k+
ingresses&lt;/strong&gt;, &lt;strong&gt;~200k
routes&lt;/strong&gt;, and &lt;strong&gt;500k-2M RPS&lt;/strong&gt;. At that size, even &lt;strong&gt;1% invalid routes&lt;/strong&gt; is not background noise. It is real production
risk.&lt;/p&gt;
&lt;p&gt;The goal was simple: reject invalid Skipper routing configuration during &lt;code&gt;kubectl apply&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;How Skipper sees a route&lt;/h2&gt;
&lt;p&gt;A Skipper route is essentially:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;routeId: Predicates -&amp;gt; filters -&amp;gt; backend
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The route ID names the route. Predicates decide whether a request matches it. Filters modify the request or response.
The backend defines where the traffic goes next.&lt;/p&gt;
&lt;p&gt;A simplified example:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;canary&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Host&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;^edge[.]internal$&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;X-Canary&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;true&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;setPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/v2&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;http://checkout:8080&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This model is one of the reasons Skipper works well as an ingress controller. It translates &lt;code&gt;Ingress&lt;/code&gt; and &lt;code&gt;RouteGroup&lt;/code&gt;
resources into &lt;a href="https://pkg.go.dev/github.com/zalando/skipper/eskip"&gt;eskip&lt;/a&gt; routes and gives teams a rich routing
language without requiring a separate proxy configuration format.&lt;/p&gt;
&lt;p&gt;At the same time, it means many routing mistakes are invisible to Kubernetes itself. A typo like
&lt;code&gt;Headr("X-Canary", "true")&lt;/code&gt; instead of &lt;code&gt;Header(...)&lt;/code&gt; is still just a string from the API server's point of view. The
manifest can be structurally valid while the resulting Skipper route is not.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Skipper ingress controller overview" src="https://engineering.zalando.com/posts/2026/04/images/01-skipper-ingress-controller-overview.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Skipper in the ingress-controller setup. The validating webhook sits on the Kubernetes write path, while Skipper serves live traffic on the data plane.&lt;/figcaption&gt;

&lt;p&gt;The validation webhook provided basic semantic checks, which was useful, but it did not validate filters and predicates
against Skipper's runtime
registry.&lt;/p&gt;
&lt;p&gt;For developers, the feedback loop was poor. The &lt;code&gt;apply&lt;/code&gt; looked successful, while the actual routing problem surfaced
later and in a different place.&lt;/p&gt;
&lt;h2&gt;Letting Skipper validate Skipper&lt;/h2&gt;
&lt;p&gt;The key design choice was to stop treating route validation as a generic string-parsing problem and instead reuse
Skipper's own validation logic on the admission path.&lt;/p&gt;
&lt;p&gt;For matching &lt;code&gt;CREATE&lt;/code&gt; and &lt;code&gt;UPDATE&lt;/code&gt; requests on the resources we care about, the API server sends an &lt;code&gt;AdmissionReview&lt;/code&gt; to
the validating webhook. The webhook extracts the Skipper-specific configuration from the object and validates it using
Skipper's own filter registry, predicate specifications, route validation, and backend checks.&lt;/p&gt;
&lt;p&gt;This changes the contract importantly. The webhook no longer answers: "Does this string parse?" - instead it
answers: "Would Skipper accept this route?"&lt;/p&gt;
&lt;p&gt;That is the difference between basic admission checks and admission-time route validation.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Skipper in webhook mode" src="https://engineering.zalando.com/posts/2026/04/images/02-skipper-in-webhook-mode.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Using Skipper on the admission path adds Skipper-specific validation before the object is applied.&lt;/figcaption&gt;

&lt;h2&gt;What happens during &lt;code&gt;kubectl apply&lt;/code&gt; now&lt;/h2&gt;
&lt;p&gt;From the user's perspective, the flow is straightforward.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;An engineer or a deployment system submits an &lt;code&gt;Ingress&lt;/code&gt; or &lt;code&gt;RouteGroup&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Kubernetes performs the usual authentication, authorization, and built-in object validation.&lt;/li&gt;
&lt;li&gt;For matching resources and operations, the API server calls the validating webhook over HTTPS.&lt;/li&gt;
&lt;li&gt;Webhook runs Skipper validation on the relevant route configuration and returns either &lt;strong&gt;allow&lt;/strong&gt; or &lt;strong&gt;deny&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Only valid objects are persisted.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If the route configuration is invalid, the request is rejected immediately and the error is returned to the caller. The
failure stays on the deployment path, which is where it is most useful.&lt;/p&gt;
&lt;h2&gt;Useful errors at applied time&lt;/h2&gt;
&lt;p&gt;Fast rejection is only useful if the error message is actionable.&lt;/p&gt;
&lt;p&gt;For example, consider this &lt;code&gt;Ingress&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;apiVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="nt"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Ingress&lt;/span&gt;
&lt;span class="nt"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;demo-invalid-unknown-predicate&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;default&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;annotations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;zalando.org/skipper-predicate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;NonExistingPredicate()&lt;/span&gt;
&lt;span class="nt"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;rules&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;host&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;demo.example&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;http&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;paths&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;pathType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;ImplementationSpecific&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;backend&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;              &lt;/span&gt;&lt;span class="nt"&gt;service&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;demo-app&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;port&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;                  &lt;/span&gt;&lt;span class="nt"&gt;number&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;When this object is applied, the rejection can be explicit:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;➜&lt;span class="w"&gt; &lt;/span&gt;kubectl&lt;span class="w"&gt; &lt;/span&gt;apply&lt;span class="w"&gt; &lt;/span&gt;-f&lt;span class="w"&gt; &lt;/span&gt;simple-app.yaml

Error&lt;span class="w"&gt; &lt;/span&gt;from&lt;span class="w"&gt; &lt;/span&gt;server:&lt;span class="w"&gt; &lt;/span&gt;error&lt;span class="w"&gt; &lt;/span&gt;when&lt;span class="w"&gt; &lt;/span&gt;creating&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;simple-app.yaml&amp;quot;&lt;/span&gt;:&lt;span class="w"&gt; &lt;/span&gt;admission&lt;span class="w"&gt; &lt;/span&gt;webhook&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ingress-admitter.teapot.zalan.do&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;denied&lt;span class="w"&gt; &lt;/span&gt;the&lt;span class="w"&gt; &lt;/span&gt;request:&lt;span class="w"&gt; &lt;/span&gt;invalid&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;zalando.org/skipper-predicate&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;annotation:&lt;span class="w"&gt; &lt;/span&gt;unknown_predicate:&lt;span class="w"&gt; &lt;/span&gt;unknown_predicate:&lt;span class="w"&gt; &lt;/span&gt;predicate&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;NonExistingPredicate&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;not&lt;span class="w"&gt; &lt;/span&gt;found
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The same applies to invalid filter parameters or backend problems. The important part is not just that the request is
rejected, but that the engineer who made the change can fix it without going hunting through runtime logs.&lt;/p&gt;
&lt;h2&gt;Rollout strategy&lt;/h2&gt;
&lt;p&gt;This was the most challenging part.&lt;/p&gt;
&lt;p&gt;The validation logic itself was relatively direct. The harder part was making the rollout seamless for users. Even if a
problem here does not affect existing customer traffic, it could still block Kubernetes writes. In practice, that means
blocked CI/CD pipelines, delayed service updates, and engineers unable to ship changes. Existing traffic keeps flowing,
but change stops.&lt;/p&gt;
&lt;p&gt;We approached the rollout as a control-plane change. First, I added metrics that made invalid routes visible during
rollout. The most useful signal was &lt;code&gt;skipper_route_invalid{route_id, reason}&lt;/code&gt;, which told me exactly which route failed
validation and why. That made it much easier to distinguish real configuration mistakes from false positives in the
validator.&lt;/p&gt;
&lt;p&gt;Then I rolled the feature tier by tier across clusters. After each rollout step, we watched webhook health, latency,
denials, and invalid-route metrics before moving to the next tier. The goal was not only to prove that validation
worked, but also that it stayed predictable under normal production usage.&lt;/p&gt;
&lt;p&gt;We also kept advanced validation behind the &lt;code&gt;-enable-advanced-validation&lt;/code&gt; feature flag. That gave us a fast rollback path
without removing the webhook itself. During the rollout, we did encounter cases where some routes were rejected even
though they should have been accepted. In those cases, we turned advanced validation off, fixed the issues, and continued
the rollout once the behavior was correct again.&lt;/p&gt;
&lt;p&gt;I later presented this solution at an internal Zalando conference, and one of the first questions was how teams could
enable it in their clusters. The satisfying part was answering that they did not need to do anything, because it was
already enabled. That is probably the best possible result for this kind of rollout.&lt;/p&gt;
&lt;h2&gt;Operational outcome&lt;/h2&gt;
&lt;p&gt;The biggest improvement was not a benchmark number. It was moving failure to the right place.&lt;/p&gt;
&lt;p&gt;Before, invalid route configuration could be accepted by Kubernetes and only discovered later in the routing layer. In
practice, that meant people showed up in support channels asking why their requests don't work.&lt;/p&gt;
&lt;p&gt;After the webhook change, invalid route configuration is rejected during deployment, before it becomes part of the
cluster state. That made the feedback loop much faster and kept the error attached to the change that caused it.&lt;/p&gt;
&lt;p&gt;The implementation is now part of open-source &lt;a href="https://github.com/zalando/skipper"&gt;Skipper&lt;/a&gt;, available from
&lt;a href="https://github.com/zalando/skipper/releases/tag/v0.24.18"&gt;v0.24.18&lt;/a&gt;, so the same pattern can be reused outside Zalando
as well.&lt;/p&gt;</content><category term="Zalando"/><category term="Platform Engineering"/><category term="Kubernetes"/><category term="Skipper"/><category term="SRE"/><category term="Open Source"/><category term="Backend"/></entry><entry><title>Search Quality Assurance with AI as a Judge</title><link href="https://engineering.zalando.com/posts/2026/03/search-quality-assurance-with-llm-judge.html" rel="alternate"/><published>2026-03-17T00:00:00+01:00</published><updated>2026-03-17T00:00:00+01:00</updated><author><name>Tao Ruangyam</name></author><id>tag:engineering.zalando.com,2026-03-17:/posts/2026/03/search-quality-assurance-with-llm-judge.html</id><summary type="html">&lt;p&gt;Deep dive into how Zalando builds a search quality assurance framework with LLM-as-a-judge to evaluate the search quality at scale with high coverage and multi-language support.&lt;/p&gt;</summary><content type="html">&lt;p&gt;In 2024, Zalando research published a paper on &lt;a href="https://engineering.zalando.com/posts/2024/11/llm-as-a-judge-relevance-assessment-paper-announcement.html"&gt;LLM-as-a-judge for search quality assurance&lt;/a&gt; at scale. The framework allows scientists and developers to effectively evaluate the semantic relevance of search results of the given search queries at large scale with multi-language support. This capability has strong potential to help the Search engineering team to quickly identify and fix search issues which we will walk through in this post.&lt;/p&gt;
&lt;h2&gt;Real-world use case: Launching a new country&lt;/h2&gt;
&lt;p&gt;In 2025 Zalando expanded its fashion store business into 3 new countries: &lt;a href="https://www.zalando.lu/"&gt;Luxembourg&lt;/a&gt;, &lt;a href="https://www.zalando.pt/"&gt;Portugal&lt;/a&gt; and &lt;a href="https://www.zalando.gr/"&gt;Greece&lt;/a&gt;. Ensuring these markets have a good search experience is critical for the success of the launch, but the challenge is how can we do that without any prior search data from real users?&lt;/p&gt;
&lt;p&gt;Before using LLM-as-a-judge, the search quality assurance process was heavily reliant on human experts and a manual process as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Due to the fact that we do not know which search queries may work well or not in the new markets because they are not live yet, we have to draw sample search queries from the existing markets, and translate them if the new market is operating in different language and test the search system manually. Human experts have to annotate error cases, and identify cases where search returns poor quality results.&lt;/li&gt;
&lt;li&gt;Root cause diagnosis in both scenarios (errors / poor results) is also performed by the same experts.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Not only is this process not scalable, but it is also &lt;em&gt;reactive&lt;/em&gt; by nature, meaning that issues are only identified after features are launched and users have already experienced them, since we rely on signals coming from real users such as low CTR. For an entirely new country, these signals are by definition not there yet. We need a more  &lt;em&gt;proactive&lt;/em&gt; approach that ensures quality before launch.&lt;/p&gt;
&lt;h2&gt;Data-Driven Approach with LLM-as-a-judge&lt;/h2&gt;
&lt;p&gt;With the recent advances in LLM capabilities, it is now possible to use &lt;strong&gt;LLM-as-a-judge&lt;/strong&gt; techniques to evaluate search quality at scale for new markets before launch. Our aforementioned &lt;a href="https://engineering.zalando.com/posts/2024/11/llm-as-a-judge-relevance-assessment-paper-announcement.html"&gt;study&lt;/a&gt; shows that it could achieve a high correlation with human judgement, so we decided to build a framework to automate the evaluation process at scale by focusing on the following key principles:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;High test coverage&lt;/strong&gt;: To gain confidence that the new market will have high quality  search feature at launch, we need to have a wide range of tests covering different search scenarios, such as different product categories, different brands and product attributes, popular searches, seasonal or trending products, etc.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Avoid the need for handcrafted test cases&lt;/strong&gt;: Manual tests are not very scalable and could be biased. Also it is hard to transfer written test cases from one market to another. To reduce this effort, we want to automate the test generation, while still being able to add or customise the test cases if we need to.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-language support&lt;/strong&gt;: All new markets operate in different languages with different linguistic characteristics and we want to cover them well.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reproducibility&lt;/strong&gt;: By nature of the pre-launch testing, we want to be able to re-evaluate the search feature after we applied the fixes to verify that the improvements are effective.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Selection of Test Queries&lt;/h2&gt;
&lt;p&gt;Using real search queries from existing markets is a good way to go. We need to draw sample test queries from search data which cover as wide range of search scenarios as possible. We should not only take most N frequent query terms, as different forms of queries may mean the same thing, e.g. "Winter boot" and "Boot for winter" are essentially the same search intent. They should belong together and be counted as the same search scenario. Therefore we need a good clustering approach.&lt;/p&gt;
&lt;p&gt;Our search operates on a mix of lexical search (text token-based) and &lt;a href="https://arxiv.org/abs/2103.00020"&gt;semantic vector search&lt;/a&gt;. Thus, we have a &lt;a href="https://arxiv.org/html/2411.05057v1"&gt;Named entity recognition (NER)&lt;/a&gt; engine to extract a wide range of attributes from the search queries, such as product name, brand, colour, size, season, occassion, material, etc. If we group search queries tagged by the same set of attributes together, we can get a good clustering with meaningful intent.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;NER tags&lt;/th&gt;
&lt;th&gt;search queries&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;category: kids, type: jacket, season: winter&lt;/td&gt;
&lt;td&gt;Kids Winter Jacket, Winter Jackets for Kids, Kids Jackets Winter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;category: shoes, brand: nike, type: sneakers&lt;/td&gt;
&lt;td&gt;Nike Sneakers, Nike Shoes, Nike Sneaker&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;category: dress, occasion: party, color: black&lt;/td&gt;
&lt;td&gt;Black Party Dress, Party Dresses Black, Black Dress for Party&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;category: jeans, fit: slim, color: blue&lt;/td&gt;
&lt;td&gt;Blue Slim Jeans, Slim Fit Blue Jeans, Blue Jeans Slim&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;category: boots, material: leather, season: winter&lt;/td&gt;
&lt;td&gt;Leather Winter Boots, Winter Boots Leather, Leather Boots Winter&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;We can translate these queries to other languages using an LLM. This enables us to reuse search scenarios from existing markets for new markets with different languages, while having translated scenarios keep the same search intents. The example table below shows translated queries in Portuguese (newly supported language) for a few NER tag groups.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;NER tags&lt;/th&gt;
&lt;th&gt;original search queries in EN&lt;/th&gt;
&lt;th&gt;translated search queries in PT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;category: kids, type: jacket, season: winter&lt;/td&gt;
&lt;td&gt;Kids Winter Jacket, Winter Jackets for Kids, Kids Jackets Winter&lt;/td&gt;
&lt;td&gt;Jaqueta de Inverno Infantil, Jaquetas de Inverno para Crianças, Jaquetas Infantis de Inverno&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;category: shoes, brand: nike, type: sneakers&lt;/td&gt;
&lt;td&gt;Nike Sneakers, Nike Shoes, Nike Sneaker&lt;/td&gt;
&lt;td&gt;Nike Sapatilhas, Nike Sapatos, Nike Sapatilha&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;category: dress, occasion: party, color: black&lt;/td&gt;
&lt;td&gt;Black Party Dress, Party Dresses Black, Black Dress for Party&lt;/td&gt;
&lt;td&gt;Vestido de Festa Preto, Vestidos de Festa Pretos, Vestido Preto para Festa&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;category: jeans, fit: slim, color: blue&lt;/td&gt;
&lt;td&gt;Blue Slim Jeans, Slim Fit Blue Jeans, Blue Jeans Slim&lt;/td&gt;
&lt;td&gt;Calças de Ganga Azuis Slim, Calças Slim Azuis, Calças de Ganga Slim Azuis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;category: boots, material: leather, season: winter&lt;/td&gt;
&lt;td&gt;Leather Winter Boots, Winter Boots Leather, Leather Boots Winter&lt;/td&gt;
&lt;td&gt;Botas de Couro de Inverno, Botas de Inverno em Couro, Botas de Couro Invernais&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;To ensure broad test coverage, we sample search queries from existing markets with similar characteristics. Since search intent distribution differs across markets, we select the top N search groups by NER tags (representing search intent/topic) ranked by traffic share. This approach provides a representative test set for the target market and helps identify potential issues in popular search scenarios.&lt;/p&gt;
&lt;h2&gt;How Does Search Quality Evaluation Work?&lt;/h2&gt;
&lt;p&gt;Having defined &lt;em&gt;how&lt;/em&gt; we select queries above, we can now break down our search quality evaluation process into 2 steps: &lt;strong&gt;1)&lt;/strong&gt; Generate the test queries with NER clustering and LLM translation, &lt;strong&gt;2)&lt;/strong&gt; Evaluate the search results with LLM-as-a-judge.&lt;/p&gt;
&lt;p&gt;&lt;img alt="image showing generation flow" src="https://engineering.zalando.com/posts/2026/03/images/llm-judge-generator-flow.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;
Figure 1: Test case generation flow.
&lt;/figcaption&gt;

&lt;p&gt;In our search infrastructure, every search request we get in the site is processed by our NER engine, served by our and Search API, and then published to an asynchronous event stream (powered by &lt;a href="https://nakadi.io/"&gt;Nakadi&lt;/a&gt;, our RESTful event bus built on top of Kafka queues). At Zalando, we have a large scale data processing platform with continuously running data pipelines that: (1) consume these event streams; (2) process the data; (3) and persist the data into a Data Lake for analysis, reporting, and archival.&lt;/p&gt;
&lt;p&gt;We have built test case generation data pipelines that consume search request data from the Data Lake, cluster them by their NER tags, and transform into useful formats ready for tester pipeline to read. The entire flow is built on the OLAP (Online Analytical Processing) paradigm.&lt;/p&gt;
&lt;p&gt;&lt;img alt="image showing LLM judge flow" src="https://engineering.zalando.com/posts/2026/03/images/llm-judge-eval-flow.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;
Figure 2: LLM judge evaluation flow.
&lt;/figcaption&gt;

&lt;p&gt;Our LLM-as-a-judge (&lt;a href="https://arxiv.org/abs/2409.11860"&gt;paper&lt;/a&gt;) uses product data and product images for its evaluation context (visual-text). It generalises well across different languages and different search contexts, e.g. by searching &lt;em&gt;"Kids Winter Jacket"&lt;/em&gt;,  the model should give high relevance scores to search results with jacket products of any brands, any colours, etc. from kids categories, according to the product attributes or their images should score. Search results that are just long-sleeve shirts, or adult items, should score lower. The reasoning is generalised and does not require specific prompts to instruct the LLM to look for specific attributes or specific parts of images.&lt;/p&gt;
&lt;p&gt;&lt;img alt="image showing LLM judge evaluation" src="https://engineering.zalando.com/posts/2026/03/images/llm-judge-scores.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;
Figure 3: LLM judge evaluation results.
&lt;/figcaption&gt;

&lt;p&gt;The LLM judge is instructed to give relevance scores for each result item using clear scale criteria. For example, a score of 4 means perfect match, 0 means completely wrong or irrelevant to the search term. Other scores in between represent varying degrees of relevance. This way we can evaluate the overall quality of the result set of each search query segment (represented by the NER tags), while being able to identify wrong articles whose relevance scores are very low which could give engineers a good hint for the root cause of the issue, e.g. wrong product data, misbehaving NER, underperforming ranker, etc.&lt;/p&gt;
&lt;h2&gt;Production time: The Evaluation Pipelines&lt;/h2&gt;
&lt;p&gt;Putting the concept together, we have built evaluation pipelines in &lt;a href="https://airflow.apache.org/"&gt;Apache Airflow&lt;/a&gt; to automate the whole process. The pipelines involve the following steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Test query generation:&lt;/strong&gt; Past queries from existing markets are clustered with NER tags. Clusters with the highest traffic share are selected, and their queries are then submitted to the translation process using LLM if needed. For some markets like Luxembourg, we can directly use the English and French queries from our existing markets without a translation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Search result retrieval:&lt;/strong&gt; This pipeline step spawns a task in our Kubernetes cluster, runs through the test queries and submits them to the search microservice to retrieve the search results. The results are then kept in an in-memory cache for the evaluation step.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LLM evaluation:&lt;/strong&gt; The search results are then evaluated by LLM-as-a-judge, which we used GPT-4o during pre-market launch process. The pipeline submits the search results, their product data and images to the LLM and get the relevance scores for each result. The scores are stored in the final evaluation report data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="image showing airflow ui" src="https://engineering.zalando.com/posts/2026/03/images/llm-judge-airflow-ui.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;
Figure 4: Airflow UI showing the evaluation pipelines.
&lt;/figcaption&gt;

&lt;p&gt;&lt;img alt="image showing airflow dag" src="https://engineering.zalando.com/posts/2026/03/images/llm-judge-airflow-dag.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;
Figure 5: Airflow DAG.
&lt;/figcaption&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://airflow.apache.org/docs/apache-airflow/2.8.4/_api/airflow/decorators/task_group/index.html"&gt;Taskgroup&lt;/a&gt;&lt;/strong&gt;: We want to be able to evaluate multiple markets in parallel, where each market shares the same flow but with different test queries. Therefore we can implement each evaluation lineage as a task group and put all of them together in the same DAG. This way each task group can run independently in parallel and, once they are all finished, a final task consolidates all evaluation results together.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html#kubernetespodoperator"&gt;PodOperator&lt;/a&gt;&lt;/strong&gt;: The evaluation code is shipped in a docker image and we can run it in our Kubernetes cluster via Airflow using PodOperator. This keeps the DAG code clean and simple, as all complex logic for the evaluation and their dependencies are encapsulated in the image.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cache&lt;/strong&gt;: The search results of different search queries may share the same products and letting the LLM judge to retrieve the same product data and images multiple times would be inefficient and slow. Therefore we put a shared cache (Elasticache) only accessible to the evaluation tasks to store and re-use the product data. This saves time and cost for the evaluation significantly. Instead of calling Product API (5000 x 25) times for 5000 search queries with 25 results, we only need to call it N times where N is the number of unique products in all search results. This N does not scale as much as the number of search queries increases. We also store evaluation results of each (query, product) pair in the cache, so that it reuses the previously evaluated results if the same (query, product) pair appears in other search queries, which further saves time and LLM cost.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;NER engine&lt;/strong&gt;: During execution, each search query is processed by the NER engine to extract its NER tag attributes. This allows us to compare the NER tags of the original search query and the translated search query, and identify inconsistencies that can lead to search issues, such as missing tags or incorrectly tagged attributes in the new language.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;p&gt;We tested 3 markets with the 1,500 most searched segments in each of them (represented by their NER tags), using the most frequently used search queries in each segment. We could easily identify segments that did not perform well. The example below shows some segments in Portuguese market before launch in which we identified low relevance scores.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Segment (NER tag group)&lt;/th&gt;
&lt;th&gt;Avg relevance score&lt;/th&gt;
&lt;th&gt;Details&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CATEGORY=desporto&lt;/td&gt;
&lt;td&gt;1.5 / 4.0&lt;/td&gt;
&lt;td&gt;Queries with "desporto", "desportivo", "desportiva" did not have consistent term filters due to word lemmatization issues.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CATEGORY=zapatilhas&lt;/td&gt;
&lt;td&gt;2.4 / 4.0&lt;/td&gt;
&lt;td&gt;Term "tenis", "ténis" (sneaker in portuguese) could not be recognized and did not discover sport shoes in general, due to an ambiguity with sport "tennis"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GENDER=mulher CATEGORY=menina&lt;/td&gt;
&lt;td&gt;2.0 / 4.0&lt;/td&gt;
&lt;td&gt;Term "menina", "meninas" could not be recognized so searching for girl articles returned mixed results from any genders and age groups.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CATEGORY=fato de treino&lt;/td&gt;
&lt;td&gt;1.2 / 4.0&lt;/td&gt;
&lt;td&gt;Searching for tracksuit "fato de treino" did not show any sport or tracksuit results.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Using the NER tags as segments helps us categorize the semantically similar search queries together and confirm with confidence when specific categories are not performing well. In the example above for Portuguese results, engineers could investigate further and identify that NER engine may have got a lemmatization issue for terms related to "desporto", "desportivo", "desportiva" which are all related to sport category. The term "tenis" and "ténis" were not recognized because these terms did not exist in our product data, which our NER engine relies on for Lucene-based entity recognition. With one run, we can identify multiple issues at once conveniently.&lt;/p&gt;
&lt;p&gt;The same query should have the same NER tags by meaning across different languages. The following example shows results for Greek market, where we can identify NER issues as some terms cannot be recognized. This leads to incorrect search filtering, and the LLM judge observes low relevant results. We can easily confirm missing NER tags by comparing with the original tags from the source language.&lt;/p&gt;
&lt;p&gt;&lt;img alt="image showing NER untagged" src="https://engineering.zalando.com/posts/2026/03/images/llm-judge-undefined-ner.png"&gt;&lt;/p&gt;
&lt;p&gt;By comparing the original search queries and NER tags with the translated queries with their tags, we could easily identify which NER tags were not recognized in the new languages. Since our search infrastructure uses these NER tags for catalog filtering, search is likely to return low relevance results, that are confirmed by the LLM judge via low relevance scores. This mechanism allowed us to quickly identify NER issues and we could fix them before the go-live in the Greek market.&lt;/p&gt;
&lt;p&gt;In conclusion, our evaluation system can either identify or hint at various types of issues. Notable ones include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Incorrect product attributes or data&lt;/strong&gt;: Product categories with incorrect attributes have difficulty surfacing in search results despite different query variations. Multiple NER tag segments with similar meaning but consistently low relevance scores indicate this issue.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unrecognized terms or attributes by NER&lt;/strong&gt;: The evaluation pipeline processes NER tagging (NER analyzer task in the Airflow DAG) to identify unrecognized terms. This helps validate spell correction and lemmatization in new languages, and determines whether to index missing terms for searchability.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Undiscoverable products or categories&lt;/strong&gt;: This helps us identify if a brand, a product family, or a category is not discoverable by analyzing multiple search segments that share the same product tags from NER (example in the table below).&lt;/li&gt;
&lt;/ul&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Segment (NER tag group)&lt;/th&gt;
&lt;th&gt;Avg relevance score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;BRAND=foo CATEGORY=yoga&lt;/td&gt;
&lt;td&gt;1.8 / 4.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BRAND=foo CATEGORY=leggings&lt;/td&gt;
&lt;td&gt;1.6 / 4.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BRAND=foo GENDER=mulher CATEGORY=tops&lt;/td&gt;
&lt;td&gt;1.9 / 4.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BRAND=foo CATEGORY=fato de treino&lt;/td&gt;
&lt;td&gt;1.7 / 4.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BRAND=foo CATEGORY=jackets MATERIAL=nylon&lt;/td&gt;
&lt;td&gt;1.5 / 4.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The example in the table above shows that multiple segments with brand &lt;em&gt;Foo&lt;/em&gt; show low search relevance scores. This may highlight that the product data may have quality issues, e.g. missing or wrong attributes, which leads to the issue that these products are less discoverable by search. This helps us quickly identifying which products or categories may need a deep check and fix before launch.&lt;/p&gt;
&lt;h2&gt;Cost of evaluation&lt;/h2&gt;
&lt;p&gt;The cost per one full run nets around 250 USD, which mainly comes from GPT-4o completion API cost. This is very cost efficient for the scale of 1,500 search segments with 25 results each. Especially so when considering the alternative of human evaluation, which also would take days. One run with our framework takes around 3-5 hours on average. Please refer to the (&lt;a href="https://arxiv.org/abs/2409.11860"&gt;paper&lt;/a&gt;) for more details of the LLM vs. human cost for annotation.&lt;/p&gt;
&lt;h2&gt;Bottom Line&lt;/h2&gt;
&lt;p&gt;Using LLM as a judge helps us evaluating the search quality at scale with high coverage and multi-language support. The investment to set up the infrastructure was a one-time cost, and no handcrafted test cases were necessary. With this setup, we can re-evaluate our search quality as many times as we want. It gave us confidence that our search engine was ready for launching new markets. Finally, we can now also perform automated in depth validation of existing markets, which enables us to proactively identify regressions and otherwise uncaught issues.&lt;/p&gt;</content><category term="Zalando"/><category term="Search"/><category term="Machine Learning"/><category term="Artificial Intelligence"/><category term="Airflow"/><category term="Backend"/></entry><entry><title>Migrating jackson-datatype-money to FasterXML: A Case Study in Open Source Consolidation</title><link href="https://engineering.zalando.com/posts/2026/03/jackson-money-migration-zalando-style.html" rel="alternate"/><published>2026-03-09T00:00:00+01:00</published><updated>2026-03-09T00:00:00+01:00</updated><author><name>Sri Adarsh Kumar</name></author><id>tag:engineering.zalando.com,2026-03-09:/posts/2026/03/jackson-money-migration-zalando-style.html</id><summary type="html">&lt;p&gt;How we integrated jackson-datatype-money into the official FasterXML Jackson ecosystem.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;TL;DR&lt;/h2&gt;
&lt;p&gt;Zalando's &lt;a href="https://github.com/zalando/jackson-datatype-money"&gt;jackson-datatype-money&lt;/a&gt; library was archived in
November 2025. &lt;a href="https://jcp.org/en/jsr/detail?id=354"&gt;JSR 354&lt;/a&gt; support is now part of the official FasterXML
&lt;a href="https://github.com/FasterXML/jackson-datatypes-misc"&gt;jackson-datatypes-misc&lt;/a&gt; repository.&lt;/p&gt;
&lt;p&gt;If you're looking to migrate, the
&lt;a href="https://github.com/zalando/jackson-datatype-money/blob/main/MIGRATION.md"&gt;migration guide&lt;/a&gt; covers everything you need.
The integrated modules are available starting from:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Jackson 2.x&lt;/strong&gt;: &lt;code&gt;com.fasterxml.jackson.datatype:jackson-datatype-javax-money&lt;/code&gt; from version &lt;strong&gt;2.19.0&lt;/strong&gt;
  (&lt;a href="https://search.maven.org/artifact/com.fasterxml.jackson.datatype/jackson-datatype-javax-money/2.19.0/jar"&gt;Maven Central 2.19.0&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Jackson 3.x&lt;/strong&gt;: &lt;code&gt;tools.jackson.datatype:jackson-datatype-javax-money&lt;/code&gt; from version &lt;strong&gt;3.0.0&lt;/strong&gt;
  (&lt;a href="https://search.maven.org/artifact/tools.jackson.datatype/jackson-datatype-javax-money/3.0.0/jar"&gt;Maven Central 3.0.0&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The rest of this post covers how the migration came about.&lt;/p&gt;
&lt;h2&gt;Background&lt;/h2&gt;
&lt;p&gt;Jackson doesn't natively support JSR 354 monetary types. While FasterXML's
&lt;a href="https://github.com/FasterXML/jackson-datatypes-misc"&gt;jackson-datatypes-misc&lt;/a&gt; repository included official support for
Joda-Money, JSR 354 support lived separately in Zalando's
&lt;a href="https://github.com/zalando/jackson-datatype-money"&gt;jackson-datatype-money&lt;/a&gt; repository. This created a fragmented
ecosystem where developers had to choose between two competing money libraries with inconsistent support levels.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://jcp.org/en/jsr/detail?id=354"&gt;JSR 354&lt;/a&gt; (JavaMoney) is the Java standard for handling monetary amounts and
currencies. Led by Werner Keil, Anatole Tresch, and Otávio Santana, it provides a robust API for representing money
without floating-point precision errors. Zalando's jackson-datatype-money module has been battle-tested in production
and adopted by multiple organizations.&lt;/p&gt;
&lt;p&gt;When Werner Keil himself opened &lt;a href="https://github.com/zalando/jackson-datatype-money/issues/224"&gt;issue #224&lt;/a&gt; asking
about contributing the module to FasterXML, it was clear that consolidation would benefit the entire Jackson
ecosystem.&lt;/p&gt;
&lt;h2&gt;The Problem&lt;/h2&gt;
&lt;p&gt;The fragmentation caused real issues:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;New developers couldn't easily find JSR 354 Jackson support&lt;/li&gt;
&lt;li&gt;The Zalando module had separate release cycles, unsynchronized with core Jackson&lt;/li&gt;
&lt;li&gt;No clear "official" path for JSR 354 users&lt;/li&gt;
&lt;li&gt;Maintenance effort split across organizations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After discussing with the maintainers, I took on the migration.
&lt;a href="https://github.com/FasterXML/jackson-datatypes-misc/pull/48"&gt;Pull Request #48&lt;/a&gt; moved the battle-tested Zalando module
into &lt;code&gt;jackson-datatypes-misc&lt;/code&gt; while preserving its functionality and aligning it with FasterXML conventions.&lt;/p&gt;
&lt;h2&gt;Technical Challenges&lt;/h2&gt;
&lt;p&gt;The migration wasn't a simple copy-paste. Several architectural decisions emerged during the process.&lt;/p&gt;
&lt;h3&gt;Module Separation&lt;/h3&gt;
&lt;p&gt;I refactored the codebase to split Moneta-specific logic into a separate &lt;code&gt;moneta&lt;/code&gt; module. The original Zalando
repository supported all monetary types in a single module, but splitting it into two focused modules made more sense:
users could depend on generic JSR 354 support via &lt;code&gt;javax-money&lt;/code&gt;, or use specialized Moneta support (&lt;code&gt;FastMoney&lt;/code&gt;,
&lt;code&gt;RoundedMoney&lt;/code&gt;, etc.) which includes the generic functionality as a transitive dependency.&lt;/p&gt;
&lt;h3&gt;Package Naming&lt;/h3&gt;
&lt;p&gt;After review feedback, we changed the module names to &lt;code&gt;JavaxMoneyModule&lt;/code&gt; and &lt;code&gt;MonetaMoneyModule&lt;/code&gt; instead of keeping the
original structure. This made each module's purpose immediately clear and improved discoverability.&lt;/p&gt;
&lt;h3&gt;Dependencies&lt;/h3&gt;
&lt;p&gt;Moving from an independent repository to a multi-module project required reviewing and streamlining dependencies. We
eliminated unnecessary ones while keeping JSR 354 support robust and feature-complete.&lt;/p&gt;
&lt;h2&gt;Review Process&lt;/h2&gt;
&lt;p&gt;The pull request involved 4 reviewers from Zalando and FasterXML. Led by Tatu Saloranta (cowtowncoder),
the FasterXML Jackson maintainer, the review covered code quality, licensing, architectural consistency, and alignment
with Jackson conventions.&lt;/p&gt;
&lt;h3&gt;Licensing and Attribution&lt;/h3&gt;
&lt;p&gt;We ensured proper license compatibility and preserved attribution for all Zalando contributors. The history and
acknowledgments for everyone who made the module production-ready remained intact.&lt;/p&gt;
&lt;h3&gt;Corporate Contributor License Agreement&lt;/h3&gt;
&lt;p&gt;Obtaining a Corporate Contributor License Agreement (CCLA) from Zalando was necessary for legal clarity. While this
added time, it ensured the intellectual property transfer was properly documented.&lt;/p&gt;
&lt;h3&gt;Code Consistency&lt;/h3&gt;
&lt;p&gt;The FasterXML maintainers suggested improvements to align with other &lt;code&gt;jackson-datatypes-misc&lt;/code&gt; modules. They
recommended changing the module name from &lt;code&gt;jackson-datatype-money&lt;/code&gt; to &lt;code&gt;jackson-datatype-javax-money&lt;/code&gt; to distinguish it
from the existing Joda-Money module. These alignment challenges were more complex than expected, but the result feels
native to the Jackson ecosystem while preserving Zalando's battle-tested functionality.&lt;/p&gt;
&lt;h2&gt;What Users Gain&lt;/h2&gt;
&lt;p&gt;The migration to FasterXML &lt;code&gt;jackson-datatypes-misc&lt;/code&gt; brings:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Official support&lt;/strong&gt;: Better long-term maintenance, consistent release cycles, and alignment with core Jackson
  development&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Easier discovery&lt;/strong&gt;: Developers can find JSR 354 support directly in the official Jackson ecosystem&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency&lt;/strong&gt;: The module follows FasterXML conventions familiar to Jackson users&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Timely updates&lt;/strong&gt;: The JSR 354 module evolves with core Jackson releases&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified community&lt;/strong&gt;: Bug fixes and features benefit everyone instead of being split across implementations&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What I Learned&lt;/h2&gt;
&lt;p&gt;Clear communication between Zalando, FasterXML maintainers, and community users kept everyone aligned throughout the
process. Transparent discussions about goals and trade-offs mattered more than technical decisions.&lt;/p&gt;
&lt;p&gt;The formal processes around licensing, code review, and corporate agreements ensured long-term health and legal
clarity. While they added time, they were necessary.&lt;/p&gt;
&lt;p&gt;I learned that migration isn't about replacing work - it's about preserving it while making it more accessible. The
Zalando module represented years of production testing. Adapting it to its new home while honoring that work was the
real challenge.&lt;/p&gt;
&lt;p&gt;The review process also identified opportunities to improve naming conventions, package structure, and test
organization without changing core functionality.&lt;/p&gt;
&lt;h2&gt;Migration Path&lt;/h2&gt;
&lt;p&gt;For full steps and edge cases, see the
&lt;a href="https://github.com/zalando/jackson-datatype-money/blob/main/MIGRATION.md"&gt;official migration guide&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Generic JSR 354 Support&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Jackson 2.x&lt;/strong&gt;: Use &lt;code&gt;com.fasterxml.jackson.datatype:jackson-datatype-javax-money&lt;/code&gt;
  (&lt;a href="https://search.maven.org/artifact/com.fasterxml.jackson.datatype/jackson-datatype-javax-money"&gt;Maven Central&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Jackson 3.x&lt;/strong&gt;: Use &lt;code&gt;tools.jackson.datatype:jackson-datatype-javax-money&lt;/code&gt;
  (&lt;a href="https://search.maven.org/artifact/tools.jackson.datatype/jackson-datatype-javax-money"&gt;Maven Central&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Moneta-Specific Features&lt;/h3&gt;
&lt;p&gt;Use &lt;code&gt;jackson-datatype-moneta&lt;/code&gt; for specialized support for Moneta types (&lt;code&gt;FastMoney&lt;/code&gt;, &lt;code&gt;RoundedMoney&lt;/code&gt;, etc.). This
automatically includes &lt;code&gt;javax-money&lt;/code&gt; as a transitive dependency.&lt;/p&gt;
&lt;h3&gt;Existing Users of Zalando's jackson-datatype-money&lt;/h3&gt;
&lt;p&gt;The migration is a drop-in replacement. Switch to &lt;code&gt;jackson-datatype-moneta&lt;/code&gt; and update dependency coordinates. No code
changes required.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Moving jackson-datatype-money to the official FasterXML repository unified JSR 354 support in the Jackson ecosystem.
What started as a simple user request became a collaboration across organizations, involving proper
licensing, code reviews, and architectural decisions.&lt;/p&gt;
&lt;p&gt;The module is now part of &lt;code&gt;jackson-datatypes-misc&lt;/code&gt;, giving developers a clear path for Jackson money handling support.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;Thanks to the Zalando team, FasterXML maintainers, and the Jackson community for making this consolidation possible.
Special thanks to Tatu Saloranta for his guidance throughout the process.&lt;/em&gt;&lt;/p&gt;</content><category term="Zalando"/><category term="Open Source"/></entry><entry><title>Why We Ditched Flink Table API Joins: Cutting State by 75% with DataStream Unions</title><link href="https://engineering.zalando.com/posts/2026/03/why-we-ditched-flink-table-api-joins-cutting-state.html" rel="alternate"/><published>2026-03-04T00:00:00+01:00</published><updated>2026-03-04T00:00:00+01:00</updated><author><name>Maryna Kryvko</name></author><id>tag:engineering.zalando.com,2026-03-04:/posts/2026/03/why-we-ditched-flink-table-api-joins-cutting-state.html</id><summary type="html">&lt;p&gt;The beauty of a high-level abstraction is that it lets you focus on the "what" rather than the "how." In the world of Apache Flink, the Table API is a powerful tool that abstracts away the complexities of stream processing, allowing developers to write SQL-like queries on streaming data. However, as we discovered in our journey with Flink, there are scenarios where the Table API's abstraction can be too heavy.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Photo by Włodzimierz Jaworski on Unsplash" src="https://engineering.zalando.com/posts/2026/03/images/wlodzimierz-jaworski-squirrel-unsplash.jpg#previewimage"&gt;&lt;/p&gt;
&lt;p&gt;The beauty of a high-level abstraction is that it lets you focus on the "what" rather than the "how." In the world of &lt;a href="https://flink.apache.org"&gt;Apache Flink&lt;/a&gt;, the &lt;a href="https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/tableapi/"&gt;Table API and SQL&lt;/a&gt; represent this convenience very well. You write a simple join statement, and the query optimizer handles the heavy lifting. It feels like magic - until that magic starts costing you thousands of dollars in AWS bills and crashing your clusters every time a snapshot is triggered.&lt;/p&gt;
&lt;p&gt;This is exactly what we faced with our Product Offer Enrichment applications at Zalando. What began as an elegant, declarative solution eventually started crumbling under the weight of its own state. By moving from the "magic" of SQL to the manual control of the &lt;a href="https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/dev/datastream/overview/"&gt;DataStream API&lt;/a&gt; and a custom &lt;code&gt;MultiStreamJoinProcessor&lt;/code&gt; we managed to decrease our state size from 240GB to 56GB, a 75% improvement.&lt;/p&gt;
&lt;p&gt;Here is the deep dive into why Flink SQL state accumulates, how joins actually work under the hood, and how we rewrote our job to save the applications.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Disclaimer: this article is about Flink 1.20, which is the only version of Flink currently (Feb 2026) available on AWS Managed Flink.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;The Initial Architecture: The Attraction of SQL&lt;/h2&gt;
&lt;p&gt;Our Product Offer Enrichment pipeline is a critical piece of the Zalando Search and Browse ecosystem. It is responsible for joining multiple streams of differing speed and "weight", including data about pricing and stock offers from partners, sorting metadata we call Boost, sponsored products metadata, and product data - to create a unified, enriched view of what a customer sees on the site when browsing a catalog of articles.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Catalog of articles" src="https://engineering.zalando.com/posts/2026/03/images/articles-catalog.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;
Figure 1: Catalog of articles on the Zalando website, fed by the Offer Enrichment pipeline.
&lt;/figcaption&gt;

&lt;p&gt;Initially, we used the &lt;a href="https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/tableapi/"&gt;Table API &amp;amp; SQL&lt;/a&gt;. It allowed us to express complex joins in a few lines of SQL code. However, to understand why it failed, we have to look at how Flink SQL manages stateful joins.&lt;/p&gt;
&lt;h2&gt;Why State Amplifies&lt;/h2&gt;
&lt;p&gt;In Flink 1.20, each join operator is a strictly independent unit. Because Flink must account for late-arrival data and potential updates, it must maintain data integrity by keeping every record in its internal state (RocksDB).
When you chain four joins together, you aren't just adding state; you are multiplying it. Each join operator in the chain maintains its own copy of the keys and values it needs.&lt;/p&gt;
&lt;h3&gt;The State Math&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Join Operator 1 (offer + boost)&lt;/strong&gt;: Flink stores all records from offer and boost in the RocksDB.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Join Operator 2 (operator 1 + sponsored)&lt;/strong&gt;: To this operator, the incoming joined record is just a new stream. It has no access to the previous operator's memory. It must store its own copy of the (offer+boost) data to join it with the sponsored metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Join Operator 3 (result of 2 + product event)&lt;/strong&gt;: It clones the previous results again.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The relational model treats these as isolated operations. The relational model treats these as isolated operations. This state amplification led us to a staggering 235–245GB of state per application.&lt;/p&gt;
&lt;h2&gt;The State Nightmare&lt;/h2&gt;
&lt;p&gt;When your state reaches 235GB of data, your application stops being a data pipeline and starts being an unstable nightmare.&lt;/p&gt;
&lt;p&gt;Every hour, a cronjob would trigger a snapshot (savepoint). For us, it was a catastrophe:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CPU exhaustion&lt;/strong&gt;: To snapshot 235GB, Flink must iterate over the RocksDB state, serialize it, and move it to S3. This would keep the cluster's CPU at 100% for nearly 12 minutes.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="CPU spikes" src="https://engineering.zalando.com/posts/2026/03/images/cpu-table-api.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;
Figure 2: CPU spikes during snapshot creation
&lt;/figcaption&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Backpressure&lt;/strong&gt;: Because the application was running close to the CPU limit, it couldn't process records. The lag would start getting higher and higher.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Crash-Restart Loop&lt;/strong&gt;: Often, the Flink application would simply give up and restart. Because Flink restarts involve reloading the state from S3, we would sometimes fall behind our 1-hour SLA. By the time the app was back up, it would be almost time for the next snapshot. We tried to lengthen the snapshot job intervals to several hours, but that would again create a setup for the SLO breach in case of the application failure that would require a restore from the snapshot.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="Flink restarts due to snapshots" src="https://engineering.zalando.com/posts/2026/03/images/restarts-table-api.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;
Figure 3: Flink restarts due to snapshots
&lt;/figcaption&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Snapshot Failures&lt;/strong&gt;: Due to forced restarts, many snapshots just couldn't be taken. This was again making us vulnerable because  of unreliable data backups.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Overscaling&lt;/strong&gt;: Every scaling operation on a Flink application involves a full job restart, because the job parallelism needs to be reconfigured, and it can't happen on a live application. Since the STOP operation involves creating a snapshot (this is a configurable setting in AWS Managed Flink that we had enabled), every scaling was taking time proportional to the snapshot creation… that is, 11–12, sometimes up to 20 minutes. Because of that, the parallelism for the application was constantly kept at 10–20% higher than normally required, to provide some margin for the intake spikes and make sure the restarts don't happen too often. They were just costing us too much lag!&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost&lt;/strong&gt;: Since the AWS Managed Flink costs are proportional to the number of KPUs, that is, the job parallelism, we were paying for the huge state with very real, physical money.&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: Managed Service for Apache Flink provisions capacity as KPUs. A single KPU provides you with 1 vCPU and 4GB of memory. For every KPU allocated, 50GB of running application storage is also provided. This means that the application resources are always configured in terms of KPUs, there's no way to allocate more storage without also allocating more CPU and memory, or more memory without also allocating more CPU and storage.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img alt="Failed snapshots" src="https://engineering.zalando.com/posts/2026/03/images/snapshots-table-api.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;
Figure 4: Failed automated snapshots
&lt;/figcaption&gt;

&lt;h2&gt;From Table API to Stream API&lt;/h2&gt;
&lt;p&gt;In the end, we made a decision to move from the Table API (declarative) to the DataStream API (imperative). This approach is very different from a simple SQL statement, but it gives way more control over what's happening.&lt;/p&gt;
&lt;h3&gt;The MultiStreamJoinProcessor&lt;/h3&gt;
&lt;p&gt;We moved to a Stream Union approach. Instead of chaining joins, we unified all incoming streams into a single &lt;code&gt;DataStream[BaseEvent]&lt;/code&gt;. This allowed us to replace the chain of joins with increasing state with a single &lt;code&gt;KeyedProcessFunction&lt;/code&gt;-specifically, our custom &lt;code&gt;MultiStreamJoinProcessor&lt;/code&gt;. The key in this case is the SKU - the product identifier, which is the common key across all streams.&lt;/p&gt;
&lt;h3&gt;Exact State Management&lt;/h3&gt;
&lt;p&gt;In this model, there is only one instance of the SKU data in RocksDB. We use a single &lt;code&gt;ValueState&lt;/code&gt; for a custom Scala POJO called &lt;code&gt;EnrichmentState&lt;/code&gt;. This state management approach has several advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;No multiplication&lt;/strong&gt;: There is no left or right part of the join. When an event arrives, it simply updates the specific field(s) in the existing ValueState object.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No TTLs&lt;/strong&gt;: We keep the state "forever" to always have the last known value for an SKU. However, because we only store it once, the state is significantly smaller.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Filtering/deduplication built-in&lt;/strong&gt;: We implemented manual stream filtering. If an incoming update has a timestamp that is earlier than what we already have, we drop it immediately, avoiding a state write altogether. In some cases, it would also be an option to compare the incoming events with already kept, using a relevant subset of fields to prevent updates from events that don't contain relevant changes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;DataStream code example&lt;/h3&gt;
&lt;p&gt;This is obviously not the real code; just a much simplified example to illustrate the options. The code is written in Scala.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;// 1. Define a unified, dense POJO, containing only the fields we need for the enrichment, and a single state object to keep it in.&lt;/span&gt;

&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;EnrichmentState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Double&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;stock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Double&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sortingScore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Double&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;productState&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// 2. The Processor Logic&lt;/span&gt;
&lt;span class="c1"&gt;// Type parameters are: key type, input type, return type&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;MultiStreamJoinProcessor&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;extends&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;KeyedProcessFunction&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;BaseEvent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;EnrichedOffer&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ValueState&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;EnrichmentState&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;override&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Configuration&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Unit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;getRuntimeContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ValueStateDescriptor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;enriched-state&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;classOf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;EnrichmentState&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;override&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;processElement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;BaseEvent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Collector&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;EnrichedOffer&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Unit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;val&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="n"&gt;getOrElse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;EnrichmentState&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;match&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;OfferEvent&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;// Deduplication possible: Only update if price or stock changed&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stock&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stock&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stock&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stock&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ProductEvent&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;// Deduplication possible: Only update if product state changed&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;productState&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;productState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;productState&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;productState&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;BoostEvent&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;// Built-in filtering: Only update if boost value is newer than the one we have&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;boostTimestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sortingScore&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sortingScore&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;EnrichedOffer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getCurrentKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Funnily enough, the "more manual" approach turned out to be even less verbose than the SQL version, because our SQL was quite complex, with aggregations for calculating the maximal timestamps between several parts of the join and with ranking functions for making sure the last record from the same part of the join always wins.&lt;/p&gt;
&lt;h2&gt;Result: 75% State Decrease&lt;/h2&gt;
&lt;p&gt;The results were immediate and impressive. We decreased the number of operators; we decreased the state to 1/4 of the previous value. The snapshot size and duration dropped proportionally. The restart time also dropped to 4–5 minutes from 12 to 20 minutes, so the lag, while still being accumulated at restarts, wasn't threatening our SLO any longer.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Table API (SQL)&lt;/th&gt;
&lt;th&gt;DataStream API&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;State Size&lt;/td&gt;
&lt;td&gt;235GB&lt;/td&gt;
&lt;td&gt;56GB&lt;/td&gt;
&lt;td&gt;-76%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snapshot Duration&lt;/td&gt;
&lt;td&gt;11 Minutes&lt;/td&gt;
&lt;td&gt;2.5 Minutes&lt;/td&gt;
&lt;td&gt;-77%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU Usage&lt;/td&gt;
&lt;td&gt;100% (Spikes)&lt;/td&gt;
&lt;td&gt;~30% (Stable)&lt;/td&gt;
&lt;td&gt;Stability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Costs&lt;/td&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;td&gt;13% Reduction&lt;/td&gt;
&lt;td&gt;Savings&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;So how come we didn't cut the cost by 75% as well?&lt;/strong&gt; Because the AWS costs are not so much about the state size, but more related to the CPU and memory resources. We did save some CPU capacity, but the memory usage didn't change much, because we still needed to keep the same amount of data in memory for processing. The CPU usage was more stable, but it wasn't reduced by 75% because the processing logic still had to do the same amount of work on the same throughput. It just didn't have to deal with the overhead of managing multiple states.&lt;/p&gt;
&lt;p&gt;Because of that, what we saved was mostly the cost of the previous overscale. The applications were running at 10–20% higher parallelism than needed, so we could reduce that and save some money, but the cost reduction was not directly proportional to the state reduction. Still, as Flink optimizations go, a 13% cost reduction is a very good result, especially considering that it also made the applications more stable and reliable.&lt;/p&gt;
&lt;h3&gt;What Comes in Flink 2.x&lt;/h3&gt;
&lt;p&gt;As already mentioned above, we are currently on Flink 1.20 because that is the only option on AWS Managed Flink. So one might say, oh, but Flink 2.x has this and that, and you probably wouldn't have to do all this work, and maybe it was all for nothing.&lt;/p&gt;
&lt;p&gt;This would happen to be true, because the Flink community was very much aware of the issue, and there was an improvement proposal dated &lt;a href="https://cwiki.apache.org/confluence/display/FLINK/FLIP-516%3A+Multi-Way+Join+Operator"&gt;May 19, 2025&lt;/a&gt;, called &lt;code&gt;Multi-Way Join Operator&lt;/code&gt;. This was then &lt;a href="https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/tuning/#the-multijoin-operator"&gt;introduced in Flink 2.1 as an experimental feature&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Disclaimer in the Flink documentation&lt;/strong&gt;: &lt;em&gt;"This is currently in an experimental state - there are open optimizations and breaking changes might be implemented in this version. We currently support only streaming INNER/LEFT joins. Support for RIGHT joins will be added soon."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The &lt;code&gt;MultiJoin&lt;/code&gt; improvement still wouldn't save us because it's a much later version of Flink than we're on, but it is interesting to see that we reimplemented the same  idea: keyed state, one operator for all streams, no intermediate state. The &lt;a href="https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/tuning/#multijoin-operator-example---benchmark"&gt;benchmark for the Flink 2.1 implementation shows impressive results&lt;/a&gt;: 2x to over 100x+ increase in processed records; 3x to over 1000x+ smaller state. I guess this feature alone will be worth the wait when we get there, but until then, we're covered by our home-baked solution.&lt;/p&gt;
&lt;h3&gt;When to Get Down&lt;/h3&gt;
&lt;p&gt;By making the state management explicit, we made the overloaded, barely coping system into a reliable, high-performance machine.&lt;/p&gt;
&lt;p&gt;Flink SQL is perfect for 90% of use cases - it's fast, elegant, and maintainable. But a software engineer's value is in recognizing the remaining 10%: the use cases where the abstraction starts costing too much.&lt;/p&gt;
&lt;p&gt;And this was definitely one of those.&lt;/p&gt;</content><category term="Zalando"/><category term="Apache Flink"/><category term="Search"/><category term="Backend"/></entry><entry><title>Running an Engineering Papers Reading Guild at Zalando</title><link href="https://engineering.zalando.com/posts/2026/01/running-an-engineering-papers-reading-guild-at-zalando.html" rel="alternate"/><published>2026-01-29T00:00:00+01:00</published><updated>2026-01-29T00:00:00+01:00</updated><author><name>Danilo Veljovic</name></author><id>tag:engineering.zalando.com,2026-01-29:/posts/2026/01/running-an-engineering-papers-reading-guild-at-zalando.html</id><summary type="html">&lt;p&gt;In September 2024, we started an Engineering Papers guild at Zalando to read and discuss research papers together. A year later, we reflect on our journey and share insights on organising and evolving the guild.&lt;/p&gt;</summary><content type="html">&lt;p&gt;In September 2024 a friendly message in our internal Java guild chat led to the formation of the Engineering Papers guild born out of sheer curiosity and excitement about reading and discussing papers together.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Chat screenshot of the initial message that led to the formation of the guild" src="https://engineering.zalando.com/posts/2026/01/images/conversation.jpg"&gt;&lt;/p&gt;
&lt;p&gt;The Engineering Papers guild at Zalando recently celebrated its first anniversary as we completed one year of monthly in-person meetups - a big feat for the group! Throughout the year we have evolved and improved gradually to make the meetups more engaging and valuable for everyone who attends while learning a lot ourselves on the way. Today, we want to share how the journey has been and celebrate one year of papers, discussions and valuable insights which may come in handy for you if you are deciding to start such a group within your organisation.&lt;/p&gt;
&lt;h2&gt;Why we wanted to read academic papers&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://stackoverflow.blog/2022/12/30/you-should-be-reading-academic-computer-science-papers/"&gt;This StackOverflow article&lt;/a&gt; groups reasons to read papers into three categories:&lt;/p&gt;
&lt;p&gt;&lt;em&gt;“surveying history, the future of programming and the map of giants’ shoulders”&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;For us, the primary motivation was to peek under the hood of abstraction in software. We work with several tools at work - web frameworks like Node.js, runtimes like JVM, databases like Postgres and DynamoDB, platforms like Kubernetes - all of which abstract a set of features implemented using lower-level primitives and in turn enable developers to build higher levels of abstraction.&lt;/p&gt;
&lt;p&gt;Papers from early days of the Internet are special in a way. They often talk about technologies and tools that have become foundations of modern day computing. Take &lt;a href="https://lamport.azurewebsites.net/pubs/time-clocks.pdf"&gt;Time, Clocks and the Ordering of Events in Distributed Systems&lt;/a&gt; - a seminal paper in which Lamport shared how we can make sense of time and ordering in distributed systems. Ideas from this paper have influenced for example how multiple replicas in a database system communicate with each other to maintain data consistency.&lt;/p&gt;
&lt;h2&gt;Organising an internal paper-reading meetup&lt;/h2&gt;
&lt;p&gt;At Zalando, &lt;em&gt;guilds&lt;/em&gt; are communities of interests around a certain topic. We have guilds for many interests ranging from technology-focused topics like Java, Web, Data Engineering, LLMs, SRE to several non-tech groups like pet owners, photography, and music. The obvious next step for us was to create a guild for us: #guild-engineering-papers.&lt;/p&gt;
&lt;p&gt;We wanted to run this guild in a sustainable way that keeps people interested and engaged and makes them derive most value from attending the meetings. We had a genuine interest in reading papers, so we would keep reading them even if the guild never existed. The reason to have a guild was to leverage the pool of high quality engineers across Zalando and discuss the papers with them, so that we could learn new techniques and potentially apply them to our daily work.&lt;/p&gt;
&lt;p&gt;Of course, running a niche guild in a large company is a challenge and it doesn’t come without its fair share of mistakes. Our plan was simple: take the best papers in the realm of distributed systems, databases and compilers, and discuss them within the group. The discussions would have a driver who would prepare and present (and essentially drive the conversation), and to set the momentum, we as co-organisers would be the drivers as well.&lt;/p&gt;
&lt;p&gt;One of the initial papers we selected to cover was the &lt;a href="https://www.usenix.org/system/files/atc22-elhemali.pdf"&gt;DynamoDB paper&lt;/a&gt;. It was relatively recent and relevant within Zalando where a lot of teams use DynamoDB, and we were confident that we would have high attendance. We created posters, wrote in the global tech channel and announced the topic in the guild channel - all excited to host the meetup. The day arrived and to our surprise we only had one attendee.&lt;/p&gt;
&lt;p&gt;Well, we realised the importance of marketing for a meetup. If no one knew about what you are organising, then no one would show up. For next meetups, we put all viable mechanisms to market the event to use - starting from eye-catching colorful posters put up across the office building to the company-wide tech newsletter, all mechanisms were utilized and this worked very well!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Guild meetup in action" src="https://engineering.zalando.com/posts/2026/01/images/session-in-action.jpeg"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center;margin-top:-1em;"&gt;Guild meetup in action&lt;/figcaption&gt;

&lt;h2&gt;Evolving the format&lt;/h2&gt;
&lt;p&gt;As the guild was created out of our sheer interest to read papers, we were quite eager to discuss as many interesting topics as possible in our meetups. We started with two papers in a 90 minute session, taking one per organiser and on a monthly cadence. We soon realised that this was not scalable: we as organisers would have to prepare the presentations in about three weeks for each meetup, and a session would be jam packed with topics related to (usually) two completely different papers.&lt;/p&gt;
&lt;p&gt;In the session where we discussed the DynamoDB paper, we also covered the &lt;a href="https://raft.github.io/raft.pdf"&gt;Raft paper&lt;/a&gt;. It is an extremely interesting read for anyone wanting to understand distributed consensus algorithms and cannot make sense of algorithms like Paxos. In a session with two such information-dense papers, it was hard for the attendees to keep up with the discussion. We soon decided to move to a one-paper-per-session format to ensure that the discussions are rich and focused.&lt;/p&gt;
&lt;h3&gt;Selecting papers&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Content sweet spot = Relevance + Interest&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This is usually a tricky topic and you should aim to balance “bringing in value” (a.k.a. &lt;em&gt;relevance&lt;/em&gt;) and “interests” for the attendees. In one of the sessions we discussed &lt;a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36500.pdf"&gt;Overlapping Experiment Infrastructure: More, Better, Faster Experimentation&lt;/a&gt; as it is a foundation for an internal experimentation solution and contained valuable insights for anyone wanting to work with the product by looking under the hood. Along with talking about the particular solution, the paper also generally describes how experimentation systems work which is a great learning.&lt;/p&gt;
&lt;p&gt;Similarly, &lt;a href="https://s3.amazonaws.com/systemsandpapers/papers/hamilton.pdf"&gt;On Designing and Deploying Internet-Scale Services&lt;/a&gt; is a goldmine for running and maintaining software systems at scale. Ideas shared in this paper were super relevant for everyone who attended and for Zalando.&lt;/p&gt;
&lt;p&gt;On the other hand, we also discussed papers like &lt;a href="https://arxiv.org/pdf/2312.10172"&gt;Prequal&lt;/a&gt; and &lt;a href="https://people.csail.mit.edu/kraska/pub/sigmod08-s3.pdf"&gt;Building a Database on S3&lt;/a&gt; -  topics which are purely interesting and can potentially inspire builders at Zalando to apply learnings to their systems.&lt;/p&gt;
&lt;p&gt;More recently, as the momentum has set in, we were happy to receive our first community proposal for a paper presentation! One of our regular attendees presented &lt;a href="https://arxiv.org/pdf/2002.11054"&gt;MLIR: A Compiler Infrastructure for the End of Moore’s Law&lt;/a&gt; - the paper introducing MLIR, a superset of LLVM-like tooling to work with multiple levels of intermediate representations in compilers. We want to encourage more such voluntary submissions so that the guild becomes community-driven and more diverse with respect to topics.&lt;/p&gt;
&lt;h2&gt;Quantifying impact&lt;/h2&gt;
&lt;p&gt;We see the impact of a meetup like ours materialising in two ways. While investing time into reading academic research papers has a long-term impact, we do want to share the tangible outcomes we realised through our regular meetups.&lt;/p&gt;
&lt;h3&gt;System internals and application at work&lt;/h3&gt;
&lt;p&gt;Our regular attendees shared that with papers like &lt;a href="https://www.usenix.org/system/files/atc22-elhemali.pdf"&gt;DynamoDB&lt;/a&gt;, they could better understand the underlying implementations and working of systems they used everyday in production. Understanding DynamoDB's internal architecture through the meetup helped a team at Zalando to understand why DynamoDB writes kept failing despite adequate provisioned capacity. They realised their data was concentrated on specific partitions that had exhausted their burst capacity, leading them to redesign their partition key strategy for better load distribution.&lt;/p&gt;
&lt;p&gt;Similarly, &lt;a href="https://lamport.azurewebsites.net/pubs/chandy.pdf"&gt;Distributed Snapshots: Determining Global States of Distributed Systems&lt;/a&gt; helped a team understand &lt;a href="https://flink.apache.org/"&gt;Apache Flink&lt;/a&gt;'s checkpointing mechanisms.&lt;/p&gt;
&lt;h3&gt;Engineering culture and exchange of ideas&lt;/h3&gt;
&lt;p&gt;The guild meetups often see rich technical discussions that bring perspectives from different parts of the organisation as we see participation from departments all around the company. Attendees bring in varying experiences and the guild provides a platform for everyone looking to learn and share.&lt;/p&gt;
&lt;p&gt;While discussing &lt;a href="https://bigdata.uni-saarland.de/publications/Haffner,%20Dittrich%20-%20A%20Simplified%20Architecture%20for%20Fast,%20Adaptive%20Compilation%20and%20Execution%20of%20SQL%20Queries%20@EDBT2023.pdf"&gt;A Simplified Architecture for Fast, Adaptive Compilation and Execution of SQL Queries&lt;/a&gt;, which discusses a LLVM and WebAssembly-based compilation architecture for SQL queries, attendees shared how some engineering teams at Zalando were experimenting with WebAssembly as a cross-platform target. This discussion provided insight for the attendees into how technologies they did not primarily use in their teams could benefit their goals.&lt;/p&gt;
&lt;h2&gt;Blueprint for organising this yourself&lt;/h2&gt;
&lt;p&gt;With our learnings from 2025, we put the following &lt;em&gt;blueprint&lt;/em&gt; together for anyone who wants to/is interested in running such a group in their organisations/institutions. We believe such a group greatly benefits in building a strong engineering culture and benefits both the attendees and the organisation.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Fundamentals&lt;/em&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Two or more organisers&lt;/strong&gt; - doing this solo will definitely wear you out and affect the quality of the meetup.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Interest in reading papers and continuous learning&lt;/strong&gt; - for a significant period of time initially you as organisers will have to drive the group, your interest and motivation will push the group through its initial days.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Choosing what to read&lt;/em&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Combine relevance and interests&lt;/strong&gt;- bring in topics that interest you and may also be relevant to the organisation or the attendees. For us, distributed systems, databases and compilers were a good fit.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mix classics with cutting edge research&lt;/strong&gt; - a historical perspective enriches ideas about present developments.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Cadence and logistics&lt;/em&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Start easy, once a month&lt;/strong&gt; - With at least two organisers, one session with one paper discussion per month would give enough time for you to prepare if you alternate responsibilities. A paper per session also allows focused discussion.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Market your meetup&lt;/strong&gt; - We cannot stress this enough. Let people know that you are organising this, you will be pleasantly surprised how many are really interested. Use various channels (chat groups, internal social media, physical posters in office, common newsletters) to share about the next meetup. Having a significant audience creates a nice feedback loop and motivates you to put in the effort.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Engaging the community&lt;/em&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Use a lot of examples&lt;/strong&gt; - breaking down concepts in a paper via examples was a great way for us to share what we learned. E.g., in the Raft paper we went as far as implementing the algorithm in Node.js to see it in action.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ask for feedback&lt;/strong&gt; - The group is community-driven and feedback is very important. We tried both formal and informal ways to gather feedback and both were helpful.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Create an inclusive environment&lt;/strong&gt; - All of us are learning in the group and creating an environment where everyone can discuss and ask questions is crucial. This also encourages community participation and gradually having presenters other than the organisers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;In-person meetups make a difference&lt;/strong&gt; - In-person gathering helps in enaging with all the attendees better and makes it more like a discussion than a &lt;em&gt;webinar&lt;/em&gt;. We value seeing each other in person once a month and you may want to try this too if your team setup permits it.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What’s next&lt;/h2&gt;
&lt;p&gt;In total, we discussed 13 papers throughout the year in the guild. That’s a significant number, considering this was a self-organised effort! We celebrated the completion of 1 year of the guild with a special anniversary edition which was a more relaxed meetup. We had donuts, discussed some interesting events at scale (Cyber Week flavored) and some popular Internet bugs of the past.&lt;/p&gt;
&lt;p&gt;We want to continue organising guild meetups and explore interesting topics in systems and how they can be applied at Zalando. If this post encourages you to pick up a paper to read or organise a similar group, hurray! Happy reading!&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;Other than the papers mentioned in the post, we discussed the following:&lt;/em&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://martin.kleppmann.com/papers/local-first.pdf"&gt;Local-First Software: You Own Your Data, in spite of the Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf"&gt;Spanner: Google’s Globally-Distributed Database&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sites.cc.gatech.edu/projects/up/publications/iotdi19-AdamHall.pdf"&gt;An Execution Model for Serverless Functions at the Edge&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf"&gt;Google's Chubby: lock service for loosely-coupled distributed systems&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;</content><category term="Zalando"/><category term="Culture"/></entry><entry><title>Paper Announcement: A Practical Approach to Replenishment Optimization with Extended (R, s, Q) Policy and Probabilistic Models</title><link href="https://engineering.zalando.com/posts/2026/01/publication-replenishment-engine.html" rel="alternate"/><published>2026-01-15T00:00:00+01:00</published><updated>2026-01-15T00:00:00+01:00</updated><author><name>Alva Presbitero</name></author><id>tag:engineering.zalando.com,2026-01-15:/posts/2026/01/publication-replenishment-engine.html</id><summary type="html">&lt;p&gt;Learn how the ZEOS replenishment optimization system achieves up to 22.1% GMV uplift by unifying probabilistic demand forecasting with risk-aware discrete event simulation.&lt;/p&gt;</summary><content type="html">&lt;p&gt;In the world of e-commerce, inventory management is a high-stakes balancing act often described as the &lt;strong&gt;Inventory Paradox&lt;/strong&gt;. Carry too much stock, and your capital is locked in storage and liquidation; carry too little, and you face the "silent killer" of retail—stock-outs, where customer intent meets an empty shelf.&lt;/p&gt;
&lt;p&gt;Following our &lt;a href="https://engineering.zalando.com/posts/2025/06/inventory-optimisation-system.html"&gt;previous discussion on the high-level architecture of our inventory optimization system&lt;/a&gt;, we are excited to dive into the applied science that powers the engine.&lt;/p&gt;
&lt;p&gt;In our recent publication in &lt;em&gt;Nature Scientific Reports&lt;/em&gt;, &lt;a href="https://www.nature.com/articles/s41598-025-32537-2"&gt;A practical approach to replenishment optimization with extended (R, s, Q) policy and probabilistic models&lt;/a&gt;, we describe how Zalando moved beyond traditional "point" forecasts to build a &lt;strong&gt;simulation-driven replenishment engine that explicitly optimizes under uncertainty.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;The Design: A Unified Optimization Architecture&lt;/h2&gt;
&lt;p&gt;The ZEOS Inventory Optimization Tool isn't just a prediction model; it’s a central replenishment engine supported by a suite of probabilistic forecasting components. We combined &lt;strong&gt;Discrete Event Simulation (DES)&lt;/strong&gt; with &lt;strong&gt;stochastic optimization&lt;/strong&gt; to determine replenishment policies that maximize value across an article’s entire lifecycle.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Component view of the replenishment engine" src="https://engineering.zalando.com/posts/2026/01/images/component_view_figure.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;
Figure 1: Component view of the replenishment engine. The ZEOS Inventory Optimization Tool consists of the core optimization engine and supporting probabilistic components.
&lt;/figcaption&gt;

&lt;p&gt;The system is built around three core pillars:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The Forecaster (LightGBM)&lt;/strong&gt;: The future is rarely a single number. Instead of predicting a single demand value, we model &lt;strong&gt;full probability distributions&lt;/strong&gt;. By using quantile forecasts from our LightGBM-based demand service, the system accounts for tail risks—those rare but financially significant demand spikes that a simple average would miss.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The Engine (Extended (R, s, Q))&lt;/strong&gt;: Classic reorder-point policies are often too rigid for the fast-paced world of fashion. We extended the classical &lt;span class="math"&gt;\((R, s, Q)\)&lt;/span&gt; policy by introducing an &lt;strong&gt;initial kick-start quantity&lt;/strong&gt; (&lt;span class="math"&gt;\(Q_0\)&lt;/span&gt;) and a &lt;strong&gt;time-based lifecycle cutoff&lt;/strong&gt; (&lt;span class="math"&gt;\(t_{\text{limit}}\)&lt;/span&gt;). This allows the policy to be aggressive during a product's launch and conservative as it reaches its decay phase.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The Optimizer (Monte Carlo Simulation)&lt;/strong&gt;: We don't just optimize for the "best-case" scenario. We simulate thousands of plausible futures for each candidate policy. The optimizer then selects the policy that performs most robustly &lt;em&gt;across that uncertainty&lt;/em&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Modeling Returns and Lead Times&lt;/strong&gt;: While demand is our primary variable, we also model returns using empirically-derived lead time distributions from historical data, while replenishment lead times are sampled from Gamma distributions during simulation.&lt;/p&gt;
&lt;h2&gt;The Simulation Method: Discrete Event Modeling&lt;/h2&gt;
&lt;p&gt;How do we "test" a policy before it hits production? We run a &lt;strong&gt;Discrete Event Simulation (DES)&lt;/strong&gt; over a 12-week horizon. Each Monte Carlo run represents one "alternate timeline" where demand, returns, and lead times evolve stochastically.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Simplified illustration of the DES state evolution" src="https://engineering.zalando.com/posts/2026/01/images/des_state_evolution_figure.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;
Figure 2: Evolution of inventory states within a simulated week, from inbound processing to replenishment decisions.
&lt;/figcaption&gt;

&lt;p&gt;Within each simulated week, inventory follows a precise sequence:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Intra-week processing&lt;/strong&gt;: Expected inbounds and returns are added half before and half after demand fulfillment to approximate a continuous flow of goods.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Demand realization&lt;/strong&gt;: We sample weekly demand from the probabilistic forecast and fulfill it based on current on-hand inventory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Replenishment decisions&lt;/strong&gt;: At review points, we check inventory against the reorder point (&lt;span class="math"&gt;\(s\)&lt;/span&gt;). If it’s breached, a replenishment of size &lt;span class="math"&gt;\(Q\)&lt;/span&gt; is triggered and enters transit with a sampled lead time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost accumulation&lt;/strong&gt;: We track storage, inbound, outbound, return, and lost-sales costs across the entire 12-week horizon.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;The Math: Optimization Objective&lt;/h2&gt;
&lt;p&gt;The optimizer’s goal is to find the specific policy parameters &lt;span class="math"&gt;\(\theta = (t_0, Q_0, s, Q)\)&lt;/span&gt; that minimize the total cost over the simulated horizon:&lt;/p&gt;
&lt;div class="math"&gt;$$\theta^* = \arg\min_{\theta \in \Theta} \left[ C_{\text{holding}}(\theta) + C_{\text{inbound}}(\theta) + C_{\text{outbound}}(\theta) + C_{\text{returns}}(\theta) + C_{\text{lost sales}}(\theta) \right]$$&lt;/div&gt;
&lt;p&gt;To find the optimal balance, the engine weighs five distinct cost pillars:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Storage (&lt;span class="math"&gt;\(C_{\text{holding}}\)&lt;/span&gt;)&lt;/strong&gt;: Fees accumulated weekly based on physical stock levels.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Logistics (&lt;span class="math"&gt;\(C_{\text{inbound}}\)&lt;/span&gt; &amp;amp; &lt;span class="math"&gt;\(C_{\text{outbound}}\)&lt;/span&gt;)&lt;/strong&gt;: Operational costs of moving goods to and from the fulfillment centers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Returns (&lt;span class="math"&gt;\(C_{\text{returns}}\)&lt;/span&gt;)&lt;/strong&gt;: Specific processing fees for returned customer items.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Opportunity (&lt;span class="math"&gt;\(C_{\text{lost sales}}\)&lt;/span&gt;)&lt;/strong&gt;: The margin lost when demand is unmet, adjusted by return rates to reflect "realized" lost sales.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While the formula looks straightforward, two specific design choices make it powerful:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Counterfactual Modeling&lt;/strong&gt;: We handle the "unseen"—like demand that &lt;em&gt;would&lt;/em&gt; have happened during a stock-out—using probabilistic distributions rather than rough guesses.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Risk-Aware Optimization&lt;/strong&gt;: Instead of minimizing the average cost, we minimize the &lt;strong&gt;75th percentile&lt;/strong&gt; of the cost distribution. This ensures our decisions protect against extreme, rare demand spikes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Results: Computational Backtesting&lt;/h2&gt;
&lt;p&gt;To evaluate the efficacy of the extended &lt;span class="math"&gt;\((R, s, Q)\)&lt;/span&gt; policy, we conducted an extensive &lt;strong&gt;computational backtest&lt;/strong&gt; designed as a series of numerical experiments. This study spanned a full year (October 2023 – September 2024), utilizing ~2 million articles from approximately 800 merchants to capture a wide spectrum of demand profiles and seasonal dynamics.&lt;/p&gt;
&lt;p&gt;By benchmarking the engine against professional human replenishment decisions, we observed a direct and stable translation of mathematical optimization into business value:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Engine vs. Human Baseline Uplift&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gross Merchandise Value (GMV)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+22.11%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gross Margin (GMV after FC)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+21.95%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Weighted Weekly Availability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+33.63%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Weighted Demand Fill Rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+23.63%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The computational experiments highlight several critical performance characteristics:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Consistent Seasonal Performance&lt;/strong&gt;: The positive uplifts in GMV and GMV after fulfillment costs remained remarkably stable throughout the 12-month period, demonstrating the engine's ability to navigate high-variance seasonal peaks and troughs without performance degradation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stable High Service Levels&lt;/strong&gt;: Financial gains were not achieved through aggressive overstocking. Instead, the engine maintained a consistent &lt;strong&gt;weighted demand fill rate of 91.14%&lt;/strong&gt; and an &lt;strong&gt;availability rate of 86.40%&lt;/strong&gt;, significantly outperforming human benchmarks across the entire temporal horizon.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Broad Generalization&lt;/strong&gt;: The numerical experiments confirmed that the benefits are not restricted to specific article types. Approximately &lt;strong&gt;70–80% of merchants&lt;/strong&gt; in the study saw positive financial uplifts, proving that the probabilistic approach effectively balances holding costs and lost sales across diverse merchant assortments.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;img alt="Distribution of GMV and GMV after fulfillment cost uplifts" src="https://engineering.zalando.com/posts/2026/01/images/profit_uplift_figure.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;
Figure 3: Distribution of GMV and GMV after fulfillment cost uplifts for representative execution dates.
&lt;/figcaption&gt;

&lt;p&gt;&lt;img alt="Temporal stability of positive uplifts" src="https://engineering.zalando.com/posts/2026/01/images/temporal_stability.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;
Figure 4: Percentage of merchants with positive GMV uplifts across the 12-month backtest period.
&lt;/figcaption&gt;

&lt;p&gt;&lt;img alt="Weekly availability and demand fill rate comparison" src="https://engineering.zalando.com/posts/2026/01/images/operational_performance.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;
Figure 5: Weekly availability and demand fill rate comparison between the optimization engine and human benchmarks.
&lt;/figcaption&gt;

&lt;p&gt;&lt;strong&gt;Note on backtest implications:&lt;/strong&gt; It is important to clarify that the uplifts cited above represent a theoretical scenario of 100% user adoption. Because the tool serves as an AI decision-support assistant, the final authority remains with the merchants. Actual results will vary depending on how consistently merchants choose to implement the system's suggestions.&lt;/p&gt;
&lt;h2&gt;Comparative Analysis &amp;amp; Ablation: Why it Works&lt;/h2&gt;
&lt;p&gt;While the backtest against human decisions quantifies end-to-end business impact, our baseline and ablation comparisons isolate exactly where the value is created.&lt;/p&gt;
&lt;h3&gt;Baseline Comparison: Algorithm vs. Tradition&lt;/h3&gt;
&lt;p&gt;We compared our &lt;strong&gt;Extended (R, s, Q)&lt;/strong&gt; approach against standard industry policies under identical data and stochastic simulation settings.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Policy&lt;/th&gt;
&lt;th&gt;GMV Uplift&lt;/th&gt;
&lt;th&gt;GMV after FC Uplift&lt;/th&gt;
&lt;th&gt;Availability Rate Uplift vs Human&lt;/th&gt;
&lt;th&gt;Demand Fill Rate Uplift vs Human&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Extended (R, s, Q) (ours)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;22.11%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;21.95%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+33.63% (+21.75pp)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+23.63% (+17.42pp)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tuned (s, S)&lt;/td&gt;
&lt;td&gt;13.39%&lt;/td&gt;
&lt;td&gt;14.80%&lt;/td&gt;
&lt;td&gt;+18.65% (+12.23pp)&lt;/td&gt;
&lt;td&gt;+14.35% (+10.71pp)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Periodic base-stock&lt;/td&gt;
&lt;td&gt;12.50%&lt;/td&gt;
&lt;td&gt;13.89%&lt;/td&gt;
&lt;td&gt;+17.99% (+11.79pp)&lt;/td&gt;
&lt;td&gt;+14.19% (+10.57pp)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Myopic Newsvendor&lt;/td&gt;
&lt;td&gt;5.07%&lt;/td&gt;
&lt;td&gt;5.60%&lt;/td&gt;
&lt;td&gt;+11.61% (+7.60pp)&lt;/td&gt;
&lt;td&gt;+8.10% (+6.03pp)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The results show a clear hierarchy. Traditional policies like the &lt;em&gt;Myopic Newsvendor&lt;/em&gt; or &lt;em&gt;Periodic base-stock&lt;/em&gt; underperform because they lack the foresight to handle lead-time and return uncertainty. Even the &lt;strong&gt;Tuned (s, S)&lt;/strong&gt; policy, which is a common industry standard, falls short because its static thresholds cannot match the responsiveness of our extended (R, s, Q) variables (&lt;span class="math"&gt;\(Q_0\)&lt;/span&gt; and &lt;span class="math"&gt;\(t_{limit}\)&lt;/span&gt;) in a high-variance environment.&lt;/p&gt;
&lt;h3&gt;Ablation Study: The "Secret Sauce"&lt;/h3&gt;
&lt;p&gt;Is the success driven by the better forecast or the better optimization? We stripped the model down to find out.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;GMV Uplift&lt;/th&gt;
&lt;th&gt;GMV after FC Uplift&lt;/th&gt;
&lt;th&gt;Availability Rate&lt;/th&gt;
&lt;th&gt;Demand Fill Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Probabilistic Forecast + Percentile Objective (ours)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;22.11%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;21.95%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;86.40%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;91.14%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Probabilistic Forecast + Mean Objective&lt;/td&gt;
&lt;td&gt;19.02%&lt;/td&gt;
&lt;td&gt;20.16%&lt;/td&gt;
&lt;td&gt;81.27%&lt;/td&gt;
&lt;td&gt;87.98%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Point Forecast + Percentile Objective&lt;/td&gt;
&lt;td&gt;6.37%&lt;/td&gt;
&lt;td&gt;5.98%&lt;/td&gt;
&lt;td&gt;77.76%&lt;/td&gt;
&lt;td&gt;84.95%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The takeaway is definitive: &lt;strong&gt;You need both.&lt;/strong&gt; Switching from point forecasts to probabilistic ones provides the single largest gain. However, optimizing for the 75th percentile rather than the average provides that final, critical layer of stability, particularly in protecting the merchant against high-impact "tail" events.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;This work proves that meaningful improvements come from &lt;strong&gt;explicitly embracing uncertainty.&lt;/strong&gt; By combining probabilistic forecasting and discrete event simulation, we’ve bridged the gap between inventory theory and the massive operational scale of Zalando. Our optimization engine can deliver substantial value across key metrics:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Up to 22% increase in Gross Merchandise Value (GMV) and Gross Margin compared to human replenishment decisions&lt;/li&gt;
&lt;li&gt;34% improvement in availability rate and 24% improvement in demand fill rate&lt;/li&gt;
&lt;li&gt;Stable performance throughout seasonal peaks and troughs over a 12-month period&lt;/li&gt;
&lt;li&gt;Positive financial uplift for 70-80% of merchants across diverse inventory profiles&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These results demonstrate that inventory optimization isn't just a theoretical exercise—it's a practical solution that drives real financial growth while improving customer experience through better product availability.&lt;/p&gt;
&lt;p&gt;For the full methodology and mathematical formulation, read our paper on &lt;a href="https://www.nature.com/articles/s41598-025-32537-2"&gt;Nature Scientific Reports&lt;/a&gt;.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;Presbitero, A., et al. (2025). &lt;em&gt;A practical approach to replenishment optimization with extended (R, s, Q) policy and probabilistic models&lt;/em&gt;. Nature Scientific Reports.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;script type="text/javascript"&gt;if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
    var align = "center",
        indent = "0em",
        linebreak = "false";

    if (false) {
        align = (screen.width &lt; 768) ? "left" : align;
        indent = (screen.width &lt; 768) ? "0em" : indent;
        linebreak = (screen.width &lt; 768) ? 'true' : linebreak;
    }

    var mathjaxscript = document.createElement('script');
    mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
    mathjaxscript.type = 'text/javascript';
    mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';

    var configscript = document.createElement('script');
    configscript.type = 'text/x-mathjax-config';
    configscript[(window.opera ? "innerHTML" : "text")] =
        "MathJax.Hub.Config({" +
        "    config: ['MMLorHTML.js']," +
        "    TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
        "    jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
        "    extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
        "    displayAlign: '"+ align +"'," +
        "    displayIndent: '"+ indent +"'," +
        "    showMathMenu: true," +
        "    messageStyle: 'normal'," +
        "    tex2jax: { " +
        "        inlineMath: [ ['\\\\(','\\\\)'] ], " +
        "        displayMath: [ ['$$','$$'] ]," +
        "        processEscapes: true," +
        "        preview: 'TeX'," +
        "    }, " +
        "    'HTML-CSS': { " +
        "        availableFonts: ['STIX', 'TeX']," +
        "        preferredFont: 'STIX'," +
        "        styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
        "        linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
        "    }, " +
        "}); " +
        "if ('default' !== 'default') {" +
            "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
            "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
        "}";

    (document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
    (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
&lt;/script&gt;</content><category term="Zalando"/><category term="Artificial Intelligence"/><category term="Machine Learning"/><category term="Logistics"/><category term="Operations Research"/><category term="Zalando Science"/><category term="Backend"/></entry><entry><title>Contributing to Debezium: Fixing Logical Replication at Scale</title><link href="https://engineering.zalando.com/posts/2025/12/contributing-to-debezium.html" rel="alternate"/><published>2025-12-19T00:00:00+01:00</published><updated>2025-12-19T00:00:00+01:00</updated><author><name>Conor Gallagher</name></author><id>tag:engineering.zalando.com,2025-12-19:/posts/2025/12/contributing-to-debezium.html</id><summary type="html">&lt;p&gt;How we contributed two features to Debezium to solve WAL growth and enable safer logical replication for our event streaming infrastructure&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Logical replication and Debezium" src="https://engineering.zalando.com/posts/2025/12/images/logical-replication.png#previewimage"&gt;&lt;/p&gt;
&lt;p&gt;At Zalando, we run hundreds of event streams powered by PostgreSQL logical replication through our Fabric Event Streams platform, a Kubernetes-based approach that allows teams to declare event streams sourcing directly from their Postgres databases. Each stream declaration provisions a micro application that uses Debezium in embedded mode to publish row-level change events as they occur. At peak traffic, these combined connectors process hundreds of thousands of events per second across our 100+ Kubernetes clusters.&lt;/p&gt;
&lt;p&gt;This infrastructure has been in operation since late 2018, processing billions of events over the years, but getting here required solving some hard problems with logical replication. This is the story of how we contributed two features to Debezium that we hope will help everyone using logical replication at scale.&lt;/p&gt;
&lt;h2&gt;The WAL Growth Problem Returns&lt;/h2&gt;
&lt;p&gt;A couple of years ago, our colleague &lt;a href="https://engineering.zalando.com/posts/2023/11/patching-pgjdbc.html"&gt;Declan Murphy wrote about a critical issue&lt;/a&gt; with PostgreSQL logical replication where low-activity databases experienced runaway Write-Ahead Log (WAL) growth. The problem was simple: replication slots wouldn't advance without table activity, causing WAL to pile up until disk space ran out. Our single biggest operational issue when rolling out this event infrastructure at scale was uncontrolled WAL growth on low-activity databases, even with heartbeat configured.&lt;/p&gt;
&lt;p&gt;As detailed in Declan's blog post, we fixed this upstream in the PostgreSQL JDBC driver by having the driver respond to keepalive messages from Postgres, advancing the replication slot when no relevant changes are pending. For a deeper dive into the hazards of logical decoding in PostgreSQL, our colleague Polina Bungina gave an excellent talk at the &lt;a href="https://www.youtube.com/watch?v=OtHu92S20Ro"&gt;Posette conference in 2024&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We deployed this fix in production with Debezium 2.7.4, pinning to pgjdbc version 42.7.2, and ran it for nearly two years, processing billions of events with zero detected data loss from this mechanism. The fix completely eliminated our WAL growth issues.&lt;/p&gt;
&lt;p&gt;Then came the problem.&lt;/p&gt;
&lt;h2&gt;Debezium Disables the Fix&lt;/h2&gt;
&lt;p&gt;As we prepared to upgrade Debezium, we discovered that &lt;a href="https://github.com/debezium/debezium/pull/6472"&gt;a recent PR&lt;/a&gt; had hard-coded the pgjdbc keepalive flush feature to disabled by setting &lt;code&gt;withAutomaticFlush(false)&lt;/code&gt; in the replication stream builder. The Debezium team had good reasons for this change: the feature conflicted with Debezium's own LSN management logic, users reported issues online, and the safest path forward was to disable it entirely.&lt;/p&gt;
&lt;p&gt;For us, this was a blocker because we couldn't upgrade Debezium without losing the fix that kept our production systems stable. We needed a way forward that would work for both the broader Debezium community and teams like us who had verified this behavior at scale.&lt;/p&gt;
&lt;h2&gt;First Contribution: Make It Opt-In&lt;/h2&gt;
&lt;p&gt;We opened &lt;a href="https://issues.redhat.com/browse/DBZ-9641"&gt;DBZ-9641&lt;/a&gt; proposing a simple solution: expose the underlying pgjdbc setting as a connector configuration option, allowing users to opt-in to this proven-safe feature while defaulting to the safer disabled behavior. The Debezium team was receptive and engaged with our use case.&lt;/p&gt;
&lt;p&gt;We submitted &lt;a href="https://github.com/debezium/debezium/pull/6881"&gt;PR #6881&lt;/a&gt;, introducing a new &lt;code&gt;lsn.flush.mode&lt;/code&gt; configuration property to replace and deprecate the existing &lt;code&gt;flush.lsn.source&lt;/code&gt; boolean with three explicit modes:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;manual&lt;/code&gt;&lt;/strong&gt; - LSN flushing is managed externally by your application or another mechanism, with the connector not flushing the LSN at all.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;connector&lt;/code&gt;&lt;/strong&gt; (default) - Debezium flushes the LSN after processing each logical replication change event, with the PostgreSQL JDBC driver's keep-alive thread not flushing LSNs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;connector_and_driver&lt;/code&gt;&lt;/strong&gt; - Both Debezium and the PostgreSQL JDBC driver's keep-alive thread can flush LSNs, preventing WAL growth on low-activity databases where monitored tables have infrequent changes. When the connector has no pending LSN to flush, the JDBC driver's keep-alive mechanism can flush the server-reported keep-alive LSN, which reflects all WAL activity including unmonitored activity such as &lt;code&gt;CHECKPOINT&lt;/code&gt;, &lt;code&gt;VACUUM&lt;/code&gt;, or &lt;code&gt;pg_switch_wal()&lt;/code&gt; that doesn't produce logical replication change events.&lt;/p&gt;
&lt;p&gt;To ensure a smooth transition, we implemented full backward compatibility where the deprecated &lt;code&gt;flush.lsn.source&lt;/code&gt; boolean automatically maps to the new enum values: &lt;code&gt;true&lt;/code&gt; maps to &lt;code&gt;connector&lt;/code&gt; and &lt;code&gt;false&lt;/code&gt; maps to &lt;code&gt;manual&lt;/code&gt;. This gives users time to migrate to the new configuration during the deprecation period without breaking existing deployments.&lt;/p&gt;
&lt;p&gt;This unblocked our upgrade path and gave us the opt-in mechanism we needed. But as we thought harder about why the feature caused problems for others, we realized we needed to understand why we were apparently the only ones who liked this feature.&lt;/p&gt;
&lt;h2&gt;Understanding Why We're Different&lt;/h2&gt;
&lt;p&gt;We started digging through GitHub issues from other Debezium users to understand their concerns, and found issues like &lt;a href="https://github.com/airbytehq/airbyte/issues/49802"&gt;this one from Airbyte&lt;/a&gt; that illustrated the problem clearly. Users were reporting that after upgrading to pgjdbc 42.7.0+, their connectors would fail on restart with "Saved offset is before replication slot's confirmed lsn," forcing them to perform full re-syncs of their databases or downgrade their pgjdbc version.&lt;/p&gt;
&lt;p&gt;That's when we realized we might be an unusual case at Zalando. Since we launched Fabric Event Streams in late 2018, we've always treated the PostgreSQL replication slot as the authoritative source of truth for stream position. Because we didn't care what offset Debezium tracked internally, we ran with the ephemeral &lt;code&gt;MemoryOffsetBackingStore&lt;/code&gt;, meaning our connectors always deferred to the slot position on startup and the keepalive flush advancing the slot ahead of the stored offset was never a problem for us.&lt;/p&gt;
&lt;p&gt;But why did we trust our replication slots so completely? The answer lies in our PostgreSQL infrastructure. At Zalando, we've been running PostgreSQL at scale using &lt;a href="https://github.com/patroni/patroni"&gt;Patroni&lt;/a&gt; (&lt;a href="https://engineering.zalando.com/posts/2016/02/zalandos-patroni-a-template-for-high-availability-postgresql.html"&gt;our open source solution&lt;/a&gt;) for automatic failover since the mid-2010s, and later built the &lt;a href="https://engineering.zalando.com/posts/2018/11/postgres-operator.html"&gt;Postgres Operator&lt;/a&gt; to manage &lt;a href="https://engineering.zalando.com/posts/2017/06/postgresql-in-a-time-of-kubernetes.html"&gt;PostgreSQL on Kubernetes&lt;/a&gt;. From day one of our logical replication rollout in late 2018, we implemented replication slot management that ensured slots survived failovers, so we could confidently trust the slot position as durable and correct.&lt;/p&gt;
&lt;p&gt;Most other Debezium users, however, were using persistent offset stores like Kafka Connect's offset topics to track their position, treating the offset as the authoritative source of truth and the replication slot as just a PostgreSQL implementation detail. For them, having the slot advance ahead of the stored offset due to keepalive flushes created an irreconcilable conflict that forced full re-syncs.&lt;/p&gt;
&lt;h2&gt;The Real Problem: When the Slot and Offset Disagree&lt;/h2&gt;
&lt;p&gt;The root issue became clear when we examined the conflict between the keepalive flush mechanism and Debezium's offset management. When using logical replication, position is tracked in two places: Debezium tracks its position in an offset store (Kafka, memory, or another backing store), while Postgres tracks the replication slot position on the database server.&lt;/p&gt;
&lt;p&gt;When Debezium starts, it compares these positions. By default, if the offset is behind the slot (&lt;code&gt;offset_lsn &amp;lt; slot_lsn&lt;/code&gt;), Debezium attempts to stream from the stored offset without validation, and if the requested LSN is no longer available in the WAL, PostgreSQL returns an error. Users could optionally enable the &lt;code&gt;internal.slot.seek.to.known.offset.on.start=true&lt;/code&gt; configuration for a stricter "fail fast" policy that would immediately fail with "Saved offset is before replication slot's confirmed lsn" to detect slot recreation scenarios.&lt;/p&gt;
&lt;p&gt;Neither approach handled the keepalive flush scenario well. The default behavior would fail with a cryptic WAL error when trying to stream from an LSN that had been cleaned up, while the strict validation would immediately fail even though no actual data had been lost.&lt;/p&gt;
&lt;p&gt;Here's why this conflicted with the pgjdbc keepalive flush:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The JDBC driver advances the slot LSN to skip unmonitored WAL activity like vacuums and checkpoints&lt;/li&gt;
&lt;li&gt;The connector hasn't flushed its offset yet because it's waiting for the next change event&lt;/li&gt;
&lt;li&gt;The connector crashes or restarts before flushing its offset&lt;/li&gt;
&lt;li&gt;On restart, &lt;code&gt;offset_lsn &amp;lt; slot_lsn&lt;/code&gt; because the slot advanced past the stored offset&lt;/li&gt;
&lt;li&gt;The connector refuses to start, even though no actual data has been lost&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Users with durable offset stores would hit this constantly when using &lt;code&gt;connector_and_driver&lt;/code&gt; mode because the keepalive flush was doing its job of preventing WAL growth, but Debezium's strict validation made it operationally unsafe. We needed a way to allow users to trust the slot position when they knew it was reliable.&lt;/p&gt;
&lt;h2&gt;Second Contribution: Trust the Slot&lt;/h2&gt;
&lt;p&gt;We opened &lt;a href="https://issues.redhat.com/browse/DBZ-9688"&gt;DBZ-9688&lt;/a&gt; proposing a new way to handle offset/slot mismatches by introducing an &lt;code&gt;offset.mismatch.strategy&lt;/code&gt; configuration property. Taking inspiration from Kafka's &lt;code&gt;auto.offset.reset&lt;/code&gt; configuration, which allows consumers to opt-in to trusting the broker's position when their local state is invalid, we proposed allowing Debezium users to opt-in to trusting the PostgreSQL replication slot's position.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/debezium/debezium/pull/6948"&gt;PR #6948&lt;/a&gt; introduced the &lt;code&gt;offset.mismatch.strategy&lt;/code&gt; enum to control connector behavior when the stored offset LSN differs from the replication slot's confirmed flush LSN. This property replaces and deprecates the existing &lt;code&gt;internal.slot.seek.to.known.offset.on.start&lt;/code&gt; boolean with four explicit strategies:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;no_validation&lt;/code&gt;&lt;/strong&gt; (default) - The connector attempts to stream from the stored offset without validating the replication slot state. If the slot is ahead of the offset and the requested LSN is no longer available in the WAL, PostgreSQL will return an error. This maintains existing default behavior and provides backward compatibility.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;trust_offset&lt;/code&gt;&lt;/strong&gt; - The connector validates that the stored offset is not behind the replication slot's confirmed flush LSN, failing with an error if the offset is behind the slot to indicate potential data loss. If the offset is ahead of or equal to the slot, the connector advances the slot to the offset position using &lt;code&gt;pg_replication_slot_advance()&lt;/code&gt; when possible. This strategy replaces the &lt;code&gt;internal.slot.seek.to.known.offset.on.start=true&lt;/code&gt; configuration and is useful when you want to detect and be alerted to unexpected slot state changes that could indicate data loss.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;trust_slot&lt;/code&gt;&lt;/strong&gt; - The connector treats the PostgreSQL replication slot as the authoritative source of truth. If the stored offset is behind the slot's confirmed flush LSN, the connector automatically advances the offset to match the slot position, skipping replay of events between the stored offset and the slot position. This is appropriate when using &lt;code&gt;lsn.flush.mode=connector_and_driver&lt;/code&gt;, which requires trusting the slot position.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;trust_greater_lsn&lt;/code&gt;&lt;/strong&gt; - The connector synchronizes to the maximum LSN between the stored offset and the slot's confirmed flush LSN, providing bidirectional synchronization. If the offset is behind the slot, the connector advances the offset to the slot position. If the offset is ahead of the slot, the connector advances the slot to the offset position when possible.&lt;/p&gt;
&lt;p&gt;Similar to the first contribution, we ensured backward compatibility by automatically mapping the deprecated &lt;code&gt;internal.slot.seek.to.known.offset.on.start&lt;/code&gt; boolean to the new strategy enum: &lt;code&gt;false&lt;/code&gt; maps to &lt;code&gt;no_validation&lt;/code&gt; and &lt;code&gt;true&lt;/code&gt; maps to &lt;code&gt;trust_offset&lt;/code&gt;, preserving existing behavior while giving users time to adopt the new configuration.&lt;/p&gt;
&lt;p&gt;This gives Debezium operators flexibility to match the connector's behavior to their operational reality. Users configuring &lt;code&gt;lsn.flush.mode=connector_and_driver&lt;/code&gt; can pair it with &lt;code&gt;offset.mismatch.strategy=trust_slot&lt;/code&gt; for safe, production-ready operation with durable offset stores. It also helps in manual recovery scenarios where an operator needs to advance a slot past corrupted WAL using &lt;code&gt;pg_replication_slot_advance()&lt;/code&gt;, allowing them to configure Debezium to respect that change instead of refusing to start.&lt;/p&gt;
&lt;p&gt;One of the most important parts of contributing to open source is ensuring that users can actually discover and understand your features. We worked closely with the Debezium team to document both &lt;code&gt;lsn.flush.mode&lt;/code&gt; and &lt;code&gt;offset.mismatch.strategy&lt;/code&gt; in the PostgreSQL connector documentation, explaining the relationship between the two properties and providing guidance on when to use each mode.&lt;/p&gt;
&lt;h2&gt;A Note of Gratitude&lt;/h2&gt;
&lt;p&gt;Both features were released and are &lt;a href="https://debezium.io/blog/2025/12/16/debezium-3-4-final-released/"&gt;available as of Debezium 3.4.0.Final&lt;/a&gt;. The changes unblocked our upgrade path and will enable safer logical replication for the broader community. Throughout the process, the Debezium engineers were incredibly responsive and helpful, engaging thoughtfully with our use cases, providing detailed feedback on our PRs, and helping us design solutions that worked for everyone.&lt;/p&gt;
&lt;p&gt;This experience reminded us why we love working in open source: we had a problem, we proposed solutions, and the maintainers worked with us to make those solutions better. Now everyone benefits from these features, whether they're running hundreds of connectors like we are at Zalando or just getting started with logical replication.&lt;/p&gt;
&lt;p&gt;A big thank you to the Debezium team, not just for building such a critical piece of infrastructure that powers event streaming at Zalando and countless other organizations, but for being so open to contributions and discussions from the community. We're grateful to give back.&lt;/p&gt;
&lt;h2&gt;What This Means for You&lt;/h2&gt;
&lt;p&gt;If you're using PostgreSQL logical replication with Debezium, these features might help you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Experiencing WAL growth on low-activity databases? Configure &lt;code&gt;lsn.flush.mode=connector_and_driver&lt;/code&gt; paired with &lt;code&gt;offset.mismatch.strategy=trust_greater_lsn&lt;/code&gt; (for persistent offset stores) to prevent WAL accumulation without requiring dummy writes. The &lt;code&gt;trust_greater_lsn&lt;/code&gt; strategy provides bidirectional synchronization and self-healing recovery.&lt;/li&gt;
&lt;li&gt;Need to manually advance replication slots past corrupted WAL segments? Use &lt;code&gt;offset.mismatch.strategy=trust_slot&lt;/code&gt; or &lt;code&gt;trust_greater_lsn&lt;/code&gt; to recover without re-snapshotting your entire database.&lt;/li&gt;
&lt;li&gt;Want maximum safety and to detect unexpected slot changes? Use &lt;code&gt;offset.mismatch.strategy=trust_offset&lt;/code&gt; to validate that your stored offset is never behind the slot, catching potential data loss scenarios early.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At Zalando, these features keep our event streaming infrastructure running smoothly across hundreds of Postgres databases, and we hope they help you build reliable logical replication systems too.&lt;/p&gt;</content><category term="Zalando"/><category term="PostgreSQL"/><category term="Open Source"/><category term="Backend"/></entry><entry><title>The Day Our Own Queries DoS’ed Us: Inside Zalando Search</title><link href="https://engineering.zalando.com/posts/2025/12/we-hacked-ourselves-so-you-dont-have-to.html" rel="alternate"/><published>2025-12-17T00:00:00+01:00</published><updated>2025-12-17T00:00:00+01:00</updated><author><name>Maryna Kryvko</name></author><id>tag:engineering.zalando.com,2025-12-17:/posts/2025/12/we-hacked-ourselves-so-you-dont-have-to.html</id><summary type="html">&lt;p&gt;Once upon a time, during a normal Sunday, our team ran into an unexpected challenge: an Elasticsearch cluster that suddenly became sluggish and unresponsive due to a self-inflicted Denial of Service (DoS) attack (of course we didn't know it at the time). This is the story of how we identified, mitigated, and learned from this incident.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Who We Are&lt;/h2&gt;
&lt;p&gt;We are part of Zalando’s &lt;strong&gt;Search &amp;amp; Browse team&lt;/strong&gt;, responsible for maintaining and optimizing the catalog and full-text search backends that power millions of user requests every single day. Our systems serve multiple catalog domains and experiences – from the core catalog to the Designer experience and our full-text search – and they also feed newer interfaces like &lt;a href="https://corporate.zalando.com/en/technology/more-personal-and-smarter-zalando-assistant-enhanced-capabilities-inspire-customers"&gt;Zalando Assistant&lt;/a&gt;, which depends on us to fetch and recommend products in real time.&lt;/p&gt;
&lt;p&gt;When everything is healthy, the experience feels effortless. Customers can search, filter, and explore the catalog, ask Zalando Assistant for ideas and instantly see relevant products. At the same time, our brand partners’ promotions, campaigns, and sponsored placements are delivered as planned, reaching the right users at the right moment.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Catalog page with dresses" src="https://engineering.zalando.com/posts/2025/12/images/catalog-dresses.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Catalog page with dresses&lt;/figcaption&gt;

&lt;p&gt;But when catalog search is slow or down, the impact is immediate and far-reaching. Customers suddenly can’t find what they want, filters and discovery flows break, and Zalando Assistant simply can’t fetch products to show. Partner campaigns underperform or effectively go dark, meaning money, planning, and trust are burned in real time. Negative reviews, customer complaints, and internal escalations start popping up across channels as frustration grows.&lt;/p&gt;
&lt;p&gt;In short: when catalog search is degraded, it’s not “just” a tech issue.
It hits customers, partners, campaigns, and Zalando’s reputation all at once.
&lt;strong&gt;It’s a big deal. A very, very big deal.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;Anthology of the System Under High Load&lt;/h2&gt;
&lt;p&gt;From the outside, “search is slow” looks like a single symptom. In reality, it flows through several layers, each doing its own work and adding its own pressure under load.&lt;/p&gt;
&lt;p&gt;At the bottom, we have &lt;strong&gt;Base Search&lt;/strong&gt;, an Elasticsearch cluster that provides initial candidates using both classic lexical matching and vector search. On top of that sits our full‑text search &lt;strong&gt;query builder&lt;/strong&gt; Named Entity Recognition (NER) system, which takes the raw user intent (query text, user filters). Based on recognized entities it promotes implicit filters. Filters could be applied to shrink or expand results. NER system builds an Elasticsearch query and attaches metadata indicating whether the result set looks sparse and might require expansion using our newer neural matching system.&lt;/p&gt;
&lt;p&gt;NER system also queries the Base Search to acquire product counts to understand how many products match different filter combinations and to decide whether we can safely narrow the search with extra filters without risking “zero results”.&lt;/p&gt;
&lt;p&gt;Above that, the &lt;strong&gt;Catalog API&lt;/strong&gt;, one of our presentation layers, coordinates everything that has to happen for a single “search” in the app:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It fans out a request into several queries to Base Search.&lt;/li&gt;
&lt;li&gt;It is integrated with our A/B testing framework, so different users or cohorts may trigger slightly different query shapes.&lt;/li&gt;
&lt;li&gt;It owns the “final redirect” decisions – for example, whether a query should land on a generic search result page or on dedicated landing pages.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each layer includes its own caching:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The query builder and Catalog API cache popular queries and filter combinations.&lt;/li&gt;
&lt;li&gt;In Elasticsearch, coordinator nodes are placed on separate machines and provide another caching layer for search results and aggregations.&lt;/li&gt;
&lt;li&gt;The Base Search Elasticsearch cluster itself is wrapped by a lightweight Search API component - another presentation layer.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;strong&gt;Search API&lt;/strong&gt; in turn integrates with other components:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Algorithm Gateway&lt;/strong&gt; enriches results with user actions data and re‑ranked using the rules engine and our personalization and relevance ML models.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Promotions bidding service&lt;/strong&gt;, it blends sponsored content with organic results.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For every search, the Catalog API issues a separate call to fetch the filters (facets) for that query: brand, size, color, price bucket, etc. These are aggregation‑heavy queries by design and stress Elasticsearch differently than “plain” document retrieval. Under normal conditions, they are well‑behaved and benefit from caching at multiple layers.&lt;/p&gt;
&lt;p&gt;Under high load, however, a pathological pattern in just this one type of query – facets – can put disproportionate pressure on Elasticsearch and its coordinator nodes, while everything above simply sees “search is slow” and “filters are broken”.&lt;/p&gt;
&lt;h2&gt;The Incident&lt;/h2&gt;
&lt;p&gt;On a seemingly ordinary Sunday, our Elasticsearch cluster began to exhibit signs of distress. Queries that usually took milliseconds were now dragging on for seconds, and some requests were timing out altogether. Users started seeing empty result pages, or pages with just a few items. These are some immediate customer feedback examples in our 1* App Reviews from users experiencing problems during the incident:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;"App barely functional. Search and filter function not usable. App therefore unusable."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"Each filter shows 0 results found."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"Filters are buggy, constantly show that no articles were found or are suddenly no longer displayed."&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The initial alerts came from our monitoring systems, which indicated a spike in response times and error rates. The Search on-call responder was paged.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Catalog error page" src="https://engineering.zalando.com/posts/2025/12/images/catalog-err-page.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Catalog error page&lt;/figcaption&gt;

&lt;p&gt;The responder quickly jumped into action, diving into the logs and metrics to identify the root cause. After initial investigation, it became clear that the Elasticsearch cluster CPU was spiking. No recent deployments or configuration changes had been made. No sudden increase in client traffic was observed; the same was true for the write load. The cluster was simply overwhelmed for no apparent reason.&lt;/p&gt;
&lt;p&gt;The issue was only affecting one of our Elasticsearch clusters, the one responsible for serving two of the largest markets. Other clusters were functioning normally, proving that our market grouping isolation worked as designed. That was a small relief, but it was still unclear what was different about this particular cluster, since they were all configured similarly.&lt;/p&gt;
&lt;h2&gt;Immediate Actions Taken&lt;/h2&gt;
&lt;p&gt;The on-call responder, leveraging our &lt;a href="https://engineering.zalando.com/posts/2023/01/how-we-manage-our-1200-incident-playbooks.html"&gt;incident playbooks&lt;/a&gt;, initiated a series of immediate actions to stabilize the situation:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Applying longer cache expiration times to reduce load on the cluster;&lt;/li&gt;
&lt;li&gt;Disabling non-critical requests that could be consuming resources;&lt;/li&gt;
&lt;li&gt;Applying lower cluster-wide query termination thresholds to prevent long-running queries from hogging resources;&lt;/li&gt;
&lt;li&gt;Scaling out the coordinator nodes to distribute the query load more effectively;&lt;/li&gt;
&lt;li&gt;Scaling out data nodes to increase the cluster's capacity.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These actions, however, have not provided even a temporary relief. The cluster remained under significant strain, and the root cause of the issue was still unclear. All these actions took time to be applied and verified, and by that time, several other team members joined the on-call responder to provide help. But the situation was still critical.&lt;/p&gt;
&lt;h2&gt;The Markets&lt;/h2&gt;
&lt;p&gt;As mentioned above, the issue was predominantly affecting two markets with larger product catalogs. As an experiment, it was decided to split the markets into two separate clusters, to see if that would help alleviate the load or isolate the problem. If the relocated market would not have issues, it could indicate that the problem was related to the specific queries or infrastructure associated with the remaining market.&lt;/p&gt;
&lt;p&gt;The market split would be done by using the node allocation settings in Elasticsearch, which allow for controlling which nodes hold which shards. By specifying different node groups for the two markets, the data could be effectively split.&lt;/p&gt;
&lt;h2&gt;Additional Load Shedding: Making the Cluster Breathe Again&lt;/h2&gt;
&lt;p&gt;To give the cluster some air, we rolled out additional load shedding measures in parallel. On the Elasticsearch side, we first reduced the number of shard replicas, so there would be fewer shards to relocate once we started splitting the markets. We then throttled ingestion all the way down to a full stop, ensuring no new data was being written while the cluster was already struggling and while shards were being moved. Finally, we split the markets: the smaller of the two markets was relocated to a new Elasticsearch cluster, while the larger one stayed on the original one.&lt;/p&gt;
&lt;p&gt;At the same time, on the application side, the presentation layers were used as a control plane to reduce load on downstream systems: we turned off non‑critical calls, reduced the number of parallel queries per request, and increased cache effectiveness for hot queries and filter combinations. Our search steering configuration also played a key role here: we lowered the load generated by search by sampling fewer requests into some heavier ML model integrations and promotion‑enrichment flows, falling back to simpler ranking where needed.&lt;/p&gt;
&lt;h2&gt;New Investigation and Finally, Root Cause&lt;/h2&gt;
&lt;p&gt;Having split the two markets into dedicated clusters, we have proven that the issue originated from queries targeting a single market. The team began a more in-depth investigation into the queries being executed on the affected cluster. It was discovered that Elasticsearch was under heavy load because of a specific type of queries: faceting queries that were performing aggregations. An attempt to get sample slow queries was made, but the cluster was too overloaded to respond to the request. With the cluster being in distress, all queries became slow and the tasks index was overflowing with long-running queries. Also, many tasks were being rejected before they could be completed or even accepted, because the queue just maxed out.&lt;/p&gt;
&lt;h2&gt;Before the Dawn: Cluster Recovery&lt;/h2&gt;
&lt;p&gt;At some point in the evening, the cluster started to recover. The CPU usage began to drop, and the query response times improved. The cluster returned to a stable state. However, the root cause of the issue was still not understood, so the team continued to investigate. No one was satisfied with just having the cluster back up; they needed to know what had caused the problem in the first place. The incident could resurface at any time if the underlying issue was not addressed.&lt;/p&gt;
&lt;p&gt;After some more digging, an exploratory analysis of traces in a Lightstep notebook detected an unusual traffic pattern from one of our internal applications. Further investigation revealed that the application was sending 50 times more queries than usual, and it matched the incident timeline exactly.&lt;/p&gt;
&lt;h2&gt;The Revelation&lt;/h2&gt;
&lt;p&gt;These queries were not typical user queries. They were faceting queries that were requesting huge aggregations on very high cardinality fields, specifically on the SKU, which is a unique product ID. These types of queries are extremely resource-intensive, as they require Elasticsearch to process and aggregate a vast amount of data. Also, they aren't making any sense from a business perspective, as faceting on unique identifiers does not provide any meaningful insights.&lt;/p&gt;
&lt;p&gt;&lt;img alt="DoS attack" src="https://engineering.zalando.com/posts/2025/12/images/ddos-attack.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;DoS attack: image generated by AI&lt;/figcaption&gt;

&lt;p&gt;It was later discovered that the root cause of the issue was a self-inflicted Denial of Service (DoS) attack. As a result of a maintenance workload coupled with a bug in the processing logic of the application, the internal client application was sending a small, but sufficient number of parallel overwhelming faceting queries to the Elasticsearch cluster.&lt;/p&gt;
&lt;h2&gt;Why wasn't this detected earlier?&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Because the queries were legitimate in terms of syntax and structure. They were valid Elasticsearch queries, but they were being used in a way that was not intended or expected. A workload meant to be executed seldom, triggered by business users, was getting triggered by the maintenance procedure in an automated fashion.&lt;/li&gt;
&lt;li&gt;Because the service sending the queries was an internal application, a legit one, and not new.&lt;/li&gt;
&lt;li&gt;Because the load was very low in terms of volume. Elasticsearch is usually handling thousands of requests per second. This service was only sending 20-100 requests per second, which in terms of normal Elasticsearch load was peanuts. We did have per-client traffic monitoring, but the load from this service was just too low to attract any attention; it was simply flying under the radar, dwarfed by the traffic from other services triggering thousands of requests per second.&lt;/li&gt;
&lt;li&gt;Because the slow queries, while being monitored, were not being analyzed in depth. The team was focused on the overall cluster health and performance metrics, and the slow queries were just a symptom of the larger issue.&lt;/li&gt;
&lt;li&gt;Because the slow queries didn't have any specific tags or identifiers that would link them to the client application. They were just faceting queries, indistinguishable from any other faceting queries that might be executed by legitimate users.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key question here is why the faceting queries on high cardinality fields caused an overload of the cluster.&lt;/p&gt;
&lt;h2&gt;Some theory on Elasticsearch DoS via Faceting Queries on High Cardinality Fields&lt;/h2&gt;
&lt;p&gt;When you send a faceting query to Elasticsearch, you’re not just hitting “one big index”. Internally, the request follows a scatter/gather path. A coordinating node (in our case, we had dedicated coordinator nodes) takes the incoming search request, scatters it to all relevant shard copies, and then gathers the partial results back, reducing them into a single response. Shard selection is influenced by Adaptive Replica Selection, which tries to pick the “best” shard copy based on past response times and the node’s search thread‑pool queue size. For aggregations, the coordinator also performs partial reductions in batches instead of waiting for all shards to finish at once, and Elasticsearch enforces soft guardrails like &lt;code&gt;search.max_buckets&lt;/code&gt; to prevent a single request from creating an unbounded number of aggregation buckets. On top of that, we also use index‑level &lt;code&gt;max_result_window&lt;/code&gt; settings to make sure no single request can ask for a “scroll the universe”‑sized result set.&lt;/p&gt;
&lt;p&gt;&lt;img alt="ES index structure" src="https://engineering.zalando.com/posts/2025/12/images/es-docs-structure.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;ES index structure: components&lt;/figcaption&gt;

&lt;p&gt;All of this work runs on several dedicated search thread pools. The search pool handles per‑shard query and aggregation execution; if too many shard‑level operations run at once, this queue fills up and Elasticsearch starts rejecting requests. The search_coordination pool takes care of the lighter orchestration work on the coordinating node: merging partial results, running reductions, and preparing the final response. Starting with 8.12, Elasticsearch also introduces a search_worker pool used by parallel collectors for some aggregation and query types, where work inside a shard can be fanned out across segments (“slices”) to reduce latency. Our incident, however, was driven by high‑cardinality terms aggregations, which are not executed with those parallel collectors; they simply ran as very heavy work on the searchpool, consuming a lot of CPU and memory. A small number of such pathological facet queries was enough to keep the cluster “hot” and to starve normal traffic, which is exactly what a DoS looks like in practice.&lt;/p&gt;
&lt;h2&gt;Follow-up Actions and Lessons Learned&lt;/h2&gt;
&lt;p&gt;This incident was a wake-up call for our team. It highlighted the difficulty of investigating performance issues in Elasticsearch, especially when the root cause is not immediately apparent. It also underscored the importance of understanding the behavior of internal applications and their potential impact on shared infrastructure.&lt;/p&gt;
&lt;p&gt;From this incident, we learned several valuable lessons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We need to think how we can split and isolate workloads better, applying rate limiting based on the type of the client traffic. Not all clients should be equal, and we might need a more granular access control.&lt;/li&gt;
&lt;li&gt;The importance of thorough monitoring and logging. We extended the slow query logging to capture more details about the queries being executed, including client identifiers &lt;a href="https://andreibaptista.medium.com/debugging-slow-queries-in-elasticsearch-using-the-slow-queries-feature-with-x-opaque-id-8a81a894333"&gt;via the X-Opaque-Id request header&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Based on that, we also extended the dashboards to monitor per-client slow query rates, and specifically aggregating queries and the aggregation sizes.&lt;/li&gt;
&lt;li&gt;We introduced application-side query limiting with dynamically adjustable thresholds, to prevent queries that would try to scan or aggregate too much data.&lt;/li&gt;
&lt;li&gt;We improved our playbooks and runbooks for Elasticsearch incidents, providing detailed steps for investigation and mitigation, for distinguishing between high read load vs. write load, and for rate limiting or blocking misbehaving clients.&lt;/li&gt;
&lt;li&gt;We introduced new runbooks on applying cluster-wide settings like &lt;a href="https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/search-settings#search-settings-max-buckets"&gt;search.max_buckets&lt;/a&gt; to limit the size of aggregations on the whole cluster at once.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But one of the most important lessons learned requires asking the same question again.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Why wasn’t this detected earlier?
Because we were looking for a horse.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You know that old saying about the horse and the zebra? When you hear hoofbeats, think of horses, not zebras. Because horses are common, and zebras are rare.&lt;/p&gt;
&lt;p&gt;But in our case, it happened to be a zebra.&lt;/p&gt;
&lt;p&gt;We were looking for common causes of Elasticsearch performance issues: high read load, high write load, misconfigurations, infrastructure issues. We were not expecting a self-inflicted DoS attack from an internal application.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;So keep in mind: sometimes, when you hear hoofbeats, it might just be a zebra.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;You don't see them often, but when you do, they can be quite a spectacle.&lt;/p&gt;
&lt;h2&gt;Useful Links&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/search-settings#search-settings-max-buckets"&gt;max_buckets setting documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.elastic.co/docs/reference/elasticsearch/index-settings/slow-log"&gt;Elasticsearch Slow Log documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-tasks-list"&gt;Documentation about tasks API mentioning X-Opaque-Id request header&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://andreibaptista.medium.com/debugging-slow-queries-in-elasticsearch-using-the-slow-queries-feature-with-x-opaque-id-8a81a894333"&gt;Medium article by Andrei Baptista on debugging slow queries using X-Opaque-Id&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.ivan.digital/apache-lucene-on-steroids-part-1-inverted-index-search-replication-8243038adde"&gt;Apache Lucene deep dive: Inverted Index, Search, Replication&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><category term="Zalando"/><category term="Elasticsearch"/><category term="Search"/><category term="SRE"/><category term="Backend"/></entry><entry><title>Accelerating Mobile App development at Zalando with Rendering Engine and React Native</title><link href="https://engineering.zalando.com/posts/2025/10/accelerating-mobile-app-development-at-zalando-with-rendering-engine-and-react-native.html" rel="alternate"/><published>2025-10-03T00:00:00+02:00</published><updated>2025-10-03T00:00:00+02:00</updated><author><name>Rene Eichhorn</name></author><id>tag:engineering.zalando.com,2025-10-03:/posts/2025/10/accelerating-mobile-app-development-at-zalando-with-rendering-engine-and-react-native.html</id><summary type="html">&lt;p&gt;We present how we combined our internal React-based UI composition framework Rendering Engine with React Native in a brownfield integration approach that enables us to gradually modernise our mobile app technology stack while at the same time introducing cross-platform technologies to build shared experiences.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Recently, Zalando decided to start a large-scale migration of the Zalando mobile app, which is currently built in two different architectures and codebases, one for iOS and one for Android. In September, I had the opportunity to &lt;a href="https://youtu.be/U76fQ_9A89Q"&gt;speak at React Universe Conf&lt;/a&gt; where I provided a high-level view of our approach. This article provides more context and an in-depth look into our decision-making, integration approach, and outlook on how we think about cross-platform customer experience development.&lt;/p&gt;
&lt;p&gt;Before going into the technical details, let’s first clarify how we reached the decision to use React Native in the first place. The core requirements we have can be summarized into the following three pillars:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Build &amp;amp; Ship faster&lt;/strong&gt;: Major architectural changes are often driven by the goal to increase speed and efficiency long-term. We operate in a fast-paced environment where we want to experiment with new customer experiences for our 52M+ fashion, beauty, and lifestyle customers. Therefore, to enable our teams to continuously iterate, we need to ensure features can be built quickly on all platforms with as little effort as possible.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Progressive adoption:&lt;/strong&gt; Rebuilding our entire app at once is out of scope due to complexity. Migrating more than 90 screens at once is not an option. Our technology choice needs to be adopted in a safe and iterative way so that we can evaluate it on a subset of screens and traffic, before rolling it out to millions of customers and putting in the effort to migrate all screens.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Including Web:&lt;/strong&gt; So far, the Zalando website and the Zalando apps were going into almost completely different technical directions and while each platform does indeed have its own challenges, there are also a lot of shared concerns. Many capabilities that have been built over years, such as backend-steered UI, composable and modular components and other capabilities should not be lost during this transition but rather built upon.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Evaluating React Native&lt;/h2&gt;
&lt;p&gt;Arriving at the decision to use React Native happened in several steps for us. Before integrating React Native into our codebase, we began with a simple but very expressive proof of concept that replicated our current app experience, including all typical needs for an application such as navigation with &lt;a href="https://reactnavigation.org/"&gt;react-navigation&lt;/a&gt;, simple to complex animations with &lt;a href="https://github.com/software-mansion/react-native-reanimated"&gt;react-native-reanimated&lt;/a&gt;, video playback with &lt;a href="https://github.com/TheWidlarzGroup/react-native-video"&gt;react-native-video&lt;/a&gt; and a custom turbo module to showcase native interoperability.
Having access to these strong, community-maintained packages helped a lot in building a prototype with a lot of content. Yet, ultimately proving a new technology to be production-ready comes with additional requirements like observability, analytical tracking events, data fetching and caching, state management, deeplinking, and other capabilities we mentioned earlier.&lt;/p&gt;
&lt;p&gt;Building a scalable React Native application typically demands a complete architectural design. However, we found ourselves in a unique position: the Zalando website already possessed a &lt;a href="https://engineering.zalando.com/posts/2021/03/micro-frontends-part1.html"&gt;well-established, scalable framework built on top of React&lt;/a&gt;. By integrating this internal framework into React Native for our proof of concept, we achieved a production-ready setup with live data access in just a few weeks.&lt;/p&gt;
&lt;p&gt;This internal framework is what we call &lt;strong&gt;Rendering Engine&lt;/strong&gt; and its essence is a concept of &lt;em&gt;renderers&lt;/em&gt;, which are supercharged React components that add common application requirements by default. Imagine you write a component and it automatically becomes observable with metrics and traces, handles data fetching and caching, state management and provides an easy way to trigger analytical events, and much more, while at the same time enforcing these components to be independent and as context-insensitive as possible. A detailed write-up about how this works can be found in &lt;a href="https://engineering.zalando.com/posts/2021/09/micro-frontends-part2.html"&gt;one of our previous posts&lt;/a&gt; that goes into detail of Rendering Engine.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;@if/rendering-engine/api&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;React&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;react&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;./query.graphql&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tile&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;withQueries&lt;/span&gt;&lt;span class="p"&gt;(({&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;entity&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;carousel&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;variables&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;withProcessDependencies&lt;/span&gt;&lt;span class="p"&gt;(({&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;===&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;error&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;No collection data found.&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;render&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;tiles&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;entities&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;getCollectionEntities&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;withRender&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;collection&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;tiles&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;entities&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Carousel&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;{...&lt;/span&gt;&lt;span class="nx"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/Carousel&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Enabling development &amp;amp; production readiness&lt;/h2&gt;
&lt;p&gt;While building web applications with Rendering Engine was an established process within Zalando, integrating React Native into an existing large codebase proved to be a new challenge. Having attempted such integration, we ran into several problems:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Native dependency conflicts&lt;/strong&gt; - React Native or community packages using native packages in different versions than we did.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No clear separation&lt;/strong&gt; - We asked ourselves where to put the React Native code and how to embed it properly in our apps’ codebases? Git submodules were one option but come with a lot of other issues and they don't enforce strict separation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bad developer experience&lt;/strong&gt; - Building a large native app can be slow, despite build caches. Having to build the entire app to get started with React Native, especially for engineers coming from a web background (and unfamiliar with tools like Android Studio and Xcode), posed a major problem, affecting productivity and causing friction in onboarding web engineers.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Getting around these problems turned out to be more complicated than initially expected, but in the end, we solved the challenges and arrived at what we call the &lt;strong&gt;React Native as a package&lt;/strong&gt; architecture.&lt;/p&gt;
&lt;p&gt;&lt;img alt="React Native as a Library dependency graph" src="https://engineering.zalando.com/posts/2025/10/images/react-native-as-a-library.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Figure 1: React Native as a Library&lt;/figcaption&gt;

&lt;p&gt;The essence of this approach is to build the React Native part of the Zalando app just like any other React Native application, with one little tweak: we put our React Root Component and initialisation logic into an npm package called the “Entry Point”. This entry point is consumed by our standalone Developer App with a &lt;a href="https://reactnative.dev/docs/next/environment-setup"&gt;standard React Native environment&lt;/a&gt;. Hence, in the developer app, we get all the benefits of React Native, as in any &lt;a href="https://en.wikipedia.org/wiki/Greenfield_project"&gt;greenfield&lt;/a&gt; app, with full isolation from the legacy architecture.
We’ve added our own developer menu on top of React Native's default developer menu (the one you see when you shake the device), which allows developers to quickly change between JavaScript bundles (released versions, pull request builds, and local) and many other developer experience utilities.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Developer app with custom menu for switching between pages and other utilities" src="https://engineering.zalando.com/posts/2025/10/images/developer-app.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Figure 2: Developer App&lt;/figcaption&gt;

&lt;p&gt;The other consumer is the Framework (SDK), which is a native library/package that contains the entire React Native stack hidden behind a simple-to-use interface.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ReactNativeViewFactory&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;loadView&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="kc"&gt;_&lt;/span&gt; &lt;span class="n"&gt;deepLinkProps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DeepLinkProperties&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;launchOptions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;AnyHashable&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]?&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;UIView&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;interface&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ReactNativeViewFactory&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;fun&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;fun&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;createViewHostedInActivity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;activity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;FragmentActivity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;screenParameters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ReactNativeScreenParameters&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;View&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;fun&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;createViewHostedInFragment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;fragment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Fragment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;screenParameters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ReactNativeScreenParameters&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;View&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;A lot of this work has been &lt;a href="https://github.com/callstack/react-native-brownfield"&gt;open sourced&lt;/a&gt; by our friends at Callstack in a simple package that you can use yourself!&lt;/p&gt;
&lt;h3&gt;Interoperability with existing native app&lt;/h3&gt;
&lt;p&gt;As much as we prefer to keep our new architecture isolated from the existing native app, it’s unfortunately not always feasible to do so. Sometimes there is still the need to communicate between these two quite different systems. For example, when you add a product to the wishlist in the Zalando app, there is a little badge that increments and shows the amount of products in your wishlist. If this happens from the React Native side, we need to tell the native app to update the counter accordingly.&lt;/p&gt;
&lt;p&gt;To bridge this functionality, we adopted a standard dependency injection flow that allows communication between the systems while still remaining as isolated as possible. For all sorts of communication, we have the following flow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Create a new turbo module and define its interface with Typescript types.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;export&lt;/span&gt; &lt;span class="n"&gt;interface&lt;/span&gt; &lt;span class="n"&gt;Spec&lt;/span&gt; &lt;span class="n"&gt;extends&lt;/span&gt; &lt;span class="n"&gt;TurboModule&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;addProduct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sku&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;shouldShowNotification&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="n"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;void&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;;&lt;/span&gt;
  &lt;span class="n"&gt;readonly&lt;/span&gt; &lt;span class="n"&gt;onProductChange&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;EventEmitter&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ProductChangeEvent&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;Define a compatible interface (or protocol) that will define the API to be injected on the native side, as well as a place to inject it.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kr"&gt;@objc&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;protocol&lt;/span&gt; &lt;span class="nc"&gt;WishlistProtocol&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;AnyObject&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;onProductChange&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;?,&lt;/span&gt; &lt;span class="nb"&gt;Bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Void&lt;/span&gt;&lt;span class="p"&gt;)?&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="kr"&gt;get&lt;/span&gt; &lt;span class="kr"&gt;set&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;addProduct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;_&lt;/span&gt; &lt;span class="n"&gt;sku&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shouldShowNotification&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="n"&gt;escaping&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;?)&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Void&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kr"&gt;@objc&lt;/span&gt; &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;WishlistConfig&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;NSObject&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kr"&gt;@objc&lt;/span&gt; &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;delegate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;TurboWishlistProtocol&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;Lastly, in the native app implement a class that conforms to the interface and injects itself into our Framework (SDK).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This creates an easy way to communicate with each other, but it also creates clear boundaries and contracts. A neat side effect is that now our standalone developer app can implement those same interfaces with a mocked version, allowing us to keep using the developer app even when testing features like the wishlist.&lt;/p&gt;
&lt;h2&gt;Cross-platform including Web&lt;/h2&gt;
&lt;p&gt;Relying on the framework initially developed for the Zalando website enables us to share core functionalities and code across both app and web platforms. Furthermore, it unifies our approach to building customer-facing applications at Zalando by standardizing underlying concepts like Renderers. This is great, but we want to explore going even further with cross-platform development. With Rendering Engine, we can share central and foundational logic like data fetching, analytical tracking, caching, etc., but what if we could share UI as well?&lt;/p&gt;
&lt;p&gt;With React Native, there are mostly two different ways to write cross-platform UI components, each coming from a different perspective:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Build components in the normal React Native&lt;/strong&gt; way using built-in components like &lt;code&gt;&amp;lt;View /&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;ScrollView /&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;Text /&amp;gt;&lt;/code&gt; and so on, and then let &lt;a href="https://github.com/necolas/react-native-web"&gt;react-native-web&lt;/a&gt; take care of translating these components to the HTML elements for the browser.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Build components with a subset of HTML&lt;/strong&gt; and map the HTML elements to their respective react-native components with &lt;a href="https://facebook.github.io/react-strict-dom/"&gt;react-strict-dom&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;So we either write HTML and map that to React Native or we write React Native and map that to HTML. Although both options were feasible, we ultimately opted for react-strict-dom. Our decision was driven by the desire to select the most future-proof solution, which react-strict-dom appeared to be. Furthermore, we believe that HTML and CSS are incredibly expressive, have evolved over many years, and are likely to remain relevant for years to come. In contrast, any other form of UI representation could potentially become obsolete at any point. React-strict-dom also has no additional runtime cost on the web because a build step removes all unnecessary abstraction layers.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;css&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;html&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;react-strict-dom&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;styles&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;css&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;button&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;backgroundColor&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;white&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;:hover&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lightgray&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;padding&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;10&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;MyButton&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;html&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;button&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;style&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;styles&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;button&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;A&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;cross&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nx"&gt;platform&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;button&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/html.button&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Building a cross-platform component library&lt;/h3&gt;
&lt;p&gt;With react-strict-dom as our cross-platform UI layer, we built a component library for Zalando’s own design system, which includes components and styling for typography, buttons, cards, dialogs, etc. However, building components cross-platform can sometimes be quite restrictive because, no matter which UI layer you choose, you will be limited to a subset of features that work identically across all platforms, and anything not universally supported is stripped away. For us, this is unacceptable, as we want to benefit from cross-platform code and not limit ourselves. Luckily, react-strict-dom and the Metro bundler possess a few utilities that help in that respect.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Platform-specific imports&lt;/strong&gt;: If you create a &lt;code&gt;Foo.native.ts&lt;/code&gt; alongside a &lt;code&gt;Foo.ts&lt;/code&gt; file, whenever you import &lt;code&gt;"./Foo"&lt;/code&gt; Metro will automatically choose between those two files depending on the target platform; &lt;code&gt;.ios.ts&lt;/code&gt; and &lt;code&gt;.android.ts&lt;/code&gt; are available to make it even more specific if needed. Especially in a component library, this is great because even if you have completely different implementations for different platforms, as long as the component's props are the same the consumer doesn't really care about the underlying implementation and is fully abstracted away from platform-specific code. We started using a simple pattern where types would live in a separate file so that we can have safe type checking between multiple implementations.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="Component Library showing platform specific imports" src="https://engineering.zalando.com/posts/2025/10/images/component-library-example.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Figure 3: Component Library&lt;/figcaption&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;React Strict DOM’s compat&lt;/strong&gt;: While react-strict-dom’s mapping works great, sometimes we want to extend or adjust props passed to the real underlying native component to gain more control over it. React Strict DOM provides a simple-to-use API that allows exactly that.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;component&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;CustomSpan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;...props&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;FooProps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;compat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kr"&gt;native&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;{...&lt;/span&gt;&lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;aria&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nx"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;label&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="kr"&gt;as&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;span&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;{(&lt;/span&gt;&lt;span class="nx"&gt;nativeProps&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;React.PropsOf&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Text&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Text&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{...&lt;/span&gt;&lt;span class="nx"&gt;nativeProps&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/compat.native&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;One last missing piece for a cross-platform component library that we haven’t talked about yet is styling. For our library, we enhanced our styling capabilities with &lt;a href="https://stylexjs.com/"&gt;StyleX&lt;/a&gt;, which works hand in hand with react-strict-dom and we use it to support theming as well as &lt;a href="https://stylexjs.com/docs/learn/styling-ui/defining-styles/#pseudo-classes"&gt;polyfilling a subset of CSS capabilities&lt;/a&gt; like pseudo-classes and media queries. This means that we can use styling variables, such as font sizes, colors, borders etc., which we call tokens just like you’d use CSS variables on all platforms. For the web all styling and variables are transformed into a regular CSS file.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tokens&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;@zds/tokens/tokens.stylex&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;DefaultMessage&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;style&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;MessageProps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;defaultStyle&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;styles&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;primaryStyle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;style&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;BaseMessage&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{...&lt;/span&gt;&lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;style&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;defaultStyle&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;styles&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;css&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;primaryStyle&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;backgroundColor&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;tokens.colorBackgroundDefault&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;borderWidth&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;tokens.borderWidthS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;borderColor&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;tokens.colorBorderSecondary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;borderStyle&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;solid&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Where we are now&lt;/h2&gt;
&lt;p&gt;For us, the migration is still ongoing but we have successfully migrated a few screens, ranging from major to minor, including Zalando’s new front screen &lt;strong&gt;&lt;a href="https://corporate.zalando.com/en/financials/zalando-q2-2025-results"&gt;Discovery Feed&lt;/a&gt;&lt;/strong&gt;, which has a strong focus on media, proving that media-heavy content can also be delivered with React Native.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Zalando's front screen - Discovery feed" src="https://engineering.zalando.com/posts/2025/10/images/zalando-app-feed.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Figure 4: Discovery Feed&lt;/figcaption&gt;

&lt;p&gt;Making mistakes and learning from them is a normal process in software engineering. Along our first releases we made a lot of discoveries along the way. A few highlights include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Launching early&lt;/strong&gt; turned out to be crucial. The first screen we migrated was a low-traffic and very simple screen; however, even in this simplest scenario, we learned a lot. It provided opportunities not just to test the technology early without breaking a major feature, but also to build proper observability based on real customer experience.&lt;/li&gt;
&lt;li&gt;Writing &lt;strong&gt;cross-platform code is a balancing act&lt;/strong&gt; between saving development time and limiting yourself to cross-platform constraints. It’s important to accept that having 100% code shared between all platforms or even between iOS and Android, is not the goal, just like writing everything in JavaScript and avoiding native code is not the goal, and that’s totally fine.&lt;/li&gt;
&lt;li&gt;Earlier, we mentioned our approach to interoperability between React Native and the existing native apps; however, getting there was not an easy step and required a proper process. Especially when combining three environments into one (TypeScript, Swift and Kotlin) it’s crucial to &lt;strong&gt;first properly define these API contracts&lt;/strong&gt; and ensure that all involved environments are compatible with this contract as early as possible. Otherwise, you run into challenges where the API design might not be feasible on all platforms, requiring you to undo work that has already been done.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With our foundation in place, we're now focused on accelerating migration velocity while maintaining the quality bar our customers expect. This is an exciting time for mobile development at Zalando, and we're grateful for the strong internal support and the robust open-source ecosystem that made this possible. We look forward to collaborating with the community and contributing our learnings back to the ecosystem.&lt;/p&gt;</content><category term="Zalando"/><category term="Frontend"/><category term="React"/><category term="Zalando App"/><category term="Open Source"/><category term="Mobile"/></entry><entry><title>Dead Ends or Data Goldmines? Investment Insights from Two Years of AI-Powered Postmortem Analysis</title><link href="https://engineering.zalando.com/posts/2025/09/dead-ends-or-data-goldmines-ai-powered-postmortem-analysis.html" rel="alternate"/><published>2025-09-25T00:00:00+02:00</published><updated>2025-09-25T00:00:00+02:00</updated><author><name>Dmitry Kolesnikov</name></author><id>tag:engineering.zalando.com,2025-09-25:/posts/2025/09/dead-ends-or-data-goldmines-ai-powered-postmortem-analysis.html</id><summary type="html">&lt;p&gt;Your incidents hold the blueprint to your most strategic infrastructure wins — if you're listening correctly.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: We adopted LLMs as an intelligent SRE assistant to analyze thousands of postmortems, transforming them from "dead ends" into "data goldmines." This solution automates the identification of recurring incident patterns, particularly in our datastores: Postgres, AWS DynamoDB, AWS ElastiCache, AWS S3 and Elasticsearch. While AI effectively speeds up analysis, uncovers hidden hotspots and investment opportunities, human curation remains crucial for accuracy, fostering trust, and addressing limitations like hallucinations and surface attribution errors. Despite this, we acknowledge the significant potential of AI with SRE that empowers engineering teams with this capability to facilitate rapid decision making.&lt;/p&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;At Zalando, a group of colleagues is looking after the datastores in our &lt;a href="https://opensource.zalando.com/tech-radar/"&gt;Tech Radar&lt;/a&gt;, wanted to explore:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“What if every system outage could make our entire infrastructure smarter?”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Going forward, we took a Site Reliability Engineering (SRE) perspective to determine valuable learning from failures and postmortems. For us a critical aspect of SRE is the feedback loop where systems, teams, and investments evolve. So far, our traditional approach to the feedback loop is human-centric analysis about incident effects, the root cause analysis (RCA), and the corrective measures implemented to prevent future occurrences. This is a solid technique for immediate reactive learning but it does not work well for retrospective analysis of years of past incident reports at the company scale.&lt;/p&gt;
&lt;p&gt;With the rise of Large Language Models (LLMs), we saw an opportunity. Could LLMs detect patterns, surface systemic issues, and even suggest preventive actions all by analyzing our postmortems at scale? Is it possible to transform past learnings into dynamically evolving datasets? We decided to validate this hypothesis specifically for datastore technologies, prior to scaling this approach further.&lt;/p&gt;
&lt;p&gt;We adopted LLMs as intelligent postmortem review assistants. What began as a time-saving experiment quickly evolved into a valuable source of strategic insights. This post shares what we learned:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How to turn postmortems into predictive signals for the reliable future;&lt;/li&gt;
&lt;li&gt;How to tweak AI to read between the lines, supporting decision makers;&lt;/li&gt;
&lt;li&gt;Practical tips about the automation of postmortem analysis.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Our experience suggests that the discussed automation is more than a productivity hack, even as we continue to fully adopt and leverage its benefits. It is a strategic lever for engineering teams.&lt;/p&gt;
&lt;h2&gt;The Traditional Postmortem Problem&lt;/h2&gt;
&lt;p&gt;Many companies have inherited the postmortem culture from Google’s Site Reliability Engineering book, described in the chapter “&lt;a href="https://sre.google/sre-book/postmortem-culture/"&gt;Postmortem Culture: Learning from Failure&lt;/a&gt;”. The postmortem culture at Zalando is highly similar. Having mitigated factors negatively affecting business operations, the team in charge of the incident starts with the root cause analysis and implementation of preventive actions. The review involves not just the directly responsible teams for the affected applications, but also stakeholders and adjacent teams. The incident is closed only when engineering leadership (up to VP Engineering depending on impact and severity) agrees on sufficient progress in implementation of preventive actions and signs off on the postmortem. Insights from these incidents are shared bottom-up through &lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/devops-guidance/o.cm.8-hold-operational-review-meetings-for-data-transparency.html"&gt;weekly operational reviews&lt;/a&gt;, and horizontally through engineering communities. This transforms each incident into a company-wide learning opportunity. Over time, we’ve accumulated a rich internal dataset: &lt;strong&gt;thousands of archived postmortem documents&lt;/strong&gt; – a gold mine of technical and organizational knowledge.&lt;/p&gt;
&lt;p&gt;Even with this culture of learning, there are limitations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Postmortems vary widely in depth and clarity. Comparing them and extracting patterns is often perceived as apples-to-oranges;&lt;/li&gt;
&lt;li&gt;Root cause analyses reflect team assumptions, subtle contributing factors often go unspoken;&lt;/li&gt;
&lt;li&gt;Making connections between incidents across teams requires immense cognitive load and informal networking. Taking an overarching company-level perspective still requires the goodwill of individuals and effective networking.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When your learning about &lt;strong&gt;site reliability depends exclusively on human effort, scale becomes the enemy&lt;/strong&gt;. It takes about 15-20 minutes to thoughtfully read a single postmortem (a dedicated reviewer can process maybe four postmortems per hour assuming a continuous focus). Now multiply that by &lt;strong&gt;thousands of postmortems&lt;/strong&gt;. Suddenly, strategic questions like “Why datastores fail most frequently at scale?” become impossible to answer quickly, or without excessive cognitive load. Even for a finite datastore area, it was a substantial time investment for us.&lt;/p&gt;
&lt;p&gt;As a result, we risk:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Missing systemic signals that could inform infrastructure investments;&lt;/li&gt;
&lt;li&gt;Reacting to symptoms instead of addressing root causes;&lt;/li&gt;
&lt;li&gt;Delaying decisions due to insufficient insights across domains.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This bottleneck in capacity led us to a clear conclusion: to get strategic value from our postmortem corpus, we needed speed and effectiveness. Specifically, we needed &lt;strong&gt;AI tools capable of reading, interpreting, and synthesizing text at scale&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Our hypothesis was simple: &lt;strong&gt;LLMs could turn a mountain of human-authored documents into a dynamic, decision-making dataset&lt;/strong&gt;. The results, as we’ll explore next, proved even more promising than we expected. It solved for us cognitive load by reducing the information context and detected patterns across large postmortem corpus quickly.&lt;/p&gt;
&lt;h2&gt;Deploying AI: Automating Postmortem Analysis&lt;/h2&gt;
&lt;p&gt;Our focus was exclusively on our datastores Postgres, AWS DynamoDB, AWS S3, AWS ElastiCache and Elasticsearch. For each of them, we have a question &lt;em&gt;“Why does the datastore fail repeatedly at scale?”&lt;/em&gt; and desire to get an instant answer. Google's NotebookLM was a natural choice as a toolbox for making the postmortem analysis. It was very effective for making a short summary from thousands of documents. &lt;strong&gt;Notebooks have boosted productivity three times&lt;/strong&gt;, reading the summary and making a conclusion about root causes requires about 5 minutes. It is still slow at our scale – sifting through summaries takes weeks for a dedicated team of experts, still not allowing us to answer questions quickly. We have also &lt;strong&gt;observed severe hallucinations and loss of the incident context&lt;/strong&gt; by LLM while producing summaries. It has required extra attention during the analysis, the excessive cognitive load has not been reduced for reviewers resulting in loss of effective productivity. All these factors led us to the decision that &lt;strong&gt;a sophisticated postmortem processing pipeline is required&lt;/strong&gt;. We set out to build an AI-powered system to scale this cognitive task, not just automate it.&lt;/p&gt;
&lt;p&gt;To solve this, &lt;strong&gt;we designed a multi-stage LLM pipeline&lt;/strong&gt; instead of using high-end LLMs with large context windows. It is a deliberate design trade-off aimed at simplicity and reliability. While large context windows allow models to process more information, we observed &lt;strong&gt;a "lost in the middle" effect&lt;/strong&gt;, where details in the middle of long inputs are often overlooked or distorted. In addition, large contexts do not guarantee perfect recall and can increase latency, memory usage, and cost. Our pipeline is a chain of a few models, where each stage strictly specialises on a single objective.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style="text-align: left;"&gt;Stage&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Goals&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Input&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Summarization&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Reduces reviewer load by condensing postmortem narratives into few data points.&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Postmortem corpus&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Summary corpus&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Classification&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Enables technology-specific clustering across incidents.&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Identity of technology buckets; Summary corpus&lt;/td&gt;
&lt;td style="text-align: left;"&gt;N-buckets, each containing postmortem summaries relevant to the technology&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Analyzer&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Converts summaries into thematic failure fingerprints.&lt;/td&gt;
&lt;td style="text-align: left;"&gt;The bucket of summaries&lt;/td&gt;
&lt;td style="text-align: left;"&gt;The bucket of digests, each describing the role of technology in the incident, max 5 sentences.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Patterns&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Detects systemic issues over time.&lt;/td&gt;
&lt;td style="text-align: left;"&gt;The bucket of digests&lt;/td&gt;
&lt;td style="text-align: left;"&gt;The one pager report about the role of technology in all incidents over the time frame, patterns of technology incidents.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Opportunity&lt;/td&gt;
&lt;td style="text-align: left;"&gt;&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Patterns of technology incidents; Postmortem corpus&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Investment opportunity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Eventually, the pipeline sifts through high-entropy information and distill it into concise reasons for failure. &lt;strong&gt;A functional pattern “map-fold” is a key building block for the pipeline&lt;/strong&gt;. A large set of documents is independently processed using a language model to extract relevant information (the "map" phase). These outputs are then aggregated either by another LLM invocation or a deterministic function into a higher-level summary (the "reduce" or "fold" phase). This modular design supports composable tasks like summarization, classification, or knowledge extraction. Pipeline’s input is thousands of postmortem documents, the output is a one-pager describing the trends and patterns for incidents in the focus. We have leveraged &lt;strong&gt;human expertise for each stage, involving examinations, labelling and quality control&lt;/strong&gt; to address accuracy requirements.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pipeline architecture" src="https://engineering.zalando.com/posts/2025/09/images/pipeline-architecture.png#center"&gt;&lt;/p&gt;
&lt;h3&gt;Summarization&lt;/h3&gt;
&lt;p&gt;The stage is designed to distill “complex” incident reports into clear summaries. This step, designed for both humans and machines, ensures that stakeholders can quickly and accurately understand the critical aspects of each incident without sifting through large contexts.&lt;/p&gt;
&lt;p&gt;Using a tightly scoped prompt, we have used &lt;a href="https://aclanthology.org/2023.findings-emnlp.946.pdf"&gt;Turn, Expression, Level of Details, Role&lt;/a&gt; (TELeR) techniques for prompt engineering, LLM processes each postmortem document and extracts only the most essential information across five core dimensions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Issue Summary - A brief overview of what happened;&lt;/li&gt;
&lt;li&gt;Root Causes - Clear identification of the underlying technical or procedural factors;&lt;/li&gt;
&lt;li&gt;Impact - A factual description of what systems, services, or users were affected and how;&lt;/li&gt;
&lt;li&gt;Resolution - The steps taken to resolve the incident;&lt;/li&gt;
&lt;li&gt;Preventive Actions - Planned or implemented measures to prevent recurrence.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The entire process is governed by strict constraints: no guessing, no assumptions, and no speculative content. If something in the original postmortem is unclear or missing, the summary explicitly states that. This ensures the final output remains accurate, focused and trusted with high-level confidence. Additionally, noise such as speculation, redundant phrasing, or tangential commentary is deliberately removed. What's preserved are the key technical and operational insights—delivered in a readable, structured format. This makes the output especially valuable for engineering leadership, reliability teams, and cross-functional reviews.&lt;/p&gt;
&lt;p&gt;Below is the censored example of the summary produced by LLMs:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;Issue Summary:
On [DATE], between [TIME] and [TIME], a library update deployment
caused a [DURATION] SEV2 incident affecting multiple services.
The deployment upgraded AWS SDK from version 2.20.162 to 2.30.20,
which led to a 5xx error spike and degraded functionality across
[PAGE A], [PAGE B], [PAGE C], and [PAGE D].

Root Causes:
The primary root cause was a missing [CLASS] dependency resulting
from version mismatch between the upgraded AWS SDK (2.30.20)
and the commons [LIBRARY]... Secondary causes included insufficient
integration testing that would have caught the DynamoDB connection
issues and incomplete deployment practices where PRs accumulated
before being fully rolled out.

Impact:
- Customers: [NUM_CUSTOMERS] received inaccurate [PAGE A];
  [NUM_CUSTOMERS] customers unable to view [PAGE B];
  customers experienced non-personalized [PAGE C];
  and unavailable [PAGE D]
- Business: Approximately [GMV] loss
- Markets: All [PAGE B] [MARKETS]
- Partners: [PAGE D] unavailable during incident

Resolution:
The incident was resolved by reverting the faulty deployment.
Detection occurred through P5 alert at [TIME] followed by
P3 alert at [TIME] (high 5xx errors). Root cause was identified
at [TIME], revert initiated at [TIME], and full recovery by [TIME].

Preventive Actions:
- Immediate: Reverted deployment, reduced alert delay,
  pinned AWS SDK version to 2.20
- Follow-up: Implement automated e2e tests for DynamoDB,
  upgrade commons lib AWS SDK version, ...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Classification&lt;/h3&gt;
&lt;p&gt;The stage systematically identifies whether specific datastore technologies directly contributed to the incident. The process works as follows: the model receives a summary postmortem document along with a list of technologies in question. The LLM was prompted to return only the name of technologies with a confirmed direct connection or “None” if there is no such link:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Identify any mentions of these technologies within the document;&lt;/li&gt;
&lt;li&gt;Verify whether the mention is explicitly connected to the root cause or impact of the incident.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Surface Attribution Error was an obstacle for our solution. We have to strictly prohibit inference or assumption, ensuring that only explicitly stated connections are flagged. Additionally, the prompt provides negative examples.&lt;/p&gt;
&lt;p&gt;The implemented classifier works reliably to classify a technology giving us the capability to scale the analysis for all technologies at &lt;a href="https://opensource.zalando.com/tech-radar/"&gt;Zalando Tech Radar&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Analyzer&lt;/h3&gt;
&lt;p&gt;The most crucial part of the incident analysis is the extraction of a short 3 to 5 sentence digest that highlights (a) the root cause or fault condition involving the technology; (b) the role it played in the overall failure scenario; (c) any contributing factors or interactions that amplify the issue. The output is produced with a technical audience in mind. It is aiming to be precise and readable without requiring access to the full postmortem, requiring only 30 to 60 seconds to understand the critical aspects of each incident.&lt;/p&gt;
&lt;p&gt;Below is the censored example of the digest produced by LLMs:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;DynamoDB contributed to this incident as the affected data store,
but was not the root cause of the failure. The root cause was a
version incompatibility between an upgraded AWS SDK (2.30.20)
and an older DynamoDB support module (2.17.279) that still
depended on a class removed in the newer SDK version.
This dependency mismatch caused all DynamoDB write operations
to fail with a NoClassDefFoundError, which cascaded to affect
multiple [SERVICES] that relied on DynamoDB for storing [DATA].
DynamoDB itself functioned normally—the issue was entirely due
to the application&amp;#39;s inability to properly connect to and
interact with DynamoDB after the SDK upgrade.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This stage adds critical interpretive value by turning raw incident data into a derivative dataset about technological failures usable for further processing by humans, LLMs or other techniques. For example, it has enabled us to discover common patterns of datastore incidents over these years.&lt;/p&gt;
&lt;h3&gt;Patterns&lt;/h3&gt;
&lt;p&gt;The real value emerges from a single-page description of cross-incident analysis, enabling engineering leadership to grasp recurring patterns, failure modes, and contributing factors comprehensively.&lt;/p&gt;
&lt;p&gt;We are feeding the entire set of incident digests into LLM within a single prompt. Within the prompt, we are explicitly prohibiting inference, redundancy, or the inclusion of any information not grounded in the source data. This ensures the resulting output is both precise and actionable. The output is a concise list of common failure themes across the incidents.&lt;/p&gt;
&lt;p&gt;Below is the censored example of the failure patterns as LLM report:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;DynamoDB Capacity and Throttling: Multiple incidents
involved DynamoDB capacity issues, leading to throttling,
latency, and service failures.

Insufficient Testing and Scaling: Lack of adequate
pre-deployment performance testing and insufficient
automated scaling contributed to incidents.

Application Logic Errors: Bugs in application logic,
such as duplicate data creation or inefficient
algorithms, led to database overload and service degradation.

Monitoring and Alerting Gaps: Insufficient monitoring
and overly sensitive or insensitive alerting
thresholds were factors in some incidents.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The resulting patterns serve as a foundation for human analysis, initiating reviews and facilitating the identification of reliability risks, architectural vulnerabilities, or process gaps. This approach enables us to maintain a focus and narrow the communication. Rather than sifting through an extensive volume of raw data, we are provided with a clear direction towards the most critical areas for in-depth investigation.&lt;/p&gt;
&lt;h3&gt;Human curation&lt;/h3&gt;
&lt;p&gt;While the goal of our solution is to reduce human involvement, &lt;strong&gt;human curation remains essential&lt;/strong&gt;. During the pipeline development, we conducted 100% human curation of output batches. This involved analyzing the generated postmortem digests and comparing them to the original postmortems. The curation process was purely labelling, requiring colleagues to upvote or downvote the results.  The feedback loop from humans helped us refine prompts and make optimal model selections for each stage. As the system matured, we relaxed human curation to 10-20% of randomly sampled summaries from each output batch. We are still using human expertise to proofread the final report applying editorial changes to summary and incident patterns.&lt;/p&gt;
&lt;h2&gt;Two Years of Data: Key Findings&lt;/h2&gt;
&lt;p&gt;Two years of data analysis reveal recurring patterns are primarily related to how these datastore technologies are being used. &lt;strong&gt;Configuration &amp;amp; deployment, as well as capacity &amp;amp; scaling are primary reasons for datastore incidents&lt;/strong&gt;. Below, we highlight examples of case studies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;AWS S3 incidents&lt;/strong&gt;: consistently tied to misconfigurations in the deployment artifacts preventing applications from accessing S3 buckets, often due to manual errors or untested changes. This insight directly led to the solution for automated change validation for infrastructure as code which is able to shield us from 25% subsequent datastore incidents, demonstrating a clear return on investment.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;AWS ElastiCache incidents&lt;/strong&gt;: a consistent trend of 80% CPU utilization causing elevated latency at peak traffic. This AI-driven insight led us developing a strategic direction about capacity planning, instance type selection and traffic management for  AWS ElastiCache.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We have established &lt;strong&gt;a comprehensive understanding of failure patterns within our datastores&lt;/strong&gt; through two years of incident analysis. So far the most recurring incident patterns are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;absence of automated change validation at config and infrastructure as a code, and poor visibility into changes and their effects;&lt;/li&gt;
&lt;li&gt;inconsistent or ad-hoc change management practices including manual intervention;&lt;/li&gt;
&lt;li&gt;absence of progressive delivery with datastores (e.g., canary or blue-green);&lt;/li&gt;
&lt;li&gt;underestimating the traffic pattern;&lt;/li&gt;
&lt;li&gt;failing to scale ahead of demand or delayed auto-scale responses;&lt;/li&gt;
&lt;li&gt;bottlenecks due memory, CPU, or IOPS constraints.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Our datastore portfolio is mature and resilient, with incidents very rarely directly attributed to technological flaws. In the past 5 years, we encountered problems with &lt;a href="https://engineering.zalando.com/posts/2023/11/patching-pgjdbc.html"&gt;JDBC drivers&lt;/a&gt; and had incidents related two known PostgreSQL bugs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The incident was caused by a crash in the AUTOVACUUM LAUNCHER process due to a race condition, which in turn terminated all connections in the PostgreSQL database pool. This crash was attributed to &lt;a href="https://www.postgresql.org/message-id/flat/15640-58e01e10b362cc7f%40postgresql.org"&gt;a known bug in PostgreSQL 12&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;A major version upgrade of the Postgres database from version 16 to 17, which triggered &lt;a href="https://www.postgresql.org/message-id/flat/680bdaf6-f7d1-4536-b580-05c2760c67c6%40deepbluecap.com"&gt;a bug in Postgres' logical replication&lt;/a&gt;. It occurs when DDL commands are executed in parallel with a large number of transactions, leading to a memory leak.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The AI analysis significantly &lt;strong&gt;reduced the time for analysis from days to hours&lt;/strong&gt; and achieved the &lt;strong&gt;scalability of the solution across multiple technological areas&lt;/strong&gt;. It also &lt;strong&gt;surfaced 'hidden hotspots'&lt;/strong&gt; like improper connection pool configuration or circuit breakers leading to cascading failures that were previously considered stable.&lt;/p&gt;
&lt;h2&gt;Dead Ends: Where AI Fell Short&lt;/h2&gt;
&lt;p&gt;The incident analysis pipeline has gone through a few evolutions, utilizing various models and hosting solutions. Initially, we employed open source models hosted within LM Studio. Subsequently, we evaluated different models, and the current iteration is powered by Claude Sonnet 4 on AWS Bedrock. Such evolution was primarily driven by compliance topics rather than technical necessity. Postmortem document contain PII data of on-call responders, companies business metrics, GMV losses, etc. The legal alignment was a pre-condition before using cloud hosted LLMs (e.g. AWS Bedrock). Within each of these environments, &lt;strong&gt;Hallucination, Surface Attribution Error and Latency are three key obstacles&lt;/strong&gt; impacting on the pipeline and the quality of the analysis.&lt;/p&gt;
&lt;p&gt;The earlier prototypes were built with small models from 3B to 12B parameters. We have observed &lt;strong&gt;up to 40% probability for hallucination&lt;/strong&gt; at summary and analysis phases. The model has written up the text that sounded plausible but it was factually incorrect. Anecdotally, small models fabricated a plausible summary regarding a non-existent DynamoDB incident, solely because DynamoDB was mentioned in the title of a playbook linked to the postmortem. To solve this challenge, we have experimented with various prompting strategies, emphasizing strict requirements and clearly articulating expectations with examples. Then we conducted human-led curation until the effect of hallucination became less than 15%. Finally, we appreciated the effort to harden prompts when transitioning to a larger-scale model as hallucinations became negligible. It was crucial for enabling the strategic insights discussed earlier.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Surface Attribution Error is dominant almost in each stage of the pipeline&lt;/strong&gt;. The model is making decisions based on surface-level clues rather than deeper meaning or causality. The model makes a bias to prominent keywords staging on the surface-level instead of reasoning through context to identify the actual causal factor. For instance, it could offer a well-structured and authoritative explanation regarding the contribution of AWS S3 to an incident, even if "S3" is merely mentioned without being causally linked. Although negative prompting was employed to mitigate the issue, it has not been entirely resolved; we still observe &lt;strong&gt;approximately 10% attribution, even with advanced models&lt;/strong&gt; such as Claude Sonnet 4.&lt;/p&gt;
&lt;p&gt;These are primary reasons for skepticism and acceptance of the results when we saw the first version of the report. By ensuring each stage's input/output was human-readable and subject to curation, we fostered trust and demonstrated the AI's role as an assistant able to produce a high quality. The pivotal role of digests allowed humans to observe all incidents as a whole and precisely validate and curate the reports produced by LLMs.&lt;/p&gt;
&lt;p&gt;Surface Attribution Error often accompanies overfitting, since both involve relying on superficial patterns from past data rather than deeper, more reliable signals. General purpose LLMs are trained on publicly available data, and struggle to identify emerging failure patterns that haven't been seen before or properly deal with Zalando proprietary technology. Given that the datastore analysis focused exclusively on public technologies, the overfitting effect was negligible. Currently, we rely on &lt;strong&gt;human editorial work for the final report to address any novel failure modes that AI may have overlooked&lt;/strong&gt;. An observable instance of this issue results in the unacceptable analysis of incidents concerning Zalando internal technologies (e.g. Skipper). Remediation of this and similar issues requires a model fine-tuning.&lt;/p&gt;
&lt;p&gt;Fail fast and rapid iterations were essential for us during the pipeline development. Given the volume of our documents, we have concluded that the overall document processing time should not exceed 120 seconds; otherwise, the processing of annual data becomes impractically long. Initial releases utilized open source model with 27B parameters, which constituted the most time-consuming phase in the pipeline, typically requiring 90 to 120 seconds for completion, giving us no bandwidth to chain multiple stages. The “map-fold” architecture depicted earlier was released with multiple models 3B, 12B and 27B requiring about 20 seconds per document to classify and 60 seconds per incident to conduct analysis. This has enabled the &lt;strong&gt;processing of annual data analysis in under 24 hours&lt;/strong&gt;. The most recent release, based on Claude Sonnet 4, processes each postmortem in approximately 30 seconds, offering immediate analytical opportunities.&lt;/p&gt;
&lt;p&gt;The initial concept of a no-code agentic solution was quickly deemed unfeasible due to performance limitations, inaccuracies, and hallucinations encountered during prototype development. We have opted for a hybrid solution where the input and output of each stage are amenable to human evaluation, thereby enhancing confidence in accuracy.&lt;/p&gt;
&lt;p&gt;Reliable accuracy in extracting numerical data, such as GMV or EBIT loss, affected customers, and repair time, from postmortems was not achieved. Consequently, we depend on our internal incident dataset that serves as a trustworthy source of truth for opportunity analysis.&lt;/p&gt;
&lt;h2&gt;Takeaways and Recommendations&lt;/h2&gt;
&lt;p&gt;The discussed solution has addressed our core problem – the &lt;strong&gt;inability of manual postmortem review to keep pace with the large volume of incidents&lt;/strong&gt;, identifying systemic issues and making data-driven investments for preventing recurring failures. Our exercise is on par with industry insights about AI:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;AI's transformative potential&lt;/strong&gt;: LLMs can effectively turn a vast corpus of human-authored postmortems into a dynamic, decision-making dataset, surfacing patterns and systemic issues that are impossible to identify manually at scale. Hallucination and Surface Attribution Error were significant obstacles initially, but could be largely mitigated through strict prompting strategies, negative prompting, and human curation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-stage pipeline effectiveness&lt;/strong&gt;: A multi-stage LLM pipeline, where each stage specializes in a single objective (summarization, classification, analysis, patterns), proved more effective and reliable than using single high-end LLMs with large context windows, mitigating issues like "lost in the middle" and improving accuracy.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Human-in-the-loop is crucial&lt;/strong&gt;: Despite automation, human curation, examination, labeling, and quality control at each stage, especially the "digests," are essential for refining prompts, ensuring accuracy, fostering trust, and addressing novel failure modes that AI might overlook.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Going forward and evolving the SRE-AI partnership, our takeaways and recommendations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Start small and iterate&lt;/strong&gt;: Begin with focused use cases and embrace rapid iterations.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prioritize prompt engineering&lt;/strong&gt;: Invest time in crafting precise and constrained prompts to minimize hallucinations and surface attribution errors. Design your solution with evolvability in mind and ship your pipelines along for golden datasets for testing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Design for human interpretability&lt;/strong&gt;: Ensure intermediate outputs are human-readable to facilitate trust and validation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In essence, Zalando's experience demonstrates that AI, when implemented thoughtfully with a human-in-the-loop approach, can transform postmortems from mere "dead ends" into invaluable "data goldmines," providing strategic insights to drive targeted reliability investments and cultivate a more intelligent infrastructure.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Dead ends or goldmines? By transforming thousands of incident reports into a dynamic, decision-making dataset, we've shown that every system outage can indeed make our infrastructure smarter. &lt;strong&gt;AI-powered pipelines bring speed to turning postmortems into predictive signals for the reliable future&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;We trust that this discussion has provided valuable insights into fine-tuning AI for nuanced interpretation, supporting decision-makers, and offering practical advice on automating postmortem analysis to enhance system reliability for the benefit of your customers.&lt;/p&gt;
&lt;p&gt;Your incidents hold the blueprint to your most strategic infrastructure wins - if you are listening correctly.&lt;/p&gt;</content><category term="Zalando"/><category term="Artificial Intelligence"/><category term="SRE"/><category term="Backend"/><category term="Machine Learning"/></entry><entry><title>Direct Data Sharing using Delta Sharing - Introduction: Our Journey to Empower Partners at Zalando</title><link href="https://engineering.zalando.com/posts/2025/07/direct-data-sharing-using-delta-sharing.html" rel="alternate"/><published>2025-07-08T00:00:00+02:00</published><updated>2025-07-08T00:00:00+02:00</updated><author><name>Lokeshbabu Radhakrishnan</name></author><id>tag:engineering.zalando.com,2025-07-08:/posts/2025/07/direct-data-sharing-using-delta-sharing.html</id><summary type="html">&lt;p&gt;In this post, we explain how we transformed fragmented partner data sharing at Zalando by implementing Delta Sharing, evolving from a pilot solution to an organization-wide platform that enables real-time, secure data access across our partner ecosystem.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;The Challenge That Started It All&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Picture this:&lt;/strong&gt; You're a partner working with Zalando, trying to understand how your products are performing on one of Europe's largest fashion platforms. You need insights to make strategic decisions about inventory, pricing, and assortment planning. But instead of getting seamless access to the data you need, you find yourself juggling multiple systems, formats, and manual processes just to piece together a coherent view of your business performance.
This was the reality our partners faced, and it was a problem we couldn't ignore.&lt;/p&gt;
&lt;p&gt;At Zalando's Partner Tech division within our Data Foundation pillar, we share data &amp;amp; insights to partners across three distinct business models to steer their business:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Traditional &lt;strong&gt;wholesale&lt;/strong&gt; relationships where Zalando purchase and resell products,&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partner Program&lt;/strong&gt; enabling direct-to-consumer sales, and&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Connected Retail&lt;/strong&gt; linking brick-and-mortar stores to our platform.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each of these partnerships generates valuable data, but getting that data into partners hands in a useful format had become a significant challenge.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;This introduction article will cover the following parts&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Brief overview of the problem statement&lt;/li&gt;
&lt;li&gt;Brief overview of existing solutions and partner needs&lt;/li&gt;
&lt;li&gt;Journey of identifying a potential solution&lt;/li&gt;
&lt;li&gt;Why did we choose delta sharing&lt;/li&gt;
&lt;li&gt;From pilot to platform&lt;/li&gt;
&lt;li&gt;Lessons learned&lt;/li&gt;
&lt;li&gt;Looking ahead&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;This article will not cover&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Delta sharing in-depth explanation of its technical architecture&lt;/li&gt;
&lt;li&gt;Databricks and Unity Catalog capabilities&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;The Wake-Up Call: Understanding the Real Impact&lt;/h2&gt;
&lt;p&gt;Our journey began with what seemed like routine partner interviews, but the conversations quickly revealed a sobering reality. Through months of discussions, we identified critical pain points undermining our partner relationships:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Fragmented Data Landscape&lt;/strong&gt; forced partners to juggle SFTP transfers, CSV downloads, self-service reports, and API calls. Each method served a purpose, but together they created a complex web requiring expertise across multiple systems just to get a complete business view.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Manual Data Processing&lt;/strong&gt; had become a hidden tax, partners were allocating 1.5 FTE per month solely for data extraction and consolidation. Strategic talent was stuck wrestling with data formats instead of analyzing trends and making business decisions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Limited Data Accessibility&lt;/strong&gt; meant our UIs weren't designed for heavy data downloads that sophisticated partners needed. Time restrictions on data availability often blocked access to historical information during critical planning cycles.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Partners sought programmatic access to analytical-ready data&lt;/strong&gt; to integrate seamlessly with their existing analytics infrastructure. While we provided APIs for operational data, partners with sophisticated analytics capabilities needed direct access to processed, analytical-ready datasets for strategic analysis.&lt;/p&gt;
&lt;p&gt;Beyond addressing these pain points, we recognized a significant opportunity. As the owner of a vast volume of commercial data across Europe's fashion ecosystem, Zalando is uniquely positioned to unlock our partners' full potential. Rather than simply fixing data access issues, we could transform how partners leverage insights to grow their businesses and strengthen our collaborative relationships.&lt;/p&gt;
&lt;h2&gt;Mapping the Partner Landscape&lt;/h2&gt;
&lt;p&gt;As we dug deeper, we realized that our "one-size-fits-all" approach wasn't serving anyone well. Our partner ecosystem spans &lt;strong&gt;thousands of active partners&lt;/strong&gt;, from small retailers managing a few hundred SKUs to major brands with catalogs exceeding tens of thousands of products. Data volumes vary dramatically, some partners work with megabytes of weekly sales data while others require terabyte-scale historical datasets for strategic planning. In total, we manage &lt;strong&gt;200+ datasets&lt;/strong&gt; with sizes ranging up to &lt;strong&gt;200TB&lt;/strong&gt;, and the usage of these data assets helps steer our &lt;strong&gt;&amp;gt;€5 billion GMV&lt;/strong&gt; commercial partner platform business.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Mapping partner landscape" src="https://engineering.zalando.com/posts/2025/07/images/partner-landscape-scaled.png#center"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Large partners&lt;/strong&gt; operated like well-oiled machines, seeking programmatic access through secure, automated pipelines. They had the technical sophistication to handle complex integrations but needed the data to flow seamlessly into their existing analytics infrastructure.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Medium-sized partners&lt;/strong&gt; lived in a hybrid world, comfortable with dashboards and periodic data pulls but not necessarily equipped for real-time streaming solutions. They needed flexibility without overwhelming complexity.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Small partners&lt;/strong&gt; often relied on familiar tools like spreadsheets and required ad-hoc access to specific datasets. For them, simplicity and accessibility trumped technical sophistication.&lt;/p&gt;
&lt;p&gt;Meanwhile, the data requirements were equally diverse. Some partners craved &lt;strong&gt;real-time insights&lt;/strong&gt; to react quickly to market changes, while others needed comprehensive &lt;strong&gt;historical datasets&lt;/strong&gt; for long-term trend analysis. Some required &lt;strong&gt;incremental updates&lt;/strong&gt; to keep their systems synchronized, while others preferred &lt;strong&gt;batch processing&lt;/strong&gt; aligned with their internal workflows.
Our existing solutions - APIs, SFTP, S3 buckets, and email, each addressed some of these needs but none provided a comprehensive answer. We were solving point problems while missing the bigger picture.&lt;/p&gt;
&lt;h2&gt;The Quest for a Better Solution&lt;/h2&gt;
&lt;p&gt;Armed with this understanding, we embarked on a systematic search for a solution that could address our partners diverse analytical needs without creating yet another siloed system. We knew we needed something that would stand the test of time and scale with our growing partner ecosystem.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Solution Criteria" src="https://engineering.zalando.com/posts/2025/07/images/solution-criteria-scaled.png#center"&gt;&lt;/p&gt;
&lt;p&gt;Our evaluation criteria were ambitious but necessary. The solution needed to align with &lt;strong&gt;Zalando's broader data strategy&lt;/strong&gt; while being &lt;strong&gt;cloud-agnostic&lt;/strong&gt; enough to work with partners' varied infrastructure. It had to support the full spectrum of &lt;strong&gt;partner ecosystems&lt;/strong&gt;, from small businesses running on spreadsheets to enterprise operations with sophisticated data pipelines.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Performance and scalability&lt;/strong&gt; were non-negotiable, we needed to handle &lt;strong&gt;terabyte-scale&lt;/strong&gt; datasets efficiently. &lt;strong&gt;Security&lt;/strong&gt; couldn't be an afterthought; we required granular access controls, data encryption, and comprehensive auditing capabilities. The solution also needed to support the full range of &lt;strong&gt;data access patterns&lt;/strong&gt; our partners required: &lt;strong&gt;real-time streaming&lt;/strong&gt;, &lt;strong&gt;batch updates&lt;/strong&gt;, &lt;strong&gt;incremental and delta changes&lt;/strong&gt;, and &lt;strong&gt;historical analysis&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Perhaps most importantly, we needed something that wouldn't lock us into a corner. The solution had to be extensible and &lt;strong&gt;compatible with open tools&lt;/strong&gt;, ensuring our partners could integrate it with their existing workflows rather than forcing them to adopt entirely new processes.&lt;/p&gt;
&lt;h2&gt;Discovering Delta Sharing: The Game Changer&lt;/h2&gt;
&lt;p&gt;Our research led us to &lt;a href="https://github.com/delta-io/delta-sharing"&gt;Delta Sharing&lt;/a&gt;, and the more we learned, the more excited we became. Here was an open protocol specifically designed for secure data sharing across organizations, exactly what we needed. But it wasn't just the technical capabilities that caught our attention; it was the philosophy behind it.&lt;/p&gt;
&lt;p&gt;Delta Sharing promised zero-copy access to data, meaning partners could work with live datasets without the overhead of constant data transfers. It supported access through programmatic interfaces, business intelligence tools, and yes, even spreadsheets, covering all our partner segments in one solution. The protocol could handle massive datasets efficiently while maintaining security through design, not as an add-on feature.&lt;/p&gt;
&lt;p&gt;When we discovered Databricks' managed Delta Sharing service, the decision became clear. While we appreciated the open-source nature of the protocol, the managed service offered something invaluable: the operational excellence we needed for a production system serving critical partner relationships.&lt;/p&gt;
&lt;p&gt;The managed solution provided robust governance through Unity Catalog integration, built-in security features, comprehensive audit logging, and most importantly, it freed our team from the operational overhead of maintaining yet another infrastructure component. We could focus on delivering value to partners rather than troubleshooting servers.&lt;/p&gt;
&lt;p&gt;The architecture was elegantly simple yet powerful. Partners could access shared data through &lt;a href="https://docs.databricks.com/aws/en/delta-sharing/create-recipient-token"&gt;token-based&lt;/a&gt; (what we are supporting for the initial phases) authentication combined with credential files, providing security without complexity. The system supported both open sharing for all partners and Databricks-to-Databricks sharing for the partners who already using databricks in their data landscape, giving us flexibility as our needs evolved.&lt;/p&gt;
&lt;h2&gt;Taking the First Steps: Our Proof of Concept&lt;/h2&gt;
&lt;p&gt;Being the first team at Zalando to implement Delta Sharing meant we were venturing into uncharted territory. We approached this with the methodical mindset that had served us well in identifying the problem: careful testing, thorough evaluation, and honest assessment of limitations.&lt;/p&gt;
&lt;p&gt;However, we didn't tackle this challenge alone. Success required close collaboration with key stakeholders across Zalando's technical organization. Our central &lt;strong&gt;Data Foundation&lt;/strong&gt; team provided crucial guidance on Unity Catalog integration and governance frameworks, helping us understand how Delta Sharing would fit into Zalando's broader data architecture. Their expertise proved invaluable in navigating the complexities of our existing data infrastructure.&lt;/p&gt;
&lt;p&gt;Equally important was our partnership with the &lt;strong&gt;AppSec&lt;/strong&gt; and &lt;strong&gt;IAM&lt;/strong&gt; team. Given that we were essentially creating new pathways for external data access, security considerations were paramount. The teams helped us evaluate authentication mechanisms, assess potential security vectors, and ensure our implementation met Zalando's stringent security and auth identity standards from the ground up.&lt;/p&gt;
&lt;p&gt;We conducted a comprehensive proof of concept to understand both the capabilities and constraints of Delta Sharing in our specific environment. This collaborative approach allowed us to identify critical limitations early and develop mitigation strategies.&lt;/p&gt;
&lt;p&gt;Our POC revealed both the promise and the practical challenges of implementation. The integration with Unity Catalog, while powerful, introduced operational complexities around permissions and access management that required careful coordination with our Data Foundation colleagues. The lack of self-service APIs for token management meant we initially had to handle partner onboarding manually—not ideal for scale, but manageable for our pilot phase with AppSec's guidance on secure token distribution.&lt;/p&gt;
&lt;p&gt;These discoveries didn't discourage us; they informed our implementation strategy and strengthened our cross-team relationships. Every pioneering effort encounters obstacles, and having the right collaborative framework allowed us to turn these challenges into learning opportunities that would benefit future implementations across Zalando.&lt;/p&gt;
&lt;h2&gt;Simplified Architecture: How It Works at Zalando&lt;/h2&gt;
&lt;p&gt;With our proof of concept validated, we moved forward with a streamlined architecture that demonstrates the core principles of &lt;strong&gt;Delta Sharing&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Simplified Architecture" src="https://engineering.zalando.com/posts/2025/07/images/simplified-architecture-scaled.png#center"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Understanding Delta Sharing Terminology:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Delta Share&lt;/strong&gt;: A logical container that groups related tables for secure distribution to external recipients&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Recipient&lt;/strong&gt;: A digital identity representing each partner in our Delta Sharing system&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Activation Link&lt;/strong&gt;: A secure URL that allows partners to download their authentication credentials&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Step 1: Data Preparation and Centralization&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We prepare datasets based on partner needs and store them in a scalable storage system. These are then cataloged in a central metadata and governance layer, which ensures consistency, control, and acts as a single source of truth.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 2: Access Configuration&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We create access points (recipients) for each partner and assign the appropriate permissions. These access points act as logical groupings for related data, allowing for secure and organized distribution. Each access point generates unique link, which is then securely provided to the respective partner.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 3: Direct data Access&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When partners receive their activation link, they use it to establish a secure connection to the data distribution system. Once authenticated, partners can make direct requests to access the underlying data.&lt;/p&gt;
&lt;p&gt;This approach delivers several key benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Partners get direct access to live data without the overhead of data copying.&lt;/li&gt;
&lt;li&gt;The authentication mechanism ensures security through time-limited, partner-specific access credentials.&lt;/li&gt;
&lt;li&gt;And because the data remains in its original location, we avoid the storage duplication and ongoing synchronization challenges.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Implementation steps: Simplified&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Implementation View" src="https://engineering.zalando.com/posts/2025/07/images/implementation-view-scaled.png#center"&gt;&lt;/p&gt;
&lt;p&gt;A typical steps involved in sharing datasets externally&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Prepare&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;final&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;data&lt;/span&gt;&lt;span class="n"&gt;sets&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;tab&lt;/span&gt;&lt;span class="n"&gt;les&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;via&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Unity&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Catalog&lt;/span&gt;&lt;span class="mf"&gt;.&lt;/span&gt;
&lt;span class="mf"&gt;2.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Create&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;&amp;#39;&lt;/span&gt;&lt;span class="n"&gt;Share&lt;/span&gt;&lt;span class="err"&gt;&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;log&lt;/span&gt;&lt;span class="n"&gt;ical&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;cont&lt;/span&gt;&lt;span class="n"&gt;ainer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="mf"&gt;.&lt;/span&gt;
&lt;span class="mf"&gt;3.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;&amp;#39;&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;tab&lt;/span&gt;&lt;span class="n"&gt;les&lt;/span&gt;&lt;span class="err"&gt;&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;‘&lt;/span&gt;&lt;span class="n"&gt;Share&lt;/span&gt;&lt;span class="err"&gt;’&lt;/span&gt;&lt;span class="mf"&gt;.&lt;/span&gt;
&lt;span class="mf"&gt;4.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Create&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;recipient&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;each&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;partner&lt;/span&gt;&lt;span class="mf"&gt;.&lt;/span&gt;
&lt;span class="mf"&gt;5.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Grant&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;permissions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;recipient&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;accessing&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;share&lt;/span&gt;&lt;span class="mf"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://www.databricks.com/product/delta-sharing"&gt;Databricks&lt;/a&gt; provides an extensive documentation to understand the Delta sharing technical details and its APIs to build the solution based on Delta Sharing.&lt;/p&gt;
&lt;h2&gt;Bridging the Gap: Making Partner Adoption Seamless&lt;/h2&gt;
&lt;p&gt;Building an elegant technical solution was only half the challenge, the other half was ensuring our partners could actually use it effectively. We developed &lt;strong&gt;comprehensive user guides with step-by-step instructions&lt;/strong&gt; for accessing shared data through familiar tools like Pandas and Apache Spark.&lt;/p&gt;
&lt;p&gt;The guides included practical examples and troubleshooting scenarios, enabling partners to go from receiving their activation link to pulling their first dataset in minutes. By providing clear documentation for Delta Sharing connector APIs, partners could integrate our data directly into their existing analytics pipelines without disrupting established workflows.&lt;/p&gt;
&lt;h2&gt;From Pilot to Platform: The Ripple Effect&lt;/h2&gt;
&lt;p&gt;Word of our Delta Sharing pilot began spreading through Zalando's internal networks, generating inquiries from teams across the organization. Other departments working with partners started reaching out, recognized the potential for their own data sharing challenges.&lt;/p&gt;
&lt;p&gt;This interest validated our approach and presented an opportunity to avoid fragmentation. Rather than having each team build their own implementation, we collaborated to evolve our solution into a comprehensive platform for recipient management across Zalando.&lt;/p&gt;
&lt;h2&gt;Building the Platform: From Solution to Service&lt;/h2&gt;
&lt;p&gt;This realization sparked our next evolution: transforming our pilot into a comprehensive platform for recipient management across Zalando. Instead of being a single-use solution for Partner Tech, we're building the infrastructure that will enable any team at Zalando to implement secure, scalable data sharing through Delta Sharing.&lt;/p&gt;
&lt;p&gt;We're not just building technology; we're building expertise. Our platform includes comprehensive guidance for teams preparing their datasets, ensuring they align with platform expectations and can scale effectively. We're codifying the lessons we learned during our proof of concept and pilot phases, transforming our hard-won knowledge into reusable best practices.&lt;/p&gt;
&lt;p&gt;As we scale beyond our initial partner use case, we're looking into making data access for partners more efficient by exploring Databricks &lt;a href="https://docs.databricks.com/aws/en/delta-sharing/create-recipient-oidc-fed"&gt;OIDC federation&lt;/a&gt; capabilities. This would allow some partners to directly access their data, protected by their own identity infrastructure and without generating an intermediate token.&lt;/p&gt;
&lt;h2&gt;The Challenges of Scale&lt;/h2&gt;
&lt;p&gt;Scaling from a single-team pilot to an organization-wide platform brings its own set of challenges. We're not just multiplying our current solution; we're reimagining it for diverse use cases we haven't fully explored yet. Different teams have different data governance requirements, varying security constraints, and unique integration needs.&lt;/p&gt;
&lt;p&gt;The technical architecture that worked for our Partner Tech use case needs to be flexible enough to accommodate everything from real-time operational data sharing to periodic analytical exports. We're essentially building a data-sharing platform that can serve as the foundation for multiple teams while maintaining the performance, security, and reliability standards each team requires.&lt;/p&gt;
&lt;p&gt;This expansion also means deeper collaboration with Zalando's data governance frameworks. As more teams adopt Delta Sharing through our platform, we need to ensure consistent policies around data access, audit trails, and compliance reporting. The platform needs to be sophisticated enough to handle complex multi-tenant scenarios while remaining simple enough that teams can adopt it without extensive training.&lt;/p&gt;
&lt;h2&gt;Lessons learned: Key Insights Of The Delta Sharing Journey&lt;/h2&gt;
&lt;p&gt;Our transformation from fragmented data sharing to a unified Delta Sharing platform taught us valuable lessons that extend beyond technical implementation.&lt;/p&gt;
&lt;h4&gt;Start with Deep Partner Understanding, Not Technology&lt;/h4&gt;
&lt;p&gt;Our biggest revelation was that the technology choice wasn't the starting point, it was the outcome of truly understanding our partners' pain points. The months we spent in partner interviews weren't just research; they were the foundation of everything that followed. The 1.5 FTE per month that partners were spending on manual data processing represented strategic talent being wasted on operational tasks.&lt;/p&gt;
&lt;h4&gt;One Size Doesn't Fit All And That's Okay&lt;/h4&gt;
&lt;p&gt;We needed one platform that could serve different partner segments in different ways. Large partners needed programmatic access, medium partners wanted flexibility, and small partners required simplicity.&lt;/p&gt;
&lt;h4&gt;Cross-Team Collaboration Is Non-Negotiable&lt;/h4&gt;
&lt;p&gt;Being the first team at Zalando to implement Delta Sharing taught us that pioneering new technology requires strong partnerships across the organization. Our success depended entirely on collaboration with the Central Data Foundation team for Unity Catalog expertise and the AppSec team for security guidance and the IAM team for identity&amp;amp;auth guidance.&lt;/p&gt;
&lt;h4&gt;Manual Processes Are Okay for Pilots, But Plan for Scale&lt;/h4&gt;
&lt;p&gt;Our initial manual token management approach worked fine for our pilot phase, but we quickly realized it would become a bottleneck as we scaled. We treated this as a learning opportunity that informed our platform development priorities every manual step in our pilot became a feature requirement for our platform.&lt;/p&gt;
&lt;h4&gt;Internal Demand Validates External Value&lt;/h4&gt;
&lt;p&gt;The unexpected internal interest in our Delta Sharing platform was one of our most important validation signals. When teams across Zalando started asking how they could leverage similar capabilities, we knew we had built something with broader applicability than our original scope.&lt;/p&gt;
&lt;h4&gt;Security and Governance Can't Be Afterthoughts&lt;/h4&gt;
&lt;p&gt;Working with the AppSec and IAM team from the beginning taught us that security considerations need to be embedded in the architecture from day one. The time we invested in understanding authentication mechanisms and access controls upfront saved us from significant refactoring later.&lt;/p&gt;
&lt;h4&gt;Documentation Is a Product Feature&lt;/h4&gt;
&lt;p&gt;Our comprehensive user guides weren't just nice-to-have documentation, they were critical product features that determined adoption success. Partners needed to go from activation link to first data pull in minutes, not hours.&lt;/p&gt;
&lt;h4&gt;Operational Excellence Matters More Than Perfect Technology&lt;/h4&gt;
&lt;p&gt;Our decision to use Databricks' managed Delta Sharing service rather than building our own implementation reflected a crucial lesson: operational excellence often trumps technical purity. The managed service freed us to focus on partner value rather than infrastructure maintenance.&lt;/p&gt;
&lt;h2&gt;Looking Ahead: The Future of Partner Data at Zalando&lt;/h2&gt;
&lt;p&gt;Our journey from solving a specific partner data problem to building an organization-wide data-sharing platform illustrates something important about innovation: the best solutions often have applications far beyond their original scope. What began as a focused effort to reduce partner frustration with fragmented data access has evolved into a cornerstone of Zalando's data-sharing strategy.&lt;/p&gt;
&lt;p&gt;As we continue building this platform, we're guided by the same principle that led us to Delta Sharing in the first place: deep understanding of user needs. Whether those users are external partners trying to optimize their product performance or internal teams seeking to collaborate more effectively, the fundamental challenge remains the same, getting the right data to the right people at the right time with the right level of security.&lt;/p&gt;
&lt;p&gt;The shift from fragmented, manual data processes to seamless, real-time data sharing represents more than a technical upgrade, it's a fundamental change in how we enable data-driven decision making across our entire ecosystem. By reducing friction in data access, we're not just improving operational efficiency; we're creating new possibilities for insight and collaboration that didn't exist before.&lt;/p&gt;
&lt;p&gt;Our commitment to continuous evolution means this story is far from over. As we gather feedback from the growing community of internal users and external partners, we'll continue iterating on both the technology and the processes around it. The future of data sharing at Zalando isn't just about better technology, it's about better partnerships, more informed decisions, and ultimately, better experiences for the millions of customers who rely on our platform every day.&lt;/p&gt;</content><category term="Zalando"/><category term="Data"/><category term="Big Data"/></entry><entry><title>Building a dynamic inventory optimisation system: A deep dive</title><link href="https://engineering.zalando.com/posts/2025/06/inventory-optimisation-system.html" rel="alternate"/><published>2025-06-30T00:00:00+02:00</published><updated>2025-06-30T00:00:00+02:00</updated><author><name>Alva Presbitero</name></author><id>tag:engineering.zalando.com,2025-06-30:/posts/2025/06/inventory-optimisation-system.html</id><summary type="html">&lt;p&gt;This technical blog outlines how we built a scalable inventory optimization system to help partners maintain a profitable inventory.&lt;/p&gt;</summary><content type="html">&lt;p&gt;In e-commerce, optimising replenishments is a crucial inventory problem. This involves solving three sub-tasks: &lt;strong&gt;&lt;em&gt;What&lt;/em&gt;&lt;/strong&gt; articles should be in stock? &lt;strong&gt;&lt;em&gt;When&lt;/em&gt;&lt;/strong&gt; should they be replenished? &lt;strong&gt;&lt;em&gt;Where&lt;/em&gt;&lt;/strong&gt; should the inventory be optimally allocated in the network of warehouses?&lt;/p&gt;
&lt;p&gt;Moreover, most e-commerce supply chains involve complex environments:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Vast catalogue&lt;/strong&gt;: up to millions of articles&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-echelon network&lt;/strong&gt;: dozens of warehouses spread across several countries&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Diverse and rotating catalogue&lt;/strong&gt;: seasonal goods rotating on pre-defined and specific windows of sale&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;High uncertainty&lt;/strong&gt; &lt;strong&gt;on key decision factors&lt;/strong&gt;: Fluctuating demand patterns, and fluctuating shipment or supplier lead times.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At &lt;a href="https://zeos.eu/"&gt;ZEOS&lt;/a&gt;, we recognise that our partners share these challenges. To empower them, we're developing AI-driven replenishment recommendations.&lt;/p&gt;
&lt;p&gt;The scale and complexity that this inventory problem brings makes it a unique combined Applied Science and MLE problem to solve. How can a system that continuously updates decisions consider these constantly changing and uncertain factors? The answer lies in building a dynamic inventory optimisation system.&lt;/p&gt;
&lt;p&gt;The article will cover:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Brief overview of the inventory optimisation framework&lt;/li&gt;
&lt;li&gt;Deep dive into how we scale demand forecasting and accelerate research in our demand forecasting pipelines&lt;/li&gt;
&lt;li&gt;Deep dive into how we run optimisation at scale in our policy optimisation pipelines&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;&lt;strong&gt;Optimisation framework&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;We frame replenishment decisions as a cost-optimisation exercise, with the end goal of minimising inventory costs:&lt;/p&gt;
&lt;p&gt;&lt;span class="math"&gt;\(Min\ Costs(\theta) = C_{storage}(\theta) + C_{lost\ sales}(\theta) + C_{overstock}(\theta) + C_{operations}(\theta) + C_{inbound}(\theta)\)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;In simpler words, we want to find optimal decisions  &lt;span class="math"&gt;\(\theta^*\)&lt;/span&gt;, that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reduce stockouts to avoid the cost of lost sales&lt;/li&gt;
&lt;li&gt;Limit inventory in warehouses at any point in time to reduce stock-holding costs&lt;/li&gt;
&lt;li&gt;Balance the long-term cost of overstock with the short-term cost of lost sales&lt;/li&gt;
&lt;li&gt;Satisfy the operational constraints/logistics setup (lead times, desired review frequency, …)&lt;/li&gt;
&lt;li&gt;Capture the stochastic nature of the decision-making process&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To do so, we rely on a 2-step flow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Step 1: We generate or gather the required inputs, such as probabilistic demand forecasts, returns lead-time forecasts, shipment lead times, user/item economics, the latest known stock state, and stock in transit.&lt;/li&gt;
&lt;li&gt;Step 2: All inputs are fed into a recommendation engine that leverages Monte Carlo simulations and black-box gradient-free optimisers for optimisation under uncertainty.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;strong&gt;demand forecasts&lt;/strong&gt; and &lt;strong&gt;replenishment optimisation system&lt;/strong&gt; are the core components, both in terms of impact and engineering complexity, which will deserve deep dives later in the article.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Overview" src="https://engineering.zalando.com/posts/2025/06/images/overview.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Figure 1: Describing the 2-step flow deployed to generate user-facing inventory recommendations&lt;/figcaption&gt;

&lt;h2&gt;Overarching building blocks and design philosophy&lt;/h2&gt;
&lt;p&gt;We break the inventory optimisation problem into two isolated but connected building blocks: Demand Forecast and Inventory Optimisation. The Demand Forecast pipeline is a batch prediction pipeline that produces probabilistic forecasts for articles at a weekly cadence. The Inventory Optimisation pipeline offers daily batch predictions, as well as a real-time inference endpoints to enable our B2B partners to interactively plan inventory settings. This service is enabled for our partners via the partner portal, which provides a holistic picture of inventory health and other metrics and KPIs for our partners.&lt;/p&gt;
&lt;p&gt;Both pipelines are implemented using &lt;a href="https://engineering.zalando.com/posts/2022/04/zalando-machine-learning-platform.html"&gt;zFlow&lt;/a&gt;, an internal machine learning ecosystem that offers seamless integration and abstractions for AWS and Databricks infrastructure. This enables us to focus on the machine learning application code without the overhead of building and maintaining complex infrastructure code. zFlow provides out-of-the-box security through in-transit and at-rest encryption for all artefacts, and enables orchestration via AWS Step Functions.&lt;/p&gt;
&lt;h2&gt;Scalable demand forecasts for millions of articles&lt;/h2&gt;
&lt;p&gt;To effectively manage our supply chain, we must accurately forecast demand for a vast number of products (SKUs) on a weekly basis. This requires a scalable and efficient forecasting system.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Demand Forecaster" src="https://engineering.zalando.com/posts/2025/06/images/demand-forecaster.jpg#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Figure 2: End-to-end flow of the demand forecasting pipeline&lt;/figcaption&gt;

&lt;p&gt;The following steps describe the flow of orchestration from right to left.&lt;/p&gt;
&lt;h3&gt;1. Feature Engineering: Data Pre-processing and Data Transformation Layers&lt;/h3&gt;
&lt;p&gt;We start by extracting features from curated data products including sales and availability information for all articles across warehouses and sales channels. Numerous (data) engineering teams across Zalando build and maintain these curated data products on a centrally governed data lakehouse, ensuring compliance with relevant access control protocols. In view of scalability, efficiency and interpretability, we recognize two complementary stages for feature engineering: data pre-processing and data transformation. The following table summarizes the design rationale for these stages.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style="text-align: left;"&gt;Criteria&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Data Pre-Processing&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Data Transformation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Primary Objective&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Model upstream data products to represent the business problem in a human-understandable structure, enabling easier validation, analysis.&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Engineer features from pre-processed data to maximize predictive signals for model training.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Example Transformations&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Joins, Filters, Aggregations, etc&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Encoding, Normalization, etc&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Libraries and Frameworks&lt;/td&gt;
&lt;td style="text-align: left;"&gt;PySpark, Spark-SQL&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Pandas, Scikit-learn, Numpy, Numba&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Architectural Advantage&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Distributed processing using PySpark enables efficiently transforming large volumes of upstream data.&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Significantly improved efficiency due to feature extraction on pre-processed data.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;Scalability&lt;/td&gt;
&lt;td style="text-align: left;"&gt;PySpark enables horizontal scalability in the number of worker nodes as data volume grows.&lt;/td&gt;
&lt;td style="text-align: left;"&gt;Dependent libraries lack native distribution support, so we rely on vertical scalability to handle increasing data volumes.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h4&gt;1.1 Data Pre-processing Layer&lt;/h4&gt;
&lt;p&gt;The goal of this stage is to construct a time-series representation for all articles’ sales and availability over a configurable timeline. In our case, we use a 2.5-year timeframe to enable the model to capture seasonal patterns without overemphasising older historical performance. Although this process involves processing large data volumes, it avoids complex statistical or vectorised feature engineering. Leveraging this condition, we implement a fast and distributed processing pipeline using PySpark and &lt;a href="https://docs.databricks.com/en/delta/index.html"&gt;Delta Lake&lt;/a&gt; running on &lt;a href="https://docs.databricks.com/en/jobs/compute.html"&gt;transient job clusters&lt;/a&gt; in Databricks.&lt;/p&gt;
&lt;h4&gt;1.2 Data Transformation Layer&lt;/h4&gt;
&lt;p&gt;The transformation layer in the &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html"&gt;Sagemaker processing job&lt;/a&gt; handles all feature engineering tasks on the time-series dataset generated in the previous step. Key transformations include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;deriving historical demand from sales and stock/availability data&lt;/li&gt;
&lt;li&gt;pricing information: initial and discounted prices on weekly levels&lt;/li&gt;
&lt;li&gt;article metadata (category, colour, material, etc.)&lt;/li&gt;
&lt;li&gt;unique identifier per time-series: we treat each combination of (article_id, merchant_id) as a unique entity.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Forecasting specific features such as target lags/transformations, exogenous features lags/transformations, and other temporal features is handled later on by Nixtla’s MLForecast. This allows us to leverage optimised transformations from Nixtla (with Numba under the hood).&lt;/p&gt;
&lt;h3&gt;2. Model Training and Predictions&lt;/h3&gt;
&lt;p&gt;After extensive experimentation with deep learning models like &lt;a href="https://arxiv.org/abs/1912.09363"&gt;TFT&lt;/a&gt; and other machine learning approaches, we selected the &lt;a href="https://lightgbm.readthedocs.io/en/stable/"&gt;LightGBM&lt;/a&gt; model integrated with &lt;a href="https://nixtlaverse.nixtla.io/mlforecast/index.html"&gt;Nixtla’s MLForecast&lt;/a&gt; interface as the foundation of our demand forecasting pipeline. This stack enables significant advantages, including high-level abstractions for time series-specific feature generation with optimised performance, rapid prototyping through shorter feedback loops, and access to a robust, well-maintained open-source ecosystem.
   Due to the ML model’s lightweight training footprint, we bypass complexity, like for example not needing checkpointing, or separate infrastructure for inference. Instead, model training as well as model inference are executed in a single pipeline using &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html"&gt;AWS SageMaker Training Jobs&lt;/a&gt;. This approach reduces complexity, lowers infrastructure costs, and accelerates the pipeline. The final output of this stage is a 12-week probabilistic demand forecast for each (article_id, merchant_id, week) combination.&lt;/p&gt;
&lt;h3&gt;3.  Post Processing&lt;/h3&gt;
&lt;p&gt;Finally, we process the demand predictions to ensure a time series representation suitable for downstream optimisation algorithms. This stage also includes a statistical analysis of model performance and the computation of key business metrics. These metrics are seamlessly integrated into our monitoring and alerting ecosystem, facilitating proactive detection of model drift. The post-processing is implemented using AWS SageMaker Processing Jobs, while the monitoring and alerting system utilises AWS CloudWatch alarms and AWS Lambda functions to deliver alerts to relevant channels.&lt;/p&gt;
&lt;p&gt;Our weekly forecasting pipeline processes 3 years of historical data for 5 million SKUs (size and colour) using a sliding window approach, and takes less than 2 hours. This high performance pipeline is enabled by a deliberate focus on data model design and I/O efficiency. We maintain a low total cost of ownership while ensuring reliability and scalability guarantees by leveraging zFlow and AWS-native services in our pipeline.&lt;/p&gt;
&lt;h2&gt;Translating Demand Forecasts into Actionable Inventory Strategies&lt;/h2&gt;
&lt;p&gt;With demand forecasts in hand, the next crucial step is determining how to effectively utilise this information. How can we extract value from these stochastic predictions and apply them to real-world inventory management decisions?&lt;/p&gt;
&lt;p&gt;Our inventory optimisation service provides both real-time and batch recommendations for optimal stock levels across all article SKUs for each partner. The real-time system allows partners to interactively adjust recommendations based on their specific inventory and stock parameters. Once these settings are established, we proactively cache both the settings and the resulting recommendations on a daily basis. This ensures that our offline batch process consistently delivers up-to-date, dynamic recommendations, taking into account the latest inputs, forecasts, and stock states.&lt;/p&gt;
&lt;p&gt;The following diagram illustrates both the real-time and batch prediction processes, flowing from right to left.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Replenishment Recommender" src="https://engineering.zalando.com/posts/2025/06/images/recommender.jpg#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Figure 3: End-to-end flow of the inventory optimisation pipeline&lt;/figcaption&gt;

&lt;h3&gt;1. Feature Generation&lt;/h3&gt;
&lt;p&gt;Similar to the demand forecaster approach, feature generation is divided into two components. Transformations that can be fully implemented in PySpark are handled within Databricks, while operations that require the Scipy or Numpy ecosystem are performed in the Sagemaker processing job. The final output of the feature generation process is a detailed feature vector for each SKU. This vector includes historical outbound data, inventory states, inbound volumes, pricing information, article metadata, cost factors, return lead time weights, and probabilistic demand forecasts for the next 12 weeks.&lt;/p&gt;
&lt;h3&gt;2. Feature Store&lt;/h3&gt;
&lt;p&gt;The input feature vector generated from the previous step will be persisted in the &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html"&gt;SageMaker Feature Store&lt;/a&gt; for both online and offline storage options. The offline store, backed by Amazon S3, is designed for cold storage use cases such as batch pipelines, archiving, and debugging, operating in append mode. It stores daily datapoints and updated feature vectors resulting from inventory settings changes, ensuring long-term data retention.&lt;/p&gt;
&lt;p&gt;While offline feature store optimises for cost efficient high throughput data IO with latency in the order of minutes, online storage is optimised for low-latency, low throughput applications, providing lookup access to only the latest valid feature vectors—either daily generated vectors or the most recent user-triggered updates.
It guarantees a latency of 10–20ms per SKU for both read and write operations, enabling fast interaction for both batch input generation pipelines and online serving systems.&lt;/p&gt;
&lt;h3&gt;3. Optimisation&lt;/h3&gt;
&lt;p&gt;Optimisation here refers to optimising the stock replenishment predictions based on predicted demand and other user inputs about inventory settings. As discussed above, we provide online as well as offline optimisation recommendations for our partners. It’s important to note that the inventory optimisation algorithm and input features are synchronised between the two subsystems (online and offline), ensuring consistency across both engines. The following subsections provide an algorithmic overview of our optimisation approach, followed by the online and offline delivery mechanism for the algorithm.&lt;/p&gt;
&lt;h4&gt;3.1 Offline Delivery Mechanism&lt;/h4&gt;
&lt;p&gt;The offline (batch) engine generates daily recommendation reports using finalised inputs from offline feature stores.
We execute the optimisation algorithm for the latest inventory setting for all merchants and articles using &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html"&gt;SageMaker batch transform jobs&lt;/a&gt;, followed by a post-processing layer implemented in &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html"&gt;Sagemaker Processing job&lt;/a&gt;. Similar to the demand forecaster, the post-processing job here evaluates our optimisation performance, enabling proactive model performance and drift monitoring. Once recommendations are computed, they are stored in S3, and a "report generated" notification is published to the respective event stream.&lt;/p&gt;
&lt;h4&gt;3.2 Online Delivery Mechanism&lt;/h4&gt;
&lt;p&gt;The online optimisation engine enables partners to interactively optimise predictions based on inventory settings. When partners update their inventory settings, we trigger an orchestrated workflow that queues each update request on AWS SQS. We then use AWS Lambda to poll the queue for updates and serve each update request asynchronously. For each inventory update, we fetch the feature vector for relevant SKUs from the online feature store, and execute the optimisation algorithm with multi-threading parallelism. Once optimal predictions have been calculated, we store the results in s3 and alert the backend systems via a notification in the event stream. Lastly, in addition to serving the online request, we also persist the inventory setting update to the offline feature store, making future offline predictions consistent.&lt;/p&gt;
&lt;h3&gt;Key scalability takeaways&lt;/h3&gt;
&lt;p&gt;Our approach prioritises scalability in three key areas:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Robust Pipelines:&lt;/strong&gt; We leverage a robust infrastructure combining Databricks and AWS Sagemaker for data transformations/processing and model training/inference. Every run triggers dedicated Databricks Job clusters and Sagemaker processing/training jobs. This ensures robust and independent runs and resources, i.e. a failure of one execution in the Databricks job cluster does not impact a parallel execution.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fast data and vectorised transformations:&lt;/strong&gt; For data and vector transformations, we rely on PySpark, &lt;a href="https://numba.pydata.org/numba-doc/dev/index.html"&gt;Numba&lt;/a&gt; and Joblib multi-core parallelisation. Whenever possible, we vectorise operations and rely on Numba, which can often offer speedup by a factor of 2 or 3 compared to Numpy.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Light models&lt;/strong&gt;: We leverage Nixtla’s MLForecast with conformal inference, and LighGBM for probabilistic forecasts. Beyond the speed and scalability of LGBM, we want to emphasise the benefits of using a library like Nixtla, which can automate many time series features and processes required just before training.&lt;/li&gt;
&lt;/ol&gt;
&lt;script type="text/javascript"&gt;if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
    var align = "center",
        indent = "0em",
        linebreak = "false";

    if (false) {
        align = (screen.width &lt; 768) ? "left" : align;
        indent = (screen.width &lt; 768) ? "0em" : indent;
        linebreak = (screen.width &lt; 768) ? 'true' : linebreak;
    }

    var mathjaxscript = document.createElement('script');
    mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
    mathjaxscript.type = 'text/javascript';
    mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';

    var configscript = document.createElement('script');
    configscript.type = 'text/x-mathjax-config';
    configscript[(window.opera ? "innerHTML" : "text")] =
        "MathJax.Hub.Config({" +
        "    config: ['MMLorHTML.js']," +
        "    TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
        "    jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
        "    extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
        "    displayAlign: '"+ align +"'," +
        "    displayIndent: '"+ indent +"'," +
        "    showMathMenu: true," +
        "    messageStyle: 'normal'," +
        "    tex2jax: { " +
        "        inlineMath: [ ['\\\\(','\\\\)'] ], " +
        "        displayMath: [ ['$$','$$'] ]," +
        "        processEscapes: true," +
        "        preview: 'TeX'," +
        "    }, " +
        "    'HTML-CSS': { " +
        "        availableFonts: ['STIX', 'TeX']," +
        "        preferredFont: 'STIX'," +
        "        styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
        "        linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
        "    }, " +
        "}); " +
        "if ('default' !== 'default') {" +
            "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
            "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
        "}";

    (document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
    (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
&lt;/script&gt;</content><category term="Zalando"/><category term="Zalando Science"/><category term="Machine Learning"/><category term="Artificial Intelligence"/><category term="AWS"/><category term="Data Science"/><category term="Operations Research"/><category term="Backend"/></entry><entry><title>Adapting to Change: Returning to Work in a Fast-Moving Tech World</title><link href="https://engineering.zalando.com/posts/2025/05/adapting-to-change.html" rel="alternate"/><published>2025-05-19T00:00:00+02:00</published><updated>2025-05-19T00:00:00+02:00</updated><author><name>Kanupriya Gupta</name></author><id>tag:engineering.zalando.com,2025-05-19:/posts/2025/05/adapting-to-change.html</id><summary type="html">&lt;p&gt;This article discusses the challenges of returning to work in a fast-paced tech environment and how to navigate them.&lt;/p&gt;</summary><content type="html">&lt;p&gt;I took a break from work for the first time in my 12-year career - a full four months away. I expected to return naturally with a bit of catching up to do. What I didn’t expect was to come back and feel like I had walked into an entirely new world.&lt;/p&gt;
&lt;p&gt;The structure of my team had changed. The tech stack had evolved. The priorities were different now. And urgent tasks were already waiting for me. It felt a bit like returning to a city after being years away — it is familiar, and yet everything is different.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;This is Zalando. This is what a fast-moving, innovation-driven tech company looks like.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;Change is the Only Constant&lt;/h2&gt;
&lt;p&gt;In just four months, the landscape of my team had transformed. Some of the changes were surprising, even emotional:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;senior engineer&lt;/strong&gt; from whom I still had so much to learn from had moved on.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;new senior engineer&lt;/strong&gt; had joined, bringing in fresh ideas and perspectives.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;close friend on the team&lt;/strong&gt; was transitioning into a broader role.&lt;/li&gt;
&lt;li&gt;And in a full-circle moment, an &lt;strong&gt;ex-coworker&lt;/strong&gt; I really enjoyed working with was rejoining the company and our team.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The shift wasn’t just technical, it was deeply human. The dynamics, the energy, and even the way we communicated had all evolved. At first, it felt like I was starting from scratch. &lt;strong&gt;It was strange to be introduced as if I was a new joiner&lt;/strong&gt;, like I had to earn my place back. The familiarity was still there, but it felt different — like I had to prove myself all over again.&lt;/p&gt;
&lt;p&gt;One of the strangest moments was realizing I needed to &lt;strong&gt;ask the new team member for help&lt;/strong&gt; with the development-environment setup. Normally, I would have turned to my old friend, who was now evolving into a bigger role. But she was busy with other priorities. Instead, I had to lean on someone I was supposed to be guiding — a role reversal that definitely caught me off guard.&lt;/p&gt;
&lt;p&gt;But then I remembered: &lt;strong&gt;that’s part of the Zalando rhythm&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;We grow, people move, new faces come in, and somehow, the team keeps flowing forward - often better than before.&lt;/p&gt;
&lt;p&gt;Change at Zalando isn’t something that happens occasionally — it’s constant. It’s intentional. And it’s embraced.&lt;/p&gt;
&lt;h2&gt;Hitting the Ground Running&lt;/h2&gt;
&lt;p&gt;No soft landing here — I had a &lt;strong&gt;presentation to prepare and deliver within just two days&lt;/strong&gt; of being back.&lt;/p&gt;
&lt;p&gt;It wasn’t your typical tech talk either. This one was part of &lt;a href="https://www.linkedin.com/posts/zalando_insidezalando-zalandotech-womenintech-activity-7317450524440551424-d__B"&gt;Future Day – Code Like a Girl!&lt;/a&gt;, an amazing Zalando initiative aimed at encouraging young girls to explore STEM fields. We were hosting &lt;strong&gt;15 bright young minds&lt;/strong&gt;, and my session needed to be &lt;strong&gt;interactive, engaging, and inspiring&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Talk about a reentry challenge!&lt;/p&gt;
&lt;p&gt;To make things more intense, I’d missed an entire month of German lessons — which means a hell of a lot of pending homework and catching up to do.&lt;/p&gt;
&lt;p&gt;Despite the time crunch, it was energizing. Being part of something that promotes diversity in tech reminded me why I love working here. It wasn’t just about catching up on code or new tools — it was about reconnecting with purpose.&lt;/p&gt;
&lt;p&gt;What truly struck me was how natural it felt to step into that space. I felt confident being myself again. I could connect with these young minds, encourage them, and share my journey authentically. It reminded me that I can have an &lt;em&gt;impact&lt;/em&gt; — that I have the ability to inspire, to mentor, and to help shape someone else’s path.&lt;/p&gt;
&lt;p&gt;That presentation helped me shake off the dust and reminded me that meaningful impact can happen even in high-pressure moments.&lt;/p&gt;
&lt;h2&gt;Tech Stack Challenges: The Learning Curve&lt;/h2&gt;
&lt;p&gt;One of the biggest changes I encountered was the &lt;strong&gt;shift in tech stack&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Before my break, I had been working with &lt;a href="https://engineering.zalando.com/posts/2024/05/appcraft.html"&gt;Appcraft&lt;/a&gt; and we were working on its backend, &lt;a href="https://engineering.zalando.com/posts/2021/09/micro-frontends-part2.html"&gt;Rendering Engine&lt;/a&gt; to bring consistent Theming in the framework, but now, I was diving into a whole new world; the team was validating whether the Rendering Engine could empower apps in a React Native environment — potentially replacing Appcraft altogether. In fact, Appcraft might be retired soon.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;setup&lt;/strong&gt; of the development environment? Oh, it was a ride.&lt;/p&gt;
&lt;p&gt;First, I had to upgrade my macOS and install the latest Xcode — simple enough, or so I thought.&lt;/p&gt;
&lt;p&gt;The real fun started when I tried to build the project and had my oops moment:
Dependency management had completely changed. Gone were the days of Carthage — now we were using the &lt;a href="https://www.swift.org/documentation/package-manager"&gt;Swift Package Manager&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;But wait, there was more — the &lt;strong&gt;new React Native framework repository&lt;/strong&gt; needed its own setup. I was in deep.&lt;/p&gt;
&lt;p&gt;And once everything was finally up and running? That’s when the next mountain appeared: getting back into React Native. I had used it before — about seven years ago — but so much had changed, it felt like a whole new framework. I’ve always liked learning new languages and tools, but this wasn’t just brushing off some rust. It was more like starting from scratch.&lt;/p&gt;
&lt;h2&gt;A Crash Course in Re-Onboarding&lt;/h2&gt;
&lt;p&gt;The first few days back were a crash course — not just in the latest codebases, but in how to &lt;em&gt;relearn&lt;/em&gt; and &lt;em&gt;reconnect&lt;/em&gt; quickly. What helped?&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open documentation and transparent communication&lt;/strong&gt;:
  Most of what I needed was already there, easily accessible and well-maintained. But it wasn’t just about finding information, it was about &lt;strong&gt;getting the context&lt;/strong&gt;. I spent a lot of time gathering links from coworkers, reading up on the &lt;strong&gt;strategy&lt;/strong&gt;, the &lt;strong&gt;roadmap&lt;/strong&gt;, and the &lt;strong&gt;execution&lt;/strong&gt; of the project. It wasn’t enough to just understand the code; I needed to understand the bigger picture. Only then could I get onboarded quickly enough not just to write code but to &lt;strong&gt;believe in the vision&lt;/strong&gt; of the project itself. Documentation wasn’t just my lifeline—it became the key to connecting with the purpose of the work.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Supportive teammates&lt;/strong&gt;:
   From pairing sessions to async catch-ups, everyone made space for me to land smoothly. For the first two days, I was given the space to &lt;strong&gt;focus entirely on my presentation&lt;/strong&gt; for the "Future Day – Code Like a Girl!" initiative. It allowed me to jump back in without feeling overwhelmed by the technical aspects right away. My manager was incredibly supportive, assigning me &lt;strong&gt;tasks that helped me contribute to the project&lt;/strong&gt; but also allowed me to &lt;strong&gt;start slow and ease back into the flow&lt;/strong&gt;. This gave me the time to reorient myself without the pressure to dive into heavy technical work too quickly.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Breaking the FOMO Myth&lt;/h2&gt;
&lt;p&gt;While I was in the second half of my break, I worried: &lt;em&gt;Will I fall behind? Will I be able to catch up?&lt;/em&gt;
Coming back has shown me that falling behind isn’t the real concern — it’s the fear of not being able to &lt;em&gt;adapt&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;What I learned: &lt;strong&gt;you can take a break and come back stronger&lt;/strong&gt;. The fear of missing out fades quickly when you're returning to a company that’s built to support growth and reinvention.&lt;/p&gt;
&lt;p&gt;If anything, the experience gave me fresh eyes and a new kind of energy.&lt;/p&gt;
&lt;h2&gt;Staying Oriented and Focused&lt;/h2&gt;
&lt;p&gt;Reorienting yourself after a break can be overwhelming, especially when you’ve been away from work for months. The pressure to catch up quickly can build up, but here’s how I kept myself oriented and focused:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Staying updated&lt;/strong&gt;:
I’ve been following newsletters, tech videos, and blogs — aiming to engage with at least one resource a day, both within and beyond our organization. It helped me stay on top of company all-hands updates and departmental priorities. It’s been key in reconnecting not just with the &lt;em&gt;what&lt;/em&gt;, but also the &lt;em&gt;why&lt;/em&gt; behind our work.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Leverage your calendar&lt;/strong&gt;:
  I rely heavily on my &lt;strong&gt;calendar&lt;/strong&gt;—even the tiniest details of my &lt;strong&gt;to-dos&lt;/strong&gt; go in there. It’s my way of keeping track of everything, making sure nothing slips through the cracks. I landed back in Berlin on &lt;strong&gt;31st March&lt;/strong&gt; and made sure I had a list of &lt;strong&gt;important things&lt;/strong&gt; to look at already scheduled for 9 am the next morning. It helped me get off to a quick start.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Morning Workout to stay focused&lt;/strong&gt;:
  I’ve learned that a good start to the day is crucial. That’s why I &lt;strong&gt;scheduled my morning yoga lessons&lt;/strong&gt; from the moment I returned. They’re not just a physical reset; they help clear my mind, giving me a focused and calm start to the day.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Meeting the team&lt;/strong&gt;:
  One of the first things I did was to &lt;strong&gt;show up at the office&lt;/strong&gt; to meet my team in person in the very first week. Those who know me, know I am not a fan of coming to office. But, after being away for four months, I was craving some real human connection to ease back in. Meeting face-to-face helped me feel reconnected and grounded. Not everyone showed up (perks of work-from-home! 😄), but those who did made it totally worth it. It was comforting to see familiar faces and share a few laughs — something that helped me feel part of the team again almost instantly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Prioritize and tackle things by weight&lt;/strong&gt;:
  Coming back to a mountain of tasks can be overwhelming, and the pressure to dive into everything immediately can feel intense. But instead of forcing myself to handle it all at once, I took a step back to &lt;strong&gt;regather myself&lt;/strong&gt; and then tackled things &lt;strong&gt;one step at a time&lt;/strong&gt;. I &lt;strong&gt;prioritized&lt;/strong&gt; tasks based on their importance and urgency, giving myself the space to focus on what mattered most first. Taking a break to breathe and collect my thoughts before diving in made all the difference.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Zalando, the Evolving Ecosystem&lt;/h2&gt;
&lt;p&gt;From the outside, Zalando may look like a fashion-store platform. But inside, it’s an ecosystem of continuous change — products, people, processes, and technologies in a constant state of evolution.&lt;/p&gt;
&lt;p&gt;It’s not always easy. But it’s never boring.&lt;/p&gt;
&lt;p&gt;This culture doesn’t just accept change — it &lt;em&gt;thrives&lt;/em&gt; on it. That’s what makes it exciting to work here. And that’s what helped me re-integrate quickly, even after a significant break.&lt;/p&gt;
&lt;h2&gt;Final Thoughts&lt;/h2&gt;
&lt;p&gt;If you’re thinking about taking a break but are worried about falling behind—don’t be. With the right environment, it’s not a setback. It’s a setup for rediscovery.&lt;/p&gt;
&lt;p&gt;If you’re considering Zalando as a place to grow your tech career — know that you’re signing up for change. Not just in what you work on, but in how you grow.&lt;/p&gt;
&lt;p&gt;And if you’re already here, maybe you’ll see yourself in this story too. And remember that adaptability isn’t just a survival skill here. It’s a superpower.&lt;/p&gt;</content><category term="Zalando"/><category term="Culture"/></entry><entry><title>From Event-Driven Chaos to a Blazingly Fast Serving API</title><link href="https://engineering.zalando.com/posts/2025/03/event-driven-to-api.html" rel="alternate"/><published>2025-03-07T00:00:00+01:00</published><updated>2025-03-07T00:00:00+01:00</updated><author><name>Conor Gallagher</name></author><id>tag:engineering.zalando.com,2025-03-07:/posts/2025/03/event-driven-to-api.html</id><summary type="html">&lt;p&gt;In this post, we explain how we replaced an event-driven system with a high performance API capable of serving millions of requests per second with single-digit-millisecond latency at P99&lt;/p&gt;</summary><content type="html">&lt;p&gt;Real-time data access is critical in e-commerce, ensuring accurate pricing and availability. At Zalando, our event-driven architecture for Price and Stock updates became a bottleneck, introducing delays and scaling challenges.&lt;/p&gt;
&lt;p&gt;This post covers how we redesigned our approach and built a blazingly fast API capable of serving millions of requests per second with single-digit-millisecond latency. You'll learn about the caching strategies, low-latency optimizations, and architectural decisions that enabled us to deliver this performance.&lt;/p&gt;
&lt;h2&gt;The Product Platform with No Read API&lt;/h2&gt;
&lt;p&gt;In 2016, Zalando built a microservices architecture where independent CRUD APIs onboarded different parts of product data. Once complete, each product was materialised as an event, requiring teams to consume the event stream to serve product data via their own APIs.&lt;/p&gt;
&lt;p&gt;In practice, this approach distributed the challenges of API serving across the company. A simple request—"I’m building a new feature and need access to product data. Where do I get it?"—had an unreasonable answer: "Subscribe to our event stream, replay events from the dawn of time, and build your own local store."&lt;/p&gt;
&lt;p&gt;Teams with engineering capacity consumed events, modified data to fit their needs, and exposed their own APIs or event streams. Those without the capacity relied on an existing unified data source, such as our Presentation API, inheriting its version of product data. This led to competing sources of truth.&lt;/p&gt;
&lt;p&gt;A good analogy is the children's whispering game—product data was altered at each step, and by the end, it no longer resembled the original. With no data lineage, there was no way to trace attributes back to their intended meaning.&lt;/p&gt;
&lt;h2&gt;The Offer Composition Problem&lt;/h2&gt;
&lt;p&gt;At Zalando, an Offer represents a merchant selling a Product at a specific price with a certain stock level. To serve the presentation view of a Product Offer, a multi-stage event-driven system merged Product, Price, and Stock events into a single structure. This structure underwent multiple stages of transformation, including aggregation and enrichment, before being stored in the datastore of our Presentation API, which is called by our &lt;a href="https://engineering.zalando.com/posts/2021/03/how-we-use-graphql-at-europes-largest-fashion-e-commerce-company.html"&gt;Fashion Store's GraphQL aggregator&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Legacy Architecture" src="https://engineering.zalando.com/posts/2025/03/images/pop-pre-pods-scaled.png#center"&gt;&lt;/p&gt;
&lt;p&gt;This architecture made Offer processing slow, expensive, and fragile. Frequent stock and price updates were processed alongside mostly static Product data, with over 90% of each payload unchanged—wasting network, memory, and processing resources. During &lt;a href="https://engineering.zalando.com/posts/2020/10/how-zalando-prepares-for-cyber-week.html"&gt;Cyber Week&lt;/a&gt;, stock and price events could be delayed by up to 30 minutes, resulting in a poor customer experience.&lt;/p&gt;
&lt;p&gt;The three Product Offer formats (Alpha, Beta, Gamma, in the diagram above) deviated significantly from their base formats. Since other teams could access events from intermediary stages, they developed dependencies on these formats.&lt;/p&gt;
&lt;h2&gt;The Mission: Decoupling Product and Offer Data&lt;/h2&gt;
&lt;p&gt;By 2022, it was clear that the Offer composition problem would become a barrier to business growth if left unresolved. A global project, Product Offer Data Split (PODS), was launched to tackle the issue.&lt;/p&gt;
&lt;p&gt;The goal was to remove large, unchanged Product data from event streams, eliminating the Offer pipeline bottleneck. A new serving layer would serve Product and Offer data independently or as a combined format for Presentation.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/products/{product-id}&lt;/code&gt; - Core Product details&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/products/{product-id}/offers&lt;/code&gt; - Offers our Merchants have available for a Product&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/product-offers/{product-id}&lt;/code&gt; - Combined Product-Offer for our Presentation API&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="Target Architecture" src="https://engineering.zalando.com/posts/2025/03/images/pods-target-arch-scaled.png#center"&gt;&lt;/p&gt;
&lt;p&gt;To succeed, our new serving layer— the Product Read API (PRAPI)—needed to match or exceed the performance of the datastore it was replacing.&lt;/p&gt;
&lt;p&gt;As the team dug deeper into the problem, one question emerged: Could PRAPI outperform all locally stored copies of Product data?&lt;/p&gt;
&lt;p&gt;If so, a simple request—"Where do I get Product data?"—could finally have a simple answer: "Call the Product Read API."&lt;/p&gt;
&lt;h2&gt;PRAPI Architecture&lt;/h2&gt;
&lt;p&gt;PRAPI had the following high-level requirements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Low-latency retrieval – P99 latency of 50ms for single-item requests and 100ms for batch retrieval&lt;/li&gt;
&lt;li&gt;Resilience to extreme traffic spikes on individual products&lt;/li&gt;
&lt;li&gt;Country-level isolation – Prevent failures from cascading across our EU markets&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="PRAPI Architecture" src="https://engineering.zalando.com/posts/2025/03/images/prapi-architecture-scaled.png#center"&gt;&lt;/p&gt;
&lt;p&gt;To meet these requirements, PRAPI was designed with four main components, each an independent Deployment on Kubernetes with tailored scaling rules. Each component incorporates end-to-end non-blocking I/O, leveraging &lt;a href="https://netty.io/"&gt;Netty’s&lt;/a&gt; EventLoop with Linux-native Epoll transport. &lt;a href="https://aws.amazon.com/dynamodb/"&gt;DynamoDB&lt;/a&gt; ensures high availability and fast lookups when cache misses occur.&lt;/p&gt;
&lt;h3&gt;Country-Level Isolation / Getting Data In&lt;/h3&gt;
&lt;p&gt;To achieve a level of country-level isolation, multiple instances of PRAPI are deployed—known as Market Groups—with each serving a subset of our countries. Routing configuration allows us to dynamically shift traffic between Market Groups, allowing us to isolate internal or canary test traffic from high-value country traffic.&lt;/p&gt;
&lt;p&gt;Each Market Group's Updaters scale horizontally based on lag, up to the number of partitions in the source stream. To ensure rapid processing of millions of products, each pod:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reads batches of 250 products&lt;/li&gt;
&lt;li&gt;Subpartitions events by Product ID&lt;/li&gt;
&lt;li&gt;Issues 10 concurrent batch writes of 25 items to DynamoDB&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="Getting Data In" src="https://engineering.zalando.com/posts/2025/03/images/prapi-getting-data-in-scaled.png#center"&gt;&lt;/p&gt;
&lt;p&gt;Scaling to hundreds of concurrent batch writes placed the bottleneck at DynamoDB’s write capacity units, which we could increase to populate a new Market Group in mere minutes if needed.&lt;/p&gt;
&lt;h3&gt;Outperforming DynamoDB&lt;/h3&gt;
&lt;p&gt;PRAPI was designed to be a fast-serving caching layer on top of DynamoDB. Here, we leaned heavily on the high-performant &lt;a href="https://github.com/ben-manes/caffeine"&gt;Caffeine&lt;/a&gt; cache. Using its async loading cache, we configured a 60 second cache time with the final 15 seconds as the stale window. In the last 15 seconds, retrieving a cache entry triggers a background refresh from DynamoDB.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Async Loading Cache" src="https://engineering.zalando.com/posts/2025/03/images/caffeine-lazy-loading2.png#center"&gt;&lt;/p&gt;
&lt;h3&gt;Optimizing Cache Hits&lt;/h3&gt;
&lt;p&gt;Our customer-driven traffic divides our catalogue into small hot, and large cold sections:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Cold: Niche items, receive infrequent traffic but must remain highly-available&lt;/li&gt;
&lt;li&gt;Hot: Everyday items, such as white socks and t-shirts are accessed frequently&lt;/li&gt;
&lt;li&gt;Extremely Hot: Limited-edition releases, such as Nike sneakers, generate sudden massive traffic spikes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But, even if just 10% of our 10 million products are hot, caching 1 million large (~1000-line JSON) product payloads per pod is simply not feasible.&lt;/p&gt;
&lt;p&gt;To solve this, we leveraged a powerful load balancing algorithm for our products component, &lt;a href="https://en.wikipedia.org/wiki/Consistent_hashing"&gt;Consistent Hash&lt;/a&gt; Load Balancing (CHLB).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Consistent Hash Load Balancing" src="https://engineering.zalando.com/posts/2025/03/images/consistent-hash-load-balancing-scaled.png#center"&gt;&lt;/p&gt;
&lt;p&gt;In CHLB, each backend pod is assigned to multiple random positions on a hash ring. When a request comes in, the product-id is hashed to locate its position on the ring. The nearest pod clockwise on the ring then consistently serves that request. This partitions our catalogue between the available pods, allowing small local caches to effectively cache hot products. The wider we scale, the higher the portion of our catalogue that is cached.&lt;/p&gt;
&lt;p&gt;Our batch-component unpacks batch requests, issues concurrent single-item lookups, and aggregates responses. It uses the &lt;a href="https://www.eecs.harvard.edu/~michaelm/postscripts/handbook2001.pdf"&gt;Power of Two Random Choices&lt;/a&gt; algorithm, routing requests to the less-loaded of two randomly selected pods.&lt;/p&gt;
&lt;h2&gt;Solving the Competing Sources of Truth Problem&lt;/h2&gt;
&lt;p&gt;Delivering Product data centrally via API solved the Offer composition problem and laid the foundation for future applications. But success hinged on adoption—teams migrating off old formats, standardizing on the new, and decommissioning legacy applications.&lt;/p&gt;
&lt;p&gt;&lt;img alt="xkcd comic on standards" src="https://imgs.xkcd.com/comics/standards.png#center"&gt;&lt;/p&gt;
&lt;p&gt;With ~350 engineering teams and thousands of deployed applications, many relying directly or indirectly on Product data, migration was always going to be complex. Without a clear transition path, legacy systems persist, and much like the XKCD comic above, it's easy to end up in a worse state than before.&lt;/p&gt;
&lt;p&gt;To ensure adoption, PRAPI took ownership of all legacy representations of Product and Offer data. Engineers meticulously analyzed and replicated existing transformations within PRAPI, allowing client teams to request data in their required format via the Accept header:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;application/json&lt;/code&gt; — New standard format for all teams&lt;/li&gt;
&lt;li&gt;&lt;code&gt;application/x.alpha-format+json&lt;/code&gt; — Legacy (previously on event stream)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;application/x.beta-format+json&lt;/code&gt; — Legacy (previously on event stream)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;application/x.gamma-format+json&lt;/code&gt; — Legacy (from Presentation API)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Additionally, temporary components within PRAPI emitted alpha and beta formats back onto the legacy event streams. This enabled legacy applications to be decommissioned immediately, while teams gradually migrated off the legacy formats within a fixed sunset period.&lt;/p&gt;
&lt;h2&gt;Performance Results&lt;/h2&gt;
&lt;p&gt;To accurately measure PRAPI's performance from a client perspective, we use metrics from our ingress load-balancer, &lt;a href="https://github.com/zalando/skipper"&gt;Skipper&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Single GET Performance" src="https://engineering.zalando.com/posts/2025/03/images/prapi-single-get-perf.png#center"&gt;&lt;/p&gt;
&lt;p&gt;Single GET requests return large (~1000-line JSON) payloads with content-type transformations but still achieve sub-10ms P99 latency.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Batch GET Performance" src="https://engineering.zalando.com/posts/2025/03/images/prapi-batch-get-perf.png#center"&gt;&lt;/p&gt;
&lt;p&gt;Batch GET requests, handling up to 100 items, scale predictably with an expected increase in response time, closely aligning with the P999 of single GETs.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Cluster Performance" src="https://engineering.zalando.com/posts/2025/03/images/prapi-cluster-perf.png#center"&gt;&lt;/p&gt;
&lt;p&gt;PRAPI performs better under load—as traffic increases, more of the product catalogue remains cached, reducing latency. Visible in our cluster-wide latency graphs above, when we load-tested PRAPI.&lt;/p&gt;
&lt;h2&gt;Advanced Tuning&lt;/h2&gt;
&lt;p&gt;This section covers the advanced tuning techniques we applied to reduce tail latency in PRAPI.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://docs.oracle.com/javacomponents/jmc-5-4/jfr-runtime-guide/about.htm"&gt;Java Flight Recorder (JFR)&lt;/a&gt; was invaluable in fine-tuning the JVM. By capturing telemetry from underperforming pods and visualising it in JDK Mission Control, we identified Garbage Collection (GC) pauses and ensured no blocking tasks ran on NIO thread pools.&lt;/p&gt;
&lt;h3&gt;Open-Source Load Balancer Contributions&lt;/h3&gt;
&lt;p&gt;We contributed the following improvements to the CHLB algorithm in &lt;a href="https://github.com/zalando/skipper"&gt;Skipper&lt;/a&gt;, our Kubernetes Ingress load balancer:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/zalando/skipper/issues/1712"&gt;Minimising Cache Loss During Scaling&lt;/a&gt; – Previously, pod rebalancing on scale-up/down caused mass cache invalidations, routing traffic to cold caches. We fixed this by assigning each pod to 100 fixed locations on the ring, reducing cache misses to 1/N, where N is the previous number of pods.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/zalando/skipper/issues/1769"&gt;Preventing Overload from Hyped Products&lt;/a&gt; – We added the Bounded Load algorithm, capping per-pod traffic at 2× the average. Once exceeded, requests spill over clockwise to the next non-overloaded pod, keeping hyped products cached and distributed.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Eliminating Garbage Collection Pauses&lt;/h3&gt;
&lt;p&gt;Key learning: The best way to eliminate GC pauses is to avoid object allocation altogether.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Products – Cache Product data as a single &lt;code&gt;ByteArray&lt;/code&gt; instead of an &lt;code&gt;ObjectNode&lt;/code&gt; graph, reducing heap pressure.&lt;/li&gt;
&lt;li&gt;Product-Sets – Avoid reading individual gzipped responses into memory. Instead, store them in &lt;a href="https://square.github.io/okio/3.x/okio/okio/okio/-buffer/index.html"&gt;Okio buffers&lt;/a&gt; and &lt;a href="https://www.gnu.org/software/gzip/manual/html_node/Advanced-usage.html"&gt;concatenate&lt;/a&gt; them directly in the response object, eliminating unnecessary gunzip/re-gzip operations.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;LIFO vs FIFO&lt;/h3&gt;
&lt;p&gt;Key learning: In latency-sensitive applications, FIFO queuing can create long-tail latency spikes.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Load Balancer – While we aim to avoid request queuing, switching to LIFO reduced long-tail latency spikes when queuing occurred.&lt;/li&gt;
&lt;li&gt;DynamoDB Clients – We configured a primary DynamoDB client with a 10ms timeout and a fallback client with a 100ms timeout for retries. This prevented FIFO queuing on the primary client during DynamoDB latency spikes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Scaling Teams Alongside Architecture&lt;/h2&gt;
&lt;p&gt;The success of the PODS project required more than just technical changes—it also required a reorganization of teams to match the new architecture. Following &lt;a href="https://martinfowler.com/bliki/ConwaysLaw.html"&gt;Conway’s Law&lt;/a&gt; and &lt;a href="https://martinfowler.com/bliki/CQRS.html"&gt;CQRS&lt;/a&gt; principles, the Product department was restructured into two stream-aligned teams:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Partners &amp;amp; Supply – Manages data ingestion (Command side)&lt;/li&gt;
&lt;li&gt;Product Data Serving – Focuses on aggregation and retrieval (Query side)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This shift reduced dependencies, improved scalability, and accelerated product updates. With &lt;a href="https://engineering.zalando.com/posts/2022/02/principal-engineering-at-zalando.html"&gt;principal engineers&lt;/a&gt; driving architectural simplifications, the new structure ensures resilience for peak events like Cyber Week and lays the foundation for future innovations, including unified product data models and multi-tenant solutions.&lt;/p&gt;
&lt;h2&gt;What’s Next?&lt;/h2&gt;
&lt;p&gt;With the core architecture in place, future projects will focus on unified product data models, multi-tenant solutions, and advanced analytics capabilities. The foundation is set—PODS has redefined how Zalando scales product data.&lt;/p&gt;
&lt;p&gt;A special thank you to the SPP and POP engineers, as well as all the teams across Zalando who contributed to this large migration effort—it would not have been possible without you.&lt;/p&gt;</content><category term="Zalando"/><category term="Scalability"/><category term="APIs"/><category term="Backend"/></entry><entry><title>LLM powered migration of UI component libraries</title><link href="https://engineering.zalando.com/posts/2025/02/llm-migration-ui-component-libraries.html" rel="alternate"/><published>2025-02-20T00:00:00+01:00</published><updated>2025-02-20T00:00:00+01:00</updated><author><name>Naval Singh</name></author><id>tag:engineering.zalando.com,2025-02-20:/posts/2025/02/llm-migration-ui-component-libraries.html</id><summary type="html">&lt;p&gt;Sharing our approach and insights employing LLMs to migrate in-house UI component libraries at scale.&lt;/p&gt;</summary><content type="html">&lt;p&gt;At Zalando, we continuously seek ways to improve our processes and focus on finding efficient solutions to complex challenges by using suitable new technologies. The in-house UI library project is one practical example of how we tackled technical debt efficiently by leveraging LLMs.&lt;/p&gt;
&lt;h2&gt;Overview of migration project&lt;/h2&gt;
&lt;p&gt;At Zalando, I work in Partner Tech, where we focus on empowering our partners to offer their products for sale on our platform (or supply to Zalando as a retailer). As the main interface between Zalando and our partners, we develop a range of user interfaces to facilitate their day-to-day operations.&lt;/p&gt;
&lt;p&gt;Over time, our department had developed two distinct in-house UI component libraries, each being used in different types of partner-facing applications. This fragmentation led to several challenges impacting our internal efficiency and partner experience:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Inconsistent user experience across different partner facing applications&lt;/li&gt;
&lt;li&gt;Duplicated design and development efforts&lt;/li&gt;
&lt;li&gt;Design side complexity in maintaining two design languages&lt;/li&gt;
&lt;li&gt;Increased maintenance complexity for the engineering teams&lt;/li&gt;
&lt;li&gt;Higher onboarding time for new developers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To resolve the above challenges, we initiated a project to migrate our partner facing applications from one of the UI component libraries to the other one. The project's scope encompassed 15 sophisticated B2B applications, and due to significant differences between the source and target UI component libraries, this migration required substantial resources and time.&lt;/p&gt;
&lt;p&gt;Given the scale and complexity of this migration, we explored various automation approaches to reduce the effort and time required. We investigated traditional approaches like &lt;a href="https://github.com/facebook/jscodeshift"&gt;javascript codemods&lt;/a&gt; and also wanted to explore AI technologies like Large Language Models (LLMs) given the recent advances in their capabilities.&lt;/p&gt;
&lt;h2&gt;Migration using LLMs: Getting Started&lt;/h2&gt;
&lt;p&gt;When we first considered using LLMs for our migration, we had several questions and concerns that needed to be resolved before committing to using a LLM for migration: Would the models understand our custom components? Could they carry out the migration with high accuracy? Is there a risk of subtle, hard-to-detect bugs being introduced? These concerns were not just theoretical – any inaccuracies in the process could have a direct impact on our partners' experience.&lt;/p&gt;
&lt;h3&gt;LLM Hackathon - Investigating Feasibility&lt;/h3&gt;
&lt;p&gt;To validate the feasibility of LLMs for our use case, we participated in an internal LLM hackathon organised by the Zalando research team and Tech Academy. In this hackathon developers from different teams explored various ways AI could solve engineering challenges.&lt;/p&gt;
&lt;p&gt;Our team focused specifically on validating LLMs' potential for automated code migration and carried out multiple experiments.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Experiment Setup&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;To keep our experiments focused and measurable, we chose a set of sample UI components of varying complexity from simple buttons and to more complex Select components that should be migrated. We also used up a simple test application to test the accuracy of migration under realistic conditions. We adopted an iterative approach, with each experiment building on insights from previous ones.&lt;/p&gt;
&lt;h3&gt;Iterative Experiments&lt;/h3&gt;
&lt;h4&gt;Iteration 1: Transform using Source code&lt;/h4&gt;
&lt;p&gt;We initially attempted direct migration by providing the source code of the components to the LLM.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gu"&gt;## Source files&lt;/span&gt;
[Source files of the component in the source library]
[Source files of the component in the target library]

&lt;span class="gu"&gt;## Instructions&lt;/span&gt;
Using the source files, migrate the components  in the below file to the target library:
[...file content...]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;: This produced inconsistent results with numerous errors.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it failed&lt;/strong&gt;: Our assumption was that the migration failed possibly due to the presence of multiple complex intermediary steps. The LLM needed to understand the source code, define an interface, create a mapping between the libraries, and then migrate the test application. It struggled to handle all these steps reliably in a single pass.&lt;/p&gt;
&lt;h4&gt;Iteration 2: Transform using interface&lt;/h4&gt;
&lt;p&gt;We divided the process into two steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Generated detailed component interfaces by providing the source code to the LLM&lt;/li&gt;
&lt;li&gt;Passed the component interface as context to the LLM for carrying out the migration&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;// Prompt 1 - interface generation
&lt;span class="gu"&gt;## Source files&lt;/span&gt;
[Source files of the component in the source library]
[Source files of the component in the target library]

&lt;span class="gu"&gt;## Instructions&lt;/span&gt;
Using the source files, generate a detailed interface of the components.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;// Prompt 2 - transformation using interface
&lt;span class="gu"&gt;## Interface&lt;/span&gt;
Here&amp;#39;s a detailed list of attributes for the Button component:
&lt;span class="k"&gt;1.&lt;/span&gt; type: &amp;quot;filled&amp;quot; | &amp;quot;outlined&amp;quot; | &amp;quot;link&amp;quot; Default: &amp;quot;filled&amp;quot;
Defines the type of the button. &amp;quot;filled&amp;quot; is ...
&lt;span class="k"&gt;2.&lt;/span&gt; size: &amp;quot;small&amp;quot; | &amp;quot;medium&amp;quot;
//... rest of the interface

&lt;span class="gu"&gt;## Transformation instructions&lt;/span&gt;
Migrate the usages of button components in the below file:
[...file content...]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;: This approach still yielded low accuracy, with the LLM failing to transform several component attributes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it failed&lt;/strong&gt;: We noticed that even though the interface was detailed, it lacked essential information present in the original source code that was necessary for complete component transformation.&lt;/p&gt;
&lt;h4&gt;Iteration 3: Transform using interface and transformation instructions&lt;/h4&gt;
&lt;p&gt;Building on previous iterations, we combined interfaces generated above with explicit instructions on how to transform a component and all of its attributes from the source library to the target.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gu"&gt;## Interface&lt;/span&gt;
[As above]

&lt;span class="gu"&gt;## Mapping&lt;/span&gt;
Instruction to migrate button
&lt;span class="k"&gt;1.&lt;/span&gt; convert variant=primary or variant=default to type=&amp;quot;filled&amp;quot;
 ...
&lt;span class="k"&gt;2.&lt;/span&gt; convert size=&amp;quot;small&amp;quot; to size= &amp;quot;small&amp;quot; and size=&amp;quot;medium&amp;quot; to size=&amp;quot;medium&amp;quot;

&lt;span class="gu"&gt;## Transformation instructions&lt;/span&gt;
[As above]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; The code was transformed with medium accuracy, but revealed flaws in the automated mapping instructions that were generated. For example, for the button component, LLM created direct size mappings (converting "medium" sized button to "medium"), when in reality, a "medium" button in the original library was visually equivalent to a "large" button in the new library.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it failed&lt;/strong&gt;: There were few reasons:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Source code cannot reveal all information, e.g. design intent or visual relationships&lt;/li&gt;
&lt;li&gt;The LLM couldn't visualize how components are rendered&lt;/li&gt;
&lt;li&gt;Different libraries implement similar concepts (like "medium" size) differently&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;Iteration 4: Manual verification of interface and transformation instructions&lt;/h4&gt;
&lt;p&gt;To handle the issues in the above iteration, we included manual verification of the prompts, for example, fixing the size mapping if they were not accurate:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gu"&gt;## Interface&lt;/span&gt;
[As above]

&lt;span class="gu"&gt;## Mapping&lt;/span&gt;
Instruction to migrate button
&lt;span class="k"&gt;1.&lt;/span&gt; convert variant=primary or variant=default to type=&amp;quot;filled&amp;quot;
 ...
// Fixed after manual verification
&lt;span class="k"&gt;2.&lt;/span&gt; convert size=&amp;quot;small&amp;quot; to size= &amp;quot;medium&amp;quot; and size=&amp;quot;medium&amp;quot; to size=&amp;quot;large&amp;quot;

&lt;span class="gu"&gt;## Transformation instructions&lt;/span&gt;
[As above]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; This improved accuracy even further for transforming basic components, but for complex components requiring substantial code restructuring it still had issues.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What could be missing&lt;/strong&gt;: While the LLM had the information needed for transformation, most of it was theoretical. We felt that providing transformation examples with explanations would help the LLM learn from these patterns and enhance accuracy.&lt;/p&gt;
&lt;h4&gt;Iteration 5: Passing examples to the LLM&lt;/h4&gt;
&lt;p&gt;Our final iteration supplemented the instructions in the previous iteration with examples of increasing complexity. The examples were generated by the LLM but verified manually.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gu"&gt;## Interface&lt;/span&gt;
[As above]

&lt;span class="gu"&gt;## Mapping&lt;/span&gt;
[As above]

&lt;span class="gu"&gt;## Examples&lt;/span&gt;
example 1: Simple transformation
// Source
&amp;lt;button size=&amp;quot;medium&amp;quot; /&amp;gt;
// Target
&amp;lt;button size=&amp;quot;large&amp;quot; /&amp;gt;
Migration Notes:
&lt;span class="k"&gt;1.&lt;/span&gt; size=&amp;quot;medium&amp;quot; maps to size=&amp;quot;large&amp;quot; due to visual equivalence
... other examples...

&lt;span class="gu"&gt;## Transformation instructions&lt;/span&gt;
[As above]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; The code was transformed with a high degree of accuracy for all the components.&lt;/p&gt;
&lt;p&gt;Through this series of iterative experiments, we were able to finalize our approach.&lt;/p&gt;
&lt;h2&gt;Building Our Migration Toolkit&lt;/h2&gt;
&lt;p&gt;After establishing our methodology through iterative experiments, the next challenge was to scale our approach while maintaining accuracy across the UI components.&lt;/p&gt;
&lt;h3&gt;Crafting component prompts&lt;/h3&gt;
&lt;p&gt;As we did in our hackathon, we crafted the transformation prompts for migration by providing the source code of our components to the LLM. These initial instructions included component interfaces, transformation rules, and example migrations. We utilized &lt;a href="https://www.continue.dev/"&gt;continue.dev&lt;/a&gt; to streamline this process, making the workflow of attaching source codes and generating prompt context more efficient.&lt;/p&gt;
&lt;h3&gt;System Prompts&lt;/h3&gt;
&lt;p&gt;We discovered that using system prompts enhanced the accuracy of the transformations. By instructing the LLM to operate as an experienced developer and clearly defining the task objectives, we achieved more consistent results. The system prompts also specified detailed requirements for code style, best practices, and error handling conventions. This proved instrumental in generating accurate code transformations that adhered to the instructed output format.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gu"&gt;## System prompt&lt;/span&gt;
You are an expert frontend software developer with deep knowledge of frontend development,
component libraries, and design systems.
You MUST follow the instructions provided exactly as they are given.
Your task is to help migrate UI components from one library to another
while maintaining visual and functional equivalence.
// .. other instructions
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Creating the tool&lt;/h3&gt;
&lt;p&gt;We developed a Python based migration tool using the &lt;a href="https://llm.datasette.io/"&gt;llm&lt;/a&gt; library's &lt;a href="https://llm.datasette.io/en/stable/python-api.html#conversations"&gt;conversation API&lt;/a&gt;. The tool processed each file in the given source directories and applied LLM-powered migrations for the components present in the file. We chose Python for its extensive support for working with LLMs and its rich ecosystem of libraries. Based on our hackathon results and subsequent testing, we opted for GPT-4o, which consistently delivered the most accurate transformations. It's worth noting that this tool was developed in September 2024 and started being used shortly after that, so our findings reflect the model's capabilities during this specific timeframe.&lt;/p&gt;
&lt;p&gt;While the core implementation was straightforward, we encountered several technical challenges that required specific solutions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Handling large files&lt;/strong&gt;: When files exceeded the 4K &lt;a href="https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them"&gt;token&lt;/a&gt; limit, the output would get truncated mid-transformation. We resolved this by utilizing the &lt;a href="https://llm.datasette.io/en/stable/python-api.html#conversations"&gt;conversation API&lt;/a&gt; and passing "continue" as a prompt whenever the content was cut off. This allowed the LLM to pick up where it left off and complete the transformation. As per our tests, a simple "continue" prompt proved more reliable than more complex prompts to continue the transformation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Output consistency&lt;/strong&gt;: Initially, we noticed varying outputs for the same input, making testing and validation challenging. Changing LLM settings, like setting the temperature parameter to 0 made the LLM's output to be more deterministic and reproducible.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fixing output format&lt;/strong&gt;: The LLM would sometimes include explanatory text or markdown formatting along with the transformed code. We resolved this by giving context to the tool and incorporating specific output formatting instructions in the system prompt.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;You MUST return just the transformed file inside the &amp;lt;updatedContent&amp;gt; tag like:
&amp;lt;updatedContent&amp;gt;transformed-file&amp;lt;/updatedContent&amp;gt; without any additional data.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Limiting input context&lt;/strong&gt;: We observed as the input prompt size grew, the transformation accuracy declined. To maintain high quality, we organized components into logical groups (like 'form', 'core', etc.), keeping context tokens between 40-50K per group of components. This grouping strategy helped maintain the LLM's focus and improved transformation accuracy.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Automated tests&lt;/strong&gt;: During development, we discovered that small adjustments to transformation instructions could lead to substantial changes in results. This highlighted a need to have prompt validation tests in place and led us to implement automated testing using LLM-generated examples. These examples served as validation tools and regression tests, helping us catch unexpected changes during the migration process.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Caching and prompt structure:&lt;/strong&gt; LLM APIs offer the ability to cache identical prompts, potentially reducing API costs and improving response times by reusing previous results (e.g. &lt;a href="https://platform.openai.com/docs/guides/prompt-caching"&gt;Prompt caching - OpenAI API&lt;/a&gt;). To leverage this capability effectively, we set up a structured prompt format that maximized cache hits. The prompt was organized to have the static part like transformation examples at top and the dynamic part (the file content) and the end, ensuring caching can be leveraged while transforming different files.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;// Example prompt structure
&lt;span class="gu"&gt;## Transformation prompt  (static)&lt;/span&gt;
{transformation_context}
{For each component in group}
&lt;span class="k"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;{interface_details}
&lt;span class="k"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;{mapping_instruction}
&lt;span class="k"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;{examples}

&lt;span class="gu"&gt;## Content to be transformed (dynamic)&lt;/span&gt;
&amp;lt;file&amp;gt;
 {file_content}
&amp;lt;/file&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Experience with LLM-Powered migration&lt;/h2&gt;
&lt;p&gt;The results of our LLM-powered migration project exceeded our initial expectations and cemented LLMs as one of tools for similar complex migrations in the future. While utilising LLMs for our complex migration task, we gained valuable insights on their power and limitations. Here's what we learned when putting LLMs to work in a real-world engineering challenge.&lt;/p&gt;
&lt;h3&gt;Cost effectiveness&lt;/h3&gt;
&lt;p&gt;When evaluating LLMs for large-scale code migrations, cost could become a critical factor. While having exact cost is challenging due to variations across the codebases, we can provide a rough estimation based on average metrics:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Average prompt size per component group: ~45K tokens&lt;/li&gt;
&lt;li&gt;Average output file size: ~2K tokens&lt;/li&gt;
&lt;li&gt;Total groups of component: ~10 (each containing on average 3 components)&lt;/li&gt;
&lt;li&gt;Average number of files transformed per component group: ~30&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Based on the &lt;a href="https://openai.com/api/pricing/"&gt;GPT-4o pricing&lt;/a&gt;, this would come to be less than $40 for each code repository. The actual costs could potentially be lower due to possibility of prompt caching being applied.&lt;/p&gt;
&lt;p&gt;While precisely quantifying the saved development effort is complex, the LLM-based approach achieved an accuracy of about 90% migrating components across large volumes of files, as reflected in the above metrics. This should imply that the LLM-based approach delivered significant time and resource savings and the approach was highly cost-effective.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Example transformation by LLM" src="https://engineering.zalando.com/posts/2025/02/images/example-change-by-llm.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;LLM Migration in action: sample code transformation&lt;/figcaption&gt;

&lt;h3&gt;What worked well&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;High Accuracy:&lt;/strong&gt; We achieved an overall accuracy of more than 90% for the component migration, with even higher accuracy for components of low to medium complexity. This reduced the amount of manual fixes needed after the llm powered migration.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Code comprehension:&lt;/strong&gt; LLMs have a good understanding of the different elements of code and their relationships. This was very useful in handling different edge cases encountered during the migration. This is a powerful capability, compared to traditional alternatives like codemods, where we need to explicitly code every edge case.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Typography&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;…&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Header&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Typography&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Headline&lt;/span&gt;
&lt;span class="c1"&gt;// LLM would be able to correlate Typography.Headline and Header are same and will replace as per instructions&lt;/span&gt;
&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Header&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/Header&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Contextual intelligence:&lt;/strong&gt; LLMs demonstrated contextual awareness during the migration process and were able to fill in the gaps in instructions based on provided examples and context. For example the LLM tool was able to provide correct default values during transformation even when the explicit instructions were missing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Accelerated development:&lt;/strong&gt; Through LLMs we were able to generate the migration prompts and develop the tool faster than using traditional alternatives like codemods, which typically require more extensive development time.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Challenges and Limitation&lt;/h3&gt;
&lt;p&gt;Despite the strengths of LLM, we encountered some limitations, both LLM-specific issues and project-specific, that restricted full automation.&lt;/p&gt;
&lt;p&gt;Some of the LLM limitations that we encountered:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reliability issues&lt;/strong&gt;: Even with carefully crafted prompts, LLMs occasionally deviated from the provided instructions or made unexpected changes. Similarly, LLMs sometimes generated plausible-looking but incorrect code e.g. adding a property to the component which does not even exist.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;“Moody” behaviour&lt;/strong&gt;: We observed that the LLM tools occasionally produced inconsistent outputs. These issues appeared without any clear reason, sometimes simply by rerunning the same prompt on the same file at a different time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time consumption&lt;/strong&gt;: Processing times ranged between 30 and 200 seconds per file, making large-scale migrations time-intensive. While not a major issue as the tool could transform files in the background, it made conducting quick, small-scale experiments more challenging&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No visual understanding&lt;/strong&gt;: LLMs are unable to verify visual implications of the changes when migrating between design systems with different fundamental units. In our case, the source and target libraries had differences like different spacing scales and grid systems (12 vs 24 columns). This limitation meant that while a page could be syntactically migrated correctly, the layout may appear broken upon deployment.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We also encountered several project-specific challenges in our migration. These included differences in design philosophies of the two UI component libraries, difficulties in migrating test suites due to inconsistent practices, gaps in feature availability between the libraries, and variations in codebases and styling practices across applications. These challenges often required significant manual work and refactoring, as LLMs could not handle such complex transformations accurately. While these obstacles highlight the challenges with automated migration, they also demonstrate the importance of proper planning and setting realistic expectations when undertaking similar projects.&lt;/p&gt;
&lt;h3&gt;Lessons learned&lt;/h3&gt;
&lt;p&gt;While our experience confirmed LLMs as valuable tools for complex migrations, we learned few lessons on how to use them effectively for such use cases:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Embrace iterative approach&lt;/strong&gt;: We found that there is no universal formula or fixed approach on how to increase the effectiveness of a LLM. Our approach required an iterative approach of continuous experimentation and refinement. We found our prompts after testing different prompt variations, analyzing results, and incorporating feedback.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Provide code examples&lt;/strong&gt;: Including specific code examples enhanced migration accuracy. When we supplemented transformation instructions with examples, the LLM's ability to handle similar patterns improved. This was particularly visible in complex component migrations where abstract instructions alone proved insufficient.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Human oversight is crucial&lt;/strong&gt;: While LLMs demonstrated impressive capabilities, human oversight and verification is crucial while dealing with LLMs at every stage. For example, code reviews and thorough visual testing would be needed for catching subtle issues that LLMs might introduce. For example, consider the below example where visual review is needed:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;// Transform 24 to 12 grid&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;Grid&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;Column&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;span&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mf"&gt;9&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;Column&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;span&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mf"&gt;15&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;Grid&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="c1"&gt;// Two possible options&lt;/span&gt;
&lt;span class="c1"&gt;// Option 1 - rounded up - page breaks into two line&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;Grid&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;Column&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;span&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mf"&gt;5&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;Column&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;span&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mf"&gt;8&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;Grid&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="c1"&gt;// Option 2 - rounded down, extra whitespace&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;Grid&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;Column&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;span&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mf"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;Column&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;span&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mf"&gt;7&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;Grid&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tool evaluation&lt;/strong&gt;: It is important to evaluate available LLM tools before embarking on similar migration projects. Our initial approach of manually copying and pasting source code into LLM prompts proved time-consuming and error-prone. The adoption of continue.dev improved our workflow by automating source code handling.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Effective Prompt engineering:&lt;/strong&gt; Our success relied on evaluating different prompt engineering strategies and modifying them for our use case. For example, breaking down complex transformations into discrete steps or practical examples with instructions increased migration accuracy and enhanced LLM's reasoning capabilities.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prompt best practices&lt;/strong&gt;: Follow established prompt engineering best practices (e.g., &lt;a href="https://platform.openai.com/docs/guides/prompt-engineering"&gt;Prompt engineering - OpenAI&lt;/a&gt;, &lt;a href="https://docs.github.com/en/copilot/using-github-copilot/copilot-chat/prompt-engineering-for-copilot-chat"&gt;Prompt engineering for Copilot Chat - GitHub Docs&lt;/a&gt;) to ensure consistent and accurate results such as prompts should be clear and concise. For instance, consider the following prompts:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gu"&gt;## Example 1&lt;/span&gt;
// ❌ prompt  not clear on how to handle when no button components are present
// transformed unrelated components if no button was found in the file
Migrate button components in the file as per the below instruction
//...

// ✅ clear instructions on transforming only when button components are present
// worked as expected
Migrate button components, if it exists, in the file as per the below instruction

&lt;span class="gu"&gt;## Example 2&lt;/span&gt;
// ❌ no clear mapping between sizes
&amp;quot;Map the sizes(small, medium) to (small, medium, large) appropriately&amp;quot;

// ✅ clear mapping between sizes
&amp;quot;Map size variants as follows:
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;small -&amp;gt; size=&amp;#39;small&amp;#39;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;medium -&amp;gt; size=&amp;#39;large&amp;#39; (for visual equivalence)
 Note: Handle undefined size with &amp;#39;medium&amp;#39; default&amp;quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Looking Forward&lt;/h2&gt;
&lt;p&gt;The use of LLMs in this project wasn't just about solving our immediate migration needs, but also to evaluate the feasibility of LLMs for tackling large-scale code transformation challenges with high degree of accuracy. While LLMs have limitations they've proven to be powerful tools. As we wrap up this phase of our UI migration, we're already identifying other areas where this approach could provide value and this time we would have a better idea how to approach such challenges.&lt;/p&gt;</content><category term="Zalando"/><category term="Machine Learning"/><category term="Frontend"/></entry><entry><title>Scaling Beyond Limits: Harnessing Route Server for a Stable Cluster</title><link href="https://engineering.zalando.com/posts/2025/02/scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster.html" rel="alternate"/><published>2025-02-17T00:00:00+01:00</published><updated>2025-02-17T00:00:00+01:00</updated><author><name>Mustafa Abdelrahman</name></author><id>tag:engineering.zalando.com,2025-02-17:/posts/2025/02/scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster.html</id><summary type="html">&lt;p&gt;A proxy server contributing to a stable Kubernetes cluster and scaling ingress controller.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;At Zalando, we faced a critical challenge: our ingress controller was threatening to overload our Kubernetes cluster. We needed a solution that could handle the increasing traffic and scale efficiently. This is the story of how we implemented a Route Server to manage control plane traffic more effectively and ensure a stable cluster.&lt;/p&gt;
&lt;h3&gt;Skipper: Our Ingress Controller&lt;/h3&gt;
&lt;p&gt;We use &lt;a href="https://opensource.zalando.com/skipper"&gt;Skipper&lt;/a&gt;, our HTTP reverse proxy for service composition, to implement the control plane and data plane of &lt;a href="https://kubernetes.io/docs/concepts/services-networking/ingress/"&gt;Kubernetes ingress&lt;/a&gt; and &lt;a href="https://opensource.zalando.com/skipper/kubernetes/routegroup-crd/"&gt;RouteGroups&lt;/a&gt;. A creation of an &lt;code&gt;Ingress&lt;/code&gt; or &lt;code&gt;RouteGroup&lt;/code&gt; will result in having an AWS LB &lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt; with TLS termination targeting Skipper via &lt;a href="https://github.com/zalando-incubator/kube-ingress-aws-controller"&gt;kube-ingress-aws-controller&lt;/a&gt;, HTTP routes at Skipper and a DNS name pointing to the LB via &lt;a href="https://github.com/kubernetes-sigs/external-dns"&gt;external-dns&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Ingress Stack" src="https://engineering.zalando.com/posts/2025/02/images/ingress-stack-components.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Ingress Stack&lt;/figcaption&gt;

&lt;p&gt;To understand the deployment context, this is the scale we operate at:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;15,000 Ingresses and 5,000 RouteGroups.&lt;/li&gt;
&lt;li&gt;Traffic of up to 2,000,000 requests per second.&lt;/li&gt;
&lt;li&gt;80-90% of our traffic are authenticated service to service calls with daily numbers between 500,000 and 1,000,000 rps across our service fleet in total.&lt;/li&gt;
&lt;li&gt;200 Kubernetes clusters.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;The Challenge&lt;/h2&gt;
&lt;h3&gt;Scaling Pain Points&lt;/h3&gt;
&lt;p&gt;Skipper instances were fetching Ingresses and RouteGroups from the Kubernetes API, which worked well initially. But the  rapid growth in Skipper instances, reaching approximately 180 per cluster, began to overwhelm our etcd infrastructure.&lt;/p&gt;
&lt;p&gt;This overload cascaded into severe Kubernetes API CPU throttling issues. These performance bottlenecks led to critical control plane stability risks, manifesting in two primary ways: our clusters lost the ability to schedule new pods effectively, and existing pod management operations began to fail. This combination of issues threatened the overall stability and reliability of our Kubernetes infrastructure.&lt;/p&gt;
&lt;h3&gt;Implementing Route Server&lt;/h3&gt;
&lt;p&gt;Before Route Server, Skipper was responsible for:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Polling Kubernetes API for Ingresses and RouteGroups.&lt;/li&gt;
&lt;li&gt;Parsing and processing the resources to Eskip &lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt; format.&lt;/li&gt;
&lt;li&gt;Validating generated Eskip format.&lt;/li&gt;
&lt;li&gt;Updating the routing table.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We introduced &lt;a href="https://pkg.go.dev/github.com/zalando/skipper/routesrv"&gt;Route Server&lt;/a&gt; a custom proxy layer to handle the control plane traffic more efficiently and
act as a proxy with &lt;a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag"&gt;HTTP ETag cache layer&lt;/a&gt; between Skipper &amp;amp; Kubernetes API Server.&lt;/p&gt;
&lt;p&gt;Now Route Server handles the polling and parsing operations, reducing Skipper's computational overhead while implementing a clear separation of concerns.&lt;/p&gt;
&lt;h3&gt;Cache Layer&lt;/h3&gt;
&lt;p&gt;Route Server polls the Kubernetes API at &lt;em&gt;3-second&lt;/em&gt; intervals to fetch the latest &lt;code&gt;Ingresses&lt;/code&gt; and &lt;code&gt;RouteGroups&lt;/code&gt;. It then generates both a routing table and a corresponding ETag value. When Skipper requests updates from Route Server, it includes its current ETag. If this matches Route Server's current ETag, indicating no changes, Route Server responds with an &lt;a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/304"&gt;HTTP 304 (Not Modified) status&lt;/a&gt;. However, if the ETags differ, Route Server sends the updated routing table to Skipper, which then updates its local configuration and stored ETag.&lt;/p&gt;
&lt;h3&gt;Route Server Not Available&lt;/h3&gt;
&lt;p&gt;While the Route Server significantly improved our system's efficiency, we also had to consider potential failure scenarios.
There are 2 possibilities when Route Server is not available:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Skipper doesn't have an initial routing table.&lt;/li&gt;
&lt;li&gt;Skipper has an initial routing table but Route Server is not available to update it.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In the first case, Skipper container won't start if you have &lt;a href="https://github.com/zalando-incubator/kubernetes-on-aws/blob/4f2e04e3e056ba6c647d85e64fd842ada44deff3/cluster/manifests/skipper/deployment.yaml#L171"&gt;&lt;code&gt;-wait-first-route-load&lt;/code&gt; flag&lt;/a&gt; enabled. In the second case, Skipper will continue to work with the last known routing table. This is a trade-off between availability and consistency.&lt;/p&gt;
&lt;p&gt;In both cases, we get an alert and we decide to either fix the Route Server or disable it and let Skipper work without it. Currently, we don't have a automatic fallback mechanism to the old approach.&lt;/p&gt;
&lt;p&gt;The final flow with Route Server integrated is as follows:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Final Flow" src="https://engineering.zalando.com/posts/2025/02/images/final-flow.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Final Flow&lt;/figcaption&gt;

&lt;h3&gt;Roll Out Strategy&lt;/h3&gt;
&lt;p&gt;Rolling out Route Server wasn't a simple task. A single mistake could break the connection between Kubernetes API and Skipper, potentially impacting our sales and gross merchandise volume (GMV). We needed to be extremely cautious and follow a well-structured rollout strategy.&lt;/p&gt;
&lt;p&gt;We planned to roll out Route Server in a controlled manner, starting with test clusters. Production clusters were categorized into tiers, with Route Server deployed tier by tier, each monitored before proceeding to the next.&lt;/p&gt;
&lt;p&gt;To do this, we defined different setup modes for rolling out the Route Server:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Mode: False&lt;/strong&gt; - Disabled mode&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mode: Pre&lt;/strong&gt; - Pre-processing mode&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mode: Exec&lt;/strong&gt; - Execution mode&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These modes are controlled via a &lt;a href="https://github.com/zalando-incubator/kubernetes-on-aws/blob/32658df25ee29049d5495cf422f85ab536b599bb/cluster/config-defaults.yaml#L223"&gt;configuration item&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Default mode is &lt;code&gt;false&lt;/code&gt; which means Route Server is disabled, and we use the regular Control Plane traffic.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Regular Flow" src="https://engineering.zalando.com/posts/2025/02/images/regular-flow.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Regular Flow&lt;/figcaption&gt;

&lt;h4&gt;Pre-Processing Mode&lt;/h4&gt;
&lt;p&gt;In this mode, Route Server works alongside Skipper, fetches &lt;code&gt;Ingresses&lt;/code&gt; and &lt;code&gt;RouteGroups&lt;/code&gt; resources from Kubernetes API and preprocesses them. This mode is useful for testing and debugging which was a key factor in our rollout strategy.&lt;/p&gt;
&lt;p&gt;We were able to get the routing table for Skipper &amp;amp; Route Server and compare them to ensure the Route Server is working as expected. Remember, if our routing table is broken for some reason, we will have a downtime.
That's why we had to be extra cautious and check any small difference in the routing table across all clusters.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# very big limit to get all routes for skipper&lt;/span&gt;
➜&lt;span class="w"&gt; &lt;/span&gt;curl&lt;span class="w"&gt; &lt;/span&gt;-i&lt;span class="w"&gt; &lt;/span&gt;http://127.0.0.1:9911/routes&lt;span class="se"&gt;\?&lt;/span&gt;limit&lt;span class="se"&gt;\=&lt;/span&gt;&lt;span class="m"&gt;10000000000000&lt;/span&gt;&lt;span class="se"&gt;\&amp;amp;&lt;/span&gt;nopretty&lt;span class="w"&gt; &lt;/span&gt;&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;skipper_routes.eskip
&lt;span class="c1"&gt;# get all routes for Route Server, we decided not to use pagination to reduce number of requests and Skipper is currently the only consumer&lt;/span&gt;
➜&lt;span class="w"&gt; &lt;/span&gt;curl&lt;span class="w"&gt; &lt;/span&gt;-i&lt;span class="w"&gt; &lt;/span&gt;http://127.0.0.1:9090/routes&lt;span class="w"&gt; &lt;/span&gt;&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;routesrv_routes.eskip
➜&lt;span class="w"&gt; &lt;/span&gt;git&lt;span class="w"&gt; &lt;/span&gt;diff&lt;span class="w"&gt; &lt;/span&gt;--no-index&lt;span class="w"&gt; &lt;/span&gt;--&lt;span class="w"&gt; &lt;/span&gt;skipper_routes.eskip&lt;span class="w"&gt; &lt;/span&gt;routesrv_routes.eskip
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Pre-Processing Mode" src="https://engineering.zalando.com/posts/2025/02/images/pre-mode.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Pre-Processing Mode&lt;/figcaption&gt;

&lt;h4&gt;Execution Mode&lt;/h4&gt;
&lt;p&gt;In this mode, Route Server acts as a proxy between Skipper and the Kubernetes API. Skipper sends requests to the Route Server, which then forwards them to the Kubernetes API. The Route Server caches the responses and sends them back to Skipper. This mode is the final setup for production.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Execution Mode" src="https://engineering.zalando.com/posts/2025/02/images/add-routesrv.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Execution Mode&lt;/figcaption&gt;

&lt;h2&gt;Production Rollout&lt;/h2&gt;
&lt;p&gt;After thorough (load)testing, we rolled out the Route Server to production in a controlled manner:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Rolled out to all test clusters and monitored for 2 weeks.&lt;/li&gt;
&lt;li&gt;Deployed to production clusters tier by tier, monitoring each tier before proceeding.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Alternative Solutions&lt;/h2&gt;
&lt;p&gt;We considered using &lt;a href="https://pkg.go.dev/k8s.io/client-go/informers"&gt;Kubernetes Informers&lt;/a&gt; to watch for changes in the Kubernetes API.
However, this approach would still require Kubernetes API to send information to all Skipper instances, which may lead to the same issues we faced. Since it's a sudden increase in traffic
and HPA won't be able to catch up and scale Kubernetes API and etcd.&lt;/p&gt;
&lt;h2&gt;Future Improvements&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Automatic Fallback&lt;/strong&gt;: Implement a fallback mechanism to ensure Skipper can continue to operate if Route Server is unavailable.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Achieved zero downtime and no gross merchandise volume (GMV) loss during rollout.&lt;/li&gt;
&lt;li&gt;Extended Skipper HPA to 300 pods.&lt;/li&gt;
&lt;li&gt;One RouteSRV deployment can handle up to 100 RPS, equivalent to ~300 Skipper pods, with no issues.&lt;/li&gt;
&lt;li&gt;Route Server is now a core component of our platform.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;AWS LoadBalancer can be ALB or NLB depending on &lt;a href="https://github.com/zalando-incubator/kube-ingress-aws-controller?tab=readme-ov-file#ingress-annotations"&gt;kube-ingress-aws-controller annotation&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;&lt;a href="https://pkg.go.dev/github.com/zalando/skipper/eskip"&gt;Eskip&lt;/a&gt; implements an in-memory representation of Skipper routes and a DSL for describing Skipper route expressions, route definitions and complete routing tables.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="Zalando"/><category term="Open Source"/><category term="Platform Engineering"/><category term="Kubernetes"/><category term="Golang"/><category term="Skipper"/><category term="Backend"/></entry><entry><title>JSON Web Keys (JWK): Rotating Cryptographic Keys at Zalando</title><link href="https://engineering.zalando.com/posts/2025/01/automated-json-web-key-rotation.html" rel="alternate"/><published>2025-01-21T00:00:00+01:00</published><updated>2025-01-21T00:00:00+01:00</updated><author><name>Jan Brennenstuhl</name></author><id>tag:engineering.zalando.com,2025-01-21:/posts/2025/01/automated-json-web-key-rotation.html</id><summary type="html">&lt;p&gt;Secret rotation is a vital security measure in many contexts. Learn how we automate this process using JSON Web Keys (JWKs) to enhance the security of our customer identity provider.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Enhancing the Security of Our Customer Identity Platform Through Automated Key Rotation&lt;/h2&gt;
&lt;p&gt;Static secrets are evil. Whether secret keys hard-coded in source code, tokens without expiry or plaintext API keys referenced in configuration files, static secrets are ticking time bombs. The same is true for cryptographic key material in the context of &lt;a href="https://engineering.zalando.com/posts/2017/07/the-purpose-of-jwt-stateless-authentication.html"&gt;JSON Web Tokens (JWTs)&lt;/a&gt; and OpenID Connect (OIDC).&lt;/p&gt;
&lt;p&gt;At Zalando, our customer authentication experience team takes protecting our customers' data and their digital identities seriously. Part of our toolbox is an OpenID Connect (OIDC)-based identity provider (IdP). A key aspect of this system's security is the regular rotation of cryptographic keys, which we've automated to ensure the ongoing safety of our platform.&lt;/p&gt;
&lt;p&gt;This article aims to shed light on why we rotate cryptographic keys, how the periodical JWK rotation process works, and what it means for customers.&lt;/p&gt;
&lt;h2&gt;What are JSON Web Keys (JWKs)?&lt;/h2&gt;
&lt;p&gt;JSON Web Keys (JWKs) are an essential part of the &lt;a href="https://www.iana.org/assignments/jose/jose.xhtml" title="IANA Assignment: JSON Object Signing and Encryption"&gt;JSON Object Signing and Encryption (JOSE)&lt;/a&gt; standards family and the backbone of token-based authentication and authorization frameworks like OIDC. JWK standardises the representation and management of cryptographic keys (&lt;a href="https://datatracker.ietf.org/doc/html/rfc7517" title="RFC 7517: JSON Web Key (JWK)"&gt;RFC 7517&lt;/a&gt;). Its JSON data structure allows the &lt;a href="https://www.janbrennenstuhl.eu/jwks-json-web-key-set/" title="Mastering JWKS: JSON Web Key Sets Explained"&gt;exchange of public keys&lt;/a&gt; in a web-native format.&lt;/p&gt;
&lt;p&gt;Identity providers (IdPs) like ours commonly use JWKs to distribute public key material via well-known and specified URIs. Clients can use the key material to e.g. verify digitally signed JSON Web Tokens (JWTs) issued by the IdP. These tokens contain information about users and their access rights, and their integrity is crucial for preventing unauthorized access.&lt;/p&gt;
&lt;h2&gt;Why is Key Rotation Important?&lt;/h2&gt;
&lt;p&gt;Rotation in context of secrets, passwords or cryptographic key material describes the periodical replacement of old with new. This process model is one of the &lt;a href="https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html#27-secret-lifecycle" title="OWASP: Secrets Management"&gt;four phases of the secret lifecycle&lt;/a&gt; and addresses the threat of undetected key compromise and reduces the window of vulnerability for potential exploits.&lt;/p&gt;
&lt;p&gt;If a signing key’s private part is compromised, anyone could forge fake tokens. These tokens could then be used to impersonate users and access sensitive data. Essentially, all tokens signed with the leaked key would become untrustworthy. Regularly rotating cryptographic keys hence is a fundamental security practice. &lt;strong&gt;It is paramount that identity providers store long-lived key material securely and rotate regularly.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;Our Approach to JWK Rotation&lt;/h2&gt;
&lt;p&gt;The key rotation process for our identity provider here at Zalando is built around four major principles:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Automation:&lt;/strong&gt; New keys are generated and old keys are retired automatically, eliminating manual intervention and ensuring consistency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scheduled Rotation:&lt;/strong&gt; Keys are rotated on a regular basis to minimize the window of vulnerability.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Secure Key Management:&lt;/strong&gt; Our keys are securely stored and managed using industry best practices to protect them from unauthorized access.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Seamless Rotation:&lt;/strong&gt; Planned rotations are transparent to clients and do not result in any kind of access revocation or token invalidation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The robust and automated key rotation process we’ve implemented, follows a careful, phased approach to ensure a smooth transition and minimize disruption for our clients and customers:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Diagram of our JWK rotation process" src="https://engineering.zalando.com/posts/2025/01/images/json-web-key-rotation.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Diagram of our JWK rotation process&lt;/figcaption&gt;

&lt;p&gt;First, a new key pair is generated. We then publish the public key portion of this new pair on &lt;a href="https://accounts.zalando.com/.well-known/jwk_uris"&gt;our JWK endpoint&lt;/a&gt;,
making it available to our clients. To avoid any immediate disruptions, we incorporate a grace period, allowing clients ample time to fetch the latest set of JWKs – cache control headers matter!
After this period, the new key is being elected as the new active signing key. The previous active key is being retired, meaning it's no longer used for signing new tokens, but its public key remains available on the JWK endpoint to ensure that previously issued tokens can still be verified.&lt;/p&gt;
&lt;p&gt;Finally, once a retired key surpasses the maximum lifetime of any token it might have signed, we remove its public key from the JWK endpoint.
To determine when it's safe to remove a key, we need to know which key signed which token and how long those tokens are valid.
Our JWTs include a key ID that tells us exactly which key was used to create them. We also control how long each token lasts before it expires.&lt;/p&gt;
&lt;p&gt;With this information, we can easily calculate when a key can be safely deleted. We simply take the time the key was retired,
add the maximum token lifespan, and add a little extra time just to be safe. At that point, any token signed with that
key will have expired, so it's safe to remove the key from our public list.&lt;/p&gt;
&lt;p&gt;Here's the simple formula:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;Time of key retirement + Maximum token lifespan + Extra safety time = Time to drop the public key
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This ensures that we don't accidentally remove a key that's still needed to verify valid tokens.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Protecting our customers' data is a top priority at Zalando. Automated key rotation using JWKs is just one of the many ways we demonstrate our commitment to security.
We believe that this approach balances security with operational stability, allowing us to rotate keys effectively while minimizing any impact on our clients.
By regularly rotating our cryptographic keys, we ensure that our customer identity platform remains resilient against potential threats without compromising on experience.&lt;/p&gt;</content><category term="Zalando"/><category term="Security"/><category term="Backend"/></entry><entry><title>Introducing Lightstep Receiver for OpenTelemetry Collector</title><link href="https://engineering.zalando.com/posts/2025/01/otelcollector-lightstep-receiver-oss.html" rel="alternate"/><published>2025-01-21T00:00:00+01:00</published><updated>2025-01-21T00:00:00+01:00</updated><author><name>Konstantin Zhukov</name></author><id>tag:engineering.zalando.com,2025-01-21:/posts/2025/01/otelcollector-lightstep-receiver-oss.html</id><summary type="html">&lt;p&gt;OpenTelemetry Lightstep Receiver helps you ingest traces generated by legacy Lightstep tracers in a simple way.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="https://opentelemetry.io/"&gt;OpenTelemetry&lt;/a&gt; is a vendor-neutral, flexible standard that supports traces, metrics, and logs all in one place.
Organizations who adopted older tracing solutions like OpenTracing or custom legacy tracer libraries to instrument their applications are faced with a migration task.&lt;/p&gt;
&lt;p&gt;Today, we’re excited to announce the &lt;a href="https://github.com/zalando/otelcol-lightstep-receiver"&gt;Lightstep Receiver&lt;/a&gt; for OpenTelemetry Collector, a component capable to receive tracing traffic from legacy Lightstep tracers, convert it to OpenTelemetry and propagate via OpenTelemetry Collector's traces pipeline.&lt;/p&gt;
&lt;p&gt;If your application is instrumented with legacy Lightstep tracer library supporting only Lightstep satellites as trace collector, the OpenTelemetry Collector together with the Lightstep Receiver provides a straightforward way to export your trace data to any backend system which OpenTelemetry collector supports. For Zalando that opens a possibility to engage our application fleet into OpenTelemetry without a need to switch applications themselves from Lightstep tracers to OpenTelemetry.&lt;/p&gt;
&lt;h2&gt;Supported tracers and protocols&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://github.com/lightstep/lightstep-tracer-python"&gt;lightstep-tracer-python&lt;/a&gt;: both Thrift binary and Protobuf&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/lightstep/lightstep-tracer-javascript"&gt;lightstep-tracer-javascript&lt;/a&gt;: Thrift JSON over http&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/lightstep/lightstep-tracer-go"&gt;lightstep-tracer-go&lt;/a&gt;: Protobuf over grpc&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The &lt;code&gt;lightstepreceiver&lt;/code&gt; receives OpenTracing traces from Lightstep tracers converting them into OpenTelemetry traces propagating it further in pipelines:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Trace Ingestion Pipeline" src="https://engineering.zalando.com/posts/2025/01/images/pipeline.png"&gt;&lt;/p&gt;
&lt;h2&gt;Configuration steps&lt;/h2&gt;
&lt;p&gt;Here's a simple guide to set up your OpenTelemetry Collector and send your traces to any backend of your choice:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Build the custom OpenTelemetry Collector&lt;/strong&gt;: You need to make a custom build of the OpenTelemetry collector (this step will not be required if in future the Lightstep Receiver is included into the standard OpenTelemetry Collector Contrib pack). For this follow the standard OCB build &lt;a href="https://opentelemetry.io/docs/collector/custom-collector/"&gt;routine&lt;/a&gt;, adding the Lightstep Receiver definition into the &lt;code&gt;receivers&lt;/code&gt; section:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;receivers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;gomod&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;github.com/zalando/otelcol-lightstep-receiver &amp;lt;put version here&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;lightstepreceiver&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Set Up the OpenTelemetry Collector&lt;/strong&gt;: The first step is to install the OpenTelemetry Collector. You can deploy the Collector as a standalone service, within a Kubernetes cluster, or as an agent on each service node. It will act as the central point for receiving, processing, and exporting your trace data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Configure the Collector to Receive Lightstep Traces&lt;/strong&gt;: In your &lt;code&gt;otelcol-config.yaml&lt;/code&gt; configuration file, define the Lightstep Receiver&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;receivers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;lightstepreceiver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;protocols&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;pbgrpc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;0.0.0.0:4317&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;pbhttp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;0.0.0.0:4327&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;thrift&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;0.0.0.0:4417&amp;quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Configure the Exporter to Your Desired Backend&lt;/strong&gt;: After setting up the receiver, configure the exporters section to route your trace data to your desired backend. Below example just has the &lt;code&gt;debug&lt;/code&gt; exporter printing data into console log, and the key benefit is that you may use any exporter that OpenTelemetry Collector supports.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;exporters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;debug&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;verbosity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;detailed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Set up the pipeline&lt;/strong&gt; having the Lightstep Receiver as an entry point for tracing pipeline:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;service&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;pipelines&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;traces&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;receivers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;lightstepreceiver&lt;/span&gt;&lt;span class="p p-Indicator"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;processors&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;[]&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;exporters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;debug&lt;/span&gt;&lt;span class="p p-Indicator"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Ingesting trace data originating from OpenTracing and legacy tracer libraries to OpenTelemetry doesn’t need to be a painful, all-at-once effort. With the OpenTelemetry Collector and Lightstep Receiver, you can easily route traces from your legacy tracing solutions to any backend, whether it’s Lightstep, AWS X-Ray, or any other system that supports OpenTelemetry.&lt;/p&gt;
&lt;p&gt;By using the OpenTelemetry Collector, you gain the flexibility to access raw data before ingestion enabling you to make any kind of special processing you need: from special sampling, and enriching data with additional attributes, to establishing the routes of ingesting data into different target systems depending on certain criteria.&lt;/p&gt;
&lt;p&gt;Get started with the OpenTelemetry Collector and begin your migration to OpenTelemetry today. You can find more details and the Lightstep Receiver &lt;a href="https://github.com/zalando/otelcol-lightstep-receiver"&gt;on GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Happy tracing, and may your trace data flow seamlessly to any observability backend you choose! 🚀&lt;/p&gt;
&lt;p&gt;&lt;em&gt;The Zalando Observability Team&lt;/em&gt;&lt;/p&gt;</content><category term="Zalando"/><category term="Open Source"/><category term="SRE"/><category term="Backend"/></entry><entry><title>Exploring the Potential of Graph Neural Networks to Transform Recommendations at Zalando</title><link href="https://engineering.zalando.com/posts/2024/12/gnn-recommendations-zalando.html" rel="alternate"/><published>2024-12-19T00:00:00+01:00</published><updated>2024-12-19T00:00:00+01:00</updated><author><name>Mariia Bulycheva</name></author><id>tag:engineering.zalando.com,2024-12-19:/posts/2024/12/gnn-recommendations-zalando.html</id><summary type="html">&lt;p&gt;Delivering personalized recommendations is key to engaging Zalando’s customers, but traditional models can miss the complexity of user-content interactions. By integrating graph neural networks (GNNs), we’re exploring a cutting-edge approach to better predict clicks and enhance the shopping experience.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Recommender systems are vital for personalizing user experiences across various platforms. At Zalando, these systems play a crucial role in tailoring content to individual users, thereby enhancing engagement and satisfaction. This is particularly important for Zalando Homepage, which serves as the customers' first impression of the company. Our current recommendation system employed on the Home page excels by leveraging user-content interactions and optimizing for predicted click through rate (CTR). The research introduced in this post focuses primarily on the approach and design of integrating GNN into the existing recommender system. We aim to validate the feasibility and effectiveness of this integration before transitioning to a fully production-ready implementation.&lt;/p&gt;
&lt;h2&gt;The Problem Statement&lt;/h2&gt;
&lt;p&gt;Given a preselected pool of content that potentially can be shown to a user on Zalando Homepage, we need to predict CTR for each piece of content so that later in the system the content with the highest expected value (which predicted CTR is part of) can be shown to the user.&lt;/p&gt;
&lt;p&gt;Our production model relies on traditional tabular data, capturing user-content interactions such as views and clicks, and contrasts with the high potential of graph neural networks. GNNs have emerged as a powerful tool for modeling relational data, offering a way to represent and learn from complex interaction patterns more effectively. GNNs operate by representing data as graphs, and recommender system can be naturally modeled as a bipartite graph with two node types: users and items, and its links connect users and items and indicate user-item interaction (e.g., click, view, order, etc.).&lt;/p&gt;
&lt;p&gt;Our task can then be formulated as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Given&lt;/strong&gt;: Past user-item interactions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Task&lt;/strong&gt;:&lt;ul&gt;
&lt;li&gt;Predict user-item interactions in the future&lt;/li&gt;
&lt;li&gt;Can be cast as link prediction problem: predict new user-item interaction links given the past links&lt;/li&gt;
&lt;li&gt;For 𝑢 ∈ 𝑼, 𝑣 ∈ 𝑽, we need to get a real-valued score 𝑓(𝑢, 𝑣)&lt;/li&gt;
&lt;li&gt;𝑲 items with the largest scores for a given user 𝑢 are then recommended&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="Example of a user-content bipartite graph" src="https://engineering.zalando.com/posts/2024/12/images/bipartite_user_item_graph.png#center"&gt;&lt;/p&gt;
&lt;h2&gt;Solution and Methods&lt;/h2&gt;
&lt;p&gt;While we can train GNN directly to predict clicks (user-content links), in this experiment we propose to employ a graph neural network to specifically train embeddings for Zalando users and content on a click prediction task, and use these embeddings as additional inputs to our production model. Node embeddings are inherently learned as part of the process when running a link prediction task, as the GNN generates these embeddings to capture the relational structure and features of nodes in the graph, which are then used to predict the presence or absence of links.&lt;/p&gt;
&lt;p&gt;We represent users and content on Homepage as two types of nodes in a graph, and their interactions (views and clicks) as two types of links, design an architecture with the basis of a GraphSage neural network and train it to predict the “clicked” link given a “viewed” link.&lt;/p&gt;
&lt;p&gt;&lt;img alt="GNN architecture to predict probability of a click" src="https://engineering.zalando.com/posts/2024/12/images/gnn_architecture.png#center"&gt;&lt;/p&gt;
&lt;h3&gt;Dataset and data sources for the GNN embeddings training&lt;/h3&gt;
&lt;p&gt;Training and evaluation datasets are prepared using the Pytorch Geometric library, which provides a rich set of functionalities, including efficient graph data loading, manipulation, and batching. The train and evaluation datasets are based on user-content activity data on a per request level labeled clicked / not clicked.&lt;/p&gt;
&lt;p&gt;Graph data structure allows GNNs to capture higher-order interactions and dependencies that traditional methods might miss. For example, in a recommender system, a GNN can model not just the direct interactions between a user and an item but also how similar users have interacted with similar items or how users following the same brand might be interested in the same article.&lt;/p&gt;
&lt;h3&gt;GNN Architecture&lt;/h3&gt;
&lt;p&gt;The GNN propagates and aggregates features from nodes along the links of a graph to capture interactions between nodes. Initially, each node has a specific set of features. In our case it is the information about most recently ordered articles for user nodes, and article representations for content nodes (each piece of content on Zalando Homepage is associated with specific articles, presented in this piece of content). As the GNN operates, nodes send their features to adjacent nodes through a process called message passing, during which features might be transformed by neural network layers such as convolution. Following this, each node combines the incoming features from its neighbors using aggregation operations like summing or averaging, updating each node's features. As the network depth increases, allowing more rounds of message passing, the GNN can consider more distant relationships. Thus the GNN model effectively generates embeddings for all nodes which are then passed through a classifier to predict the existence of the “clicked” link between a user and a content node, using a binary cross entropy loss function for updating the gradients.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Graph neighborhood sampling and aggregation" src="https://engineering.zalando.com/posts/2024/12/images/graph_sampling_and_aggregation.png#center"&gt;&lt;/p&gt;
&lt;h3&gt;Graph Mini-batching&lt;/h3&gt;
&lt;p&gt;To handle large-scale data, we employ mini-batch training, sampling subgraphs and computing embeddings in parallel. We sample links together with neighborhoods of both of their adjacent nodes. The depth of the sampled neighborhood is equal to the depth of the GNN. This approach ensures scalability and efficient use of computational resources, allowing GNNs to handle real-world large-scale graph datasets. For each mini-batch we sample disjoint subgraphs. We also use disjoint sets of links for message passing and for the supervision signal to prevent information leakage.&lt;/p&gt;
&lt;h3&gt;Integrating GNN trained features into our production model&lt;/h3&gt;
&lt;p&gt;While we evaluated offline that it is possible to directly predict clicks using a GNN model, integrating such a model into our current production system presents several challenges:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Graph data generation&lt;/strong&gt;: generating and maintaining the graph data structure creates operational overhead because raw user activity data is logged in a tabular format and requires time to be converted into a graph. This graph also needs to be updated in real-time (within the user session) with new user interactions which requires developing a new approach to data logging and training dataset generation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inference challenges&lt;/strong&gt;: inference on a graph is fundamentally different from inference on tabular data, as you need not only the information about a particular user-content pair, but rather all (or part of) the neighboring pairs as well. Aggregating information from a node’s neighbors can be computationally intensive and require specialized infrastructure to handle the graph operations efficiently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scalability&lt;/strong&gt;: running GNN inference at scale, especially for a large number of users and pieces of content, can pose significant scalability challenges and may require distributed computing environments.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Given these complexities, as an initial step we decided to use the embeddings generated by the GNN model for users and content as additional features in our existing production model. This approach leverages the strengths of GNNs while integrating more seamlessly with our current infrastructure not involving significant operational changes as opposed to running click predictions on GNNs end to end.&lt;/p&gt;
&lt;p&gt;The GNN model can be retrained daily, ensuring that its features are regularly updated to reflect the latest user-content interactions. A key advantage of using a GNN is its ability to address the cold-start problem for nodes (e.g., newly introduced content) that were not part of the initial training. Even if a new node has no clicks yet, GNN inference can still be performed using the node's initial features and existing 'view' links formed during the content exploration phase. These initial features are dynamically updated as the node gains more connections and interactions within the graph.&lt;/p&gt;
&lt;p&gt;What makes GNN features particularly valuable, compared to static features of individual articles, is their ability to capture and adapt to the relational context in the graph. Unlike static features that rely solely on precomputed attributes, GNN-generated embeddings are task-specific and are trained directly for the click prediction objective. This allows the model to encode not only the intrinsic properties of the content but also its evolving relationships with users and other content, leading to more accurate and context-aware predictions.&lt;/p&gt;
&lt;h2&gt;Experiments and Results&lt;/h2&gt;
&lt;h3&gt;Evaluation approach and metric&lt;/h3&gt;
&lt;p&gt;We evaluate our new modeling approach in two stages:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;We evaluate our GNN model on the user-content click binary classification task using the ROC-AUC metric. We conduct several experiments, varying the number of layers (or hops on the graph) and the neighborhood size for graph mini-batching, ultimately selecting the best-performing configuration. To support the offline evaluation of our main production model, we run GNN inference on both the training and evaluation datasets, generating and saving user and content embeddings for all nodes in the respective graphs.&lt;/li&gt;
&lt;li&gt;We feed the generated GNN embeddings for users and content together with other features to the main production model and evaluate it on CTR prediction task also with ROC-AUC metric.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Adding GNN features into the production model has improved our main offline evaluation metric, ROC-AUC, by 0.6 percentage points. While this improvement might seem modest, it's important to note that this was an initial experimentation round focused primarily on validating the feasibility of integrating GNNs into our system, rather than fully optimizing the GNN configuration or the broader model pipeline. The improvements achieved thus far suggest significant untapped potential, as further tuning of hyperparameters, node feature engineering, and experimentation with different graph structures could unlock more substantial performance gains.&lt;/p&gt;
&lt;p&gt;On top of that, GNNs offer capabilities that extend beyond traditional deep learning algorithms. They allow us to model complex aspects such as the novelty and diversity of content recommendations, and even the inspirational value of the content. These advanced capabilities can enable us to better align recommendations with higher-level business goals, such as enhancing user engagement through diverse and inspiring content.&lt;/p&gt;
&lt;h2&gt;Conclusion and Next Steps&lt;/h2&gt;
&lt;p&gt;We demonstrated the feasibility of using graph neural networks to model user interactions on Zalando’s Homepage. By leveraging GNN embeddings, we have improved the ROC-AUC performance of our recommender system however there is still a lot of room for improvement on both sides: fine-tuning the hyperparameters of production model with GNNs features, as well as testing architectural enhancements to train GNN embeddings. Future work involves experimenting with such improvements and validating the impact of the approach in the production setting. Additionally, creating solid customer representation leveraging GNNs, have strong potential to enable a variety of ML tasks within Zalando, enhancing applications like our recommender model to improve CTR prediction accuracy and enrich the overall user experience.&lt;/p&gt;</content><category term="Zalando"/><category term="Machine Learning"/><category term="Recommender Systems"/><category term="Deep Learning"/><category term="Research"/><category term="Zalando Science"/></entry><entry><title>Open Policy Agent in Skipper Ingress</title><link href="https://engineering.zalando.com/posts/2024/12/open-policy-agent-in-skipper-ingress.html" rel="alternate"/><published>2024-12-06T00:00:00+01:00</published><updated>2024-12-06T00:00:00+01:00</updated><author><name>Magnus Jungsbluth</name></author><id>tag:engineering.zalando.com,2024-12-06:/posts/2024/12/open-policy-agent-in-skipper-ingress.html</id><summary type="html">&lt;p&gt;Zalando has integrated Open Policy Agent (OPA) into Skipper, our open-source Ingress controller, to provide Authorization as a Service. It aims to simplify the developer experience and provides observability out of the box.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;At Zalando, we continuously strive to enhance our platform capabilities to provide robust, scalable, and developer-friendly solutions. One such initiative is the integration of &lt;a href="https://www.openpolicyagent.org/"&gt;Open Policy Agent&lt;/a&gt; (OPA) into &lt;a href="https://github.com/zalando/skipper"&gt;Skipper&lt;/a&gt;, our open-source ingress controller and reverse proxy, to deliver Authorization as a Service. This integration not only allows externalising authorization policies but also aligns with our goals of solving security concerns on the infrastructure with efficiency and developer experience in mind. It simplifies developer experience by embedding OPA as a library within Skipper and allows multiple virtual OPA instances to coexist within a single Skipper process. Enabling OPA for a specific application is as easy as just stating “application X should be protected” without touching multiple YAML files, adding monitoring, and inheriting many more responsibilities to be compliant.&lt;/p&gt;
&lt;h2&gt;Goals&lt;/h2&gt;
&lt;p&gt;Our primary goals for integrating OPA into Skipper include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Externalised Authorization:&lt;/strong&gt; Embedding OPA into Skipper provides a powerful and flexible policy engine as a platform feature. This enables our engineering teams to leverage externalised authorization policies without additional overhead.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Clear Responsibility Split:&lt;/strong&gt; The integration allows a clear delineation of responsibilities: platform teams manage the core authorization infrastructure while application teams focus on application-specific policies, ensuring efficiency and security.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scalability:&lt;/strong&gt; The implementation is designed to handle millions of policy decisions per second, scaling with the demands of our extensive application landscape.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enhanced Developer Experience:&lt;/strong&gt; We prioritise making it straightforward for developers to enable authorization in their applications, reducing complexity and time required to implement secure access controls.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Developer Experience&lt;/h2&gt;
&lt;p&gt;To illustrate how to utilise the OPA integration in Skipper via Kubernetes, engineers might configure to use OPA via the &lt;code&gt;opaAuthorizeRequest&lt;/code&gt; filter:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;apiVersion&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;networking&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;k8s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;io&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;
&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Ingress&lt;/span&gt;
&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;annotations&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;zalando&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;org&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;skipper&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;opaAuthorizeRequest&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;my-application&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;application&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;my&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;application&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;my&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;application&lt;/span&gt;
&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;rules&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;zalando&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;example&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;paths&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;my&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;application&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;              &lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8080&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;pathType&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ImplementationSpecific&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Explanation&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;zalando.org/skipper-filter&lt;/code&gt; annotation specifies the Skipper filter that is applied to all routes in this Ingress manifest. In this example, the &lt;a href="https://opensource.zalando.com/skipper/reference/filters/#opaauthorizerequest"&gt;&lt;code&gt;opaAuthorizeRequest&lt;/code&gt; filter&lt;/a&gt; is configured with one parameter: &lt;code&gt;"my-application"&lt;/code&gt; (the name of the OPA policy bundle and also the registered ID of the application to be protected).&lt;/p&gt;
&lt;p&gt;This is the only infrastructure setup required from engineers to authorise requests for their application. Specifics like which paths to protect and authoring rules using &lt;a href="https://www.openpolicyagent.org/docs/latest/policy-language/"&gt;Rego&lt;/a&gt;, the policy language of Open Policy Agent, are decentrally managed in the application's Git repositories.&lt;/p&gt;
&lt;h2&gt;Skipper for Kubernetes Ingress&lt;/h2&gt;
&lt;p&gt;We use &lt;a href="https://opensource.zalando.com/skipper"&gt;Skipper&lt;/a&gt;, our HTTP reverse proxy for service composition, to implement the control plane and data plane of &lt;a href="https://kubernetes.io/docs/concepts/services-networking/ingress/"&gt;Kubernetes ingress&lt;/a&gt; and &lt;a href="https://opensource.zalando.com/skipper/kubernetes/routegroup-crd/"&gt;routegroups&lt;/a&gt;. A creation of an ingress will result in having AWS NLB with TLS termination targeting skipper via &lt;a href="https://github.com/zalando-incubator/kube-ingress-aws-controller"&gt;kube-ingress-aws-controller&lt;/a&gt;, HTTP routes at skipper and a DNS name pointing to the NLB via &lt;a href="https://github.com/kubernetes-sigs/external-dns"&gt;external-dns&lt;/a&gt;.
To understand the deployment context, this is the scale we operate at:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;15,000 Ingresses and 5,000 routegroups&lt;/li&gt;
&lt;li&gt;traffic of up to 2,000,000 requests per second&lt;/li&gt;
&lt;li&gt;80-90% of our traffic are authenticated service to service calls with daily numbers between 500,000 and 1,000,000 rps across our service fleet in total&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Technical Design&lt;/h2&gt;
&lt;p&gt;To achieve these goals, several key technical decisions were made:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Alignment with OPA Envoy Plugin's Input Structures:&lt;/strong&gt; We chose to align closely with the OPA Envoy plugin's input structures to leverage existing documentation, examples, and training resources. This minimises the learning curve for our developers and keeps Zalando-isms at bay.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OPA Embedded as a Library in Skipper:&lt;/strong&gt; Embedding OPA directly within Skipper as a library ensures minimal latency in policy enforcement by keeping policy decisions local to the ingress data plane. It also is cost efficient compared to running an OPA deployment per application or as sidecars.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hide OPA Configuration from Engineers:&lt;/strong&gt; To separate platform concerns from application concerns, we only expose the bundle name and additional context data as configuration to application engineers. How to run OPA and how it communicates with its control plane is configured and owned by platform engineers.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Skipper can configure multiple routes that can target different backend applications inside its surrounding Kubernetes cluster. OPA enabled filters can be used in multiple routes or even multiple times in the same route.&lt;/p&gt;
&lt;p&gt;At Zalando, every application that is deployed to production must be registered first in our application registry. For structuring policies, we piggyback on this governance structure and expect application teams to have an OPA policy bundle per application which uses the application id in its name.&lt;/p&gt;
&lt;p&gt;Inside Skipper, we create one virtual OPA instance per application that is referenced in at least one of the routes. This allows us to re-use memory and also provides a buffer against high-frequency route changes by having a grace period for garbage collection.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Skipper Process" src="https://engineering.zalando.com/posts/2024/12/images/skipper-process.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;OPA instances within a skipper process&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;To reduce the likelihood of outages due to an authorization infrastructure failure, we use AWS S3 and its availability promises as the source for policy bundles. Styra DAS, a commercial control plane for Open Policy Agent is used to source the bundles and publish them to S3.&lt;/p&gt;
&lt;p&gt;To capture observability metrics, we both send spans for authorization decisions and spans for the control plane traffic to Lightstep via OpenTelemetry. To complement the picture, Styra DAS also receives regular updates via the OPA status and decision log plugins.&lt;/p&gt;
&lt;p&gt;This approach allows us to scale and fail-over despite failures of our OPA control plane and only depends on S3 being available.&lt;/p&gt;
&lt;p&gt;&lt;img alt="OPA control plane" src="https://engineering.zalando.com/posts/2024/12/images/opa-system-architecture.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Architecture of the OPA control plane&lt;/figcaption&gt;

&lt;h2&gt;Trade-Offs&lt;/h2&gt;
&lt;p&gt;The integration involved several trade-offs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Latency vs. Memory Consumption:&lt;/strong&gt; Embedding OPA reduces latency but increases memory consumption, raising the risk of out-of-memory (OOM) issues. We mitigated this by implementing strict limits on bundle size and also doing constrained memory consumption for advanced features like request body parsing. Telemetry like decision streaming and status reports also use bounded data structures to avoid memory exhaustion.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flexibility vs. Cost:&lt;/strong&gt; While OPA offers great flexibility in defining policies, it can be more resource-intensive compared to simpler token validation methods that are implemented without a general purpose policy engine. However, we expect the benefits of fine-grained access control and externalised policy management to outweigh the additional computational costs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OPA by default vs. on demand&lt;/strong&gt; OPA is only enabled and bootstrapped only if at least one application uses OPA in a Kubernetes cluster and if the cluster is enabled to support OPA. Skipper instances which have OPA-enabled routes are generally scaled up to compensate for higher cpu consumption due to policy execution.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Observability&lt;/h2&gt;
&lt;p&gt;Running any service in production requires solid observability to pinpoint issues quickly. If Skipper is configured to send OpenTelemetry Spans, the OPA filters in Skipper automatically send Spans for two paths:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Policy Decisions&lt;/strong&gt;: Whenever the OPA filter is executed as part of a Skipper route, a Span is injected that captures relevant metadata like the decision ID, the decision outcome, the bundle name (in our case the application ID) and the labels of the running OPA instance. This allows linking directly into the full decision as stored in Styra DAS but also allows capturing metrics right in Lightstep and only based on the traces.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Control Plane Traffic&lt;/strong&gt;: Whenever OPA calls out to the control plane to fetch bundles or when it reports status / decisions back to the control plane, a separate Trace is generated. This allows monitoring for errors in the basic setup or general problems with fetching bundles or control plane communication.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br/&gt;
&lt;img alt="Sample Trace" src="https://engineering.zalando.com/posts/2024/12/images/sample-trace.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;OPA Observability: Sample Trace&lt;/figcaption&gt;

&lt;p&gt;&lt;br/&gt;
&lt;img alt="Sample Span" src="https://engineering.zalando.com/posts/2024/12/images/sample-span.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;OPA Observability: Sample Span&lt;/figcaption&gt;

&lt;h2&gt;Differences Between Envoy OPA Plugin and Skipper OPA Integration&lt;/h2&gt;
&lt;p&gt;Our OPA integration in Skipper introduces several unique features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Multiple Virtual OPA Instances in one Deployment:&lt;/strong&gt; This allows multiple virtual OPA instances to coexist within a single Skipper process deployment, providing low latency without a network hop and also no extra OPA deployment required. In a vanilla OPA deployment, you typically run one OPA process per application.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Serving HTTP Requests:&lt;/strong&gt; OPA can serve authorization responses independently of the target application, useful for migrating existing legacy IAM services and supporting single-page applications (SPAs) that require precomputed authorization decisions or lists of permissions for the current users.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Integrating Open Policy Agent into Skipper marks a significant advancement in Zalando's platform capabilities. This integration not only enhances security and scalability but also empowers our developers with a robust, easy-to-use authorization service. By focusing on developer experience and maintaining a high-performance standard, we ensure that our platform remains at the forefront of technological innovation. On our journey, OPA has so far been used mostly used in employee- or partner facing applications and APIs where access models and authorization rules are generally more complex.&lt;/p&gt;
&lt;h3&gt;References&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://opensource.zalando.com/skipper/reference/filters/#open-policy-agent"&gt;Skipper Open Policy Agent Filters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://opensource.zalando.com/skipper/tutorials/auth/#open-policy-agent"&gt;Skipper Authorization Tutorial&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><category term="Zalando"/><category term="Platform Engineering"/><category term="Security"/><category term="Skipper"/><category term="Open Source"/><category term="Backend"/></entry><entry><title>Paper Announcement: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation</title><link href="https://engineering.zalando.com/posts/2024/11/llm-as-a-judge-relevance-assessment-paper-announcement.html" rel="alternate"/><published>2024-11-15T00:00:00+01:00</published><updated>2024-11-15T00:00:00+01:00</updated><author><name>Kasra Hosseini</name></author><id>tag:engineering.zalando.com,2024-11-15:/posts/2024/11/llm-as-a-judge-relevance-assessment-paper-announcement.html</id><summary type="html">&lt;p&gt;Sharing our latest research paper on leveraging LLM-as-a-Judge for scalable, multimodal relevance assessment in e-commerce product search.&lt;/p&gt;</summary><content type="html">&lt;p&gt;We are excited to share our latest research paper &lt;a href="https://arxiv.org/abs/2409.11860"&gt;Retrieve, Annotate, Evaluate, Repeat — Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation&lt;/a&gt;. We introduce a novel approach to large-scale product retrieval evaluation using Multimodal Large Language Models (MLLMs). Evaluated on 20,000 examples, our method shows how MLLMs can help automate the relevance assessment of retrieved products, achieving levels of accuracy comparable to human annotators and enabling scalable evaluation for high-traffic e-commerce platforms.&lt;/p&gt;
&lt;p&gt;In summary, our contributions are as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We introduce a multimodal LLM-based evaluation framework for large-scale product retrieval systems. This framework utilizes LLMs (i) to generate context-specific annotation guidelines and (ii) to conduct relevance assessments.&lt;/li&gt;
&lt;li&gt;We evaluate the performance of our framework against human annotations on real-world production search queries in a multilingual setting and analyse the different types of errors that humans and LLMs tend to make.&lt;/li&gt;
&lt;li&gt;We demonstrate the cost-effectiveness and efficiency of our approach for conducting large-scale evaluations. We also compare the performance of different types of LLMs for relevance assessment, including GPT-4o, GPT-4 Turbo and GPT-3.5 Turbo.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We assess the performance of different types of LLMs in relevance assessment, including GPT-4o, GPT-4 Turbo and GPT-3.5 Turbo. By leveraging Multimodal LLMs (MLLMs) that analyze both text and images, our framework enables a high level of semantic accuracy in evaluating query-product relevance at scale, minimizing the need for extensive human input. We compare agreements based on (i) matching either A1 or A2 and (ii) inter annotator agreement between human annotators (A1 vs. A2) and between LLMs and the human majority vote. Results are reported separately for English and German. For human annotations, we report the total time and cost.&lt;/p&gt;
&lt;p&gt;&lt;img alt="agreements between LLM and the human annotator groups" src="https://engineering.zalando.com/posts/2024/11/images/human_llm_table_comparison.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Table 1: Agreements between our LLM-based evaluation framework and the human annotator groups (A1 and A2).&lt;/figcaption&gt;

&lt;h2&gt;Why Evaluate Product Retrieval at Scale?&lt;/h2&gt;
&lt;p&gt;Search functionality is a fundamental component of e-commerce platforms, with the objective of finding the most relevant products in a dynamic product database. Customers using search often exhibit a higher intent to find specific products, leading to greater engagement and conversion rates. However, they may struggle to articulate their needs in a search query. Even if they do express their intent clearly, information retrieval systems and search engines might fail to interpret it correctly, resulting in irrelevant search results.&lt;/p&gt;
&lt;p&gt;Evaluating product retrieval systems on a large scale in a multilingual setting and for a diverse set of customer queries is an intricate but essential task for maintaining a high-quality user experience and driving business success. Traditionally, the quality of these results is measured through human relevance assessments, which require substantial time and resources.&lt;/p&gt;
&lt;p&gt;Our &lt;a href="https://arxiv.org/abs/2409.11860"&gt;paper&lt;/a&gt; proposes a scalable solution: a framework that integrates Multimodal LLMs (i) to generate context-specific annotation guidelines and (ii) to conduct relevance assessments. By using MLLMs that analyze both text and images, we enable a high level of semantic accuracy in evaluating query-product relevance without the need for extensive human input.&lt;/p&gt;
&lt;h2&gt;Retrieve, Annotate, Evaluate, Repeat&lt;/h2&gt;
&lt;p&gt;Our framework, built for efficiency and scalability, is structured as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Query extraction&lt;/strong&gt;: query-product pairs are extracted from search logs for evaluation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Guideline generation&lt;/strong&gt;: for each query, an LLM generates custom annotation guidelines, setting detailed criteria for relevance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multimodal annotation&lt;/strong&gt;: MLLMs assign relevancy scores to the search results based on both textual and visual descriptions, classifying each result as "highly relevant", "acceptable substitute", or "irrelevant".&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Evaluation and storage&lt;/strong&gt;: each labeled pair is stored for continuous retrieval system evaluation and comparison across different configurations.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The framework's modular design allows for caching and parallel processing, enabling evaluations to scale efficiently to support multiple search engines and to accommodate updates to retrieval algorithms.&lt;/p&gt;
&lt;p&gt;Our proposed approach works by extracting a query-product pair from our search query-click logs (1). The query (e.g. black sneakers) is then passed on to the "LLM generator" (2). The LLM generator creates specific annotation instructions for the given query. The query-specific annotation guidelines and the query-product pair (e.g. black sneakers and the retrieved product) are provided as input to the "LLM annotator" (3). Lastly, the annotated query-product pair is forwarded to the search engine evaluation module (4).&lt;/p&gt;
&lt;p&gt;&lt;img alt="proposed framework" src="https://engineering.zalando.com/posts/2024/11/images/llm_annotation_overview.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Fig. 1: Design of the LLM-based annotation framework.&lt;/figcaption&gt;

&lt;h3&gt;Multimodal LLM-powered relevance assessment: evaluation steps for an example query&lt;/h3&gt;
&lt;p&gt;Fig. 2 demonstrates the structured process of our evaluation framework with the example query &lt;em&gt;"women's long sleeve t-shirt with green stripes"&lt;/em&gt; in panel (a). The LLM identifies and prioritizes four query requirements: "assortment category"; "sleeve length"; "product type" and "pattern", with assigned importance levels, as shown in panel (b). Using this information, &lt;em&gt;query-specific annotation guidelines&lt;/em&gt; are generated in panel (c) to provide tailored descriptions for three relevance labels: "irrelevant", "acceptable substitute" and "highly relevant".&lt;br/&gt;
Panel (d) illustrates an example product and its attributes, with the relevance label "highly relevant" assigned in panel (e) based on LLM-guided reasoning. The entire content displayed in this figure is generated by Multimodal LLMs, except for panel (a), the packshot in panel (d), and the black dashed rectangle also in panel (d). However, within the attributes shown in panel (d), the "visual description of packshot", highlighted by a red rectangle, is generated by a vision model (here: GPT-4o).
&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Evaluation steps for an example query" src="https://engineering.zalando.com/posts/2024/11/images/evaluation_steps.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Fig. 2: Evaluation steps for the example query "women's long sleeve t-shirt with green stripes".&lt;/figcaption&gt;

&lt;h2&gt;Key Findings and Benefits&lt;/h2&gt;
&lt;p&gt;Our method, validated through deployment on Zalando's large e-commerce platform, demonstrates comparable quality to human annotations. This approach offers several advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Cost efficiency&lt;/strong&gt;: MLLM-based assessments are up to 1,000 times cheaper than human labor, resulting in substantial resource savings.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Speed&lt;/strong&gt;: The pipeline can assess 20,000 query-product pairs in around 20 minutes, compared to the weeks needed for human annotators.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multilingual adaptability&lt;/strong&gt;: The framework supports multiple languages, essential for e-commerce platforms operating in diverse markets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Error analysis&lt;/strong&gt;: MLLMs demonstrated lower rates of common human errors, such as brand mismatches, which are often caused by annotation fatigue, making MLLMs a reliable solution for repetitive assessment tasks. See the next section "Human vs. LLM Error Analysis" for more details.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Human versus LLM Error Analysis&lt;/h3&gt;
&lt;p&gt;We compare the LLM-based annotations to human annotations on a dataset collected from live traffic and find that while humans and LLMs approximately make the same amount of errors, the respective error distributions are very different.&lt;/p&gt;
&lt;p&gt;For example, the majority of human errors are on (i) brands (e.g. query asks for an Adidas sneaker, retrieved product is from another brand, yet humans would annotate it as relevant), (ii) products and (iii) categories.&lt;/p&gt;
&lt;p&gt;On the other hand, LLMs barely made these errors but were often too strict in their judgement (e.g. query asks for black Levi's jeans, retrieved product is a dark grey pair of Levi's jeans, yet the LLM would judge it as irrelevant) or suffer from an "understanding" error (e.g. query asks for the brand "On Vacation", the LLM interprets it in its literal sense, i.e. going on holiday, instead of the brand).&lt;/p&gt;
&lt;p&gt;Given these observations we found that LLMs are a very good choice for handling the annotation bulk work for our use-case (most bread-and-butter queries), freeing up human expertise to focus on the tricky cases (e.g. asking for styles and trends)&lt;/p&gt;
&lt;p&gt;Fig. 3 shows the distribution of errors between LLMs and humans on hard disagreements (50% were due to human errors, 31% LLM errors and in 19% both made an error). The upper part ("Both errors") focuses on errors that either the LLM or humans could make. It highlights that LLMs and humans make very different types of errors. In addition, the lower part ("LLM errors") shows the distribution of errors that only an LLM would make. Predominantly these are misunderstandings of a part of the search query.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Error analysis, Human vs LLM" src="https://engineering.zalando.com/posts/2024/11/images/llm_vs_human_errs-paper.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Fig. 3: Distribution of errors between LLMs and humans on hard disagreements&lt;/figcaption&gt;

&lt;h2&gt;Real-World Impact at Zalando&lt;/h2&gt;
&lt;p&gt;This framework has been deployed in production at Zalando, enabling regular monitoring of high-frequency search queries and identifying low-performing queries for targeted improvements. This continual assessment allows us to quickly pinpoint areas where the retrieval system needs adjustments, helping to ensure a high-quality customer experience.&lt;/p&gt;
&lt;p&gt;We note that high relevance is a necessary, but not a sufficient condition, for high customer engagement, as it is also determined by other factors, such as, personal preferences, product availability, and price expectations. In this paper, we focus on semantic relevance, but in production we rank the retrieved documents based on various features to take into account both relevance to the query and customers' personal preferences.&lt;/p&gt;
&lt;p&gt;This approach enables us to significantly reduce costs and to enhance customer experience faster by prioritising the queries that need the most attention and optimising our resources accordingly.&lt;/p&gt;
&lt;h2&gt;Future Directions&lt;/h2&gt;
&lt;p&gt;While the MLLM-powered framework has shown considerable promise, we are exploring several enhancements to extend its capabilities and broaden its impact:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Deeper Human-LLM collaboration&lt;/strong&gt;: Integrating human expertise for ambiguous or complex cases could optimize relevance assessments, with humans addressing nuanced judgments where domain knowledge is essential.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Broadening relevance dimensions&lt;/strong&gt;: In this paper, we focused on semantic relevance, excluding factors like personal preferences, seasonal trends, or emerging product categories. A future direction could involve adapting the framework to assess multiple relevance dimensions, providing a more holistic view of query relevance in response to shifting customer interests.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Adaptability across market segments&lt;/strong&gt;: To further enhance the robustness of our framework, we aim to test its adaptability across various market segments and product categories, refining its ability to interpret domain-specific language and visual cues that vary between, for example, fashion and beauty products.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Automated detection of new trends&lt;/strong&gt;: By leveraging real-time data and LLMs' growing ability to capture new terminology and styles, we hope to improve the framework's responsiveness to evolving trends, allowing for quick adjustments in annotation criteria to align with emerging patterns.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By advancing these areas, we're setting new standards for automated evaluation in large-scale e-commerce, providing practical solutions for scalable, accurate, and context-aware relevance assessments.&lt;/p&gt;
&lt;h2&gt;Want to Know More?&lt;/h2&gt;
&lt;p&gt;For a detailed exploration of our framework, experimental results, and insights on the use of Multimodal LLMs in product retrieval evaluation, refer to our full paper &lt;a href="https://arxiv.org/abs/2409.11860"&gt;Retrieve, Annotate, Evaluate, Repeat — Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation&lt;/a&gt; on arXiv. The paper includes comprehensive discussions on our methodology, error analysis, and a comparison of human and LLM annotations. We invite you to dive into the technical details and explore how this approach is shaping the future of scalable, automated relevance assessment in e-commerce.&lt;/p&gt;</content><category term="Zalando"/><category term="Machine Learning"/><category term="Research"/><category term="Zalando Science"/><category term="Search"/><category term="Backend"/></entry><entry><title>Building a Modular Portal with Webpack Module Federation</title><link href="https://engineering.zalando.com/posts/2024/10/building-modular-portal-with-webpack-module-federation.html" rel="alternate"/><published>2024-10-17T00:00:00+02:00</published><updated>2024-10-17T00:00:00+02:00</updated><author><name>Kadir Caner Erguen</name></author><id>tag:engineering.zalando.com,2024-10-17:/posts/2024/10/building-modular-portal-with-webpack-module-federation.html</id><summary type="html">&lt;p&gt;In this post, we explore how Webpack Module Federation helped us build a scalable, modular portal. Learn how dynamic code sharing enabled independent development, how we managed shared dependencies, centralized services like authentication, and maintained design consistency with a UI-kit. We also cover performance optimizations that ensured a seamless user experience.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;h3&gt;Context and Purpose&lt;/h3&gt;
&lt;p&gt;Our team is part of the Transport teams within the Logistics department, where we build and manage software for internal users, including finance teams, warehouses, and, in the future, our third-party partners. The portal software is designed to streamline operations across these teams, providing tools and features that improve workflow efficiency and collaboration.&lt;/p&gt;
&lt;p&gt;We decided to use Webpack Module Federation while building a modular portal to address the challenges of scalability and development autonomy. Modularity was key to enabling independent feature development and deployment across teams. In this blog post, we’ll share the reasoning behind this choice, the process, and the lessons we learned along the way.&lt;/p&gt;
&lt;h3&gt;Overview&lt;/h3&gt;
&lt;p&gt;&lt;a href="https://webpack.js.org/concepts/module-federation/"&gt;Webpack Module Federation&lt;/a&gt; allows for sharing code dynamically between applications at runtime without needing to rebuild the host app or statically link all modules, enabling micro frontends. This flexibility played a crucial role in the architecture of our portal project.&lt;/p&gt;
&lt;h2&gt;1. The Challenge&lt;/h2&gt;
&lt;h3&gt;Project Scope&lt;/h3&gt;
&lt;p&gt;Our portal project involved collaboration between five different teams, with each team responsible for different applications. Some of these applications were brand new, while others had legacy code that required continued support. For example, certain applications were still being used within iframes in other portals, adding complexity to the integration. The critical aspect of modularity was to ensure that each team could independently develop and deploy their applications without affecting the main portal. This independence was key to maintaining flexibility, especially as teams needed to update their applications without touching the portal itself.&lt;/p&gt;
&lt;h3&gt;Limitations of Traditional Approaches&lt;/h3&gt;
&lt;p&gt;In previous approaches, such as monolithic applications or using static builds, deployments were cumbersome and tightly coupled. Every time an update was required, teams would need to coordinate to release changes together, which often led to delays and bottlenecks. These traditional models also made it difficult to integrate new applications alongside legacy systems, as maintaining compatibility across different codebases was a significant challenge. Additionally, relying on methods like iframes was not scalable, and it lacked the modern functionality we needed to future-proof the system.&lt;/p&gt;
&lt;h2&gt;2. Why Webpack Module Federation?&lt;/h2&gt;
&lt;h3&gt;Dynamic Code Sharing&lt;/h3&gt;
&lt;p&gt;Webpack Module Federation stands out because it allows us to dynamically share modules between applications without the need for static linking or managing shared npm packages across teams. This ability to share code at runtime enabled us to avoid the traditional pitfalls of managing dependencies between different teams’ applications. Each team could expose specific components or utilities from their micro frontend, and other teams could consume them directly, without rebuilding or redeploying shared libraries. For example in our case host and remote applications dynamically sharing React and React-DOM. This not only reduced overhead but also ensured that each application could be updated independently.&lt;/p&gt;
&lt;h3&gt;Autonomy and Scalability&lt;/h3&gt;
&lt;p&gt;One of the biggest advantages of Webpack Module Federation is that it provides teams with the ability to develop and release features in isolation. Each team could work on their micro frontend independently, making their own technology decisions and deploying updates on their schedule. This autonomy allowed for much quicker development cycles and reduced the complexity of managing interdependencies between applications.&lt;/p&gt;
&lt;h2&gt;3. What about backend services? How did we handle authentication and authorisation?&lt;/h2&gt;
&lt;h3&gt;Handling Authentication and Authorisation&lt;/h3&gt;
&lt;p&gt;On the backend side, we implemented a centralised backend proxy within the portal to handle both authentication and authorisation. This proxy acts as a gatekeeper for all requests coming from the frontend, ensuring that no direct access to backend services occurs without proper authentication and authorisation checks&lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;Centralised Authentication&lt;/h3&gt;
&lt;p&gt;The backend proxy is responsible for authenticating users. It uses a single authentication mechanism, ensuring that every frontend application integrates within the portal and does not have to manage individual authentication flows. Once authenticated, the user’s credentials and permissions are handled by the proxy.&lt;/p&gt;
&lt;h3&gt;Authorisation and Request Forwarding&lt;/h3&gt;
&lt;p&gt;Once authentication was completed, the proxy verifies user permissions for specific services. The proxy forwards the requests to the appropriate microservices. This approach allows each microservice to focus on its core functionality without worrying about authorisation logic. The proxy ensures that only authorised requests reach the relevant service.&lt;/p&gt;
&lt;h3&gt;Unified Entry Point&lt;/h3&gt;
&lt;p&gt;By creating this backend proxy, we established a single entry point for both the frontend and backend, streamlining security and reducing the complexity of managing multiple microservices. This architecture ensures consistent handling of security concerns while giving us the flexibility to scale backend services independently.&lt;/p&gt;
&lt;h2&gt;4. Flow from the user’s perspective&lt;/h2&gt;
&lt;p&gt;From the user’s perspective, interacting with the portal is seamless and intuitive. Here’s how the flow works:
&lt;img alt="User Flow Diagram" src="https://engineering.zalando.com/posts/2024/10/images/design.jpg#center"&gt;&lt;/p&gt;
&lt;h3&gt;Initial Request&lt;/h3&gt;
&lt;p&gt;When a user opens the portal, the first action is to call the portal proxy’s applications endpoint. This endpoint serves as a gatekeeper, checking the user’s permissions and determining which applications they have access to. The response from this endpoint contains a list of applications the user is authorised to view. Each application comes with its ID, name, configuration path, and the specific URL (activePath) where the application will start loading into the portal. Additionally, it contains detailed permission scopes (e.g., read and write permissions) for specific actions within the application.&lt;/p&gt;
&lt;p&gt;Here’s an example of the response the user’s browser would receive:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;applications&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;appId&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;example-application&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Example Application&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;configPath&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;application.manifest.json path&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;activePath&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/example-application&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;opaScope&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;scope1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;                    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;read&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;write&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;scope2&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;                    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;read&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;write&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Fetching Application Configurations&lt;/h3&gt;
&lt;p&gt;Once the portal receives the list of applications, it proceeds to load each application’s manifest.json file. This file contains essential details about the application, such as its menu items (more on below), required permissions, and the path to the bundled application. The manifest allows the portal to integrate each application seamlessly, displaying menu options and enabling users to navigate to the correct sections based on their permissions.&lt;/p&gt;
&lt;h3&gt;Example Manifest Data&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;menuItems&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;label&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Last Mile&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;path&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/last-mile&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;requiredPermissions&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;groupId&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;invoice-verification&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;bundlePath&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Bundle path of application. (RemoteEntry.js)&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Loading the Application&lt;/h3&gt;
&lt;p&gt;When a user navigates to one of the defined activePath routes in the configuration (for example, “/last-mile”), the portal dynamically loads the corresponding application bundle from the specified bundlePath. This allows the application to be loaded directly into the portal’s interface without refreshing or reloading the entire page. The user is presented with the appropriate UI and functionality based on the permissions defined in the manifest, ensuring they only see what they are authorised to access.&lt;/p&gt;
&lt;h3&gt;Dynamic Menu and Permissions&lt;/h3&gt;
&lt;p&gt;The manifest file also defines the menu items for each application, ensuring that the user can navigate the different features within the application. These menu items are dynamically displayed in the portal’s UI based on the user’s permissions, making the experience personalised and secure.&lt;/p&gt;
&lt;p&gt;This entire flow is smooth from the user’s perspective, allowing them to access only the applications and features they have permission to use while the portal handles the complex backend interactions in the background.&lt;/p&gt;
&lt;h2&gt;5. Challenges and Solutions&lt;/h2&gt;
&lt;h3&gt;Shared Dependencies&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Challenge&lt;/strong&gt;: One of the significant challenges we encountered was managing shared dependencies across multiple federated modules. With several teams working independently, version conflicts were a real concern. For instance, two different micro frontends might rely on different versions of a shared library, which could lead to runtime errors or unexpected behaviour.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: We used Webpack’s shared dependencies feature to specify the common libraries used by multiple micro frontends, ensuring that only a single version of the dependency would be loaded at runtime. By marking key libraries (e.g., React, lodash) as shared, we were able to reduce version conflicts and avoid loading multiple versions of the same package. Additionally, we worked on aligning versions across teams during development to maintain consistency and minimise potential issues.&lt;/p&gt;
&lt;h3&gt;Communication Between Apps&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Challenge&lt;/strong&gt;: Another challenge was enabling smooth communication between different federated modules, especially since they were independently developed and deployed. Some applications needed to share state or data, and managing this across independently running modules posed a challenge.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: We are passing a prop to each application, which allows interaction with the portal and other applications. This prop serves as an interface for each micro frontend to communicate with the rest of the portal. Through this prop, modules can access shared data, trigger specific actions, or exchange necessary information between the federated applications. This method allows us to maintain the independence of each application while providing the necessary communication channels to ensure smooth functionality across the portal.&lt;/p&gt;
&lt;p&gt;You can find examples of shared data, specific actions, and necessary information below;&lt;/p&gt;
&lt;h3&gt;Shared Data&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;User Session Info&lt;/strong&gt;: Each micro frontend might need to access the currently logged-in user’s session details, such as their role or permissions, to ensure the correct data is displayed or actions are allowed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Global App Settings&lt;/strong&gt;: Applications could share configuration settings, such as theme preferences (light/dark mode) or language localization, to ensure a consistent experience across the portal.&lt;/p&gt;
&lt;h3&gt;Specific Actions&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Logging Out the User&lt;/strong&gt;: If a user triggers a logout action from one micro frontend this action can be communicated to other applications through the shared prop, ensuring the entire session is closed and the user is logged out portal-wide.&lt;/p&gt;
&lt;h3&gt;Necessary Information&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Navigation Requests&lt;/strong&gt;: If one application needs to trigger navigation to another part of the portal it can use the shared prop to request the portal to navigate the user to the appropriate page.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Error Handling&lt;/strong&gt;: Federated applications can pass errors (e.g., failed API calls) to the portal via the shared prop, and the portal can handle displaying a global error message or logging errors centrally.&lt;/p&gt;
&lt;h3&gt;Performance Considerations&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Challenge&lt;/strong&gt;: Loading multiple remote modules into the portal can introduce performance issues, particularly in terms of loading time and bundle size. We had to ensure that the portal loaded efficiently, even as more micro frontends were added.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: To tackle this, we implemented lazy loading for federated modules, ensuring that only the necessary modules were loaded when the user navigated to a particular section of the portal. This minimised initial load times and kept the bundle size in check. Additionally, we optimised our Webpack builds by enabling code splitting, caching, and compression techniques, which further improved performance. Preloading critical assets for the user’s next possible interactions also helped speed up the perceived load time.&lt;/p&gt;
&lt;h2&gt;6. Lessons Learned and Best Practices&lt;/h2&gt;
&lt;h3&gt;Plan for Integration&lt;/h3&gt;
&lt;p&gt;While modularity is great for flexibility, we realised early planning for how modules would communicate and interact is crucial. Defining clear interfaces and communication methods helped avoid complexity later on.&lt;/p&gt;
&lt;h3&gt;Centralise Common Services&lt;/h3&gt;
&lt;p&gt;Some services, like authentication and user state management, were best centralised. This helped maintain consistency across applications while allowing teams to remain autonomous.&lt;/p&gt;
&lt;h3&gt;Optimise for Performance&lt;/h3&gt;
&lt;p&gt;We prioritised lazy loading and code splitting to ensure the portal remained fast, even as more modules were added. Early optimisation paid off in maintaining a smooth user experience.&lt;/p&gt;
&lt;h2&gt;7. Bonus: UI-Kit for Consistent Design&lt;/h2&gt;
&lt;p&gt;To ensure a unified and consistent design across all applications in the portal, we created a shared UI-kit library. This UI-kit was packaged as an internal npm module and distributed across all teams. It provides a set of reusable components, such as buttons, modals, input fields, and typography, all following the same design language and style guidelines.&lt;/p&gt;
&lt;p&gt;By using this shared UI-kit, we maintained design consistency across different micro frontends, regardless of the team or application. It also helped speed up development, as teams didn’t need to recreate common UI elements. Additionally, any design updates could be applied centrally within the UI-kit and propagated across all applications, ensuring the portal always maintained a cohesive and up-to-date look and feel.&lt;/p&gt;
&lt;h2&gt;8. Conclusion&lt;/h2&gt;
&lt;p&gt;Building our portal with Webpack Module Federation allowed us to create a highly modular and scalable system where different teams could work independently without compromising on integration or performance. By centralizing key services like authentication, managing shared dependencies, and optimizing loading strategies, we are able to deliver a smooth user experience while maintaining flexibility for future growth. Though there were challenges in managing communication between modules and handling version conflicts, careful planning and adherence to best practices helped us overcome these hurdles.&lt;/p&gt;
&lt;p&gt;Currently, our portal consists of 11 different applications being developed by 4 different teams. This structure allows each team to work autonomously while maintaining consistency and integration across the platform. In the end, the result is a robust, efficient portal that meets the needs of multiple teams and applications, offering a strong foundation for future development.&lt;/p&gt;</content><category term="Zalando"/><category term="Frontend"/></entry><entry><title>Content Creation Copilot - AI-assisted product onboarding</title><link href="https://engineering.zalando.com/posts/2024/09/content-creation-copilot-ai-assited-product-onboarding.html" rel="alternate"/><published>2024-09-18T00:00:00+02:00</published><updated>2024-09-18T00:00:00+02:00</updated><author><name>Michal Kubacki</name></author><id>tag:engineering.zalando.com,2024-09-18:/posts/2024/09/content-creation-copilot-ai-assited-product-onboarding.html</id><summary type="html">&lt;p&gt;Explores how to improve the efficiency and effectiveness of the content creation process, the data quality and time-to-market using AI-based product attribute extraction.&lt;/p&gt;</summary><content type="html">&lt;h3&gt;Introduction&lt;/h3&gt;
&lt;p&gt;At Zalando, we strive to discover valuable use cases that benefit our customers and stakeholders by using AI-based approaches. Our team's primary mission is to enable content creation teams to produce and integrate best-in-class content for our customers in the most efficient way. We are building tools that streamline the content creation journey - from photo shooting, copyrighting to submission articles in Zalando shop in compliant way.&lt;/p&gt;
&lt;h3&gt;Current Process&lt;/h3&gt;
&lt;p&gt;Our colleagues responsible for &lt;a href="https://engineering.zalando.com/posts/2018/11/exploring-fashion-catalog.html"&gt;Product understanding&lt;/a&gt;, &lt;a href="https://engineering.zalando.com/posts/2018/09/shop-look-deep-learning.html"&gt;Product Search&lt;/a&gt;, or our &lt;a href="https://corporate.zalando.com/en/technology/how-zalando-co-creating-its-new-ai-powered-assistant-together-customers"&gt;Zalando Assistant&lt;/a&gt; are extensively using Machine Learning approaches for feature extraction or similarity searches for products that are already onboarded to the Zalando platform. Yet, the content creation stage of the product onboarding is largely a manual process. Copywriters enrich attributes using a Content Creation Tool and perform Quality Assurance (QA) themselves to guarantee the four-eyes principle.&lt;/p&gt;
&lt;p&gt;After QA is completed, the article is published in the shop. The enriched attributes are then available to Zalando customers across Europe, making it easier to make informed purchasing decisions.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Enriched content visible on Zalando page" src="https://engineering.zalando.com/posts/2024/09/images/zalando-shop-page-selected-attributes.png#center"&gt;&lt;/p&gt;
&lt;h3&gt;The Problem Statement&lt;/h3&gt;
&lt;p&gt;After analyzing the outcomes of our quality assurance processes, we've been consistently identifying opportunities to reduce error rates. As the manual process contributed to approximately 25% of the overall content production timeline, we've prioritized the development of assistive functions to support the QA process. These aim to streamline the detection and correction of defects in the earliest possible stage of content production in accordance with the Zalando content creation guides.&lt;/p&gt;
&lt;p&gt;As a technology team, we believed leveraging Machine Learning in the content enrichment workflow could benefit the content creation teams by helping them create high-quality content while increasing the coverage of attributes across our product data catalog. This in turn would help Zalando customers access more new products every day, experience better search and discovery of the product catalogue, and consume richer product information (completeness and correctness) in the Product Detail Pages (PDPs).&lt;/p&gt;
&lt;p&gt;There are multiple parts of the workflow that could be improved, but we chose the part of the highest impact on the customer experience: generating attributes based on provided images. However, this presented our first challenge: with so many solutions on the market, which model provider should we choose? How could we ensure that our users would receive the highest possible data quality? How could we ensure in the future an easy way to compare different sources and seamlessly change the used source for attribute suggestion?&lt;/p&gt;
&lt;h3&gt;Solution&lt;/h3&gt;
&lt;p&gt;We're building on the idea of a copilot, like the ones used in IDEs for developers, to make life easier for users by automating parts of the article enrichment process. By leveraging Machine Learning, we streamline the task of adding attributes to articles, reducing errors and ensuring consistency across similar content. Our system is designed to combine AI input with other sources, while the user interface clearly shows what suggestions come from which source, leaving the final decision in the hands of the human. This approach not only improves quality but also speeds up Time to Online (TTO), allowing Zalando customers to gain access to more new products daily and enjoy an enhanced search and discovery experience. Attributes are now marked with purple indicator (dot) and pre-selected for suggestions coming from the prompt generator in Content Creation Tool.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot from content creation tooling highlighting automated suggestions" src="https://engineering.zalando.com/posts/2024/09/images/content-creation-tool-preselected-ai-suggestions.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Screenshot from content creation tooling highlighting automated suggestions&lt;/figcaption&gt;

&lt;p&gt;&lt;br/&gt;
As you can see, the attributes are already pre-filled and marked with a purple dot to make users aware that these attributes were auto-suggested. This visual cue helps streamline the workflow, allowing users to concentrate more on QA rather than the time-consuming task of enriching content.&lt;/p&gt;
&lt;h3&gt;Our Approach&lt;/h3&gt;
&lt;p&gt;Before we even began with the technical design, we built a small POC. We evaluated the results of various models on a large sample of articles from our catalog assortment measuring accuracy by having the predictions reviewed by domain experts. After doing a thorough analysis and multiple tests, we decided to use the OpenAI GPT-4 Turbo model, as it provided the right balance between accuracy and information coverage. We started crafting the prompt to ensure the best accuracy of suggested attributes.&lt;/p&gt;
&lt;p&gt;As GPT-4o was announced relatively early in the copilot's development, we initially performed a human inspection, comparing the accuracy of different sources for sample articles. The new model not only provided better results but also delivered faster response times and proved to be more cost-effective. While this was a clear improvement, our goal is to automate this process. We are now able to easily integrate different suggestion sources/models within the copilot, which is a key step toward achieving this automation across the platform.&lt;/p&gt;
&lt;h3&gt;Design and Implementation&lt;/h3&gt;
&lt;p&gt;We designed and implemented a system leveraging multiple AI services. To simplify the use case, we will describe one of our use cases.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Simplified workflow for generation of attribute suggestions" src="https://engineering.zalando.com/posts/2024/09/images/simplified-current-workflow.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Simplified workflow for generation of attribute suggestions&lt;/figcaption&gt;

&lt;p&gt;&lt;br/&gt;
This diagram illustrates the current workflow involving the interaction between four components: Content Creation Tool, Prompt Generator, Article Masterdata and OpenAI - GPT.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Content Creation Tool&lt;/strong&gt;: Internal content creation tool used by photographers to upload images, which URLs are sent to the Prompt Generator. Receives generated attribute suggestions from the OpenAI-GPT - and auto-selected them in the copyrighting workflow in Content Creation Tool.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Article Masterdata&lt;/strong&gt;: Holds metadata about articles, such as attributes and attribute sets (definition of the types and attributes that are optional and mandatory for the article type) of the article.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prompt Generator&lt;/strong&gt;: Generate prompts based on the attributes and attribute sets coming from Article Masterdata. The prompts and image URLs are sent to OpenAI-GPT for further processing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OpenAI-GPT&lt;/strong&gt;: Processes the prompts received from the Prompt Generator and provides suggestions based on the prompts. The suggestions or content are sent back to the Content Creation Tool.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Challenges and Solutions&lt;/h3&gt;
&lt;p&gt;As Zalando operates in 25 markets with different languages, we are storing the attributes of the article as attribute codes. One of the biggest challenges was translating the Zalando-specific attribute codes provided by Master Data (e.g. for the attribute &lt;code&gt;assortment_type&lt;/code&gt;, master data is providing values with following values: &lt;code&gt;assortment_type_7312&lt;/code&gt;, &lt;code&gt;assortment_type_7841&lt;/code&gt;) into human-readable language understandable by the GPT model and then translating the suggestions back into the Master Data-specific code. The solution was to get the English translation of the possible attribute values (in this case it’s &lt;code&gt;Petite&lt;/code&gt; and &lt;code&gt;Tall&lt;/code&gt;), wait for the GPT response, and then translate it back into the &lt;code&gt;attribute_code&lt;/code&gt;. As the suggestions directly impact customer experience, it was imperative for us to ensure the output of OpenAI was compatible with our APIs. We built a translation layer that converts OpenAI output into information directly usable by Zalando and discards the part that is not relevant.&lt;/p&gt;
&lt;p&gt;Another challenge was that some attributes shouldn't be filled for certain types of articles according to the internal guidelines, and the accuracy of predicted suggestions for these attributes was often poor. To address this, we introduced a mapping layer between product categories and the relevant information that should be shown to the customer. Furthermore, we created custom guidelines as part of the prompt for complex product attributes which gave additional hints (E.g. differentiating between &lt;code&gt;V-neck&lt;/code&gt; and &lt;code&gt;Low cut V-neck collar&lt;/code&gt; types).&lt;/p&gt;
&lt;p&gt;GPT-4o model tends to suggest general attributes like &lt;code&gt;V-necks&lt;/code&gt; or &lt;code&gt;round necks&lt;/code&gt; for &lt;code&gt;necklines&lt;/code&gt; correctly, but can be less precise when it comes to more fashion-specific ones, like &lt;code&gt;deep scoop necks&lt;/code&gt;. This issue is more noticeable when using balanced datasets (where there’s an equal number of samples per attribute) compared to unbalanced ones (where the sample proportions reflect real-world trends). The risk is that less common or more specific fashion terms may be treated inaccurately or being incomplete. That's one of the reasons, why we created an aggregator service - to integrate multiple AI services, leveraging a wider variety of data sources, such as brand data dumps, partner contributions, and images, to improve the accuracy and completeness of the results.&lt;/p&gt;
&lt;p&gt;One of the challenges we encountered was reducing the infrastructure costs of suggestions generation, which were higher than expected. First, we stopped generating suggestions for some unsupported attribute sets. Second, we migrated to GPT-4o model, which significantly lowered costs.&lt;/p&gt;
&lt;p&gt;A further challenge involved identifying the optimal set of images to enhance input quality while balancing cost efficiency, as we found out some image types performed better than others, with product-only front images delivering the best results, followed closely by front images featuring the products being worn by the model.&lt;/p&gt;
&lt;h3&gt;Results and Impact&lt;/h3&gt;
&lt;p&gt;The early results are very encouraging as we see an improvement in both data quality and coverage of attributes. The way we built our architecture helped us do a controlled rollout where we could easily include/exclude products or attributes with minimal effort. Involving our users early in product development brought great benefits, as the adoption was very smooth, and the content creation experts are now actively contributing to the prompts. We've achieved an accuracy rate of approximately 75%, and we're enriching around 50,000 attributes on average per week. As a next step, we will focus on improving accuracy for niche categories and expanding the coverage of the product information beyond the regular product attributes.&lt;/p&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;The architecture built around the Content Creation Copilot has proven to be a strong baseline for future use cases by providing an easy way of integrating future model sources and enhancing data accuracy. The next use case involves describing images with the most informative tags, which unblocks multiple applications, including content performance analytics and delivering better-targeted ads. Additionally, we will assist in generating suggestions for free text attributes and their translations.&lt;/p&gt;</content><category term="Zalando"/><category term="Machine Learning"/><category term="UI"/><category term="Frontend"/></entry><entry><title>Ensuring Even Ad Spend on the Zalando Homepage: How Our New Bidding Algorithm Maximizes Value for Advertisers and Shoppers</title><link href="https://engineering.zalando.com/posts/2024/09/even-ad-spend-on-zalando-homepage.html" rel="alternate"/><published>2024-09-17T00:00:00+02:00</published><updated>2024-09-17T00:00:00+02:00</updated><author><name>Rui Gonçalves</name></author><id>tag:engineering.zalando.com,2024-09-17:/posts/2024/09/even-ad-spend-on-zalando-homepage.html</id><summary type="html">&lt;p&gt;Learn how Zalando improved ad exposure while maintaining a seamless shopping experience for Zalando users through a new bidding algorithm.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Zalando Marketing Services (ZMS) is Zalando's advertising platform. It helps brands create and manage campaigns on Zalando, increasing their visibility and improving performance at every stage of the marketing funnel, from awareness to purchase, within the Zalando marketplace.&lt;/p&gt;
&lt;p&gt;At ZMS, we're constantly innovating to optimize the advertising experience on Zalando homepage.  A key element of this is ensuring sponsored ads receive optimal exposure while maintaining a seamless shopping experience for Zalando users. This article dives into the challenge of achieving even ad spend and introduces our new bidding strategy designed to address it.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Homepage content selection flow with real-time bidding" src="https://engineering.zalando.com/posts/2024/09/images/content-selection-flow-real-time-bidding.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Homepage content selection flow with real-time bidding&lt;/figcaption&gt;

&lt;h3&gt;The Challenge of Uneven Ad Spend on the Homepage&lt;/h3&gt;
&lt;p&gt;Imagine you're an advertiser running a campaign on the Zalando homepage.  Your goal is to maximize brand awareness by getting as many user views as possible for your ad.  You allocate a specific advertising budget for your campaign within a defined timeframe.&lt;/p&gt;
&lt;p&gt;However, a hidden hurdle exists: uneven ad spend.  Currently, ad placements on the homepage are determined by a real-time bidding system.  This system can lead to situations where your ad budget is exhausted early in the campaign period, limiting your potential reach.&lt;/p&gt;
&lt;p&gt;The Consequence?  Lower-than-desired ad views and potentially a missed opportunity to connect with your target audience.&lt;/p&gt;
&lt;h2&gt;The ZMS Solution: Introducing the Adjustment Factor Bidding Strategy&lt;/h2&gt;
&lt;p&gt;Our ZMS product team understands the importance of efficient ad spend for both advertisers and Zalando.  That's why we've developed a new bidding strategy, based on closed feedback loops.&lt;/p&gt;
&lt;p&gt;Imagine you're on a road trip in an electric car. You have a set amount of battery power to cover a specific distance. To reach your destination efficiently, you can't just use all your power at the beginning and speed down the highway.  Just like with uneven ad spend, this would leave you stranded before reaching your goal.  Instead, an electric car on a long trip with varying terrain needs to adjust its speed throughout the journey.  It might go faster on flat stretches to maintain an average speed and conserve battery for steeper hills. Similarly, our new bidding strategy avoids the "all-or-nothing" approach, ensuring advertising budget is used efficiently throughout advertisers’ campaigns to maximize reach.&lt;/p&gt;
&lt;p&gt;Here's how it works:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Budget allocation: advertising budget is allocated based on the traffic forecast of customers on the Zalando platform and distributed in hourly buckets.&lt;/li&gt;
&lt;li&gt;Monitoring budget allocation: The system continuously tracks the remaining budget for the campaign relative to the expected (the expected amount is the amount remaining at any give time if the budget were to be spent evenly over the hour).&lt;/li&gt;
&lt;li&gt;Dynamic bid adjustments: Based on this comparison, the bidding strategy automatically adjusts the advertiser ad's bid price. If the advertiser’s campaign is overspending, the bid is lowered. Conversely, if it's underspending, the bid is increased.&lt;/li&gt;
&lt;li&gt;Equilibrium through feedback control: This dynamic adjustment process ensures the ad budget is spent evenly, maximizing the number of potential viewers throughout the campaign duration.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="Converging to even spending" src="https://engineering.zalando.com/posts/2024/09/images/converging-to-even-spending.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Converging to even spending&lt;/figcaption&gt;

&lt;h3&gt;Technical Deep Dive&lt;/h3&gt;
&lt;p&gt;Let's now break down the math behind our new bidding strategy.
Let &lt;span class="math"&gt;\(t\)&lt;/span&gt; be the fraction of the hour passed at a given point in time.
Let &lt;span class="math"&gt;\(spent_t\)&lt;/span&gt; be the fraction of budget spent at time &lt;span class="math"&gt;\(t\)&lt;/span&gt;, and &lt;span class="math"&gt;\(spent_t^{even}\)&lt;/span&gt; the ideal fraction of budget spent at time &lt;span class="math"&gt;\(t\)&lt;/span&gt; with even spending.
Note that &lt;span class="math"&gt;\(spent_t^{even}\)&lt;/span&gt; is equal to &lt;span class="math"&gt;\(t\)&lt;/span&gt; due to even spending.
The ratio between these values, &lt;span class="math"&gt;\(r_t = spent_t / spent_t^{even}\)&lt;/span&gt;, captures how close we are from even spending, and we want to achieve a value close to &lt;span class="math"&gt;\(1\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;In our previous bidding strategy, bid was directly proportional to the following factor:&lt;/p&gt;
&lt;div class="math"&gt;$$
1 - spend_t \cdot (1 - t) = 1 - r_t \cdot t \cdot (1 - t)
$$&lt;/div&gt;
&lt;p&gt;Taking the derivative of the factor w.r.t. &lt;span class="math"&gt;\(t\)&lt;/span&gt;, we obtain &lt;span class="math"&gt;\(r_t \cdot (2t - 1)\)&lt;/span&gt;, which is negative for &lt;span class="math"&gt;\(t &amp;lt; 1/2\)&lt;/span&gt; and positive for &lt;span class="math"&gt;\(t &amp;gt; 1/2\)&lt;/span&gt;.
In other words, regardless of the value of &lt;span class="math"&gt;\(r_t\)&lt;/span&gt; (i.e. over- or under-spending), the bid would decrease in the first half of the hour (negative slope w.r.t. &lt;span class="math"&gt;\(t\)&lt;/span&gt;) and increases in the second half of the hour.&lt;/p&gt;
&lt;p&gt;In the new bidding formula, bid is directly proportional to the following factor:&lt;/p&gt;
&lt;div class="math"&gt;$$
\frac{1 - spent_t}{1 - spent_t^{even}} = \frac{1 - r_t \cdot t}{1 - t}
$$&lt;/div&gt;
&lt;p&gt;Taking the derivative of this factor w.r.t. &lt;span class="math"&gt;\(t\)&lt;/span&gt;, we can see that it is positive for &lt;span class="math"&gt;\(r_t &amp;gt; 1\)&lt;/span&gt; and negative for &lt;span class="math"&gt;\(r_t &amp;lt; 1\)&lt;/span&gt;.
That is, this formula ensures that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The bid increases when underspending.&lt;/li&gt;
&lt;li&gt;The bid decreases when overspending.&lt;/li&gt;
&lt;li&gt;The bid remains constant when spending is even.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As a result, this bidding strategy converges to an even spending over time and achieves an equilibrium price under a given market condition (supply, demand, competition).&lt;/p&gt;
&lt;h3&gt;Benefits for Advertisers and Shoppers&lt;/h3&gt;
&lt;p&gt;By leveraging the new bidding strategy, advertisers gain several key advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Maximized reach: Achieve a more even distribution of ad views throughout your campaign, leading to a higher likelihood of reaching your target audience.&lt;/li&gt;
&lt;li&gt;Cost efficiency: Reduce your cost per view (CPV) by ensuring we use an optimal bid (not the max bid), while meeting your goals.&lt;/li&gt;
&lt;li&gt;Greater value: Get more value from your advertising budget, leading to a potentially higher return on investment (ROI).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Shoppers also benefit from this strategy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Improved relevance: By allowing for a wider range of ads to compete for display, shoppers are more likely to see ads relevant to their interests.&lt;/li&gt;
&lt;li&gt;Seamless experience: The strategy maintains a balanced ad-to-content ratio, ensuring a smooth shopping experience on the homepage.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Validation Through A/B Testing&lt;/h3&gt;
&lt;p&gt;To validate the effectiveness of this bidding strategy, we conducted a comprehensive A/B test with budget-split.  The results were clear:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Increased advertising views: Ads using the new strategy achieved a 10% increase in views compared to the previous approach; with a linear drop of CPV of around 10%.&lt;/li&gt;
&lt;li&gt;Increased clicks on advertising content: the absolute number of clicks on ads increased by 23%.&lt;/li&gt;
&lt;li&gt;Enhanced click-through rate (CTR): The ratio between clicks and views improved by 11%, suggesting greater relevance of advertising content for Zalando customers.&lt;/li&gt;
&lt;li&gt;Non-significant impact on metrics of overall Homepage customer experience; a great indication of success since we are delivering more sponsored content without harming overall homepage customer experience.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Uneven ad spend can hinder advertiser efforts and limit the value proposition for Zalando.  The new ZMS bidding strategy effectively addresses this challenge by ensuring a balanced distribution of ad spend.  With this approach, inspired by the principles of closed feedback loops, ZMS empowers advertisers to maximize the effectiveness of their campaigns while maintaining a positive shopping experience for our valued users.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Ad Spend: old vs. new algorithm" src="https://engineering.zalando.com/posts/2024/09/images/ad-spend-old-vs-new-algorithm.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Ad spend: old vs. new algorithm&lt;/figcaption&gt;

&lt;script type="text/javascript"&gt;if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
    var align = "center",
        indent = "0em",
        linebreak = "false";

    if (false) {
        align = (screen.width &lt; 768) ? "left" : align;
        indent = (screen.width &lt; 768) ? "0em" : indent;
        linebreak = (screen.width &lt; 768) ? 'true' : linebreak;
    }

    var mathjaxscript = document.createElement('script');
    mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
    mathjaxscript.type = 'text/javascript';
    mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';

    var configscript = document.createElement('script');
    configscript.type = 'text/x-mathjax-config';
    configscript[(window.opera ? "innerHTML" : "text")] =
        "MathJax.Hub.Config({" +
        "    config: ['MMLorHTML.js']," +
        "    TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
        "    jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
        "    extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
        "    displayAlign: '"+ align +"'," +
        "    displayIndent: '"+ indent +"'," +
        "    showMathMenu: true," +
        "    messageStyle: 'normal'," +
        "    tex2jax: { " +
        "        inlineMath: [ ['\\\\(','\\\\)'] ], " +
        "        displayMath: [ ['$$','$$'] ]," +
        "        processEscapes: true," +
        "        preview: 'TeX'," +
        "    }, " +
        "    'HTML-CSS': { " +
        "        availableFonts: ['STIX', 'TeX']," +
        "        preferredFont: 'STIX'," +
        "        styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
        "        linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
        "    }, " +
        "}); " +
        "if ('default' !== 'default') {" +
            "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
            "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
        "}";

    (document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
    (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
&lt;/script&gt;</content><category term="Zalando"/><category term="AdTech"/><category term="Backend"/></entry><entry><title>OpenTelemetry for JavaScript Observability at Zalando</title><link href="https://engineering.zalando.com/posts/2024/07/opentelemetry-for-javascript-observability-at-zalando.html" rel="alternate"/><published>2024-07-29T00:00:00+02:00</published><updated>2024-07-29T00:00:00+02:00</updated><author><name>Mohit Karekar</name></author><id>tag:engineering.zalando.com,2024-07-29:/posts/2024/07/opentelemetry-for-javascript-observability-at-zalando.html</id><summary type="html">&lt;p&gt;How Zalando improved observability for Node.js and web applications using OpenTelemetry&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Cover - OpenTelemetry &amp;amp; Zalando" src="https://engineering.zalando.com/posts/2024/07/images/obssdk-cover.jpg#previewimage"&gt;&lt;/p&gt;
&lt;p&gt;"What’s happening inside my application?" - an age-old question bothering anyone who deploys a software service. Packaging source code for an application makes it a black box for its users who can only interact with it through explicitly available APIs. Fortunately, we’ve had several developments in the field of observability in recent years that help us peek into this black box and react to anomalies.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://opentelemetry.io/"&gt;OpenTelemetry&lt;/a&gt; has become the widely-accepted open standard for application observability across the software engineering community. It evolved from the previous OpenTracing project which introduced standards for distributed tracing and brought all observability signals under one umbrella, introducing specifications and implementations. At Zalando as well, OpenTelemetry is the adopted standard for observability and our platform teams provide SDKs in several languages for engineers to instrument their applications.&lt;/p&gt;
&lt;p&gt;For applications running in a JavaScript environment, the story was quite different though. We have a significant number of Node.js applications, and before 2023 the observability state of these applications was quite poor. During an incident, on-call responders would try to locate the root cause of the issue only to find some applications in the request flow having no instrumentation at all. In one specific, &lt;a href="https://engineering.zalando.com/posts/2024/07/nodejs-tale-worker-threads.html"&gt;very interesting example&lt;/a&gt;, we had almost zero visibility into what the affected application was doing, which made understanding the root cause more difficult than it should be.&lt;/p&gt;
&lt;p&gt;Often, the reason for the missing visibility was not the complexity of implementing it, but rather the "mundane" effort engineers would have to put in. The true impact of good observability is often intangible and hence can lead to some complacency on the part of service owners. We wanted to solve this problem without adding an operational burden to the already busy engineering teams.&lt;/p&gt;
&lt;h2&gt;Standardised Node.js Observability&lt;/h2&gt;
&lt;p&gt;At the end of 2022, the SRE Enablement and the Web Platform teams at Zalando collaborated to build a Node.js observability SDK based on OpenTelemetry. Observability SDKs had already proven successful at Zalando, providing several advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Automatic configuration including out-of-the-box environment variable parsing.&lt;/li&gt;
&lt;li&gt;Standard semantic conventions and APIs across languages.&lt;/li&gt;
&lt;li&gt;Built-in auto-instrumentations and platform-specific metrics.&lt;/li&gt;
&lt;li&gt;Central control over use/restriction of features, e.g. security and compliance.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Our Node.js Observability SDK is a small wrapper on top of open-source core OpenTelemetry packages that adds Zalando-specific configuration and acts as a proxy for all underlying dependencies. We also decided to provide a set of Node.js critical metrics by default: CPU and memory usage, garbage collection metrics and event loop lag. SDK users can use a boolean flag in the initialization configuration to enable HTTP instrumentation, and optionally Express.js instrumentation. Moreover, the SDK can be initialised in a single statement.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;SDK&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;@zalando/observability-sdk-node&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="ow"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;SDK&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The SDK constructor takes in an optional configuration argument, but thanks to the platform environment variables made available to any application deployed in Kubernetes at Zalando, the SDK is autoconfigured from these values. Calling the &lt;code&gt;start()&lt;/code&gt; functions enables several features in the background:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Auto-instrumentations are registered, e.g. HTTP functions are monkey-patched to record span data during various network calls.&lt;/li&gt;
&lt;li&gt;In-built metric collection is enabled at a configured interval.&lt;/li&gt;
&lt;li&gt;Span and metric exporters are enabled to export telemetry data at a specified interval to the telemetry backend.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Providing these fundamental capabilities in the SDK out-of-the-box made it easy to instrument Node.js applications and saw a good rate of adoption.&lt;/p&gt;
&lt;h2&gt;Still blind on the Client&lt;/h2&gt;
&lt;p&gt;While we were improving in terms of server-side observability overall as a company, observability on the client-side was still a distant concept for us. Before 2023, we had baseline operational visibility into how our web applications were performing in our customers’ browsers, Sentry error logging being the only tool in our arsenal. While console error logging helps, it does not provide great details about why an issue occurred.&lt;/p&gt;
&lt;p&gt;One of the examples of cases where we needed this kind of visibility was in our web checkout experience. There were known instances of a small portion of incoming requests being blocked by our web application firewall (WAF) during checkout as it flagged them as coming from bots. At times, these requests were sent by genuine customers and there was no way to detect these as our tracing spans began on the server, specifically at the proxy level (&lt;a href="https://github.com/zalando/skipper"&gt;Skipper&lt;/a&gt;). We could have known how many customers were facing this issue only if we had a way to connect a user interaction (e.g. a button click) to an incoming/missing request at our proxy.&lt;/p&gt;
&lt;p&gt;Taking inspiration from the server-side efforts in improving Node.js observability, we decided to start developing a web observability SDK, using corresponding OpenTelemetry packages.&lt;/p&gt;
&lt;h2&gt;Things are tricky on the Client side&lt;/h2&gt;
&lt;p&gt;We bootstrapped a minimal SDK to be used in web applications at Zalando which exposed tracing and metric collection APIs. Thanks to one of the early contributors to the Node.js SDK, we had already separated types and APIs into an independent package. This API package was then used to implement Node.js and web SDKs. This structure became especially useful while instrumenting isomorphic applications – those which run both on the server and client side.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nv"&gt;@zalando&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;observability&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;
&lt;span class="nv"&gt;@zalando&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;observability&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;
&lt;span class="nv"&gt;@zalando&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;observability&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;browser&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;While developing the SDK, we realised that there are more operational challenges than technical effort in instrumenting on the client-side. These are the peculiarities of instrumenting on the web versus the server:&lt;/p&gt;
&lt;h3&gt;Performance Implications&lt;/h3&gt;
&lt;p&gt;On the web, every byte counts and hence adding instrumentation packages can lead to an increased page payload affecting your website performance. In the past, we tried to integrate some telemetry packages only to realise they added about 400 KBs to the page size! There are ways to asynchronously load these packages, but some features are easiest to implement when run in the critical page load path (e.g. tracing page load, generating propagation context for API requests).&lt;/p&gt;
&lt;p&gt;We found OpenTelemetry packages to be very customizable and in the end we could cherry-pick packages that we considered crucial for the initial load and delay loading everything else. Overall, we added about 30 KBs to our page size. While developing the SDK, we also came across &lt;a href="https://github.com/grafana/faro-web-sdk"&gt;Grafana Faro&lt;/a&gt;, which is a similar implementation for frontend observability by Grafana. If you are starting from scratch, it’s a great package to check out.&lt;/p&gt;
&lt;p&gt;Additionally, we also pushed the network requests to be least critical by using &lt;code&gt;sendBeacon()&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Sending Telemetry Data and User Consent&lt;/h3&gt;
&lt;p&gt;The next challenge is where the data should be sent to from the browser and whether you are even allowed to send it at all. On the server side, it’s easy since usually the services receiving telemetry data are deployed in the same cluster and no special configuration is required for the host application. On the client side though you need to go through the public internet and hence need some publicly accessible endpoint for sending telemetry data. We used our edge proxy (Skipper) to route frontend telemetry to an internal collector. This also allowed us to implement certain endpoint protection measures like rate-limits. To support adoption of the SDK in other applications, we also provided a custom template to deploy a proxy that would act as a telemetry backend.&lt;/p&gt;
&lt;p&gt;Collecting data from customers’ browsers needs their explicit consent as per GDPR. We had to be mindful while exporting telemetry data – sending the export request only if the user consented.&lt;/p&gt;
&lt;h2&gt;Unprecedented Visibility&lt;/h2&gt;
&lt;p&gt;Early this year, we rolled out the integration of the web SDK in Zalando’s web framework – &lt;a href="https://engineering.zalando.com/posts/2021/09/micro-frontends-part2.html"&gt;Rendering Engine&lt;/a&gt;. To start with, we traced runtime operations of the framework, e.g. page load, entity resolution and AJAX requests. We started receiving telemetry and an unprecedented visibility into client-side operations.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Page load ops" src="https://engineering.zalando.com/posts/2024/07/images/page-load-ops.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Client trace" src="https://engineering.zalando.com/posts/2024/07/images/client-trace.png"&gt;&lt;/p&gt;
&lt;h3&gt;Leveraging the Framework&lt;/h3&gt;
&lt;p&gt;Rendering Engine is an expressive web framework and has a concept of “renderers” as independent units that declare their data dependencies and UI. We decided to expose the capabilities of the SDK through platform APIs inside renderers to allow frontend developers to trace custom operations inside renderers. At a high level, this is how the API looks for a filter update operation on the client:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="cm"&gt;/*&lt;/span&gt;
&lt;span class="cm"&gt; * This is a renderer in Rendering Engine.&lt;/span&gt;
&lt;span class="cm"&gt; */&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;view&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;withQueries&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;withProcessDependencies&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;withRender&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;traceAs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;observability&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;traceAs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="cm"&gt;/*&lt;/span&gt;
&lt;span class="cm"&gt;     * withRender() is where the React component for the renderer is declared&lt;/span&gt;
&lt;span class="cm"&gt;     *&lt;/span&gt;
&lt;span class="cm"&gt;     * props.tools.observability has tools related to client-side observability&lt;/span&gt;
&lt;span class="cm"&gt;    */&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;fetchFilteredProducts&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;traceAs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;fetch_filtered_products&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;addTags&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;href&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;href&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;serviceClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sb"&gt;`/search?q=&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;filter&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="c1"&gt;// process response&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;addTags&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;finish&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;button&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;onClick&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;fetchFilteredProducts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;shoes&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)}&amp;gt;&lt;/span&gt;&lt;span class="nx"&gt;Fetch&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Shoes&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;button&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;traceAs&lt;/code&gt; function allows renderer developers to create a new span for a specific operation. The span can then be tagged with attributes, passed around functions and used to create new spans.&lt;/p&gt;
&lt;p&gt;This API allowed us to trace crucial client-side operations that were asynchronous in nature. We were previously depending on the status of incoming HTTP requests as a result of user interactions, which is an indirect, “pseudo” way of determining service health. Instrumenting user interactions directly made the visibility much more “real”.&lt;/p&gt;
&lt;h3&gt;Web Performance Metrics&lt;/h3&gt;
&lt;p&gt;For the web shop, we already had real user monitoring (RUM) in place to collect various web performance metrics including the web vitals. These metrics form a crucial part of our experimentation strategy when we release features that might impact page performance. Our existing infrastructure was custom, with a service for collecting and aggregating metrics and storing them in a database. While this worked great over the years, we missed flexibility in adding custom attributes to the collected metrics and thus correlating regressions with features was difficult.&lt;/p&gt;
&lt;p&gt;With the SDK already in the frontend application, we decided to enable OpenTelemetry metrics on the client-side. Since most of the implementation for recording metrics was already present, we only had to create a new exporter for OpenTelemetry. The feature was quickly rolled out and we started receiving core web vitals (FCP, LCP, INP, CLS) tagged with numerous attributes.&lt;/p&gt;
&lt;p&gt;One immediate application of these metrics was to measure performance impact of the newly created &lt;a href="https://engineering.zalando.com/posts/2024/05/theming-the-zalando-design-system.html"&gt;“designer” experience&lt;/a&gt;. These pages consisted of some complex &lt;a href="https://engineering.zalando.com/posts/2024/07/custom-navigational-transitions-ios.html"&gt;client-side animations&lt;/a&gt; and visualisations and the owning team wanted to measure how these affected overall web performance. We added a new attribute that denoted the current experience on the page and soon we could group and filter metrics on the basis of this attribute, thanks to all the existing tooling from ServiceNow Cloud Observability (previously Lightstep), built according to OpenTelemetry specifications.&lt;/p&gt;
&lt;p&gt;&lt;img alt="LCP per experience" src="https://engineering.zalando.com/posts/2024/07/images/lcp-designer.png"&gt;&lt;/p&gt;
&lt;p&gt;The side-effect of this is that we no longer need our custom setup to collect metrics and can happily de-commission it soon.&lt;/p&gt;
&lt;h2&gt;Challenges&lt;/h2&gt;
&lt;p&gt;Our adoption of OpenTelemetry had its own set of challenges.&lt;/p&gt;
&lt;h3&gt;Migration from OpenTracing&lt;/h3&gt;
&lt;p&gt;While most of the concepts in OpenTelemetry are similar to OpenTracing, the language SDK implementations have a different API when compared to corresponding OpenTracing implementations. The new APIs make it difficult to migrate existing instrumentation code, especially in a large codebase like ours. For example, the JavaScript OpenTelemetry SDK uses &lt;code&gt;context&lt;/code&gt; to track the current active span, versus in OpenTracing, you'd have to pass the span object around manually in functions. The context approach is really useful, but we found out that for an already instrumented application (in the OpenTracing way), this is rather a frustration.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;// 1. Starting a span in OpenTracing&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;startSpan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;callOtherFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// 2. Starting an active span in OpenTelemetry&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;startActiveSpan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;callOtherFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// 3. Starting a span with custom context in OpenTelemetry&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;getContextFromSomewhere&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;startSpan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;callOtherFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We ended up not using context as it was easy to migrate from OpenTracing that way (as shown above in approach number 3).&lt;/p&gt;
&lt;p&gt;OpenTelemetry Node.js packages use &lt;a href="https://nodejs.org/api/async_context.html"&gt;AsyncLocalStorage&lt;/a&gt; to propagate context values through asynchronous function calls. On the web though, there is no such runtime API (&lt;a href="https://github.com/tc39/proposal-async-context"&gt;yet&lt;/a&gt;), and the same has to be acheived with &lt;a href="https://www.npmjs.com/package/zone.js?activeTab=readme"&gt;Zone.js&lt;/a&gt; which monkey-patches global functions. We are not big fans of this, especially when done in the customer's browser and hence opted out of context on the client side as well, resorting back to manual passing of span objects.&lt;/p&gt;
&lt;h3&gt;Metrics &amp;amp; Bucketing&lt;/h3&gt;
&lt;p&gt;Collecting metrics on the client side is tricky as they usually are standalone values sent only once per page load (e.g. core web vitals). To obtain a percentile distribution, of e.g. the Largest Contentful Paint, a histogram instrument has to be used to record LCP values. These instruments are primarily designed to record values over time, e.g. memory usage in servers, and use value "buckets" to record number of values in each bucket. By default OpenTelemetry JavaScript histogram declared the following &lt;a href="https://github.com/open-telemetry/opentelemetry-js/blob/a6020fb113a60ae6abc1aa925fa6744880e7fa15/packages/sdk-metrics/src/view/Aggregation.ts#L109"&gt;buckets&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;250&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;750&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;7500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Using these bucket values would provide a skewed view of the recorded values as the histogram would bucket values into the closest range and for various metrics the range could vary substantially. E.g. LCP is usually in the range of 600 to 2000 milliseconds while cumulative layout shift (CLS) ranges between 0 to 1.&lt;/p&gt;
&lt;p&gt;Our solution was to use custom buckets using a &lt;a href="https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view"&gt;&lt;code&gt;view&lt;/code&gt;&lt;/a&gt; and OpenTelemetry has a way to declare these as a custom aggregation method. The API has been &lt;a href="https://open-telemetry.github.io/opentelemetry-js/interfaces/_opentelemetry_api.MetricAdvice.html#explicitBucketBoundaries"&gt;recently improved&lt;/a&gt; to make it easy to do this while creating a histogram. Our buckets for client-side metrics looked as follows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;metricBuckets&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;fcp&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mf"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;350&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;450&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;550&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;650&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;750&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;850&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;900&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;950&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mf"&gt;1100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;lcp&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mf"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;550&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;650&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;750&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;850&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;900&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;950&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1050&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mf"&gt;1100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1250&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1350&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1450&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1550&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1650&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mf"&gt;1700&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1800&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1900&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;cumulativeLayoutShift&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mf"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.025&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.075&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.125&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.175&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.225&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.275&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="mf"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I got a chance to discuss this limitation with OpenTelemetry contributors at the recent KubeCon in Paris and we also talked about the &lt;a href="https://opentelemetry.io/docs/specs/otel/logs/event-api/"&gt;events API&lt;/a&gt; which potentially could be a solution for browser-based metrics.&lt;/p&gt;
&lt;h2&gt;Next Steps&lt;/h2&gt;
&lt;p&gt;With the newly-obtained visibility on the client side, we plan to introduce new &lt;a href="https://engineering.zalando.com/posts/2022/04/operation-based-slos.html"&gt;critical business operations (CBOs)&lt;/a&gt; and modify existing ones to reflect the operation more realistically. CBOs are operational markers for the health of a certain important user feature. Moving them to the client side helps us track their health on a more finer level.&lt;/p&gt;
&lt;p&gt;Taking an example of the catalog page (a.k.a the product listing page): applying a filter is a critical user feature that allows customers to narrow down their search in their shopping journey. If they are not able to filter, they might drop out and this would negatively affect the business. Client-side tracing provides visibility into this segment of the user journey as it's mainly happening in customers’ browsers. Measuring the health of this operation and using alerts to get notified of anomalies can help detect issues in customer journeys faster, allowing us to deliver on a high quality of service for our customers.&lt;/p&gt;</content><category term="Zalando"/><category term="JavaScript"/><category term="Frontend"/><category term="SRE"/><category term="Backend"/></entry><entry><title>Node.js and the tale of worker threads</title><link href="https://engineering.zalando.com/posts/2024/07/nodejs-tale-worker-threads.html" rel="alternate"/><published>2024-07-25T00:00:00+02:00</published><updated>2024-07-25T00:00:00+02:00</updated><author><name>Jeremy Colin</name></author><id>tag:engineering.zalando.com,2024-07-25:/posts/2024/07/nodejs-tale-worker-threads.html</id><summary type="html">&lt;p&gt;Join me on a Friday night on-call investigation into a rogue Node.js service.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;A disrupted gaming night&lt;/h2&gt;
&lt;p&gt;I do not usually read code when dealing with production incidents, as it is one of the slower ways to understand and mitigate what is happening. But on that Friday night, I was glad I did.&lt;/p&gt;
&lt;p&gt;I was about to start another session of Elden Ring (a video game in which everything is pretty much trying to kill the player) when I was paged with the following: "campaign service is consuming all resources we throw at it". I joined a call and was then told that the observed impact was due to one of the dependencies: the translation service, for which my on-call rotation was responsible for. The translation service was indeed very slow to respond (its p99 latency had increased from 100ms to 500ms) and its error rate had gone from 0 to 4%. This did not really explain why the service calling us (the campaign service) was on a cloud resource consumption spree.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Increased latency on translation service" src="https://engineering.zalando.com/posts/2024/07/images/translation-service-latencies.png#center"&gt;&lt;/p&gt;
&lt;p&gt;We started with distributed tracing, however the campaign service was not instrumented so we could not get much out of our tracing tooling. We did see some &lt;code&gt;context cancelled&lt;/code&gt; error messages on our request spans which usually means that the connection was unexpectedly closed from the client side. We quickly moved on to logging and sure enough, we found the same evidence in the translation service logs: &lt;code&gt;java.lang.IllegalStateException: Response is closed&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;We are relatively well instrumented at Zalando in terms of &lt;a href="https://engineering.zalando.com/posts/2022/04/operation-based-slos.html"&gt;operations&lt;/a&gt;, especially with built-in Kubernetes dashboards. Using our Kubernetes API Monitoring Clients dashboard we confirmed that the calling service (the campaign service) was misbehaving and instead of its usual 1 000 requests per minute to the translation service, it was making over 20 000 requests per minute.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Increased requests on translation service" src="https://engineering.zalando.com/posts/2024/07/images/translation-service-requests.png#center"&gt;&lt;/p&gt;
&lt;p&gt;It looked like the campaign service was effectively increasing the pressure on our translation service. This meant that our translation service was then slower to respond and sometimes not responding at all, which in turn somehow increased the amount of requests that the campaign service was making and the cloud resources it was consuming.&lt;/p&gt;
&lt;p&gt;We were looking at a positive feedback loop that was destabilising both systems. Fortunately for us, the effects of the loop were eventually stopped at some point when both systems reached their allocated cloud limits, memory for the campaign service and the maximum number of replicas for our translation service. This had been going on for several hours as the campaign service is not on the critical path of the customer journey, so 4% was a slow burn error and we were only paged because the team that owns the service started investigating this anomaly and found this interaction with our translation service.&lt;/p&gt;
&lt;p&gt;In an attempt to resolve the situation, we reduced the number of pods for the campaign service and allowed our translation service to scale up, and sure enough the situation improved by itself in a matter of minutes. As I was about to pick up my game controller again, I took one last look at the graphs and, lo fand behold, the error rate was back up and the positive feedback loop had resumed, as if in defiance of my gaming night.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Increased latency on translation service again" src="https://engineering.zalando.com/posts/2024/07/images/translation-service-latencies-down-and-up.png#center"&gt;&lt;/p&gt;
&lt;h2&gt;Not so fast Tarnished&lt;/h2&gt;
&lt;p&gt;In Elden Ring, you have to retry boss fights quite a lot so I rolled up my sleeves and started investigating again. This time, resolved to understand the systems' "patterns".&lt;/p&gt;
&lt;p&gt;Taking another look at the campaign service logs was quite interesting to say the least. Yes, it did start with a bunch of &lt;code&gt;request failed&lt;/code&gt;, &lt;code&gt;read timeout&lt;/code&gt; but then it was followed by a lot of logs like &lt;code&gt;Worker fragment (pid: 51) died&lt;/code&gt; and &lt;code&gt;Worker 549 started&lt;/code&gt;. When I say a lot, I mean A LOT, more than 20 per second in total.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Campaign service logs" src="https://engineering.zalando.com/posts/2024/07/images/campaign-service-logs.png#center"&gt;&lt;/p&gt;
&lt;p&gt;At this point, we needed to understand where they were coming from and yes, I started reading the code on github. We were dealing with a simple Node.js application. The entry point was a file called &lt;code&gt;cluster.js&lt;/code&gt; and the first thing it did was get the number of CPUs from the OS and spawn a worker for each CPU core.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cluster&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;numCPUs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;os&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;cpus&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// master wrapper&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;isMaster&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sb"&gt;`Master &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pid&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sb"&gt; is running`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sb"&gt;`CPU Total &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;numCPUs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// fork workers&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;numCPUs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fork&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;exit&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// when worker exits&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sb"&gt;`Worker fragment (pid: &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pid&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sb"&gt;) died`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fork&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Pretty smart right? Node.js is single-threaded and you don't want to leave those precious CPU cores idle. Well that depends on where your code is running!&lt;/p&gt;
&lt;p&gt;Following a migration, this service was now running on Kubernetes with pods requesting the equivalent of 1 CPU unit, so far so good. However, when called inside a Kubernetes container, the &lt;code&gt;os.cpus().length&lt;/code&gt; method returned the number of cores available on the host machine, instead of the amount of CPU allocated to the container by Kubernetes. At this point, the campaign service was running on machines with 48 cores so it was spawning a whopping 48 processes (yes, for Node.js, this is a whopping number, I can see you Golang people judging us). In fact, using cluster mode for Node.js in a Kubernetes environment is discouraged because Kubernetes can help you do this in a simple way out of the box, for example by setting cpu request to 1000m to allocate one CPU core per pod.&lt;/p&gt;
&lt;p&gt;Another interesting thing we could read in &lt;code&gt;cluster.js&lt;/code&gt; was that when a worker thread exited, it immediately spawned another one, and we could quickly sense how this could lead to a dangerous situation. Well, while that explained the high number of workers and the logs we saw above, it still didn't explain why they kept exiting and spawning.&lt;/p&gt;
&lt;p&gt;Enter &lt;code&gt;translation-fetcher.js&lt;/code&gt;, a file that exposes a method to fetch translations from a remote API (our translation service for which I was paged for). Interestingly, when the fetch call fails, the catch clause calls &lt;code&gt;process.exit(1)&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;TranslationManager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fetchAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;then&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;fallbackFilename&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;./fallback-translations.json&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;writeFileSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fallbackFilename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;4&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So there we had it! We had 48 forked worker processes, most of which were exiting, respawning and trying to fetch translations again on startup. We felt pretty confident that we understood what was happening as this was the only place in the whole codebase where a worker thread could exit. We also concluded that the fact that our translation service was slower to respond and sometimes not at all was what fed the positive feedback loop I described above. Indeed, if the call to the translation service failed, the worker thread was killed and a new one spawned, triggering a new call to the translation service, and so on.&lt;/p&gt;
&lt;p&gt;Now it was time to patch the issue, so I could get back to being slain by monsters in my video game. We updated the service to no longer use cluster mode as in fact a few pods would be more than able to handle the load even at peak traffic. We struggled to deploy the service to production as it hadn't been deployed for a while and we were missing some permissions, but that's too boring a story to go into. Once the service was deployed, the number of requests to our translation service dropped from 20 000 requests per minute to 100 requests per minute and the health of our translation service quickly recovered and the service even scaled down. What happened next in Elden Ring will stay in Elden Ring.&lt;/p&gt;
&lt;h2&gt;Digging deeper&lt;/h2&gt;
&lt;p&gt;Fast forward to Monday, we start working on a detailed post-mortem analysis describing what I wrote above and I decide to write this up as our Site Reliability Engineering (SRE) team loves to hate on Node.js. When I get to the part where I talk about translation-fetcher.js, I get perplexed. It does not really make sense to call &lt;code&gt;process.exit()&lt;/code&gt; in a live environment.&lt;/p&gt;
&lt;p&gt;Also what about the response closed and context cancelled errors we were seeing, they did not match our current understanding of the worker being killed after the call itself failed. As I hate to share something I do not have a very good understanding of, I dove once more into the campaign service code and, lo and behold, I found a huge oversight we had made on that Friday night. The &lt;code&gt;translation-fetcher.js&lt;/code&gt; code was not being called in the live environment, it was another file, obviously called &lt;code&gt;translation.js&lt;/code&gt; that was being called on application startup, still calling our translation service but returning fallbacks if the call failed.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;initTranslations&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;TranslationManager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fetchAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;catch error&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;fallbackTranslations&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;then&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;initialData&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;manager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;watch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;initialData&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;initAndPrefetchOAuth&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;initTranslations&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;listen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;PORT&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Server started&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So there never was a positive feedback loop with the translation service, it was all up in our heads and I felt a bit stupid about it.&lt;/p&gt;
&lt;p&gt;What was happening then? We still didn't understand why workers were being killed and respawning, which led to a very high amount of requests to our translation service. That did, however, put the focus back on the campaign service: what happened at 2am on that Friday that had never happened before, destabilising the service? Could the logs tell us more?&lt;/p&gt;
&lt;p&gt;Luckily for me, someone was curious about the number of CPUs allocated and as the main application started, the code was logging the number of CPUs in the machine. So I scanned the last 30 days of logs and got the history of the number of CPUs for the allocated machines: it was always 4, 8 or 16. Well, except for last Friday but also on the 6th of April 2022 at 10:49 when the AWS gods had gifted us a 48 cores machine, interesting... What was the state of the application at that point in time? Well it wasn't great, one pod, unsurprisingly allocated in the node (machine) with the 48 cores, was over-utilising both its CPU and memory allocations and was repeatedly being killed. At the exact same time, our beloved translation service had also begun to consume more resources, scaling massively from 4 to 20 pods despite only receiving twice as many requests.&lt;/p&gt;
&lt;p&gt;Why did it not escalate at this point? Because once the pod was killed, despite being replaced twice by a pod on the same 48 cores node, the third time it was replaced by a pod on a different node with only 16 cores. The campaign service generously requested 2GB of memory for each of its pods so with 4, 8 or 16 cores, it was only spawning 2, 4 or 16 extra workers on top of the main process. It turns out that the campaign service application process needs around 120 MB of memory to run properly so it was painfully able to accommodate up to 16 cores, but 48 cores meant that each process only had around 40 MB of memory each (which is still 10 000 times more than the Apollo guidance computer that got us to the moon by the way) and around 20m CPU ("twenty millicpu"), which is really not that much for a single thread.&lt;/p&gt;
&lt;p&gt;At this point, I still did not understand why the node thread workers kept dying, although I had an intuition that it was due to the low amount of resources available but I could not see any garbage collection or memory issues in the stack traces. I decided to run the service locally, updated the cluster file to spawn 50 worker threads regardless of the number of CPUs, built it and started it in a Docker container. At first, I gave the container a single core from my 5 year old Macbook and despite being excruciatingly slow, every worker thread spawned and triggered its initial request to get the translations. I repeated the operation, this time giving the container only 1000MB of memory and sure enough, after spawning around half of the workers, I saw the same logs as in production: &lt;code&gt;Worker fragment (pid: 1) died&lt;/code&gt;, &lt;code&gt;Worker 31 started&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;That was the aha moment, up to that point, I was expecting a clue as to why a worker would be killed by Node.js, but it never came and it turns out that Node.js simply starts killing worker threads when it needs to reclaim memory. And if you remember the code in &lt;code&gt;cluster.js&lt;/code&gt;, immediately after the worker thread exited, the application spawned another one, so we end up with lots of worker threads spawning and dying in quick succession, living just long enough time to say hello to our translation service. This also explains very well the context cancelled errors we saw in the translation service, because when the worker thread dies, the socket it created unexpectedly hangs up. It also explains well the read timeout errors in the campaign service as the processes did not have enough time (due to their very low CPU resource allocation) to read the translation service response. Unfortunately, this information was not readily available to us because the campaign service did not instrument its event loop lag, the degradation of which is a common root cause of API call read timeouts.&lt;/p&gt;
&lt;h2&gt;Building better observability&lt;/h2&gt;
&lt;p&gt;This story happened back in April 2022 and was one of the motivations for developing a Zalando Observability SDK for Node.js. Two years later, we have 53 Node.js applications instrumented with the SDK, which means that investigating incidents involving Node.js is now easier with common signals readily available. This will be the topic of a subsquent blog post, stay tuned!&lt;/p&gt;</content><category term="Zalando"/><category term="Frontend"/><category term="SRE"/><category term="Backend"/></entry><entry><title>End-to-end test probes with Playwright</title><link href="https://engineering.zalando.com/posts/2024/07/end-to-end-test-probes-with-playwright.html" rel="alternate"/><published>2024-07-19T00:00:00+02:00</published><updated>2024-07-19T00:00:00+02:00</updated><author><name>Jeremy Colin</name></author><id>tag:engineering.zalando.com,2024-07-19:/posts/2024/07/end-to-end-test-probes-with-playwright.html</id><summary type="html">&lt;p&gt;Learn how we set up reliable automated end-to-end test probes for our Zalando website using Playwright&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Why automated end-to-end tests?&lt;/h2&gt;
&lt;p&gt;What are automated end-to-end tests? Do you need them at all? In this blog post we dive into the ugly behind automated end-to-end testing, what we struggled with at Zalando, what worked well for us and our latest solution with end-to-end test probes.&lt;/p&gt;
&lt;p&gt;Automated end-to-end tests continue to polarise the industry, with some leaders advocating for them and others rightfully questioning their return on investments and recommending to invest in monitoring and alerting systems instead.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Twitter post on value of e2e" src="https://engineering.zalando.com/posts/2024/07/images/twitter-screenshot.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center; font-size: 1em; margin-top:-1.4em; margin-bottom: 1.4em"&gt;
    &lt;a href="https://twitter.com/GergelyOrosz/status/1792173032653332543"&gt;Tweet on end-to-end testing from @GergerlyOrosz on May 19th, 2024&lt;/a&gt;
&lt;/figcaption&gt;

&lt;p&gt;Of course, the right approach always depends on your product and the impact of your application being unavailable for even a short period of time. At Zalando, the disruption of a critical customer journey can quickly add up to millions in lost revenue so there is an obvious value for us in ensuring the high quality of our releases and automated end-to-end tests are one of the best tools for the job. So when we release new versions of our &lt;a href="https://en.zalando.de/"&gt;Zalando website&lt;/a&gt; multiple times a day in a completely autonomous manner, each release goes through an automated quality assurance pipeline that includes end-to-end tests written with &lt;a href="https://www.cypress.io/"&gt;Cypress&lt;/a&gt;.&lt;/p&gt;
&lt;div style="background: #e2e8f0; padding: 1rem"&gt;
&lt;b&gt;What are automated end-to-end tests?&lt;/b&gt;
&lt;br /&gt;
&lt;br /&gt;
Automated end-to-end tests simulate real user interactions with an application to ensure that the entire application stack works correctly from the user interface to the backend. These tests typically run in a headless browser environment and are thus easily integrated into continuous integration and delivery (CI/CD) pipelines. By automating these tests, teams can efficiently detect and address issues early, ensure regression testing, and maintain application quality as the code base evolves.
&lt;/div&gt;

&lt;h2&gt;Investing in automated end-to-end tests&lt;/h2&gt;
&lt;p&gt;It really paid off for Zalando and helped us find bugs early on that would otherwise have caused major incidents. It has not been all nice and shiny though as we experienced what Gergely was complaining about: the tests were taxing to maintain and the most frustrating part of it all was that they were still a bit flaky. They had a success rate of around 80%, but with around 120 builds a day, that still meant an average of 24 builds a day which were failing as false positives, causing unnecessary friction.&lt;/p&gt;
&lt;p&gt;We doubled down on our investment in these tests, which included creating better test setup context as we have highly dynamic content on Zalando and our product pages are highly contextual, sometimes with products not yet released to build anticipation and for which we obviously could not trigger the add to cart flow. We also improved our selectors and added a mechanism to detect when our pages are hydrated with React after server-side rendering, as Cypress would fail eagerly executing test scripts on a non-interactive UI. Our efforts increased the tests reliability to the 95% range and we felt pretty good about it.&lt;/p&gt;
&lt;h2&gt;A new class of issues&lt;/h2&gt;
&lt;p&gt;You can imagine our disappointment when we had a major incident due to front-end interactivity issues where React hydration crashed on a large number of our product detail pages, preventing users from selecting product sizes and adding products to their shopping carts. The issue was large enough to have a business impact, but not just not enough to trigger an automated alert. How did this regression sneak in? It turned out that the incident was triggered by new and incomplete content published to our headless CMS which broke the front-end API contract with our API gateway and ultimately led to broken interactivity. We had have React error boundaries in place, however it turned out that these weren't working for the eagerly-hydrated part of our product pages.&lt;/p&gt;
&lt;p&gt;So we were almost back to square one: no matter how much we had invested in our end-to-end test automation, external factors could still lead to broken pages. Obviously, we will tighten up our monitoring and alerting as part of the incident process which seeks to systematically address contributing factors, but we also wanted to catch such interactivity issues more consistently. An idea came to mind: why not run our automated end-to-end tests periodically and alert when they fail? However, remember we had only achieved a 95% success rate with our end-to-end tests, so if we were to run them every 30 minutes to ensure that our website was working as expected. If we were to page our on-call team upon failures, alerts would trigger several times a day and possibly at night, leading to incident fatigue for the on-call team – a state we did not want to be in. So we needed to further increase the reliability of our end-to-end tests if this was to become a viable solution.&lt;/p&gt;
&lt;h2&gt;A simpler and better approach&lt;/h2&gt;
&lt;p&gt;We went back to the drawing board: what we needed was higher resiliency and one of the ways to achieve this is often through simplification. We decided that for the end-to-end test probes we would run a cron job with scenarios covering critical customer journeys. We started with a few scenarios: one test would cover landing on our home page, browsing to a gender page and clicking on a product, another would cover landing on our catalog page, applying a filter, clicking on a product and a final one would cover landing on a product page, selecting a size, adding the product to the cart and starting the checkout process. By focusing on a smaller number of features and interactions, we were able to reduce the likelihood of false positives.&lt;/p&gt;
&lt;p&gt;Around the same time, we also held our internal &lt;a href="https://engineering.zalando.com/posts/2024/06/hosting-an-internal-engineering-conference.html"&gt;Zalando Engineering Conference&lt;/a&gt; and one of the talks was about scaling automated end-to-end testing. Playwright, an end-to-end testing solution developed by Microsoft was presented as a great solution for this thanks to its strong focus on resilient testing. Indeed, Playwright features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;"auto-wait" (no artificial timeouts)&lt;/li&gt;
&lt;li&gt;"auto-retry" (web assertions), eliminating key causes for flaky tests&lt;/li&gt;
&lt;li&gt;rich tooling options (tracing, time-travel) to debug and fix issues if failures occur&lt;/li&gt;
&lt;li&gt;a unified API which works across all modern browsers&lt;/li&gt;
&lt;li&gt;Typescript out of the box&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This was very compelling so we decided to use Playwright for these end-to-end test probes.&lt;/p&gt;
&lt;p&gt;It was easy to get up and running with Playwright, especially for our now simple scenarios. We used fixtures to set up independent test contexts for scenarios such as getting a good product candidate for the product page landing test and disabling our cookie consent banner. Playwright's API was simple to pick up, making use of promises natively and augmenting standard CSS selectors which made us hit the ground running super quickly. Here is the final code for our catalog landing test which is only a few lines of code:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Test catalog landing journey for zalando&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//  navigate to catalog page&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;catalogNav&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kr"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;catalogLink&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;catalogNav&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nx"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;200&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;toHaveURL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;title&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// we only wait to simulate a &amp;quot;real user behavior&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// with playwright this is not necessary&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;waitForTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getByRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;button&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sr"&gt;/farbe/i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nx"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;label[for=colors-BLACK]&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getByText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/speichern/i&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getByTestId&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;is-loading&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nx"&gt;toBeVisible&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getByTestId&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;is-loading&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nx"&gt;not&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;toBeVisible&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;article[role=link]&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;a[href$=&amp;quot;.html&amp;quot;]&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;first&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;waitForLoadState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;domcontentloaded&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;toHaveURL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/\.html/i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We set up the tests to run on a 30 minute cron job and instead of paging immediately when they failed, we created a low-priority alert that emailed the team to validate their reliability using a "shadow" mode. And it did trigger a couple of times, especially over the weekend. Each time we captured HTML reports as logs so that we could understand the issue, improve our selectors, implement local retry loops with &lt;a href="https://playwright.dev/docs/test-assertions#expecttopass"&gt;expect.toPass&lt;/a&gt; and even cover tricky edges with selectors targeting non-visible content thanks to Playwright's &lt;a href="https://playwright.dev/docs/other-locators#css-locator]"&gt;automatic augmentation of pseudo-classes&lt;/a&gt; like :visible. After a few weeks, we stopped getting alerts in shadow mode and enabled paging when those tests failed. So far they have only paged us once, and that was during an incident where the page was actually not working.&lt;/p&gt;
&lt;h2&gt;Outlook&lt;/h2&gt;
&lt;p&gt;It has been quite a journey to get to where we are now, but we feel pretty good about our setup, which we could not have achieved without focusing on simplicity and betting on Playwright's reliability. If, like us, having production downtime is damaging to your business, we believe that implementing end-to-end test probes could be a useful addition to your toolkit. Our main advice would be to keep these tests focused on your critical customer journeys, write good selectors and iterate in a shadow mode before alerting in production.&lt;/p&gt;
&lt;p&gt;We are planning to increase the number of scenarios for the end-to-end probes to include more of our &lt;a href="https://engineering.zalando.com/posts/2022/04/operation-based-slos.html"&gt;Critical Business Operations&lt;/a&gt; (CBOs) and we also looking at extending this idea to our mobile apps.&lt;/p&gt;</content><category term="Zalando"/><category term="Testing"/><category term="Frontend"/><category term="SRE"/><category term="Backend"/></entry><entry><title>Custom Navigational Transitions in iOS</title><link href="https://engineering.zalando.com/posts/2024/07/custom-navigational-transitions-ios.html" rel="alternate"/><published>2024-07-04T00:00:00+02:00</published><updated>2024-07-04T00:00:00+02:00</updated><author><name>Kanupriya Gupta</name></author><id>tag:engineering.zalando.com,2024-07-04:/posts/2024/07/custom-navigational-transitions-ios.html</id><summary type="html">&lt;p&gt;Explores how our iOS App incorporates custom navigation in a backend-driven UI&lt;/p&gt;</summary><content type="html">&lt;h3&gt;Introduction&lt;/h3&gt;
&lt;p&gt;In present mobile development, the emphasis lies on achieving both speed and personalization. As the demand for rapid delivery intensifies, continuously improving the user experience for customers is essential.&lt;/p&gt;
&lt;p&gt;One avenue through which this aspiration materializes is via screen transitions. These transitions serve a dual purpose: they facilitate seamless navigation while striving to establish a sense of continuity in user interactions, transcending the mere act of moving from one screen to another.&lt;/p&gt;
&lt;p&gt;In this article, we will focus on screen transitions for iOS apps. Rather than implementing a custom transition for a basic scenario, which many resources already cover, we will explore a real example from Zalando's iOS App showcasing navigation between two screens that are entirely backend-driven.&lt;/p&gt;
&lt;h3&gt;Navigation Transition&lt;/h3&gt;
&lt;p&gt;In our prior article &lt;a href="https://engineering.zalando.com/posts/2024/05/appcraft.html"&gt;Backend-driven UI for mobile apps&lt;/a&gt;, we explained how the screen functions as a composed structure of a limited number of primitive components within the framework. So our problem space is: &lt;em&gt;How to enhance navigational experience in a Backend-driven UI system?&lt;/em&gt;. To understand that challenge, we will break down what is needed to implement one. But first, let's have a look on the status quo of a transition from an outfit-card to outfit-details screen.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Current Outfits Transition" src="https://engineering.zalando.com/posts/2024/07/images/transition-current.gif#center"&gt;&lt;/p&gt;
&lt;p&gt;Here, one of the outfits from the carousel is tapped and an outfit-details screen is &lt;code&gt;pushed&lt;/code&gt; on the navigation stack with the default transition. Notice the image in the carousel and the image on the detail screen are the same, the interaction could be enhanced in many ways here. One way is to build a custom navigational experience, where the image that is interacted grows into the detailed view (similar transitions can be noticed on the iOS App Store for reference).&lt;/p&gt;
&lt;p&gt;While in case of static content implementing the &lt;code&gt;UIViewControllerAnimatedTransitioning&lt;/code&gt; protocol provided by UIKit's &lt;a href="https://developer.apple.com/documentation/uikit/animation_and_haptics/view_controller_transitions"&gt;View Controller Transitions API&lt;/a&gt; and using a custom navigation delegate would be enough. Whereas in our scenario, the process isn't straightforward due to the following facts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Backend-driven UI:&lt;/strong&gt; Given that the UI of the initial screen is determined by the backend, identifying the user's interaction—whether it's with an image or a layout—poses a challenge. We require precise information about the tapped view, including its position and size (i.e., its frame within the screen).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Generic deep-link navigation:&lt;/strong&gt; With a generic deep-link navigation approach, the URL is passed to the router, which handles the navigation independently in a separate module. This means that the router lacks the context of the next screen, complicating the transition process further.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When an outfit-card is tapped (event), it triggers a deep link navigation (action), this action is propagated from &lt;a href="https://engineering.zalando.com/posts/2024/05/appcraft.html"&gt;Appcraft&lt;/a&gt; iOS framework to the Zalando App to be handled by a common router. We can intercept this flow and identify the location of the tap event. Once we do that, we can take a snapshot of the tapped view, which in this case is an Outfits-card. This solves the first problem stated above.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Code caption: Method initially used to capture the tapped view and convert into an image&lt;/em&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;extension&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;UIView&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;asImage&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;UIImage&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;renderer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;UIGraphicsImageRenderer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bounds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bounds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;renderer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rendererContext&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;in&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;drawHierarchy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ow"&gt;in&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bounds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;afterScreenUpdates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="bp"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Code caption: Once we have a snapshot to work with, we propagate the UIImage and its frame to the framework's navigation service, enabling us to pass this information to the router for handling the transition. Implementing the navigation controller and &lt;code&gt;UIViewControllerAnimatedTransitioning&lt;/code&gt;, facilitating a transition process similar to the following:&lt;/em&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;// At the call site&lt;/span&gt;
&lt;span class="n"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;navigationController&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="bp"&gt;UINavigationController&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;rootViewController&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;initialViewController&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;navigationController&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delegate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;CustomNavigationDelegate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;navigationController&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pushViewController&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nextViewController&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                        &lt;/span&gt;&lt;span class="nl"&gt;animated&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Custom Navigation Delegate&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;CustomNavigationDelegate&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="bp"&gt;NSObject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                &lt;/span&gt;&lt;span class="bp"&gt;UINavigationControllerDelegate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;navigationController&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;navigationController&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="bp"&gt;UINavigationController&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;animationControllerFor&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;operation&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="bp"&gt;UINavigationController&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Operation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fromVC&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="bp"&gt;UIViewController&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;toVC&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="bp"&gt;UIViewController&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="bp"&gt;UIViewControllerAnimatedTransitioning&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;operation&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;push&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;SourceScaleTransition&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;nil&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// SourceScaleTransition class&lt;/span&gt;
&lt;span class="n"&gt;final&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;SourceScaleTransition&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="bp"&gt;NSObject&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                   &lt;/span&gt;&lt;span class="bp"&gt;UIViewControllerAnimatedTransitioning&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;transitionInfo&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// contains the image and it&amp;#39;s frame&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;transitionDuration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;using&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;transitionContext&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="bp"&gt;UIViewControllerContextTransitioning&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;TimeInterval&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;animationDuration&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;animateTransition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;using&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;transitionContext&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="bp"&gt;UIViewControllerContextTransitioning&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;guard&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;transitionContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;viewController&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;forKey&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;              &lt;/span&gt;&lt;span class="n"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;toViewController&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;transitionContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;viewController&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;forKey&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;as&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="n"&gt;SnapshotTransitionPushedController&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;containerView&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;transitionContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;containerView&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;animatingView&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;transitionInfo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sourceView&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;containerView&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contentMode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scaleAspectFill&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;containerView&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;addSubview&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;toViewController&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;view&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;containerView&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;addSubview&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;animatingView&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;toViewController&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;view&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;layoutIfNeeded&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;finalFrame&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;calculatedFrame&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;// calculate final frame based on the destination and app safe areas&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;toViewController&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;snapshotFromSourceView&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;animatingView&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;animatingView&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;transitionInfo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sourceRect&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;toViewController&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;view&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isHidden&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="bp"&gt;UIView&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;animate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;withDuration&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;animationDuration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                       &lt;/span&gt;&lt;span class="nl"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;animations&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;weak&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;self&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;animatingView&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;finalFrame&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;finished&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;toViewController&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;view&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isHidden&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;transitionContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completeTransition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In addition to the above, we also created a protocol for destination controllers so that the transition concluded in a smooth way&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;///&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Destination&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ViewController&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;must&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;conform&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;
&lt;span class="o"&gt;///&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="n"&gt;SnapshotTransitionPushedController&lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt;
&lt;span class="o"&gt;///&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;so&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;that&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;could&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;be&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;seemlessly&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;added&lt;/span&gt;
&lt;span class="o"&gt;///&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;removed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;transitional&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;view&lt;/span&gt;
&lt;span class="n"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;protocol&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;SnapshotTransitionPushedController&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;UIViewController&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;///&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="n"&gt;snapshotFromSourceView&lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;view&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tapped&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;///&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;It&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;was&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;propagated&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;with&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;deeplink&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;information&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;///&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;will&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;be&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;scaled&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;an&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;animating&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;view&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Custom&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Transition&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;snapshotFromSourceView&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;UIView&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;///&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Call&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="n"&gt;removeTransitionalView&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;remove&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;///&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Example&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;when&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;view&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;has&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;loaded&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;rendered&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;removeTransitionalView&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Although initially promising, this approach proved insufficient for production use. Issues such as image pixelation and awkward text scaling, leading to abrupt disappearances, were observed. We identified two key problems that needed addressing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Selective rendering&lt;/strong&gt; Not all components are necessary for the transition and should be omitted.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Quality of Scaling view&lt;/strong&gt;: The transition should occur smoothly without pixelation, ensuring high-quality visuals throughout.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Our solution involved devising an approach where the tapped layout undergoes recursive traversal and re-rendering to produce a high-quality snapshot. This recursive methodology offers the added advantage of enabling us to selectively choose the components essential to the transition. Each component autonomously manages the rendering of its snapshot, enhancing the efficiency and precision of the process.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Below is a simplified version of selective rendering where Label &amp;amp; Button Components are ignored while rendering a snapshot view of a Composed component. There is a dedicated handling of &lt;code&gt;snapshot(:)&lt;/code&gt; method in the Image Component, shown further below.`&lt;/em&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;extension&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ComponentRenderer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;renderer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Renderer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;UIView&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Selective&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Rendering&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;LabelComponent&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;||&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ButtonComponent&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;EmptyView&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Implement&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;this&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;relevant&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;components&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dedicated&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;handling&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;renderer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;renderer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Render an actual view, and not just a snapshot to get a good quality transitional view&lt;/em&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;struct&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ComponentRenderer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;renderer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Renderer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;UIView&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;UIImageView&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;props&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Let's look at the resulting outfits-card transition:&lt;/p&gt;
&lt;p&gt;&lt;img alt="New Outfits Transition" src="https://engineering.zalando.com/posts/2024/07/images/transition-new.gif#center"&gt;&lt;/p&gt;
&lt;p&gt;Isn't it much better than the vanilla transition? It definitely is! &lt;em&gt;Bonus -&lt;/em&gt; The same transition can now be enabled to other screens since it is in a generic screen framework and backend driven.&lt;/p&gt;
&lt;p&gt;To conclude, each interaction is unique, and there's no one-size-fits-all solution, but this is a solid starting point. By collaborating with designers, engineers can create smooth, visually appealing animations. While these enhancements are not must-haves, they contribute significantly to a more enjoyable user experience. By focusing on advanced aspects of UIKit's View Controller Transitions API, you can improve your app's aesthetics and functionality, making it more engaging for users.&lt;/p&gt;</content><category term="Zalando"/><category term="Mobile"/><category term="iOS"/><category term="UI"/><category term="UX"/><category term="User Interaction"/><category term="Zalando App"/><category term="Frontend"/></entry><entry><title>Failing to Auto Scale Elasticsearch in Kubernetes</title><link href="https://engineering.zalando.com/posts/2024/06/failing-to-auto-scale-elasticsearch-in-kubernetes.html" rel="alternate"/><published>2024-06-21T00:00:00+02:00</published><updated>2024-06-21T00:00:00+02:00</updated><author><name>Juho Vuori</name></author><id>tag:engineering.zalando.com,2024-06-21:/posts/2024/06/failing-to-auto-scale-elasticsearch-in-kubernetes.html</id><summary type="html">&lt;p&gt;A story of operational failure in large scale Elastisearch installation including the root cause analysis and mitigations that followed&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="k8s logo, fire, and elasticsearch logo" src="https://engineering.zalando.com/posts/2024/06/images/k8s-fire-es.png#previewimage"&gt;&lt;/p&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;In Lounge by Zalando, we run an Elasticsearch cluster in Kubernetes to store
user facing article descriptions. Our business model is such that we receive
about three times the normal load during the busy hour in the morning and
therefore we use schedules to automatically scale in and out applications to
handle that peak. If scaling out in the morning fails, we face a potential
catastrophe. This is a story of one such case.&lt;/p&gt;
&lt;h2&gt;First anomaly&lt;/h2&gt;
&lt;p&gt;Early Tuesday morning, our on-call engineer received an alert about too few
running Elasticsearch nodes. We started executing the playbook to handle such a
case, but before we had time to go through all the steps, the missing nodes
popped up and the alert closed on its own. Catastrophe avoided for now, but
after a cup of coffee, follows the root cause analysis.&lt;/p&gt;
&lt;p&gt;Investigating the logs it turned out that the cluster had failed to fully scale
down for the night. The cluster was configured to run 6 nodes during the night,
but it got stuck running 7 nodes.&lt;/p&gt;
&lt;p&gt;To understand why that happened and why it is interesting, a little bit of
context is required. We run Elasticsearch in Kubernetes using
&lt;a href="https://github.com/zalando-incubator/es-operator"&gt;es-operator&lt;/a&gt;. Es-operator
defines a Kubernetes &lt;a href="https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/"&gt;custom
resource&lt;/a&gt;,
ElasticsearchDataSet (EDS), that describes the Elasticsearch cluster. It
monitors changes to it and maintains a
&lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/"&gt;StatefulSet&lt;/a&gt;
that consists of pods and volumes that implement the Elasticsearch nodes. We’ve
configured our cluster so that the pods running it are spread across all AWS
availability zones, and Elasticsearch is configured to &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html#shard-allocation-awareness"&gt;spread the shards across
the
zones&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For us, the schedule based scaling is implemented by a fairly complex set of
cronjobs that change the number of nodes by manipulating the EDS for our
cluster. There’s separate cronjobs for scaling up at various times of day and
scaling down at other times of day.&lt;/p&gt;
&lt;p&gt;The pods in a StatefulSet are numbered and the one with the highest number is
always chosen for removal when scaling in. Just before the nightly scale got
reached, we were running the following pods in the shown availability zones:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;es-data-production-v2-0 eu-central-1b
es-data-production-v2-1 eu-central-1c
es-data-production-v2-2 eu-central-1b
es-data-production-v2-3 eu-central-1c
es-data-production-v2-4 eu-central-1c
es-data-production-v2-5 eu-central-1c
es-data-production-v2-6 eu-central-1a
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The pod to be scaled in next is &lt;code&gt;es-data-production-v2-6&lt;/code&gt;. First step in this is
for es-operator to drain the node, i.e. request elasticsearch to relocate any
shards out of it. Here though, the node to be drained is the only one located in
eu-central-1a. Due to our zone awareness configuration, Elasticsearch refused to
relocate the shards in it. Es-operator has quite simple logic here: It requests
for shards to be relocated, check whether it happened and keep retrying for 999
times before giving up. This kept happening throughout the night and quite
unbelievably, retries were done just two minutes after we got the alert. Then,
es-operator carried on with scaling out and the problem resolved itself. The
timing here is quite surprising, but occasionally such things occur.&lt;/p&gt;
&lt;h2&gt;Initial root cause analysis&lt;/h2&gt;
&lt;p&gt;Something in the above is not quite right though. The intended behaviour of
es-operator is as follows: It constantly monitors updates to EDS resources and
if change is observed, it compares the state of the cluster to the description
and starts to modify the cluster to match its description. If, during that
process, EDS gets changed one more time, es-operator should abort the process
and start modifying to cluster to match the new desired state.&lt;/p&gt;
&lt;p&gt;This was the case for us exactly. Es-operator was still processing EDS update to
the scale in for the night while it received another EDS update to start
scaling out for the morning. We spent much of the next day tracing through
es-operator source code and finally realised there was a bug regarding retrying
on draining nodes for scaling in: In this one specific retry loop, context
cancellations are not reacted on. The bug is specific to draining a node and
doesn’t apply to other processes. It’s &lt;a href="https://github.com/zalando-incubator/es-operator/pull/405"&gt;fixed now&lt;/a&gt;,
so remember to upgrade if you are running es-operator yourself.&lt;/p&gt;
&lt;p&gt;Still something is not quite right. Why did this happen on Tuesday and never
before? We never scale into less than 6 pods and as explained above, the pod to
scale in is always the one with the greatest number. Therefore, the pods
numbered 0 to 5 should remain untouched. The pods running the Elasticsearch are
run as a StatefulSet by es-operator. If that StatefulSet was using an EBS backed
volume, Kubernetes would guarantee to not move the between zones. We, however,
don’t store unrecoverable data in our Elasticsearch, thus we can afford to run
it on top of ephemeral storage. Nothing is strictly guaranteed for us then.
Normally, pods remain quite stable in a zone nevertheless, but on Monday, the day
before the first anomaly, our Kubernetes cluster was upgraded to version 1.28.
This process likely has affected the pod scheduling across nodes
in a different availability zone, though we have not done a full deep dive
into the upgrade process to confirm this.&lt;/p&gt;
&lt;h2&gt;The first fix that didn’t work&lt;/h2&gt;
&lt;p&gt;As a quick fix, we just increased the number of nodes running during the night.
This way, the nightly scale-in job wouldn’t try to drain
&lt;code&gt;es-data-production-v2-6&lt;/code&gt;, the last node in eu-central-1a and it wouldn’t get
stuck the way it did the previous night. We might want to consider something
else for a longer term, but this should stop us from failing to scale out the
next morning.&lt;/p&gt;
&lt;p&gt;Still, the next morning, we received the exact same alert once again. And after
a few minutes, the alert closed on its own the same way as the day before.&lt;/p&gt;
&lt;p&gt;This time we were unable to scale in from 8 to 7 nodes, which did work fine the
day before. Looking at the node distribution:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;es-data-production-v2-0 eu-central-1b
es-data-production-v2-1 eu-central-1c
es-data-production-v2-2 eu-central-1b
es-data-production-v2-3 eu-central-1c
es-data-production-v2-4 eu-central-1c
es-data-production-v2-5 eu-central-1c
es-data-production-v2-6 eu-central-1a
es-data-production-v2-7 eu-central-1a
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Why was es-operator not able to drain &lt;code&gt;es-data-production-v2-7&lt;/code&gt;? This time it’s
not the last node in eu-central-1a.&lt;/p&gt;
&lt;p&gt;Digging into this revealed another bug in es-operator. The process for scaling
in a node, in a bit more depth, looks like the following:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Mark the node excluded (&lt;code&gt;cluster.routing.allocation.exclude._ip&lt;/code&gt;) in
   Elasticsearch. This instructs Elasticsearch to start relocating shards from
   it.&lt;/li&gt;
&lt;li&gt;Check from Elasticsearch whether any shards are still located in the given
   node. If yes, repeat from the beginning.&lt;/li&gt;
&lt;li&gt;Remove the corresponding pod from the StatefulSet.&lt;/li&gt;
&lt;li&gt;Clean up node exclusion list (&lt;code&gt;cluster.routing.allocation.exclude._ip&lt;/code&gt;) in
   Elasticsearch.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Pondering about the above, you are likely to guess what was wrong this time. If
the scaling down process gets interrupted, the clean up phase is never executed
and the node stays in the exclusion list forever. So, &lt;code&gt;es-data-production-v2-6&lt;/code&gt;,
which failed to scale in the day before, was still marked as excluded and
Elasticsearch was unwilling to store any data in it. In effect,
&lt;code&gt;es-data-production-v2-7&lt;/code&gt; was the only usable node in eu-central-1a.&lt;/p&gt;
&lt;h2&gt;The second fix&lt;/h2&gt;
&lt;p&gt;Manually removing the “zombie” node from the exclusion list is simple, so we did
exactly that to mitigate the immediate problem.&lt;/p&gt;
&lt;p&gt;Fixing the underlying bug in a reliable and safe way is much more involved. Just
adding a special if clause for cleaning up in case of cancellation would solve
the simple instance of this problem. But we are potentially dealing with partial
failure here. Any amount of if clauses wouldn’t solve the problem when the
es-operator crashes in the middle of the draining process. There’s a &lt;a href="https://github.com/zalando-incubator/es-operator/pull/423"&gt;PR in
progress&lt;/a&gt; to handle
this, but at the time of writing the bug still remains and we currently accept
the need to deal with these types of exceptional situations manually.&lt;/p&gt;
&lt;h2&gt;Finally&lt;/h2&gt;
&lt;p&gt;As an embarrassing postlude to this story, we received the same alert one more
time the next day. The quick fix we did the day before only touched the major
nightly scale down job, but ignored another one related to a recent experimental
project. It was a trivial mistake, but enough to cause a bit of organisational
hassle.&lt;/p&gt;
&lt;p&gt;Well, we fixed the remaining cronjob and that was finally it. Since then we’ve
been running hassle free.&lt;/p&gt;
&lt;p&gt;What did we learn from all this? Well, &lt;strong&gt;Read the code&lt;/strong&gt;. For solving difficult
problems, understanding the related processes in abstract terms might not be
enough. The details matter, and the code is the final documentation for those.
It also mercilessly reveals any bugs that lurk around.&lt;/p&gt;</content><category term="Zalando"/><category term="Kubernetes"/><category term="Elasticsearch"/><category term="Zalando Helsinki"/><category term="SRE"/><category term="Backend"/><category term="Culture"/></entry><entry><title>Next level customer experience with HTTP/3 traffic engineering</title><link href="https://engineering.zalando.com/posts/2024/06/next-level-customer-experience-with-http3-traffic-engineering.html" rel="alternate"/><published>2024-06-18T00:00:00+02:00</published><updated>2024-06-18T00:00:00+02:00</updated><author><name>Dmitry Kolesnikov</name></author><id>tag:engineering.zalando.com,2024-06-18:/posts/2024/06/next-level-customer-experience-with-http3-traffic-engineering.html</id><summary type="html">&lt;p&gt;HTTP/3 addresses key challenges such as latency reduction, concurrent access, and low-latency content delivery.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Next level customer experience with HTTP/3 traffic engineering" src="https://engineering.zalando.com/posts/2024/06/images/http3.png#previewimage"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: HTTP/3 has gathered consensus by the industry as the best technical solution for improving Web protocol stack. &lt;a href="https://w3techs.com/technologies/details/ce-http3"&gt;Usage statistics&lt;/a&gt; indicate that 29.8% of websites worldwide have already embraced HTTP/3 to cater to their users, with Zalando being among them. The architecture of HTTP/3, coupled with the underlying QUIC transport, introduces concurrent access and low-latency capabilities to solutions, facilitated by user-space flow and congestion controls operating over the User Datagram Protocol (UDP). QUIC is used by &lt;a href="https://w3techs.com/technologies/details/ce-quic"&gt;8.0% of all the websites&lt;/a&gt;. The result is an enhanced customer experience that fundamentally transforms content consumption, promising visually stunning displays on customers' mobile screens. This post will delve into the intricacies of HTTP/3 traffic engineering, Zalando experience with it and our vision for next steps.&lt;/p&gt;
&lt;h2&gt;The significance of HTTP/3 adoption&lt;/h2&gt;
&lt;p&gt;Nowadays, &lt;a href="https://www.researchgate.net/publication/375562820_Design_Modeling_and_Implementation_of_Robust_Migration_of_Stateful_Edge_Microservices"&gt;85%&lt;/a&gt; of total Internet traffic is TCP traffic. HTTP traffic takes about &lt;a href="https://radar.cloudflare.com/"&gt;54.6%&lt;/a&gt; and &lt;a href="https://www.statista.com/statistics/277125/share-of-website-traffic-coming-from-mobile-devices/"&gt;54.4%&lt;/a&gt; of it is the traffic to mobile devices. TCP was developed in the 70s of last century to build reliable client/server communication. The TCP-based family of Web protocols, specifically HTTP/1.0, HTTP/1.1 and HTTP/2, inherits the legacy TCP inefficiencies for building concurrent and low-latency Web applications on wireless networks. Looking in-depth on the protocol stack involved for end-to-end communication, there are issues in (1) network infrastructure utilisation and (2) protocol design:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Issues with Utilisation of IP Network" src="https://engineering.zalando.com/posts/2024/06/images/issues-utilization-ip-network.svg#center"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;(1) Issues with Utilisation of IP Network&lt;/strong&gt;: The Internet comprises a heterogeneous mix of packet-switched networks, including ISP Access Networks, ISP Core Networks, and numerous Tier 1/2/3 telecom carriers. For European customers connecting to load balancers deployed in the eu-central-1 availability zone, packets traversing about 15 hops. Each hop introduces a blend of processing, waiting times, and the inherent risks of packet loss or network congestion, particularly when nodes or links are strained beyond capacity. Additionally, the architecture of the access network, encompassing its physical medium and the transmission delays it incurs, further compounds these challenges. Furthermore, the saturated capacity of the radio spectrum utilised for communication within the access network adds another layer of complexity to contend with.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Issues with Protocol Design" src="https://engineering.zalando.com/posts/2024/06/images/issues-protocol-design.svg#center"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;(2) Issues with Protocol design&lt;/strong&gt;: Recent development of Web-protocol stack has presented several notable improvements, foremost among them being the excessive signalling and handshakes required by the upper protocol to negotiate communication parameters prior to payload transfer. Despite this, each "cold" HTTP/2 request necessitates approximately 5 to 6 round-trips, including 1xDNS, 1xTCP, 3xTLS, and 1xHTTP handshakes, contributing to significant network signalling overhead.  Moreover, TCP, functioning as a single ordered stream of bytes, lacks concurrent multiplexing capabilities for application traffic over the transport layer. Consequently, any networking failure, such as packet loss or congestion, results in the blocking of the entire byte stream, hindering performance and responsiveness. Existing Transport Congestion Control algorithms often fail to optimise network bandwidth utilisation, leading to suboptimal performance and efficiency. Additionally, poorly designed protocols contribute to fragmentation and reassembly, necessary for packets to traverse links with smaller Maximum Transmission Units (MTUs) than the original packet size. This fragmentation process increases the likelihood of excessive retransmissions in the event of packet loss, further impeding network efficiency and reliability.&lt;/p&gt;
&lt;p&gt;It has been proven by the industry that customers love fast experiences: application and web sites. About &lt;a href="https://www.amity.co/blog/mobile-app-user-acquisition-statistics"&gt;70%&lt;/a&gt; of mobile app users will stop using an app if it is taking too long to load. Slow “pages” have higher bounding rate; “speed” of the sites is considered as ranking signal for search. Having a fast site makes for a good user experience, which helps improve rankings and brings in visitors, which keeps them on your site and ultimately leads to more conversions.&lt;/p&gt;
&lt;p&gt;Knowing these issues, we make an assumption that the first group of factors related to network infrastructure remain unchanged in the near future (3 to 5 years). The infrastructure improvements are driven by economics. It is only remediation of the second group factors related to protocol design that can bring about a significant improvement of the customer experience. We also assume mobile devices replacement is seasonal, with longer or shorter cycles depending on country &amp;amp; economic situation, but certain.&lt;/p&gt;
&lt;p&gt;HTTP/3 has gathered consensus as the best technical solution to the second group of problems related to protocol design at this time.&lt;/p&gt;
&lt;h2&gt;What enhancements does HTTP/3 bring?&lt;/h2&gt;
&lt;p&gt;In the past, the industry has made multiple attempts on improving protocol design through Structured Streams Transport (&lt;a href="https://pdos.csail.mit.edu/papers/sst:sigcomm07/"&gt;SST&lt;/a&gt;), Stream Control Transport Protocol (&lt;a href="https://datatracker.ietf.org/doc/html/rfc9260"&gt;SCTP&lt;/a&gt;), Multipath TCP (&lt;a href="https://datatracker.ietf.org/doc/html/rfc8684"&gt;MP-TCP&lt;/a&gt;) and kernel-less TCP/IP implementations (e.g. &lt;a href="https://github.com/adamdunkels/uip"&gt;uIP&lt;/a&gt;, and &lt;a href="https://savannah.nongnu.org/projects/lwip/"&gt;lwIP&lt;/a&gt;). None of these became widely adopted because they were focusing on the transport layer only, avoiding end-to-end Web perspective. In June 2022, IETF published &lt;a href="https://datatracker.ietf.org/doc/html/rfc9114"&gt;HTTP/3&lt;/a&gt; as a Proposed Standard, which is built over a new protocol called &lt;a href="https://datatracker.ietf.org/doc/rfc9000/"&gt;QUIC&lt;/a&gt; (standardised in May 2021).&lt;/p&gt;
&lt;p&gt;QUIC is a transport layer network protocol. In contrast to TCP, it is &lt;strong&gt;user-space flow and congestion controls&lt;/strong&gt; over the User Datagram Protocol (UDP). Its new architecture is built over protocols cooperation principles rather than a strict OSI layering. The protocol solves:&lt;/p&gt;
&lt;p&gt;&lt;img alt="HTTP/3 Improvements" src="https://engineering.zalando.com/posts/2024/06/images/http3-improvments.svg#center"&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Multiplexing&lt;/em&gt;: TCP is a single stream that guarantees strict ordering of bytes. Any concurrency requires multiplexing over a single stream. Network conditions (e.g. packet losses, congestion) causes the TCP stream to be a bottleneck that blocks all senders / receivers on this stream. QUIC multiplexes streams over UDP datagrams, each stream independent and implements its own flow and congestion controls. QUIC also controls the fragmentation and packetisation of payload, producing optimal network datagrams.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Handshake&lt;/em&gt;: Each “cold” HTTP/2 request demands about 5 to 6 round-trips (1xDNS, 1xTCP, 3xTLS, 1xHTTP). HTTP/3 requires 3 round-trips (1xDNS, 1xQUIC, 1xHTTP). QUIC handshake combines negotiation of cryptographic and transport parameters. The handshake is structured to permit the exchange of application data as soon as possible, achieving actual waiting time to be a single round-trip. Peers establish a single QUIC connection that multiplexes a large number of parallel streams. The handshake is only required once, setup of the stream is an instant operation and does not require any additional handshake.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;TLS&lt;/em&gt;: Traditional layered architecture has an isolated security and transport layer causing significant overhead to negotiate encryption keys and transmit encrypted data. Customers perceive bad experiences when the chain of TLS certificates exceeds 4KB and TLS records are fragmented to multiple packets. QUIC adopts TLS version 3 as default one and encapsulates the security protocol (encrypts each individual packet).&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Congestion&lt;/em&gt;: QUIC provides the open architecture for congestion control, whereas TCP implements it on the kernel side of the operating system. QUIC does not aim to standardise the congestion control algorithms, it provides generic signals for congestion control, and the sender is free to implement its own congestion control mechanisms. As a benefit, sender can align payload to the actual size of the congestion window but also leads to performance inefficiencies as it involves copying extra packet data from kernel memory to user memory, so research on improving that efficiency is key.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Handover&lt;/em&gt;: QUIC connections are not strictly bound to a single network path. The protocol supports the connection transfer to a new network path, ensuring a low-latency experience when consumers switch from mobile to WiFi. In the case of HTTP, it always requires a “cold” start.&lt;/p&gt;
&lt;h2&gt;Outstanding HTTP/3 protocol challenges&lt;/h2&gt;
&lt;p&gt;QUIC has emerged as a serious alternative to TCP in the Web domain. Unfortunately, QUIC and HTTP/3 are not a “silver bullet” to solve concurrency and low latency. Open issues remains for engineers to be considered for the application development:&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Multiplexing&lt;/em&gt;: Stream frames are multiplexed over single QUIC packets, which are coalesced into a single UDP datagram. The congestion or loss of datagrams causes a similar effect as on TCP. Application needs to implement its own traffic prioritisation schema(s) to mitigate effect if necessary.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Memory management&lt;/em&gt;: HTTP/3 and QUIC demands a greater commitment for memory resources than traditional Web protocol stack. HTTP/3 mitigates the protocol overhead with various compression techniques but stream-oriented ordering of bytes requires excessive buffering of any data that is received out of order. Additionally, a user-space implementation leads to performance inefficiencies as it involves copying extra packet data from kernel memory to user memory.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Traffic shaping and security&lt;/em&gt;: networking infrastructure was monopolised by TCP so long that it introduced indirect dependencies on networking. ISP enforces different traffic routing policies for TCP vs UDP traffic, there are various in-the-network optimisation techniques such as Quality of Service, Active Queue Management that &lt;a href="https://lwn.net/Articles/752956/"&gt;impacts on UDP&lt;/a&gt;. The massive adoption of QUIC would require reconfiguration of networking gears. For example, &lt;a href="https://engineering.fb.com/2020/10/21/networking-traffic/how-facebook-is-bringing-quic-to-billions/"&gt;Facebook reported&lt;/a&gt;: client side heuristic about TCP, heuristic for estimating the available download bandwidth, bottlenecks at Linux-kernel on UDP packet processing, new load balancing and firewall policies.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Congestion control&lt;/em&gt;: No ultimate solution on the problem domain. It inherits algorithms from TCP. Historically, congestion control was owned by “hardware” companies - those who developed networking equipment and operating systems. QUIC shifts the ownership, because of user-space implementation, towards “software” companies - those who own Web-browsers. Nowadays, &lt;a href="https://datatracker.ietf.org/doc/html/rfc6582"&gt;NewReno&lt;/a&gt; (1999), &lt;a href="https://datatracker.ietf.org/doc/html/rfc8312"&gt;CUBIC&lt;/a&gt; (2008) and &lt;a href="https://queue.acm.org/detail.cfm?id=3022184"&gt;Bottleneck Bandwidth and Round-trip&lt;/a&gt; (2016) are the heuristic congestion control algorithms. QUIC standard is confusing, it proposes NewReno as default algorithm, although CUBIC is the dominant algorithm for the broad internet traffic today. Also, BBR algorithm has increased its share in terms of the practical implementation and it can be expected to become the dominant algorithm in the future. A positive side effect of shifting congestion control to user-space is unblocking innovations (e.g. there are research activities of the adoption of Deep Reinforcement Learning to boost customer experience).&lt;/p&gt;
&lt;p&gt;&lt;em&gt;MTU&lt;/em&gt;: The QUIC protocol, as it is being standardised by the IETF, does not support network MTUs smaller than 1280 bytes. It makes the protocol compatible with IPv6 networks (1280 bytes is IPv6 MTU). However, this poses challenges for networks operating on "non-standard" IPv4 configurations, potentially leading to packet fragmentation, especially on radio channels. Presently, the industry predominantly adheres to Ethernet standards, assuming a physical link MTU of 1500. While larger datagrams are feasible, they necessitate the utilisation of the Path Maximum Transmission Unit Discovery protocol to ensure optimal performance and compatibility across diverse network environments.&lt;/p&gt;
&lt;h2&gt;Viewing HTTP/3 from the Radio Access Network (Physical Link) angle&lt;/h2&gt;
&lt;p&gt;The architecture of the HTTP/3 protocol assumes low latency and high reliability within access networks. While the QUIC protocol brings notable enhancements for "interactive" communication over 3G/4G/LTE wireless networks, it has not focused on specificity regarding the unique attributes of 5G networks. It's crucial to note that 5G networks are poised to solve latency issues effectively. Engineers need to be aware of the limitations within Radio Access Networks and carefully weigh the adoption of 5G technology, particularly in &lt;a href="https://5gobservatory.eu/observatory-overview/interactive-5g-scoreboard/#EU-scoreboard"&gt;the European context&lt;/a&gt;. 5G stands out for its remarkable speed capabilities, boasting peak data rates of up to 20 Gigabits-per-second (Gbps) and average data rates exceeding 100 Megabits-per-second (Mbps). Unlike its predecessor, 4G, 5G exhibits significantly enhanced capacity, designed to accommodate a 100-fold surge in traffic capacity and network efficiency. Theoretical estimates suggest that 5G can support up to 1 million devices per square kilometer, showcasing its immense potential for accommodating the burgeoning demands of modern connectivity.&lt;/p&gt;
&lt;p&gt;&lt;img alt="HTTP/3 Radio Access Network Perspective" src="https://engineering.zalando.com/posts/2024/06/images/http3-ran.svg#center"&gt;&lt;/p&gt;
&lt;p&gt;Advertisements about 5G talk about millimeter-wave (mmWave) but the 5G technology is built over three &lt;a href="https://en.wikipedia.org/wiki/5G_NR_frequency_bands"&gt;frequency bands&lt;/a&gt; (a) low-bands (sub-1GHz) supports wide-area coverage, (b) mid-bands (1 - 6 GHz) offers a trade-off between coverage and capacity, most of the commercial 5G networks will use 3.3 GHz to 4.2 GHz range in the mid-band spectrum and (c) high-bands (24–52 GHz) are required to achieve ultra-high data rates and ultra-low latencies. High-bands (mmWave) are highly susceptible to blockages caused by various objects (e.g., buildings, vehicles, trees) and even the human body. Mass scale operating in mmWave spectrum, presents a demanding challenge in terms of its practical implementation and costs. The physical link in the Radio Access Network emerges as the primary bottleneck on low- and mid-bands, primarily due to the constrained capacity of the radio spectrum. Frequency bands below 6 GHz, traditionally utilised by pre-5G technologies, are progressively saturating, unable to meet escalating consumer demands. Our assumption is about the massive adoption of mid-bands across Europe, 5G mid-bands still outperforms 3G/4G/LTE in terms of latency and packet loss probability but requires less investment into network infrastructure.  For example, serving multiple real-time video streams over 5G is not magic anymore. We are able to build customer experience with about &lt;a href="https://www.utupub.fi/bitstream/handle/10024/154600/Ritola_Ville_Thesis.pdf"&gt;13 ms latency for 99.9% of downlink packets and 28 ms for 99.9% of uplink packets&lt;/a&gt; even with “bad” signal strength from -100 dBm  to -113 dBm.&lt;/p&gt;
&lt;p&gt;On the mid-bands, 5G still outperforms 3G/4G/LTE in terms of latency and packet loss probability. High-reliability plays against the congestion control algorithms used by QUIC. Conventional algorithms are not able to differentiate between the potential causes of packet loss or congestion on the radio channel due to noise, interference, blockage or handover. NewReno and CUBIC have resulted in &lt;a href="https://www.mdpi.com/1424-8220/21/13/4510"&gt;very poor throughput and latency performance&lt;/a&gt;. Only BBR exhibited the lowest round trip time values among all possible physical failure scenarios and can satisfy the typical 5G requirements. Advancing the adoption of HTTP/3 for low-latency communication scenarios necessitates research and development into congestion control algorithms that are sensitive to bandwidth variations across different frequency bands.&lt;/p&gt;
&lt;h2&gt;Adoption of HTTP/3 by Zalando&lt;/h2&gt;
&lt;p&gt;Despite the discussed limitation, we have adopted the HTTP/3 protocol at Zalando for distributing all media content. We have successfully brought our vision to life: delivering a premium customer experience atop the foundation laid by industry enablers. Akamai Technologies has been supporting QUIC since July 2016. Amazon supports QUIC (UDP) at &lt;a href="https://aws.amazon.com/blogs/aws/new-udp-load-balancing-for-network-load-balancer/"&gt;Network Load Balancer&lt;/a&gt;. Most importantly, HTTP/3 is available at &lt;a href="https://aws.amazon.com/blogs/aws/new-http-3-support-for-amazon-cloudfront/"&gt;CloudFront&lt;/a&gt; giving the ability to serve European customers through Edge Locations. &lt;a href="https://developer.apple.com/videos/play/wwdc2021/10094/"&gt;Apple&lt;/a&gt; maintains proprietary closed source implementation of QUIC and HTTP/3 protocol since iOS 15. On Android, an open source &lt;a href="https://developer.android.com/codelabs/cronet#0"&gt;Cronet library&lt;/a&gt; exists. Google Chrome has supported the protocol since 2012. Apple added official support in Safari 14. Support in Firefox arrived in May 2021.&lt;/p&gt;
&lt;p&gt;Since HTTP/3 have been enabled into our production environment, we have observed that 36.6% of our users seamlessly migrated to content consumption using HTTP/3 protocol. The average latency for these customers has improved from double digit to single digit value giving about 94% improvements. The p99 latency has improved from 4th digit value to double digit giving 96% gain in comparison with HTTP/2. About 61.6% of our users continue utilisation of HTTP/2 protocol and remaining 1.8% of users fall back to HTTP/1. No incidents or severe anomalies caused by HTTP/3 have been observed by us.&lt;/p&gt;
&lt;h2&gt;Exploring further directions on traffic engineering opportunities with HTTP/3&lt;/h2&gt;
&lt;p&gt;Prior to concluding, the author anticipates delineating two significant pathways for further enhancing HTTP/3, aimed at crafting next-level customer experiences.&lt;/p&gt;
&lt;h3&gt;Congestion Control with Deep Reinforcement Learning&lt;/h3&gt;
&lt;p&gt;Conventional CC algorithms base their decisions on pre-defined criteria (heuristic) such as packet loss or delay and they lack the ability to learn and adapt their behaviour in complex dynamic environments such as 5G cellular networks. Some heuristic algorithms use statistics to accommodate previous experience into the decision making process, still they are not able to achieve the full potential of modern networks.&lt;/p&gt;
&lt;p&gt;&lt;img alt="HTTP/3 Traffic Engineering Opportunities" src="https://engineering.zalando.com/posts/2024/06/images/http3-drl.svg#center"&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.mdpi.com/1099-4300/25/2/294"&gt;Machine Learning techniques outperforms conventional CC algorithms by dynamically adapting the parameters&lt;/a&gt;. Deep Reinforcement Learning (DRL) is a prominent technique that has been assessed with QUIC. The Reinforcement Learning agent makes decisions about the size of the congestion window or sending rate while interacting with the environment. The reward metric is either throughput or network delay while penalising packet losses that are optimised for a particular application. In the lab, analysis of DRL algorithms has shown higher throughput and round-trip performance under various network settings to compare with competing solutions (e.g. BRR or Remy). It is worth mentioning Aurora, Eagle, Orca and PQB as known DRL algorithms. We expect this will become the main concept exploited in the research dedicated for protocol improvements in 5G networks.&lt;/p&gt;
&lt;h3&gt;Streaming of 4K Ultra High Definition videos&lt;/h3&gt;
&lt;p&gt;Streaming of 4K Ultra High Definition 3480x2160 video at 60 fps requires usage of H.265 (High Efficiency Video Coding) and demands 30 - 50 Mbps network bandwidth, 6 - 11 ms packet latency and 99.999% reliability for packet delivery. This is a tough requirement for 5G mid-bands and practically achievable in the urban areas only.&lt;/p&gt;
&lt;p&gt;HTTP/3 introduces concurrent access and low-latency capabilities to video streaming solutions. Our initial investigations have revealed that only Video on Demand applications utilise Dynamic Adaptive Streaming over HTTP/3, with an assumption of 5.6 MB of HEVC-compressed video per second. The QUIC stream concurrency enables parallel fetching of video chunks, leading to an improved user experience compared to HTTP/2. The real-time video streaming with QUIC over less than ideal network conditions faces an issue due to the reliable nature of the protocol. Retransmissions of lost packets in a video stream, inadvertently lead to stalls in the video stream. It also performs poorly when it encounters packet losses that are not due to congestion. This is another improvement opportunity for QUIC to offer a selectively reliable transport wherein not all video frames are delivered reliably, we can optimise video streaming and improve end-user experiences. We believe this improvement impacts content consumption by supporting up to 4096 × 2160 at 60fps (True 4K).&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://w3techs.com/technologies/details/ce-http3"&gt;Usage statistics&lt;/a&gt; indicate that 29.8% of websites worldwide have already embraced HTTP/3 to cater to their users, with Zalando being among them. Through its adoption, significant strides have been made towards improving the efficiency and responsiveness of web communications, ultimately enhancing the end-user experience.&lt;/p&gt;
&lt;p&gt;We've explored how HTTP/3 addresses key challenges such as latency reduction, concurrent access, and low-latency content delivery. We’ve also emphasised remaining issue engineers should be aware specifically in the content of radio access networks and discussed  remaining exciting opportunities for further advancements in traffic engineering and network optimization, especially as technologies like Deep Reinforcement Learning continue to mature.&lt;/p&gt;
&lt;p&gt;Overall, the insights shared in this post underscore the pivotal role of HTTP/3 in shaping the future of web communication, paving the way for richer, more immersive online experiences. Our observations tell us that 36.6% of our users seamlessly migrated to content consumption using HTTP/3 protocol. The average latency for these customers has improved from double digit to single digit value giving about 94% improvements.&lt;/p&gt;</content><category term="Zalando"/><category term="Platform Engineering"/><category term="User Experience"/><category term="Backend"/><category term="Frontend"/></entry><entry><title>Hosting an internal Engineering Conference</title><link href="https://engineering.zalando.com/posts/2024/06/hosting-an-internal-engineering-conference.html" rel="alternate"/><published>2024-06-03T00:00:00+02:00</published><updated>2024-06-03T00:00:00+02:00</updated><author><name>Bartosz Ocytko</name></author><id>tag:engineering.zalando.com,2024-06-03:/posts/2024/06/hosting-an-internal-engineering-conference.html</id><summary type="html">&lt;p&gt;In August 2023 we hosted our first internal Engineering Conference. This post summarizes the experience and provides some tips for those who want to organize a similar event.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Our &lt;a href="https://engineering.zalando.com/tags/data-science.html"&gt;Data Science&lt;/a&gt; colleagues had been hosting an internal Data Science Days event for a few years. For our 2,000+ Engineers, we had been missing a similar community event. For several years we wanted to organize one, but got distracted by other priorities and external factors. Finally, in 2022 we decided to commit to hosting an internal Engineering Conference every year and included this commitment in our Engineering Strategy.&lt;/p&gt;
&lt;p&gt;Last year, in August 2023, we hosted our first internal Engineering Conference. In this post, we are summarizing how we organized this event and provide tips for those who want to organize a similar event in their company. If you never hosted an event like this before, fear not - when we embarked on the journey we also had no experience in doing so. The event turned out to be a success nonetheless.&lt;/p&gt;
&lt;h2&gt;Conference format&lt;/h2&gt;
&lt;p&gt;As this was our first event, we had no reference on the level of interest from potential speakers nor attendees.
Without a reference point from prior years, it was a big ask to request that Engineering Managers allow their teams dedicated time to attend, especially given the summer holiday timing (which could work for or against attendance). On top, conference talks are expected to be of higher quality than typical internal presentations, so we needed a format that would ensure quality of talks.&lt;/p&gt;
&lt;p&gt;Given these circumstances and following our value &lt;em&gt;think big, act fast&lt;/em&gt;, we defined the conference format as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;1 day event, all online (we're 2,000+ Engineers with sites in Berlin, Dublin, Dortmund, Helsinki, Stockholm, Zürich)&lt;/li&gt;
&lt;li&gt;call for papers to collect submissions across 8 tracks&lt;/li&gt;
&lt;li&gt;track host per track who would moderate the track and act as subject-matter expert during the preparation of the talks&lt;/li&gt;
&lt;li&gt;program committee to review submissions and select talks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Initially, we were thinking that 8 tracks would be too many, but we wanted to encourage submissions across a variety of topics and see where this takes us, adjusting the track as needed. Our tracks covered Building Platforms, Cloud Native, Developer Experience, Data Engineering, and App/Web Development. We also had a dedicated track for Engineering Leadership and (of course) for the hot topic of the year: AI.&lt;/p&gt;
&lt;p&gt;The call for papers was open for 3 weeks. Up until the very end, we were not sure if we would get enough submissions to fill all tracks.
Only the last two days before the deadline, we received a significant number of submissions. We ended up with enough submissions to fill all defined tracks and struck gold. Now the organizing team had a challenge - to deliver an event with 8 tracks happening in parallel and 54 talks in total.&lt;/p&gt;
&lt;p&gt;When we reached out to our broadcasting team who typically assist in hosting internal events, we learned that they never hosted an event that big, with 3 tracks being their technical limit. So we ended up hosting the event on our own, using Google Meet streaming, a slide-based presentation catalogue with talks and descriptions, and 54 calendar events to make it easy to build up one's own schedule.&lt;/p&gt;
&lt;h2&gt;Conference content&lt;/h2&gt;
&lt;p&gt;2023 was the year of Large Language Models (LLMs), thus it could not be missing from our event. As LLMs were new for many of our Engineers, we invited our Data Scientists to share their know-how on this topic. We had a talk about the fundamentals of LLMs, followed by a summary on the challenges using LLMs based on two use cases: code generation and &lt;a href="https://corporate.zalando.com/en/technology/how-zalando-co-creating-its-new-ai-powered-assistant-together-customers"&gt;building our Zalando Assistant&lt;/a&gt;. As expected, these presentations attracted a lot of interest from our community.&lt;/p&gt;
&lt;p&gt;Our Engineering Leadership track was focused on talks related to managing teams in challenging times, building trust with the team and sustaining empathy when the team or oneself is affected by the current situation. Other talks focused on driving innovation, continuing to learn as leaders.&lt;/p&gt;
&lt;p&gt;The Cloud Native and Developer Experience tracks turned out to be great platforms for sharing new developments in our infrastructure services and promoting their use. Colleagues learned both about proven features that they may be missing out on as well got a peek on improvements in our Kubernetes platform. Our SRE-minded speakers, shared tips about building easy to understand Grafana dashboards using data visualization techniques and demonstrating reference dashboards for applications.&lt;/p&gt;
&lt;p&gt;The Data Engineering track was focused on sharing best practices in data processing and data quality. Speakers shared how they monitor data quality in their pipelines, how to simplify data aggregation queries, or how architectural decisions around data design affect data quality and technical debt.&lt;/p&gt;
&lt;p&gt;Two teams particulary stood out with multiple presentations across the tracks. The team behind our &lt;a href="https://engineering.zalando.com/posts/2021/03/micro-frontends-part1.html"&gt;Web platform&lt;/a&gt; shared their journey on evolving their platform into a standalone framework that's now also powers parts of the &lt;a href="https://zalando-lounge.com"&gt;Zalando Lounge&lt;/a&gt; experience, covered the &lt;a href="https://engineering.zalando.com/posts/2023/07/rendering-engine-tales-road-to-concurrent-react.html"&gt;journey to concurrent React&lt;/a&gt;, and how we continuously measure and report on the web performance. Our Size &amp;amp; Fit team on the other hand explained how the &lt;a href="https://corporate.zalando.com/en/technology/zalando-launches-size-recommendations-based-customers-own-body-measurements"&gt;Size Recommendations based on Body Measurements&lt;/a&gt; features work behind the scenes, starting with the on-device computation and ending with the compliance requirements for processing sensitive data. The team also shared how the data acquisition pipelines for the &lt;a href="https://corporate.zalando.com/en/technology/zalando-brings-virtual-fitting-room-pilot-millions-customers"&gt;Virtual Fitting Room&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Lessons learned&lt;/h2&gt;
&lt;h3&gt;Conference format&lt;/h3&gt;
&lt;p&gt;With the 8 tracks in a single day, we triggered massive FOMO (fear of missing out) across Zalandos, as it was difficult to decide which talk to attend. We knew from the get go that this would be a challenge, but decided that the trade-off was worth it. Now that we gained credibility for running the event, in future we will reduce the number of tracks and spread the event out over at least two full conference days. When hosting yearly events, the amount of net new project content is expected to stabilize when compared to the first event.&lt;/p&gt;
&lt;p&gt;For first-time speakers, the online format was a great opportunity to practice as stage anxiety is smaller than when speaking to a full room. It's challenging for an online-only event to deliver a full conference feeling, though. While on the following day we had an on-site event with two keynotes and a get together, participants were missing the buzz and networking opportunities known from on-site conferences. Nothing replaces the chatter in the hallway and missing talks due to engaging in conversations with colleagues in a prolonged coffee break ;-)&lt;/p&gt;
&lt;p&gt;We had two conference talk formats: full talks (with Q&amp;amp;A) and short lightning talks (without Q&amp;amp;A). The feedback we received from speakers for the lightning talks is that they missed out on the Q&amp;amp;A part and the resulting feedback loop telling them whether the audience was interested in the talk (or not).&lt;/p&gt;
&lt;h3&gt;CFP&lt;/h3&gt;
&lt;p&gt;We ran the CFP (call for papers) using Google Forms and scored the submissions in Google Sheets. Each Program Committee member reviewed and scored the submissions based on the topic relevance for the target audience, the abstract quality, and the expected takeaways. We provided a scoring guidance document and removed speaker information to ensure an unbiased selection process focused solely on content. To balance the workload, we assigned each committee member up to 50% of the tracks to score. We then normalized the ranking results and selected the top submissions for the conference. In some cases, we reclassified talks across tracks to ensure balanced content distribution.&lt;/p&gt;
&lt;h3&gt;Track Hosts&lt;/h3&gt;
&lt;p&gt;Assigning a track host per track worked well (and is well known from other conferences). The track hosts helped speakers prepare and were an early sounding board for the presentation content. They had freedom to select the order of the talks to ensure a good flow of topics and help in their storytelling when introducing the speakers throughout the day. Hosts also prepared backup questions to use in the Q&amp;amp;A part in case while the audience was busy typing their questions into the Q&amp;amp;A tool.&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;The event turned out to be a success and we received a lot of &lt;a href="https://www.linkedin.com/posts/zalando_insidezalando-zalandotech-zalandohelsinki-activity-7103029512493248512-3bcv?trk=public_profile"&gt;positive feedback&lt;/a&gt; from our colleagues who after the closing event were asking when we will host the next one. The event was a great opportunity to learn about projects across the organization and to promote platform solutions to a wide and focused audience. The recordings from the talks serve as onboarding material for colleagues willing to learn about specific projects or just joining the team of the speakers. The on-site event on the following day was a great opportunity to meet colleagues in person and to get their first hand feedback on what they liked from the conference and what they would like to see improved.&lt;/p&gt;
&lt;h3&gt;Tips for organizing similar events&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Sponsorship&lt;/em&gt;: get a sponsor from the leadership team to provide budget and high-level guidance for the event.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Organizing team&lt;/em&gt;: form a small team to organize the event (at Zalando we have a Tech Academy team experienced in organizing events for the Engineering Community).&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Program Committee&lt;/em&gt; and &lt;em&gt;Track Hosts&lt;/em&gt; are great mechanisms to give visibility to role models and to &lt;a href="https://engineering.zalando.com/tags/women-in-tech.html"&gt;promote diversity&lt;/a&gt; across the organization.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Program Committee&lt;/em&gt;: use a principled-based approach for program committee composition.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;CFP scoring&lt;/em&gt;: provide guidance for the program committee on how to score submissions; ensure that the selection is based solely on the content of the submission (via conference software or just plain old spreadsheets).&lt;/li&gt;
&lt;li&gt;&lt;em&gt;CFP scoring&lt;/em&gt;: submissions that made it to the shortlist, but did not make it to the conference, should be considered for other internal talks formats or blog posts.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Track Hosts&lt;/em&gt;: consider assigning a track hosts, if only to moderate the track and introduce speakers during the day; they can also help speakers prepare the talks, though you can also assign a group subject matter experts to review the talk early on.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Communication&lt;/em&gt;: meet the target audience where they are; use all possible communication channels to reach them: chats, email, intranet, posters in office, ask leads to promote the event in their team meetings and townhalls.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Presentations&lt;/em&gt;: provide a slide template to ensure a consistent look and feel across all presentations (at least for the first slide). Provide guidance on the font sizes and how to pick accessible color combinations with high contrast.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;What changes are we making this year?&lt;/h3&gt;
&lt;p&gt;This year, we're running the conference already in June and host it as an on-site event with the aim to create a real conference feeling.
We spread the conference over two days with three tracks per day, merging some tracks from last year and adding new ones. The event will be streamed to all sites, this time with support of our broadcasting team. The streams will also make it possible for our colleagues to join the event from home.&lt;/p&gt;
&lt;p&gt;We ran the CFP for 4 weeks to give potential speakers more time, but the impact on the number of submissions over time was neglible. The due date for submissions is what matters and as in 2023 we received most submissions in the last two days before the end of the CFP. We invited past speakers and track hosts to become part of the Program Committee.&lt;/p&gt;
&lt;p&gt;We're excited to host the event again and look forward to learning how the conference format for this year will be received by Zalandos.
More on that another time!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Engineering Conference 2023" src="https://engineering.zalando.com/posts/2024/06/images/secc23-event.jpg#center"&gt;&lt;/p&gt;</content><category term="Zalando"/><category term="Culture"/></entry><entry><title>Transitioning to Appcraft: Evolution of Zalando’s server-driven UI framework</title><link href="https://engineering.zalando.com/posts/2024/05/appcraft.html" rel="alternate"/><published>2024-05-16T00:00:00+02:00</published><updated>2024-05-16T00:00:00+02:00</updated><author><name>Kanupriya Gupta</name></author><id>tag:engineering.zalando.com,2024-05-16:/posts/2024/05/appcraft.html</id><summary type="html">&lt;p&gt;This article outlines our server-driven UI framework utilised for mobile applications and delves into the reasons for retiring its predecessor.&lt;/p&gt;</summary><content type="html">&lt;p&gt;At the heart of Zalando's mobile content strategy lies the Appcraft platform, fueling 13 dynamic pages within the app. This framework is instrumental in delivering top-tier content formats, including the popular &lt;a href="https://en.zalando.de/stories/"&gt;Zalando Stories&lt;/a&gt;. In this post we explain the origins and inner workings of the platform.&lt;/p&gt;
&lt;h2&gt;The TNA Dilemma&lt;/h2&gt;
&lt;p&gt;The Flexible Layout Kit (formerly known as  &lt;a href="https://engineering.zalando.com/posts/2016/07/an-introduction-to-truly-native-apps.html"&gt;Truly Native Apps, TNA&lt;/a&gt;) was a framework used in Zalando App to render content dynamically. This framework processed JSON input, which defines the slots and elements of a screen. These elements were characterised by their types and a set of attributes. The primary container of the screen was a vertical list type, which encapsulated a series of Composed Tiles within client-side Apps.
While this system initially provided simplicity and a robust foundation for dynamic landing pages within our Apps, its fixed UI structure imposed constraints. Notably, maintaining the high-level composed UI components across both iOS and Android clients proved challenging, mainly due to versioning but also due to constant UI design changes and the introduction of multiple variants for a single Tile in order to support our different business logic and content formats. These limitations inhibited innovation and hindered the seamless integration of dynamic content.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Example of a component in TNA: These were the Showstopper Tile variants (C and D shown below) in TNA framework&lt;/em&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style="text-align: center;"&gt;&lt;img alt="Version C" src="https://engineering.zalando.com/posts/2024/05/images/teaser-version-c.png#center"&gt;&lt;/th&gt;
&lt;th style="text-align: center;"&gt;&lt;img alt="Version D" src="https://engineering.zalando.com/posts/2024/05/images/teaser-version-d.png#center"&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="text-align: center;"&gt;Version C&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Version D&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;em&gt;The json for Version D looked like below:&lt;/em&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;&amp;quot;element-type&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;teaser&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s2"&gt;&amp;quot;attributes&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;trackingParameters&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;saleBoxColor&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;#FF0000&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;teaserVersion&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;VERSION_D&amp;quot;&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="s2"&gt;&amp;quot;subelements&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="s2"&gt;&amp;quot;attributes&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="s2"&gt;&amp;quot;element-type&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;image&amp;quot;&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="s2"&gt;&amp;quot;attributes&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="s2"&gt;&amp;quot;element-type&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;text&amp;quot;&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="s2"&gt;&amp;quot;attributes&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="s2"&gt;&amp;quot;element-type&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;use-voucher&amp;quot;&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="s2"&gt;&amp;quot;attributes&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="s2"&gt;&amp;quot;element-type&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;show-info&amp;quot;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To summarise, these were the pain points with the TNA framework:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Small UI changes within a Tile, such as moving a button to the right or left, or stakeholders requiring two UI presentation variants, would prompt a new version and necessitate a client-side change and release to the App Stores.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For other cases involving changes to business logic, such as a price format change, the contract or schema for the price component on both clients and the server had to be modified.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Maintaining backward compatibility and versioning was challenging and led to a few incidents. It also necessitated coordination between clients, especially when the app release versions between iOS and Android were not synchronised.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;More over back then several  backend services including  TNA needed to be migrated and the team had to face a decision of either maintain or decommission TNA backend.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These shortcomings encouraged us to replace TNA with a new Framework in which we aimed at:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A common and more flexible design layout system.&lt;/li&gt;
&lt;li&gt;Simplified Versioning capabilities.&lt;/li&gt;
&lt;li&gt;Same-day delivery for new Screens and Layouts.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Enter Appcraft&lt;/h2&gt;
&lt;h3&gt;A common design layout system&lt;/h3&gt;
&lt;p&gt;In 2018, after experimenting with web-like architectures and several layout systems provided by native and third-party frameworks, we decided to implement a mobile version of the &lt;strong&gt;Elm architecture&lt;/strong&gt;, together with &lt;strong&gt;Flex&lt;/strong&gt;, as a unifying principle that could bridge the design paradigms of Android and iOS. Here's how:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://guide.elm-lang.org/architecture/"&gt;ELM architecture&lt;/a&gt;, inspired by the Elm programming language, follows a unidirectional data flow pattern consisting of three main components: Model, View, and Update. The Model represents the application state, the View displays this state to the user, and the Update modifies the state based on user interactions. This clear separation of concerns simplifies code maintenance and enhances predictability, making ELM architecture popular for building scalable and maintainable web applications.&lt;/p&gt;
&lt;p&gt;Flex was key in helping to build a common understanding of layout concepts for mobile clients, which web developers could also grasp without the need to learn the individual mechanisms each platform uses to lay out views on a screen. It offers flexibility for dynamic and responsive designs across platforms, streamlines development, fosters cross-platform compatibility, and benefits from a large community of developers.&lt;/p&gt;
&lt;h4&gt;The challenges&lt;/h4&gt;
&lt;p&gt;While the decision to use Flex was agreed within the cross-platform team, the challenge lay in adding Flex support to iOS and Android, each of which internally uses its own native layout framework. Based on this, we experimented with a few third-party layout libraries already available, each with a fair reputation, comparing their performance and integration efforts. Once these libraries were chosen for iOS and Android, most of the effort went into translating the Flex definitions from the server into the Flex library APIs for each platform and comparing them to ensure consistent results between both.
One important consideration while choosing the library was finding one that sits on top of the native UI frameworks to assist with positioning and sizing, without replacing or altering the behaviour of the native UI framework. This means that, for example, a scrollable layout with Flex specifications on the server will be transformed by Appcraft into a native UICollectionView for iOS and into a RecyclerView for Android. This approach ensures that we still have access to new APIs and improvements available on the native UI frameworks for newer OS versions. We decided to move further with &lt;a href="https://texturegroup.org/"&gt;Texture&lt;/a&gt; on iOS and &lt;a href="https://fblitho.com/"&gt;Litho&lt;/a&gt; on Android.&lt;/p&gt;
&lt;h4&gt;Primitives&lt;/h4&gt;
&lt;p&gt;We've established a set of Primitive Components to serve as the foundation for constructing High-level UI Components. Starting with essentials such as &lt;code&gt;Label&lt;/code&gt;, &lt;code&gt;Button&lt;/code&gt;, &lt;code&gt;Image&lt;/code&gt;, &lt;code&gt;Video&lt;/code&gt;, and a &lt;code&gt;Layout&lt;/code&gt; container, these primitives form the building blocks for crafting intricate UI components. With these foundational elements in place, developers possess the flexibility to combine and customise them according to their application's unique requirements, unlocking a plethora of possibilities for UI design and interaction.&lt;/p&gt;
&lt;h4&gt;Behaviour&lt;/h4&gt;
&lt;p&gt;Users engage with apps through various events such as scrolling, tapping, long-pressing, and more. Each of these triggers a specific action as a response which in most cases results in a UI update or a side effect. We've devised a comprehensive set of actions to ensure the system effectively responds to these user-triggered and component life-cycle events for e.g., &lt;code&gt;tap&lt;/code&gt; is an event &lt;code&gt;navigate&lt;/code&gt; is an action. Additionally, there are implicit events designed to track user interactions, ranging from detailed events like &lt;code&gt;scroll-forward&lt;/code&gt; to simpler ones like &lt;code&gt;dismiss&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;This is what a component looks like in Appcraft:&lt;/em&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;layout&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;id&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;root-container-layout-id&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;flex&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;props&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;chidlren&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;image&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;id&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;id1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;flex&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;props&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;events&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;tap&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;id&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;id2&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;props&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;track&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;id&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;id3&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;props&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;navigate&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;events&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Simplified Versioning capabilities&lt;/h3&gt;
&lt;p&gt;With the previous TNA system, both server and clients had to exchange information about the schema version, adding complexity. We sought alternatives to reduce errors and simplify maintenance.
With a more flexible layout structure and by keeping the logic of binding data and layout in the server, we achieved reduced  complexity in the clients by leaving the sole responsibility of rendering to the app. The schema versioning remained on the server, making it easier to resolve issues such as retrieving the right component version for each client and allowing  us the flexibility of customising UI and behaviours for each platform independently.
While it was not immediately apparent, maintaining this flexibility on the server allowed us to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Enable or disable components and their behaviour targeting specific app versions, platforms, premises and A/B testing.&lt;/li&gt;
&lt;li&gt;Resolve incidents quicker without the need of a hotfix by removing for example faulty components for specific app versions or OS due to bugs or performance reasons from the server.&lt;/li&gt;
&lt;li&gt;Retain backward compatibility logic on the server, as we can specify a minimum version for a component.&lt;/li&gt;
&lt;li&gt;Adding new appcraft pages in the App without the need of client changes, by just configuring the new page route and the minimum app version supported.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Same-day delivery&lt;/h3&gt;
&lt;p&gt;In Zalando mobile engineering we operate in sprints, with each sprint culminating in an app release. In this model, even simple UI adjustments may require waiting for a new app version and even longer for the full adoption, which can be a significant bottleneck in a fast paced organisation like Zalando itself.
In an ideal scenario, without the need for hotfixes, waiting for a complete release cycle for moving a label from left to right seems counterproductive. Appcraft is designed to be agile and responsive to user needs, and such delays can hinder our ability to deliver a dynamic user experience.
With the introduction of the Appcraft framework, the delivery is not tied to app releases or sprint duration, changes can be made at any point during a sprint. Now, the presentation layer can be defined directly on the server using pre-defined primitives that are packaged within the app.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What does it look like when a new screen is required?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When a new screen is required, the process is streamlined and dynamic in our mobile applications. We heavily rely on deep-link navigation, allowing seamless transitions between different screens. In a truly dynamic system, the creation of deep-links should happen on the fly without the need to manually add routes in the clients every time.&lt;/p&gt;
&lt;p&gt;To achieve this, we've introduced a middle-man component that takes a deep-link and converts it into an API request that our framework can understand. This way, every time a new screen is needed, our stakeholders simply align on the deep-link structure and update the configuration according to the agreed-upon contract. With these adjustments in place, the setup is complete. The next step involves the renderer, which will then interpret the updated configuration and render the new screen accordingly.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Appcraft deeplink resolution" src="https://engineering.zalando.com/posts/2024/05/images/deeplink-resolution-sequence-diagram.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;So when is a client-release needed?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A client-release is only required when there's a need to introduce a new primitive or extend the contract of an existing one to support additional behaviour.&lt;/p&gt;
&lt;p&gt;For example: When a simple label was not enough, we decided to introduce a Composite Label with the ability to add subtexts with their own font styling decoration and sizing and this is currently the primitive used for example to render price due to its flexibility.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How is a newly created screen tested?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We developed a demo app named the Appcraft Browser, featuring an address bar where any URL emitting appcraft screen JSON can be provided as input. The screen definition is then rendered in an isolated environment with only the bare minimum dependencies, facilitating faster development without the need to build the entire app. This tool allows web developers to insert a local host URL and test their development seamlessly while working on the renderer.&lt;/p&gt;
&lt;p&gt;After the development stage, web developers open a PR which allows them to deploy the rendering changes in a staging environment, changes are then validated in a debug version of the Zalando app by incorporating the deployed PR number into the app debug settings. This allows testing in production screens and the actual app environment.&lt;/p&gt;
&lt;h2&gt;Appcraft's Business Impact&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Dynamic content&lt;/strong&gt; - Currently Appcraft platform serves 13 different dynamic pages in the mobile app which contribute to Zalando’s effort of consistently delivering quality content formats to mobile users for inspiration and personalisation around brands, recommendations, outfits, creators, collections and campaigns. Check out the most recently shipped feature powered by Appcraft called &lt;a href="https://en.zalando.de/stories/"&gt;Zalando Stories&lt;/a&gt; and its &lt;a href="https://corporate.zalando.com/en/fashion/zalando-introduces-stories-setting-tone-fashion-and-culture"&gt;press release&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;App Theming/Redesign&lt;/strong&gt; - Since the inception of the Zalando App, the company has undergone several app redesigns, each demanding significant engineering effort and collaboration across multiple teams. However, when it comes to pages served by Appcraft, there has been a notable reduction in engineering effort compared to non-backend-driven UI. This is because the majority of changes are implemented on the server, benefiting both mobile platforms and all supported premises through common rules.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tracking Migrations over time&lt;/strong&gt; - Similar to UI redesigns, since the introduction of Appcraft platform, the mobile apps have gone through two different tracking migrations, first in 2021 and now in 2024. For Appcraft screens, akin to UI changes, all tracking events and their schema are defined on the server. The mobile client's only task was adopting a new SDK or in-house backend solution to pass by the events to the new analytics framework.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Quick Prototyping&lt;/strong&gt; - We use Appcraft for fast prototyping. By creating new renderers in the backend, the engineers and designers were able to quickly iterate on different UI designs over the course of a week.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Resilience&lt;/strong&gt; - Appcraft’s resilience has matured over time, with past incidents triggering some of the improvements. By deploying changes on the server within the same day, the MTTR for incidents is notably reduced. Moreover, the platform is used with success during &lt;a href="https://engineering.zalando.com/tags/cyber-week.html"&gt;Cyber Week&lt;/a&gt;, Zalando's biggest sales event for the last couple of years.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;User experience&lt;/strong&gt; - When a concept is added in Appcraft, it scales immediately to all screens via the backend. We are actively working on enhancing the user experience to be more delightful. We're currently exploring screen transitions, fluidity concepts, and micro-animations on the Appcraft platform.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Current challenges and evolution&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;While thoroughly enjoying the flexibility of adding screens without the involvement of any app engineers throughout the content experiences, we, as the platform team, find it challenging to keep track of the launched screens due to gaps in monitoring. Sometimes issues arise and reach us only when they become urgent fixes.&lt;/li&gt;
&lt;li&gt;Striking the right balance between generality and restrictiveness when creating a new feature in a backend-driven mobile framework is essential. It involves carefully considering factors such as usability, flexibility, consistency, performance, and compatibility to ensure that the feature meets the needs of both developers and end-users effectively.&lt;/li&gt;
&lt;li&gt;Testing has also gotten easier only over time. We enhanced the developer experience by enabling local testing for web developers, providing screen context injection for A/B testing, and eventually facilitating testing for pending changes to renderers (open PRs).&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We are currently addressing another significant challenge, known as &lt;strong&gt;Interoperability&lt;/strong&gt;, which refers to the reuse of existing non-Appcraft components in Appcraft and vice versa. To tackle this, we've introduced the capability of embedding non-Appcraft components in Appcraft screens and the embedding of entire Appcraft screens within larger features. Examples of this can be seen on the Tabular structure on Home Screen where each tab is an appcraft screen.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Dependency on third-party UI technology could pose a challenge because iOS and Android libraries may behave differently, requiring additional customization or default code to achieve consistent functionality and user experience across both platforms.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;Due to organisational changes over the years – such as transitioning from a strong web engineering team with limited mobile resources to having equally strong web and mobile teams – the allocation of effort has become a topic of debate. Consequently, we've observed that feature ownership (mobile vs. web) can sometimes become unclear.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;Appcraft has been serving as a stalwart in the realm of backend driven screen frameworks. Read all about the &lt;a href="https://engineering.zalando.com/posts/2021/09/micro-frontends-part2.html"&gt;backend system&lt;/a&gt; that empowers this platform.&lt;/em&gt;&lt;/p&gt;</content><category term="Zalando"/><category term="Mobile"/><category term="iOS"/><category term="Android"/><category term="Frameworks"/><category term="Elm"/><category term="Zalando App"/><category term="Backend"/><category term="Frontend"/></entry><entry><title>Theming the Zalando Design System</title><link href="https://engineering.zalando.com/posts/2024/05/theming-the-zalando-design-system.html" rel="alternate"/><published>2024-05-14T00:00:00+02:00</published><updated>2024-05-14T00:00:00+02:00</updated><author><name>Andrea Moretti</name></author><id>tag:engineering.zalando.com,2024-05-14:/posts/2024/05/theming-the-zalando-design-system.html</id><summary type="html">&lt;p&gt;The journey of introducing theming to the Zalando Design System, and the challenges faced along the way.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="mock webpage design showcasing the usage of design tokens" src="https://engineering.zalando.com/posts/2024/05/images/zds-theming.jpg#center"&gt;&lt;/p&gt;
&lt;h2&gt;Why theming?&lt;/h2&gt;
&lt;p&gt;As a design system evolves alongside with the brand it represents, there are
often multiple occasions when a need to introduce variations arises. On the
business side of things there may be use cases for part of the customer journey
to have a distinct look and feel, or there may be sub-brands being part of a
larger platform. The &lt;a href="https://engineering.zalando.com/posts/2022/07/an-introduction-to-the-zalando-design-system.html"&gt;previous
article&lt;/a&gt;
on this blog gives a wider overview of the Zalando Design System. This article
will focus instead on the challenges encountered in the development of theming
capabilities.&lt;/p&gt;
&lt;p&gt;Introducing variations into the system, without compromising the baseline brand
identity and the benefits of reusing existing client components, is one of the
main reasons to explore the concept of theming.&lt;/p&gt;
&lt;p&gt;In the absence of a proper theming architecture, early attempts and
explorations of "theming" had lead to a number of hacky solutions that quickly
become hard to maintain and pose risks to the overall system stability. In the
past we encountered numerous challenges, including hidden CSS overrides, local
conditional logic, debatable API additions, and duplicated implementations. A
comprehensive theming solution quickly evolved from a "nice to have" into a
clear "must have".&lt;/p&gt;
&lt;p&gt;On a very high level, a theming architecture is just another instance of the
generic problem of balancing flexibility and usability. A very strict and
consistent design system makes development extremely fast, but as a company
evolves and business requirements start to deviate from the initially
identified rules we observe an increase in development and maintenance efforts.
In order to keep the system healthy, it quickly becomes a requirement to handle
the newly introduced flexibility as part of the design system itself.&lt;/p&gt;
&lt;p&gt;Coming up with a theming concept tailored to the company's strategy and
envisioning long-term goals beyond immediate business needs, is one of the most
challenging steps in this process. Too much or too little flexibility can lead
to a system that is hard to use, becomes increasingly difficult and costly to
maintain, extends over time and impacts the performance and the maintenance
costs of the systems involved.&lt;/p&gt;
&lt;p&gt;To give an idea of how theming is currently used at Zalando, the &lt;a href="https://en.zalando.de/premium-home-women"&gt;Designer
Home&lt;/a&gt; is a good example. You might
notice the use of monochromatic texts, larger and uppercase headings, and the
usage of rounded icon buttons. Those changes are all implemented via a theme
and can be easily enabled or disabled on any given page.&lt;/p&gt;
&lt;h2&gt;Defining boundaries&lt;/h2&gt;
&lt;p&gt;Imagine a design system as a list of properties that define how UIs of a
particular product should look and function. Now, consider theming as a
mechanism to allow changing the values of a subset of those properties. Using
this perspective there are two main areas of influence to shape a theming
architecture: defining properties, and defining their allowed values.&lt;/p&gt;
&lt;p&gt;For example, we could have a highly constrained theming concept, where
different themes are allowed to choose a text colour to be either black or red,
and buttons to be either rectangular or with rounded corners. In order to
implement those theming specifications, we will need to have two properties in
the system to represent the text colour and the border radius of buttons, as
well as a defined set of possible values for both (e.g. "black/red", and
"0px/32px").&lt;/p&gt;
&lt;p&gt;In reality, things are never this simple though, and identifying a relatively
stable set of properties and values requires both a comprehensive understanding
of how the design system is currently used, as well as a fair amount of
abstract thinking and product vision on how it may evolve in the future. The
balance between static (or implicit and hardcoded) properties and dynamic ones,
defines what a theme can or cannot do, and when there is a discrepancy between
those capabilities and the product requirements, the expected advantages
quickly dissipate and new iterations on the concept will be required.&lt;/p&gt;
&lt;p&gt;An important aspect to be discussed is the scope and area of influence of those
"themable" properties. While it may not be immediately obvious, there is a
clear distinction between defining a UI component in isolation, as opposed to
in a specific composition. Should a theme be able to change how a button looks
inside a product card, but not anywhere else? These kinds of questions are
inherently connected to the wider topic of ownership. Where can we draw the
line between generic UI components and business specific compositions? What
part of a visual change in the end user experience can be expressed as a global
theme change and which one as a localised business logic?&lt;/p&gt;
&lt;p&gt;It’s very easy to confuse the concept of "theming" as a capability of a
component library, with "theming" as a feature of a design system. Component
libraries do not encompass the entire design system, but are merely a tool that
follows its specifications and "implements" it for a specific purpose, for
example building web pages.&lt;/p&gt;
&lt;p&gt;Many of the popular open source design systems are showcased, documented, and
advertised via their implementation; usually one or more component libraries
for different platforms. One famous exception is Material Design, which from
the beginning only described the design system in the form of a series of specs
and guidelines.&lt;/p&gt;
&lt;p&gt;This confusion between design specification and implementation gets mirrored in
the misunderstanding on what "theming" means for those two different concepts.
Most open source component libraries allow some level of theming via a number
of different technical approaches, usually using config files, shared contexts,
and some form of shared variables (design tokens). On the other hand, what
"theming" means on the design layer, is often overlooked.&lt;/p&gt;
&lt;p&gt;Typically, a default theme that aligns with the brand's character and identity
is commonly used. Theming is then offered as a way to adapt it to different
organisations, companies, design systems. It’s very rare for theming
capabilities to be showcased as a way to express variations of the same design
system. A common exception, though, is the usage of colours. Material Design is
again a good example here because it was intended to be used by many different
products and apps not necessarily related, keeping the interactions and the
tactile "material" metaphor consistent, while allowing to play with a very
large colour palette in order to introduce a level of identity and ownership.
Other libraries often showcase theming capabilities with custom colour
palettes, or defining dark mode themes.&lt;/p&gt;
&lt;p&gt;At Zalando, being one company with a well defined visual identity, introducing
the concept of theming raised a lot of questions around the related governance
rules and processes. How many themes may we need? How different can they look?
Who can/should own and create them? How to ensure a baseline visual identity?
Those and many other questions can be very hard to answer, and we will have to
address them as we iterate through the initial use cases.&lt;/p&gt;
&lt;h2&gt;Semantic design tokens&lt;/h2&gt;
&lt;p&gt;One of the very first challenges in making a design system themable is the
process of "tokenization". There are a number of repeated values scattered
across design specifications and source code that need to be extracted into
variables, known as design tokens, which can then be dynamically changed by
themes. For example, the same shade of orange might be used as the background
colour for a button, as well as the colour of the wishlist icon. A simple
initial approach would be to create a variable called &lt;code&gt;orange&lt;/code&gt; holding the
exact hex colour value and then consume it in the two different components.&lt;/p&gt;
&lt;p&gt;What will happen if a new theme now wants the button to be green? Surely, we
cannot simply reassign our &lt;code&gt;orange&lt;/code&gt; variable to a green value, that’s a recipe
for disaster. This leads us to an important second step: identifying the
semantic roles of different tokens and name them accordingly. Instead of
&lt;code&gt;orange&lt;/code&gt; we could call it &lt;code&gt;accent&lt;/code&gt;, there would no longer be any confusion when
its value is changed to green, or any other colour.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;[color.background]&lt;/span&gt;
&lt;span class="n"&gt;accent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;orange&amp;quot;&lt;/span&gt;

&lt;span class="k"&gt;[theme.foo]&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;background&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;accent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;green&amp;quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;While this may sound simple on paper, the reality can be extremely complex.
While trying to identify a reasonable set of semantic tokens out of our
existing design system, we had to go through many design iterations, often
leading to significant changes to the existing specifications. This process
reflects our dedication to evolving a system that wasn't originally designed
from the beginning with semantic tokens in mind. We faced several common
challenges, including managing a large number of tokens, inconsistencies in
their usage, and a lack of clarity regarding which values should change
together or not.&lt;/p&gt;
&lt;p&gt;Among all the sweat and tears, though, this has been a great opportunity to
assess the quality of the design system itself. It has resulted in substantial
simplification, removal of unnecessary subtle variations, as well as increasing
the level of parity and consistency across libraries implementation for other
platforms (Android and iOS).&lt;/p&gt;
&lt;p&gt;Once we got a stable set of global tokens, the next challenge we faced was how
to express variations that do not apply to everything, but only to specific
components. For example we could have a &lt;code&gt;padding.small&lt;/code&gt; token and use it across
many components, but what happens if we want the &lt;code&gt;button&lt;/code&gt; component to use
&lt;code&gt;padding.small&lt;/code&gt; in one theme and &lt;code&gt;padding.large&lt;/code&gt; in another one? We cannot
change the meaning of &lt;code&gt;padding.small&lt;/code&gt; globally as it would have repercussions
way beyond that specific button.&lt;/p&gt;
&lt;p&gt;This led to what we call "component-level theming", that ultimately is nothing
more than an additional level of indirection between a token name and its final
value. We can create a token &lt;code&gt;button.padding&lt;/code&gt; with a value of
&lt;code&gt;{padding.small}&lt;/code&gt;, where we refer to another token rather than a value. This
way a theme gains the flexibility to change the padding value used in the
button, as well as define which globally padding values are allowed.&lt;/p&gt;
&lt;h2&gt;Colour schemes&lt;/h2&gt;
&lt;p&gt;At Zalando, we encountered various situations where we need to alter the usage
of colours based on what background is used in order to satisfy accessibility
colour contrast requirements as well as visually pleasant colour combinations.
Many banners on the website dynamically pick a background colour based on the
content of an image.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Banner with background colour based on the image
content" src="https://engineering.zalando.com/posts/2024/05/images/color-snap.jpg#center"&gt;&lt;/p&gt;
&lt;p&gt;To satisfy those needs, we introduced the concepts of colour schemes, namely a
&lt;code&gt;monochrome-dark&lt;/code&gt; colour scheme to be used on dark (but not black) backgrounds,
and a &lt;code&gt;monochrome-light&lt;/code&gt; for the opposite use case. Counting the "default" look
and feel, it means that we need to support three different colour combinations.&lt;/p&gt;
&lt;p&gt;This solution, for us, predates the concept of themes, and we used to override
the values of palette colours directly, without semantic tokens in the picture
yet. When shaping the new theming architecture we had to take colour schemes
into account and make them first class citizens of themes.&lt;/p&gt;
&lt;p&gt;What "monochrome-dark" looks like in a given theme can be different from
another one. This means that each individual theme needs to support three
different colour schemes. With those requirements in mind, the logic to
determine the value of colour related design tokens becomes more complex, and
requires knowledge of the current active theme as well as the current colour
scheme.&lt;/p&gt;
&lt;p&gt;A constant source of confusion has been the relationship between colour schemes
and native dark mode that the user could potentially want to enable from the
operating system settings. While we always had full dark mode support in mind
when implementing colour schemes, and their current architecture can simplify
the creation of a native dark mode for Zalando, it would not necessarily be as
simple as enabling the "monochrome-dark" colour scheme on the entire page.&lt;/p&gt;
&lt;p&gt;Additional considerations will have to be made in order to proceed towards
native dark mode. For example there would be a need to express the default
background colour through its own semantic token, additionally we would need to
clarify the relationship with themes and colour schemes. Would "dark" be
treated as a new colour scheme to be supported by each theme? Would "dark" and
"monochrome-dark" be the same thing? Can a colour scheme change depending on
native dark mode?&lt;/p&gt;
&lt;p&gt;All those questions lead to complex conversations about how themes are used,
their purpose, and the impact they have on the user experience. In order to
answer all of them, we may have to gradually iterate on those concepts in order
to find out what works and what doesn’t.&lt;/p&gt;
&lt;h2&gt;Style dictionary&lt;/h2&gt;
&lt;p&gt;The core of our theming infrastructure is our design tokens repository. We use
&lt;a href="https://amzn.github.io/style-dictionary/#/"&gt;Style Dictionary&lt;/a&gt; as a framework,
and we define tokens in a single source of truth that can be consumed by
libraries implemented for different platforms. Style dictionary allows to use a
shared data format that can then be transformed to adapt to the needs of all
the consuming component libraries. For example it takes care of converting and
using the right units and colour formats for web, Android and IOS. Additionally
it can generate platform specific artefacts that can be bundled, published, and
consumed independently.&lt;/p&gt;
&lt;p&gt;Style Dictionary is also easy to customise to our specific needs. Particularly
with our own "transforms" and “formats”, we can handle custom requirements in a
well-tested and reusable way. Some interesting examples are a transform to
handle a boolean "display" token type and map it to CSS properties on web while
keeping it as a boolean for app consumption; or another transform that allows
to apply transparency to colours in a cross platform way.&lt;/p&gt;
&lt;p&gt;Formats, on the other hand, can be used to customise the files generated for
each platform. We can run a single build, generating different artefacts, and
then have independent pipelines to publish them. This allows teams from web,
Android, and IOS, to independently adapt the format of tokens to their
platform, without affecting the other ones.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;[color.text]&lt;/span&gt;
&lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;black&amp;quot;&lt;/span&gt;
&lt;span class="n"&gt;primary-dark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;white&amp;quot;&lt;/span&gt;

&lt;span class="k"&gt;[spacing]&lt;/span&gt;
&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;1rem&amp;quot;&lt;/span&gt;
&lt;span class="n"&gt;s-desktop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2rem&amp;quot;&lt;/span&gt;

&lt;span class="k"&gt;[theme.foo]&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;blue&amp;quot;&lt;/span&gt;
&lt;span class="n"&gt;spacing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;1.5rem&amp;quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The TOML format allows to express the nested structure of tokens in a human
friendly way. Within the &lt;code&gt;tokens&lt;/code&gt; folder, we have distinct files for different
categories, like spacing, colours, typography, etc. Each one creates a
namespace for the tokens defined inside them. Concatenating all the files
inside the &lt;code&gt;tokens&lt;/code&gt; folder we obtain a single dictionary object that represents
the "base" theme. Colour schemes and responsive variants for each tokens,
instead, are expressed using extra tokens with predefined suffixes (e.g.
&lt;code&gt;-dark&lt;/code&gt;, &lt;code&gt;-tablet&lt;/code&gt;, etc.).&lt;/p&gt;
&lt;p&gt;A theme is created with a file located in a separate folder, which defines a
dictionary mirroring the structure of the base theme, but includes only tokens
that are changed. The final theme dictionary is then computed by deep merging
the base theme object with the theme one. This approach establishes a direct
inheritance of each theme from the base theme, and is particularly convenient
when it is expected for a base visual identity to be maintained across multiple
themes.&lt;/p&gt;
&lt;h2&gt;CSS Variables (WEB)&lt;/h2&gt;
&lt;p&gt;The main output format consumed by the web component library, is a custom CSS
file containing all the tokens encoded as CSS variables. The variables are then
consumed by our CSS framework, which in turn exposes a library of classes for
our React components. Ultimately, when working on a component and consuming
some classes to set the primary text colour, there's no need for any knowledge
about themes, colour schemes, or screen sizes; but we can assume the value will
be changed automatically based on the defined overrides. This effectively
decouples the implementation of components from the context in which they may
be used by providing a stable and reliable interface to get dynamic values from
a list of available semantic tokens.&lt;/p&gt;
&lt;p&gt;For this behaviour to happen automatically, themes, colour schemes, and
responsive variants for each token are implemented using classes to scope the
set of required overrides.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;root&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;--spacing-s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="kt"&gt;rem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;--color-text-primary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;black&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="k"&gt;media&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nt"&gt;min-width&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;64rem&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;root&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;--spacing-s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="kt"&gt;rem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;dark&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;--color-text-primary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;white&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;theme-foo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;--primary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;blue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;--spacing-s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="kt"&gt;rem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This way, setting a theme or colour scheme class on a container, ensures that
all its children, will resolve the tokens with the correct value. Relying on
classes we are less dependent on more complex JavaScript based tooling and we
can use different ways to add or remove the required classes based on the use
case.&lt;/p&gt;
&lt;p&gt;Another advantage of using variable overrides is that we can express a whole
theme solely by the difference from the base one, allowing for smaller CSS size
overhead and, possibly, to load a separate small CSS file for the theme only
when needed. On the other hand a drawback of this approach is that multiple
themes nested inside each other on the same page would not be possible without
duplicating all the existing tokens, otherwise we would get unpredictable
combinations depending on what each theme overrides or not. Thus far, this
hasn’t been a problem as we do not anticipate multiple themes appearing on the
same page given our priority of maintaining visual coherency for our users.&lt;/p&gt;
&lt;p&gt;Even without nested themes, the possibility of having nested colour schemes
poses similar challenges, and we had to handle colours less efficiently by
duplicating all colour tokens for every colour scheme, even if they were
unchanged. Additionally, given that CSS selector with same specificity are
applied based on their order of definition, the only way to guarantee for the
class of the closest themed parent to win, would be to have additional
selectors for every possible nesting combination.&lt;/p&gt;
&lt;p&gt;While the recently introduced &lt;code&gt;:is&lt;/code&gt; selectors help in keeping the code
readable, there is still no way to support arbitrary nesting, requiring us to
impose a hard limit. In the near future, once supported in all major browsers,
the CSS &lt;code&gt;@scope&lt;/code&gt; at-rule should help solve most of those issues, and enable
more complex nested theming capabilities.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;a&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;b&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;a&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;c&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;a&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;a&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;a&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;b&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;a&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
&lt;span class="c"&gt;/* etc... */&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;--primary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;red&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;/* can be simplified to */&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;a&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;is&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;a&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;b&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;c&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;a&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;is&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;a&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;b&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;c&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;is&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;a&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;b&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;c&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;--primary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;red&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;/* in the future, once @scope is supported */&lt;/span&gt;
&lt;span class="c"&gt;/* this also allows for arbitrary nesting levels */&lt;/span&gt;
&lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="k"&gt;scope&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;a&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nv"&gt;--primary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;red&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;One interesting caveat of using a class scope to override the value of
variables is related to how the value of CSS variables is resolved. The same
algorithm used to determine the specificity of CSS selectors is also used to
determine when the class (or at-rule) override is enabled for a variable value.
This becomes a bit complicated when the value of a variable is a reference to
another variable.&lt;/p&gt;
&lt;p&gt;For example given this CSS and HTML:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;root&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;--primary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;black&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;--color&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;--primary&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;blue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;--primary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;blue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;box&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;height&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;background-color&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;--color&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;blue&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;box&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;box&amp;quot;&lt;/span&gt; &lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We do not get the intuitively expected behaviour, and the box appears to be
black instead of blue. This happens because the &lt;code&gt;--color&lt;/code&gt; variable resolution
happens on the &lt;code&gt;:root&lt;/code&gt; scope based on the last matching value of &lt;code&gt;--primary&lt;/code&gt;
(&lt;code&gt;black&lt;/code&gt;), counterintuitively &lt;code&gt;--color&lt;/code&gt; won’t be reevaluated when &lt;code&gt;--primary&lt;/code&gt;
changes, unless a higher specificity selector requires so.&lt;/p&gt;
&lt;p&gt;To address this, we can introduce an additional &lt;code&gt;scope&lt;/code&gt; class to increase the
specificity of our boxes&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;scope blue&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;box&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;box&amp;quot;&lt;/span&gt; &lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;root&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;--primary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;black&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;root&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;scope&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;--color&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;--primary&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;blue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;--primary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;blue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;box&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;height&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;background-color&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;--color&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now, the behaviour is in line with our expectations. As components are always
children of a possibly themed container, we can add a class to their root
container to enable this scoped resolution whenever we want a token to refer to
another token rather than a static value. This is especially beneficial in
scenarios involving component-level theming.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;root&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;--color-text-primary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;black&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;dark&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;--color-text-primary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;white&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;root&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;scope&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;--button-color-text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;--color-text-primary&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;iOS&lt;/h2&gt;
&lt;p&gt;Having the great power of code generation, it was tempting to convert design
tokens placed in TOML files into the final source code that could be consumed
directly in any iOS project. At first, we attempted to map TOML directly to
Swift, but encountered certain challenges. Firstly, this approach would have
allowed any engineer to extend the existing theme with new attributes.
Additionally, we also had to figure out how to automate publishing of new
versions of the library. Carthage, the dependency manager for iOS we use,
assumes that the dependency is placed in a git repository and one should
provide a url to download and build it. This means that all the generated files
should be committed and pushed to the GitHub repository, and pushing the
generated source code was considered a bad practice.&lt;/p&gt;
&lt;p&gt;With this in mind, we quickly added some base Swift files that describe a
structure of a Theme manually, and switched our scripts to generate JSON files,
which, in their turn, are not that harmful when committed automatically, as
they're just resources and don't potentially include any business logic inside.
Having JSONs as a way to populate themes with actual values should also give us
flexibility in case we'll be considering downloading themes from some kind of
server API.&lt;/p&gt;
&lt;p&gt;The system architecture of the iOS library for consuming design tokens is
simple: there is a Theme structure, that defines all the agreed attributes, and
there is an entity called &lt;code&gt;ThemeManager&lt;/code&gt;, that loads the stored JSON files and
populates itself with all the known variations of a Theme. Now any theme can be
accessed from this &lt;code&gt;ThemeManager&lt;/code&gt; just by its name.&lt;/p&gt;
&lt;p&gt;Applying a theme is a recursive process: a theme applied on a higher level,
let's say, a screen, will be automatically applied to all its subviews, then to
subviews of these subviews and so on. It doesn't matter, if any view doesn't
change it's appearance depending on a theme, this doesn't affect the theme
propagating process, but for ones that support the theming capability inside,
the result will be visible at once.&lt;/p&gt;
&lt;p&gt;Supporting the theming capability in different ZDS components, we faced a
problem. The appearance of a component is described by a Style object, which is
just a static structure, encapsulating all the necessary attributes, such as a
background colour, padding values, font size, etc. And every component has
multiple presets for this Style structure. For example, a Flag component can be
&lt;code&gt;default&lt;/code&gt;, &lt;code&gt;positive&lt;/code&gt; or &lt;code&gt;sale&lt;/code&gt;, and every such preset stores its own values
for the same attributes. Changing a theme would mean recreating the same Style
structure with different values. At this moment it seemed that we should store
&lt;code&gt;default&lt;/code&gt;, &lt;code&gt;positive&lt;/code&gt; and &lt;code&gt;sale&lt;/code&gt; Flag values for every theme separately, and
adding a new theme would mean that a new variant of the same presets should be
added for every component. Not very scalable, isn't it?&lt;/p&gt;
&lt;p&gt;So we introduced &lt;code&gt;StyleTokens&lt;/code&gt;. For every component that supports theming it's
just an enum which lets us know which preset should be applied disregarding the
actual values that come from a theme. Based on this &lt;code&gt;StyleToken&lt;/code&gt; value, the
actual &lt;code&gt;Style&lt;/code&gt; structure is generated every time the appearance of the
component should change.&lt;/p&gt;
&lt;p&gt;Now that meant that the final look of every themable component depends on 3
inputs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;style token&lt;/li&gt;
&lt;li&gt;theme name&lt;/li&gt;
&lt;li&gt;color scheme&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And every time some of these three are changed — the theming engine creates a
new instance of the &lt;code&gt;Style&lt;/code&gt; object which is used to redraw the view. Now we can
switch themes and add as many of them as we like without thinking that we would
need to modify existing components every time it happens.&lt;/p&gt;
&lt;h2&gt;Android&lt;/h2&gt;
&lt;p&gt;Theme resources for (BaseTheme &amp;amp; Child-themes) are generated in an android
consumable resource format (we have 2 formats):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;XML resources for Android ViewSystem&lt;/li&gt;
&lt;li&gt;JSON files for Compose&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These resources are then packaged and published as a library in our internal
maven repository, ready to be consumed by the Android component library as well
as directly in the Zalando App codebase.&lt;/p&gt;
&lt;h3&gt;XML&lt;/h3&gt;
&lt;p&gt;This is the most used format as of now, given that most of our components are
still built in XML, in this format theming is generated in the form of
tokens/attributes that are then made into theme XML classes/objects ready to be
consumed i.e &lt;code&gt;BaseTheme&lt;/code&gt;, &lt;code&gt;Designer&lt;/code&gt;, etc... And these themes can be easily
applied using their ids, i.e &lt;code&gt;R.style.BaseTheme&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;JSON for Compose&lt;/h3&gt;
&lt;p&gt;Here, we generate theme tokens in the form of JSON files that are also packaged
and shipped in the same library. These files are then parsed and theming data
is extracted from them in the ZDS library. A theming architecture is then built
on top of this data. This theming solution is also represented as simple
semantic tokens that are ready to be consumed in all Composables (components
written in compose).&lt;/p&gt;
&lt;h3&gt;ColorSchemes&lt;/h3&gt;
&lt;p&gt;We support 3 colour Schemes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Default&lt;/li&gt;
&lt;li&gt;Mono-Light&lt;/li&gt;
&lt;li&gt;Mono-Dark&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each theme is generated with these three colour schemes supported, and it gets
to decide the actual colours for each one. The client/user of a certain theme
can choose when and where (on which part of their screen) to apply a certain
colour schemes.&lt;/p&gt;
&lt;p&gt;In XML, we offer two colour schemes templates that when applied to certain
sections of the screen handle the colour swapping to the Monochrome (Light or
Dark) variants for each colour, and they work on all themes.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;style&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;MonoLightScheme&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;item&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;ColorSchemeType&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;MONO_LIGHT&lt;span class="nt"&gt;&amp;lt;/item&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;item&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;colorBackgroundSecondary&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;?colorBackgroundSecondaryMono
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/item&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In Compose, both &lt;code&gt;Theme&lt;/code&gt; and &lt;code&gt;ColorScheme&lt;/code&gt; are chosen at the root of the
&lt;code&gt;ZdsTheme&lt;/code&gt; selector, due to the simplicity of using theming in compose, a new
&lt;code&gt;ZdsTheme&lt;/code&gt; &lt;code&gt;Composable&lt;/code&gt; can be used at any part of the experience to choose and
apply any combination of a ZDS-Theme and a colour scheme that fits the
requirements of that section of the screen.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@Composable&lt;/span&gt;
&lt;span class="kd"&gt;fun&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ZdsTheme&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;zdsThemeType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ZdsThemeType&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ZdsThemeType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;BaseTheme&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;zdsColorScheme&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ZdsColorScheme&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ZdsColorScheme&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Default&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nd"&gt;@Composable&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;Unit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Component-level Theming&lt;/h3&gt;
&lt;p&gt;An additional layer or set of tokens that are intended to alter the visuals of
a specific component without affecting the rest of components i.e Flag
component, can be modified without affecting the rest of the visual
language/theme thus all other components are safe when, for example, the
default flag changes colour from primary to secondary or something else.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Theming a design system is a way to introduce variations in a controlled
manner. Depending on the business use case, careful consideration should be
taken on how the theming architecture is designed. One of the most challenging
parts is to identify the properties that can be altered by a theme, as well as
the possible values they may have.&lt;/p&gt;
&lt;p&gt;Governance becomes a key aspect when introducing theming to a design system.
Like any other source of variations, themes should be managed and maintained in
a way that ensures the baseline visual identity is preserved. This includes
defining the number of themes, how different they can look, who can create
them, and how to ensure that the visual identity is maintained.&lt;/p&gt;
&lt;p&gt;By leveraging a single source of truth for design tokens, it becomes possible
to share the specifications of each theme across different platforms. This
allows for a predictable styling of all components, and decouples the
implementation of components from the themed context in which they are used.&lt;/p&gt;</content><category term="Zalando"/><category term="Design"/><category term="Frontend"/><category term="UX"/></entry><entry><title>Enhancing the Mock Server: A User Interface Approach</title><link href="https://engineering.zalando.com/posts/2024/04/enhancing-the-mock-server-a-ui-approach.html" rel="alternate"/><published>2024-04-25T00:00:00+02:00</published><updated>2024-04-25T00:00:00+02:00</updated><author><name>Carlos Tan</name></author><id>tag:engineering.zalando.com,2024-04-25:/posts/2024/04/enhancing-the-mock-server-a-ui-approach.html</id><summary type="html">&lt;p&gt;A summary of our experience using a mock server library in our frontend React application and how we leveraged a custom UI implementation to further enhance its flexibility during initial feature development phases.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Agile approach including the Mock Server" src="https://engineering.zalando.com/posts/2024/04/images/feature-dev-life-cycle-with-mock-server.png#previewimage"&gt;&lt;/p&gt;
&lt;h1&gt;Enhancing the Mock Server: A User Interface Approach&lt;/h1&gt;
&lt;p&gt;As far as feature life cycles go, we as a team follow certain agile practices in pursuing its delivery. We first discover and surface potential features or enhancements through data-driven approaches, which then culminate as a proposal in the form of an intake document. Following its signoff, we then narrow the scope and define deliverables, focusing on an iterative approach to incrementally accomplish the feature in more manageable milestones. Lastly, once we have fleshed out the technical documentation, initial design mockups, API schemas, and ticket creation we begin with the actual implementation.&lt;/p&gt;
&lt;p&gt;At this point, however, a common scenario takes place in which the API endpoints have not yet been developed, making frontend developers have to postpone fetching from live endpoints and continue developing the UI by mocking the API response statically. Popular tools have arisen to tackle this issue, such as &lt;a href="https://miragejs.com/docs/getting-started/introduction/"&gt;mirage.js&lt;/a&gt;, &lt;a href="https://mswjs.io/docs/"&gt;MSW&lt;/a&gt;, etc., which facilitate the mocking of servers, typically by intercepting the desired endpoints and returning predefined responses. This enables front-end developers to work independently from the backend while reducing the time needed to finish the milestone.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Agile approach including the Mock Server" src="https://engineering.zalando.com/posts/2024/04/images/feature-dev-life-cycle-with-mock-server.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Fig 1. Agile approach including the Mock Server&lt;/figcaption&gt;
&lt;p&gt;&lt;br /&gt;
While this solved the issue of frontend independence, the other arose during the review phases with our product manager. A typical review cycle could take the form of developers first publishing the current state of the feature on the staging environment in order for it to be easily accessible by authorized users but still publicly hidden. Those internal users would then be able to inspect the feature though only in the state the mocked values allowed it to display. Naturally, requests came along to see how the feature would react if the API would return certain edge case responses. This required an update in the code base, another pull request to publish it, and finally its deployment on the staging environment. These steps could be reduced even further and possibly make our colleagues more independent from developers when reviewing such feature behaviors.&lt;/p&gt;
&lt;h2&gt;Solution Summary&lt;/h2&gt;
&lt;p&gt;While the foundation for our solution is based on mirage.js, using similar libraries that allow server mocking should also be feasible. In our case, there was little reason to try a different library after having used it and having done initial research on its applicability. The bottleneck, however, was that these libraries were only able to mock each endpoint with a single response, requiring a change in code to load different mocked responses if desired.&lt;/p&gt;
&lt;p&gt;To overcome this, a UI was built on top of mirage.js so that users themselves could choose what specific endpoints should return as a response in order to make the application behave in a certain way. An example of this was our &lt;em&gt;Data Freshness&lt;/em&gt; feature, which rendered differently depending on how recent KPIs or other similar data were updated. If a product manager would like to check how that specific feature would change in appearance if the responsible endpoint either returned freshly added, late or no data at all, then they would only need to select the provided options on the mock server UI to have the changes take effect.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Mock Server UI in action: mocking the /branding-campaigns-summary endpoint" src="https://engineering.zalando.com/posts/2024/04/images/example-mock-server-ui.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Fig 2. Mock Server UI in action: mocking the /branding-campaigns-summary endpoint&lt;/figcaption&gt;
&lt;p&gt;&lt;br /&gt;
In this case, neither a developer nor a new staging deployment is needed in order for users to inspect specific UI edge cases and scenarios while also having the option to shut down the mock server on the fly once our backend has finished implementing live endpoints. The only additional step required is the setup of these edge cases that features could potentially exhibit in the form of multiple mocked data sets for the mock server to consume.&lt;/p&gt;
&lt;h2&gt;Deep Dive&lt;/h2&gt;
&lt;p&gt;The actual implementation of the mock server follows similar suggestions from the official docs of &lt;a href="https://miragejs.com/docs/getting-started/overview/"&gt;mirage.js&lt;/a&gt; in that we have to define three parts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the mocked data responses in JSON format&lt;/li&gt;
&lt;li&gt;a controller to define the endpoints we wish the mock server to intercept&lt;/li&gt;
&lt;li&gt;the instantiation of the mock server itself&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Provider Component&lt;/strong&gt;: To ensure the mock server intercepts all relevant endpoints effectively, it should be instantiated before key parts of the application are mounted. Following this, the mock server may only return a single response per endpoint. To overcome this limitation, the UI enables users to control when the mock server instantiates in order to load different mocked responses based on user preferences. This is achieved by using a wrapper component like &lt;a href="https://react.dev/learn/passing-data-deeply-with-context"&gt;React’s Context API&lt;/a&gt;, which not only houses the logic for its re-instantiation but also simplifies setting up the mock server. By wrapping the main component with the Context API, developers can easily configure the mock server by providing the necessary props to the &lt;em&gt;Provider&lt;/em&gt; component. This approach streamlines the implementation process of the UI component (&lt;code&gt;&amp;lt;MockServer /&amp;gt;&lt;/code&gt;) with which it can automatically gather all required information without the need for additional props.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;isMockServerEnabled&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;production&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;App&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;isMockServerEnabled&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;MockServerProvider&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;apiNamespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;makeServer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;makeServer&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;mockServerOptions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;OPTIONS&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;children&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/MockServerProvider&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;children&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="p"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;// In any nested component&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;MockServer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;@dna-zdirect-ui/mock-server&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;MockServer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Session Storage&lt;/strong&gt;: The other issue to overcome is in passing different mocked responses to the endpoints. Since we allow the user to change returned responses of endpoints at any point of the app's lifecycle via UI options a page refresh is necessary in order for the mock server to load a different set of mock data. Carrying over the chosen option, however, was not possible through application state management due to full app re-mounting after a page reload. The browser's session storage is used instead in order to persist state outside of the app’s lifecycle while also cleaning up entries in the session storage object once the session has ended. A unique key is also used here in case multiple apps are using this mock server implementation in the same session.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of the inspection window: console tab" src="https://engineering.zalando.com/posts/2024/04/images/introspection-window-mirage-response.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of the inspection window: application tabs" src="https://engineering.zalando.com/posts/2024/04/images/introspection-window-session-storage.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Fig 3 + Fig 4. Screenshots of the inspection window: Console and application tabs&lt;/figcaption&gt;
&lt;p&gt;&lt;br /&gt;
The UI itself is a constellation of components provided by a &lt;em&gt;UI-Kit&lt;/em&gt; library for the simple reasons of quick development and consistent design with the main requirements of enabling the user to easily select their desired mocked responses, triggering a page reload, disabling or re-enabling the mock server.&lt;/p&gt;
&lt;h2&gt;Limitations and Alternatives&lt;/h2&gt;
&lt;p&gt;By building on top of the mock server library mirage.js a solution is implemented that not only supplements the inherent advantage of enabling parallel development of an app's API and UI but does so by making it more flexible and accessible.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;allows visual documentation and a showcase of edge-case scenarios&lt;/li&gt;
&lt;li&gt;enables the mocking of endpoints on the fly&lt;/li&gt;
&lt;li&gt;provides ease of use by means of a customized and non-intrusive UI&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This solution is by no means an alternative to writing proper unit tests for edge-case scenarios. In fact, unit tests take precedence while this mock server rather acts as an enhancement during an app’s development by enabling an easier way to showcase such scenarios, e.g. during demos. Similarly, contract testing, in which services, such as an API provider and a client are tested if requests are correctly understood and responses are correctly generated, also takes precedence. Where mocks do shine more are the development phases in which the API services are still being developed and can act as an interim solution until these services are available.&lt;/p&gt;
&lt;p&gt;While this specific implementation targets REST APIs the approach should also be compatible with a GraphQL architecture, like the one provided by the &lt;a href="https://www.apollographql.com/docs/react/"&gt;Apollo&lt;/a&gt; framework, which already comes bundled with its own mocking solution. Whichever technology is used, however, the definitions of mocks are entirely on the frontend side, meaning conventional API validations and error handling are separate from any backend service. Thus, also special attention has to be paid to continuously match the schema of the backend service that was originally intended to be mocked.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;All in all, through positive feedback, especially from our product managers and designers, the inclusion of this mock server in our apps not only improves the collaboration between them and engineers by facilitating the presentation of features in various development phases but also eases the setup of a mock server solution for engineers by encapsulating non-business related logic and providing intuitive components. After a couple of implementations, a more generalized version of this mock server has been developed, which is internally available as a separate NPM module.&lt;/p&gt;
&lt;p&gt;Lastly, while this is a niche solution that might not fit with many setups, we'd like to stress the importance of allowing developers to have space, resources, and support within their team to explore and experiment in a variety of ways has to be emphasized so that ideas may have enough time to bear fruit.&lt;/p&gt;</content><category term="Zalando"/><category term="Frontend"/><category term="APIs"/><category term="Testing"/><category term="Debugging"/><category term="Backend"/></entry><entry><title>Enhancing Distributed System Load Shedding with TCP Congestion Control Algorithm</title><link href="https://engineering.zalando.com/posts/2024/04/enhancing-distributed-system-load-shedding-with-tcp-congestion-control-algorithm.html" rel="alternate"/><published>2024-04-23T00:00:00+02:00</published><updated>2024-04-23T00:00:00+02:00</updated><author><name>Andrew Meleka</name></author><id>tag:engineering.zalando.com,2024-04-23:/posts/2024/04/enhancing-distributed-system-load-shedding-with-tcp-congestion-control-algorithm.html</id><summary type="html">&lt;p&gt;Load shedding is a common problem in event driven systems. But even more problematic when that load needs to be prioritized according to different priorities. Here we present how we solved this problem using a well known algorithm that is used in TCP congestion control.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="A busy port where hundreds of containers wait to be loaded to ships or trailers, photo by CHUTTERSNAP on Unsplash" src="https://engineering.zalando.com/posts/2024/04/images/busy_port.jpg#previewimage"&gt;&lt;/p&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Our team is responsible for sending out communications to all our customers at Zalando - e.g. confirming a placed order,
informing about new content from a favourite brand or announcing sales campaigns. During the preparation of those
messages as well during sending those out via different service providers we have to deal with limited resources. We
cannot process all requested communication as fast as possible. This leads occasionally to some backlog of requests.&lt;/p&gt;
&lt;p&gt;But not all communication is equally important. The business stakeholders have requested to ensure that we process the
communication which supports critical business operations within the given &lt;a href="https://engineering.zalando.com/posts/2022/04/operation-based-slos.html"&gt;service level objectives&lt;/a&gt; (SLOs).&lt;/p&gt;
&lt;p&gt;This has led us to investigate the space of solutions for load shedding. Load shedding has been
addressed in &lt;a href="https://github.com/zalando/skipper/issues/2004"&gt;Skipper&lt;/a&gt; already. But our system is event driven, all
requests we process are delivered as events via &lt;a href="https://nakadi.io/"&gt;Nakadi&lt;/a&gt;. Skipper's feature does not help here. But
why not use the same underlying idea?&lt;/p&gt;
&lt;p&gt;We know if our system runs within its normal limits that we meet our SLOs. If we would control the ingestion of message
requests into our system we would be able to process the task in a timely manner. Additionally we would need to combine
this control of ingestion with prioritization of those requests which support critical business operations.&lt;/p&gt;
&lt;h2&gt;Overview of the System&lt;/h2&gt;
&lt;p&gt;First, let me introduce you to the system under the load.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Communication Platform Overview" src="https://engineering.zalando.com/posts/2024/04/images/communication_platform_overview.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Communication Platform Overview&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://nakadi.io/"&gt;Nakadi&lt;/a&gt; is a distributed event bus that offers a RESTful API on top of Kafka-like queues. This
component serves a couple of thousands of event types published by different teams Zalando wide for different purposes.
Out of those more than 1000 different event types trigger customer communication.&lt;/p&gt;
&lt;p&gt;The Stream Consumer is the microservice that acts as the entry point for the events into the entire platform. It is
responsible for consuming the events from Nakadi, applying few processing, and pushing them to the RabbitMQ broker.
Every Nakadi event type is processed by an instance of the Event Listener.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.rabbitmq.com/"&gt;RabbitMQ&lt;/a&gt; is a message broker and should be considered as the backbone of our platform. It
is responsible for receiving the events from stream consumer and making them available for the downstream services.&lt;/p&gt;
&lt;p&gt;Our Platform consists of many services. These microservices are responsible for processing the events. This includes but
is not limited to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Rendering messages (both push notification &amp;amp; email)&lt;/li&gt;
&lt;li&gt;Checking for the customers' consent, preference and blocklist&lt;/li&gt;
&lt;li&gt;Checking for the customers' eligibility&lt;/li&gt;
&lt;li&gt;Storing templates and different Zalando's tenants' configurations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Inside the platform, we have a lot of components that are interacting with each other, and the communication between
those components is done mostly via RabbitMQ.&lt;/p&gt;
&lt;p&gt;Each service will be publishing to 1 or more exchanges, and consuming from 1 or more queues, the same applies to the
other services, so we have a lot of communication going on between the services, and RabbitMQ is the middleman for all
of that.&lt;/p&gt;
&lt;h3&gt;High Level Design&lt;/h3&gt;
&lt;p&gt;&lt;img alt="High Level Architecture" src="https://engineering.zalando.com/posts/2024/04/images/high_level_design.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;&lt;/figcaption&gt;

&lt;p&gt;We know that having suitable backlog size behind each application, can guarantee their scaling out as well as the best
throughput, then we can achieve our SLOs. The system has capabilities to adjust the resources acquired from kubernetes
based on the demand (using a scaling mechanism based on CPU/memory/endpoint-calls/backlogs).&lt;/p&gt;
&lt;p&gt;We consider the whole platform as a system with an interface, and we strive to protect it at the interface level, by
avoiding overwhelming that system with messages that it can't handle in proper time. This means we can steer the
ingestion based on the priority and the available capacity of the system.&lt;/p&gt;
&lt;p&gt;Stream Consumer will implement the adaptive concurrency management using
&lt;a href="https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease"&gt;Additive Increase Multiplicative Decrease&lt;/a&gt;
(AIMD). This algorithm reacts to the reduced service capacity. Whenever congestion is detected, the request rate is
reduced by a multiplier.&lt;/p&gt;
&lt;p&gt;We needed to find proper indicators for the reduced service capacity. The Stream Consumer publishes the messages to
RabbitMQ, so we have been looking for some indicators available from RabbitMQ. As the first indicator we decided to use
errors. Whenever we can’t publish we should reduce the consumption rate. The second is more subtle. RabbitMQ is able to
apply back-pressure when slow consumers are detected and the system resources are consumed too fast. In this case
RabbitMQ will slow down the publish rate which the publisher will experience in the increase in the publish time. Stream
Consumer will observe those metrics and adjust the consumption rate.&lt;/p&gt;
&lt;p&gt;Reducing the consumption for all event types would help to run the system within its limits, but it does not prioritize
the critical ones yet. The component shall be able to adjust the rate of how fast stream consumer consume events from
Nakadi selectively. Therefore every event-type will get assigned a rate based on its priority and the system load. It
shall ensure that every reader gets its dedicated capacity assigned. If there is more capacity available the system will
adjust accordingly and provide a higher rate to events which have a higher demand (backlog).&lt;/p&gt;
&lt;p&gt;Thus it's not needed to determine the tipping point throughput for a single service. The AIMD algorithm also adapts
increased capacity after scaling the system. Most importantly, the algorithm requires a local variable only, which
avoids central coordination like a shared database.&lt;/p&gt;
&lt;p&gt;By following this approach we&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Avoid multiple changes in all the microservices by scoping it to one component.&lt;/li&gt;
&lt;li&gt;Achieve prioritization on the service consumption level, hence avoid the need to prioritize messages inside the
  platform.&lt;/li&gt;
&lt;li&gt;Get a scalable solution with no single point of failure.&lt;/li&gt;
&lt;li&gt;Use Nakadi to persist the backlog, hence reducing risk to overload RabbitMQ.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We will need to tune the actual value (latency of publishing to RabbitMQ) used as an indicator for reducing ingestion.
It should have enough load on the system to trigger scaling of services in the platform as well as reduce the number of
messages stored in RabbitMQ.&lt;/p&gt;
&lt;h3&gt;Low Level Design&lt;/h3&gt;
&lt;p&gt;&lt;img alt="Changes in Stream Consumer" src="https://engineering.zalando.com/posts/2024/04/images/low_level_design.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Changes in Stream Consumer&lt;/figcaption&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Statistics Collector&lt;/strong&gt; Collects the statistics about the latency (e.g. P50) publishing to RabbitMQ as well as any
  exception thrown while publishing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Congestion Detector&lt;/strong&gt; It decides whether there is any congestion in the system or not (depending on the fact of
  latency availability or exceptions thrown), based on the data it receives from the statistics collector and comparing
  them with the configured numbers in the service.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Throttle&lt;/strong&gt; Provided as an instance per each consumer. This is the class that implements the AIMD algorithm. It
  should be instantiated by the consumer providing it with the priority of that event, that priority then will affect
  the increase/decrease of the permitted events/sec that can be consumed.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;How the Design Works&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;When the Stream Consumer starts, all the event listeners start with an initial consumption batch size. They will also
   instantiate a throttle instance.&lt;/li&gt;
&lt;li&gt;The statistics collector cron job kicks in, collecting some statistics about latency (P50) and exceptions, and then
   calls the congestion detector to provide the results.&lt;/li&gt;
&lt;li&gt;The congestion detector checks the data it receives, and makes a decision whether there is congestion or not by
   comparing the data received with the limits set in the configurations. Congestion detector passes its decision to all
   the throttles associated with each event listener through an observer pattern.&lt;/li&gt;
&lt;li&gt;The throttle, once called, and depending on the decision from the congestion detector as well as the priority it was
   given when the consumer started, will decide the new batch size using the AIMD. (Note: there is no coordination
   between different throttles!).&lt;/li&gt;
&lt;li&gt;As modifying the batch size is currently not supported natively by Nakadi, the application will slow down/speed up
   the consumption accordingly.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;How priorities affect the events consumption speeding up/slowing down&lt;/h3&gt;
&lt;p&gt;Let’s suppose that we have 3 priorities in our system, from P1 to P3, where P1 is the highest, P3 is the lowest. Stream
consumer should already have a defined number for the speeding up/slowing down in the configurations per each priority.&lt;/p&gt;
&lt;p&gt;First scenario, signal for consumption speeding up (relieved RabbitMQ cluster)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For each priority, there will be a defined value for the speeding up, let’s assume some numbers here:&lt;ul&gt;
&lt;li&gt;P1: 15&lt;/li&gt;
&lt;li&gt;P2: 10&lt;/li&gt;
&lt;li&gt;P3: 5&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;So the new consumption rates (batch sizes) will be:&lt;ul&gt;
&lt;li&gt;P1: Previous value + 15&lt;/li&gt;
&lt;li&gt;P2: Previous value + 10&lt;/li&gt;
&lt;li&gt;P3: Previous value + 5&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="Additive Increase" src="https://engineering.zalando.com/posts/2024/04/images/additive_increase.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Additive Increase&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;Second scenario, signal for consumption slowing down (RabbitMQ cluster under load)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Here also, per priority, different value for slowing down should be set, let’s assume here those numbers:&lt;ul&gt;
&lt;li&gt;P1: 20% decrease&lt;/li&gt;
&lt;li&gt;P2: 40% decrease&lt;/li&gt;
&lt;li&gt;P3: 60% decrease&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;So the new consumption rate will decrease by the following percentages:&lt;ul&gt;
&lt;li&gt;P1: Previous value * (20% (P1)) =&amp;gt; 20% decrease&lt;/li&gt;
&lt;li&gt;P2: Previous value * (40% (P2)) =&amp;gt; 40% decrease&lt;/li&gt;
&lt;li&gt;P3: Previous value * (60% (P3)) =&amp;gt; 60% decrease&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="Multiplicative Decrease" src="https://engineering.zalando.com/posts/2024/04/images/multiplicative_decrease.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Multiplicative Decrease&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;So, the rule of thumb here is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Whenever the RabbitMQ cluster is not under load, we speed up the consumption rate for all consumers, but we give more
  capacity for higher priority event types, more than less priority event types.&lt;/li&gt;
&lt;li&gt;Whenever the RabbitMQ cluster is under load, we slow down the consumption rate by a percentage for all the consumers,
  but those with high priority decrease by much fewer percentage compared to those with less priority.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;p&gt;So far, we have been running the solution in production for around 6 months, and we have seen a lot of improvements in
the platform, including:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Less stress on RabbitMQ cluster, as the messages are not pushed to it unless there is enough capacity to handle them.
    &lt;img alt="RabbitMQ Messages" src="https://engineering.zalando.com/posts/2024/04/images/rabbitmq_messages.png#center"&gt;
    &lt;figcaption style="text-align:center"&gt;RabbitMQ Messages&lt;/figcaption&gt;
    &lt;br/&gt;&lt;/p&gt;
&lt;p&gt;Around 300k messages in one of the application's queue backlog, the other applications are not under load, that's
obvious from the few number of messages in their queues. The reduced stress on RabbitMQ cluster is also visible
comparing the number of messages in the queues with the number of messages in the backlog in Nakadi (point 3 below).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Prioritization of messages, higher priority messages are sent first, and lower priority messages are sent later.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Order Confirmation Processing Time" src="https://engineering.zalando.com/posts/2024/04/images/order_confirmation_processing_time.png#center"&gt;
&lt;figcaption style="text-align:center"&gt;Order Confirmation Processing Time&lt;/figcaption&gt;
&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Commercial Messages Processing Time" src="https://engineering.zalando.com/posts/2024/04/images/commercial_messages_processing_time.png#center"&gt;
&lt;figcaption style="text-align:center"&gt;Commercial Messages Processing Time&lt;/figcaption&gt;
&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;In the above diagrams, you see that the processing time for order confirmation is relatively stable. This is important
as it’s a high priority use case. In contrast, commercial messages experience an increase in the processing time. This
is acceptable as this is a low priority use case.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Events that can't be processed at the moment are still in Nakadi, so they can be processed later or easily discarded
   in case of emergency.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Nakadi Backlog" src="https://engineering.zalando.com/posts/2024/04/images/large_nakadi_backlog.png#center"&gt;
&lt;figcaption style="text-align:center"&gt;Nakadi Backlog&lt;/figcaption&gt;
&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;As we can see, the backlog is being consumed without putting pressure on the platform. Messages of lower priority can
be discarded in case of emergency.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Nakadi Order Confirmation Backlog" src="https://engineering.zalando.com/posts/2024/04/images/nakadi_backlog_order_confirmation.png#center"&gt;
&lt;figcaption style="text-align:center"&gt;Nakadi Order Confirmation Backlog&lt;/figcaption&gt;
&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;The order confirmation is a P1 priority message, so it's being consumed first (during the same period less priority
messages were growing in the backlog).&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Utilizing the TCP congestion control algorithm to control traffic proved to be effective in event driven systems. In
general, it's much better to control how much traffic is ingested into your system from the source, rather than letting
it flood the system and then trying to deal with it.&lt;/p&gt;
&lt;p&gt;In our case, it helped us to solve the problem of prioritization of messages, messages are only allowed to enter the
system based on their priority and the capacity the system can handle. It also helped us to avoid using the RabbitMQ
cluster as a storage for millions of messages - with a smaller queue size in RabbitMQ we follow best practices. In case
of emergency, we can easily discard messages, as most of them will still be in the source.&lt;/p&gt;
&lt;h2&gt;Resources&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=m64SWl9bfvk"&gt;Stop Rate Limiting! Capacity Management Done Right | Strange Loop Conference | 2017&lt;/a&gt;&lt;/p&gt;</content><category term="Zalando"/><category term="Event Driven"/><category term="Microservices"/><category term="Scalability"/><category term="Backend"/></entry><entry><title>12 Golden Signals To Discover Anomalies And Performance Issues on Your AWS RDS Fleet</title><link href="https://engineering.zalando.com/posts/2024/02/twelve-golden-signals-to-discover-anomalies-and-performance-issues-on-aws-rds.html" rel="alternate"/><published>2024-02-20T00:00:00+01:00</published><updated>2024-02-20T00:00:00+01:00</updated><author><name>Dmitry Kolesnikov</name></author><id>tag:engineering.zalando.com,2024-02-20:/posts/2024/02/twelve-golden-signals-to-discover-anomalies-and-performance-issues-on-aws-rds.html</id><summary type="html">&lt;p&gt;Automate anomaly detection for AWS RDS at scale.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Logo rds-health utility" src="https://engineering.zalando.com/posts/2024/02/images/rds-health-v2.png#previewimage"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: Database per service pattern in the microservices world brings an overhead on operating database instances, observing its health status and anomalies. Standardisation on methodology and tooling is a key factor for the success at the scale. We have incorporated learning from past incidents, anomalies and empirical observations into a methodology of observing the health status using 12 golden signals. The most simple way to adopt these methodology within your engineering environment is an open source utility &lt;a href="https://github.com/zalando/rds-health"&gt;rds-health&lt;/a&gt; recently released by us.&lt;/p&gt;
&lt;h3&gt;The problem of maintaining robustness at scale&lt;/h3&gt;
&lt;p&gt;Since Zalando concluded &lt;a href="https://engineering.zalando.com/posts/2016/10/jimmy-to-microservices-the-journey-one-year-later.html"&gt;the organisation's scalability using microservice pattern&lt;/a&gt;, the company has experienced steady growth across multiple dimensions: in the number of users, in the technology landscape and number of teams involved in building and running systems. So far, Zalando is a leading European online fashion retailer. It is critical that our architecture is robust to withstand challenges and uncertainties while teams innovate and experiment with new ideas.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Overhead by microworld.&lt;/strong&gt; &lt;a href="https://engineering.zalando.com/tags/microservices2.html"&gt;Microservices&lt;/a&gt; became a design style for us to define system architectures, purify core business concepts, evolve solutions in parallel, make things look uniform, and implement stable and consistent interfaces across systems. Our engineering teams independently design, build and operate multiple microservices. Often, microservices are implemented with a datastore following &lt;a href="https://microservices.io/patterns/data/database-per-service.html"&gt;the design pattern – database per service&lt;/a&gt;, where each service deploys its own database instances. The &lt;a href="https://opensource.zalando.com/tech-radar/"&gt;Zalando TechRadar&lt;/a&gt; guides teams about the database selection and their deployment options – AWS RDS with Postgres as one of the available options.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hidden costs by toil.&lt;/strong&gt; Operating swarm of small databases at company scale quickly gets tough. Complex anomaly detection tasks, such as byzantine failures or issues with SQL statements, takes a noticeable investment all over the place. A combination of manual processes and ad-hoc scripts to manage the health conditions of database instances are not an option at the scale. It became increasingly time-consuming and error-prone, some teams are required to allocate engineers for sprint or even months for such activities.&lt;/p&gt;
&lt;p&gt;Standardisation is one of the factors that reduces this complexity. It is well known that if teams use the same frameworks or design pattern then making changes at scale becomes easier. Same concept is extendable into the operation domain. We have limited the fragmentation by providing stronger guidelines to our engineers on what metrics to observe from datastore components.&lt;/p&gt;
&lt;p&gt;We have developed a methodology on how to detect anomalies with AWS RDS workload through 12 “golden signals”. We also decided to release an open-source command line utility (https://github.com/zalando/rds-health) to help automate and streamline detection of anomalies and performance issues. The utility provides a consistent and repeatable way to automatically analyse database metrics, reducing the risk of errors and improving overall efficiency.&lt;/p&gt;
&lt;h3&gt;12 Golden Signals&lt;/h3&gt;
&lt;p&gt;Setup and operating high-performing databases requires observability of a large variety of signals across multiple buckets: CPU, Memory, Disk and Workload. Thanks to past incidents and empirical observations, we have reduced complexity so that only a few signals from each of the discussed buckets need to be analysed for making a reliable conclusion about the heals status of database instances. This is how we got twelve golden signals.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;C1: CPU Utilisation&lt;/strong&gt; &lt;code&gt;os.cpuUtilization.total&lt;/code&gt; - typical database workloads are bound to memory or storage, high CPU is an anomaly that requires further investigation. Our past experience advises us that CPU utilisation over 40% - 60% on database instances eventually leads to incidents.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;C2: CPU Await&lt;/strong&gt; &lt;code&gt;os.cpuUtilization.await&lt;/code&gt; - the Linux kernel reports time is spent waiting for IO requests from its very beginning toward its end using await metric. Its high value indicates that a database instance is bound to the IO bandwidth of storage. Similar to the previous metric, we have concluded that any value above 5 - 10% eventually leads to incident.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;M1: Swapped In from disk&lt;/strong&gt; &lt;code&gt;os.swap.in&lt;/code&gt; - Swap is an extension of RAM into the disk. Operating system swaps the RAM pages into the disk and back when there is not enough memory to run the workload. Any intensive activities indicate that the database instance is running on low memory. Considering the disk performance is order of magnitude slower, any swap activity would slow down the operating system and its applications.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;M2: Swapped Out to disk&lt;/strong&gt; &lt;code&gt;os.swap.out&lt;/code&gt; - See explanation above.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;D1: Storage Read IO&lt;/strong&gt; &lt;code&gt;os.diskIO.rdsdev.readIOsPS&lt;/code&gt; - Storage IO bandwidth is an essential resource for high-performing databases. It is required to align the IO bandwidth with the overall database workload so that there is enough bandwidth to handle workload. In the case of AWS RDS, the metric value shall be aligned with the storage configuration deployed for database instance. With the GP2 volume type, IOPS are provisioned by volume size, 3 IOPS per GB of storage with a minimum of 100 IOPS. The IO volume type has an explicit value defined at deployment time. Note that a very low value shows that the entire dataset is served from memory.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;D2: Storage Write IO&lt;/strong&gt; &lt;code&gt;os.diskIO.rdsdev.writeIOsPS&lt;/code&gt; - See explanation above. Also note that a high number shows that the workload is write-mostly and potentially bound to the IO capacity of storage.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;D3: Storage IO Latency&lt;/strong&gt; &lt;code&gt;os.diskIO.rdsdev.await&lt;/code&gt; - Overall performance of storage is a function of its IO bandwidth and its latency. The latency metric reflects the time spent by the storage to load data blocks into memory. High storage latency implies a higher latency to conduct applications workload on the database. Our empirical observations show that storage latency above 10 ms eventually leads to incident, the latency above 5 ms impacts on applications SLOs. A typical storage latency for database systems should be less than 4 - 5 ms.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;P1: Cache Hit Ratio&lt;/strong&gt; &lt;code&gt;db.Cache.blks_hit / (db.Cache.blks_hit + db.IO.blk_read)&lt;/code&gt; - Databases do reading and writing of application data in blocks. The number of blocks read by the database from the physical storage has to be aligned with storage IO bandwidth provisioned to the database instance. Database caches these blocks in the memory to optimise the application performance. When clients request data, the database checks cached memory and if there is no relevant data there it has to read it from disk, thus queries become slower. Any values below 80 % show that databases have insufficient amount of shared buffers or physical RAM. Data required for top-called queries don't fit into memory, and the database has to read it from disk.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;P2: Blocks Read Latency&lt;/strong&gt; &lt;code&gt;db.IO.blk_read_time&lt;/code&gt; - The metric reflects the time used by the database to read blocks from the storage. High latency on the storage implies a high latency of application workload. We have observed an impact on SLOs when the latency has grown above 10 ms.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;P3: Database Deadlocks&lt;/strong&gt; &lt;code&gt;db.Concurrency.deadlocks&lt;/code&gt; - Number of deadlocks detected in this database. Ideally, it shall be 0. The application schema and IO logic requires evaluation if the number is high.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;P4: database transactions&lt;/strong&gt; &lt;code&gt;db.Transactions.xact_commit&lt;/code&gt; - Number of transactions executed by database. The low number indicates that the database instance is standby.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;P5: SQL efficiency&lt;/strong&gt; [db.SQL.tup_fetched / db.SQL.tup_returned] - SQL efficiency shows the percentage of rows fetched by the client vs rows returned from the storage. The metric does not necessarily show any performance issue with databases but high ratio of returned vs fetched rows should trigger the question about optimization of SQL queries, schema or indexes. For example, If you do &lt;code&gt;select count(*) from million_row_table&lt;/code&gt;, one million rows will be returned, but only one row will be fetched.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Open Source Command Line Utility&lt;/h3&gt;
&lt;p&gt;AWS offers a wide range of observability solutions for AWS RDS such as AWS CloudWatch, AWS Performance Insights and others. These off-the-shelf solutions help anyone with setting up alerts and debugging anomalies when one of twelve golden signals is violated. We are only missing an efficient utility to holistically observe the status of the entire AWS RDS fleet in your account with “a single click of the button”.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot rds-health utility" src="https://engineering.zalando.com/posts/2024/02/images/rds-health-screenshot.png#center"&gt;&lt;/p&gt;
&lt;p&gt;This is how the &lt;a href="https://github.com/zalando/rds-health"&gt;&lt;code&gt;rds-health&lt;/code&gt;&lt;/a&gt; utility was born. It conducts analysis of AWS RDS instances using time-series metrics collected by AWS Performance Insights. Actually, the utility is a frontend for AWS APIs that simply automates analysis of discussed golden signals across your accounts and regions. The utility can be easily customised to meet specific use cases, allowing users to tailor their workflows to their unique needs. Some of the key features include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Show configuration of all AWS RDS instances and clusters;&lt;/li&gt;
&lt;li&gt;Check health of all AWS RDS deployments;&lt;/li&gt;
&lt;li&gt;Conduct capacity planning for your AWS RDS deployments.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Check out our open source project at https://github.com/zalando/rds-health. It guides you through simple installation and configuration steps together with tutorials about its features. We are looking forward to hearing your feedback and suggestions for improvement. Please raise &lt;a href="https://github.com/zalando/rds-health"&gt;an issue on the project&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;Our objective is reduction of complexity through limiting the fragmentation within our engineering ecosystems by enabling teams with engineering and operational guidelines. The discussed methodology on how to detect anomalies with AWS RDS workload through 12 “golden signals” is one of the examples about solving the complexity at Zalando.&lt;/p&gt;
&lt;p&gt;Standardisation is not only guidelines but also automations of repetitive tasks, freeing up time for more creative and strategic work. We are happy to empower the Open Source Community with our learning and approaches on observing AWS RDS instances at scale through open source utility. Apply these learnings within your teams.&lt;/p&gt;
&lt;p&gt;If you have any questions about our methodology or open source utility &lt;code&gt;rds-health&lt;/code&gt; itself, please raise &lt;a href="https://github.com/zalando/rds-health"&gt;an issue on the project&lt;/a&gt;. Contributions are welcomed and encouraged!&lt;/p&gt;</content><category term="Zalando"/><category term="Open Source"/><category term="SRE"/><category term="AWS"/><category term="PostgreSQL"/><category term="Backend"/></entry><entry><title>Paper Announcement: Joint Order Selection, Allocation, Batching and Picking for Large Scale Warehouses</title><link href="https://engineering.zalando.com/posts/2024/01/paper-warehouse-order-batching.html" rel="alternate"/><published>2024-01-29T00:00:00+01:00</published><updated>2024-01-29T00:00:00+01:00</updated><author><name>Julius Pätzold</name></author><id>tag:engineering.zalando.com,2024-01-29:/posts/2024/01/paper-warehouse-order-batching.html</id><summary type="html">&lt;p&gt;Sharing our latest research paper on warehouse order batching.&lt;/p&gt;</summary><content type="html">&lt;p&gt;We, as the Zalando team BART, are excited to share our latest research paper, describing the optimization problem of order batching and picking in Zalando's warehouses. In this paper (preprint available on &lt;a href="https://arxiv.org/abs/2401.04563"&gt;arxiv&lt;/a&gt;), we formally introduce our proposed order batching problem and provide benchmark instances, two baseline algorithms, and a solution validation tool, all made publicly available on &lt;a href="https://github.com/zalandoresearch/batching-benchmarks/"&gt;GitHub&lt;/a&gt;. Our goal is to provide insights to the research community on planning and optimizing the warehouse order picking process in large-scale warehouses, such as Zalando's.&lt;/p&gt;
&lt;h3&gt;The Underlying Optimization Problem&lt;/h3&gt;
&lt;p&gt;Zalando Tech Logistics is responsible for creating the software that manages all Zalando warehouses and their processes. Team BART, part of Zalando's Logistics Algorithms department, provides the decision-making algorithms for order batching and picking. These decisions can be broken down into four parts:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Order Selection: Which customer orders are processed next?&lt;/li&gt;
&lt;li&gt;Item Allocation: Which warehouse items are used to fulfill a selected order?&lt;/li&gt;
&lt;li&gt;Batching: Which selected orders are picked together?&lt;/li&gt;
&lt;li&gt;Picking: How are batches split up into pick tours?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Traditionally, these decision problems are considered individually and solved using simplified rules. For example, order selection could be done using a first-in-first-out approach. However, our experience and analysis of batching algorithms have shown that a purely sequential approach is far from optimal. While there has been some research on these problems in the literature, there is no closed formulation, to the best of our knowledge, that encapsulates all four problems into one. And this is exactly what we aim to achieve with our paper: We combine all of the four problems into one, named Joint Order Selection, Allocation, Batching and Picking.&lt;/p&gt;
&lt;h3&gt;Benchmark Instances&lt;/h3&gt;
&lt;p&gt;To ensure a clear understanding of the problem statement, we provide benchmark instances for the Joint Order Selection, Allocation, Batching, and Picking Problem. These instances allow anybody interested to immediately try out their ideas for solving this problem. Additionally, we share the implementation of two baseline algorithms described in the paper.&lt;/p&gt;
&lt;h3&gt;Outlook&lt;/h3&gt;
&lt;p&gt;We aim to stimulate academic discussion around the Joint Order Selection, Allocation, Batching, and Picking Problem. We believe there are practitioners and researchers interested in this type of optimization problem. By providing benchmark instances, we hope to establish a standard definition that can be easily adapted for further research.&lt;/p&gt;
&lt;p&gt;Publishing this problem formulation also allows us to share insights on how we are solving this problem at Zalando. We look forward to sharing more in our next publication. In the meantime, we welcome any feedback and collaboration from the community: Feel free to share your feedback via &lt;a href="https://github.com/zalandoresearch/batching-benchmarks/discussions"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;</content><category term="Zalando"/><category term="Operations Research"/><category term="Logistics"/><category term="Zalando Science"/><category term="Backend"/><category term="Machine Learning"/></entry><entry><title>Tale of 'metadpata': the revenge of the supertools</title><link href="https://engineering.zalando.com/posts/2024/01/tale-of-metadpata-the-revenge-of-the-supertools.html" rel="alternate"/><published>2024-01-23T00:00:00+01:00</published><updated>2024-01-23T00:00:00+01:00</updated><author><name>Bartosz Ocytko</name></author><id>tag:engineering.zalando.com,2024-01-23:/posts/2024/01/tale-of-metadpata-the-revenge-of-the-supertools.html</id><summary type="html">&lt;p&gt;One day in November 2022, we brought down our shop with a single character. This post recaps on the lessons we learned from this incident.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="this is fine meme" src="https://engineering.zalando.com/posts/2024/01/images/this-is-fine.jpg#previewimage"&gt;&lt;/p&gt;
&lt;h2&gt;The perfect storm&lt;/h2&gt;
&lt;p&gt;In the mids of Cyber Week preparation in November 2022, I was DMd by a colleague with a request to quickly join a call. To my surprise as I was anticipating a 1:1 call, I got greeted by a message indicating that 60+ others are in the call as well. It turned out that I was just about to join an incident response call for what later got to be known internally as the "metadpata" incident.&lt;/p&gt;
&lt;p&gt;In the call, a group of colleagues was trying to put the jigsaw pieces together analyzing why suddenly a large amount of DNS entries across our AWS accounts were removed, causing our shop to effectively go offline for our customers. Additionally, all of us except for the cloud infrastructure team were locked out of accessing AWS accounts and internal tools due to missing DNS entries, rendering the incident response difficult. In short – the classic DNS incident that you may be familiar with from other write-ups. Some helpful and lucky souls hastily started to copy their cached DNS entries before they expired. It was an all hands on deck situation with everyone focused on the single goal of restoring service for our customers ASAP. What followed in the incident call was a controlled disaster recovery with colleagues manually restoring DNS entries starting with essential tooling, followed by core infrastructure, and the services powering our on-site experiences to restore service for our customers.&lt;/p&gt;
&lt;p&gt;How was it possible that the DNS entries across multiple accounts suddenly disappeared? The Pull Request that triggered the event was aimed at adjusting YAML configuration for our infrastructure. However, apart from changing configuration for a test account, it also contained a "p" character in one of the configuration fields called "metadata" transforming it into "metadpata". Yet, why was this single character so powerful and destructive?&lt;/p&gt;
&lt;h2&gt;Enter supertools&lt;/h2&gt;
&lt;p&gt;We coined the term &lt;em&gt;supertools&lt;/em&gt; when working on the Post Mortem for the incident. These are applications or scripts that have the ability to execute large-scale changes across the infrastructure. Initially well intentioned as daemons automating creation of resources and implementing various stages of their lifecycle, they also perform cleanup operations that result in removal of resources. The latter operation, typically used for cleanup of resources that are to be decommissioned is easy to become subject to cost optimization. As part of cost-saving measures, the pacing of executing deletion operations was sped up.&lt;/p&gt;
&lt;p&gt;The tool processing the configuration with the unfortunate typo is responsible for setting up AWS accounts. It is a background job that parses the configuration and computes the operations that are to be executed on each affected account. It uses the &lt;code&gt;metadata&lt;/code&gt; object to calculate the accounts to work on. The typo resulted the configuration to be interpreted as "no accounts" which in turn was interpreted to be equal to the situation where all accounts are to be decommissioned. The deletion process was triggered and it managed to delete hosted zones containing DNS entries, which triggered the incident. Luckily, the deletion process ran into an error when performing the deletion operations, reducing the scope of the incident and the disaster recovery required.&lt;/p&gt;
&lt;h2&gt;Incident response&lt;/h2&gt;
&lt;p&gt;While our incident response culture is well established, this incident tested it to its full extent. In an all hands on deck situation, the cloud infrastructure team was focused on disaster recovery, organized via an incident call. Through an incident chat room, our colleagues were reporting the impact they still observed and reported on the progress of recovery in their clusters. The Incident Commanders focused on determining the approach and priority of the recovery efforts as well as on facilitating the communication between the chatroom and the incident call. Throughout the incident response we switched the Incident Commanders according to their areas of expertise which kept the incident response focused and efficient.&lt;/p&gt;
&lt;h2&gt;Post Mortem&lt;/h2&gt;
&lt;p&gt;Through great collaboration across teams to recover the needed DNS entries and restore service for our customers, we were back online in a few hours. As the first incident of its kind and with a large scale impact for our customers, it got high attention across the organization. Predictably, this resulted in an overload of Google Docs that limits the concurrent editors for the document who were working on the Post Mortem. To reduce the likelihood of this happening again, we've changed all links to Post Mortem documents shared with big audiences use the &lt;code&gt;/preview&lt;/code&gt; URL by default.&lt;/p&gt;
&lt;p&gt;Being close to the start of Cyber Week the focus for the team was to complete the Post Mortem analysis work and decide upon immediate actions to prevent a similar incident from happening. This included pausing changes to the configuration, a review of all supertools in place, and temporary deactivation of the relevant deletion processes. We also wrote a 1-pager summary of the incident and shared it proactively with the whole organization to keep everyone informed about the types of action items scheduled short- and mid-term as agreed during an Incident Review.&lt;/p&gt;
&lt;h2&gt;Infrastructure changes&lt;/h2&gt;
&lt;p&gt;An important and often vigorously discussed part of Post Mortems are the action items aimed at preventing recurrence of the incident. In our case, we analyzed how infrastructure changes are reviewed and rolled out a number of improvements with the aim of improving the validation and reducing the blast radius of infrastructure changes that go wrong. We will focus on the most impactful changes that were implemented.&lt;/p&gt;
&lt;h3&gt;Account lifecycle management changes&lt;/h3&gt;
&lt;p&gt;We have introduced a new step in the account decommissioning process that simulates deletion using Network ACLs. We also remove the delegation for the DNS zone assigned to the account to ensure that related CNAMEs will not resolve anymore. The account is left in this state for one week before proceeding further with the real decommissioning. This acts as a final "scream test" to make sure there are no more dependencies on this account.&lt;/p&gt;
&lt;p&gt;Having assessed the trade-offs and risks for deletion of resources, we have additionally decided to be more careful with deletion of resources that have low cost savings potential compared to the impact a wrong deletion could have. These changes are now done manually and take a longer time to complete, an acceptable trade-off we're willing to take to reduce the risk. To mitigate the potential cost increase, we are monitoring the account costs for the previous 7 days. In case it is over a certain threshold, we look at deleting the resources manually.&lt;/p&gt;
&lt;h3&gt;Change validation&lt;/h3&gt;
&lt;p&gt;We've introduced a series of validation steps, for example stringent checks for the presence of mandatory keys and the preview of all stack templates using &lt;a href="https://github.com/aws-cloudformation/cfn-lint"&gt;AWS CloudFormation Linter&lt;/a&gt; before they get deployed.&lt;/p&gt;
&lt;p&gt;Also, we have set up jsonschema validation for all our configuration files. All these checks run both locally (thanks to pre-commit hooks) and in the CI/CD pipelines. We also did some small quality of life improvements to enable autocompletion and schema validation in our local IDEs, which mitigates the possibility of typos and errors and is &lt;a href="https://developers.redhat.com/blog/2020/11/25/how-to-configure-yaml-schema-to-make-editing-files-easier#yaml_schema"&gt;simple to set up&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# yaml-language-server: $schema=schema/config_schema.json&lt;/span&gt;
&lt;span class="l l-Scalar l-Scalar-Plain"&gt;(your config)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Additionally, for creation/decommissioning of critical resources, we have introduced several automated quality checks which ensure that all the change corresponds to the user request and the pull request description. These checks also introduce additional approval from the respective account or cost center owners and validation from respective managers. The checks are implemented as a GitHub bot that comments on the Pull Request and blocks the merge until all the checks are validated.&lt;/p&gt;
&lt;h3&gt;Change previews&lt;/h3&gt;
&lt;p&gt;We have implemented automated previews in the Pull Request comments. This feature leverages the &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-updating-stacks-changesets.html"&gt;AWS CloudFormation "ChangeSet" feature&lt;/a&gt;. When an updated CF stack template is provided to the CloudFormation "CreateChangeSet" endpoint, CloudFormation generates a json preview of the changes, which then can be executed or rejected. We read this ChangeSet from each account in our AWS Organization and merge them to create a human readable preview of changes in a PR comment. After the preview is created, the ChangeSet is dropped.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Preview of changes in Pull Requests" src="https://engineering.zalando.com/posts/2024/01/images/cf-preview-gh.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Preview of changes in Pull Requests&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;h3&gt;Phased rollout&lt;/h3&gt;
&lt;p&gt;Our Kubernetes cluster rollout already included a phased rollout to different groups of clusters. This idea was extended to our AWS infrastructure. The rollout process adopted by our tooling now includes gradual rollout to different release channels, each associated with a few AWS account categories (e.g. playground, test, infra). All changes must go through all release channels before getting to production. This approach allows us to gradually deploy changes to different accounts, ensuring a more controlled propagation that catches errors early on with a limited blast radius. The trade-off here is of course that the rollout takes a longer time.&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;Supertools never sleep (unless you program them otherwise!). They're powerful yet often misjudged in review processes as they're expected to only trigger action in the scope of expected changes. As our story shows, this is highly dependent on the implementation and it's highly important to implement additional safety nets in the processes and tooling. We hope that the examples of changes we've implemented in our infrastructure will help you reflect and improve mechanisms in your own context.&lt;/p&gt;</content><category term="Zalando"/><category term="SRE"/><category term="Culture"/><category term="Backend"/></entry><entry><title>Using modules for Testcontainers with Golang</title><link href="https://engineering.zalando.com/posts/2023/12/using-modules-for-testcontainers-with-golang.html" rel="alternate"/><published>2023-12-19T00:00:00+01:00</published><updated>2023-12-19T00:00:00+01:00</updated><author><name>Fabien Pozzobon</name></author><id>tag:engineering.zalando.com,2023-12-19:/posts/2023/12/using-modules-for-testcontainers-with-golang.html</id><summary type="html">&lt;p&gt;In this post, we explain how to use modules for Testcontainers with Golang and how to fix common issues.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Testcontainers with Go" src="https://engineering.zalando.com/posts/2023/12/images/go-test-containers.jpg#previewimage"&gt;&lt;/p&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://github.com/testcontainers/testcontainers-go"&gt;Testcontainers for Go&lt;/a&gt; enables developers to run easily tests against containerized dependencies. In our previous articles, you can find &lt;a href="https://engineering.zalando.com/posts/2021/02/integration-tests-with-testcontainers.html"&gt;an introduction of Integration tests with Testcontainers&lt;/a&gt;
and &lt;a href="https://engineering.zalando.com/posts/2022/04/functional-tests-with-testcontainers.html"&gt;explore how to write Functional tests with Testcontainers&lt;/a&gt; (in Java).&lt;/p&gt;
&lt;p&gt;This blog post will deep dive into how to use modules and a common issue for Testcontainers with Golang.&lt;/p&gt;
&lt;h3&gt;What we use it for?&lt;/h3&gt;
&lt;p&gt;Services often use external dependencies like datastore or queues.
It is possible to mock these dependencies but if you want to run for example integration test, it is better to verify against the real dependency (or close enough).&lt;/p&gt;
&lt;p&gt;Starting a container with the image of the dependency is a convenient way to verify that the application works as expected.
With Testcontainers, starting the container is done programmatically so that you can define it as part of your tests. The machine running the tests (developer, CI/CD) requires to have a container runtime interface (e.g. Docker, Podman...)&lt;/p&gt;
&lt;h2&gt;Basic implementation&lt;/h2&gt;
&lt;p&gt;Testcontainers for Go is very easy to use, &lt;a href="https://golang.testcontainers.org/quickstart/#3-spin-up-redis"&gt;the quick start example&lt;/a&gt; is:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;TODO&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;testcontainers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ContainerRequest&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;redis:latest&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;ExposedPorts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;6379/tcp&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;WaitingFor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nx"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ForLog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Ready to accept connections&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;redisC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;testcontainers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;GenericContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;testcontainers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;GenericContainerRequest&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;ContainerRequest&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;Started&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nb"&gt;panic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;defer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;redisC&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Terminate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nb"&gt;panic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If we dive into the code above, we notice that:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;testcontainers.ContainerRequest&lt;/code&gt; initialises a struct with container image, exposed port and waiting strategy parameters&lt;/li&gt;
&lt;li&gt;&lt;code&gt;testcontainers.GenericContainer&lt;/code&gt; starts the container returning the container and error structs&lt;/li&gt;
&lt;li&gt;&lt;code&gt;redisC.Terminate&lt;/code&gt; terminates the container with &lt;code&gt;defer&lt;/code&gt; once the test is done&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Implementing our own internal library&lt;/h2&gt;
&lt;p&gt;From the example in the previous section, there is some minor inconvenience:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;wait.ForLog("Ready to accept connections")&lt;/code&gt; uses logs to wait for start of the container which can break easily&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ExposedPorts: []string{"6379/tcp"}&lt;/code&gt; requires knowledge of the exposed port for Redis&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;There might also be some additional environment variables and other parameters useful to run a Redis container which requires deeper knowledge.
As such, we decided to create an internal library which would initialise container with the default parameters required to ease test implementation.
To remain flexible, we used the &lt;a href="https://golang.cafe/blog/golang-functional-options-pattern.html"&gt;Functional Options Pattern&lt;/a&gt; so that consumer can still customize depending on the needs.&lt;/p&gt;
&lt;p&gt;Example of implementation for Redis:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;defaultPreset&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="nx"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Option&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="nx"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Option&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;WithPort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;6379/tcp&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;WithGetURL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;port&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;nat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Port&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;localhost:&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;port&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Port&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;WithImage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;redis&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;WithWaitingStrategy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="nx"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Container&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Strategy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ForAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nx"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;NewHostPortStrategy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Port&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nx"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ForLog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Ready to accept connections&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// New - create a new container able to run redis&lt;/span&gt;
&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Option&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="nx"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Container&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Container&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;defaultPreset&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;o&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;range&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;o&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Start - start a Redis container and return a container.CreatedContainer&lt;/span&gt;
&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Option&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;CreatedContainer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;CreatedContainer&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Usage of the library for Redis:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;TODO&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nx"&gt;cc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;WithVersion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;latest&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nb"&gt;panic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;defer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;cc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Stop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nb"&gt;panic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;With this internal library, developers could easily add tests for Redis without the need to figure out the waiting strategy, exposed port, etc.
In case of incompatibility, the internal library could be updated to centrally fix the issue.&lt;/p&gt;
&lt;h2&gt;Common issue - Garbage collector (Ryuk / Reaper)&lt;/h2&gt;
&lt;p&gt;Testcontainers covers the extra mile of ensuring that container is removed once test is done using a &lt;a href="https://golang.testcontainers.org/features/garbage_collector/#garbage-collector"&gt;Garbage Collector&lt;/a&gt; which is an additional container started as a "sidecar".
This container is responsible for stopping the container being tested even if your test crash (which would prevent &lt;code&gt;defer&lt;/code&gt; to run).&lt;/p&gt;
&lt;p&gt;When using Docker, it works without problem, but with other container runtime interfaces (like Podman) often you will get this kind of error: &lt;code&gt;Error response from daemon: container create: statfs /var/run/docker.sock: permission denied: creating reaper failed: failed to create container&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;One way to "fix this" is to deactivate it with the environment variable &lt;code&gt;TESTCONTAINERS_RYUK_DISABLED=true&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Another way is to set the Podman machine rootful and add:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;TESTCONTAINERS_RYUK_CONTAINER_PRIVILEGED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;true&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# needed to run Reaper (alternative disable it TESTCONTAINERS_RYUK_DISABLED=true)&lt;/span&gt;
&lt;span class="nb"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;TESTCONTAINERS_DOCKER_SOCKET_OVERRIDE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/var/run/docker.sock&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# needed to apply the bind with statfs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In our internal library we took the approach of disabling it by default as developers had issues running it locally.&lt;/p&gt;
&lt;h2&gt;Moving to modules&lt;/h2&gt;
&lt;p&gt;Once our internal library was stable enough, we decided that it was time to give back to the community by contributing to Testcontainers.
But surprise... &lt;a href="https://golang.testcontainers.org/modules/"&gt;modules&lt;/a&gt; has just been introduced in Testcontainers.
Module is doing exactly what our internal library was for, we therefore migrated all our services to modules and discontinued the internal library.
From the migration, we learned that it was possible to use the standard library out of the box now that modules have been introduced, which reduces the maintenance cost of our services.
The main challenge was to fine-tune developer environment variables to run on the developer machine (make Garbage Collector work) using Makefile.&lt;/p&gt;
&lt;p&gt;Adapted example from &lt;a href="https://golang.testcontainers.org/modules/redis/#usage-example"&gt;testcontainers documentation&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;TODO&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nx"&gt;redisContainer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;RunContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;testcontainers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;WithImage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;docker.io/redis:latest&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nb"&gt;panic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;defer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;redisContainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Terminate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nb"&gt;panic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Testcontainers for Golang is a great library to support testing which is even better now that modules have been introduced. Some small impediments with the Garbage collector exist, but that can be fixed easily as described in this post.&lt;/p&gt;
&lt;p&gt;I hope with this blog, if you haven't already, that you will adopt Testcontainers, highly recommended to improve testability of your applications.&lt;/p&gt;</content><category term="Zalando"/><category term="Golang"/><category term="Docker"/><category term="Testing"/><category term="Backend"/></entry><entry><title>Migrating From Elasticsearch 7.17 to Elasticsearch 8.x: Pitfalls and Learnings</title><link href="https://engineering.zalando.com/posts/2023/11/migrating-from-elasticsearch-7-to-8-learnings.html" rel="alternate"/><published>2023-11-20T00:00:00+01:00</published><updated>2023-11-20T00:00:00+01:00</updated><author><name>Maryna Cherniavska</name></author><id>tag:engineering.zalando.com,2023-11-20:/posts/2023/11/migrating-from-elasticsearch-7-to-8-learnings.html</id><summary type="html">&lt;p&gt;With Elasticsearch, moving from one major version to another is a big jump. Usually, it is updated in gradual increments, minor to minor version. It is difficult to make a big move. There's no official step-by-step and usually, it just doesn't happen. So, how did we approach it? Read on to find out.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;What this article is about&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;What kind of changes we had to make to the codebase&lt;/li&gt;
&lt;li&gt;How we did the actual upgrade&lt;/li&gt;
&lt;li&gt;What challenges we faced&lt;/li&gt;
&lt;li&gt;How we did the data transfer&lt;/li&gt;
&lt;li&gt;How the data was kept in sync&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What this article is not&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;A step-by-step guide on how to upgrade Elasticsearch (read on to find out why).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Who we are&lt;/h2&gt;
&lt;p&gt;We are a team from the Search &amp;amp; Browse department, the department in Zalando that is responsible for all things search (read: relevance, personalisation, sorting, filters, full text search, ... in short, everything that forms the search experience). The search applications are using Elasticsearch as the main datastore, so we are also the ones responsible for its well-being.&lt;/p&gt;
&lt;h2&gt;Why upgrade&lt;/h2&gt;
&lt;p&gt;We have been using Elasticsearch for a long time. It was upgraded more or less on a regular basis, but we were always a bit behind the latest version (Elastic has a regular release schedule; the releases are all scheduled well in advance). We were on version 7.17 for a while, and while we were pretty happy with it, we still had a few reasons to upgrade to 8.x.&lt;/p&gt;
&lt;p&gt;First, we wanted to use the new features that were introduced in 8.0. Namely, &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#approximate-knn"&gt;the approximate kNN (k nearest neighbors) - or ANN-search&lt;/a&gt;. The vector search was already used in Search &amp;amp; Browse, but it was &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#exact-knn"&gt;the exact kNN search&lt;/a&gt;, the brute-force and less performant one. What Elastic says about the approximate vs exact kNN search is this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In most cases, you’ll want to use approximate kNN. Approximate kNN offers lower latency at the cost of slower indexing and imperfect accuracy.&lt;/p&gt;
&lt;p&gt;Exact, brute-force kNN guarantees accurate results but doesn’t scale well with large datasets. With this approach, a &lt;code&gt;script_score&lt;/code&gt; query must scan each matching document to compute the vector function, which can result in slow search speeds. However, you can improve latency by using a &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html"&gt;query&lt;/a&gt; to limit the number of matching documents passed to the function. If you filter your data to a small subset of documents, you can get good search performance using this approach.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;There is also a &lt;a href="https://www.elastic.co/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0"&gt;great article about ANN on Elastic blog
by Julie Tibshirani&lt;/a&gt; - read it, you won't regret it.&lt;/p&gt;
&lt;p&gt;Second, we also wanted to be on the latest version for performance and security reasons, because obviously, every new release has a lot of security fixes and performance improvements.&lt;/p&gt;
&lt;h2&gt;Why it's difficult to upgrade&lt;/h2&gt;
&lt;p&gt;&lt;img alt="You don't just upgrade Elasticsearch" src="https://engineering.zalando.com/posts/2023/11/images/youcantjustupdate.jpg"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Boromir telling you that you don't just upgrade Elasticsearch&lt;/figcaption&gt;
&lt;p&gt;Usually, Elasticsearch is updated in gradual increments, minor to minor version, and it's difficult, not to mention dangerous, to make such a big move as going from one major version to another. Also, the documentation on the official website, while ample, is pretty disorganized, and there's no complete step-by-step for such an endeavor. And even if you were to gather all the information from the docs, it's still not enough. You need to know what to do with your data, how to keep it in sync, and how to make sure that the new version is working as expected.&lt;/p&gt;
&lt;p&gt;In Zalando, the size of data is pretty massive. We have millions of articles in each country, and while the &lt;a href="https://en.zalando.de/women/"&gt;gender root page for women&lt;/a&gt; in Germany will show you 450k items, it's simply not the full picture. This number is just how many items at most get scanned to show you the first page. The actual number of items is much higher. And we currently have 28 domains (country + language combos), each with its own catalog. So in short, we have a lot of data, and we need to make sure that it's not lost or corrupted during the upgrade.&lt;/p&gt;
&lt;h2&gt;How we approached the upgrade&lt;/h2&gt;
&lt;p&gt;Another reason why one can't just go and upgrade Elasticsearch is because, well, it's not an island.&lt;/p&gt;
&lt;p&gt;What I mean is, it's not some independent entity that has a value all by itself. It's our datastore, and it's used by a lot of our services. So before one goes and upgrades this massive thing, one should think of possible breaking changes in the product. And also, one should think about how it changes the actual usage of Elasticsearch.&lt;/p&gt;
&lt;p&gt;The main search application in Zalando, the one that deals directly with Elasticsearch queries, is called Origami.
From the description on its (internal) repository page:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Origami is the Zalando Core Search API. It provides a powerful information retrieval language and engine that integrates several microservice components built by the Search Department. In the landscape of Zalando Search and Browse platform, Origami is the connector - coordinating all search intelligence to serve correct search results to customers.&lt;/p&gt;
&lt;p&gt;Origami builds on top of Elasticsearch and our internal/Zalando-specific suite of APIs. These APIs will facilitate composing/serving search and discovery, navigation, and analytics functionalities.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The application is written in Scala and using a &lt;a href="https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high.html"&gt;Java High Level REST Client, which got deprecated in Elasticsearch 7.15.0&lt;/a&gt; and replaced by &lt;a href="https://www.elastic.co/guide/en/elasticsearch/client/java-api-client/7.17/introduction.html"&gt;ElasticSearch Java API client&lt;/a&gt;, so first of all, we had to update the codebase to use the new client.&lt;/p&gt;
&lt;h3&gt;Updating the codebase&lt;/h3&gt;
&lt;p&gt;However, updating the codebase was also not a one-step task. (This just goes deeper into the rabbit hole, doesn't it?)&lt;/p&gt;
&lt;p&gt;Origami has 443k lines of code in 846 files. Of course, a lot of these files are the configs and tests and test resources, so the actual number of Scala files is much lower. But still, it's a lot of code, and a lot of it is dealing with Elasticsearch.&lt;/p&gt;
&lt;p&gt;Upgrading the Elasticsearch API to be able to work with version 8.x also represented a choice. We could either use the official &lt;a href="https://www.elastic.co/guide/en/elasticsearch/client/java-api-client/8.6/migrate-hlrc.html"&gt;Elasticsearch Java API Client&lt;/a&gt;, or we could use the &lt;a href="https://github.com/sksamuel/elastic4s"&gt;Elasticsearch Scala client&lt;/a&gt; which seemed to be quite popular and had a lot of contributors (and stars) on GitHub. Both options were available and viable. Both had their pros and cons.&lt;/p&gt;
&lt;p&gt;With the Elasticsearch Java API, the advantages would be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The library is officially supported and its versions match the Elasticsearch releases;&lt;/li&gt;
&lt;li&gt;There is a ready-made DSL for all the REST APIs;&lt;/li&gt;
&lt;li&gt;It’s open source and the code is available on GitHub. The license is Apache License 2.0.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;However:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It’s in Java. This means that all the lambda types, collection types, etc. are not directly interoperable and special transformations should be done within our code;&lt;/li&gt;
&lt;li&gt;We’re missing on the other Scala advantages like built-in immutability, null safety and so on.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The unofficial Scala client is advertised as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Providing a type-safe, concise DSL;&lt;/li&gt;
&lt;li&gt;Integrating with standard Scala futures or other effects libraries;&lt;/li&gt;
&lt;li&gt;Using Scala collections library over Java collections;&lt;/li&gt;
&lt;li&gt;Returning &lt;code&gt;Option&lt;/code&gt; where the Java methods would return &lt;code&gt;null&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;Using Scala &lt;code&gt;Durations&lt;/code&gt; instead of strings/longs for time values;&lt;/li&gt;
&lt;li&gt;Supporting typeclasses for indexing, updating, and search backed by Jackson, Circe, Json4s, PlayJson and Spray Json implementations;&lt;/li&gt;
&lt;li&gt;Supporting Java and Scala HTTP clients such as Akka-Http;&lt;/li&gt;
&lt;li&gt;Providing reactive-streams implementation;&lt;/li&gt;
&lt;li&gt;Providing a testkit subproject ideal for tests.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The disadvantages, however, could not be ignored:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It’s not official and the releases are not closely following Elastic’s release schedule. At the time we were looking at it, Elasticsearch was already at v8.7 and this library’s last version was 8.5.4. (It could work with Elasticsearch up to version 8.6 though);&lt;/li&gt;
&lt;li&gt;Because it did not implement all the new features, there was no DSL for kNN search. KNN search was still available via sending a pure JSON query, but it was not a pretty option.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the end, we decided to go with the Elasticsearch Java API client. The main reason was that it was officially supported and the releases were closely following the Elasticsearch releases, and it wouldn't just disappear into thin air in the unlikely case when its creator would suddenly want to quit. Also, it had DSL for all the REST APIs. The absense of the kNN search DSL in the Scala library was really disappointing, because approximate kNN search was one of the main reasons why we wanted to upgrade in the first place.&lt;/p&gt;
&lt;p&gt;So, the choice was made.&lt;/p&gt;
&lt;p&gt;But.&lt;/p&gt;
&lt;p&gt;As I said before, this was a large application.&lt;/p&gt;
&lt;p&gt;How does one make sure that no existing functionality is going to break when upgrading the API? How does one make sure that all the existing queries are still going to work?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Obviously, you write a test.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;Writing a test&lt;/h3&gt;
&lt;p&gt;There was one more decision that we made while selecting a migration strategy, and that was to start with &lt;a href="https://www.elastic.co/guide/en/elasticsearch/client/java-api-client/current/migrate-hlrc.html#_compatibility_mode_using_a_7_17_client_with_elasticsearch_8_x"&gt;compatibility mode&lt;/a&gt;. This meant that we would use the Elasticsearch High Level Rest Client from version 7.x, but in the compatibility mode, so that it would instruct Elasticsearch 8.x to behave like the old client. This way we would be able to upgrade the Elasticsearch cluster first, and then upgrade the client gradually. With this approach, we would avoid rewriting too much code at once. And afterward, we would be able to use one of the &lt;a href="https://www.elastic.co/guide/en/elasticsearch/client/java-api-client/current/migrate-hlrc.html#_transition_strategies"&gt;transition strategies, recommended by Elasticsearch, to gradually upgrade the client&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This approach was also a good fit, since we assumed that we might have a time during the transition phase when the application would have to deal with both Elasticsearch 7.x and Elasticsearch 8.x. Because our Elasticsearch was a multi-cluster deployment, it would be practically impossible to upgrade in one go. We would have to start with less mission-critical clusters, and then gradually move to the more important ones. So, we would definitely have to deal with both versions of Elasticsearch for some time.&lt;/p&gt;
&lt;p&gt;So how to write such a test?&lt;/p&gt;
&lt;p&gt;This is where &lt;a href="https://testcontainers.com/"&gt;Testcontainers&lt;/a&gt; shine. Basically, we had a helper class looking like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;object&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ESContainers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;val&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Version7179&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;7.17.9&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;val&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Version86&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;8.6.2&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;val&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Version88&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;8.8.2&amp;quot;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;val&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;VersionDefault&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Version7179&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;initAndStartESContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;VersionDefault&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ElasticsearchEndPoint&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;val&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ElasticsearchContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;s&amp;quot;docker.elastic.co/elasticsearch/elasticsearch:&lt;/span&gt;&lt;span class="si"&gt;$&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;withReuse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;withCreateContainerCmdModifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getHostConfig&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;withCapAdd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Capability&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SYS_CHROOT&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;val&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hostAndPort&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getHttpHostAddress&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;:&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nc"&gt;ElasticsearchEndPoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hostAndPort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hostAndPort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;toInt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;And then, in the test, we would just do this to start Elasticsearch with the version we needed.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;lazy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;val&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ESContainers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;initAndStartESContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Version88&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Since at some point we'd have to deal with both versions of the API, we had to test three combinations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Elasticsearch 7.x with Elasticsearch 8.x API;&lt;/li&gt;
&lt;li&gt;Elasticsearch 8.x with Elasticsearch 8.x API;&lt;/li&gt;
&lt;li&gt;Elasticsearch 8.x with Elasticsearch 7.x API.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And with each, we needed to make sure that the common types of actions, done by the application, continue to work as expected.&lt;/p&gt;
&lt;p&gt;So this is exactly what we did. We wrote three test classes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;NewClientWithOldElasticTest&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;OldClientWithNewElasticTest&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;NewClientWithNewElasticTest&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Why is there no &lt;code&gt;OldClientWithOldElasticTest&lt;/code&gt;? Because we already knew that it was working. It was what the application we already had.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Each class was checking that the application was able to do the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Create an index;&lt;/li&gt;
&lt;li&gt;Create a document;&lt;/li&gt;
&lt;li&gt;Create kNN vector mappings;&lt;/li&gt;
&lt;li&gt;Index kNN vector data;&lt;/li&gt;
&lt;li&gt;Search for a document with a kNN query;&lt;/li&gt;
&lt;li&gt;Delete an index;&lt;/li&gt;
&lt;li&gt;Close the client.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The tests were not covering all the queries that we ran - only the common types. But even with this simplified approach we were able to discover a few issues, for which we had to make changes to the codebase.&lt;/p&gt;
&lt;h3&gt;Issues discovered and fixes applied&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Elasticsearch 8 deprecated the &lt;code&gt;_type&lt;/code&gt; field in search response, so we had to remove it from all the test case resources that represented example JSONs for the expected response.&lt;/li&gt;
&lt;li&gt;Elasticsearch 8 didn't allow null in the &lt;code&gt;is_write&lt;/code&gt; parameter when creating an alias for the index. Therefore, code was added to set this flag explicitly.&lt;/li&gt;
&lt;li&gt;Range query based on date/epoch_second &lt;a href="https://discuss.elastic.co/t/date-range-not-working-as-expected-between-elasticsearch-7-17-and-elasticsearch-8-6/328825"&gt;didn't work with upper/lower bounds specified as numbers&lt;/a&gt;. (According to the Elastic team, it was a feature and would not be fixed). Due to that, the range boundaries had to be stringified before being passed to Elasticsearch.&lt;/li&gt;
&lt;li&gt;In Elasticsearch 8, a cluster setting called &lt;code&gt;action.destructive_requires_name&lt;/code&gt; now defaults to &lt;code&gt;true&lt;/code&gt; instead of &lt;code&gt;false&lt;/code&gt;. Since our e2e tests were dropping all test indexes by wildcard before starting, they all started crashing. So, a change was introduced to update this setting on a cluster to allow the test suits run this action. The method that was doing it was only used in test suites, because for a real production cluster, it's pretty unsafe.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Moreover, when we started to switch the other, more detailed integration tests to Elasticsearch 8, we found an issue that was a little more involved. Some of those tests started to fail with the following error:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;query_shard_exception&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;reason&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;it is mandatory to set the [nested] context on the nested sort field: [trace.origami.timestamp].&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;index_uuid&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;_xvEa8gNSFyCDm0aFXqYhg&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;index&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;article_1&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;That seemed to refer to the sort clause that we had in the e2e test suite:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;quot;sort&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;trace.origami.timestamp&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;order&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;desc&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The page about sorting on a nested field for ES 8.8 (current at that time) &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/8.8/sort-search-results.html#nested-sorting"&gt;says that there should be a path specified in a "nested.path" clause of the sort&lt;/a&gt;. However, the &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.17/sort-search-results.html#nested-sorting"&gt;same page for ES 7.17 states exactly the same&lt;/a&gt;, but the query still runs fine without that clause.&lt;/p&gt;
&lt;p&gt;So something changed between the versions in such a way that it started erroring out in ES8, whereas in ES7 it was working fine, despite the docs stating that the parameter is non-optional (&lt;a href="https://discuss.elastic.co/t/nested-sorting-differs-between-es7-and-es8/337904/2"&gt;the thread I created on ES discussion board suggests there was a bug and it was fixed&lt;/a&gt;). So, we had to add the &lt;code&gt;nested.path&lt;/code&gt; clause to the sort clauses in the queries that were sorting on nested fields, meaning that the sort clause from the example above would now look like this.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;quot;sort&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;trace.origami.timestamp&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;order&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;desc&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;nested&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;path&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;trace&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Deprecating Elasticsearch settings in preparation for 8.x migration&lt;/h2&gt;
&lt;p&gt;Summary of changes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.17/modules-threadpool.html#fixed-auto-queue-size"&gt;Remove fixed_auto_queue_size thread pool&lt;/a&gt;. It’s replaced with the normal fixed thread pool configuration.&lt;/li&gt;
&lt;li&gt;Replace deprecated &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.17/breaking-changes-7.1.html#_deprecation_of_old_transport_settings"&gt;transport.tcp.compress&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Replace node role settings with new &lt;code&gt;node.roles&lt;/code&gt; settings (see &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.17/breaking-changes-7.9.html#breaking_79_settings_changes"&gt;one&lt;/a&gt; and &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.17/modules-node.html#coordinating-only-node"&gt;two&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;Due to &lt;a href="https://github.com/elastic/elasticsearch/issues/65577"&gt;an existing bug&lt;/a&gt;, the coordinating role needs to be set as a default which can in turn be overridden by setting the &lt;code&gt;node.roles&lt;/code&gt; environment variables with specific values.&lt;/li&gt;
&lt;li&gt;Remove deprecated &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.17/breaking-changes-7.7.html#deprecate-defer-cluster-recovery-settings"&gt;gateway.recover_after_master_nodes setting&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Add human approval to prevent upgrading master nodes before data nodes.&lt;/li&gt;
&lt;li&gt;Explicitly disable the serial GC using &lt;code&gt;-XX:-UseSerialGC&lt;/code&gt; to avoid the following error messages during start up:
    &lt;code&gt;text
    Error occurred during initialization of VM
    Multiple garbage collectors selected&lt;/code&gt;
even though &lt;code&gt;-XX:+UseZGC&lt;/code&gt; or &lt;code&gt;-XX:+UseG1GC&lt;/code&gt; is explicitly enabled. Most likely an intermediate script was logging this message. In ES 8.x the container can unsuccessfully exit because of this error.&lt;/li&gt;
&lt;li&gt;Coordinating nodes are &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.17/modules-node.html#coordinating-only-node"&gt;enabled by default by specifying an empty value&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Data nodes will only have &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.17/modules-node.html#data-node"&gt;the “data” role defined&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Monitoring checks had to be updated because the role abbreviations changed and became stricter than before.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;How we did the actual upgrade&lt;/h2&gt;
&lt;p&gt;Finally, it seemed that the application was prepared to work with non-homogenous Elasticsearch versions. At last, it was time to upgrade the Elasticsearch cluster itself.&lt;/p&gt;
&lt;p&gt;There is a &lt;a href="https://www.elastic.co/guide/en/elastic-stack/8.11/upgrading-elastic-stack.html#prepare-to-upgrade"&gt;documentation page&lt;/a&gt; with some advice about going from 7.x to 8.x, and it states that first, one should move to 7.17. From there, it is recommended to use an &lt;a href="https://www.elastic.co/guide/en/kibana/7.17/upgrade-assistant.html"&gt;Upgrade Assistant&lt;/a&gt; tool to help prepare for the upgrade. As an alternative, is also recommended to use the &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.17/docs-reindex.html"&gt;Reindex API&lt;/a&gt; to reindex the data from the old version to the new one.&lt;/p&gt;
&lt;p&gt;So in short, Elasticsearch provides two ways to upgrade:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.17/rolling-upgrades.html"&gt;rolling upgrade&lt;/a&gt; approach;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.elastic.co/guide/en/elastic-stack/8.11/upgrade-elastic-stack-for-elastic-cloud.html#upgrading-reindex"&gt;Upgrading via reindex&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;First one is upgrading live. It means that you upgrade the cluster node by node, and the cluster is still available during the upgrade. The second one is upgrading via reindex. It means that you create a new cluster, and you reindex the data from the old cluster to the new one. Then you switch the traffic to the new cluster and shut down the old one.&lt;/p&gt;
&lt;p&gt;In general, &lt;a href="https://www.elastic.co/guide/en/elastic-stack/current/upgrading-elasticsearch.html"&gt;Elastic recommends&lt;/a&gt; doing a rolling upgrade in a following way:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Upgrade the data nodes first;&lt;/li&gt;
&lt;li&gt;Upgrade other non-master nodes (ML-dedicated, coordinating, etc.);&lt;/li&gt;
&lt;li&gt;Upgrade the master nodes.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is because the data nodes can join the cluster with the master nodes of a lower version, but older data nodes can't always join the newer cluster. So, if you upgrade the master nodes first, the data nodes might fail to join it, and the cluster will be unavailable.&lt;/p&gt;
&lt;p&gt;In general, the rolling upgrade is the recommended way to upgrade, because it's less disruptive. However, in our case, it represented too many dangers. First of all, we have a multi-cluster deployment, and the clusters are pretty large, so we're talking about some terabytes of data. It would take a lot of time to upgrade the cluster node by node, and during this time, the cluster would be in a mixed state, with some nodes being upgraded and some not, with relocating shards, and in general in a degraded state.&lt;/p&gt;
&lt;p&gt;That, in itself, wouldn't be so scary. What would indeed be bad is if something were to go wrong. If we faced data loss, we'd have no choice but to go with restoring the data from snapshots and then resetting the input streams to bring the data up to date. This would take quite some time, because we'd have to do it for all the indices in the cluster, and during all this time, the catalog of products would either be unavailable or would have stale or partial data.&lt;/p&gt;
&lt;p&gt;So, we decided to go with the second option, the reindexing. It meant that we'd have to create a new cluster, reindex the data from the old one, and then gradually switch the traffic to the new cluster. It would take more time, but it would be way less risky and less disruptive, because when the data would be in sync, going to the new cluster would be just a matter of switching the routing. If something went wrong, the rollback procedure would be almost instantaneous as it would again be just the routing switched back.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;And last but not least, having both clusters running side by side would give us time to test the new cluster and make sure that it was working as expected and performed at the same level. We could first test if with shadow traffic, and then gradually increase the traffic to the new cluster and decrease it on the old one.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;Procedure per cluster&lt;/h3&gt;
&lt;p&gt;The procedure for each of out cluster would be similar and would include the following steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Deploy ES8 cluster.&lt;/li&gt;
&lt;li&gt;Setup monitoring.&lt;/li&gt;
&lt;li&gt;Create index templates (because if we were to index the data from the old cluster, we'd have to make sure that the new cluster has the same index templates as the old one).&lt;/li&gt;
&lt;li&gt;Restore data from the latest snapshot.&lt;/li&gt;
&lt;li&gt;Set up the shadow &lt;strong&gt;intake&lt;/strong&gt; traffic. This meant that the data would gradually converge with the old cluster, but the queries would still be served by the old cluster. If we were to consider the moment the snapshot was taken as point A and the moment shadow intake was enabled on the new cluster as point B, then it would mean that we have full data from beginning to A, and then from B to the end.&lt;/li&gt;
&lt;li&gt;That left us with the gap between points A and B, so the next step would be to perform the data update by resetting the data streams to the point of just before the snapshot was taken.&lt;/li&gt;
&lt;li&gt;Shadow query traffic. This would be performed gradually, with monitoring for errors.&lt;/li&gt;
&lt;li&gt;Verify that the new cluster works as expected and compare the cluster performance with the old one.&lt;/li&gt;
&lt;li&gt;Switch the live traffic to ES8 cluster (again, gradually shifting the percentages).&lt;/li&gt;
&lt;li&gt;Remove old traffic and clean up old cluster resources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If these steps sound familiar, it is because they are. It is basically the Blue/Green procedure that is usually used for disaster recovery (failover cluster), or for testing something new. The only difference is that we were using it for the one-time Elasticsearch cluster upgrade and not keep the second cluster around. (We are also looking into applying the same approach for the failover cluster, but since our deployments are very large and complicated, we're still getting there.) This Blue/Green approach was also used by the team behind &lt;a href="https://www.zalando-lounge.de"&gt;Zalando Lounge&lt;/a&gt; which has a separate catalog of products, also backed by Elasticsearch, so we had some in-house experience to compare with.&lt;/p&gt;
&lt;h4&gt;Routing and shadowing&lt;/h4&gt;
&lt;p&gt;The whole mechanism is based on a delicate balance of routing and shadowing. We use an open-sourced solution called &lt;a href="https://opensource.zalando.com/skipper/"&gt;Skipper&lt;/a&gt; as an ingress controller, which gives us access to &lt;a href="https://opensource.zalando.com/skipper/reference/filters/"&gt;filters&lt;/a&gt;. For the routing, we're using a custom resource type called &lt;a href="https://opensource.zalando.com/skipper/kubernetes/routegroups/"&gt;RouteGroup&lt;/a&gt;. For example, to ensure that the intake pipeline ingests data into the new cluster, the route group configuration needs to be modified to shadow the &lt;strong&gt;intake&lt;/strong&gt; traffic for the &lt;code&gt;/bulk&lt;/code&gt; and &lt;code&gt;/_alias/{index}_write&lt;/code&gt; endpoints. Here is a somewhat simplified example configuration for shadowing the specified endpoints:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;apiVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;zalando.org/v1&lt;/span&gt;
&lt;span class="nt"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;RouteGroup&lt;/span&gt;
&lt;span class="nt"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;hosts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;cluster-name-{{{CLIENT}}}.ingress.cluster.local&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;backends&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;backend-old&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;network&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;address&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;http://backend-old.ingress.cluster.local&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;backend-new&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;network&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;address&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;http://backend-new.ingress.cluster.local&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;routes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;## match to shadow /_bulk, /_alias/{index}_ad*_write to new backend with ES8&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;pathSubtree&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;pathRegexp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;^/(_bulk|_alias/(index-name-template)_[\d]+_write)$&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;predicates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;HeaderRegexp(&amp;quot;elasticsearch-index-name&amp;quot;, &amp;quot;^(index-name-template)_[\d]+($|_.*)&amp;quot;)&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;filters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;teeLoopback(&amp;quot;intake_shadow&amp;quot;)&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;preserveHost(&amp;quot;false&amp;quot;)&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;backends&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;backendName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;backend-old&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;## shadow &amp;quot;intake_shadow&amp;quot; matched requests to new backend with ES8&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;pathSubtree&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;pathRegexp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;^/(_bulk|_alias/(index-name-template)_[\d]+_write)$&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;predicates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;HeaderRegexp(&amp;quot;elasticsearch-index-name&amp;quot;, &amp;quot;^(index-name-template)_[\d]+($|_.*)&amp;quot;)&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Tee(&amp;quot;intake_shadow&amp;quot;)&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Weight(2)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;## hack required to not match route with Traffic() and teeLoopback()&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;filters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;preserveHost(&amp;quot;false&amp;quot;)&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;backends&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;backendName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;backend-new&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;But that's not all. Before shadowing the intake, the mapping templates should be created. One way to do it would be to just grab them and recreate to the new cluster. But that would mean that we'd have to do it manually, and also we might miss the updates to them if they were to happen while the clusters were still running side by side. Since the templates are stored in our code repos and updated (based on the version) on application restart, the traffic related to template creation also should have been shadowed, so we had to capture this specific traffic too. Snippet of code (shortened):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;routes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/:index/_mapping&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;predicates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;HeaderRegexp(&amp;quot;elasticsearch-index-name&amp;quot;, &amp;quot;^(index-name-template)_[\d]+($|_.*)&amp;quot;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;## &amp;lt;...&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/_template/*&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;predicates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;HeaderRegexp(&amp;quot;elasticsearch-index-name&amp;quot;, &amp;quot;^(index-name-template)_[\d]+($|_.*)&amp;quot;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4&gt;Monitoring&lt;/h4&gt;
&lt;p&gt;The whole process would make no sense if we were going blind. Since it was a multistep procedure, we needed to see how each step is changing the data, affecting the cluster, performing compared to the old cluster, etc. So we needed to set up monitoring. It was based on creating &lt;a href="https://docs.lightstep.com/docs/welcome-to-lightstep"&gt;Lightstep&lt;/a&gt; streams and setting up the dashboards in &lt;a href="https://grafana.com"&gt;Grafana&lt;/a&gt;. The dashboards were showing the traffic from both clusters side by side per endpoint, and the key metrics like latency and error rate. We also monitored CPU and memory consumption via Kubernetes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;One of the most important things was that the data would be in sync, so the boards also had index sizes and the difference between them for the old and new cluster. This way, we could see if say restoring from the snapshot was indeed successful and if the follow-up of shadow intake and stream resetting was resulting in data converging in the end.&lt;/strong&gt;&lt;/p&gt;
&lt;h4&gt;Alerting&lt;/h4&gt;
&lt;p&gt;And last but not least, before each new cluster went live, we had to update alerts and checks that were set up on the corresponding old cluster. We had to make sure that the alerts were pointing to the new cluster and that the checks were still working as expected. We also had to make sure that the alerts were not firing during the upgrade.&lt;/p&gt;
&lt;h4&gt;Backing up the data&lt;/h4&gt;
&lt;p&gt;And of course, as soon as the new cluster went live serving queries and the data on the old cluster stopped being updated (or preferably before that), we set up the snapshotting. We had to make sure that the data was backed up, using the same policies that the previous cluster was using.&lt;/p&gt;
&lt;h2&gt;Challenges we faced&lt;/h2&gt;
&lt;p&gt;The process of upgrading the cluster was not without challenges. Some of them were expected, some were not, and some were purely based on people never having performed some procedures before, or on something slipping one's attention.&lt;/p&gt;
&lt;p&gt;One such thing resulted in duplicates being shown in the product catalog country-wide, because there was a routing error while switching the country index from an old cluster to the new one, so one extra index was created automatically (and erroneously) and for some time two different indices with duplicate content were existing behind the same alias. But that was quickly fixed, and the duplicates were removed by just dropping the mistakenly created index. (And hey, it's better to show the product twice than not to show it at all, right?)&lt;/p&gt;
&lt;p&gt;In general, the whole process was an amazing learning experience, and the whole team is now better prepared for the next upgrade and feels more confident tackling Elasticsearch in general. So, while assuredly sh*t still can and will happen, what matters is how you deal with it and what you learn from it.&lt;/p&gt;
&lt;p&gt;For example, the difficulty experienced by team members while restoring the data was a good indicator that our existing procedure of restoring from snapshot was extremely fussy and error-prone, which resulted in looking for alternative solutions, like Kibana-based workflows, to make the process more straightforward and more obvious. Historically, we were using custom scripts and our CI pipeline for that, but now we're aiming to get our engineers better acquainted with Kibana. The scripts are still the default way, but we're getting there.&lt;/p&gt;
&lt;h2&gt;Success!&lt;/h2&gt;
&lt;p&gt;As always after a big project, we had a retrospective, and the team was pretty happy with the results. The upgrade was successful, and the new cluster was performing at the same level as the old one. The new features were working as expected, and the new cluster was stable. The monitoring was set up, and the dashboards were showing the data in sync. The alerts were firing as expected, and the checks were working. So all in all, it was a success.&lt;/p&gt;
&lt;p&gt;But you know what?&lt;/p&gt;
&lt;p&gt;Products keep upgrading. Progress is the only constant thing in the world. So, we're already looking into the next upgrade, and we're already thinking about how to make it even better.&lt;/p&gt;
&lt;p&gt;And we will keep evolving, because that's what we do.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;We're Zalando. We dress code.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;(See what I did here? Even though I can't take any credit for this. This is a slogan that we once had on our company hoodies!)&lt;/p&gt;
&lt;h2&gt;Helpful links&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.elastic.co/guide/en/kibana/7.17/upgrade-assistant.html"&gt;Elasticsearch upgrade assistant&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.17/docs-reindex.html"&gt;Elasticsearch reindex API&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.17/rolling-upgrades.html"&gt;Elasticsearch rolling upgrades&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.elastic.co/guide/en/elastic-stack/current/upgrading-elasticsearch.html"&gt;Elasticsearch upgrade guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.elastic.co/guide/en/cloud/current/ec-snapshot-restore.html"&gt;Restoring the Elasticsearch data from snapshot&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;</content><category term="Zalando"/><category term="Elasticsearch"/><category term="Scala"/><category term="Java"/><category term="Backend"/></entry><entry><title>Mastering Testing Efficiency in Spring Boot: Optimization Strategies and Best Practices</title><link href="https://engineering.zalando.com/posts/2023/11/mastering-testing-efficiency-in-spring-boot-optimization-strategies-and-best-practices.html" rel="alternate"/><published>2023-11-14T00:00:00+01:00</published><updated>2023-11-14T00:00:00+01:00</updated><author><name>Hassan Elseoudy</name></author><id>tag:engineering.zalando.com,2023-11-14:/posts/2023/11/mastering-testing-efficiency-in-spring-boot-optimization-strategies-and-best-practices.html</id><summary type="html">&lt;p&gt;Unlock the secrets to supercharging your Spring Boot tests! Explore how we utilized specific techniques, resulting in a 60% reduction in test runtime!&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Introduction 🚀&lt;/h2&gt;
&lt;p&gt;Hey there, fellow engineers! Let's dive into the exciting world of Spring Boot testing with JUnit. It is incredibly powerful, providing a realistic environment for testing our code. However, if we don't optimize our tests, they can be slow and negatively affect lead time to changes for our teams.&lt;/p&gt;
&lt;p&gt;This blog post will teach you how to optimize your Spring Boot tests, making them faster, more efficient, and more reliable.&lt;/p&gt;
&lt;p&gt;Imagine an application whose tests take 10 minutes to execute. That's a lot of time! Let's roll up our sleeves and see how we can whiz through those tests in no time! 🕒✨&lt;/p&gt;
&lt;h2&gt;Understanding Test Slicing in Spring&lt;/h2&gt;
&lt;p&gt;Test slicing in Spring allows testing specific parts of an application, focusing only on relevant components, rather than loading the entire context. It is achieved by annotations like &lt;code&gt;@WebMvcTest&lt;/code&gt;, &lt;code&gt;@DataJpaTest&lt;/code&gt;, or &lt;code&gt;@JsonTest&lt;/code&gt;. These annotations are a targeted approach to limit the context loading to a specific layer or technology. For instance, &lt;code&gt;@WebMvcTest&lt;/code&gt; primarily loads the Web layer, while &lt;code&gt;@DataJpaTest&lt;/code&gt; initializes the Data JPA layer for more concise and efficient testing. This selective loading approach is a cornerstone in optimizing test efficiency.&lt;/p&gt;
&lt;p&gt;There are more annotations that can be used to slice the context. See official Spring &lt;a href="https://docs.spring.io/spring-boot/docs/current/reference/html/test-auto-configuration.html#appendix.test-auto-configuration.slices"&gt;documentation on Test Slices&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Test Slicing: Using @DataJpaTest as a replacement for @SpringBootTest 🧩&lt;/h2&gt;
&lt;p&gt;Let's take a look at an example (code below). The test first deletes all the data (shipments and containers, each shipment can have multiple containers) from the target tables, and then saves a new shipment. Next, it creates a thread pool with 50 threads, where each thread calls the &lt;code&gt;svc.createOrUpdateContainer&lt;/code&gt; method.&lt;/p&gt;
&lt;p&gt;The test will wait until all the threads are finished, then it will check that the database has only one container.&lt;/p&gt;
&lt;p&gt;It's all about checking concurrency issues and involves a swarm of threads, clocking in at about 16 seconds on my machine – a massive chunk of time for a single service check, right?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@ActiveProfiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;test&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@SpringBootTest&lt;/span&gt;
&lt;span class="kd"&gt;abstract&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;BaseIT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Autowired&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;lateinit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;shipmentRepo&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ShipmentRepository&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Autowired&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;lateinit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;containerRepo&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ContainerRepository&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ContainerServiceTest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;BaseIT&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Autowired&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;lateinit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;svc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ContainerService&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@BeforeEach&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;fun&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;setup&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;shipmentRepo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;deleteAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;containerRepo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;deleteAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;shipmentRepo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shipment&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Test&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;fun&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;testConcurrentUpdatesForContainer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="kd"&gt;val&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;executor&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Executors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newFixedThreadPool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;repeat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="n"&gt;containerService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createOrUpdateContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="n"&gt;shipment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}${&lt;/span&gt;&lt;span class="n"&gt;svc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;DEFAULT_CONTAINER&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Patch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;NEW_LABEL&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;shutdown&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;while&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;awaitTermination&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;TimeUnit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;MILLISECONDS&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="c1"&gt;// busy waiting for executor to terminate&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;assertThat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;containerRepo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shipment&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;hasSize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The first problem we have is the class declaration:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ContainerServiceTest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;BaseIT&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The issue starts with the &lt;code&gt;BaseIT&lt;/code&gt; class using &lt;code&gt;@SpringBootTest&lt;/code&gt;. This causes the Spring context for the entire application to be loaded (every time we mess with context caching mechanisms, we'll get to that later!). When the application is large enough, a huge number of beans are loaded - a costly operation for tests with specific objectives.&lt;/p&gt;
&lt;p&gt;But no, we don't want to load everything. All we need to load is the &lt;code&gt;ContainerService&lt;/code&gt; bean and JPA repositories. We can switch to &lt;code&gt;@DataJpaTest&lt;/code&gt;. This annotation only loads the JPA part of the application, which is what we need for this test. Let's try it out!&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@DataJpaTest&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ContainerServiceTest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Autowired&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;lateinit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;svc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ContainerService&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Autowired&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;lateinit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;shipmentRepo&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ShipmentRepository&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Autowired&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;lateinit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;containerRepo&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ContainerRepository&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Upon execution, an exception is thrown:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;springframework&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beans&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;factory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nl"&gt;BeanCreationException&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Failed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DataSource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;an&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;embedded&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;database&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;If&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;you&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;want&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;an&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;embedded&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;database&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;please&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;put&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;supported&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;classpath&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;or&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tune&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;attribute&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;@AutoConfigureTestDatabase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;@DataJpaTest&lt;/code&gt; has an annotation &lt;code&gt;@AutoConfigureTestDatabase&lt;/code&gt;, which by default, sets up an H2 in-memory database for the tests, and configures &lt;code&gt;DataSource&lt;/code&gt; to use it. However, in this case, the H2 dependency is not found in the classpath.&lt;/p&gt;
&lt;p&gt;And actually, we don't want to use H2 for our tests, so we can tell &lt;code&gt;@AutoConfigureTestDatabase&lt;/code&gt; not to replace our configured database with an H2. Plus, we have to configure and load our own database, which is performed here by importing a &lt;code&gt;@Configuration&lt;/code&gt; class called &lt;code&gt;EmbeddedDataSourceConfig&lt;/code&gt; (It simply creates a &lt;code&gt;@Bean&lt;/code&gt; of type &lt;code&gt;DataSource&lt;/code&gt;).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@DataJpaTest&lt;/span&gt;
&lt;span class="nd"&gt;@AutoConfigureTestDatabase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AutoConfigureTestDatabase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Replace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;NONE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@Import&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;EmbeddedDataSourceConfig&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// Import the embedded database configuration if needed.&lt;/span&gt;
&lt;span class="nd"&gt;@ActiveProfiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;test&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// Use the test profile to load a different configuration for tests.&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ContainerServiceTest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// test code&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Let's try to run the test again. Now, it fails with this error:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name &amp;#39;ContainerServiceTest&amp;#39;: Unsatisfied dependency expressed through field &amp;#39;containerService&amp;#39;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You already know the trick, you need to load the &lt;code&gt;ContainerService&lt;/code&gt; bean in the Spring context!&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@DataJpaTest&lt;/span&gt;
&lt;span class="nd"&gt;@AutoConfigureTestDatabase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AutoConfigureTestDatabase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Replace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;NONE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@Import&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ContainerService&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;EmbeddedDataSourceConfig&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@ActiveProfiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;test&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ContainerServiceTest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// test code&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Uh-oh! The Spring context loads successfully, but the test fails with the following error:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;java.lang.AssertionError:
Expected size:&amp;lt;1&amp;gt; but was:&amp;lt;0&amp;gt; in:
&amp;lt;[]&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If you look at &lt;code&gt;@DataJpaTest&lt;/code&gt;, you will notice that it uses the &lt;code&gt;@Transactional&lt;/code&gt; annotation. It means that by default, deleting data from the target tables and creating a new container will only be committed at the end of the test method, thus the changes are not visible to the transactions created by the threads.&lt;/p&gt;
&lt;p&gt;Since we would like to commit the transaction inside the main transaction (which &lt;code&gt;@DataJpaTest&lt;/code&gt; uses), we need to use &lt;code&gt;Propagation.REQUIRES_NEW&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@DataJpaTest&lt;/span&gt;
&lt;span class="nd"&gt;@AutoConfigureTestDatabase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AutoConfigureTestDatabase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Replace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;NONE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@Import&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ContainerService&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;EmbeddedDataSourceConfig&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@ActiveProfiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;test&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ContainerServiceTest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Autowired&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;lateinit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;transactionTemplate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;TransactionTemplate&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Autowired&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;lateinit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;svc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ContainerService&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Autowired&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;lateinit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;shipmentRepo&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ShipmentRepository&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Autowired&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;lateinit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;containerRepo&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ContainerRepository&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@BeforeEach&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;fun&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;setup&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;transactionTemplate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;propagationBehavior&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;TransactionTemplate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;PROPAGATION_REQUIRES_NEW&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;transactionTemplate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;shipmentRepo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;deleteAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;containerRepo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;deleteAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;shipmentRepo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shipment&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;🎉 The test passes, completing in just 8 seconds (load context + run) - twice as fast as before!&lt;/p&gt;
&lt;h2&gt;Test Slicing: @JsonTest Precision in Validating JSON Serialization/Deserialization 💡&lt;/h2&gt;
&lt;p&gt;Consider this test snippet:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;EventDeserializationIT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;extends&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;BaseIT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;final&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;RESOURCE_PATH&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;event-example.json&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Autowired&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ObjectMapper&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;objectMapper&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Event&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dto&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Test&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;testDeserialization&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;throws&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Exception&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Resources&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Resources&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getResource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RESOURCE_PATH&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;UTF_8&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;dto&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;objectMapper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="na"&gt;forType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="na"&gt;readValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;assertThat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dto&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getData&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getNewTour&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getFromLocation&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="na"&gt;isNotNull&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;assertThat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dto&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getData&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getNewTour&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getToLocation&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="na"&gt;isNotNull&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The objective of this test is to ensure proper deserialization. We can use &lt;code&gt;@JsonTest&lt;/code&gt; annotation to import the beans that we need in the test. We only need object mapper, no need to extend any other classes! Using this annotation will only apply the configuration relevant to JSON tests (i.e. &lt;code&gt;@JsonComponent&lt;/code&gt;, Jackson Module).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@JsonTest&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;EventDeserializationTest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Autowired&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ObjectMapper&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;objectMapper&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// Test implementation&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Test Slicing: @WebMvcTest for REST APIs 🌐&lt;/h2&gt;
&lt;p&gt;Using &lt;code&gt;@WebMvcTest&lt;/code&gt;, we can test REST APIs without firing up the server (e.g., the embedded Tomcat), or loading the whole application context. It’s all about targeting specific controllers. Fast and efficient, just like that!&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@WebMvcTest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ShipmentServiceController&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ShipmentServiceControllerTests&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Autowired&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MockMvc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mvc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@MockBean&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ShipmentService&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Test&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;getShipmentShouldReturnShipmentDetails&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;given&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;any&lt;/span&gt;&lt;span class="p"&gt;())).&lt;/span&gt;&lt;span class="na"&gt;willReturn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;LocalDate&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;mvc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;perform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;/shipments/12345&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;                        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;accept&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MediaType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;APPLICATION_JSON&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;                        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;andExpect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="na"&gt;isOk&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="w"&gt;                        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;andExpect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jsonPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;$.number&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;12345&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Taming Mock/Spy Beans and Context Caching Dilemmas 🔍&lt;/h2&gt;
&lt;p&gt;Let's delve into the intricacies of the Spring Test context caching mechanism!&lt;/p&gt;
&lt;p&gt;When your tests involve Spring Test features (e.g., &lt;code&gt;@SpringBootTest&lt;/code&gt;, &lt;code&gt;@WebMvcTest&lt;/code&gt;, &lt;code&gt;@DataJpaTest&lt;/code&gt;), they require a running Spring Context. Starting a Spring Context for your test requires a considerable amount of time, especially if the entire context is populated using &lt;code&gt;@SpringBootTest&lt;/code&gt;, resulting in increased test execution overhead and longer build times if each test starts its own context.&lt;/p&gt;
&lt;p&gt;Fortunately, Spring Test provides a mechanism to cache a started application context and reuse it for subsequent tests with similar context requirements.&lt;/p&gt;
&lt;p&gt;The cache is like a map, with a certain capacity. The map key is computed from a few parameters, including the beans loaded into the context.&lt;/p&gt;
&lt;p&gt;The cache key consists of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;locations (from &lt;code&gt;@ContextConfiguration&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;classes (from &lt;code&gt;@ContextConfiguration&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;contextInitializerClasses (from &lt;code&gt;@ContextConfiguration&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;contextCustomizers (from &lt;code&gt;ContextCustomizerFactory&lt;/code&gt;) – this includes &lt;code&gt;@DynamicPropertySource&lt;/code&gt; methods as well as various features from Spring Boot’s testing support such as &lt;code&gt;@MockBean&lt;/code&gt; and &lt;code&gt;@SpyBean&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;contextLoader (from &lt;code&gt;@ContextConfiguration&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;parent (from &lt;code&gt;@ContextHierarchy&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;activeProfiles (from &lt;code&gt;@ActiveProfiles&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;propertySourceLocations (from &lt;code&gt;@TestPropertySource&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;propertySourceProperties (from &lt;code&gt;@TestPropertySource&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;resourceBasePath (from &lt;code&gt;@WebAppConfiguration&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For example, if &lt;code&gt;TestClassA&lt;/code&gt; specifies &lt;code&gt;{"app-config.xml", "test-config.xml"}&lt;/code&gt; for the locations (or value) attribute of &lt;code&gt;@ContextConfiguration&lt;/code&gt;, the TestContext framework loads the corresponding ApplicationContext and stores it in a static context cache under a key that is based solely on those locations. So, if &lt;code&gt;TestClassB&lt;/code&gt; also defines &lt;code&gt;{"app-config.xml", "test-config.xml"}&lt;/code&gt; for its locations (either explicitly or implicitly through inheritance) and does not define different attributes for any of the other attributes listed above, then the same ApplicationContext is shared by both test classes. This means that the setup cost for loading an application context is incurred only once (per test suite), and subsequent test execution is much faster.&lt;/p&gt;
&lt;p&gt;If you use different attributes per different tests, for example different (&lt;code&gt;ContextConfiguration&lt;/code&gt;, &lt;code&gt;TestPropertySource&lt;/code&gt;, &lt;code&gt;@MockBean&lt;/code&gt; or &lt;code&gt;@SpyBean&lt;/code&gt;) in your test, the caching key changes. And for each new context (that does not exist in the cache), the context must be loaded from scratch.&lt;/p&gt;
&lt;p&gt;And if there are many different contexts, the old keys from the cache are removed, thus the next running tests that could potentially use those cached contexts need to reload them.
This addition results in extra test time.&lt;/p&gt;
&lt;p&gt;One efficiency optimization method is consolidating mock beans in a parent class. This ensures that the context remains unchanged, enhancing efficiency and avoiding context reloading multiple times.&lt;/p&gt;
&lt;p&gt;Example before and after:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@SpringBootTest&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestClass1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@MockBean&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DependencyA&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dependencyA&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// Test implementation&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@SpringBootTest&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestClass2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@MockBean&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DependencyB&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dependencyB&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// Test implementation&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@SpringBootTest&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestClass3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@MockBean&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DependencyC&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dependencyC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// Test implementation&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If we tried to run the above example, the context will be reloaded 3 times, which is not efficient at all.
Let's try to optimize it.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@SpringBootTest&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;abstract&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BaseTestClass&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@MockBean&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DependencyA&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dependencyA&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@MockBean&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DependencyB&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dependencyB&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@MockBean&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DependencyC&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dependencyC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Extend the BaseTestClass for each test class&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestClass1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;extends&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;BaseTestClass&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Test&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;testSomething1&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;// Test implementation&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestClass2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;extends&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;BaseTestClass&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Test&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;testSomething2&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;// Test implementation&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestClass3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;extends&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;BaseTestClass&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Test&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;testSomething3&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;// Test implementation&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now, the context will be reloaded only once, which is more efficient!&lt;/p&gt;
&lt;p&gt;Or even better: You can avoid class inheritance by using &lt;code&gt;@Import&lt;/code&gt; annotation to import configuration classes that contain the mock beans.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@TestConfiguration&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Config&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@MockBean&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DependencyA&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dependencyA&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@MockBean&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DependencyB&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dependencyB&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@MockBean&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DependencyC&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dependencyC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@Import&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@ActiveProfiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;test&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;TestClass1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// Test code&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Think twice before using @DirtiesContext ❗&lt;/h2&gt;
&lt;p&gt;Applying &lt;code&gt;@DirtiesContext&lt;/code&gt; to a test class removes the application context after tests are executed. This marks the Spring context as dirty, preventing Spring Test from reusing it. It's important to carefully consider using this annotation.&lt;/p&gt;
&lt;p&gt;Although some use it to reset IDs in the database, better alternatives exist. For instance, the &lt;code&gt;@Transactional&lt;/code&gt; annotation can be used to roll back the transaction after the test is executed.&lt;/p&gt;
&lt;h2&gt;Parallel Execution of Tests 🏎️&lt;/h2&gt;
&lt;p&gt;By default, JUnit Jupiter tests run sequentially in a single thread. However, enabling tests to run in parallel, for faster execution, is an opt-in feature introduced in JUnit 5.3. 🚀&lt;/p&gt;
&lt;p&gt;To initiate parallel test execution, follow these steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Create a &lt;code&gt;junit-platform.properties&lt;/code&gt; file in test/resources.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Add the following configuration to the file:
&lt;code&gt;junit.jupiter.execution.parallel.enabled = true&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Add the following to every class you want to run parallel. &lt;code&gt;@Execution(CONCURRENT)&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Keep in mind that certain tests might not be compatible with parallel execution due to their nature. For such cases, you should not add &lt;code&gt;@Execution(CONCURRENT)&lt;/code&gt;. See &lt;a href="https://junit.org/junit5/docs/snapshot/user-guide/#writing-tests-parallel-execution"&gt;JUnit: writing tests – parallel execution&lt;/a&gt; for more explanation on the different execution modes.&lt;/p&gt;
&lt;h2&gt;Results 📊&lt;/h2&gt;
&lt;p&gt;Applying all the optimizations mentioned above made a big difference in our CI/CD pipeline. Our tests are much faster, taking only &lt;strong&gt;4 minutes and 15 seconds&lt;/strong&gt; now, compared to the previous time &lt;strong&gt;(10 minutes 7 seconds)&lt;/strong&gt;, which is a massive &lt;strong&gt;60&lt;/strong&gt;% improvement! 🌟&lt;/p&gt;
&lt;h2&gt;Conclusion 🎬&lt;/h2&gt;
&lt;p&gt;In this adventure of optimizing Spring Boot tests, we've harnessed a collection of strategies to bolster test efficiency and speed. Let's summarize the tactics we've implemented:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Test Slicing:&lt;/strong&gt; Leveraging &lt;code&gt;@WebMvcTest&lt;/code&gt;, &lt;code&gt;@DataJpaTest&lt;/code&gt;, and &lt;code&gt;@JsonTest&lt;/code&gt; to focus tests on specific layers or components. You can check more about (&lt;a href="https://docs.spring.io/spring-boot/docs/current/reference/html/features.html#features.testing.spring-boot-applications"&gt;Testing Spring Boot Applications&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Context Caching Dilemmas:&lt;/strong&gt; Overcoming challenges related to dirty ApplicationContext caches by optimizing the use of mock and spy beans. See &lt;a href="https://docs.spring.io/spring-framework/reference/testing/testcontext-framework/ctx-management/caching.html"&gt;Spring Test Context Caching&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Parallel Test Execution:&lt;/strong&gt; Enabling parallel test execution to significantly reduce test suite execution time. See &lt;a href="https://junit.org/junit5/docs/current/user-guide/#writing-tests-parallel-execution"&gt;JUnit 5 User Guide on Parallel Execution&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These strategies collectively transform testing into a faster, more reliable, and efficient process. Each tactic, used alone or combined, contributes significantly to optimized testing practices, empowering engineers to deliver higher-quality software with enhanced efficiency.&lt;/p&gt;</content><category term="Zalando"/><category term="Java"/><category term="Testing"/><category term="Frameworks"/><category term="Kotlin"/><category term="Backend"/></entry><entry><title>Patching the PostgreSQL JDBC Driver</title><link href="https://engineering.zalando.com/posts/2023/11/patching-pgjdbc.html" rel="alternate"/><published>2023-11-09T00:00:00+01:00</published><updated>2023-11-09T00:00:00+01:00</updated><author><name>Declan Murphy</name></author><id>tag:engineering.zalando.com,2023-11-09:/posts/2023/11/patching-pgjdbc.html</id><summary type="html">&lt;p&gt;Contributing to the PostgreSQL JDBC Driver to address the issue of runaway WAL growth in Logical Replication&lt;/p&gt;</summary><content type="html">&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;This blog post describes a recent contribution from Zalando to the Postgres JDBC driver to address &lt;a href="https://github.com/pgjdbc/pgjdbc/issues/1490"&gt;a long-standing issue&lt;/a&gt; with the driver’s integration with Postgres’ logical replication that resulted in runaway Write-Ahead Log (WAL) growth. We will describe the issue, how it affected us at Zalando, and detail the fix made upstream in the JDBC driver that fixes the issue for Debezium and all other clients of the Postgres JDBC driver.&lt;/p&gt;
&lt;h2&gt;Postgres Logical Replication at Zalando&lt;/h2&gt;
&lt;p&gt;Builders at Zalando have access to a low-code solution that allows them to declare event streams that source from Postgres databases. Each event stream declaration provisions a micro application, powered by &lt;a href="https://debezium.io/"&gt;Debezium Engine&lt;/a&gt;, that uses Postgres Logical Replication to publish table-level change events as they occur. Capable of publishing events to a variety of different technologies, with arbitrary event transformations via AWS Lambda, these event streams form a core part of the Zalando infrastructure offering. At the time of writing, there are hundreds of these Postgres-sourced event streams out in the wild at Zalando.&lt;/p&gt;
&lt;p&gt;One common problem that occurs with Logical Replication is excessive growth of Postgres WAL logs. At times, the Write Ahead Log (WAL) growth could occur to the point where the WAL would consume all of the available disk space on the database node resulting in demotion of the node to read-only - an undesirable outcome in a production setting indeed! This issue is prevalent in cases where a table being streamed receives very little to no write traffic - but once a write is made, any excessive WAL growth disappears instantly. In recent years, as the popularity of Postgres-sourced event streams has grown in Zalando, we see this issue occurring more and more often.&lt;/p&gt;
&lt;p&gt;So what is happening at a low level during this event-streaming process? How does Postgres reliably ensure that all data change events are emitted and captured by an interested client? The answers to these questions were crucial to understanding the problem and finding its solution.&lt;/p&gt;
&lt;p&gt;To explain the issue and how we solved it, we first must explain a little bit about the internals of Postgres replication. In Postgres, the Write Ahead Log (WAL) is a strictly ordered sequence of events that have occurred in the database. These WAL events are the source of truth for the database, and streaming and replaying WAL events is how both Physical and Logical Replication work. Physical replication is used for database replication. Logical Replication, which is the subject of this blog, allows clients to subscribe to data change WAL events. In both cases, replication clients track their progress through the WAL by checkpointing their location, known as the Log Sequence Number (LSN), directly on the primary database. WAL events stored on the primary database can only be discarded after all replication clients, both physical and logical, confirm that they have been processed. If one client fails to confirm that it has processed a WAL event, then the primary node will retain that WAL event and all subsequent WAL events until confirmation occurs.&lt;/p&gt;
&lt;p&gt;Simple, right?&lt;/p&gt;
&lt;p&gt;Well, the happy path is quite simple, yes. However as you may imagine, this blog post concerns a path that is anything but happy.&lt;/p&gt;
&lt;h2&gt;The Problem&lt;/h2&gt;
&lt;p&gt;Before we go on, allow me to paint a simplified picture of our architecture which was experiencing issues with this process:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A Postgres database with logical replication set up on two of its three tables" src="https://engineering.zalando.com/posts/2023/11/images/logical-replication.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;A Postgres database with logical replication set up on two of its three tables&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;We have a database with multiple tables, denoted here by their different colors: blue (1), pink (2), purple (3), etc. Additionally, we are listening to changes made to the blue and pink tables specifically. The changes are being streamed via Logical Replication to a blue client and a pink client respectively. In our case, these clients are our Postgres-sourced event streaming applications which use &lt;a href="https://github.com/debezium/debezium"&gt;Debezium&lt;/a&gt; and &lt;a href="https://github.com/pgjdbc/pgjdbc"&gt;PgJDBC&lt;/a&gt; under the hood to bridge the gap between Postgres byte-array messages and Java by providing a user-friendly API to interact with.&lt;/p&gt;
&lt;p&gt;The key thing to note here is that changes from all tables go into the same WAL. The WAL exists at the server level and we cannot break it down into a table-level or schema-level concept. All changes for all tables in all schemas in all databases on that server go into the same WAL.&lt;/p&gt;
&lt;p&gt;In order to track the individual progress of the blue and pink replication, the database server uses a construct called a replication slot. A replication slot should be created for each client - so in this case we have blue (upper, denoted &lt;code&gt;1&lt;/code&gt;) and pink (lower, denoted &lt;code&gt;2&lt;/code&gt;) replication slots - and each slot will contain information about the progress of its client through the WAL. It does this by storing the LSN of the last flushed WAL, among some other pieces of information but let’s keep it simple.&lt;/p&gt;
&lt;p&gt;If we zoom into the WAL, we could illustrate it simplistically as follows:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Each client has a replication slot, tracking its progress through the WAL." src="https://engineering.zalando.com/posts/2023/11/images/replication-slots-1.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Each client has a replication slot, tracking its progress through the WAL.&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;Here, I have illustrated LSNs as decimal numbers for clarity. In reality, they are expressed as hexadecimal combinations of page numbers and positions.&lt;/p&gt;
&lt;p&gt;As write operations occur on any of the tables in the database, those write operations are written to the WAL - the next available log position being &lt;code&gt;#7&lt;/code&gt;. If a write occurs on e.g. the blue table, a message will be sent to the blue client with this information and once the client confirms receipt of change &lt;code&gt;#7&lt;/code&gt;, the blue replication slot will be advanced to &lt;code&gt;#7&lt;/code&gt;. However WAL with LSN &lt;code&gt;#7&lt;/code&gt; can’t be recycled and its disk space freed up just yet, since the pink replication slot is still only on &lt;code&gt;#6&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="As changes occur in the blue table, the blue client's replication slot advances, but the pink slot has no reason to move" src="https://engineering.zalando.com/posts/2023/11/images/replication-slots-2.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;As changes occur in the blue table, the blue client's replication slot advances, but the pink slot has no reason to move&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;If the blue table were to continue receiving writes, but without a write operation occurring on the pink table, the pink replication slot would never have a chance to advance, and all of the blue WAL events would be left sitting around, taking up space.&lt;/p&gt;
&lt;p&gt;&lt;img alt="This will continue with WAL growing dangerously large, risking using all of the disk space of the entire server" src="https://engineering.zalando.com/posts/2023/11/images/replication-slots-3.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;This will continue with WAL growing dangerously large, risking using all of the disk space of the entire server&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;However once a write occurs in the pink table, this change will be written to the next available WAL position, say &lt;code&gt;#14&lt;/code&gt;, the pink client will confirm receipt and the pink replication slot will advance to position &lt;code&gt;#14&lt;/code&gt;. Now we have the below state:&lt;/p&gt;
&lt;p&gt;&lt;img alt="As soon as a write occurs in the pink table, the pink replication slot will advance and the WAL events can be deleted up to position #13, as they are no longer needed by any slot" src="https://engineering.zalando.com/posts/2023/11/images/replication-slots-4.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;As soon as a write occurs in the pink table, the pink replication slot will advance and the WAL events can be deleted up to position #13, as they are no longer needed by any slot&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;This was the heart of the issue. The pink client is not interested in these WAL events, however until the pink client confirms a later LSN in its replication slot, Postgres cannot delete these WAL events. This will continue ad infinitum until the disk space is entirely used up by old WAL events that cannot be deleted until a write occurs in the pink table.&lt;/p&gt;
&lt;h2&gt;Mitigation Strategies&lt;/h2&gt;
&lt;p&gt;Many blog posts have been written about this bug, phenomenon, behavior, call it what you will. Hacky solutions abound. The most popular by far was creating scheduled jobs writing dummy data to the pink table in order to force it to advance. This solution had been used in Zalando in the past but it’s a kludge that doesn’t address the real issue at the heart of the problem and mandates a constant extra workload overhead from now and forever more when setting up Postgres logical replication.&lt;/p&gt;
&lt;p&gt;Even Gunnar Morling, the ex-Debezium Lead, has &lt;a href="https://www.morling.dev/blog/insatiable-postgres-replication-slot/"&gt;written&lt;/a&gt; about the topic.&lt;/p&gt;
&lt;p&gt;Byron Wolfman, in a blog post, alludes to the pure solution before abandoning the prospect in favour of the same kludge. The following quote is an extract from his &lt;a href="https://wolfman.dev/posts/pg-logical-heartbeats/"&gt;post&lt;/a&gt; on the topic:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Excerpt from a blog post which details both the pure solution of advancing the cursor as well as the “fake writes” hack" src="https://engineering.zalando.com/posts/2023/11/images/wolfman-blog-post-excerpt.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Excerpt from a blog post which details both the pure solution of advancing the cursor as well as the “fake writes” hack&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;This was indeed the solution in its purest form. In our case with a Java application as the end-consumer, the first port-of-call for messages from Postgres was PgJDBC, the Java Driver for Postgres. If we could solve the issue at this level, then it would be abstracted away from - and solved for - all Java applications, Debezium included.&lt;/p&gt;
&lt;h2&gt;Our Solution&lt;/h2&gt;
&lt;p&gt;The key was to note that while Postgres only sends Replication messages in case of a write operation, it is sending KeepAlive messages on a regular basis in order to maintain the connection between it and, in this case, PgJDBC. This KeepAlive message contains very little data: some identifiers, a timestamp, a single bit denoting if a reply is required, but most crucially, the KeepAlive message contains the current WAL LSN of the database server. Historically, PgJDBC would not respond to a KeepAlive message and nothing would change on the server-side as a result of a KeepAlive message being sent. This needed to change.&lt;/p&gt;
&lt;p&gt;&lt;img alt="The original flow of messages between the database server and the PGJDBC driver. Only replication messages received confirmations from the driver" src="https://engineering.zalando.com/posts/2023/11/images/message-flow-original.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;The original flow of messages between the database server and the PgJDBC driver. Only replication messages received confirmations from the driver.&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;The fix involved updating the client to keep track of the LSN of the last Replication message received from the server and the LSN of the latest message confirmed by the client. If these two LSNs are the same, and the client then receives a KeepAlive message with a higher LSN, the client can imply that it has flushed all relevant changes and that some irrelevant changes are happening on the database that the client doesn't care about. The client can safely confirm receipt of this change back to the server, thus advancing its replication slot position and allowing the Postgres server to delete those irrelevant WAL events. This approach is sufficiently conservative enough to allow confirmation of LSNs while guaranteeing that no relevant events can be skipped.&lt;/p&gt;
&lt;p&gt;&lt;img alt="The updated flow of messages now includes confirmation responses for each KeepAlive message as well, allowing all replicas to constantly confirm receipt of WAL changes" src="https://engineering.zalando.com/posts/2023/11/images/message-flow-updated.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;The updated flow of messages now includes confirmation responses for each KeepAlive message as well, allowing all replicas to constantly confirm receipt of WAL changes&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;The fix was implemented, tested, submitted to PgJDBC in &lt;a href="https://github.com/pgjdbc/pgjdbc/pull/2941"&gt;a pull request&lt;/a&gt;. Merged on August 31st 2023, this fix is scheduled to be released in the 42.7.0 version of PgJDBC.&lt;/p&gt;
&lt;h2&gt;Rollout&lt;/h2&gt;
&lt;p&gt;Our Debezium-powered streaming applications support backwards compatibility with functionality that has been removed from newer versions of Debezium. In order to maintain this backwards compatibility, our applications do not use the latest version of Debezium and, by extension, do not use the latest version of PgJDBC which is pulled in as a transitive dependency by Debezium. In order to take advantage of the fix while still maintaining this backwards compatibility, we modified our build scripts to optionally override the latest version of the transitive PgJDBC dependency and we took advantage of this option to build not one, but two Docker images for our applications: one unchanged and another with a locally built version, 42.6.1-patched, of PgJDBC that contained our fix. We rolled this modified Docker image out to our test environment while still using the unchanged image in our production environment. This way we could safely verify that our event-streaming applications continued to behave as intended and monitor the behaviour in order to verify the issue of WAL growth had been addressed.&lt;/p&gt;
&lt;p&gt;To verify the issue had indeed disappeared, we monitored a graph of the total WAL Size over the course of a few days on a low-activity database. Before the implementation of the fix, it would be common to see the following graph of total WAL size, indicating the presence of the issue over 36 hours:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Graph of WAL before the fix" src="https://engineering.zalando.com/posts/2023/11/images/wal-growth-before.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Runaway WAL growth before the fix&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;That same database after the fix now has a WAL Size graph that looks like the below, over the same time range and with no other changes to the persistence layer, service layer or activity:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Graph of WAL after the fix" src="https://engineering.zalando.com/posts/2023/11/images/wal-growth-after.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;WAL growth (or lack thereof!) after the fix&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;As the fix itself was designed to be sufficiently conservative when confirming LSNs so that we could guarantee that an event would never be skipped or missed, this evidence was sufficient for us to confidently roll out the newer Docker images to our production clusters, solving the issue of runaway WAL growth for 100s of Postgres-sourced event streams across Zalando. No more hacks required :)&lt;/p&gt;</content><category term="Zalando"/><category term="PostgreSQL"/><category term="Open Source"/><category term="Backend"/></entry><entry><title>Understanding GraphQL Directives: Practical Use-Cases at Zalando</title><link href="https://engineering.zalando.com/posts/2023/10/understanding-graphql-directives-practical-use-cases-zalando.html" rel="alternate"/><published>2023-10-19T00:00:00+02:00</published><updated>2023-10-19T00:00:00+02:00</updated><author><name>Boopathi Rajaa Nedunchezhiyan</name></author><id>tag:engineering.zalando.com,2023-10-19:/posts/2023/10/understanding-graphql-directives-practical-use-cases-zalando.html</id><summary type="html">&lt;p&gt;In this blog post, we dive into the practical applications of GraphQL directives at Zalando. With simple examples, we aim to highlight how they enhance our use cases. From defining precise authorization requirements to efficiently handling metadata, GraphQL directives offer flexibility and control in our API development process.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;GraphQL directives&lt;/h2&gt;
&lt;p&gt;In GraphQL, if you've used the syntax that starts with an &lt;code&gt;@&lt;/code&gt;, for example, &lt;code&gt;@foo&lt;/code&gt;, then you've used GraphQL directives. Directives provide a way to extend the language features of GraphQL using a supported syntax. Certain directives are built into GraphQL, like &lt;code&gt;@skip&lt;/code&gt;, &lt;code&gt;@include&lt;/code&gt;, &lt;code&gt;@deprecated&lt;/code&gt;, and &lt;code&gt;@specifiedBy&lt;/code&gt;, and are supported by all GraphQL engines.&lt;/p&gt;
&lt;p&gt;If we look closer, we can see that two of these directives (&lt;code&gt;@skip&lt;/code&gt; and &lt;code&gt;@include&lt;/code&gt;) are used only in the queries, and the other two (&lt;code&gt;@deprecated&lt;/code&gt; and &lt;code&gt;@specifiedBy&lt;/code&gt;) are used only in the schema. This is because GraphQL directives are defined for two different categories of locations - &lt;code&gt;TypeSystem&lt;/code&gt; and &lt;code&gt;ExecutableDefinition&lt;/code&gt;. The &lt;code&gt;TypeSystem&lt;/code&gt; directives are defined for the schema, and the &lt;code&gt;ExecutableDefinition&lt;/code&gt; directives are defined for the queries. We will discuss this in detail in the next section.&lt;/p&gt;
&lt;p&gt;The query directives are generally useful for clients to express certain types of metadata for the query. The schema directives are generally useful for declaratively specifying common server-side behaviors, for example, authorization requirements, marking sensitive data, etc.&lt;/p&gt;
&lt;h2&gt;Part 1: Schema directives at Zalando&lt;/h2&gt;
&lt;p&gt;The schema directives refer to the directives defined for the &lt;code&gt;TypeSystem&lt;/code&gt; locations. The type system directives are available for the locations listed below. Consider &lt;code&gt;@foo&lt;/code&gt; a directive for the location mentioned in the 1st column.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;directive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@foo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;LOCATION_IN_FIRST_COLUMN&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;!--
Because the line containing
union X @foo = A | B
treats `|` as table separator and messes up the table formatting
--&gt;
&lt;!-- prettier-ignore --&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Directive Location&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SCHEMA&lt;/td&gt;
&lt;td&gt;&lt;code&gt;schema @foo { query: Query }&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SCALAR&lt;/td&gt;
&lt;td&gt;&lt;code&gt;scalar x @foo&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OBJECT&lt;/td&gt;
&lt;td&gt;&lt;code&gt;type Product @foo { }&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FIELD_DEFINITION&lt;/td&gt;
&lt;td&gt;&lt;code&gt;type X { field: String @foo }&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ARGUMENT_DEFINITION&lt;/td&gt;
&lt;td&gt;&lt;code&gt;type X { field(arg: Int @foo): String }&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INTERFACE&lt;/td&gt;
&lt;td&gt;&lt;code&gt;interface X @foo {}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UNION&lt;/td&gt;
&lt;td&gt;&lt;code&gt;union X @foo = A | B&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ENUM&lt;/td&gt;
&lt;td&gt;&lt;code&gt;enum X @foo { A B }&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ENUM_VALUE&lt;/td&gt;
&lt;td&gt;&lt;code&gt;enum X { A @foo B }&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INPUT_OBJECT&lt;/td&gt;
&lt;td&gt;&lt;code&gt;input X @foo { }&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INPUT_FIELD_DEFINITION&lt;/td&gt;
&lt;td&gt;&lt;code&gt;input X { field: String @foo }&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;a href="https://the-guild.dev/about-us"&gt;The guild - https://the-guild.dev&lt;/a&gt; has a great &lt;a href="https://the-guild.dev/graphql/tools/docs/schema-directives"&gt;article&lt;/a&gt; and a mechanism for implementing schema directives via their &lt;a href="https://the-guild.dev/graphql/tools"&gt;graphql-tools&lt;/a&gt; packages. I highly recommend reading it and using graphql-tools for implementing schema directives.&lt;/p&gt;
&lt;p&gt;The gist is that you can define a directive in the schema and implement the directive in the resolver layer. The directive is implemented as a function that takes the resolver function as an argument and returns a new resolver function. The new resolver function can be used to implement the directive logic.&lt;/p&gt;
&lt;p&gt;You can think of schema directives as some function call injected to your resolver function in a declarative way. Consider the following illustration to understand where the directive function can be invoked in the context of a resolver.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="cm"&gt;/**&lt;/span&gt;
&lt;span class="cm"&gt; * Illustration of schema directives execution in&lt;/span&gt;
&lt;span class="cm"&gt; * the query execution pipeline&lt;/span&gt;
&lt;span class="cm"&gt; */&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;resolvers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;Query&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="c1"&gt;// schema directives&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;schemaDirectivesExecutions&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="c1"&gt;// resolver logic&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;getProduct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="c1"&gt;// schema directives&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;schemaDirectivesExecutions&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;&lt;code&gt;@isAuthenticated&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;At Zalando, we use SSO for customer authentication and &lt;a href="https://auth0.com/blog/what-is-step-up-authentication-when-to-use-it/"&gt;step-up authentication&lt;/a&gt;. Our GraphQL server handles publicly available data like the product data, and also handles confidential data like customer-related data.&lt;/p&gt;
&lt;p&gt;The queries can contain customer fields along with product fields and other non-customer data. Here, we need to ensure that the customer is authenticated and has the correct authenticity levels (&lt;a href="https://developer.okta.com/docs/guides/step-up-authentication/main/"&gt;ACR Value&lt;/a&gt;) whenever a field or mutation containing customer information is used in the query. So, we need a way to control this granularly for different data points in the schema. The directive &lt;code&gt;@isAuthenticated&lt;/code&gt; is used for this purpose.&lt;/p&gt;
&lt;p&gt;The directive is defined in the schema as follows -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;scalar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;ACRValue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@specifiedBy(url:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;&amp;quot;https://example.com/zalando-acr-value&amp;quot;)&lt;/span&gt;

&lt;span class="k"&gt;directive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@isAuthenticated(&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;The&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;ACR&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;which&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;indicates&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;level&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;authenticity&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;expected&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;perform&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;operation.&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;Optional.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;If&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;not&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;provided&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;default&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;behavior&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;simply&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;validate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;user&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;authenticated&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;and&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;has&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;no&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;ACR&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;requirements.&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;acrValue:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;ACRValue&lt;/span&gt;
&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;FIELD_DEFINITION&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;For example, it is used in a mutation definition as follows -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Query&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Customer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nd"&gt;@isAuthenticated&lt;/span&gt;
&lt;span class="err"&gt;}&lt;/span&gt;
&lt;span class="err"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Mutation&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;updateCustomerInfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;String&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;phoneNumber&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;String&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;UpdateCustomerInfoResult&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nd"&gt;@isAuthenticated&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;acrValue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;HIGH&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;&lt;code&gt;@sensitive&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;We expose customer-sensitive data via our GraphQL API - like the email address, customer name, phone number, address, etc, to render the customer profile page. We also use observability tools and monitoring tools like logging and tracing. We do not want such sensitive customer data in the logs and traces. So, we need a way to control logging so that the logs contain enough information to debug issues but not sensitive customer data. The directive &lt;code&gt;@sensitive&lt;/code&gt; is used for this purpose.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;directive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@sensitive(&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;&amp;quot;An&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;optional&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;reason&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;why&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;field&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;marked&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;sensitive&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;reason:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;String&lt;/span&gt;
&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;ARGUMENT_DEFINITION&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;For example, it is used in a mutation definition as follows -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Mutation&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;updateCustomerInfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;String&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="n"&gt;sensitive&lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Customer email address&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nl"&gt;phoneNumber&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nd"&gt;@sensitive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Customer phone number&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;):&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;UpdateCustomerInfoResult&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;It could be somewhat manual and forgetful to add &lt;code&gt;@sensitive&lt;/code&gt; to the correct arguments in the schema proactively. So, we also rely on a schema linter to automatically fail when a field/argument name contains sensitive keywords like &lt;code&gt;password&lt;/code&gt;, &lt;code&gt;email&lt;/code&gt;, &lt;code&gt;phone&lt;/code&gt;, &lt;code&gt;bank&lt;/code&gt;, &lt;code&gt;bic&lt;/code&gt;, &lt;code&gt;account&lt;/code&gt;, &lt;code&gt;owner&lt;/code&gt;, &lt;code&gt;order&lt;/code&gt;, &lt;code&gt;token&lt;/code&gt;, &lt;code&gt;voucher&lt;/code&gt;, &lt;code&gt;customer&lt;/code&gt;, etc. This way, we can ensure we do not forget to add &lt;code&gt;@sensitive&lt;/code&gt; to the correct fields/arguments.&lt;/p&gt;
&lt;p&gt;Implementing this directive is also quite simple and does not require any resolver logic. It can be implemented in NodeJS as follows (the implementation is shortened to fit into a post) -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;getSensitiveVariables&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;sensitiveVariables&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;graphql&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;Variable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;isSensitive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getArgument&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;astNode&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;directives&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;directive&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;directive&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;===&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;sensitive&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;isSensitive&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nx"&gt;sensitiveVariables&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;sensitiveVariables&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;&lt;code&gt;@requireExplicitEndpoint&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;With GraphQL, all of the varieties of HTTP requests fit into one single pattern - &lt;code&gt;POST /graphql&lt;/code&gt;. It makes using techniques and tools available for REST APIs - like rate limiting, bot protection, caching, and other security practices fail to work out of the box. So, we need a way to control different schema sections to be exposed via different HTTP endpoints. The directive &lt;code&gt;@requireExplicitEndpoint&lt;/code&gt; is used for this purpose.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;directive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@requireExplicitEndpoint(endpoints:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;[String!]!)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;FIELD_DEFINITION&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In implementing this directive, we override the resolver for the respective field where it is used. We can access the request parameters (like pathname) by running GraphQL over HTTP. We then match the pathname with the list of endpoints provided in the directive and return an error if there is no match.&lt;/p&gt;
&lt;p&gt;This directive allows us to define custom routes for different schema sections and prevents the client from accessing the entire schema via a single HTTP endpoint, &lt;code&gt;POST /graphql.&lt;/code&gt; For example, let's see how we can define this directive for the &lt;code&gt;updateDeliveryAddress&lt;/code&gt; mutation.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Mutation&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;updateDeliveryAddress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;ID&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;newAddress&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;CustomerAddress&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;UpdateDeliveryAddressResult&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@requireExplicitEndpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;endpoints&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;/customer-addresses&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So, a mutation query like the following will fail with an error when executing via &lt;code&gt;/graphql&lt;/code&gt; endpoint -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c"&gt;# POST /graphql&lt;/span&gt;
&lt;span class="k"&gt;mutation&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;updateDeliveryAddress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;1234&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;newAddress&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Boopathi&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;&lt;code&gt;@draft&lt;/code&gt;, &lt;code&gt;@allowedFor&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;We use persisted queries and define different schema stability levels for different sections of the schema. We have a separate blog post explaining the details of &lt;a href="https://engineering.zalando.com/posts/2022/02/graphql-persisted-queries-and-schema-stability.html"&gt;how Zalando uses persisted queries&lt;/a&gt; and how we think about schema stability and granular control.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;@draft&lt;/code&gt; and &lt;code&gt;@allowedFor&lt;/code&gt; directives are used for this purpose. It prevents clients from persisting a query that is not stable yet.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c"&gt;# Draft&lt;/span&gt;
&lt;span class="k"&gt;directive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@draft&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;FIELD_DEFINITION&lt;/span&gt;

&lt;span class="c"&gt;# Restricted usage: Only for the specified components&lt;/span&gt;
&lt;span class="k"&gt;directive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@component(name:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;String!)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;QUERY&lt;/span&gt;
&lt;span class="k"&gt;directive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@allowedFor(componentNames:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;[String!]!)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;FIELD_DEFINITION&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;&lt;code&gt;@final&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Enums in GraphQL are tricky to evolve. Adding a new value to an enum is not considered a breaking change, but it is still a "dangerous" change. It is "dangerous" because the client might not have a handler for the new value. It is easy to update the client code for web applications, but for the mobile native apps shipped to the app store, it is impossible to update the client code. Though we practice defensive coding practices to handle unknown values, we still need a way to control the evolution of enums in a safe manner. The directive &lt;code&gt;@final&lt;/code&gt; is used for this purpose.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;directive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@final&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;ENUM&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The implementation of this directive is absolutely nothing - i.e., it does not need any runtime behavior. It is only used in our GraphQL linter that executes during the build time and prevents additions of new values to enums which are marked as final. When we want to make a dangerous change, we remove the &lt;code&gt;@final&lt;/code&gt; directive in the first pull request and reason about and find if old apps would break by making this "dangerous" change. After extending the enum, we add it in a separate pull request. This process is cumbersome, but it is on purpose. It must be more complicated to make dangerous changes, and it is a trade-off we are willing to make.&lt;/p&gt;
&lt;p&gt;The ideal situation would be that all enums are treated as final by default, and this directive is never required in the first place. During schema evolution, your use case might warrant such directives to control a smooth schema evolution.&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;@extensibleEnum&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;As we are discussing enums, another use-case of directives for enums, primarily one-off use cases, and extending them is the common case. Creating enums for one use case is tricky in these cases, and extending it has dangerous consequences. At Zalando, we have RESTful API guidelines, and one of the recommendations is to use &lt;a href="https://opensource.zalando.com/restful-api-guidelines/#112"&gt;x-extensible-enum&lt;/a&gt; to represent all enums. This recommendation is so that the enums can evolve, and the client is aware, right from the name, that it is extensible. We use the directive &lt;code&gt;@extensibleEnum&lt;/code&gt; for this purpose. The type in GraphQL for the field would be &lt;code&gt;String&lt;/code&gt;, and the directive is used to provide the list of allowed values.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;directive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@extensibleEnum(values:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;[String!]!)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;FIELD_DEFINITION&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;For example, it is used in a query definition as follows -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;CustomerConsent&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nd"&gt;@extensibleEnum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;GRANTED&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;REJECTED&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;With &lt;code&gt;@extensibleEnum&lt;/code&gt;, we found that contributors to the schema are more likely to think about the evolution of schema. We also noticed that contributors are more likely to use this directive for defining enums than the GraphQL native enum, as this directive is more explicit about the extensibility of the enum.&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;@resolveEntityId&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Our GraphQL schema defines certain types as Entities related to the &lt;a href="https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model"&gt;Entity-Relationship model&lt;/a&gt;. We define entities abstractly as the basic building blocks for designing customer experience. For example, product, customer, brand, etc. are some entities. The entity definition has some properties -&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;it follows a specific template/pattern of resolvers that is mostly the same for all entities&lt;/li&gt;
&lt;li&gt;it is of a specific type name as defined in the schema&lt;/li&gt;
&lt;li&gt;it has a unique ID of a specific pattern (for example, &lt;code&gt;entity:product:1234&lt;/code&gt; for &lt;code&gt;type Product&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;it has a set of fields that are common to all entities&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To solve these cases holistically, we use the directive &lt;code&gt;@resolveEntityId&lt;/code&gt; defined against each entity definition in the schema.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;directive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@resolveEntityId(&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;&amp;quot;An&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;optional&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;override&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;entity&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;its&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;ID&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;override:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;String&lt;/span&gt;
&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;OBJECT&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The usage is as follows -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Product&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;implements&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Entity&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@resolveEntityId&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The implementation of this directive is two-fold. For one, we generate TypeScript code based on the &lt;code&gt;resolveEntityId&lt;/code&gt; directive. This code generation allows us to develop the boilerplate code for the entity ID type definitions and resolvers - for example, the &lt;code&gt;__typename&lt;/code&gt; resolvers. The other part is the runtime, where an &lt;code&gt;id&lt;/code&gt; resolver is added to wrap the entity IDs - for example, consider the product - &lt;code&gt;entity:product:1234&lt;/code&gt; is the full entity ID, and the &lt;code&gt;1234&lt;/code&gt; is called the SKU of the product.&lt;/p&gt;
&lt;h2&gt;Part 2: Query directives at Zalando&lt;/h2&gt;
&lt;p&gt;Query directives are directives that are defined for the &lt;code&gt;ExecutableDefinition&lt;/code&gt; locations. The executable directives are available for the locations listed below. Consider &lt;code&gt;@foo&lt;/code&gt; a directive for the location mentioned in the 1st column.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;directive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@foo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;LOCATION_IN_FIRST_COLUMN&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style="text-align: left;"&gt;Directive Location&lt;/th&gt;
&lt;th style="text-align: left;"&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;QUERY&lt;/td&gt;
&lt;td style="text-align: left;"&gt;&lt;code&gt;query name @foo {}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;MUTATION&lt;/td&gt;
&lt;td style="text-align: left;"&gt;&lt;code&gt;mutation name @foo {}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;SUBSCRIPTION&lt;/td&gt;
&lt;td style="text-align: left;"&gt;&lt;code&gt;subscription name @foo {}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;FIELD&lt;/td&gt;
&lt;td style="text-align: left;"&gt;&lt;code&gt;query { product @foo {} }&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;FRAGMENT_DEFINITION&lt;/td&gt;
&lt;td style="text-align: left;"&gt;&lt;code&gt;fragment x on Query @foo { }&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;FRAGMENT_SPREAD&lt;/td&gt;
&lt;td style="text-align: left;"&gt;&lt;code&gt;query { ...x @foo }&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;INLINE_FRAGMENT&lt;/td&gt;
&lt;td style="text-align: left;"&gt;&lt;code&gt;query { ... @foo { } }&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: left;"&gt;VARIABLE_DEFINITION&lt;/td&gt;
&lt;td style="text-align: left;"&gt;&lt;code&gt;query ($id: ID @foo) { }&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Unlike schema directives, &lt;a href="https://the-guild.dev/graphql/tools"&gt;graphql-tools&lt;/a&gt; does not support attaching functions to resolvers the same way for query directives. They also have an excellent point: query directives are good for annotating the query with metadata and not for resolver logic. Likewise, most of our use cases include attaching metadata at the query level and one case for observability and monitoring.&lt;/p&gt;
&lt;p&gt;For query metadata, the implementation is as simple as going through the parsed GraphQL document (&lt;a href="https://en.wikipedia.org/wiki/Abstract_syntax_tree"&gt;AST - Abstract Syntax Tree&lt;/a&gt;) and extracting the metadata from the query directives. We use a two-step approach for the use case that adds behavior to a field - specifically the &lt;code&gt;@omitErrorTag&lt;/code&gt; directive (discussed below). In the first step before execution, we extract the field paths of the fields that have this directive. In the second step, after execution, we match the error paths and omit the error tag for those extracted paths.&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;@component&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;@component&lt;/code&gt; directive defines a component name by the client for the query. This directive is used in our observability and monitoring tools and for schema stability - restricted usage in production. See our blog post &lt;a href="https://engineering.zalando.com/posts/2022/02/graphql-persisted-queries-and-schema-stability.html"&gt;GraphQL persisted queries and schema stability&lt;/a&gt; for more details.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;directive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@component(name:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;String!)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;QUERY&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;&lt;code&gt;@tracingTag&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;@tracingTag&lt;/code&gt; directive defines an &lt;a href="https://opentelemetry.io/"&gt;OpenTelemetry&lt;/a&gt; tracing tag for the query. Using this directive on a query adds a specific client-defined tag to our tracing spans. The clients can then follow the traces and filter by this tag to find the traces for a particular query. This directive is useful for debugging, troubleshooting, monitoring specific set of queries, etc.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;directive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@tracingTag(value:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;String!)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;QUERY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;MUTATION&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;SUBSCRIPTION&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;&lt;code&gt;@omitErrorTag&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;@omitErrorTag&lt;/code&gt; directive is used to omit marking the tracing span as an error. This directive can be used on a particular field in the query. This directive lets the client define that some field errors are noncritical and should not be reported for alerting. The 24x7 on-call team can then focus on the critical errors and not be distracted by the noise.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;directive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@omitErrorTag&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;FIELD&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;&lt;code&gt;@maxCountInBatch&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;@maxCountInBatch&lt;/code&gt; directive is used at the Query level to declare the maximum number of queries that can be batched together in a single request. This directive is client-controlled i.e. it is only available during &lt;a href="https://engineering.zalando.com/posts/2022/02/graphql-persisted-queries-and-schema-stability.html"&gt;build/persist time&lt;/a&gt;. At runtime, the directive is used to prevent overfetching of data and bot abuse of the GraphQL API.&lt;/p&gt;
&lt;p&gt;Our GraphQL server allows batching of multiple queries in a single batch. With persisted queries, we only send the id of the query, and the client cannot send a raw query in production. So, the system design allows the safe usage of &lt;code&gt;maxCountInBatch&lt;/code&gt; controlled by the clients.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;directive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@maxCountInBatch(value:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Int!)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;QUERY&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Example usage of all of the above query directives&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;query&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;product_card&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;!)&lt;/span&gt;
&lt;span class="c"&gt;#&lt;/span&gt;
&lt;span class="c"&gt;# component directive&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="nf"&gt;component&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;&amp;quot;&lt;/span&gt;&lt;span class="nc"&gt;web&lt;/span&gt;&lt;span class="err"&gt;-product-card&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;#&lt;/span&gt;
&lt;span class="c"&gt;# tracing tag directive to add a tag to the tracing span&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="nf"&gt;tracingTag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;&amp;quot;&lt;/span&gt;&lt;span class="nc"&gt;slo&lt;/span&gt;&lt;span class="err"&gt;-1s&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;#&lt;/span&gt;
&lt;span class="c"&gt;# maxCountInBatch directive to limit the number of queries in a batch request&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="nf"&gt;maxCountInBatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;50)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nc"&gt;product&lt;/span&gt;&lt;span class="err"&gt;(id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="nc"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;brand&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c"&gt;# omitErrorTag directive to omit marking the tracing&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c"&gt;# span as an error if inWishlist field errors&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;inWishlist&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nd"&gt;@omitErrorTag&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;}&lt;/span&gt;
&lt;span class="err"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Query directives allow clients to define metadata and, on rare occasions, behavior. Schema directives, on the other hand, allow the server to define behavior, validation, and resolution logic in a declarative manner. Schema directives carry the added advantage that the servers can make breaking changes to these directives, as these directives are not consumed by the client - they only experience the resulting behavior. It's important when designing a directive to consider its properties, use cases, trade-offs, and where the control should lie.&lt;/p&gt;
&lt;p&gt;The use cases outlined in this blog post represent some of the ways we use GraphQL directives at Zalando. There are numerous other cases that we'll cover in future blog posts. I hope this piece provides a good starting point for you to explore GraphQL directives and their practical applications.&lt;/p&gt;
&lt;h2&gt;Further reading&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://the-guild.dev/graphql/tools/docs/schema-directives"&gt;Schema Directives - GraphQL Tools&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://engineering.zalando.com/posts/2022/02/graphql-persisted-queries-and-schema-stability.html"&gt;GraphQL persisted queries and Schema stability&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://engineering.zalando.com/posts/2021/04/modeling-errors-in-graphql.html"&gt;Modeling Errors in GraphQL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://engineering.zalando.com/posts/2021/03/optimize-graphql-server-with-lookaheads.html"&gt;Optimize GraphQL Server with Lookaheads&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><category term="Zalando"/><category term="GraphQL"/><category term="APIs"/><category term="Backend"/></entry><entry><title>My First Year as an Engineering Manager at Zalando</title><link href="https://engineering.zalando.com/posts/2023/09/my-first-year-as-an-engineering-manager-at-zalando.html" rel="alternate"/><published>2023-09-26T00:00:00+02:00</published><updated>2023-09-26T00:00:00+02:00</updated><author><name>Kaan Bobac</name></author><id>tag:engineering.zalando.com,2023-09-26:/posts/2023/09/my-first-year-as-an-engineering-manager-at-zalando.html</id><summary type="html">&lt;p&gt;Reflecting on my first year as an Engineering Manager at Zalando.&lt;/p&gt;</summary><content type="html">&lt;h3&gt;Starting a New Journey&lt;/h3&gt;
&lt;p&gt;Moving forward in career steps is always an exciting adventure, even if it comes with challenges. For me, the biggest challenge was becoming an engineering manager in a foreign country. Stepping into a new country as an expat, with a culture I wasn't all that familiar with, was a completely fresh start.
When I said yes to my new journey, I started researching Zalando to learn more.&lt;/p&gt;
&lt;p&gt;My first stop was the Zalando Engineering Blog - a real treasure for someone like me who was curious about the engineering culture and practices at what would be my new company. Reading post after post, I was amazed by everything - the interesting engineering topics, challenges, solutions, and approaches.
Since I love reading and writing blog posts, I even dreamt of contributing here someday. Now, looking at today and thinking about my first year, I see that I've gained lots of experiences and learnings that I can put into words. While one post won't cover all the details, I believe I can create a short but nice summary of my journey so far. So, let's begin.&lt;/p&gt;
&lt;h3&gt;First Impressions&lt;/h3&gt;
&lt;p&gt;On my first day, as I stepped into the office, one thing truly resonated with me. A phrase was inscribed on the floor: &lt;em&gt;"Always put yourself in the customer’s shoes"&lt;/em&gt;. This is one of the founding mindset of Zalando which I would learn in the next few days. This also marked the first of many reminders that would constantly keep me aware of how important customers are for Zalando.&lt;/p&gt;
&lt;p&gt;As I walked around and met with various people, I realised the impressive international working environment with a rich multicultural and diverse setup.
From day one and with each passing day, I've come to believe that this is Zalando's greatest wealth. And on a personal note having colleagues from all corners of the world, having lunches, coffee breaks, learning from their diverse experiences – these are indeed great benefits that cannot be simply found in contracts.&lt;/p&gt;
&lt;h3&gt;Onboarding&lt;/h3&gt;
&lt;p&gt;As I settled in, my onboarding journey kicked off right away. Zalando provides an excellent &lt;a href="https://engineering.zalando.com/posts/2021/04/making-the-remote-onboarding-a-success.html"&gt;onboarding program&lt;/a&gt; for newbies. It covers not only technical topics but also goes into Zalando's culture, with a lot of inspiring meetings. This also creates an opportunity to connect with colleagues from different departments that you may not have had a chance to interact with otherwise.&lt;/p&gt;
&lt;p&gt;Besides Zalando's onboarding, it was important for me to really understand how my department and team contribute to the company. So, I focused on what we do and how our work helps Zalando’s success. My department is Pricing Platform, and our main scope is pricing and discounting tools and algorithms.
The more I learnt, the more I was amazed by how much data science, engineering, and analytics are involved in something as simple as a 20% discount on the web site.
For me, the real test is, if I can successfully explain the project details to my dad, who doesn't know much about tech except using a smartphone. If he gets it, then I'm pretty sure I truly understand what we do in our department. When I told my dad about my department's job, I started with, &lt;em&gt;"dad you will not believe how that simple discount you see in the webpage is calculated"&lt;/em&gt;.&lt;/p&gt;
&lt;h3&gt;Cyber Week&lt;/h3&gt;
&lt;p&gt;My first big challenge was Cyber Week. Since I joined Zalando just a month before Cyber Week, everyone was talking about it. Coming from a country that doesn't have Cyber Week, I initially thought (I'll admit it shamelessly) that Zalando was having a week of cyber security tests, which actually sounded pretty cool. But then, when I understood what Cyber Week was really about, I realised how important it was for Zalando.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://engineering.zalando.com/posts/2020/10/how-zalando-prepares-for-cyber-week.html"&gt;readiness for Cyber Week&lt;/a&gt; and all the preparations that go into it completely impressed me. The structured game plan, &lt;a href="https://engineering.zalando.com/posts/2023/01/how-we-manage-our-1200-incident-playbooks.html"&gt;playbooks&lt;/a&gt;, situation rooms, incident processes – they were all new concepts to me, and I was amazed by how operational excellence can be.
There’s no way I can cover all the details of Cyber Week in this post, but there's one thing I have to mention. During the final minutes of Black Friday, there's this tradition of virtually gathering with the shift crew and watching the order monitoring spike up like a hockey stick, marking the peak order rate during Black Friday. That moment made a strong impact on me, showing how our little contributions as software engineers play a role in those big successes.&lt;/p&gt;
&lt;h3&gt;Growing Together&lt;/h3&gt;
&lt;p&gt;While I've mostly focused on the technical and operational aspects of Zalando, I can't skip the people part, of course. Zalando has an amazing culture when it comes to managing and developing people. They provide different ways to grow with clear expectations. One thing that really surprised me was that Zalando offers both management and technical expert paths for software engineers. For example, after becoming a Senior Software Engineer, you can choose to either become an &lt;a href="https://engineering.zalando.com/posts/2023/01/how-you-can-have-impact-as-an-engineering-manager.html"&gt;Engineering Manager&lt;/a&gt; or a &lt;a href="https://engineering.zalando.com/posts/2022/02/principal-engineering-at-zalando.html"&gt;Principal Engineer&lt;/a&gt;. This is quite unique, something I hadn't encountered before in my past experiences. It’s not about getting pushed into management; instead you have the opportunity to advance based on your skills and aspirations at the same level as management roles.&lt;/p&gt;
&lt;h3&gt;Feedback Culture&lt;/h3&gt;
&lt;p&gt;Talking about &lt;a href="https://engineering.zalando.com/posts/2022/07/growth-engineering-at-zalando.html"&gt;career growth&lt;/a&gt;, I shouldn't forget to mention performance evaluation. This is a vital aspect of any organization's success. Zalando recognizes this importance and has implemented effective practices to ensure that performance management is done right. Performance evaluation at Zalando starts with collecting feedback, the most important part of the process, in my opinion. Company provides an ideal environment for sharing and receiving feedback. You can receive feedback from your peers, team members, and stakeholders, essentially from the people you interact with daily. This culture of openness to feedback has been invaluable in helping me understand where we can improve as a team and how I can grow as a leader beyond my current capabilities.&lt;/p&gt;
&lt;p&gt;Moreover, in my role as a leader, I know the importance of giving constructive feedback and facilitating performance evaluations for my team members. Zalando has several effective practices in place to support leaders in this regard. We receive support from experienced leaders, seek guidance from our peers in different departments, and collaborate with P&amp;amp;O (People and Operations) business partners. Throughout the year, we also have access to various training sessions, coaching sessions, and leaders' enablement programs. This comprehensive support to leaders makes sharing constructive feedback, which ultimately helps everyone reach their full potential, a seamless and rewarding part of the job.&lt;/p&gt;
&lt;h3&gt;It Is Not All About Work&lt;/h3&gt;
&lt;p&gt;While I've mostly shared the business aspect of Zalando, I must acknowledge that Zalando also knows how to have a good time. There are a lot of communities with various interests, running, fishing, beach volleyball, board games, or more technical topics like Python or Linux guilds.&lt;/p&gt;
&lt;p&gt;The company also gives a big importance to continuous improvement, which is, of course, a crucial aspect of a software engineer's work. Departments organize hack weeks; for instance, our department had an Innovation Sprint where individuals pitched initiatives using cutting-edge technologies like generative AI. Every month, Tech Academy hosts a Coffee Bytes event, a casual coffee meet-up with no set agenda, allowing members of the tech community to connect and make friends. Considering all these examples, despite the importance of business and customers, having fun is equally important at Zalando. I realized this right from the beginning when I saw one of the t-shirts with the slogan &lt;em&gt;"Zalando, we dress code"&lt;/em&gt;.&lt;/p&gt;
&lt;h3&gt;What's Next?&lt;/h3&gt;
&lt;p&gt;Finishing up this look back, my first year as an Engineering Manager at Zalando has been a really good journey with lots of learning, growing, and experiencing new things. The diverse and dynamic environment, along with focusing on people and having fun, has been like magic. Thinking about what is next, I'm looking forward to continuing adding my small touch to Zalando's great work, enjoying the mix of tough challenges, teamwork, and moments that make us laugh. Here's to more times of growing, trying new things, and maybe getting a few more awesome sneakers along the way!&lt;/p&gt;</content><category term="Zalando"/><category term="Management"/><category term="Onboarding"/><category term="Leadership"/></entry><entry><title>Sunrise: Zalando's developer platform based on Backstage</title><link href="https://engineering.zalando.com/posts/2023/08/sunrise-zalandos-developer-platform-based-on-backstage.html" rel="alternate"/><published>2023-08-03T00:00:00+02:00</published><updated>2023-08-03T00:00:00+02:00</updated><author><name>Lacey Nagel</name></author><id>tag:engineering.zalando.com,2023-08-03:/posts/2023/08/sunrise-zalandos-developer-platform-based-on-backstage.html</id><summary type="html">&lt;p&gt;Lessons learned from adopting Backstage as Developer Platform at Zalando.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Since 2021, Zalando invested in building up a developer portal called Sunrise, aimed to become the starting point for Builders at Zalando. The portal is based on Spotify's &lt;a href="https://github.com/backstage/backstage"&gt;Backstage platform&lt;/a&gt; with additional extensions built internally. Sunrise enables everyone at Zalando to view and discover information about teams, applications, APIs, events, CI/CD pipelines, Infrastructure accounts and costs, and much more. In this post, we explore how adopting Backstage impacted the daily life of Software Engineers at Zalando and get insights from Lacey and Arthur who led the efforts on the Product and Engineering side.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Sunrise: application view" src="https://engineering.zalando.com/posts/2023/08/img/sunrise-application-view.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Fig 1. Sunrise: detailed information about applications&lt;/figcaption&gt;

&lt;h3&gt;Lacey, what's your role in creating Sunrise?&lt;/h3&gt;
&lt;p&gt;&lt;em&gt;Lacey:&lt;/em&gt; Funny story, I actually ran a vision workshop with the team responsible for the Developer Portal at Zalando before I became a member of the department! As the official product manager, I helped solidify the vision with a platform mindset and an experience strategically focused on interoperability and usability. I worked with engineering stakeholders and the engineering manager to devise a strategy and roadmap to give us the best chance at efficient implementation, good adoption, and improved satisfaction from users so that more platform and infra teams would want to contribute. And of course, I'm probably the loudest promoter of our platform's &amp;amp; contributors' solutions 😅&lt;/p&gt;
&lt;h3&gt;Arthur, how about you? How are you involved here?&lt;/h3&gt;
&lt;p&gt;&lt;em&gt;Arthur:&lt;/em&gt; Hello! I've actually started to be involved with Sunrise as an early adopter and active user first, before moving internally to the team in May 2022. Since then, I've been leading the engineering team, driving the delivery of new features, coordinating support and maintenance on the platform, contributing to the product vision and ensuring our alignment with the organizational strategy, all the while managing our amazing team of 4 software engineers.&lt;/p&gt;
&lt;h3&gt;Why did Zalando choose Backstage for its developer portal? Was any similar solution in place before?&lt;/h3&gt;
&lt;p&gt;&lt;em&gt;Lacey:&lt;/em&gt; Before Sunrise, we had over 100 disconnected interfaces &amp;amp; resources, plus "the Developer Console" which centralized links to resources mostly for the &lt;em&gt;Code&lt;/em&gt; through &lt;em&gt;Deploy&lt;/em&gt; steps of the Developer Journey. After recognizing that we'd need to evolve into a platform to achieve our vision, we considered several options (including building everything ourselves), and Spotify happened to reach out while we were still in the discovery &amp;amp; design phase. What made it a great fit then, was that we had extremely limited resources and skills (both engineering &amp;amp; design) on the team at the time, so we recognized that having an out-of-the-box solution for a design-system and plugins like the basic Software Catalog would be necessary for us to deliver something fast enough to justify the strategic investment &amp;amp; potential risk of failure.&lt;/p&gt;
&lt;h3&gt;I hear that our Engineers are really excited about Sunrise. Why and what features are they most excited about?&lt;/h3&gt;
&lt;p&gt;&lt;em&gt;Lacey:&lt;/em&gt; From pretty much the beginning, the topic of interoperability has been prevalent as it's what enables us to eliminate friction from the day to day tasks Builders need to perform. Users really celebrated a deeper integration that two contributing teams collaborated on to make the experience of deploying data pipelines more seamless, and features that make org structure and reporting lines more transparent have also had very quick and wide adoption. We also have some very popular Platform features that enable all our users (regardless of whether they actually own services or not) to see personalized content by default and further customize personalization settings. The day to day features that people actually use the most are the action-oriented-easy-access links on the homepage, the CI/CD interface, Search, and the Application catalog, which includes integrations to tooling and resources across the SDLC.&lt;/p&gt;
&lt;h3&gt;How do you measure adoption of the platform and along each part of the SDLC?&lt;/h3&gt;
&lt;p&gt;&lt;em&gt;Lacey:&lt;/em&gt; Since our vision for Sunrise was to make it the "daily" starting point for Builders, we monitor the share of Builders using the platform on weekdays, and weekly as our primary success metrics for adoption. Since not all features actually need to be used daily (for example, every single person won't be registering a new application every working day), we let contributors determine what makes sense for their integrations and we provide them with a centralized dashboard and support with Analytics to make it easier to understand usage. In the future, we hope to map adoption of features to more tangible improvements in operational performance.&lt;/p&gt;
&lt;h3&gt;What features were added on top of Backstage's open-source project?&lt;/h3&gt;
&lt;p&gt;&lt;em&gt;Lacey:&lt;/em&gt; That's actually a pretty big question. For our earliest release, we added a personalized homepage with easy-access links to things engineers use often like open PRs and recently deployed pipelines, and added a support overview that they were used to from previous tooling, and our CI/CD platform that is internally built. Since then, we've integrated 27 other tools &amp;amp; services through 30 front-end plugins ranging from our internal &lt;a href="https://engineering.zalando.com/posts/2022/04/zalando-machine-learning-platform.html"&gt;machine learning platform&lt;/a&gt;, through widgets that make users aware of base image vulnerabilities or delivery performance insights, to a personalized dashboard covering all aspects of critical business events, like Cyber Week. Some of those plugins were contributed back to the open source community, such as the interface for our &lt;a href="https://github.com/zalando/backstage-plugin-api-linter"&gt;API Linter, Zally&lt;/a&gt;. Our platform features personalization – especially for users who don't own components themselves, but who have some accountability for them – increased adoption amongst &lt;a href="https://engineering.zalando.com/posts/2022/02/principal-engineering-at-zalando.html"&gt;principal engineers&lt;/a&gt; and leadership, and has helped contributors to Sunrise provide similar reporting-like features that they never had before with very little effort that in turn drive more regular use within engineering teams.&lt;/p&gt;
&lt;h3&gt;Which team operates the platform? Any challenges that you had to overcome to support Zalando's user base?&lt;/h3&gt;
&lt;p&gt;&lt;em&gt;Arthur:&lt;/em&gt; Our team is called Builder Portal, and has been operating and evolving the platform since its inception. Our biggest technical challenge at Zalando's scale has been managing the various pre-existing sources of data and determining how to sync them with &lt;a href="https://backstage.io/docs/features/software-catalog/"&gt;Backstage's Catalog&lt;/a&gt; system. We currently have over 40k registered entities (between applications, teams, and users) which we sync daily with the respective source of truth services. In terms of adoption, the biggest challenge from the get-go was to make sure that the experience is approachable and consistent for all users, regardless of which part of the development journey they are working on. Builders can be very opinionated in their ways of working, so making sure that our decisions are well thought out and will ultimately support them in working productively and happily can be challenging sometimes, but it's also very rewarding. And hey – we're Builders ourselves too, so we also enjoy using Sunrise while maintaining it.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Lacey:&lt;/em&gt; A lot of what we see impacting adoption of new features is that people have built habits – and incredibly long bookmark lists – to make up for deficiencies of the fragmented tooling. What turned out to be most impactful for solving this problem is ensuring that we redirect users from old features to the new ones in Sunrise shortly after making them generally available and then &lt;em&gt;completely shut down&lt;/em&gt; the old tooling.&lt;/p&gt;
&lt;h3&gt;Backstage is open-source. How does Zalando and your team approach upstream contributions? Can you name some notable examples?&lt;/h3&gt;
&lt;p&gt;&lt;em&gt;Arthur:&lt;/em&gt; Whenever we find some limitation in Backstage in comparison to what we want a feature to look like, we reflect on whether this is something that could impact other adopters of the platform or whether it's a Sunrise-specific problem. If it's the former, we reach out via a GitHub issue (e.g. &lt;a href="https://github.com/backstage/backstage/issues/17481"&gt;bug report&lt;/a&gt; and &lt;a href="https://github.com/backstage/backstage/issues/9805"&gt;feature request&lt;/a&gt;). If we know how to solve it, we also contribute a pull request (e.g. respective &lt;a href="https://github.com/backstage/backstage/pull/17485"&gt;bugfix&lt;/a&gt; and &lt;a href="https://github.com/backstage/backstage/pull/10041"&gt;new feature&lt;/a&gt;). We also keep an eye out for opportunities to share in-house plugins with the community. As mentioned by Lacey earlier, last year we open-sourced our &lt;a href="https://github.com/zalando/backstage-plugin-api-linter"&gt;API Linter plugin&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Backstage Plugin: API linter using Zally under the hood" src="https://engineering.zalando.com/posts/2023/08/img/backstage-plugin-api-linter.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Fig 2. Sunrise: open-sourced API linter plugin&lt;/figcaption&gt;

&lt;h3&gt;How about the internal features? How easy has it been to get contributions from outside of your team?&lt;/h3&gt;
&lt;p&gt;&lt;em&gt;Arthur:&lt;/em&gt; We have at least ten other plugins (the number grows sporadically) owned and maintained by other teams in Zalando, including our own Continuous Delivery and Machine Learning Platform teams. There's always an initial barrier of entry (as with any other application and framework) for contributors to understand the domain-specific language of Backstage, as well as the standards we have implemented on the platform, especially since many platform teams don't have a lot of front-end engineers available to work on the user interface of their plugins. We invest a lot in creating standard components and documenting our patterns so contributors can spend less time figuring out which button to use and more time improving the overall experience for their users.&lt;/p&gt;
&lt;h3&gt;You recently reached a major milestone – 2,000 PRs merged to the repository and Sunrise replacing multiple internal tools and the prior generation of the developer portal. What's the next big milestone that you look forward to?&lt;/h3&gt;
&lt;p&gt;&lt;em&gt;Lacey:&lt;/em&gt; Creating comprehensive visibility into &lt;em&gt;everything&lt;/em&gt; running in production and mapping the relationships between entities – automatically where possible – so that we can centrally support global improvements to the operational health of systems and teams. The &lt;a href="https://engineering.zalando.com/posts/2023/04/how-sboms-change-the-dependency-game.html"&gt;SBOM work&lt;/a&gt; you mentioned in your recent post is a big part of that, but we are also working on surfacing the relationships between entities like data pipelines and applications, as well as the relationships of applications and their components to business problems through a standardized and semi-automated documentation of Domains. Having that oversight will enable us to shift left not only security and compliance, but also productivity, reliability, and cost efficiency by providing insights about the current balance of operational health in relationship to business metrics relevant to our high-level Domains. It will give Builders easier access to the information they need to involve the right stakeholders and make decisions about what kind of work to invest in and when. To put it shortly: we're all a bit happier, more secure, and more efficient when working with transparency and less uncertainty.&lt;/p&gt;
&lt;h3&gt;Any tips that you'd give to teams who are also adopting Backstage as the foundation for their developer portal?&lt;/h3&gt;
&lt;p&gt;&lt;em&gt;Lacey:&lt;/em&gt; Haha, the list is long because I've learned a lot over the life of this initiative. I'd sum it up as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Having a &lt;strong&gt;clear, inspirational vision&lt;/strong&gt; that includes (and delineates) the needs of both users and contributors – and that you &lt;em&gt;constantly&lt;/em&gt; communicate – will be key for motivating contributors and for reaching the critical mass of user journeys needed for users to feel the benefit of your platform.&lt;/li&gt;
&lt;li&gt;To drive adoption and impact, look for opportunities to &lt;strong&gt;personalize content&lt;/strong&gt; to make it easier to recognise and understand, and invest in &lt;strong&gt;increasing the interoperability&lt;/strong&gt; along the journeys your users take to complete tasks between both fully integrated interfaces and features, as well as external tooling – and don't forget to shut down old tooling!&lt;/li&gt;
&lt;li&gt;Whether you're using an open source plugin or building something yourself from scratch, &lt;strong&gt;investing in great UX research and design is &lt;em&gt;critical&lt;/em&gt;&lt;/strong&gt; for building an experience that will remain cohesive as it grows – that's important so that your users are enabled to actually find the things you build, and are happy to use them.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Arthur:&lt;/em&gt; My tip is to &lt;strong&gt;leverage the power of open source&lt;/strong&gt;! The Backstage Community is ever-growing and provides a lot of interesting, well-maintained plugins for you to make use of, so don't shy away from engaging with it. The framework itself is also constantly evolving and growing its scope, and with some big adopters already leveraging it (including us!), you're sure to see a lot of examples of interesting use cases that will support your teams to be more productive.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Bartosz:&lt;/em&gt; Thanks for the conversation and for walking us through our approach to buliding a Developer Platform!&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;If you would like to know more about Sunrise, check out Henning's talk &lt;a href="https://youtu.be/4EGTa8u-7Ws?t=479"&gt;Cloud native developer experience at Zalando&lt;/a&gt; or the &lt;a href="https://platformengineering.org/talks-library/sunrise-zalandos-internal-developer-platform"&gt;related post&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;</content><category term="Zalando"/><category term="Open Source"/><category term="Platform Engineering"/><category term="Productivity"/><category term="Backend"/><category term="Leadership"/></entry><entry><title>All you need to know about timeouts</title><link href="https://engineering.zalando.com/posts/2023/07/all-you-need-to-know-about-timeouts.html" rel="alternate"/><published>2023-07-26T00:00:00+02:00</published><updated>2023-07-26T00:00:00+02:00</updated><author><name>Anton Ilinchik</name></author><id>tag:engineering.zalando.com,2023-07-26:/posts/2023/07/all-you-need-to-know-about-timeouts.html</id><summary type="html">&lt;p&gt;How to set a reasonable timeout for your microservices to achieve maximum performance and resilience.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Nobody likes to wait. We at Zalando are not an exception. We don't like our customers to wait too long for delivery, we don't like them to wait during checkout, and we don't like microservices that take too long to respond.
In this post we're going to talk about - how to set a reasonable timeout for your microservices to achieve maximum performance and resilience.&lt;/p&gt;
&lt;h2&gt;Why set timeout&lt;/h2&gt;
&lt;p&gt;Before we start, let’s answer the simple question: "Why timeout?". A successful response, even if it takes time, is better than a timeout error. Hmm… not always, it depends!&lt;/p&gt;
&lt;p&gt;First of all, if your server does not respond or takes too long to respond, nobody will wait for it. Instead of challenging the patience of your users, follow the fail-fast principle. Let your clients retry or handle an error on their side. When possible return a fallback value.&lt;/p&gt;
&lt;p&gt;Another important aspect is resource utilisation. While a client is waiting for a response, various resources are being utilised: threads, https connections, database connections, etc.
Even if the client has closed the connection, without a proper timeout configuration the request is still being processed on your side, which means that resources are busy.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Client closed connection" src="https://engineering.zalando.com/posts/2023/07/images/client_closed_connection.png#center"&gt;&lt;/p&gt;
&lt;p&gt;Remember, &lt;strong&gt;when you increase timeouts you potentially decrease the throughput of your application!&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Using infinite timeout or very high timeout is a bad strategy. For a while, you won't see the problem until one of your downstream services gets stuck and your thread pool gets exhausted.
Unfortunately, many libraries set default timeouts too high or infinite. They aim to attract as many users as possible and try to make their library work in most situations. But for production services, it is not acceptable. It can even be dangerous.
For example for native java HttpClient the default connection/request timeout is infinite, which is unlikely within your SLA :)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The default timeout is your enemy, always set timeouts explicitly!&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;Connection timeout vs. request timeout&lt;/h2&gt;
&lt;p&gt;The distinction between connection timeout and request timeout can cause confusion.
First, let's have a look at what Connection timeout is.&lt;/p&gt;
&lt;p&gt;If you google or ask ChatGPT you’ll get something like this:&lt;/p&gt;
&lt;p&gt;&lt;em&gt;A connection timeout refers to the maximum amount of time a client is willing to wait while attempting to establish a connection with a server. It measures the time it takes for a client to successfully establish a network connection with a server. If the connection is not established within the specified timeout period, the connection attempt is considered unsuccessful, and an error is typically returned to the client.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;What does it mean to establish a connection?
TCP uses a three-way handshake to establish a reliable connection. The connection is full duplex, and both sides synchronize (SYN) and acknowledge (ACK) each other. The exchange of these four flags is performed in three steps—SYN, SYN-ACK, and ACK.&lt;/p&gt;
&lt;p&gt;&lt;img alt="tcp three-way handshake" src="https://engineering.zalando.com/posts/2023/07/images/handshake.png#center"&gt;&lt;/p&gt;
&lt;p&gt;A connection timeout should be sufficient to complete this process and the actual transmission of packets is gated by the quality of the connection.&lt;/p&gt;
&lt;p&gt;In simple words, the value for the connection timeout should be derived from the quality of the network between services.
If a remote service is running in the same datacenter or the same cloud region, connection time should be low. And the opposite, if you’re working on a mobile application then connection time to a remote service might be quite high.&lt;/p&gt;
&lt;p&gt;To give you some insights. Round-trip time (RTT) in fiber, New York to San Francisco ~42ms, New York to Sydney ~160ms.
You can also look at &lt;a href="https://clients.amazonworkspaces.com/Health.html"&gt;Connection Health Check by Amazon&lt;/a&gt;. This is what I get from my local machine, RTT 28ms to the recommended AWS Region.&lt;/p&gt;
&lt;p&gt;&lt;img alt="connection health check" src="https://engineering.zalando.com/posts/2023/07/images/connection_health_check.png#center"&gt;&lt;/p&gt;
&lt;h3&gt;When does connection timeout occur&lt;/h3&gt;
&lt;p&gt;A connection timeout occurs only upon starting the TCP connection. This usually happens if the remote machine does not answer. This means that the server has been shut down, you used the wrong IP/DNS name, the wrong port or the network connection to the server is down. Another frequent condition is when a given endpoint simply drops packets without a response. The remote endpoint's firewall or security settings may be configured to drop certain types of packets or traffic from specific sources.&lt;/p&gt;
&lt;h3&gt;Connection timeout best practices&lt;/h3&gt;
&lt;p&gt;A common practice for microservices is to set a connection timeout equal to or slightly lower than the timeout for the operation. This approach may not be ideal since the two processes are different.
Whereas establishing a connection is a relatively quick process, an operation can take hundreds or thousands of ms!&lt;/p&gt;
&lt;p&gt;You can setup a connection timeout which is some multiple of your expected RTT. &lt;strong&gt;Connection timeout = RTT * 3 is commonly used as a conservative approach&lt;/strong&gt;, but you can adjust it based on your specific needs.&lt;/p&gt;
&lt;p&gt;In general, the connection timeout for a microservice should be set low enough so that it can quickly detect an unreachable service, but high enough to allow the service to start up or recover from a short-lived problem.&lt;/p&gt;
&lt;h3&gt;Request Timeout&lt;/h3&gt;
&lt;p&gt;A request timeout, on the other hand, pertains to the maximum duration a client is willing to wait for a response from the server after a successful connection has been established. It measures the time it takes for the server to process the client's request and provide a response.&lt;/p&gt;
&lt;h2&gt;Setting optimal request timeout&lt;/h2&gt;
&lt;p&gt;Imagine you are going to integrate your microservice with a new API.&lt;/p&gt;
&lt;p&gt;The first step would be to look at SLAs provided by the microservice or API you are calling.
Unfortunately, not all services provide SLAs and even if they do you should not trust blindly.
The SLA value is good enough only for starting to test real latency.&lt;/p&gt;
&lt;p&gt;If possible, run an integration with the new API in shadow mode and collect metrics. This code should run parallel to the existing production integration, but without affecting the production system (run it in a separate thread-pool, mirror traffic, etc).&lt;/p&gt;
&lt;p&gt;After collecting latency metrics such as p50, p99, p99.9 you can define the so-called acceptable rate of false timeouts. Let's say you go with a false timeout rate 0.1% that means the max timeout you can set is p99.9 corresponding latency percentile on the downstream service.&lt;/p&gt;
&lt;p&gt;At this step you have a max timeout value you can set but you have a trade-off:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;set timeout to the max value&lt;/li&gt;
&lt;li&gt;decrease timeout and enable retry&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Based on the test results you need to choose the timeout strategy. We'll cover retries a little bit later.&lt;/p&gt;
&lt;p&gt;The next challenge you will face is a chain of calls.
Imagine your service has SLA 1000ms and it calls sequentially Order Service with p99.9 = 700ms and then Payment Service with p99.9 = 700ms. How to configure timeout and not breach the SLA?&lt;/p&gt;
&lt;p&gt;&lt;img alt="Chain of calls" src="https://engineering.zalando.com/posts/2023/07/images/chain_of_calls.png#center"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 1: Share your time budget&lt;/strong&gt;
One option would be to share your time budget (your SLA) between services and set timeouts accordingly 500ms for Order Service and 500ms for Payment Service.
In this case, you have a guarantee that you will not breach your SLA but you might have some false positive timeouts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option 2: Introduce a TimeLimiter for your API&lt;/strong&gt;
Since different services will not simultaneously respond with the maximum delay, you can wrap the chained calls in a time limiter and set the maximum acceptable timeout for both services. In this case you could create a time limiter 1sec and set a timeout 700ms for downstream services.&lt;/p&gt;
&lt;p&gt;In Java, you could use &lt;code&gt;CompletableFuture&lt;/code&gt; and several methods among which are &lt;code&gt;orTimeout&lt;/code&gt; and &lt;code&gt;completeOnTimeOut&lt;/code&gt; that provide built-in support for dealing with timeouts.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;CompletableFuture&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;supplyAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orderService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;placeOrder&lt;/span&gt;&lt;span class="p"&gt;(...))&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;thenApply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paymentService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;updateBalance&lt;/span&gt;&lt;span class="p"&gt;(...))&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;orTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;TimeUnit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;SECONDS&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;There is also a nice TimeLimiter module provided by the &lt;a href="https://resilience4j.readme.io/docs/timeout"&gt;Resilience4j library&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Retry or not retry&lt;/h2&gt;
&lt;p&gt;The idea is simple - consider enabling retry when there is a chance of success.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Temporary failures:&lt;/strong&gt; Retry is suitable for temporary failures that are expected to be resolved after a short period, such as network glitches, server timeouts, or database connection issues. Retry can also avoid a bad node. Given a large enough deployment (e.g. 100 pods), a single pod might have a substantial performance regression, but if requests are load balanced in a sufficiently random way retrying is faster then awaiting a response from the bad node.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Retry on timeout errors and 5xx errors&lt;/li&gt;
&lt;li&gt;Do not retry on 4xx errors&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Idempotent operations:&lt;/strong&gt; If the operation being performed is idempotent, meaning that executing it multiple times has the same result as executing it once, retries are generally safe.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Non-idempotent operations&lt;/strong&gt; can cause unintended side effects if retried multiple times. Examples include operations that modify data, perform financial transactions, or have irreversible consequences. Retrying such operations can lead to data inconsistency or duplicate actions.&lt;/p&gt;
&lt;p&gt;Even if you think an operation is idempotent, if possible, ask the service owner whether it is a good idea to enable retries.&lt;/p&gt;
&lt;p&gt;For safely retrying requests without accidentally performing the same operation twice, consider supporting additional &lt;em&gt;Idempotency-Key&lt;/em&gt; header in your API. When creating or updating an object, use an idempotency key. Then, if a connection error occurs, you can safely repeat the request without the risk of creating a second object or performing the update twice. You can read more about this idempotency pattern here &lt;a href="https://stripe.com/docs/api/idempotent_requests"&gt;Idempotent Requests by Stripe&lt;/a&gt; and &lt;a href="https://aws.amazon.com/builders-library/making-retries-safe-with-idempotent-APIs/"&gt;Making retries safe with idempotent APIs by Amazon&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Circuit breaker:&lt;/strong&gt; always consider implementing circuit breakers when enabling retry. When failures are rare, that's not a problem. Retries that increase load can make matters significantly worse.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Exponential backoff:&lt;/strong&gt; Implementing exponential backoff can be an effective retry strategy. It involves increasing the delay between each retry attempt exponentially, reducing the load on the failing service and preventing overwhelming it with repeated requests. Here is a fantastic blog on how &lt;a href="https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/"&gt;AWS SDKs support exponential backoff and jitter&lt;/a&gt; as a part of their retry behaviour.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Time-sensitive operations:&lt;/strong&gt; Retries may not be appropriate for time-critical operations. The trade-off here is to decrease a timeout and enable retries or keep the max acceptable timeout value. Retries might not work well where p99.9 is close to p50.&lt;/p&gt;
&lt;p&gt;Look at the graph, on the first one, timeouts occasionally happens, a big difference between p99 and p50, a good case for enabling retries&lt;/p&gt;
&lt;p&gt;&lt;img alt="Retry is applicable" src="https://engineering.zalando.com/posts/2023/07/images/retry_applicable.png#center"&gt;&lt;/p&gt;
&lt;p&gt;On the second graph, timeouts happen periodically, &lt;strong&gt;p99 is close to p50, do not enable retries&lt;/strong&gt;
&lt;img alt="Retry is not applicable" src="https://engineering.zalando.com/posts/2023/07/images/retry_is_not_applicable.png#center"&gt;&lt;/p&gt;
&lt;h2&gt;Recap&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;set timeout explicitly on any remote calls&lt;/li&gt;
&lt;li&gt;set connection timeout = expected RTT * 3&lt;/li&gt;
&lt;li&gt;set request timeout based on collected metrics and SLA&lt;/li&gt;
&lt;li&gt;fail-fast or return a fallback value&lt;/li&gt;
&lt;li&gt;consider wrapping chained calls into time limiter&lt;/li&gt;
&lt;li&gt;retry on 5xx error and do not retry on 4xx&lt;/li&gt;
&lt;li&gt;think about implementing a circuit breaker when retrying&lt;/li&gt;
&lt;li&gt;be polite and ask the API owner for permission to enable retries&lt;/li&gt;
&lt;li&gt;support &lt;em&gt;Idempotency-Key&lt;/em&gt; header in your API&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Resources&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://hpbn.co/primer-on-latency-and-bandwidth/#speed-of-light-and-propagation-latency"&gt;Speed of Light and Propagation Latency&lt;/a&gt;&lt;br/&gt;
&lt;a href="https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter"&gt;Timeouts, retries, and backoff with jitter by AWS&lt;/a&gt;&lt;br/&gt;
&lt;a href="https://cseweb.ucsd.edu/classes/sp18/cse291-c/post/schedule/p74-dean.pdf"&gt;The Tail at Scale - Dean and Barroso 2013&lt;/a&gt;&lt;br/&gt;
&lt;a href="https://blog.acolyer.org/2015/01/15/the-tail-at-scale/"&gt;The Tail at Scale - Adrian Colyer 2015&lt;/a&gt;&lt;br/&gt;
&lt;a href="https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/"&gt;The complete guide to Go net/http timeouts by Cloudflare&lt;/a&gt;&lt;br/&gt;
&lt;a href="https://www.linkedin.com/pulse/handling-timeouts-microservice-architecture-arpit-bhayani/"&gt;Handling timeouts in a microservice architecture&lt;/a&gt;&lt;br/&gt;
&lt;a href="https://aws.amazon.com/builders-library/making-retries-safe-with-idempotent-APIs/"&gt;Making retries safe with idempotent APIs by AWS&lt;/a&gt;&lt;br/&gt;
&lt;a href="https://stripe.com/docs/api/idempotent_requests"&gt;Idempotent Requests by Stripe&lt;/a&gt;&lt;br/&gt;&lt;/p&gt;</content><category term="Zalando"/><category term="Microservices"/><category term="Java"/><category term="SRE"/><category term="REST"/><category term="Backend"/></entry><entry><title>Rendering Engine Tales: Road to Concurrent React</title><link href="https://engineering.zalando.com/posts/2023/07/rendering-engine-tales-road-to-concurrent-react.html" rel="alternate"/><published>2023-07-11T00:00:00+02:00</published><updated>2023-07-11T00:00:00+02:00</updated><author><name>Rene Eichhorn</name></author><id>tag:engineering.zalando.com,2023-07-11:/posts/2023/07/rendering-engine-tales-road-to-concurrent-react.html</id><summary type="html">&lt;p&gt;Integrating React's Concurrent features into Zalando's web framework. In this post we go over our solution design, early benchmarks, and some useful tips about common hydration mismatch errors.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Outfit Page" src="https://engineering.zalando.com/posts/2023/07/images/rengine-outfit-page.png#previewimage"&gt;&lt;/p&gt;
&lt;p&gt;Welcome back to our web platform blog series! It's been a while since we &lt;a href="https://engineering.zalando.com/posts/2021/09/micro-frontends-part2.html"&gt;last talked about&lt;/a&gt; our approach to large-scale front-end development at Zalando. We are excited now to reconnect and share with you some substantial enhancements we've made to the streaming and rendering architecture of our Rendering Engine framework.&lt;/p&gt;
&lt;p&gt;The first post of this new series will recap how Rendering Engine works, its relationship with Concurrent React, and our journey with it including design and implementation challenges as well as successes gained so far. &lt;br/&gt;
Additionally, it covers the main hydration mismatch errors we faced during this upgrade, our solutions and recommendations for avoiding them, and some extra tips and tricks for debugging this type of issue.&lt;/p&gt;
&lt;h2&gt;Intro&lt;/h2&gt;
&lt;p&gt;"Rendering Engine" is the web framework that is maintained by and currently used in Zalando to render the &lt;a href="https://en.zalando.de/"&gt;Fashion Store website&lt;/a&gt;, and is designed for building any web application with similar needs.&lt;/p&gt;
&lt;p&gt;You might know Rendering Engine (&lt;strong&gt;RE&lt;/strong&gt;) from our previous blog posts about Micro Frontends at Zalando and our journey through them from Project Mosaic with its &lt;a href="https://engineering.zalando.com/posts/2018/12/front-end-micro-services.html"&gt;fragments&lt;/a&gt; and &lt;a href="https://github.com/zalando/tailor"&gt;Tailor&lt;/a&gt;, to &lt;a href="https://engineering.zalando.com/posts/2021/03/micro-frontends-part1.html"&gt;Interface Framework&lt;/a&gt; (&lt;a href="https://engineering.zalando.com/posts/2021/09/micro-frontends-part2.html"&gt;part 2&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;In a nutshell, &lt;strong&gt;RE&lt;/strong&gt; is a web framework best suited for creating a website that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Uses React to render the UI&lt;/li&gt;
&lt;li&gt;Inherently implements universal rendering (server side / client side) with high emphasis on server rendering and page load performance&lt;/li&gt;
&lt;li&gt;Its page content, layout and UI steering is highly driven by backend in a nestable approach&lt;/li&gt;
&lt;li&gt;The backend can be a recommendation engine, a CMS-like system able to define the shape and content of pages, or any other similar system.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The building blocks of RE's language for defining what to render, are &lt;strong&gt;Entities&lt;/strong&gt;.
Each &lt;strong&gt;Entity&lt;/strong&gt; is a block of content that from a business-logic perspective has a specific identity, and can have other Entities nested inside. For example in the context of a fashion store, an Entity could be a Product, a Collection of products, an Outfit, etc. Which when organized in tree-like structures, can be used to define full layout and contents of pages.
Defining each Entity from the backend is done through specifying a &lt;strong&gt;&lt;em&gt;type&lt;/em&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;em&gt;id&lt;/em&gt;&lt;/strong&gt;, and optional extra data in the form of &lt;strong&gt;&lt;em&gt;hints&lt;/em&gt;&lt;/strong&gt;. We'll skip how RE handles defining layouts from the backend for the time being.&lt;/p&gt;
&lt;p&gt;So by considering Entities to be responsible for describing "&lt;em&gt;what to render&lt;/em&gt;" (by the backend), then specifying "&lt;em&gt;how to render&lt;/em&gt;" is the responsibility of what we call a &lt;strong&gt;Renderer&lt;/strong&gt; (by the client). &lt;br/&gt;
Each &lt;strong&gt;Renderer&lt;/strong&gt; is a self-contained TypeScript module powered by multiple RE features provided during server- and client-side rendering.
Each Renderer is responsible to render a specific type of Entity, while each Entity-type can be represented by multiple Renderers depending on the extra hints data.&lt;/p&gt;
&lt;p&gt;This assignment mapping is defined via something called &lt;strong&gt;Rendering Rules&lt;/strong&gt;. These configurations are passed to RE, which include "selectors" for matching the incoming Entity definitions from backend, and support nested and per-page rules.&lt;/p&gt;
&lt;p&gt;There are a handful of other features built into this framework including monitoring, experimentation, tracking, a different rendering output for server driven mobile apps, etc. but for now this introduction should do.&lt;/p&gt;
&lt;h2&gt;React 18's Concurrent Rendering&lt;/h2&gt;
&lt;h4 style="opacity: 0.7"&gt;(and how it fits Rendering Engine like a glove)&lt;/h4&gt;

&lt;p&gt;Performance has always been one of the key focus areas of Rendering Engine from its beginnings. Aside from being built with performance in mind and going through many micro improvements over the years, it also comes with some performance features built inside, including but not limited to streaming, lazy-loading, partial streaming and partial hydration (yes, almost the same concept as in Concurrent React!).&lt;/p&gt;
&lt;p&gt;Although these performance related features have proven to be very important in the success of the Fashion Store website, their code's maintenance, improvements and required education as well as knowledge sharing come with a cost.&lt;/p&gt;
&lt;p&gt;But more importantly, we anticipated having React's built-in support for these features would most probably bring even more performance boosts to the table.&lt;/p&gt;
&lt;p&gt;Additionally, React's concurrent rendering APIs seamlessly integrate with the architecture of RE because its Renderers serve as ideal candidates for being encapsulated within a Suspense boundary. This enables them to function as individual blocks that can be server-rendered, streamed, hydrated, and client-rendered "concurrently". Especially since many of them have already been using Rendering Engine's own partial hydration/streaming features!&lt;/p&gt;
&lt;p&gt;As a result, we have been very excited about the concurrent React 18 for quite a while and as soon as the opportunity arrived, we started the migration and refactoring of Rendering Engine's core functionalities to use the concurrent features.&lt;/p&gt;
&lt;p&gt;Needless to say, this migration task has also had its challenges and costs! So now that we have finished some important milestones and are close to completion, we thought it is a good chance to start sharing our challenges, successes and learnings with you.&lt;/p&gt;
&lt;h2&gt;Design challenges with Concurrent Rendering&lt;/h2&gt;
&lt;p&gt;Rendering Engine at its core includes logic for handling the resolution of server's specified Entity definitions or layout into the corresponding Renderers, fetching their data as well as handling all the other aforementioned features like experimentation, tracking, etc. And only after that, it hands over the UI rendering responsibilities to React. &lt;br/&gt;
These happen gradually (and if needed, recursively) in a way that makes sure that Renderers remain independent while getting their data and rendering/streaming their final html, which makes way for performance gains.&lt;/p&gt;
&lt;p&gt;So initially, with React 18 we thought of moving as much of this logic as possible (from data fetching to experimentation, tracking, etc.) to the React concurrent APIs such as Suspense and &lt;code&gt;useTransition&lt;/code&gt;, through custom hooks - which is often referred to as the "Render-As-You-Fetch pattern. With the aim of reducing complexity and required effort among other things.&lt;/p&gt;
&lt;p&gt;But after a trial phase and implementing a proof of concept, we faced some issues, the main ones being:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In cases where keeping the correct order of the content during streaming/hydration is important, the closest available solution would be to use the &lt;code&gt;SuspenseList&lt;/code&gt; API. But it still seems to be &lt;a href="https://github.com/facebook/react/issues/22771#issuecomment-969451702"&gt;experimental, with some limitations&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;The &lt;a href="https://github.com/facebook/react/issues/25082"&gt;&lt;code&gt;useTransition&lt;/code&gt; API not considering nested suspense boundaries&lt;/a&gt;, causing bad UX in some scenarios.&lt;/li&gt;
&lt;li&gt;By utilizing hooks to initiate requests or other async operations, the timing of fetch operations becomes coupled with the order of rendering, which may not be optimal for performance.&lt;/li&gt;
&lt;li&gt;Progressive hydration and streaming, necessitate the availability of all the data required for client-side rendering as early as possible. This implies that, in addition to the HTML generated by components, it is crucial to stream their data to prevent redundant requests from being made by the client.&lt;ul&gt;
&lt;li&gt;During the trial phase, the streaming and caching layer to support this issue  wasn't yet handled by React. And as of now, the &lt;a href="https://github.com/facebook/react/pull/25502"&gt;latest supporting feature&lt;/a&gt; is still not final.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Chosen technical design&lt;/h3&gt;
&lt;p&gt;Due to the limitations mentioned above, we finally decided to go with a mixed solution.&lt;/p&gt;
&lt;p&gt;In this approach, the concurrent streaming, hydration, rendering and basically all the Concurrent benefits are still achieved via fully utilizing React: by wrapping every Renderer in a Suspense boundary, and handling changes through concurrent APIs. &lt;br/&gt;
But at the same time, we created an "Application State" layer which encapsulates the main logic and Renderers data outside of React components/hooks in a central place, which dictates to the Suspense boundaries their state.&lt;/p&gt;
&lt;p&gt;This way, the full power of orchestrating when to suspend a component (Renderer) depending on its place in the tree, handling the order of the suspended components, and deciding how to manage a transition considering the nested Suspense boundaries, would all be available and customizable in this Application State layer. &lt;br/&gt;
&lt;em&gt;We will share the details of the technical solution for ordered streaming/hydration in another post&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;In other words, everytime RE finds the matching Renderer and resolves all its corresponding data for an Entity definition (through "resolveEntity" step), the output will be written to the Application State layer. In the meantime React is rendering the Renderer components which are wrapped with Suspense. &lt;br/&gt;
To access data from the Application State, the suspendable Renderers use the "Connector hook". &lt;br/&gt;
The Connector hook reads from the application state which either returns the data that was asked for, or creates a promise that will be resolved once the data has been written. The promise is then used to suspend the component and React will automatically re-render once the Promise has been resolved. &lt;br/&gt;
&lt;em&gt;Imagine Redux's &lt;code&gt;useSelector&lt;/code&gt; hook, but instead of immediately returning selected data you get a Promise that only resolves once a reducer has made the data available.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Rendering Engine architecture using Concurrent React" src="https://engineering.zalando.com/posts/2023/07/images/rengine-concurrent-react.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;Benefits gained from Concurrent Rendering&lt;/h2&gt;
&lt;p&gt;As we are still going through the changes and final steps of the full-fledged concurrent mode described above, the full benefits of it are yet to be observed.&lt;/p&gt;
&lt;p&gt;Till date, we achieved some performance improvements by mainly using the new streaming and hydration root APIs.&lt;/p&gt;
&lt;h3&gt;Performance improvements from &lt;code&gt;renderToPipeableStream&lt;/code&gt; and &lt;code&gt;hydrateRoot&lt;/code&gt; APIs&lt;/h3&gt;
&lt;p&gt;As one of the milestones, after pure version upgrade and handling breaking changes, we solely changed RE's internal streaming and hydration code to use the new React 18 APIs instead. i.e. &lt;code&gt;renderToPipeableStream&lt;/code&gt; instead of &lt;code&gt;renderToNodeStream&lt;/code&gt;, and &lt;code&gt;hydrateRoot&lt;/code&gt; instead of &lt;code&gt;hydrate&lt;/code&gt;. &lt;br/&gt;
We rolled out this change through an A/B test covering all pages of our e-commerce website, and in the end we observed these mild performance (and business metric) improvements:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Overall&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://web.dev/inp/"&gt;INP&lt;/a&gt;: &lt;span style="color: #61bd6d"&gt;&lt;strong&gt;-5.69%&lt;/strong&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://web.dev/fid/"&gt;FID&lt;/a&gt;: &lt;span style="color: #61bd6d"&gt;&lt;strong&gt;-8.81%&lt;/strong&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://web.dev/lcp/"&gt;LCP&lt;/a&gt;: &lt;span style="color: #61bd6d"&gt;&lt;strong&gt;-2.43%&lt;/strong&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://web.dev/fcp/"&gt;FCP&lt;/a&gt;: &lt;span style="color: #61bd6d"&gt;&lt;strong&gt;-0.23%&lt;/strong&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bounce rate&lt;/strong&gt;: &lt;span style="color: #61bd6d"&gt;&lt;strong&gt;-0.24%&lt;/strong&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Per page:&lt;/strong&gt;
(some of the frequently visited pages)&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style="text-align: center;"&gt;&lt;strong&gt;Metric&lt;/strong&gt;&lt;/th&gt;
&lt;th style="text-align: center;"&gt;&lt;strong&gt;Home page&lt;/strong&gt;&lt;/th&gt;
&lt;th style="text-align: center;"&gt;&lt;strong&gt;Catalog page&lt;/strong&gt;&lt;br/&gt;&lt;em&gt;(list of products and search)&lt;/em&gt;&lt;/th&gt;
&lt;th style="text-align: center;"&gt;&lt;strong&gt;Product Details page&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="text-align: center;"&gt;&lt;strong&gt;INP&lt;/strong&gt;&lt;/td&gt;
&lt;td style="text-align: center;"&gt;&lt;span style="color: #61bd6d"&gt;&lt;strong&gt;-2.92%&lt;/strong&gt;&lt;/span&gt;&lt;/td&gt;
&lt;td style="text-align: center;"&gt;&lt;span style="color: #61bd6d"&gt;&lt;strong&gt;-6.76%&lt;/strong&gt;&lt;/span&gt;&lt;/td&gt;
&lt;td style="text-align: center;"&gt;&lt;span style="color: #61bd6d"&gt;&lt;strong&gt;-6.09%&lt;/strong&gt;&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: center;"&gt;&lt;strong&gt;FID&lt;/strong&gt;&lt;/td&gt;
&lt;td style="text-align: center;"&gt;&lt;span style="color: #61bd6d"&gt;&lt;strong&gt;-2.98%&lt;/strong&gt;&lt;/span&gt;&lt;/td&gt;
&lt;td style="text-align: center;"&gt;&lt;span style="color: #61bd6d"&gt;&lt;strong&gt;-17.11%&lt;/strong&gt;&lt;/span&gt;&lt;/td&gt;
&lt;td style="text-align: center;"&gt;&lt;span style="color: #61bd6d"&gt;&lt;strong&gt;-6.06%&lt;/strong&gt;&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: center;"&gt;&lt;strong&gt;Exit Rate&lt;/strong&gt;&lt;/td&gt;
&lt;td style="text-align: center;"&gt;&lt;span style="color: #61bd6d"&gt;&lt;strong&gt;-0.43%&lt;/strong&gt;&lt;/span&gt;&lt;/td&gt;
&lt;td style="text-align: center;"&gt;&lt;span style="color: #61bd6d"&gt;&lt;strong&gt;-0.06%&lt;/strong&gt;&lt;/span&gt;&lt;/td&gt;
&lt;td style="text-align: center;"&gt;&lt;span style="color: #61bd6d"&gt;&lt;strong&gt;-0.06%&lt;/strong&gt;&lt;/span&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Needless to say, this shows great promise, and we are now even more excited about the results of the next steps.&lt;/p&gt;
&lt;h2&gt;Technical challenges: Rise of the Hydration Mismatch errors!&lt;/h2&gt;
&lt;p&gt;As also stated in &lt;a href="https://github.com/reactjs/rfcs/blob/ba9bd5744cb922184ec9390515910cd104a30c6e/text/0215-server-errors-in-react-18.md#hydration-mismatches"&gt;some documentations around React 18&lt;/a&gt;, because the new React APIs are way more sensitive towards existing hydration mismatch issues, after the migration to the new streaming and hydration APIs, we started receiving a lot more hydration error logs (via Sentry) for Zalando Fashion Store. &lt;br/&gt;
So during this migration, we've been finding and fixing these issues to prevent negative user impact as much as possible. And after fixing dozens of different types of issues deep inside hundreds of Renderers, we were able to considerably reduce the number of the hydration mismatch errors occuring in the wild. That being said, there are still some more errors to fix which are harder to reproduce and find due to the dynamic nature of the page content in Fashion Store. &lt;br/&gt;
Nevertheless, below you can find the most common issues we found so far, and how we were able to fix them.&lt;/p&gt;
&lt;p&gt;After that, we also briefly share some tips and tricks about the debugging process. Because - as you may also know if you have faced these errors in your projects - debugging them is not always a straightforward task, and to be honest, React's error logs (especially coming from the production environment) aren't very helpful!&lt;/p&gt;
&lt;h3&gt;Main types of issues we faced, and suggested solutions&lt;/h3&gt;
&lt;p&gt;Before going through details of each type, in some cases we realized that based on product requirements, one might actually not need to render some content on SSR (Server Side Rendering) and only the CSR (Client Side Rendering) would be enough. &lt;br/&gt;
Hence the obvious fix might be to just skip rendering on SSR and only show the content once the app is mounted on the user's browser.&lt;/p&gt;
&lt;p&gt;To do that, we can rely on React hooks and lifecycle methods to ensure the app/component has been mounted on the browser. For example:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Instead of&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//...&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;dataThatDiffersBetweenClientAndServer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;div&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;dataThatDiffersBetweenClientAndServer&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/div&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Do&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;//...&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;isMounted&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;setIsMounted&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;React&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;useState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;React&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;useEffect&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;setIsMounted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[]);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;dataThatDiffersBetweenClientAndServer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;div&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;isMounted&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;dataThatDiffersBetweenClientAndServer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;some fallback&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;||&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/div&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;There are similar cases where due to the basic differences between the SSR and the CSR, like some data only being available on client side, one might need to render different content or elements on the two. For example, based on the exact specifications of the user's device, you want to display an app download banner.&lt;/p&gt;
&lt;p&gt;For these scenarios, the suggestion would again be to simply wait until the initial hydration phase is finished on the client side, and then render the different content.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: in such cases, be mindful of layout shifts that can happen as a result of some element popping into the view.&lt;/p&gt;
&lt;p&gt;With that out of the way, let's dive into the list of issues.&lt;/p&gt;
&lt;h4&gt;1. Timers&lt;/h4&gt;
&lt;p&gt;This is a common and somewhat expected source of hydration mismatch issues simply because if you're calculating and rendering the distance between two specific points in time (usually from past/future to now), it will result in slightly different values when calculated on SSR compared to a few moments later on CSR.&lt;/p&gt;
&lt;p&gt;As also mentioned in &lt;a href="https://react.dev/reference/react-dom/client/hydrateRoot#suppressing-unavoidable-hydration-mismatch-errors"&gt;React docs&lt;/a&gt;, in such cases where the mismatch is unavoidable, the suggestion is to simply tell React that the difference is expected and that React should ignore the mismatch during hydration. The way to do this is by passing the prop &lt;code&gt;suppressHydrationWarning={true}&lt;/code&gt; to the element that contains such a mismatch. Keep in mind that this prop only works one level deep, so you have to pass it to the closest element wrapping the mismatching text. For example:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Instead of&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//...&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;timeDistance&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;targetDate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getTime&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;div&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;timeDistance&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/div&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Do&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//...&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;timeDistance&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;targetDate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getTime&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;div&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;suppressHydrationWarning&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;timeDistance&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/div&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h4&gt;2. Localization of dates and different time-zones&lt;/h4&gt;
&lt;p&gt;Converting date values from raw formats (e.g. ISO 8601 &lt;code&gt;2023-01-01T20:00:00.000Z&lt;/code&gt;) to human-readable strings can be a tricky cause of hydration mismatch errors. &lt;br/&gt;
Because if the timezone used for conversion is different between the server and client, the resulting values can be different as well.&lt;/p&gt;
&lt;p&gt;So for example if the timezone is not specified while using the localization APIs (e.g. &lt;code&gt;Intl.DateTimeFormat&lt;/code&gt; or &lt;code&gt;Date.prototype.toLocaleString&lt;/code&gt;), then the host timezone will be used and if the SSR server has a different timezone than the user, it will lead to different localized date values in the end.&lt;/p&gt;
&lt;p&gt;It's hard to decide what the best solution is in these cases especially because as of now it is not possible to know the exact local timezone of the user on SSR based on http headers (in the initial request). &lt;br/&gt;
On top of that, the question of which timezone to use for displaying dates is ultimately a product decision.&lt;/p&gt;
&lt;p&gt;But if a specific universal timezone is approved and provided (for example the website's domain's matching timezone), then specifying that universal timezone to the conversion APIs on both the client and server code can fix this issue. Meaning:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Instead of&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//...&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;div&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;someDate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;toLocaleString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;locale&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="ow"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;Intl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;DateTimeFormat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;locale&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;someDate&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/div&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Do&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//...&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;div&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;someDate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;toLocaleString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;locale&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;timeZone&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;universalTimezone&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;})}&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="ow"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;Intl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;DateTimeFormat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;locale&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;timeZone&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;universalTimezone&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nx"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;someDate&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/div&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;That being said, depending on the situation and product requirements, an alternative approach would be to just move the conversion to the backend so that the client simply receives dates in the localized format - which has passed through timezone transformation (and localisation).&lt;/p&gt;
&lt;h4&gt;3. Localization of numbers&lt;/h4&gt;
&lt;h5 style="opacity: 0.7"&gt;(and a Safari bug for "de-AT" locale!)&lt;/h5&gt;

&lt;p&gt;Similar to converting dates and importance of timezones, when converting raw numbers to localized human-readable strings (e.g. &lt;code&gt;12345&lt;/code&gt; to &lt;code&gt;"12,345"&lt;/code&gt;) if the locale is not specified, then the host's locale will be used and it can lead to different results. So it's important to always pass a universal locale to these APIs which is consistent during server and client rendering:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Instead of&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//...&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;div&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;someNumber&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;toLocaleString&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="ow"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;Intl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;DateTimeFormat&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;someNumber&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/div&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Do&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//...&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;div&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;someNumber&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;toLocaleString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;universalLocale&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="ow"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;Intl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;DateTimeFormat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;universalLocale&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;someNumber&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/div&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;But in very specific cases, we observed that the localisation APIs act differently between SSR and CSR, which again lead to generating different values, thus hydration mismatches!&lt;/p&gt;
&lt;p&gt;We particularly encountered this issue with the Safari browser where for the de-AT locale, the localisation APIs (like &lt;code&gt;Intl.NumberFormat&lt;/code&gt; or &lt;code&gt;tolocalestring&lt;/code&gt;) generate values like &lt;code&gt;"2.345"&lt;/code&gt; but other browsers including Chrome and Firefox as well as Node.js generate values like &lt;code&gt;"2 345"&lt;/code&gt; for the same locale!&lt;/p&gt;
&lt;p&gt;So an alternative approach in these cases would be to receive the final localized values from the backend and show that to the user without needing any more modifications, thus eliminating the mismatches.&lt;/p&gt;
&lt;h4&gt;4. Invalid HTML nesting&lt;/h4&gt;
&lt;p&gt;This issue might be a new cause of hydration mismatch in React 18, which happens as a result of incorrect HTML like nesting a &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt; inside a &lt;code&gt;&amp;lt;p&amp;gt;&lt;/code&gt; or &lt;code&gt;&amp;lt;button&amp;gt;&lt;/code&gt; inside &lt;code&gt;&amp;lt;button&amp;gt;&lt;/code&gt;. We couldn't find clear documentation from React explaining why HTML validity issues lead to hydration mismatch errors (aside from community discussions &lt;a href="https://github.com/facebook/react/issues/24519"&gt;like here&lt;/a&gt;). But regardless, to avoid them, adding markup validation steps (like &lt;a href="https://github.com/MananTank/eslint-plugin-validate-jsx-nesting"&gt;this eslint plugin&lt;/a&gt;) could be helpful.&lt;/p&gt;
&lt;p&gt;Either Way, in such cases the obvious goal is to use semantically correct HTML elements while nesting. For example:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Instead of&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//...&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;div&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;div&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nx"&gt;Some&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/div&amp;gt;&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;button&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;button&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nx"&gt;Button&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/button&amp;gt;&amp;lt;/button&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/div&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Do&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//...&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;div&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nx"&gt;Some&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/span&amp;gt;&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;button&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nx"&gt;Button&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/span&amp;gt;&amp;lt;/button&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/div&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Some debugging tips &amp;amp; tricks&lt;/h3&gt;
&lt;p&gt;Soon after receiving the new hydration mismatch logs in our error tracking system (Sentry), it was clear that the most important first step in debugging them is whether we can reproduce them or not! &lt;br/&gt;
Because due to the nature of the React hydration errors in its production bundle, there is not much detail you can get from the error messages in Sentry. Although including the &lt;a href="https://github.com/facebook/react/blob/v18.2.0/packages/react-reconciler/src/ReactInternalTypes.js#L254"&gt;&lt;code&gt;componentStack&lt;/code&gt;&lt;/a&gt; from the &lt;code&gt;hydrateRoot&lt;/code&gt;‘s &lt;code&gt;onRecoverableError&lt;/code&gt; callback in the logs comes in quite handy, (especially after cleaning the stack a bit to make it more readable) but due to code minification and uglifying in production bundle of your application, you will still have to carry out complicated tasks and use the provided line/column numbers to find the closest components with the help of sourcemaps.&lt;/p&gt;
&lt;p&gt;On top of that, if a website has dynamic content served to each user like Zalando Fashion Store, it may be even harder to reproduce the exact page (with the same content) that was receiving a specific error.&lt;/p&gt;
&lt;p&gt;Another issue we encountered was that the &lt;code&gt;onRecoverableError&lt;/code&gt; callback is usually called multiple times by React for a single hydration mismatch problem, both polluting our Sentry logs as well as making the debugging process harder. &lt;br/&gt;
This seems to be due to &lt;a href="https://github.com/facebook/react/blob/fc929cf4ead35f99c4e9612a95e8a0bb8f5df25d/packages/react-reconciler/src/ReactFiberHydrationContext.js#L447"&gt;the way hydration phase works&lt;/a&gt;, in which React compares a list of available server rendered DOM nodes with a list of client rendered React elements ("fibers") and tries to match them together and basically hydrate the nodes. And when matching and hydration fails for a specific node instance and errors are logged, it &lt;a href="https://github.com/facebook/react/blob/fc929cf4ead35f99c4e9612a95e8a0bb8f5df25d/packages/react-reconciler/src/ReactFiberHydrationContext.js#L474"&gt;tries to hydrate the next one&lt;/a&gt;. What we observed here was that (at least in some cases) because of the previous mismatching node/fiber, the order of the lists becomes broken, and that leads to all the next ones failing as well. And that means a lot of other hydration mismatch error logs which aren't necessarily correct. &lt;br/&gt;
To mitigate this in the production environment, we modified our error tracking code to only send the first hydration error log to Sentry. We also found this to be very helpful to keep in mind during development debugging.&lt;/p&gt;
&lt;p&gt;But in case reproducing the error locally is possible, then we found these steps to be helpful:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Work on the first error log, and after it's fixed, check if any other one remains.&lt;/li&gt;
&lt;li&gt;Based on the log and the &lt;code&gt;componentStack&lt;/code&gt;, find the closest component(s) causing the issue.&lt;/li&gt;
&lt;li&gt;In some cases the cause of the issue is obvious in the specified component's source code - for example the issue number 4 mentioned above (Invalid HTML nesting).&lt;ul&gt;
&lt;li&gt;With HTML nesting issues, the log usually contains the text &lt;code&gt;validateDOMNesting(...)&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;In other cases where the cause is not very obvious, what we found helpful was to check the React dev bundle (&lt;code&gt;react-dom/umd/react-dom.development.js&lt;/code&gt;) and put debuggers on places which log the hydration errors (usually the &lt;code&gt;checkForUnmatchedText&lt;/code&gt; or &lt;code&gt;throwOnHydrationMismatch&lt;/code&gt; functions).&lt;ul&gt;
&lt;li&gt;Then by loading the page, try to find out what is the exact React fiber that causes the issue, and based on that find the component/element. Don't be afraid to go higher in the stack and use more debuggers!&lt;/li&gt;
&lt;li&gt;In some cases we realized that the fiber is the same element that caused the issue, but in others, it's more confusing as the fiber is something that was rendered &lt;strong&gt;after&lt;/strong&gt; a mismatching (usually missing) node instance that was the actual cause of the issue.&lt;/li&gt;
&lt;li&gt;Here it also helps to check different variables like &lt;code&gt;fiber&lt;/code&gt;, &lt;code&gt;nextInstance&lt;/code&gt;, &lt;code&gt;current&lt;/code&gt;, etc. including their received props.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The migration to React 18 and its concurrent features was of extra importance for our Rendering Engine framework due to its unique architecture. And despite the challenges, the results have been promising so far, especially since we observed improvements over Fashion Store website’s Core Web Vitals and bounce rate.&lt;/p&gt;
&lt;p&gt;Additionally, the upgrade shined a light on the hidden hydration mismatch issues scattered in different components, which led us to not only fix many of them, but also collect and internally document them along with recommendations and debugging tips for further reference.&lt;/p&gt;
&lt;h2&gt;Next Steps&lt;/h2&gt;
&lt;p&gt;We are planning to share more detailed posts in the future about the architecture and technical specs of Rendering Engine - especially in light of the Concurrent features. &lt;br/&gt;
Additionally, we aim to share the effects of the new features and the final architecture on Zalando Fashion Store's performance.&lt;/p&gt;
&lt;p&gt;Next up, we're excited to start using React Server Components which have shown great promise so far. Stay tuned!&lt;/p&gt;</content><category term="Zalando"/><category term="Frontend"/><category term="React"/><category term="Concurrent React"/><category term="Frameworks"/><category term="Debugging"/><category term="JavaScript"/><category term="TypeScript"/><category term="Backend"/></entry><entry><title>Riptide HTTP Client tutorial</title><link href="https://engineering.zalando.com/posts/2023/06/riptide-http-client-tutorial.html" rel="alternate"/><published>2023-06-29T00:00:00+02:00</published><updated>2023-06-29T00:00:00+02:00</updated><author><name>Olga Semernitskaia</name></author><id>tag:engineering.zalando.com,2023-06-29:/posts/2023/06/riptide-http-client-tutorial.html</id><summary type="html">&lt;p&gt;Riptide: learning the fundamentals of the open source Zalando HTTP client&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Riptide logo - big ocean wave" src="https://engineering.zalando.com/posts/2023/06/images/wave.jpg#center"&gt;&lt;/p&gt;
&lt;h2&gt;Overview&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://github.com/zalando/riptide"&gt;Riptide&lt;/a&gt; is a Zalando open source Java HTTP client
that implements declarative client-side response routing.
It allows dispatching HTTP responses very easily to different handler methods based on various characteristics of the response,
including status code, status family, and content type.
The way this works is similar to server-side request routing, where any request that reaches a web application
is usually routed to the correct handler based on the combination of URI (including query and path parameters), method,
Accept and Content-Type header.
With Riptide, you can define handler methods on the client side based on the response characteristics.
See &lt;a href="https://github.com/zalando/riptide/blob/main/docs/concepts.md"&gt;the concept document&lt;/a&gt; for more details. Riptide is part of the core Java/Kotlin stack and is used in production by hundreds of applications at Zalando.&lt;/p&gt;
&lt;p&gt;In this tutorial, we'll explore the fundamentals of Riptide HTTP client. We'll learn how to initialize it and examine various use cases:
sending simple GET and POST requests, and processing different responses.&lt;/p&gt;
&lt;h2&gt;Maven Dependencies&lt;/h2&gt;
&lt;p&gt;First, we need to add the library as a dependency into the &lt;code&gt;pom.xml&lt;/code&gt; file:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.zalando&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;riptide-core&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;${riptide.version}&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Check &lt;a href="https://mvnrepository.com/artifact/org.zalando/riptide"&gt;Maven Central page&lt;/a&gt;
to see the latest version of the library.&lt;/p&gt;
&lt;h2&gt;Client Initialization&lt;/h2&gt;
&lt;p&gt;To send HTTP requests, we need to build an &lt;code&gt;Http&lt;/code&gt; object, then we can use it for all our HTTP requests for
the specified base URL:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;Http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requestFactory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;SimpleClientHttpRequestFactory&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;getBaseUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Sending Requests&lt;/h2&gt;
&lt;p&gt;Sending requests using Riptide is pretty straightforward:
you need to use an appropriate method from the created &lt;code&gt;Http&lt;/code&gt; object depending on the HTTP request method.
Additionally, you can provide a request body, query params, content type, and request headers.&lt;/p&gt;
&lt;h3&gt;GET Request&lt;/h3&gt;
&lt;p&gt;Here is an example of sending a simple GET request:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;/products&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;header&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;X-Foo&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;bar&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pass&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;join&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;POST Request&lt;/h3&gt;
&lt;p&gt;POST requests also can be sent easily:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;/products&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;header&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;X-Foo&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;bar&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;contentType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MediaType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;APPLICATION_JSON&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;str_1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pass&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;join&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In the next sections, we will explain the meanings of the &lt;code&gt;call&lt;/code&gt;, &lt;code&gt;pass&lt;/code&gt;, and &lt;code&gt;join&lt;/code&gt; methods from the code snippets above.&lt;/p&gt;
&lt;h2&gt;Response Routing&lt;/h2&gt;
&lt;p&gt;One of the main features of the Riptide HTTP client is declarative response routing.
We can use the &lt;code&gt;dispatch&lt;/code&gt; method to specify processing logic (routes) for different response types.
The &lt;code&gt;dispatch&lt;/code&gt; method accepts the &lt;code&gt;Navigator&lt;/code&gt; object as its first parameter, this parameter specifies which response attribute
will be used for the routing logic.&lt;/p&gt;
&lt;p&gt;Riptide has several default &lt;code&gt;Navigator&lt;/code&gt;-s:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Navigator&lt;/th&gt;
&lt;th&gt;Response characteristic&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Navigators.series()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Class of status code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Navigators.status()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Status&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Navigators.statusCode()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Status code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Navigators.reasonPhrase()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reason Phrase&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Navigators.contentType()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Content-Type header&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Simple Routing&lt;/h3&gt;
&lt;p&gt;Let's see how we can use response routing:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;/products/{id}&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;dispatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OK&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Product&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Product: &amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NOT_FOUND&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Product not found&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="n"&gt;anyStatus&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pass&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;join&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In this example, we demonstrate retrieving a product by its ID and handling the responses.
We use the &lt;code&gt;Navigators.status()&lt;/code&gt; static method to route our responses based on their statuses.
We then describe processing logic for different statuses:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;OK&lt;/code&gt; - we use a version of the &lt;code&gt;call&lt;/code&gt; method that deserializes the response body
into the specified type (&lt;code&gt;Product&lt;/code&gt; in our case). This deserialized object is then used as a parameter
for a consumer, which is passed as a second argument to the &lt;code&gt;call&lt;/code&gt; method.
In our example, the consumer simply logs the &lt;code&gt;Product&lt;/code&gt; object.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;NOT_FOUND&lt;/code&gt; - we assume that we won't receive a &lt;code&gt;Product&lt;/code&gt; response, so we use
another version of the &lt;code&gt;call&lt;/code&gt; method with a single argument: a consumer accepting &lt;code&gt;org.springframework.http.client.ClientHttpResponse&lt;/code&gt;.
In this scenario, we decide to log a warning message.&lt;/li&gt;
&lt;li&gt;All other statuses we intend to process in the same way. To achieve this we use the &lt;code&gt;Bindings.anyStatus()&lt;/code&gt; static function,
allowing us to describe the processing logic for all remaining statuses.  In our case, we have decided that no action
is required for such statuses, so we utilize the &lt;code&gt;PassRoute.pass()&lt;/code&gt; static method, that returns do-nothing handler.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In Riptide all requests are sent using an &lt;code&gt;Executor&lt;/code&gt; (configured in the &lt;code&gt;executor&lt;/code&gt; method in the &lt;strong&gt;Client initialization&lt;/strong&gt; section).
Because of this, responses are always processed in separate threads and the
&lt;code&gt;dispatch&lt;/code&gt; method returns &lt;code&gt;CompletableFuture&amp;lt;ClientHttpResponse&amp;gt;&lt;/code&gt;. To make the invoking thread waiting
for the response to be processed, we use the &lt;code&gt;join()&lt;/code&gt; method in our example.&lt;/p&gt;
&lt;h3&gt;Nested Routing&lt;/h3&gt;
&lt;p&gt;We can have nested (multi-level) routing for our responses. For example, the first level of routing can be based
on the response &lt;code&gt;series&lt;/code&gt;, and the second level - on specific status codes:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;/products/{id}&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;dispatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SUCCESSFUL&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Product&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Product: &amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CLIENT_ERROR&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="na"&gt;dispatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;                        &lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;                        &lt;/span&gt;&lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NOT_FOUND&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Product not found&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="w"&gt;                        &lt;/span&gt;&lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TOO_MANY_REQUESTS&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="k"&gt;throw&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;RuntimeException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Too many reservation requests&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);}),&lt;/span&gt;
&lt;span class="w"&gt;                        &lt;/span&gt;&lt;span class="n"&gt;anyStatus&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pass&lt;/span&gt;&lt;span class="p"&gt;())),&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SERVER_ERROR&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="k"&gt;throw&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;RuntimeException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Server error&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);}),&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="n"&gt;anySeries&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pass&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;join&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In the example above, we implement nested routing. First, we dispatch our responses based on the &lt;code&gt;series&lt;/code&gt; using the
static method &lt;code&gt;Navigators.series()&lt;/code&gt;, and then we dispatch &lt;code&gt;CLIENT_ERROR&lt;/code&gt; responses based on their specific statuses.
For other series such as &lt;code&gt;SUCCESSFUL&lt;/code&gt;, we utilize a single handler per series without any nested routing.&lt;/p&gt;
&lt;p&gt;Similar to the previous example, we use the &lt;code&gt;PassRoute.pass()&lt;/code&gt; static method to skip actions for certain cases.
Additionally, we use &lt;code&gt;Bindings.anyStatus()&lt;/code&gt; and &lt;code&gt;Bindings.anySeries()&lt;/code&gt; methods to define default behavior
for all series or statuses that are not explicitly described. Furthermore, in this example, we've chosen to throw
exceptions for specific cases, these exceptions can be then caught and processed in the invoking code -
see &lt;code&gt;TOO_MANY_REQUESTS&lt;/code&gt; status and &lt;code&gt;SERVER_ERROR&lt;/code&gt; series routes.&lt;/p&gt;
&lt;h2&gt;Returning Response Objects&lt;/h2&gt;
&lt;p&gt;In some cases we need to return a response object from the REST endpoints invocation - we can use a &lt;code&gt;riptide-capture&lt;/code&gt; module to do so.&lt;/p&gt;
&lt;p&gt;Let's take a look on a simple example:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;ClientHttpResponse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;clientHttpResponse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;/products/{id}&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;dispatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OK&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Product&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Product: {}&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="n"&gt;anyStatus&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="k"&gt;throw&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;RuntimeException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Invalid status&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);}))&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;join&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;As mentioned earlier, when we invoke the &lt;code&gt;dispatch&lt;/code&gt; method, it returns a &lt;code&gt;CompletableFuture&amp;lt;ClientHttpResponse&amp;gt;&lt;/code&gt;.
If we then invoke the &lt;code&gt;join()&lt;/code&gt; method and wait for the result of invocation - we'll get an object of type &lt;code&gt;ClientHttpResponse&lt;/code&gt;.
However, with the assistance of the &lt;code&gt;riptide-capture&lt;/code&gt; module, we can return a deserialized object from
the response body instead. In our example, the deserialized object has a type &lt;code&gt;Product&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;First, we need to add a dependency for the &lt;code&gt;riptide-capture&lt;/code&gt; module:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.zalando&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;riptide-capture&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;${riptide.version}&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now let's rewrite the previous example using the &lt;code&gt;Capture&lt;/code&gt; class. This class allows us to extract a value of
a specified type from the response body:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;Capture&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Product&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;capture&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Capture&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;Product&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;/products/{id}&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;dispatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OK&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Product&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;capture&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="n"&gt;anyStatus&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="na"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="k"&gt;throw&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;RuntimeException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Invalid status&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);}))&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;thenApply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;capture&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;join&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In this example, we pass the &lt;code&gt;capture&lt;/code&gt; object to the route for the &lt;code&gt;OK&lt;/code&gt; status.  The purpose of the &lt;code&gt;capture&lt;/code&gt; object
is to deserialize the response body into a &lt;code&gt;Product&lt;/code&gt; object and store it for future use.
Then we invoke the &lt;code&gt;thenApply(capture)&lt;/code&gt; method to retrieve stored &lt;code&gt;Product&lt;/code&gt; object. The &lt;code&gt;thenApply(capture)&lt;/code&gt; method
will return a &lt;code&gt;CompletableFuture&amp;lt;Product&amp;gt;&lt;/code&gt;, so we again can utilize the &lt;code&gt;join()&lt;/code&gt; method
to get a &lt;code&gt;Product&lt;/code&gt; object, as we did in the previous examples.
See also &lt;a href="https://github.com/zalando/riptide/tree/main/riptide-capture"&gt;the riptide-capture module page&lt;/a&gt; for more details.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;In this article, we've demonstrated the fundamental use cases of the Riptide HTTP client.
You can find the code snippets with complete imports on &lt;a href="https://github.com/zalando-incubator/riptide-demo/tree/main/src/test/java/org/zalando/fundamentals"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In future articles, we'll explore usage of Riptide plugins - they provide additional logic for your REST client,
such as retries, authorization, metrics publishing etc. Additionally, we'll look at Riptide Spring Boot starter,
that simplifies an &lt;code&gt;Http&lt;/code&gt; object initialization.&lt;/p&gt;</content><category term="Zalando"/><category term="Open Source"/><category term="Java"/><category term="REST"/><category term="Riptide"/><category term="Backend"/></entry><entry><title>Context Based Experience in Zalando</title><link href="https://engineering.zalando.com/posts/2023/06/context-based-experience-in-zalando.html" rel="alternate"/><published>2023-06-26T00:00:00+02:00</published><updated>2023-06-26T00:00:00+02:00</updated><author><name>Shlomi Israel</name></author><id>tag:engineering.zalando.com,2023-06-26:/posts/2023/06/context-based-experience-in-zalando.html</id><summary type="html">&lt;p&gt;Using context-aware decisions to provide partner-tailored experiences, and how we achieved this for our selective distribution brands&lt;/p&gt;</summary><content type="html">&lt;p&gt;In 2022 we developed a unique partner experience that speaks to dedicated requirements from selective distribution brands and retailers around visual representation, brand storytelling and protecting brand equity. Our solution provides dedicated brand exposure across the experience and at the same time respects special requirements to secure brand equity. In order to achieve consistency with other articles, a general context-aware mechanism needed to be implemented.&lt;/p&gt;
&lt;p&gt;We derived a plan to create distinction and elevation in the experience. The criteria for enabling an experience are based on explicit customer intent. For instance, searching for the retailer name or one of its brands will enable the elevated experience. Viewing their product details page will also enable it. These intentions are identified by our backend systems with specific business domain rules, i.e. the Search backend will have different rules from the Product backend.&lt;/p&gt;
&lt;p&gt;To date, the Fashion Store was based solely on domain-specific data. These new rules, defined on customer intent and context, introduced new challenges in Zalando, and required a new solution. For instance, the same product can behave differently depending on that context. While viewing the catalog without any intent for a brand distinctive experience, for the sake of consistency, all products, including ones belonging to other distribution brands have a gray background, even though the brand elevated experience may dictate, for example, a white background.&lt;/p&gt;
&lt;p&gt;In order to achieve this we needed to identify what we should apply for each use case, meaning what are the brand's requirements, and when they should be applied - which rules should be checked in order to understand the customer's context or intent.&lt;/p&gt;
&lt;p&gt;Brand requirements can be a complicated matter. We identified some which were global on the merchant level; for instance, let's say one of the distribution brands are required to have different packshot images, with white backgrounds, whilst we typically use gray backgrounds in Zalando. Other requirements are brand-specific. Some brands are only to be shown in the product catalog when the brand or its products are explicitly requested to be shown by specific search queries or via catalog filters.&lt;/p&gt;
&lt;p&gt;In order to support different kinds of requirements, we use the concept of &lt;em&gt;experiences&lt;/em&gt;. Experiences are simply a collection of policies that we need to apply, and a list of selection rules.&lt;/p&gt;
&lt;p&gt;For example, a policy may be the theme configuration that needs to be applied, or whether we are allowed to show the product under certain conditions. The selection rules define the criteria that enable the experience, e.g. selection by brand codes. This means that selecting a specific brand in the brand filter will change the experience to the one that has been configured for that brand.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;id&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;XP_ID&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;XP_NAME&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;policies&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;THEME&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;value&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;THEME_NAME&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;theme_config1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;PRODUCT__FLAGS__HIDE_SALE&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;value&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;selection_metadata&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;experience_brands&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;type&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;brand_code&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;value&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;BRANDNAME&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Selection rules can be another complicated matter. For instance, how to decide which experience to choose when two brands belong to different experiences? Thinking about the right use cases to support the business needs, whilst keeping simplicity is the key. Our approach to solving some cases is to define &lt;em&gt;Fallback&lt;/em&gt; experiences, to be able to catch these use-cases.&lt;/p&gt;
&lt;p&gt;As mentioned in other &lt;a href="/tags/microservices.html"&gt;posts&lt;/a&gt; here in Zalando Engineering Blog, Zalando has many microservices, and even our &lt;a href="https://engineering.zalando.com/posts/2021/03/micro-frontends-part1.html"&gt;Frontend’s architecture&lt;/a&gt; is based on micro frontends. We defined the general data structure to understand the experience, but how can we orchestrate it across Zalando's ecosystem?&lt;/p&gt;
&lt;p&gt;In order to get into that, we need to break down the flow into two steps. The first one is the &lt;em&gt;Experience Resolution&lt;/em&gt; step. This starts very early &lt;a href="https://engineering.zalando.com/posts/2021/09/micro-frontends-part2.html"&gt;during the root entity resolution&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Let's say that a customer browses a catalog page. This will send a request to Rendering Engine, which will resolve the root entity by sending a request to the Fashion Store API (GraphQL), which will then query the Catalog backend system. The catalog has its own business logic to understand the customer’s intent and it will find the best matching experience, using its &lt;code&gt;selection_metadata&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The resolved experience name is then stored in the Rendering Engine request state.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Root Entity Experience Resolution" src="https://engineering.zalando.com/posts/2023/06/images/root-entity-resolution.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Fig 1. Root Entity Experience Resolution&lt;/figcaption&gt;

&lt;p&gt;&lt;br/&gt;At this point we have only resolved the root entity. We don’t yet know which renderers (micro-frontends) are required. During this process, we start the second step, where each one of them will query Fashion store API independently, only this time the query will use the previously resolved experience. In the catalog, we have product cards, whose data is populated by a different backend, the Product backend. As we have already resolved the experience, the Product backend can now understand which policies are required. For Zalando’s experience it will select the gray background images with the watermark, instead of the white ones.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Child Renderers with stored exprience" src="https://engineering.zalando.com/posts/2023/06/images/child-renderer-resolution.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Fig 2. Child Renderers are reusing previous resolved experience&lt;/figcaption&gt;

&lt;p&gt;&lt;br/&gt;Using this new mechanism, we successfully managed to introduce new concepts to Zalando. It has opened a door for so many new possibilities that we can leverage to further enhance the customer experience.&lt;/p&gt;</content><category term="Zalando"/><category term="Frontend"/></entry><entry><title>How Software Bill of Materials change the dependency game</title><link href="https://engineering.zalando.com/posts/2023/04/how-sboms-change-the-dependency-game.html" rel="alternate"/><published>2023-04-13T00:00:00+02:00</published><updated>2023-04-13T00:00:00+02:00</updated><author><name>Bartosz Ocytko</name></author><id>tag:engineering.zalando.com,2023-04-13:/posts/2023/04/how-sboms-change-the-dependency-game.html</id><summary type="html">&lt;p&gt;In this post, we explain what questions and insights Software Bill of Materials (SBOMs) provide across thousands of microservices&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Dependency hygiene&lt;/h2&gt;
&lt;p&gt;Dependency updates are a tedious task when maintaining thousands of microservices.
Some teams use tools like &lt;a href="https://github.com/dependabot"&gt;dependabot&lt;/a&gt;, &lt;a href="https://github.com/scala-steward-org/scala-steward"&gt;scala-steward&lt;/a&gt; that create pull requests in repositories when new library versions are available. Other teams update dependencies regularly in bulk, supported by build system plugins (e.g. &lt;a href="https://www.mojohaus.org/versions-maven-plugin/"&gt;maven-versions-plugin&lt;/a&gt;, &lt;a href="https://github.com/ben-manes/gradle-versions-plugin"&gt;gradle-versions-plugin&lt;/a&gt;). Playing the catch-up game and getting some visibility through incoming pull requests or changes is far from great, though and we can do better here.&lt;/p&gt;
&lt;h2&gt;On the importance of dependency data and hygiene&lt;/h2&gt;
&lt;p&gt;What's needed for dependency management is the ability to get a complete picture of used dependencies over time and analyze trends over time. This granular data allows teams to step up their game.&lt;/p&gt;
&lt;p&gt;Critical vulnerabilities in commonly used libraries (e.g. log4j, spring, commons-text) require an ability to find all affected applications in minutes. Only this way can the impact of a vulnerability be assessed and mitigated quickly. Some projects, like openssl, preannounce security updates allowing for more preparation time.&lt;/p&gt;
&lt;p&gt;Similarly, upgrades to major versions of libraries, changes in licensing of open-source libraries (for example Akka) create the need to understand the library footprint to assess the need for action or migration costs. Bugs in libraries tend to eventually trigger production incidents and it's necessary to have a way to find all affected teams, track progress of patches across all applications, and identify reasons why teams struggle to keep up.&lt;/p&gt;
&lt;p&gt;At Zalando, we use &lt;strong&gt;Software Bill of Materials&lt;/strong&gt; (aka. SBOMs) to help answer various questions about application dependencies. We publish a curated data set containing dependency data from the SBOM for every application we deploy, based on its Container image. The data set is available in our data lake and thus can be easily queried and visualized by any engineer.&lt;/p&gt;
&lt;h2&gt;What are SBOMs?&lt;/h2&gt;
&lt;p&gt;The Software Bill of Materials contains information about the packages and libraries used by an application. It can be generated for an application based on its source code or extracted from a Docker container. The SBOM includes packages used by the operating system as well as the application and its dependencies. For each entry, the name, version, and license is tracked. Common formats like &lt;a href="https://cyclonedx.org/specification/overview/"&gt;CycloneDX&lt;/a&gt; or &lt;a href="https://github.com/spdx/spdx-spec/blob/v2.2/schemas/spdx-schema.json"&gt;SPDX&lt;/a&gt; help with portability and integration into various tooling. For example, &lt;a href="https://github.com/anchore/syft"&gt;syft&lt;/a&gt; can generate an SBOM file that can be further parsed with &lt;a href="https://github.com/anchore/grype"&gt;grype&lt;/a&gt; to periodically scan the application's SBOMs for vulnerabilities. On top, GitHub introduced recently an &lt;a href="https://github.blog/2023-03-28-introducing-self-service-sboms/"&gt;on-demand SBOM generation&lt;/a&gt; feature.&lt;/p&gt;
&lt;p&gt;The SBOM needs to be generated with every software change, for example as part of the CI/CD pipeline. Some countries recommend or even mandate the use of SBOMs in certain scenarios in order to manage cyber security and software supply chain risks (see &lt;a href="https://media.defense.gov/2022/Sep/01/2003068942/-1/-1/0/ESF_SECURING_THE_SOFTWARE_SUPPLY_CHAIN_DEVELOPERS.PDF"&gt;Securing the Software Supply Chain: Recommended Practices Guide for Developers&lt;/a&gt;).&lt;/p&gt;
&lt;h2&gt;What questions can the SBOM help to answer?&lt;/h2&gt;
&lt;p&gt;In the context of dependency management, SBOMs collected for all applications help us answer a variety of questions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Which applications use dependency X (in version Y)?&lt;/li&gt;
&lt;li&gt;How many distinct versions of dependency X do we use across all applications?&lt;/li&gt;
&lt;li&gt;Does the dependency hygiene differ per language?&lt;/li&gt;
&lt;li&gt;How quickly after release, are new versions of libraries adopted? Does adoption differ for versions that have known security vulnerabilities?&lt;/li&gt;
&lt;li&gt;When adopting a new Docker base image, what are its contents?&lt;/li&gt;
&lt;li&gt;Which application has dependencies licensed under license X?&lt;/li&gt;
&lt;li&gt;Which distinct licences are being used by application dependencies?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;From Docker image metadata, we can infer the owning team and thus target communication when reaching out to teams. For large-scale patch actions (like the famous log4j upgrade), we prepare change sets for different types of build files and automate the Pull Request creation across all repositories. This allows for central tracking of the patch progress and requires minimal support from the team for the deployment.&lt;/p&gt;
&lt;p&gt;Another insight from analyzing the SBOM data was our usage of the AWS SDK. We noticed that some applications were using the full SDK (200MB+ in Java) instead of its individual modules. Addressing this finding helped reduce build times and lower resulting docker image size significantly.&lt;/p&gt;
&lt;h2&gt;Show me real data!&lt;/h2&gt;
&lt;p&gt;Our diverse application footprint across languages allows us to perform a comparison of the amount of libraries typical applications have.
Looking at the data, the number of dependencies grows exponentially. Here an example for Python:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Number of dependencies in Python applications" src="https://engineering.zalando.com/posts/2023/04/images/sbom-python-dependencies-per-application.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Fig 1. Number of dependencies in Python applications&lt;/figcaption&gt;

&lt;p&gt;&lt;br/&gt;Looking across languages we have two outliers that have the most amount of dependencies.
For Python it's jupyter (2.5x next biggest app) and for Java it's tableau (3.14x next biggest app).&lt;/p&gt;
&lt;p&gt;To compare how hungry each language ecosystem is for dependencies, we can plot the percentiles for the number of dependencies per application. Python wins the race with the lowest amount of dependencies, followed by golang (ca. 1.4-2x when compared to Python). Next in line is Java (covers Java, Kotlin, Scala as the SBOM scanner detects java-archives) with 2-3x more dependencies than golang and lastly JavaScript (incl. TypeScript) with 5-10x more dependencies than Java.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Number of dependencies per language" src="https://engineering.zalando.com/posts/2023/04/images/dependencies-per-language.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Fig 2. Number of dependencies per language&lt;/figcaption&gt;

&lt;h3&gt;Another popular library used across Java and Kotlin projects&lt;/h3&gt;
&lt;p&gt;This example highlights the challenge with long-term maintenance of a large application footprint. As the frequency of changes to an application reduces, it's more difficult for teams to plan dependency updates for those applications, unless there are security issues to address. The following graph looks at the usage of an internal library with three data snapshots.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Usage of an internal library plotted over time" src="https://engineering.zalando.com/posts/2023/04/images/internal-library-usage.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Fig 3. Usage of an internal library&lt;/figcaption&gt;

&lt;p&gt;&lt;br/&gt;We can see that versions 0.22.0+ exhibit expected behavior by being replaced with the next available version. On the other hand, usage of version 0.21.0 constantly increases, even though three newer versions are available in Q4. This situation requires further inspection. It is likely that new applications are created by using the same application template, which misses the dependency update.&lt;/p&gt;
&lt;h2&gt;SBOM Data quality&lt;/h2&gt;
&lt;p&gt;The SBOM data quality varies. For the JVM languages, we observed differing package names, group ids being detected. This increases the complexity of correlating library use across languages. Further, some SBOMs did not show any java-archive entries, because the team's build process flattened all dependencies into an uber-jar and the required metadata needed for library detection was lost. Hence, we recommend caution when using SBOM tools and double-checking that the SBOM generation works correctly for all applications.&lt;/p&gt;
&lt;h2&gt;Summary and future outlook&lt;/h2&gt;
&lt;p&gt;In addition to smaller findings like the one with AWS SDK, the value of SBOMs has already been proven with the very low time it takes us to analyze the impact of the Akka license change or CVEs.&lt;/p&gt;
&lt;p&gt;We look to dive deeper into our SBOM data as we collect more historical data. Aside from observing trends on library usage and adoption, we hope to be able to correlate dependency data with dependency hygiene practices, deployment frequency, change failure rates, and lead times for each application. For our shared libraries, we aim to understand how to help reduce the burden of dependency updates acknowledging that plugin adoption is insufficient to remain a healthy dependency posture.&lt;/p&gt;
&lt;p&gt;If you're not using SBOMs for dependency analysis yet, you're missing out on a great tool helping you to create more transparency. We're curious to read your stories and insights on SBOMs.&lt;/p&gt;</content><category term="Zalando"/><category term="Open Source"/><category term="Microservices"/><category term="Java"/><category term="Kotlin"/><category term="Scala"/><category term="Golang"/><category term="JavaScript"/><category term="TypeScript"/><category term="Python"/><category term="Backend"/><category term="Frontend"/></entry><entry><title>Gender Equity in IT Panel by Zalando Women in Tech Employee Resource Group</title><link href="https://engineering.zalando.com/posts/2023/04/gender-equity-in-it-panel-women-in-tech.html" rel="alternate"/><published>2023-04-12T00:00:00+02:00</published><updated>2023-04-12T00:00:00+02:00</updated><author><name>Anja Bergner</name></author><id>tag:engineering.zalando.com,2023-04-12:/posts/2023/04/gender-equity-in-it-panel-women-in-tech.html</id><summary type="html">&lt;p&gt;Three Women in Tech leaders discuss Gender Equity in IT on a discussion panel organized by our Women in Tech Employee Resource Group.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Our panelists on stage, from left to right: Ana Peleteiro Ramallo (host), Tian Su, Joyce Chen" src="https://engineering.zalando.com/posts/2023/04/images/gender-equity-1.jpeg#center"&gt;&lt;/p&gt;
&lt;p&gt;As part of their week-long International Women's Day event series, the Zalando Women's Network and the Zalando Women in Tech Employee Resource Groups recently held an event to discuss the challenges that women in tech face in the workplace and to share ideas about how to overcome them. We welcomed women in tech leadership to the panel, who shared their experiences and insights into the world of work: Joyce Chen, VP Engineering Beauty; Tian Su, VP Customers, and host Ana Peleteiro Ramallo, Director of Applied Science.&lt;/p&gt;
&lt;p&gt;Joyce Chen shared her past experience of being the first woman engineer in an all-men engineering group. She acknowledged that unconscious bias education has made progress over the last 10 years, and that she now has the language to describe what she went through. However, she also noted that the ratio of women to men in engineering, particularly in leadership positions, is still not good enough. To overcome this, Joyce shared the importance of mentoring, sponsorship, and reskilling.&lt;/p&gt;
&lt;p&gt;Joyce also acknowledged that she often feels like she needs to work harder to prove her worth in a field dominated by men. She highlighted that this is a common feeling among women, and it stems from historic biases that still exist today.
&lt;em&gt;"To overcome this feeling: network, seek mentorship, believe in yourself, and empower yourself to achieve greatness."&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Tian Su highlighted, &lt;em&gt;"Men have historically been in leadership positions and therefore shaped society's perception of what good leadership looks like. This is why leadership is often seen through masculine traits. By bringing diversity into leadership, we can get different leadership styles, which can be beneficial for everyone."&lt;/em&gt; Tian also discussed the challenges in a former company of being the only mother on her team, which meant that she was not always able to attend social and training events after work. However, when she shared this with her former team, they realised that they hadn't considered this at all! They took the time and care to understand her situation, and they improved.&lt;/p&gt;
&lt;p&gt;Ana Peleteiro Ramallo explained, &lt;em&gt;"The way we think we need to behave at work is shaped by the leadership styles we see around us. It's important to bring clarity and your own perspective to your manager in order to help them understand your point of view"&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Our panelists on stage, from left to right: Ana Peleteiro Ramallo (host), Tian Su, Joyce Chen" src="https://engineering.zalando.com/posts/2023/04/images/gender-equity-2.jpeg#center"&gt;&lt;/p&gt;
&lt;p&gt;The panelists also discussed the importance of role models, allies, and mentoring in helping women to succeed in the workplace. Joyce stressed the need for sponsorship and support, and encouraged allies to speak up and amplify women's voices. Tian noted that her husband is her biggest ally, and that intentional outreach from colleagues who are men can also make a difference. Ana emphasized the importance of finding allies who understand you and are willing to listen.&lt;/p&gt;
&lt;p&gt;The event then opened to a Q&amp;amp;A session, and the panel was asked how to build resilience and overcome unconscious bias. Ana stressed the importance of communicating your perspectives and raising your voice when necessary, while Tian suggested taking conversations to a 1:1 setting to create a safe and open environment. Joyce emphasized the need for transparency and training, starting from the interview stage.&lt;/p&gt;
&lt;p&gt;Overall, the event was a great opportunity to share ideas and support women in the workplace. By continuing to have these conversations and advocating for change, we can work towards a more equitable and inclusive future for all. Thanks to the Zalando Women's Network and the Women in Tech Employee Resource Groups for organizing this session, and the panellists for sharing their experiences and thoughts with us!&lt;/p&gt;</content><category term="Zalando"/><category term="Culture"/><category term="Women in Tech"/></entry><entry><title>Applied Methods from Mathematical Optimization and Machine Learning in E-commerce</title><link href="https://engineering.zalando.com/posts/2023/02/gor-workshop.html" rel="alternate"/><published>2023-02-21T00:00:00+01:00</published><updated>2023-02-21T00:00:00+01:00</updated><author><name>Amin Jorati</name></author><id>tag:engineering.zalando.com,2023-02-21:/posts/2023/02/gor-workshop.html</id><summary type="html">&lt;p&gt;Report from a workshop hosted by Zalando in October 2022&lt;/p&gt;</summary><content type="html">&lt;p&gt;Last year, Zalando hosted the 106th meeting of the &lt;a href="https://www.gor-ev.de/"&gt;Gesellschaft für Operations Research e.V. (Germany Society of Operations Research)&lt;/a&gt; working group on &lt;a href="https://www.gor-ev.de/arbeitsgruppen/praxis-der-mathematischen-optimierung/praxis-der-mathematischen-optimierung-meetings"&gt;Practice of Mathematical Optimization&lt;/a&gt;. The workshop took place October 6-7, 2022 at the Zalando Headquarters in Berlin.&lt;/p&gt;
&lt;h2&gt;Applied Methods from Mathematical Optimization and Machine Learning&lt;/h2&gt;
&lt;p&gt;Techniques from the field of mathematical optimization on the one hand and from machine learning on the other hand have been crucial components in delivering solutions to customers in the e-commerce industry. Serving over 50 million customers and delivering a quarter billion orders last year, Zalando, is one of the largest online retail stores in Europe.
 Operating at such a large scale gives rise to a plethora of technical problems within these two fields that our applied scientists tackle across various teams. Thus, Zalando was uniquely positioned to host this workshop at the confluence of these two scientific fields, titled "Applied Methods from Mathematical Optimization and Machine Learning in E-commerce".
The workshop included a number of talks by representatives from industry and academia from all over Germany. The presentations included applications ranging from forecasting to network design, pricing, logistics, scheduling, and vehicle routing, among others. See &lt;a href="http://www.gor-ev.de/wp-content/uploads/2022/10/PMO106-invitation.pdf"&gt;the full program&lt;/a&gt; of the workshop for more details.&lt;/p&gt;
&lt;p&gt;The event took place in hybrid mode with streaming available for virtual attendees and presenters.
The majority participants, i.e. around sixty, attended the event in person.
They took advantage of the various networking opportunities during coffee breaks, the conference dinner and a tour of the historic east-side gallery, the largest remaining section of the Berlin wall, right across from the workshop venue at Zalando headquarters in Berlin.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Group Picture" src="https://engineering.zalando.com/posts/2023/02/images/gor-workshop.jpg#center"&gt;&lt;/p&gt;
&lt;p&gt;Applied Scientists from Zalando presented two different use-cases at the confluence of optimization and ML in the workshop. The pricing team gave a talk about challenges in large scale article discounting, while the logistics team made a presentation about stock distribution and its challenges.&lt;/p&gt;
&lt;h2&gt;Pricing&lt;/h2&gt;
&lt;p&gt;The pricing team is responsible for the science behind offering attractive prices to customers.
Their talk about &lt;a href="https://github.com/zalando/public-presentations/blob/master/files/2022-10-06-GOR-PRT_presentation.pdf"&gt;Challenges in Large Scale Article Discounting&lt;/a&gt; gave a glimpse in the
multitude of challenges that are connected to discounting for the entirety of Zalando's assortment.&lt;/p&gt;
&lt;p&gt;Even with a proven machinery that manages to recommend millions of discounts under given business targets,
many pitfalls have to be circumvented.
We discussed the following complications and mentioned potential treatments.&lt;/p&gt;
&lt;h3&gt;Forecasting Challenges&lt;/h3&gt;
&lt;p&gt;The demand for niche articles, typically with just few sales per month, is hard to predict accurately.
Moreover, articles with many sizes, e.g. jeans with many length and width combinations, can behave like multiple separate articles: different customers consider purely their own size, which creates a demand only on certain sizes.
On top, some costs like shipping and returns are a mixed calculation based on the collection of articles handled together.&lt;/p&gt;
&lt;h3&gt;Optimization Challenges&lt;/h3&gt;
&lt;p&gt;An optimization model has to respect the business setup in its decisions.
Several constraints were created so that the model has to follow business decisions, e.g. the model has to sell to customers in a sales period even if it would be more profitable to keep items now for sales in the future.
Without them, it could be proposed to take an article offline for a certain period or prefer to sell stronger in countries where shipment costs are lower.
On a technical side, some optimization problems can be infeasible through incompatible business targets and require adjustment recommendations.&lt;/p&gt;
&lt;h3&gt;Processes and Measuring&lt;/h3&gt;
&lt;p&gt;Further consideration stem from the connected processes around pricing.
Matching competitors' prices, incorporating sales events and warehouse capacities
are crucial in order to recommend profitable discounts.
Ultimately, the impact has to be measured via A/B testing.
When it comes to pricing, we have to carefully set it up to rule out customer discrimination by different prices and to enable gathering valuable insights.&lt;/p&gt;
&lt;h2&gt;Logistics&lt;/h2&gt;
&lt;p&gt;The logistics team delivered a talk titled &lt;a href="https://github.com/zalando/public-presentations/blob/master/files/2022-10-07-GOR-Alea-Kea-Waffle_presentation.pdf"&gt;Mathematical Optimization Meets Machine Learning to Optimize Stock Distribution&lt;/a&gt;.
Zalando operates a network of interconnected warehouses and return centers serving its customer base across Europe. In order to best serve our customers we need to make our stock available to our customers where and when they desire it. This requires listening to our customers' demands and distribute stock across our network and within each facility accordingly. In this talk, we outlined the challenges at the core of this stock distribution problem and dived deep into some technical aspects.&lt;/p&gt;
&lt;h3&gt;Demand Forecasting&lt;/h3&gt;
&lt;p&gt;We model demand prediction as a time series forecasting problem at the individual article level for each of the markets we are active in for any given day. We produce probabilistic forecasts for each such problem using a deep recurrent neural network. Challenges abound in demand forecasting for the fashion industry where articles have fast turnover due to seasonality, the fast moving nature of fashion, and the diversity of trends in our vast customer base. This probabilistic demand forecast is used as input to solve two major optimization problems: (i) Item Network Distribution Problem: how best to distribute our stock across our facilities, and (ii) In-warehouse Item Relocation Problem: how best to position our articles within each facility.&lt;/p&gt;
&lt;h3&gt;Item Network Distribution&lt;/h3&gt;
&lt;p&gt;In the item network distribution problem, items are moved between warehouses: We need to ensure that for each country, the warehouses serving that country have the article assortment and stock quantities that best fulfill the country's expected demand. Our objectives are to maximize sales and minimize delivery times and costs. We discussed the algorithm currently used to make distribution decisions and presented some results.&lt;/p&gt;
&lt;h3&gt;In-warehouse Item Relocation&lt;/h3&gt;
&lt;p&gt;The in-warehouse item relocation problem is defined at the warehouse level. A warehouse contains various storage areas with different capacities and speed for collecting one item. Given a constant stream of incoming and outgoing items, we can relocate items between storage areas to achieve a distribution that is optimal for the demand reduced to a warehouse. We presented a formalization of the problem and prospective approaches to solve it.&lt;/p&gt;</content><category term="Zalando"/><category term="Machine Learning"/><category term="Research"/><category term="Data Science"/><category term="Zalando Science"/></entry><entry><title>How we manage our 1200 incident playbooks</title><link href="https://engineering.zalando.com/posts/2023/01/how-we-manage-our-1200-incident-playbooks.html" rel="alternate"/><published>2023-01-31T00:00:00+01:00</published><updated>2023-01-31T00:00:00+01:00</updated><author><name>Bartosz Ocytko</name></author><id>tag:engineering.zalando.com,2023-01-31:/posts/2023/01/how-we-manage-our-1200-incident-playbooks.html</id><summary type="html">&lt;p&gt;We consolidated our incident playbooks in September 2019. 1200 playbooks later...&lt;/p&gt;</summary><content type="html">&lt;p&gt;At Zalando, we use Incident Playbooks to support our on-call teams with emergency procedures that can be used to mitigate incidents.
In this post, we describe how we structured incident playbooks, and how we manage these across 100+ on-call teams.&lt;/p&gt;
&lt;h3&gt;Incident Playbooks - where are we now?&lt;/h3&gt;
&lt;p&gt;We consolidated our incident playbooks as part of preparation for &lt;a href="https://engineering.zalando.com/posts/2020/10/how-zalando-prepares-for-cyber-week.html"&gt;Cyber Week&lt;/a&gt; in 2019. Fast forward to 2023 and we have over 1200 playbooks that our teams have authored.
Given the 850+ applications in scope for on-call coverage across 100+ on-call teams, that's 1.41 playbooks per application and ca. 12 playbooks per on-call team. The diagram below shows how our playbook collection has increased over the years. It's easy to see how Cyber Week preparations in Q3 of each year result in significant increases in the playbook collection.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Count of incident playbooks over time" src="https://engineering.zalando.com/posts/2023/01/images/incident-playbooks.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Count of incident Playbooks over time&lt;/figcaption&gt;

&lt;p&gt;As expected, most applications have just a few playbooks. Below, you can see the number of applications per playbook count.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Number of applications per playbook count" src="https://engineering.zalando.com/posts/2023/01/images/playbook-count-distribution.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Number of applications per playbook count&lt;/figcaption&gt;

&lt;h3&gt;What are incident playbooks?&lt;/h3&gt;
&lt;p&gt;Our Incident Playbooks cover emergency procedures to initiate in case a certain set of conditions is met, for example when one of our systems is overloaded and the existing resiliency measures (e.g. circuit breakers) are insufficient to mitigate the observed customer impact. In such cases there are often measures we can take, though they will degrade the customer experience.
These emergency procedures are pre-approved by the respective Business Owner of the underlying functionality, allowing for quicker incident response without the need for explicit decision making while critical issues are ongoing.&lt;/p&gt;
&lt;p&gt;Further, playbooks make incident response less stressful for colleagues on on-call rotations. Each on-call member takes the time to become familiar with the procedures and understands the toolbox they have available during incidents. New playbooks are reviewed by the on-call team, shared as part of on-call handover or operational reviews, and practiced in game days, or as part of preparation for big events.&lt;/p&gt;
&lt;p&gt;The procedures document the &lt;em&gt;conditions&lt;/em&gt; (e.g. increased error rates), &lt;em&gt;business impact&lt;/em&gt; (e.g. conversion rate decrease), &lt;em&gt;operational impact&lt;/em&gt; (e.g. reduction of DB load), &lt;em&gt;mean time to recover&lt;/em&gt;, and the &lt;em&gt;steps&lt;/em&gt; to execute. This structure allows all stakeholders involved in incident response to clearly understand the executed actions and target state of the system to expect. Lastly, by having playbooks in a single location, our Incident Responders and Incident Commanders have easy access to all available emergency procedures in a consistent format. This simplifies collaboration across teams during outages.&lt;/p&gt;
&lt;p&gt;More often than not, our playbooks cover the whole system (a few microservices) instead of its individual components being covered through separate procedures. When the bigger system context is considered, there are more options available to mitigate issues.&lt;/p&gt;
&lt;p&gt;When we started in 2019, we first focused on a collection of procedures that were already known, but not consistently documented.
Next, as part of the Cyber Week preparations we wanted to explore and strengthen the mechanisms we have in place to mitigate overload or capacity issues across the different touchpoints of the customer (e.g. product listing pages) and partner journeys (e.g. processing of price updates).&lt;/p&gt;
&lt;p&gt;Let's consider two examples:&lt;/p&gt;
&lt;h4&gt;1) Product Listing Pages (aka. catalog)&lt;/h4&gt;
&lt;p&gt;Our &lt;a href="https://en.zalando.de/womens-clothing/"&gt;catalog pages&lt;/a&gt; integrate multiple data sources, such as teasers, sponsored products, and outfits.
Fetching data from all sources comes at increased costs compared to a simple article grid. Therefore, we have a set of playbooks that disable the different data sources in order to reduce the load on the backends providing the APIs and the underlying Elasticsearch cluster. The playbooks are sorted in such way that we apply the playbooks with least business impact first. In one of our evening Cyber Week shifts, we encountered performance degradation resulting in increased latencies, which was hard to diagnose. While one part of the team was busy troubleshooting the issue, another part of the team executed multiple of the prepared playbooks in sequence in order to mitigate the customer impact.&lt;/p&gt;
&lt;p&gt;Example playbook for catalog:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Title&lt;/strong&gt;: Disable calls for outfits in the Catalog’s article grid&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trigger&lt;/strong&gt;: High latency for fetching outfits for the article grid or High CPU usage for Elasticsearch's outfit queries&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mean time to recover:&lt;/strong&gt; 3 minutes after updating configuration&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Operational Health Impact&lt;/strong&gt;: No more outfit calls from Catalog, reduced request rates to Elasticsearch by x%.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Business Impact&lt;/strong&gt;: Outfits won't be shown as part of the catalog pages.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;2) Monitoring system&lt;/h4&gt;
&lt;p&gt;Our monitoring system &lt;a href="https://opensource.zalando.com/zmon/"&gt;ZMON&lt;/a&gt; had a component ingesting metrics data and storing these in KairosDB TSDB, backed by Cassandra. Pre-scaling of the Zalando platform for Cyber Week peak workload resulted in a multi-factor increase in metrics pushed by the individual application instances, resulting in ingestion delays due to Cassandra cluster overload. To mitigate similar incidents, we developed a tiering system with three criticality tiers for the metrics, so that in case of overload of the TSDB, we could still ingest the most important metrics necessary to plot essential dashboards required to monitor the Cyber Week event. This playbook is still in place today, even though we changed our metrics storage.&lt;/p&gt;
&lt;p&gt;Example playbook for ZMON:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Title&lt;/strong&gt;: Drop non-critical metrics due to TSDB overload&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trigger&lt;/strong&gt;: Metrics Ingestion SLO is at risk of being breached (link to alert/dashboard)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mean time to recover:&lt;/strong&gt; 2 minutes after updating configuration&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Operational Health Impact&lt;/strong&gt;: Loss of tier-3 and tier-2 metrics. Only tier-1 metrics are processed, leading to 40% load reduction on the metrics TSDB.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Business Impact&lt;/strong&gt;: None&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;How do we author playbooks?&lt;/h3&gt;
&lt;p&gt;We use documentation site built using &lt;a href="https://www.mkdocs.org/"&gt;mkdocs&lt;/a&gt; to host the documentation containing a description of the incident process and all playbooks. We generate the playbook directory structure based on our OpsGenie on-call teams. This way there is always a skeleton available for every team to contribute their playbooks to. When we started in 2019 we had a team of 3 reviewers, who as part of the playbook reviews were committed throughout the year to explain the purpose/guidance of the playbooks and align these to a common standard. With sufficient examples and knowledge spread across the organization, we switched to using &lt;a href="https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners"&gt;CODEOWNERS&lt;/a&gt; to delegate the reviews to representatives of the departments, skilled in operational excellence.&lt;/p&gt;
&lt;p&gt;To remind new contributors about our playbook guidelines, we use a pull request template with a few check boxes as means for self-verification of playbook completeness. The 1st line of the template contains a TODO with a nudge for a 1-line summary of the changes. This proved to an easy way of providing reviewers with more context about the performed changes.&lt;/p&gt;
&lt;h3&gt;Integrating playbook data with application reviews&lt;/h3&gt;
&lt;p&gt;Aside from the information about triggers and impact for playbooks, we also collect additional metadata allowing us to integrate playbooks with our application review process:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Application – links playbooks to the involved applications&lt;/li&gt;
&lt;li&gt;Expiry date – allows to nudge teams to re-review playbooks that will expire soon&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To keep integration simple, along with the documentation, we also generate a JSON file with playbook metadata.
During the application review process it's indicated per application (from certain criticality tier onward) whether there are any playbooks defined for it and whether any of these are expired.&lt;/p&gt;
&lt;p&gt;With time, we made it mandatory for applications of certain criticality to have an assigned playbook.
This partially increased the scope of the playbooks beyond the key emergency procedures while at the same time providing training to our engineers in the authoring of playbooks and thinking about the overload and failure scenarios that can occur.&lt;/p&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;When we initially created the incident playbooks site, maintenance of playbooks as markdown files was considered to be good means for ensuring consistency, but rather of temporary nature. To be consistent with our UI-driven application review workflow, we intended to manage playbooks in the same way.
Managing structured data in markdown is not ideal, despite the ability to use front matter for metadata.
However, managing playbooks in a code repository provides us with easy means for cross-team reviews using pull requests.
This key advantage keeps us from moving to a UI-driven workflow where such collaboration would be limited.&lt;/p&gt;
&lt;p&gt;We can certainly recommend every team to think about the failure scenarios their systems can experience, for example as part of production readiness reviews or game days. Without them, there are several key incidents that would have had a markedly larger impact on our customer experience.&lt;/p&gt;
&lt;p&gt;Imagining how to react to such scenarios by putting the system into a degraded state, trading off availability over customer experience, can spark interesting conversations about resilience mechanisms that can be built into the software. These conversations drive engineers to make changes to their design to fundamentally improve availability, or at least, to ensure their software facilitates easier intervention.&lt;/p&gt;
&lt;p&gt;If used often enough, playbooks should be ideally automated.&lt;/p&gt;</content><category term="Zalando"/><category term="SRE"/><category term="Backend"/></entry><entry><title>How You Can Have Impact As An Engineering Manager</title><link href="https://engineering.zalando.com/posts/2023/01/how-you-can-have-impact-as-an-engineering-manager.html" rel="alternate"/><published>2023-01-26T00:00:00+01:00</published><updated>2023-01-26T00:00:00+01:00</updated><author><name>Gary Rafferty</name></author><id>tag:engineering.zalando.com,2023-01-26:/posts/2023/01/how-you-can-have-impact-as-an-engineering-manager.html</id><summary type="html">&lt;p&gt;How Engineering Managers create impact and shape organisational culture&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;em&gt;If you are a good leader,&lt;/em&gt;&lt;br /&gt;
&lt;em&gt;Who talks little,&lt;/em&gt;&lt;br /&gt;
&lt;em&gt;They will say.&lt;/em&gt;&lt;br /&gt;
&lt;em&gt;When your work is done,&lt;/em&gt;&lt;br /&gt;
&lt;em&gt;And your aim fulfilled,&lt;/em&gt;&lt;br /&gt;
&lt;em&gt;“We did it ourselves”&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;- Lao-Tse&lt;/p&gt;
&lt;p&gt;Last year, I &lt;a href="https://engineering.zalando.com/posts/2022/07/growth-engineering-at-zalando.html"&gt;shared&lt;/a&gt; how Zalando enables and supports the continued growth of our Software Engineers. The piece was written from a leadership perspective. A natural sequel to that would describe how our leaders are empowered. Specifically, I would like to provide my own perspective on how Engineering Managers can create impact and shape organisational culture.&lt;/p&gt;
&lt;h1&gt;Team Structures&lt;/h1&gt;
&lt;p&gt;To provide some context, Engineering Managers use the distinction between the &lt;strong&gt;“Team You Lead”&lt;/strong&gt; and the &lt;strong&gt;“Team You Are On”&lt;/strong&gt;. For the former, an Engineering Manager, is responsible for a single delivery team of Software Engineers or Applied Scientists. This is the team that they are leading. The latter refers to the Engineering Manager’s own team (their peer group that forms a department, and is led by a Head of Engineering).&lt;/p&gt;
&lt;h2&gt;The Team You Lead&lt;/h2&gt;
&lt;p&gt;I use the team you lead as the starting point to describe Engineering Management, because this, in my opinion, is the bread and butter of the role. Forming and leading a high-performing delivery team is no small feat. The team of individuals must collectively progress through the four stages of forming (purpose and raison d’etre), storming (sharing feedback, ideation, and defining roles within the group), norming (establishing ways of working and responsibilities), and performing (peak delivery and problem-solving). Take a look at Patrick Lencioni’s &lt;a href="https://www.amazon.co.uk/Five-Dysfunctions-Team-Leadership-Lencioni/dp/0787960756"&gt;Five Dysfunctions of a Team&lt;/a&gt; (or read the &lt;a href="https://www.amazon.co.uk/Five-Dysfunctions-Team-Illustrated-Leadership/dp/0470823380/"&gt;Manga Edition&lt;/a&gt; for a more illustrated journey) to peek into the complex problems that leaders need to resolve in order to keep their team healthy.&lt;/p&gt;
&lt;p&gt;Engineering Managers are accountable for driving the delivery of projects from start to finish - encompassing the entire lifecycle of what the team builds, how they structure step-changes to systems, how they can monitor and measure the performance of said systems for operational excellence, and all the other ingredients that go into delivering effective software.&lt;/p&gt;
&lt;h2&gt;The Team You Are On&lt;/h2&gt;
&lt;p&gt;Beyond the team that they lead, I mentioned that Engineering Managers have another team, and this is their peer group. No two organisations are identical, but typically, multiple teams are grouped to form a department, which is fulfilling a part of the larger group strategy. This for me, is where the magic happens for Engineering Management, and it is where I encourage my direct reports to make the biggest impact.&lt;/p&gt;
&lt;p&gt;Andy Grove &lt;a href="https://www.amazon.co.uk/High-Output-Management-Andrew-Grove/dp/0679762884"&gt;defined&lt;/a&gt; a Manager’s output as the output of her/his organisation, plus the  output of neighbouring organisations under her/his influence. To put that in context, this is the output of the Team You Lead, plus the output of the teams of your peer group. For the sake of this post, I make the assumption that these teams are interacting, and I do this because “&lt;em&gt;A system is never the sum of its parts; it’s the product of their interaction&lt;/em&gt;”.&lt;/p&gt;
&lt;h1&gt;Interaction is Culture&lt;/h1&gt;
&lt;p&gt;So, if the yield of a system is the product of how the parts interact, you might be wondering how Managers influence this.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Culture has entered the chat...&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Culture is how work happens between people and between teams, which sounds simple, but culture is complex, and takes considerable time and effort to instil.&lt;/p&gt;
&lt;p&gt;I recently read a great description of culture, which hypothesised that culture is composed of behaviour, processes, and practices. Let’s take a look at each, and hone in on the Manager’s role within.&lt;/p&gt;
&lt;h2&gt;Behaviour&lt;/h2&gt;
&lt;p&gt;A well known &lt;a href="https://rework.withgoogle.com/print/guides/5721312655835136/"&gt;study&lt;/a&gt; of engineering team effectiveness from Google, named Project Aristotle, identified the common elements of their best teams, and at the top of that list, was Psychological Safety. Psychological Safety “...refers to an individual’s perception of the consequences of taking an interpersonal risk”. If we strip this down to bare metal, it is referring to how comfortable, and encouraged, team members are to speak up, to give their opinions, and to support one another.&lt;/p&gt;
&lt;p&gt;Engineering Management is not about dictating what our engineers do, nor is it about having all the answers to the hard questions. Similarly, engineers are not blindly following instructions, nor are they viewed as code labourers. Instead, Engineering Management is about creating an environment that sets clear expectations and goals, encourages voices and opinions, destigmatizes failure, encourages diverse thinking, and supports the individual growth of each team member.&lt;/p&gt;
&lt;p&gt;To accomplish this, Engineering Managers are provided with the autonomy to support their teams and to enable success as they know best. They should be guided by Our Founding Mindset (OFM), but be led by their own experience and know-how.&lt;/p&gt;
&lt;p&gt;Achieving this within the Team You Lead is one thing, but the key is achieving this across the wider scope of the teams within your influence. This requires customer-first thinking, working backwards from the organisational goals, and ensuring that all teams have enough information and support to achieve their target. In other words, putting purpose over ego, and doing what’s right for the organisation and the customer.&lt;/p&gt;
&lt;h2&gt;Processes&lt;/h2&gt;
&lt;p&gt;A successful organisation is driven by autonomous, and empowered teams. Peak inside each of these teams and you will find a diverse collective of talented, ambitious, and driven individuals. We are actively shaping the Zalando of the future by hiring great people with high potential. Our Engineering Managers are responsible for contributing to, and defining, the processes that will enable these teams of individuals to succeed.&lt;/p&gt;
&lt;p&gt;Processes at Zalando are constantly evolving; responding to the ever-changing landscape in which we operate. In order to successfully equip an organisation with the necessary processes for momentum, decision making, and enablement, our Engineering Managers are required to collaborate with other leaders across multiple disciplines and job families, such as Principal Engineering, Product Management, Technical Program Management, and Design.&lt;/p&gt;
&lt;p&gt;Perhaps they might be collaborating with Talent Acquisition Partners to refine the candidate experience during the hiring process or creating a Mentorship program. In other cases, they might be contributing to a cross-functional working group to define KPIs to measure progress relative to the Group Strategy. Perhaps they might be supporting the &lt;a href="https://engineering.zalando.com/posts/2020/10/how-zalando-prepares-for-cyber-week.html"&gt;Cyber Week preparations&lt;/a&gt;. You get the idea. These are just four examples that my cohort of Managers have been working on recently, however, they all share the running theme of intrapreneurial spirit - embodying our “Act Like an Owner” founding mindset. Making things happen throughout the organisation that ultimately become a tail-wind for impact.&lt;/p&gt;
&lt;h2&gt;Practices&lt;/h2&gt;
&lt;p&gt;If the purpose of processes is to shape the environment such that group thinking and empowered decision making is supported, then practice is the more granular day to day activities that sit atop the processes. These practices help Engineers to get things done.&lt;/p&gt;
&lt;p&gt;As before, if we take the team you lead as the base, the Engineering Manager is responsible for working with their team to define fruitful ways of working that embrace best practices and foster collaboration. This will take time, especially for a newer team, but through trial and error, you will find that sweet spot.&lt;/p&gt;
&lt;p&gt;When we hone in on practices beyond the team, we see wider collaborations across disciplines to get things done collaboratively across the department.&lt;/p&gt;
&lt;p&gt;Practices, in my opinion, are the catalyst for helping Engineering Managers to understand how to scale themselves, by delegating and supporting the individuals on their team to step up and take on more responsibility. If we take a look at Communities of Practice, Operational Review Meetings, or Guilds, we typically see Engineers taking more of a leading role in establishing these practices, but in order to do this, our Engineering Managers are playing more of a supporting role. We are identifying opportunities and matching those to individual goals and aspirations. We are setting those individuals up for success by coaching, providing feedback, utilising training and development budgets, and stepping back to let them drive.&lt;/p&gt;
&lt;p&gt;As individuals are growing into these responsibilities, it is important to nurture experimentation, to celebrate successes and failures, and most importantly, to provide the context (the why) of how these practices are related to the bigger picture.&lt;/p&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;Engineering Managers are responsible for steering and enabling a high-performing team of engineers, but their scope of influence and impact extends far beyond the realms of the team. Managers help to shape the behaviours, the processes, and the practices of the organisation to yield, and foster, a culture of innovation, delivery, empowerment and drive. This culture is what enables organisations to succeed in our non-linear world.&lt;/p&gt;
&lt;p&gt;The Harvard Business Review recently published a &lt;a href="https://hbr.org/2022/12/to-retain-your-best-employees-invest-in-your-best-managers"&gt;terrific article&lt;/a&gt;, stating that in order to retain your best employees, you need to invest in your best managers. This article resonates with my own view that the success of an Engineering Organisation is greatly supported by our Engineering Managers - the ones who are close enough to the metal to implement culture, yet elevated enough to encompass a broad scope of influence, and provided with enough autonomy to innovate for the organisation.&lt;/p&gt;
&lt;p&gt;I would like to finish this article off with an extract from our Role Expectations for the Management track:&lt;/p&gt;
&lt;p&gt;&lt;em&gt;“Great managers come in all shapes and sizes.  There is no ‘checklist’ for leadership …  No leader can do everything - some will exceed in certain capabilities while others will exceed in a different combination - this is OK and intended”.&lt;/em&gt;&lt;/p&gt;</content><category term="Zalando"/><category term="Tech Culture"/><category term="Leadership"/><category term="Culture"/></entry><entry><title>More Editorial Content, please.</title><link href="https://engineering.zalando.com/posts/2022/09/editorial-content.html" rel="alternate"/><published>2022-09-29T00:00:00+02:00</published><updated>2022-09-29T00:00:00+02:00</updated><author><name>George Evans</name></author><id>tag:engineering.zalando.com,2022-09-29:/posts/2022/09/editorial-content.html</id><summary type="html">&lt;p&gt;Building a CMS for the Zalando Fashion Store&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Zalando and Editorial Content Logo" src="https://engineering.zalando.com/posts/2022/09/images/editorial-content-logo.png#previewimage"&gt;&lt;/p&gt;
&lt;p&gt;At Zalando, serving engaging content across the user journey has become increasingly important for multiple teams within the company. This required a scalable, feature-rich and easy-to-use solution, that was flexible enough to adapt to the ever-changing requirements for rich content.&lt;/p&gt;
&lt;p&gt;In this post, George and Daniel describe the product that was built to serve this purpose - its problem space, the solution design process, the technological context and how the product evolved to include new use-cases, such as the Zalando Sustainability topic.&lt;/p&gt;
&lt;h2&gt;Problem Space: The need for a flexible content solution&lt;/h2&gt;
&lt;p&gt;The Zalando Fashion Store is first and foremost a platform to help our customers find the products they want, and it employs various strategies to personalise the experience for each customer. Zalando also aims to inform and inspire, and many of our internal teams and brand partners sought to do this by telling stories, via &lt;em&gt;editorial&lt;/em&gt; content.&lt;/p&gt;
&lt;p&gt;This is where "editorial landing pages" come in as static, self-contained web pages on the Zalando site containing a range of content. Landing pages are often tied in to products and brands, but not always with conversion as the primary focus. They include &lt;a href="https://en.zalando.de/campaigns/nike-my-kinda-play-w/"&gt;awareness campaigns from key brands&lt;/a&gt;, inspiration for a clothing category like &lt;a href="https://en.zalando.de/campaigns/outdoor-w/"&gt;outdoor&lt;/a&gt;, or informative pages for key Zalando initiatives like &lt;a href="https://en.zalando.de/pre-owned-w"&gt;Pre-owned&lt;/a&gt;, or &lt;a href="https://en.zalando.de/about-sustainability/"&gt;sustainability in fashion&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;When George's team first started working on the topic of landing pages for Zalando Marketing Services (ZMS) campaigns, there was a legacy tooling for the creation &amp;amp; management of such pages already in place. However, it had many limitations affecting scalability. Also, it was based on Zalando's "Mosaic" system architecture, which was being phased out in favour of the newer &lt;a href="https://engineering.zalando.com/posts/2021/03/micro-frontends-part1.html"&gt;Interface Framework&lt;/a&gt;. So the team decided to build a new tool to replace the old, overcome the feature and scalability related shortcomings, on top of this new architecture.&lt;/p&gt;
&lt;h3&gt;Core Requirements&lt;/h3&gt;
&lt;p&gt;The shortcomings and pain-points of the previous tool became the basis of the requirements for what the team would build:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Ease of Use / Scalability&lt;/strong&gt; - The previous solution required significant engineering effort to set up each page, before Content Managers could upload the content. This was ineffecient and a clear bottleneck to scalability. Therefore, the new tool should allow Content Managers to create pages, upload and publish content with no engineering involvement.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Content Flexibility&lt;/strong&gt; - With the previous tooling, once a page was set up, the layout could not be changed without resetting it, which would cause any content uploaded to be lost, creating a lot of repeated work. The new tool should allow the flexibility to change the layout, add and remove content, whilst preserving existing content.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Parity with the Zalando App&lt;/strong&gt; - In the previous tooling, web and Zalando app pages were entirely separate - they had different content formats that looked quite different with content for each being uploaded separately. This created a lot of duplicate work, both in asset creation and upload. The new tooling should allow for a single source of content, and mirror its appearance across web &amp;amp; app.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Localisation&lt;/strong&gt; - Zalando operates in 25 different markets, requiring content for a given page to be localized into several languages. The previous process for this was cumbersome and confusing, effectively repeating the content-upload for each language. Our goal was to streamline this into an efficient, user-friendly process.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Extensibility&lt;/strong&gt; - Creating new, engaging experiences was a key part of the ZMS use case, so we wanted a setup that would facilitate the development of new content formats. After the initial rollout, other teams also showed an interest in this capability, so creating a streamlined contribution model became a priority.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Interface Framework&lt;/strong&gt; - To integrate the tool with Zalando's new architecture and design system, to leverage its capabilities and scale with it.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Solution Design&lt;/h2&gt;
&lt;p&gt;The first decision to make was whether to build a new CMS from scratch, or use an existing, third-party solution. We needed something flexible enough to adapt to our precise requirements, but we were also conscious that trying to reinvent the wheel by building our own CMS could grow into a project with limitless scope that we would never finish.&lt;/p&gt;
&lt;p&gt;After researching many third-party CMS solutions we decided to go with &lt;a href="https://www.contentful.com/"&gt;Contentful&lt;/a&gt;, a headless CMS - 'headless' since it is agnostic about the 'how' of presenting content to the end user. Instead, it focuses on making the content management process as easy and intuitive as possible. The content is delivered via an API to the presentational layer, e.g. directly to an app, a static site generator such as next.js or any user consumer channel, such as Zalando's micro-service-based architecture in our case. What won us over was how flexible and scalable it is in terms of what content could be served, as well as the ease with which the CMS UI could be extended with custom apps. It also had strong multi-language support out of the box, and enabled collaboration in bigger teams.&lt;/p&gt;
&lt;h3&gt;System Architecture Context&lt;/h3&gt;
&lt;p&gt;Let's have a closer look at the technology context into which our solution needed to fit and how a request to a landing page would be processed, finding its way from the content consumer all the way to Contentful:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;There are two main consumer platforms: web and app. Our &lt;a href="https://github.com/zalando/skipper"&gt;Skipper&lt;/a&gt; routing service takes care of matching the request URL with the correct internal service endpoints and HTTP header enrichment:&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Both platforms are serving a requested landing page via our &lt;a href="https://engineering.zalando.com/posts/2021/03/micro-frontends-part1.html"&gt;Rendering Engine&lt;/a&gt;, which fetches data for each UI element via a GraphQL query using our GraphQL aggregator, the &lt;a href="https://engineering.zalando.com/posts/2021/03/how-we-use-graphql-at-europes-largest-fashion-e-commerce-company.html"&gt;Fashion Store API (FSA)&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;To enable data fetching for our landing pages, George's team built a data proxy service. This sits between FSA and Contentful's API, and handles content mapping &amp;amp; caching. This approach also ensures resilience and that the aggregation layer calls directly only Zalando-operated APIs.&lt;/li&gt;
&lt;li&gt;To integrate additional content from Zalando services into the Contentful CMS, a simple content aggregator was built.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="System architecture context relevant for the Landing Page stack" src="https://engineering.zalando.com/posts/2022/09/images/system-architecture.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;System architecture context relevant for the Landing Page stack&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;h3&gt;Content Data Model&lt;/h3&gt;
&lt;p&gt;The actual content of a landing page is managed within Contentful as "entries"; each entry-type having its own data schema definition, validation rules and a content-upload UI for the content editors.&lt;/p&gt;
&lt;p&gt;The main entry is the landing page itself. It has basic fields like the page title, the URL path and SEO related metadata. It also has a reference list to sub-entries or "modules" - preset content formats such as banners, text blocks, a product carousels etc, or more bespoke formats such as list of sustainability certifications with background information. These can be composed using a drag-and-drop UI to build a landing page layout, and then the necessary content can be uploaded for each one. They can be rearranged/edited at any time, without then need to re-upload existing content.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Contentful modules screenshot" src="https://engineering.zalando.com/posts/2022/09/images/contentful-lp-modules-screenshot.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Landing page modules as arranged in Contentful&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;When a landing page request reaches FSA, it in turn calls the Contentful proxy service, which returns the data for the page and each of its modules. These are map to corresponding 'renderers' in the Rendering Engine, which render the UI components.&lt;/p&gt;
&lt;h3&gt;A sustainable solution: extensibility and contributions from other teams&lt;/h3&gt;
&lt;p&gt;The Sustainability Team was one of the first interested parties to reach out to George's team early on in the implementation phase. They were seeking a way to display information on the various aspects of Sustainability in fashion in an engaging way. Although this content typically exists more permanently than the short-lived marketing campaigns for which the landing pages system was primarily intended, the overlap of the problem space and requirements was significant enough to make for a beneficial collaboration.&lt;/p&gt;
&lt;p&gt;Extension and adaptions were needed however, both regarding orthogonal aspects (like SEO support or a content review and approval workflow) as well as for specific presentational features.
In particular, the addition of the latter in form of self-contained new modules demonstrated that the new system is flexible enough to enable contribution from other teams.
Among the additional modules added by the Sustainability team was one showing details of the sustainability certificates Zalando supports on the product level.&lt;/p&gt;
&lt;p&gt;Let's use this module to make the stack as described in the previous section a bit more tangible.&lt;/p&gt;
&lt;h4&gt;The Sustainability Certificate module&lt;/h4&gt;
&lt;p&gt;The purpose of the certificate module is to present a list of sustainability related certificates to our customers.
A sustainability certificate acts as the proof for sustainability related claims about a product. They can be either a 3rd party certificate like Fairtrade or GOTS or one of the criteria Zalando provides, e.g. 'Made with 70-100% recycled materials'.
On a landing page, each certificate needs to be shown with three content pieces:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;logo&lt;/li&gt;
&lt;li&gt;title&lt;/li&gt;
&lt;li&gt;description text&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Additionally, the whole certificate module has two headlines and an introduction text block.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Sustainability Landing Page - Certificate Module" src="https://engineering.zalando.com/posts/2022/09/images/lp-cert-module-screenshot.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;The Certificate Module on a Sustainability related Landing Page&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;One interesting aspect of the module is that it gets its content not solely from Contentful, but partially from another Zalando service already delivering data for another customer touch point: the Sustainability accordion of the Product Detail Page.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Certificate on Product Landing Page" src="https://engineering.zalando.com/posts/2022/09/images/pdp-cert-screenshot.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;A Sustainability Certificate on a Product Landing Page&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;Using a single source for sustainability information is valuable not only for making the life of our Content Editors easier (especially when considering the number of supported languages), but also because it's important to show accurate and up-to-date information about Sustainability claims across the whole customer journey.&lt;/p&gt;
&lt;p&gt;For that reason, the Contentful data model of the module looks like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Title&lt;/li&gt;
&lt;li&gt;Subtitle&lt;/li&gt;
&lt;li&gt;Overall intro description text block&lt;/li&gt;
&lt;li&gt;list of certificate IDs (the list and order of certificates to show can vary from landing page to landing page)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These fields are delivered via the Contentful proxy to the Fashion Store API (FSA) where the certificate IDs are enriched with the values for the logo url, title, and description in the same way as is done for requests from the Product Detail Page. The certificates are delivered to the clients by FSA in the field &lt;code&gt;entities&lt;/code&gt; which is part of the &lt;code&gt;Collection&lt;/code&gt; type in the GraphQL schema.&lt;/p&gt;
&lt;p&gt;This ensures that the certificate detail information on Landing Pages and on the Product Detail Pages are always in sync.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;query&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;query&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;collection_certificates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;!,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$first&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;!)&lt;/span&gt;
&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="nf"&gt;component&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;&amp;quot;&lt;/span&gt;&lt;span class="nc"&gt;re&lt;/span&gt;&lt;span class="err"&gt;-collection_certificates&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;subtitle&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$first&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nl"&gt;certificates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;__typename&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;SustainabilityCertificate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;logo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;uri&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;response&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;data&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;collection&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;id&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ern:collection:fwd:component:xyz&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;title&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Background check&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;subtitle&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Sustainability criteria you can trust&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;description&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Look for certificates like these to see...&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;entities&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;certificates&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;__typename&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;SustainabilityCertificate&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;id&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ern:sustcertificate::xyz&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;title&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;GOTS - organic&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;description&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;The Global Organic Textile Standard (GOTS) is...&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;logo&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;              &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;uri&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;[...]/sustainability/logos/gots-2.png&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;When implementing this module, we had to touch the following components of the Landing Page stack:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Contentful&lt;/strong&gt;, to add the new data model&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Contentful proxy&lt;/strong&gt;, to map the new Contentful model to the &lt;code&gt;Collection&lt;/code&gt; type of the GraphQL schema in the Fashion Store API&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;UI components&lt;/strong&gt; for app and web platforms&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Overall, the implementation of the additional modules by the Sustainability team was a successful example of &lt;a href="https://www.oreilly.com/library/view/adopting-innersource/9781492041863/ch01.html"&gt;inner sourcing&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Impact of the Content Management Tool&lt;/h3&gt;
&lt;p&gt;Once the new tool was rolled out, it had a substantial impact on the efficiency of landing page content management:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The average landing page time-to-go-live, from page creation, content upload to publish, was reduced from 2 days to 4 hours.&lt;/li&gt;
&lt;li&gt;In the previous set-up, we had to impose a 2-week lead time from page briefing to go-live, to allow for content upload issues &amp;amp; QA etc. With the new solution, this lead time has been removed entirely.&lt;/li&gt;
&lt;li&gt;The new tool requires no engineering involvement in the creation &amp;amp; publishing of landing pages - non-technical stakeholders can complete the process end-to-end themselves.&lt;/li&gt;
&lt;li&gt;The same was also true for changes to the layout of existing landing pages. Previously requiring engineering involvement, and re-upload of &lt;em&gt;all&lt;/em&gt; the content again, now this can be achieved by simply reordering the modules, or adding/removing new ones as needed.&lt;/li&gt;
&lt;li&gt;The landing pages and all modules on them are mirrored across web and app, from a single point of upload, rather than two distinct pages, cutting the briefing and upload workload in half.&lt;/li&gt;
&lt;li&gt;Since rollout, we've seen an 82% increase in the number of landing pages published YoY.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Conclusion and Next Steps&lt;/h3&gt;
&lt;p&gt;In conclusion, if we assess the impact of the new tool against the original requirements, we think it’s fair to call the project a success. We implemented a tool that allows non-technical stakeholders to create and manage landing pages end-to-end, with greatly reduced effort, and that takes advantage of Zalando’s new Interface Framework.&lt;/p&gt;
&lt;p&gt;Perhaps the most promising achievement is that one of the key aims of the tooling was to facilitate the addition of new features and iterations to continuously improve the landing pages offering. We feel this was achieved, as since the rollout many such features have been added, such as new content formats like the aforementioned Sustainability Certificates module, or process improvements like an adaptive streaming video solution which allows us to deliver longer video content with seamless playback, or image editing capabilities within the CMS to streamline content upload.&lt;/p&gt;
&lt;p&gt;The ability to add these improvements gives us confidence that the tooling will remain adaptable enough to serve our ever-changing needs in the long term.&lt;/p&gt;</content><category term="Zalando"/><category term="Frontend"/></entry><entry><title>Growth Engineering at Zalando</title><link href="https://engineering.zalando.com/posts/2022/07/growth-engineering-at-zalando.html" rel="alternate"/><published>2022-07-26T00:00:00+02:00</published><updated>2022-07-26T00:00:00+02:00</updated><author><name>Gary Rafferty</name></author><id>tag:engineering.zalando.com,2022-07-26:/posts/2022/07/growth-engineering-at-zalando.html</id><summary type="html">&lt;p&gt;How we enable growth for engineers at Zalando&lt;/p&gt;</summary><content type="html">&lt;p&gt;We recently closed out our annual performance review for employees. Naturally, this
period is for us to focus on how we are performing, what we aspire to achieve, and
how we can progress towards those goals, with the support of our leads.&lt;/p&gt;
&lt;p&gt;As a leader, I’ve spent a great deal of time working with Software Engineers on their
development, and helping them to drive their career progression. These conversations
and discussions are usually driven by the engineer, with managers playing a guiding and
supporting role, and typically consist of self-reflection, ideation, motivation, and
the culmination of a development plan.&lt;/p&gt;
&lt;p&gt;I thought that it might be helpful to share some notes on a few of the ways that we enable
growth for Engineers at Zalando.&lt;/p&gt;
&lt;h2&gt;Role Expectations&lt;/h2&gt;
&lt;p&gt;A standard progression for an engineer is from Junior to Mid to Senior. Unfortunately,
aside from the title, we (and I include myself from my own engineering days), are not always
completely clear on what the differences are between the levels. In order to progress as a
Software Engineer, it is imperative that we understand the expectations at each level.&lt;/p&gt;
&lt;p&gt;At Zalando, all of our engineers are provided with a copy of our Software Engineering
Role Expectations. This document, very clearly defines the expectations per grade across a
wide range of functional areas, such as &lt;strong&gt;Scope&lt;/strong&gt;, &lt;strong&gt;Delivery &amp;amp; Impact&lt;/strong&gt;, &lt;strong&gt;Community Contributions&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Moreover, the expectations very clearly describe the requirements for advancing to the next grade.
A common activity for engineers reviewing their performance is to look at the functional areas on
their current grade, and the grade above, and with the help of their lead, to perform a RAG
assessment on their performance. This will usually shine a spotlight on areas for growth, and also
shine a light on strengths that should be doubled down upon.&lt;/p&gt;
&lt;p&gt;A concrete role expectations document is something that I would have greatly benefited from
whilst coming up as an engineer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Alice&lt;/strong&gt;: &lt;em&gt;"Would you tell me, please, which way I ought to go from here?"&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Cheshire Cat&lt;/strong&gt;: &lt;em&gt;"That depends a good deal on where you want to get to."&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Alice&lt;/strong&gt;: &lt;em&gt;"I don’t much care where."&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Cheshire Cat&lt;/strong&gt;: &lt;em&gt;"Then it doesn’t much matter which way you go."&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Alice&lt;/strong&gt;: &lt;em&gt;"...so long as I get somewhere."&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Cheshire Cat&lt;/strong&gt;: &lt;em&gt;"Oh, you’re sure to do that, if only you walk long enough."&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;Performance Reviews&lt;/h2&gt;
&lt;p&gt;I mentioned in the introduction that we have recently concluded our most recent performance review.
Performance reviews of some shape and form are relatively standard practice across the industry,
but no two systems are the same.&lt;/p&gt;
&lt;p&gt;Our reviews are held annually, with a half-yearly check-in*. The reviews provide an opportunity for
employees to receive rounded feedback, which incorporates inputs from their peers, stakeholders, and lead.
In addition, it requires self-assessment. The self-assessment is particularly important.
We are all responsible for owning our careers.&lt;/p&gt;
&lt;p&gt;The performance reviews serve to:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Recognise and celebrate their contributions over the last period.&lt;/li&gt;
&lt;li&gt;Identify their strengths and the areas that they shine in.&lt;/li&gt;
&lt;li&gt;Highlight any development areas or blindspots.&lt;/li&gt;
&lt;li&gt;Calibrate these elements relative to the aforementioned role expectations.&lt;/li&gt;
&lt;li&gt;Develop a goal and milestones to work towards over the course of the next review period.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I personally cherish the development areas, and love to hear where I can push myself more,
and course correct any bad habits or issues (we all have them).&lt;/p&gt;
&lt;p&gt;*Growth and progression is a constant and ongoing collaboration between you and your lead, but the
actual timelines for the official review periods are annually and half-yearly.&lt;/p&gt;
&lt;h2&gt;Continuous Feedback&lt;/h2&gt;
&lt;p&gt;When I started out my career in engineering, one of the exciting aspects was the tight feedback loop.
Using the REPL or compiler, I could quickly validate my solution. Tight feedback loops allow us to
quickly course correct when something is wrong, but also provide a nourishing hit of endorphins when things
go well.
This supercharged-catalyst approach is something that we use for the delivery of continuous feedback at Zalando.&lt;/p&gt;
&lt;p&gt;One of our values is &lt;a href="https://jobs.zalando.com/en/our-founding-mindset/"&gt;High challenge, high support&lt;/a&gt;, which states that&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Feedback is a gift. We give and receive honest and timely feedback. At the same time, we provide each
other with support, and we care about the person beyond their role.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The use of the word timely is critical here. The best time to provide feedback, especially critical,
is when the action is fresh in the mind. This is when context is plentiful and crystal clear.
My lead never waited until our next 1:1 to provide me with feedback, and this is something that I have continued.&lt;/p&gt;
&lt;h2&gt;Mentoring &lt;em&gt;(noun)&lt;/em&gt;&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;the practice of helping and advising a less experienced person over a period of time,
especially as part of a formal programme in a company, university, etc.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Mentoring is everywhere in Zalando. We have many official mentoring programmes (some are company wide,
others are nurtured within departments), and we also have many unofficial mentoring relationships.
During my tenure, I have benefitted from being a mentor, and a mentee.&lt;/p&gt;
&lt;p&gt;Typically, for early stage engineers, seeking out an experienced mentor is a great way to broaden
their network, to gain experience, and to accelerate their growth. Your mentor will likely be from a
different team or business unit, so they can offer a more diverse approach to problem solving and development.&lt;/p&gt;
&lt;p&gt;For our more tenured engineers, and especially those who are progressing towards Senior Engineering,
mentoring a less experienced engineer* helps to prepare you for the seniority expectations such as
coaching, guiding, providing feedback, and paving the way for a new generation.&lt;/p&gt;
&lt;p&gt;*I have witnessed some success stories where engineers have mentored non-engineers and helped
them to secure their first engineering role.&lt;/p&gt;
&lt;h2&gt;Personal Development Budget&lt;/h2&gt;
&lt;p&gt;We provide our engineers with a healthy personal development budget, which can be used for learning materials,
educational resources, training and certifications, and the like.
Every person is unique, and whilst you might prefer to upskill using sites like Coursera, I might prefer to
read a book on a particular topic, or to join a local study group.&lt;/p&gt;
&lt;p&gt;Personal development is certainly not limited to technical skills, and should also include soft-skills, and
other attributes that shape a well-rounded career. A personal example. I recently sought to improve my public
speaking skills and took an eight week online course on Presentation Skills. The course was aimed at individuals
who often need to speak to groups, and who find it uncomfortable. To my surprise, the cohort consisted of quite
a few engineering leaders.&lt;/p&gt;
&lt;p&gt;Courses and activities like these can be cost-prohibitive to some, and having the investment of your company
to support you is a huge boost to your development.&lt;/p&gt;
&lt;h2&gt;Missing it? Make it Happen!&lt;/h2&gt;
&lt;p&gt;Another one of our values is &lt;a href="https://jobs.zalando.com/en/our-founding-mindset/"&gt;Act like an owner&lt;/a&gt;, which states that&lt;/p&gt;
&lt;p&gt;&lt;em&gt;“Ownership” is about being responsible to our customers, partners and colleagues, not about being entitled.
We own our destiny and are not stopped by circumstances: Zalando is what you make of it.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;We are all encouraged to take ownership of our careers and development. One such example of this is the large
number of communities and groups that were founded and run by engineers. In my particular department,
I have seen people create and run React meetups, Book Clubs, Podcasts, Show &amp;amp; Tells, Hackathons, etc. At one point
in time, these forums did not exist - an engineer wanted to attend one, and so they took ownership and created it.&lt;/p&gt;
&lt;p&gt;Founding and organising such initiatives is no small feat, and you can be sure that the creators developed many
skills along the way.&lt;/p&gt;
&lt;p&gt;Organisations are ever evolving, and don’t come equipped with everything that you would like. If there’s something
that you want, then go and make it happen.&lt;/p&gt;
&lt;h2&gt;Support, Support, Support.&lt;/h2&gt;
&lt;p&gt;I have been incredibly fortunate to work with leaders and peers who support my growth and development. They have
provided me with open and honest feedback on what I am doing well, and of course, what I am doing not so well.&lt;/p&gt;
&lt;p&gt;Growing within an organisation with such a deeply woven culture of supporting one another is surprisingly easy.
Our engineers’ growth and engagement is a top priority for our leadership cohort, and they have our full support
for unlocking their potential.
Support isn’t sugar-coated, and sometimes that means having difficult conversations, but we do this to set you up for success.&lt;/p&gt;</content><category term="Zalando"/><category term="Tech Culture"/><category term="Leadership"/><category term="Culture"/></entry><entry><title>An Introduction to the Zalando Design System</title><link href="https://engineering.zalando.com/posts/2022/07/an-introduction-to-the-zalando-design-system.html" rel="alternate"/><published>2022-07-21T00:00:00+02:00</published><updated>2022-07-21T00:00:00+02:00</updated><author><name>Andrea Moretti</name></author><id>tag:engineering.zalando.com,2022-07-21:/posts/2022/07/an-introduction-to-the-zalando-design-system.html</id><summary type="html">&lt;p&gt;A high level overview of the elements composing our Design System and a brief history of how we got from an idea to full adoption.&lt;/p&gt;</summary><content type="html">&lt;h1&gt;Yet Another "What is a Design System?"&lt;/h1&gt;
&lt;p&gt;There is a lot of literature and countless blog posts around the very definition of the concept of design systems. In this post, we'd like to look at it from an engineering perspective and describe the journey from the initial idea to the complete adoption here at Zalando.&lt;/p&gt;
&lt;p&gt;You can also find more information about the creation process from a design point of view in &lt;a href="https://medium.com/zalando-design/the-label-part-1-redesigning-our-visual-identity-a468cad9d6f2"&gt;this blog post&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;At its core, a Design System is a collection of specifications describing a set of design primitives, reusable components, and arbitrary guidelines to ensure consistency and visual identity.
Given such a broad definition, there are no fixed rules when it comes to technical implementation, but some patterns started to emerge in the industry.&lt;/p&gt;
&lt;h2&gt;Implementation-less Design System&lt;/h2&gt;
&lt;p&gt;How a Design System is implemented into a reusable library is highly influenced by the specific business use case, technologies and frameworks used, platforms to support, as well as teams and company wide processes and structure.
In a very large company with many different products and a diverse panorama of tech stacks, providing a single solution that suits every context may become extremely difficult, if not impossible. On the other hand, visual consistency and brand identity are likely to still be a requirement.&lt;/p&gt;
&lt;p&gt;A radical, but common, approach in these use cases is not providing an implementation at all.
The Design System is defined via a strict set of platform and technology agnostic definitions. Different teams/products/departments can implement their own library using the best tool for the job as long as the specifications are respected.&lt;/p&gt;
&lt;h2&gt;Design Tokens&lt;/h2&gt;
&lt;p&gt;Relying exclusively on a set of specifications offers more flexibility. However, as more and more implementations are developed, the problem of guaranteeing that they are in sync with the latest specs becomes increasingly hard.&lt;/p&gt;
&lt;p&gt;A step toward increasing consistency without sacrificing flexibility is to provide a set of core variables and assets to be used across implementations.
Those variables, called tokens, represent all the shared values that will help us maintain consistency across our system.
Some practical examples are color palettes, spacing, typography, and assets like logos, icons, etc.&lt;/p&gt;
&lt;p&gt;Design Tokens are usually maintained in a centralised place and via some tooling they are converted into different formats to be consumed by a vast array of different platforms.
Every independent implementation will use the latest version of those tokens as the only source of truth for the core variables and assets used.
With such a setup, we can quickly roll out changes to Design System core elements across an arbitrary number of implementations.&lt;/p&gt;
&lt;h2&gt;The Single Component Library&lt;/h2&gt;
&lt;p&gt;The term "Design System" is often used as a synonym for a component library.
While it is true that one of the practical implementations of a Design System is one of such libraries, overloading the term is a practice that may turn out to be counter-productive.
A lot of emphasis is given to the technicalities of how the different components are developed in a specific architecture, glossing over the Design System's core goals, which are to enforce a visual consistency and identity while reducing the maintenance costs.
These fundamental aspects are instead often relegated to vague concepts of default styles or custom themes.&lt;/p&gt;
&lt;p&gt;The confusion of those terms is easy to understand: in many cases the one single component library is the main contact point between the Design System as a concept and its practical consumers.
Referring to this contact point with the “design system” term is an understandable shortcut.
Regardless of the terminology, we are dealing with very different concepts. For example, a Design System can exist without a component library, the same way a component library can be abstract enough to not enforce any visual identity.&lt;/p&gt;
&lt;h1&gt;The Zalando Implementation for the Web Platform&lt;/h1&gt;
&lt;p&gt;Our design system was initially conceived and developed roughly at the same time with its web platform implementation, this gave us the opportunity to gradually adopt certain technical decisions with a very tight feedback loop during a &lt;a href="https://engineering.zalando.com/posts/2021/03/micro-frontends-part1.html"&gt;major visual and architecture redesign&lt;/a&gt;.
In retrospect, that was both an advantage and a disadvantage: starting from scratch gave us the freedom to make the choices based on suitable use cases without the constraints of a legacy live system.
On the other hand, the lack of a complete set of specifications led to many changing requirements that naturally caused a certain amount of refactors and duplicated work.&lt;/p&gt;
&lt;p&gt;Overall it was an extremely interesting challenge and I would like to share some of the learnings and decisions we encountered on the way.
As a first step, we identified some of the functional requirements we could foresee based on past experience and current business needs.&lt;/p&gt;
&lt;h3&gt;Team Autonomy&lt;/h3&gt;
&lt;p&gt;A high level of autonomy has consistently been reinforced by Zalando, even after years of change and growth.
Different teams, especially on the customer-facing side, own specific parts of the experience and expect to independently develop new features without being blocked by overly centralised teams and architectures.&lt;/p&gt;
&lt;h3&gt;Speed&lt;/h3&gt;
&lt;p&gt;In every meaning of the word, we knew that speed would have been a requirement.
From the performance of the components, to the ability to quickly iterate over existing implementations, provide new features, and avoid, as much as possible, becoming a bottleneck for other teams.&lt;/p&gt;
&lt;h3&gt;Consistency&lt;/h3&gt;
&lt;p&gt;One of the key metrics to evaluate the success of a Design System is the consistency and identity of the final customer-facing product.
From a technical perspective, there are always some trade-offs between consistency, speed, and flexibility.
While it can be complex, if not impossible, to maximize all of them, we tried to incentivize the "consistent way" by making it the easiest and fastest option whenever possible.
We still had to consider possible escape hatches for certain edge cases, but we wanted the most obvious and simple option to be the one providing the highest level of consistency.&lt;/p&gt;
&lt;h3&gt;Consider Other Platforms&lt;/h3&gt;
&lt;p&gt;While our main focus was to support the web platform, we decided from the beginning to identify opportunities to maintain a certain level of code sharing across platforms.
Some variables could be shared across all platforms, part of the CSS used on the website may be used for emails, some teams may want to use a different JS framework.
Those are some of the possible use cases we thought could arise at some point. While we didn’t want to over-engineer our solution based on these uncertain requirements, we tried to keep a loosely coupled architecture that would allow some of these scenarios to be addressed more easily in the future.&lt;/p&gt;
&lt;h2&gt;Extended Atomic Metaphor&lt;/h2&gt;
&lt;p&gt;Our web component library follows an approach loosely based on the concept of &lt;a href="https://bradfrost.com/blog/post/atomic-web-design/"&gt;Atomic Design&lt;/a&gt;.
The basic idea is to have different abstractions that can be built based on each other, from the most simple to the most complex. In the same way, complex living organisms are composed of simpler molecules which in turn are composed of simpler atoms and so on.
A layered approach is a natural fit for many complex and continuously evolving systems. In particular, we can observe in nature the speed at which layers of different complexities change and tend to be mirrored in artificial constructs like a Design System or many other instances of complex systems.
A very interesting reading that I strongly suggest on the topic is &lt;a href="https://jods.mitpress.mit.edu/pub/issue3-brand/release/2"&gt;Pace Layering: How Complex Systems Learn and Keep Learning&lt;/a&gt;.
For our web architecture we ended up with these different layers:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Ownership" src="https://engineering.zalando.com/posts/2022/07/images/atomic.png#center"&gt;&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;strong&gt;Design Tokens&lt;/strong&gt;&lt;/dt&gt;
&lt;dd&gt;A centralised source of truth for variables and assets that define the core of the Design System. Some examples are: colour palette, spacing, typography, fonts, icons, etc.&lt;/dd&gt;

&lt;dt&gt;&lt;strong&gt;Electrons&lt;/strong&gt;&lt;dt&gt;
&lt;dd&gt;A subset of the CSS grammar that only allows properties and values that are consistent with the specifications of the Design System. e.g. &lt;code&gt;paddingTop_m&lt;/code&gt;, &lt;code&gt;fontFamily_sansSerif&lt;/code&gt;, etc.&lt;/dd&gt;

&lt;dt&gt;&lt;strong&gt;Atoms&lt;/strong&gt;&lt;/dt&gt;
&lt;dd&gt;A composition of electrons and/or other atoms that serve a single generic purpose and cannot be divided further without losing its functionality. E.g. the collection of electrons needed to create a button. In our implementation this is the last layer that directly uses CSS.&lt;/dd&gt;

&lt;dt&gt;&lt;strong&gt;Molecules&lt;/strong&gt;&lt;/dt&gt;
&lt;dd&gt;A composition of atoms and/or other molecules forming a single generic component. An example could be the React implementation of different button types ready to be used as a package. At this level there should not be any business logic and emphasis is given to reusability and consistency with design specifications. For example, most of these molecules will also be available as components in the shared designer library.&lt;/dd&gt;

&lt;dt&gt;&lt;strong&gt;Organisms&lt;/strong&gt;&lt;/dt&gt;
&lt;dd&gt;A composition of molecules, atoms, and/or other organisms to fulfill a specific business use case. They are not part of the core component library and are owned by different teams owning the specific feature they enable.&lt;/dd&gt;
&lt;/dl&gt;

&lt;p&gt;Consistent with the natural world analogy, elements belonging to the simpler layers like electrons and atoms, tend to be stable and only very rarely receive any updates, for example a major redesign every few years.
On the other hand as the complexity of the layer increases, changes happen more and more frequently.
Based on this expected behaviour, we shaped our architecture in order to optimise for:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;very frequent changes in organisms&lt;/li&gt;
&lt;li&gt;occasional changes in molecules&lt;/li&gt;
&lt;li&gt;very rare changes in atoms and electrons.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We were also able to use these assumptions as a technical leverage to maximise other dimensions like bundle size, enforced visual consistency, testing, and documentation.&lt;/p&gt;
&lt;h2&gt;Contributions and Ownership&lt;/h2&gt;
&lt;p&gt;In terms of tangible entities, the Zalando Design System is composed of different parts with different ownership and contribution processes in place, &lt;a href="https://medium.com/zalando-design/zalandos-design-system-contribution-model-73ab36f8591e"&gt;this article&lt;/a&gt; covers the details of our "contribution model" more in-depth.
Here, we will focus on the parts affecting the web platform, but a similar structure can be encountered for mobile app development as well.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Ownership" src="https://engineering.zalando.com/posts/2022/07/images/ZDS.png#center"&gt;&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;strong&gt;Design Tokens repository&lt;/strong&gt;&lt;/dt&gt;
&lt;dd&gt;Owned by the larger Design System team, including designers as well as web and app engineers.&lt;/dd&gt;

&lt;dt&gt;&lt;strong&gt;Figma component library&lt;/strong&gt;&lt;/dt&gt;
&lt;dd&gt;Includes a visual representation of the Design System specifications as well as a centralised component library that can be used by designers in many different teams to create screens and requirements for arbitrary features.&lt;/dd&gt;

&lt;dt&gt;&lt;strong&gt;Web component library&lt;/strong&gt;&lt;/dt&gt;
&lt;dd&gt;Structured as a monorepo, it exports a single npm package for each atom, molecule and organism as well as a single highly optimised CSS bundle. The central Design System team has the ownership of the CSS layer, the atoms, the molecules, and some generic organisms.&lt;/dd&gt;
&lt;/dl&gt;

&lt;p&gt;Using GitHub code owners, different teams own specific organisms and are responsible for maintaining any business logic required.
Pull requests on code owned folders are usually faster to approve and merge as we ensure that changes on a code owned component will not affect other exported packages.&lt;/p&gt;
&lt;p&gt;The only way to use CSS on organisms and molecules is via atoms, this ensures a certain amount of consistency and makes it easy to spot possible deviations from the Design System specifications.
Using a single, predictable CSS bundle and a set of React hooks and patterns, we encourage consistency and composability over one-off implementations. In return we get a very scalable library where an unlimited number of organisms will always result in the same CSS bundle size and not affect each other JS bundle size.&lt;/p&gt;
&lt;h2&gt;Challenges and Pain Points&lt;/h2&gt;
&lt;p&gt;Creating a Design System from scratch and driving its adoption in a large company was definitely challenging, from gathering the requirements from many hidden use cases to getting enough traction to refactor complex applications; it has been a journey where communication and coordination have played a major role.
Finding a technical solution able to grow and scale as fast as our requirements was also a challenge.&lt;/p&gt;
&lt;p&gt;While the system has been running relatively smoothly for more than 2 years and the adoption rate is close to 100%, there are some long-lasting pain points and possible areas for future improvements.&lt;/p&gt;
&lt;h3&gt;Fragmented Ownerships&lt;/h3&gt;
&lt;p&gt;Finding the right owner for specific common components like product cards, carousels, banners, etc. is extremely difficult from an organisational point of view.
Even when an owner is found, it is hard to prevent some conflicts and overlaps of responsibilities.
For example, multiple variations of the same components start to appear with different ownerships, features that require coordination across certain premises need the involvement of different teams, and the discoverability of what is currently available becomes a crucial requirement.&lt;/p&gt;
&lt;h3&gt;Coupling with Deployments&lt;/h3&gt;
&lt;p&gt;In software engineering, it is usually considered a best practice to group things that change together.
Currently, a new version of the component library and a new version of the live customer-facing application are handled by different pipelines and the codebases live in different repositories.
Although having independent releases and a platform-agnostic pipeline may be convenient, we cannot ignore the reality of having one main consumer. In this case, a solution involving a larger monorepo may help with the bottleneck problem created by the need to keep versions in sync.&lt;/p&gt;
&lt;h2&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;A Design System tends to behave like most complex systems.
Different layers evolve and stabilize at different paces, with a slow-changing core and fast iterations on the edges.
The biology metaphor fits quite well in those behaviors and got popularized with atomic design.
Porting those complexity layers into a technical implementation was not always straightforward, but overall a good decision.&lt;/p&gt;
&lt;p&gt;Code can be observed with the same curiosity we have when looking at nature.
Identifying the boundaries between different layers and their relationships is the key to control the complexity involved.
While, to some extent, exceptions will always exist, knowing what parts of the system are stable, which ones are changing fast, and how they affect each other is a powerful tool.
The architecture and processes around a Design System can be shaped around these characteristics in order to optimize for fast iterations on the edge layers and stability on the core ones.
Embracing the chaotic nature of changes while learning and observing the larger patterns at play is the key to achieving long-term stability and a healthy evolution process.&lt;/p&gt;</content><category term="Zalando"/><category term="Design"/><category term="Frontend"/><category term="UX"/></entry><entry><title>International Women in Engineering Day (June 23rd)</title><link href="https://engineering.zalando.com/posts/2022/06/international-women-in-engineering-day-23-june.html" rel="alternate"/><published>2022-06-23T00:00:00+02:00</published><updated>2022-06-23T00:00:00+02:00</updated><author><name>Anja Bergner</name></author><id>tag:engineering.zalando.com,2022-06-23:/posts/2022/06/international-women-in-engineering-day-23-june.html</id><summary type="html">&lt;p&gt;We’re celebrating International Women in Engineering Day by talking to three senior Zalando Women in Tech.&lt;/p&gt;</summary><content type="html">&lt;p&gt;What were the biggest learnings in your career so far? And what advice would you give your younger self today? How do you get ahead in your career? We’re celebrating &lt;strong&gt;&lt;a href="https://www.inwed.org.uk/about/"&gt;International Women in Engineering Day&lt;/a&gt;&lt;/strong&gt; by talking to three senior Zalando Women in Tech: &lt;a href="https://www.linkedin.com/in/mahak-swami-5a404029/"&gt;Mahak Swami&lt;/a&gt;, Engineering Manager; &lt;a href="https://www.linkedin.com/in/florianegramlich/"&gt;Floriane Gramlich&lt;/a&gt;, Director of Product Payments; and &lt;a href="https://www.linkedin.com/in/anapeleteiro/"&gt;Ana Peleteiro Ramallo&lt;/a&gt;, Head of Applied Science. We caught up with them during the Women in Tech Global Conference 2022 — let’s find out their advice!&lt;/p&gt;
&lt;h4&gt;What’s the best thing about your job?&lt;/h4&gt;
&lt;p&gt;&lt;img alt="Photo of Mahak" src="https://engineering.zalando.com/posts/2022/06/images/mahak.jpg#right"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Mahak:&lt;/strong&gt; In my team, we build products for the Zalando mobile app. The best thing is the technical challenges: working on them and solving them.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Floriane:&lt;/strong&gt; I have an incredible team who I love to work with – it’s fun, but it’s also inspiring. Also, I work in payments, which is all about customer convenience: Ultimately, if I don’t do my job right, then people can’t pay, so I love that I’m making a difference.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ana:&lt;/strong&gt; The best thing about my job is that I get to work on super-interesting topics, and with really amazing and interesting colleagues.&lt;/p&gt;
&lt;h4&gt;Looking back at your career, what’s your tip for fostering a more inclusive environment?&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Mahak:&lt;/strong&gt; It’s really important that everyone’s opinions are considered when you’re solving a problem. An engineer could bring equally important input to the design, and vice versa. Everybody needs to bring their own values to the table, so that we can find the best solutions to the problem.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Floriane:&lt;/strong&gt; Being yourself is super-important. That means accepting who you are, and not trying to imitate somebody else. Because, if you can’t be true to yourself, how can you be true to others?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ana:&lt;/strong&gt; The first thing is to make people aware when there is not an inclusive environment. Many times people want to be inclusive, and don’t realise there’s a problem.&lt;/p&gt;
&lt;h4&gt;What’s the best professional advice you’ve ever received?&lt;/h4&gt;
&lt;p&gt;&lt;img alt="Photo of Floriane" src="https://engineering.zalando.com/posts/2022/06/images/floriane.jpg#right"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Mahak:&lt;/strong&gt; The best advice I’ve had was around executive presence: To speak about my work and represent it just as well as I was doing it. A lot of women don’t advocate for the work they’re doing. That’s one thing I’d definitely push for.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Floriane:&lt;/strong&gt; So, the worst advice I ever received was, ‘Don’t be too ambitious’. I was told that a LOT in previous companies, in almost every performance talk. It’s terrible advice and I wonder if a man would be told the same thing. Now, it’s really important to me to be that multiplier for my teams, I say: Be ambitious!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ana:&lt;/strong&gt; The best professional advice I ever got was, ‘If you want something, just go and get it’. Because many times we doubt ourselves, but it’s about wanting to get something and having a plan for how to get it.&lt;/p&gt;
&lt;h4&gt;What advice would you give your younger self?&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Mahak:&lt;/strong&gt; Try out as many things as you can in your career. It’s very important to figure yourself out. Don’t be afraid to find out what clicks for you as a professional.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Floriane:&lt;/strong&gt; Know what you want. Say what you want. Do what you want. And stand true to that. It’s super-important to invest in self-reflection quite early. You need to really understand who you are.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ana:&lt;/strong&gt; What I learned is to be really proactive and never stop learning. Continuous learning will help you to grow.&lt;/p&gt;
&lt;h4&gt;What other tips would you give to women starting their career in STEM?&lt;/h4&gt;
&lt;p&gt;&lt;img alt="Photo of Ana" src="https://engineering.zalando.com/posts/2022/06/images/ana.jpg#right"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Mahak:&lt;/strong&gt; In general, women have this perception of tech: that it isn’t a place for them, and perhaps it’s difficult to get into. But that’s not the case. Tech is very logical, a lot of fun and now very inclusive too. When I started my career, I was often the first and only woman on the team. But now that’s not the case. You will have company and you will have fun – try it!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Floriane:&lt;/strong&gt; Be curious and don’t let other people tell you what you can or can’t do. On a more practical level, look for role models (there are lots out there), find yourself a mentor, build your network, and really learn from others. Getting advice from outside your usual zone is very powerful.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ana:&lt;/strong&gt; Never allow anyone to tell you what you can or can’t do. You’re the only one who knows your goals and what you want to achieve. Also, there are no things for girls or things for boys – there’s only things you like. So, if there’s something you like, go ahead and enjoy it.&lt;/p&gt;
&lt;p&gt;Learn more about &lt;a href="https://www.inwed.org.uk/about/"&gt;International Women in Engineering Day&lt;/a&gt; and for more inspiration, check out our three Zalando speakers at the recent &lt;a href="https://www.youtube.com/results?search_query=womentech+network+zalando"&gt;Women in Tech Global Conference&lt;/a&gt;.&lt;/p&gt;</content><category term="Zalando"/><category term="Women in Tech"/><category term="Tech Jobs"/><category term="Diversity in Tech"/><category term="Culture"/><category term="Leadership"/></entry><entry><title>Accelerate testing in Apache Airflow through DAG versioning</title><link href="https://engineering.zalando.com/posts/2022/06/accelerate-apache-airflow-testing-through-dag-versioning.html" rel="alternate"/><published>2022-06-10T00:00:00+02:00</published><updated>2022-06-10T00:00:00+02:00</updated><author><name>Hilmi Yildirim</name></author><id>tag:engineering.zalando.com,2022-06-10:/posts/2022/06/accelerate-apache-airflow-testing-through-dag-versioning.html</id><summary type="html">&lt;p&gt;In this blog post we present a way to version your Airflow DAGs on a single server through isolated pipeline and data environments to enable more convenient simulation and testing.&lt;/p&gt;</summary><content type="html">&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In the Performance Marketing department, we run paid advertisement campaigns for Zalando. To do so,
we build services that allow us to manage campaigns, optimize and distribute content,
and measure the performance of the campaigns at scale.&lt;/p&gt;
&lt;p&gt;Talking about measurement, one of the core systems we’ve built and continuously extended over
the years is our so-called marketing ROI (return on investment) pipeline. The ROI pipeline is
a batch based data- and machine learning pipeline powered by Databricks Spark and orchestrated
by Apache Airflow. It consists of various sub-pipelines (components), some of which are built
using our python sdk &lt;a href="https://www.linkedin.com/pulse/building-ml-workflows-zalando-zflow-s%C3%A1nchez-fern%C3%A1ndez/"&gt;zFlow&lt;/a&gt;. Examples for said components are our input data preparation,
marketing attribution model or an incremental profit forecast for our campaigns.
These components are owned and developed by different cross-functional
teams (applied science, engineering, product) within Performance Marketing.
You can read more about the way we measure campaign effectiveness from a functional perspective in our previous &lt;a href="https://engineering.zalando.com/posts/2019/02/effectiveness-online-marketing.html"&gt;blog post&lt;/a&gt;.&lt;/p&gt;
&lt;h1&gt;Problem Statement&lt;/h1&gt;
&lt;p&gt;A recurring problem we faced during the development relates to the nature of the marketing
ROI which lacks a ground truth&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;. It means that while we oftentimes have assumptions on what
the impact of a change in input data or to our components has on the ROI, we require the new
version of the ROI pipeline to be run end-to-end to confirm our assumptions. Since different
teams are working on different components of the ROI pipeline in parallel, evaluating the
impact of a change on the final ROI in isolation is required to work effectively
(i.e. teams not blocking each other). The following section explains the problem in more depth.&lt;/p&gt;
&lt;p&gt;As mentioned earlier, we are using Airflow to orchestrate the overall pipeline. The Airflow
code is stored in a github repository. We have two servers, production and test. When a pull
request is opened, the Airflow pipeline is deployed to the test server. On merge to the main
branch, we deploy to the production server. In this setup, we have two so-called pipeline
environments, a production (live) and a test environment. The live pipeline uses the live
data environment while the test pipeline uses the test data environment. As our data layer,
we’re mainly using AWS S3 with data organized as Spark tables. A set of Spark
tables represents a data environment. Only one version of an Airflow DAG such as our marketing
ROI pipeline can exist in each environment. When multiple features are developed at the same time,
they have to share the test environment which oftentimes leads to conflicts since testing in
isolation is not possible. Alternatively, the features can be tested sequentially which leads
to delays. To solve the problem, we implemented a mechanism to enable a flexible number of
Airflow environments. Moreover, we also developed a script to spin up new data environments.&lt;/p&gt;
&lt;p&gt;Figure 1 depicts the relationship between a pipeline and data environments.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Environments" src="https://engineering.zalando.com/posts/2022/06/images/overview_environments.jpg#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Figure 1: Environments&lt;/figcaption&gt;

&lt;h2&gt;Pipeline Environment&lt;/h2&gt;
&lt;p&gt;A pipeline environment is a version of a pipeline (set of Airflow DAGs) deployed to an Airflow server on which it can
run end-to-end. Each environment contains all DAGs necessary to produce the required output
(e.g. marketing ROI in our case), so multiple environments can co-exist on one server and can be used independently.&lt;/p&gt;
&lt;h2&gt;Data Environment&lt;/h2&gt;
&lt;p&gt;A data environment is a set of Spark/Hive databases, tables and views. A pipeline environment uses a single
data environment for reading and writing data.&lt;/p&gt;
&lt;h1&gt;Airflow Environments&lt;/h1&gt;
&lt;p&gt;Our main objective was to create a new Airflow environment once a pull request is
opened on which the developed version of the pipeline can be tested in isolation.
The most trivial way is to create a new Airflow server for every pull request, which
would be time consuming and costly. For example, Amazon Managed Workflows for Apache Airflow (MWAA)
needs up to 30 minutes to create a new Airflow server and you have to pay for additional resources.
With our solution, a new environment is created on the existing test server once a pull request is
opened, resulting in multiple environments on the same Airflow server. The creation of a
new environment takes less than one minute.&lt;/p&gt;
&lt;p&gt;Figure 2 shows how this could look like on the test server. We have 2 Airflow DAGs
&lt;code&gt;qu.test_dag&lt;/code&gt; and &lt;code&gt;qu.test_dag_2&lt;/code&gt; with three different environments: &lt;code&gt;feature1&lt;/code&gt;, &lt;code&gt;feature2&lt;/code&gt;
and &lt;code&gt;feature3&lt;/code&gt;. "qu" is the name of an internal team at Zalando. The DAGs always have the team name as prefix.
It means that the same DAGs are adapted and deployed through three separate pull requests.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Airflow Environments" src="https://engineering.zalando.com/posts/2022/06/images/airflow_environments.jpg#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Figure 2: Airflow Environments&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;When the corresponding pull request is closed, the environment will be deleted automatically.
How did we implement this since the concept of environments does not exist in Airflow?
To achieve this, we adjusted the source code of the Airflow library and developed a cron job
which deletes the environments later on. The following sections explain necessary modifications made.&lt;/p&gt;
&lt;h2&gt;Deploying Airflow code as a zip file&lt;/h2&gt;
&lt;p&gt;The Airflow code is deployed as a single zip archive using the
&lt;a href="https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html#packaging-dags"&gt;Packaging DAGs&lt;/a&gt;
feature. This feature prevents dependency conflicts because every deployment only uses
the dependencies which are defined in the same zip file.
The zip file has the name of the branch from which we are deploying. For example, when
we deploy the Airflow code from branch feature1, the zip file is called &lt;code&gt;feature1.zip&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Use correct Jinja Paths&lt;/h2&gt;
&lt;p&gt;A problem occuring through the use of zip file is that jinja templates for files are not
working anymore. Jinja detects the absolute path of the file correctly but the file cannot
be read because it’s inside a zip file. For this reason we also deploy the unpackaged zip archive
in a different location. Inside the &lt;code&gt;dag.py&lt;/code&gt; file (see Figure 3 line 13 - 19) we add the
location of the unpackaged files to the template search path. As a result, jinja now
searches for templates inside the unpackaged folder.&lt;/p&gt;
&lt;h2&gt;Renaming Dag Ids&lt;/h2&gt;
&lt;p&gt;On one Airflow server, it’s not possible to create multiple DAGs with the same id.
Therefore, we have to rename the DAG ids for every deployment. For that reason we adapted
the &lt;code&gt;dag.py&lt;/code&gt; file (see Figure 3) of the Airflow library which contains the DAG class. Inside
the init method we are checking the file path of the python file which is initializing the dag.
The path contains the name of the zip file, e.g. &lt;code&gt;feature1.zip&lt;/code&gt;. This way we can differentiate
the environments. We modify the original DAG id and inject the environment name
(see Figure 3, lines 3-11). Furthermore, we add the environment name as a tag to enable
filtering on environments.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;():
…
    &lt;span class="n"&gt;def&lt;/span&gt; &lt;span class="n"&gt;__init__&lt;/span&gt;(...):
         &lt;span class="c1"&gt;# /usr/local/airflow/dags/feature1.zip/qu/main/file.py&lt;/span&gt;
         &lt;span class="n"&gt;file_path&lt;/span&gt; = &lt;span class="n"&gt;get_path_of_file_which_initialized_dag&lt;/span&gt;()

         &lt;span class="c1"&gt;#feature1&lt;/span&gt;
         &lt;span class="n"&gt;feature_name&lt;/span&gt; = &lt;span class="n"&gt;get_zip_file_name&lt;/span&gt;(&lt;span class="n"&gt;file_path&lt;/span&gt;)

         &lt;span class="n"&gt;dag_id&lt;/span&gt; = {&lt;span class="n"&gt;team_name&lt;/span&gt;}.&lt;span class="n"&gt;feature_name&lt;/span&gt;.{&lt;span class="n"&gt;dag_id&lt;/span&gt;.&lt;span class="nb"&gt;split&lt;/span&gt;({&lt;span class="n"&gt;team_name&lt;/span&gt;}.&amp;#39;)[&lt;span class="mi"&gt;1&lt;/span&gt;]}
         &lt;span class="n"&gt;tags&lt;/span&gt;.&lt;span class="nb"&gt;append&lt;/span&gt;(&lt;span class="n"&gt;feature_name&lt;/span&gt;)

         &lt;span class="c1"&gt;# /usr/local/airflow/features/feature1/&lt;/span&gt;
         &lt;span class="n"&gt;feature_dir_path&lt;/span&gt; = &lt;span class="n"&gt;get_feature_dir_path&lt;/span&gt;(&lt;span class="n"&gt;file_path&lt;/span&gt;)
         &lt;span class="n"&gt;template_searchpath&lt;/span&gt;.&lt;span class="nb"&gt;add&lt;/span&gt;(&lt;span class="n"&gt;feature_dir_path&lt;/span&gt;)

         &lt;span class="c1"&gt;# /usr/local/airflow/features/feature1/qu/main/&lt;/span&gt;
         &lt;span class="n"&gt;feature_file_path&lt;/span&gt; = &lt;span class="n"&gt;get_feature_file_dir_path&lt;/span&gt;(&lt;span class="n"&gt;file_path&lt;/span&gt;)
         &lt;span class="n"&gt;template_searchpath&lt;/span&gt;.&lt;span class="nb"&gt;add&lt;/span&gt;(&lt;span class="n"&gt;feature_file_path&lt;/span&gt;)
…
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;figcaption style="text-align:center"&gt;Figure 3: Pseudo Code of adapted dag.py&lt;/figcaption&gt;

&lt;h2&gt;Environment Cleanup&lt;/h2&gt;
&lt;p&gt;We have developed a cron job that checks the status of pull requests. Once a pull request is
closed, the corresponding environment is deleted on the Airflow server. The job deletes the
zip file and the folder which contains the unpackaged files. Then, it queries the Airflow
metastore for all associated DAGs and deletes them via Airflow cli.&lt;/p&gt;
&lt;h1&gt;Data Environments&lt;/h1&gt;
&lt;p&gt;Every Airflow environment also requires a data environment, otherwise conflicts on the data
layer could occur during parallel feature development. Our data is mainly organized as Spark
databases stored on S3. A data environment is a set of Spark databases with a corresponding
suffix, e.g. all databases of the live environment have the suffix &lt;code&gt;_live&lt;/code&gt;. The ddls of our
databases and tables are stored in a git repository. We developed a script which uses the
ddls to create a new data environment (see Figure 4). The databases have the environment
name as a suffix, e.g. &lt;code&gt;db_attribution_feature1&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Data Environments" src="https://engineering.zalando.com/posts/2022/06/images/data_environments.jpg#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Figure 4: Create new Data Environment&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;A new data environment initially is empty, i.e. the databases do not contain any data.
We could copy the data, this costs time and money though. A more elegant way is the table
environment feature which we implemented with the data environment script. Instead of copying
data, the script creates a view pointing to the respective test data (see Figure 5).
Table environments are defined in a configuration file which is automatically created
via the table environment script. The script uses information about input and output
tables of all tasks which are predefined as yaml files. An example table environment
configuration is &lt;code&gt;db_attribution.m_events:TEST&lt;/code&gt;, resulting in the creation of the following
example view.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;CREATE VIEW db_attribution_feature1.m_events AS
SELECT * FROM db_attribution_test.m_events
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;figcaption style="text-align:center"&gt;Figure 5: Creating a view instead of copying data&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;A view is only created if the table is not used as output by one of the respective tasks.
In some cases you need initial data for tables which are used as output. Therefore, the
table environment script creates a configuration stub for these tables like that:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;db_attribution.m_events:
    partitions:
        &lt;span class="k"&gt;-&lt;/span&gt; date BETWEEN &amp;quot;x&amp;quot; AND &amp;quot;y&amp;quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If you define the partition ranges and execute the data environment script, it creates the
table and copies the data for you.&lt;/p&gt;
&lt;h1&gt;Summary&lt;/h1&gt;
&lt;p&gt;In this blog post we presented how we enabled versioning of our performance marketing pipeline
which is based on Apache Airflow. The Versioning is necessary to enable more convenient simulation
and testing. We modified the Airflow DAGs class and used the Packaging DAGs feature of Apache Airflow
to make it possible to have multiple versions of the same DAGs on a single server. This allows us
to deploy a git branch consisting of Airflow DAGs directly to a single Airflow server where they
can run isolated from other versions. The deployment takes less than 1 minute compared to up to
30 minutes when you create a new Airflow server for the deployment. To enable isolation on data
level we implemented a script which spins up a new Data Environment consisting of Spark/Hive
tables on S3. As a result, every Pipeline version can use a dedicated Data Environment.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;This is simplified, ultimately we consider the results of our a/b tests as ground truth.
Yet, a/b tests are only run in certain periods of the year and are used to correct our marketing
attribution results also in-between a/b test periods. Here, due to internal and external factors
such as spend changes or campaign efficiency changes, the ground truth could in fact have changed
as well.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="Zalando"/><category term="Big Data"/><category term="Data Science"/><category term="Data"/><category term="Machine Learning"/></entry><entry><title>Operation-Based SLOs</title><link href="https://engineering.zalando.com/posts/2022/04/operation-based-slos.html" rel="alternate"/><published>2022-04-28T00:00:00+02:00</published><updated>2022-04-28T00:00:00+02:00</updated><author><name>Pedro Alves</name></author><id>tag:engineering.zalando.com,2022-04-28:/posts/2022/04/operation-based-slos.html</id><summary type="html">&lt;p&gt;Zalando developed a new type of SLOs to monitor the critical aspects of its business which is based on Operations. This blog post describes how that framework works, and how it contributes to healthier on-call rotations.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Zalando's 2019 Cyber Week Situation Room" src="https://engineering.zalando.com/posts/2022/04/images/preview-image.jpg#previewimage"&gt;&lt;/p&gt;
&lt;p&gt;Anyone who has been following the topic of Site Reliability Engineering (SRE)
has likely heard of &lt;a href="https://sre.google/sre-book/service-level-objectives/"&gt;Service Level Objectives (SLOs)&lt;/a&gt;,
and Service Level Indicators (SLIs). SLIs and SLOs are at the core of the SRE
practices. They are fundamental to establish the balance between building new
features on a product, shipping fast, or working on the reliability of that
product. But they are not easy to get right. Zalando has gone through different
iterations of defining SLOs, and we’re now in the process of maturing our latest
iteration of SLO tooling. With this iteration, we are addressing fragmentation
problems that are inherent to service based SLOs in highly distributed
applications. Instead of defining reliability goals for each microservice, we
are working with SLOs on Critical Busines Operations that are directly related
to the user experience (e.g. &lt;em&gt;"View Catalog"&lt;/em&gt;, &lt;em&gt;"Add Item to Cart"&lt;/em&gt;), rather
than a specific application (Catalog Service, Cart Service). In this blog post
we’re going to present our Operation Based SLOs, how we define them, the tooling
around them, how they are part of our development process, and also how they
contributed to a healthier on-call.&lt;/p&gt;
&lt;h2&gt;The first iterations of defining SLOs&lt;/h2&gt;
&lt;p&gt;To understand where we are right now, it’s important to understand how we got
here. When &lt;a href="https://engineering.zalando.com/posts/2021/09/sre-journey-part1.html"&gt;we introduced SRE in Zalando back in 2016&lt;/a&gt;
we also introduced SLOs. At the time, we went with service based SLOs. Each
microservice would have SLOs on whatever SLIs service owners defined (usually
availability and latency), and they would get a weekly report of those SLOs,
through a &lt;a href="https://github.com/zalando-zmon/service-level-reporting"&gt;custom tool&lt;/a&gt;
that was tightly coupled with our homebrew monitoring system.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Service Level Reporting tool" src="https://engineering.zalando.com/posts/2022/04/images/slr-report.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Service Level Reporting tool&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;As these were new concepts in the company, we ran multiple workshops across the
company for Engineers and Product Managers to train them on the basics and to
kick-start the definition of SLOs across all engineering teams.
Product Managers and Product Owners started to get unexpected questions from
other peers and engineers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;"What is the desired level of service you wish to provide to your customers?"&lt;/li&gt;
&lt;li&gt;"How fast should your product be?"&lt;/li&gt;
&lt;li&gt;"When is the customer experience degraded to an unacceptable level?"&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The last one was particularly relevant for services that have different levels
of graceful degradation. Say the service cannot respond in the ideal way; it
uses its first fallback strategy that is still "good enough" so we consider it a
success. But what if that first fallback also fails? We can use a second
fallback just so we don’t return an error, but maybe that is no longer a
response of acceptable quality. Even though the response was successful from the
client’s perspective, we still count it as an error.
What was particularly interesting about this thought process was that it created
a break from defining availability exclusively based on HTTP status codes (where
failure is anything in the 5xx range). It’s good to keep this reasoning in mind,
as it will be useful further down.&lt;/p&gt;
&lt;p&gt;SLOs saw an increasing adoption across the company, with many services having
SLOs defined and collected. This, however, did not mean that they were living up
to their full potential, as they were still not used to help balance feature
development and improving reliability. In a microservice architecture, a product
is implemented by multiple services. Some of those services contribute to
multiple products. As such, Product Managers had a hard time connecting the
myriad of SLOs and their own expectations for the products they are responsible
for. Because SLOs are on a microservice level, the closest manager would be on
the team level. Taking into consideration the previous point that a product is
implemented by multiple services, aligning the individual SLOs for a single
product would mean costly cross-team alignment. Raising the SLO discussion to a
higher management level would also be challenging, as microservices are too fine
grained for a Head or Director to be reviewing. &lt;strong&gt;The learning at this stage
was that the boundaries of Products did not match individual microservices.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Service landscape and products" src="https://engineering.zalando.com/posts/2022/04/images/service-landscape-and-products.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;In this service landscape we see that products can share individual services&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;We later tried to add additional structure to the existing SLOs. One of the
challenges we had with service based SLOs was the sheer amount of services that
had to be measured and monitored for their SLOs. Realistically speaking, they
could not all have the same level of importance. To ensure teams focused on what
mattered the most, a system of Tier classifications was developed - Tier 1 being
most critical and Tier 3 being least critical. With each service properly
classified, teams knew what they should be keeping a close eye on. Having the
Tier definition also allowed us to set canonical SLOs according to an
application's tier classification. Our tooling evolved to keep up with these
changes.&lt;/p&gt;
&lt;p&gt;To summarise, our experience with service based SLOs struggled to overcome the
following challenges:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;High number of microservices.&lt;/strong&gt; The more there are, the more SLOs teams have to monitor, review, and fine tune.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mapping microservice SLOs to products and their expectations.&lt;/strong&gt; When products use different services to provide the end-user functionality and with some services supporting several products, SLOs easily conflict with each other.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SLOs on a fine grained level made it challenging for management to align on them.&lt;/strong&gt; When dealing with SLOs on such a granular level as micro services, Management support beyond the team level is difficult to get. And within the team level, it requires costly cross-team alignment.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Symptom Based Alerting&lt;/h2&gt;
&lt;p&gt;In our role as SREs we were in frequent contact with different teams, helping
them with PostMortem investigation, or reviewing their monitoring (what metrics
were collected and paging alerts that were set up).
While teams were quick to collect many different metrics, figuring out what to
alert on was a more challenging task.  The default was to alert on signals that
&lt;em&gt;could&lt;/em&gt; indicate a wider system failure ("Load average is high", "Cassandra node
is down"). Knowing the right thresholds to alert on was another challenge. Too
strict, and you’re being paged all the time with false positives. Too relaxed,
and you’re missing out on potential customer impacting incidents. Even figuring
out whether the alert always translates to customer impact was also tricky at
times. All of this led us to push for a different alerting strategy: &lt;strong&gt;Symptom
Based Alerting&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;You can find more details about Symptom Based Alerting in the &lt;a href="https://github.com/zalando/public-presentations/blob/master/files/2019-05-16_alerting_monitoring_and_all_that_jazz.pdf"&gt;slides of one of
the talks&lt;/a&gt;
we did on this topic. But the main message of that talk is that there are some
parallels between SLOs and Symptom Based Alerts. Namely, about &lt;em&gt;what&lt;/em&gt; makes a
good SLO, or a symptom worth alerting, and &lt;em&gt;how&lt;/em&gt; many SLOs and alerts you should
have.
Both SLOs and Symptom based alerts should be focused on key customer experiences&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;
by defining alerts and SLOs on signals that represent those experiences. Those
signals are stronger when they are measured closer to the customer, so we should
measure them on the edge services.
There are benefits to keeping both alerts and SLOs at a low number&lt;sup id="fnref2:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;sup id="fnref2:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;.
Focusing on the customer experience, rather than all the services and other
components that make up that experience helps ensure that. By alerting on
symptoms, rather than potential causes for issues, we can also identify issues
in a more comprehensive way&lt;sup id="fnref:4"&gt;&lt;a class="footnote-ref" href="#fn:4"&gt;4&lt;/a&gt;&lt;/sup&gt;, as anything that may negatively affect the
customer experience will be noticed by the signal at the edge.&lt;/p&gt;
&lt;p&gt;Let's see how this works in practice by taking the following SLO as an example:
&lt;em&gt;"Catalog Service has 99.9% availability"&lt;/em&gt;. Let's assume Catalog Service is an
edge service responsible for providing to our customers the catalog information,
its categories, and the articles included in each category. If that service is
not available, customers cannot browse the Catalog. Because it is an edge
service it can fail due to issues in any of the downstream services. That, in
turn, would negatively affect the availability SLO. Any breach of the SLO means
that the customer experience is also affected.
Due to the connection between the SLO's performance and the customer experience
we come to the conclusion that the degradation of the SLI &lt;em&gt;"Catalog Service
availability"&lt;/em&gt; is a symptom of a degraded customer experience. The SLO sets a
threshold after which that degradation is no longer acceptable, and immediate
action is required. Or in other words, &lt;strong&gt;we should page when our SLO is missed,
or in danger of being missed.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;From this we derived the following formula:&lt;/p&gt;
&lt;center&gt;
_Service Level Objective = Symptom + Target_
&lt;/center&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;Essentially, we wanted to capture high level signals (or symptoms) that
represented customer interactions. These signals could be captured at the edge
services that communicate with our customers. If those signals degraded, then
the customer experience degraded. Regardless of whatever it was that caused that
degradation. If we couple that with an SLO, then, following the formula above,
we get our alert threshold implicitly.&lt;/p&gt;
&lt;p&gt;There is an additional feedback loop between SLOs and symptom based alerts when
you couple them like that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If you get too many pages, then the respective SLO should be reviewed, even if temporarily.&lt;/li&gt;
&lt;li&gt;If you get too few pages, then maybe you can raise the SLO, as you are overdelivering.&lt;/li&gt;
&lt;li&gt;If you have a customer experience that is not covered by an alert, then you likely also identified a new SLO&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The problem with setting up alerts at those edge services, however, was that it
would always fall down to the team owning those services to receive the paging
alerts and perform the initial triage to figure out what was going on.&lt;/p&gt;
&lt;p&gt;While the concept seemed solid, and made a lot of sense, we were still missing
one key ingredient: &lt;strong&gt;how could we measure and page based on these symptoms,
without burning out the team at the edge given they'd be paged all the time?&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;Introducing Operation Based SLOs&lt;/h2&gt;
&lt;p&gt;When rolling out &lt;strong&gt;Distributed Tracing&lt;/strong&gt; in the company, one of the challenges
we faced was where to begin with the service instrumentation work to showcase
its value early on.
Our first instinct was to instrument the Tier 1 services (the most critical
ones). We decided against this approach because we wanted to observe requests
end-to-end, and instrumenting services by their criticality would not give us
the coverage across system boundaries we were aiming for. Also, it is relevant
to highlight that Tracing is an observability mechanism that is &lt;strong&gt;operation
based&lt;/strong&gt;, so we thought that going with a service based approach would be
counter-intuitive. We then decided to instrument a complete customer operation
from start to finish. But the question then became: "Which operation(s)?".&lt;/p&gt;
&lt;p&gt;Earlier, for our &lt;a href="https://engineering.zalando.com/tags/cyber-week.html"&gt;Cyber Week&lt;/a&gt;
load testing efforts, SREs and other experienced engineers compiled a list of
"User Functions". These were customer interactions that were critical to the
customer-facing side of our business. Zalando is an e-commerce fashion store, so
operations like &lt;em&gt;"Place Order"&lt;/em&gt; or &lt;em&gt;"Add to Cart"&lt;/em&gt; are key to the success of the
customer experience, and to the success of the business. The criticality
argument was also valid to guide our instrumentation efforts, so that is what we
used to decide which operations to instrument. This list became a major
influence on the work we did from then on.&lt;/p&gt;
&lt;p&gt;One of the key benefits we quickly got from Distributed Tracing was that it
allowed us to get a comprehensive look at any given operation. From looking at a
trace we could easily understand what were the key latency contributors, or
where did an error originate in the call chain. As these quick insights started
becoming commonplace during incident handling, we started wondering if we could
automate this triage step.&lt;/p&gt;
&lt;p&gt;That train of thought led us to the development of an alert handler called
&lt;strong&gt;Adaptive Paging&lt;/strong&gt; (you can see the &lt;a href="https://www.usenix.org/conference/srecon19emea/presentation/mineiro"&gt;SRECon talk&lt;/a&gt;
to learn more details about Adaptive Paging). When this alert handler is
triggered, it reads the tracing data to determine where the error comes from
across the entire distributed system, and pages the team that is closest to the
problem. &lt;strong&gt;Essentially, by taking Adaptive Paging, and having it monitor an edge
operation, we achieved a viable and sustainable implementation of Symptom Based
Alerting&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Adaptive Paging" src="https://engineering.zalando.com/posts/2022/04/images/adaptive-paging.jpg#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Adaptive Paging will traverse the Trace and identify the team to be paged&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;But rather than going around promoting Adaptive Paging as another tool that
engineers could use to be alerted, we were a bit more selective. A single
Adaptive Paging alert, monitoring an edge operation can cover all the services
in the call chain, which span multiple teams. No need to have every individual
team monitoring their own operations, when a single alert would serve the same
purpose (while being less noisy, and easier to manage). And figuring out what to
alert on was rather straightforward thanks to our list of "User Functions". We
renamed it to &lt;strong&gt;Critical Business Operations (CBO)&lt;/strong&gt;, to be able to encompass
more than strictly user operations, and once again followed that list to
identify the signals we wanted to monitor. Alerts need a threshold to work, though.
Picking alert thresholds was always a challenging task. If we are talking about
an alert handler that can page any number of teams across several departments,
this becomes an even more sensitive topic that requires stronger governance.&lt;/p&gt;
&lt;p&gt;Our list of CBOs was a customer centric list of symptoms that could "capture
more problems more comprehensively and robustly". And SLOs should represent the
"most critical aspects of the user experience". Basically, all we needed was a
target (which would be our alert threshold) and we would also have SLOs. &lt;strong&gt;CBOs
then became an implementation of Operation Based SLOs.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Let’s take as an example &lt;em&gt;"Place Order"&lt;/em&gt;. This operation is clearly critical to
our business, which is why it was one of the first to make the Critical Business
Operations list. As there are many teams and departments owning services that
are contributing to this operation, the ownership for the SLO is critical. We
chose the senior manager  owning the customer experience of the Checkout and
Sales Order systems to define and be accountable for the SLO of the &lt;em&gt;"Place
Order"&lt;/em&gt; operation. This also ensured that SLO had management support.
We repeated this process for the remaining CBOs. We identified the senior
managers responsible for each of the CBOs (Directors, VPs and above) and
discussed the SLOs for those operations. With each discussion we would end up
with: a CBO with an SLO signed off by senior management; and a new alert on that
same CBO that would be sure to page only on situations where customers were
truly affected.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Our Operation Based SLOs tackled the issues we had with the service based
approach:&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style="text-align: center;"&gt;Service Based SLOs&lt;/th&gt;
&lt;th style="text-align: center;"&gt;Operation Based SLOs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style="text-align: center;"&gt;High number of SLOs.&lt;/td&gt;
&lt;td style="text-align: center;"&gt;A short list of SLOs, easier to maintain as changes in service landscape have no implications on the SLO definition.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: center;"&gt;Difficult mapping from services to products.&lt;/td&gt;
&lt;td style="text-align: center;"&gt;SLOs are now agnostic of the services implementing the Critical Business Operations.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style="text-align: center;"&gt;SLOs on a fine grained level made it challenging for management to align on them.&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Products have owners. We also changed the approach from bottom-up, to top-down to bring additional transparency to that ownership.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;There were additional benefits that came with this new strategy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Longevity of the SLOs&lt;/strong&gt; → "View Product Details" is something that has always existed in the company’s history, but as a feature it has gone through different services and architectures implementing it.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Using SLOs to balance feature development with reliability&lt;/strong&gt; → Before, the lack of ownership meant that teams were not clear when to stop feature development work to improve reliability should the availability decline. Now they had a clear message from the VP or Director that the SLO was a target that had to be met.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Out-of-the-box alerts&lt;/strong&gt; → Our Adaptive Paging alert handler was designed to cover CBOs. As soon as a CBO has an SLO, it can have an alert with its thresholds derived from the SLO.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Transport agnostic measurements&lt;/strong&gt; → Availability SLOs no longer need to be about 5xx rate, or using additional elaborate metrics. OpenTracing’s error tag makes it a lot easier for engineers to signal an operation as conceptually failed. This enables the graceful degradation scenario mentioned earlier.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Understanding impact during an incident&lt;/strong&gt; → 50% error rate in Service Foo is not easily translatable to customer or business impact, without deep understanding of the service landscape. A 50% error rate on “Add to cart” is much clearer to communicate and derive urgency of needing to be addressed immediately.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;SRE continued the rollout of CBOs by working closely with the senior management
of several departments agreeing on SLOs that would be guarded by our Adaptive
Paging alert handler. With this we also continued the adoption of Symptom Based
Alerting. As more and more CBOs were defined, we needed to improve the reporting
capabilities of our tooling, and developed a &lt;strong&gt;new Service Level Management&lt;/strong&gt;
tool that catered to this operation based approach.&lt;/p&gt;
&lt;p&gt;&lt;img alt="New SLO tool" src="https://engineering.zalando.com/posts/2022/04/images/slo-tool.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Our Service Level Management Tool (operation based - not actual data)&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;As the coverage of CBOs and their respective alerts took off, we started getting
reports that the alerts were too sensitive. Particularly, there were multiple
occasions of short lived error spikes that resulted in pages to on-call
responders. To prevent these situations, engineers started adding complex rules
to the alerts on a trial and error basis (usually using time of day, throughput,
duration of the error condition).
SRE was aiming at creating alerts that did not require much effort from
engineers to set them up, with no fine tuning required, or that would not change
as components and architecture evolved. We were not there yet, but we soon
evolved our Adaptive Paging alert handler to use the &lt;a href="https://sre.google/workbook/alerting-on-slos/#6-multiwindow-multi-burn-rate-alerts"&gt;Multi Window Multi Burn Rate&lt;/a&gt;
strategy which uses burn rates to define alert thresholds. &lt;strong&gt;The Error Budget
became much more relevant with this change.&lt;/strong&gt; The alerts went from being
triggered whenever the error rate breached the SLO, to having the decision of
whether a page should go out or not based on the rate we are burning the error
budget for an operation. This not only prevented on-call responders from being
paged by short lived error spikes, but also meant we could pick up on slowly
burning error conditions.
Because the Error Budget is derived from the SLO, it is still the SLO that made
it possible to derive the alert threshold automatically. Together with the
adaptability of Multi Window Multi Burn Rate which made it unnecessary to fine
tune alerts, this meant engineering teams required no effort to set up and manage
these alerts.
We also made sure that the Error Budget was visible in our new Service Level
Management tool.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Error budget view" src="https://engineering.zalando.com/posts/2022/04/images/error-budget.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Error Budget over three 28 day periods&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;h2&gt;Putting this model to the test&lt;/h2&gt;
&lt;p&gt;Everything we described so far seems to make perfect sense. And as we explained
it to several teams, no one seemed to make any argument against it. But still,
we were not seeing the initiative gaining the momentum we expected. Even teams
that did adopt CBOs, weren’t disabling their cause based alerts. Something was
missing. We needed the data to support our claims of a better process that would
reduce false positive alerts, while ensuring wide coverage of critical systems.
That’s what we set out to do, by &lt;a href="https://en.wikipedia.org/wiki/Eating_your_own_dog_food"&gt;&lt;em&gt;dogfooding&lt;/em&gt;&lt;/a&gt;
the process within the department.&lt;/p&gt;
&lt;p&gt;For 3 months we put the whole flow to the test within the SRE department. We
defined and measured CBOs for our department, with their SLO targets (at the
same time demonstrating that this approach wasn’t exclusively for the use of
end-user or external customer systems). Because SRE owns the Observability
Platform our CBOs included operations like &lt;em&gt;"Ingest Metrics"&lt;/em&gt;, or &lt;em&gt;"Query Traces"&lt;/em&gt;.
Those CBOs were monitored by Adaptive Paging alerts. Within our weekly
operational review meeting we would look at the alerts and incidents created in
the previous week, and gradually identify which cause based alerts could be
safely disabled or not. All of this had the support of senior management,
granting engineers the confidence to take these steps.&lt;/p&gt;
&lt;p&gt;By the end of that quarter we reduced the False Positive Rate for alerts within
the department from 56% to 0%. We also reduced the alert workload from 2 to 0.14
alerts per day. And we did this without missing any relevant user-facing
incidents. In the process we disabled over 30 alerts from all the teams in the
department. Those alerts were either prone to False Positives, or already
covered by the symptom based alerts.&lt;/p&gt;
&lt;p&gt;One thing the on-call team did bring up was that shifts had become too calm.
They risked losing their on-call ‘muscle’. We tackled this with regular
&lt;a href="https://sre.google/sre-book/accelerating-sre-on-call/#xref_training_disaster-rpg"&gt;"Wheel of Misfortune"&lt;/a&gt;
sessions, to keep knowledge fresh, and review incident documentation and tooling.&lt;/p&gt;
&lt;h2&gt;What's next?&lt;/h2&gt;
&lt;p&gt;We are not done yet with our goal of rolling out Operation Based SLOs. There are
still more Critical Business Operations that we can onboard, for one. And as we
onboard those operations, teams can start turning off their cause based alerts
that lead to false positives.&lt;/p&gt;
&lt;p&gt;And there are additional evolutions we can add to our product.&lt;/p&gt;
&lt;h3&gt;Alerting on latency targets&lt;/h3&gt;
&lt;p&gt;Right now, CBOs only set Availability targets. We also want CBO owners to define
latency targets. After all, our customers not only care that the experience
works, but also that it is fast.
While we already have the latency measurements, and could, technically, trigger
alerts when that latency breaches the SLO, it is challenging to use our current
Adaptive Paging algorithm to track the source of the latency increase. We don’t
want to burden the team owning the edge component with every latency alert, so
we are holding off on those alerts until a proper solution is found.&lt;/p&gt;
&lt;h3&gt;Event based systems&lt;/h3&gt;
&lt;p&gt;So far we’ve been focusing on direct end-customer experiences, which are served
mostly by RPC systems. There is a good chunk of our business that relies on
event based systems, and that we also want to cater for with our CBO framework.
This is quite the undertaking, as monitoring of event based systems is not as
well established as traditional HTTP APIs. Also, Distributed Tracing, the
telemetry pillar behind our current monitoring and alerting of CBOs, was not
designed with an event based architecture in mind. And the loss of the causality
property reduces the usefulness of our Adaptive Paging algorithm.&lt;/p&gt;
&lt;h3&gt;Non-edge customer operations&lt;/h3&gt;
&lt;p&gt;We always tried to measure customer experience as close to the edge as possible.
There are, however, some operations that are deeper in the call chain, but would
still benefit from closer monitoring. To prevent an uncontrolled growth of CBOs,
well defined criteria needs to be in place to properly identify and onboard
these operations.&lt;/p&gt;
&lt;h2&gt;Closing notes&lt;/h2&gt;
&lt;p&gt;Operation Based SLOs granted us quite a few advantages over Service Based SLOs.
Through this type of SLOs we were also able to implement Symptom Based Alerting,
with clear benefits for the on-call health of our engineers. And we were even
able to demonstrate the effectiveness of this new approach with numbers, after
trailing within the SRE department.&lt;/p&gt;
&lt;p&gt;But the purpose of this post is not to present a new and better type of SLOs. We
see operation based SLOs and service based SLOs as different implementations of
SLOs. Depending on your organization, and/or architecture, one implementation or
the other may work better for you. Or maybe a combination of the two.&lt;/p&gt;
&lt;p&gt;Here at Zalando we are still learning as the adoption of this framework grows in
the organization. We'll keep sharing our experience when there are significant
changes through future blog posts. Until then we hope this inspired you to give
operation based SLOs a try, or that it inspires the development of a different
implementation of SLOs.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;&lt;a href="https://cloud.google.com/blog/products/gcp/building-good-slos-cre-life-lessons"&gt;Google Cloud Platform Blog, Building good SLOs - CRE life lessons&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;&lt;a href="https://prometheus.io/docs/practices/alerting/"&gt;Prometheus Best Practices&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;a class="footnote-backref" href="#fnref2:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;&lt;a href="https://sre.google/sre-book/service-level-objectives/"&gt;SRE Book, Chapter 4 - Service Level Objectives&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;a class="footnote-backref" href="#fnref2:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:4"&gt;
&lt;p&gt;&lt;a href="https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit"&gt;Rob Ewaschuk, "My Philosophy on Alerting"&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="Zalando"/><category term="SRE"/><category term="Backend"/></entry><entry><title>Zalando's Machine Learning Platform</title><link href="https://engineering.zalando.com/posts/2022/04/zalando-machine-learning-platform.html" rel="alternate"/><published>2022-04-19T00:00:00+02:00</published><updated>2022-04-19T00:00:00+02:00</updated><author><name>Krzysztof Szafranek</name></author><id>tag:engineering.zalando.com,2022-04-19:/posts/2022/04/zalando-machine-learning-platform.html</id><summary type="html">&lt;p&gt;Architecture and tooling behind machine learning at Zalando&lt;/p&gt;</summary><content type="html">&lt;p&gt;To optimize the fashion experience for 46 million of our customers, Zalando embraces the opportunities provided by machine learning (ML). For example, we use recommender systems so you can easily find your favorite shoes or that great new shirt. We want these items to fit you perfectly, so a different set of algorithms is at work to give you the best size recommendations. Our demand forecasts will ensure that everything is in stock, even when you decide to make a purchase in the middle of a Black Friday shopping spree.&lt;/p&gt;
&lt;p&gt;As we grow our business, we look for innovative ideas to improve user experience, become more sustainable, and optimize existing processes. What does it take to develop such an idea into a mature piece of software operating at Zalando's scale? Let's look at it from the point of view of a machine learning practitioner, such as an applied scientist or a software engineer.&lt;/p&gt;
&lt;h2&gt;Experimenting with Ideas&lt;/h2&gt;
&lt;p&gt;Jupyter notebooks are a frequently used tool for creative exploration of data. Zalando provides its ML practitioners with access to a hosted version of JupyterHub, an experimentation platform where they can use Jupyter notebooks, R Studio, and other tools they may need to query available data, visualize results, and validate hypotheses. Internally we call this environment Datalab. It is available via a web browser, comes with web-based shell access and common data science libraries.&lt;/p&gt;
&lt;p&gt;Because Datalab provides pre-configured access to various data sources within Zalando, such as S3, BigQuery, MicroStrategy, and others, its users don't have to worry about setting up the necessary tools and clients on their own laptops. Instead, they're ready to start experimenting in less than a minute.&lt;/p&gt;
&lt;p&gt;While Datalab is well suited for prototyping and getting quick feedback, it's not always enough, especially when big data is involved. Apache Spark is much better suited for that purpose, and Zalando users can access it via Databricks. It's a well-known tool within the data science community, suitable for both experimentation via notebooks and for running large-scale data processing jobs in Spark clusters.&lt;/p&gt;
&lt;p&gt;Some experiments require extra processing power, e.g. when they involve computer vision or training of large models. For these purposes, our applied scientists have access to a high-performance computing cluster (HPC) equipped with powerful GPU nodes. Using the HPC is as easy as connecting to it via SSH.&lt;/p&gt;
&lt;h2&gt;ML Pipelines in Production&lt;/h2&gt;
&lt;p&gt;One of the most frequently discussed problems in machine learning is crossing the gap between experimentation and production, or in more crude terms: between a notebook and a machine learning pipeline.&lt;/p&gt;
&lt;p&gt;Jupyter notebooks don't scale well to requirements typical for running ML in a large-scale production environment. These requirements include secure and privacy-respecting access to large datasets, reproducibility, high performance, scalability, documentation, and observability (logging, monitoring, debugging). A machine learning pipeline is a sequence of steps that can meet these additional requirements, and describes how the data will be extracted and processed, what is the required hardware infrastructure, and how to train and deploy the model. Additionally, ML pipelines at Zalando should follow best practices of software engineering: the code needs be stored in git, clean, readable, and reviewed by at least two people.
An ML pipeline can be visualized as a graph, like the one shown below.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Example ML pipeline" src="https://engineering.zalando.com/posts/2022/04/images/pipeline.png#center"&gt;&lt;/p&gt;
&lt;p&gt;But how does one implement such a pipeline? In early 2019 we at Zalando decided to use AWS Step Functions for orchestrating machine learning pipelines. Step Functions is a platform for building and executing workflows consisting of multiple steps that may call various other services, such as AWS Lambda, S3 and Amazon SageMaker. These calls can be used to perform all steps comprising an ML pipeline, from data processing (e.g. by invoking Databricks API), to running training and batch processing jobs in Amazon SageMaker and creating SageMaker endpoints for real-time inference. The fact that Zalando already used AWS as its main cloud provider, and the flexibility provided by integrations with other services made Step Functions a good fit for our machine learning needs.&lt;/p&gt;
&lt;p&gt;A Step Functions workflow is a state machine that can either be created visually using an editor provided by AWS or deployed as a JSON or YAML file known as a CloudFormation (CF) template. CloudFormation is another AWS service that implements the concept of infrastructure as code, and allows developers to specify needed AWS resources in a text file. We can thus use a CF template to describe Lambda functions and security policies used by the Step Functions workflow that is our ML pipeline. After the template is deployed to AWS, CloudFormation will create all resources listed in the file.&lt;/p&gt;
&lt;p&gt;CloudFormation templates are highly expressive and allow developers to describe even minute details. Unfortunately, CF files can become verbose and are tedious to edit manually. We addressed this problem by creating zflow, a Python tool for building machine learning pipelines. Since its creation, zflow has been used to create hundreds of pipelines at Zalando.&lt;/p&gt;
&lt;p&gt;A pipeline in a zflow script is a Python object with a series of stages attached to it. zflow provides a number of custom functions for configuring ML tasks, for example training, batch transform, and hyperparameter tuning. It also offers flow control so stages can be run conditionally or in parallel. Together these functions form a Domain Specific Language (DSL) for describing pipelines in a concise and readable form. Because zflow code is annotated with type hints, users can spot mistakes early on, and the available warnings go beyond simple syntax checks available for JSON and YAML templates.&lt;/p&gt;
&lt;p&gt;The code listing below demonstrates an example zflow pipeline, with some configuration options omitted for brevity. It shows how three stages are created and added to a pipeline in the desired order. The pipeline is then added to a stack (a group of CloudFormation resources). The last line specifies where the resulting template should be saved.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;data_processing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;databricks_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;data_processing_job&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;training&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;training_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;training_job&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;batch_inference&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;batch_transform_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;batch_transform_job&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PipelineBuilder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;example-pipeline&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipeline&lt;/span&gt; \
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_stage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_processing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_stage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;training&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_stage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch_inference&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;stack&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;StackBuilder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;example-stack&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;zflow_pipeline.yaml&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;When a pipeline script is executed, zflow uses AWS CDK to generate a CloudFormation template file. The file contains all the information needed to create the necessary AWS resources. All that is needed now is to commit and push the generated template to the git repository and let Zalando Continuous Delivery Platform (CDP) deploy it to AWS. When that is done, our pipeline will appear in the AWS Console as a Step Functions state machine. It can then be executed, either via scheduler (like in our example), manually in the Console, or programatically via an API call.&lt;/p&gt;
&lt;p&gt;With zflow, a pipeline can be coded in a concise way, tested, then versioned in a git repository, deployed, run, and scaled as needed. To ensure that it works as expected, we can track its executions using a custom web interface. Pipeline tracking is a part of the internal Zalando developer portal running on top of &lt;a href="https://backstage.io/"&gt;Backstage&lt;/a&gt;, an open-source platform for building such portals. Here a screenshot of a series of pipeline executions in the ML portaI.&lt;/p&gt;
&lt;p&gt;&lt;img alt="ML portal in Backstage" src="https://engineering.zalando.com/posts/2022/04/images/backstage.jpg#center"&gt;&lt;/p&gt;
&lt;p&gt;This ML web interface provides a detailed, real-time view of pipeline execution. Pipeline authors can monitor how metrics evolve across multiple runs of training pipelines and can view these changes on a graph. They can also view model cards for models created by the pipelines. These are just a few features of the ML portal, and the tool is actively developed to improve the process of experimenting with notebooks and deploying the pipelines in production.&lt;/p&gt;
&lt;p&gt;The detailed journey of a pipeline is shown in the diagram below.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Lifecycle of an ML pipeline at Zalando" src="https://engineering.zalando.com/posts/2022/04/images/zflow-diagram.png#center"&gt;&lt;/p&gt;
&lt;p&gt;Admittedly, that's a lot to take in! Let's summarize the steps and tools we discussed so far:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;We use JupyterHub, Databricks, and a high-performance computing cluster for ML experimentation.&lt;/li&gt;
&lt;li&gt;We describe our ML pipelines in Python scripts with zflow DSL. Pipelines can use various resources, such as Databricks jobs for big data processing and Amazon SageMaker endpoints for real-time inference.&lt;/li&gt;
&lt;li&gt;When we run the pipeline script, zflow will internally call AWS CDK to generate a CloudFormation template.&lt;/li&gt;
&lt;li&gt;We commit and push the template to a git repository, and Zalando Continuous Delivery Platform will then upload it to AWS CloudFormation.&lt;/li&gt;
&lt;li&gt;CloudFormation will create all the resources specified in the template, most notably: a Step Functions workflow. Our pipeline is now ready to run.&lt;/li&gt;
&lt;li&gt;A web portal built with Backstage provides a visual overview of running pipelines, together with additional information relevant to ML practitioners.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;zflow and the dedicated web UI abstract away most of the complexity of building production pipelines with AWS tooling, such as CDK and CloudFormation, so ML practitioners can focus on their domain rather than the infrastructure. While zflow takes full advantage of AWS, it also allows us to integrate other tools used within the company and to quickly respond to our specific needs.&lt;/p&gt;
&lt;h2&gt;The Organization&lt;/h2&gt;
&lt;p&gt;Tooling is just one side of using any technology. Another aspect is the organizational structure that allows experts to work and collaborate effectively. While applying ML within the company, Zalando uses a distributed setup with additional resources in place to support reusing tools and practices across the organization. Most expertise is spread across over a hundred product teams working in their specific business domains. These teams have dedicated software engineers and applied scientists who in their daily work use both 3rd party products (e.g. AWS, Databricks) and internal tools (zflow, ML web portal).&lt;/p&gt;
&lt;p&gt;Our experts are assisted by a few central teams which operate and develop some of the aforementioned tools. For example, a dedicated team provides support and improvements to our JupyterHub installation and the HPC cluster. Two teams actively develop zflow and monitoring tools for pipelines. Another group consisting of ML consultants works closely with product teams, offering trainings, architectural advice, and pair programming. A separate research team actively explores and disseminates the state-of-the-art in algorithmics, deep learning, and other branches of AI.&lt;/p&gt;
&lt;p&gt;On top of that, our data science community provides platforms to exchange best practices from internal teams, academia, and the rest of the industry through expert talks, workshops, reading groups, and an annual internal conference.&lt;/p&gt;
&lt;h2&gt;Exciting Times&lt;/h2&gt;
&lt;p&gt;Teams at Zalando tackle many of the difficult problems in the space of &lt;a href="https://engineering.zalando.com/tags/machine-learning.html"&gt;machine learning and MLOps&lt;/a&gt;, such as reducing the time needed to validate and implement new ideas at scale and improving model observability. We constantly look for new ways to use technology to be faster, more efficient, and innovative in meeting all fashion-related needs of our customers. Best news: we would like to work with you on these exciting ML challenges!&lt;/p&gt;</content><category term="Zalando"/><category term="AWS"/><category term="Machine Learning"/><category term="Experimentation"/><category term="Zalando Science"/><category term="Backend"/><category term="Data"/></entry><entry><title>Functional tests with Testcontainers</title><link href="https://engineering.zalando.com/posts/2022/04/functional-tests-with-testcontainers.html" rel="alternate"/><published>2022-04-12T00:00:00+02:00</published><updated>2022-04-12T00:00:00+02:00</updated><author><name>Marek Hudyma</name></author><id>tag:engineering.zalando.com,2022-04-12:/posts/2022/04/functional-tests-with-testcontainers.html</id><summary type="html">&lt;p&gt;We explore how to write functional tests using Testcontainers.org library in Java-based backend applications.&lt;/p&gt;</summary><content type="html">&lt;p&gt;In this article, I will show how teams at &lt;a href="https://zms.zalando.com/"&gt;Zalando Marketing Services&lt;/a&gt; are using functional tests. We will follow the idea of functional tests: the main concept and the attributes of a good functional test. Then, we will discuss an example based on the TestContainers library used in the Spring environment.&lt;/p&gt;
&lt;p&gt;You can find an introduction to the &lt;a href="https://www.testcontainers.org/"&gt;TestContainers library&lt;/a&gt; in my previous article &lt;a href="https://engineering.zalando.com/posts/2021/02/integration-tests-with-testcontainers.html"&gt;Integration tests with Testcontainers&lt;/a&gt;, because that is out of the scope of this one.&lt;/p&gt;
&lt;h2&gt;Definition of functional test&lt;/h2&gt;
&lt;p&gt;There are many definitions of functional testing. For example, the definition found on &lt;a href="https://en.wikipedia.org/wiki/Functional_testing"&gt;Wikipedia&lt;/a&gt; is:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Functional testing is a quality assurance (QA) process and a type of black-box testing that bases its test cases on the specifications of the software component under test. Functions are tested by feeding them input and examining the output, and internal program structure is rarely considered (unlike white-box testing). Functional testing is conducted to evaluate the compliance of a system or component with specified functional requirements. Functional testing usually describes what the system does.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Functional tests answer the fundamental question: &lt;code&gt;Do the features work as intended&lt;/code&gt;?
Functional tests are not answering the question of HOW it works internally, but rather &lt;code&gt;WHAT&lt;/code&gt; the result should be.&lt;/p&gt;
&lt;h2&gt;Non-functional vs. functional testing&lt;/h2&gt;
&lt;p&gt;What is the key difference between &lt;strong&gt;non-functional software testing&lt;/strong&gt; and &lt;strong&gt;functional testing&lt;/strong&gt;?&lt;/p&gt;
&lt;p&gt;The answer is relatively simple: non-functional testing is concerned with &lt;strong&gt;how&lt;/strong&gt;, and functional testing is concerned with &lt;strong&gt;what&lt;/strong&gt;.
Functional testing verifies what the system should do, and non-functional testing tests how well the system works. The intention of functional testing is to verify software actions, and non-functional testing validates the behavior of the application.&lt;/p&gt;
&lt;p&gt;Another comparison you might see when discussing this is black-box testing vs white-box testing. Black-box testing looks at the functionality of the software &lt;strong&gt;without&lt;/strong&gt; looking at the &lt;strong&gt;internal structures&lt;/strong&gt;. White-box testing is aware of the internal structures.&lt;/p&gt;
&lt;h2&gt;Concept&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://www.testcontainers.org/"&gt;Testcontainers.org&lt;/a&gt; is a JVM library that allows users to run and manage docker images and control them from Java code.
Zalando uses it mainly for integration and functional tests.&lt;/p&gt;
&lt;p&gt;The main purpose of functional tests with the Testcontainers library is to set up a black-box test, by using an environment closest to the production one.
To achieve this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;package and run your service in a docker container&lt;/strong&gt;;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;run all its dependencies&lt;/strong&gt;, like: database, queues, streams, &lt;strong&gt;as separate docker containers&lt;/strong&gt;;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;make your service connect to locally run dependencies&lt;/strong&gt;;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;make your testing code independent of implementation&lt;/strong&gt;;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The structure of invocation can look like below.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Functional tests communicates with your service run as Docker images." src="https://engineering.zalando.com/posts/2022/04/images/concept.png"&gt;&lt;/p&gt;
&lt;p&gt;Your entire production code needs to be packaged and run as a docker image.
If your service needs to communicate to the database, you need to run the database as a docker image as well.
Your functional tests will test your code ran as a docker image, so your testing code does not have any connection to production code.&lt;/p&gt;
&lt;p&gt;You also need to remember that a proper pyramid of tests is (when sorted from the highest to the lowest amount of tests):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;unit tests&lt;/li&gt;
&lt;li&gt;component tests&lt;/li&gt;
&lt;li&gt;integration tests&lt;/li&gt;
&lt;li&gt;functional tests&lt;/li&gt;
&lt;li&gt;system tests&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is very nice to have functional tests, but it cannot dominate your testing structure.&lt;/p&gt;
&lt;h2&gt;Packaging your application into a docker container&lt;/h2&gt;
&lt;p&gt;Packaging your application into a docker image is pretty simple. In the root of your repository, just define Dockerfile like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;openjdk:17-alpine&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;service/target/application-exec.jar&lt;span class="w"&gt; &lt;/span&gt;application.jar
&lt;span class="k"&gt;EXPOSE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;8080&lt;/span&gt;
&lt;span class="k"&gt;ENTRYPOINT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;java&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ADDITIONAL_JAVA_OPTIONS&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-jar&lt;span class="w"&gt; &lt;/span&gt;application.jar
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;As an alternative solution, I would suggest using &lt;a href="https://github.com/GoogleContainerTools/jib"&gt;Jib&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Code separation&lt;/h2&gt;
&lt;p&gt;I recommend organizing code into a multi-module maven project with two modules: service and functional-tests.
The functional-tests module cannot have any dependency on the service module.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;.&lt;/span&gt;
&lt;span class="n"&gt;├── service&lt;/span&gt;
&lt;span class="n"&gt;│   └── pom.xml&lt;/span&gt;
&lt;span class="n"&gt;├── functional-tests&lt;/span&gt;
&lt;span class="n"&gt;│   └── pom.xml&lt;/span&gt;
&lt;span class="n"&gt;├── Dockerfile&lt;/span&gt;
&lt;span class="n"&gt;└── pom.xml&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Because we don’t have access to the service code, we cannot use any DTO objects, database repositories, etc.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We should operate on the simplest possible interfaces. For example, if we call a REST endpoint, send plain JSON and read JSON. Don’t create any internal DTOs. It would place you in the position of a real client of your service.&lt;/li&gt;
&lt;li&gt;I recommend using only official interfaces to create resources, e.g. create entities via the REST interface. We could create the entity directly inside the database and inside the test to just retrieve it, but it would not be a black-box test then. If there are changes to the storage of the service in the future, we would need to change our tests.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;AbstractFunctionalTests&lt;/h2&gt;
&lt;p&gt;All functional tests extend the &lt;code&gt;AbstractFunctionalTest&lt;/code&gt; class where all needed docker images are run.
In our example, I will run my microservice which is connected to the database.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AbstractFunctionalTest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;final&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;HTTP_PORT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8080&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;final&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DEBUG_PORT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5005&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;final&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Logger&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;LOGGER&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;LoggerFactory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Docker-Container&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;final&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Network&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Network&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newNetwork&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;final&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;PostgreSQLContainer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;postgreSQLContainer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PostgreSQLContainer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;PostgreSQLContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;postgres:14.2&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withUsername&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;username&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withPassword&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;password&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withDatabaseName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;databaseName&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withNetwork&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withNetworkAliases&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;postgres&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;final&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;GenericContainer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;?&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;backendContainer&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;postgreSQLContainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;start&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;backendContainer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ofNullable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;System&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;CONTAINER_VERSION&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ServiceContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;docker-repository/application&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;orElseGet&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ServiceContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;.&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Paths&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;../&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withExposedPorts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;HTTP_PORT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DEBUG_PORT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withFixedExposedPort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DEBUG_PORT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DEBUG_PORT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withEnv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;SPRING_PROFILES_ACTIVE&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;functional&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withEnv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;ADDITIONAL_JAVA_OPTIONS&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;-agentlib:jdwp=transport=dt_socket,&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;server=y,suspend=n,address=0.0.0.0:&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DEBUG_PORT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withNetwork&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withCreateContainerCmdModifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;application&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withLogConsumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Slf4jLogConsumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LOGGER&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withPrefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Service&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;waitingFor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Wait&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;forHttp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;/actuator/health&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="na"&gt;forPort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;HTTP_PORT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withStartupTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ofMinutes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)));&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;backendContainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;start&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getRuntime&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="na"&gt;addShutdownHook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;              &lt;/span&gt;&lt;span class="n"&gt;backendContainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;              &lt;/span&gt;&lt;span class="n"&gt;postgreSQLContainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;}));&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;As an alternative solution, I would suggest the creation of a &lt;a href="https://junit.org/junit5/docs/current/user-guide/#extensions"&gt;Junit5 extension&lt;/a&gt;.
In this case, we would use an annotation instead inheritance, with the same logic.&lt;/p&gt;
&lt;h2&gt;Logging&lt;/h2&gt;
&lt;p&gt;When running the docker image with our service, it is critical to add logging. Without it, you are loosing visibility on errors. Don't forget adding a logger to the container code:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withLogConsumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Slf4jLogConsumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LOGGER&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="na"&gt;withPrefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Service&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Stopping images&lt;/h2&gt;
&lt;p&gt;One of the biggest advantages of the TestContainers library is the fact that there is a &lt;strong&gt;Ryuk&lt;/strong&gt; container that stops all other containers when an initial JVM process is terminated.
It protects us from unwanted zombie containers (and networks, volumes) in the system. But if you run docker images from multiple maven modules, the Ryuk image can be too slow and the build can crash. That’s why I additionally specify &lt;code&gt;shutdownHook&lt;/code&gt;, which stops all docker images when test execution finishes.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getRuntime&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="na"&gt;addShutdownHook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;backendContainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;postgreSQLContainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Example of a functional test&lt;/h2&gt;
&lt;p&gt;An example functional test can look like below. The testing method uses many helper methods to simplify the test.
Helper methods are key to make the code readable.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AccountFunctionalTest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;extends&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AbstractFunctionalTest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nd"&gt;@Test&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;shouldUpdateAccount&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;throws&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;JSONException&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// given&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;createAccount&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// when&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;ResponseEntity&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;updateAccount&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// then&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;assertThat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getStatusCodeValue&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;isEqualTo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;HttpStatus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;NO_CONTENT&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;getAccount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;00000000-0000-0000-0000-000000000001&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;readFromResources&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;get_account_dto.json&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;JSONAssert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;assertEquals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;JSONCompareMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;LENIENT&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;createAccount&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;readFromResources&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;create_account_dto.json&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;ResponseEntity&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;getTestRestTemplate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;exchange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;/accounts&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;HttpMethod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;POST&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;HttpEntity&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;getPostHeaders&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;assertThat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getStatusCodeValue&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;isEqualTo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;HttpStatus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;CREATED&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ResponseEntity&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;updateAccount&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;getTestRestTemplate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;exchange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;/accounts/00000000-0000-0000-0000-000000000001&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;HttpMethod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;PATCH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;HttpEntity&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;readFromResources&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;patch_account_dto.json&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;getPatchHeaders&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;etag&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;getEtag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;ResponseEntity&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;getTestRestTemplate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getForEntity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;/accounts/{id}&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getHeaders&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getETag&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;getAccount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;ResponseEntity&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;getTestRestTemplate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getForEntity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;/accounts/{id}&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getBody&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;HttpHeaders&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;getPostHeaders&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;HttpHeaders&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;HttpHeaders&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setContentType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MediaType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;APPLICATION_JSON&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;HttpHeaders&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;getPatchHeaders&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;etag&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;HttpHeaders&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;HttpHeaders&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setContentType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MediaType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;application&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;merge-patch+json&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;HttpHeaders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ETAG&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;etag&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Advantages of functional tests&lt;/h2&gt;
&lt;p&gt;The biggest advantages of functional tests are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We force engineers to think about the &lt;a href="https://opensource.zalando.com/restful-api-guidelines/#api-first"&gt;API first principle&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;We are able to test the service as black-box, meaning that when you have a good functional tests coverage, you are able to make a deep refactoring without changing functional tests.&lt;/li&gt;
&lt;li&gt;It gives developers a lot of confidence that the code does what it should do.&lt;/li&gt;
&lt;li&gt;You are sure that your application is correctly packed as a docker image, so another layer of application is tested.&lt;/li&gt;
&lt;li&gt;Functional tests give you a lot of confidence that the application works as expected. I find it very useful during code refactoring.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Disadvantages of functional tests&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Writing functional tests can be time-consuming. Especially when something doesn’t work as expected, debugging becomes much harder. From a different point of view, if you have well-written helper classes you can speed up this process.&lt;/li&gt;
&lt;li&gt;Because functional tests are running services and dependencies (like database, queues) as docker images, we need to run it at least once. Usually, it is slow. For example: PostgreSQL as a docker image needs around 4 seconds to start on my machine, Localstack which emulates AWS components, can take much longer to start, even 20 seconds.&lt;/li&gt;
&lt;li&gt;In an ideal world, we should run new containers for each test, but it would be way too slow. So, we need to run it once for all tests. If functional tests are written in a bad way, they can make tests interfere with each other. It is critical that tests use different object identifiers and that there is a clean state after the test.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;Unit tests force developers to think about methods. Functional tests do the same for applications/components.&lt;/p&gt;
&lt;p&gt;I find functional tests to be an interesting concept. The TestContainers library makes it possible to use this concept inside the Java world.
It can be pretty expensive to implement it, but it also gives you big confidence that a system still works during deep refactoring.&lt;/p&gt;
&lt;p&gt;Functional tests implemented in this way are not for everybody. I would suggest having it in the systems where microservice contracts are not changing very fast.
Besides of high cost of development, it gives us a very high confidence level that the delivered applications are working as intended.&lt;/p&gt;
&lt;h2&gt;Code example&lt;/h2&gt;
&lt;p&gt;You can find examples of usages in my &lt;a href="https://gitlab.com/marek_hudyma/application-style"&gt;GitLab project&lt;/a&gt;.&lt;/p&gt;</content><category term="Zalando"/><category term="Java"/><category term="Testing"/><category term="Docker"/><category term="Backend"/></entry><entry><title>GraphQL persisted queries and Schema stability</title><link href="https://engineering.zalando.com/posts/2022/02/graphql-persisted-queries-and-schema-stability.html" rel="alternate"/><published>2022-02-17T00:00:00+01:00</published><updated>2022-02-17T00:00:00+01:00</updated><author><name>Boopathi Rajaa Nedunchezhiyan</name></author><id>tag:engineering.zalando.com,2022-02-17:/posts/2022/02/graphql-persisted-queries-and-schema-stability.html</id><summary type="html">&lt;p&gt;Learn how Zalando uses persisted queries, and how we define and think about different levels of stability of our GraphQL schema.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Persisted Queries&lt;/h2&gt;
&lt;p&gt;Persisted Queries in GraphQL are like stored procedures in Databases. To know about the Apollo's way of automated persisted queries, please follow their documentation &lt;a href="https://www.apollographql.com/docs/apollo-server/performance/apq/"&gt;here&lt;/a&gt;. In Zalando, we took a different approach - &lt;strong&gt;to disable GraphQL in production&lt;/strong&gt;. It might sound counterintuitive at first - we have a GraphQL service, but we disable GraphQL in production - why?&lt;/p&gt;
&lt;p&gt;Let us go over how the system works and explain the reasons for how it helps us maintain a stable schema.&lt;/p&gt;
&lt;h3&gt;Part 1: Build time persistence&lt;/h3&gt;
&lt;p&gt;At development time for the web and apps, the developers enjoy the power of GraphQL - the automatic code and type generation, combining multiple parts of the application to send queries and aggregation of those queries to perform one optimized batched request, etc.&lt;/p&gt;
&lt;p&gt;When the code in the UI layers (web and app) is actually merged to the main deployment branch, at the build time, there is one extra step - persist the queries to the GraphQL service. The GraphQL service generates an ID for a particular query (ID is just the hash of the normalized query in terms of formatting and operation selection), and returns it back to the UI layers to bundle with the actual built files.&lt;/p&gt;
&lt;p&gt;When the actual query is used in production, the GraphQL service does not allow GraphQL queries, but rather only allows the query IDs that are persisted. So, instead of the request looking like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="err"&gt;POST /graphql&lt;/span&gt;

&lt;span class="err"&gt;{&lt;/span&gt;
&lt;span class="err"&gt;  &amp;quot;query&amp;quot;: &amp;quot;query productCard($id: ID!) { product(id: $id) { name } }&amp;quot;,&lt;/span&gt;
&lt;span class="err"&gt;  &amp;quot;variables&amp;quot;: {&lt;/span&gt;
&lt;span class="err"&gt;    &amp;quot;id&amp;quot;: &amp;quot;12345&amp;quot;&lt;/span&gt;
&lt;span class="err"&gt;  }&lt;/span&gt;
&lt;span class="err"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;it would look like this - with &lt;code&gt;id&lt;/code&gt; instead of &lt;code&gt;query&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="err"&gt;POST /graphql&lt;/span&gt;

&lt;span class="err"&gt;{&lt;/span&gt;
&lt;span class="err"&gt;  &amp;quot;id&amp;quot;: &amp;quot;a1b2c3&amp;quot;,&lt;/span&gt;
&lt;span class="err"&gt;  &amp;quot;variables&amp;quot;: {&lt;/span&gt;
&lt;span class="err"&gt;    &amp;quot;id&amp;quot;: &amp;quot;12345&amp;quot;&lt;/span&gt;
&lt;span class="err"&gt;  }&lt;/span&gt;
&lt;span class="err"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Part 2: Inspecting the persisted queries database&lt;/h3&gt;
&lt;p&gt;Now that we have a database of queries, we can perform certain inspections on these persisted queries. Because we do not allow non-persisted queries in production, we know at any time what parts of the schema are used in production and what are not used in production.&lt;/p&gt;
&lt;p&gt;We leverage these persisted queries for better monitoring and alerting for each individual query separately. We are also able to tell if certain fields can have a breaking change because the field is no longer used or never used in production.&lt;/p&gt;
&lt;h2&gt;Schema Stability&lt;/h2&gt;
&lt;p&gt;As mentioned previously, our GraphQL schema covers wide variety of use-cases and different parts of the schema can have different levels of stability as new product features get added in.&lt;/p&gt;
&lt;p&gt;All API's dream is to have a non-breaking model that evolves well. In most cases, it becomes impossible to design everything up front so well in a changing product landscape. In other aspects, the amount of time we spend meditating about certain models to get the best design possible may not warrant the actual time available to completely implement it end-to-end.&lt;/p&gt;
&lt;p&gt;The schema is a collaboration of the UI engineers and the GraphQL server maintainers. It should be possible for the UI engineers to prototype something fast and break it later. But once the schema is merged to the main deployment branch, the GraphQL server maintainers do not wish to have breaking changes. How do we solve this conflict in a neat way?&lt;/p&gt;
&lt;p&gt;Let's use branch deployments to satisfy this constraint, so the main branch stays clean. Though it looks simple and easy enough to understand, the mixing of branches across various projects soon becomes a nightmare in reality. At Zalando, we have microservices and the &lt;a href="https://engineering.zalando.com/posts/2021/03/how-we-use-graphql-at-europes-largest-fashion-e-commerce-company.html"&gt;GraphQL layer is an aggregator&lt;/a&gt; from multiple other services. So, maintaining multiple feature branches across 3-5 projects for 1 or 2 product features isn't going to help any developer or team move smoothly. The complexity increases non-linearly as we mix different features that must work together.&lt;/p&gt;
&lt;h3&gt;Draft status&lt;/h3&gt;
&lt;p&gt;In the previous section, we learned about the power of persisted queries controlled by the GraphQL layer - we exactly know what part of the schema is used in production. So, our solution to schema stability starts by leveraging how we handle persisted queries - by marking certain parts of the schema as &lt;strong&gt;not ready for production&lt;/strong&gt;, and preventing them to get into the persisted queries database.&lt;/p&gt;
&lt;p&gt;For this we use &lt;a href="https://graphql.org/learn/queries/#directives"&gt;GraphQL directives&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;directive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@draft&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;FIELD_DEFINITION&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The above directive will help annotate certain fields in the schema as draft. And during the persistence time, we validate if the query contains a field which is marked as such and disallow persisting it.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;draftRule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;parentType&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getParentType&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;parentType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getFields&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;isDraft&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;astNode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;directives&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;directive&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;directive&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;===&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;draft&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;isDraft&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reportError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ow"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GraphQLError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sb"&gt;`Cannot persist draft field`&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is an example implementation of the rule which you can pass to the &lt;a href="https://graphql.org/learn/validation/"&gt;GraphQL validation&lt;/a&gt;. The usage in the schema would look like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Product&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;fancyNewField&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;FancyNewType&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nd"&gt;@draft&lt;/span&gt;
&lt;span class="err"&gt;}&lt;/span&gt;

&lt;span class="err"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;FancyNewType&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;testField:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;String&lt;/span&gt;
&lt;span class="err"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In the above definition of a Product, when we add the new field &lt;code&gt;fancyNewField&lt;/code&gt;, we begin by adding a draft status. When someone tries to persist it, it would fail.&lt;/p&gt;
&lt;p&gt;This brings us new opportunities and guarantees:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The field cannot be used in production&lt;/li&gt;
&lt;li&gt;We can break it at will, since we allow ONLY persisted queries in production&lt;/li&gt;
&lt;li&gt;We can merge it to the main branch (and even deploy it)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The draft status and how our persisted queries work improves the work flow. We are able to faster develop multiple features, experiment with it across different codebases, and still have the safety of production usage only after we stabilized (removing draft) the schema by testing it end-to-end.&lt;/p&gt;
&lt;h3&gt;Experimenting in Production&lt;/h3&gt;
&lt;p&gt;The draft status allows us to deny persisting certain queries which we know are not ready for production usage. When they are ready, we want to carry forward certain experiments to production. But, we can still be unsure about the stability of this schema. This is tricky, but is a valid use-case often. Certain product features go into production as an experiment, and then it may change form or structure by a little.&lt;/p&gt;
&lt;p&gt;One obvious option is to remove the draft. But we do not restrict who can persist it. For example, some other parts of the UI may start persisting those experimental fields, and we might not notice it until we inspect the queries. We certainly cannot break the schema once it is in production. So, how do we ensure that this experimental field is used only by the components that are part of the experiment?&lt;/p&gt;
&lt;p&gt;Here, we introduce two new directives which act as access control for fields in production. The &lt;code&gt;@component&lt;/code&gt; directive, and &lt;code&gt;@allowedFor&lt;/code&gt; directive:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;directive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@component(name:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;String!)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;QUERY&lt;/span&gt;
&lt;span class="k"&gt;directive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@allowedFor(componentNames:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;[String!]!)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;FIELD_DEFINITION&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;These two directives complement each other where one is used in the query and the other one is used in the schema (here, on &lt;code&gt;Field&lt;/code&gt; definition). We ask the queries authors to tag their queries using a component name, and we match those names in the other directive &lt;code&gt;allowedFor&lt;/code&gt; during persist time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: Instead of component name, you can also use the operation name of the query itself.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Product&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;fancyProp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nd"&gt;@allowedFor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;componentNames&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;web-product-card&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;and a query product card:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;query&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;productCard&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="nf"&gt;component&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;&amp;quot;&lt;/span&gt;&lt;span class="nc"&gt;web&lt;/span&gt;&lt;span class="err"&gt;-product-card&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;fancyProp&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This would be allowed and any other query which uses the field &lt;code&gt;fancyProp&lt;/code&gt; would fail to persist.&lt;/p&gt;
&lt;p&gt;The component and allowed-for directives / annotations allow us to take an experimental feature to production by restricting the usage to one component of the UI. This allows us to handle breaking changes more easily as we have a guarantee that only that part of the UI needs to update when we have a minor breaking change.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;When we first extend the GraphQL schema, we start with the &lt;code&gt;draft&lt;/code&gt; annotation. Then we promote new fields to a restricted usage in production using the &lt;code&gt;allowedFor&lt;/code&gt; annotation. After we finally have stabilized the schema, we remove all of these annotations and have a non-breaking contract in form of persisted queries.&lt;/p&gt;
&lt;p&gt;This is just the starting point of the exploration to saving developer time as well as ensuring stability to the GraphQL schema. It helps us in evolving the schema rather than having to re-model it every single time.&lt;/p&gt;
&lt;p&gt;Depending on how you want to evolve the schema, and how you prefer to handle breaking changes, you can use these concepts and save precious time - by thinking about schema evolution in a non-destructive manner.&lt;/p&gt;
&lt;h2&gt;Related posts&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://engineering.zalando.com/posts/2023/10/understanding-graphql-directives-practical-use-cases-zalando.html"&gt;Understanding GraphQL Directives: Practical Use-Cases at Zalando&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://engineering.zalando.com/posts/2021/04/modeling-errors-in-graphql.html"&gt;Modeling Errors in GraphQL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://engineering.zalando.com/posts/2021/03/optimize-graphql-server-with-lookaheads.html"&gt;Optimize GraphQL Server with Lookaheads&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><category term="Zalando"/><category term="GraphQL"/><category term="APIs"/><category term="Backend"/></entry><entry><title>Principal Engineering at Zalando</title><link href="https://engineering.zalando.com/posts/2022/02/principal-engineering-at-zalando.html" rel="alternate"/><published>2022-02-10T00:00:00+01:00</published><updated>2022-02-10T00:00:00+01:00</updated><author><name>Bartosz Ocytko</name></author><id>tag:engineering.zalando.com,2022-02-10:/posts/2022/02/principal-engineering-at-zalando.html</id><summary type="html">&lt;p&gt;Learn how we leverage Principal Engineers to solve engineering challenges across Zalando.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Photo by Ian Schneider on Unsplash" src="https://engineering.zalando.com/posts/2022/02/images/career-path.jpg#previewimage"&gt;&lt;/p&gt;
&lt;p&gt;In many companies, Senior Engineers who do not pursue Engineering Management, end up in a dead end in terms of their career progression. At Zalando, we have had a career path for individual contributors since 2016. Senior Software Engineers can choose one of the three possible career paths:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Engineering Management&lt;/li&gt;
&lt;li&gt;Principal Engineering&lt;/li&gt;
&lt;li&gt;Technical Program Management&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In this post, we detail out how we leverage our senior individual contributors (Principal Engineers) throughout the company. In the last two years, we have observed an increased amount of companies emphasizing the value of career development for individual contributors. At this level, the roles are highly varying across companies, hence the importance of exchange about different approaches to structuring this role.&lt;/p&gt;
&lt;h2&gt;Principal Engineering&lt;/h2&gt;
&lt;p&gt;Beyond the Senior Software Engineer level, Engineers have increasingly varying profiles depending on their career journey and unique expertise. &lt;em&gt;Depth-focused&lt;/em&gt; Principal Engineers are experts in their unique field (or more than one) whereas &lt;em&gt;breadth-focused&lt;/em&gt; Principal Engineers have an expert view across many domains and aspects of the software development life cycle with an ability to leverage unique expertise of others or when needed dive deep themselves.&lt;/p&gt;
&lt;p&gt;Up until 2021, there was no literature we would know about, speaking in detail about individual contributors above the senior level in tech companies. While traditionally Software Companies defined the role of an (Enterprise) Architect, the industry moved away from centralized architecture teams with hands-off individuals, as these were detached from the software development process and the necessary feedback loops to continuously adjust their approaches.
More often than not, delivery teams are empowered with technical decision making and conduct architectural design adhering to guardrails set by the department and the company (in our case, the &lt;a href="https://engineering.zalando.com/tags/tech-radar.html"&gt;Tech Radar&lt;/a&gt;). Principal Engineers support the team in the architectural design and help to maintain architectural integrity in the scope of the department and beyond.&lt;/p&gt;
&lt;p&gt;In March 2021, the book &lt;a href="https://staffeng.com/book"&gt;Staff Engineer: Leadership beyond the management track&lt;/a&gt; was published and added some common vocabulary about technical leadership and strategies for leading without formal authority. In addition, &lt;a href="https://staffeng.com/guides/staff-archetypes"&gt;four archetypes&lt;/a&gt; are listed and provide classification for the types of tasks Principal Engineers are most commonly working on. It is important to note that individuals may transition between these archetypes throughout their career depending on their strengths or the organizational needs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tech Lead:&lt;/strong&gt; leads critical technical initiatives across the department and beyond. Partners with more than one team to support teams and individuals with delivery and coaching. Usually, Principal Engineers transitioning from a Senior Engineer role in a single team to a Principal Engineer acting across teams will go through this path. Initially, delivery includes high focus on coding alongside the team for high-impact and critical projects.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Architect:&lt;/strong&gt; manages technical direction, quality, and approach within an area or project. Navigates different levels of leadership to address mid to long-term challenges.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solver:&lt;/strong&gt; digs deep into an area or problem, captures findings, aligns a set of recommendations. May apply both to short-term and long-term engagements and include driving the implementation of the recommended solutions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Right Hand:&lt;/strong&gt; extends an executive's attention and borrows their scope and authority to address certain problem areas.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Principal Engineers at Zalando&lt;/h2&gt;
&lt;p&gt;Principal Engineers&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt; at Zalando are senior individual contributors and role models for our Engineers. While they have no people management responsibilities, they are part of the leadership team. Principal Engineers report to a Manager of Managers (e.g. Head of Engineering) and assume the scope of the person they report to. Typically, this means they have 2-5 engineering teams that they support. Overall, Principal Engineers constitute around 4% of our total Engineering population.&lt;/p&gt;
&lt;p&gt;At Zalando, Principal Engineers are responsible for the architecture of the systems built within the department they're part of. They enable others and facilitate the design process across teams. They are proactively initiating and executing process and technical improvements (e.g. scaling, technical debt reduction) across the department and beyond. Principal Engineers play a leading role in the full product development lifecycle. They're consulting Product and Engineering Management on projects early on, ensuring that technical considerations are factored into the project's scoping and planning processes.&lt;/p&gt;
&lt;p&gt;Our (usually &lt;em&gt;breadth-focused&lt;/em&gt;) Principal Engineers are leading the technical design for mid to large scale projects that their department is part of. This involves trade-off discussion, scope definition and negotiation with Product Designers and Product Managers as well as advice on structuring the projects into iterations optimized for reducing delivery risk, dependencies on teams, or ensuring quick time to market. Principal Engineers facilitate design discussions with the involved teams, delegate design or experimentation of well-defined parts of the design to other Engineers. They outline key design decisions and trade-offs and seek feedback through peer-reviews. To understand how their designs perform in production, they guide teams throughout the execution time of the project and support launch readiness through production readiness reviews and project launch coordination.&lt;/p&gt;
&lt;p&gt;At Zalando, we peer-review technical designs on different organizational levels, depending on their scope and complexity. During peer-reviews for Zalando group-wide projects requiring contributions from multiple business units, Principal Engineers support the project teams in finding the best solution for realizing the project's goals. Additionally, they provide teams with a different perspective on the suggested solutions and discuss trade-offs related to dependencies, relation to other pending or ongoing projects, and risks and challenges anticipated during project delivery. In this way, we ensure consistency of technical solutions, promote standardized solutions and practices, connect teams who solved similar problems with one another, and seek to incorporate learnings from other projects into future designs.&lt;/p&gt;
&lt;p&gt;Focus on operational excellence is key to delivering high-value customer experiences. Principal Engineers play a crucial role in scaling knowledge and raising the bar. They coach teams on resilience patterns, observability and facilitate weekly operational meetings where the operational performance of the system and past incidents are reviewed. They peer-review post-mortem documents and runbooks that the teams prepare as part of the incident response. Finally, they collaborate on alignment and implementation of cross-team action items.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Depth-focused&lt;/em&gt; Principal Engineers are most frequently part of platform or infrastructure teams. When compared with their peers, these individuals are also spending the highest share of their time writing code. They are thought-leaders influencing the long-term product roadmap. Through their network and collaborations with other Engineers across the company (e.g. via language guilds), they look for opportunities to scale the adoption of existing infrastructure solutions or initiate new ones, with the focus on making our teams or systems more efficient (e.g. shared libraries, application templates, operational guidance or patterns). Lastly, they contribute to setting Engineering Standards and support others in technology selection, evaluation, and adoption as part of our &lt;a href="https://engineering.zalando.com/tags/tech-radar.html"&gt;Tech Radar process&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Principal Engineers have also important contributions that go beyond core engineering tasks. They are bar raisers during the interview process, mentor other Engineers, and play a key role in our engineering communities. This way, they have opportunities to coach other engineers, role model our culture, and help identify and develop promising talent.&lt;/p&gt;
&lt;h2&gt;Principal Engineering Community&lt;/h2&gt;
&lt;p&gt;Principal Engineers form a company-wide community of experts, who support one another in their challenges and journey at Zalando. They self-organize both company-wide and per business unit in order to discuss and drive technical topics that they or their leadership consider as important to meet the business growth and operational excellence of Zalando's technical systems. The Community provides expertise around know-how, patterns, solutions, and the approach to rollout of these in teams. Further, Principal Engineers support one another in order to continuously upskill themselves and others, through mentorship, coaching, or pairing up on tasks.&lt;/p&gt;
&lt;p&gt;Engineering-wide initiatives driven by the community are documented in a task list, which in addition to providing transparency on the community efforts, serves as an opportunity to (i) highlight tasks that any Engineer at Zalando can contribute to, or (ii) for anyone to request support on an engineering topic. Similar task lists exist in a smaller scope and provide ways to involve the Engineering talent from these organizations.&lt;/p&gt;
&lt;h2&gt;Helping Principal Engineers with their new role&lt;/h2&gt;
&lt;p&gt;The majority of our Principal Engineers have been promoted from within Zalando. Some of our senior individual contributors have switched career tracks from Engineering Management back to individual contributors. As the principal engineering role is tailored to our specific needs and organizational structure, it was important for us to set up newcomers to the role for success.&lt;/p&gt;
&lt;p&gt;A few Principal Engineers teamed up and compiled a guide to beginning the journey of a Principal Engineer and how to structure the first 100 days in this role. This guide has proven to be helpful for our Principal Engineers, their Managers, and for colleagues who are planning their own career development towards the individual contributor track. In addition to the guide, our more seasoned Principal Engineers provide mentorship to other Principal Engineers.&lt;/p&gt;
&lt;p&gt;We also realize that the role of a Principal Engineer may not be a fitting career opportunity for every Senior Engineer. Principal Engineering is not just a label for the best Senior Engineers. In the end, it's a technical leadership role with strong emphasis on cross-team coordination, communication skills, and requiring the ability to lead without authority. The initiatives that an individual is driving tend to have a much longer time horizon for the impact to become visible and are often realized through the hands of others. This delayed gratification can negatively affect motivation, especially for individuals who as problem-solvers with deep expertise value and source their energy from solving large-scale problems with fast iteration cycles (e.g. as part of incident response). At Zalando, we leverage stretch assignments as development opportunities to allow our colleagues to try out aspects of the Principal Engineer role and verify whether it's a good fit for them while allowing them to easily step back to their prior activities otherwise.&lt;/p&gt;
&lt;h2&gt;Managing Principal Engineers&lt;/h2&gt;
&lt;p&gt;Some of our Engineering Managers have not worked with nor managed Principal Engineers before.
This can lead to situations where the potential of the individuals is under-leveraged. Individual contributors on this level require a degree of flexibility and share of their time to explore the potential of addressing the problem areas they have identified. They also need the necessary sponsorship and support in change management for solutions that are introduced within the department and beyond.&lt;/p&gt;
&lt;p&gt;To address this challenge at scale, we compiled guidance for our managers on how to support and effectively work with Principal Engineers. This guide includes a short checklist allowing organizational leaders to easily verify whether they have structured the ways of working and expectations towards the Principal Engineers in the right way. This includes ensuring that the Principal Engineer is part of leadership rounds providing the right context about the department's priorities and upcoming projects, creating the necessary connections between key stakeholders and Heads of Product, and also includes examples of initiatives that Principal Engineers have driven at Zalando.&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;In this post we have provided insights into the key aspects of the role of a Principal Engineer at Zalando. While this is not an extensive description of the challenges and intricacies of the role, we hope that the information shared in this post will shed some light on the opportunities that the individual contributor path provides. Likewise, we will be happy if it serves as an inspiration for you to consider putting stronger focus on the individual contributor career path in your company.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;There is no consistency in the industry for naming Senior+ roles. Some companies use (i) Senior, Staff, Senior Staff, Principal (e.g. Spotify), whereas others go for (ii) Senior, Principal, Senior Principal, ..., Distinguished Engineer (e.g. Amazon). We chose a naming scheme based on the second model.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="Zalando"/><category term="Leadership"/><category term="Management"/><category term="Tech Culture"/><category term="Culture"/></entry><entry><title>Releasing Connexion to the Community</title><link href="https://engineering.zalando.com/posts/2022/02/releasing-connexion-python-framework-to-the-oss-community.html" rel="alternate"/><published>2022-02-07T00:00:00+01:00</published><updated>2022-02-07T00:00:00+01:00</updated><author><name>Henning Jacobs</name></author><id>tag:engineering.zalando.com,2022-02-07:/posts/2022/02/releasing-connexion-python-framework-to-the-oss-community.html</id><summary type="html">&lt;p&gt;After 6 years and 3.9k GitHub stars, we are releasing Connexion, our API-first Python framework, to the Open Source community.&lt;/p&gt;</summary><content type="html">&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://github.com/zalando/connexion/"&gt;Connexion&lt;/a&gt; is a Python framework that automagically handles HTTP requests based on &lt;a href="https://www.openapis.org/"&gt;OpenAPI specification&lt;/a&gt; (formerly known as Swagger Spec) of your API described in &lt;a href="https://github.com/OAI/OpenAPI-Specification/blob/master/versions/2.0.md#format"&gt;YAML format&lt;/a&gt;. Connexion allows you to write an OpenAPI specification, then maps the endpoints to your Python functions; this makes it unique, as many tools generate the specification based on your Python code. You can describe your REST API in as much detail as you want; then Connexion guarantees that it will work as you specified.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;After 6 years and 3.9k GitHub stars, Zalando is now releasing Connexion to the community. What does this mean? Connexion's repository will move from Zalando's GitHub organization to the &lt;a href="https://github.com/spec-first"&gt;new community-owned "spec-first" organization&lt;/a&gt;. This repository transfer highlights changes in Connexion's maintainer structure. Connexion's license (Apache 2.0) and &lt;a href="https://pypi.org/project/connexion/"&gt;release package on PyPI&lt;/a&gt; will not change.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Connexion on GitHub.com" src="https://engineering.zalando.com/posts/2022/02/images/connexion-github.png#center"&gt;&lt;/p&gt;
&lt;p&gt;Connexion was a huge enabler for Zalando to move towards &lt;a href="https://opensource.zalando.com/restful-api-guidelines/#api-first"&gt;API-first&lt;/a&gt; in 2015, i.e. to write the API specification before implementing the backend code. While Python is a first class citizen in Zalando's tech landscape (see our &lt;a href="https://opensource.zalando.com/tech-radar/"&gt;Tech Radar&lt;/a&gt;), Zalando's customer-facing production software is usually implemented in modern JVM languages such as &lt;a href="https://engineering.zalando.com/tags/kotlin.html"&gt;Kotlin&lt;/a&gt;, &lt;a href="https://engineering.zalando.com/tags/java.html"&gt;Java&lt;/a&gt;, or &lt;a href="https://engineering.zalando.com/tags/scala.html"&gt;Scala&lt;/a&gt;. Maintenance of Connexion stalled with core developers changing focus and nobody new stepping up within Zalando. Thankfully, &lt;a href="https://blog.ml6.eu/why-we-decided-to-help-maintain-connexion-c9f449877083"&gt;ML6 took over&lt;/a&gt; most of the regular maintenance from Zalando. We are very glad to have found new active maintainers. Special thanks go to my colleague &lt;a href="https://github.com/jmcs"&gt;João&lt;/a&gt; as the original author, &lt;a href="https://github.com/rafaelcaricio"&gt;Rafael&lt;/a&gt; for his significant contributions, &lt;a href="https://github.com/RobbeSneyders"&gt;Robbe&lt;/a&gt; and &lt;a href="https://github.com/Ruwann"&gt;Ruwan&lt;/a&gt; from ML6 for taking over, and to &lt;a href="https://github.com/dtkav"&gt;Daniel&lt;/a&gt; for donating the "spec-first" organization. The "spec-first" organization will serve as a company-neutral new home for this awesome open source project. The project is what it is today because of its community. Big thanks to all &lt;a href="https://github.com/zalando/connexion/graphs/contributors"&gt;165 contributors&lt;/a&gt; and to the numerous users of Connexion out there!&lt;/p&gt;
&lt;p&gt;Moving Connexion out of Zalando's GitHub organization won't affect how the project is used within Zalando. With JVM-based languages powering most of Zalando's Fashion Store, Connexion is used for low-traffic services and tools in various departments. For example, Connexion powers parts of our internal Continuous Delivery Platform, serves metadata for our internal realtime business monitoring platform, exposes APIs for our inhouse machine learning platform, and is used in our pricing department. Connexion has gained some popularity among Zalando's data science community as &lt;a href="https://engineering.zalando.com/tags/python.html"&gt;Python&lt;/a&gt; is the most commonly used language for data scientists.&lt;/p&gt;
&lt;p&gt;Personally, I'm very happy to see Connexion graduate and have it released to a new community-owned home. I will follow its path into the future and try to be helpful when time allows.&lt;/p&gt;
&lt;p&gt;If you are interested in learning more about Connexion, check out &lt;a href="https://connexion.readthedocs.io/"&gt;the documentation&lt;/a&gt;.&lt;/p&gt;</content><category term="Zalando"/><category term="Python"/><category term="Open Source"/><category term="Backend"/></entry><entry><title>Utilizing Amazon DynamoDB and AWS Lambda for Asynchronous Event Publication</title><link href="https://engineering.zalando.com/posts/2022/02/transactional-outbox-with-aws-lambda-and-dynamodb.html" rel="alternate"/><published>2022-02-03T00:00:00+01:00</published><updated>2022-02-03T00:00:00+01:00</updated><author><name>Matthias Michael Döpmann</name></author><id>tag:engineering.zalando.com,2022-02-03:/posts/2022/02/transactional-outbox-with-aws-lambda-and-dynamodb.html</id><summary type="html">&lt;p&gt;We demonstrate an implementation of the Transactional Outbox pattern put into practice on AWS with Amazon DynamoDB, AWS DynamoDB Streams and AWS Lambda.&lt;/p&gt;</summary><content type="html">&lt;p&gt;In our Microservices Architecture, services communicate both asynchronous via events and synchronous via REST calls.
Frequently, a synchronous REST call modifies data in a data store and emits an event based on the changes made.
Publishing data change events can be decoupled from performing the changes in the data store in order to increase the resilience of the application.&lt;/p&gt;
&lt;p&gt;We will show how this is achieved with the &lt;a href="https://microservices.io/patterns/data/transactional-outbox.html"&gt;Transactional Outbox&lt;/a&gt; pattern, presenting a cloud native approach utilizing Amazon DynamoDB, AWS DynamoDB Streams and AWS Lambda.&lt;/p&gt;
&lt;h2&gt;Problem Statement&lt;/h2&gt;
&lt;p&gt;In Zalando Payments we have a service, called Order Store, that stores payment related data for a given order in a DynamoDB table.
Updating this data happens via a synchronous REST call.
Changes to the stored payment information need to be propagated to other services too, which is realized by sending events to &lt;a href="https://github.com/zalando/nakadi"&gt;Nakadi&lt;/a&gt;, Zalando's message bus.&lt;/p&gt;
&lt;p&gt;&lt;img alt="coupled" src="https://engineering.zalando.com/posts/2022/02/images/coupled_diagram.png"&gt;&lt;/p&gt;
&lt;p&gt;Initially, the service created/updated data in DynamoDB and then sent events to Nakadi to inform other services about the change in payment information.
This meant the service had two downstream dependencies to complete the request, namely the database and the message bus.
As the availability of a service is the product of the availabilities of its dependencies, the more dependencies a service has, the lesser is its own availability.
Let's assume DynamoDB and the message bus have availabilities of 99.9% each.
Thus, the maximum availability for the service is &lt;code&gt;99.9% * 99.9% = 99.8%&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Aiming for the highest availability possible, reducing the dependency to only DynamoDB results in a higher availability of the service.
After explaining the transactional outbox pattern, we will provide a concrete solution, the technologies it comprises and how we achieved decoupling the process.&lt;/p&gt;
&lt;h2&gt;Transactional Outbox&lt;/h2&gt;
&lt;p&gt;Let us look at the underlying concept of how to decouple data update and event publication.
The pattern we are describing here is known as Transactional Outbox.
Our goal is to achieve that a service, synchronously called via a REST API, creates, deletes or updates a data store entry and also propagates the change to other services via messaging.
However, publishing the message is decoupled from updating the data store.&lt;/p&gt;
&lt;p&gt;&lt;img alt="transactional-outbox-drawing" src="https://engineering.zalando.com/posts/2022/02/images/outbox_diagram.png"&gt;&lt;/p&gt;
&lt;p&gt;In this drawing we provide the setup of the environment.
Our flow consists of 4 steps, where the starting point is a synchronous call that triggers further actions.&lt;/p&gt;
&lt;h3&gt;Change Entry and Populate Outbox&lt;/h3&gt;
&lt;p&gt;After the call is received, the service triggers a change for an entry in the data store.
This is denoted with &lt;code&gt;1&lt;/code&gt;.
The actions that trigger a change consist of Create, Update or Delete, as a Read operation would not alter any data.
Modifying data in the data store is transactional and once it is successfully completed, the service already returns a success response code to its caller.&lt;/p&gt;
&lt;p&gt;As part of the transaction in the data store, the actual data change is written to an outbox.
This is depicted in step &lt;code&gt;1.5&lt;/code&gt;.
The outbox can be thought of as a write append log.
Each data change operation in the data store will produce an entry in the outbox.&lt;/p&gt;
&lt;h3&gt;Consume Outbox and Publish Event&lt;/h3&gt;
&lt;p&gt;The transaction in the data store was successful and the data entry got updated or created.
Thus, a new entry in the outbox exists.
A so called message relay reads that entry from the outbox.
To get aware of the new entry, the message relay notifies the outbox, which upon notification consumes the entry.
This is depicted with number &lt;code&gt;2&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Upon consumption, the message relay extracts the data, transforms it to an event and publishes it, marked in the diagram with &lt;code&gt;2.5&lt;/code&gt;.
Only after successful publication the entry is marked as consumed.&lt;/p&gt;
&lt;h2&gt;Concrete Solution&lt;/h2&gt;
&lt;p&gt;After describing the pattern we now want to present the concrete solution.
In order to decouple the asynchronous event emission from the synchronous process we take advantage of various cloud services AWS has to offer.&lt;/p&gt;
&lt;p&gt;The following diagram shows the complete flow from a synchronous REST API call to the publicaton of the Nakadi event following the new approach:&lt;/p&gt;
&lt;p&gt;&lt;img alt="concrete-solution-drawing" src="https://engineering.zalando.com/posts/2022/02/images/concrete.png"&gt;&lt;/p&gt;
&lt;h3&gt;DynamoDB Streams&lt;/h3&gt;
&lt;p&gt;Recently, DynamoDB was extended with a Change Data Capture implementation – DynamoDB Streams.
Once &lt;a href="https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html"&gt;activated&lt;/a&gt;, as soon as an item in the DynamoDB table is changed (added, updated or deleted) a corresponding &lt;em&gt;dataset&lt;/em&gt; is sent to the stream.
In our case this dataset contains the &lt;em&gt;old image&lt;/em&gt;, containing the table item before the change, and the &lt;em&gt;new image&lt;/em&gt;, containing the table item after the change.
It can be configured which &lt;em&gt;images&lt;/em&gt; AWS exposes to the DynamoDB stream.
With both these images we are now able to assemble a corresponding Nakadi event using AWS Lambda.&lt;/p&gt;
&lt;h3&gt;AWS Lambda&lt;/h3&gt;
&lt;p&gt;The trigger for our AWS Lambda is a DynamoDB Stream item.
We chose Python for our implementation as it is more lightweight compared to Java.
The lambda function will receive the item containing the &lt;em&gt;old&lt;/em&gt; and &lt;em&gt;new image&lt;/em&gt;.
Then it will assemble the data change event, which contains the complete item after its change as well as a patch node containing the diff.
As a last step the assembled event is published to Nakadi.&lt;/p&gt;
&lt;p&gt;In case the publication to Nakadi fails, e.g. due to timeouts, the request is retried.
If all the retries fail then we make use of an AWS SQS queue as fallback storage which is further explained in the next chapter.
This also means that we do not guarantee that the events are published in the correct order.&lt;/p&gt;
&lt;h3&gt;AWS SQS &amp;amp; Kubernetes CronJob&lt;/h3&gt;
&lt;p&gt;AWS SQS is a message queue service.
When creating a new AWS Lambda function it already comes with an AWS SQS queue attached as a dead letter queue.
Having this queue it is ensured that no events are lost in case of a failed publication or even worse a temporary outage.
Now, whenever Nakadi event publishing fails the event is sent to the dead letter queue.
For event publishing retries with exponential backoff are in place to minimize the number of events that could not be published ending up in the dead letter queue.
In order to retry sending the events in the queue in intervals we created a Kubernetes cronjob.
The cronjob simply runs the Python code that is also run by the AWS Lambda and tries to publish the events to Nakadi again.
As publication is eventually successful the event is then removed from the SQS queue.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;We successfully decoupled synchronous data changes from eventually consistent event publishing.
Through decreasing dependencies, we increased the resiliency of our service.
Besides improving the architecture, the team also got to work with DynamoDB streams and AWS Lambda for the first time, offering a great possibility to learn about AWS technologies.
Having implemented this pattern, we are working with our infrastructure teams to offer an implementation of this pattern to all teams at Zalando.
We already have an implementation of the Transactional Outbox for &lt;a href="https://engineering.zalando.com/tags/postgresql.html"&gt;PostgreSQL&lt;/a&gt;, managed centrally via a Kubernetes operator.&lt;/p&gt;</content><category term="Zalando"/><category term="AWS"/><category term="Microservices"/><category term="Backend"/></entry><entry><title>Maps with PostgreSQL and PostGIS</title><link href="https://engineering.zalando.com/posts/2021/12/maps-with-postgresql-and-postgis.html" rel="alternate"/><published>2021-12-02T00:00:00+01:00</published><updated>2021-12-02T00:00:00+01:00</updated><author><name>Felix Kunde</name></author><id>tag:engineering.zalando.com,2021-12-02:/posts/2021/12/maps-with-postgresql-and-postgis.html</id><summary type="html">&lt;p&gt;Learn how to stream geodata from PostGIS to your browser&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Maps with PostgreSQL and PostGIS - map" src="https://engineering.zalando.com/posts/2021/12/images/postgis-maps-preview.png#previewimage"&gt;&lt;/p&gt;
&lt;p&gt;This blog post explains to you which tools to use to serve geospatial data from a database system (PostgreSQL) to your web browser. All you need is a database server for the data, a web map application for the frontend and a small service in between to transfer user requests. I will also show you how these components can run on top of Kubernetes in a highly available cloud native fashion.&lt;/p&gt;
&lt;h2&gt;PostGIS - a spatial database&lt;/h2&gt;
&lt;p&gt;As a first step the dataset in your database you want to put on a map must include a geospatial representation: Two coordinates or an address. For Zalando it might be interesting to know the demand hotspots across Europe e.g. by joining the zip codes of shipments with administrative boundaries which are often &lt;a href="https://ec.europa.eu/eurostat/web/gisco/"&gt;available&lt;/a&gt; as Open Data. The database must support geo data types and indexes to answer spatial queries. At Zalando, the open source database system PostgreSQL is used by many teams and it offers a geospatial component called &lt;a href="https://postgis.net/"&gt;PostGIS&lt;/a&gt;. It is used for example to allow our customers to select the nearest pickup and return points. Over the years, PostGIS has grown a strong community and is widely accepted in the industry as the de facto standard to manage geospatial data. There are many different tools and interfaces available to import data in various formats into PostGIS and access it from your favorite data science environment - be it &lt;a href="https://jupyter-tutorial.readthedocs.io/de/latest/data-processing/postgresql/postgis/index.html"&gt;Jupyter&lt;/a&gt;, &lt;a href="https://www.r-bloggers.com/2019/04/interact-with-postgis-from-r/"&gt;R&lt;/a&gt; or &lt;a href="https://help.tableau.com/current/pro/desktop/en-gb/maps_spatial_sql.htm"&gt;Tableau&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Bring the map to your browser&lt;/h2&gt;
&lt;p&gt;Creating a web mapping app is simple with tools like &lt;a href="https://leafletjs.com/examples.html"&gt;Leaflet.js&lt;/a&gt;. For the basemap we can use &lt;a href="https://www.openstreetmap.org/"&gt;OpenStreetMap&lt;/a&gt;, the wiki-style free alternative to commercial map providers. Adding extra layers with e.g. over 100,000 polygons on top of it would slow down map navigation a lot. Splitting the data into a grid of tiles and loading only the ones of the area you are currently looking at on your screen is what makes a browser map fast and responsive. Until recently, a middleware was usually required to produce these tile structures. That middleware had to consider not only the grid creation, but also take care of different zoom levels. When you zoom out the geometry of streets, rivers, forests etc. should be coarser and styled differently - some details should be even left out at a smaller scale for the sake of readability.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Loading data as vector tiles" src="https://engineering.zalando.com/posts/2021/12/images/nuts3_tiles.gif#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Streaming spatial data from PostGIS as vector tiles into the browser map&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;The good news is, these days PostGIS can take over most of the middleware’s job and produce map tiles for you. You only need a lightweight server between the frontend that takes in requests from the map and sends queries to your spatial database to produce the tiles you want. &lt;a href="https://github.com/CrunchyData/pg_tileserv"&gt;pg_tileserv&lt;/a&gt; is such a solution. You configure the table name that contains the spatial data and that’s it. If you want to learn more about vector tiles I can recommend &lt;a href="https://www.youtube.com/watch?v=t8eVmNwqh7M"&gt;this talk&lt;/a&gt; by Paul Ramsey, one of the PostGIS authors.&lt;/p&gt;
&lt;h2&gt;Running it on Kubernetes&lt;/h2&gt;
&lt;p&gt;The &lt;a href="https://github.com/zalando/postgres-operator"&gt;Postgres Operator&lt;/a&gt;, created by my team at Zalando, provides you with an easy creation and update path for PostgreSQL servers running on top of Kubernetes. Engineers only have to write a short YAML manifest which can look like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;apiVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;acid.zalan.do/v1&lt;/span&gt;
&lt;span class="nt"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Postgresql&lt;/span&gt;
&lt;span class="nt"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;acid-geo&lt;/span&gt;
&lt;span class="nt"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;numberOfInstances&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;2&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;postgresql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;14&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;10Gi&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;teamId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;acid&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;preparedDatabases&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;map_db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;defaultUsers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;true&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;extensions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;postgis&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;geo&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;schemas&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;geo&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;{}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The operator will notice the new manifest and create all the necessary resources in Kubernetes - a stateful set with 2 database pods, services to connect to the database, secrets for authentication etc.. With specifying preparedDatabases the operator will create a new database with schemas as well as a set of database roles (reader, writer, owner) with default access privileges assigned. Plus, you can list extensions to be created in a certain schema. The Postgres cluster is based on the &lt;a href="https://github.com/zalando/spilo"&gt;Spilo&lt;/a&gt; docker image which includes the PostGIS extension.&lt;/p&gt;
&lt;p&gt;To import arbitrary geodata formats I can recommend &lt;a href="https://subscription.packtpub.com/book/application_development/9781788299329/1/ch01lvl1sec14/importing-and-exporting-data-with-the-ogr2ogr-gdal-command"&gt;GDAL’s ogr2ogr&lt;/a&gt; command-line tool. In my case I’ve imported the latest &lt;a href="https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/administrative-units-statistical-units/nuts"&gt;European NUTS polygons&lt;/a&gt; of 2021 and the &lt;a href="https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/population-distribution-demography/geostat"&gt;1km² population grid&lt;/a&gt; of 2018 by Geostat.&lt;/p&gt;
&lt;p&gt;To roll out &lt;code&gt;pg_tileserv&lt;/code&gt; on Kubernetes I’m using a deployment resource. To run it within the Zalando infrastructure I had to move the tileserver base path behind our oauth2 proxy with a dedicated &lt;code&gt;/tileserver&lt;/code&gt; base path which required me to overwrite &lt;code&gt;pg_tileserv&lt;/code&gt;’s default configuration. Configuration of &lt;code&gt;pg_tileserv&lt;/code&gt; happens via &lt;a href="https://github.com/CrunchyData/pg_tileserv/blob/master/config/pg_tileserv.toml.example"&gt;toml&lt;/a&gt; files so I’ve put that into a config map and mounted it into the container. Here you can see the manifest (leaving out the resources section in this example):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;apiVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;v1&lt;/span&gt;
&lt;span class="nt"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;ConfigMap&lt;/span&gt;
&lt;span class="nt"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;acid-geo-tileserver-config&lt;/span&gt;
&lt;span class="nt"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nt"&gt;pg_tileserv.toml&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;|&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="no"&gt;BasePath = &amp;quot;/tileserver/&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="no"&gt;Debug = true&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="nt"&gt;apiVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;apps/v1&lt;/span&gt;
&lt;span class="nt"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Deployment&lt;/span&gt;
&lt;span class="nt"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;acid-geo-tileserver&lt;/span&gt;
&lt;span class="nt"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;replicas&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;1&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;matchLabels&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;application&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;acid-geo-tileserver&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;template&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;application&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;acid-geo-tileserver&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;containers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;acid-geo-tileserver&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;pramsey/pg_tileserv:latest&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;ports&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;containerPort&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;7800&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nt"&gt;protocol&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;TCP&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;volumeMounts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;configs&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nt"&gt;mountPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/config&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DATABASE_URL&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nt"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;postgresql://map_db_reader_user@acid-geo:5432/map_db&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;PGPASSWORD&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;valueFrom&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;              &lt;/span&gt;&lt;span class="nt"&gt;secretKeyRef&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;map_db_reader_user.acid-geo.credentials&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;password&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;volumes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;configs&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;configMap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;acid-geo-tileserver-config&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Another deployment is needed serving our Leaflet application, e.g. using a simple Ubuntu docker image with nginx running.&lt;/p&gt;
&lt;h2&gt;Dynamic mapping layers&lt;/h2&gt;
&lt;p&gt;The web map requests tiles from &lt;code&gt;pg_tileserv&lt;/code&gt; which sends back protobuf files. In our case, a request looks like this - with &lt;code&gt;geo.boundaries_europe&lt;/code&gt; being the schema qualified table name:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;BASE_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;/tileserver/geo.boundaries_europe/&lt;span class="o"&gt;{&lt;/span&gt;z&lt;span class="o"&gt;}&lt;/span&gt;/&lt;span class="o"&gt;{&lt;/span&gt;x&lt;span class="o"&gt;}&lt;/span&gt;/&lt;span class="o"&gt;{&lt;/span&gt;y&lt;span class="o"&gt;}&lt;/span&gt;.pbf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Z is the zoom level and X and Y are the coordinates of the mouse cursor. Leaflet’s &lt;a href="https://leaflet.github.io/Leaflet.VectorGrid/vectorgrid-api-docs.html#vectorgrid"&gt;VectorGrid&lt;/a&gt; class can be used to &lt;a href="https://blog.crunchydata.com/blog/crunchy-spatial-tile-serving"&gt;display the vector tiles&lt;/a&gt; returned from PostGIS. For the boundaries the result can look like in the first picture above. The vector tile format must not consist solely of the geometry. Multiple thematic attributes can be included making it possible to change the style on the fly without sending another request to the database. &lt;code&gt;pg_tileserv&lt;/code&gt; will take information from all columns it finds in a spatial table.&lt;/p&gt;
&lt;p&gt;Alternatively, it allows me to serve vector tiles not only from a table but also from an SQL function using a query with PostGIS’ vector tile creator function &lt;a href="https://postgis.net/docs/ST_AsMVT.html"&gt;ST_AsMVT&lt;/a&gt;. &lt;code&gt;pg_tileserv&lt;/code&gt;’s &lt;a href="https://github.com/CrunchyData/pg_tileserv#readme"&gt;README&lt;/a&gt; on GitHub provides some cool examples for such function layers. For example PostGIS allows you to create a &lt;a href="https://postgis.net/docs/ST_HexagonGrid.html"&gt;grid&lt;/a&gt; of squares or hexagons within a defined extent, e.g. the envelope of a single tile. The grid can be intersected with another spatial data set to produce a heatmap. The following example is inspired from &lt;code&gt;pg_tileserv&lt;/code&gt;'s example of &lt;a href="https://access.crunchydata.com/documentation/pg_tileserv/1.0.3/usage/function-layers-advanced/"&gt;Advanced Function Layers&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;OR&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;REPLACE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;FUNCTION&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;geodata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;population_hexagons&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;integer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;integer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;integer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;integer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;RETURNS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bytea&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="err"&gt;$$&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bounds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;-- get web mercator tile bounds to given coordinate&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ST_TileEnvelope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;geom&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hexes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;-- generate hexgrid within bounds and join with population grid&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;OVER&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;grid_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;geom&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;popcount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;popcount&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;-- oversimplified, of course&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bounds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="k"&gt;JOIN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;LATERAL&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ST_HexagonGrid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;-- 1. hex size, 2. boundary&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ST_XMax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;geom&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ST_XMin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;geom&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;geom&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;ON&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="c1"&gt;-- do spatial join between our artificial grid and the Geostat grid&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="c1"&gt;-- the hex grid is in web mercator coordinate reference system (CRS)&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="c1"&gt;-- it must be tranformed into the same CRS of the population grid (WGS84 - 4326)&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="k"&gt;JOIN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;geodata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;population&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;
&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="k"&gt;ON&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;geom&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ST_Transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;geom&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4326&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;GROUP&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;BY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;geom&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mvt&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;-- processing geometry for vector tiles&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ST_AsMVTGeom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;geom&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;geom&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;geom&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;||&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;||&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid_id&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;grid_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="n"&gt;H&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;popcount&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hexes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bounds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;-- baking mvt geom, grid_id and popcount into MVT encoding&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ST_AsMVT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mvt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;geodata.population_hexagons&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mvt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="err"&gt;$$&lt;/span&gt;
&lt;span class="k"&gt;LANGUAGE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;sql&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;STABLE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;STRICT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;PARALLEL&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;SAFE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Your function must take the Z, X and Y parameters as arguments and return Postgres' &lt;code&gt;bytea&lt;/code&gt; type, which is just a BLOB for the PBFs returned from ST_AsMVT. In the first part of the query we need to get the tile envelope for the given input. Within this square we generate the grid and join it against the Geostat population grid. For each hexagon we sum up the population of every intersecting Geostat grid cell. This is quite coarse, indeed. It would be more precise to join the generated grid against a point data set, e.g. one could generate &lt;a href="https://postgis.net/docs/ST_Centroid.html"&gt;centroids&lt;/a&gt; for each data polygon.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Dynamic spatial layers based on zoom level" src="https://engineering.zalando.com/posts/2021/12/images/pophex_tiles.gif#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Dynamic hexagon grid joined against Geostat population data using an SQL function&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;Because this is all based on database queries triggered from user interactions with the map, such a heatmap can be dynamic and change while zooming in and out. As the vector tile grid gets smaller on a larger scale the heatmap becomes more fine-grained. In the map legend you can see that values adapt to the zoom level and hexagon size. This is much better for the perception by not overwhelming the observer when the full picture is shown and providing better guidance to points of interests.&lt;/p&gt;</content><category term="Zalando"/><category term="Data Analytics"/><category term="PostgreSQL"/><category term="Kubernetes"/><category term="Backend"/><category term="Data"/></entry><entry><title>A Systematic Approach to Reducing Technical Debt</title><link href="https://engineering.zalando.com/posts/2021/11/technical-debt.html" rel="alternate"/><published>2021-11-30T00:00:00+01:00</published><updated>2021-11-30T00:00:00+01:00</updated><author><name>Gregor Ulm</name></author><id>tag:engineering.zalando.com,2021-11-30:/posts/2021/11/technical-debt.html</id><summary type="html">&lt;p&gt;This article describes a systematic approach to reducing technical debt from the perspective of engineering management. It thoroughly describes the process that was set up in one of our core engineering teams and also addresses how such work can be effectively capitalized.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;While technical debt is a recurring issue in software engineering, the case of the Merchant Orders team within Zalando Direct was a an outlier as, due to a lack of a clearly defined process, technical debt more or less only ever accumulated. When I joined this team in autumn 2020 as its new engineering lead, the technical debt backlog had entries dating back to 2018. In this article, I describe the process we set up in Q1/2021 in order to regain control of our technical debt. While the situation in your own team may not be quite as dire, you may nonetheless find some aspects of this blog post useful to adopt. Our backlog of technical debt tickets used to be in excess of 70, with no end in sight. With the adoption of the methodology described in this article, we have already shipped more than ten features or improvements over the course of eight weeks, i.e. four sprints. For the first time in three years, i.e. ever since my team started tracking technical debt, we are reducing it.&lt;/p&gt;
&lt;p&gt;This article is written from a managerial perspective and has Engineers and Engineering Managers as its target audience, though I hope that engineers of all levels find value in this article. Furthermore, I can only encourage any software engineer reading this article to approach their lead if ever-growing technical debt is an issue in their team. There is a non-zero chance that they will appreciate you raising the issue, considering that all of us are aware that technical debt is a serious problem. If you do not pay it down, you will get more technical debt on top for free, until your only option is a complete rewrite. This is quite similar to compound interest driving debtors into bankruptcy in the real world. Obviously, we would like to avoid such an outcome.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Excerpt from team's technical-debt backlog" src="https://engineering.zalando.com/posts/2021/11/images/techdebt_tracker_anon.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;An excerpt from my team’s technical-debt backlog as of April 2021. As you can see, there are items from 2018 and 2019 on it.&lt;/figcaption&gt;

&lt;h2&gt;Technical debt, Known and Unknown&lt;/h2&gt;
&lt;p&gt;Using the vocabulary of the &lt;a href="https://en.wikipedia.org/wiki/Johari_window"&gt;Johari window&lt;/a&gt;, you can probably identify plenty of “known known” technical debt in your codebase. However, some technical debt constitutes an “unknown unknown”, i.e. technical debt we do not know that we have. In our case, we had a long backlog of known technical debt, with many dozens of entries. Given that we have over a dozen services to maintain, this is probably not even a particularly frightening number. However, there is also technical debt that you are completely unaware of. This may seem counter-intuitive, in particular if you subscribe to the notion of being able to perfectly design services in advance, as well as once and for all eternity. Yet, this is not a caricature, considering that you can encounter non-technical leads who hold rather similar beliefs. In some circumstances, this could even be a perfectly valid position to hold, for instance in static environments.&lt;/p&gt;
&lt;p&gt;There are at least two sources of unknown technical debt. First, there are problems with your services that you simply have not yet identified. This can happen easily because once you agree on a design and subsequently carry out its implementation, you may not question any decisions the team has agreed on. This can of course mean that there are drawbacks in your design or implementation that someone with a fresh pair of eyes, for instance a new joiner, may be able to spot. Second, technology is a fast-moving field. This means that today’s cutting-edge design-patterns, development processes, testing strategies, or even programming languages and paradigms may get superseded. Your current best practices replaced your previous set of best practices one by one, and there are new developments that will one day make you wonder why anybody ever thought that a hitherto valid approach was ever a good idea. Of course, there is also the problem that we sometimes need to deliver features quickly to seize a business opportunity, which may lead to sub-optimal design and implementation decisions.&lt;/p&gt;
&lt;p&gt;Not all change is positive, however. As much as we engineers may pride ourselves on our objectivity, our industry is also driven by fads. This is such a big issue that a company like Gardner makes money by selling their analyses about where on the “&lt;a href="https://www.gartner.com/en/research/methodologies/gartner-hype-cycle"&gt;hype cycle&lt;/a&gt;” certain technologies are. Sometimes, we also regress as an industry, for instance by adopting technologies that are popular but less powerful. Yet, if they are being pushed by corporations with an annual marketing budget of many hundreds of millions of dollars, they can get a lot of traction in industry. Any of your services might look much differently if it was rewritten today. As a practical consequence, I think you should take the time to re-review your existing services and look for improvements, but, if possible, with a very critical view toward buzzwords du jour. Even &lt;a href="https://en.wikipedia.org/wiki/TeX"&gt;TeX&lt;/a&gt;, one of the arguably most mature software products in the world, receives fixes to this very day. Its first version was released about two decades ago. Taking this into account, it is probably not an entirely implausible assumption that your services could be improved as well. On a related note, Zalando has &lt;a href="https://engineering.zalando.com/posts/2020/07/technology-choices-at-zalando-tech-radar-update.html"&gt;formal processes in place for selecting technologies as well as adopting new technologies&lt;/a&gt;. This is certainly helpful for engineering leaders, yet it cannot address the problem that some technologies fall out of favor over time due to shortcomings.&lt;/p&gt;
&lt;p&gt;As we create software solutions in a highly dynamic environment where both customer requirements and technologies can change, a semi-regular review of any of your services may uncover areas of improvement. All of that should be categorized as (hitherto unknown) technical debt. A very welcome consequence of such an exercise is that your engineers will gain greater familiarity with their services. This is particularly valuable if your services need to be reliable anytime. Preferably, each engineer on your on-call rotation should have very detailed knowledge of your services, so thoroughly studying the source code of your existing service will be very helpful to them.&lt;/p&gt;
&lt;h2&gt;Motivating your Engineers&lt;/h2&gt;
&lt;p&gt;In management theory, a popular concept is &lt;a href="https://en.wikipedia.org/wiki/Theory_X_and_Theory_Y"&gt;Theory X/Theory Y&lt;/a&gt;. These two show up in pairs. According to Theory X, people only work because they need money and, if they could get away with it, they would prefer to not work at all. In contrast, Theory Y posits that people are intrinsically motivated, care about their work, and want to advance in their career. Reality is probably somewhere in-between. However, as a leader, the problem is how to get people to want to work on technical debt. In our case, the problem was that the backlog had tickets on it that were three years old, which seems to imply a lack of motivation to work on such tickets.&lt;/p&gt;
&lt;p&gt;As leaders we can of course simply tell people what to work on (Theory X). The problem, however, is that people tend to be more productive if they work on tickets they really do want to work on (Theory Y). Furthermore, my experience as an engineer was that work on technical debt can be both fulfilling, as well as open up new opportunities. Consequently, I use a Theory Y approach with my team, stressing the benefits of this kind of work. Please note that this is not in any way a cynical approach. A good part of my growth as an engineer was due to resolving hairy technical problems, oftentimes with a focus on performance improvements. In one of my internships I was given the task of increasing the performance of an artificial neural network, and this work led to me later on getting hired in a very competitive field. I also highlighted to my team that work on technical debt can sometimes be easily quantified. An engineer’s CV certainly looks better with hard data on percentages of performance increases or space reductions. Examples are: “Reduced weekly AWS hosting fees by $500 by evaluating resource requirements” (this is an actual result of our work) or “reduced space requirements of one of our databases by 12% by optimizing data types and removing redundant information.”&lt;/p&gt;
&lt;h2&gt;The Technical-Debt Rotation&lt;/h2&gt;
&lt;p&gt;My team already has several rotations in place. Thus, I set up technical debt as another rotation. I aim to give my team autonomy in their work, so my proposal was the following: all engineers take turns in the technical-debt rotation, and one iteration lasts for one week. In practice, this means that on every Monday an engineer should spend some time on identifying technical debt they want to work on. This can either be known technical debt, i.e. one or more tickets from the technical-debt tracker, or unknown technical debt. For the latter, my suggestion is to pick one of our many services, study the source code, and look for improvements. This should lead to a number of additional tickets. Preferably, an engineer identifying possible improvements of an existing service should also do the corresponding work. This is particularly the case when we only have a hypothesis that requires some work to test it.&lt;/p&gt;
&lt;p&gt;I want the engineers on the technical-debt rotation to work on tickets related to technical debt before taking on any tickets from our regular backlog, which is of course considered during the planning meeting. In terms of the time commitment, I am rather flexible. I would like the engineer on the rotation to spend at least one day working on technical debt. However, there are situations where a bigger commitment may be warranted. This is particularly the case with larger subprojects, which is detailed in the next section.
You may miss that I have not addressed the issue of urgency as, clearly, not all technical debt is created equal. Pressing issues we tend to address as soon as possible. We commonly do not even classify it as technical debt but instead as a necessary bug fix or an “operations” issue. Nonetheless, some of our accumulated technical debt is merely nice-to-resolve. My advice to fellow leaders would be to keep an eye on what your team is working on by tracking the technical-debt tickets your team closes. There should be a healthy mix of relative importance. If not, you will have to address this, perhaps in a separate session for backlog refinement. I would not advise you to rank all technical-debt tickets by urgency and simply assign them, however, for reasons specified in the previous section.&lt;/p&gt;
&lt;p&gt;We also have a simple system in place for categorizing technical debt where we use the two metrics "complexity" and "impact", and rank both on a scale from one to five. In our case, these estimations are initially done by the engineer who adds entries to the tech-debt backlog, but they are reviewed intermittently. I think a good starting point is picking a few items that could be considered low-hanging fruit, i.e. work that pairs relatively low complexity with moderate to high impact. You may want to encourage your engineers to also tackle more complex work with a medium to high impact. You may also find that some of the technical debt is not worth resolving at the current point in time as the impact would be low to non-existent. Those you may want to save for a less busy time, for instance the code freeze before Cyber Week.&lt;/p&gt;
&lt;h2&gt;Capitalizing Technical Debt&lt;/h2&gt;
&lt;p&gt;One of the duties of software engineering leads is to ensure that the work their team performs is properly capitalized. This means that any software we create that increases our digital assets should also be added to our financial assets. In turn, this reduces our tax liabilities. Maintenance work, however, cannot be capitalized as it is instead considered an expense. A collection of technical debt tickets could constitute a mini-project that can be capitalized, however. One example would be a migration to new infrastructure or a significant rewrite that leads to performance improvements. Admittedly, packaging technical-debt tickets into a project may be an overly idealistic scenario. Yet, it is a possible outcome. In our team’s case, we have recently identified a number of issues with our Scala code base, due to an over-reliance on object-oriented programming constructs. If we resolved them, we would have a more maintainable system; we also predict an improvement in performance as there are many instances where objects are used instead of primitive types. Similarly, you may be able to identify a group of technical-debt tickets, provided your backlog is long enough, that could constitute a small project.&lt;/p&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;p&gt;The team has been following the technical-debt rotation as described in this article for about six months. Feedback from the team has been positive. Among others, the engineers remarked that it adds variety to their work or that they appreciate the increased autonomy. Of course, the latter will only be the case for as long as there is a large enough backlog of technical-debt tickets to choose from. At some point, hopefully, we will have reduced our backlog significantly, and then we will have to rely on the intrinsic motivation of wanting to better understand an existing system by diving deeper into implementation details or the satisfaction of improving the performance or design of a service.
From the perspective of an engineering leader, my end goal is to pay down as much technical debt as possible. In fact, the ideal size of our technical-debt backlog would be zero. This is a distant goal, but we have taken successful steps towards it. First, I wanted to reduce the rate of increase of the backlog. We achieved this within the first two weeks. If you preside over a technical-debt backlog that has only been growing for three years, it is already satisfying to see that it is no longer growing as quickly. The next step was to keep the number of tickets on the backlog steady, which we reached soon afterwards. Now we are at the point where the total number of tickets on our technical-debt backlog is, possibly for the first time ever, declining. The team is very happy about it. One year from now, I expect us to have drastically reduced our technical-debt backlog.&lt;/p&gt;</content><category term="Zalando"/><category term="Management"/><category term="Leadership"/></entry><entry><title>Parallel Run Pattern - A Migration Technique in Microservices Architecture</title><link href="https://engineering.zalando.com/posts/2021/11/parallel-run.html" rel="alternate"/><published>2021-11-04T00:00:00+01:00</published><updated>2021-11-04T00:00:00+01:00</updated><author><name>Ali Sabzevari</name></author><id>tag:engineering.zalando.com,2021-11-04:/posts/2021/11/parallel-run.html</id><summary type="html">&lt;p&gt;Learn how we leveraged the parallel run pattern to decompose a high traffic monolith to smaller microservices&lt;/p&gt;</summary><content type="html">&lt;p&gt;The business landscape in Zalando is growing every day. This continuous growth implies that we need to be able to cope with an ever-changing environment. Everyone with experience in software development knows that dealing with changes is a challenging problem. Especially, when the software is already working in production. Changing the software in production is like changing the tires on a car while it is still moving.&lt;/p&gt;
&lt;p&gt;In large organisations such as Zalando, where microservices architecture is the standard, changes are even more frequent. Technologies become obsolete, organization structures change, teams split or merge, monoliths are being rewritten, and yesterday's microservices become today's monoliths. All those examples impose dramatic changes in codebases.&lt;/p&gt;
&lt;p&gt;Naturally, testing is the first solution that comes to our minds when trying to minimize the regression of a change. But, in scenarios like decomposing a monolith or replacing a legacy component with a newer one, testing might not be enough. Furthermore, there are always dark corners in our systems that we have never tested or we don't know their behavior (anymore). Sometimes, as you may well know from your own experience, legacy systems don't even have tests one can use as a reference.&lt;/p&gt;
&lt;p&gt;In this article, we will explore a design pattern called the &lt;em&gt;Parallel Run&lt;/em&gt;&lt;sup id="fnref:fn1"&gt;&lt;a class="footnote-ref" href="#fn:fn1"&gt;1&lt;/a&gt;&lt;/sup&gt; which is a strategy to make sure those dramatic changes will not break the system. We will walk you through a real-world example and describe how we managed to replace a service by taking advantage of this pattern and show you the challenges and surprises we dealt with. In the end, we summarize the upsides and downsides of this pattern to better help you choose when to implement it and when not.&lt;/p&gt;
&lt;h2&gt;Decomposing the monolith, a case study&lt;/h2&gt;
&lt;p&gt;Zalando is aiming to unify the user experience across platforms&lt;sup id="fnref:fn2"&gt;&lt;a class="footnote-ref" href="#fn:fn2"&gt;2&lt;/a&gt;&lt;/sup&gt;. As part of this effort we, the Returns team, were required to extract the returns logic out of a soon-to-be legacy monolithic application. Returns logic, as the name might imply, deals with everything to do with customers returning articles they've bought on the Zalando Fashion Store. This article will explore how our team used the Parallel Run pattern to transparently and safely extract the returns logic from the monolith to the new Returns microservice.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Decomposing the monolith" src="https://engineering.zalando.com/posts/2021/11/images/decomposing-monolith.png"&gt;&lt;/p&gt;
&lt;p&gt;This new service should behave exactly like the respective part in the monolith and the customers should not notice any difference after the migration. In order to achieve this, the following complications needed to be overcome:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;While reading the old code is possible, we might miss some parts of the logic or misunderstand the code.&lt;/li&gt;
&lt;li&gt;Some parts of the code are not tested, so running the tests over the new code (if possible) would not guarantee the exact behavior.&lt;/li&gt;
&lt;li&gt;The criticality of the application precludes downtime.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Parallel Run Pattern&lt;/h2&gt;
&lt;p&gt;In order to solve these problems, wouldn't it be nice if we could verify that each request handled by the new system would be handled exactly in the same way as for the system currently running in production? The parallel run pattern does exactly that.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;When using a parallel run, rather than calling either the old or the new implementation, instead we call both, allowing us to compare the results to ensure they are equivalent. Despite calling both implementations, only one is considered the source of truth at any given time. Typically, the old implementation is considered the source of truth until the ongoing verification reveals that we can trust our new implementation.&lt;/p&gt;
&lt;p&gt;-- Sam Newman, Monolith to microservices&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;Implementation&lt;/h3&gt;
&lt;p&gt;There are several ways of implementing this pattern. Hereafter we present how we solved it for the above use case.&lt;/p&gt;
&lt;p&gt;The following diagram shows the flow for each incoming request:&lt;/p&gt;
&lt;!-- http://www.plantuml.com/plantuml/uml/dPB1RXGn38RlynJM7X18h40zSa0jXPweG6sFI2XDl3iMIHniXx9lJxAZ7GR1x85ZBCV_vo-vL7DYDSN1LUDSqoFAS1q9iy7sBMokIb5uTtC3psyvSoGRNspUm1r-hwZsrGt_REWtfnczLGjdnTQRsH24zcChFum8fmiWnvwWO0ms8lWftnM2BrccB70APF34DGR8BCd5U830mph2vWwjIjPM9o-iA3_8O-V__Ed-0LxvnaLgcFrXwqVqttH2ZBXhXFVOYJhEJB0pb2DHYGVA-nFkjEhwUZfFQeajphGDuLsld4o2ow6VPrr0-NX-v70OrXQ1xVeJNNcFnJ0iL-fKomb0AM4WPzXKJdiHAZnrw8lN5yEPu7Mxoz-nLFBXfudpzeVHAl4b9BIHGyll3aPq0KMFFhet8EkQoHJZxYpFUoojlp_cJ33yhfqbdgt_xyRNd8eJKivBtLCLGKvlsl_Bt_zUyNngQoTZeRnlIRTeGbxX6NpalGwNc4DDyHS0 --&gt;

&lt;p&gt;&lt;img alt="Parallel run sequence diagram" src="https://engineering.zalando.com/posts/2021/11/images/parallel-run-sequence-diagram.png"&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;(1-2) The Client makes a request that gets immediately processed and responded by the monolith to avoid any degradation in performance.&lt;/li&gt;
&lt;li&gt;(3-4) After responding, the monolith POSTs a request to the &lt;code&gt;/consistency-checks&lt;/code&gt; endpoint of the new Returns microservice, that immediately answers back with 202 (Accepted), indicating the request will be handled asynchronously. In this way we avoid the monolith having to wait, and we free its resources.&lt;/li&gt;
&lt;li&gt;(5-6-7) The Returns microservice starts processing the request, in background, by first re-issuing the same request to itself but calling the actual endpoint.&lt;/li&gt;
&lt;li&gt;(8) Then the response from the Returns microservice gets collected and compared with the one from the monolith.&lt;/li&gt;
&lt;li&gt;(9) Finally, Metrics and Logs about the consistency are produced to later on verify that the expected consistency is reached and to investigate cases of inconsistencies.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The async request sent to the ConsistencyChecker part in the Returns microservice, contains information about the original request url with the query-params, the method, headers and, when present, the body. This information represents the new request to be sent to the Returns microservice. It includes also the HttpStatus, the headers, and the body of the response returned by the monolith in order to be checked against the response from the Returns microservice.&lt;/p&gt;
&lt;p&gt;The following is an example of the structure that we used:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;request&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;url&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;path&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;api/example?param=something&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;headers&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Content-Type&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;application/json;charset=UTF-8&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Accept-Language&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;de-DE&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;method&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;GET&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;body&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;response&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;status&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;headers&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Content-Type&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;application/json;charset=UTF-8&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;transfer-encoding&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;chunked&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;body&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;json-response-body&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Each endpoint of the monolith has its own expected consistency to be reached in order to declare the migration successful. Once that threshold has been achieved, the migration can be considered safe, and we can perform the switch from the monolith to the new Returns microservice for that endpoint.&lt;/p&gt;
&lt;h3&gt;Monitoring and Reporting&lt;/h3&gt;
&lt;p&gt;In order to consider an endpoint ready, it had to reach a satisfying consistency percentage. For each request we produced the result metrics using Prometheus, and we displayed them with Grafana. Each endpoint, defined by an &lt;code&gt;operation_id&lt;/code&gt;, had its own metric and its own tolerance. This was done because, as usual, fixing those last few percentages has a cost higher than the value it brings; given that each endpoint is completely separated from one another, each endpoint had its own target percentage to consider it consistent (enough).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Monitoring_example" src="https://engineering.zalando.com/posts/2021/11/images/monitoring-and-reporting-grafana.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Matched&lt;/strong&gt;: counter for all the requests that matched between the monolith and the Returns microservice.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Unmatched&lt;/strong&gt;: counter for all the requests that did not match between the two services. Possible examples could be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Different HttpStatuses&lt;/em&gt;: such as 2xx and 4xx or even 201 and 200&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Different Headers set&lt;/em&gt;: a missing header in one of the two responses or different values for the same header&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Different Body responses&lt;/em&gt;: missing fields/attributes in the responses or different values for the same field/attribute&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Failed&lt;/strong&gt;: counter for all the requests where the response was terminated by temporary issues, such as for example in case of any 5xx. In these cases, even if they matched it would not be a valuable information given that the request couldn't be properly fulfilled due to a transient server-side issue. On the other hand, if the request did not match for 5xx cases, the &lt;em&gt;unmatched&lt;/em&gt; counter should be increased because it means the overall behavior of the Returns microservice doesn't match the one from the monolith, and it requires a deeper investigation.&lt;/p&gt;
&lt;h3&gt;Rollout&lt;/h3&gt;
&lt;p&gt;The switch was done gradually, and it was done per endpoint to allow the system to be tested in a fully functional way. This was achieved by using a proxy to move the forwarding of the requests to the Returns microservice one by one once they were ready. In our case we used &lt;a href="https://opensource.zalando.com/skipper/"&gt;Skipper&lt;/a&gt;, an open-source Proxy developed by Zalando.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Endpoints rollout" src="https://engineering.zalando.com/posts/2021/11/images/endpoints-rollout.png"&gt;&lt;/p&gt;
&lt;p&gt;In this way, by minimizing the amount of endpoint rolled out to one per switch, we avoided introducing a massive set of changes in one go, and we were able to collect additional feedback by every single switch while still working on finalizing the other ones.&lt;/p&gt;
&lt;h3&gt;Clean-up&lt;/h3&gt;
&lt;p&gt;Once the migration was successfully finalized, all the code related to the parallel run logic needed to be cleaned-up. The three main parts to remove were the handler performing the consistency check (use cases layer), the gateway to call the localhost (gateway layer) and the domain model related to the consistency logic (entities layer). Additional clean-ups were done for configuration files such as the feature toggle to enable/disable the consistency checker and the config for the localhost gateway, the dependency injection in the Main file, the consistency-checker api in the route and, of course, all the tests to validate the consistency check logic. Code-wise we removed ~700 lines of code and ~1.3k lines between unit and component tests.&lt;/p&gt;
&lt;h3&gt;Advantages of this approach&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Live data for testing:&lt;/strong&gt; We can leverage the real production data as test cases. Therefore, given enough time, the system will be tested potentially under all the "real-life" use cases.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Gradual rollout:&lt;/strong&gt; The rollout is done per endpoint minimizing the amount of changes per switch.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Incremental development:&lt;/strong&gt; The gradual rollout also enables the possibility to approach the implementation per endpoint.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Easy rollback:&lt;/strong&gt; By using a proxy to do the traffic switch, rolling back just requires a change to the proxy to migrate the endpoint back to use the previous host instead of the microservice one; this avoids the need of redeploying, making the whole process faster.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Finding bugs:&lt;/strong&gt; Since the new microservice will be tested with real data, there might be cases where even the monolith was behaving incorrectly. This approach can make those edge cases visible.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Load testing:&lt;/strong&gt; In case of using a different technology for the newer service, parallel run pattern helps to understand the performance characteristics of the new service. As a result, the development team can target more realistic performance goals or SLOs before going live.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Considerations and Limitations&lt;/h3&gt;
&lt;p&gt;While this approach makes the migration safer and smoother, it has also some concerns and issues to be kept into account.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Increased load:&lt;/strong&gt; Given that requests received by the monolith are forwarded to the microservice, the load across all components increases, potentially doubling.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Refine the comparisons:&lt;/strong&gt; In the comparison check not everything needs to match 100%. For example, in our case we ignored some headers that were not relevant for the outcome of the request.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;GDPR:&lt;/strong&gt; While collecting the data for the comparison we need to keep into account that sensitive information should either not be stored or cleaned afterwards. In the former case, analyzing some inconsistencies for the fields containing personal data might not be easy.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Non-trivial comparisons:&lt;/strong&gt; Comparing the results is not always a straightforward task. For example comparing PDFs might be complicated due to different but negligible metadata, or a change in the http frameworks might result in different default response headers, or collections could have different orderings.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Non-Idempotent endpoints:&lt;/strong&gt; Idempotency should always be kept into account. For example this approach can be used for POSTs that are idempotent but not when the idempotency of the endpoint cannot be guaranteed. When doing this investigation always consider idempotency of each operation and possible side effects (for example calling another POST api, updating a database, or publishing an event).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not a quick-win:&lt;/strong&gt; Even if this approach leads to a smooth and safe migration, it requires quite some time and effort to be properly set up and tuned.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Verdict&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;Implementing a parallel run is rarely a trivial affair, and is typically reserved for those cases where the functionality being changed is considered to be high risk. (...) the work to implement this needs to be traded off against the benefits you gain.&lt;/p&gt;
&lt;p&gt;-- Sam Newman, Monolith to microservices&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The parallel run pattern is a powerful technique to overcome the complexities and stress of migration projects, but not every migration project is a match to use this pattern. Increasing traffic, complexities in comparing the results, and the amount of effort are the risks that should be considered before implementing this pattern.&lt;/p&gt;
&lt;p&gt;In the end, this pattern is just a tool that should be used wisely considering constraints, use cases, and team capacity when planning for it. When it is done properly, it saves you a lot of headaches.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:fn1"&gt;
&lt;p&gt;Newman S. (2020). &lt;em&gt;Monolith to Microservices&lt;/em&gt;. 2nd ed. O’Reilly Media, Inc.&amp;#160;&lt;a class="footnote-backref" href="#fnref:fn1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:fn2"&gt;
&lt;p&gt;You can learn more about this effort in a series of &lt;a href="https://engineering.zalando.com/posts/2021/03/how-we-use-graphql-at-europes-largest-fashion-e-commerce-company.html"&gt;articles about GraphQL&lt;/a&gt; in this blog.&amp;#160;&lt;a class="footnote-backref" href="#fnref:fn2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="Zalando"/><category term="Microservices"/><category term="Monolith"/><category term="Design Patterns"/><category term="Backend"/><category term="Frontend"/></entry><entry><title>Tracing SRE’s journey in Zalando - Part III</title><link href="https://engineering.zalando.com/posts/2021/10/sre-journey-part3.html" rel="alternate"/><published>2021-10-15T00:00:00+02:00</published><updated>2021-10-15T00:00:00+02:00</updated><author><name>Pedro Alves</name></author><id>tag:engineering.zalando.com,2021-10-15:/posts/2021/10/sre-journey-part3.html</id><summary type="html">&lt;p&gt;Follow Zalando's journey to adopt SRE in its tech organization.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;em&gt;This is the third and last part of our journey to roll out SRE in Zalando. You’ll find the previous chapters &lt;a href="https://engineering.zalando.com/posts/2021/09/sre-journey-part1.html"&gt;here&lt;/a&gt; and &lt;a href="https://engineering.zalando.com/posts/2021/09/sre-journey-part2.html"&gt;here&lt;/a&gt;. Thanks for following our story.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;2020 - From team to department&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;The road so far:&lt;/em&gt; 2016 saw an attempt at the rollout of a Site Reliability Engineering (SRE) organization that did not quite materialize but still left the seed of SRE in the company; in 2018 and 2019 we had a single SRE team working on strategic projects that improved the reliability of Zalando’s platform. The success of that last team brought with it many requests for collaboration, which had to be balanced with SRE’s own roadmap. In this chapter we’ll learn how SRE adapted in order to achieve sustainable growth.&lt;/p&gt;
&lt;p&gt;In late 2019 there was a reorg in our Central Functions unit. This reorg was centered around a set of principles, chief among them were ”Customer Focus”, “Purpose” and “Vision”. Through that reorg &lt;strong&gt;SRE becomes a department&lt;/strong&gt; that encompasses the original SRE Enablement team, the teams building monitoring services and infrastructure, and incident management. This is a clear investment from the company into the value SRE repeatedly demonstrated.
The close collaboration those teams had had in the previous years already hinted at a common purpose between them. Through the Incident Commander role and the support to Postmortems, SRE was always in close contact with Incident Management. Distributed Tracing, where SRE invested much of its efforts, was actually owned by one of the monitoring teams. Now that everyone was under the same ‘roof’ we could further strengthen the synergies that were already in place.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Zalando’s SRE Logo" src="https://engineering.zalando.com/posts/2021/10/images/sre-logo.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Zalando’s SRE &lt;s&gt;team&lt;/s&gt; department logo&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;In 2019 SRE had already started to dedicate time to its own products, but the creation of a department further endorsed SRE’s long term plans. But with an entire department under the SRE label, we had to be smart about our next steps. Particularly in the long term. Also, we had to adjust to what it meant operating as a department. Before, with a single team we could be (and occasionally had to be) more flexible, picking ad hoc projects. But now we had teams with a better defined purpose. And we wanted to have all teams working together towards a common goal. It was time to come up with a plan for how we could implement our new purpose: &lt;strong&gt;to reduce the impact of incidents while supporting all builders at Zalando to deliver innovation to their users reliably and confidently&lt;/strong&gt;. That plan was materialized into the &lt;strong&gt;SRE Strategy&lt;/strong&gt;, which was published in 2020, and it set the path for the years to come.&lt;/p&gt;
&lt;p&gt;Following the same set of principles that influenced the creation of the SRE department (”Customer Focus”, “Purpose” and “Vision”), the SRE Strategy had at its core &lt;strong&gt;Observability&lt;/strong&gt;.
How did Observability fit with those principles and bound the three teams? For the teams developing our monitoring products it’s quite obvious. But Observability is also key for SRE: we drive our work through SLOs, and it is at the base of the &lt;a href="https://sre.google/sre-book/part-III-practices/"&gt;Service Reliability Hierarchy&lt;/a&gt;. Finally, Incident Management is made that much more efficient with the right Observability into our systems, by identifying issues in our platform, and also making it easier to understand what is affecting the customer experience.&lt;/p&gt;
&lt;p&gt;Our strategy set a target &lt;strong&gt;standardizing Observability across Zalando&lt;/strong&gt;. Through that standardization we could achieve a common understanding of Observability within the company, reduce overhead of operating multiple services and make it easier to build on top of well defined signals (like we did before with OpenTracing). The concrete step for making this possible was to develop SDKs for the major &lt;a href="https://opensource.zalando.com/tech-radar/"&gt;programming languages at use in Zalando&lt;/a&gt;.
&lt;strong&gt;Standardization&lt;/strong&gt; was something we grew quite fond of in the previous years. While operating as a single team, doing several projects with different teams we were uniquely positioned to identify common pain points or inefficiencies across the company. But eventually we also realised one thing: &lt;strong&gt;as a single team it would be challenging to scale our enablement efforts to cover hundreds of teams in the company&lt;/strong&gt;. Waiting for the practices we tried to establish to spread organically would also take too long. The only way we could properly scale our efforts and reach our goals, was to develop the tools and practices that every other team would use in their day to day work. We couldn’t do everything at once, but our new strategy gave us the starting point: Observability.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Service Reliability Hierarchy" src="https://engineering.zalando.com/posts/2021/10/images/service-reliability-hierarchy.jpeg#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Observability is also at the base of &lt;a href="https://sre.google/sre-book/part-III-practices/"&gt;Service Reliability Hierarchy&lt;/a&gt;&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;We started collecting metrics on our performance regarding Incident Response. How many incidents were we getting? What was the Mean Time To Repair? How many were false positives? What was the impact of those incidents? Now that incident management was part of SRE, it was important to understand how the incident process was working, and how it could be improved. We were already rolling out Symptom Based Alerting, so that alone would already help with reducing the False Positive Rate. But we took it a step further and devised a new incident process that &lt;strong&gt;separated Anomalies and Incidents&lt;/strong&gt;.
It’s easy to map these improvements to benefits for the business and to our customers, but there’s also something to be said about the &lt;strong&gt;health of our on-call engineers&lt;/strong&gt;. Having an efficient incident process (and the right Observability into a team’s systems), goes a long way to making the lives of on-call engineers better. Pager fatigue is something that should not be dismissed, and can hurt a team through lower productivity and employee attrition.
Something important to highlight in this whole process is that &lt;strong&gt;we started by collecting the numbers&lt;/strong&gt; to see if they would match what our observations had already been pointing to. This is a common practice that guides our initiatives. That is also why one of the first things we did after creating the department was to define the KPIs that would guide our work, make sure they were being measured, and facilitate the reporting of those KPIs.&lt;/p&gt;
&lt;p&gt;SRE continued the rollout of &lt;a href="https://engineering.zalando.com/posts/2022/04/operation-based-slos.html"&gt;Operation Based SLOs&lt;/a&gt; by working closely with the senior management of several departments and agreeing on their respective SLOs. Those SLOs would be guarded by our &lt;a href="https://www.usenix.org/conference/srecon19emea/presentation/mineiro"&gt;&lt;strong&gt;Adaptive Paging&lt;/strong&gt;&lt;/a&gt; alert handler. With this we also continued the adoption of &lt;a href="https://github.com/zalando/public-presentations/blob/master/files/2019-05-16_alerting_monitoring_and_all_that_jazz.pdf"&gt;&lt;strong&gt;Symptom Based Alerting&lt;/strong&gt;&lt;/a&gt;.
With Adaptive Paging we had an interesting development. Our initial approach was to make the SLO the threshold upon which we would page the on-call responder. What we soon discovered is that it made our alerts too sensitive to occasional short lived spikes, similar to any other non-Adaptive Paging alert. We mitigated this by providing additional criteria that engineers could use to more granularly control the alert itself (time of day, throughput, length of the error rate). What initially was supposed to be a hands off task for engineers (defining alerts and thresholds), quickly led us down a path we were already familiar with. Engineers were back at defining alerting rules because the target set by the SLO was not enough. After some experiments, we improved Adaptive Paging by having it use &lt;a href="https://sre.google/workbook/alerting-on-slos/#6-multiwindow-multi-burn-rate-alerts"&gt;Multi Window Multi Burn Rate&lt;/a&gt; alert threshold calculation. This change resulted in two relevant outcomes. First, it brought &lt;strong&gt;Error Budgets&lt;/strong&gt; to the forefront. Deciding whether to page someone or not was no longer whether the SLO was breached or not, but rather whether the Error Budget was in risk of being depleted or not. The second outcome, and arguably more important, is that we made it possible for the operations guarded by our alert handler to have their respective rules (length of the sliding windows and the alarm threshold) &lt;strong&gt;derived automatically from the SLO without any effort from the engineering teams&lt;/strong&gt;, which was usually done through trial and error.&lt;/p&gt;
&lt;p&gt;The challenge with rolling out Operation Based SLOs was that reporting and getting an overview of those SLOs was not easy, with the data fragmented in different tools. To address this issue, a &lt;strong&gt;new Service Level Management tool&lt;/strong&gt; was developed. As we evolved the concept of SLOs, so too did we evolve the tooling that supported it. Other than reporting SLOs for the different operations, we also gave a view on the Error Budget. Knowing how much Error Budget is left makes it easier to use it to steer prioritization of development work.&lt;/p&gt;
&lt;p&gt;&lt;img alt="SLO Tool" src="https://engineering.zalando.com/posts/2021/10/images/slo-tool.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Our operation based Service Level Management Tool (not actual data)&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;Late in 2020 we began developing what we called the &lt;strong&gt;SRE Curriculum&lt;/strong&gt;. This was an initiative that aimed at scaling the &lt;strong&gt;educational benefits of SRE&lt;/strong&gt;. Specifically, this meant sharing the wealth of knowledge that SREs have accumulated over time about the sharp edges of production. We were looking not only at raising the bar on the company’s operational capabilities, but also to facilitate any interactions with other teams by providing a common understanding on the topics covered by the curriculum. In the previous years we did several training sessions for incident response, distributed tracing, and alerting strategies. These were ad hoc engagements when teams requested our support. With the advent of the pandemic, many things changed and we had to adapt. Those training sessions were one of those things. The format for those sessions was based on having them in person. We did try to do some via video conference, but it did not have quite the same result. At the same time, the company’s Tech Academy was facing the same challenges. We grouped together to develop a new series of training sessions in a new format. The deliverables of this new format were a video and a quiz for each topic, with the content of each training being created and reviewed by subject matter experts to ensure a common understanding and a high quality training. This way we captured the knowledge that could be consumed by anyone in the company at any given time and different pace. Also, by having those training sessions part of the onboarding process, any engineer joining Zalando would get an introduction to some of the SRE practices we were rolling out.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Curriculum recording" src="https://engineering.zalando.com/posts/2021/10/images/studio-picture.jpeg#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;The studio where we recorded some of the training sessions&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;The support of the SRE Enablement team is still in high-demand for ad hoc projects. After another collaboration between SRE and the Checkout teams, the senior management of that department officially pitched for the creation of an &lt;strong&gt;Embedded SRE team&lt;/strong&gt;. This is something we had in the back of our minds for further down the road. But to have it being requested by another department was an interesting development. In any case, here we were. This development presented quite a few new challenges (and opportunities):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What will the team work on? What will its responsibilities be?&lt;/li&gt;
&lt;li&gt;Who will the team report to?&lt;/li&gt;
&lt;li&gt;Is this time bound? Or is it a permanent setup?&lt;/li&gt;
&lt;li&gt;If they report to separate departments, how will they review the collaboration? Or how do we do performance evaluation effectively for SREs working in a different department?&lt;/li&gt;
&lt;li&gt;How will the embedded SRE team collaborate with the product development team?&lt;/li&gt;
&lt;li&gt;How will the embedded team keep in sync with the central team?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Embedded team will report to the SRE department, and &lt;strong&gt;both SRE and product area management have aligned on a set of KPIs&lt;/strong&gt; like Availability and On Call Health. The former will be dictated by the SLOs defined for that product area, but the latter aims at making sure the operational aspect is not having its toll on the product development team. On-call Health will be measured taking into account paging alerts and how often an individual is on-call.&lt;/p&gt;
&lt;p&gt;We’re still figuring out most things as we go along, but this is an exciting development. This team will be different from the Enablement team, in the sense that it will have a much more concrete scope. This team will be able to be more hands-on on the code and tooling used within the product development team. It will be a voice for reliability within that product area, able to influence the prioritization of topics which ensure a reliable customer experience in our Fashion Store. The &lt;strong&gt;SRE department will also benefit from having a source providing precious feedback&lt;/strong&gt; on whatever the department is trying to roll out to the wider engineering community.&lt;/p&gt;
&lt;p&gt;You may remember from our &lt;a href="https://engineering.zalando.com/posts/2021/09/sre-journey-part2.html"&gt;last article&lt;/a&gt; where we mentioned that hiring was always a challenge (a topic you can also read from the experience of other companies that rolled out SRE). Now we’re planning to bootstrap another team, so that cannot be making things any easier. But the truth is that having a department with teams which were different in nature also had an &lt;strong&gt;unexpected benefit in our hiring&lt;/strong&gt;. Before, our capacity constraints prevented us from hiring anyone who wasn't a good fit for the original position with the plan to further develop those people and establish the SRE mindset. Now we have the possibility to have a candidate with potential to join one of the teams in the department, and from there grow into the SRE role. Whether later they join the SRE Enablement team or not is not that important (although team rotation is something that is quite active in Zalando). Any team can benefit from having someone with the SRE mindset. Also, &lt;strong&gt;we strive for close collaboration within the department&lt;/strong&gt;, so it’s not like engineers are isolated in their respective teams.&lt;/p&gt;
&lt;p&gt;And this is it, mostly. You are all caught up with how SRE has been adopted in Zalando, and what we’ve been up to. And what a ride it has been! Attempting to create a full SRE organization, later starting with a single central team, reaching the limits of that team, creating a department, further growing that department with an embedded SRE team… Were we 100% successful? No (also, SREs don’t believe in 100%). But we’ve done the Postmortem where we failed, and the learnings we got from there turned into action items in our strategy. This has been working really well for us, but there’s still so much to do. There are many interesting ways that SRE can develop into, so we’re really excited to see what challenges we’ll get next. Until we reach our next stage of evolution, we’ll keep doing what we do best: dealing with ambiguity and uncertainty. And help Zalando ensure customers can buy fashion reliably!&lt;/p&gt;</content><category term="Zalando"/><category term="SRE"/><category term="Backend"/></entry><entry><title>Tuning Image Classifiers using Human-In-The-Loop</title><link href="https://engineering.zalando.com/posts/2021/10/tuning-image-classifiers-using-human-in-the-loop.html" rel="alternate"/><published>2021-10-13T00:00:00+02:00</published><updated>2021-10-13T00:00:00+02:00</updated><author><name>Paul O'Grady</name></author><id>tag:engineering.zalando.com,2021-10-13:/posts/2021/10/tuning-image-classifiers-using-human-in-the-loop.html</id><summary type="html">&lt;p&gt;We present an Expectation–Maximization (EM) algorithm for iteratively estimating the optimal class-confidence threshold for an image classifier using human annotators. The algorithm is developed for classifiers that are applied to out-of-distribution images, and efficiently constructs a validation data set to estimate an optimal threshold for this use case.&lt;/p&gt;</summary><content type="html">&lt;p&gt;In this blog post we describe an algorithm we developed when building our product image analysis infrastructure, where we use human-in-the-loop to tune the thresholds of our image classifiers. We discuss the algorithm in the following, and present some mathematical details and a simple code example in the appendices.&lt;/p&gt;
&lt;h2&gt;Background&lt;/h2&gt;
&lt;p&gt;When a customer browses for a product on the Zalando website they may use descriptive terms to search for what they want, for example a customer may use a specific term such as &lt;a href="https://en.zalando.de/women/?q=leopard+print+dress"&gt;&lt;em&gt;leopard print dress&lt;/em&gt;&lt;/a&gt; instead of providing a more generic term such as casual dress. One approach we use to support product search using descriptive terms is to automatically generate additional product information from product images using computer vision techniques. In particular, we train image classifiers to identify products that have a particular fashion attribute such as a specific pattern or style, e.g. leopard print, which correspond to descriptive search terms.&lt;/p&gt;
&lt;h2&gt;Problem&lt;/h2&gt;
&lt;p&gt;A typical image classifier generates a &lt;em&gt;class-confidence score&lt;/em&gt; (a value between 0 &amp;amp; 1) at its output to indicate that a given input image belongs to one of the specified output classes, i.e., the image shows a particular fashion attribute. To generate a binary decision from the classifier output a &lt;em&gt;class-confidence threshold&lt;/em&gt; parameter is selected based on a classifier performance metric such as &lt;a href="https://en.wikipedia.org/wiki/Precision_and_recall"&gt;&lt;em&gt;precision&lt;/em&gt; &amp;amp; &lt;em&gt;recall&lt;/em&gt;&lt;/a&gt;. Once the threshold has been selected the model can be deployed and used to generate class labels for an input image, which can be used in product search.&lt;/p&gt;
&lt;p&gt;Over time the characteristics of the input product images may change, leading to a drift in the input &lt;em&gt;data distribution&lt;/em&gt;. For image classifiers that are used to generate predictions for &lt;em&gt;out-of-distribution&lt;/em&gt; input images the performance of the classifier may degrade. For example this may occur when an image classifier is trained on Zalando product images before the introduction of a new photography style on a revamped Zalando website, for which there are no annotated image examples in the new style available to retrain the model.&lt;/p&gt;
&lt;p&gt;To solve this problem we modify the class-confidence threshold of the classifier to compensate for data distribution drift, and developed an &lt;a href="https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm"&gt;Expectation-Maximization (EM) algorithm&lt;/a&gt; that we call &lt;em&gt;AutoThreshold&lt;/em&gt; for this purpose. AutoThreshold estimates an &lt;em&gt;optimal class-confidence threshold&lt;/em&gt; for an image classifier using manual annotations from a selection of the classifier's predictions on the out-of-distribution data. Additionally, the process of creating annotations for the out-of-distribution data helps in the generation of a new data set that can be used to train a new version of the image classifier.&lt;/p&gt;
&lt;h3&gt;Selecting Classifier Thresholds&lt;/h3&gt;
&lt;p&gt;The optimal threshold value for an image classifier is the class-confidence score, a value between 0 &amp;amp; 1, for which the set of predictions above that score leads to optimal classifier performance. Ideally this value would be 0.5, i.e., the center of the range. However, for a number of reasons this is never the case and is usually estimated post training to achieve best results.&lt;/p&gt;
&lt;p&gt;The estimated optimal threshold for each output class of an image classifier is evaluated using an annotated image data set, i.e. validation set, where each image in the set is manually assigned a class label. The image classifier is tested by using the validation data set as input and comparing the classifier's predictions to the manually assigned labels. We can measure classifier performance using metrics such as precision &amp;amp; recall, which indicate the quality and quantity of the results. Optimizing the threshold is usually a tradeoff between precision &amp;amp; recall, where we want to find a threshold value that results in an &lt;em&gt;acceptable&lt;/em&gt; score for both. Typically, a performance metric that combines both precision and recall, such as the &lt;a href="https://en.wikipedia.org/wiki/F-score"&gt;&lt;span class="math"&gt;\(f_\beta\)&lt;/span&gt;-measure&lt;/a&gt;, is used, and the class-confidence score that maximizes the metric is chosen as the threshold value.&lt;/p&gt;
&lt;h3&gt;Estimating Thresholds in the Absence of Data&lt;/h3&gt;
&lt;p&gt;For our use case there exists no training or validation data set for the out-of-distribution input image set. Furthermore, we do not annotate all images in advance, as this would be a costly, and time consuming, exercise for the scale of the data at Zalando (currently around &lt;a href="https://en.zalando.de/catalogue/"&gt;600k products&lt;/a&gt;). To overcome these issues we make use of the simple fact that when classifier predictions are &lt;em&gt;ordered&lt;/em&gt; by class-confidence score&amp;mdash;for a well trained image classifier&amp;mdash;high-confidence class predictions exhibit greater correspondence with the image annotations than low-confidence predictions, which indicates model performance, and allows us to search for an optimal threshold between both extremes (demonstrated in the &lt;a href="#annotation_plot"&gt;plot below&lt;/a&gt;). With this in mind, we frame threshold selection as an optimization problem using manual annotators, who generate annotations to be used in the metric calculations required to estimate a threshold.&lt;/p&gt;
&lt;p&gt;Specifically, we take an iterative approach, where images to be annotated are conditioned on the image classifier, and annotators annotate a subset of the classifier's most confident predictions first. The generated annotations are used to estimate a threshold using our selected performance metric, and the process is repeated until our estimated threshold converges. This process can be implemented as an Expectaton-Maximization algorithm, and describes a &lt;em&gt;human-in-the-loop&lt;/em&gt; procedure, which generates a validation data set for the out-of-distribution data over a number of iterations. Furthermore, the data set is generated in an efficient way, both in terms of the number of annotations required, and the selection of image examples which contribute most to discovery of an optimal threshold.&lt;/p&gt;
&lt;h2&gt;Problem Definition&lt;/h2&gt;
&lt;p&gt;Taking a binary image classifier as our motivating example, which typically has a sigmoid output layer, the value generated at the output for each of the &lt;span class="math"&gt;\(n\)&lt;/span&gt; input images can be interpreted as a class-confidence score, or probability &lt;span class="math"&gt;\(p_{i}\)&lt;/span&gt;, that an input image, &lt;span class="math"&gt;\(\mathbf{x}_i\)&lt;/span&gt;, belongs to the output class, &lt;span class="math"&gt;\(c\)&lt;/span&gt;. For the purposes of image attribute identification, the predictions at the output, &lt;span class="math"&gt;\(\mathbf{p} =[p_{1},\dots,p_{n}]\)&lt;/span&gt;, undergo a thresholding operation to replace the class-confidence scores with a binary class label, which indicates a transform from a continuous to categorical probability distribution. Since the output layer is a sigmoid function, where output values are thresholded by the parameter &lt;span class="math"&gt;\(t\)&lt;/span&gt; into two binary categories, &lt;em&gt;true&lt;/em&gt; &amp;amp; &lt;em&gt;false&lt;/em&gt;, we can model the classifier's output distribution using a &lt;a href="https://en.wikipedia.org/wiki/Bernoulli_distribution"&gt;&lt;em&gt;Bernoulli distribution&lt;/em&gt;&lt;/a&gt;, i.e., &lt;span class="math"&gt;\(P(\mathbf{x}_i=c | p_{i})\)&lt;/span&gt;. Furthermore, the distribution of annotations also follows a Bernoulli distribution. Using these details, we frame the problem of threshold estimation within the framework of the Expectation-Maximization algorithm, where we present algorithm details below, and present a more detailed mathematical explanation in Appendix A.&lt;/p&gt;
&lt;h3&gt;Threshold Estimation Using the EM Algorithm&lt;/h3&gt;
&lt;p&gt;The Expectation-Maximization algorithm is an iterative method to find &lt;a href="https://en.wikipedia.org/wiki/Maximum_likelihood_estimation"&gt;&lt;em&gt;maximum likelihood&lt;/em&gt;&lt;/a&gt; estimates of parameters (such as our classifier threshold) in the presence of &lt;em&gt;unobserved latent variables.&lt;/em&gt; In our problem setting, the predictions made by the classifier are observed by our annotators to generate image annotations. However, the order of the images presented to the annotators is conditioned on the classifier's class-confidence score, which is unknown to our annotators. As mentioned, the estimated optimal threshold corresponds to a class-confidence score, and thus our latent variable allows us to estimate an optimal threshold for our classifier. Each iteration of the EM algorithm alternates between performing an Expectation step (E-step), which constructs a likelihood function to estimate the latent variable, and a Maximization step (M-step), which computes parameters that maximize the function constructed in the E-step. For our algorithm, the E-step generates annotations for the classifier's most confident predictions and the M-step estimates the optimal class-confidence threshold using the new set of annotations. Both steps are repeated at each iteration until the estimated threshold converges.&lt;/p&gt;
&lt;h2&gt;Algorithm Details - Binary Classifier&lt;/h2&gt;
&lt;p&gt;For a set of images, &lt;span class="math"&gt;\(\mathbf{X}=[\mathbf{x}_1,\dots,\mathbf{x}_n]\)&lt;/span&gt;, and their class-confidence scores, &lt;span class="math"&gt;\(\mathbf{p}\)&lt;/span&gt;, we construct a set of images ordered by their scores, &lt;span class="math"&gt;\(\mathbf{X}_{\tt asc} = {\tt sort}(\mathbf{X},\mathbf{p})\)&lt;/span&gt;, to estimate the optimal threshold, &lt;span class="math"&gt;\(\hat{t}\)&lt;/span&gt;, for the output class. We use &lt;span class="math"&gt;\(\mathbf{X}_{\tt asc}\)&lt;/span&gt; as input to the AutoThreshold algorithm, and specify a number of hyperparamters including the subset window size &lt;span class="math"&gt;\(m\)&lt;/span&gt;, and classifier performance metric &lt;span class="math"&gt;\({\tt metric}(.)\)&lt;/span&gt; (e.g., &lt;span class="math"&gt;\(f_{\beta}\)&lt;/span&gt;-measure). We define a data windowing function that selects images to be annotated by centering a window of size &lt;span class="math"&gt;\(m\)&lt;/span&gt; on &lt;span class="math"&gt;\(\mathbf{X}_{\tt asc}\)&lt;/span&gt; at a position that corresponds to current threshold estimate (class-confidence score), i.e, &lt;span class="math"&gt;\(\mathbf{X}_{\tt subset} = {\tt window}(\mathbf{X}_{\tt asc}, \hat{t}, m)\)&lt;/span&gt;. We denote associated predictions for the windowed subset as &lt;span class="math"&gt;\(\mathbf{p}_{\tt subset}\)&lt;/span&gt;, and denote the annotations generated for this set as &lt;span class="math"&gt;\(\mathbf{a}_{\tt subset}\)&lt;/span&gt;.
Furthermore, we define a thresholding function &lt;span class="math"&gt;\({\tt threshold}(\mathbf{p}_{\tt subset}, t)\)&lt;/span&gt;, which generates true and false class labels from model predictions to be used as input to the performance metric.&lt;/p&gt;
&lt;p&gt;The EM algorithm is outlined below:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Specify hyperparameters &lt;span class="math"&gt;\(m\)&lt;/span&gt; &amp;amp; &lt;span class="math"&gt;\({\tt metric}\)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Initialise the current threshold estimate &lt;span class="math"&gt;\(\hat{t}\)&lt;/span&gt; to the maximum class-confidence score, i.e. 1&lt;/li&gt;
&lt;li&gt;&lt;em&gt;E-step:&lt;/em&gt; Generate a new subset of manual annotations, &lt;span class="math"&gt;\(\mathbf{a}_{\tt subset}\)&lt;/span&gt;, for the selected images, &lt;span class="math"&gt;\(\mathbf{X}_{\tt subset} = {\tt window}(\mathbf{X}_{\tt asc}, \hat{t}, m)\)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;M-step:&lt;/em&gt; Estimate a new threshold estimate which corresponds to the maximum metric value for the new set of annotations, &lt;span class="math"&gt;\(\hat{t} = {\underset {t} {\operatorname {argmax} }} \ \, {\tt metric}(\mathbf{a}_{\tt subset}, {\tt threshold}(\mathbf{p}_{\tt subset}, t))\)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Return to step 3 until convergence&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Practical Details&lt;/h3&gt;
&lt;p&gt;Below are some practical details on the operation of the algorithm:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Note that &lt;span class="math"&gt;\(\hat{t}\)&lt;/span&gt; can be initialized to any value between 0 &amp;amp; 1, if a good initial estimate is available it can be used to initialize the algorithm, if not initializing to 1 is a good choice. Also note that when initializing to the maximum, due to edge effects, the windowing function will only capture the &lt;span class="math"&gt;\(m/2\)&lt;/span&gt; examples beneath &lt;span class="math"&gt;\(\hat{t}\)&lt;/span&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The EM algorithm typically converges to a local optimum, for our use case there is a global optimum, and we have observed (for a suitably selected subset size) very good convergence and results with this approach.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Note that as the algorithm operates on subsets of the unannotated data, and as such the number of available unannotated images, &lt;span class="math"&gt;\(n\)&lt;/span&gt;, could grow as the algorithm runs, so &lt;span class="math"&gt;\(n\)&lt;/span&gt; is not required to be fixed. Furthermore, the number of required annotations (and hence algorithm iterations) will depend on the metric and subset size chosen.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Finally, for a multilabel classifier, where the output classes, &lt;span class="math"&gt;\(\mathbf{c}= [c_1,\dots,c_k]\)&lt;/span&gt;, are independent but not mutually exclusive of each other, the above algorithm can be performed for each class separately, where the task is to estimate &lt;span class="math"&gt;\(\hat{t}_j\)&lt;/span&gt; for each of the &lt;span class="math"&gt;\(j=1,\ldots,k\)&lt;/span&gt; classes.&lt;/p&gt;
&lt;h2&gt;Threshold Estimation Example&lt;/h2&gt;
&lt;p&gt;Below we present an annotation plot for a run of our EM algorithm for a &lt;a href="https://en.zalando.de/women/?q=leopard+print+dress"&gt;&lt;em&gt;leopard print&lt;/em&gt;&lt;/a&gt; image classifier, which is a binary classifier and has a single class output. The middle subplot presents the annotations for the images sent to a crowdsourcing platform, ordered in ascending class-confidence score (as illustrated by the orange curve), where positive labeled images are indicated at the top of the subplot by blue dashes and negative labelled images are indicated at the bottom of the subplot by purple dashes. We can see that for high confidence predictions there are many positive annotations with few negative annotations, illustrating that the classifier is performing well. However there is a point at which the occurrence of positive labels is frequently punctuated by negative annotations, illustrating that the classifier performs poorly beyond this point. We can see from the subplot that the threshold estimated by the EM algorithm (as indicated by the black dot) is positioned just before the classifier begins to perform poorly, which demonstrates the algorithm's usefulness in estimating an optimal class-confidence threshold. Furthermore, the annotation density subplot indicates a natural separation between the cluster of positive and negative annotations, and the estimated threshold corresponds to this also.&lt;/p&gt;
&lt;div id="annotation_plot"&gt;&lt;/div&gt;
&lt;p&gt;&lt;img alt="leopard print annotations analysis" src="https://engineering.zalando.com/posts/2021/10/images/leopard_print_annotations.png"&gt;&lt;/p&gt;
&lt;p&gt;To illustrate further we present a &lt;em&gt;slope plot&lt;/em&gt; below, where we generate a cumulative sum of annotations and examine the slope of the curve, where annotations are assigned values 1 &amp;amp; 0 for positive and negative labels respectively, and are ordered by the class-confidence scores generated by the classifier (as was the case in the previous plot). The resultant plot is piecewise linear, where flat-line segments in the curve above the threshold represent consecutive False Positives, whereas those beneath the threshold represent consecutive True Negatives. Conversely, sloped-line segments in the curve above the threshold represent consecutive True Positives, whereas those beneath the threshold represent consecutive False Negatives. For our purposes we would like the curve above the threshold to have a slope as close to 1 as possible, and on average to have a steeper slope above the threshold than beneath it.&lt;/p&gt;
&lt;p&gt;&lt;img alt="leopard print slope plot" src="https://engineering.zalando.com/posts/2021/10/images/leopard_print_slope_plot.png"&gt;&lt;/p&gt;
&lt;p&gt;In the slope plot we observe the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;There are many long sloped-line segments above the threshold, whereas there are few beneath the threshold&lt;/li&gt;
&lt;li&gt;There are many long flat-line segments beneath the threshold, whereas there are few above the threshold&lt;/li&gt;
&lt;li&gt;The slope on average above the threshold is steeper than beneath it&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Therefore, for the leopard print image classifier predictions, we see that the threshold estimated by the AutoThreshold algorithm successfully identifies an appropriate class-confidence threshold.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;We have presented a novel algorithm for the task of optimal threshold estimation for an image classifier that is applied to out-of-distribution data, where an EM algorithm and human-in-the-loop is used to generate annotations for the out-of-distribution data, which are used to calculate a threshold to compensate for the difference in distributions. The algorithm is simple to implement, and is efficient in terms of the number of annotated image examples required to estimate an optimal threshold.&lt;/p&gt;
&lt;p&gt;In future work, we will explore using the EM algorithm and human-in-the-loop to train a classifier in the context of &lt;a href="https://en.wikipedia.org/wiki/Active_learning_(machine_learning)"&gt;&lt;em&gt;active learning&lt;/em&gt;&lt;/a&gt;, i.e., the case where there is no annotated data set to train a classifier.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;If you would like to work on similar problems, consider joining our &lt;a href="https://jobs.zalando.com/en/tech/jobs/?filters%5Bcategories%5D%5B0%5D=Product%20Design%20%26%20User%20Research&amp;amp;filters%5Bcategories%5D%5B1%5D=Applied%20Science&amp;amp;filters%5Bcategories%5D%5B2%5D=Software%20Engineering&amp;amp;filters%5Bcategories%5D%5B3%5D=Product%20Management%20%28Technology%29&amp;amp;search=machine%20learning"&gt;Data Science teams!&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;Appendix A: Mathematical Details&lt;/h2&gt;
&lt;p&gt;Below we provide further details on the presented algorithm's interpretation as an &lt;a href="https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm"&gt;Expectation-Maximization (EM) algorithm&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;EM Algorithm Description&lt;/h3&gt;
&lt;p&gt;Using standard notation, the EM algorithm can be described as follows: For a set of observed data &lt;span class="math"&gt;\(\mathbf{X}\)&lt;/span&gt; generated from a statistical model with unknown parameters &lt;span class="math"&gt;\(\boldsymbol{\theta}\)&lt;/span&gt;, and a set of latent variables &lt;span class="math"&gt;\(\mathbf{Z}\)&lt;/span&gt;, which are unobserved but effect the distribution of the data nonetheless, we estimate the values for &lt;span class="math"&gt;\(\boldsymbol{\theta}\)&lt;/span&gt; by maximizing the marginal likelihood of the observed data,&lt;/p&gt;
&lt;p&gt;&lt;span class="math"&gt;\({\displaystyle L({\boldsymbol {\theta }};\mathbf {X} )=p(\mathbf {X} \mid {\boldsymbol {\theta }})=\int p(\mathbf {Z} \mid \mathbf {X} ,{\boldsymbol {\theta }})p(\mathbf {X} \mid {\boldsymbol {\theta }})\,d\mathbf {Z} }\)&lt;/span&gt;,&lt;/p&gt;
&lt;p&gt;i.e, we generate a &lt;a href="https://en.wikipedia.org/wiki/Maximum_likelihood_estimation"&gt;maximum likelihood estimate (MLE)&lt;/a&gt; for &lt;span class="math"&gt;\(\boldsymbol{\theta}\)&lt;/span&gt;. However, this quantity is often intractable since &lt;span class="math"&gt;\(\mathbf {Z}\)&lt;/span&gt; is unobserved and its distribution is unknown before obtaining &lt;span class="math"&gt;\(\boldsymbol{\theta}\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;The EM algorithm seeks to overcome this issue, and finds the MLE of the marginal likelihood by iteratively maximizing a specifed &lt;span class="math"&gt;\(Q\)&lt;/span&gt; function, which is defined as the expected value of the log likelihood function of &lt;span class="math"&gt;\({\boldsymbol {\theta }}\)&lt;/span&gt;, i.e., &lt;span class="math"&gt;\(Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})=\operatorname {E} _{\mathbf {Z} \mid \mathbf {X} ,{\boldsymbol {\theta }}^{(t)}}\left[\log L({\boldsymbol {\theta }};\mathbf {X} ,\mathbf {Z} )\right]\,\)&lt;/span&gt;. The &lt;span class="math"&gt;\(Q\)&lt;/span&gt; function is maximized over two steps: In the first step&amp;mdash;the E-step&amp;mdash;the data-dependent parameters of the &lt;span class="math"&gt;\(Q\)&lt;/span&gt; function are calculated, while in the second step&amp;mdash;the M-step&amp;mdash;we seek to maximize the function constructed in the E-step over the parameters &lt;span class="math"&gt;\(\boldsymbol{\theta}\)&lt;/span&gt;, where the value that achieves the maximum is our new estimate, &lt;span class="math"&gt;\(\boldsymbol {\theta }^{(t)}\)&lt;/span&gt;.&lt;/p&gt;
&lt;h3&gt;AutoThreshold as an EM Algorithm&lt;/h3&gt;
&lt;p&gt;Using the above notation and translating to our algorithm description, our observations, &lt;span class="math"&gt;\(\mathbf{X}\)&lt;/span&gt;, are the vector of annotations generated by human-in-the-loop, &lt;span class="math"&gt;\(\mathbf{a}\)&lt;/span&gt;; our unobserved latent variables, &lt;span class="math"&gt;\(\mathbf{Z}\)&lt;/span&gt;, are the ordered classifier predictions used to generate &lt;span class="math"&gt;\(\mathbf{a}\)&lt;/span&gt;, i.e. &lt;span class="math"&gt;\(\mathbf{p}\)&lt;/span&gt;; and the unknown model parameters, &lt;span class="math"&gt;\({\boldsymbol {\theta }}\)&lt;/span&gt;, are defined by the statistical model used to generate &lt;span class="math"&gt;\(\mathbf{X}\)&lt;/span&gt;, which in our case is the &lt;em&gt;Bernoulli distribution&lt;/em&gt;, as the annotators answer a yes-no question when generating annotations for our image data set. For the Bernoulli distribution, there is single model parameter &lt;span class="math"&gt;\(p\)&lt;/span&gt;, which is simply the probability that an observation will be true.&lt;/p&gt;
&lt;p&gt;For our use case, where we estimate a class-confidence threshold, &lt;span class="math"&gt;\(\hat{t}\)&lt;/span&gt;, for an image classifier in order to generate binary predictions, the parameter &lt;span class="math"&gt;\(p\)&lt;/span&gt; has a direct correspondence, which can be explained as follows: For an ideal image classifier with perfect accuracy applied to a balanced data set (i.e., a data set with an equal number of true and false examples) the output distribution of the class labels will be uniform and the parameter &lt;span class="math"&gt;\(p\)&lt;/span&gt; will be 0.5, as all predictions will be correct, and a true or false outcome will have equal probability as the observations are balanced. Similarly, in the ideal case the sigmoid units at the output will be perfectly normalized and the class-confidence threshold used to assign predictions to categories will also be 0.5 (as is the standard assumption with logistic regression analysis etc.). Also, 0.5 corresponds to the sample mean of the observed predictions (where true &amp;amp; false are represented numerically by 1 &amp;amp; 0) which is the &lt;em&gt;MLE&lt;/em&gt; for the parameter &lt;span class="math"&gt;\(p\)&lt;/span&gt;.&lt;/p&gt;
&lt;h3&gt;Known Unknowns&lt;/h3&gt;
&lt;p&gt;As we move away from the ideal case where the data may not be balanced or the image classifier may exhibit errors, the parameter &lt;span class="math"&gt;\(p\)&lt;/span&gt; and threshold &lt;span class="math"&gt;\(t\)&lt;/span&gt; deviate from 0.5 and both become unknown (but still remain in the range from 0 to 1), since the classifier's output class distribution, &lt;span class="math"&gt;\(\mathbf{y}\)&lt;/span&gt;, becomes unknown. However, a direct correspondence between the two parameters remains. To overcome this issue, and estimate an appropriate value for &lt;span class="math"&gt;\(t\)&lt;/span&gt; using a known distribution, i.e., &lt;span class="math"&gt;\(\hat{t}\)&lt;/span&gt;, we generate a validation data set, i.e., a set of manually annotated images, and test the image classifier by generating class predictions for the images then compare against the image annotations. The goal is to estimate a value for &lt;span class="math"&gt;\(\hat{t}\)&lt;/span&gt; that will generate a class label output distribution, &lt;span class="math"&gt;\(\mathbf{y}\)&lt;/span&gt;, as close as possible to &lt;span class="math"&gt;\(\mathbf{a}\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;However, as already discussed in this article, there are additional practical considerations when evaluating the performance of an image classifier such as precision &amp;amp; recall, and simply comparing annotations to class predictions to determine performance may not lead to the selection of a useful classifier. To choose a suitable image classifier, the effect of the class-confidence threshold itself must be considered, which leads to a meta-labeling of the model's class predictions using the annotations in the validation data set. In particular, all positively annotated images that are correctly classified are known as True Positives (TP), whereas those that are incorrectly classified are known as False Negatives (FN). Conversely, all negatively annotated images that are correctly classified are known as True Negatives (TN), whereas those that are incorrectly classified are known as False Positives (FP).&lt;/p&gt;
&lt;p&gt;Using these four categories of class prediction, a performance metric can indicate how close an image classifier's class output distribution is to the validation data set, while also giving an indication of the classifier's performance when it comes to precision &amp;amp; recall.&lt;/p&gt;
&lt;h3&gt;Averages Over Categories&lt;/h3&gt;
&lt;p&gt;As mentioned above an important component of the EM algorithm is how to calculate the maximum likelihood estimate for the unknown parameter &lt;span class="math"&gt;\(\boldsymbol{\theta}\)&lt;/span&gt;. For our use case where the observations are generated by a Bernoulli distribution, the MLE for the parameter &lt;span class="math"&gt;\(p\)&lt;/span&gt; is the sample mean. Although, as discussed above, for our use case we must also consider precision &amp;amp; recall, which necessitates the use of a performance metric to determine a class-confidence threshold that optimizes &lt;span class="math"&gt;\(p\)&lt;/span&gt; with respect to the validation data set. However, performance metrics such as precision &amp;amp; recall can be interpreted as &lt;em&gt;averages over categories&lt;/em&gt;, which provides a direct connection to the MLE for &lt;span class="math"&gt;\(p\)&lt;/span&gt;. For example, recall can be considered an average over the meta-labeled positive annotations TP &amp;amp; FN, i.e., recall = TP/(TP+FN); while precision can be considered an average over the meta-labeled annotations above the threshold, i.e., precision = TP/(TP+FP). Furthermore, as discussed, precision and recall may be combined to create a performance metric such as the &lt;span class="math"&gt;\(f_\beta\)&lt;/span&gt;-measure, such derived performance metrics also perform averaging over the values for precision &amp;amp; recall. In summary, for a chosen performance metric, the optimal value for &lt;span class="math"&gt;\(\hat{t}\)&lt;/span&gt; has the effect of generating a Bernoulli distribution &lt;span class="math"&gt;\(\mathbf{y}\)&lt;/span&gt; which is a close as possible to &lt;span class="math"&gt;\(\mathbf{a}\)&lt;/span&gt;, and also specifies a level of control over precision and recall.&lt;/p&gt;
&lt;h3&gt;Optimization Loop&lt;/h3&gt;
&lt;p&gt;Now that we have described how the AutoThreshold algorithm fits within the framework of the EM algorithm, we will provide further detail on the algorithm's optimization loop.&lt;/p&gt;
&lt;p&gt;At each iteration, the number of items in &lt;span class="math"&gt;\(\mathbf{a}\)&lt;/span&gt;, and their corresponding &lt;span class="math"&gt;\(\mathbf{p}\)&lt;/span&gt;, increases by our specified window size, &lt;span class="math"&gt;\(m\)&lt;/span&gt;, which increases the amount of data available to calculate our specified performance metric, &lt;span class="math"&gt;\({\tt metric}(.)\)&lt;/span&gt;, and also increases the number of possible values to be used to maximize &lt;span class="math"&gt;\(\hat{t}\)&lt;/span&gt;. Where we increase the available observations in the E-step (by generating new annotations from our most confident predictions) and maximize the threshold in the M-step to estimate the optimal threshold. Here the E-step is arguably most important, since it generates the required validation data set, as the original problem is to generate a sufficient number of annotations for an unannotated data set to estimate a threshold. Furthermore, in the E-step, we increase the available observations using a suitably large subset size until the algorithm converges, which allows us to minimize overall the number of annotations needed to estimate an optimal threshold, which is what we wish to achieve with this algorithm.&lt;/p&gt;
&lt;h3&gt;Finally&lt;/h3&gt;
&lt;p&gt;To conclude we present some other interesting points to consider about this algorithm:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;For this use case we apply the EM algorithm to a discrete probability distribution using categorical observations, i.e., annotations. Typically EM is applied to problems where observations are drawn from a continuous probability distribution, such as the Gaussian distribution.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For this use case we have our latent variables, &lt;span class="math"&gt;\(\mathbf{p}\)&lt;/span&gt;, before we obtain our observations, &lt;span class="math"&gt;\(\mathbf{a}\)&lt;/span&gt;. This is the reverse of the standard implementation of EM, and illustrates the flexibility of the EM algorithm's two-step learning iteration when applied to human-in-the-loop.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For this use case we have human-generated observations, where usually the EM algorithm is applied to &lt;a href="https://asp-eurasipjournals.springeropen.com/articles/10.1155/2008/784296"&gt;sensor observations&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Appendix B: AutoThreshold Python Implementation&lt;/h2&gt;
&lt;p&gt;Below we present a simple code implementation of the AutoThreshold algorithm applied to a binary classification task using synthetic data.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="ch"&gt;#!/usr/bin/env python3.8&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;namedtuple&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;f1_score&lt;/span&gt;

&lt;span class="n"&gt;SyntheticData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;namedtuple&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;SyntheticData&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;predictions&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;annotations&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_predictions_and_annotations&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Returns synthetic predictions and annotations for a step classifier response,&lt;/span&gt;
&lt;span class="sd"&gt;    ordered by prediction score.&lt;/span&gt;

&lt;span class="sd"&gt;    Note: The returned synthetic data has an optimal threshold at 0.5&lt;/span&gt;

&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;annotations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;concatenate&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ones&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;SyntheticData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;annotations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;predictions_generator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;synthetic_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;thresh_ind&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Returns predictions for the current subset window as specified by `thresh_ind`.&lt;/span&gt;

&lt;span class="sd"&gt;    Note: In the normal operation of AutoThreshold this step would generate predictions&lt;/span&gt;
&lt;span class="sd"&gt;    for our out-of-distribution images from our image classifier. Here, our toy example&lt;/span&gt;
&lt;span class="sd"&gt;    is run on synthetic data and our precomputed predictions are simply returned.&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;synthetic_data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;thresh_ind&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;thresh_ind&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;annotations_generator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;synthetic_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;thresh_ind&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Returns annotations for the current subset window as specified by `thresh_ind`.&lt;/span&gt;

&lt;span class="sd"&gt;    Note: In the normal operation of AutoThreshold this step would source annotations&lt;/span&gt;
&lt;span class="sd"&gt;    from a crowdsourcing platform. Here, our toy example is run on synthetic data and&lt;/span&gt;
&lt;span class="sd"&gt;    our precomputed annotations are simply returned.&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;synthetic_data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;annotations&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;thresh_ind&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;thresh_ind&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_optimal_threshold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;annotations&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Returns the index of the optimal threshold using the F1 score.&lt;/span&gt;

&lt;span class="sd"&gt;    **Example:**&lt;/span&gt;

&lt;span class="sd"&gt;    &amp;gt;&amp;gt;&amp;gt; predictions = [0, 0.2, 0.4, 0.6, 0.8, 1.0]&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;gt;&amp;gt;&amp;gt; annotations = [0, 0, 0, 1, 1, 1]&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;gt;&amp;gt;&amp;gt; thresh_ind = calculate_optimal_threshold(annotations, predictions)&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;gt;&amp;gt;&amp;gt; threshold = predictions[thresh_ind]&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;gt;&amp;gt;&amp;gt; threshold&lt;/span&gt;
&lt;span class="sd"&gt;    0.6&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;prediction&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;prediction&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
            &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f1_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;annotations&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;auto_threshold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;synthetic_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;annotation_generator&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Main loop of the AutoThreshold algorithm.&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="c1"&gt;# Specify initial estimate; here we start from the highest confidence which is&lt;/span&gt;
    &lt;span class="c1"&gt;# the n-th ordered prediction&lt;/span&gt;
    &lt;span class="n"&gt;thresh_ind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;
    &lt;span class="n"&gt;thresh_est&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;synthetic_data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;thresh_ind&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MAX_ITERS&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

        &lt;span class="c1"&gt;# E-Step: Generate annotations for the subset of ordered predictions&lt;/span&gt;
        &lt;span class="n"&gt;predictions_subset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;predictions_generator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;synthetic_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;thresh_ind&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;annotations_subset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;annotations_generator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;synthetic_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;thresh_ind&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# M-Step: Estimate local threshold index for the newly annotated subset&lt;/span&gt;
        &lt;span class="n"&gt;thresh_ind_subset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;calculate_optimal_threshold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;annotations_subset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predictions_subset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Estimate new threshold&lt;/span&gt;
        &lt;span class="n"&gt;thresh_ind_old&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;thresh_ind&lt;/span&gt;
        &lt;span class="n"&gt;thresh_ind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;thresh_ind_old&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;thresh_ind_subset&lt;/span&gt;
        &lt;span class="n"&gt;thresh_est&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;synthetic_data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;thresh_ind&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Iter: &lt;/span&gt;&lt;span class="si"&gt;{}&lt;/span&gt;&lt;span class="s1"&gt;, Est: &lt;/span&gt;&lt;span class="si"&gt;{:.3f}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;thresh_est&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="c1"&gt;# Check convergence&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;thresh_ind&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;thresh_ind_old&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Converged&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;thresh_est&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;__main__&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;AutoThreshold Toy Example.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Specify arguments: Max algorithm iterations, number of synthetic predictions &amp;amp; subset size&lt;/span&gt;
    &lt;span class="n"&gt;MAX_ITERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;M&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;

    &lt;span class="c1"&gt;# Synthetically generate ordered classifier predictions and annotations&lt;/span&gt;
    &lt;span class="n"&gt;synthetic_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generate_predictions_and_annotations&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Run AutoThreshold to estimate optimal classifier threshold&lt;/span&gt;
    &lt;span class="n"&gt;thresh_est&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;auto_threshold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;synthetic_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;annotations_generator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;Estimated threshold value: &lt;/span&gt;&lt;span class="si"&gt;{:.3f}&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;thresh_est&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="se"&gt;\n\t&lt;/span&gt;&lt;span class="s2"&gt;Fin.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Code output will look like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$ ./autothreshold.py
AutoThreshold Toy Example.

Iter: 0, Est: 0.975
Iter: 1, Est: 0.950
Iter: 2, Est: 0.925
Iter: 3, Est: 0.900
Iter: 4, Est: 0.875
Iter: 5, Est: 0.850
Iter: 6, Est: 0.825
Iter: 7, Est: 0.800
Iter: 8, Est: 0.775
Iter: 9, Est: 0.750
Iter: 10, Est: 0.725
Iter: 11, Est: 0.700
Iter: 12, Est: 0.675
Iter: 13, Est: 0.650
Iter: 14, Est: 0.625
Iter: 15, Est: 0.600
Iter: 16, Est: 0.575
Iter: 17, Est: 0.550
Iter: 18, Est: 0.525
Iter: 19, Est: 0.500
Iter: 20, Est: 0.500
Converged

Estimated threshold value: 0.500

        Fin.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;script type="text/javascript"&gt;if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
    var align = "center",
        indent = "0em",
        linebreak = "false";

    if (false) {
        align = (screen.width &lt; 768) ? "left" : align;
        indent = (screen.width &lt; 768) ? "0em" : indent;
        linebreak = (screen.width &lt; 768) ? 'true' : linebreak;
    }

    var mathjaxscript = document.createElement('script');
    mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
    mathjaxscript.type = 'text/javascript';
    mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';

    var configscript = document.createElement('script');
    configscript.type = 'text/x-mathjax-config';
    configscript[(window.opera ? "innerHTML" : "text")] =
        "MathJax.Hub.Config({" +
        "    config: ['MMLorHTML.js']," +
        "    TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
        "    jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
        "    extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
        "    displayAlign: '"+ align +"'," +
        "    displayIndent: '"+ indent +"'," +
        "    showMathMenu: true," +
        "    messageStyle: 'normal'," +
        "    tex2jax: { " +
        "        inlineMath: [ ['\\\\(','\\\\)'] ], " +
        "        displayMath: [ ['$$','$$'] ]," +
        "        processEscapes: true," +
        "        preview: 'TeX'," +
        "    }, " +
        "    'HTML-CSS': { " +
        "        availableFonts: ['STIX', 'TeX']," +
        "        preferredFont: 'STIX'," +
        "        styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
        "        linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
        "    }, " +
        "}); " +
        "if ('default' !== 'default') {" +
            "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
            "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
        "}";

    (document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
    (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
&lt;/script&gt;</content><category term="Zalando"/><category term="Computer Vision"/><category term="Machine Learning"/><category term="Python"/><category term="Zalando Science"/><category term="Backend"/></entry><entry><title>Space efficient machine learning feature stores using probabilistic data structures - a benchmark</title><link href="https://engineering.zalando.com/posts/2021/10/space-efficient-machine-learning-feature-stores-using-probabilistic-data-structures.html" rel="alternate"/><published>2021-10-05T00:00:00+02:00</published><updated>2021-10-05T00:00:00+02:00</updated><author><name>Enno Shioji</name></author><id>tag:engineering.zalando.com,2021-10-05:/posts/2021/10/space-efficient-machine-learning-feature-stores-using-probabilistic-data-structures.html</id><summary type="html">&lt;p&gt;In this post, we describe a technique for providing machine learning feature stores with sublinear space requirements, and perform a benchmark that uses bloom filter as the backing data store. Such feature stores can be an effective alternative to the commonly used key-value-store-based feature stores in certain situations.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Bloom Filter" src="https://engineering.zalando.com/posts/2021/10/images/bloom_filter.png#previewimage"&gt;&lt;/p&gt;
&lt;h2&gt;The problem&lt;/h2&gt;
&lt;p&gt;When building Machine Learning (ML) applications - such as recommender systems - there is often a need to provide a "feature store" which can enrich the request to the system with additional ML features.&lt;/p&gt;
&lt;p&gt;For example: whether a user had looked at an article before is often very informative about whether the user will click or buy that article this time. So, companies keep a record of what article their users had clicked bought recently, and use this data in their recommender systems. Other commonly used data include: past browsing history, purchase history, user information like demographics, explicit preferences they shared etc.&lt;/p&gt;
&lt;p&gt;These data are usually stored in key-value stores like Redis, using the user ID as the key, and the features as value.&lt;/p&gt;
&lt;p&gt;When a request is made to the recommender system, a query is made to this key-value store using the user ID, and the retrieved features are fed to the recommendation algorithm together with the data contained in the original request. When there are many users, these feature stores can easily get very large.&lt;/p&gt;
&lt;p&gt;This creates significant challenges in terms of the development and operation of ML applications.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;They add to the processing time&lt;/strong&gt;: Adding a network call commonly adds 2-10ms to your response time. To make matters worse, it also adds a lot of variance to the response time due to the variation of message sizes across users&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Additional hosting costs/maintenance cost&lt;/strong&gt;: Distributed databases with strict performance requirements can be expensive to host&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Additional operational complexities&lt;/strong&gt;: Operations like backfill can become very expensive to setup/execute&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Development complexities&lt;/strong&gt;: An external database adds a dependency to the application code, which adds some complexity to the development/testing process (like having to pre-populate this DB for tests). Intrusive performance optimizations like size limits, aggregations, prioritization of users are often necessary, which adds development time and increases the coupling between model design and infrastructure&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multiple lookups can be prohibitively expensive&lt;/strong&gt;: For example: imagine you want to rank a thousand products, and want to retrieve features for each product - this would be extremely difficult with an external database under strict latency budget. Another hypothetical example is retrieving features for composite keys (interactions), e.g. "How many times were product X and Y bought together?". If the feature state is small enough to live in the same processes' memory, multiple look-ups are far cheaper and thus feasible.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;The solution&lt;/h2&gt;
&lt;p&gt;What if, instead of having a big, unwieldy database, we could read a much smaller dataset into memory, and query that as a feature store from within the process? This is essentially what we can do with "sketching" data structures, a type of probabilistic data structures.&lt;/p&gt;
&lt;p&gt;Sketching data structures can store large amounts of data in a compact (sublinear) space at the expense of accuracy. In other words, they store a "summary" of the original data. They are essentially a lossy compression algorithm for your features. Just like JPEG compression for your images, it can compress input data at varying "compression levels" - low-compression level means better quality but larger sizes, and high-compression level means lower quality but smaller sizes.&lt;/p&gt;
&lt;p&gt;This allows us to trade-off accuracy in exchange for space requirements. As we will see below, the trade-off is highly favorable - a very small sacrifice in accuracy can save a lot of space.&lt;/p&gt;
&lt;p&gt;In this article we will only describe and benchmark bloom-filter-backed feature stores in detail, but theoretically, other sketching data structures like &lt;a href="https://en.wikipedia.org/wiki/HyperLogLog"&gt;HyperLogLog&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch"&gt;Count-Min Sketch&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Quotient_filter"&gt;Quotient Filters&lt;/a&gt; etc. could be used, too.&lt;/p&gt;
&lt;h2&gt;Benchmark of a sketching-data-structure-based feature store backed by a Bloom-Filter&lt;/h2&gt;
&lt;p&gt;Below is a benchmark based on a real-life click prediction dataset. It shows that prediction models that use a bloom-filter-based feature store can achieve the same level of prediction accuracy &amp;amp; prediction throughput with a vastly smaller feature state that can easily be fit into memory.&lt;/p&gt;
&lt;h3&gt;Benchmark setting&lt;/h3&gt;
&lt;p&gt;We used a real-life click prediction dataset which has two types of features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Request features&lt;/strong&gt;: Features that are immediately available in the request, like country, article id, device type, context URL and so on&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Historical features&lt;/strong&gt;: Features that are based on accumulated historical data, like browsing history, purchase history, preferences that were saved in the past etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The historical features were aggregated using count, max etc. (e.g. how many times did a user browse an item, what was the last time they looked at it etc.) and were then discretized to yield categorical features. They were then stored into feature stores.&lt;/p&gt;
&lt;p&gt;The training data had about 5.7 mil examples. Out of these 5.7 mil examples, 2.8 mil had historical data (the rest had only request features). Combined, the data had 1.762 bil data points after feature extraction.&lt;/p&gt;
&lt;p&gt;Finally, a logistic regression classifier was used to predict clicks. Our variants were as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;No history&lt;/strong&gt;: A model without a feature store (so that it could only use request features)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Uncompressed history&lt;/strong&gt;: A model that simulated use of a conventional feature store (the features were pre-fetched)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compressed history&lt;/strong&gt;: A model that used a bloom filter based compressed feature store&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Implementation of the bloom-filter-based-compressed-feature-store&lt;/h3&gt;
&lt;p&gt;Below is a simplified implementation in Python that illustrates how the feature store was implemented. It returns what articles a user had looked at before, given their user_id. This is not the actual implementation that was used in the benchmark. The benchmark used a JVM-based implementation, and was more general in nature (it stored arbitrary categorical features).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Set&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;bloom_filter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BloomFilter&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FeatureStore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BloomFilter&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;possible_articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;article_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;article_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;possible_articles&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;article_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;composite_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;^&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;article_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;composite_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retreive_articles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;ret&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;possible_articles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;composite_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;^&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;article_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;might_contain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;composite_key&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;ret&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;article_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ret&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The most important element to point out is the additional state &lt;code&gt;self.possible_articles&lt;/code&gt;. This would hold the set of all possible features (in this case, all article IDs), and the code is brute forcing all of them in order to reconstruct the set of articles viewed by the user. This may appear to be a very expensive thing to do, but in practice it is very cheap in relation to the total processing. In my simple benchmark, the difference was undetectable. It is also worth noting that this process could be optimized, for example through the use of binary search, and/or by only querying for important features.&lt;/p&gt;
&lt;p&gt;The compressed history variant had a parameter that determined the level of compression - i.e. higher compression level meant lower quality and size, lower compression level meant higher quality and size. What do we mean by "quality" here? In a nutshell, the bloom filter tells us if a binary categorical feature is present (1) or not (0). When the bloom filter says a feature is NOT present, it is always correct - i.e. there are no false negatives. However, when the bloom filter says a feature is present, it can be an error. In other words, at some probability, we will mistakenly set the feature value to 1, when in fact it should have been 0 (i.e. false positive). This adds noise to our model's input. This probability can be tuned via a parameter, and the higher the false positive rate, the smaller the state size.&lt;/p&gt;
&lt;p&gt;For more details on how this compression level parameter works, and generally how bloom filters work and their characteristics, see e.g. &lt;a href="https://llimllib.github.io/bloomfilter-tutorial/"&gt;here&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Bloom_filter"&gt;here&lt;/a&gt; and &lt;a href="https://freecontent.manning.com/all-about-bloom-filters/"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;As an evaluation metric, we used click &lt;a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic"&gt;ROC-AUC (Area Under the Curve of the Receiver Operating Characteristic curve)&lt;/a&gt;, a common metric for recommender systems.&lt;/p&gt;
&lt;h3&gt;Result&lt;/h3&gt;
&lt;p&gt;The scatter plot below shows the AUC (y axis) of the classifier at varying compression levels (x axis = size of the feature store in bytes in logarithmic scale). The dotted green line is the AUC with a key-value-store-based feature store equivalent (i.e. Uncompressed). The dotted red line is the AUC without any history features (i.e. No history).&lt;/p&gt;
&lt;p&gt;&lt;img alt="AUC plotted against feature state size" src="https://engineering.zalando.com/posts/2021/10/images/feature_store-to-roc_auc-plot.png"&gt;&lt;/p&gt;
&lt;p&gt;As expected, our bloom-filter-backed feature store achieves performances between the two lines (uncompressed ~= 0.80 and no history ~= 0.70).&lt;/p&gt;
&lt;p&gt;The estimated size of the key-value-store-based feature store was about 15GB. Hence, the results show that our compressed feature store achieves the same level of classification performance (AUC~=0.7997) using just 3% of memory (470MB vs 15GB). The state size can be further reduced at the expense of classification performance. For example, 90% of the uplift provided by the feature store can be retained by using merely ca. 40MB of state (AUC~=0.79). This would be just 0.3% of the size of an uncompressed feature store. Note that this "saving" grows as the data volume increases due to the sublinear space complexity.&lt;/p&gt;
&lt;p&gt;When it comes to throughput (computational efficiency), all of the variants achieved similar throughput (20-22k predictions per second per core on my 2018 Mac). I.e. the additional overhead was undetectable with my performance tests.&lt;/p&gt;
&lt;h2&gt;The Limitation&lt;/h2&gt;
&lt;p&gt;So the benchmark results look very good - why would anyone use a conventional key-value-store-based feature store at all? Alas, the new feature stores come with severe limitations and are thus not a drop-in replacement for conventional feature stores.&lt;/p&gt;
&lt;h3&gt;You have to know what to ask&lt;/h3&gt;
&lt;p&gt;As described above, we need to keep the set of possible features in order to get the desired output. In a lot of use cases this is not an issue, but in some situations it may be prohibitively expensive (e.g. imagine reconstructing bag-of-word encoding of past user reviews).&lt;/p&gt;
&lt;h3&gt;They are difficult to update (and thus keep them "fresh")&lt;/h3&gt;
&lt;p&gt;The second, and probably by far the more important weakness is the difficulty associated with updating them.&lt;/p&gt;
&lt;p&gt;Feature "freshness", as in how quickly recent events can be reflected to the feature store is very important, as recent events tend to have high informational value. Many distributed key-value stores have good write performance, and thus it's very feasible to keep them very "fresh" even when high load is involved. The situation is very different with sketching-data-structure-based feature stores.&lt;/p&gt;
&lt;p&gt;First, let's consider the appending of new information to our new feature store.&lt;/p&gt;
&lt;p&gt;Most sketching-data-structure (including bloom filters) allow incremental appends (so far, so good). However, since the complete state is loaded onto each node's RAM, every write must be applied on every node - so that each node (process) must be able to handle 100% of event traffic. This is usually impossible - common event streams like views, clicks are usually very high volume, and processing that amount of writes on a single node is not a practical option. One could consider batching, but in many key-value-based feature store, the target update latency is shorter than a few seconds - which makes this option extremely difficult.&lt;/p&gt;
&lt;p&gt;Theoretically bloom filters could be distributed so that each node only needs to process a shard of the traffic - but at this point one would have converted one's real-time transaction server into a distributed database.&lt;/p&gt;
&lt;p&gt;Second, let's consider deletion (expiry) of information.&lt;/p&gt;
&lt;p&gt;The situation is even worse, because due to their nature, sketching-data-structures don't allow deletes of individual records. Thus, to delete a record from our new feature state, one has to completely regenerate it by re-processing the entire source dataset again (sans the information we want to delete). This is extremely expensive and thus can only be done on a low-frequency batch basis. There are some sketching-data-structure variants that allow some degree of expiry (see e.g. &lt;a href="https://arxiv.org/pdf/2001.03147.pdf"&gt;Age-partitioned Bloom Filters&lt;/a&gt;, but there are no mature implementations available.&lt;/p&gt;
&lt;h3&gt;They cannot support complex queries and updates&lt;/h3&gt;
&lt;p&gt;Finally, sketching-data-structure-based feature stores don't support complex queries or updates like "remove all events that happened on day X". With key-value-store-based feature stores, the additional cost of storing some metadata (like event timestamps) is relatively minor. But this can be a major undertaking for sketching-data-structure-based feature store.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Sketching-data-structure-based feature stores can not substitute conventional feature stores in all use cases, but they can be an attractive option when using an external feature store is prohibitively expensive. For example, if:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;One can't afford the additional network call to an external feature store&lt;/li&gt;
&lt;li&gt;Many feature lookups need to be performed per one request&lt;/li&gt;
&lt;/ul&gt;</content><category term="Zalando"/><category term="Machine Learning"/><category term="Recommender Systems"/><category term="Scalability"/><category term="Zalando Science"/><category term="Backend"/></entry><entry><title>Tracing SRE’s journey in Zalando - Part II</title><link href="https://engineering.zalando.com/posts/2021/09/sre-journey-part2.html" rel="alternate"/><published>2021-09-21T00:00:00+02:00</published><updated>2021-09-21T00:00:00+02:00</updated><author><name>Pedro Alves</name></author><id>tag:engineering.zalando.com,2021-09-21:/posts/2021/09/sre-journey-part2.html</id><summary type="html">&lt;p&gt;Follow Zalando's journey to adopt SRE in its tech organization.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;em&gt;Welcome to the second part of our journey establishing SRE in Zalando. You’ll find the &lt;a href="https://engineering.zalando.com/posts/2021/09/sre-journey-part1.html"&gt;first part here&lt;/a&gt;. Don’t miss out on the third and final post in one week.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;2018 - The Return of SRE&lt;/h2&gt;
&lt;p&gt;In our previous blog post we left it with the plans for Site Reliability Engineering (SRE) in Zalando having to change. So, what were those changes and what were the challenges we faced in this new iteration?
In this blog post we’ll go straight to the first quarter of 2018, when two sister SRE teams were bootstrapped around the same time in different departments. One of them was the &lt;strong&gt;SRE Enablement&lt;/strong&gt; team in Digital Foundation (DF - a central functions department). The other was the &lt;strong&gt;Digital Experience SRE&lt;/strong&gt; team (DX - the department responsible for the customer facing part of our Fashion Store). The last one was created from a grassroots initiative, but the DF one was reimagined by management of that department.&lt;/p&gt;
&lt;p&gt;Since the decision made back in 2017 to grow the number of teams on call, the issue with overwhelmed on call teams was gone. As expected, the side effect of that decision was that teams were now much more aware of the operational burden of their services and would take steps to reduce that burden. Post-Mortems started becoming a regular practice in 2017, which also helped (although the practice was not yet well established). But while teams were slowly becoming more ‘operationally capable’, the complexity of our platform was growing at a much faster pace, with no one to keep a holistic view on the service landscape.
You’ll notice from the name of the DF team that there is already something implied: SRE &lt;strong&gt;Enablement&lt;/strong&gt;. This is where the new team differentiates itself from the 2016 initiative. The challenge that gave purpose to the Enablement team was &lt;strong&gt;raising the bar on our operational practices&lt;/strong&gt;. This was around: monitoring, incident response, chaos engineering, resilience engineering.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Service Landscape" src="https://engineering.zalando.com/posts/2021/09/images/service-landscape.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Service Landscape&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;Both SRE teams had very limited resources (only 2 engineers each), and they obviously shared the same goals. To better align the efforts of both teams, an &lt;strong&gt;SRE Program&lt;/strong&gt; is kicked-off that unites them around common goals. As before, the practices and mindset described in Google’s original &lt;a href="https://sre.google/sre-book/table-of-contents/"&gt;SRE book&lt;/a&gt; are used as the main inspiration for our own SRE teams.
The teams were composed of experienced engineers, with a strong background in software development, knowledge of systems engineering, and incident response (very much aligned with the profile that was outlined back in 2016). These engineers also enjoyed a fair amount of social capital across the organization, which greatly facilitated the collaboration with other teams.&lt;/p&gt;
&lt;p&gt;Compared to the previous iteration, the SRE Program was not aiming at significant organizational changes. This gave some degree of freedom regarding the projects the Program would tackle. At the beginning of the Program, the 2 teams got together and made a list of all the topics that were SRE relevant and that we wanted to work on. When we were done, the size of the list was considerable (there are so many interesting, relevant and challenging topics in SRE). With our limited capacity, however (6 team members between the two teams - 1 Lead, 1 Program Manager, 4 Engineers), we had to be careful when picking our initiatives. Although this meant that we had to drop many of the topics we wanted to work on, &lt;strong&gt;that careful selection contributed significantly to the success of the Program&lt;/strong&gt;, and the reputation we built for the SRE name within the company.&lt;/p&gt;
&lt;p&gt;The SRE Program took on the &lt;strong&gt;rollout of Distributed Tracing&lt;/strong&gt; across the engineering organization, helped &lt;a href="https://engineering.zalando.com/posts/2018/06/loading-time-matters.html"&gt;&lt;strong&gt;improve the Page Load Time&lt;/strong&gt;&lt;/a&gt; for some of Zalando’s pages, staffs the newly created &lt;a href="https://sre.google/workbook/incident-response/"&gt;&lt;strong&gt;Incident Commander role&lt;/strong&gt;&lt;/a&gt;, and helps with Cyber Week preparations, namely &lt;a href="https://engineering.zalando.com/posts/2019/04/end-to-end-load-testing-zalandos-production-website.html"&gt;&lt;strong&gt;Load Tests&lt;/strong&gt;&lt;/a&gt;. SREs, in the role of Incident Commanders, provided on-site support during Black Friday in a dedicated Situation Room. SREs also worked with other teams on &lt;strong&gt;efficiency topics&lt;/strong&gt; that led to significant cost savings with cloud infrastructure while preserving reliability targets.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Distributed Tracing Workshop" src="https://engineering.zalando.com/posts/2021/09/images/ot-workshop.jpg"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Distributed Tracing Workshop&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;SLOs, as were introduced back in 2016 were still in place, with hundreds of new services specifying SLOs. Despite the growing number of SLOs, they were still not used to help the teams strike a balance between feature development and operational improvements. One of the things that made it more challenging was the fact that Zalando runs many thousands of services in production. We figured that not all of them had the same relevance. To try to put some structure into the SLOs we had, &lt;strong&gt;Service Tier definitions&lt;/strong&gt; were published. To help with the Service Tiers, a &lt;strong&gt;new SLO reporting tool&lt;/strong&gt; was developed. The new tool defined canonical SLIs and used the tier classification. However, this work was limited in scope. They targeted a single department, Digital Experience, home to one of the SRE teams. Services in other departments were not included in this effort and there was no mandate for them to adopt the new Service Tier definitions. Attempting to roll this out for the entire company (&amp;gt;4000 services) would not be feasible.&lt;/p&gt;
&lt;p&gt;On the cultural level, the SRE Program took ownership of the SRE Guild. Guilds in Zalando are self-organized groups of colleagues, sharing a common interest, that meet regularly to exchange knowledge. The SRE Guild was actually a remnant from the 2016 initiative, but was left dormant. &lt;strong&gt;We saw the SRE Guild as an agent of cultural change&lt;/strong&gt; to help us spread the SRE mindset. We then devoted efforts to develop a format that would be engaging and sustainable. Guild sessions provided a regular event with talks around all things SRE, whether it’s presenting the work of the SRE Program, or giving the floor for other teams or engineers to share knowledge. Postmortems became a regular topic in these sessions. This format is still in place today.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Black Friday 2018 Situation Room" src="https://engineering.zalando.com/posts/2021/09/images/black-friday-2018.jpg"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Black Friday 2018 Situation Room&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;Despite the success of the SRE Program, the fact that the individual teams were part of different organizations with different reporting chains led to some challenges related to the priorities of those different departments. Those different priorities and guidelines posed another problem when they would be at odds with each other. Teams in Zalando would seek out guidance from SRE, not knowing which team to reach out to, or even that there were 2 separate teams. To understand how two SRE teams that were working together could offer inconsistent guidance, it’s important to remember that they belonged to different departments. The SRE DX team could focus on the problem space of the DX department and offer customized solutions for those teams. The SRE DF team had the entire company in scope, so whatever that team did, it had to be applicable on a different scale. The SRE Program was planned for the year of 2018, culminating with the end of Cyber Week. Following that plan, after Cyber Week was over the program ended and each team went back to work on projects relevant to their respective departments.&lt;/p&gt;
&lt;h2&gt;2019 - Combining forces as a single SRE team&lt;/h2&gt;
&lt;p&gt;In early 2019 both SRE teams were officially united into a single team in the DF department (the department of one of the original teams). With this merger, SRE now had a single voice in the company.&lt;/p&gt;
&lt;p&gt;The experience with Distributed Tracing in the previous year was quite positive - Do you get the pun in the blog post’s name, now? 🤓. For one, it became a fundamental tool for incident response because it allowed for quicker insights, saving time from incidents. The coverage across Zalando’s services kept growing. The standardized data model and the development of Zalando specific Semantic Conventions, and an API to consume the tracing data allowed the SRE team to build additional value from it.&lt;/p&gt;
&lt;p&gt;One of the tools we developed based on Distributed Tracing is an Alert Handler called &lt;strong&gt;Adaptive Paging&lt;/strong&gt; (which we &lt;a href="https://www.usenix.org/conference/srecon19emea/presentation/mineiro"&gt;talked about in SRECon’19&lt;/a&gt;). This alert handler monitors the error rate of what we call &lt;strong&gt;Critical Business Operations&lt;/strong&gt;&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;  (CBO) and when it is triggered it uses the tracing data to determine where the error comes from across the entire distributed system, and pages the team that is closest to the problem. This alert handler was also a game changer in our push for a different alerting strategy: &lt;strong&gt;Symptom Based Alerting&lt;/strong&gt;. You can learn more about it in the &lt;a href="https://github.com/zalando/public-presentations/blob/master/files/2019-05-16_alerting_monitoring_and_all_that_jazz.pdf"&gt;slides of one of the talks&lt;/a&gt; we did on this topic.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Adaptive Paging Diagram" src="https://engineering.zalando.com/posts/2021/09/images/adaptive-paging.jpg"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Adaptive Paging will traverse the Trace and identify the team to be paged&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;A &lt;strong&gt;throughput calculator&lt;/strong&gt; based on Tracing data is also developed that helped the Load Test efforts for Cyber Week preparations. By applying the expected throughput for a CBO, we could estimate the impact on all the components that are part of the same journey, usually through cascading remote procedure calls.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Throughput Calculator" src="https://engineering.zalando.com/posts/2021/09/images/throughput-calculator.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Throughput Calculator&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;Finally, through our use of Distributed Tracing, and Adaptive Paging, we made a significant change in our SLO strategy. We moved away from service based SLOs, and started rolling out &lt;a href="https://engineering.zalando.com/posts/2022/04/operation-based-slos.html"&gt;Operation based SLOs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Through internal and external hiring we grew the team up to 7 SREs. But that team size notwithstanding, &lt;strong&gt;hiring was always a challenge&lt;/strong&gt;. Then, and today. The combination of the required skill set for an SRE at Zalando and the different definitions of the SRE role across the industry, means many candidates do not meet the bar, or simply have a different skill set. Nevertheless, it was agreed that we would not compromise our hiring. While growing engineers and teaching the SRE mindset was something seen as positive (and definitely a way to scale the team further), with our reduced size we could not provide an effective mentorship. Any engineers we would hire needing that mentorship would not be set up for success.&lt;/p&gt;
&lt;p&gt;&lt;img alt="SRECon DT Workshop" src="https://engineering.zalando.com/posts/2021/09/images/sre-con-workshop.jpg"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;We took the previous year’s Distributed Tracing Workshop to SRECon’19&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;Both 2018 and 2019 were successful years for SRE, but there are quite a few differences between the two. In 2018 we worked exclusively on topics that SRE did not own. We were a mix of a &lt;a href="https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started"&gt;consulting team and a kitchen sink team&lt;/a&gt;. We either volunteered for some of the projects we worked on, or were asked to help due to capacity reasons or because the projects required a specific skill set. Our main challenge was &lt;strong&gt;how to decide what to work on&lt;/strong&gt;. There was no mathematical formula to determine this. It was always a matter of balancing the following dimensions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Likelihood of success (Would we be in way over our head? Could we actually influence the outcome?)&lt;/li&gt;
&lt;li&gt;Company’s priorities&lt;/li&gt;
&lt;li&gt;Enablement (If we’re working with a team, will that team learn something from the engagement, or were we expected to do everything ourselves?)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In 2019 we still operated partially in the same kitchen sink/consulting mode, but the big difference is that in 2019 we started working on our own products, which also means we started taking some control of our roadmap.&lt;/p&gt;
&lt;p&gt;Overall, 2019 was the year we started reaping the benefits of the achievements from the previous years. We had given a clear signal that a single (small) team of engineers dedicated to Reliability could bring significant benefits to an organization the size of Zalando. But, to an extent, we were also a victim of our success. Despite having our own backlog and a list of topics we wanted to work on, &lt;strong&gt;the team became increasingly more in demand&lt;/strong&gt; from different parts of the organization. Our help was requested to improve Operational Excellence in departments, to assist in the roll out of major launches, to review Technical Design Documents, to help in PostMortem investigations, &lt;a href="https://engineering.zalando.com/posts/2020/10/how-zalando-prepares-for-cyber-week.html"&gt;Cyber Week preparations, Production Readiness Reviews&lt;/a&gt;… As before, we had to pick our battles carefully. Accepting every challenge with our reduced capacity meant that we would likely do a poor job in all of them. And anything in our backlog that we had promised and wouldn’t deliver would also affect our reputation.&lt;/p&gt;
&lt;p&gt;Things are starting to get interesting. After a few successful projects, SRE’s reputation in the company grew. We merged the two SRE teams into a single team, making sure that SRE could continue to grow unaffected by fragmentation. The SRE Guild kept on going, further spreading the SRE mindset. We grew the team, and even started to focus on our own backlog. But SRE is still a single, small, team in a very large organization. How far can we stretch this model? Well, that's what we're going to talk about in our last blog post on this series in one week's time.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;EDIT 1: Don't stop now. The third and last part of our series is already available &lt;a href="https://engineering.zalando.com/posts/2021/10/sre-journey-part3.html"&gt;here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;Grossly summarizing it, Zalando is an e-commerce platform, so a Critical Business Operation is anything that affects our Business, like ‘Add To Cart’, ‘Place Order’ or ‘View Catalog’&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="Zalando"/><category term="SRE"/><category term="Backend"/></entry><entry><title>Tracing SRE’s journey in Zalando - Part I</title><link href="https://engineering.zalando.com/posts/2021/09/sre-journey-part1.html" rel="alternate"/><published>2021-09-13T00:00:00+02:00</published><updated>2021-09-13T00:00:00+02:00</updated><author><name>Pedro Alves</name></author><id>tag:engineering.zalando.com,2021-09-13:/posts/2021/09/sre-journey-part1.html</id><summary type="html">&lt;p&gt;Follow Zalando's journey to adopt SRE in its tech organization.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;2016 - First attempt at rolling out SRE&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;Welcome to the first installment of our three part series following Zalando’s SRE journey. Be sure to come back for the other two, with the next one being published in a week.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Site Reliability Engineering (SRE) is a recent discipline in the Software Engineering field that is growing in popularity, with many companies turning to this new way of working to solve their operational issues, or to support its growing scale.
But being a recent discipline, it’s not yet well established how organizations should adopt SRE, or even what is the role of a Site Reliability Engineer (although the role enjoys an increasing demand).
At Zalando we also took a stab at implementing SRE within our organization. We looked at it as a way to help us scale our engineering efforts, improving efficiency and making life for our developers easier. Today, Zalando includes in its organization a Site Reliability Engineering department, but the journey to reach this point was filled with challenges and learnings that we are now sharing with everyone.&lt;/p&gt;
&lt;p&gt;In this series of blog posts we will take our readers through the road so far. We’ll describe what worked well for us, and what didn’t. Where we failed, and where we succeeded. We’ll also look into how we defined the role of an SRE within the company, and how SRE is growing in Zalando.&lt;/p&gt;
&lt;p&gt;Before we get to the ‘How’, let’s start with the ‘Why’. Why would we want to have SRE in Zalando? Well, for that we need to understand the point that we were at as a company before this journey began. That takes us back to 2016 when we were well into our move to the cloud, migrating our monoliths to a micro services architecture (you can find more details about this and what came after in the &lt;a href="https://srcco.de/posts/one-decade-in-zalando-tech.html"&gt;blog post&lt;/a&gt; from our colleague Henning Jacobs).&lt;/p&gt;
&lt;p&gt;&lt;img alt="A view of Zalando Tech pre-cloud" src="https://engineering.zalando.com/posts/2021/09/images/zalando-pre-cloud.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;A view of Zalando Tech pre-cloud&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;The move to the cloud came with disruptive changes to the way we were working. Teams were now responsible end-to-end for the software they built. That meant designing, developing, testing, deploying and operating the applications the teams owned. I’ll skip the gruesome details, but to put it simply, before this time, developers developed, and operators operated&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;. This meant that the vast majority of our engineers were not experienced in a good chunk of their newfound responsibilities. This lack of experience coupled with the hypergrowth that we were going through resulted in a lot of different and complex issues. These issues were mostly around the operational aspect of software development (monitoring, automated testing, deploying, incident handling, managing the cloud runtime).&lt;/p&gt;
&lt;p&gt;One of the more obvious pain points was the on-call support. Before we started the microservice migration, our service landscape was small enough that 5 on-call teams could cover the whole stack. Each team had a large enough rotation, and the domain was well understood by each team member. The monoliths were also quite similar in terms of monitoring and operations, making it easier to tackle issues even in services that a given engineer would not be so familiar with. That gradually changed as new teams were created, and more and more services were deployed in the cloud. And there was little standardization across those services. The on-call teams did not grow to meet the new demands, and were increasingly overwhelmed by the new services that they were responsible for.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Our deploy tool for our data center services" src="https://engineering.zalando.com/posts/2021/09/images/dc-deploy-tool.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Our deploy tool for our data center services&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;But 2016 is also the year that Google publishes their book &lt;a href="https://sre.google/sre-book/table-of-contents/"&gt;Site Reliability Engineering&lt;/a&gt;. The practices and mindset described in that book seemed to provide some answers to the growth pains we were experiencing. For that reason, it becomes the main inspiration for implementing the SRE mindset, role and practices in Zalando. &lt;strong&gt;How it all started, though, was through a grassroots initiative to promote and pitch for an investment in SRE.&lt;/strong&gt; After convincing enough managers, mostly through explaining the pain points being felt by the engineering teams, and how SRE can be a solution for those pains, a group of engineers teams up under a project scope to drive this implementation. One of their main goals was to solve the on-call situation, and make it sustainable.
A quick side note: If it feels like the ‘convincing’ management is grossly summarized, or feels like it was just too easy, it’s important to bring up that &lt;strong&gt;Zalando is a company that does not shy away from change.&lt;/strong&gt; It’s a core part of the company’s DNA and culture. And the &lt;a href="https://hbr.org/2011/03/culture-trumps-strategy-every"&gt;culture of an organization&lt;/a&gt; always plays a key role in enabling (or resisting) such changes.&lt;/p&gt;
&lt;p&gt;&lt;img alt="SRE Brainstorming session" src="https://engineering.zalando.com/posts/2021/09/images/sre-brainstorm.jpg"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;SRE Brainstorming session&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;Now that there was an initial buy in from management, there were o so many things to discuss at the time. But the one that had the most influence in the following steps was &lt;strong&gt;“How do we structure SRE?”&lt;/strong&gt;. Again, remember that this had to be done in a way that it would solve the on-call problem.
Should we go for a central team? We were already too big for that (our headcount had grown to 1.000+), so odds were that we wouldn’t be effective. Although it would make staffing easier because we’d need fewer SREs.
Should we distribute one SRE per team? The scope would be too large for the lone SREs. Not to mention that, over time, they’d likely become the Ops engineer for the team they were in.
It was agreed that we would need several SRE teams. But that still begged the question: What is the granularity at which we would create SRE teams? In the end we went with &lt;strong&gt;one SRE team per Product Cluster&lt;/strong&gt;. This would give SREs end-to-end responsibility over a domain, without having too wide of a scope.&lt;/p&gt;
&lt;p&gt;&lt;img alt="SRE team structures" src="https://engineering.zalando.com/posts/2021/09/images/sre-team-structures.jpg"&gt;&lt;/p&gt;
&lt;p&gt;There was another concern around the reporting chain. This was an easy discussion, as we quickly converged to following the &lt;a href="https://sre.google/workbook/how-sre-relates/#consider-reliability-work-as-a-specialized-role"&gt;guidance in the SRE book&lt;/a&gt; and consider reliability work as a specialized role and have them separate from the product delivery teams.&lt;/p&gt;
&lt;p&gt;To further gauge the interest in the SRE role and mindset, we sent out a survey to our engineering Org. In that survey we included a description of the desired profile for an SRE. That profile included: &lt;strong&gt;Software engineering, Operational mindset, Systems engineering, Software architecture skills, Troubleshooting skills&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Survey to gauge SRE interest" src="https://engineering.zalando.com/posts/2021/09/images/sre-survey.png"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Survey to gauge SRE interest&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;The survey results also gave us an idea on the talent pool that might be interested in a move to an SRE role. To further promote the role and the initiative within the company, several talks were done across the company and its different hubs, which, at the time, already included Helsinki, Dublin, and Dortmund.&lt;/p&gt;
&lt;p&gt;With few engineers able to fit that profile we had to be smart about where to start rolling out SRE. Ideally, we start with the area with the most need for SRE practices. But to know which area that would be, we first had to measure the health of the different products at Zalando, to then be able to prioritize.
Fortunately, at the core of SRE we have &lt;a href="https://sre.google/sre-book/service-level-objectives/"&gt;Service Level Objectives&lt;/a&gt; (SLOs) and Service Level Indicators (SLIs). With the lack of a standardized way of measuring availability, the first thing the team working on the SRE initiative decided to do was to roll out SLOs and SLIs. Workshops were conducted across the company for Engineers and Product Managers, and the first SLO reporting tool (SLR) was developed.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Zalando’s SRE Logo" src="https://engineering.zalando.com/posts/2021/09/images/sre-logo.png#center"&gt;&lt;/p&gt;
&lt;figcaption style="text-align:center"&gt;Zalando’s SRE Logo&lt;/figcaption&gt;
&lt;p&gt;&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;To further demonstrate the &lt;strong&gt;educational benefit of SRE&lt;/strong&gt;, the SRE program team ran Reliability Workshops as part of &lt;a href="https://engineering.zalando.com/tags/cyber-week.html"&gt;Cyber Week&lt;/a&gt; preparations to discuss and review Reliability Patterns for the more critical services. In those Reliability Workshops we covered Retry Strategies, Circuit Breakers and Fallbacks.&lt;/p&gt;
&lt;p&gt;Many services did have SLOs defined and collected, but it still did not end up influencing the software development process. The vast majority of SLOs were defined through initiatives from Engineers. But in a microservice architecture, a product is implemented by multiple services. Product Managers had a hard time establishing a link between the different SLOs and their own expectations for the products they are responsible for. Management was kept in the loop, but not directly involved, so there was no real motivation for management to uphold the SLOs.&lt;/p&gt;
&lt;p&gt;Senior Management agreed that SRE concepts like SLOs and reliability patterns are a much needed practice, and that teams should continue doing that. However, there was a clear preference to keep building the missing operational capabilities in the Delivery Teams. &lt;strong&gt;The way that was chosen to kickstart that capability building, was by putting each delivery team on-call for the critical services they owned. This decision was fundamental to properly establish the “you build it, you run it” mentality we still have today.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;With teams now responsible 24/7 for their own services, the plans for Zalando SRE would necessarily have to change. Join us for the next chapter of our series to learn more about the next steps of this journey.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;EDIT 1: No reason to stop the reading here. The second part of our series is already available &lt;a href="https://engineering.zalando.com/posts/2021/09/sre-journey-part2.html"&gt;here&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;We did have some engineers with end to end responsibility. They would deploy, monitor and even be on-call for the services of their respective area. This was not standardized in the company, and it would depend greatly on the leadership of their respective teams.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="Zalando"/><category term="SRE"/><category term="Backend"/></entry><entry><title>Micro Frontends: Deep Dive into Rendering Engine (Part 2)</title><link href="https://engineering.zalando.com/posts/2021/09/micro-frontends-part2.html" rel="alternate"/><published>2021-09-09T00:00:00+02:00</published><updated>2021-09-09T00:00:00+02:00</updated><author><name>Jan Brockmeyer</name></author><id>tag:engineering.zalando.com,2021-09-09:/posts/2021/09/micro-frontends-part2.html</id><summary type="html">&lt;p&gt;Learn all the details about Rendering Engine - from data fetching to layout composition.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Zalando's &lt;a href="https://en.zalando.de/"&gt;Fashion Store&lt;/a&gt; has been running on top of microservices for quite some time already. This architecture has proven to be very flexible, and project &lt;a href="https://www.mosaic9.org/"&gt;Mosaic&lt;/a&gt; has extended it – although partially – to the frontend, allowing HTML fragments from multiple services to be stitched together, and served as a single page.&lt;/p&gt;
&lt;p&gt;Fragments in Mosaic can be seen as the first step towards a Micro Frontends architecture. With the ambitions of the Interface Framework as presented in &lt;a href="https://engineering.zalando.com/posts/2021/03/micro-frontends-part1.html"&gt;the first blog post&lt;/a&gt;, we did not want to just stop at serving multiple HTML pieces, we wanted more:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Implemented once, works anywhere&lt;/em&gt; - UI blocks should work in different contexts and be context-aware, not context-bound.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Declarative data dependencies&lt;/em&gt; - Components get the data they need but do not re-implement data fetching over and over.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Simplified A/B Testing&lt;/em&gt; - Zalando's decisions are data driven, so experimentation is at the core of our decision making. Running an A/B test that spans multiple pages and user flows should be possible with minimal alignment and zero delivery interruption.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Feels like Zalando&lt;/em&gt; - We want a consistent and accessible look and feel for all user journeys and ability to experiment with design fast, across multiple user flows.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Power to the engineers&lt;/em&gt; - Any developer should be able to contribute to all the Fashion Store experience. This means universal tooling and setup, first-class React integration, easy testing (also for work-in-progress code), and continuous integration.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That's how Renderers came to be.&lt;/p&gt;
&lt;h2&gt;Introducing Renderers&lt;/h2&gt;
&lt;p&gt;A Renderer is a self-contained Javascript module that runs inside the Rendering Engine framework. It fully relies on the framework to encapsulate all the implementation details like data fetching and layout composition.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Architecture of Interface Framework" src="https://engineering.zalando.com/posts/2021/09/images/rengine-architecture_if.png"&gt;&lt;/p&gt;
&lt;p&gt;A Renderer declares its data dependencies using GraphQL queries and, based on that data, provides a visual representation of a single Entity type (check &lt;a href="https://engineering.zalando.com/posts/2021/03/micro-frontends-part1.html"&gt;Part 1&lt;/a&gt; for a detailed explanation on Entities).&lt;/p&gt;
&lt;p&gt;This visual representation is a React component, but data management and layout composition is handled solely by the Rendering Engine framework.&lt;/p&gt;
&lt;p&gt;So, Renderers are visualisation components for Entities.&lt;/p&gt;
&lt;p&gt;The mapping of Entities to Renderers is one-to-many, since different visual representations may exist for a given entity type. A Product Entity, for example, can be represented as a detailed product page, or as a compact card component in collection view. Each Renderer, on the other hand, corresponds to one specific entity type only.&lt;/p&gt;
&lt;p&gt;All Renderers share some important properties:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Renderers are &lt;em&gt;composable&lt;/em&gt;. A Renderer is able to embed other Renderers as children, or be embedded by other Renderers.&lt;/li&gt;
&lt;li&gt;Renderers are &lt;em&gt;declarative&lt;/em&gt;. They specify their dependencies and behaviour but delegate all implementation to the Rendering Engine, the framework that runs them.&lt;/li&gt;
&lt;li&gt;Renderers are &lt;em&gt;self-sufficient&lt;/em&gt;. A Renderer can visualise its Entity no matter on which page or in which context it appears. This ensures that the choice and arrangement of Renderers remains as flexible as possible.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Enabling dynamic content for Zalandos’ mobile apps&lt;/h2&gt;
&lt;p&gt;Project Mosaic was solely focused on the web. However, Zalando offers its Fashion Store as two experiences: the Web and the Native Apps. Since they share most parts of the user journey, it was natural to explore if the Apps could benefit from a system based on Entities and Renderers, too.&lt;/p&gt;
&lt;p&gt;We knew it would be too much of a stretch for Mosaic fragments. But there's literally nothing that binds Renderers specifically to the Web!&lt;/p&gt;
&lt;p&gt;In the Zalando app, we had already implemented server-side layout steering for some parts of the application experience such as the main App landing page. Instead of relying on hardcoded views, the app would receive layouts from a remote Zalando server over the network. The preferred format here would be JSON, but otherwise the same challenges were present: we wanted dynamic, personalizable UIs with declarative data dependencies.&lt;/p&gt;
&lt;p&gt;If Renderers were able to output JSON instead of HTML, we could reuse the same rendering core as for the web with the same benefits.
Our Renderers relied on React for their output. To cover the app-specific use case, we added a custom React reconciler that consumed custom React elements, and output app-compatible JSON instead of HTML. Now, web developers are able to contribute Native apps features by reusing the same set of APIs as they were used to deliver web experiences and bring the web and native apps experiences closer together. All the existing tools, infrastructure support, and the constantly evolving platform APIs are now shared.&lt;/p&gt;
&lt;h2&gt;The life of a Renderer&lt;/h2&gt;
&lt;p&gt;So, how does it look under the hood?&lt;/p&gt;
&lt;p&gt;We decided to organise the Renderers API as a set of so-called life cycle methods, each accepting a function declaring Renderer's behaviour for a given context or case. All Renderers are implemented using TypeScript.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of collection carousel Renderer" src="https://engineering.zalando.com/posts/2021/09/images/rengine-carousel.png"&gt;&lt;/p&gt;
&lt;p&gt;Let’s have a look at a simplified version of a collection carousel Renderer:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;MOVE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;@tracking/event-names&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;SimpleCarousel&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;@dx/react-carousel-tile&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ViewTracker&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;@if/rendering-engine/api&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;React&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;react&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;./query.graphql&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tile&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;withQueries&lt;/span&gt;&lt;span class="p"&gt;(({&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;entity&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;carousel&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;variables&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;withProcessDependencies&lt;/span&gt;&lt;span class="p"&gt;(({&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;===&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;error&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;No collection data found.&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;render&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;tiles&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;entities&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;getCollectionEntities&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;withRender&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;collection&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;tiles&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;entities&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;ViewTracker&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;SimpleCarousel&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;{...&lt;/span&gt;&lt;span class="nx"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nx"&gt;onNextClickCarousel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tracking&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;track&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;MOVE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/SimpleCarousel&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;/ViewTracker&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Renderers are implemented using the &lt;a href="https://en.wikipedia.org/wiki/Fluent_interface"&gt;fluent interface&lt;/a&gt; approach. By calling the &lt;code&gt;tile()&lt;/code&gt; function of the Rendering Engine API, we are setting up a Renderer that defines various &lt;em&gt;lifecycle methods&lt;/em&gt;. Each method receives a function that encapsulates the associated behaviour and has fully typed interfaces. Since renderers are declarative, they do not execute any of the lifecycle methods themselves. Instead, the Rendering Engine framework runs all of them, in due order and context, fetches data and dependencies, and passes the output down to other methods when necessary.&lt;/p&gt;
&lt;p&gt;The most important lifecycle methods are:&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;withQueries&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Declares a data dependency via a GraphQL query. Data is fetched automatically by the framework and is available when the other life cycle methods are called.&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;withProcessDependencies&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Based on data delivered by &lt;code&gt;withQueries&lt;/code&gt;, defines further action (render, error etc.) and allows data pre-processing, which is then passed to the &lt;code&gt;withRender&lt;/code&gt; method. The chosen action tells the Rendering Engine that the Renderer should redirect, or be displayed in an error state.&lt;/p&gt;
&lt;p&gt;This life cycle method is also responsible for specifying child entities of the current Renderer. In this example we want to display the collection entities as outfit or product cards based on their entity type. It is important to note that a given renderer does not know which renderers will be used for its child entities&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;withRender&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Returns the root React component to be used as the Renderer output.&lt;/p&gt;
&lt;p&gt;For the Web, this is transformed into HTML and rendered on the server (SSR). Later on, the markup is hydrated on the client side with the data. For the Apps, we use a custom React reconciler and custom (non-Web) components to output JSON instead of HTML. However, most of the data flow, dev tooling and infrastructure remain the same for both use cases.&lt;/p&gt;
&lt;p&gt;There are more advanced features by using Renderers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Progressive Hydration:&lt;/strong&gt; we can mark specific renderers to be hydrated early, i.e. kicking off their React hydration as fast as possible on the client-side, and thus making its content interactive before its parent renderer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Code Splitting:&lt;/strong&gt; we only load and parse the Renderers needed on a given, personalised page which gives us a good performance out of the box.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Renderer State:&lt;/strong&gt; Renderers have access to a local Renderer State. The concept is similar to &lt;a href="https://reactjs.org/docs/react-component.html#setstate"&gt;React’s setState&lt;/a&gt;. It enables you to re-run renderer lifecycle methods for example to fetch additional data, and re-render the updated child entities. The "classical" React state can still be used via React Hooks.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Data sharing&lt;/h3&gt;
&lt;p&gt;Renderers are not intended to share data with each other that is based on the client side state. We want to avoid unwanted data coupling and allow Renderers to be reused in other contexts with minimal risks.&lt;/p&gt;
&lt;p&gt;Renderers have access to Zalando’s GraphQL Mutation APIs which allows remote data to be modified. Since all Renderers use the same data schema for their data dependencies, they can subscribe to changes in the schema to limit the need for cross-renderer communication.&lt;/p&gt;
&lt;h2&gt;Rendering Engine&lt;/h2&gt;
&lt;p&gt;Rendering Engine is the framework powering the Renderers. It is a backend service written in TypeScript and running in NodeJS coupled to a client-side Javascript module that runs in the browser.&lt;/p&gt;
&lt;p&gt;Rendering Engine encapsulates all the complexity and implementation details for the declarative Renderers. It processes incoming customer requests, matches Entities to Renderers, fetches data and other dependencies such as A/B testing assignments, asynchronously renders the response and delivers it back to the Web and Native App clients.&lt;/p&gt;
&lt;p&gt;The following sections describe the main responsibilities of Rendering Engine.&lt;/p&gt;
&lt;h3&gt;UI Composition&lt;/h3&gt;
&lt;p&gt;All layouts in Interface Framework are represented as trees of nested entities that are visualized using the matching Renderers. The mapping of Entities to Renderers is fully described by a set of rendering rules.&lt;/p&gt;
&lt;p&gt;In computer science terms, Rendering Engine recursively and asynchronously transforms a tree of entities into a tree of UI elements. On each step, it takes an entity node and its metadata as input, outputs a UI node plus zero or more child entity nodes, and then recurs over children.&lt;/p&gt;
&lt;p&gt;The page rendering always starts with an Entity. We call it the Root Entity since it typically defines what the page is about. After the Rendering Engine receives a request, it extracts the root Entity from the request headers and looks up a matching Renderer. Once a Renderer is found, the Rendering Engine runs the Renderer lifecycle methods to fetch data. In case there are any child entities associated with this Renderer, the same resolution process happens recursively. Thus, each Renderer may "suggest" which entities should be rendered as its children, but has no control over the actual renderer choice. That choice is based exclusively on the Rendering Rules.&lt;/p&gt;
&lt;p&gt;The important part here is that we do not block the resolution process. As soon as the entity is matched to a Renderer and the data resolved, the Rendering Engine kicks off the rendering process and starts streaming the HTML content to the client.&lt;/p&gt;
&lt;h3&gt;Data Fetching&lt;/h3&gt;
&lt;p&gt;The Rendering Engine takes care of fetching the GraphQL queries from the Fashion Store API. It uses an implementation of &lt;a href="https://github.com/zalando-incubator/perron"&gt;Perron&lt;/a&gt;, a data client with built-in support for circuit breakers, error handling and retries.&lt;/p&gt;
&lt;p&gt;All queries to FSA are batched and cached based on a &lt;a href="https://github.com/graphql/dataloader"&gt;DataLoader&lt;/a&gt; implementation. This prevents duplicate calls to backends during the same request.&lt;/p&gt;
&lt;h3&gt;Universal Rendering&lt;/h3&gt;
&lt;p&gt;Zalando being an e-commerce platform, our typical web page would have a prevalence of static content with islands of interactivity and we aim at serving content as fast as possible. This is why Rendering Engine was built from the ground up with full Server-Side Rendering (SSR) support. Each Renderer first generates its markup on the server and the Rendering Engine stitches it all together and streams the HTML to the client which then hydrates the components using our runtime module.&lt;/p&gt;
&lt;p&gt;For the Web use case, we provide additional Zalando-specific APIs which add interactivity, mutate data if necessary, lazy-load extra contents etc. For the Native app, the Rendering Engine only serves the JSON markup and the actual rendering happens in App clients for iOS and Android.&lt;/p&gt;
&lt;h3&gt;Mosaic backward compatibility&lt;/h3&gt;
&lt;p&gt;We knew that the migration from Mosaic to Interface Framework would not happen in a day. Our Mosaic codebase was extensive and actively maintained. Therefore, the Rendering Engine allowed Mosaic fragments to be used directly inside Renderers.&lt;/p&gt;
&lt;p&gt;This made our migration path very smooth. In fact, we now view Mosaic fragments as a powerful API our framework supports, and we still use them sometimes. In addition, this opened up extra integration and observability benefits for the legacy implementations.&lt;/p&gt;
&lt;h3&gt;Monitoring and Tracing&lt;/h3&gt;
&lt;p&gt;Improved observability is yet another benefit of the integrated platform. The Rendering Engine automatically collects and reports &lt;a href="https://web.dev/vitals/"&gt;Web Vitals&lt;/a&gt; so that we can correlate performance variations with code changes. A number of custom client-side metrics are also collected. All this happens automatically, so developers who contribute to Renderers can focus on the customer experience
We also integrate a variety of common enterprise tools for logging aggregation, Open Tracing and client-side error monitoring, with zero-integration time for the Renderer developers.&lt;/p&gt;
&lt;h3&gt;Developer Experience&lt;/h3&gt;
&lt;p&gt;Rendering Engine focuses on providing a great developer experience with the following features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Local Development Environment:&lt;/strong&gt; the framework provides an integrated development server and an on-demand compilation of Renderers. It only builds the Renderers that are shown on the current page. This ensures fast build times even when more and more Renderers are added to the application.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multiple version support:&lt;/strong&gt; Rendering Engine uses the Zalando Design System as a UI component library. The UI components are defined as dependencies for each particular Renderer. To allow greater flexibility, it supports using multiple versions including convenient tools and hooks to simplify the version maintenance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Continuous Integration &amp;amp; Deployment:&lt;/strong&gt; New code changes get tested and built automatically with specific performance reports for every page. These reports include bundle sizes and Lighthouse metrics. The deployments to Kubernetes happen continuously in preview and production environment.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Automatic Persisted Queries:&lt;/strong&gt; all GraphQL queries to the Fashion Store API are persisted on the server side together with a unique identifier. It helps reduce the request size, since the Rendering Engine client runtime sends the identifier instead of the whole query string.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Localization:&lt;/strong&gt; Rendering Engine supports localized bits of text inside Renderers.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Page Rendering Explained&lt;/h2&gt;
&lt;p&gt;Let’s have a look at what happens in Interface Framework on a high-level when you visit a page on the Zalando website. In this example, the user visits an outfit view by choosing one from Zalando’s &lt;a href="https://en.zalando.de/get-the-look-women/"&gt;Get the Look&lt;/a&gt; page.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Data flow for page rendering" src="https://engineering.zalando.com/posts/2021/09/images/rengine-data-flow.png"&gt;&lt;/p&gt;
&lt;p&gt;The request gets picked up by &lt;a href="https://github.com/zalando/skipper"&gt;Skipper&lt;/a&gt;, which is an HTTP router and reverse proxy for service composition. Skipper identifies the matching route and forwards the request to the Rendering Engine along with the entity parameters:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;entity&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;outfit&amp;quot;&lt;/span&gt;
&lt;span class="nx"&gt;entity&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;ern:outfit::4NXOAez0Qti&amp;quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The Rendering Engine gets the request with the entity above, that is called the root entity. The root entity defines the main content of the page. Based on the Rendering Rules, a matching Renderer is selected for this root entity.&lt;/p&gt;
&lt;p&gt;For the outfit page, the set of Rendering Rules looks like the following:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;outfitViewRule&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RenderingRule&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;entity&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;outfit&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;renderer&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;outfit_view&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;children&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;entity&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;outfit&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;renderer&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;outfit_highlight-b&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;children&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nx"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;entity&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;product&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nx"&gt;renderer&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;product_horizontal-highlight-product-card&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;entity&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;collection&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;renderer&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;collection_simple-carousel&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;children&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nx"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;entity&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;outfit&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nx"&gt;renderer&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;outfit_outfit-card&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The Renderer for the root entity is the Outfit View Renderer. We can refer to it as the top-level or root Renderer for the request. The Renderer has a data dependency in the form of the following GraphQL query.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;outfit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;4NXOAez0Qti&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nx"&gt;creator&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="nx"&gt;variant&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nx"&gt;relevantEntities&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;first&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="nx"&gt;edges&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The query is executed in the &lt;a href="https://engineering.zalando.com/posts/2021/03/how-we-use-graphql-at-europes-largest-fashion-e-commerce-company.html"&gt;Fashion Store API&lt;/a&gt; and various parts of the query go through different resolvers depending on the fields that are present. Each of the resolvers then calls one or many microservices that provide data.&lt;/p&gt;
&lt;p&gt;In our example, we ask for the creator’s name of the outfit together with two relevant entities. One resolver will call the Recommendation System to get the relevant entities for this outfit. Here, our relevant entities are a collection with other outfits from the same creator and a collection with outfits that look similar.&lt;/p&gt;
&lt;p&gt;Each Renderer decides which relevant entities appear as its children and adds placeholders for them. This is achieved via the &lt;code&gt;withProcessDependencies&lt;/code&gt; lifecycle method.
The Rendering Engine picks up all relevant entities and determines matching Renderers. For each of these nested Renderers, the process repeats recursively until no more nested entities must be rendered.&lt;/p&gt;
&lt;p&gt;After all the Renderers and their data dependencies are collected, the Rendering Engine renders the React components of each Renderer and streams the content to the client.
The next picture shows a sketch of the outfit page that is divided into the corresponding Renderers. Each Renderer is responsible for one part of the page.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Hierarchy of Renderers for the outfit page" src="https://engineering.zalando.com/posts/2021/09/images/rengine-outfit-page.png"&gt;&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;We have presented a deep dive into Rendering Engine with all its key functionalities. The final part of this blog series will cover a comparison between Mosaic and Interface Framework and what we have learned during the migration.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 2023/07&lt;/strong&gt;: See &lt;a href="https://engineering.zalando.com/posts/2023/07/rendering-engine-tales-road-to-concurrent-react.html"&gt;Rendering Engine Tales: Road to Concurrent React&lt;/a&gt; for an update on Rendering Engine and how we integrated React Concurrent features as part of our upgrade to React 18.&lt;/p&gt;</content><category term="Zalando"/><category term="Frontend"/><category term="Microservices"/><category term="GraphQL"/><category term="TypeScript"/><category term="Backend"/></entry><entry><title>Using Internal Mobility For Growth</title><link href="https://engineering.zalando.com/posts/2021/09/internal-mobility.html" rel="alternate"/><published>2021-09-02T00:00:00+02:00</published><updated>2021-09-02T00:00:00+02:00</updated><author><name>Gary Rafferty</name></author><id>tag:engineering.zalando.com,2021-09-02:/posts/2021/09/internal-mobility.html</id><summary type="html">&lt;p&gt;A look at how Zalando Direct, our B2B marketplace, is using Internal Mobility as a catalyst for growth.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Long time readers of this blog will remember that back in 2019, we &lt;a href="https://engineering.zalando.com/posts/2019/03/rotating-engineers-at-zalando.html"&gt;published a feature&lt;/a&gt; on the benefits of rotating engineers between teams. For those of you who have not seen it, the article described an initiative that aimed to establish cross-functional knowledge sharing, encourage cross team collaboration, and bring greater product awareness, by providing engineers with an opportunity to work on different teams within our Developer Productivity department.&lt;/p&gt;
&lt;p&gt;Within Zalando, we are incredibly passionate about enabling our engineers to progress and to develop. This empowerment and growth mindset is deeply woven into our fabric. Take a peek at &lt;a href="https://jobs.zalando.com/en/our-founding-mindset/"&gt;Our Founding Mindset&lt;/a&gt;. Four of them are focused on empowerment.
I myself am particularly drawn to &lt;strong&gt;#makeUsBetterNotBigger&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Let’s take a look at how another of our business units, Zalando Direct, our B2B marketplace, is using Internal Mobility as a catalyst for development.
Within the unit, the leadership team maintains a directory of opportunities that are used to foster growth within engineers. This repository covers community driven initiatives such as our architecture review groups and our weekly hacking sessions, in addition to our department driven topics and task forces such as improving observability of systems. One development opportunity is Internal Mobility.&lt;/p&gt;
&lt;p&gt;Internal Mobility is described as an exciting avenue for growth that enables engineers to join a different team on either a fixed-length assignment, or on a permanent basis. In this article, I would like to focus on the former, which was our most recent success story.
This story involved a Frontend Engineer who had been with Zalando Direct for over one year, and was joining my team on a short-term assignment for one month.&lt;/p&gt;
&lt;p&gt;The goals of the team swap were to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Provide a solid opportunity to expand knowledge and expertise by contributing to a new domain.&lt;/li&gt;
&lt;li&gt;Provide the destination team with an experienced extra engineer to contribute to their large and growing backlog.&lt;/li&gt;
&lt;li&gt;Further highlight that Internal Mobility should be used to successfully provide a development opportunity for our engineers.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Kicking Things Off&lt;/h2&gt;
&lt;p&gt;The engineer’s lead initiated the assignment, so let’s understand what that entails. First and foremost, it is imperative that our engineer is comfortable with, and excited about, the opportunity. Taking ownership of one’s own career progression and personal development is something that I look for when an engineer is on a seniority trajectory. I am always more than happy to double down my investment in them if I know that it will be maximised.&lt;/p&gt;
&lt;p&gt;Thereafter, it is important to agree on scope and duration. Engineers know that diving into an unscoped project is a fool’s errand, and this is no different. Up front, it is important to be clear on what is expected from all parties, and what are the boundaries. In this case, it was agreed that the duration would be one month, and that the scope was to work on a particular area of partner-facing functionality within our platform, zDirect. For some additional context, zDirect is a web application that enables our partners to grow and steer their business on Zalando.&lt;/p&gt;
&lt;h2&gt;Onboarding&lt;/h2&gt;
&lt;p&gt;Onboarding a new joiner to our team is always a great opportunity to critically assess how well our process is. One factor that can accelerate onboarding productivity, is if the new joiner is familiar with the languages and tools. We were able to keep the tech stack unified, which is a subset of the technologies sponsored by Zalando as part of the &lt;a href="https://opensource.zalando.com/tech-radar/"&gt;tech radar&lt;/a&gt;. This, coupled with the engineer’s understanding of the ecosystem, meant that we were able to get up and running in no time at all. Additionally, we got some incredibly helpful feedback that enabled us to improve our onboarding documentation. Given that we are growing at an incredible pace, streamlining the onboarding process for new hires pays dividends on productivity and experience. &lt;strong&gt;Always be squeezing your Time-To-Ship!&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;From this point onwards, we had a new team member. They joined all of our ceremonies, paired with their colleagues, and got to grips with the team’s ways of working. Similarly, they attended social settings such as team lunches and activities.
They immediately started shipping value, and right away boosted our team’s throughput. This required collaboration with our engineers, our product manager, and our designer.
We do not work in isolation, and this is an important aspect of the assignment. Please don’t extract somebody from their team environment and have them work alone. A &lt;a href="https://rework.withgoogle.com/blog/five-keys-to-a-successful-google-team/"&gt;well known study&lt;/a&gt; on team dynamics stated that &lt;em&gt;“Who is on a team matters less than how the team members interact, structure their work, and view their contributions”&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Use this opportunity to solidify your team and to hone the dynamics of collaboration.&lt;/p&gt;
&lt;h2&gt;So How Did This Experiment Go?&lt;/h2&gt;
&lt;p&gt;Ultimately, this assignment enabled our team to deliver increased value for our stakeholders. Throughput aside, however, the assignment yielded much more.
As a leader, I thrive from helping my team to succeed. One of the most rewarding stages of this assignment was doing a final retrospective with our new team member. Throughout the process I could see a continuous stream of high quality deliveries, but I wanted to drill down further into the personal experience. To hear that they&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“developed technically, acquired a better understanding of how the business operates, and identified different processes and ideas to bring back to their own team”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;was of course music to my ears. Moreover, they were inspired to go out and enroll into a Typescript course (we provide every engineer with a healthy training budget to use for their own growth) and incorporate it into their development plan. I like to think of this as the flywheel effect on growth.&lt;/p&gt;
&lt;p&gt;My last question to them was “Would you do it again?”, which was answered with an enthusiastic “Yes”.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Internal mobility assignments are a really effective way to provide engineers with an opportunity to learn new skills, to work in a new domain, and to push themselves out of their comfort zone.&lt;/p&gt;
&lt;p&gt;All experiments come with learning opportunities, and the goal of trying something new is to broaden our understanding and experiences. Two important learnings for us (as receiving team) were that&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;We needed to improve our onboarding documentation.&lt;/li&gt;
&lt;li&gt;Engineers should not have to switch back and forth during such an assignment.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For the former, our new member was able to pinpoint some gaps in the process, and we have since created an internal ways-of-working document to alleviate this for the next person. For the latter, there was an instance when our new member needed to respond to a topic for his original team, which broke the productivity flow, and led to some context switching. This is something that we will avoid next time.&lt;/p&gt;
&lt;p&gt;Sidenote: Context-switching is a productivity killer. I remember reading &lt;em&gt;Quality Software Management: System Thinking&lt;/em&gt;, by Gerald Weinberg, and being horrified by the impact that switching has on delivery.&lt;/p&gt;
&lt;p&gt;That being said, I believe that any endeavour that yields learnings is a successful endeavour. The benefits and learnings that come from internal rotation are in abundance, and I would highly recommend that you try this in your organisation. Presently, we have a number of engineers on different assignments, ranging from weeks to months.&lt;/p&gt;
&lt;p&gt;I opened up this article by referring back to an experiment conducted back in 2019. One of the goals that the authors hoped for was that rotations would become more of a regular thing in Zalando, and it’s awesome to be able to write this piece two years later, and say that, yes it is something that we are doing regularly, and continuously learning from.&lt;/p&gt;</content><category term="Zalando"/><category term="Tech Culture"/><category term="Culture"/></entry><entry><title>Knowledge Graph Technologies Accelerate and Improve the Data Model Definition for Master Data</title><link href="https://engineering.zalando.com/posts/2021/07/knowledge-graph-master-data-mdm.html" rel="alternate"/><published>2021-07-29T00:00:00+02:00</published><updated>2021-07-29T00:00:00+02:00</updated><author><name>Katariina Kari</name></author><id>tag:engineering.zalando.com,2021-07-29:/posts/2021/07/knowledge-graph-master-data-mdm.html</id><summary type="html">&lt;p&gt;In the ongoing master data management project the challenge is to create a consolidated golden record of particular business information scattered across multiple systems of different business units. Applying knowledge graph technologies has proven to be an effective means to automatically derive a logical data model for this golden record and improve stakeholder communication.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;The Master Data Management Challenge&lt;/h2&gt;
&lt;p&gt;Master data management (MDM) is a technology-enabled discipline in which business and Information Technology work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise's official shared master data assets.&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt; At Zalando we are at an early phase of realising MDM for our internal data assets and we have chosen to do it in a consolidated style.&lt;/p&gt;
&lt;p&gt;Typically, MDM projects are started because an organisation does not have a central view to a specific subject matter and, instead, that information, such as the contact details of a business partner, are scattered across systems with each maintaining their own differing or same record of these details. In our practical approach MDM is a set of practises to create a common, shared, and trusted view on data, also called a golden record, for a particular domain. In our MDM project, source systems are identified, their data is consumed, processed through a match and merge process, cleansed and quality assured, and then stored centrally according to a canonical data model. This centrally stored golden record, is then published back to the source systems for consideration and possible correction in their respective systems.&lt;/p&gt;
&lt;p&gt;We are currently designing a central MDM component that harmonises the different records into the central and trusted golden record. Its form needs to be defined in a logical data model. This is a set of definitions of tables and columns in which the consolidated record pulled, matched, and merged from the different sources is stored. Deriving this model is usually done manually, which has the following drawbacks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The amount of manual work to create the logical data model increases relatively to the number of system tables.&lt;/li&gt;
&lt;li&gt;Usually, the data models are read and created by colleagues from engineering with limited business know-how.&lt;/li&gt;
&lt;li&gt;The communication of the data model of source records and the data model of the golden record is shown as technical and textual definition files (SQL schema or a spreadsheet).&lt;/li&gt;
&lt;li&gt;For business stakeholders that are domain experts the understanding of contents and how they relate to each other is hard to grasp from these technical definition files.&lt;/li&gt;
&lt;li&gt;The domain expert is limited from conveying correctly the knowledge to the engineers creating the data model, which leads to errors and misunderstandings.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Because of these drawbacks, the risk is that a MDM tool is released with a faulty and incorrect model that needs iterations of rework. As the logical data model is a main driver for the effort of creating a MDM tool effecting user interface, processes, business rules, and data storage, this risk might have a large impact and delays the business value delivery.&lt;/p&gt;
&lt;p&gt;As the communication between business and engineering about a correct logical data model is happening upon textual technical specification files, an effective and efficient data governance decision making process is hindered, too, which is important to make the golden record also trustworthy.&lt;/p&gt;
&lt;p&gt;The logical data model is not the only deliverable in such an MDM project. We also have to deliver the mapping from each system's data model to the golden record's one and define whether mapping can be done directly 1-to-1, or whether it needs to go through some kind of transformation. For example, system A may define an address differently like system B.&lt;/p&gt;
&lt;p&gt;System A:
&lt;em&gt;Address&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;address_line_1&lt;/li&gt;
&lt;li&gt;address_line_2&lt;/li&gt;
&lt;li&gt;address_line_3&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;System B:
&lt;em&gt;Address&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;street&lt;/li&gt;
&lt;li&gt;zip_code&lt;/li&gt;
&lt;li&gt;city&lt;/li&gt;
&lt;li&gt;country_code&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The golden record data model needs to define the optimal and correct way to store an address object as well as define how the differing systems' data models map to it. If done manually, also this work increases with the number of system tables.&lt;/p&gt;
&lt;h2&gt;Using Knowledge Graph Technologies&lt;/h2&gt;
&lt;p&gt;In order to improve this manual definition, we made use of knowledge graph  technologies by describing all system's data models in a named directed graph. We then mapped each column of a system to a set of business concepts, such as "address", "contact person", or "business partner". These business concepts have attributes as well as relationships with other concepts. For example, the business partner concept is connected to the address concept as in the image below.&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Business Partner has Address" src="https://engineering.zalando.com/posts/2021/07/images/address-business-partner.png"&gt;&lt;/p&gt;
&lt;p&gt;We are using Neo4J to create these human-readable images about the mappings, since it has, in our opinion, the best look-and-feel in the current landscape of knowledge graph technologies. Most domain experts can read these images much better than the above mentioned data model definition files. Currently, we are mapping tens of tables and hundreds of columns, so creating images manually would generate more manual and error-prone work and that is why it is efficient to generate these images from the knowledge graph. The number in brackets in the colour legend is the total amount of nodes of this type in the knowledge graph.&lt;/p&gt;
&lt;p&gt;For the above mentioned example of system A and B storing address information differently, we can model this in the knowledge graph in the following way. Columns from system A, such as address line 1, 2, and 3, map &lt;em&gt;indirectly&lt;/em&gt; (one-to-many) to the address concept. This means that these columns need to be processed into the MDM system with a transformation algorithm. Columns from system B, however, map &lt;em&gt;directly&lt;/em&gt; (one-to-one) to respective attributes of the address concept. See the image below for an illustration.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Column Mappings for Address" src="https://engineering.zalando.com/posts/2021/07/images/column-mapping.png"&gt;&lt;/p&gt;
&lt;h2&gt;Focusing Manual Work Where it Should Be&lt;/h2&gt;
&lt;p&gt;The only manual work that is done is to record the mapping from systems' tables and columns to business concepts, their attributes, and their relationships. For example, system A and B is mapped in the following way:&lt;/p&gt;
&lt;p&gt;System A:
&lt;em&gt;Address&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;address id&lt;/strong&gt; -&amp;gt; concept: &lt;strong&gt;Address&lt;/strong&gt;, relationship: &lt;strong&gt;has contact&lt;/strong&gt; (target)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;business partner id&lt;/strong&gt; -&amp;gt; concept: &lt;strong&gt;Business Partner&lt;/strong&gt;, relationship: &lt;strong&gt;has contact&lt;/strong&gt; (source)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;address_line_1&lt;/strong&gt; -&amp;gt; concept: &lt;strong&gt;Address&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;address_line_2&lt;/strong&gt; -&amp;gt; concept: &lt;strong&gt;Address&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;address_line_3&lt;/strong&gt; -&amp;gt; concept: &lt;strong&gt;Address&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;System B:
&lt;em&gt;Address&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;id&lt;/strong&gt; -&amp;gt; concept: &lt;strong&gt;Address&lt;/strong&gt;, relationship: &lt;strong&gt;has contact&lt;/strong&gt; (target)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;business partner id&lt;/strong&gt; -&amp;gt; concept: &lt;strong&gt;Business Partner&lt;/strong&gt;, relationship: &lt;strong&gt;has contact&lt;/strong&gt; (source)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;street&lt;/strong&gt; -&amp;gt; concept: &lt;strong&gt;Address&lt;/strong&gt;, attribute: &lt;strong&gt;street name&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;zip_code&lt;/strong&gt; -&amp;gt; concept: &lt;strong&gt;Address&lt;/strong&gt;, attribute: &lt;strong&gt;postal code&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;city&lt;/strong&gt; -&amp;gt; concept: &lt;strong&gt;Address&lt;/strong&gt;, attribute: &lt;strong&gt;city name&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;country_code&lt;/strong&gt; -&amp;gt; concept: &lt;strong&gt;Address&lt;/strong&gt;, attribute: &lt;strong&gt;country code&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And that is all that needs to be done manually. A domain expert can provide us with these definitions and some coordination that the exact same name for concepts, attributes, and relationships is required. This is done by cross-referencing system's business concepts and unifying their wording.&lt;/p&gt;
&lt;h2&gt;Generating the Logical Data Model&lt;/h2&gt;
&lt;p&gt;The mapping from systems' tables and columns to business concepts is processed and written into the knowledge graph, which then holds the following types of nodes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;System&lt;/strong&gt;, the name of one system owning tables and columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Table&lt;/strong&gt;, the name of a table from a particular system.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Column&lt;/strong&gt;, one column in one system with respective schema definitions, such as data type.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Concept&lt;/strong&gt;, a business concept such as Address.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Attribute&lt;/strong&gt;, one single data record defining the concepts, such as street name for the address concept.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Relationship&lt;/strong&gt;, a connecting information between two concepts flowing from one, the source concept, to the other, the target concept. For example business partner "has contact" address.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The logical data model is then systematically created (via a Python script) from the concepts, attributes, and relationships. Each concept is created with a table of its own, where the columns are all of its attributes and an internal identifier for the concepts. Each relationship also becomes a table of its own with the internal identifiers of the source and target concepts as foreign key columns.&lt;/p&gt;
&lt;p&gt;Since the graph contains the record which system's tables and columns contribute to one concept, we can then also generate the so-called transformation data model, which shows how each system's column maps to (directly or indirectly) to the logical data model of the golden record.&lt;/p&gt;
&lt;p&gt;By using knowledge graphs for a live-data representation of all systems' logical data models and how they map to a semantic layer of business concepts, we are able to automatically generate the logical data model of the golden record inside the knowledge graph with additional information on how it connects to systems' data model. This enables us to keep a record of data lineage from each system to the golden record and, additionally, to use contemporary knowledge graph visualisation tools to give domain experts a intuitive and understandable representation on how each system is connected to the golden record. We see here two main advantages:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The dialogue between business and technology in designing the golden record logical data model has improved and accelerated the process of creating a correct model.&lt;/li&gt;
&lt;li&gt;All deliverables, such as the logical data model and the transformation data model can be queried directly from the knowledge graph and do not need to be done manually, which is less error-prone.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We estimate that during the development of the MDM component this approach will keep on saving time for us by forgoing misunderstandings and improving stakeholder communication.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;Wikipedia on &lt;a href="https://en.wikipedia.org/wiki/Master_data_management"&gt;Master Data Management&lt;/a&gt; 23.7.2021&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;For knowledge graph experts it is worthwhile to note that because this is a schema for the logical data model, also relationships between concepts are modeled as nodes. This is a deliberate design choice. It enables us to map data model information to relationships.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="Zalando"/><category term="Knowledge Graph"/><category term="Data"/></entry><entry><title>How we use Kotlin for backend services at Zalando</title><link href="https://engineering.zalando.com/posts/2021/07/kotlin-for-backend-services.html" rel="alternate"/><published>2021-07-01T00:00:00+02:00</published><updated>2021-07-01T00:00:00+02:00</updated><author><name>Ole Sasse</name></author><id>tag:engineering.zalando.com,2021-07-01:/posts/2021/07/kotlin-for-backend-services.html</id><summary type="html">&lt;p&gt;In the latest update to the Tech Radar, Kotlin has moved to ADOPT. As part of this effort, Zalando's Kotlin Guild has created a set of recommended tools and libraries for backend development, which this blog post takes a closer look at.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Kotlin Logo" src="https://engineering.zalando.com/posts/2021/07/images/kotlin-logo.png#previewimage"&gt;&lt;/p&gt;
&lt;h2&gt;The adoption of Kotlin at Zalando&lt;/h2&gt;
&lt;p&gt;As outlined in &lt;a href="https://engineering.zalando.com/tags/tech-radar.html"&gt;prior posts&lt;/a&gt;, Zalando uses a &lt;a href="https://opensource.zalando.com/tech-radar"&gt;Tech Radar&lt;/a&gt; to provide guidance on technology selection.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://engineering.zalando.com/posts/2021/06/zalando-tech-radar-scaling-contributions.html"&gt;Recently&lt;/a&gt;, we moved &lt;a href="https://kotlinlang.org"&gt;Kotlin&lt;/a&gt; from TRIAL to ADOPT. With this change we are doubling down on the support of Kotlin as the 3rd JVM language next to &lt;a href="https://www.java.com"&gt;Java&lt;/a&gt; and &lt;a href="https://scala-lang.org"&gt;Scala&lt;/a&gt;. This is the result of increased adoption within the company (100+ new applications were written in Kotlin in a year), positive feedback from engineers starting to use it, as well as creation of guidelines, coding standards, reference projects, and service templates by the Zalando Kotlin Guild.&lt;/p&gt;
&lt;p&gt;The experience that our Engineering Community gained over the recent years with Kotlin matches the developer stories of other companies. A nice collection of success stories can be found on the Android &lt;a href="https://developer.android.com/kotlin/stories"&gt;blog&lt;/a&gt;. Kotlin allows writing more succinct code with fewer pitfalls compared to Java and comes with a lot of useful features and libraries (e.g. &lt;a href="https://kotlinlang.org/docs/data-classes.html"&gt;data classes&lt;/a&gt;, &lt;a href="https://kotlinlang.org/docs/null-safety.html"&gt;null safety&lt;/a&gt;) that Java does not (yet) have as part of its standard library. This is probably also a reason why it is more &lt;a href="https://insights.stackoverflow.com/survey/2020#technology-most-loved-dreaded-and-wanted-languages-wanted"&gt;wanted&lt;/a&gt; and less &lt;a href="https://insights.stackoverflow.com/survey/2020#technology-most-loved-dreaded-and-wanted-languages-dreaded"&gt;dreaded&lt;/a&gt; than Java and Scala in the 2020 Stackoverflow insights. Additionally, &lt;a href="https://kotlinlang.org/spec/type-inference.html"&gt;type inference&lt;/a&gt;, &lt;a href="https://kotlinlang.org/docs/reference/collections-overview.html#collection-types"&gt;read only collections&lt;/a&gt; as well as the rich support for functional programming in the standard libraries were among the things our developers see as benefits compared to Java.&lt;/p&gt;
&lt;h2&gt;The Kotlin Guild&lt;/h2&gt;
&lt;p&gt;The Kotlin &lt;a href="https://engineering.zalando.com/tags/guild.html"&gt;Guild&lt;/a&gt; was founded with around 10 core members who want to help the language grow in Zalando. Moving the language to ADOPT in the latest &lt;a href="https://engineering.zalando.com/posts/2021/06/zalando-tech-radar-scaling-contributions.html"&gt;Tech Radar Update&lt;/a&gt; was a central milestone in that effort, as the ADOPT status comes with support from central infrastructure teams and the created documentation as well as templates, which help to promote a standardized tech stack and make bootstrapping new services easier. Due to being driven by our language guild, the whole process was kept transparent and open for contributions from the Engineering Community.&lt;/p&gt;
&lt;p&gt;As a preparation for wider adoption of Kotlin, we collected internal good practices as well as the definition of tools and libraries for the development of RESTful backend services and Android apps with Kotlin that are recommended as default choices. For additional input we looked at how frequently things are used within the company, sat together with experts on specific topics, consulted external sources, and asked the whole Engineering Community to review final recommendations via a survey. Overall, we made sure that our recommendations support a positive developer experience and fit the need of most services, which are not directly serving customer traffic.&lt;/p&gt;
&lt;p&gt;Looking forward the Kotlin Guild will continue to foster knowledge exchange as well as community building for its 250+ members. We also plan to cover more use cases with our documentation, like pure functional services using &lt;a href="https://arrow-kt.io"&gt;Arrow&lt;/a&gt; and will make sure we stay up to date with new development in the Kotlin space. Next to that, the members support each other with technical issues and regular talks are hosted.&lt;/p&gt;
&lt;h2&gt;How we build Backend Services at Zalando&lt;/h2&gt;
&lt;p&gt;Our internal developer tooling allows to initialize a repository from a template project. Those come with out-of-the box configuration and integrations which teams can then adapt to their needs. As an added benefit, they nudge teams towards higher consistency across different services and departments.&lt;/p&gt;
&lt;p&gt;All APIs are defined in the OpenAPI format using &lt;a href="https://swagger.io"&gt;Swagger&lt;/a&gt;. This allows our API portal to list all available APIs in one place along with their API linting results via &lt;a href="https://github.com/zalando/zally"&gt;Zally&lt;/a&gt;. API linting can also be required to pass for MUST validations on every build. Many of our teams follow the &lt;a href="https://engineering.zalando.com/posts/2019/04/developing-zalando-apis.html"&gt;API first&lt;/a&gt; principle throughout service development.&lt;/p&gt;
&lt;p&gt;Given that most services are deployed in &lt;a href="https://kubernetes.io"&gt;Kubernetes&lt;/a&gt;, we consider &lt;a href="https://opensource.zalando.com/skipper"&gt;Skipper&lt;/a&gt; filters the best way to handle Authentication and Authorization. This can either be achieved in Skipper &lt;a href="https://opensource.zalando.com/skipper/reference/filters/#oauthtokeninfoallscope"&gt;directly&lt;/a&gt;, via &lt;a href="https://opensource.zalando.com/skipper/kubernetes/routegroups/#routegroups"&gt;Route Groups&lt;/a&gt; or &lt;a href="https://zalando-incubator.github.io/fabric-gateway/fabric-gateway-features/#authentication"&gt;Fabric Gateway&lt;/a&gt;. Skipper is designed to handle a large number of requests and is less likely to be misconfigured than for example &lt;a href="https://spring.io/projects/spring-security"&gt;Spring security&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Many JVM based Web services in Zalando are built using &lt;a href="https://spring.io/projects/spring-boot"&gt;Spring Boot&lt;/a&gt; and we believe that this is also a good option when using Kotlin. This choice is mainly driven by the large adoption, but also because Spring &lt;a href="https://spring.io/guides/tutorials/spring-boot-kotlin"&gt;integrates&lt;/a&gt; really well with Kotlin, is compatible with multiple application servers, and supports reactive programming via &lt;a href="https://docs.spring.io/spring-integration/docs/5.1.2.RELEASE/reference/html/webflux.html"&gt;WebFlux&lt;/a&gt;. We do also see growing adoption of &lt;a href="https://ktor.io"&gt;Ktor&lt;/a&gt; and predict it to gain popularity within Zalando in the future, possibly even in conjunction with &lt;a href="https://www.graalvm.org"&gt;GraalVM&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Libraries we use for Backend Services&lt;/h2&gt;
&lt;p&gt;As build system, we prefer &lt;a href="https://gradle.org"&gt;Gradle&lt;/a&gt; over &lt;a href="https://maven.apache.org"&gt;Maven&lt;/a&gt; because of its great customizability and build performance. Gradle is also used to compile the language itself and is used by many major framework projects like &lt;a href="https://github.com/spring-projects/spring-boot"&gt;Spring Boot&lt;/a&gt;. On top of that, the build configuration scripts can be &lt;a href="https://docs.gradle.org/current/userguide/tutorial_using_tasks.html"&gt;written&lt;/a&gt; in Kotlin.&lt;/p&gt;
&lt;p&gt;Linting is a very good practice to keep the style consistent in a codebase and to settle disputes over correct indentation. &lt;a href="https://github.com/pinterest/ktlint"&gt;Ktlint&lt;/a&gt; is our tool of choice as it follows the official &lt;a href="https://kotlinlang.org/docs/coding-conventions.html"&gt;coding&lt;/a&gt; conventions, is &lt;a href="https://plugins.gradle.org/plugin/org.jlleitschuh.gradle.ktlint"&gt;easy to run&lt;/a&gt; in Gradle, and does not enforce too many rules such that it seamlessly integrates into the software development process.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/MicroUtils/kotlin-logging"&gt;Kotlin-logging&lt;/a&gt; is recommended for logging as it automatically adds class names to the log, lazily evaluates messages, and is built on top of &lt;a href="http://www.slf4j.org"&gt;slf4j&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For &lt;a href="https://redis.io"&gt;Redis&lt;/a&gt; access, we recommend using &lt;a href="https://github.com/lettuce-io/lettuce-core"&gt;Lettuce&lt;/a&gt; which is part of &lt;a href="https://mvnrepository.com/artifact/org.springframework.boot/spring-boot-starter-data-redis"&gt;spring-boot-starter-data-redis&lt;/a&gt;, as it is a thread safe client with nice support for reactive programming.&lt;/p&gt;
&lt;p&gt;To access relational databases, we see &lt;a href="https://mvnrepository.com/artifact/org.springframework.boot/spring-boot-starter-data-jpa"&gt;spring-boot-starter-data-jpa&lt;/a&gt; as a solid choice in case you like to use ORM, but advise considering &lt;a href="https://www.jooq.org"&gt;jOOQ&lt;/a&gt; in cases where database transactions become more complex. It is also worth mentioning that jOOQ can be used together with other clients, as it can be used on top of JPA. jOOQ also has the added benefit that it supports database specifics like Postgres &lt;a href="https://www.postgresql.org/docs/current/datatype-json.html"&gt;JSON types&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Zalando is investing into traceability with &lt;a href="https://opentracing.io"&gt;Open Tracing&lt;/a&gt; and we recommend &lt;a href="https://github.com/zalando/opentracing-toolbox"&gt;opentracing-toolbox&lt;/a&gt; which eases &lt;a href="https://github.com/zalando/opentracing-toolbox/tree/main/opentracing-kotlin"&gt;integration&lt;/a&gt; of tracers, particularly in Spring Boot projects. Tracing allows linking requests across services and is also great to set up &lt;a href="https://www.usenix.org/conference/srecon19emea/presentation/mineiro"&gt;automated alerting&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;We hope this gives you some idea why Kotlin is gaining popularity for backend development within Zalando.&lt;/p&gt;</content><category term="Zalando"/><category term="Kotlin"/><category term="Tech Radar"/><category term="Frameworks"/><category term="Guild"/><category term="Backend"/><category term="Culture"/></entry><entry><title>Zalando Tech Radar - Scaling Contributions to Technology Selection</title><link href="https://engineering.zalando.com/posts/2021/06/zalando-tech-radar-scaling-contributions.html" rel="alternate"/><published>2021-06-24T00:00:00+02:00</published><updated>2021-06-24T00:00:00+02:00</updated><author><name>Bartosz Ocytko</name></author><id>tag:engineering.zalando.com,2021-06-24:/posts/2021/06/zalando-tech-radar-scaling-contributions.html</id><summary type="html">&lt;p&gt;Learn how we scaled contributions to Zalando Tech Radar&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Zalando Tech Radar" src="https://engineering.zalando.com/posts/2021/06/images/zalando-tech-radar.jpg#previewimage"&gt;&lt;/p&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;In our previous post about &lt;a href="/posts/2020/07/technology-choices-at-zalando-tech-radar-update.html"&gt;Technology Choices at Zalando&lt;/a&gt; we spoke about a few problems with scaling technology selection in Tech companies. Since then, we have focused on the remaining categories of the &lt;a href="https://opensource.zalando.com/tech-radar/"&gt;Tech Radar&lt;/a&gt; beyond languages and the Tech Radar contribution process. Now, we'd like to reflect on our lessons learned, which you can use when designing technology selection processes.&lt;/p&gt;
&lt;h2&gt;Scaling contributions&lt;/h2&gt;
&lt;p&gt;One of the challenges for us to solve was scaling contributions to the Tech Radar across our 250+ delivery teams. Technologists are often more excited in promoting a new, promising technology than working on guidelines or sharing knowledge about already well-known tech. Such individuals are also essential for continued innovation. On the other hand, companies look for organizational efficiency by ensuring talent mobility across teams supported by a more or less standardized tech stack. This makes it easier to address cross-team dependencies in product delivery by allowing teams to contribute to code bases beyond their area of responsibility. Further, it creates career opportunities for Engineers, who can quickly switch teams and work on a challenging, high impact project. Thus, for technology selection, there is a natural tension between early adopters' vested interest and the needs of the organization they work for. At Zalando, we have created a two-sided contribution model to the Tech Radar:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Anyone in Zalando is encouraged to contribute knowledge about technologies we have on the Tech Radar or suggest ones that are promising to evaluate and play a key role in this process.&lt;/li&gt;
&lt;li&gt;Our Principal Engineers are maintainers of the Tech Radar and are moderating information collection on incoming suggestions, driving creation of good practices for technologies being evaluated or used, and for promoting technologies to increase their adoption.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Ring change suggestions are supported by issue templates in our internal Tech Radar GitHub repository. These templates provide guidance on common questions around use case fit, key differences from alternatives already on the Tech Radar, conformance to our Technology Selection Principles, and support within the Engineering Community.&lt;/p&gt;
&lt;p&gt;We encourage and expect our Engineers to contribute information about usage, lessons learned from production incidents, or challenges they face at scale. Voluntary contributions alone are insufficient to keep an updated view of the technologies we use. Thus, to support usage information collection, we collect usage data from our AWS accounts, source code repositories, or our infrastructure platform offerings. Collected information is collected in a documentation page with a common structure across all entries:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Zalando Tech Radar: example documentation entry" src="https://engineering.zalando.com/posts/2021/06/images/tech-radar-docs-entry.png"&gt;&lt;/p&gt;
&lt;p&gt;Finally, we leverage Principal Engineers to moderate and drive discussions around technology adoption at Zalando. These colleagues have a sufficiently broad view on technology usage and performance in production across multiple teams and serve as a multiplying factor. They're responsible for encouraging teams they work with to share knowledge and highlight technology usage based on the software systems in their areas - either themselves or by enabling others to do so. Additionally, they moderate discussions within technology guilds or initiate working groups to create specific artifacts for the technologies, like collections of good practices or guidelines tailored to our environment, use cases, and scale. Such working groups are also excellent opportunities to develop or identify talent within the company.&lt;/p&gt;
&lt;h2&gt;Re-scoring - how have we decided upon changes?&lt;/h2&gt;
&lt;p&gt;After a longer period of time with no regular changes to the Tech Radar, we had a re-scoring exercise to complete. A similar approach was used originally at ThoughtWorks and can be used to create a Tech Radar from the ground up.&lt;/p&gt;
&lt;p&gt;Within our Principal Engineering Community, we formed a working group per dimension: Datastores, Data processing, Infrastructure, and Queues. Our &lt;a href="https://opensource.zalando.com/tech-radar/"&gt;Tech Radar visualization&lt;/a&gt; merges Data processing and Queues in a single Data Management dimension for simplicity. Each working group was responsible for the data collection and analysis. One person from each group compiled the information in a structured format where per technology there was a case made for a ring change (or not). The change reasoning was supported by data points on usage, incidents, and expertise we gained since the technology was added to the Tech Radar (a few years in some cases) as well as conformance with our Technology Selection Principles. Where necessary to build a solid case, we reached out to teams in order to understand more details about their use cases or experience, if this was not sufficiently documented through recent information in our Tech Radar.&lt;/p&gt;
&lt;p&gt;Based on the collected data, Principal Engineers participated in a review and re-scoring exercise. In a spreadsheet, we collected votes. Every 'nay' vote required a short rationale which we later discussed in the group to ensure we did not miss out on usage or use cases. We also found inconsistencies in the way we handle technologies with multiple deployment options (self-hosted vs. managed or vendor offerings), for which we did not find a good solution yet.&lt;/p&gt;
&lt;p&gt;After the voting, the collected ring changes were discussed with our Senior Leadership Team. The main focus was on ensuring long-term support for the technologies we promote to ADOPT and that technologies on lower rings are in line with long-term strategies (e.g. Data Strategy).&lt;/p&gt;
&lt;p&gt;Finally, the changes were shared with our Engineers where we shared detailed rationale per ring change and further information on the re-scoring process and contributions moving forward.&lt;/p&gt;
&lt;h2&gt;Notable changes&lt;/h2&gt;
&lt;p&gt;With the re-scoring, we moved a few technologies to ADOPT, confirming our investment in these. To scale adoption, in some cases, we formed dedicated teams that operate service offerings available to all Zalando Engineers and Data Scientists.&lt;/p&gt;
&lt;h3&gt;Airflow&lt;/h3&gt;
&lt;p&gt;&lt;a href="https://airflow.apache.org/"&gt;Apache Airflow&lt;/a&gt; is a Workflow Orchestration tool used by data teams in Zalando. We have a central infrastructure team responsible for managing Airflow as a Service for our data teams.&lt;/p&gt;
&lt;h3&gt;Databricks&lt;/h3&gt;
&lt;p&gt;We've been using &lt;a href="https://spark.apache.org/"&gt;Apache Spark&lt;/a&gt; for various analytical and Machine Learning use cases and talked about our usage before (see &lt;a href="https://youtu.be/Fy_KnCxp1lo"&gt;Data Warehousing with Spark Streaming at Zalando&lt;/a&gt;). Databricks is also the core element of our Machine Learning Platform, available to all Engineers.
More recently, we went from a centralized Data Lake approach towards a distributed Data Mesh architecture backed by Spark and built on Delta Lake powered by Databricks. See our talk &lt;a href="https://www.youtube.com/watch?v=eiUhV56uVUc"&gt;Data Mesh in Practice: How Europe's Leading Online Platform for Fashion Goes Beyond the Data Lake&lt;/a&gt; for more information.&lt;/p&gt;
&lt;h3&gt;GraphQL&lt;/h3&gt;
&lt;p&gt;We've blogged about our &lt;a href="/posts/2021/03/how-we-use-graphql-at-europes-largest-fashion-e-commerce-company.html"&gt;GraphQL usage&lt;/a&gt; before. We have 200+ developers that contributed to the GraphQL API layer powering the &lt;a href="https://en.zalando.de/"&gt;Zalando shop&lt;/a&gt; over the past 2.5 years. We also have other use cases in production, for example in back-office applications for our Buying department.&lt;/p&gt;
&lt;h3&gt;Kotlin &amp;amp; TypeScript&lt;/h3&gt;
&lt;p&gt;Having seen continued and growing usage of &lt;a href="https://kotlinlang.org/"&gt;Kotlin&lt;/a&gt; and &lt;a href="https://www.typescriptlang.org/"&gt;TypeScript&lt;/a&gt;, we have initiated workstreams for within our language guilds to define guidelines, coding standards, reference projects, and service templates. These artifacts are helping teams in adopting the languages moving forward. Further, they help building a shared understanding what we consider as production-proven frameworks and libraries along with recommended configuration options.
We've shared our &lt;a href="/posts/2019/02/typescript-best-practices.html"&gt;TypeScript best practices&lt;/a&gt; in the past and more details about &lt;a href="/posts/2021/07/kotlin-for-backend-services.html"&gt;promoting Kotlin at Zalando&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;SageMaker&lt;/h3&gt;
&lt;p&gt;We have blogged before about our usage of &lt;a href="https://aws.amazon.com/sagemaker/"&gt;Amazon SageMaker&lt;/a&gt; for &lt;a href="/posts/2021/02/machine-learning-pipeline-with-real-time-inference.html"&gt;ML Pipelines with Real-Time Inference&lt;/a&gt;, &lt;a href="/posts/2020/06/distributed-xgb-sagemaker.html"&gt;distributed training&lt;/a&gt;. See also our talk on &lt;a href="https://www.youtube.com/watch?v=6UVdMtNUpDE"&gt;using SageMaker for training ML models&lt;/a&gt; from the AWS Summit 2019.&lt;/p&gt;
&lt;h2&gt;Tech Radar changes moving forward and future focus&lt;/h2&gt;
&lt;p&gt;The re-scoring exercise described in this post was a house-keeping exercise supported by clarifying the purpose of the Tech Radar, long-term ownership, and the contribution model. The amount of upcoming changes will of course depend on contributions from our Engineering Community and our appetite for trying out new technologies. While changes to ADOPT/HOLD are going to be evaluated on a quarterly basis, we have a steady stream of ongoing assessments and trials.&lt;/p&gt;
&lt;p&gt;The Principal Engineering Community focuses on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;supporting and guiding contributions from the Engineering Community,&lt;/li&gt;
&lt;li&gt;identifying promising technologies to invest in,&lt;/li&gt;
&lt;li&gt;collecting best practices and expertise around technologies on TRIAL and ADOPT.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With the last point we aim to define paved roads for Engineers describing for example battle-tested configurations for typical use cases or standardized monitoring dashboards with their explanation for the key and most common technologies. While this is today already the case for our &lt;a href="https://www.youtube.com/watch?v=G8MnpkbhClc"&gt;PostgreSQL as a Service offering&lt;/a&gt; built on top of &lt;a href="https://github.com/zalando/patroni"&gt;Patroni&lt;/a&gt; and &lt;a href="https://github.com/zalando/postgres-operator"&gt;Postgres Operator&lt;/a&gt;, given a dedicated team responsible for this infrastructure, we don't have such guidance collected across all our ADOPT technologies yet.&lt;/p&gt;
&lt;h2&gt;Challenges we have not solved yet&lt;/h2&gt;
&lt;p&gt;There are a few challenges that the Tech Radar does not solve for today, mostly related to consistency and completeness of the technology landscape. If we resolve any of these challenges, we will surely share our insights and lessons learned.&lt;/p&gt;
&lt;p&gt;Some technologies (e.g. etcd) have been successfully used in our infrastructure teams, but we would not want any delivery team to use these (e.g. for configuration management counting as "infrastructure") as we have more suitable building blocks in our platform.&lt;/p&gt;
&lt;p&gt;In other cases, we have invested into service offerings built around open-source software (e.g. Airflow) and we would rather have teams extend this platform offering rather than deploy their own infrastructure.&lt;/p&gt;
&lt;p&gt;We also have solutions built in-house (e.g. our request router - &lt;a href="https://github.com/zalando/skipper"&gt;Skipper&lt;/a&gt;) which are an essential part of our cloud infrastructure. Teams don't really have a choice to easily opt-out of these. These technologies will most likely be moved to a different place that will represent the maturity of the development infrastructure at Zalando from a Product perspective.&lt;/p&gt;
&lt;p&gt;For technologies, where we chose vendor offerings built on top of a technology (e.g. Databricks for Spark), the question arises whether to include one or both and with which ring assignment (setting Spark to HOLD while keeping Databricks on ADOPT may sound confusing). Here, we consider using the underlying technology and outlining the recommended deployment options.&lt;/p&gt;
&lt;p&gt;Finally, there are 3rd party products, which allow us to deliver solutions faster, without the need to reinvent the wheel. One example are Content Management Systems - we've built a few over the past years and strive not to do this again. A question arises how to make these sufficiently visible to our Engineers, so that they're considered while building future products for our customers.&lt;/p&gt;</content><category term="Zalando"/><category term="Tech Culture"/><category term="Tech Radar"/><category term="Culture"/></entry><entry><title>Making the Remote Onboarding a Success</title><link href="https://engineering.zalando.com/posts/2021/04/making-the-remote-onboarding-a-success.html" rel="alternate"/><published>2021-04-22T00:00:00+02:00</published><updated>2021-04-22T00:00:00+02:00</updated><author><name>Martin Schwitalla</name></author><id>tag:engineering.zalando.com,2021-04-22:/posts/2021/04/making-the-remote-onboarding-a-success.html</id><summary type="html">&lt;p&gt;Onboarding new people to the team is always a big challenge and got even more complicated due to the pandemic when most people work from home. This post describes a couple of steps we took to make the remote onboarding of three new team members a success.&lt;/p&gt;</summary><content type="html">&lt;p&gt;When the pandemic started in 2020 many Zalando employees went into home office. It changed our working habits and many other things and Zalando published &lt;a href="https://engineering.zalando.com/posts/2020/03/how-to-work-remotely-at-zalando.html"&gt;remote working guidelines&lt;/a&gt; to support their employees. This concentrates only on remote working, but what happens if you change companies during the pandemic?&lt;/p&gt;
&lt;p&gt;Joining a new company and getting onboarded can be already pretty tough during normal times. Starting a new job requires you to learn new skills and build up new relations within the company. Working from home amplifies those problems by introducing virtual barriers. It's not possible to walk up to somebody and ask a question or introduce yourself to people you meet by chance in different situations.&lt;/p&gt;
&lt;p&gt;We were recently confronted with the challenge to grow our engineering team from two persons to five persons across two months. In this article I try to describe how we tackled this challenge to make sure that the new team members get quickly onboarded and feel welcomed in this new setup.&lt;/p&gt;
&lt;h2&gt;Onboarding Buddy&lt;/h2&gt;
&lt;p&gt;One of the first decisions we made was to assign an onboarding buddy to each new team member. The onboarding buddy is the go to person for the new team member in case of questions or problems where support is needed, e.g. setting up the notebook. As some persons might feel uncomfortable asking unknown people for help, especially remotely, daily 1:1 sessions have been set up to discuss the current state of the onboarding, answering open questions or to provide regular feedback. As time went on, the frequency of the 1:1s decreased, because people got used to working in the team.&lt;/p&gt;
&lt;h2&gt;Feedback&lt;/h2&gt;
&lt;p&gt;Providing regular feedback is the key to success during the onboarding. It’s supposed to create this continuous feedback loop to inform the new team members about how their contribution is viewed, get them used to Zalando's feedback culture and to also reflect on how the onboarding is working out and if it needs to be tweaked. To make sure we don’t forget to provide feedback, we set up monthly feedback sessions between the team and each new team member. While doing this we experimented with three different formats.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;An open round where everybody shares the feedback freely.&lt;/li&gt;
&lt;li&gt;The feedback is given in short 1:1 sessions between each team member.&lt;/li&gt;
&lt;li&gt;The team collects the feedback and presents then one summarized view to the new team member.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Overall it’s impossible to say which format is the best. It could be intimidating in the beginning to receive feedback from the whole team in an open round, but fine at a later point in time when the team knows each other better. It depends on the situation and the people and we gave our new team members the possibility to choose. As those feedback sessions were also meant for the new member to provide feedback to the team, we prepared some questions to collect the feedback.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What do you think about the onboarding so far?&lt;/li&gt;
&lt;li&gt;Is there any information that you missed or would have liked to receive earlier?&lt;/li&gt;
&lt;li&gt;Is your workload manageable for you? Are the tasks too easy/too difficult?&lt;/li&gt;
&lt;li&gt;Would you like to receive more/less support?&lt;/li&gt;
&lt;li&gt;Is there anything you would like to work more on?&lt;/li&gt;
&lt;li&gt;How comfortable would you feel if all other team members fall sick and you are alone working on tasks and support requests?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The last question is probably the most important one. It asks the new team members to reflect on themself and check how confident they are about their skills already. This is an important indicator for the team to maybe put some focus on certain areas that were missed so far in the onboarding. This way we found out that we needed to become better at introducing the on-call and incident process in our team as this was completely missed.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Picture of Sokoban Standup" src="https://engineering.zalando.com/posts/2021/04/images/team_pic.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;Technical Onboarding&lt;/h2&gt;
&lt;p&gt;The onboarding consists of course of some technical onboarding as well. We did the obligatory domain introduction and some introductions into our ways of working, like the sprint ceremonies. It’s important to not overwhelm the new team members in the start. Many if not most information can be also shared down the line when it’s necessary. It’s better to focus on the basics in the beginning and give time to let that sink in. But at some point the new team members need to get their hands dirty and work on some real tasks. To make the start easier, we defaulted to pair programming or even mob programming in the beginning. It was the rule that the tasks had to be done with at least two persons unless other circumstances prevented it. Pair programming while working remotely is even more important than usually. Not only because it allows for easy, “on the job” knowledge sharing, but it also allows the participants to bond and get to know each other. The pair programming was done with simple tools. The person programming was using their IDE of their choice and the screen was shared via the call so that other persons could watch the coding. Of course other tools and IDE plugins exist that try to make the whole setup even better, but in our experience it worked pretty well without them.&lt;/p&gt;
&lt;p&gt;In our team we have a team role that rotates each day and that person takes care of incoming support requests from internal clients. Usually this requires a certain level of domain and system knowledge. We decided to onboard the new team members pretty fast to the role. On the one hand it frees up some time from the more experienced engineers and on the other hand it provides another learning opportunity for the new team members. As long as this was transparently communicated with clients, they didn’t mind that some support requests took longer than usual and the new team members made huge progress on domain knowledge in a relatively short time.&lt;/p&gt;
&lt;h2&gt;Relationships&lt;/h2&gt;
&lt;p&gt;The last part of the onboarding relates to the relationships inside the team. We are not just robots coming into work, but we are humans with emotions, goals and sometimes also problems. I believe that trust is an essential ingredient for efficient teams. It allows you to speak up freely, you can make mistakes and addressing conflicts leads to constructive discussions. And during the pandemic you are missing out on a lot of opportunities to get to know your new team-mates as there are no team lunches, no short discussions at the coffee machine and no rounds of table tennis during the breaks. This can quickly start to feel like you are being left alone with your problems. Therefore we introduced a weekly “Team Bonding” session which was moderated by our producer. The producer is responsible for team processes in our team and in case you don't have such a role, any person, be it a team member, team lead or somebody outside the team, could facilitate this meeting.&lt;/p&gt;
&lt;p&gt;Every week she came up with new ideas for the session. Sometimes we just presented to each other personal objects from our home, another time we did powerpoint karaoke or we played a game like Tabu. Some of those exercises had some goals, like improving your presentation skills, but in the end it was always about the people and getting to know them. What drives your team-mates? What kind of humour do they have? What keeps them up at night right now? Opening up really helps to create this bond and increase the trust among each other. Such exercises can of course also be done when everybody is back at the office to continue the bonding between team-mates and are not only valuable when you are working remotely.&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;Summing up this article, it boils down to some simple points. Take your time to do a proper onboarding and be transparent with clients and leads about possible delays for support requests or roadmaps. Remind yourself constantly about providing feedback to give guidance and prevent unpleasant surprises. And don’t forget about the personal relationships that need to be created, because they will allow you to trust each other and also feel safe while making mistakes. Following those rules is very time intensive, but it pays off in the long run and we were able to build an awesome team in just about three months that already increased the productivity compared to before. Of course there is no one-size-fits-all solution regarding the onboarding and different teams might have different needs, but this setup worked very well for us.&lt;/p&gt;
&lt;h2&gt;Other Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://miro.com/guides/remote-work/onboarding"&gt;Miro: Remote Onboarding Checklist&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://resources.owllabs.com/blog/remote-employee-onboarding"&gt;OwlLabs: 7 Remote Employee Onboarding Tips and Checklist for Your Next New Hire&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://about.gitlab.com/company/culture/all-remote/onboarding/"&gt;GitLab: The guide to remote onboarding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.hive.hr/blog/improve-your-remote-onboarding-experience/"&gt;Hive: 16 Ways to Improve Your Remote Onboarding Experience&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://martinfowler.com/articles/on-pair-programming.html"&gt;Martin Fowler: On Pair Programming&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.agilealliance.org/glossary/pairing/"&gt;Agile Alliance: Pair Programming&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.agilealliance.org/glossary/mob-programming/"&gt;Agile Alliance: Mob Programming&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.remotemobprogramming.org/"&gt;Remote Mob Programming&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.powerpointkaraoke.com/"&gt;PowerPoint Karaoke&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://slidelizard.com/en/blog/powerpoint-karaoke-rules-and-free-download"&gt;A Guide to PowerPoint Karaoke&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><category term="Zalando"/><category term="Remote Working"/><category term="Onboarding"/><category term="Culture"/><category term="Leadership"/></entry><entry><title>Modeling Errors in GraphQL</title><link href="https://engineering.zalando.com/posts/2021/04/modeling-errors-in-graphql.html" rel="alternate"/><published>2021-04-13T00:00:00+02:00</published><updated>2021-04-13T00:00:00+02:00</updated><author><name>Boopathi Rajaa Nedunchezhiyan</name></author><id>tag:engineering.zalando.com,2021-04-13:/posts/2021/04/modeling-errors-in-graphql.html</id><summary type="html">&lt;p&gt;GraphQL excels in modeling data requirements. Modeling errors as schema types in GraphQL is required for certain kinds of errors. In this post, let's analyze some cases where errors contain structured data apart from the message and the location information.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Use case to distinguish different errors" src="https://engineering.zalando.com/posts/2021/04/images/use-case.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;GraphQL Errors&lt;/h2&gt;
&lt;p&gt;GraphQL is an excellent language for writing data requirements in a declarative fashion. It gives us a clear and well-defined concept of nullability constraints and error propagation. In this post, let's discuss how GraphQL lacks in certain places regarding errors and how we can model those errors to fit some of our use-cases.&lt;/p&gt;
&lt;p&gt;Before we dive into the topic, let's understand how GraphQL currently treats and handles errors. The response of a GraphQL query is of the following structure -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;data&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;foo&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;errors&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;message&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Something happened&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;path&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;foo&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;bar&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Error extensions&lt;/h2&gt;
&lt;p&gt;The Schema we define for GraphQL is used only in the data field of the response. The &lt;code&gt;errors&lt;/code&gt; field is a well-defined structure - &lt;code&gt;Array&amp;lt;{ message: string, path: string[] }&amp;gt;&lt;/code&gt; in its simplest form. The Schema we define does not affect this Error.&lt;/p&gt;
&lt;p&gt;Let's say the client queries a field using an ID. How can the client know from the above error object whether the Error is due to an Internal Server Error or the ID is Not_Found? Parsing the message is a no-go because it is not reliable.&lt;/p&gt;
&lt;p&gt;Luckily, in GraphQL, there is a way to provide extensions to the error structure - using &lt;code&gt;extensions&lt;/code&gt;. The &lt;code&gt;error.extensions&lt;/code&gt; can convey other information related to the Error - properties, metadata, or other clues from which the client can benefit. As for the above example, we can model the response to be -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Not Found&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;extensions&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;code&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;NOT_FOUND&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Errors for Customers&lt;/h2&gt;
&lt;p&gt;When we have a GraphQL API that delivers content to the end-user - the customers, i.e., we have two levels of users -&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The &lt;strong&gt;Developer&lt;/strong&gt; or &lt;strong&gt;user&lt;/strong&gt; of the API - UI/UX/front-end developer.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Customer&lt;/strong&gt; or &lt;strong&gt;end-user&lt;/strong&gt; - The one who does not see any technical layers but gets the product's experience in its most presentable format. The Front-end developer builds this experience using data from the GraphQL API.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Since using the word &lt;strong&gt;user&lt;/strong&gt; might be confusing, from now on, &lt;strong&gt;Developer&lt;/strong&gt; will refer to the front-end developer, and &lt;strong&gt;Customer&lt;/strong&gt; will refer to the end-user.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Customer vs Developer" src="https://engineering.zalando.com/posts/2021/04/images/customer-developer.jpg"&gt;&lt;/p&gt;
&lt;p&gt;When we have an API whose data is directly consumed by two levels of these users - Developer and Customer, there might be different error data requirements. For example, let's take &lt;code&gt;mutations&lt;/code&gt; - when the Customer enters an invalid email address,&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The &lt;strong&gt;Developer&lt;/strong&gt; who uses the GraphQL API needs to know that the Customer has entered an Invalid Email address via a &lt;strong&gt;parseable format&lt;/strong&gt; - a boolean or enum or whatever data structure you choose will work except parsing the error message.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Customer&lt;/strong&gt; needs to care about the error message in a nicely styled format close to the text box. Also, for &lt;strong&gt;different languages&lt;/strong&gt; or locales, the error message needs to be in the corresponding &lt;strong&gt;translated&lt;/strong&gt; text.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Let's try to model this using the error extensions discussed above -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;data&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;errors&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;message&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Die E-Mail-Addresse ist ungültig&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;extensions&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;code&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;INVALID_EMAIL&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;While this would work, we soon end up in a case where multiple input fields in a mutation can be invalid. What can we do here? Do we model them as different errors or fit everything into the same Error.&lt;/p&gt;
&lt;p&gt;The Customer errors still need to be usable by the Developers to propagate it. The front-end developers are the ones ultimately transforming our data structures to UI elements. So they need to understand the Error to highlight that input text-box with a &lt;strong&gt;red&lt;/strong&gt; border. So, to make it easy, let's try modeling these as a single error with multiple validation messages -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;data&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;errors&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;message&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Multiple inputs are invalid&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;extensions&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;invalidInputs&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;code&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;INVALID_EMAIL&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;message&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Die E-Mail-Addresse ist ungültig&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;code&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;INVALID_PASSWORD&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;message&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Das Passwort erfüllt nicht die Sicherheitsstandards&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The codes &lt;code&gt;INVALID_EMAIL&lt;/code&gt; and &lt;code&gt;INVALID_PASSWORD&lt;/code&gt; will help the front-end dev or &lt;strong&gt;Developer&lt;/strong&gt; highlight the field in the UI, and the message will be displayed to the user right under that text-box.&lt;/p&gt;
&lt;p&gt;All this leads to a complicated structure very soon and is not as friendly as the data modeled with a GraphQL schema.&lt;/p&gt;
&lt;h2&gt;Why you no Schema?&lt;/h2&gt;
&lt;p&gt;&lt;img alt="Errors don't have type definitions" src="https://engineering.zalando.com/posts/2021/04/images/error-schema.jpg"&gt;&lt;/p&gt;
&lt;p&gt;The biggest problem we face in modeling these in the extension object is that it's not discoverable. We use such a powerful language like GraphQL to define each field in our data structure using Schemas, but when designing the errors, we went back to a &lt;strong&gt;loose mode&lt;/strong&gt; of not using any of the ideas GraphQL brought us.&lt;/p&gt;
&lt;p&gt;Maybe, in future extensions of the language, we can write schemas for Errors as we write for Queries and Mutations. The developers using the Schema get all the benefits of GraphQL even when handling errors. For now, let's concentrate on modeling this using the existing language specification.&lt;/p&gt;
&lt;h2&gt;Errors in Schema&lt;/h2&gt;
&lt;p&gt;We want to enjoy the power of GraphQL - the discoverability of fields of data, the tooling, and other aspects for errors. Why don't we put some of these errors in the Schema instead of capturing them in extensions?&lt;/p&gt;
&lt;p&gt;For example, the mutation discussed previously can be modeled like this -&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;mutation returns a &lt;code&gt;Result&lt;/code&gt; type&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Result&lt;/code&gt; type is a &lt;code&gt;union&lt;/code&gt; of &lt;code&gt;Success&lt;/code&gt;, &lt;code&gt;Error&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Error schema contains necessary error info - like translated messages, etc.&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Mutation&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;String&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;String&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;RegisterResult&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;union&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;RegisterResult&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;RegisterSuccess&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;RegisterError&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;RegisterSuccess&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;RegisterError&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;invalidInputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;[&lt;/span&gt;&lt;span class="n"&gt;RegisterInvalidInput&lt;/span&gt;&lt;span class="err"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;InvalidInput&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;field&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;RegisterInvalidInputField&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;enum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;RegisterInvalidInputField&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;EMAIL&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;PASSWORD&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This structure looks exactly like the one we designed above inside error extensions. The advantage of modeling it like this would be that we are using the benefits of GraphQL for errors.&lt;/p&gt;
&lt;h2&gt;When you have a hammer,&lt;/h2&gt;
&lt;p&gt;Now, with the idea of modeling errors as Schema types, we are left with more questions than answers -&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Should I model all errors as GraphQL types?&lt;/li&gt;
&lt;li&gt;How should I decide when to use error extensions and when to use GraphQL types for modeling errors?&lt;/li&gt;
&lt;li&gt;etc.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;img alt="The Problem hammer" src="https://engineering.zalando.com/posts/2021/04/images/problem-nails.jpg"&gt;&lt;/p&gt;
&lt;p&gt;When we have multiple teams maintaining the platform, many people contribute and think about modeling different parts of the Schema. There should be clear definitions for the different aspects of the existing data structures and the idea behind how we reached such solutions. The design and the Schema are changed far fewer times than it is read/used.&lt;/p&gt;
&lt;p&gt;GraphQL gave us the mindset of &lt;a href="https://graphql.org/learn/thinking-in-graphs/"&gt;"Thinking in Graphs"&lt;/a&gt;. If we suggest a new way of modeling errors, we need to talk about this mindset and its ideas. Not all errors fit into this modeling (error types in Schema), and it will make the GraphQL API less usable if we approach it by looking at all the errors as nails.&lt;/p&gt;
&lt;h2&gt;Classification&lt;/h2&gt;
&lt;p&gt;To model errors, let's try to find some analogies. I want to think about modeling these errors in terms of programming language errors. For example,&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Go: Error vs. panic&lt;/li&gt;
&lt;li&gt;Java: Error vs. Exception&lt;/li&gt;
&lt;li&gt;Rust: Error vs. runtime exception&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The programming languages also model errors as two variants. In one model (an &lt;code&gt;error&lt;/code&gt; type in go), we inform the Developer who uses the function. The Developer decides either to handle it or to pass it through. In the other variant (a &lt;code&gt;panic&lt;/code&gt; in go), we skip everything and bring the program to a halt. We inform the end-user of the program that something has happened. This small variation captured as two different things help us understand the intention of data in errors.&lt;/p&gt;
&lt;h3&gt;Part 1. Action-ables&lt;/h3&gt;
&lt;p&gt;What is an error? It tells us that something is wrong and gives us some information on what action can be taken. We can think of errors as containers of &lt;strong&gt;action&lt;/strong&gt;-ables. When modeling them, we classify them into different groups depending on &lt;strong&gt;who&lt;/strong&gt; can take that action.&lt;/p&gt;
&lt;p&gt;In GraphQL context, for some errors, the front-end takes care of it - either by a fallback or a retry. In case of some other errors like the invalid inputs, the front-end cannot take action; only the Customer who entered the invalid input can fix the input.&lt;/p&gt;
&lt;p&gt;Instead of modeling the errors loosely, we now have a concrete use-case - model it for whoever can take action.&lt;/p&gt;
&lt;h3&gt;Part 2. Bugs in the system&lt;/h3&gt;
&lt;p&gt;Errors convey information - either to &lt;strong&gt;Developer&lt;/strong&gt; or &lt;strong&gt;Customer&lt;/strong&gt;. If the Error is conveying some bug in the system, it should &lt;strong&gt;not&lt;/strong&gt; be modeled as schema error types. Here, the system means all the services and software involved in our entire product and not just the GraphQL service. It is essential because it separates the end-user / Customer vs. Developer who uses the API - the end-user looks at our product as one thing, not many individual services.&lt;/p&gt;
&lt;p&gt;In the &lt;code&gt;404 Not Found&lt;/code&gt; case, if we had modeled the errors as schema types, it would make the Schema less usable. Let's take a product look-up use-case -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;foo&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ProductSuccess&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ProductError&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;bar&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;CollectionSuccess&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;products&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;ProductSuccess&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;CollectionError&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This way of handling errors at every level is not friendly for front-end developers. It's too much to type in a query and too many branches to handle in the code.&lt;/p&gt;
&lt;h3&gt;Part 3. Error propagation&lt;/h3&gt;
&lt;p&gt;We also have to remember not to disrupt GraphQL semantics of error propagation. If an error occurs in one place in the query, it propagates upwards in the tree till the first nullable field occurs. This propagation does not happen with error types in Schema. It is essential to model these schema error types for only specific use-cases. We go back to Part 1: Action-ables - we design these types for actions that the end-user or Customer can take.&lt;/p&gt;
&lt;h2&gt;The Problem type&lt;/h2&gt;
&lt;p&gt;Naming is half the battle in GraphQL. Since the name &lt;code&gt;error&lt;/code&gt; is already taken by the GraphQL language (&lt;code&gt;response.errors&lt;/code&gt;), it would be confusing to name our error types in Schema as &lt;code&gt;Error&lt;/code&gt;. As we did before to look for inspirations, there is a well-defined concept in &lt;a href="https://tools.ietf.org/html/rfc7807"&gt;RFC 7807 - Problem details for HTTP API&lt;/a&gt;. So, we will call all our errors in Schema as Problems and, as it has always been, all other errors as errors.&lt;/p&gt;
&lt;p&gt;The above register schema with the &lt;code&gt;Problem&lt;/code&gt; type would look like this -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Mutation&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;String&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;String&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;RegisterResult&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;union&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;RegisterResult&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;RegisterSuccess&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;RegisterProblem&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;RegisterSuccess&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;RegisterProblem&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;&amp;quot;&lt;/span&gt;&lt;span class="n"&gt;translated&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;encompassing&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;all&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;invalid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="err"&gt;.&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;invalidInputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;[&lt;/span&gt;&lt;span class="n"&gt;RegisterInvalidInput&lt;/span&gt;&lt;span class="err"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;InvalidInput&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;field&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;RegisterInvalidInputField&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;&amp;quot;&lt;/span&gt;&lt;span class="n"&gt;translated&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="err"&gt;.&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;enum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;RegisterInvalidInputField&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;EMAIL&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;PASSWORD&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Problem or Error&lt;/h2&gt;
&lt;p&gt;&lt;img alt="Errors vs Problems" src="https://engineering.zalando.com/posts/2021/04/images/problem-vs-error-2.jpg"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt; refers to the Error as a Schema type. ** Error** refers to the Error that appears in the &lt;code&gt;response.errors&lt;/code&gt; array with an error code at &lt;code&gt;error.extensions.code&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Case 1: Resource Not Found&lt;/h3&gt;
&lt;p&gt;404s are bugs in the system in case of navigation. If the user navigates from the home page to a product page and ends up on a 404 page, some service selected an id that leads to 404 when resolved and this has most likely been the case upon selection. It's not something because the user entered some input. Also, these errors need to be propagated. So, this becomes an Error with an error code as &lt;code&gt;NOT_FOUND&lt;/code&gt; and not a Problem.&lt;/p&gt;
&lt;h3&gt;Case 2: Authorization&lt;/h3&gt;
&lt;p&gt;Authorization errors are of the Error type and do not fit a problem type. Here, the action taker looks like it's the Customer who needs to log in. But, the UI can take action here and show a login dialog box to the Customer. In apps, the app decides to take the Customer to the login view. The action belongs to the Front-end and only then the Customer. So, we model it for the developer/front-end as an Error with error code &lt;code&gt;NOT_AUTHORIZED&lt;/code&gt; and not a Problem.&lt;/p&gt;
&lt;h3&gt;Case 3: Mutation Inputs&lt;/h3&gt;
&lt;p&gt;Mutation Inputs is the only case where it is crucial to construct Problem types. It contains inputs directly from the Customer, and only the Customer can take action for this. So, we model these errors as Problems and not Errors.&lt;/p&gt;
&lt;h3&gt;Case 4: All other bugs / errors&lt;/h3&gt;
&lt;p&gt;Any runtime exception in the code or Internal Server Errors from any backends that the GraphQL layer connects to should be modeled as Error and need not contain an error code. This way, it is easy for the front-end to treat all non-error code responses as Internal Server Errors and take action accordingly - to retry or show the Customer an error page.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;We have discussed Problem type as a possible solution where the error object in the GraphQL response does not suffice the use-cases. But we have to be careful about not overusing this for many use-cases where the error extensions already provide enough value.&lt;/p&gt;
&lt;p&gt;We have to understand that the Problem type in &lt;strong&gt;unnecessary&lt;/strong&gt; places does make the query and front-end code complicated. Our GraphQL Schema should try to simplify and provide a friendly interface.&lt;/p&gt;
&lt;h2&gt;Related posts&lt;/h2&gt;
&lt;p&gt;In case you are interested, here are further posts in the GraphQL series -&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://engineering.zalando.com/posts/2021/03/how-we-use-graphql-at-europes-largest-fashion-e-commerce-company.html"&gt;Introduction to how we use GraphQL at Zalando&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://engineering.zalando.com/posts/2023/10/understanding-graphql-directives-practical-use-cases-zalando.html"&gt;Understanding GraphQL Directives: Practical Use-Cases at Zalando&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://engineering.zalando.com/posts/2022/02/graphql-persisted-queries-and-schema-stability.html"&gt;GraphQL persisted queries and Schema stability&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://engineering.zalando.com/posts/2021/03/optimize-graphql-server-with-lookaheads.html"&gt;Optimize GraphQL Server with Lookaheads&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><category term="Zalando"/><category term="GraphQL"/><category term="Backend"/></entry><entry><title>Optimize GraphQL Server with Lookaheads</title><link href="https://engineering.zalando.com/posts/2021/03/optimize-graphql-server-with-lookaheads.html" rel="alternate"/><published>2021-03-18T00:00:00+01:00</published><updated>2021-03-18T00:00:00+01:00</updated><author><name>Boopathi Rajaa Nedunchezhiyan</name></author><id>tag:engineering.zalando.com,2021-03-18:/posts/2021/03/optimize-graphql-server-with-lookaheads.html</id><summary type="html">&lt;p&gt;GraphQL offers a way to optimize the data between a client and a server. We can use the declarative nature of a GraphQL query to perform lookaheads. Lookaheads provide us a way to optimize the data between the GraphQL server and a backend data provider - like a database or another server that can return partial responses.&lt;/p&gt;</summary><content type="html">&lt;p&gt;In our first post about &lt;a href="https://engineering.zalando.com/posts/2021/03/how-we-use-graphql-at-europes-largest-fashion-e-commerce-company.html"&gt;How we use GraphQL at Zalando&lt;/a&gt;, we briefly shared about performance optimizations using &lt;a href="https://github.com/zalando-incubator/graphql-jit"&gt;GraphQL-JIT&lt;/a&gt;. GraphQL-JIT allowed us to scale our implementation without performance degradations. In this post, we share another optimization we use - &lt;strong&gt;Lookaheads&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Lookaheads" src="https://engineering.zalando.com/posts/2021/03/images/lookaheads.png"&gt;&lt;/p&gt;
&lt;h2&gt;Same Model; Different Views&lt;/h2&gt;
&lt;p&gt;In our GraphQL service, we do not have resolvers for every single field in the schema. Instead, we have certain groups of fields resolved together as a single request to a backend service that provides the data. For example, let's take a look at the &lt;code&gt;product&lt;/code&gt; resolver,&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;resolvers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;Query&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ProductBackend&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getProduct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This resolver will be responsible for getting multiple properties of the &lt;code&gt;Product&lt;/code&gt; - name, price, stock, images, material, sizes, brand, color, other colors, and further details. The same &lt;strong&gt;Product&lt;/strong&gt; type in the schema can render as a Product Card in a grid or the entire Product Page. The amount of data required for a Product card is less than the complete product details of a product page.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Different views of the same model" src="https://engineering.zalando.com/posts/2021/03/images/same-model-different-views.png"&gt;&lt;/p&gt;
&lt;p&gt;Every time the product resolver is called, the entire response from the product backend is requested by the GraphQL service. Though GraphQL allows us to specify the data requirements to fetch optimally, it becomes beneficial only between the client-server communication. The data transfers between the GraphQL server and the Backend server remain unoptimized.&lt;/p&gt;
&lt;h2&gt;Partial Responses&lt;/h2&gt;
&lt;p&gt;Most of the backend services in Zalando support &lt;a href="https://cloud.google.com/blog/products/api-management/restful-api-design-can-your-api-give-developers-just-information-they-need"&gt;Partial responses&lt;/a&gt;. In the request, one can specify the fields' list. Only these fields must be in the response trimming other fields which were not specified in the request. The backend service treats this as a filter and returns only those fields. It is similar to what GraphQL offers us, and the request somewhat looks like this -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="err"&gt;GET /product?id=product-id&amp;amp;fields=name,stock,price&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Here, the &lt;code&gt;fields&lt;/code&gt; query parameter is used to declare the required response fields. The backend can use this to compute only those response fields. Likewise, the backend can pass it further down the pipeline to another service or database. The response for the above request would look like the following -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Fancy T-Shirt&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;stock&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;AVAILABLE&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;price&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;EUR 35.50&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Partial responses help in reducing the amount of data over the wire and give a good performance boost. A GraphQL query is also precisely the same thing - it provides a well-defined language for the fields parameter in the above request.&lt;/p&gt;
&lt;h2&gt;Lookahead&lt;/h2&gt;
&lt;p&gt;Let's leverage these partial responses and use them in the GraphQL server. When resolving the product, we must know what the next fields are within this product, (or) we need to &lt;strong&gt;look ahead&lt;/strong&gt; in the query to get the sub-fields of the product.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;query&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;foo&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;stock&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;A thing to note - name, stock, and price do not have explicitly declared resolvers. When resolving &lt;strong&gt;product&lt;/strong&gt;, how can we know what its sub-selections are? Here, navigating the query &lt;a href="https://en.wikipedia.org/wiki/Abstract_syntax_tree"&gt;AST (Abstract Syntax Tree)&lt;/a&gt; helps. During execution, the resolver function will receive the AST of the current field. The structure of the AST depends on the language and implementation. For &lt;a href="https://github.com/graphql/graphql-js"&gt;GraphQL-JS&lt;/a&gt;, or &lt;a href="https://github.com/zalando-incubator/graphql-jit"&gt;GraphQL-JIT&lt;/a&gt; executors, it is available in the last parameter (of the resolver function) which is called a &lt;strong&gt;Resolve Info&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;resolvers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;Query&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;getFields&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ProductBackend&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getProduct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We use the query AST in the resolve info to compute the list of fields under product, pass this list of fields to the product backend, which supports partial responses, and then send the backend response as the resolved result.&lt;/p&gt;
&lt;h2&gt;Field Nodes&lt;/h2&gt;
&lt;p&gt;The resolve info is useful for doing a lot of optimizations. Here, for this case, we are interested in the &lt;strong&gt;fieldNodes&lt;/strong&gt;. It is an array of objects, each representing the same field - in this case - &lt;strong&gt;product&lt;/strong&gt;. Why is it an array? A single field may appear in more than one place in a query - for instance, fragments, inline fragments, aliasing, etc. For simplicity, we will not consider fragments and aliasing in this post.&lt;/p&gt;
&lt;p&gt;The entire query is a tree of field nodes where the children at each level are available as selection sets.&lt;/p&gt;
&lt;p&gt;Each fieldNode has a &lt;strong&gt;Selection Set&lt;/strong&gt;, a list of &lt;strong&gt;subfield nodes&lt;/strong&gt; - here - the selection set will be the field nodes of name, stock, and price. So the &lt;code&gt;getFields&lt;/code&gt; implementation (without considering fragments and aliasing) will look like the following -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;getFields&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// TODO: handle all field nodes in other fragments&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fieldNodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;selectionSet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;selections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;selection&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="c1"&gt;// TODO: handle fragments&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;selection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;When we pass product resolver's info, the &lt;code&gt;getFields&lt;/code&gt; function returns &lt;code&gt;[name, stock, price]&lt;/code&gt;. We can take this list and pass it to the backend as the query parameter.&lt;/p&gt;
&lt;p&gt;For simple use-cases like these, where the backend data structure and the GraphQL schema are the same, it's possible to use GraphQL fields as the backend fields. When it's a bit different, we need to map the schema fields to backend fields for the request. Also, we need to map the backend fields back to schema fields for the response.&lt;/p&gt;
&lt;h2&gt;Different schemas&lt;/h2&gt;
&lt;p&gt;If the backend fields are different from the GraphQL schema fields, then there exists a mapping from schema fields to backend fields. A simple mapping may be the difference in the name of the fields. For example, &lt;code&gt;name&lt;/code&gt; in schema might be &lt;code&gt;title&lt;/code&gt; in the backend. This mapping can get complex where a single schema field might derive from multiple backend fields. For example, price in schema might be a concatenation of &lt;em&gt;currency&lt;/em&gt; and &lt;em&gt;amount&lt;/em&gt; from the backend. It gets interesting when we have nested structures - for example, &lt;code&gt;price&lt;/code&gt; in schema might be a concatenation of &lt;code&gt;price.currency&lt;/code&gt; and &lt;code&gt;price.amount&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;The response is partial&lt;/h2&gt;
&lt;p&gt;Another aspect of this mapping is that it's not enough to think about it one way - from schema fields to backend fields. It only suffices the request from the GraphQL server to the backend server. The response that the backend sends must transform to match the schema, and it isn't free when we have such complications in the mapping of fields.&lt;/p&gt;
&lt;p&gt;When we have a single transform function that converts backend response to match the schema, we have to understand that it is built from a &lt;a href="https://cloud.google.com/blog/products/api-management/restful-api-design-can-your-api-give-developers-just-information-they-need"&gt;partial response&lt;/a&gt; and not the complete response -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;backendProductToSchemaProduct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;backendProduct&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;backendProduct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="c1"&gt;// we have a problem here -&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;price&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;backendProduct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currency&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sb"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;backendProduct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;amount&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;stock&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;backendProduct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stock_availability&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In the above implementation, when the query is &lt;code&gt;{ product(id) { name } }&lt;/code&gt;, the transformer will try to convert, assuming the complete response is available. Since the backend responded with partial data (only the &lt;code&gt;name&lt;/code&gt; field is used), the access to a nested property will throw an error - &lt;code&gt;Cannot read property currency of 'undefined'&lt;/code&gt;. We could have a &lt;code&gt;null&lt;/code&gt; check at every place, but the code becomes not maintainable. So we need a way to model it both ways -&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Map schema fields to backend fields during the request to the backend&lt;/li&gt;
&lt;li&gt;Map backend fields to schema fields with the response from the backend&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Dependency Maps&lt;/h2&gt;
&lt;p&gt;The mapping we talked about in our scribbling phase is what a dependency map is. Every schema field depends on one or many nested fields in the backend. A way to represent this can be as simple as an object whose keys are schema fields, and the values are a list of &lt;a href="https://github.com/mariocasciaro/object-path#usage"&gt;object paths&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;dependencyMap&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;title&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;price&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;price.currency&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;price.amount&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;stock&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;stock_availability&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Dependency Map" src="https://engineering.zalando.com/posts/2021/03/images/dependencies.png"&gt;&lt;/p&gt;
&lt;p&gt;From this dependency map, we can create our request to the backend. Let's say the backend takes a query parameter &lt;code&gt;fields&lt;/code&gt; in the following form - a comma-separated list of object path strings. Depending on the implementation, there can be a wide variety of formats for this. Here, we will take a simple one.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;getBackendFields&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;schemaFields&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;dependencyMap&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// Set helps in deduping&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;backendFields&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;schemaFields&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;dependencyMap&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;acc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;acc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[])&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;backendFields&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;,&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;For schema fields name and price, the computed backend fields would be a string, and we can construct the request to the backend -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="err"&gt;GET /product?id=foo&amp;amp;fields=title,price.currency,price.amount&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Transformation Maps&lt;/h2&gt;
&lt;p&gt;After the request, we know that the backend returns a partial response instead of the complete response. We also saw above that a single function that transforms the entire backend response to schema fields is not enough. Here, we use a &lt;strong&gt;transformation map&lt;/strong&gt;. It's a map of schema fields to transformation logic. Like the dependency map, the keys are schema fields, but the values are transform functions that use only specific fields from the backend.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;transformerMap&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;price&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currency&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sb"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;amount&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;stock&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stock_availability&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;As you see here, each value is a function where the only properties used inside this function are from the &lt;strong&gt;dependency map&lt;/strong&gt;. To construct the result object from the partial response of the backend, we use the same computed sub-fields (from the &lt;code&gt;getFields&lt;/code&gt; function) and use them on the transformer map. For example -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;getSchemaResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;backendResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;transformerMap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;schemaFields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;schemaResponse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{};&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;schemaFields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;schemaResponse&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;transformerMap&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="nx"&gt;backendResponse&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;schemaResponse&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;So far,&lt;/h2&gt;
&lt;p&gt;Let's recap on how the concept we have so far unwrapped -&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;getFields&lt;/code&gt;: compute sub-fields by looking ahead in AST&lt;/li&gt;
&lt;li&gt;&lt;code&gt;getBackendFields&lt;/code&gt;: compute backend fields from sub-fields and dependency map&lt;/li&gt;
&lt;li&gt;request the backend with the computed backend fields&lt;/li&gt;
&lt;li&gt;&lt;code&gt;getSchemaResponse&lt;/code&gt;: compute schema response from partial backend response, sub-fields, and the transformer map&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Batching&lt;/h2&gt;
&lt;p&gt;At Zalando, like &lt;a href="https://cloud.google.com/blog/products/api-management/restful-api-design-can-your-api-give-developers-just-information-they-need"&gt;partial responses&lt;/a&gt;, most of our backends support batching multiple requests into a single request. Instead of getting a resource by its &lt;code&gt;id&lt;/code&gt;, most backends have to get resources by &lt;code&gt;ids&lt;/code&gt;. For example,&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="err"&gt;GET /products?ids=a,b,c&amp;amp;fields=name&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;will return the response,&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;a&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;b&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;c&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We should take advantage of such features. One of the popular libraries that aid us in batching is the &lt;a href="https://github.com/graphql/dataloader"&gt;DataLoader&lt;/a&gt; by Facebook.&lt;/p&gt;
&lt;p&gt;We provide the dataloader - an implementation for handling an array of inputs that returns an array of outputs/responses in the same order. The dataloader takes care of combining and batching requests from multiple places in the code in an optimal fashion. You can read more about it in the Dataloader's &lt;a href="https://github.com/graphql/dataloader"&gt;documentation&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Dataloader for Product resolver&lt;/h2&gt;
&lt;p&gt;When a Product appears in multiple parts of the same GraphQL query, each will create separate requests to the backend. For example, let's consider this simple GraphQL query -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;query&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;foo&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;foo&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="n"&gt;productCardFields&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;bar&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="n"&gt;productCardFields&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The products &lt;code&gt;foo&lt;/code&gt; and &lt;code&gt;bar&lt;/code&gt; are batched together into a single query using aliasing. If we implement a resolver for a product that calls the ProductBackend, we will end with &lt;strong&gt;two&lt;/strong&gt; separate requests. Our goal is to make it in a single request. We can implement this with a dataloader -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;getProductsByIds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;products&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sb"&gt;`/products?ids=&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;,&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;products&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;productLoader&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Dataloader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;getProductsByIds&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We can use this &lt;code&gt;productLoader&lt;/code&gt; in our &lt;code&gt;product&lt;/code&gt; resolver -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;resolvers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;productLoader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The Dataloader takes care of the magic of combining multiple calls to the load method into a single call to our implementation - &lt;code&gt;getProductsByIds&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Complexities&lt;/h2&gt;
&lt;p&gt;The DataLoader deduplicates inputs, optionally cache the outputs and also provides a way to customize these functionalities. In the &lt;code&gt;productLoader&lt;/code&gt; defined above, our input is the product &lt;strong&gt;id&lt;/strong&gt; - a &lt;strong&gt;string&lt;/strong&gt;. When we introduce the concepts of &lt;a href="https://cloud.google.com/blog/products/api-management/restful-api-design-can-your-api-give-developers-just-information-they-need"&gt;partial responses&lt;/a&gt;, the backend expects more than just the &lt;code&gt;id&lt;/code&gt; - it also predicts the &lt;code&gt;fields&lt;/code&gt; parameter used to select the fields for the response. So our input to the loader is not just a string - let's say, it's an object with keys - &lt;code&gt;ids&lt;/code&gt; and &lt;code&gt;fields&lt;/code&gt;. The dataloader implementation now becomes -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;getProductsByIds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// We have a problem here&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;//                    v&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;products&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="sb"&gt;`/products?ids=&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;,&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sb"&gt;&amp;amp;fields=&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;products&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Here, in the above code-block, the problem is highlighted with a comment - each of the &lt;code&gt;productLoader.load&lt;/code&gt; calls can have a different set of fields. What is our strategy for merging all of these fields? Why do we need to merge?&lt;/p&gt;
&lt;p&gt;Let's go back to an example and understand why we should handle this -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;query&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;foo&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;foo&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;bar&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The product &lt;code&gt;foo&lt;/code&gt; requires &lt;strong&gt;name&lt;/strong&gt; and product &lt;code&gt;bar&lt;/code&gt; requires &lt;strong&gt;price&lt;/strong&gt;. If we remind ourselves how this gets translated to backend fields using the dependency map, we end up with the following calls -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;productLoader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;load&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;foo&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;productLoader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;load&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;bar&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;price.currency&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;price.amount&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If these two calls get into a single batch, we need to merge the fields such that both of them work during the transformation of backend fields to schema fields. Unfortunately, it's impossible to select different fields for different ids in the backend in most cases. If this is possible in your case, you probably do not need merging. But for our use-case and probably many others, let's continue the topic assuming merging is necessary.&lt;/p&gt;
&lt;h2&gt;Merging fields&lt;/h2&gt;
&lt;p&gt;&lt;img alt="Merge fields and IDs" src="https://engineering.zalando.com/posts/2021/03/images/merge.png"&gt;&lt;/p&gt;
&lt;p&gt;In the above example, the correct request to the backend would be -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="err"&gt;GET /products&lt;/span&gt;
&lt;span class="err"&gt;  ? ids = foo , bar&lt;/span&gt;
&lt;span class="err"&gt;  &amp;amp; fields = name , price.currency , price.amount&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The merge strategy is quite simple; it's a union of all the fields. Structurally we need the following transformation - &lt;code&gt;[ { id, fields } ]&lt;/code&gt; to &lt;code&gt;{ ids, mergedFields }&lt;/code&gt;. The following implementation merges the inputs -&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;mergeInputs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;mergedFields&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;,&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Putting it all together&lt;/h2&gt;
&lt;p&gt;Combining all the little things we handled so far, the flow for the &lt;code&gt;product&lt;/code&gt; field resolution would be -&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;getFields&lt;/code&gt;: compute sub-fields by looking ahead in AST&lt;/li&gt;
&lt;li&gt;&lt;code&gt;getBackendFields&lt;/code&gt;: compute the list of backend fields from sub-fields and dependency map&lt;/li&gt;
&lt;li&gt;&lt;code&gt;productLoader.load({ id, backendFields })&lt;/code&gt;: use the product loader to schedule in the dataloader to fetch a product.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;mergeFields&lt;/code&gt;: merge the different inputs to dataloader into a list of ids and union of all backendFields from all inputs.&lt;/li&gt;
&lt;li&gt;Send the batched input as a request to the backend and get the partial response&lt;/li&gt;
&lt;li&gt;&lt;code&gt;getSchemaResponse&lt;/code&gt;: compute schema fields from partial backend response, sub-fields computed in the first step, and the transformer map&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;productLoader&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;DataLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;getBackendProducts&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;resolvers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;Query&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;getFields&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;backendFields&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;getBackendFields&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;dependencyMap&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;backendResponse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;productLoader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;load&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;backendFields&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;schemaResponse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;getSchemaResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;backendResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;transformerMap&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;schemaResponse&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;dependencyMap&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;title&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;price&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;price.currency&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;price.amount&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;stock&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;stock_availability&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;transformerMap&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;price&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currency&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sb"&gt; &lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;amount&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;stock&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stock_availability&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;getFields&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fieldNodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;selectionSet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;selections&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// TODO: handle all field nodes in other fragments&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;selection&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// TODO: handle fragments&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;selection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;getBackendFields&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;schemaFields&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;dependencyMap&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// Set helps in deduping&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;backendFields&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;schemaFields&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;dependencyMap&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;acc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;acc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[])&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;backendFields&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;getBackendProducts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;mergedFields&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;mergeInputs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;products&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="sb"&gt;`/products?ids=&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;,&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sb"&gt;&amp;amp;fields=&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;mergedFields&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;products&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;mergeInputs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;mergedFields&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;,&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;getSchemaResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;backendResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;transformerMap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;schemaFields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;schemaResponse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{};&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;schemaFields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;schemaResponse&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;transformerMap&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;field&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="nx"&gt;backendResponse&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;schemaResponse&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;All of the code, patterns, and nuances we have seen until now may differ for different applications or different languages. The critical aspect is to leverage the declarative nature of GraphQL and optimize for better user experience at all points throughout the lifecycle of a request.&lt;/p&gt;
&lt;p&gt;Field filtering using Dependency Maps and Transformer Maps enables us to handle complexities in optimizing GraphQL servers for performance. Though this looks like a lot of work, at runtime, this outperforms the otherwise unoptimized handling of huge responses from the backend - JSON parsing cost + transfer of bytes + construction time of the response by the backend.&lt;/p&gt;
&lt;p&gt;You also have to consider the trade-off of whether such optimizations work for every backend. As the GraphQL schema grows, these solutions scale well. At Zalando's scale, it has proved to be better than transferring a giant unoptimized blob of data.&lt;/p&gt;
&lt;h2&gt;Related posts&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://engineering.zalando.com/posts/2021/03/how-we-use-graphql-at-europes-largest-fashion-e-commerce-company.html"&gt;Introduction to how we use GraphQL at Zalando&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://engineering.zalando.com/posts/2023/10/understanding-graphql-directives-practical-use-cases-zalando.html"&gt;Understanding GraphQL Directives: Practical Use-Cases at Zalando&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://engineering.zalando.com/posts/2022/02/graphql-persisted-queries-and-schema-stability.html"&gt;GraphQL persisted queries and Schema stability&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://engineering.zalando.com/posts/2021/04/modeling-errors-in-graphql.html"&gt;Modeling Errors in GraphQL&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><category term="Zalando"/><category term="GraphQL"/><category term="NodeJS"/><category term="Backend"/><category term="Frontend"/></entry><entry><title>Flexbox Layout Behavior in Jetpack Compose</title><link href="https://engineering.zalando.com/posts/2021/03/flexbox-layout-behavior-in-jetpack-compose.html" rel="alternate"/><published>2021-03-16T00:00:00+01:00</published><updated>2021-03-16T00:00:00+01:00</updated><author><name>Andy Dyer</name></author><id>tag:engineering.zalando.com,2021-03-16:/posts/2021/03/flexbox-layout-behavior-in-jetpack-compose.html</id><summary type="html">&lt;p&gt;Much of the layout behavior defined in the flexbox spec has a direct analog in Jetpack Compose.&lt;/p&gt;</summary><content type="html">&lt;h3&gt;Introduction&lt;/h3&gt;
&lt;p&gt;The &lt;a href="https://drafts.csswg.org/css-flexbox-1/"&gt;CSS Flexible Box Layout specification&lt;/a&gt; (AKA flexbox) is a useful abstraction for describing layouts in a platform agnostic way. For this reason, it is widely used on the web and even &lt;a href="https://github.com/google/flexbox-layout"&gt;on mobile&lt;/a&gt;. Readers familiar with &lt;a href="https://developer.android.com/reference/androidx/constraintlayout/widget/ConstraintLayout"&gt;&lt;code&gt;ConstraintLayout&lt;/code&gt;&lt;/a&gt; can think of flexbox as conceptually similar to the &lt;a href="https://developer.android.com/reference/androidx/constraintlayout/helper/widget/Flow"&gt;&lt;code&gt;Flow&lt;/code&gt;&lt;/a&gt; virtual layout it supports. This type of layout is ideal for grids or other groups of views with varying sizes.&lt;/p&gt;
&lt;p&gt;In the &lt;a href="https://play.google.com/store/apps/details?id=de.zalando.mobile"&gt;Zalando Fashion&lt;/a&gt;&amp;nbsp;&lt;a href="https://apps.apple.com/de/app/zalando-fashion-and-shopping/id585629514"&gt;Store apps&lt;/a&gt;, we are using flexbox to define the layout of our backend-driven screens, which I &lt;a href="http://andydyer.org/blog/2019/12/22/appcraft-faster-than-a-speeding-release-train/"&gt;spoke about previously&lt;/a&gt;. Thus far, we have been using &lt;a href="https://github.com/facebook/litho"&gt;Litho&lt;/a&gt; on Android and &lt;a href="https://github.com/TextureGroup/Texture"&gt;Texture&lt;/a&gt; on iOS (both of which use the flexbox based &lt;a href="https://github.com/facebook/yoga"&gt;Yoga layout engine&lt;/a&gt;) for rendering backend driven screens because they support things that are essential when building fully dynamic UI at runtime such as async layout, efficient diffing of changes, and view flattening.&lt;/p&gt;
&lt;p&gt;As Google prepares &lt;a href="https://developer.android.com/jetpack/compose"&gt;Jetpack Compose&lt;/a&gt; (now in beta) for production release, we have started evaluating it as a successor to Litho. Compose offers numerous &lt;a href="https://developer.android.com/reference/kotlin/androidx/compose/foundation/layout/package-summary#top-level-functions"&gt;layout composables&lt;/a&gt;, many with bits of flexbox like behavior. However, there is no &lt;code&gt;Flexbox&lt;/code&gt; composable that does it all and no blog post explaining how flexbox concepts map to Compose, so I wrote this one. I also built &lt;a href="https://github.com/abdyer/flexbox-compose"&gt;this sample app&lt;/a&gt;, parts of which I will reference in code examples below.&lt;/p&gt;
&lt;p&gt;Before we continue, yes, I know technically it's called &lt;em&gt;Compose UI&lt;/em&gt; and not simply &lt;em&gt;Compose&lt;/em&gt;, but &lt;a href="https://jakewharton.com/a-jetpack-compose-by-any-other-name/"&gt;as Jake said&lt;/a&gt;, most of us are already thinking of it this way. Insert a "UI" where necessary while reading if you'd like.&lt;/p&gt;
&lt;h3&gt;Flex&lt;/h3&gt;
&lt;p&gt;Let's start with the flex attributes, which describe the direction, size, and horizontal/vertical alignment of a layout's children.&lt;/p&gt;
&lt;h4&gt;Flex Direction&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://drafts.csswg.org/css-flexbox-1/#flex-direction-property"&gt;Flex direction&lt;/a&gt; specifies whether items are arranged vertically or horizontally. Compose has &lt;a href="https://developer.android.com/reference/kotlin/androidx/compose/foundation/layout/package-summary#row"&gt;&lt;code&gt;Row&lt;/code&gt;&lt;/a&gt; and &lt;a href="https://developer.android.com/reference/kotlin/androidx/compose/foundation/layout/package-summary#column"&gt;&lt;code&gt;Column&lt;/code&gt;&lt;/a&gt; composables that work for simple horizontal and vertical layouts.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@Composable&lt;/span&gt;
&lt;span class="kd"&gt;fun&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;RowExample&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;modifier&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Modifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fillMaxWidth&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;padding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bottom&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;16.&lt;/span&gt;&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;background&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MaterialTheme&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;colors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;primaryVariant&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Child&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Child&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Child&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If &lt;a href="https://drafts.csswg.org/css-flexbox-1/#flex-wrap-property"&gt;flex wrap&lt;/a&gt; behavior is needed to control how items wrap across multiple rows, the &lt;a href="https://developer.android.com/reference/kotlin/androidx/compose/foundation/layout/package-summary#flowrow"&gt;&lt;code&gt;FlowRow&lt;/code&gt;&lt;/a&gt; and &lt;a href="https://developer.android.com/reference/kotlin/androidx/compose/foundation/layout/package-summary#flowcolumn"&gt;&lt;code&gt;FlowColumn&lt;/code&gt;&lt;/a&gt; composables will do this. However, &lt;a href="https://android-review.googlesource.com/c/platform/frameworks/support/+/1521704"&gt;these were deprecated&lt;/a&gt; before I even finished writing this article, so the best we can do is use the old implementation as a reference for our own.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@Deprecated&lt;/span&gt;
&lt;span class="nd"&gt;@Composable&lt;/span&gt;
&lt;span class="kd"&gt;fun&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;FlowRowExample&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;FlowRow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;mainAxisSpacing&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;8.&lt;/span&gt;&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;crossAxisSpacing&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;8.&lt;/span&gt;&lt;span class="n"&gt;dp&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;repeat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;Child&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;48.&lt;/span&gt;&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;24.&lt;/span&gt;&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The above code results in the following UI:
&lt;img alt="Flex wrap example" src="https://engineering.zalando.com/posts/2021/03/images/flex-wrap.jpg"&gt;&lt;/p&gt;
&lt;h4&gt;Flex Grow &amp;amp; Shrink&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://developer.mozilla.org/en-US/docs/Web/CSS/flex-grow"&gt;Flex grow&lt;/a&gt; controls how children will expand to fill available space in their parent layout. &lt;a href="https://developer.mozilla.org/en-US/docs/Web/CSS/flex-shrink"&gt;Flex shrink&lt;/a&gt; is its opposite, controlling how children will shrink relative to siblings if their parent layout does not have room for all of them.&lt;/p&gt;
&lt;p&gt;Use the &lt;code&gt;weight()&lt;/code&gt; modifier for flex grow behavior. Compose does not really have a flex shrink analog, but with its variety of layout composables, this can be overcome with a different approach in most cases. Depending on your specific needs, one approach could be to use &lt;code&gt;Modifier.preferredWidth(IntrinsicSize.Min)&lt;/code&gt; to specify that a composable should not take up any more space than its children require. You can read more about it &lt;a href="https://jetc.dev/slack/2021-01-17-matching-parent-size.html"&gt;here&lt;/a&gt; in this question reposted from the &lt;a href="https://slack.kotlinlang.org/"&gt;kotlinlang Slack&lt;/a&gt; in Mr. Mark Murphy's excellent &lt;a href="https://jetc.dev"&gt;jetc.dev&lt;/a&gt; newsletter.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@Composable&lt;/span&gt;
&lt;span class="kd"&gt;fun&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;FlexGrowExample&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;modifier&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Modifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fillMaxWidth&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;padding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bottom&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;16.&lt;/span&gt;&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;background&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MaterialTheme&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;colors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;primaryVariant&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;FlexChild&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;modifier&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Modifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1F&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;FlexChild&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;modifier&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Modifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;2F&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;FlexChild&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;modifier&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Modifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1F&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The above code results in the following UI:
&lt;img alt="Flex grow example" src="https://engineering.zalando.com/posts/2021/03/images/flex-grow.jpg"&gt;&lt;/p&gt;
&lt;p&gt;When the utmost flexibility is needed, there's always &lt;a href="https://developer.android.com/codelabs/jetpack-compose-layouts#5"&gt;implementing your own&lt;/a&gt; &lt;code&gt;Layout&lt;/code&gt; composable or the raw power of the &lt;a href="https://developer.android.com/jetpack/compose/layout#contraintlayout"&gt;ConstraintLayout composable&lt;/a&gt;, which can be used directly from Compose. If you don't mind reading Java instead of Kotlin, the implementation in Google's &lt;a href="https://github.com/google/flexbox-layout/blob/master/flexbox/src/main/java/com/google/android/flexbox/FlexboxHelper.java"&gt;&lt;code&gt;flexbox-layout&lt;/code&gt; library&lt;/a&gt; is a good starting point for understanding the algorithm.&lt;/p&gt;
&lt;h3&gt;Alignment&lt;/h3&gt;
&lt;p&gt;Alignment controls how items are arranged on their vertical and horizontal axes. This can be done on a parent layout with the &lt;code&gt;*-content&lt;/code&gt; properties or on the children themselves using the &lt;code&gt;*-self&lt;/code&gt; properties.&lt;/p&gt;
&lt;h4&gt;Main Axis&lt;/h4&gt;
&lt;p&gt;Main axis alignment refers to how children are aligned on the main axis of their parent; horizontal for rows and vertical for columns. In the flexbox spec, this is known as &lt;a href="https://developer.mozilla.org/en-US/docs/Web/CSS/justify-content"&gt;&lt;code&gt;justify-content&lt;/code&gt;&lt;/a&gt;. In Compose, main axis alignment is controlled by the the &lt;code&gt;horizontalArrangement&lt;/code&gt; parameter passed to &lt;code&gt;Row&lt;/code&gt; and the &lt;code&gt;verticalArrangement&lt;/code&gt; parameter passed to &lt;code&gt;Column&lt;/code&gt;. Both include options such as start/end, center, and space around/between/evenly for possible values.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@Composable&lt;/span&gt;
&lt;span class="kd"&gt;fun&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ArrangementExample&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;modifier&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Modifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fillMaxWidth&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;padding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bottom&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;16.&lt;/span&gt;&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;background&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MaterialTheme&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;colors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;primaryVariant&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;horizontalArrangement&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Arrangement&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;SpaceBetween&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Child&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Child&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Child&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The above code results in the following UI:
&lt;img alt="Arrangement example" src="https://engineering.zalando.com/posts/2021/03/images/space-between.jpg"&gt;&lt;/p&gt;
&lt;h4&gt;Cross Axis&lt;/h4&gt;
&lt;p&gt;Cross axis alignment refers to how children are aligned on the non-main axis of their parent; vertical for rows and horizontal for columns. In the flexbox spec, &lt;a href="https://developer.mozilla.org/en-US/docs/Web/CSS/align-items"&gt;&lt;code&gt;align-items&lt;/code&gt;&lt;/a&gt; and &lt;a href="https://developer.mozilla.org/en-US/docs/Web/CSS/align-content"&gt;&lt;code&gt;align-content&lt;/code&gt;&lt;/a&gt; control layout children while &lt;a href="https://developer.mozilla.org/en-US/docs/Web/CSS/align-self"&gt;&lt;code&gt;align-self&lt;/code&gt;&lt;/a&gt; allows children to do so themselves. In Compose, cross axis alignment is controlled by the &lt;code&gt;verticalAlignment&lt;/code&gt; parameter passed to &lt;code&gt;Row&lt;/code&gt;, the &lt;code&gt;horizontalAlignment&lt;/code&gt; parameter passed to &lt;code&gt;Column&lt;/code&gt;, and the &lt;code&gt;align&lt;/code&gt; modifier on their child composables. Both include options start, end, and center for possible values.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@Composable&lt;/span&gt;
&lt;span class="kd"&gt;fun&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;AlignmentExample&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;modifier&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Modifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fillMaxWidth&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;height&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;150.&lt;/span&gt;&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;padding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bottom&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;16.&lt;/span&gt;&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;background&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MaterialTheme&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;colors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;primaryVariant&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;verticalAlignment&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Alignment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;CenterVertically&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Child&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Child&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Child&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The above code results in the following UI:
&lt;img alt="Alignment example" src="https://engineering.zalando.com/posts/2021/03/images/alignment.jpg"&gt;&lt;/p&gt;
&lt;p&gt;You may have noticed that the space around/between/evenly options from &lt;code&gt;justify-content&lt;/code&gt; are not listed for the cross axis. This is because there is no cross axis space around/between alignment in Compose. However, the resulting layout could be achieved via other composable combinations.&lt;/p&gt;
&lt;p&gt;Flexbox also specifies a &lt;code&gt;stretch&lt;/code&gt; option for cross axis alignment. In Compose, the &lt;code&gt;stretch&lt;/code&gt; equivalent would be individual children using the &lt;code&gt;fillMaxSize()&lt;/code&gt;/&lt;code&gt;fillMaxWidth()&lt;/code&gt;/&lt;code&gt;fillMaxHeight()&lt;/code&gt; modifiers.&lt;/p&gt;
&lt;h3&gt;Layout&lt;/h3&gt;
&lt;p&gt;Finally, let's look at a few other attributes that affect a view's size and position.&lt;/p&gt;
&lt;h4&gt;Aspect Ratio&lt;/h4&gt;
&lt;p&gt;Compose's &lt;code&gt;aspectRatio()&lt;/code&gt; modifier works exactly as you'd expect. It takes a float representing the desired ratio and uses that value to determine the size in the unspecified layout direction (width or height).&lt;/p&gt;
&lt;p&gt;For example, specifying &lt;code&gt;fillMaxWidth()&lt;/code&gt; and &lt;code&gt;aspectRatio(16F / 9F)&lt;/code&gt; results in a rectangle that fills the width of the screen with a height corresponding to 9/16 of that width.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@Composable&lt;/span&gt;
&lt;span class="kd"&gt;fun&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;AspectRatioExample&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Box&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;modifier&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Modifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;padding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bottom&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;16.&lt;/span&gt;&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;background&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MaterialTheme&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;colors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;secondary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fillMaxWidth&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;aspectRatio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;16F&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;9F&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;border&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2.&lt;/span&gt;&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MaterialTheme&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;colors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;secondaryVariant&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The above code results in the following UI:
&lt;img alt="Aspect ratio example" src="https://engineering.zalando.com/posts/2021/03/images/aspect-ratio.jpg"&gt;&lt;/p&gt;
&lt;h4&gt;Padding &amp;amp; Margins&lt;/h4&gt;
&lt;p&gt;Compose has a &lt;code&gt;padding()&lt;/code&gt; modifier, but none for margins. Margins can be considered extra padding, so a single value can be used.&lt;/p&gt;
&lt;h4&gt;Absolute Position&lt;/h4&gt;
&lt;p&gt;When absolute positioning is needed to place one composable on top of another, the &lt;a href="https://developer.android.com/reference/kotlin/androidx/compose/foundation/layout/package-summary#box"&gt;&lt;code&gt;Box&lt;/code&gt;&lt;/a&gt; composable can be used. &lt;code&gt;Box&lt;/code&gt; children can use the &lt;code&gt;align()&lt;/code&gt; modifier to specify where they are aligned within the box including top start/center/end, bottom start/center/end, and center start/end.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@Composable&lt;/span&gt;
&lt;span class="kd"&gt;fun&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;AbsolutePositionExample&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Box&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Box&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;modifier&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Modifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fillMaxWidth&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;height&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;240.&lt;/span&gt;&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;background&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MaterialTheme&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;colors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;primaryVariant&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Child&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;modifier&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Modifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;align&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Alignment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;TopStart&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Child&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;modifier&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Modifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;align&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Alignment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;TopEnd&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Child&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;modifier&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Modifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;align&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Alignment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;BottomStart&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Child&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;modifier&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Modifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;align&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Alignment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;BottomEnd&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Child&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;modifier&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Modifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;align&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Alignment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Center&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The above code results in the following UI:
&lt;img alt="Absolute position example" src="https://engineering.zalando.com/posts/2021/03/images/absolute-position.jpg"&gt;&lt;/p&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;In this article, we have seen how much of the layout behavior defined in the flexbox spec has a direct analog in Compose and a few places where we have to do a bit more work to approximate certain concepts. Please see &lt;a href="https://github.com/abdyer/flexbox-compose"&gt;the sample app repo&lt;/a&gt; for the code as well as my first attempt at working with the &lt;a href="https://developer.android.com/jetpack/compose/navigation"&gt;Compose Navigation library&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;During our recent Hack Week, we had a chance to spend more time with Compose. We were impressed with how easy it was to get started and managed to build a fairly performant Compose powered implementation of our home screen. For a beta, it's quite promising!&lt;/p&gt;
&lt;p&gt;Thanks for reading!&lt;/p&gt;</content><category term="Zalando"/><category term="Android"/><category term="Kotlin"/><category term="UI"/><category term="Backend"/><category term="Frontend"/><category term="Mobile"/></entry><entry><title>Micro Frontends: from Fragments to Renderers (Part 1)</title><link href="https://engineering.zalando.com/posts/2021/03/micro-frontends-part1.html" rel="alternate"/><published>2021-03-11T00:00:00+01:00</published><updated>2021-03-11T00:00:00+01:00</updated><author><name>Jan Brockmeyer</name></author><id>tag:engineering.zalando.com,2021-03-11:/posts/2021/03/micro-frontends-part1.html</id><summary type="html">&lt;p&gt;Moving beyond Project Mosaic. Get an insight into the declarative view composition framework that powers new features for Zalando's website.&lt;/p&gt;</summary><content type="html">&lt;p&gt;In 2015, we wanted to improve how we delivered features to customers and move away from a monolithic shop system. &lt;a href="https://www.mosaic9.org/"&gt;Project Mosaic&lt;/a&gt; and its microservices approach for the frontend were vital to support this transition. Mosaic enabled a relatively large number of teams to work on the main Zalando website &lt;a href="https://engineering.zalando.com/posts/2015/10/from-jimmy-to-microservices-rebuilding-zalandos-fashion-store.html"&gt;independently and without performance compromises&lt;/a&gt;. At its core, Mosaic architecture relies on page Fragments, which are owned by different teams.&lt;/p&gt;
&lt;p&gt;Mosaic helped us deliver features quickly and experiment at scale, contributing to Zalando’s growth, but we &lt;a href="https://engineering.zalando.com/posts/2018/12/front-end-micro-services.html"&gt;identified limitations to the Fragments approach&lt;/a&gt;. The main pain points for Zalando at that time were:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Differences in tech stacks, bundling, and deployment practices across fragments led to inconsistent user experience and cross-team collaboration difficulties&lt;/li&gt;
&lt;li&gt;A high entry barrier for teams contributing to the customer experience. To be able to add new features to the website, engineers had to&lt;ul&gt;
&lt;li&gt;build and operate their fragments (usually frontend and backend services)&lt;/li&gt;
&lt;li&gt;discover and integrate with all the data sources&lt;/li&gt;
&lt;li&gt;re-implement or adapt the UI&lt;/li&gt;
&lt;li&gt;re-implement or adjust tracking &amp;amp; A/B testing&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In 2018, we started designing Interface Framework (IF) to overcome these issues. The new transition’s key goal was to build a platform that unified the tech stack and centralized the deployment and operation process for various parts of the Zalando website. It would enable a fully personalized customer experience, and guarantee overall UX consistency based on a new design language.&lt;/p&gt;
&lt;p&gt;Now, we'd like to give you an update on our approach to frontend development in the form of a blog series. The first part covers the key features of the new framework and provides an overview of its architecture.&lt;/p&gt;
&lt;h2&gt;Why Interface Framework&lt;/h2&gt;
&lt;h3&gt;Consistent Entity Data&lt;/h3&gt;
&lt;p&gt;We identified a reasonably small amount of content pieces in use by Zalando that can be visualized or catered for personalization purposes. For example, a Product, a Collection, or an Outfit. When organized in tree-like structures, they can be used to define layouts and content of the Zalando core user journey pages. When used individually, they can be the common language used across microservices to exchange data.&lt;/p&gt;
&lt;p&gt;We call them Entities. Each Entity has a type and a unique id.&lt;/p&gt;
&lt;h3&gt;Dynamic View &amp;amp; Content Composition&lt;/h3&gt;
&lt;p&gt;Interface Framework supports dynamic composition of the user interface. It composes a page by forming a tree of nested Entities and transforming it into a tree of matching Renderers. The mapping of Entities to Renderers is specified in a declarative set of layout rules, which we call rendering rules. A Renderer is responsible for visualizing data related to an Entity.&lt;/p&gt;
&lt;p&gt;Let's assume we are presenting a product page with some slots below the article to show additional content. Our personalization service chooses to provide three pieces of content: a collection, an outfit, and another collection. It determines what content the customers see on the page.&lt;/p&gt;
&lt;p&gt;The Rendering Engine then decides to visualize the collection as a carousel, outfit as a card component, and the third collection as another carousel. It is responsible for how the content gets rendered to the customers.&lt;/p&gt;
&lt;h3&gt;Integrated Monitoring&lt;/h3&gt;
&lt;p&gt;Interface Framework automatically connects all views to the internal monitoring tools, ensuring that only the unified, user consent compliant, and thoroughly tested implementation is used. It helps to prevent incidents and disruptions in business reporting and personalization.&lt;/p&gt;
&lt;h3&gt;Orchestrated A/B Testing&lt;/h3&gt;
&lt;p&gt;A/B tests can now run in an orchestrated way to compare the results and make informed choices. This ensures features are tested with a representative user base, using standardized A/B testing scenarios and KPIs to ease comparison between features. Defining and setting up global A/B tests also means reducing the overhead of doing it for every page.&lt;/p&gt;
&lt;p&gt;The integration of &lt;a href="https://engineering.zalando.com/posts/2021/01/experimentation-platform-part1.html"&gt;Zalando’s A/B testing platform&lt;/a&gt; in IF allows us to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Implement experiments with only a few lines of code, and get the implementation automatically validated&lt;/li&gt;
&lt;li&gt;Track experiments automatically without additional efforts to analyze the results&lt;/li&gt;
&lt;li&gt;Continue managing experiments via the intuitive A/B testing platform UI&lt;/li&gt;
&lt;li&gt;Keep experiment latency overhead low by batching all requests to the A/B testing platform for all Renderers&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Integrated Testing for Developers&lt;/h3&gt;
&lt;p&gt;As Interface Framework provides a single integration point where all code is developed and deployed, we give developers access to deployment previews, which allow any open pull request to be previewed in an environment very close to production. This setup is different from the traditional staging approach. The preview deployment is connected to production endpoints and follows 100% production routing while ensuring that only authenticated developers can access it.&lt;/p&gt;
&lt;h3&gt;Consistent UX Design&lt;/h3&gt;
&lt;p&gt;All pages running on Interface Framework, the look &amp;amp; feel, accessibility features, and actual components used are all defined by a design system. Our server-side rendering framework, which we call Rendering Engine, takes over the complexity of component version management and optimizes client code bundle size.&lt;/p&gt;
&lt;h3&gt;Page Performance Quality Gates&lt;/h3&gt;
&lt;p&gt;We evaluated best practices from CI/CD pipelines for Fragments from various teams and combined them to measure the performance for pages served by Interface Framework. We do support the following tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://github.com/GoogleChrome/lighthouse-ci"&gt;Lighthouse CI&lt;/a&gt;:&lt;/strong&gt; a tool to automatically run performance and accessibility tests for specific pages. It validates assertions with results and decides whether the current score is good enough for production.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bundle Size Limits:&lt;/strong&gt; we have a tool to automatically compute and check bundle sizes for Renderers on every pull request. It shows the results for all Renderers that have changed with the current version being released.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Client Metrics:&lt;/strong&gt; we provide a built-in layer to report &lt;a href="https://web.dev/vitals/"&gt;Web Vitals&lt;/a&gt; and custom metrics to capture all Zalando pages’ user experience.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Increased Organizational Speed and Efficiency&lt;/h3&gt;
&lt;p&gt;We are still organized around feature teams which have frontend engineers embedded. The main difference is that now they are working in a monolithic repository providing a unified and automated environment that offers new joiners a quick onboarding. The teams develop features and UI elements within Renderers. These Renderers are associated with Entities that make up our new page semantic.&lt;/p&gt;
&lt;p&gt;There is quite a cultural shift as some ownership lines are now blurred in Renderers, with multiple teams contributing to most of them. As a result, we now have a much more collaborative development environment where teams benefit from their best practices. A centralized repository also means it is easier to ship large project changes and contribute to other teams' code, supported by a set of contribution guidelines.&lt;/p&gt;
&lt;p&gt;We now have an aligned set of modern frontend technologies (React, TypeScript, GraphQL), a centralized server infrastructure, a release process, and a robust set of monitoring capabilities with dashboards and alerts. We are more efficient in terms of operations, and new reliability patterns immediately impact the whole website.&lt;/p&gt;
&lt;h2&gt;Architecture Overview&lt;/h2&gt;
&lt;p&gt;The following chart gives an overview of the underlying architecture. It contains all the core components of Interface Framework.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Interface Framework's Architecture" src="https://engineering.zalando.com/posts/2021/03/images/architecture_if.png"&gt;&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://engineering.zalando.com/posts/2021/03/how-we-use-graphql-at-europes-largest-fashion-e-commerce-company.html"&gt;&lt;strong&gt;GraphQL API&lt;/strong&gt;&lt;/a&gt; is a data aggregation layer. It is to become the primary data source for all web pages at Zalando and reduce data integration costs across many teams. It provides a unified way for accessing content as an output of personalization services like the Recommendation System.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;Rendering Engine&lt;/strong&gt; is a backend service and client-side runtime running in Node.js and the browser. Its primary purpose is to resolve and render a tree of Entities for a given request. The Recommendation System controls the structure of this tree.&lt;/p&gt;
&lt;p&gt;A &lt;strong&gt;Renderer&lt;/strong&gt; is a self-contained, reusable piece of code that runs inside the Rendering Engine. It declaratively specifies all of its data dependencies via GraphQL and uses the Zalando Design System to represent a single Entity visually.&lt;/p&gt;
&lt;p&gt;The mapping of Entities to Renderers is one-to-many since different visual representations are possible for an Entity. An outfit Entity, for example, can be represented as a main view or a card component within a collection. Each Renderer, on the other hand, corresponds to one specific Entity type.&lt;/p&gt;
&lt;p&gt;We do support a hybrid approach with Interface Framework. The Rendering Engine can serve views in different configurations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;View is a Mosaic Template and only uses Fragments.&lt;/li&gt;
&lt;li&gt;View contains both Renderers and Fragments.&lt;/li&gt;
&lt;li&gt;View only consists of Renderers.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This support for both rendering modes was and is still very beneficial for teams migrating their page from Mosaic to IF. Currently, we serve around 90% of traffic via Interface Framework.&lt;/p&gt;
&lt;h2&gt;Future Posts&lt;/h2&gt;
&lt;p&gt;In upcoming posts, we will dive deeper into the framework’s core internals and provide an overview of latest upgrades.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://engineering.zalando.com/posts/2021/09/micro-frontends-part2.html"&gt;Part 2: Deep Dive into Rendering Engine&lt;/a&gt; (2021/09)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://engineering.zalando.com/posts/2023/07/rendering-engine-tales-road-to-concurrent-react.html"&gt;Rendering Engine Tales: Road to Concurrent React&lt;/a&gt; (2023/07)&lt;/li&gt;
&lt;/ul&gt;</content><category term="Zalando"/><category term="Frontend"/><category term="Microservices"/><category term="GraphQL"/><category term="Backend"/></entry><entry><title>How we use GraphQL at Europe's largest fashion e-commerce company</title><link href="https://engineering.zalando.com/posts/2021/03/how-we-use-graphql-at-europes-largest-fashion-e-commerce-company.html" rel="alternate"/><published>2021-03-04T00:00:00+01:00</published><updated>2021-03-04T00:00:00+01:00</updated><author><name>Aditya Pratap Singh</name></author><id>tag:engineering.zalando.com,2021-03-04:/posts/2021/03/how-we-use-graphql-at-europes-largest-fashion-e-commerce-company.html</id><summary type="html">&lt;p&gt;Managing consistent and backwards-compatible APIs for Web and mobile App frontends is always a complex task in the long-term. At Zalando, we have used GraphQL to solve some of the common problems of frontend data requirements while gaining speed of delivery in a large and quickly growing organisation. This article is about &lt;strong&gt;GraphQL as Unified-Backend-For-Frontend (UBFF)&lt;/strong&gt; application and first in a series of posts about problems we solved with our use of GraphQL at Zalando.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="GraphQL logo" src="https://engineering.zalando.com/posts/2021/03/images/graphql.png#previewimage"&gt;&lt;/p&gt;
&lt;h2&gt;Background&lt;/h2&gt;
&lt;p&gt;Today's large scale organizations leveraging microservice architecture face a plethora of problems at the data aggregation and presentation layers. Managing consistent and backwards-compatible APIs for Web and Mobile App frontends is definitely one of the complex ones. The balance between a frontend developer's need for consistent data source and of product managers for delivering new features quickly in a fast-paced, large organization is a tough nut to crack. It is very common for frontend developers to struggle finding the right backend service to deliver a given feature.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://samnewman.io/patterns/architectural/bff/"&gt;Backend-for-frontend (BFF)&lt;/a&gt; concept is a pattern pioneered by Soundcloud wherein a backend application is created for every business and frontend specific use case. With our adoption of microservices at Zalando in 2015, we used this pattern to create a large number of BFFs for Web Product details page, Web wishlist page, Mobile app wishlist view, Mobile app home view and so on. The BFF is very similar to Netflix’s approach of &lt;a href="https://netflixtechblog.com/embracing-the-differences-inside-the-netflix-api-redesign-15fd8b3dc49d"&gt;Embracing the Differences&lt;/a&gt; which pointed out 4 key characteristics for APIs serving frontend applications:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Embrace the differences of the devices&lt;/li&gt;
&lt;li&gt;Separate content gathering from content formatting/delivery&lt;/li&gt;
&lt;li&gt;Redefine the border between “Client” and “Server”&lt;/li&gt;
&lt;li&gt;Distribute innovation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While these two approaches addressed most of these concerns of frontend development, they also introduced other issues for a large organisation like Zalando:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Lack of optimal balance between fast feature delivery and developer experience&lt;/li&gt;
&lt;li&gt;Duplication of efforts due to the large number of Backend-for-Frontend microservices&lt;/li&gt;
&lt;li&gt;Inconsistent experience for Zalando customers across platforms&lt;/li&gt;
&lt;li&gt;Fragmented handling of &lt;code&gt;Security&lt;/code&gt; and &lt;code&gt;Authentication&lt;/code&gt; concerns&lt;/li&gt;
&lt;li&gt;Fragmented &lt;code&gt;Observability&lt;/code&gt; implementations&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Out of the above problems, &lt;em&gt;Inconsistent experience for Zalando customers across platforms&lt;/em&gt; is a subtle one to understand and is more evident when the same business logic and aggregation is done in multiple ways in multiple backends leading to broken customer experiences. This is a classic example of &lt;a href="https://en.wikipedia.org/wiki/Conway's_law"&gt;Conway's law&lt;/a&gt; which in this case may ignore the User's point of view of different user experiences in their interaction with different frontend applications for the same organization.&lt;/p&gt;
&lt;p&gt;The diagram below shows the inconsistency issue that is not uncommon across different user interfaces for the same application if served via multiple backends. In the mobile app the delivery date range for an article on Zalando is &lt;strong&gt;5-9 Feb&lt;/strong&gt; whereas in the desktop version it’s &lt;strong&gt;1-3 Feb&lt;/strong&gt;. Even though this particular example is hypothetical, we have seen such inconsistent data bugs at Zalando in the past due to the different BFFs having fragmented logic across different services.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Inconsistent data across desktop and mobile" src="https://engineering.zalando.com/posts/2021/03/images/inconsistent-data.png"&gt;&lt;/p&gt;
&lt;p&gt;All the above problems at large scale become exponentially hard. We observed this also at Zalando and used our &lt;em&gt;Unified Backend-For-Frontend&lt;/em&gt; graph of &lt;code&gt;Entities&lt;/code&gt; approach to address most of these concerns.&lt;/p&gt;
&lt;h2&gt;Our setup&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;GraphQL&lt;/code&gt; is a query language developed by Facebook to enable declarative data fetching. The users of the API declaratively specify the shape of the data requirement via the query and response structure they expect.&lt;/p&gt;
&lt;p&gt;For example, in order to fetch the name of the example product mentioned above you can query it as:&lt;/p&gt;
&lt;p&gt;&lt;img alt="graphql query" src="https://engineering.zalando.com/posts/2021/03/images/graphql-query.png"&gt;&lt;/p&gt;
&lt;p&gt;From the &lt;a href="https://spec.graphql.org/June2018/#sec-Overview"&gt;GraphQL specification design principles&lt;/a&gt;, GraphQL was created with business requirements and hierarchical views in modern applications in mind:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Hierarchical&lt;/strong&gt;: GraphQL specification recommends the language to be structured in hierarchy to be well suited for Hierarchical Views in modern frontend applications&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Product-centric&lt;/strong&gt;: The evolution of a GraphQL schema is directly influenced by the product/business features being developed by frontend engineers&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;These are the two main principles we have kept in mind at Zalando while building a single &lt;strong&gt;GraphQL API&lt;/strong&gt; as a &lt;strong&gt;Unified Backend-For-Frontends (UBFFs)&lt;/strong&gt; for all Web and mobile App frontend feature teams. We use a monorepo which has a shared ownership across 12+ domain teams using a set of contribution principles. This is similar to the &lt;strong&gt;one unified graph&lt;/strong&gt; concept highlighted in &lt;a href="https://principledgraphql.com/integrity#1-one-graph"&gt;Principled GraphQL&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We use an &lt;code&gt;Entity&lt;/code&gt; system where entities are the first-class citizens in the graph with our custom implementation of GraphQL specification (&lt;a href="https://github.com/zalando-incubator/graphql-jit"&gt;graphql-jit&lt;/a&gt;) for performance optimization. The Entities themselves represent content and domain models spread across the Zalando shop e.g. &lt;code&gt;Product&lt;/code&gt;, &lt;code&gt;Campaign&lt;/code&gt; (elaborating the Entity model will be its own post in the series). The overall application data flow looks like this.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Architecture and data flow across desktop and mobile" src="https://engineering.zalando.com/posts/2021/03/images/architecture.png"&gt;&lt;/p&gt;
&lt;p&gt;We started with the GraphQL solution at Zalando in the first half of 2018 and have had the service in production since the end of 2018. The unified GraphQL schema has grown significantly in the last 2 years to a dense graph now with more than 12 domains and serves more than 80% of Web and 50% of the App use cases (as of February 2021).&lt;/p&gt;
&lt;h2&gt;Advantages&lt;/h2&gt;
&lt;p&gt;With our implementation of GraphQL running in production for the last 2 years at Zalando, we addressed most of the aforementioned concerns and observed multiple advantages including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Improved efficiency for developers to find and access data in one place as opposed to finding and integrating with the individual APIs.&lt;/li&gt;
&lt;li&gt;Improved developer experience via GraphQL tools such as explorer with live assortment data.&lt;/li&gt;
&lt;li&gt;Faster deployments leading to shipping features faster, leading to happy product managers.&lt;/li&gt;
&lt;li&gt;Consistent customer experience across platforms with a single consistent data source for frontends.&lt;/li&gt;
&lt;li&gt;Reduced duplication of effort to develop the same feature across platforms.&lt;/li&gt;
&lt;li&gt;Easy to enforce governance and organisational best practices.&lt;/li&gt;
&lt;li&gt;The GraphQL layer has a "No Business Logic" principle, which allows domain specific backend APIs to steer domain or platform (Web vs. App) specific content on their own.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Known concerns and challenges&lt;/h2&gt;
&lt;h3&gt;&lt;a href="https://samnewman.io/patterns/architectural/bff/#reuse"&gt;Code reuse leading to bloated code base&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Our approach with GraphQL has been to avoid any platform or domain specific logic in the GraphQL layer and instead let the domain specific teams drive this via presentation layer backend services. This allows us to keep a business logic agnostic data-aggregation layer which serves frontend developers and also helps in operational maintenance.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Presentation layer ensuring business logic agnostic graph" src="https://engineering.zalando.com/posts/2021/03/images/responsibilities-architecture.png"&gt;&lt;/p&gt;
&lt;h3&gt;Adoption and learning curve&lt;/h3&gt;
&lt;p&gt;Given GraphQL was a new technology for our teams, it involved investment in terms of learning curve and adoption. We addressed the adoption using some common mechanisms:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;One-stop-shop Documentation&lt;/strong&gt;: We use a single &lt;a href="https://documentation.divio.com/"&gt;structured documentation&lt;/a&gt; with embedded GraphQL editor, schema documentation, Voyager for schema exploration, practice exercises to allow our new users to adopt GraphQL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Support chat&lt;/strong&gt;: Just like any platform team we also provide support channel for any queries from users and contributors of the GraphQL service.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trainings&lt;/strong&gt;: Given that GraphQL is new at Zalando, we conducted GraphQL adoption training with 150+ developers participating to learn about using GraphQL at Zalando. The training had a broad impact on a large population of developers intending to switch to GraphQL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consultation&lt;/strong&gt;: GraphQL schema design is always a tricky topic even for frontend developers who can use GraphQL. In order to ensure a single, dense, unified graph, our team also provided consultation for all new domains being integrated into the Unified graph.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These four measures have resulted in increasing the number of contributors to our monorepo from 50 to 150+ in 2020 and developers using GraphQL for feature development from 70 to 200.&lt;/p&gt;
&lt;h3&gt;&lt;a href="http://www.designsmells.com/articles/does-your-architecture-smell/"&gt;God Component&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;God component is a design smell when a component is excessively large either in the terms of LOC or number of classes. We have a monorepo for the unified GraphQL service which makes it a potential architectural and operational risk. We address the architectural risk by shared ownership mechanism at Zalando, guided by a set of contribution principles. For the operational risk, we observe and address most issues by Reliability Patterns such as &lt;code&gt;Circuit breakers&lt;/code&gt;, &lt;code&gt;Timeouts&lt;/code&gt; and &lt;code&gt;Retry&lt;/code&gt; patterns. We also introduced &lt;a href="https://docs.microsoft.com/en-us/azure/architecture/patterns/bulkhead"&gt;Bulkhead pattern&lt;/a&gt; to provide more Fault tolerance and isolation by deploying the application to serve traffic per platform (separate deployments for Web and mobile Apps).&lt;/p&gt;
&lt;h2&gt;Related work on Unified GraphQL&lt;/h2&gt;
&lt;p&gt;Unified Graph is a known concept which is being adopted by a lot of large organisations. Below is a list of some of the large organisations using unified GraphQL in production:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Github has a &lt;a href="https://docs.github.com/en/graphql"&gt;GraphQL implementation with a single graph&lt;/a&gt; of all the domains including repos, users, marketplace etc. in it.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Shopify has a single GraphQL implementations for its &lt;a href="https://shopify.dev/docs/storefront-api/reference"&gt;StoreFront&lt;/a&gt; (customer facing) and &lt;a href="https://shopify.dev/docs/admin-api/graphql/reference"&gt;Admin&lt;/a&gt; (merchant facing) APIs where they allow customers and partners to build experiences using the unified graphs for each of those.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;AirBnB has been working on creating a Unified Schema for GraphQL solution, which they shared during the &lt;a href="https://www.youtube.com/embed/pywcFELoU8E"&gt;GraphQL Summit 2019 talk&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Expedia moved from a REST specific service to a &lt;a href="https://www.apollographql.com/customers/expediagroup/"&gt;Central data graph using GraphQL&lt;/a&gt; to solve their problems of using REST endpoints where developers were spending more time to figure out which service to call than to develop features.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://www.apollographql.com/docs/apollo-server/federation/introduction/"&gt;Apollo Federation&lt;/a&gt; is Apollo's solution for providing single data Graph over multiple Graphs across an organization. The difference between the Unified Graph we have at Zalando and Apollo's federation is that instead of having multiple Graphs connected via a library and gateway we have a single service at Zalando which connects all the domains in a single schema Graph. This has tradeoffs which we have addressed as mentioned &lt;a href="#god-component"&gt;here&lt;/a&gt;, since we gain by keeping a single Graph in terms of tooling, deployment and governance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Netflix also has its own version of one-graph that they use in the  Netflix Studio ecosystem and elaborated the setup in &lt;a href="https://netflixtechblog.com/how-netflix-scales-its-api-with-graphql-federation-part-1-ae3557c187e2"&gt;this blog post series&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Conclusion and next steps&lt;/h2&gt;
&lt;p&gt;The Unified Backend-For-Frontend (UBFF) GraphQL is not a silver bullet, but is a tradeoff which has worked well for our frontend data fetching problems at Zalando. In the next few articles in this series we will cover other aspects of our usage of GraphQL at Zalando in context of &lt;em&gt;Observability&lt;/em&gt;, &lt;em&gt;Performance Optimization&lt;/em&gt;, &lt;em&gt;Security&lt;/em&gt;, &lt;em&gt;Tooling&lt;/em&gt;, &lt;em&gt;Errors&lt;/em&gt; etc. which allowed us to scale the adoption of the service to 200+ Web and App developers and serve the use cases of more than 25-30 feature teams.&lt;/p&gt;
&lt;h2&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://samnewman.io/patterns/architectural/bff/"&gt;Backend For Frontend Pattern by Sam Newman&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://netflixtechblog.com/embracing-the-differences-inside-the-netflix-api-redesign-15fd8b3dc49d"&gt;Netflix API redesign&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://spec.graphql.org/draft"&gt;GraphQL spec&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://martinfowler.com/bliki/CircuitBreaker.html"&gt;Circuit Breaker pattern&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://martinfowler.com/bliki/CircuitBreaker.html"&gt;Bulkhead pattern&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://netflixtechblog.com/how-netflix-scales-its-api-with-graphql-federation-part-1-ae3557c187e2"&gt;Netflix GraphQL Federation approach&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><category term="Zalando"/><category term="GraphQL"/><category term="APIs"/><category term="Backend"/></entry><entry><title>Building an End to End load test automation system on top of Kubernetes</title><link href="https://engineering.zalando.com/posts/2021/03/building-an-end-to-end-load-test-automation-system-on-top-of-kubernetes.html" rel="alternate"/><published>2021-03-02T00:00:00+01:00</published><updated>2021-03-02T00:00:00+01:00</updated><author><name>Amila Kumaranayaka</name></author><id>tag:engineering.zalando.com,2021-03-02:/posts/2021/03/building-an-end-to-end-load-test-automation-system-on-top-of-kubernetes.html</id><summary type="html">&lt;p&gt;Learn how we built an end-to-end load test automation system to make load tests a routine task.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;At Zalando we continuously invent new ways for customers to interact with fashion. In order to provide an excellent customer experience, we must ensure our systems can technically handle high traffic events such as Cyber Week or other sales campaigns. We have published a &lt;a href="https://engineering.zalando.com/posts/2020/10/how-zalando-prepares-for-cyber-week.html"&gt;detailed article&lt;/a&gt; on how Zalando prepares for the Cyberweek. Checkout and payments related systems are particularly important during sales events. As we continuously evolve our systems and add new features to optimize the customer experience, it is cumbersome and expensive to manually test our systems capability to handle high traffic.&lt;/p&gt;
&lt;p&gt;Our department is responsible for payments processing systems of Zalando, these systems must maintain high availability and reliability. We set out to build an automated end-to-end load testing system capable of simulating real user behaviour across the whole system composed of microservices in order to achieve high stability in our systems. This testing system automatically steers generated traffic based on a dynamically adjusted orders per minute configuration. In order to really push our services to the edge, we wanted to run the load testing system in our test cluster, as this enables us to break things when necessary without causing customer impact. These tests can then be conveniently managed and triggered by our team and serve as the first quality gate of the Payment system.
As part of the Cyber Week preparation, we formed a dedicated project team tasked with making our vision come to life.&lt;/p&gt;
&lt;p&gt;To summarize, we wanted to build a load testing tool with the following features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Automatic load test execution based on a schedule.&lt;/li&gt;
&lt;li&gt;Simple API through which developers can manually trigger a load test.&lt;/li&gt;
&lt;li&gt;Load test tool to be ran in our test environment, that scales our Kubernetes services and Amazon ECS&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;(Elastic Container Service) environment up to our production configuration and then execute load tests.&lt;/li&gt;
&lt;li&gt;Automated alarms if a load test causes SLO (&lt;a href="https://sre.google/sre-book/service-level-objectives/"&gt;Service Level Objective&lt;/a&gt;) breaches.&lt;/li&gt;
&lt;li&gt;The generated load test traffic must imitate our customer's checkout flow.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The diagram below illustrates how the testing system (NodePool A) and our Payment platform (NodePool B and ECS) is deployed:
&lt;img alt="Load Test Flow" src="https://engineering.zalando.com/posts/2021/03/images/loadtestconductor-flow.png"&gt;&lt;/p&gt;
&lt;h2&gt;Traffic generation&lt;/h2&gt;
&lt;p&gt;Our first step was to select a load testing framework. We considered multiple options such as Locust, Vegeta and JMeter. This was filtered down to &lt;a href="https://locust.io/"&gt;Locust&lt;/a&gt; and &lt;a href="https://github.com/tsenart/vegeta"&gt;Vegeta&lt;/a&gt; due to &lt;a href="https://jmeter.apache.org/"&gt;JMeter&lt;/a&gt; not being popular internally. We chose Locust as it was more popular within our development teams, thus the test suite would be easier to maintain. We have also &lt;a href="https://engineering.zalando.com/posts/2019/04/end-to-end-load-testing-zalandos-production-website.html"&gt;blogged before&lt;/a&gt; on how we leveraged Locust in prior preparations for sales events.&lt;/p&gt;
&lt;p&gt;Locust works both in standalone and distributed mode. It operates a controller with multiple workers in distributed mode. In order to generate higher loads a distributed setup is required to overcome resource limitations. We created locust scripts covering multiple business processes mimicking real world traffic patterns to our services. These scripts were then packaged as a docker container and deployed as a distributed locust system.&lt;/p&gt;
&lt;h2&gt;Mock External Dependencies&lt;/h2&gt;
&lt;p&gt;When we defined the scope of the load tests we all agreed we would only focus on testing internal service components and did not want to involve external dependencies for routine tests. Therefore we decided to mock these dependencies.&lt;/p&gt;
&lt;p&gt;The table below compares a variety of tools that can be used to implement mocks.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th style="text-align: center;"&gt;Mobtest&lt;/th&gt;
&lt;th style="text-align: center;"&gt;Wiremock&lt;/th&gt;
&lt;th style="text-align: center;"&gt;Mockserver&lt;/th&gt;
&lt;th style="text-align: center;"&gt;Mokoon&lt;/th&gt;
&lt;th style="text-align: center;"&gt;Hoverfly&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Language&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Javascript&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Java&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Java&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Javascript&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Golang&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Github star/fork&lt;/td&gt;
&lt;td style="text-align: center;"&gt;1289/173&lt;/td&gt;
&lt;td style="text-align: center;"&gt;3453/934&lt;/td&gt;
&lt;td style="text-align: center;"&gt;2280/616&lt;/td&gt;
&lt;td style="text-align: center;"&gt;1402/63&lt;/td&gt;
&lt;td style="text-align: center;"&gt;1468/131&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Config (API, route, ...)&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Json config&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Json&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Js config&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Js config&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Json&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency simulation&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Fixed&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Fixed / Random&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Fixed&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Fixed&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Fixed / Random&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fault simulation&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stateful behaviour&lt;/td&gt;
&lt;td style="text-align: center;"&gt;No&lt;/td&gt;
&lt;td style="text-align: center;"&gt;State machine&lt;/td&gt;
&lt;td style="text-align: center;"&gt;No&lt;/td&gt;
&lt;td style="text-align: center;"&gt;No&lt;/td&gt;
&lt;td style="text-align: center;"&gt;key-value map&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Easy to extend&lt;/td&gt;
&lt;td style="text-align: center;"&gt;No&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: center;"&gt;No&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Proxying&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Response templating&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: center;"&gt;No&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Request matching&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: center;"&gt;No&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Record &amp;amp; Replay&lt;/td&gt;
&lt;td style="text-align: center;"&gt;No&lt;/td&gt;
&lt;td style="text-align: center;"&gt;No&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;td style="text-align: center;"&gt;No&lt;/td&gt;
&lt;td style="text-align: center;"&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;After evaluating multiple options we settled on using &lt;a href="https://github.com/SpectoLabs/hoverfly"&gt;Hoverfly&lt;/a&gt; as the mocking solution. Hoverfly provides the ability to easily set up mocks with static or dynamic responses. Mocks were created and deployed for multiple external dependencies. Furthermore, we wanted to run the load tests against services that could at the same time be used for other tests. This meant that the service needed to dynamically switch the dependency between the real service and its mock. For this, we leveraged header-based routing  using &lt;a href="https://opensource.zalando.com/skipper/"&gt;Skipper&lt;/a&gt;, so a service can decide whether to use mocks or actual dependent service by examining if the request belongs to a load test or not.&lt;/p&gt;
&lt;p&gt;Hoverfly example mocking a service with PATCH endpoint:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="x"&gt;{&lt;/span&gt;
&lt;span class="x"&gt;    &amp;quot;data&amp;quot;: {&lt;/span&gt;
&lt;span class="x"&gt;        &amp;quot;pairs&amp;quot;: [&lt;/span&gt;
&lt;span class="x"&gt;            {&lt;/span&gt;
&lt;span class="x"&gt;                &amp;quot;request&amp;quot;: {&lt;/span&gt;
&lt;span class="x"&gt;                    &amp;quot;path&amp;quot;: [{&lt;/span&gt;
&lt;span class="x"&gt;                        &amp;quot;matcher&amp;quot;: &amp;quot;exact&amp;quot;,&lt;/span&gt;
&lt;span class="x"&gt;                        &amp;quot;value&amp;quot;: &amp;quot;/test&amp;quot;&lt;/span&gt;
&lt;span class="x"&gt;                    }],&lt;/span&gt;
&lt;span class="x"&gt;                    &amp;quot;method&amp;quot;: [{&lt;/span&gt;
&lt;span class="x"&gt;                        &amp;quot;matcher&amp;quot;: &amp;quot;exact&amp;quot;,&lt;/span&gt;
&lt;span class="x"&gt;                        &amp;quot;value&amp;quot;: &amp;quot;PATCH&amp;quot;&lt;/span&gt;
&lt;span class="x"&gt;                    }]&lt;/span&gt;
&lt;span class="x"&gt;                },&lt;/span&gt;
&lt;span class="x"&gt;                &amp;quot;response&amp;quot;: {&lt;/span&gt;
&lt;span class="x"&gt;                    &amp;quot;status&amp;quot;: 204,&lt;/span&gt;
&lt;span class="x"&gt;                    &amp;quot;body&amp;quot;: &amp;quot;&amp;quot;,&lt;/span&gt;
&lt;span class="x"&gt;                    &amp;quot;encodedBody&amp;quot;: false,&lt;/span&gt;
&lt;span class="x"&gt;                    &amp;quot;headers&amp;quot;: {&lt;/span&gt;
&lt;span class="x"&gt;                        &amp;quot;Date&amp;quot;: [&lt;/span&gt;
&lt;span class="x"&gt;                            &amp;quot;&lt;/span&gt;&lt;span class="cp"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;currentDateTime&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Mon, 02 Jan 2020 15:04:05 GMT&amp;#39;&lt;/span&gt; &lt;span class="cp"&gt;}}&lt;/span&gt;&lt;span class="x"&gt;&amp;quot;&lt;/span&gt;
&lt;span class="x"&gt;                        ],&lt;/span&gt;
&lt;span class="x"&gt;                        &amp;quot;Load-Test&amp;quot;: [&lt;/span&gt;
&lt;span class="x"&gt;                            &amp;quot;true&amp;quot;&lt;/span&gt;
&lt;span class="x"&gt;                        ]&lt;/span&gt;
&lt;span class="x"&gt;                    },&lt;/span&gt;
&lt;span class="x"&gt;                    &amp;quot;templated&amp;quot;: true&lt;/span&gt;
&lt;span class="x"&gt;                }&lt;/span&gt;
&lt;span class="x"&gt;            }&lt;/span&gt;
&lt;span class="x"&gt;        ],&lt;/span&gt;
&lt;span class="x"&gt;        &amp;quot;globalActions&amp;quot;: {&lt;/span&gt;
&lt;span class="x"&gt;            &amp;quot;delays&amp;quot;: []&lt;/span&gt;
&lt;span class="x"&gt;        }&lt;/span&gt;
&lt;span class="x"&gt;    },&lt;/span&gt;
&lt;span class="x"&gt;    &amp;quot;meta&amp;quot;: {&lt;/span&gt;
&lt;span class="x"&gt;        &amp;quot;schemaVersion&amp;quot;: &amp;quot;v5&amp;quot;,&lt;/span&gt;
&lt;span class="x"&gt;        &amp;quot;hoverflyVersion&amp;quot;: &amp;quot;v1.1.2&amp;quot;,&lt;/span&gt;
&lt;span class="x"&gt;        &amp;quot;timeExported&amp;quot;: &amp;quot;2020-01-07T13:21:02+02:00&amp;quot;&lt;/span&gt;
&lt;span class="x"&gt;    }&lt;/span&gt;
&lt;span class="x"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To start Hoverfly using this configuration, one can simply run:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;hoverfly&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;webserver&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;simulation.json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Load Test Conductor&lt;/h2&gt;
&lt;p&gt;In order to meet our goal of running automated load tests in the test cluster, we needed to design a system that could manage the full lifecycle of a load test and ensure the cluster and deployed applications match our production configuration. So applications in load test environment is updated to match resource allocation, number of instances and application version of the production environment.&lt;/p&gt;
&lt;h3&gt;Load test lifecycle&lt;/h3&gt;
&lt;p&gt;We defined the lifecycle of one load test as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Deploy all applications in the test environment to be the same version as production.&lt;/li&gt;
&lt;li&gt;Scale up the applications in the test environment to meet the resource configuration of the production environment.&lt;/li&gt;
&lt;li&gt;Generate load test traffic that replicates real customer behaviour.&lt;/li&gt;
&lt;li&gt;Scale down applications in the test environment after the test as a cost saving measure.&lt;/li&gt;
&lt;li&gt;Clean up databases and remove unnecessary test data.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For this purpose, we built a microservice in Golang called the load-test-conductor that executes and manages these load test phases and transitions. Our service design was heavily influenced by what Kubernetes popularized for infrastructure management. We wanted our system to be a declarative system. Therefore, the service provides a simple API that can be used by engineers to run load tests by defining the desired state of load test. Executing a load test is now just one API call away!&lt;/p&gt;
&lt;p&gt;On the diagram below, you can find the system components of the Load Test Conductor:
&lt;img alt="Conductor Components" src="https://engineering.zalando.com/posts/2021/03/images/conductor_components_1.png"&gt;&lt;/p&gt;
&lt;h2&gt;Deployment and Scaling&lt;/h2&gt;
&lt;p&gt;To ensure that the exact version of the service running in production is deployed and services are pre-scaled, we automated deployment and scaling of the application within the Load Test Conductor. We use our Continuous Delivery Platform (CDP) to find the version deployed in production using the Kubernetes client and trigger a new deployment of this exact version in our staging environment. Applications which need to be included in a load test can be provided as an environment-specific configuration. The &lt;strong&gt;Deployer&lt;/strong&gt; component will trigger a deployment and wait till all the deployments are completed. Afterwards, the &lt;strong&gt;Scaler&lt;/strong&gt; component triggers scaling based on the target configuration. Our load test conductor currently supports scaling resources in Kubernetes and AWS ECS environments. It also handles scaling down to the previous state once the load test is completed or failed.&lt;/p&gt;
&lt;h2&gt;Load generation&lt;/h2&gt;
&lt;p&gt;We chose to run locust in distributed mode to mimic customer traffic. Each Locust worker executes our test scripts and interacts with our microservices in order to simulate the customer journey through our systems. We wanted to be able to test different load scenarios, so we decided to implement an algorithm in the load-test-conductor that can instrument the locust workers through the API provided by Locust. The Locust API provides the functionality to change the count and the rate at which Locust workers are spawned. We designed an algorithm that ramps up locust workers based on a business KPI (orders placed per minute). Users of the test system can define a ramp-up time, a plateau time and the target orders per minute that the test should reach. Our algorithm then hatches the locust workers based on the configured parameters and dynamically recalculates the hatch rate and locust worker count needed to reach the defined orders per minute target.&lt;/p&gt;
&lt;h4&gt;Load generation pseudo code&lt;/h4&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;initial&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;calculation&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;seconds&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;load&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;has&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;not&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;exceeded&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;locust&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;calculate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;per&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;defined&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;calculation&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;calculate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;per&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;minute&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;reported&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;locust&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;locust&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;equal&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;zero&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;load test is being initialized.&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;loadtest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hatch&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;one&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;loadtest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;initial&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;loadtest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;per&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;minute&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;loadtest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;per&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;minute&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;equal&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;zero&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;load test stalled due to no orders getting generated.&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;loadtest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hatch&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;one&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;loadtest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;one&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;calculate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;needed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;achive&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;per&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;minute&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;using&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;locust&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;per&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;minute&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;and&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;per&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;minute&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;calculate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;that&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;needs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;be&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;created&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;calculate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;load&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;calculate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;iterations&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;load&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;calculate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;spawn&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;this&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;iteration&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;calculate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hatchrate&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;loadtest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hatch&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;calculated&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hatchrate&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;loadtest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hatch&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;calculated&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;locust&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;with&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;load&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;this&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;triggers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;load&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;generation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;calculaton&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Test Execution &amp;amp; Test Evaluation&lt;/h2&gt;
&lt;p&gt;To trigger the load test, we used a Kubernetes CronJob that calls the API of the load test conductor. For our Payment system, load tests take about 2 hours to complete.&lt;/p&gt;
&lt;p&gt;To monitor the system during test execution, we leverage Grafana dashboards that provide insights into the most important metrics, for example - latency, throughput and response code rates. Through manual inspection of the graphs, we also evaluate if a load test was successful or not. Additionally, we use alerts that trigger when a service did not meet its SLO during a test.&lt;/p&gt;
&lt;p&gt;Test results have to be manually evaluated to decide if the outcome is successful or not, which is sufficient for us for the time being.&lt;/p&gt;
&lt;h2&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;Overall, the solution fulfilled the goal of a successful preparation and scaling of our applications. However, running load tests on the test cluster posed several challenges. Sometimes, new deployments were rolled out during tests, which caused the service to point to pods with minimal resources instead of the scaled up one. Several infrastructure components like cluster node type, databases, centrally managed event queues (&lt;a href="https://github.com/zalando/nakadi"&gt;Nakadi&lt;/a&gt;) had to be adjusted for similarity with the production environment. This required a lot of communication effort and alignment with teams managing the services.&lt;/p&gt;
&lt;p&gt;We made the deployment of the production versions of the applications an optional feature, so that developers can test their feature branch code. The load test tool has become our standard way to verify for every developed change that the applications can handle peak production traffic.&lt;/p&gt;
&lt;p&gt;Giving developers the possibility to run load tests by a simple API call encourages and enables them to thoroughly load test applications.&lt;/p&gt;
&lt;p&gt;Since these load tests are conducted in a non-production environment, we could stress the services till they fail. In combination with load tests in production, this was essential for preparing our production services for higher load.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;ECS is only used by a small set of isolated services, all other services run on &lt;a href="https://engineering.zalando.com/tags/kubernetes.html"&gt;Kubernetes&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="Zalando"/><category term="Cyber Week"/><category term="Testing"/><category term="Backend"/></entry><entry><title>Integration tests with Testcontainers</title><link href="https://engineering.zalando.com/posts/2021/02/integration-tests-with-testcontainers.html" rel="alternate"/><published>2021-02-25T00:00:00+01:00</published><updated>2021-02-25T00:00:00+01:00</updated><author><name>Marek Hudyma</name></author><id>tag:engineering.zalando.com,2021-02-25:/posts/2021/02/integration-tests-with-testcontainers.html</id><summary type="html">&lt;p&gt;We explore how to write integration tests using Testcontainers.org library in Java-based backend applications.&lt;/p&gt;</summary><content type="html">&lt;p&gt;In this article, I will show how teams at &lt;a href="https://zms.zalando.com/"&gt;Zalando Marketing Services&lt;/a&gt; are using integration tests in Java-based backend applications. We will follow the idea of integration tests: the main concept and the attributes of a good integration test. Then, we will discuss an example based on the TestContainers library used in the Spring environment.&lt;/p&gt;
&lt;h2&gt;Integration tests&lt;/h2&gt;
&lt;p&gt;There are many definitions of integration testing. For example, the definition found on &lt;a href="https://en.wikipedia.org/wiki/Integration_testing"&gt;Wikipedia&lt;/a&gt; is: &lt;code&gt;Integration testing is the phase in software testing in which individual software modules are combined and tested as a group&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;For this article, we define integration tests as tests of communication between our code and external components, e.g. database, one of the AWS services (like S3, Kinesis, DynamoDB, SQS, and others) or an external system with which we are communicating over HTTP.&lt;/p&gt;
&lt;p&gt;The purpose of integration tests is to assess how our code will behave when communicating with external services. Not only in happy path scenarios, but especially in corner cases, e.g. external service will respond with an unexpected HTTP code, the HTTP response will come after a defined timeout, AWS S3 responses with internal errors.&lt;/p&gt;
&lt;h2&gt;Amount of integration tests&lt;/h2&gt;
&lt;p&gt;While implementing tests, we need to remember to maintain the proper balance between different test types. Integration tests cannot be the core of the testing codebase.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://martinfowler.com/articles/practical-test-pyramid.html"&gt;A pyramid of testing&lt;/a&gt; shows us the proportions of different types of tests. For backend applications, the foundations are unit tests and component tests. Integration tests are a complement of unit tests and other test types like component, system, and manual.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pyramid of testing" src="https://engineering.zalando.com/posts/2021/02/images/pyramid-of-testing.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/System_testing"&gt;System tests&lt;/a&gt; and manual tests should ideally be the rarest type of tests.
From our experience, we estimate the number of integration tests to be around 25% of unit tests, but it varies from application to application.&lt;/p&gt;
&lt;h2&gt;Integration tests with Testcontainers library&lt;/h2&gt;
&lt;p&gt;Let's see how to organize an integration test with the Testcontainers library, and how to manage a startup/teardown of Docker containers.
&lt;a href="https://www.testcontainers.org/"&gt;Testcontainers.org&lt;/a&gt; is a JVM library that allows users to run and manage Docker images and control them from Java code. &lt;a href="https://www.testcontainers.org/#who-is-using-testcontainers"&gt;Zalando uses it&lt;/a&gt; mainly for integration tests.
To implement an integration test, you need to run your application similarly to a unit test (method annotated by &lt;code&gt;@Test&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;The integration test additionally runs external components as real Docker containers. External components can be one of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;database storage&lt;/strong&gt; - for example, run real PostgreSQL as a Docker image,&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;mocked HTTP server&lt;/strong&gt; - you can mimic the behavior of other HTTP services by using Docker images from MockServer or WireMock,&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Redis&lt;/strong&gt; - run real Redis as a Docker image,&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;streams or queues&lt;/strong&gt; (like RabbitMQ and others),&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AWS components&lt;/strong&gt; like S3, Kinesis, DynamoDB, and others, which you can emulate with Localstack&lt;/li&gt;
&lt;li&gt;other &lt;strong&gt;application&lt;/strong&gt; that can be run as a Docker image.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is very easy to run Docker images from Java code. Every Docker image can be run with &lt;code&gt;GenericContainer&lt;/code&gt;. For the most popular Docker images, there are prepared wrapper classes for convenient usage.&lt;/p&gt;
&lt;p&gt;To make sure that every Docker image will be stopped after usage and resources are released, the library uses JVM ShutdownHooks and a special Docker image &lt;code&gt;Ryuk&lt;/code&gt;. ShutdownHooks stops images when tests are finished. In case the Java process is no longer available, the &lt;code&gt;Ryuk&lt;/code&gt; container stops all Docker images. It is worth mentioning that it is possible to disable &lt;code&gt;Ryuk&lt;/code&gt; containers.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Your service communicates with external components run as Docker images." src="https://engineering.zalando.com/posts/2021/02/images/concept.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;Maven configuration&lt;/h2&gt;
&lt;p&gt;To use Testcontainers, add a maven dependency with a current library version.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.testcontainers&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;testcontainers&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;${testcontainers.version}&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;scope&amp;gt;&lt;/span&gt;test&lt;span class="nt"&gt;&amp;lt;/scope&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;It's important to have control over test execution. Unit tests should be executed before integration tests. It is a consequence of the pyramid of testing and helps to ensure that feedback loops are short.
In some cases, you may want to skip integration tests, for example when your local machine is slow and you want to run it only on CI/CD.&lt;/p&gt;
&lt;p&gt;To run the integrations tests after your unit tests, simply add &lt;code&gt;maven-failsafe-plugin&lt;/code&gt; to your project. Failsafe and Surefire plugins work in different build phases.
By default, the Maven Surefire plugin executes unit tests during the test phase. It includes all classes whose name ends with Test / Tests or TestCase.
The Failsafe plugin runs integration tests in the integration-test phase. To separate execution, we configure Failsafe plugin to run classes with postfix &lt;code&gt;IntegrationTest&lt;/code&gt;.
We also create a special profile, here: &lt;code&gt;with-integration-tests&lt;/code&gt; to control if we want to run integration-tests or not.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;profiles&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;profile&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;id&amp;gt;&lt;/span&gt;with-integration-tests&lt;span class="nt"&gt;&amp;lt;/id&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;build&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;pluginManagement&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;plugins&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;plugin&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.apache.maven.plugins&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;maven-failsafe-plugin&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;executions&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;execution&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;goals&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;                 &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;goal&amp;gt;&lt;/span&gt;integration-test&lt;span class="nt"&gt;&amp;lt;/goal&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;                 &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;goal&amp;gt;&lt;/span&gt;verify&lt;span class="nt"&gt;&amp;lt;/goal&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/goals&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/execution&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/executions&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;configuration&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;includes&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;include&amp;gt;&lt;/span&gt;**/*IntegrationTest.java&lt;span class="nt"&gt;&amp;lt;/include&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/includes&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/configuration&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/plugin&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/plugins&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/pluginManagement&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/build&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/profile&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;An invocation of maven command would look like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;mvn clean verify -P with-integration-tests
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Basic integration test with TestContainers&lt;/h2&gt;
&lt;p&gt;Let’s set up a basic integration test with JUnit 5 and Spring Boot.&lt;/p&gt;
&lt;p&gt;An integration test class example can look like the example below. The test class inherits from &lt;code&gt;AbstractIntegrationTest&lt;/code&gt;. The test method creates an entity in the database run as a Docker image. Later, we read the entity from the database and control if the entity has been written correctly.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AccountRepositoryIntegrationTest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;extends&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AbstractIntegrationTest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Autowired&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;private&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AccountRepository&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dao&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Test&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;shouldCreateAccount&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;// given&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Account&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;account&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;createAccount&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;// when&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;underTest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;account&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;// then&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Account&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;actualOptional&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dao&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;findById&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;account&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getId&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Account&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;createAccount&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;assertThat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actualOptional&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="na"&gt;isPresent&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;assertThat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actualOptional&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="na"&gt;isEqualTo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The test class below is an abstract class that will be inherited by all integration tests. It contains static references to Docker containers - &lt;a href="https://www.testcontainers.org/test_framework_integration/manual_lifecycle_control/#singleton-containers"&gt;singleton container&lt;/a&gt;.
In the static block, we start all images. We do not need to stop them, it will be done automatically. In the example below, the &lt;code&gt;PostgreSQLContainer&lt;/code&gt; is going to listen on a random port. To facilitate adding properties with dynamic values, we used the &lt;code&gt;@DynamicPropertySource&lt;/code&gt; annotation that was introduced in Spring Framework 5.2.5 (it has a more compact syntax than &lt;code&gt;ApplicationContextInitializer&lt;/code&gt;).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@SpringBootTest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;webEnvironment&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;WebEnvironment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;RANDOM_PORT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@ActiveProfiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;test&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;abstract&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AbstractIntegrationTest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;PostgreSQLContainer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;postgreSQL&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;PostgreSQLContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;postgres:13.1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withUsername&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;testUsername&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withPassword&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;testPassword&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withDatabaseName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;testDatabase&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;postgreSQL&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;start&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@DynamicPropertySource&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;postgresqlProperties&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DynamicPropertyRegistry&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;db_url&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;postgreSQL&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;getJdbcUrl&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;db_username&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;postgreSQL&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;getUsername&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;db_password&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;postgreSQL&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;getPassword&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;@TestContainers annotation&lt;/h2&gt;
&lt;p&gt;There are also different ways of running your containers. You can use the annotations set prepared in the Junit-Jupiter maven module:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.testcontainers&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;junit-jupiter&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;${testcontainers.version}&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;scope&amp;gt;&lt;/span&gt;test&lt;span class="nt"&gt;&amp;lt;/scope&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;A test class annotated with the &lt;code&gt;@Testcontainers&lt;/code&gt; annotation runs all containers annotated with the &lt;code&gt;@Container&lt;/code&gt; annotation. Additionally, when the container is static, it shares containers between test methods. You can control the startup order of containers by using &lt;code&gt;dependsOn&lt;/code&gt; method of &lt;code&gt;GenericContainer&lt;/code&gt;. The main limitation is, that containers &lt;strong&gt;cannot be reused between test classes&lt;/strong&gt;. Moreover, this extension has only been tested with sequential test execution. Using it with parallel test execution is unsupported and may have unintended side effects.
The test class would look like the example below.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nd"&gt;@Testcontainers&lt;/span&gt;
&lt;span class="nd"&gt;@SpringBootTest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;webEnvironment&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;WebEnvironment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;RANDOM_PORT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@ActiveProfiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;test&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ApplicationIntegrationTest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nd"&gt;@Container&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;PostgreSQLContainer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;postgreSQL&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;PostgreSQLContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;postgres:13.1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withUsername&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;testUsername&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withPassword&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;testPassword&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withDatabaseName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;testDatabase&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nd"&gt;@DynamicPropertySource&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;postgresqlProperties&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DynamicPropertyRegistry&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;spring.datasource.url&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;postgreSQL&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;getJdbcUrl&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;spring.datasource.password&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;postgreSQL&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;getPassword&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;spring.datasource.username&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;postgreSQL&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;getUsername&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nd"&gt;@Test&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kd"&gt;public&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;contextLoads&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Lifecycle of integration test&lt;/h2&gt;
&lt;p&gt;All tests (including integration tests) should follow principles defined as FIRST. The acronym FIRST was defined in the book &lt;a href="https://www.oreilly.com/library/view/clean-code-a/9780136083238/"&gt;Clean Code&lt;/a&gt; written by Robert C. Martin.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;[F]&lt;/strong&gt;ast - A test should not take more than a second to finish the execution.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;[I]&lt;/strong&gt;solated - No order-of-run dependency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;[R]&lt;/strong&gt;epeatable - A test method should NOT depend on any data in the environment/instance in which it is running.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;[S]&lt;/strong&gt;elf-Validating - No manual inspection required to check whether the test has passed or failed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;[T]&lt;/strong&gt;horough - Should cover every use case scenario and NOT just aim for 100% coverage.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Running a Docker image for every test method can take an enormous amount of time. To increase performance we need to make a real-life compromise. We can run a Docker image per class or even run once for all integration test executions. The second approach has been presented in the code.
If we decide to share Docker images between tests, we need to be ready for it. There are many ways to achieve it, e.g.:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tests should operate on unique IDs, names, etc. That way, we can avoid collisions of database constraints. In this case, you don’t need to clean up after the test execution. Some problems can occur, for example when you count elements in the database table. You can count elements created by different tests.&lt;/li&gt;
&lt;li&gt;Tests should clean up the state after execution. This approach consumes much more development time and is error-prone.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If we would like to run tests concurrently, it would require even more discipline from developers.&lt;/p&gt;
&lt;h2&gt;Advantages of using the TestContainers library&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;You run tests against real components, for example, the PostgreSQL database instead of the H2 database, which doesn’t support the Postgres-specific functionality (e.g. partitioning or JSON operations).&lt;/li&gt;
&lt;li&gt;You can mock AWS services with Localstack or Docker images provided by AWS. It will simplify administrative actions, cut costs and make your build offline.&lt;/li&gt;
&lt;li&gt;You can run your tests offline - no Internet connection is needed. It is an advantage for people who are traveling or if you have a slow Internet connection (when you have already run them once and there is no version change in the container).&lt;/li&gt;
&lt;li&gt;You can test corner cases in HTTP communication like:&lt;ul&gt;
&lt;li&gt;programmatically simulate timeout from external services (e.g. by configuring MockServer to respond with a delay that is bigger than the timeout set in your HTTP client),&lt;/li&gt;
&lt;li&gt;simulate HTTP codes that are not explicitly supported by our application.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Implementation and tests can be written by developers and exposed in the same pull request by backend developers.&lt;/li&gt;
&lt;li&gt;Even one integration test can verify if your application context starts properly and your database migration scripts (e.g. Flyway) are executing correctly.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Disadvantages of using the TestContainers library&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;We bring another dependency to our system that you need to maintain.&lt;/li&gt;
&lt;li&gt;You need to run containers at least once - it consumes time and resources. For example, PostgreSQL as a Docker image needs around 4 seconds to start on my machine, whereas the H2 in-memory database needs only 0.4 seconds. From my experience, Localstack which emulates AWS components, can start much longer, even 20 seconds on my machine.&lt;/li&gt;
&lt;li&gt;A continuous integration (e.g. Jenkins) machine needs to be bigger (build uses more RAM and CPU).&lt;/li&gt;
&lt;li&gt;Your local computer should be pretty powerful. If you run many Docker images, it can consume a lot of resources.&lt;/li&gt;
&lt;li&gt;Sometimes, integration tests with TestContainers are still not sufficient. For example, while testing REST responses with a mockserver container you can miss changes of real API. Inside the integration test, you may not reflect it, and your code still can crash on production. To minimize the risk, you may consider leveraging Contract Testing via &lt;a href="https://spring.io/projects/spring-cloud-contract"&gt;Spring Cloud Contract&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Code example&lt;/h2&gt;
&lt;p&gt;You can find examples of usages in my &lt;a href="https://github.com/marekhudyma/application-style"&gt;GitHub project&lt;/a&gt;.&lt;/p&gt;</content><category term="Zalando"/><category term="Java"/><category term="Testing"/><category term="Docker"/><category term="Backend"/></entry><entry><title>A Machine Learning Pipeline with Real-Time Inference</title><link href="https://engineering.zalando.com/posts/2021/02/machine-learning-pipeline-with-real-time-inference.html" rel="alternate"/><published>2021-02-16T00:00:00+01:00</published><updated>2021-02-16T00:00:00+01:00</updated><author><name>Henning-Ulrich Esser</name></author><id>tag:engineering.zalando.com,2021-02-16:/posts/2021/02/machine-learning-pipeline-with-real-time-inference.html</id><summary type="html">&lt;p&gt;How we improved an ML legacy system using Amazon SageMaker&lt;/p&gt;</summary><content type="html">&lt;p&gt;Customers love the freedom to try the clothes first and pay later. We’d love to offer everyone the convenience of deferred payment. However, fraudsters exploit this to acquire goods they never pay for. The better we know the probability of an order defaulting, the better we can steer the risk and offer the convenience of deferred payment to more customers.&lt;/p&gt;
&lt;p&gt;That’s where our Machine Learning models come into play.&lt;/p&gt;
&lt;p&gt;&lt;img alt="payments" src="https://engineering.zalando.com/posts/2021/02/images/payments.png#center"&gt;&lt;/p&gt;
&lt;p&gt;We have been tackling this problem for a while now.
Everything started with a simple Python and scikit-learn setup.
In 2015 we decided to migrate to Scala and Spark in order to scale better. You can read about this transition &lt;a href="https://engineering.zalando.com/posts/2016/05/scalable-fraud-detection-fashion-platform.html"&gt;on our engineering blog&lt;/a&gt;.
Last year we started to explore the potential value of tooling provided by Zalando's Machine Learning Platform (ML Platform) team as part of our strategy investment.&lt;/p&gt;
&lt;h3&gt;Pain Points with the existing solution&lt;/h3&gt;
&lt;p&gt;Our current solution serves us well. However, it has a few pain points, namely:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;It’s highly coupled to Scala and Spark which makes using state of the art libraries (mostly Python) difficult.&lt;/li&gt;
&lt;li&gt;It contains custom tailored code for functionalities which nowadays can be replaced by managed services. This adds an additional layer of complexity, making it difficult to maintain and to onboard new team members.&lt;/li&gt;
&lt;li&gt;It is a bit problematic in production: it uses a lot of memory, suffers from latency spikes, new instances start rather slowly which affects scalability.&lt;/li&gt;
&lt;li&gt;It has a monolithic design, meaning that feature preprocessing and model training are highly coupled. There is no pipeline with clear steps and everything runs on the same cluster during training.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Requirements for the New System&lt;/h3&gt;
&lt;p&gt;We started the project by writing down requirements for the new solution. The requirements fulfilled by our current system still stand:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;API&lt;/strong&gt;: the new system needs to conform to the existing API. We receive a JSON response with order data, and return a response in a JSON format.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency&lt;/strong&gt;: the deployed service must respond to requests quickly. 99.9% of responses must be returned under a threshold in the order of milliseconds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Load&lt;/strong&gt;: the busiest model must be able to handle hundreds of requests per second (RPS) on a regular basis. During sales events, the requests rate for a model may scale at a higher order of magnitude.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Support for multiple models in production&lt;/strong&gt;: several models, divided per assortment type, market, etc., must be available in the production service at any given time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified feature implementation&lt;/strong&gt;: our model features require preprocessing (extraction from the request JSON) both in production and in our training data (which comes in the same JSON format). The preprocessing applied to incoming requests in production must be identical to that applied to the training data. We want to avoid implementing this logic twice for both cases.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Performance metrics&lt;/strong&gt;: we must be able to compare the performance between the new and the old version of a model (using the same data) to improve our tagging capabilities.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To alleviate the current pains, we require our new system to meet the following criteria in addition to those above:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Independence from a specific model framework&lt;/strong&gt;: our research team develops improved models with different frameworks, such as PyTorch, Tensorflow, &lt;a href="https://engineering.zalando.com/posts/2020/06/distributed-xgb-sagemaker.html"&gt;XGBoost&lt;/a&gt;, etc.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fast scale-up&lt;/strong&gt;: the production system should adjust to growing traffic and accept requests in a matter of minutes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Clear pipeline&lt;/strong&gt;: the pipeline should have clear steps, especially the separation between data preprocessing and model training should be easy to understand.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use existing services&lt;/strong&gt;: ML tooling made quite a leap in the recent years and when possible we should take advantage of what’s available instead of building custom solutions.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Architecture of the New System&lt;/h3&gt;
&lt;p&gt;The system is a machine learning workflow built primarily from services provided by AWS. At Zalando, we use a tool provided by Zalando’s ML Platform team called &lt;a href="https://www.linkedin.com/pulse/building-ml-workflows-zalando-zflow-s%25C3%25A1nchez-fern%25C3%25A1ndez/"&gt;zflow&lt;/a&gt;. It is essentially a Python library built on top of &lt;a href="https://aws.amazon.com/step-functions/"&gt;AWS Step Functions&lt;/a&gt;, &lt;a href="https://aws.amazon.com/lambda/"&gt;AWS Lambdas&lt;/a&gt;, &lt;a href="https://aws.amazon.com/sagemaker/"&gt;Amazon SageMaker&lt;/a&gt;, and &lt;a href="https://databricks.com/"&gt;Databricks&lt;/a&gt; Spark, that allows users to easily orchestrate and schedule ML workflows.&lt;/p&gt;
&lt;p&gt;With this approach we steer away from implementing the whole system from scratch, hopefully making it easier to understand, which was one of the pain points (#2) of our prior system.&lt;/p&gt;
&lt;p&gt;In this new system, a single workflow orchestrates the following tasks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Training data preprocessing, using a Databricks cluster and a scikit-learn batch transform job on SageMaker&lt;/li&gt;
&lt;li&gt;Training a model using a SageMaker training job&lt;/li&gt;
&lt;li&gt;Generating predictions with another batch transform job&lt;/li&gt;
&lt;li&gt;Generating a report to demonstrate model’s performance, done with a Databricks job&lt;/li&gt;
&lt;li&gt;Deploying a SageMaker endpoint to serve the model&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="statemachine" src="https://engineering.zalando.com/posts/2021/02/images/statemachine.jpg"&gt;&lt;/p&gt;
&lt;p&gt;The platform solution allowed us to create a clean workflow with a lot of flexibility when it comes to technology selection for all the steps. We consider this a big improvement in regards to our pain point #4.&lt;/p&gt;
&lt;p&gt;Using a SageMaker training job allows us to substitute the model training step with any model available as a SageMaker container. In rare cases, when the algorithm is not already provided, we still have the possibility to implement the container on our own. This gives us much more flexibility and deals with pain point #1 discussed before.&lt;/p&gt;
&lt;h5&gt;Model Evaluation&lt;/h5&gt;
&lt;p&gt;After the training is finished, a SageMaker model is generated. To evaluate the performance of the model candidate we perform inference on a dedicated test dataset. As we needed to check  additional metrics to the ones provided out of the box by SageMaker, we added a custom Databricks job to calculate those metrics and to plot them in a PDF report (example below, where we see a model performing poorly).&lt;/p&gt;
&lt;p&gt;&lt;img alt="PR_AUC" src="https://engineering.zalando.com/posts/2021/02/images/PR_AUC_ROC.png"&gt;&lt;/p&gt;
&lt;h5&gt;Model Serving&lt;/h5&gt;
&lt;p&gt;At inference time, a SageMaker endpoint serves the model. Requests include a payload which requires preprocessing before it is delivered to the model. This can be accomplished using a so-called “inference pipeline model” in SageMaker.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Model serving" src="https://engineering.zalando.com/posts/2021/02/images/model_serving.png#center"&gt;&lt;/p&gt;
&lt;p&gt;The inference pipeline here consists of two Docker containers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A scikit-learn container for processing the incoming requests, i.e. extracting features from the input JSON or basic data transformations&lt;/li&gt;
&lt;li&gt;Main model container (i.e. XGBoost, PyTorch) for model predictions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The containers are lightweight and optimized for serving. They are able to scale-up sufficiently fast. This solved our pain point #3.&lt;/p&gt;
&lt;h3&gt;Performance Metrics&lt;/h3&gt;
&lt;h5&gt;Latency and Success Rate&lt;/h5&gt;
&lt;p&gt;We then performed a series of load tests. During every load test the endpoint was hit continuously for 4 minutes. We varied:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The EC2 instance type&lt;/li&gt;
&lt;li&gt;Number of instances&lt;/li&gt;
&lt;li&gt;The request rate. Different rates were applied to different AWS instance types. For example, it does not make sense to use ml.t2.medium instances to serve a model at a highest request rate, as they are not meant for such a load.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We reported the following metrics:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Success&lt;/strong&gt;: the percentage of all requests which returned an HTTP 200 OK status. 100% is optimal. Although there is no hard threshold here, the success rate should be high enough to serve endpoint requests.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;99th&lt;/strong&gt;: the 99th percentile for response rates of all requests, in milliseconds. To be usable, an endpoint must be able to respond to requests within the agreed sub-second threshold.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Sample results, for m5.large instance type:&lt;/p&gt;
&lt;p&gt;&lt;img alt="load1" src="https://engineering.zalando.com/posts/2021/02/images/load3.png"&gt;&lt;/p&gt;
&lt;p&gt;Some of our findings:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For a rate of 200 requests/s, a single ml.m5.large instance can handle the load with a p99 of under 80ms.&lt;/li&gt;
&lt;li&gt;For a rate of 400 requests/s, the success rate is not near 100% until 4 or more ml.m5.large instances are used. The response rates are under 50ms.&lt;/li&gt;
&lt;li&gt;For the 1000 requests/s rate, 2 or more ml.m5.4xlarge or ml.m5.12xlarge instances can keep the success rate with response times below 200ms.&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;Cost&lt;/h5&gt;
&lt;p&gt;Based on our estimates the cost of serving our models will increase significantly after the migration. We anticipate the increase by up to 200%. The main reason behind it is cost efficiency of the legacy system, where all the models are served from one big instance (multiplied for scaling). In the new system every model gets a separate instance(s).&lt;/p&gt;
&lt;p&gt;Still, this is a cost increase that we are willing to accept for the following reasons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Model flexibility. Having a separate instance per model means every model can use a different technology stack or framework for serving.&lt;/li&gt;
&lt;li&gt;Isolation. Every model’s traffic is separated, meaning we can scale each model individually, and flooding one model with requests doesn’t affect other models.&lt;/li&gt;
&lt;li&gt;Use of managed services instead of maintaining a custom solution.&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;Scale-up Time&lt;/h5&gt;
&lt;p&gt;We would like to be able to adjust our infrastructure to traffic as fast as possible. This is why we verified how much time it takes to scale the system up. Based on our experiments, adding an instance to a SageMaker endpoint with our current configuration reduces scale-up time by 50% over our old system. However, we wish to explore options for reducing this time further.&lt;/p&gt;
&lt;h3&gt;Cross Team Collaboration&lt;/h3&gt;
&lt;p&gt;Development of this system was a collaborative effort of two different teams: Zalando Payments and Zalando Machine Learning Platform, with each contributing members to a dedicated virtual team. This inter-team collaborative workstyle is typical for the ML Platform team, which offers the services of data scientists and software engineers to accelerate onboarding to the platform. To define the scope of the collaboration, the two teams created a Statement of Work (or SoW) to specify what services and resources the ML Platform will provide, and for what length of time. The entire collaboration lasted 9 months.&lt;/p&gt;
&lt;p&gt;The two teams collaborated in a traditional Kanban development style: we developed user stories, broke them into tasks, and completed each task. We met weekly for a replanning and had daily standups to catch up.&lt;/p&gt;
&lt;p&gt;Our collaboration was not without friction. Having developers from two different teams means overhead from two different teams. For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We had periods where the ML Platform team members had to deliver training programs for other parts of the company, and could not devote much time to this project. Similarly, members of the Payments team would occasionally need to attend to unrelated firefighting duties and miss a week of the collaborative project. Clearly communicating these external influences was very important, as the Payments team members are not aware of what is happening in the ML Platform team, and vice-a-versa.&lt;/li&gt;
&lt;li&gt;Sharing knowledge between the two teams was crucial, especially in the early stages of the project. While the Payments' team members are experts at the underlying business domain, the ML Platform team members were not. Similarly, while the ML Platform team members are experienced with the tools used for the project, the Payments’ team members did not have this expertise.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Conclusion and Outlook&lt;/h3&gt;
&lt;p&gt;Our new system fulfills the requirements of the old system, while addressing its pain points:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Because we use Amazon SageMaker for the model actions (i.e. training, endpoints, etc.), the system is guaranteed to be independent from the modeling framework.&lt;/li&gt;
&lt;li&gt;Each model served behind a SageMaker endpoint scales more quickly than in the old system, and we can easily increase the number of instances used for model training to speed up our pipeline execution.&lt;/li&gt;
&lt;li&gt;Each stage of the pipeline has a clear purpose and thanks to SageMaker Inference Pipelines, the data processing and model inferencing can take place within a single endpoint.&lt;/li&gt;
&lt;li&gt;Because we are using Zalando ML Platform tooling, our new system takes advantage of technology from AWS, in particular Amazon SageMaker.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We plan to use a similar architecture in other data science products.&lt;/p&gt;
&lt;p&gt;The project was a successful test of a team collaboration across departments, and proved that such collaboration can bring great results.&lt;/p&gt;</content><category term="Zalando"/><category term="Artificial Intelligence"/><category term="AWS"/><category term="Data Science"/><category term="Machine Learning"/><category term="Backend"/></entry><entry><title>Find out what challenges Customer Conversion solves at Zalando</title><link href="https://engineering.zalando.com/posts/2021/02/customer-conversion-at-zalando.html" rel="alternate"/><published>2021-02-11T00:00:00+01:00</published><updated>2021-02-11T00:00:00+01:00</updated><author><name>Kerstin Schartner</name></author><id>tag:engineering.zalando.com,2021-02-11:/posts/2021/02/customer-conversion-at-zalando.html</id><summary type="html">&lt;p&gt;We have spoken with our Director Customer Conversion, Pascal Hahn to find out more about their Product and to understand what the teams are looking for in the upcoming Hiring Sprint Event&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Pascal Hahn" src="https://engineering.zalando.com/posts/2021/02/images/pascal-hahn.jpg#right"&gt;&lt;/p&gt;
&lt;p&gt;When our &lt;a href="https://pages.beamery.com/zalando/page/hiring-sprint-event?utm_source=beamery&amp;amp;utm_medium=landingpage-p-paid&amp;amp;utm_campaign=2018-dim&amp;amp;utm_term=LinkedInTRM&amp;amp;utm_content=hiringsprintFeb"&gt;Hiring Sprint&lt;/a&gt; kicks off next month, we will be looking for great professionals to join some of our stellar teams – Shopping Cart, Checkout, Sales Orders and Returns. All meaningful segments of our &lt;strong&gt;Customer Conversion&lt;/strong&gt; organization, these teams are responsible for forging and shaping some of the most relevant experiences in Zalando customer journey. Skilled in innovating and versed in perfection, our Customer Conversion organization might become your next career step if you ace our Hiring Sprint.&lt;/p&gt;
&lt;p&gt;To give you a better idea of what expects you here I have spoken with our Director Customer Conversion, Pascal Hahn, who has talked me through the priorities of his teams and has shared some advice for those who are keen to join it ;)&lt;/p&gt;
&lt;h3&gt;Pascal, could you introduce the major functions and priorities of your teams?&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Customer Conversion&lt;/strong&gt; is the organization that enables our &lt;strong&gt;35M&lt;/strong&gt; customers to shop on Zalando. We are split in two departments: the Purchase department that delivers experiences from Shopping cart to Order confirmation, and the Post Purchase department that is responsible for processing orders, sorting out order details, order history as well as return experiences. Each department delivers experiences end-to-end, from ideation, product inception and development to operating and scaling them. Our mission is to let customers buy their beloved pieces easily and effortlessly by providing seamless, convenient and reliable experiences throughout. The work we do is a broad mix of designing and building new capabilities, experimenting, expanding and extending existing experiences or improving scalability and operational posture overall.&lt;/p&gt;
&lt;h3&gt;"Solving something that matters" - what does it mean for the team? What does it mean for you personally?&lt;/h3&gt;
&lt;p&gt;There’s no e-commerce without people shopping; and to work on the experiences that Zalando customers across all 17 markets use when they shop for their next favorite piece is a great mission. Being part of delivering excellent shopping experiences is what makes working at Zalando very special for me.&lt;/p&gt;
&lt;h3&gt;What do you appreciate the most about the challenges you face in your job?&lt;/h3&gt;
&lt;p&gt;To have a shot at solving problems that affect millions of users, together with some of the industry’s brightest minds is a privilege. When I started here about a year ago, I didn't know much about the inner workings of retail, and ever since I haven’t had a single day at Zalando without learning something new. Going forward, I still feel like there’s so much to learn.&lt;/p&gt;
&lt;h3&gt;Pascal, could you give some advice to people who'd like to work in the Customer Conversion organization?&lt;/h3&gt;
&lt;p&gt;If you’re excited about innovating at the intersection of the physical and the digital; if you take pride in building and operating systems that “just work”; if you enjoy using state-of-the-art tech at scale – this is the right place for you to work at. Whether you choose to work on product innovations with our product management team, or join us as an engineer or engineering leader that owns, delivers and operates our experiences, or as a data scientist who works on detecting transactional risks that affect our overall business – we offer a number of roles and challenges.&lt;/p&gt;
&lt;h3&gt;What do you think is the main achievement of the teams in Customer Conversion of the past few years?&lt;/h3&gt;
&lt;p&gt;The COVID pandemic has posed many challenges to our customers, team members, teams and business. When some markets introduced severe lockdowns, we had to react quickly building new features with very short timelines. Keeping the Zalando Store open and coping with the increased scale while delivering new features to our customers continually has been no easy feat. In addition, all the while we were working from home and had to cope with our own personal difficulties brought on by the virus and the imposed restrictions.&lt;/p&gt;
&lt;p&gt;For more details on how to participate in our 1st Hiring Sprint follow &lt;a href="http://zln.do/3nNskEV"&gt;this Link&lt;/a&gt;!&lt;/p&gt;</content><category term="Zalando"/><category term="Recruiting"/><category term="Inside Zalando"/><category term="Culture"/><category term="Leadership"/></entry><entry><title>It's Never Too Late For a Career Change</title><link href="https://engineering.zalando.com/posts/2021/02/its-never-too-late-for-a-career-change.html" rel="alternate"/><published>2021-02-04T00:00:00+01:00</published><updated>2021-02-04T00:00:00+01:00</updated><author><name>Julia Miller</name></author><id>tag:engineering.zalando.com,2021-02-04:/posts/2021/02/its-never-too-late-for-a-career-change.html</id><summary type="html">&lt;p&gt;A story of a Business Analyst and Product Manager turning into a Software Engineer.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Is it ever too late to follow your dream and start a new career? Well, I was 30 and had been working for Zalando for more than 4 years when I decided to change my career path for the second time. I made the decision a year ago, joined my new team in April 2020, and I didn't regret it for a single day.&lt;/p&gt;
&lt;p&gt;Since that transition, a lot of people approached me with questions and asked me for advice. I started to realize that my experience could be valuable to others out there. Some people may want to change their career too but are afraid of failure or do not have enough support from their friends or colleagues, or maybe haven’t even shared their thoughts with anyone yet.&lt;/p&gt;
&lt;p&gt;This article contains answers to the questions I was frequently asked. I hope it might support you with the decision whether a career in software engineering is what you always wanted, provide you with arguments to convince people around you that switching careers is a great idea if you do it for the right reasons, or just help you go through a difficult time of uncertainty.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Julia after the Coding Camp" src="https://engineering.zalando.com/posts/2021/02/images/inside-zalando.jpg"&gt;&lt;/p&gt;
&lt;h3&gt;What did you do before you became an engineer?&lt;/h3&gt;
&lt;p&gt;I studied business mathematics and joined Zalando as a Business Analyst after completing  my master's degree. At my first job, I was helping out one of the Product Managers (PM) in my department. One year later I was offered the opportunity to become a PM myself. By that time, product duties had already taken more than 50% of my working time, so it was an easy decision. I continued to work as PM for another 3 years.&lt;/p&gt;
&lt;h3&gt;How did you become interested in coding?&lt;/h3&gt;
&lt;p&gt;I was always working quite closely with engineers in my team. At some point, they realized that I enjoy thinking about technical stuff too, and started to involve me in their discussions. I still remembered a bit of coding that I did during my bachelor years, and I started spending some of my free time attending online courses and re-learning how to code.&lt;/p&gt;
&lt;h3&gt;How did you learn to code?&lt;/h3&gt;
&lt;p&gt;My interest was growing, but at the same time, I had to admit that I couldn't spend enough time coding outside my work. You should know that I'm a very social person, so almost every evening in my normal week is blocked for some kind of social activity. I love to travel, so the weekends didn't help either. I decided to give it a proper try: take a sabbatical and do a full-time course at &lt;a href="https://www.ironhack.com/en/berlin"&gt;Ironhack&lt;/a&gt; coding camp for 9 weeks. With the help of this course I built the foundation for my  current programming skill set.&lt;/p&gt;
&lt;h3&gt;Why did you decide to switch to engineering?&lt;/h3&gt;
&lt;p&gt;After 9 weeks of coding every day&lt;sup id="fnref:*"&gt;&lt;a class="footnote-ref" href="#fn:*"&gt;1&lt;/a&gt;&lt;/sup&gt;, I still enjoyed it. So I said to myself, this is what I'd like to be paid for! It felt right to pursue something that is so much fun even while it's sometimes frustrating.&lt;/p&gt;
&lt;h3&gt;How did you know it was the right decision?&lt;/h3&gt;
&lt;p&gt;This was the key question for me. It was a life-changing decision, so I wanted to be fully aware of my motivations and confident that I really want it. My key takeaways were:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Make sure to not bargain one trouble for another.&lt;/strong&gt; It's absolutely crucial to know that you want to become an engineer rather than just escape your current job. To verify that it's not about my current product or team, I first switched  to another department still as a PM but working on a completely different topic. Only after spending half a year with the new project, I could say with certainty that my wish was not about the circumstances but the engineering job itself.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Make sure you want to become an engineer for the right reasons.&lt;/strong&gt; I made a list of pros and cons for both my current job and software development and then talked to engineers I knew to ensure it's not just how I &lt;em&gt;imagine&lt;/em&gt; this job to be. If some aspects of your current role make you unhappy, make sure it's not going to be a major part of your future role. If you are happy with your job, but the main reason is that you think you could earn more money as an engineer – please, think twice. However, if you can see how becoming a software engineer would fit your interests, character, and life goals much better than your current job – go for it!&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;What do you like most about engineering?&lt;/h3&gt;
&lt;p&gt;My favorite topic! There are so many things! Here are just a few highlights:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Power of creativity&lt;/strong&gt;: when you write code, you create something that wasn't there before. Sometimes it's really touchable, like a new button, sometimes it's a new behavior you introduce, sometimes a performance gain. Whatever it is, the act of creation makes you feel almost like a god ^^.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Joy of focus&lt;/strong&gt;: I love that engineering goals are usually very tangible. I also love that, at least at the beginning of your engineering career, you can focus on one task at a time. In my previous roles, I would often end up juggling a lot of balls at the same time, which can be very exhausting. It’s an extremely satisfying experience to really complete something end to end, even if it’s just a little button that does exactly one thing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solving puzzles&lt;/strong&gt;: you often have to solve what feels like real mysteries. When you investigate failures or look for root causes of a bug, you are the Sherlock Holmes in this story. If you are into this kind of puzzles, it's going to be amazing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Constant learning&lt;/strong&gt;: no matter how long you are in this job, there is always more to learn - new frameworks, programming languages, tools, principles, concepts, entire new areas of technology. This feeling is shared by every engineer I know, regardless of how many years of experience they have. Your brain is always working, and it's beautiful.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Weren't you afraid to start on a new path after 4 years of a professional career?&lt;/h3&gt;
&lt;p&gt;Of course I was! Every new start is terrifying. But if you know why you are doing it and you have the support of your colleagues, friends and family, it's less scary. Even if you don't have that, the engineering community is a lovely place – there are always people who will point you in the right direction when you ask for help. Also, what's the worst thing that could happen? If a year down the line I should realize that it's not the right thing for me, I can always return to my previous job with even more valuable experience in my mental backpack.&lt;/p&gt;
&lt;h3&gt;How did you feel about throwing away years of professional experience?&lt;/h3&gt;
&lt;p&gt;The answer is simple: I didn't throw them away. Whatever you were doing before, whatever you learned and practiced, stays with you and you can most certainly use it in your new role. In my case, it was easy to justify: I brought with me the knowledge about the software development lifecycle, soft skills and business acumen. If you worked in a different role before, you still learned useful things there: maybe you were part of a team, a problem solver or a great communicator, or maybe you are amazing at structuring things. Whatever it is, you are going to need it and it's going to help you.&lt;/p&gt;
&lt;h3&gt;How did your friends and family react?&lt;/h3&gt;
&lt;p&gt;I was a bit afraid to tell them. "I'm 30, and I finally figured out what I want to become when I grow up" sounded weird even in my own head. But almost everyone I shared my idea with was so supportive and excited once I explained my motivation, that soon I started to gain a lot of energy from telling people about my goal and sharing my plans.&lt;/p&gt;
&lt;h3&gt;Is it better to do the change inside your current company or join a new one?&lt;/h3&gt;
&lt;p&gt;Well, it really depends on your current situation. On the one hand, I would highly recommend doing the first steps in your current company because it makes things &lt;em&gt;easier&lt;/em&gt;. You already know the company, you know some people, you are not a complete newbie. I’m not sure if Zalando is special that way, but I received unimaginable amounts of support from my leads, colleagues and the company itself. Zalando invests in its people, so I was financially supported from the very first milestone on this way. My wonderful company paid for my coding camp, and the only thing I had to do in return was to sign that I won’t leave within the next year (which I didn’t intend to do anyway). Every next step would have also been way harder in a new environment.
On the one hand, if you are not happy with your current employer, staying there only to make the transition easier is probably not the best idea. Short: if you like your company - make your transition there, if not - don't be afraid to leave.&lt;/p&gt;
&lt;h3&gt;What concrete steps can I take towards switching to engineering?&lt;/h3&gt;
&lt;p&gt;The way to engineering can be very different. Here is how I would go about it:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Try online programming courses to see if you like it. While doing that myself, I collected a &lt;a href="https://docs.google.com/document/d/1pWs9v7ecaksEYonProyTuGimee5Y8zgY0ZqAiQ5lR3E/edit?usp=sharing"&gt;list of resources&lt;/a&gt; that I found helpful, feel free to check it out and add new ones using the comments.&lt;/li&gt;
&lt;li&gt;If you are still not quite sure, take a vacation or a sabbatical and give it a full-time test-drive.&lt;/li&gt;
&lt;li&gt;Write a list of things that you love about your current job and that you think you might love about being an engineer. Talk to someone about it and verify that you have the right motivation.&lt;/li&gt;
&lt;li&gt;Talk to your manager about your goal. Together you can figure out what would be the right way: a slow transition with a part-time involvement, or a full switch at a time frame that is satisfactory for both of you.&lt;/li&gt;
&lt;li&gt;Do it :)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;img alt="Trying online courses" src="https://engineering.zalando.com/posts/2021/02/images/coding-with-a-cat.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;I have met a lot of wonderful people who would like to change their careers and try something new. Many of them have always dreamed of becoming an engineer but were told not to. Actually, my own sister once said that I shouldn’t study Computer Science because I’m not smart enough for that, so I didn’t. It can be scary, you might feel like people are going to be judgmental about it, you might be afraid to lose your stability - and it’s all justified. My goal here is to let you know that you are not alone with your fear. The change is not as crazy as it might sound to you, and that there are more people like you who have already successfully made the transition, and can support you. Give it a try!&lt;/p&gt;
&lt;p&gt;If you have any questions that I haven’t covered here, don't hesitate to &lt;a href="https://www.linkedin.com/in/julia-miller-ber/"&gt;reach out&lt;/a&gt; to me, and I'll gladly share everything I know.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:*"&gt;
&lt;p&gt;I'd like to point out that this was a very special situation for a limited amount of time. In normal times and especially during quarantine I pay a lot of attention to my work-life-balance and strongly recommend everyone to do the same.&amp;#160;&lt;a class="footnote-backref" href="#fnref:*" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="Zalando"/><category term="Diversity in Tech"/><category term="Education"/><category term="Tech Culture"/><category term="Tech Jobs"/><category term="Tour of Mastery"/><category term="Women in Tech"/><category term="Culture"/><category term="Leadership"/></entry><entry><title>Stop using constants. Feed randomized input to test cases.</title><link href="https://engineering.zalando.com/posts/2021/02/randomized-input-testing-ios.html" rel="alternate"/><published>2021-02-02T00:00:00+01:00</published><updated>2021-02-02T00:00:00+01:00</updated><author><name>Vijaya Prakash Kandel</name></author><id>tag:engineering.zalando.com,2021-02-02:/posts/2021/02/randomized-input-testing-ios.html</id><summary type="html">&lt;p&gt;Most test cases assert using hand typed constants. Leveraging randomized input is a much better approach.&lt;/p&gt;</summary><content type="html">&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Testing is widely accepted practice in software industry. I am an iOS Engineer and have been writing tests, like most of us. The way I approach testing changed radically a few years back. And I have used and shared this new technique for a few years within Zalando and outside. In this post, I will explain what is wrong with most test cases and how to apply randomized input to improve tests.&lt;/p&gt;
&lt;p&gt;This is our sample code under test:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;DomainStore&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;internalStorage&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;UserDefaults&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;standard&lt;/span&gt;

    &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;internalStorage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kr"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;internalStorage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;The usual testing approach&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;test_setValueCanBeRetrieved&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;storage&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DomainStore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kr"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;quot;Zalando&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;quot;companyName&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;obtained&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kr"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;quot;companyName&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;
    &lt;span class="n"&gt;XCTAssertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Zalando&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;obtained&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Imagine someone opens your code a few months down the road and modifies the code under test ever so slightly.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;DomainStore&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;internalStorage&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;UserDefaults&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;standard&lt;/span&gt;

    &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;internalStorage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kr"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;&amp;quot;Zalando&amp;quot;&lt;/span&gt;        &lt;span class="c1"&gt;// Note&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This diligent test runs on your machine or on CI and it passes. Does it mean the production code works fine? Of course not. Most Test Driven Development (TDD) practitioners would move past this DomainStore but, should you? How can we reveal similar quality issues and address them?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fundamentally we are testing using constant String while the production method suggests it can take any String.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When we check this function signature.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;It tells it can take any &lt;code&gt;String&lt;/code&gt; instance. Not just &lt;code&gt;"Zalando"&lt;/code&gt;.  However, our previous test asserted on only 1 instance of String type.&lt;/p&gt;
&lt;h2&gt;Better approach: Feed Randomized Input to test cases&lt;/h2&gt;
&lt;p&gt;The fundamental idea of this technique is &lt;strong&gt;never to feed test cases hand typed constants.&lt;/strong&gt; What do we feed in then? Welcome &lt;code&gt;randomness&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This is our fixed test case.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;test_setValueCanBeRetrieved&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;storage&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DomainStore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

      &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;  &lt;span class="c1"&gt;// Note&lt;/span&gt;
      &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;key&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;

      &lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kr"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;obtained&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kr"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;
      &lt;span class="n"&gt;XCTAssertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;obtained&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;String.random&lt;/code&gt; produces random instance of a &lt;code&gt;String&lt;/code&gt;. At Zalando, we use this &lt;a href="https://github.com/kandelvijaya/Randomizer"&gt;Randomizer&lt;/a&gt; library for generating random inputs. It covers most the used types in the Standard Library.&lt;/li&gt;
&lt;li&gt;If &lt;strong&gt;Randomizer&lt;/strong&gt; doesn’t fit your need, feel free to extend it or add your custom conformance to &lt;code&gt;Random&lt;/code&gt; protocol requirement.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now the above tempered code will not pass through this test case. Unless we run it, we don’t know ahead of time what values we are going to test with. And these values are different across runs. Effectively exercising our production code with many permutations of possible values. This is the essence of randomized input tests (sometimes referred to as permutation tests).&lt;/p&gt;
&lt;h3&gt;Going beyond a simple case&lt;/h3&gt;
&lt;p&gt;Here’s one example test case from our module. The code below creates random label component and sets random accessibility options on model layer, then asserts if the rendered view has correct accessibility information.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;test_whenAccessibilityProvided_andComponentHasTapAction_thenAccessibilityIsSet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;props&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LabelProps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;
        &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;accessibilityModel&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;APIAccessibility&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;
        &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;component&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LabelComponent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
          &lt;span class="n"&gt;componentId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;flex&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;props&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;props&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;accessibility&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Accessibility&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;with&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;accessibilityModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;componentType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;props&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
          &lt;span class="n"&gt;debugProps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DebugProps&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;node&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MockNode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;component&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;actions&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;EventType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ComponentAction&lt;/span&gt;&lt;span class="p"&gt;(.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;))]]&lt;/span&gt;

        &lt;span class="n"&gt;component&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updateAccessibility&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;XCTAssertTrue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isAccessibilityElement&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;XCTAssertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;accessibilityLabel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;accessibilityModel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;XCTAssertEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;accessibilityHint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;accessibilityModel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hint&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;XCTAssertTrue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;accessibilityTraits&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="bp"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(.&lt;/span&gt;&lt;span class="n"&gt;staticText&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;XCTAssertTrue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;accessibilityTraits&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="bp"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(.&lt;/span&gt;&lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;User defined types (usually Structs) are composed of standard library types and predefined custom types. We can extend user defined types in our test target to conform to &lt;code&gt;Random&lt;/code&gt;. An example conformance of LabelProps is as below:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;LabelProps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Codable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;Hashable&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;backgroundColor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;font&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;FontProps&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;extension&lt;/span&gt; &lt;span class="nc"&gt;LabelProps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Random&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;random&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LabelProps&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;LabelProps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;backgroundColor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;font&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;We could do code generation on build phase to synthesize the Random conformance. Although this is out of scope of this post, its how &lt;code&gt;Equatable&lt;/code&gt; conformance works.&lt;/li&gt;
&lt;li&gt;Due to Swift’s type inference; &lt;code&gt;.random&lt;/code&gt; will  use the exact type’s random conformance.&lt;/li&gt;
&lt;li&gt;For cases where we need to compare against input value, we can store the generated model into a local property. Like we did for &lt;code&gt;accessibilityModel&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;There are times when function under tests expects &lt;code&gt;Email&lt;/code&gt;, &lt;code&gt;URL&lt;/code&gt;, &lt;code&gt;Deeplink&lt;/code&gt; or &lt;code&gt;PhoneNumber&lt;/code&gt;s. These data types are often represented by &lt;code&gt;String&lt;/code&gt;. However, &lt;code&gt;String.random&lt;/code&gt; is not good enough on this case. There are 2 ways of tackling this. One is to extend String to have &lt;code&gt;String.randomEmail&lt;/code&gt;. Another is to create concrete type which conforms to &lt;code&gt;Random&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;This technique was not my realization. I grasped the phrase &lt;strong&gt;“Don’t use constants on tests”&lt;/strong&gt; from &lt;a href="https://twitter.com/jdortiz"&gt;Jorge Ortiz&lt;/a&gt; during his workshop on Clean Architecture on &lt;a href="https://www.swiftaveiro.xyz/"&gt;Swift Averio&lt;/a&gt;, 2017. It then changed the way I write tests. I hope this technique will help you too.&lt;/p&gt;
&lt;p&gt;The technique of permutation testing by using random input applies to all software testing; not just iOS development. The only requirement is &lt;code&gt;Type.random&lt;/code&gt;.&lt;/p&gt;</content><category term="Zalando"/><category term="Testing"/><category term="iOS"/><category term="Mobile"/><category term="Backend"/></entry><entry><title>Creating a uniform landscape for macOS Software</title><link href="https://engineering.zalando.com/posts/2021/01/creating-a-uniform-landscape-for-mac-software.html" rel="alternate"/><published>2021-01-21T00:00:00+01:00</published><updated>2021-01-21T00:00:00+01:00</updated><author><name>Bernardo Prieto Curiel</name></author><id>tag:engineering.zalando.com,2021-01-21:/posts/2021/01/creating-a-uniform-landscape-for-mac-software.html</id><summary type="html">&lt;p&gt;Here's how we managed to automate the patch management process through the use of JAMF Pro, open source tools and a set of in-house developments to tie these tools together.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="macOS installed software packages" src="https://engineering.zalando.com/posts/2021/01/images/preview.png#previewimage"&gt;&lt;/p&gt;
&lt;p&gt;At the time of this writing, we have a universe of Mac applications — that are identified and version-inventoried — within the fleet of little over 3,000 Mac devices in Zalando from which a subset — selected either by their importance, frequency of updates or size of the install base — are part of a so-called &lt;strong&gt;software lifecycle&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;However, in July 2019, when a &lt;a href="https://support.zoom.us/hc/en-us/articles/360031244812-Security-CVE-2019-13449"&gt;vulnerability was discovered in &lt;strong&gt;Zoom&lt;/strong&gt;&lt;/a&gt; (long before becoming the mainstream video conference app during the COVID-19 pandemic), Information Security requested the immediate deployment of the latest patch to every device that had the app installed and a report of the progress of this task.&lt;/p&gt;
&lt;p&gt;The report and the patch were not a challenge in themselves — this was already part of what we were doing with core applications such as Google Chrome, or Chat — but the process was nothing more than a set of manual and repetitive chores that could be streamlined.&lt;/p&gt;
&lt;p&gt;So this defined a set of goals:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Procure patches and updates in a proactive way&lt;/li&gt;
&lt;li&gt;Test them and then deploy to our users as soon as possible after their release&lt;/li&gt;
&lt;li&gt;Keep detailed information about the patch levels of key applications&lt;/li&gt;
&lt;li&gt;Automate, as much as possible, all these tasks&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Our tools&lt;/h1&gt;
&lt;h2&gt;JAMF Patch Management&lt;/h2&gt;
&lt;p&gt;The Mac Management Platform in use in &lt;strong&gt;Zalando&lt;/strong&gt;, called &lt;a href="https://www.jamf.com"&gt;&lt;strong&gt;JAMF Pro&lt;/strong&gt;&lt;/a&gt;, provides Patch Management functionalities that are great at detecting the patch level of devices and deploying the appropriate versions; however, getting this functionality to work properly has the following requirements.&lt;/p&gt;
&lt;h3&gt;A source of patch definitions&lt;/h3&gt;
&lt;p&gt;The first thing the system needs is the so-called &lt;em&gt;definition of the title&lt;/em&gt;&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt; including dates, versions, OS requirements, etc. in a JSON format. &lt;strong&gt;JAMF&lt;/strong&gt; (the company behind JAMF Pro) offers a web service with a basic set of titles, but of course, that doesn’t cover all our core applications. Fortunately, it’s also possible to configure additional sources of patch definitions, either local or from third parties.&lt;/p&gt;
&lt;h3&gt;Installation packages&lt;/h3&gt;
&lt;p&gt;Each vendor has different locations to provide their installers; additionally, for the management platform to be able to install applications (or its updates), they need to be uploaded to distribution points in a &lt;em&gt;PKG format&lt;/em&gt;, which is not always what the vendor provides.&lt;/p&gt;
&lt;h2&gt;AutoPkg&lt;/h2&gt;
&lt;p&gt;An open source tool developed by the community of Mac admins around the world, called &lt;a href="http://autopkg.github.io/autopkg/"&gt;&lt;strong&gt;AutoPkg&lt;/strong&gt;&lt;/a&gt;, provides a framework to automate many of the tasks surrounding patch management. The steps taken through the process are defined on plist-format files called &lt;em&gt;recipes&lt;/em&gt;, which AutoPkg follows.&lt;/p&gt;
&lt;h3&gt;Recipes&lt;/h3&gt;
&lt;p&gt;The community of AutoPkg users has generated recipes that cover a broad range of applications and that are updated regularly; nevertheless, for security reasons, AutoPkg requires manual inspection of downloaded recipes or the creation of local copies, before allowing an automated execution. AutoPkg recipes have a parent-child relationship which brings modularity and also the chance of having different results depending on the child recipe that was executed.&lt;/p&gt;
&lt;h3&gt;Processors&lt;/h3&gt;
&lt;p&gt;Each step of a recipe is executed by a &lt;strong&gt;Python&lt;/strong&gt; piece of code called &lt;em&gt;processor&lt;/em&gt;. AutoPkg includes dozens of these processors — each of them with a specific functionality — but also has the ability to run custom processors, coded by users, to provide functionality not covered by the standard ones.&lt;/p&gt;
&lt;h1&gt;Our solution&lt;/h1&gt;
&lt;p&gt;The combination of JAMF Patch Management and AutoPkg was the right one to accomplish our goals, but this doesn’t work for our needs just out of the box and then it evolved into three different projects.&lt;/p&gt;
&lt;h2&gt;Cookbook&lt;/h2&gt;
&lt;p&gt;The name was obvious for the project aiming to standardize and manage our AutoPkg recipes.&lt;/p&gt;
&lt;p&gt;For improved modularity of the process, each application that we have introduced into the software lifecycle has its own set of recipes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Download from the vendor&lt;/li&gt;
&lt;li&gt;Create a package&lt;/li&gt;
&lt;li&gt;Sign the package&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;li&gt;Upload to the distribution points&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In addition to the recipes, we created three custom processors to:
&lt;img alt="Chat message about a new version of Postman available." src="https://engineering.zalando.com/posts/2021/01/images/chatbot.jpg#right"&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Announce in a Google Chat group the availability of a new version, packaged and uploaded to our system&lt;/li&gt;
&lt;li&gt;Generate the JSON patch definition and upload it to our own definition server, for titles not covered by JAMF&lt;/li&gt;
&lt;li&gt;Update information in our reporting tool, LineUp&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Finally, for better organization of the workload, &lt;em&gt;Cookbook&lt;/em&gt; is a git repository. We work locally, push our changes to the repository and then after merging, we pull on a server called &lt;em&gt;Apple Packaging Station&lt;/em&gt; that runs AutoPkg on a regular schedule with help from a third party tool called &lt;a href="https://www.lindegroup.com/autopkgr"&gt;&lt;strong&gt;AutoPkgR&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;LineUp&lt;/h2&gt;
&lt;p&gt;When we first created a report about the deployment of the patch of &lt;strong&gt;Zoom&lt;/strong&gt;, we pulled the information from our platform directly into a &lt;strong&gt;Google Spreadsheet&lt;/strong&gt; and then used &lt;strong&gt;Google Data Studio&lt;/strong&gt; to generate a chart.&lt;/p&gt;
&lt;p&gt;This may seem okay for a one-shot requirement, but in reality this happens often throughout the year and becomes hard to maintain or scale. So then we opted for a custom database (hosted in Zalando’s shared &lt;a href="https://engineering.zalando.com/tags/postgresql.html"&gt;&lt;strong&gt;Postgres&lt;/strong&gt;&lt;/a&gt; cluster) queried with &lt;strong&gt;Grafana&lt;/strong&gt;, which offers great visualization capabilities.&lt;/p&gt;
&lt;p&gt;But then, with a proper database structure already holding the data, the next logical step was to add a custom visualization tool and provide it with its own API to update the information. This is when &lt;strong&gt;LineUp&lt;/strong&gt; was born.&lt;/p&gt;
&lt;p&gt;&lt;img alt="LineUp Example" src="https://engineering.zalando.com/posts/2021/01/images/lineup.jpg#right"&gt;&lt;/p&gt;
&lt;p&gt;At the beginning, we were just looking for a simple mechanism to show information from the database without requiring a client application or the user to run SQL queries, and even the simplest web development frameworks, once connected to a database, have power to do much more than this. We selected &lt;strong&gt;Django&lt;/strong&gt; as our framework and after developing these simple views, we decided to leverage its capabilities and come up with &lt;strong&gt;detailed views for each Mac application&lt;/strong&gt;, creating a module to use JAMF’s API to get up-to-date information about them.&lt;/p&gt;
&lt;p&gt;Then, while working on this, it was natural to expand the scope and include the inventory of applications running in the &lt;strong&gt;Windows&lt;/strong&gt; and &lt;strong&gt;Ubuntu&lt;/strong&gt; platforms and to do so, we developed a module to query Zalando’s &lt;strong&gt;asset management platform&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;PackageChanger&lt;/h2&gt;
&lt;p&gt;After each scheduled execution of our &lt;strong&gt;AutoPKG&lt;/strong&gt; recipes we end up with a set of packages uploaded to the distribution points, notifications about them in our Chat group, and the JAMF server aware of these new versions of applications. Now it’s time to test the updates and release them if they are working properly.&lt;/p&gt;
&lt;p&gt;This became a new tedious process which is done in JAMF’s web UI. Each update implies going to a set of screens to associate the new version with a package, assign that version to a group of testers and later, release the version to the rest of the users as well as setting this version as the baseline installer for new devices.&lt;/p&gt;
&lt;p&gt;To simplify these steps, we created &lt;strong&gt;PackageChanger&lt;/strong&gt;, a command line tool that, through JAMF’s API, let’s us work with packages and versions in a faster and simpler way than using a web UI.&lt;/p&gt;
&lt;p&gt;&lt;img alt="PackageChanger Example" src="https://engineering.zalando.com/posts/2021/01/images/package-changer.jpg#left"&gt;&lt;/p&gt;
&lt;p&gt;To work with the API we selected &lt;a href="https://github.com/PixarAnimationStudios/ruby-jss"&gt;&lt;strong&gt;Ruby-JSS&lt;/strong&gt;&lt;/a&gt; — a Ruby library developed by the Mac admins at &lt;strong&gt;Pixar Animation Studios&lt;/strong&gt; — which to this day is the most comprehensive and well documented library to interact with it.&lt;/p&gt;
&lt;h1&gt;Our next steps&lt;/h1&gt;
&lt;p&gt;The work done so far has improved significantly the way we make updates available, especially for key applications, and has provided us with ways to have &lt;strong&gt;real-time information&lt;/strong&gt; during first few hours after a software vulnerability is disclosed. We are still missing, nevertheless, some refinements to have a completely streamlined software lifecycle.&lt;/p&gt;
&lt;h2&gt;User interaction&lt;/h2&gt;
&lt;p&gt;Patch management from JAMF offers us two ways to deploy patches: &lt;strong&gt;automatic push&lt;/strong&gt; or through the &lt;strong&gt;Self Service application&lt;/strong&gt; notifying the user when updates are available. The latter would be optimal, but the notification mechanism &lt;a href="https://www.jamf.com/jamf-nation/discussions/30475/broken-notification-center-notifications"&gt;does not work&lt;/a&gt; and leaves us with our user base unaware of patches. On the other hand, pushing updates has proven to be a source of discomfort for users, especially because updated applications need to be closed and reopened and it’s really difficult to find a convenient moment to do this.&lt;/p&gt;
&lt;p&gt;As a response, we are working on &lt;strong&gt;an alternative notification mechanism&lt;/strong&gt;, so we can continue to offer updates through Self Service, but making users aware of them with enough frequency and convenience so that they install them in a comfortable and timely manner.&lt;/p&gt;
&lt;p&gt;&lt;img alt="UpdateBuddy Example" src="https://engineering.zalando.com/posts/2021/01/images/update-buddy.jpg#left"&gt;&lt;/p&gt;
&lt;h2&gt;Quality gate&lt;/h2&gt;
&lt;p&gt;Before generally releasing a patch we deploy it to a small subset of devices whose owners are considered &lt;strong&gt;testers&lt;/strong&gt;. This allows us to know if the installer works and if the application runs as expected after the update.&lt;/p&gt;
&lt;p&gt;These tests may be enough for simple applications — such as Google Chat — but fall short for specialized or complex ones — such as &lt;strong&gt;Tableau Desktop&lt;/strong&gt; — where only a trained user would be able to tell if the new version is ready to be deployed to the user base.&lt;/p&gt;
&lt;p&gt;The next improvement in this direction would be a &lt;strong&gt;quality gate&lt;/strong&gt;, in which additional tests for releases are described and a bigger set of testers can go through them, decide if they are passed successfully, and then approve collectively the deployment of a patch.&lt;/p&gt;
&lt;h2&gt;Increased selection of titles&lt;/h2&gt;
&lt;p&gt;The initial set of applications covered by patch management was selected because of the obvious level of use the get within Zalando: Google Chrome, Chat, Backup and Sync, etc.&lt;/p&gt;
&lt;p&gt;Afterwards, when &lt;strong&gt;LineUp&lt;/strong&gt; provided us with information about the number of installations of each application, we had a roadmap of sorts to know which applications should be covered next. For example, we discovered that over one third of the Mac fleet has &lt;strong&gt;Docker&lt;/strong&gt; installed on them, so we decided to start offering it in Self Service and provide patch management so that we can be sure our user base has easy access to this tool.&lt;/p&gt;
&lt;p&gt;Here, the next step is part of a continuous improvement cycle, in which we will keep adding applications to the automated lifecycle.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;Within patch management, the word &lt;em&gt;title&lt;/em&gt; is used to refer to pieces of software that can be inventoried and have versioning, and range from internal tools to applications from the App Store.&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;At the time of this writing &lt;strong&gt;macOS Catalina&lt;/strong&gt; and &lt;strong&gt;macOS Big Sur&lt;/strong&gt; allow the installation, through an MDM&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;, of unsigned packages. This may change with future releases of macOS and make crucial to include an automated signing step, which we already have.&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;MDM stands for &lt;em&gt;Mobile Device Management&lt;/em&gt;, which consists in a platform and a set of tools for the administration of mobile devices such as smartphones, tablets and laptops.&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="Zalando"/><category term="Apple"/><category term="Python"/><category term="Backend"/><category term="Mobile"/></entry><entry><title>Experimentation Platform at Zalando: Part 1 - Evolution</title><link href="https://engineering.zalando.com/posts/2021/01/experimentation-platform-part1.html" rel="alternate"/><published>2021-01-12T00:00:00+01:00</published><updated>2021-01-12T00:00:00+01:00</updated><author><name>Shan Huang</name></author><id>tag:engineering.zalando.com,2021-01-12:/posts/2021/01/experimentation-platform-part1.html</id><summary type="html">&lt;p&gt;Challenges and solutions of our experimentation platform at Zalando&lt;/p&gt;</summary><content type="html">&lt;p&gt;Online controlled experimentation, aka A/B test, has been a golden standard for evaluating improvements in software systems. By changing one factor at a time, A/B test causally measures, from real users, whether one product variant is better than the other.&lt;/p&gt;
&lt;p&gt;As an increasingly important area in tech companies, experimentation platforms face -- apart from their scientific challenges -- many unique engineering problems. In this blog series, we will share what we’ve learned at Zalando. During this journey, we have presented our works at well-known conferences including &lt;a href="https://pydata.org/berlin2018/schedule/presentation/69/"&gt;PyData 2018&lt;/a&gt;, &lt;a href="http://ide.mit.edu/sites/default/files/agendas/CODE%202018%20Agenda.pdf"&gt;Conference on Digital Experimentation 2018&lt;/a&gt;, and &lt;a href="https://causalscience.org/programme/day-1/"&gt;Causal Data Science Meeting 2020&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In this first post, we’ll introduce the evolution of experimentation platform at Zalando. Technical challenges and their solutions of experimentation engine, analysis system, data quality issues, and data visualization will follow in the upcoming posts.&lt;/p&gt;
&lt;p&gt;The next sections are structured using the Experimentation Evolution Model in &lt;a href="https://exp-platform.com/Documents/2017-05%20ICSE2017_EvolutionOfExP.pdf"&gt;Fabijan et.al., 2017&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Phase one: crawl (before 2016)&lt;/h2&gt;
&lt;p&gt;As natural as data-driven decisions sound today, it’s not the focus in early stages of Zalando. In the early days, A/B tests are set up by each team individually and manually -- as well as their analyses.&lt;/p&gt;
&lt;p&gt;Soon we discovered that such setup can neither ensure A/B test quality, nor can we know whether product teams actually run A/B tests before making decisions. There is very little A/B testing knowledge in most product teams then -- we realized the need of a centralized experimentation service. In order to take full control of data infrasture as well as analysis features, we need an in-house experimentation platform at Zalando instead of using off-the-shelf A/B testing tools.&lt;/p&gt;
&lt;p&gt;In 2015, the first version of Zalando's Experimentation platform &lt;em&gt;Octopus&lt;/em&gt; was released. It is named after &lt;a href="https://en.wikipedia.org/wiki/Paul_the_Octopus"&gt;Paul the Octopus&lt;/a&gt;, who correctly chose the winner team of a match at FIFA 2010, with a small error rate. That’s the essence of an experimentation platform, except that our metrics are based on trustworthy statistics rather than Paul’s mood of the day.&lt;/p&gt;
&lt;p&gt;At this period, our biggest challenge is &lt;strong&gt;Lack of cross-functional knowledge&lt;/strong&gt;. The initial platform was built by a virtual team with members from various parts of Zalando. The platform had three parts: experiment management, experiment execution, and experiment analysis. In the early days, the team's focus was set to execution because of few service customers - analyses can be performed manually in the worse case. This initial virtual team consisted of engineers and data scientist who had little knowledge of each other's domain at that time. For example, data scientists didn't have production software experience and didn't know Scala, whereas software engineers didn't know concepts of statistics. To decouple the development processes of one subgroup from another, we ended up with building an open-source statistics library wrapped by the backend production system.&lt;/p&gt;
&lt;h2&gt;Phase two: walk (2016-2020)&lt;/h2&gt;
&lt;p&gt;Even though wrapping analysis scripts into a production software system is not a scalable solution, it worked for the load at that time. Through hard groundwork, we achieved a platform where teams can configure and manage their A/B tests in one place. Another major benefit of platformization is that randomization process and analysis methods are now standardized. Octopus uses a two-sided t-test with 5% significance level to analyze results.&lt;/p&gt;
&lt;p&gt;During these years, we have boosted the number of running A/B tests at Zalando.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Number of experiments" src="https://engineering.zalando.com/posts/2021/01/images/num_exp.png"&gt;&lt;/p&gt;
&lt;p&gt;There is a decrease of number of A/B tests in early 2020. This decrease could have been due to a focus of teams on large-scale coordinated product initiatives, which were not A/B testable during this period. Another possible cause is that we suggest to pause A/B tests due to abnormal user behaviour in the beginning of COVID-19 in Europe.&lt;/p&gt;
&lt;p&gt;On the other hand, we also faced a few big challenges. The keywords of improvements in this period are &lt;em&gt;scalability&lt;/em&gt; and &lt;em&gt;trustworthiness&lt;/em&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Establishing experimentation culture&lt;/strong&gt;. Many teams started to make product decisions through A/B testing, however, it’s a big company and the experimentation culture didn’t reach every corner. We started to look at use cases from various departments and integrated them into Octopus. We also provided in-person A/B testing training in the company at regular intervals. In addition, there is a company-wide initiative to ensure each team has embedded A/B test owners (product analysts or data scientists) who have sufficient knowledge of experimentation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Source data tracking&lt;/strong&gt;. The experimental data were collected from each product team through tracking events (we track only users who provided appropriate consent). A dedicated tracking team ingested these events, unified data schema, and stored them in a big data database. However, data tracking concepts were not holistically understood across the company -- some teams define their own version of tracking event schema. This inconsistency resulted in corrupted and missing data. As a consumer of this data, our A/B test analyses suffer from data quality. This situation started to improve after a period of extensive cross-team communication and reorganization.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A/B test design quality&lt;/strong&gt;. Since we found that A/B tests from different teams had various level of quality, we introduced an A/B test design audit process as well as weekly consultation hours. Aspects of quality include testable hypothesis, clear problem statement, clear outcome KPI, A/B test runtime, and finishing based on planned stopping criteria. We also wrote internal blogs regularly to share our tips for effective A/B testing in Octopus.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A/B test analysis method quality&lt;/strong&gt;. To make our services trustworthy, we revisited our analysis methods rigorously in peer reviews with applied scientists from other teams. We documented analysis steps transparently. Through scientific peer reviews, we have identified potential improvement areas such as non-inferiority tests.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The right analysis tool&lt;/strong&gt;. A/B tests are not always feasible for every use case. For example, comparing performance between two countries. In such cases, quasi-experimental methods are better suited. We provided guidelines and software packages to help analysts to choose the right causal inference tool.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Randomization engine latency&lt;/strong&gt;. Some applications have strict requirements for latency. For example, a slightly higher loading times of product detail pages may cause customers to churn. We enhanced the latency of our services through a few engineering optimizations. Technical details will be discussed in later posts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Controlled rollout&lt;/strong&gt;. In some cases, teams want to gradually increase the traffic into the tests, so that they don’t accidentally show a buggy variant to a lot of users. In other cases, several teams are working on a complex feature release and want to release the product at the same time. In general, such staged rollouts are called controlled rollouts. To support these use cases, Octopus created new features such as traffic ramp-up in experimentation and &lt;a href="https://martinfowler.com/articles/feature-toggles.html"&gt;feature toggles&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Analysis system scalability&lt;/strong&gt;. The biggest challenge we had in this period is that our initial analysis system can not handle the load of concurrent A/B tests anymore due to constraints in its architecture. As the maintenance cost of the analysis system became too high, we didn't have capacity to work on improvement of analysis methods. We concluded that the need of a new analysis system was pressing. In the end, we spent two years rebuilding the new analysis system in Spark. Our lessons learned will be shared in a separate post.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="Causal inference tool usage" src="https://engineering.zalando.com/posts/2021/01/images/ci_tool_usage.png"&gt;&lt;/p&gt;
&lt;h2&gt;Phase three: run (2020-)&lt;/h2&gt;
&lt;p&gt;At this point, experimentation culture is established in most parts of the company. With the scalable infrastructure ready, the team can now work on more advanced statistical methods.&lt;/p&gt;
&lt;p&gt;We are looking forward to bringing experimentation at Zalando to a new stage by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Scaling out experimentation expertise&lt;/strong&gt;. We have designed a new company-wide training curriculum that has a more smooth study experience. It covers causality, statistical inference, and analysis tools at Zalando. We have also increased the scope of causal inference research peer reviews to the whole company.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Automating data quality indicators&lt;/strong&gt;. A/B testing results are highly senstive to data quality. The most important data quality indicator is &lt;a href="https://exp-platform.com/Documents/2019_KDDFabijanGupchupFuptaOmhoverVermeerDmitriev.pdf"&gt;sample ratio mismatch&lt;/a&gt; -- the actual sample size split is significantly different from the expected sample size split. Companies similar to Zalando have identified that between 6-10% of their A/B tests have sample ratio mismatch, a similar analysis on our historical data shows that at least 20% of A/B tests are affected within Zalando. Our platform automatically raises alerts to the affected team when sample ratio mismatch is detected. Further data investigation will be needed before analysis results are shown to users in the platform's dashboard. Another major data quality issue is the data tracking consent imposed by GDPR. As we process data only for visitors who provided their consent, we have been working on research to understand the selection bias for A/B tests and its solution.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Overall evaluation criteria&lt;/strong&gt;. In the last few years, we understand from our users that selecting outcome KPI for A/B tests is a big pain point. We have now provided teams qualitative guidelines: a) KPIs should be team-specific. KPIs should be sensitive to the product that each team controls, i.e. each team can drive their KPIs by changing product features; b) KPIs should be proxies to long-term customer lifetime values, instead of short-term revenues. We plan to incorporate these guidelines into Octopus with scientifically proven methods.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Faster experimentation&lt;/strong&gt;. We found that the median runtime of an A/B test at Zalando is about three weeks. This is higher than similar companies in the tech industry. Many users might claim their test has time constraints based on business requirements. We plan to support trustworthy analysis for faster experimentation by more advanced analysis methods, such as variance reduction, Bayesian analysis, and multi-armed bandit.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stable unit assumption&lt;/strong&gt;. In practice, each unit in the A/B test may not represent a unique person. For example, currently we are not able to detect the same person from Zalando website and Zalando App and assign him/her the same variant. A solution of this problem creates new engineering challenges due to latency requirement.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data visualization&lt;/strong&gt;. Smart data visualization provides answers to questions you didn’t know you had. With complex and hierarchical data from A/B tests, there is quite some potential for data visualization designs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="Number of sample ratio mismatch" src="https://engineering.zalando.com/posts/2021/01/images/share_srm.png"&gt;&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;To sum up, experimentation platform at Zalando has evolved a lot since 2015. Nevertheless, we are and will always be focusing on bringing more &lt;em&gt;scalable&lt;/em&gt; and more &lt;em&gt;trustworthy&lt;/em&gt; experimentation to Zalando. We thank all team members, contributors and leadership who made it happen during this incredible journey.&lt;/p&gt;
&lt;h2&gt;Future posts&lt;/h2&gt;
&lt;p&gt;In the upcoming posts, we will provide more details about the technical challenges and solutions of the experimentation engine, analysis system, data quality issues, and data visualization. Stay tuned!&lt;/p&gt;</content><category term="Zalando"/><category term="Experimentation"/><category term="Platform Engineering"/><category term="Backend"/><category term="Data"/></entry><entry><title>How Zalando prepares for Cyber Week</title><link href="https://engineering.zalando.com/posts/2020/10/how-zalando-prepares-for-cyber-week.html" rel="alternate"/><published>2020-10-08T00:00:00+02:00</published><updated>2020-10-08T00:00:00+02:00</updated><author><name>Bartosz Ocytko</name></author><id>tag:engineering.zalando.com,2020-10-08:/posts/2020/10/how-zalando-prepares-for-cyber-week.html</id><summary type="html">&lt;p&gt;Learn how we prepare our platform for Cyber Week - the highest traffic period in the year.&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Cyber Week has become an increasingly important time of the year in e-commerce. &lt;a href="https://corporate.zalando.com/en/newsroom/en/news-stories/zalando-achieves-record-breaking-cyber-week-results"&gt;In 2019&lt;/a&gt;, we have attracted 840,000 new customers and our sales (Gross Merchandise Volume) increased by 32% compared to the previous year. During the event we grew faster as a business than throughout the year where we grow at a 20-25% rate. Our peak orders per minute reached 7,200 compared to 4,200 the year before (+71% YoY).&lt;/p&gt;
&lt;p&gt;From an engineering point of view, Cyber Week is a very exciting time, during which all systems are exposed to load that is far beyond any peak seen throughout the year. The experience of supporting the event itself has been extremely rewarding for everyone involved due to close collaboration between teams and strong focus on operational excellence and reliability. During the preparation time for the Cyber Weeks we created new capabilities in our teams and platform that serve us throughout the whole year. Looking back at the past years, we would like to share our experience and how our capabilities evolved over time around key themes of: &lt;em&gt;Site Reliability Engineering&lt;/em&gt;, &lt;em&gt;Load Testing in Production&lt;/em&gt;, and the &lt;em&gt;Preparation&lt;/em&gt; approach itself.&lt;/p&gt;
&lt;h2&gt;Site Reliability Engineering&lt;/h2&gt;
&lt;h3&gt;Phase 1: Building up knowledge about reliability engineering&lt;/h3&gt;
&lt;p&gt;Six years ago, when our e-commerce platform was still within on-premise data centers, we had a handful of on-call teams. Two of these teams were responsible for the backend and frontend systems of our e-commerce platform and were primarily responsible for Cyber Week preparations and support during the event. When we started moving more and more critical systems into the AWS cloud as part of our &lt;a href="https://engineering.zalando.com/posts/2018/12/front-end-micro-services.html"&gt;micro-frontend architecture&lt;/a&gt;, we adopted the "you build it - you run it" mindset and the number of on-call teams has increased dramatically to around 100 teams today. This also meant that we needed to educate many teams about designing for reliability. To achieve that, we formed a team of 10 colleagues, who were passionate about SRE and who signed up to perform &lt;a href="https://landing.google.com/sre/sre-book/chapters/evolving-sre-engagement-model/#:~:text=The%20most%20typical%20initial%20step,a%20service%20operating%20in%20production."&gt;production readiness reviews&lt;/a&gt; of our applications ahead of Cyber Week. In preparation for that, we ran a series of workshops with teams to share knowledge about reliability patterns and identified clusters of applications that required adjustments, so that the platform is stable in case of various failure types (e.g. failures of dependencies, overload, timeouts).&lt;/p&gt;
&lt;h3&gt;Phase 2: Distributed tracing&lt;/h3&gt;
&lt;p&gt;We use distributed tracing following the OpenTracing standard across our platform. This allows us to inspect the performance of our distributed system and quickly find contributing factors for increased latency or error rates across our applications. After instrumenting a set of applications and proving the intended wins resulting from it, we leveraged Cyber Week preparations to scale this effort. In year one, we focused on critical, tier-1 systems involved in the hot path of the browse journey in &lt;a href="https://en.zalando.de"&gt;our shop&lt;/a&gt;. The year following that, we have expanded the coverage further to tier-2 systems for applications in the scope of Cyber Week. During the instrumentation, we have adopted additional conventions that help us identify the traffic sources: App, Web, push notifications, load tests. This allows us to better understand traffic patterns and perform capacity planning based on the request ratios between incoming traffic and the respective parts of our platform.&lt;/p&gt;
&lt;h3&gt;Phase 3: Dedicated team for SRE enablement&lt;/h3&gt;
&lt;p&gt;What started as a grass-roots movement around SRE practices in Phase 1, has evolved to a SRE department within Zalando, which is focused on reliability engineering, observability, and providing necessary infrastructure around monitoring, logging and distributed tracing. The SRE team also organizes trainings and knowledge exchange within the SRE guild where teams share lessons learned and pitfalls about operating systems in production and collaborate on formulating best practices.&lt;/p&gt;
&lt;p&gt;Distributed tracing has been a game-changer for us. We have leveraged tracing data to reduce alert fatigue of our on-call teams through an approach called adaptive paging. It's an alert handler that leverages the causality from tracing and OpenTracing's semantic conventions to page the team closest the problem. From a single alerting rule, a set of heuristics is applied to identify the most probable cause, paging the respective team instead of the alert owner. See our talk from the SRECon &lt;a href="https://www.usenix.org/conference/srecon19emea/presentation/mineiro"&gt;Are We All on the Same Page? Let's Fix That&lt;/a&gt; which explains our approach in detail.&lt;/p&gt;
&lt;h2&gt;Load testing in Production&lt;/h2&gt;
&lt;h3&gt;Phase 1: Feeling lucky&lt;/h3&gt;
&lt;p&gt;Over the years of operating our shop in the Data Center, we learned how to scale our shop's frontend. We kept adding servers and scaling our Solr fleet responsible for Product Data and Search until this has become impractical due to a multi-month lead time needed to get new, physical servers. The Solr fleet was the one most benefiting from auto-scaling in the cloud and thus the first system that we moved to the cloud six years ago. Our backend services (e.g. product information management, inventory management, order management, customer accounts and data) however, formed an over-provisioned system with a fixed number of instances in the Data Center. At its heart were PostgreSQL instances heavily optimized by our Database infrastructure team that we scaled through sharding and switching from spinning disks to SSDs.&lt;/p&gt;
&lt;p&gt;This was sufficient for Cyber Week in 2015 where commercial campaigns were just about the right size for our capacity. With no past knowledge about what type of traffic to expect we were amazed how much more headroom our backend systems really had. Never before had we seen load throughout the day that surpassed every past evening peak we saw. There were of course some challenges with scaling, but we could overcome these with small tuning of the system configuration during the event. This was achieved mostly through pausing some asynchronous processing that was not essential for accepting and processing orders.&lt;/p&gt;
&lt;h3&gt;Phase 2: Load Tests in Production&lt;/h3&gt;
&lt;p&gt;In a cloud-based system that relies heavily on auto-scaling for cost-optimization, proper testing and capacity planning is a must. To achieve that, we set the target to better understand our scalability limits. We tried many approaches and given our experience, the only way we found effective for a large-scale system like ours are live load tests in production. Testing in production is an established practice, but difficult to execute well. Mistakes become really costly as the customer experience is degraded and thus this approach requires the ability to quickly notice customer impact and react by aborting the test or mitigating the incident otherwise.&lt;/p&gt;
&lt;p&gt;To achieve our goal, we wrote simulators that place sales orders for test products that can be clearly differentiated from real customer orders, processed to a certain degree, and then skipped at the stage of fulfillment. This gives us the understanding of the limitations of our order processing system and all its dependencies, incl. inventory management and payment processing. Further, as shared before in &lt;a href="https://engineering.zalando.com/posts/2019/04/end-to-end-load-testing-zalandos-production-website.html"&gt;end-to-end load testing Zalando’s production website&lt;/a&gt;, we wrote a simulator that traverses the user journey across key customer touch-points in our shop. We ran this simulation in production for all countries and mimic the traffic patterns we observe for sales events. Through that we uncover scalability bottlenecks and verify if certain resilience patterns work properly. Running the simulation is a fun and thrilling exercise, especially if the whole team starts suddenly hearing pagers fire as we continue to increase the test traffic.&lt;/p&gt;
&lt;h3&gt;Phase 3: Load Tests inform capacity planning&lt;/h3&gt;
&lt;p&gt;Having written and evolved the user journey simulator for two years we were not fully satisfied with its abilities to generate load at scale. There were too many rough edges and tuning the simulator to be able to generate the required load profiles and investing our development time was very time consuming. We decided that it's better to leverage an existing product that will do the job better. This paid off heavily as last year we were able to run the tests both on App and Web platforms simultaneously.&lt;/p&gt;
&lt;p&gt;The different types of load tests that we ran in production last year helped inform capacity planning based on commercial goals and the projected sales. The final, clean run of tests also gave us sufficient confidence that the platform was scaled to sustain a certain amount of incoming traffic and sales in the peak minute and thus contributed to a smooth event for our teams.&lt;/p&gt;
&lt;h2&gt;Preparation as a project&lt;/h2&gt;
&lt;p&gt;The Cyber Week project is always at the top of our project lists and we dedicate highest attention to the preparation work. Over the past years, we have progressively increased collaboration between the engineering and commercial teams and have dedicated Program Managers responsible for the delivery of the project. With every year we tune the structure and reporting within this project.&lt;/p&gt;
&lt;p&gt;Thanks to the high priority of the Cyber Week preparations, every year we are able to invest in a key theme that helps us build up new capabilities that we did not have before - be it resilience engineering know-how, load testing in production, capacity planning, production readiness reviews, or collaboration across the company. On top of that, we also run dedicated projects aimed at increasing scalability of our platform and deliver changes to the customer experience for sales events.&lt;/p&gt;
&lt;h2&gt;During the event&lt;/h2&gt;
&lt;p&gt;After months of preparation, the event itself is a cherry on top - it's the time where we see how the time invested has paid off. If we are well prepared, we expect a rather uneventful time in terms of the number of production incidents. For the key period where we expect the highest load on our systems, we organize a Situation Room to ensure rapid incident response. In the room, we gather representatives from key engineering teams, SRE team, and dedicated Incident Commanders to closely watch the operational performance of our platform. It's basically a control center with dozens of screens and graphs, that looked like this in 2019:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Zalando's Cyber Week Situation Room" src="https://engineering.zalando.com/posts/2020/10/images/cw-situation-room.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;We've explored two key themes in Zalando's Cyber Week preparation journey. We are constantly tuning our approach based on insights from each year and adapting the areas we invest in to the business growth and commercial campaign requirements. This year has an added twist of remote working, which likely will require us to rethink how to organize the Situation Room efficiently. With seven weeks until Cyber Week, our preparations for this year's event are well underway and we are looking forward to sharing results and lessons learned in follow-up posts. With our growing application landscape, there are sufficient challenges ahead as we have more than 1122 applications (out of 4000+) in scope of the Cyber Week preparations.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Applications in scope for Cyber Week" src="https://engineering.zalando.com/posts/2020/10/images/applications-in-scope.png"&gt;&lt;/p&gt;</content><category term="Zalando"/><category term="Cyber Week"/><category term="SRE"/><category term="Testing"/><category term="Backend"/></entry><entry><title>Meet Boris Malensek, Our Head Of Engineering In Merchant Operations</title><link href="https://engineering.zalando.com/posts/2020/09/meet-boris-malensek-head-of-engineering-merchant-operations.html" rel="alternate"/><published>2020-09-08T00:00:00+02:00</published><updated>2020-09-08T00:00:00+02:00</updated><author><name>Kerstin Schartner</name></author><id>tag:engineering.zalando.com,2020-09-08:/posts/2020/09/meet-boris-malensek-head-of-engineering-merchant-operations.html</id><summary type="html">&lt;p&gt;We have talked with Boris about his career journey within Zalando, the evolution of Merchant Operations, and the engineering culture within the company.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Boris Malensek" src="https://engineering.zalando.com/posts/2020/09/images/boris-malensek.jpg#right"&gt;&lt;/p&gt;
&lt;p&gt;We spoke about his professional journey within Zalando, the evolution of Merchant Operations, and the engineering culture within the company.&lt;/p&gt;
&lt;p&gt;The interview was initially conducted for Zalando’s External Talent Community.&lt;/p&gt;
&lt;h3&gt;Boris, let’s go back to the start. What attracted you to Zalando in the first place?&lt;/h3&gt;
&lt;p&gt;The main reason for my attraction to Zalando was how quickly the company was able to adapt to change. I liked that they were constantly trying out new things, even if at that given moment they didn’t seem like the best solutions. At Zalando, there have always been believers in the change, and for me that is important. I think of the process as a journey, and who you share this journey with has always been important to me.&lt;/p&gt;
&lt;h3&gt;Do you think that’s the main incentive for people to join Zalando – the constant change?&lt;/h3&gt;
&lt;p&gt;I don’t think there is just one formula, one reason, why people choose to join the company. But what candidates should understand is that Zalando will always change. We will probably become a more stable organisation over time, but there will always be changes. We will continue to try out new things, and people should not be afraid of that. Some things turn out to be a great success, others don't, but we will always try to innovate and be better than before.&lt;/p&gt;
&lt;h3&gt;What is special and particular about Software Engineering at Zalando?&lt;/h3&gt;
&lt;p&gt;The engineering culture. Since the day I joined it remains the most impressive engineering culture I’ve experienced. What I refer to by the engineering culture is the support you receive on various levels: from a single line of code up to global challenges. There is always someone ready to help you, someone to learn from, and that’s really powerful. Our feedback culture is getting stronger with people having healthy attitudes towards sharing feedback. In general, we strive to build a community based on trust.
Zalando has invested a lot in technology and our solutions and tooling are state-of-the-art. The way we enable our engineering teams to deploy their software – fast, autonomously, at scale and still compliant – is impressive. That sets us apart from many other companies.
Our approach to solving problems is unique. We always try to put the customer first, we try to understand why we do what we do, what the purpose is, and this is important. We always aim to explain our strategy in the clearest way possible.&lt;/p&gt;
&lt;h3&gt;As the Head of Engineering in Merchant Operations, what do you do and what are your responsibilities?&lt;/h3&gt;
&lt;p&gt;Firstly, on a daily basis I enable the team to tackle complex challenges by providing guidance when they are unsure of how to come to an optimal solution. However, my main goal is to make myself “obsolete”: I aim to develop the team in such a way that they feel empowered to solve problems independently.
An important part of my role as a leader is to hire the best talent for our business unit and the broader organisation. I am also responsible for planning and outlining strategies for upcoming technological, architectural or organisational changes that support the longer term Zalando Group Strategy. I work on building a network within and outside Zalando, so that I can turn to like-minded engineers and leaders for help with problems. Finally, I am accountable for the software that we deliver: it needs to be scalable and resilient, and when we fail, we need to fail fast, learn from it, and move forward to continuously improve on what we have done before.&lt;/p&gt;
&lt;h3&gt;Boris, you have just had your 5-year anniversary at Zalando and have gone through several stages of career growth from a Senior Software Engineer to an Engineering Manager, to a Head of Engineering. When the time came to pursue the next steps in your development, what motivated you to choose a management path? What does being an engineering leader entail?&lt;/h3&gt;
&lt;p&gt;Most of us want to grow by simply stepping out of our comfort zone. That’s definitely something that still drives me today, and at Zalando I have opportunities to do that. I came to Zalando as an experienced Senior Software Engineer, and leading people and projects was not new to me. When I joined Zalando, there was a reorganisation within the company and with perseverance and self-driven efforts, I enthusiastically grabbed the opportunity to become an Engineering Manager.
Being a leader has taught me the importance of creating opportunities for career growth within an organisation. I am to provide opportunities for growth both within my team and beyond - I believe that it's important to support employees' growth first and foremost, no matter where it may take them.&lt;/p&gt;
&lt;h3&gt;Merchant Operations is often referred to as a great success story within Zalando, could you tell us about how this business unit evolved?&lt;/h3&gt;
&lt;p&gt;Merchant Operations has a rich history. I have been involved with the department from the very start, but when I joined it five years ago it was called Brand Solutions. Brand Solutions was building a prototype for a marketplace. It had a small tech team, and I was the third software engineer to be hired for the team. We had a great commercial team working alongside us, developing the idea of the marketplace and managing important partner relationships. Over time, we grew into a fully-fledged organisation. Three years ago, David Roberts joined us as the VP of Merchant Operations, and around the same time our objective became clear: build a B2B marketplace model, to bring Zalando closer to being the Starting Point for Fashion by increasing our assortment to include external partners. Currently, we have around 80 people in the engineering organisation, compared to just 10 in the early days. We have engineers in &lt;a href="https://jobs.zalando.com/en/tech/jobs/?filters%5Boffices%5D%5B0%5D=Berlin&amp;amp;filters%5Bcategories%5D%5B0%5D=Technology&amp;amp;filters%5Bcategories%5D%5B1%5D=Product%20Design"&gt;Berlin&lt;/a&gt; and &lt;a href="https://jobs.zalando.com/en/tech/jobs/?filters[offices][0]=Dublin%20%28Ireland%29"&gt;Dublin&lt;/a&gt;. Our Dublin team has been a great success story, having ramped up really quickly after the beginning of our expansion in October 2019 to a team of 15 today.
What makes Merchant Operations unique is that it started as a pure operations team. However, if you want to reach the scale required to become a giant in the fashion e-commerce industry, you need to focus on innovating through technology - and that is how we began to transform. Our biggest initiative currently is Zalando Direct (zDirect) which steers the business of external partners to Zalando's platform and extensive customer base, which increases our offering and convenience proposition exponentially.&lt;/p&gt;
&lt;h3&gt;Lastly, could you give a piece of advice for a Senior Software Engineer who would like  to join Zalando?&lt;/h3&gt;
&lt;p&gt;Patience is very important. I think it is always important to give yourself some time to learn, grow and focus on what you believe to be your ultimate goal. If you are a Senior Software Engineer and still in doubt about the direction you would like to take with your development, you have to think about this first and foremost. Your goal may be ambitious. But it’s really important that you think of constructive steps you can take to move towards it. Be disciplined. Stay determined, don't be afraid to ask for what you want, and remember to remain open to a path of continuous learning. It's only when you step outside of your comfort zone, that you realise what you are capable of.&lt;/p&gt;</content><category term="Zalando"/><category term="Inside Zalando"/><category term="Culture"/></entry><entry><title>Inbox Zero is not a Lifestyle</title><link href="https://engineering.zalando.com/posts/2020/07/leading-self.html" rel="alternate"/><published>2020-07-17T00:00:00+02:00</published><updated>2020-07-17T00:00:00+02:00</updated><author><name>Tim Kroeger</name></author><id>tag:engineering.zalando.com,2020-07-17:/posts/2020/07/leading-self.html</id><summary type="html">&lt;p&gt;Personal productivity is subject of frequent debate and optimization. Learn how to stay organized as a leader and feel accomplished every day.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Photo of a laptop on a desk showing the author on a video call on the screen and a Google calendar screenshot partially obscuring the author on the screen" src="https://engineering.zalando.com/posts/2020/07/images/tim-laptop-calendar.jpg#previewimage"&gt;&lt;/p&gt;
&lt;p&gt;The following guidelines and tricks help me with task management, time management, planning &amp;amp; prioritization, reacting to ad-hoc situations, and the sense of not having accomplished anything during the day. There is some overlap with our Remote Work Guidelines&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt;. My meta-advice for applying anything from this article: start with one improvement, don’t try it all at once. Start with tools you have at hand. It’s an ongoing improvement process, and it’s ok to fail and start over. I've been iterating over this on and off for roughly three years now.&lt;/p&gt;
&lt;p&gt;Having worked as a software developer in my early career, I've been a manager for roughly 10 years now. I have gone back to an individual contributor role for a year in between. An aspect to consider when reading about my experience and the suggestions provided, is that a &lt;em&gt;manager's schedule&lt;/em&gt; is somewhat different from a &lt;em&gt;maker's schedule&lt;/em&gt;. Depending on your organization's challenges, a manager still needs to be able to create, to provide e.g. structure and strategy. This needs an environment comparable to that of a maker. On the other hand, makers will benefit from applying some of the solutions lined out in this article when they need to adapt to a challenging environment themselves. "Different types of work need different types of schedules"&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;, and while this article is primarily aimed at managers, I believe that makers can take away some learnings, too, especially when they are planning to transition from an individual contributor role to a manager's career path.&lt;/p&gt;
&lt;p&gt;To limit the scope of this article and the suggested solutions, a nice concept to introduce is the concept of &lt;em&gt;constants&lt;/em&gt;. I'm going to refer to constants as constraints that are considered to be true, and can’t be ignored, at least not for too long: I have eight hours per day and 40 hours per week for work. I need to eat and take a break. I will need to process email and other requests. I need time to plan, and some plans I made will need to be changed.&lt;/p&gt;
&lt;p&gt;In order to address all this, I need transparency on what kind of time and energy I have available, and what work needs to be done by when. I will need to understand how flexible I can change what I have planned to adapt to a new situation. For all this, I use the Google calendar and a task management tool.&lt;/p&gt;
&lt;h2&gt;Configure work time&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://support.google.com/calendar/answer/7638168?hl=en"&gt;Setting up your working hours in Google Calendar&lt;/a&gt; is a good reminder for you and your colleagues when you are available and when you should not be working. Make conscious decisions to break the rule of working outside of your working hours when needed. When your colleagues see they're inviting you to an event outside of your work hours, they will reconsider, or at least reach out to you first. That way you assert a certain control over your calendar and the invites you are getting.&lt;/p&gt;
&lt;h2&gt;Make a decision for every event&lt;/h2&gt;
&lt;p&gt;Events without a decision clutter your calendar and make the organizers’ lives harder. Make a decision on the same day or the next day latest for every incoming event, and move on. State a clear reason in the comment in case you decline an event.&lt;/p&gt;
&lt;h2&gt;Hide declined events&lt;/h2&gt;
&lt;p&gt;You’ve already made a decision on those events, and you don’t need declined events to clutter your calendar. If you ever need to revisit that decision, you can enable showing declined events for that purpose in your calendar's settings, and disable it again afterwards.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of Google Calendar's view options configuration with 'show declined events' deselected" src="https://engineering.zalando.com/posts/2020/07/images/calendar-view-options.png" title="Google Calendar view options"&gt;&lt;/p&gt;
&lt;h2&gt;Defragment your calendar&lt;/h2&gt;
&lt;p&gt;If you have many short appointments like 1:1's, group them together. If short appointments come in, try to fill gaps or place them next to other meetings. That way you optimize for continuous free space which helps with blocking time for focused work that takes more than just 30 minutes. You can also use Google Calendar's &lt;em&gt;reschedule event&lt;/em&gt; functionality to ask the organizer to reschedule, if you prefer a different time, and the other participants are available.&lt;/p&gt;
&lt;h2&gt;Block recurring events&lt;/h2&gt;
&lt;p&gt;Take back control over how and when you are working on what. Some things need to be done every day (processing email, responding to calendar invites and chats, having lunch, or planning and prioritizing work) and you need to make room for that. You can always cut back if you’re running out of overhead tasks. My work time as you can see in the following screenshot is from 10:00 to 19:00. I usually do not exceed my 40 hours work week with this setup.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a typical week in my Google Calendar. There are daily recurring blocks for standups, lunch, processing things, and ending the day. Other weekly or monthly recurring events are grouped mostly on Wednesday afternoon and are a different color." src="https://engineering.zalando.com/posts/2020/07/images/calendar-typical-week.png" title="A typical work week"&gt;&lt;/p&gt;
&lt;p&gt;For all tasks that need doing, I follow a Getting Things Done (GTD) approach&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt;. I process my inbox after lunch because I like to get started with work I planned instead of new input from my inbox. When processing, I make prioritization decisions mostly on importance and urgency&lt;sup id="fnref:4"&gt;&lt;a class="footnote-ref" href="#fn:4"&gt;4&lt;/a&gt;&lt;/sup&gt;. Processing means that I try to organize all tasks into my task management system, which makes it easier for me to discover these tasks at the right time in the right context. A task management system can be anything from a formatted text file or a google document, to a more sophisticated, dedicated task management app. Setting this up is a topic on its own. I suggest to start with whatever you have at hand. I try to follow a strict agenda for task processing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Review perspectives&lt;sup id="fnref:5"&gt;&lt;a class="footnote-ref" href="#fn:5"&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;ul&gt;
&lt;li&gt;What is happening today and the next few days?&lt;/li&gt;
&lt;li&gt;What input am I waiting for that will be provided by someone else?&lt;/li&gt;
&lt;li&gt;What is stalled (i.e. it’s not clear what the next step would be)?&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Process email inbox&lt;/li&gt;
&lt;li&gt;Process assigned &lt;a href="https://drive.google.com/drive/search?q=followup:actionitems"&gt;Google Followup Action Items&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Process our internal communication platform&lt;/li&gt;
&lt;li&gt;Process Google Chat (pull mode)&lt;/li&gt;
&lt;li&gt;Process other inboxes (e.g. task management tool inbox). Categorize and compartmentalize tasks &amp;amp; projects.&lt;/li&gt;
&lt;li&gt;Plan and schedule events in the calendar for important or full focus tasks&lt;/li&gt;
&lt;li&gt;Flag tasks I plan to complete today&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Tasks that are &lt;em&gt;flagged&lt;/em&gt; are the focus for today and are highlighted in my task management system (e.g. listed on top of the text file). That way I can always go to one spot after some inevitable context switching to get back on track fast. In the evening I try to clear out my inbox, and process and schedule all tasks that came in after lunch for the next few days, so I can start the next morning without having to look into my email inbox. That way I might reach &lt;em&gt;Inbox Zero&lt;/em&gt; from time to time, which feels extremely good. A much more important aspect than trying to achieve Inbox Zero all the time, is measuring how much you have on your plate and if your inbox is constantly filling up, or if you're able to keep a healthy balance. &lt;strong&gt;Inbox Zero is a signal, not a lifestyle.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;Categorize calendar entries&lt;/h2&gt;
&lt;p&gt;When you categorize your calendar entries, you can see immediately what can be easily rescheduled or canceled in case of emergencies and urgent and important ad-hoc requests. You see how much time you have available, and you can reflect much better on what you did at the end of the day or week. It’s good to feel accomplished about your “focus week”, or “hiring week”, the “catch-up week” or an “off-the-charts week” if you made those choices deliberately. I use the following colors to categorize events.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Red: Lunch (to remind myself of the importance)&lt;/li&gt;
&lt;li&gt;Bright blue: Inbox processing / quick topics / Getting Things Done (GTD)&lt;/li&gt;
&lt;li&gt;Light purple: 1:1's / Jour Fixes with directs and skip-level directs&lt;/li&gt;
&lt;li&gt;Dark grey: Recurring department or team meetings&lt;/li&gt;
&lt;li&gt;Yellow: Everything hiring related like interviews, preparation and briefings&lt;/li&gt;
&lt;li&gt;Orange: Focus time&lt;/li&gt;
&lt;li&gt;Dark blue: Mentoring, Career Development, Performance Management&lt;/li&gt;
&lt;li&gt;Light orange: Trainings&lt;/li&gt;
&lt;li&gt;Green: Everything else (default for incoming events, because green is hope)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can also use emojis to make your calendar look nicer. I’m a visual person and I used this trick to cheat myself into caring more about my calendar and getting into the habit of maintaining good calendar hygiene. If emojis don’t work for you, maybe you’ll find something else. My colleague Lacey Nagel uses an elaborate emoji mapping for events she owns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;🌊 blockers for time to focus on specific tasks&lt;/li&gt;
&lt;li&gt;📌 user research/interviews&lt;/li&gt;
&lt;li&gt;🥙 planned breaks / lunch by myself&lt;/li&gt;
&lt;li&gt;🍱 lunch with other Zalando's&lt;/li&gt;
&lt;li&gt;🙌 1:1's&lt;/li&gt;
&lt;li&gt;🐩 backlog refinement&lt;/li&gt;
&lt;li&gt;🗺 planning&lt;/li&gt;
&lt;li&gt;🔬 retro&lt;/li&gt;
&lt;li&gt;🎂 reminders for colleagues’ birthdays&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I use some of those and use the following additional emojis for my calendar:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;📥 processing my inbox / mail&lt;/li&gt;
&lt;li&gt;🧹 finishing up for the day&lt;/li&gt;
&lt;li&gt;🎓 career development&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;A Hiring Week&lt;/h2&gt;
&lt;p&gt;Looking at my calendar, I know at one glance I don’t have to try and reschedule something yellow, but I can delay focus time, or make a conscious decision to cut back on inbox processing, or move a 1:1. Even if you didn’t work on what you planned to (e.g. product review), because you had to jump in and interview a candidate, you can feel good about it looking at the yellow accomplishments at the end of your week.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a week in my Google Calendar that was focused more on hiring. Two days consist almost completely out of yellow events." src="https://engineering.zalando.com/posts/2020/07/images/calendar-hiring-week.png" title="A 'hiring' week"&gt;&lt;/p&gt;
&lt;h2&gt;Plan and schedule your focus work&lt;/h2&gt;
&lt;p&gt;If you don’t block those time slots in your calendar, someone else will do it. Understand your energy levels&lt;sup id="fnref:6"&gt;&lt;a class="footnote-ref" href="#fn:6"&gt;6&lt;/a&gt;&lt;/sup&gt;. You might just want to get a few small things done and out of the way, to get the energy to work on the product strategy next. Maybe you don’t have a lot of energy left, so you can read a document that was shared, or watch an all-hands that was recorded earlier. Different kinds of tasks need different levels of energy. I adopted the energy levels “Short Dashes”, “Full Focus”, “Hanging around”, and “Depleted”. These can be contexts, tags, categories, or different To-Do lists in your task management system, to allow easy access to these tasks.&lt;/p&gt;
&lt;h2&gt;A Focus week&lt;/h2&gt;
&lt;p&gt;In the example below I had to get the Performance &amp;amp; Development statements for my directs ready before the due date, so I put blockers in the calendar and focused on it. I also finalized a quarterly product review. Another thing you can see is I felt in the mood to go through a few emails and process my inbox earlier on Tuesday so instead of cutting my lunch short, I switched the inbox processing event and the lunch event around.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a week in my Google Calendar where I could focus on preparing documents for product review, performance assessment, and career development sessions. Roughly 40% of the 40 work hours are orange 'focus' blockers across the week." src="https://engineering.zalando.com/posts/2020/07/images/calendar-focus-week.png" title="A 'focus' week"&gt;&lt;/p&gt;
&lt;h2&gt;A Management week&lt;/h2&gt;
&lt;p&gt;In the next example you can see that preparing material for performance management is a diligent effort and takes a lot of time, same as participating in the corresponding alignment meetings (PRCs). I cut back heavily on inbox processing and lunch, and did some overtime to make it work. At the same time I did not want to cancel the training sessions I had scheduled a long time ago, and had been looking forward to, or miss out on a project closing dinner on Thursday to celebrate success. That was a conscious decision again, so I can’t complain about it afterwards. Cutting back on a routine can be a slippery slope to breaking an established good habit, so be mindful to get back to a normal setup as soon as possible, and compensate for the overtime by taking some time off the following week.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a week in my Google Calendar where I spent most of the time preparing performance assessment material and participated in the corresponding alignment meetings, including some overtime. Most days are orange because of the focused preparation, and blue because of the career development &amp;amp; performance management character of the events." src="https://engineering.zalando.com/posts/2020/07/images/calendar-management-week.png" title="A 'management' week"&gt;&lt;/p&gt;
&lt;h2&gt;Feel accomplished working asynchronously&lt;/h2&gt;
&lt;p&gt;Transitioning from the office to working remote, especially when using asynchronous communication, can further reduce the feeling of being appreciated and accomplished. The lack of face to face communication means less exposure to this type of appreciation. As someone giving feedback, or when reading something that someone else created or contributed to, you can compensate by explicitly expressing your appreciation. A thank you here and there goes a long way, even if it’s not actionable feedback. It doesn’t have to be. As someone who misses this kind of appreciation, I try to find other signals that potentially correlate with doing a good job, and being appreciated for it, like e.g. the number of readers of a document, or the amount of comments, discussion, and other contributions on topics I'm driving.&lt;/p&gt;
&lt;h2&gt;What has changed since going full-remote in March 2020?&lt;/h2&gt;
&lt;p&gt;One thing that has changed is that because of the lack of commute, I had more time in the morning, and I started to eat breakfast. Not doing that before meant that I would need to have lunch at noon because I hadn't eaten properly in the morning and would be hungry already. Now with a proper breakfast to start the day, I have shifted lunch to 1pm and process my inbox right before at 12. I essentially switched those events around. You also see that we introduced recurring executive sync meetings at the end of the day to stay connected while working in a remote-first setup.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of how a typical week in my Google Calendar looks like after going remote. The daily inbox processing events at 1pm have switched places with the lunch event, which was at 12 beforehand." src="https://engineering.zalando.com/posts/2020/07/images/calendar-remote-week.png" title="A the new remote week setup"&gt;&lt;/p&gt;
&lt;h2&gt;Closing comment&lt;/h2&gt;
&lt;p&gt;I hope this blog post helps you in &lt;em&gt;leading yourself&lt;/em&gt;. Reflecting on how I feel today compared to when I started out on this journey a few years ago, it is a night and day difference. When you learn concepts like the Eisenhower matrix, or Getting Things Done (GTD), most of the time you don't get specific tips and details of how to apply it on a day to day basis. I'm sharing my concrete experience as a template for you to start out with, customize, and iterate on.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;&lt;a href="https://engineering.zalando.com/posts/2020/03/how-to-work-remotely-at-zalando.html"&gt;Guidelines for remote work at Zalando&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;&lt;a href="https://fs.blog/2017/12/maker-vs-manager/"&gt;Maker vs. Manager&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;&lt;a href="https://hamberg.no/gtd/"&gt;GTD in 15 minutes – A Pragmatic Guide to Getting Things Done&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:4"&gt;
&lt;p&gt;&lt;a href="https://www.eisenhower.me/eisenhower-matrix/"&gt;Eisenhower Matrix&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:5"&gt;
&lt;p&gt;The term 'perspective' is task management tool specific: &lt;a href="https://medium.com/smarter-productivity/a-modern-approach-to-gtd-contexts-and-perspectives-in-omnifocus-32a5256f1a0e"&gt;A modern approach to GTD contexts and perspectives in OmniFocus&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:5" title="Jump back to footnote 5 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:6"&gt;
&lt;p&gt;&lt;a href="https://medium.com/smarter-productivity/a-modern-approach-to-gtd-contexts-and-perspectives-in-omnifocus-32a5256f1a0e"&gt;A modern approach to GTD contexts and perspectives in OmniFocus
&lt;/a&gt;&amp;#160;&lt;a class="footnote-backref" href="#fnref:6" title="Jump back to footnote 6 in the text"&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="Zalando"/><category term="Productivity"/><category term="Leadership"/><category term="Remote Working"/><category term="Culture"/></entry><entry><title>Technology Choices at Zalando - Updating our Tech Radar Process</title><link href="https://engineering.zalando.com/posts/2020/07/technology-choices-at-zalando-tech-radar-update.html" rel="alternate"/><published>2020-07-15T00:00:00+02:00</published><updated>2020-07-15T00:00:00+02:00</updated><author><name>Bartosz Ocytko</name></author><id>tag:engineering.zalando.com,2020-07-15:/posts/2020/07/technology-choices-at-zalando-tech-radar-update.html</id><summary type="html">&lt;p&gt;We have revisited the process of technology selection at Zalando, adjusted the Tech Radar ring semantics, and moved towards principle-based decision making. In this post, we would like to share the process and its outcomes so far.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Zalando Tech Radar" src="https://engineering.zalando.com/posts/2020/07/images/zalando-tech-radar.jpg#previewimage"&gt;&lt;/p&gt;
&lt;h2&gt;Challenges with our Tech Radar&lt;/h2&gt;
&lt;p&gt;The &lt;a href="https://opensource.zalando.com/tech-radar/"&gt;Zalando Tech Radar&lt;/a&gt; is modelled after the &lt;a href="https://www.thoughtworks.com/radar"&gt;Thoughtworks Technology Radar&lt;/a&gt; and includes a ring-based scoring for a certain technology/framework along with supplementary information about pros, cons, restrictions, usage, and lessons learned at Zalando available as a knowledge base for our teams. Since publishing, the approach and &lt;a href="https://engineering.zalando.com/posts/2018/01/building-tech-radar.html"&gt;visualization engine&lt;/a&gt; has been used by others and also showcased at conferences &lt;a href="https://twitter.com/arungupta/status/1194653758275256320"&gt;as an example&lt;/a&gt; of how tech companies manage their technology choices.&lt;/p&gt;
&lt;p&gt;Our initial concept of the Tech Radar suffered from a series of problems, which we have observed in the Engineering Community while maintaining the Tech Radar:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The ring change criteria were too high level without being specific for technology types (e.g. programming languages, data stores) or context (e.g. backend, data science, mobile), its support by our infrastructure and impact to engineering usage. They didn’t allow for transparent, objective, and recurring rescoring of the Tech Radar nor for clear guidance for our engineers on how to select or suggest technologies to evaluate.&lt;/li&gt;
&lt;li&gt;The Tech Radar has been easy to ignore due to lack of a formal process and oftentimes delivery teams have been making key technology choices in isolation without consulting them with the guild maintaining the Tech Radar. Only after technologies were already in production, radar entries and ring changes were proposed instead of having followed the Tech Radar cycle. This led to a disconnect between the ring assignments and factual usage across teams.&lt;/li&gt;
&lt;li&gt;The Tech Radar relied on voluntary contributions degrading in frequency due to neither being clearly incentivized nor part of the job expectations for higher grades. Contributions are usually driven by a small group of engineers forming an informal guild, who were driving the collection of lessons learned material and encouraging teams across the organization to contribute. The guild lacked a formal mandate to make company-wide technology decisions and was insufficiently representing our departments across the company.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Confirming the problem statements&lt;/h2&gt;
&lt;p&gt;To address these problems we have embarked on a journey starting with confirming the observed problems with our Engineering Managers and getting more insights on how they manage technology choices in their teams. We also explored potential effects on delivery in the past years. We found that Engineering Managers have felt insufficiently supported by the company to manage expectations and technology choices in their teams and missed the ability to lean on stricter guidance. Further, too broad technology choice has had an effect on the growth rate of their teams and created challenges with cross-team code contributions.&lt;/p&gt;
&lt;h2&gt;Technology choices in Tech companies&lt;/h2&gt;
&lt;p&gt;Having confirmed the problem, we’ve been collecting ideas on how the problems can be approached. We began with researching how other tech companies are managing technology selection. Unlike Zalando, other established tech companies (Google, Spotify, Tencent, &lt;a href="https://github.com/foursquare/fsqio/blob/master/src/docs/fsqio/policies/new_technology_proposal.md"&gt;Foursquare&lt;/a&gt;, and other &lt;a href="https://www.cncf.io/people/end-user-community/"&gt;CNCF End User companies&lt;/a&gt;) use a much stricter technology selection process, limit programming language choices, and invest into changing the &lt;a href="https://cloudblogs.microsoft.com/opensource/2019/10/16/announcing-dapr-open-source-project-build-microservice-applications/"&gt;way applications are built&lt;/a&gt; to leverage centralized control planes, which increases development velocity. They limit the tech stack choices due to the amount of investment into infrastructure support and the high cost of removing technologies that did not prove to be useful.&lt;/p&gt;
&lt;p&gt;A too high number of technologies, that are adopted company-wide, make it challenging and expensive for Infrastructure teams to provide high-quality and well integrated tooling, e.g. CI/CD, observability, profiling, vulnerability scanning, compliance, governance, etc. It also causes the teams that provide infrastructure solutions to strongly depend on coordinated and continuous community contribution for technologies that are not supported centrally. A broad freedom of choice leads to increased difficulties in supporting software long-term when the original authors have left the company, which is guaranteed to happen sooner or later. There are also other problems related to development collaboration: (1) adjusting to cross-language communication becomes significant as teams will repeatedly implement the same functional components in different ways, (2) the code duplication rate is increased and it's costly to address non-functional requirements of services in terms of performance, high availability, and scalability, and (3) cross-team collaboration across different code bases is hindered.&lt;/p&gt;
&lt;p&gt;Generally, aside from specialized use cases, especially high value in flexibility around technology choices is provided when organizations have the ability to identify technologies that are bringing a paradigm shift (e.g. Kubernetes) paired with business value and use case fit. This proves to be a difficult task and companies rarely get the timing right.&lt;/p&gt;
&lt;h2&gt;Data collection&lt;/h2&gt;
&lt;p&gt;We sourced information from the Engineering Community through a Programming Language survey among our developers. The survey indicated how many engineers are currently using a certain language, which they feel comfortable working with and to which degree, as well as which language they would like to support others with in terms of guidelines or ad-hoc help. We cross-checked this data with our 4,000+ applications and derived how the different programming languages have gained traction and popularity over time.&lt;/p&gt;
&lt;h2&gt;Setting the bar for ADOPT languages&lt;/h2&gt;
&lt;p&gt;We have collected expectations around the level of support that we would like to see for ADOPT languages, ranging from clear guidelines on the VM lifecycles, integration into CI/CD systems, observability, size and health of the community within and outside of the company, ability to hire engineers to grow our teams using those languages, up to best practices for common tasks like performance analysis and tuning through inspection of heap dumps or flame graphs. We then collected data on how all our languages used in production benchmark against that criteria to see how big the gap in our expectations is with reality.&lt;/p&gt;
&lt;h2&gt;Defining new ring semantics&lt;/h2&gt;
&lt;p&gt;We have redefined the ring semantics as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ADOPT&lt;/strong&gt;: technologies with broad adoption, in which Zalando is willing to invest long-term&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TRIAL&lt;/strong&gt;: captures all current experiments in production&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ASSESS&lt;/strong&gt;: active, non-production assessments of promising technologies and trends&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;HOLD&lt;/strong&gt;: discouraged from broad adoption where the company is not willing to invest further; no new applications may use this technology&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;NIL&lt;/strong&gt;: no ring assignment, captures previous assessments and findings for long-term documentation purposes (we periodically archive HOLD entries as NIL)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We optionally limit the ring assignments through a clear scope recommendation: Backend, Mobile, Web, Data, Machine Learning, and Infrastructure. This allows us to better differentiate between the specifics of those use cases. The updated semantics allow us to be broad in assessing the value of emerging technologies, but be selective in terms of their deployments to production and level of investment into adoption and promotion within the company. For TRIAL, we also involve explicit sponsorship from our Engineering Heads, who will support production trials and commit to being accountable for divesting from non-promising technologies and the removal of failed experiments from our technology landscape.&lt;/p&gt;
&lt;h2&gt;Technology Selection Principles and Principal Engineering Community&lt;/h2&gt;
&lt;p&gt;The timing for making changes to Tech Radar was fortunate due to two reasons. First, we have started an update of our role expectations for Software Engineers and Engineering Managers and included the responsibility and accountability for technology selection along with incentivizing contributions to the process in the new expectations. Second, we created a community of Principal Engineers with the most senior engineers across the company as members, who have been empowered to make decisions on technology selection and thus maintain the Tech Radar. We kicked off the community with a day-long remote off-site where we captured engineering challenges we face at Zalando, brainstormed on principles for technology selection, and initial exchange about the implications of new ring assignments and learnings about the programming languages we use in production. In departments that were not represented by Principal Engineers, we have included our Senior Engineers to contribute instead. Following the off-site, we have formalized Technology Selection Principles that provide guidance on technology choices in terms of breadth and depth, focus on company instead of local decision making, etc. &lt;a href="https://www.meeteor.com/post/principle-based-decision-making"&gt;Principle-based decision making&lt;/a&gt; enables healthy discussions and differs enormously from preference-based decision making, which easily becomes personal and leads to conflicts.&lt;/p&gt;
&lt;h2&gt;Parting ways with Clojure, Haskell, and Rust&lt;/h2&gt;
&lt;p&gt;Having reviewed the use cases where our teams have used the languages that are not on ADOPT, their current adoption within Zalando since 2016, the available set of languages, and the level of investment required to bring them to ADOPT, we have decided to part ways with Clojure, Haskell, and Rust and not create new applications in those languages moving forward. Although our teams have built many services using these languages and learned how to operate these at scale with many successes, following our technology selection principles, we decided to not further invest in these languages as their unique capabilities are not giving us any further &lt;a href="https://dehora.net/journal/leverage-in-engineering-organisations"&gt;leverage&lt;/a&gt; at this point in time. Instead, we are focusing our community efforts on Kotlin and TypeScript and expect our language communities to help us move these to ADOPT later this year.&lt;/p&gt;
&lt;p&gt;Please note that this decision is specific to the context of Zalando (1,200+ developers, 4,000+ applications) and our current technology landscape and engineering practices. As such, this decision is not transferable to other organizations nor to be understood as a statement about the technical capabilities of the languages themselves. We encourage readers to follow a similar exercise as ours to derive decisions for their context.&lt;/p&gt;
&lt;h2&gt;Next steps&lt;/h2&gt;
&lt;p&gt;So far, we have reviewed the area of programming languages as the one having the biggest long-term impact on our engineers and system architecture as well as being the one sparking many debates on which language is better and why (when arguing based on preferences). As the next step, we are proceeding with reviewing the remaining categories of the Tech Radar, so stay tuned for further updates on our journey. (Update: check out our follow-up post on &lt;a href="/posts/2021/06/zalando-tech-radar-scaling-contributions.html"&gt;Scaling Contributions to the Tech Radar&lt;/a&gt;)&lt;/p&gt;</content><category term="Zalando"/><category term="Leadership"/><category term="Tech Culture"/><category term="Tech Radar"/><category term="Culture"/></entry><entry><title>Launching the Engineering Blog</title><link href="https://engineering.zalando.com/posts/2020/07/launching-the-engineering-blog.html" rel="alternate"/><published>2020-07-01T00:00:00+02:00</published><updated>2020-07-01T00:00:00+02:00</updated><author><name>Henning Jacobs</name></author><id>tag:engineering.zalando.com,2020-07-01:/posts/2020/07/launching-the-engineering-blog.html</id><summary type="html">&lt;p&gt;We recently re-launched Zalando's Engineering Blog. Learn how we have set up a blog with a Lighthouse score of 100.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Our Engineering Blog was launched in June 2020 after a long break of the previous tech blog.
This post describes the technical setup behind &lt;code&gt;engineering.zalando.com&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;You will learn:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Which static site generator we selected and why.&lt;/li&gt;
&lt;li&gt;What customizations we applied to design the blog and the publishing process.&lt;/li&gt;
&lt;li&gt;How we serve static HTML using Skipper and S3.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Static Site Generator&lt;/h2&gt;
&lt;p&gt;Our previous tech blog used a CMS which only a limited number of people had access to.
The CMS system also lacked a workflow to propose and review drafts.
As authors of the Engineering Blog will (mostly) be software engineers, we decided to switch to a git-based workflow
and a static site generator.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.staticgen.com/"&gt;StaticGen&lt;/a&gt; provides a nice overview of many different static site generators.
Nearly all of them provide the necessary features to generate a static HTML site from blog posts written in Markdown.
So which static site generator to choose?&lt;/p&gt;
&lt;p&gt;With the need to customize the blog engine, e.g. with custom templates and features like author titles,
the main criteria for the static site generator is to use a familiar programming language for templating and for plugins.
The static site generator should generate plain HTML and not contain unnecessary features we won't use.
The winner was &lt;a href="https://getpelican.com/"&gt;Pelican&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="StaticGen: Pelican stats" src="https://engineering.zalando.com/posts/2020/07/images/staticgen-pelican.png#center"&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Pelican is written in Python. Python is the language the most people are familiar with in Zalando, so it's a safe bet.&lt;/li&gt;
&lt;li&gt;Templates are written in &lt;a href="https://palletsprojects.com/p/jinja/"&gt;Jinja&lt;/a&gt;. Jinja is a popular templating system, it's &lt;a href="https://github.com/search?l=Python&amp;amp;q=org%3Azalando+org%3Azalando-incubator+jinja2&amp;amp;type=Code"&gt;used in Zalando Open Source&lt;/a&gt; and &lt;a href="https://github.com/search?l=Python&amp;amp;q=user%3Ahjacobs+jinja2&amp;amp;type=Code"&gt;I use it in my own OSS projects&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Atom/RSS feeds are supported out-of-the-box&lt;/li&gt;
&lt;li&gt;There are &lt;a href="https://github.com/getpelican/pelican-plugins"&gt;many existing plugins&lt;/a&gt; and it's easy to write your own in Python.&lt;/li&gt;
&lt;li&gt;It's actively developed. The &lt;a href="https://github.com/getpelican/pelican/"&gt;last git commit&lt;/a&gt; was 16 days ago at the time of writing.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Customization&lt;/h2&gt;
&lt;p&gt;We implemented the blog's design with plain HTML/CSS. The CSS is generated via &lt;a href="https://postcss.org/"&gt;PostCSS&lt;/a&gt; and &lt;a href="https://tailwindcss.com/"&gt;Tailwind CSS&lt;/a&gt;.
Customizing Pelican's Jinja templates was straightforward.&lt;/p&gt;
&lt;p&gt;Other customizations we did:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Enable &lt;a href="https://engineering.zalando.com/atom.xml"&gt;the Atom feed&lt;/a&gt; via the &lt;code&gt;FEED_ATOM&lt;/code&gt; setting in &lt;code&gt;pelicanconf.py&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Generate &lt;a href="https://engineering.zalando.com/sitemap.xml"&gt;the sitemap XML&lt;/a&gt; with the &lt;a href="https://github.com/pelican-plugins/sitemap"&gt;sitemap plugin&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Add author titles with the &lt;a href="https://pypi.org/project/pelican-metadataparsing/"&gt;pelican-metadataparsing plugin&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Minify generated HTML with the &lt;a href="https://pypi.org/project/pelican-htmlmin/"&gt;pelican-htmlmin plugin&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Additionally to the above, we want to make sure that automatic linting is in place for blog posts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Required meta keys must be present, e.g. title, summary, and author names.&lt;/li&gt;
&lt;li&gt;The blog post Markdown file must be in the right year/month folder.&lt;/li&gt;
&lt;li&gt;Article tags should be curated via an explicit allowlist. We want to avoid introducing many unnecessary tags and different tags for the same concept, e.g. "Postgres" vs. "PostgreSQL".&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Linting is done via &lt;a href="https://pre-commit.com/"&gt;pre-commit&lt;/a&gt; which calls a custom Python script to validate blog post Markdown files.
The &lt;code&gt;.pre-commit-config.yaml&lt;/code&gt; looks something like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;minimum_pre_commit_version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;1.21.0&lt;/span&gt;
&lt;span class="nt"&gt;repos&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;meta&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;hooks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;check-hooks-apply&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;check-useless-excludes&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;local&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;hooks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;validate-content&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Validate blog content&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;system&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;# run with poetry to get dependencies (Pelican)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;poetry run ./validate-content.py&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;types&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;markdown&lt;/span&gt;&lt;span class="p p-Indicator"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;exclude&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;^content/pages/.*.md$&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;https://github.com/pre-commit/pre-commit-hooks&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;rev&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;v3.1.0&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;hooks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;check-added-large-files&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;end-of-file-fixer&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;trailing-whitespace&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;mixed-line-ending&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Zalando's CI/CD system automatically lints all files by executing &lt;code&gt;make lint&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Writing a blog post&lt;/h2&gt;
&lt;p&gt;Anybody in Zalando can pitch a blog post idea by creating an issue in the git repo:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Blog post pitch: new issue" src="https://engineering.zalando.com/posts/2020/07/images/blog-post-pitch-new-issue.png"&gt;&lt;/p&gt;
&lt;p&gt;Bootstrapping a new blog post looks like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;hjacobs@ZALANDO-123:~/workspace/engineering-blog$&lt;span class="w"&gt; &lt;/span&gt;make&lt;span class="w"&gt; &lt;/span&gt;new
poetry&lt;span class="w"&gt; &lt;/span&gt;run&lt;span class="w"&gt; &lt;/span&gt;./scripts/new-post.py
This&lt;span class="w"&gt; &lt;/span&gt;will&lt;span class="w"&gt; &lt;/span&gt;create&lt;span class="w"&gt; &lt;/span&gt;a&lt;span class="w"&gt; &lt;/span&gt;new&lt;span class="w"&gt; &lt;/span&gt;blog&lt;span class="w"&gt; &lt;/span&gt;post,&lt;span class="w"&gt; &lt;/span&gt;please&lt;span class="w"&gt; &lt;/span&gt;answer&lt;span class="w"&gt; &lt;/span&gt;a&lt;span class="w"&gt; &lt;/span&gt;few&lt;span class="w"&gt; &lt;/span&gt;questions..
Title&lt;span class="w"&gt; &lt;/span&gt;of&lt;span class="w"&gt; &lt;/span&gt;blog&lt;span class="w"&gt; &lt;/span&gt;post:&lt;span class="w"&gt; &lt;/span&gt;Launching&lt;span class="w"&gt; &lt;/span&gt;the&lt;span class="w"&gt; &lt;/span&gt;Engineering&lt;span class="w"&gt; &lt;/span&gt;Blog
Slug&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;launching-the-engineering-blog&lt;span class="o"&gt;]&lt;/span&gt;:
Date&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;estimated&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;of&lt;span class="w"&gt; &lt;/span&gt;publishing&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;2020&lt;/span&gt;-07-04&lt;span class="o"&gt;]&lt;/span&gt;:
Author&lt;span class="w"&gt; &lt;/span&gt;names&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;separate&lt;span class="w"&gt; &lt;/span&gt;with&lt;span class="w"&gt; &lt;/span&gt;semicolon&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;Henning&lt;span class="w"&gt; &lt;/span&gt;Jacobs&lt;span class="o"&gt;]&lt;/span&gt;:
Author&lt;span class="w"&gt; &lt;/span&gt;titles&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;separate&lt;span class="w"&gt; &lt;/span&gt;with&lt;span class="w"&gt; &lt;/span&gt;semicolon&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;Senior&lt;span class="w"&gt; &lt;/span&gt;Principal&lt;span class="w"&gt; &lt;/span&gt;Engineer&lt;span class="o"&gt;]&lt;/span&gt;:
&lt;span class="o"&gt;========================================&lt;/span&gt;
Title:&lt;span class="w"&gt;         &lt;/span&gt;Launching&lt;span class="w"&gt; &lt;/span&gt;the&lt;span class="w"&gt; &lt;/span&gt;Engineering&lt;span class="w"&gt; &lt;/span&gt;Blog
Slug:&lt;span class="w"&gt;          &lt;/span&gt;launching-the-engineering-blog
Authors:&lt;span class="w"&gt;       &lt;/span&gt;Henning&lt;span class="w"&gt; &lt;/span&gt;Jacobs
Author&lt;span class="w"&gt; &lt;/span&gt;Titles:&lt;span class="w"&gt; &lt;/span&gt;Senior&lt;span class="w"&gt; &lt;/span&gt;Principal&lt;span class="w"&gt; &lt;/span&gt;Engineer
Date:&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="m"&gt;2020&lt;/span&gt;-07-04
URL:&lt;span class="w"&gt;           &lt;/span&gt;/posts/2020/07/launching-the-engineering-blog.html
&lt;span class="o"&gt;========================================&lt;/span&gt;
Does&lt;span class="w"&gt; &lt;/span&gt;this&lt;span class="w"&gt; &lt;/span&gt;look&lt;span class="w"&gt; &lt;/span&gt;correct?&lt;span class="w"&gt; &lt;/span&gt;Answer&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;y&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;or&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;n&amp;#39;&lt;/span&gt;:&lt;span class="w"&gt; &lt;/span&gt;y
Creating&lt;span class="w"&gt; &lt;/span&gt;content/2020/07/launching-the-engineering-blog/2020-07-04-launching-the-engineering-blog.md&lt;span class="w"&gt; &lt;/span&gt;..

Useful&lt;span class="w"&gt; &lt;/span&gt;commands:
-&lt;span class="w"&gt; &lt;/span&gt;make&lt;span class="w"&gt; &lt;/span&gt;devserver&lt;span class="w"&gt;    &lt;/span&gt;Start&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;local&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;webserver,&lt;span class="w"&gt; &lt;/span&gt;find&lt;span class="w"&gt; &lt;/span&gt;your&lt;span class="w"&gt; &lt;/span&gt;draft&lt;span class="w"&gt; &lt;/span&gt;on&lt;span class="w"&gt; &lt;/span&gt;http://localhost:8000/drafts/
-&lt;span class="w"&gt; &lt;/span&gt;make&lt;span class="w"&gt; &lt;/span&gt;lint&lt;span class="w"&gt;         &lt;/span&gt;Validate&lt;span class="w"&gt; &lt;/span&gt;content&lt;span class="w"&gt; &lt;/span&gt;and&lt;span class="w"&gt; &lt;/span&gt;formatting.

Please&lt;span class="w"&gt; &lt;/span&gt;edit&lt;span class="w"&gt; &lt;/span&gt;your&lt;span class="w"&gt; &lt;/span&gt;article&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;content/2020/07/launching-the-engineering-blog/2020-07-04-launching-the-engineering-blog.md
and&lt;span class="w"&gt; &lt;/span&gt;don&lt;span class="err"&gt;&amp;#39;&lt;/span&gt;t&lt;span class="w"&gt; &lt;/span&gt;forget&lt;span class="w"&gt; &lt;/span&gt;to&lt;span class="w"&gt; &lt;/span&gt;open&lt;span class="w"&gt; &lt;/span&gt;a&lt;span class="w"&gt; &lt;/span&gt;PR&lt;span class="w"&gt; &lt;/span&gt;:-&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Opening a PR to the Engineering Blog repository will trigger a build (&lt;code&gt;make html&lt;/code&gt;) on our Zalando Continuous Delivery Platform.
The PR build will publish a preview of the blog under a private (authenticated) URL.&lt;/p&gt;
&lt;p&gt;After merging the blog post PR, it will automatically be published on the live site &lt;code&gt;engineering.zalando.com&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Serving static HTML&lt;/h2&gt;
&lt;p&gt;Zalando's Continuous Delivery Platform has a built-in feature to upload files to a given S3 bucket. This feature is used to upload all files from the &lt;code&gt;output&lt;/code&gt; directory (generated by Pelican) to the blog's S3 bucket.
The S3 bucket is created via CloudFormation which also configures the S3 website:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;AWSTemplateFormatVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;2010-09-09&lt;/span&gt;
&lt;span class="nt"&gt;Metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;StackName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;engineering-blog&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;Tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;application&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;engineering-blog&amp;quot;&lt;/span&gt;
&lt;span class="nt"&gt;Resources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;S3Bucket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;AWS::S3::Bucket&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;Properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;BucketName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;lt;BUCKET-NAME&amp;gt;&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;AccessControl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;PublicRead&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;WebsiteConfiguration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;IndexDocument&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;index.html&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;ErrorDocument&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;error.html&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;DeletionPolicy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Retain&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;BucketPolicy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;AWS::S3::BucketPolicy&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;Properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;PolicyDocument&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-s3-websiteconfiguration.html"&gt;WebsiteConfiguration property&lt;/a&gt; will make the bucket contents available on &lt;code&gt;http://&amp;lt;BUCKET-NAME&amp;gt;.s3-website.&amp;lt;REGION&amp;gt;.amazonaws.com&lt;/code&gt;.
The S3 website only provides an HTTP endpoint (no SSL) and not a domain we would want to use publicly.&lt;/p&gt;
&lt;p&gt;One way to serve the contents with a custom domain and SSL is to &lt;a href="https://aws.amazon.com/premiumsupport/knowledge-center/cloudfront-serve-static-website/"&gt;create a CloudFront web distribution&lt;/a&gt;.
I decided to not use CloudFront as all the required infrastructure for domain+SSL is already in place.&lt;/p&gt;
&lt;p&gt;We have &lt;a href="https://github.com/zalando/skipper/"&gt;Skipper&lt;/a&gt; as the Kubernetes Ingress proxy running for all our 140+ Kubernetes clusters.
&lt;a href="https://github.com/kubernetes-sigs/external-dns"&gt;External DNS&lt;/a&gt; automatically configures the DNS name and the &lt;a href="https://github.com/zalando-incubator/kube-ingress-aws-controller"&gt;Kubernetes Ingress Controller for AWS&lt;/a&gt; configures the AWS ALB with the right ACM SSL certificate. So let's reuse this infrastructure and let Skipper proxy all requests to the S3 website bucket endpoint.
This can be achieved by adding a default Skipper route as Ingress annotation:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;apiVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;networking.k8s.io/v1beta1&lt;/span&gt;
&lt;span class="nt"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Ingress&lt;/span&gt;
&lt;span class="nt"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;engineering-blog&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;application&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;engineering-blog&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;annotations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;zalando.org/skipper-routes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;|&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="no"&gt;redirect_app_default: * -&amp;gt; compress() -&amp;gt; setDynamicBackendUrl(&amp;quot;http://&amp;lt;BUCKET-NAME&amp;gt;.s3-website.&amp;lt;REGION&amp;gt;.amazonaws.com&amp;quot;) -&amp;gt; &amp;lt;dynamic&amp;gt;;&lt;/span&gt;
&lt;span class="nt"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;rules&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;host&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;engineering.zalando.com&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;http&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;paths&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;backend&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nt"&gt;serviceName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;engineering-blog&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nt"&gt;servicePort&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;That Skipper's &lt;code&gt;compress()&lt;/code&gt; filter enables &lt;code&gt;gzip&lt;/code&gt; compression as the S3 endpoint does not provide response compression out-of-the-box.
The ACM certificate, HTTP/2 support, the S3 website response, and the enabled compression are visible when doing a curl request (output shortened):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;curl&lt;span class="w"&gt; &lt;/span&gt;-v&lt;span class="w"&gt; &lt;/span&gt;--compressed&lt;span class="w"&gt; &lt;/span&gt;https://engineering.zalando.com&lt;span class="w"&gt; &lt;/span&gt;-o&lt;span class="w"&gt; &lt;/span&gt;/dev/null
*&lt;span class="w"&gt; &lt;/span&gt;SSL&lt;span class="w"&gt; &lt;/span&gt;connection&lt;span class="w"&gt; &lt;/span&gt;using&lt;span class="w"&gt; &lt;/span&gt;TLSv1.2&lt;span class="w"&gt; &lt;/span&gt;/&lt;span class="w"&gt; &lt;/span&gt;ECDHE-RSA-AES128-GCM-SHA256
*&lt;span class="w"&gt; &lt;/span&gt;Server&lt;span class="w"&gt; &lt;/span&gt;certificate:
*&lt;span class="w"&gt;  &lt;/span&gt;subject:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;CN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;engineering.zalando.com
*&lt;span class="w"&gt;  &lt;/span&gt;subjectAltName:&lt;span class="w"&gt; &lt;/span&gt;host&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;engineering.zalando.com&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;matched&lt;span class="w"&gt; &lt;/span&gt;cert&lt;span class="err"&gt;&amp;#39;&lt;/span&gt;s&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;engineering.zalando.com&amp;quot;&lt;/span&gt;
*&lt;span class="w"&gt;  &lt;/span&gt;issuer:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;C&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;US&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;O&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Amazon&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;OU&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Server&lt;span class="w"&gt; &lt;/span&gt;CA&lt;span class="w"&gt; &lt;/span&gt;1B&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;CN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Amazon
*&lt;span class="w"&gt;  &lt;/span&gt;SSL&lt;span class="w"&gt; &lt;/span&gt;certificate&lt;span class="w"&gt; &lt;/span&gt;verify&lt;span class="w"&gt; &lt;/span&gt;ok.
&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;GET&lt;span class="w"&gt; &lt;/span&gt;/&lt;span class="w"&gt; &lt;/span&gt;HTTP/2
&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;Host:&lt;span class="w"&gt; &lt;/span&gt;engineering.zalando.com
&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;user-agent:&lt;span class="w"&gt; &lt;/span&gt;curl/7.68.0
&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;accept:&lt;span class="w"&gt; &lt;/span&gt;*/*
&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;accept-encoding:&lt;span class="w"&gt; &lt;/span&gt;deflate,&lt;span class="w"&gt; &lt;/span&gt;gzip,&lt;span class="w"&gt; &lt;/span&gt;br
&amp;lt;&lt;span class="w"&gt; &lt;/span&gt;HTTP/2&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;200&lt;/span&gt;
&amp;lt;&lt;span class="w"&gt; &lt;/span&gt;content-type:&lt;span class="w"&gt; &lt;/span&gt;text/html
&amp;lt;&lt;span class="w"&gt; &lt;/span&gt;content-encoding:&lt;span class="w"&gt; &lt;/span&gt;deflate
&amp;lt;&lt;span class="w"&gt; &lt;/span&gt;etag:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;304fcc9c31aac19255bf1d84669059df&amp;quot;&lt;/span&gt;
&amp;lt;&lt;span class="w"&gt; &lt;/span&gt;last-modified:&lt;span class="w"&gt; &lt;/span&gt;Sat,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;27&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Jun&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2020&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;07&lt;/span&gt;:23:19&lt;span class="w"&gt; &lt;/span&gt;GMT
&amp;lt;&lt;span class="w"&gt; &lt;/span&gt;server:&lt;span class="w"&gt; &lt;/span&gt;AmazonS3
&amp;lt;&lt;span class="w"&gt; &lt;/span&gt;vary:&lt;span class="w"&gt; &lt;/span&gt;Accept-Encoding
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Performance&lt;/h2&gt;
&lt;p&gt;The static website should be fast. So let's test. We can use &lt;a href="https://github.com/tsenart/vegeta"&gt;Vegeta&lt;/a&gt; for some basic HTTP load testing.
60ms as p99 latency looks good:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;&amp;quot;GET https://engineering.zalando.com/&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;vegeta&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;attack&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;vegeta&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;
&lt;span class="n"&gt;Requests&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;total, rate, throughput&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;50.02&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;50.00&lt;/span&gt;
&lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;total, attack, wait&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="mf"&gt;59.995&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;59.98&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;15.246&lt;/span&gt;&lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="n"&gt;Latencies&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;min, mean, 50, 90, 95, 99, max&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="mf"&gt;12.418&lt;/span&gt;&lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;19.751&lt;/span&gt;&lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;17.049&lt;/span&gt;&lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;25.05&lt;/span&gt;&lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;38.382&lt;/span&gt;&lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;59.958&lt;/span&gt;&lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;244.094&lt;/span&gt;&lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="n"&gt;Bytes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;In&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;total, mean&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt;                     &lt;/span&gt;&lt;span class="mi"&gt;51441000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;17147.00&lt;/span&gt;
&lt;span class="n"&gt;Bytes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;Out&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;total, mean&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt;                     &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.00&lt;/span&gt;
&lt;span class="n"&gt;Success&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ratio&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt;                           &lt;/span&gt;&lt;span class="mf"&gt;100.00&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;
&lt;span class="n"&gt;Status&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Codes&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;code:count&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt;                      &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;
&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;Set&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The user experience with a real browser is much more interesting. &lt;a href="https://developers.google.com/web/tools/lighthouse/"&gt;Chrome Lighthouse&lt;/a&gt; can be used to assess the page performance.
Google's PageSpeed Insights uses Lighthouse for its score calculation.
Running &lt;a href="https://developers.google.com/speed/pagespeed/insights/?url=https%3A%2F%2Fengineering.zalando.com"&gt;PageSpeed Insights for the blog&lt;/a&gt; reports a nice score of 100 out of 100 (desktop):&lt;/p&gt;
&lt;p&gt;&lt;img alt="PageSpeed Insights for https://engineering.zalando.com/" src="https://engineering.zalando.com/posts/2020/07/images/page-speed-insights-engineering-zalando-com.png"&gt;&lt;/p&gt;
&lt;p&gt;Thanks go out to our Employer Branding colleagues who created the design and implemented the responsive HTML/CSS layout!&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;I hope this blog post gives you some inspiration for setting up your own blog with Pelican or some other static site generator.
After re-launching our Engineering Blog, our main focus will be providing regular and high quality content.
We still have to figure out the best way to source, review, and schedule blog posts.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://twitter.com/ZalandoTech"&gt;Follow ZalandoTech on Twitter&lt;/a&gt; and subscribe to &lt;a href="https://engineering.zalando.com/atom.xml"&gt;the Atom/RSS feed&lt;/a&gt; to get the latest articles.&lt;/p&gt;</content><category term="Zalando"/><category term="Engineering Blog"/><category term="Python"/><category term="AWS"/><category term="Kubernetes"/><category term="Skipper"/><category term="Backend"/><category term="Open Source"/><category term="Frontend"/></entry><entry><title>PgBouncer on Kubernetes and how to achieve minimal latency</title><link href="https://engineering.zalando.com/posts/2020/06/postgresql-connection-poolers.html" rel="alternate"/><published>2020-06-24T00:00:00+02:00</published><updated>2020-06-24T00:00:00+02:00</updated><author><name>Dmitrii Dolgov</name></author><id>tag:engineering.zalando.com,2020-06-24:/posts/2020/06/postgresql-connection-poolers.html</id><summary type="html">&lt;p&gt;Experiments with connection poolers on Kubernetes for Postgres Operator&lt;/p&gt;</summary><content type="html">&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;In the new Postgres Operator release 1.5 we have implemented couple of new
interesting &lt;a href="https://github.com/zalando/postgres-operator/releases/tag/v1.5.0"&gt;features&lt;/a&gt;, including connection pooling support. &lt;a href="https://sanctum.geek.nz/arabesque/vim-koans/"&gt;Master Wq&lt;/a&gt;
says there is "No greatest tool", to run something successfully in production
one needs to understand pros and cons. Let's try to dig into the topic, and
take a look at the performance aspect of connection pooler support, mostly from
a scaling perspective.&lt;/p&gt;
&lt;p&gt;But first let's make an introduction. Why do we quite often need a connection
pooler for PostgreSQL (and in fact for many other &lt;a href="https://www.cockroachlabs.com/docs/stable/recommended-production-settings.html#connection-pooling"&gt;databases&lt;/a&gt; too)? There
are several performance implications of having too many connections to a
database open that result from how a connection is &lt;a href="https://www.postgresql.org/docs/12/connect-estab.html"&gt;opened&lt;/a&gt; (PostgreSQL
uses a "process per user" client/server model, in which too many connections mean too
many processes fighting for resources and drowning in context switches and
&lt;a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/core.c#n1736"&gt;CPU migrations&lt;/a&gt;) and how &lt;a href="https://www.postgresql.org/message-id/20200301083601.ews6hz5dduc3w2se%40alap3.anarazel.de"&gt;certain aspects&lt;/a&gt; of transaction handling are
implemented (e.g. &lt;code&gt;GetSnapshotData&lt;/code&gt; has &lt;code&gt;O(connections)&lt;/code&gt; complexity). Having
said that there are three options where to implement a connection pooler:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;on the database side, like proposed in this &lt;a href="https://www.postgresql.org/message-id/flat/KL1PR0601MB380006383DE897E2026ACEC6B6D40%40KL1PR0601MB3800.apcprd06.prod.outlook.com#329a9ba21d8f634eebade5d1d62fa3c0"&gt;patch&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;as a separate component between the database and the application&lt;/li&gt;
&lt;li&gt;on the application side&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For Postgres Operator we have chosen the second approach. Although there are
pros and cons for all of those options, any other will obviously require a lot
of efforts (application side connection pooler is not something under the
operator control, and internal connection pooler for PostgreSQL is a major
feature one needs to develop yet). Another interesting choice to make in this
case is which solution for connection pooling to use. At the moment for
PostgreSQL there are couple of available options (listed in no particular
order):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.pgbouncer.org"&gt;PgBouncer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.pgpool.net/mediawiki/index.php/Main_Page"&gt;Pgpool-II&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/yandex/odyssey"&gt;Odyssey&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://agroal.github.io/pgagroal/"&gt;pgagroal&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;PgBouncer is probably the most popular and the oldest solution. Pgpool-II can
actually do much more than just connection pooling (e.g. it can do load
balancing), but it means it's a bit more heavyweight than others. Odyssey and
pgagroal are much newer and try to be more performance optimized and scalable
than the alternatives.&lt;/p&gt;
&lt;p&gt;Eventually we went for PgBouncer, but current implementation allow us to switch to
any other solutions if they conform to a basic common standard. Now let's
take a look at how PgBouncer performs in tests.&lt;/p&gt;
&lt;h1&gt;Setup&lt;/h1&gt;
&lt;p&gt;In fact, we did significant amount of benchmarks with PgBouncer for different
workloads on our Kubernetes clusters and learned few interesting details. For
example, I didn't know that a Kubernetes &lt;code&gt;Service&lt;/code&gt; can distribute workload in
not exactly uniform way, so that one can see something like this, where the
third pod is only half utilized and in fact gets half as much queries as the
others:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;NAME                         CPU(cores)   MEMORY(bytes)
pool-test-7d8bfbc47f-6bbhr   977m         5Mi
pool-test-7d8bfbc47f-8jtnp   995m         6Mi
pool-test-7d8bfbc47f-ghvpn   585m         6Mi
pool-test-7d8bfbc47f-s945p   993m         6Mi
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This could happen if &lt;code&gt;kube-proxy&lt;/code&gt; works in &lt;code&gt;iptables&lt;/code&gt; &lt;a href="https://kubernetes.io/docs/concepts/services-networking/service/#proxy-mode-iptables"&gt;mode&lt;/a&gt; and calculates
probabilities to land on a pod instead of strict round-robin.&lt;/p&gt;
&lt;p&gt;But in this article I want to offer one example, produced in a more artificial
environment of my laptop. That's mostly because we can get more interesting
metrics that are interesting for this particular case, but do not make sense to
collect for all workloads. My original idea was to play around CPU management
policies and &lt;a href="https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy"&gt;exclusive CPUs&lt;/a&gt;, to show what will happen if a PgBouncer runs
with a fixed cpuset. But interesting enough, another effect introduced an even
bigger difference, so the following experiment will be more about scaling of
PgBouncer instances.&lt;/p&gt;
&lt;p&gt;To simulate the networking part of our experiment, let's setup a separate network
namespace, where we will run PostgreSQL and PgBouncer, and connect it via veth
link with the root namespace.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# setup veth link with veth0/veth1 at the ends&lt;/span&gt;
$&lt;span class="w"&gt; &lt;/span&gt;ip&lt;span class="w"&gt; &lt;/span&gt;link&lt;span class="w"&gt; &lt;/span&gt;add&lt;span class="w"&gt; &lt;/span&gt;veth0&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;veth&lt;span class="w"&gt; &lt;/span&gt;peer&lt;span class="w"&gt; &lt;/span&gt;name&lt;span class="w"&gt; &lt;/span&gt;veth1

&lt;span class="c1"&gt;# check that they&amp;#39;re present&lt;/span&gt;
$&lt;span class="w"&gt; &lt;/span&gt;ip&lt;span class="w"&gt; &lt;/span&gt;link&lt;span class="w"&gt; &lt;/span&gt;show&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;veth

&lt;span class="c1"&gt;# add a new network namespace&lt;/span&gt;
$&lt;span class="w"&gt; &lt;/span&gt;ip&lt;span class="w"&gt; &lt;/span&gt;netns&lt;span class="w"&gt; &lt;/span&gt;add&lt;span class="w"&gt; &lt;/span&gt;db

&lt;span class="c1"&gt;# move one end into the new namespace&lt;/span&gt;
$&lt;span class="w"&gt; &lt;/span&gt;ip&lt;span class="w"&gt; &lt;/span&gt;link&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;veth1&lt;span class="w"&gt; &lt;/span&gt;netns&lt;span class="w"&gt; &lt;/span&gt;db

&lt;span class="c1"&gt;# check that now only veth0 is visible&lt;/span&gt;
$&lt;span class="w"&gt; &lt;/span&gt;ip&lt;span class="w"&gt; &lt;/span&gt;link&lt;span class="w"&gt; &lt;/span&gt;show&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;veth

&lt;span class="c1"&gt;# check that veth1 is visible from the other namespace&lt;/span&gt;
$&lt;span class="w"&gt; &lt;/span&gt;ip&lt;span class="w"&gt; &lt;/span&gt;netns&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;db&lt;span class="w"&gt; &lt;/span&gt;ip&lt;span class="w"&gt; &lt;/span&gt;link&lt;span class="w"&gt; &lt;/span&gt;show&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;veth

&lt;span class="c1"&gt;# add corresponding addresses and bring everything up&lt;/span&gt;
$&lt;span class="w"&gt; &lt;/span&gt;ip&lt;span class="w"&gt; &lt;/span&gt;addr&lt;span class="w"&gt; &lt;/span&gt;add&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;.0.0.10/24&lt;span class="w"&gt; &lt;/span&gt;dev&lt;span class="w"&gt; &lt;/span&gt;veth0
$&lt;span class="w"&gt; &lt;/span&gt;ip&lt;span class="w"&gt; &lt;/span&gt;netns&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;db&lt;span class="w"&gt; &lt;/span&gt;ip&lt;span class="w"&gt; &lt;/span&gt;addr&lt;span class="w"&gt; &lt;/span&gt;add&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;.0.0.1/24&lt;span class="w"&gt; &lt;/span&gt;dev&lt;span class="w"&gt; &lt;/span&gt;veth1
$&lt;span class="w"&gt; &lt;/span&gt;ip&lt;span class="w"&gt; &lt;/span&gt;link&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;veth0&lt;span class="w"&gt; &lt;/span&gt;up
$&lt;span class="w"&gt; &lt;/span&gt;ip&lt;span class="w"&gt; &lt;/span&gt;netns&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;db&lt;span class="w"&gt; &lt;/span&gt;ip&lt;span class="w"&gt; &lt;/span&gt;link&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;veth1&lt;span class="w"&gt; &lt;/span&gt;up
$&lt;span class="w"&gt; &lt;/span&gt;ip&lt;span class="w"&gt; &lt;/span&gt;netns&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;db&lt;span class="w"&gt; &lt;/span&gt;ip&lt;span class="w"&gt; &lt;/span&gt;link&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;lo&lt;span class="w"&gt; &lt;/span&gt;up
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This link is going to be blazingly fast, so let's add a small delay to the veth
interface, which corresponds to the empirical network latency we observe in
our Kubernetes clusters. Distribution parameter here is mostly to emphasize
its presence, since it's normal by default anyway.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;tc&lt;span class="w"&gt; &lt;/span&gt;qdisc&lt;span class="w"&gt; &lt;/span&gt;add&lt;span class="w"&gt; &lt;/span&gt;dev&lt;span class="w"&gt; &lt;/span&gt;veth0&lt;span class="w"&gt; &lt;/span&gt;root&lt;span class="w"&gt; &lt;/span&gt;netem&lt;span class="w"&gt; &lt;/span&gt;delay&lt;span class="w"&gt; &lt;/span&gt;1ms&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.1ms&lt;span class="w"&gt; &lt;/span&gt;distribution&lt;span class="w"&gt; &lt;/span&gt;normal
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In our experiment we will run pgbench test with a query &lt;code&gt;;&lt;/code&gt;, which is the
&lt;a href="https://jakewheat.github.io/sql-overview/sql-2008-foundation-grammar.html#direct-SQL-statement"&gt;smallest SQL query&lt;/a&gt; one can come up with. The idea is to not load the
database itself too much and see how PgBouncer instance will handle many
connections, which is in this case 1000 dispatched via 8 threads. A word of
warning: use pgbench carefully, since in some cases it could be a bottleneck
and produce confusing results. In our case we will try to limit this by pinning
all the components to a separate cores, collect performance counters to see
where what do we spend time and be alerted about strange results. But for a
more diverse workload and more holistic approach you can use &lt;a href="https://github.com/oltpbenchmark/oltpbench/"&gt;oltpbench&lt;/a&gt; or
&lt;a href="https://github.com/petergeoghegan/benchmarksql"&gt;benchmarksql&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The result will be per transaction execution &lt;a href="https://www.postgresql.org/docs/current/pgbench.html#id-1.9.4.10.8.6"&gt;log&lt;/a&gt;. Every component, namely:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;PostgreSQL instance&lt;/li&gt;
&lt;li&gt;Two PgBouncer instances&lt;/li&gt;
&lt;li&gt;PgBench workload generator&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;is bound to a single CPU core, with Intel turbo being disabled and CPU
scaling governor for all the cores set to &lt;code&gt;performance&lt;/code&gt;. Two instances of
PgBouncer will run with &lt;code&gt;so_reuseport&lt;/code&gt; option, which is essentially a way to
get PgBouncer to use &lt;a href="http://www.pgbouncer.org/config.html#so_reuseport"&gt;more CPU cores&lt;/a&gt;. The only degree of freedom we will
investigate is their location between cores in relation to whether it's a real
separate core, or just a separate hyperthread.&lt;/p&gt;
&lt;h1&gt;Benchmark&lt;/h1&gt;
&lt;p&gt;Here are the benchmark results, presenting rolling mean, 99th latency and
standard deviation values, executed on a rather modest setup with 2 physical
cores each with 2 hyperthreads for three cases:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Only one instance of PgBouncer on an isolated real core&lt;/li&gt;
&lt;li&gt;Two PgBouncers on isolated hyperthreads, but on the same physical core.&lt;/li&gt;
&lt;li&gt;Two PgBouncers on isolated cores (with potential noise from other components
  on the different hyperthread).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a href="https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf#page=311"&gt;Hyper-Threading&lt;/a&gt; means than two components are still fighting for CPU time,
but will share some execution state and cache. Usually, it renders more
deviations in latency, which we will have in mind.&lt;/p&gt;
&lt;p&gt;&lt;img alt="separate_cores tar gz" src="https://engineering.zalando.com/posts/2020/06/images/pgbouncer-cpu-cores.png"&gt;&lt;/p&gt;
&lt;p&gt;One nice feature we can immediately see is that results are relatively stable,
which is good. Another interesting note is that despite the fact that we were
only changing the core location for every component, we can see a significant
difference in latency. For a single PgBouncer instance we've got the lowest
latency, while for two PgBouncers on the same physical core it's almost two
times higher (with somewhat minimal increase in throughput). In case of two
PgBouncers on a different physical cores, even with potential competition for
resources with another component (and a different resource consumption
pattern), the latency is somewhere in between (with the throughput best of the
three). Why is that?&lt;/p&gt;
&lt;p&gt;In the course of investigation more and more puzzling measurements were
collected, showing no significant difference in sampling with &lt;code&gt;perf&lt;/code&gt; of
PostgreSQL activity or both PgBouncer instances. Let's take a closer look at
what PgBouncer is actually doing:&lt;/p&gt;
&lt;p&gt;&lt;img alt="pgbouncer" src="https://engineering.zalando.com/posts/2020/06/images/pgbouncer-flamegraph.png"&gt;&lt;/p&gt;
&lt;p&gt;As expected, it spends a lot of its time doing networking. Kernel &lt;a href="https://www.kernel.org/doc/html/latest/networking/scaling.html#suggested-configuration"&gt;docs&lt;/a&gt;
says that:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For interrupt handling, HT has shown no benefit in initial tests, so limit
the number of queues to the number of CPU cores in the system.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This could be our working assumption. Network interrupts probably are not very
well scaled between hyperthreads, so one needs to use a real core to scale them
out. To get a bit more evidences, let's take a look at interrupts latencies in
both cases, different cores and different hyperthreads. For that we can use
&lt;code&gt;irq:softirq_entry&lt;/code&gt; and &lt;code&gt;irq:softirq_exit&lt;/code&gt; and a &lt;a href="http://www.brendangregg.com/perf.html"&gt;script from Brendan Gregg&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# one PgBouncer instance is running on a CPU2 with no other PgBouncer on the&lt;/span&gt;
&lt;span class="c1"&gt;# same physical core. We&amp;#39;re interested only in NET_RX,NET_TX vectors.&lt;/span&gt;

$&lt;span class="w"&gt; &lt;/span&gt;perf&lt;span class="w"&gt; &lt;/span&gt;record&lt;span class="w"&gt; &lt;/span&gt;-e&lt;span class="w"&gt; &lt;/span&gt;irq:softirq_entry,irq:softirq_exit&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;-a&lt;span class="w"&gt; &lt;/span&gt;-C&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;--filter&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;vec == 2 || vec == 3&amp;#39;&lt;/span&gt;
$&lt;span class="w"&gt; &lt;/span&gt;perf&lt;span class="w"&gt; &lt;/span&gt;script&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;awk&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;{ gsub(/:/, &amp;quot;&amp;quot;) } $5 ~ /entry/ { ts[$6, $10] = $4 }&lt;/span&gt;
&lt;span class="s1"&gt;    $5 ~ /exit/ { if (l = ts[$6, $9]) { printf &amp;quot;%.f %.f\n&amp;quot;, $4 * 1000000,&lt;/span&gt;
&lt;span class="s1"&gt;    ($4 - l) * 1000000; ts[$6, $10] = 0 } }&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;latencies.out
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;And the same for another case when a PgBouncer sits together with another one
on the same physical core. Here is the 99th percentile of the resulting
latencies:&lt;/p&gt;
&lt;p&gt;&lt;img alt="softirq_net_rx_net_tx_latencies" src="https://engineering.zalando.com/posts/2020/06/images/pgbouncer-softirq.png"&gt;&lt;/p&gt;
&lt;p&gt;Which indeed points into the direction of network interrupts being a bit slower
for the case when both PgBouncers are sharing the same physical CPU. In theory,
it means that we can get surprising performance results after adding more pods
to a connection pool deployment depending on where did those new pods land, on
an isolated CPU or on a CPU with another hyperthread already busy. In the view
of these results it could be beneficial to configure &lt;a href="https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/"&gt;CPU manager&lt;/a&gt; in the
cluster, so that this would not be an issue.&lt;/p&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;Having said all above I must admit it's just a tip of the iceberg. If there
could be interesting complications about how to run a connection pooler within
a single node, you can imagine what happens on a higher architecture level.
We've spent a lot of time discussing different design possibilities for
Postgres Operator, e.g. whether it should be a single "big" pgbouncer instance
(with many processes reusing the same port) with an affinity to be close to the
database, or multiple "small" instances equidistant from the database. Every
design has its own trade-offs about network round trips and availability
implications, but since we value simplicity (especially in the view of such
complicated topic) we went for a rather straightforward approach relying on the
standard Kubernetes functionality:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Postgres Operator creates a single connection pooler deployment and exposes
  it via new service.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Connection pooler pods are distributed between availability zones.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Due the nature of connection pooling, pods are doing CPU intensive work with
  minimal amount of memory (less than a hundred of megabytes in a simple case)
  and it makes sense to create as many as needed to prevent resource
  saturation. Those pods could be scattered across multiple nodes and availability zones which
  means latency variability.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For those cases when this variability could not be tolerated, we would consider
  creating manually a single "big" pooler instance with the affinity to put it
  on the same node as the database and adjust CPU manager to squeeze everything
  we can from this setup. This new instance would be a primary one for
  connecting with another one providing HA.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This simplicity should not be confused with ignorance, it's based on
understanding of proposed solution limitations and what could be adjusted
beyond them. As in my other blog posts and talks I would love to emphasize the
importance of the described methodology: even if you have such a complicated
system in your hand as Kubernetes it's important to understand what happens
underneath!&lt;/p&gt;</content><category term="Zalando"/><category term="Kubernetes"/><category term="Open Source"/><category term="PostgreSQL"/><category term="Postgres Operator"/><category term="Backend"/></entry></feed>