Zalando Engineering Blog

Node.js and the tale of worker threads

2024-07-25T00:00:00+02:00

A disrupted gaming night

I do not usually read code when dealing with production incidents, as it is one of the slower ways to understand and mitigate what is happening. But on that Friday night, I was glad I did.

I was about to start another session of Elden Ring (a video game in which everything is pretty much trying to kill the player) when I was paged with the following: "campaign service is consuming all resources we throw at it". I joined a call and was then told that the observed impact was due to one of the dependencies: the translation service, for which my on-call rotation was responsible for. The translation service was indeed very slow to respond (its p99 latency had increased from 100ms to 500ms) and its error rate had gone from 0 to 4%. This did not really explain why the service calling us (the campaign service) was on a cloud resource consumption spree.

We started with distributed tracing, however the campaign service was not instrumented so we could not get much out of our tracing tooling. We did see some context cancelled error messages on our request spans which usually means that the connection was unexpectedly closed from the client side. We quickly moved on to logging and sure enough, we found the same evidence in the translation service logs: java.lang.IllegalStateException: Response is closed

We are relatively well instrumented at Zalando in terms of operations, especially with built-in Kubernetes dashboards. Using our Kubernetes API Monitoring Clients dashboard we confirmed that the calling service (the campaign service) was misbehaving and instead of its usual 1 000 requests per minute to the translation service, it was making over 20 000 requests per minute.

It looked like the campaign service was effectively increasing the pressure on our translation service. This meant that our translation service was then slower to respond and sometimes not responding at all, which in turn somehow increased the amount of requests that the campaign service was making and the cloud resources it was consuming.

We were looking at a positive feedback loop that was destabilising both systems. Fortunately for us, the effects of the loop were eventually stopped at some point when both systems reached their allocated cloud limits, memory for the campaign service and the maximum number of replicas for our translation service. This had been going on for several hours as the campaign service is not on the critical path of the customer journey, so 4% was a slow burn error and we were only paged because the team that owns the service started investigating this anomaly and found this interaction with our translation service.

In an attempt to resolve the situation, we reduced the number of pods for the campaign service and allowed our translation service to scale up, and sure enough the situation improved by itself in a matter of minutes. As I was about to pick up my game controller again, I took one last look at the graphs and, lo fand behold, the error rate was back up and the positive feedback loop had resumed, as if in defiance of my gaming night.

Not so fast Tarnished

In Elden Ring, you have to retry boss fights quite a lot so I rolled up my sleeves and started investigating again. This time, resolved to understand the systems' "patterns".

Taking another look at the campaign service logs was quite interesting to say the least. Yes, it did start with a bunch of request failed, read timeout but then it was followed by a lot of logs like Worker fragment (pid: 51) died and Worker 549 started. When I say a lot, I mean A LOT, more than 20 per second in total.

At this point, we needed to understand where they were coming from and yes, I started reading the code on github. We were dealing with a simple Node.js application. The entry point was a file called cluster.js and the first thing it did was get the number of CPUs from the OS and spawn a worker for each CPU core.

const cluster = require("cluster");
const numCPUs = require("os").cpus().length;

// master wrapper
if (cluster.isMaster) {
  console.log(`Master ${process.pid} is running`);
  console.log(`CPU Total ${numCPUs}`);

  // fork workers
  for (let i = 0; i < numCPUs; i++) {
    cluster.fork();
  }

  cluster.on("exit", (worker) => {
    // when worker exits
    console.warn(`Worker fragment (pid: ${worker.process.pid}) died`);
    cluster.fork();
  });
}

Pretty smart right? Node.js is single-threaded and you don't want to leave those precious CPU cores idle. Well that depends on where your code is running!

Following a migration, this service was now running on Kubernetes with pods requesting the equivalent of 1 CPU unit, so far so good. However, when called inside a Kubernetes container, the os.cpus().length method returned the number of cores available on the host machine, instead of the amount of CPU allocated to the container by Kubernetes. At this point, the campaign service was running on machines with 48 cores so it was spawning a whopping 48 processes (yes, for Node.js, this is a whopping number, I can see you Golang people judging us). In fact, using cluster mode for Node.js in a Kubernetes environment is discouraged because Kubernetes can help you do this in a simple way out of the box, for example by setting cpu request to 1000m to allocate one CPU core per pod.

Another interesting thing we could read in cluster.js was that when a worker thread exited, it immediately spawned another one, and we could quickly sense how this could lead to a dangerous situation. Well, while that explained the high number of workers and the logs we saw above, it still didn't explain why they kept exiting and spawning.

Enter translation-fetcher.js, a file that exposes a method to fetch translations from a remote API (our translation service for which I was paged for). Interestingly, when the fetch call fails, the catch clause calls process.exit(1).

TranslationManager.fetchAll()
  .then((data) => {
    const fallbackFilename = "./fallback-translations.json";
    fs.writeFileSync(fallbackFilename, JSON.stringify(data, null, 4));
  })
  .catch((e) => {
    console.error(e);
    process.exit(1);
  });

So there we had it! We had 48 forked worker processes, most of which were exiting, respawning and trying to fetch translations again on startup. We felt pretty confident that we understood what was happening as this was the only place in the whole codebase where a worker thread could exit. We also concluded that the fact that our translation service was slower to respond and sometimes not at all was what fed the positive feedback loop I described above. Indeed, if the call to the translation service failed, the worker thread was killed and a new one spawned, triggering a new call to the translation service, and so on.

Now it was time to patch the issue, so I could get back to being slain by monsters in my video game. We updated the service to no longer use cluster mode as in fact a few pods would be more than able to handle the load even at peak traffic. We struggled to deploy the service to production as it hadn't been deployed for a while and we were missing some permissions, but that's too boring a story to go into. Once the service was deployed, the number of requests to our translation service dropped from 20 000 requests per minute to 100 requests per minute and the health of our translation service quickly recovered and the service even scaled down. What happened next in Elden Ring will stay in Elden Ring.

Digging deeper

Fast forward to Monday, we start working on a detailed post-mortem analysis describing what I wrote above and I decide to write this up as our Site Reliability Engineering (SRE) team loves to hate on Node.js. When I get to the part where I talk about translation-fetcher.js, I get perplexed. It does not really make sense to call process.exit() in a live environment.

Also what about the response closed and context cancelled errors we were seeing, they did not match our current understanding of the worker being killed after the call itself failed. As I hate to share something I do not have a very good understanding of, I dove once more into the campaign service code and, lo and behold, I found a huge oversight we had made on that Friday night. The translation-fetcher.js code was not being called in the live environment, it was another file, obviously called translation.js that was being called on application startup, still calling our translation service but returning fallbacks if the call failed.

function initTranslations() {
  return TranslationManager.fetchAll()
    .catch((error) => {
      console.error("catch error", error);
      return fallbackTranslations;
    })
    .then((initialData) => {
      return manager.watch(initialData);
    });
}

const server = initAndPrefetchOAuth()
  .then(() => initTranslations())
  .then(() => {
    app.listen(PORT);
    log.info("Server started");
  });

So there never was a positive feedback loop with the translation service, it was all up in our heads and I felt a bit stupid about it.

What was happening then? We still didn't understand why workers were being killed and respawning, which led to a very high amount of requests to our translation service. That did, however, put the focus back on the campaign service: what happened at 2am on that Friday that had never happened before, destabilising the service? Could the logs tell us more?

Luckily for me, someone was curious about the number of CPUs allocated and as the main application started, the code was logging the number of CPUs in the machine. So I scanned the last 30 days of logs and got the history of the number of CPUs for the allocated machines: it was always 4, 8 or 16. Well, except for last Friday but also on the 6th of April 2022 at 10:49 when the AWS gods had gifted us a 48 cores machine, interesting... What was the state of the application at that point in time? Well it wasn't great, one pod, unsurprisingly allocated in the node (machine) with the 48 cores, was over-utilising both its CPU and memory allocations and was repeatedly being killed. At the exact same time, our beloved translation service had also begun to consume more resources, scaling massively from 4 to 20 pods despite only receiving twice as many requests.

Why did it not escalate at this point? Because once the pod was killed, despite being replaced twice by a pod on the same 48 cores node, the third time it was replaced by a pod on a different node with only 16 cores. The campaign service generously requested 2GB of memory for each of its pods so with 4, 8 or 16 cores, it was only spawning 2, 4 or 16 extra workers on top of the main process. It turns out that the campaign service application process needs around 120 MB of memory to run properly so it was painfully able to accommodate up to 16 cores, but 48 cores meant that each process only had around 40 MB of memory each (which is still 10 000 times more than the Apollo guidance computer that got us to the moon by the way) and around 20m CPU ("twenty millicpu"), which is really not that much for a single thread.

At this point, I still did not understand why the node thread workers kept dying, although I had an intuition that it was due to the low amount of resources available but I could not see any garbage collection or memory issues in the stack traces. I decided to run the service locally, updated the cluster file to spawn 50 worker threads regardless of the number of CPUs, built it and started it in a Docker container. At first, I gave the container a single core from my 5 year old Macbook and despite being excruciatingly slow, every worker thread spawned and triggered its initial request to get the translations. I repeated the operation, this time giving the container only 1000MB of memory and sure enough, after spawning around half of the workers, I saw the same logs as in production: Worker fragment (pid: 1) died, Worker 31 started.

That was the aha moment, up to that point, I was expecting a clue as to why a worker would be killed by Node.js, but it never came and it turns out that Node.js simply starts killing worker threads when it needs to reclaim memory. And if you remember the code in cluster.js, immediately after the worker thread exited, the application spawned another one, so we end up with lots of worker threads spawning and dying in quick succession, living just long enough time to say hello to our translation service. This also explains very well the context cancelled errors we saw in the translation service, because when the worker thread dies, the socket it created unexpectedly hangs up. It also explains well the read timeout errors in the campaign service as the processes did not have enough time (due to their very low CPU resource allocation) to read the translation service response. Unfortunately, this information was not readily available to us because the campaign service did not instrument its event loop lag, the degradation of which is a common root cause of API call read timeouts.

Building better observability

This story happened back in April 2022 and was one of the motivations for developing a Zalando Observability SDK for Node.js. Two years later, we have 53 Node.js applications instrumented with the SDK, which means that investigating incidents involving Node.js is now easier with common signals readily available. This will be the topic of a subsquent blog post, stay tuned!

End-to-end test probes with Playwright

2024-07-19T00:00:00+02:00

Why automated end-to-end tests?

What are automated end-to-end tests? Do you need them at all? In this blog post we dive into the ugly behind automated end-to-end testing, what we struggled with at Zalando, what worked well for us and our latest solution with end-to-end test probes.

Automated end-to-end tests continue to polarise the industry, with some leaders advocating for them and others rightfully questioning their return on investments and recommending to invest in monitoring and alerting systems instead.

Tweet on end-to-end testing from @GergerlyOrosz on May 19th, 2024

Of course, the right approach always depends on your product and the impact of your application being unavailable for even a short period of time. At Zalando, the disruption of a critical customer journey can quickly add up to millions in lost revenue so there is an obvious value for us in ensuring the high quality of our releases and automated end-to-end tests are one of the best tools for the job. So when we release new versions of our Zalando website multiple times a day in a completely autonomous manner, each release goes through an automated quality assurance pipeline that includes end-to-end tests written with Cypress.

What are automated end-to-end tests?

Automated end-to-end tests simulate real user interactions with an application to ensure that the entire application stack works correctly from the user interface to the backend. These tests typically run in a headless browser environment and are thus easily integrated into continuous integration and delivery (CI/CD) pipelines. By automating these tests, teams can efficiently detect and address issues early, ensure regression testing, and maintain application quality as the code base evolves.

Investing in automated end-to-end tests

It really paid off for Zalando and helped us find bugs early on that would otherwise have caused major incidents. It has not been all nice and shiny though as we experienced what Gergely was complaining about: the tests were taxing to maintain and the most frustrating part of it all was that they were still a bit flaky. They had a success rate of around 80%, but with around 120 builds a day, that still meant an average of 24 builds a day which were failing as false positives, causing unnecessary friction.

We doubled down on our investment in these tests, which included creating better test setup context as we have highly dynamic content on Zalando and our product pages are highly contextual, sometimes with products not yet released to build anticipation and for which we obviously could not trigger the add to cart flow. We also improved our selectors and added a mechanism to detect when our pages are hydrated with React after server-side rendering, as Cypress would fail eagerly executing test scripts on a non-interactive UI. Our efforts increased the tests reliability to the 95% range and we felt pretty good about it.

A new class of issues

You can imagine our disappointment when we had a major incident due to front-end interactivity issues where React hydration crashed on a large number of our product detail pages, preventing users from selecting product sizes and adding products to their shopping carts. The issue was large enough to have a business impact, but not just not enough to trigger an automated alert. How did this regression sneak in? It turned out that the incident was triggered by new and incomplete content published to our headless CMS which broke the front-end API contract with our API gateway and ultimately led to broken interactivity. We had have React error boundaries in place, however it turned out that these weren't working for the eagerly-hydrated part of our product pages.

So we were almost back to square one: no matter how much we had invested in our end-to-end test automation, external factors could still lead to broken pages. Obviously, we will tighten up our monitoring and alerting as part of the incident process which seeks to systematically address contributing factors, but we also wanted to catch such interactivity issues more consistently. An idea came to mind: why not run our automated end-to-end tests periodically and alert when they fail? However, remember we had only achieved a 95% success rate with our end-to-end tests, so if we were to run them every 30 minutes to ensure that our website was working as expected. If we were to page our on-call team upon failures, alerts would trigger several times a day and possibly at night, leading to incident fatigue for the on-call team – a state we did not want to be in. So we needed to further increase the reliability of our end-to-end tests if this was to become a viable solution.

A simpler and better approach

We went back to the drawing board: what we needed was higher resiliency and one of the ways to achieve this is often through simplification. We decided that for the end-to-end test probes we would run a cron job with scenarios covering critical customer journeys. We started with a few scenarios: one test would cover landing on our home page, browsing to a gender page and clicking on a product, another would cover landing on our catalog page, applying a filter, clicking on a product and a final one would cover landing on a product page, selecting a size, adding the product to the cart and starting the checkout process. By focusing on a smaller number of features and interactions, we were able to reduce the likelihood of false positives.

Around the same time, we also held our internal Zalando Engineering Conference and one of the talks was about scaling automated end-to-end testing. Playwright, an end-to-end testing solution developed by Microsoft was presented as a great solution for this thanks to its strong focus on resilient testing. Indeed, Playwright features:

"auto-wait" (no artificial timeouts)
"auto-retry" (web assertions), eliminating key causes for flaky tests
rich tooling options (tracing, time-travel) to debug and fix issues if failures occur
a unified API which works across all modern browsers
Typescript out of the box

This was very compelling so we decided to use Playwright for these end-to-end test probes.

It was easy to get up and running with Playwright, especially for our now simple scenarios. We used fixtures to set up independent test contexts for scenarios such as getting a good product candidate for the product page landing test and disabling our cookie consent banner. Playwright's API was simple to pick up, making use of promises natively and augmenting standard CSS selectors which made us hit the ground running super quickly. Here is the final code for our catalog landing test which is only a few lines of code:

test("Test catalog landing journey for zalando", async ({ page }) => {
  //  navigate to catalog page
  const catalogNav = await page.goto(catalogLink);
  expect(catalogNav?.status()).toBe(200);
  await expect(page).toHaveURL(title);

  // we only wait to simulate a "real user behavior"
  // with playwright this is not necessary
  await page.waitForTimeout(1000);

  await page.getByRole("button", { name: /farbe/i }).click();
  await page.locator("label[for=colors-BLACK]").click();
  await page.getByText(/speichern/i).click();

  await expect(page.getByTestId("is-loading")).toBeVisible();
  await expect(page.getByTestId("is-loading")).not.toBeVisible();

  await page
    .locator("article[role=link]")
    .locator('a[href$=".html"]')
    .first()
    .click();

  await page.waitForLoadState("domcontentloaded");
  await expect(page).toHaveURL(/\.html/i);
});

We set up the tests to run on a 30 minute cron job and instead of paging immediately when they failed, we created a low-priority alert that emailed the team to validate their reliability using a "shadow" mode. And it did trigger a couple of times, especially over the weekend. Each time we captured HTML reports as logs so that we could understand the issue, improve our selectors, implement local retry loops with expect.toPass and even cover tricky edges with selectors targeting non-visible content thanks to Playwright's automatic augmentation of pseudo-classes like :visible. After a few weeks, we stopped getting alerts in shadow mode and enabled paging when those tests failed. So far they have only paged us once, and that was during an incident where the page was actually not working.

Outlook

It has been quite a journey to get to where we are now, but we feel pretty good about our setup, which we could not have achieved without focusing on simplicity and betting on Playwright's reliability. If, like us, having production downtime is damaging to your business, we believe that implementing end-to-end test probes could be a useful addition to your toolkit. Our main advice would be to keep these tests focused on your critical customer journeys, write good selectors and iterate in a shadow mode before alerting in production.

We are planning to increase the number of scenarios for the end-to-end probes to include more of our Critical Business Operations (CBOs) and we also looking at extending this idea to our mobile apps.

Custom Navigational Transitions in iOS

2024-07-04T00:00:00+02:00

Introduction

In present mobile development, the emphasis lies on achieving both speed and personalization. As the demand for rapid delivery intensifies, continuously improving the user experience for customers is essential.

One avenue through which this aspiration materializes is via screen transitions. These transitions serve a dual purpose: they facilitate seamless navigation while striving to establish a sense of continuity in user interactions, transcending the mere act of moving from one screen to another.

In this article, we will focus on screen transitions for iOS apps. Rather than implementing a custom transition for a basic scenario, which many resources already cover, we will explore a real example from Zalando's iOS App showcasing navigation between two screens that are entirely backend-driven.

Navigation Transition

In our prior article Backend-driven UI for mobile apps, we explained how the screen functions as a composed structure of a limited number of primitive components within the framework. So our problem space is: How to enhance navigational experience in a Backend-driven UI system?. To understand that challenge, we will break down what is needed to implement one. But first, let's have a look on the status quo of a transition from an outfit-card to outfit-details screen.

Here, one of the outfits from the carousel is tapped and an outfit-details screen is pushed on the navigation stack with the default transition. Notice the image in the carousel and the image on the detail screen are the same, the interaction could be enhanced in many ways here. One way is to build a custom navigational experience, where the image that is interacted grows into the detailed view (similar transitions can be noticed on the iOS App Store for reference).

While in case of static content implementing the UIViewControllerAnimatedTransitioning protocol provided by UIKit's View Controller Transitions API and using a custom navigation delegate would be enough. Whereas in our scenario, the process isn't straightforward due to the following facts:

Backend-driven UI: Given that the UI of the initial screen is determined by the backend, identifying the user's interaction—whether it's with an image or a layout—poses a challenge. We require precise information about the tapped view, including its position and size (i.e., its frame within the screen).
Generic deep-link navigation: With a generic deep-link navigation approach, the URL is passed to the router, which handles the navigation independently in a separate module. This means that the router lacks the context of the next screen, complicating the transition process further.

When an outfit-card is tapped (event), it triggers a deep link navigation (action), this action is propagated from Appcraft iOS framework to the Zalando App to be handled by a common router. We can intercept this flow and identify the location of the tap event. Once we do that, we can take a snapshot of the tapped view, which in this case is an Outfits-card. This solves the first problem stated above.

Code caption: Method initially used to capture the tapped view and convert into an image

extension UIView {
    func asImage() -> UIImage {
        let renderer = UIGraphicsImageRenderer(bounds: bounds)
        return renderer.image { rendererContext in
            drawHierarchy(in: bounds, afterScreenUpdates: true)
        }
    }
}

Code caption: Once we have a snapshot to work with, we propagate the UIImage and its frame to the framework's navigation service, enabling us to pass this information to the router for handling the transition. Implementing the navigation controller and UIViewControllerAnimatedTransitioning, facilitating a transition process similar to the following:

// At the call site
let navigationController = UINavigationController(
    rootViewController: initialViewController
)
navigationController.delegate = CustomNavigationDelegate()
navigationController.pushViewController(nextViewController,
                                        animated: true)

// Custom Navigation Delegate
class CustomNavigationDelegate: NSObject,
                                UINavigationControllerDelegate {
    func navigationController(
        _ navigationController: UINavigationController,
        animationControllerFor operation: UINavigationController.Operation,
        from fromVC: UIViewController,
        to toVC: UIViewController
    ) -> UIViewControllerAnimatedTransitioning? {
        if operation == .push {
            return SourceScaleTransition()
        }
        return nil
    }
}

// SourceScaleTransition class
final class SourceScaleTransition: NSObject,
                                   UIViewControllerAnimatedTransitioning {
    let transitionInfo; // contains the image and it's frame

    public func transitionDuration(
        using transitionContext: UIViewControllerContextTransitioning?
    ) -> TimeInterval {
        animationDuration
    }

    func animateTransition(
        using transitionContext: UIViewControllerContextTransitioning
    ) {
        guard let _ = transitionContext.viewController(forKey: .from),
              let toViewController = transitionContext.viewController(forKey: .to) as?
                SnapshotTransitionPushedController else { return }

        let containerView = transitionContext.containerView

        let animatingView = transitionInfo.sourceView
        containerView.contentMode = .scaleAspectFill
        containerView.addSubview(toViewController.view)
        containerView.addSubview(animatingView)

        toViewController.view.layoutIfNeeded()

        let finalFrame = calculatedFrame;
        // calculate final frame based on the destination and app safe areas
        toViewController.snapshotFromSourceView = animatingView
        animatingView.frame = transitionInfo.sourceRect

        toViewController.view.isHidden = true
        UIView.animate(withDuration: animationDuration,
                       delay: 0.0, animations: { [weak self] in
            animatingView.frame = finalFrame
        }) { finished in
            toViewController.view.isHidden = false
            transitionContext.completeTransition(true)
        }
    }
}

In addition to the above, we also created a protocol for destination controllers so that the transition concluded in a smooth way

/// Destination ViewController must conform to
/// `SnapshotTransitionPushedController`
/// so that the snapshot could be seemlessly added
/// & removed from transitional view
public protocol SnapshotTransitionPushedController: UIViewController {

    /// `snapshotFromSourceView` is the snapshot of the view tapped.
    ///  It was propagated with deeplink information &
    ///  will be scaled in an animating view in a Custom Transition
    var snapshotFromSourceView: UIView? { get set }

    /// Call `removeTransitionalView()` to remove the snapshot.
    /// Example, when view has loaded/rendered.
    func removeTransitionalView()
}

Although initially promising, this approach proved insufficient for production use. Issues such as image pixelation and awkward text scaling, leading to abrupt disappearances, were observed. We identified two key problems that needed addressing:

Selective rendering Not all components are necessary for the transition and should be omitted.
Quality of Scaling view: The transition should occur smoothly without pixelation, ensuring high-quality visuals throughout.

Our solution involved devising an approach where the tapped layout undergoes recursive traversal and re-rendering to produce a high-quality snapshot. This recursive methodology offers the added advantage of enabling us to selectively choose the components essential to the transition. Each component autonomously manages the rendering of its snapshot, enhancing the efficiency and precision of the process.

Below is a simplified version of selective rendering where Label & Button Components are ignored while rendering a snapshot view of a Composed component. There is a dedicated handling of snapshot(:) method in the Image Component, shown further below.`

extension ComponentRenderer {
    func snapshot(renderer: Renderer) -> UIView {
        // Selective Rendering
        if self is LabelComponent || self is ButtonComponent {
            return EmptyView()
        }
        // Implement this method in relevant components
        // for dedicated handling
        return snapshot(renderer: renderer)
    }
}

Render an actual view, and not just a snapshot to get a good quality transitional view

struct Image: ComponentRenderer {
    ...
    func snapshot(renderer: Renderer) -> UIView {
        UIImageView(image: props.image)
    }
}

Let's look at the resulting outfits-card transition:

Isn't it much better than the vanilla transition? It definitely is! Bonus - The same transition can now be enabled to other screens since it is in a generic screen framework and backend driven.

To conclude, each interaction is unique, and there's no one-size-fits-all solution, but this is a solid starting point. By collaborating with designers, engineers can create smooth, visually appealing animations. While these enhancements are not must-haves, they contribute significantly to a more enjoyable user experience. By focusing on advanced aspects of UIKit's View Controller Transitions API, you can improve your app's aesthetics and functionality, making it more engaging for users.

Failing to Auto Scale Elasticsearch in Kubernetes

2024-06-21T00:00:00+02:00

Introduction

In Lounge by Zalando, we run an Elasticsearch cluster in Kubernetes to store user facing article descriptions. Our business model is such that we receive about three times the normal load during the busy hour in the morning and therefore we use schedules to automatically scale in and out applications to handle that peak. If scaling out in the morning fails, we face a potential catastrophe. This is a story of one such case.

First anomaly

Early Tuesday morning, our on-call engineer received an alert about too few running Elasticsearch nodes. We started executing the playbook to handle such a case, but before we had time to go through all the steps, the missing nodes popped up and the alert closed on its own. Catastrophe avoided for now, but after a cup of coffee, follows the root cause analysis.

Investigating the logs it turned out that the cluster had failed to fully scale down for the night. The cluster was configured to run 6 nodes during the night, but it got stuck running 7 nodes.

To understand why that happened and why it is interesting, a little bit of context is required. We run Elasticsearch in Kubernetes using es-operator. Es-operator defines a Kubernetes custom resource, ElasticsearchDataSet (EDS), that describes the Elasticsearch cluster. It monitors changes to it and maintains a StatefulSet that consists of pods and volumes that implement the Elasticsearch nodes. We’ve configured our cluster so that the pods running it are spread across all AWS availability zones, and Elasticsearch is configured to spread the shards across the zones.

For us, the schedule based scaling is implemented by a fairly complex set of cronjobs that change the number of nodes by manipulating the EDS for our cluster. There’s separate cronjobs for scaling up at various times of day and scaling down at other times of day.

The pods in a StatefulSet are numbered and the one with the highest number is always chosen for removal when scaling in. Just before the nightly scale got reached, we were running the following pods in the shown availability zones:

es-data-production-v2-0 eu-central-1b
es-data-production-v2-1 eu-central-1c
es-data-production-v2-2 eu-central-1b
es-data-production-v2-3 eu-central-1c
es-data-production-v2-4 eu-central-1c
es-data-production-v2-5 eu-central-1c
es-data-production-v2-6 eu-central-1a

The pod to be scaled in next is es-data-production-v2-6. First step in this is for es-operator to drain the node, i.e. request elasticsearch to relocate any shards out of it. Here though, the node to be drained is the only one located in eu-central-1a. Due to our zone awareness configuration, Elasticsearch refused to relocate the shards in it. Es-operator has quite simple logic here: It requests for shards to be relocated, check whether it happened and keep retrying for 999 times before giving up. This kept happening throughout the night and quite unbelievably, retries were done just two minutes after we got the alert. Then, es-operator carried on with scaling out and the problem resolved itself. The timing here is quite surprising, but occasionally such things occur.

Initial root cause analysis

Something in the above is not quite right though. The intended behaviour of es-operator is as follows: It constantly monitors updates to EDS resources and if change is observed, it compares the state of the cluster to the description and starts to modify the cluster to match its description. If, during that process, EDS gets changed one more time, es-operator should abort the process and start modifying to cluster to match the new desired state.

This was the case for us exactly. Es-operator was still processing EDS update to the scale in for the night while it received another EDS update to start scaling out for the morning. We spent much of the next day tracing through es-operator source code and finally realised there was a bug regarding retrying on draining nodes for scaling in: In this one specific retry loop, context cancellations are not reacted on. The bug is specific to draining a node and doesn’t apply to other processes. It’s fixed now, so remember to upgrade if you are running es-operator yourself.

Still something is not quite right. Why did this happen on Tuesday and never before? We never scale into less than 6 pods and as explained above, the pod to scale in is always the one with the greatest number. Therefore, the pods numbered 0 to 5 should remain untouched. The pods running the Elasticsearch are run as a StatefulSet by es-operator. If that StatefulSet was using an EBS backed volume, Kubernetes would guarantee to not move the between zones. We, however, don’t store unrecoverable data in our Elasticsearch, thus we can afford to run it on top of ephemeral storage. Nothing is strictly guaranteed for us then. Normally, pods remain quite stable in a zone nevertheless, but on Monday, the day before the first anomaly, our Kubernetes cluster was upgraded to version 1.28. This process likely has affected the pod scheduling across nodes in a different availability zone, though we have not done a full deep dive into the upgrade process to confirm this.

The first fix that didn’t work

As a quick fix, we just increased the number of nodes running during the night. This way, the nightly scale-in job wouldn’t try to drain es-data-production-v2-6, the last node in eu-central-1a and it wouldn’t get stuck the way it did the previous night. We might want to consider something else for a longer term, but this should stop us from failing to scale out the next morning.

Still, the next morning, we received the exact same alert once again. And after a few minutes, the alert closed on its own the same way as the day before.

This time we were unable to scale in from 8 to 7 nodes, which did work fine the day before. Looking at the node distribution:

es-data-production-v2-0 eu-central-1b
es-data-production-v2-1 eu-central-1c
es-data-production-v2-2 eu-central-1b
es-data-production-v2-3 eu-central-1c
es-data-production-v2-4 eu-central-1c
es-data-production-v2-5 eu-central-1c
es-data-production-v2-6 eu-central-1a
es-data-production-v2-7 eu-central-1a

Why was es-operator not able to drain es-data-production-v2-7? This time it’s not the last node in eu-central-1a.

Digging into this revealed another bug in es-operator. The process for scaling in a node, in a bit more depth, looks like the following:

Mark the node excluded (cluster.routing.allocation.exclude._ip) in Elasticsearch. This instructs Elasticsearch to start relocating shards from it.
Check from Elasticsearch whether any shards are still located in the given node. If yes, repeat from the beginning.
Remove the corresponding pod from the StatefulSet.
Clean up node exclusion list (cluster.routing.allocation.exclude._ip) in Elasticsearch.

Pondering about the above, you are likely to guess what was wrong this time. If the scaling down process gets interrupted, the clean up phase is never executed and the node stays in the exclusion list forever. So, es-data-production-v2-6, which failed to scale in the day before, was still marked as excluded and Elasticsearch was unwilling to store any data in it. In effect, es-data-production-v2-7 was the only usable node in eu-central-1a.

The second fix

Manually removing the “zombie” node from the exclusion list is simple, so we did exactly that to mitigate the immediate problem.

Fixing the underlying bug in a reliable and safe way is much more involved. Just adding a special if clause for cleaning up in case of cancellation would solve the simple instance of this problem. But we are potentially dealing with partial failure here. Any amount of if clauses wouldn’t solve the problem when the es-operator crashes in the middle of the draining process. There’s a PR in progress to handle this, but at the time of writing the bug still remains and we currently accept the need to deal with these types of exceptional situations manually.

Finally

As an embarrassing postlude to this story, we received the same alert one more time the next day. The quick fix we did the day before only touched the major nightly scale down job, but ignored another one related to a recent experimental project. It was a trivial mistake, but enough to cause a bit of organisational hassle.

Well, we fixed the remaining cronjob and that was finally it. Since then we’ve been running hassle free.

What did we learn from all this? Well, Read the code. For solving difficult problems, understanding the related processes in abstract terms might not be enough. The details matter, and the code is the final documentation for those. It also mercilessly reveals any bugs that lurk around.

Next level customer experience with HTTP/3 traffic engineering

2024-06-18T00:00:00+02:00

TL;DR: HTTP/3 has gathered consensus by the industry as the best technical solution for improving Web protocol stack. Usage statistics indicate that 29.8% of websites worldwide have already embraced HTTP/3 to cater to their users, with Zalando being among them. The architecture of HTTP/3, coupled with the underlying QUIC transport, introduces concurrent access and low-latency capabilities to solutions, facilitated by user-space flow and congestion controls operating over the User Datagram Protocol (UDP). QUIC is used by 8.0% of all the websites. The result is an enhanced customer experience that fundamentally transforms content consumption, promising visually stunning displays on customers' mobile screens. This post will delve into the intricacies of HTTP/3 traffic engineering, Zalando experience with it and our vision for next steps.

The significance of HTTP/3 adoption

Nowadays, 85% of total Internet traffic is TCP traffic. HTTP traffic takes about 54.6% and 54.4% of it is the traffic to mobile devices. TCP was developed in the 70s of last century to build reliable client/server communication. The TCP-based family of Web protocols, specifically HTTP/1.0, HTTP/1.1 and HTTP/2, inherits the legacy TCP inefficiencies for building concurrent and low-latency Web applications on wireless networks. Looking in-depth on the protocol stack involved for end-to-end communication, there are issues in (1) network infrastructure utilisation and (2) protocol design:

(1) Issues with Utilisation of IP Network: The Internet comprises a heterogeneous mix of packet-switched networks, including ISP Access Networks, ISP Core Networks, and numerous Tier 1/2/3 telecom carriers. For European customers connecting to load balancers deployed in the eu-central-1 availability zone, packets traversing about 15 hops. Each hop introduces a blend of processing, waiting times, and the inherent risks of packet loss or network congestion, particularly when nodes or links are strained beyond capacity. Additionally, the architecture of the access network, encompassing its physical medium and the transmission delays it incurs, further compounds these challenges. Furthermore, the saturated capacity of the radio spectrum utilised for communication within the access network adds another layer of complexity to contend with.

(2) Issues with Protocol design: Recent development of Web-protocol stack has presented several notable improvements, foremost among them being the excessive signalling and handshakes required by the upper protocol to negotiate communication parameters prior to payload transfer. Despite this, each "cold" HTTP/2 request necessitates approximately 5 to 6 round-trips, including 1xDNS, 1xTCP, 3xTLS, and 1xHTTP handshakes, contributing to significant network signalling overhead. Moreover, TCP, functioning as a single ordered stream of bytes, lacks concurrent multiplexing capabilities for application traffic over the transport layer. Consequently, any networking failure, such as packet loss or congestion, results in the blocking of the entire byte stream, hindering performance and responsiveness. Existing Transport Congestion Control algorithms often fail to optimise network bandwidth utilisation, leading to suboptimal performance and efficiency. Additionally, poorly designed protocols contribute to fragmentation and reassembly, necessary for packets to traverse links with smaller Maximum Transmission Units (MTUs) than the original packet size. This fragmentation process increases the likelihood of excessive retransmissions in the event of packet loss, further impeding network efficiency and reliability.

It has been proven by the industry that customers love fast experiences: application and web sites. About 70% of mobile app users will stop using an app if it is taking too long to load. Slow “pages” have higher bounding rate; “speed” of the sites is considered as ranking signal for search. Having a fast site makes for a good user experience, which helps improve rankings and brings in visitors, which keeps them on your site and ultimately leads to more conversions.

Knowing these issues, we make an assumption that the first group of factors related to network infrastructure remain unchanged in the near future (3 to 5 years). The infrastructure improvements are driven by economics. It is only remediation of the second group factors related to protocol design that can bring about a significant improvement of the customer experience. We also assume mobile devices replacement is seasonal, with longer or shorter cycles depending on country & economic situation, but certain.

HTTP/3 has gathered consensus as the best technical solution to the second group of problems related to protocol design at this time.

What enhancements does HTTP/3 bring?

In the past, the industry has made multiple attempts on improving protocol design through Structured Streams Transport (SST), Stream Control Transport Protocol (SCTP), Multipath TCP (MP-TCP) and kernel-less TCP/IP implementations (e.g. uIP, and lwIP). None of these became widely adopted because they were focusing on the transport layer only, avoiding end-to-end Web perspective. In June 2022, IETF published HTTP/3 as a Proposed Standard, which is built over a new protocol called QUIC (standardised in May 2021).

QUIC is a transport layer network protocol. In contrast to TCP, it is user-space flow and congestion controls over the User Datagram Protocol (UDP). Its new architecture is built over protocols cooperation principles rather than a strict OSI layering. The protocol solves:

Multiplexing: TCP is a single stream that guarantees strict ordering of bytes. Any concurrency requires multiplexing over a single stream. Network conditions (e.g. packet losses, congestion) causes the TCP stream to be a bottleneck that blocks all senders / receivers on this stream. QUIC multiplexes streams over UDP datagrams, each stream independent and implements its own flow and congestion controls. QUIC also controls the fragmentation and packetisation of payload, producing optimal network datagrams.

Handshake: Each “cold” HTTP/2 request demands about 5 to 6 round-trips (1xDNS, 1xTCP, 3xTLS, 1xHTTP). HTTP/3 requires 3 round-trips (1xDNS, 1xQUIC, 1xHTTP). QUIC handshake combines negotiation of cryptographic and transport parameters. The handshake is structured to permit the exchange of application data as soon as possible, achieving actual waiting time to be a single round-trip. Peers establish a single QUIC connection that multiplexes a large number of parallel streams. The handshake is only required once, setup of the stream is an instant operation and does not require any additional handshake.

TLS: Traditional layered architecture has an isolated security and transport layer causing significant overhead to negotiate encryption keys and transmit encrypted data. Customers perceive bad experiences when the chain of TLS certificates exceeds 4KB and TLS records are fragmented to multiple packets. QUIC adopts TLS version 3 as default one and encapsulates the security protocol (encrypts each individual packet).

Congestion: QUIC provides the open architecture for congestion control, whereas TCP implements it on the kernel side of the operating system. QUIC does not aim to standardise the congestion control algorithms, it provides generic signals for congestion control, and the sender is free to implement its own congestion control mechanisms. As a benefit, sender can align payload to the actual size of the congestion window but also leads to performance inefficiencies as it involves copying extra packet data from kernel memory to user memory, so research on improving that efficiency is key.

Handover: QUIC connections are not strictly bound to a single network path. The protocol supports the connection transfer to a new network path, ensuring a low-latency experience when consumers switch from mobile to WiFi. In the case of HTTP, it always requires a “cold” start.

Outstanding HTTP/3 protocol challenges

QUIC has emerged as a serious alternative to TCP in the Web domain. Unfortunately, QUIC and HTTP/3 are not a “silver bullet” to solve concurrency and low latency. Open issues remains for engineers to be considered for the application development:

Multiplexing: Stream frames are multiplexed over single QUIC packets, which are coalesced into a single UDP datagram. The congestion or loss of datagrams causes a similar effect as on TCP. Application needs to implement its own traffic prioritisation schema(s) to mitigate effect if necessary.

Memory management: HTTP/3 and QUIC demands a greater commitment for memory resources than traditional Web protocol stack. HTTP/3 mitigates the protocol overhead with various compression techniques but stream-oriented ordering of bytes requires excessive buffering of any data that is received out of order. Additionally, a user-space implementation leads to performance inefficiencies as it involves copying extra packet data from kernel memory to user memory.

Traffic shaping and security: networking infrastructure was monopolised by TCP so long that it introduced indirect dependencies on networking. ISP enforces different traffic routing policies for TCP vs UDP traffic, there are various in-the-network optimisation techniques such as Quality of Service, Active Queue Management that impacts on UDP. The massive adoption of QUIC would require reconfiguration of networking gears. For example, Facebook reported: client side heuristic about TCP, heuristic for estimating the available download bandwidth, bottlenecks at Linux-kernel on UDP packet processing, new load balancing and firewall policies.

Congestion control: No ultimate solution on the problem domain. It inherits algorithms from TCP. Historically, congestion control was owned by “hardware” companies - those who developed networking equipment and operating systems. QUIC shifts the ownership, because of user-space implementation, towards “software” companies - those who own Web-browsers. Nowadays, NewReno (1999), CUBIC (2008) and Bottleneck Bandwidth and Round-trip (2016) are the heuristic congestion control algorithms. QUIC standard is confusing, it proposes NewReno as default algorithm, although CUBIC is the dominant algorithm for the broad internet traffic today. Also, BBR algorithm has increased its share in terms of the practical implementation and it can be expected to become the dominant algorithm in the future. A positive side effect of shifting congestion control to user-space is unblocking innovations (e.g. there are research activities of the adoption of Deep Reinforcement Learning to boost customer experience).

MTU: The QUIC protocol, as it is being standardised by the IETF, does not support network MTUs smaller than 1280 bytes. It makes the protocol compatible with IPv6 networks (1280 bytes is IPv6 MTU). However, this poses challenges for networks operating on "non-standard" IPv4 configurations, potentially leading to packet fragmentation, especially on radio channels. Presently, the industry predominantly adheres to Ethernet standards, assuming a physical link MTU of 1500. While larger datagrams are feasible, they necessitate the utilisation of the Path Maximum Transmission Unit Discovery protocol to ensure optimal performance and compatibility across diverse network environments.

Viewing HTTP/3 from the Radio Access Network (Physical Link) angle

The architecture of the HTTP/3 protocol assumes low latency and high reliability within access networks. While the QUIC protocol brings notable enhancements for "interactive" communication over 3G/4G/LTE wireless networks, it has not focused on specificity regarding the unique attributes of 5G networks. It's crucial to note that 5G networks are poised to solve latency issues effectively. Engineers need to be aware of the limitations within Radio Access Networks and carefully weigh the adoption of 5G technology, particularly in the European context. 5G stands out for its remarkable speed capabilities, boasting peak data rates of up to 20 Gigabits-per-second (Gbps) and average data rates exceeding 100 Megabits-per-second (Mbps). Unlike its predecessor, 4G, 5G exhibits significantly enhanced capacity, designed to accommodate a 100-fold surge in traffic capacity and network efficiency. Theoretical estimates suggest that 5G can support up to 1 million devices per square kilometer, showcasing its immense potential for accommodating the burgeoning demands of modern connectivity.

Advertisements about 5G talk about millimeter-wave (mmWave) but the 5G technology is built over three frequency bands (a) low-bands (sub-1GHz) supports wide-area coverage, (b) mid-bands (1 - 6 GHz) offers a trade-off between coverage and capacity, most of the commercial 5G networks will use 3.3 GHz to 4.2 GHz range in the mid-band spectrum and (c) high-bands (24–52 GHz) are required to achieve ultra-high data rates and ultra-low latencies. High-bands (mmWave) are highly susceptible to blockages caused by various objects (e.g., buildings, vehicles, trees) and even the human body. Mass scale operating in mmWave spectrum, presents a demanding challenge in terms of its practical implementation and costs. The physical link in the Radio Access Network emerges as the primary bottleneck on low- and mid-bands, primarily due to the constrained capacity of the radio spectrum. Frequency bands below 6 GHz, traditionally utilised by pre-5G technologies, are progressively saturating, unable to meet escalating consumer demands. Our assumption is about the massive adoption of mid-bands across Europe, 5G mid-bands still outperforms 3G/4G/LTE in terms of latency and packet loss probability but requires less investment into network infrastructure. For example, serving multiple real-time video streams over 5G is not magic anymore. We are able to build customer experience with about 13 ms latency for 99.9% of downlink packets and 28 ms for 99.9% of uplink packets even with “bad” signal strength from -100 dBm to -113 dBm.

On the mid-bands, 5G still outperforms 3G/4G/LTE in terms of latency and packet loss probability. High-reliability plays against the congestion control algorithms used by QUIC. Conventional algorithms are not able to differentiate between the potential causes of packet loss or congestion on the radio channel due to noise, interference, blockage or handover. NewReno and CUBIC have resulted in very poor throughput and latency performance. Only BBR exhibited the lowest round trip time values among all possible physical failure scenarios and can satisfy the typical 5G requirements. Advancing the adoption of HTTP/3 for low-latency communication scenarios necessitates research and development into congestion control algorithms that are sensitive to bandwidth variations across different frequency bands.

Adoption of HTTP/3 by Zalando

Despite the discussed limitation, we have adopted the HTTP/3 protocol at Zalando for distributing all media content. We have successfully brought our vision to life: delivering a premium customer experience atop the foundation laid by industry enablers. Akamai Technologies has been supporting QUIC since July 2016. Amazon supports QUIC (UDP) at Network Load Balancer. Most importantly, HTTP/3 is available at CloudFront giving the ability to serve European customers through Edge Locations. Apple maintains proprietary closed source implementation of QUIC and HTTP/3 protocol since iOS 15. On Android, an open source Cronet library exists. Google Chrome has supported the protocol since 2012. Apple added official support in Safari 14. Support in Firefox arrived in May 2021.

Since HTTP/3 have been enabled into our production environment, we have observed that 36.6% of our users seamlessly migrated to content consumption using HTTP/3 protocol. The average latency for these customers has improved from double digit to single digit value giving about 94% improvements. The p99 latency has improved from 4th digit value to double digit giving 96% gain in comparison with HTTP/2. About 61.6% of our users continue utilisation of HTTP/2 protocol and remaining 1.8% of users fall back to HTTP/1. No incidents or severe anomalies caused by HTTP/3 have been observed by us.

Exploring further directions on traffic engineering opportunities with HTTP/3

Prior to concluding, the author anticipates delineating two significant pathways for further enhancing HTTP/3, aimed at crafting next-level customer experiences.

Congestion Control with Deep Reinforcement Learning

Conventional CC algorithms base their decisions on pre-defined criteria (heuristic) such as packet loss or delay and they lack the ability to learn and adapt their behaviour in complex dynamic environments such as 5G cellular networks. Some heuristic algorithms use statistics to accommodate previous experience into the decision making process, still they are not able to achieve the full potential of modern networks.

Machine Learning techniques outperforms conventional CC algorithms by dynamically adapting the parameters. Deep Reinforcement Learning (DRL) is a prominent technique that has been assessed with QUIC. The Reinforcement Learning agent makes decisions about the size of the congestion window or sending rate while interacting with the environment. The reward metric is either throughput or network delay while penalising packet losses that are optimised for a particular application. In the lab, analysis of DRL algorithms has shown higher throughput and round-trip performance under various network settings to compare with competing solutions (e.g. BRR or Remy). It is worth mentioning Aurora, Eagle, Orca and PQB as known DRL algorithms. We expect this will become the main concept exploited in the research dedicated for protocol improvements in 5G networks.

Streaming of 4K Ultra High Definition videos

Streaming of 4K Ultra High Definition 3480x2160 video at 60 fps requires usage of H.265 (High Efficiency Video Coding) and demands 30 - 50 Mbps network bandwidth, 6 - 11 ms packet latency and 99.999% reliability for packet delivery. This is a tough requirement for 5G mid-bands and practically achievable in the urban areas only.

HTTP/3 introduces concurrent access and low-latency capabilities to video streaming solutions. Our initial investigations have revealed that only Video on Demand applications utilise Dynamic Adaptive Streaming over HTTP/3, with an assumption of 5.6 MB of HEVC-compressed video per second. The QUIC stream concurrency enables parallel fetching of video chunks, leading to an improved user experience compared to HTTP/2. The real-time video streaming with QUIC over less than ideal network conditions faces an issue due to the reliable nature of the protocol. Retransmissions of lost packets in a video stream, inadvertently lead to stalls in the video stream. It also performs poorly when it encounters packet losses that are not due to congestion. This is another improvement opportunity for QUIC to offer a selectively reliable transport wherein not all video frames are delivered reliably, we can optimise video streaming and improve end-user experiences. We believe this improvement impacts content consumption by supporting up to 4096 × 2160 at 60fps (True 4K).

Conclusion

Usage statistics indicate that 29.8% of websites worldwide have already embraced HTTP/3 to cater to their users, with Zalando being among them. Through its adoption, significant strides have been made towards improving the efficiency and responsiveness of web communications, ultimately enhancing the end-user experience.

We've explored how HTTP/3 addresses key challenges such as latency reduction, concurrent access, and low-latency content delivery. We’ve also emphasised remaining issue engineers should be aware specifically in the content of radio access networks and discussed remaining exciting opportunities for further advancements in traffic engineering and network optimization, especially as technologies like Deep Reinforcement Learning continue to mature.

Overall, the insights shared in this post underscore the pivotal role of HTTP/3 in shaping the future of web communication, paving the way for richer, more immersive online experiences. Our observations tell us that 36.6% of our users seamlessly migrated to content consumption using HTTP/3 protocol. The average latency for these customers has improved from double digit to single digit value giving about 94% improvements.

Hosting an internal Engineering Conference

2024-06-03T00:00:00+02:00

Introduction

Our Data Science colleagues had been hosting an internal Data Science Days event for a few years. For our 2,000+ Engineers, we had been missing a similar community event. For several years we wanted to organize one, but got distracted by other priorities and external factors. Finally, in 2022 we decided to commit to hosting an internal Engineering Conference every year and included this commitment in our Engineering Strategy.

Last year, in August 2023, we hosted our first internal Engineering Conference. In this post, we are summarizing how we organized this event and provide tips for those who want to organize a similar event in their company. If you never hosted an event like this before, fear not - when we embarked on the journey we also had no experience in doing so. The event turned out to be a success nonetheless.

Conference format

As this was our first event, we had no reference on the level of interest from potential speakers nor attendees. Without a reference point from prior years, it was a big ask to request that Engineering Managers allow their teams dedicated time to attend, especially given the summer holiday timing (which could work for or against attendance). On top, conference talks are expected to be of higher quality than typical internal presentations, so we needed a format that would ensure quality of talks.

Given these circumstances and following our value think big, act fast, we defined the conference format as follows:

1 day event, all online (we're 2,000+ Engineers with sites in Berlin, Dublin, Dortmund, Helsinki, Stockholm, Zürich)
call for papers to collect submissions across 8 tracks
track host per track who would moderate the track and act as subject-matter expert during the preparation of the talks
program committee to review submissions and select talks

Initially, we were thinking that 8 tracks would be too many, but we wanted to encourage submissions across a variety of topics and see where this takes us, adjusting the track as needed. Our tracks covered Building Platforms, Cloud Native, Developer Experience, Data Engineering, and App/Web Development. We also had a dedicated track for Engineering Leadership and (of course) for the hot topic of the year: AI.

The call for papers was open for 3 weeks. Up until the very end, we were not sure if we would get enough submissions to fill all tracks. Only the last two days before the deadline, we received a significant number of submissions. We ended up with enough submissions to fill all defined tracks and struck gold. Now the organizing team had a challenge - to deliver an event with 8 tracks happening in parallel and 54 talks in total.

When we reached out to our broadcasting team who typically assist in hosting internal events, we learned that they never hosted an event that big, with 3 tracks being their technical limit. So we ended up hosting the event on our own, using Google Meet streaming, a slide-based presentation catalogue with talks and descriptions, and 54 calendar events to make it easy to build up one's own schedule.

Conference content

2023 was the year of Large Language Models (LLMs), thus it could not be missing from our event. As LLMs were new for many of our Engineers, we invited our Data Scientists to share their know-how on this topic. We had a talk about the fundamentals of LLMs, followed by a summary on the challenges using LLMs based on two use cases: code generation and building our Zalando Assistant. As expected, these presentations attracted a lot of interest from our community.

Our Engineering Leadership track was focused on talks related to managing teams in challenging times, building trust with the team and sustaining empathy when the team or oneself is affected by the current situation. Other talks focused on driving innovation, continuing to learn as leaders.

The Cloud Native and Developer Experience tracks turned out to be great platforms for sharing new developments in our infrastructure services and promoting their use. Colleagues learned both about proven features that they may be missing out on as well got a peek on improvements in our Kubernetes platform. Our SRE-minded speakers, shared tips about building easy to understand Grafana dashboards using data visualization techniques and demonstrating reference dashboards for applications.

The Data Engineering track was focused on sharing best practices in data processing and data quality. Speakers shared how they monitor data quality in their pipelines, how to simplify data aggregation queries, or how architectural decisions around data design affect data quality and technical debt.

Two teams particulary stood out with multiple presentations across the tracks. The team behind our Web platform shared their journey on evolving their platform into a standalone framework that's now also powers parts of the Zalando Lounge experience, covered the journey to concurrent React, and how we continuously measure and report on the web performance. Our Size & Fit team on the other hand explained how the Size Recommendations based on Body Measurements features work behind the scenes, starting with the on-device computation and ending with the compliance requirements for processing sensitive data. The team also shared how the data acquisition pipelines for the Virtual Fitting Room.

Lessons learned

Conference format

With the 8 tracks in a single day, we triggered massive FOMO (fear of missing out) across Zalandos, as it was difficult to decide which talk to attend. We knew from the get go that this would be a challenge, but decided that the trade-off was worth it. Now that we gained credibility for running the event, in future we will reduce the number of tracks and spread the event out over at least two full conference days. When hosting yearly events, the amount of net new project content is expected to stabilize when compared to the first event.

For first-time speakers, the online format was a great opportunity to practice as stage anxiety is smaller than when speaking to a full room. It's challenging for an online-only event to deliver a full conference feeling, though. While on the following day we had an on-site event with two keynotes and a get together, participants were missing the buzz and networking opportunities known from on-site conferences. Nothing replaces the chatter in the hallway and missing talks due to engaging in conversations with colleagues in a prolonged coffee break ;-)

We had two conference talk formats: full talks (with Q&A) and short lightning talks (without Q&A). The feedback we received from speakers for the lightning talks is that they missed out on the Q&A part and the resulting feedback loop telling them whether the audience was interested in the talk (or not).

CFP

We ran the CFP (call for papers) using Google Forms and scored the submissions in Google Sheets. Each Program Committee member reviewed and scored the submissions based on the topic relevance for the target audience, the abstract quality, and the expected takeaways. We provided a scoring guidance document and removed speaker information to ensure an unbiased selection process focused solely on content. To balance the workload, we assigned each committee member up to 50% of the tracks to score. We then normalized the ranking results and selected the top submissions for the conference. In some cases, we reclassified talks across tracks to ensure balanced content distribution.

Track Hosts

Assigning a track host per track worked well (and is well known from other conferences). The track hosts helped speakers prepare and were an early sounding board for the presentation content. They had freedom to select the order of the talks to ensure a good flow of topics and help in their storytelling when introducing the speakers throughout the day. Hosts also prepared backup questions to use in the Q&A part in case while the audience was busy typing their questions into the Q&A tool.

Summary

The event turned out to be a success and we received a lot of positive feedback from our colleagues who after the closing event were asking when we will host the next one. The event was a great opportunity to learn about projects across the organization and to promote platform solutions to a wide and focused audience. The recordings from the talks serve as onboarding material for colleagues willing to learn about specific projects or just joining the team of the speakers. The on-site event on the following day was a great opportunity to meet colleagues in person and to get their first hand feedback on what they liked from the conference and what they would like to see improved.

Tips for organizing similar events

Sponsorship: get a sponsor from the leadership team to provide budget and high-level guidance for the event.
Organizing team: form a small team to organize the event (at Zalando we have a Tech Academy team experienced in organizing events for the Engineering Community).
Program Committee and Track Hosts are great mechanisms to give visibility to role models and to promote diversity across the organization.
Program Committee: use a principled-based approach for program committee composition.
CFP scoring: provide guidance for the program committee on how to score submissions; ensure that the selection is based solely on the content of the submission (via conference software or just plain old spreadsheets).
CFP scoring: submissions that made it to the shortlist, but did not make it to the conference, should be considered for other internal talks formats or blog posts.
Track Hosts: consider assigning a track hosts, if only to moderate the track and introduce speakers during the day; they can also help speakers prepare the talks, though you can also assign a group subject matter experts to review the talk early on.
Communication: meet the target audience where they are; use all possible communication channels to reach them: chats, email, intranet, posters in office, ask leads to promote the event in their team meetings and townhalls.
Presentations: provide a slide template to ensure a consistent look and feel across all presentations (at least for the first slide). Provide guidance on the font sizes and how to pick accessible color combinations with high contrast.

What changes are we making this year?

This year, we're running the conference already in June and host it as an on-site event with the aim to create a real conference feeling. We spread the conference over two days with three tracks per day, merging some tracks from last year and adding new ones. The event will be streamed to all sites, this time with support of our broadcasting team. The streams will also make it possible for our colleagues to join the event from home.

We ran the CFP for 4 weeks to give potential speakers more time, but the impact on the number of submissions over time was neglible. The due date for submissions is what matters and as in 2023 we received most submissions in the last two days before the end of the CFP. We invited past speakers and track hosts to become part of the Program Committee.

We're excited to host the event again and look forward to learning how the conference format for this year will be received by Zalandos. More on that another time!

Transitioning to Appcraft: Evolution of Zalando’s server-driven UI framework

2024-05-16T00:00:00+02:00

At the heart of Zalando's mobile content strategy lies the Appcraft platform, fueling 13 dynamic pages within the app. This framework is instrumental in delivering top-tier content formats, including the popular Zalando Stories. In this post we explain the origins and inner workings of the platform.

The TNA Dilemma

The Flexible Layout Kit (formerly known as Truly Native Apps, TNA was a framework used in Zalando App to render content dynamically. This framework processed JSON input, which defines the slots and elements of a screen. These elements were characterised by their types and a set of attributes. The primary container of the screen was a vertical list type, which encapsulated a series of Composed Tiles within client-side Apps. While this system initially provided simplicity and a robust foundation for dynamic landing pages within our Apps, its fixed UI structure imposed constraints. Notably, maintaining the high-level composed UI components across both iOS and Android clients proved challenging, mainly due to versioning but also due to constant UI design changes and the introduction of multiple variants for a single Tile in order to support our different business logic and content formats. These limitations inhibited innovation and hindered the seamless integration of dynamic content.

Example of a component in TNA: These were the Showstopper Tile variants (C and D shown below) in TNA framework


Version C	Version D

The json for Version D looked like below:

{
  "element-type": "teaser",
  "attributes": {
    "trackingParameters": {},
    "saleBoxColor": "#FF0000",
    "teaserVersion": "VERSION_D"
  },
  "subelements": [
    {
      "attributes": { },
      "element-type": "image"
    },
    {
      "attributes": { },
      "element-type": "text"
    },
...
    {
      "attributes": { },
      "element-type": "use-voucher"
    },
    {
      "attributes": { },
      "element-type": "show-info"
    }
  ]
}

To summarise, these were the pain points with the TNA framework:

Small UI changes within a Tile, such as moving a button to the right or left, or stakeholders requiring two UI presentation variants, would prompt a new version and necessitate a client-side change and release to the App Stores.
For other cases involving changes to business logic, such as a price format change, the contract or schema for the price component on both clients and the server had to be modified.
Maintaining backward compatibility and versioning was challenging and led to a few incidents. It also necessitated coordination between clients, especially when the app release versions between iOS and Android were not synchronised.
More over back then several backend services including TNA needed to be migrated and the team had to face a decision of either maintain or decommission TNA backend.

These shortcomings encouraged us to replace TNA with a new Framework in which we aimed at:

A common and more flexible design layout system.
Simplified Versioning capabilities.
Same-day delivery for new Screens and Layouts.

Enter Appcraft

A common design layout system

In 2018, after experimenting with web-like architectures and several layout systems provided by native and third-party frameworks, we decided to implement a mobile version of the Elm architecture, together with Flex, as a unifying principle that could bridge the design paradigms of Android and iOS. Here's how:

ELM architecture, inspired by the Elm programming language, follows a unidirectional data flow pattern consisting of three main components: Model, View, and Update. The Model represents the application state, the View displays this state to the user, and the Update modifies the state based on user interactions. This clear separation of concerns simplifies code maintenance and enhances predictability, making ELM architecture popular for building scalable and maintainable web applications.

Flex was key in helping to build a common understanding of layout concepts for mobile clients, which web developers could also grasp without the need to learn the individual mechanisms each platform uses to lay out views on a screen. It offers flexibility for dynamic and responsive designs across platforms, streamlines development, fosters cross-platform compatibility, and benefits from a large community of developers.

The challenges

While the decision to use Flex was agreed within the cross-platform team, the challenge lay in adding Flex support to iOS and Android, each of which internally uses its own native layout framework. Based on this, we experimented with a few third-party layout libraries already available, each with a fair reputation, comparing their performance and integration efforts. Once these libraries were chosen for iOS and Android, most of the effort went into translating the Flex definitions from the server into the Flex library APIs for each platform and comparing them to ensure consistent results between both. One important consideration while choosing the library was finding one that sits on top of the native UI frameworks to assist with positioning and sizing, without replacing or altering the behaviour of the native UI framework. This means that, for example, a scrollable layout with Flex specifications on the server will be transformed by Appcraft into a native UICollectionView for iOS and into a RecyclerView for Android. This approach ensures that we still have access to new APIs and improvements available on the native UI frameworks for newer OS versions. We decided to move further with Texture on iOS and Litho on Android.

Primitives

We've established a set of Primitive Components to serve as the foundation for constructing High-level UI Components. Starting with essentials such as Label, Button, Image, Video, and a Layout container, these primitives form the building blocks for crafting intricate UI components. With these foundational elements in place, developers possess the flexibility to combine and customise them according to their application's unique requirements, unlocking a plethora of possibilities for UI design and interaction.

Behaviour

Users engage with apps through various events such as scrolling, tapping, long-pressing, and more. Each of these triggers a specific action as a response which in most cases results in a UI update or a side effect. We've devised a comprehensive set of actions to ensure the system effectively responds to these user-triggered and component life-cycle events for e.g., tap is an event navigate is an action. Additionally, there are implicit events designed to track user interactions, ranging from detailed events like scroll-forward to simpler ones like dismiss.

This is what a component looks like in Appcraft:

{
  "type": "layout",
  "id": "root-container-layout-id",
  "flex": {},
  "props": {},
  "chidlren": [
    {
      "type": "image",
      "id": "id1",
      "flex": {},
      "props": {},
      "events": {
        "tap": [
          {
            "id": "id2",
            "props": {},
            "type": "track"
          },
          {
            "id": "id3",
            "props": {},
            "type": "navigate"
          }
        ]
      }
    }
  ],
  "events": {}
}

Simplified Versioning capabilities

With the previous TNA system, both server and clients had to exchange information about the schema version, adding complexity. We sought alternatives to reduce errors and simplify maintenance. With a more flexible layout structure and by keeping the logic of binding data and layout in the server, we achieved reduced complexity in the clients by leaving the sole responsibility of rendering to the app. The schema versioning remained on the server, making it easier to resolve issues such as retrieving the right component version for each client and allowing us the flexibility of customising UI and behaviours for each platform independently. While it was not immediately apparent, maintaining this flexibility on the server allowed us to:

Enable or disable components and their behaviour targeting specific app versions, platforms, premises and A/B testing.
Resolve incidents quicker without the need of a hotfix by removing for example faulty components for specific app versions or OS due to bugs or performance reasons from the server.
Retain backward compatibility logic on the server, as we can specify a minimum version for a component.
Adding new appcraft pages in the App without the need of client changes, by just configuring the new page route and the minimum app version supported.

Same-day delivery

In Zalando mobile engineering we operate in sprints, with each sprint culminating in an app release. In this model, even simple UI adjustments may require waiting for a new app version and even longer for the full adoption, which can be a significant bottleneck in a fast paced organisation like Zalando itself. In an ideal scenario, without the need for hotfixes, waiting for a complete release cycle for moving a label from left to right seems counterproductive. Appcraft is designed to be agile and responsive to user needs, and such delays can hinder our ability to deliver a dynamic user experience. With the introduction of the Appcraft framework, the delivery is not tied to app releases or sprint duration, changes can be made at any point during a sprint. Now, the presentation layer can be defined directly on the server using pre-defined primitives that are packaged within the app.

What does it look like when a new screen is required?

When a new screen is required, the process is streamlined and dynamic in our mobile applications. We heavily rely on deep-link navigation, allowing seamless transitions between different screens. In a truly dynamic system, the creation of deep-links should happen on the fly without the need to manually add routes in the clients every time.

To achieve this, we've introduced a middle-man component that takes a deep-link and converts it into an API request that our framework can understand. This way, every time a new screen is needed, our stakeholders simply align on the deep-link structure and update the configuration according to the agreed-upon contract. With these adjustments in place, the setup is complete. The next step involves the renderer, which will then interpret the updated configuration and render the new screen accordingly.

So when is a client-release needed?

A client-release is only required when there's a need to introduce a new primitive or extend the contract of an existing one to support additional behaviour.

For example: When a simple label was not enough, we decided to introduce a Composite Label with the ability to add subtexts with their own font styling decoration and sizing and this is currently the primitive used for example to render price due to its flexibility.

How is a newly created screen tested?

We developed a demo app named the Appcraft Browser, featuring an address bar where any URL emitting appcraft screen JSON can be provided as input. The screen definition is then rendered in an isolated environment with only the bare minimum dependencies, facilitating faster development without the need to build the entire app. This tool allows web developers to insert a local host URL and test their development seamlessly while working on the renderer.

After the development stage, web developers open a PR which allows them to deploy the rendering changes in a staging environment, changes are then validated in a debug version of the Zalando app by incorporating the deployed PR number into the app debug settings. This allows testing in production screens and the actual app environment.

Appcraft's Business Impact

Dynamic content - Currently Appcraft platform serves 13 different dynamic pages in the mobile app which contribute to Zalando’s effort of consistently delivering quality content formats to mobile users for inspiration and personalisation around brands, recommendations, outfits, creators, collections and campaigns. Check out the most recently shipped feature powered by Appcraft called Zalando Stories and its press release.
App Theming/Redesign - Since the inception of the Zalando App, the company has undergone several app redesigns, each demanding significant engineering effort and collaboration across multiple teams. However, when it comes to pages served by Appcraft, there has been a notable reduction in engineering effort compared to non-backend-driven UI. This is because the majority of changes are implemented on the server, benefiting both mobile platforms and all supported premises through common rules.
Tracking Migrations over time - Similar to UI redesigns, since the introduction of Appcraft platform, the mobile apps have gone through two different tracking migrations, first in 2021 and now in 2024. For Appcraft screens, akin to UI changes, all tracking events and their schema are defined on the server. The mobile client's only task was adopting a new SDK or in-house backend solution to pass by the events to the new analytics framework.
Quick Prototyping - We use Appcraft for fast prototyping. By creating new renderers in the backend, the engineers and designers were able to quickly iterate on different UI designs over the course of a week.
Resilience - Appcraft’s resilience has matured over time, with past incidents triggering some of the improvements. By deploying changes on the server within the same day, the MTTR for incidents is notably reduced. Moreover, the platform is used with success during Cyber Week, Zalando's biggest sales event for the last couple of years.
User experience - When a concept is added in Appcraft, it scales immediately to all screens via the backend. We are actively working on enhancing the user experience to be more delightful. We're currently exploring screen transitions, fluidity concepts, and micro-animations on the Appcraft platform.

Current challenges and evolution

While thoroughly enjoying the flexibility of adding screens without the involvement of any app engineers throughout the content experiences, we, as the platform team, find it challenging to keep track of the launched screens due to gaps in monitoring. Sometimes issues arise and reach us only when they become urgent fixes.
Striking the right balance between generality and restrictiveness when creating a new feature in a backend-driven mobile framework is essential. It involves carefully considering factors such as usability, flexibility, consistency, performance, and compatibility to ensure that the feature meets the needs of both developers and end-users effectively.
Testing has also gotten easier only over time. We enhanced the developer experience by enabling local testing for web developers, providing screen context injection for A/B testing, and eventually facilitating testing for pending changes to renderers (open PRs).
We are currently addressing another significant challenge, known as Interoperability, which refers to the reuse of existing non-Appcraft components in Appcraft and vice versa. To tackle this, we've introduced the capability of embedding non-Appcraft components in Appcraft screens and the embedding of entire Appcraft screens within larger features. Examples of this can be seen on the Tabular structure on Home Screen where each tab is an appcraft screen.
Dependency on third-party UI technology could pose a challenge because iOS and Android libraries may behave differently, requiring additional customization or default code to achieve consistent functionality and user experience across both platforms.
Due to organisational changes over the years – such as transitioning from a strong web engineering team with limited mobile resources to having equally strong web and mobile teams – the allocation of effort has become a topic of debate. Consequently, we've observed that feature ownership (mobile vs. web) can sometimes become unclear.

Appcraft has been serving as a stalwart in the realm of backend driven screen frameworks. Read all about the backend system that empowers this platform.

Theming the Zalando Design System

2024-05-14T00:00:00+02:00

Why theming?

As a design system evolves alongside with the brand it represents, there are often multiple occasions when a need to introduce variations arises. On the business side of things there may be use cases for part of the customer journey to have a distinct look and feel, or there may be sub-brands being part of a larger platform. The previous article on this blog gives a wider overview of the Zalando Design System. This article will focus instead on the challenges encountered in the development of theming capabilities.

Introducing variations into the system, without compromising the baseline brand identity and the benefits of reusing existing client components, is one of the main reasons to explore the concept of theming.

In the absence of a proper theming architecture, early attempts and explorations of "theming" had lead to a number of hacky solutions that quickly become hard to maintain and pose risks to the overall system stability. In the past we encountered numerous challenges, including hidden CSS overrides, local conditional logic, debatable API additions, and duplicated implementations. A comprehensive theming solution quickly evolved from a "nice to have" into a clear "must have".

On a very high level, a theming architecture is just another instance of the generic problem of balancing flexibility and usability. A very strict and consistent design system makes development extremely fast, but as a company evolves and business requirements start to deviate from the initially identified rules we observe an increase in development and maintenance efforts. In order to keep the system healthy, it quickly becomes a requirement to handle the newly introduced flexibility as part of the design system itself.

Coming up with a theming concept tailored to the company's strategy and envisioning long-term goals beyond immediate business needs, is one of the most challenging steps in this process. Too much or too little flexibility can lead to a system that is hard to use, becomes increasingly difficult and costly to maintain, extends over time and impacts the performance and the maintenance costs of the systems involved.

To give an idea of how theming is currently used at Zalando, the Designer Home is a good example. You might notice the use of monochromatic texts, larger and uppercase headings, and the usage of rounded icon buttons. Those changes are all implemented via a theme and can be easily enabled or disabled on any given page.

Defining boundaries

Imagine a design system as a list of properties that define how UIs of a particular product should look and function. Now, consider theming as a mechanism to allow changing the values of a subset of those properties. Using this perspective there are two main areas of influence to shape a theming architecture: defining properties, and defining their allowed values.

For example, we could have a highly constrained theming concept, where different themes are allowed to choose a text colour to be either black or red, and buttons to be either rectangular or with rounded corners. In order to implement those theming specifications, we will need to have two properties in the system to represent the text colour and the border radius of buttons, as well as a defined set of possible values for both (e.g. "black/red", and "0px/32px").

In reality, things are never this simple though, and identifying a relatively stable set of properties and values requires both a comprehensive understanding of how the design system is currently used, as well as a fair amount of abstract thinking and product vision on how it may evolve in the future. The balance between static (or implicit and hardcoded) properties and dynamic ones, defines what a theme can or cannot do, and when there is a discrepancy between those capabilities and the product requirements, the expected advantages quickly dissipate and new iterations on the concept will be required.

An important aspect to be discussed is the scope and area of influence of those "themable" properties. While it may not be immediately obvious, there is a clear distinction between defining a UI component in isolation, as opposed to in a specific composition. Should a theme be able to change how a button looks inside a product card, but not anywhere else? These kinds of questions are inherently connected to the wider topic of ownership. Where can we draw the line between generic UI components and business specific compositions? What part of a visual change in the end user experience can be expressed as a global theme change and which one as a localised business logic?

It’s very easy to confuse the concept of "theming" as a capability of a component library, with "theming" as a feature of a design system. Component libraries do not encompass the entire design system, but are merely a tool that follows its specifications and "implements" it for a specific purpose, for example building web pages.

Many of the popular open source design systems are showcased, documented, and advertised via their implementation; usually one or more component libraries for different platforms. One famous exception is Material Design, which from the beginning only described the design system in the form of a series of specs and guidelines.

This confusion between design specification and implementation gets mirrored in the misunderstanding on what "theming" means for those two different concepts. Most open source component libraries allow some level of theming via a number of different technical approaches, usually using config files, shared contexts, and some form of shared variables (design tokens). On the other hand, what "theming" means on the design layer, is often overlooked.

Typically, a default theme that aligns with the brand's character and identity is commonly used. Theming is then offered as a way to adapt it to different organisations, companies, design systems. It’s very rare for theming capabilities to be showcased as a way to express variations of the same design system. A common exception, though, is the usage of colours. Material Design is again a good example here because it was intended to be used by many different products and apps not necessarily related, keeping the interactions and the tactile "material" metaphor consistent, while allowing to play with a very large colour palette in order to introduce a level of identity and ownership. Other libraries often showcase theming capabilities with custom colour palettes, or defining dark mode themes.

At Zalando, being one company with a well defined visual identity, introducing the concept of theming raised a lot of questions around the related governance rules and processes. How many themes may we need? How different can they look? Who can/should own and create them? How to ensure a baseline visual identity? Those and many other questions can be very hard to answer, and we will have to address them as we iterate through the initial use cases.

Semantic design tokens

One of the very first challenges in making a design system themable is the process of "tokenization". There are a number of repeated values scattered across design specifications and source code that need to be extracted into variables, known as design tokens, which can then be dynamically changed by themes. For example, the same shade of orange might be used as the background colour for a button, as well as the colour of the wishlist icon. A simple initial approach would be to create a variable called orange holding the exact hex colour value and then consume it in the two different components.

What will happen if a new theme now wants the button to be green? Surely, we cannot simply reassign our orange variable to a green value, that’s a recipe for disaster. This leads us to an important second step: identifying the semantic roles of different tokens and name them accordingly. Instead of orange we could call it accent, there would no longer be any confusion when its value is changed to green, or any other colour.

[color.background]
accent.value = "orange"

[theme.foo]
color.background.accent.value = "green"

While this may sound simple on paper, the reality can be extremely complex. While trying to identify a reasonable set of semantic tokens out of our existing design system, we had to go through many design iterations, often leading to significant changes to the existing specifications. This process reflects our dedication to evolving a system that wasn't originally designed from the beginning with semantic tokens in mind. We faced several common challenges, including managing a large number of tokens, inconsistencies in their usage, and a lack of clarity regarding which values should change together or not.

Among all the sweat and tears, though, this has been a great opportunity to assess the quality of the design system itself. It has resulted in substantial simplification, removal of unnecessary subtle variations, as well as increasing the level of parity and consistency across libraries implementation for other platforms (Android and iOS).

Once we got a stable set of global tokens, the next challenge we faced was how to express variations that do not apply to everything, but only to specific components. For example we could have a padding.small token and use it across many components, but what happens if we want the button component to use padding.small in one theme and padding.large in another one? We cannot change the meaning of padding.small globally as it would have repercussions way beyond that specific button.

This led to what we call "component-level theming", that ultimately is nothing more than an additional level of indirection between a token name and its final value. We can create a token button.padding with a value of {padding.small}, where we refer to another token rather than a value. This way a theme gains the flexibility to change the padding value used in the button, as well as define which globally padding values are allowed.

Colour schemes

At Zalando, we encountered various situations where we need to alter the usage of colours based on what background is used in order to satisfy accessibility colour contrast requirements as well as visually pleasant colour combinations. Many banners on the website dynamically pick a background colour based on the content of an image.

To satisfy those needs, we introduced the concepts of colour schemes, namely a monochrome-dark colour scheme to be used on dark (but not black) backgrounds, and a monochrome-light for the opposite use case. Counting the "default" look and feel, it means that we need to support three different colour combinations.

This solution, for us, predates the concept of themes, and we used to override the values of palette colours directly, without semantic tokens in the picture yet. When shaping the new theming architecture we had to take colour schemes into account and make them first class citizens of themes.

What "monochrome-dark" looks like in a given theme can be different from another one. This means that each individual theme needs to support three different colour schemes. With those requirements in mind, the logic to determine the value of colour related design tokens becomes more complex, and requires knowledge of the current active theme as well as the current colour scheme.

A constant source of confusion has been the relationship between colour schemes and native dark mode that the user could potentially want to enable from the operating system settings. While we always had full dark mode support in mind when implementing colour schemes, and their current architecture can simplify the creation of a native dark mode for Zalando, it would not necessarily be as simple as enabling the "monochrome-dark" colour scheme on the entire page.

Additional considerations will have to be made in order to proceed towards native dark mode. For example there would be a need to express the default background colour through its own semantic token, additionally we would need to clarify the relationship with themes and colour schemes. Would "dark" be treated as a new colour scheme to be supported by each theme? Would "dark" and "monochrome-dark" be the same thing? Can a colour scheme change depending on native dark mode?

All those questions lead to complex conversations about how themes are used, their purpose, and the impact they have on the user experience. In order to answer all of them, we may have to gradually iterate on those concepts in order to find out what works and what doesn’t.

Style dictionary

The core of our theming infrastructure is our design tokens repository. We use Style Dictionary as a framework, and we define tokens in a single source of truth that can be consumed by libraries implemented for different platforms. Style dictionary allows to use a shared data format that can then be transformed to adapt to the needs of all the consuming component libraries. For example it takes care of converting and using the right units and colour formats for web, Android and IOS. Additionally it can generate platform specific artefacts that can be bundled, published, and consumed independently.

Style Dictionary is also easy to customise to our specific needs. Particularly with our own "transforms" and “formats”, we can handle custom requirements in a well-tested and reusable way. Some interesting examples are a transform to handle a boolean "display" token type and map it to CSS properties on web while keeping it as a boolean for app consumption; or another transform that allows to apply transparency to colours in a cross platform way.

Formats, on the other hand, can be used to customise the files generated for each platform. We can run a single build, generating different artefacts, and then have independent pipelines to publish them. This allows teams from web, Android, and IOS, to independently adapt the format of tokens to their platform, without affecting the other ones.

[color.text]
primary.value = "black"
primary-dark.value = "white"

[spacing]
s.value = "1rem"
s-desktop.value = "2rem"

[theme.foo]
color.text.primary.value = "blue"
spacing.s.value = "1.5rem"

The TOML format allows to express the nested structure of tokens in a human friendly way. Within the tokens folder, we have distinct files for different categories, like spacing, colours, typography, etc. Each one creates a namespace for the tokens defined inside them. Concatenating all the files inside the tokens folder we obtain a single dictionary object that represents the "base" theme. Colour schemes and responsive variants for each tokens, instead, are expressed using extra tokens with predefined suffixes (e.g. -dark, -tablet, etc.).

A theme is created with a file located in a separate folder, which defines a dictionary mirroring the structure of the base theme, but includes only tokens that are changed. The final theme dictionary is then computed by deep merging the base theme object with the theme one. This approach establishes a direct inheritance of each theme from the base theme, and is particularly convenient when it is expected for a base visual identity to be maintained across multiple themes.

CSS Variables (WEB)

The main output format consumed by the web component library, is a custom CSS file containing all the tokens encoded as CSS variables. The variables are then consumed by our CSS framework, which in turn exposes a library of classes for our React components. Ultimately, when working on a component and consuming some classes to set the primary text colour, there's no need for any knowledge about themes, colour schemes, or screen sizes; but we can assume the value will be changed automatically based on the defined overrides. This effectively decouples the implementation of components from the context in which they may be used by providing a stable and reliable interface to get dynamic values from a list of available semantic tokens.

For this behaviour to happen automatically, themes, colour schemes, and responsive variants for each token are implemented using classes to scope the set of required overrides.

:root {
  --spacing-s: 1rem;
  --color-text-primary: black;
}

@media (min-width: 64rem) {
  :root {
    --spacing-s: 2rem;
  }
}

.dark {
  --color-text-primary: white;
}

.theme-foo {
  --primary: blue;
  --spacing-s: 1.5rem;
}

This way, setting a theme or colour scheme class on a container, ensures that all its children, will resolve the tokens with the correct value. Relying on classes we are less dependent on more complex JavaScript based tooling and we can use different ways to add or remove the required classes based on the use case.

Another advantage of using variable overrides is that we can express a whole theme solely by the difference from the base one, allowing for smaller CSS size overhead and, possibly, to load a separate small CSS file for the theme only when needed. On the other hand a drawback of this approach is that multiple themes nested inside each other on the same page would not be possible without duplicating all the existing tokens, otherwise we would get unpredictable combinations depending on what each theme overrides or not. Thus far, this hasn’t been a problem as we do not anticipate multiple themes appearing on the same page given our priority of maintaining visual coherency for our users.

Even without nested themes, the possibility of having nested colour schemes poses similar challenges, and we had to handle colours less efficiently by duplicating all colour tokens for every colour scheme, even if they were unchanged. Additionally, given that CSS selector with same specificity are applied based on their order of definition, the only way to guarantee for the class of the closest themed parent to win, would be to have additional selectors for every possible nesting combination.

While the recently introduced :is selectors help in keeping the code readable, there is still no way to support arbitrary nesting, requiring us to impose a hard limit. In the near future, once supported in all major browsers, the CSS @scope at-rule should help solve most of those issues, and enable more complex nested theming capabilities.

.a,
.b .a,
.c .a,
.a .a,
.a .a .a,
.a .b .a,
/* etc... */ {
  --primary: red;
}

/* can be simplified to */
.a,
:is(.a, .b, .c) .a,
:is(.a, .b, .c) :is(.a, .b, .c) .a {
  --primary: red;
}

/* in the future, once @scope is supported */
/* this also allows for arbitrary nesting levels */
@scope (.a) {
  & {
    --primary: red;
  }
}

One interesting caveat of using a class scope to override the value of variables is related to how the value of CSS variables is resolved. The same algorithm used to determine the specificity of CSS selectors is also used to determine when the class (or at-rule) override is enabled for a variable value. This becomes a bit complicated when the value of a variable is a reference to another variable.

For example given this CSS and HTML:

:root {
  --primary: black;
  --color: var(--primary);
}

.blue {
  --primary: blue;
}

.box {
  width: 100px;
  height: 100px;
  background-color: var(--color);
}

<div class="blue">
  <div class="box" class="box" />
</div>

We do not get the intuitively expected behaviour, and the box appears to be black instead of blue. This happens because the --color variable resolution happens on the :root scope based on the last matching value of --primary (black), counterintuitively --color won’t be reevaluated when --primary changes, unless a higher specificity selector requires so.

To address this, we can introduce an additional scope class to increase the specificity of our boxes

<div class="scope blue">
  <div class="box" class="box" />
</div>

:root {
  --primary: black;
}

:root,
.scope {
  --color: var(--primary);
}

.blue {
  --primary: blue;
}

.box {
  width: 100px;
  height: 100px;
  background-color: var(--color);
}

Now, the behaviour is in line with our expectations. As components are always children of a possibly themed container, we can add a class to their root container to enable this scoped resolution whenever we want a token to refer to another token rather than a static value. This is especially beneficial in scenarios involving component-level theming.

:root {
  --color-text-primary: black;
}

.dark {
  --color-text-primary: white;
}

:root,
.scope {
  --button-color-text: var(--color-text-primary);
}

iOS

Having the great power of code generation, it was tempting to convert design tokens placed in TOML files into the final source code that could be consumed directly in any iOS project. At first, we attempted to map TOML directly to Swift, but encountered certain challenges. Firstly, this approach would have allowed any engineer to extend the existing theme with new attributes. Additionally, we also had to figure out how to automate publishing of new versions of the library. Carthage, the dependency manager for iOS we use, assumes that the dependency is placed in a git repository and one should provide a url to download and build it. This means that all the generated files should be committed and pushed to the GitHub repository, and pushing the generated source code was considered a bad practice.

With this in mind, we quickly added some base Swift files that describe a structure of a Theme manually, and switched our scripts to generate JSON files, which, in their turn, are not that harmful when committed automatically, as they're just resources and don't potentially include any business logic inside. Having JSONs as a way to populate themes with actual values should also give us flexibility in case we'll be considering downloading themes from some kind of server API.

The system architecture of the iOS library for consuming design tokens is simple: there is a Theme structure, that defines all the agreed attributes, and there is an entity called ThemeManager, that loads the stored JSON files and populates itself with all the known variations of a Theme. Now any theme can be accessed from this ThemeManager just by its name.

Applying a theme is a recursive process: a theme applied on a higher level, let's say, a screen, will be automatically applied to all its subviews, then to subviews of these subviews and so on. It doesn't matter, if any view doesn't change it's appearance depending on a theme, this doesn't affect the theme propagating process, but for ones that support the theming capability inside, the result will be visible at once.

Supporting the theming capability in different ZDS components, we faced a problem. The appearance of a component is described by a Style object, which is just a static structure, encapsulating all the necessary attributes, such as a background colour, padding values, font size, etc. And every component has multiple presets for this Style structure. For example, a Flag component can be default, positive or sale, and every such preset stores its own values for the same attributes. Changing a theme would mean recreating the same Style structure with different values. At this moment it seemed that we should store default, positive and sale Flag values for every theme separately, and adding a new theme would mean that a new variant of the same presets should be added for every component. Not very scalable, isn't it?

So we introduced StyleTokens. For every component that supports theming it's just an enum which lets us know which preset should be applied disregarding the actual values that come from a theme. Based on this StyleToken value, the actual Style structure is generated every time the appearance of the component should change.

Now that meant that the final look of every themable component depends on 3 inputs:

style token
theme name
color scheme

And every time some of these three are changed — the theming engine creates a new instance of the Style object which is used to redraw the view. Now we can switch themes and add as many of them as we like without thinking that we would need to modify existing components every time it happens.

Android

Theme resources for (BaseTheme & Child-themes) are generated in an android consumable resource format (we have 2 formats):

XML resources for Android ViewSystem
JSON files for Compose

These resources are then packaged and published as a library in our internal maven repository, ready to be consumed by the Android component library as well as directly in the Zalando App codebase.

XML

This is the most used format as of now, given that most of our components are still built in XML, in this format theming is generated in the form of tokens/attributes that are then made into theme XML classes/objects ready to be consumed i.e BaseTheme, Designer, etc... And these themes can be easily applied using their ids, i.e R.style.BaseTheme.

JSON for Compose

Here, we generate theme tokens in the form of JSON files that are also packaged and shipped in the same library. These files are then parsed and theming data is extracted from them in the ZDS library. A theming architecture is then built on top of this data. This theming solution is also represented as simple semantic tokens that are ready to be consumed in all Composables (components written in compose).

ColorSchemes

We support 3 colour Schemes:

Default
Mono-Light
Mono-Dark

Each theme is generated with these three colour schemes supported, and it gets to decide the actual colours for each one. The client/user of a certain theme can choose when and where (on which part of their screen) to apply a certain colour schemes.

In XML, we offer two colour schemes templates that when applied to certain sections of the screen handle the colour swapping to the Monochrome (Light or Dark) variants for each colour, and they work on all themes.

<style name="MonoLightScheme">
  <item name="ColorSchemeType">MONO_LIGHT</item>
  <item name="colorBackgroundSecondary">
    ?colorBackgroundSecondaryMono
  </item>

In Compose, both Theme and ColorScheme are chosen at the root of the ZdsTheme selector, due to the simplicity of using theming in compose, a new ZdsTheme Composable can be used at any part of the experience to choose and apply any combination of a ZDS-Theme and a colour scheme that fits the requirements of that section of the screen.

@Composable
fun ZdsTheme(
   zdsThemeType: ZdsThemeType = ZdsThemeType.BaseTheme,
   zdsColorScheme: ZdsColorScheme = ZdsColorScheme.Default,
   content: @Composable () -> Unit,
) {
 ...
}

Component-level Theming

An additional layer or set of tokens that are intended to alter the visuals of a specific component without affecting the rest of components i.e Flag component, can be modified without affecting the rest of the visual language/theme thus all other components are safe when, for example, the default flag changes colour from primary to secondary or something else.

Conclusion

Theming a design system is a way to introduce variations in a controlled manner. Depending on the business use case, careful consideration should be taken on how the theming architecture is designed. One of the most challenging parts is to identify the properties that can be altered by a theme, as well as the possible values they may have.

Governance becomes a key aspect when introducing theming to a design system. Like any other source of variations, themes should be managed and maintained in a way that ensures the baseline visual identity is preserved. This includes defining the number of themes, how different they can look, who can create them, and how to ensure that the visual identity is maintained.

By leveraging a single source of truth for design tokens, it becomes possible to share the specifications of each theme across different platforms. This allows for a predictable styling of all components, and decouples the implementation of components from the themed context in which they are used.

Enhancing the Mock Server: A User Interface Approach

2024-04-25T00:00:00+02:00

Enhancing the Mock Server: A User Interface Approach

As far as feature life cycles go, we as a team follow certain agile practices in pursuing its delivery. We first discover and surface potential features or enhancements through data-driven approaches, which then culminate as a proposal in the form of an intake document. Following its signoff, we then narrow the scope and define deliverables, focusing on an iterative approach to incrementally accomplish the feature in more manageable milestones. Lastly, once we have fleshed out the technical documentation, initial design mockups, API schemas, and ticket creation we begin with the actual implementation.

At this point, however, a common scenario takes place in which the API endpoints have not yet been developed, making frontend developers have to postpone fetching from live endpoints and continue developing the UI by mocking the API response statically. Popular tools have arisen to tackle this issue, such as mirage.js, MSW, etc., which facilitate the mocking of servers, typically by intercepting the desired endpoints and returning predefined responses. This enables front-end developers to work independently from the backend while reducing the time needed to finish the milestone.

Fig 1. Agile approach including the Mock Server

While this solved the issue of frontend independence, the other arose during the review phases with our product manager. A typical review cycle could take the form of developers first publishing the current state of the feature on the staging environment in order for it to be easily accessible by authorized users but still publicly hidden. Those internal users would then be able to inspect the feature though only in the state the mocked values allowed it to display. Naturally, requests came along to see how the feature would react if the API would return certain edge case responses. This required an update in the code base, another pull request to publish it, and finally its deployment on the staging environment. These steps could be reduced even further and possibly make our colleagues more independent from developers when reviewing such feature behaviors.

Solution Summary

While the foundation for our solution is based on mirage.js, using similar libraries that allow server mocking should also be feasible. In our case, there was little reason to try a different library after having used it and having done initial research on its applicability. The bottleneck, however, was that these libraries were only able to mock each endpoint with a single response, requiring a change in code to load different mocked responses if desired.

To overcome this, a UI was built on top of mirage.js so that users themselves could choose what specific endpoints should return as a response in order to make the application behave in a certain way. An example of this was our Data Freshness feature, which rendered differently depending on how recent KPIs or other similar data were updated. If a product manager would like to check how that specific feature would change in appearance if the responsible endpoint either returned freshly added, late or no data at all, then they would only need to select the provided options on the mock server UI to have the changes take effect.

Fig 2. Mock Server UI in action: mocking the /branding-campaigns-summary endpoint

In this case, neither a developer nor a new staging deployment is needed in order for users to inspect specific UI edge cases and scenarios while also having the option to shut down the mock server on the fly once our backend has finished implementing live endpoints. The only additional step required is the setup of these edge cases that features could potentially exhibit in the form of multiple mocked data sets for the mock server to consume.

Deep Dive

The actual implementation of the mock server follows similar suggestions from the official docs of mirage.js in that we have to define three parts:

the mocked data responses in JSON format
a controller to define the endpoints we wish the mock server to intercept
the instantiation of the mock server itself

Provider Component: To ensure the mock server intercepts all relevant endpoints effectively, it should be instantiated before key parts of the application are mounted. Following this, the mock server may only return a single response per endpoint. To overcome this limitation, the UI enables users to control when the mock server instantiates in order to load different mocked responses based on user preferences. This is achieved by using a wrapper component like React’s Context API, which not only houses the logic for its re-instantiation but also simplifies setting up the mock server. By wrapping the main component with the Context API, developers can easily configure the mock server by providing the necessary props to the Provider component. This approach streamlines the implementation process of the UI component (<MockServer />) with which it can automatically gather all required information without the need for additional props.

const isMockServerEnabled = config.env !== "production";
const App = isMockServerEnabled ? (
  <MockServerProvider
    apiNamespace={config.namespace}
    makeServer={makeServer}
    mockServerOptions={OPTIONS}
  >
    {children}
  </MockServerProvider>
) : (
  children
);

...

// In any nested component
import { MockServer } from "@dna-zdirect-ui/mock-server";
...
<MockServer />

Session Storage: The other issue to overcome is in passing different mocked responses to the endpoints. Since we allow the user to change returned responses of endpoints at any point of the app's lifecycle via UI options a page refresh is necessary in order for the mock server to load a different set of mock data. Carrying over the chosen option, however, was not possible through application state management due to full app re-mounting after a page reload. The browser's session storage is used instead in order to persist state outside of the app’s lifecycle while also cleaning up entries in the session storage object once the session has ended. A unique key is also used here in case multiple apps are using this mock server implementation in the same session.

Fig 3 + Fig 4. Screenshots of the inspection window: Console and application tabs

The UI itself is a constellation of components provided by a UI-Kit library for the simple reasons of quick development and consistent design with the main requirements of enabling the user to easily select their desired mocked responses, triggering a page reload, disabling or re-enabling the mock server.

Limitations and Alternatives

By building on top of the mock server library mirage.js a solution is implemented that not only supplements the inherent advantage of enabling parallel development of an app's API and UI but does so by making it more flexible and accessible.

allows visual documentation and a showcase of edge-case scenarios
enables the mocking of endpoints on the fly
provides ease of use by means of a customized and non-intrusive UI

This solution is by no means an alternative to writing proper unit tests for edge-case scenarios. In fact, unit tests take precedence while this mock server rather acts as an enhancement during an app’s development by enabling an easier way to showcase such scenarios, e.g. during demos. Similarly, contract testing, in which services, such as an API provider and a client are tested if requests are correctly understood and responses are correctly generated, also takes precedence. Where mocks do shine more are the development phases in which the API services are still being developed and can act as an interim solution until these services are available.

While this specific implementation targets REST APIs the approach should also be compatible with a GraphQL architecture, like the one provided by the Apollo framework, which already comes bundled with its own mocking solution. Whichever technology is used, however, the definitions of mocks are entirely on the frontend side, meaning conventional API validations and error handling are separate from any backend service. Thus, also special attention has to be paid to continuously match the schema of the backend service that was originally intended to be mocked.

Conclusion

All in all, through positive feedback, especially from our product managers and designers, the inclusion of this mock server in our apps not only improves the collaboration between them and engineers by facilitating the presentation of features in various development phases but also eases the setup of a mock server solution for engineers by encapsulating non-business related logic and providing intuitive components. After a couple of implementations, a more generalized version of this mock server has been developed, which is internally available as a separate NPM module.

Lastly, while this is a niche solution that might not fit with many setups, we'd like to stress the importance of allowing developers to have space, resources, and support within their team to explore and experiment in a variety of ways has to be emphasized so that ideas may have enough time to bear fruit.

Enhancing Distributed System Load Shedding with TCP Congestion Control Algorithm

2024-04-23T00:00:00+02:00

Introduction

Our team is responsible for sending out communications to all our customers at Zalando - e.g. confirming a placed order, informing about new content from a favourite brand or announcing sales campaigns. During the preparation of those messages as well during sending those out via different service providers we have to deal with limited resources. We cannot process all requested communication as fast as possible. This leads occasionally to some backlog of requests.

But not all communication is equally important. The business stakeholders have requested to ensure that we process the communication which supports critical business operations within the given service level objectives (SLOs).

This has led us to investigate the space of solutions for load shedding. Load shedding has been addressed in Skipper already. But our system is event driven, all requests we process are delivered as events via Nakadi. Skipper's feature does not help here. But why not use the same underlying idea?

We know if our system runs within its normal limits that we meet our SLOs. If we would control the ingestion of message requests into our system we would be able to process the task in a timely manner. Additionally we would need to combine this control of ingestion with prioritization of those requests which support critical business operations.

Overview of the System

First, let me introduce you to the system under the load.

Communication Platform Overview

Nakadi is a distributed event bus that offers a RESTful API on top of Kafka-like queues. This component serves a couple of thousands of event types published by different teams Zalando wide for different purposes. Out of those more than 1000 different event types trigger customer communication.

The Stream Consumer is the microservice that acts as the entry point for the events into the entire platform. It is responsible for consuming the events from Nakadi, applying few processing, and pushing them to the RabbitMQ broker. Every Nakadi event type is processed by an instance of the Event Listener.

RabbitMQ is a message broker and should be considered as the backbone of our platform. It is responsible for receiving the events from stream consumer and making them available for the downstream services.

Our Platform consists of many services. These microservices are responsible for processing the events. This includes but is not limited to:

Rendering messages (both push notification & email)
Checking for the customers' consent, preference and blocklist
Checking for the customers' eligibility
Storing templates and different Zalando's tenants' configurations

Inside the platform, we have a lot of components that are interacting with each other, and the communication between those components is done mostly via RabbitMQ.

Each service will be publishing to 1 or more exchanges, and consuming from 1 or more queues, the same applies to the other services, so we have a lot of communication going on between the services, and RabbitMQ is the middleman for all of that.

High Level Design

We know that having suitable backlog size behind each application, can guarantee their scaling out as well as the best throughput, then we can achieve our SLOs. The system has capabilities to adjust the resources acquired from kubernetes based on the demand (using a scaling mechanism based on CPU/memory/endpoint-calls/backlogs).

We consider the whole platform as a system with an interface, and we strive to protect it at the interface level, by avoiding overwhelming that system with messages that it can't handle in proper time. This means we can steer the ingestion based on the priority and the available capacity of the system.

Stream Consumer will implement the adaptive concurrency management using Additive Increase Multiplicative Decrease (AIMD). This algorithm reacts to the reduced service capacity. Whenever congestion is detected, the request rate is reduced by a multiplier.

We needed to find proper indicators for the reduced service capacity. The Stream Consumer publishes the messages to RabbitMQ, so we have been looking for some indicators available from RabbitMQ. As the first indicator we decided to use errors. Whenever we can’t publish we should reduce the consumption rate. The second is more subtle. RabbitMQ is able to apply back-pressure when slow consumers are detected and the system resources are consumed too fast. In this case RabbitMQ will slow down the publish rate which the publisher will experience in the increase in the publish time. Stream Consumer will observe those metrics and adjust the consumption rate.

Reducing the consumption for all event types would help to run the system within its limits, but it does not prioritize the critical ones yet. The component shall be able to adjust the rate of how fast stream consumer consume events from Nakadi selectively. Therefore every event-type will get assigned a rate based on its priority and the system load. It shall ensure that every reader gets its dedicated capacity assigned. If there is more capacity available the system will adjust accordingly and provide a higher rate to events which have a higher demand (backlog).

Thus it's not needed to determine the tipping point throughput for a single service. The AIMD algorithm also adapts increased capacity after scaling the system. Most importantly, the algorithm requires a local variable only, which avoids central coordination like a shared database.

By following this approach we

Avoid multiple changes in all the microservices by scoping it to one component.
Achieve prioritization on the service consumption level, hence avoid the need to prioritize messages inside the platform.
Get a scalable solution with no single point of failure.
Use Nakadi to persist the backlog, hence reducing risk to overload RabbitMQ.

We will need to tune the actual value (latency of publishing to RabbitMQ) used as an indicator for reducing ingestion. It should have enough load on the system to trigger scaling of services in the platform as well as reduce the number of messages stored in RabbitMQ.

Low Level Design

Changes in Stream Consumer

Statistics Collector Collects the statistics about the latency (e.g. P50) publishing to RabbitMQ as well as any exception thrown while publishing.
Congestion Detector It decides whether there is any congestion in the system or not (depending on the fact of latency availability or exceptions thrown), based on the data it receives from the statistics collector and comparing them with the configured numbers in the service.
Throttle Provided as an instance per each consumer. This is the class that implements the AIMD algorithm. It should be instantiated by the consumer providing it with the priority of that event, that priority then will affect the increase/decrease of the permitted events/sec that can be consumed.

How the Design Works

When the Stream Consumer starts, all the event listeners start with an initial consumption batch size. They will also instantiate a throttle instance.
The statistics collector cron job kicks in, collecting some statistics about latency (P50) and exceptions, and then calls the congestion detector to provide the results.
The congestion detector checks the data it receives, and makes a decision whether there is congestion or not by comparing the data received with the limits set in the configurations. Congestion detector passes its decision to all the throttles associated with each event listener through an observer pattern.
The throttle, once called, and depending on the decision from the congestion detector as well as the priority it was given when the consumer started, will decide the new batch size using the AIMD. (Note: there is no coordination between different throttles!).
As modifying the batch size is currently not supported natively by Nakadi, the application will slow down/speed up the consumption accordingly.

How priorities affect the events consumption speeding up/slowing down

Let’s suppose that we have 3 priorities in our system, from P1 to P3, where P1 is the highest, P3 is the lowest. Stream consumer should already have a defined number for the speeding up/slowing down in the configurations per each priority.

First scenario, signal for consumption speeding up (relieved RabbitMQ cluster)

For each priority, there will be a defined value for the speeding up, let’s assume some numbers here:
- P1: 15
- P2: 10
- P3: 5
So the new consumption rates (batch sizes) will be:
- P1: Previous value + 15
- P2: Previous value + 10
- P3: Previous value + 5

Additive Increase

Second scenario, signal for consumption slowing down (RabbitMQ cluster under load)

Here also, per priority, different value for slowing down should be set, let’s assume here those numbers:
- P1: 20% decrease
- P2: 40% decrease
- P3: 60% decrease
So the new consumption rate will decrease by the following percentages:
- P1: Previous value * (20% (P1)) => 20% decrease
- P2: Previous value * (40% (P2)) => 40% decrease
- P3: Previous value * (60% (P3)) => 60% decrease

Multiplicative Decrease

So, the rule of thumb here is:

Whenever the RabbitMQ cluster is not under load, we speed up the consumption rate for all consumers, but we give more capacity for higher priority event types, more than less priority event types.
Whenever the RabbitMQ cluster is under load, we slow down the consumption rate by a percentage for all the consumers, but those with high priority decrease by much fewer percentage compared to those with less priority.

Results

So far, we have been running the solution in production for around 6 months, and we have seen a lot of improvements in the platform, including:

Less stress on RabbitMQ cluster, as the messages are not pushed to it unless there is enough capacity to handle them.
RabbitMQ Messages

Around 300k messages in one of the application's queue backlog, the other applications are not under load, that's obvious from the few number of messages in their queues. The reduced stress on RabbitMQ cluster is also visible comparing the number of messages in the queues with the number of messages in the backlog in Nakadi (point 3 below).
Prioritization of messages, higher priority messages are sent first, and lower priority messages are sent later.

Order Confirmation Processing Time

Commercial Messages Processing Time

In the above diagrams, you see that the processing time for order confirmation is relatively stable. This is important as it’s a high priority use case. In contrast, commercial messages experience an increase in the processing time. This is acceptable as this is a low priority use case.
Events that can't be processed at the moment are still in Nakadi, so they can be processed later or easily discarded in case of emergency.

Nakadi Backlog

As we can see, the backlog is being consumed without putting pressure on the platform. Messages of lower priority can be discarded in case of emergency.

Nakadi Order Confirmation Backlog

The order confirmation is a P1 priority message, so it's being consumed first (during the same period less priority messages were growing in the backlog).

Conclusion

Utilizing the TCP congestion control algorithm to control traffic proved to be effective in event driven systems. In general, it's much better to control how much traffic is ingested into your system from the source, rather than letting it flood the system and then trying to deal with it.

In our case, it helped us to solve the problem of prioritization of messages, messages are only allowed to enter the system based on their priority and the capacity the system can handle. It also helped us to avoid using the RabbitMQ cluster as a storage for millions of messages - with a smaller queue size in RabbitMQ we follow best practices. In case of emergency, we can easily discard messages, as most of them will still be in the source.

Resources

Stop Rate Limiting! Capacity Management Done Right | Strange Loop Conference | 2017

12 Golden Signals To Discover Anomalies And Performance Issues on Your AWS RDS Fleet

2024-02-20T00:00:00+01:00

TL;DR: Database per service pattern in the microservices world brings an overhead on operating database instances, observing its health status and anomalies. Standardisation on methodology and tooling is a key factor for the success at the scale. We have incorporated learning from past incidents, anomalies and empirical observations into a methodology of observing the health status using 12 golden signals. The most simple way to adopt these methodology within your engineering environment is an open source utility rds-health recently released by us.

The problem of maintaining robustness at scale

Since Zalando concluded the organisation's scalability using microservice pattern, the company has experienced steady growth across multiple dimensions: in the number of users, in the technology landscape and number of teams involved in building and running systems. So far, Zalando is a leading European online fashion retailer. It is critical that our architecture is robust to withstand challenges and uncertainties while teams innovate and experiment with new ideas.

Overhead by microworld. Microservices became a design style for us to define system architectures, purify core business concepts, evolve solutions in parallel, make things look uniform, and implement stable and consistent interfaces across systems. Our engineering teams independently design, build and operate multiple microservices. Often, microservices are implemented with a datastore following the design pattern – database per service, where each service deploys its own database instances. The Zalando TechRadar guides teams about the database selection and their deployment options – AWS RDS with Postgres as one of the available options.

Hidden costs by toil. Operating swarm of small databases at company scale quickly gets tough. Complex anomaly detection tasks, such as byzantine failures or issues with SQL statements, takes a noticeable investment all over the place. A combination of manual processes and ad-hoc scripts to manage the health conditions of database instances are not an option at the scale. It became increasingly time-consuming and error-prone, some teams are required to allocate engineers for sprint or even months for such activities.

Standardisation is one of the factors that reduces this complexity. It is well known that if teams use the same frameworks or design pattern then making changes at scale becomes easier. Same concept is extendable into the operation domain. We have limited the fragmentation by providing stronger guidelines to our engineers on what metrics to observe from datastore components.

We have developed a methodology on how to detect anomalies with AWS RDS workload through 12 “golden signals”. We also decided to release an open-source command line utility (https://github.com/zalando/rds-health) to help automate and streamline detection of anomalies and performance issues. The utility provides a consistent and repeatable way to automatically analyse database metrics, reducing the risk of errors and improving overall efficiency.

12 Golden Signals

Setup and operating high-performing databases requires observability of a large variety of signals across multiple buckets: CPU, Memory, Disk and Workload. Thanks to past incidents and empirical observations, we have reduced complexity so that only a few signals from each of the discussed buckets need to be analysed for making a reliable conclusion about the heals status of database instances. This is how we got twelve golden signals.

C1: CPU Utilisation os.cpuUtilization.total - typical database workloads are bound to memory or storage, high CPU is an anomaly that requires further investigation. Our past experience advises us that CPU utilisation over 40% - 60% on database instances eventually leads to incidents.
C2: CPU Await os.cpuUtilization.await - the Linux kernel reports time is spent waiting for IO requests from its very beginning toward its end using await metric. Its high value indicates that a database instance is bound to the IO bandwidth of storage. Similar to the previous metric, we have concluded that any value above 5 - 10% eventually leads to incident.
M1: Swapped In from disk os.swap.in - Swap is an extension of RAM into the disk. Operating system swaps the RAM pages into the disk and back when there is not enough memory to run the workload. Any intensive activities indicate that the database instance is running on low memory. Considering the disk performance is order of magnitude slower, any swap activity would slow down the operating system and its applications.
M2: Swapped Out to disk os.swap.out - See explanation above.
D1: Storage Read IO os.diskIO.rdsdev.readIOsPS - Storage IO bandwidth is an essential resource for high-performing databases. It is required to align the IO bandwidth with the overall database workload so that there is enough bandwidth to handle workload. In the case of AWS RDS, the metric value shall be aligned with the storage configuration deployed for database instance. With the GP2 volume type, IOPS are provisioned by volume size, 3 IOPS per GB of storage with a minimum of 100 IOPS. The IO volume type has an explicit value defined at deployment time. Note that a very low value shows that the entire dataset is served from memory.
D2: Storage Write IO os.diskIO.rdsdev.writeIOsPS - See explanation above. Also note that a high number shows that the workload is write-mostly and potentially bound to the IO capacity of storage.
D3: Storage IO Latency os.diskIO.rdsdev.await - Overall performance of storage is a function of its IO bandwidth and its latency. The latency metric reflects the time spent by the storage to load data blocks into memory. High storage latency implies a higher latency to conduct applications workload on the database. Our empirical observations show that storage latency above 10 ms eventually leads to incident, the latency above 5 ms impacts on applications SLOs. A typical storage latency for database systems should be less than 4 - 5 ms.
P1: Cache Hit Ratio db.Cache.blks_hit / (db.Cache.blks_hit + db.IO.blk_read) - Databases do reading and writing of application data in blocks. The number of blocks read by the database from the physical storage has to be aligned with storage IO bandwidth provisioned to the database instance. Database caches these blocks in the memory to optimise the application performance. When clients request data, the database checks cached memory and if there is no relevant data there it has to read it from disk, thus queries become slower. Any values below 80 % show that databases have insufficient amount of shared buffers or physical RAM. Data required for top-called queries don't fit into memory, and the database has to read it from disk.
P2: Blocks Read Latency db.IO.blk_read_time - The metric reflects the time used by the database to read blocks from the storage. High latency on the storage implies a high latency of application workload. We have observed an impact on SLOs when the latency has grown above 10 ms.
P3: Database Deadlocks db.Concurrency.deadlocks - Number of deadlocks detected in this database. Ideally, it shall be 0. The application schema and IO logic requires evaluation if the number is high.
P4: database transactions db.Transactions.xact_commit - Number of transactions executed by database. The low number indicates that the database instance is standby.
P5: SQL efficiency [db.SQL.tup_fetched / db.SQL.tup_returned] - SQL efficiency shows the percentage of rows fetched by the client vs rows returned from the storage. The metric does not necessarily show any performance issue with databases but high ratio of returned vs fetched rows should trigger the question about optimization of SQL queries, schema or indexes. For example, If you do select count(*) from million_row_table, one million rows will be returned, but only one row will be fetched.

Open Source Command Line Utility

AWS offers a wide range of observability solutions for AWS RDS such as AWS CloudWatch, AWS Performance Insights and others. These off-the-shelf solutions help anyone with setting up alerts and debugging anomalies when one of twelve golden signals is violated. We are only missing an efficient utility to holistically observe the status of the entire AWS RDS fleet in your account with “a single click of the button”.

This is how the rds-health utility was born. It conducts analysis of AWS RDS instances using time-series metrics collected by AWS Performance Insights. Actually, the utility is a frontend for AWS APIs that simply automates analysis of discussed golden signals across your accounts and regions. The utility can be easily customised to meet specific use cases, allowing users to tailor their workflows to their unique needs. Some of the key features include:

Show configuration of all AWS RDS instances and clusters;
Check health of all AWS RDS deployments;
Conduct capacity planning for your AWS RDS deployments.

Check out our open source project at https://github.com/zalando/rds-health. It guides you through simple installation and configuration steps together with tutorials about its features. We are looking forward to hearing your feedback and suggestions for improvement. Please raise an issue on the project.

Conclusion

Our objective is reduction of complexity through limiting the fragmentation within our engineering ecosystems by enabling teams with engineering and operational guidelines. The discussed methodology on how to detect anomalies with AWS RDS workload through 12 “golden signals” is one of the examples about solving the complexity at Zalando.

Standardisation is not only guidelines but also automations of repetitive tasks, freeing up time for more creative and strategic work. We are happy to empower the Open Source Community with our learning and approaches on observing AWS RDS instances at scale through open source utility. Apply these learnings within your teams.

If you have any questions about our methodology or open source utility rds-health itself, please raise an issue on the project. Contributions are welcomed and encouraged!

Paper Announcement: Joint Order Selection, Allocation, Batching and Picking for Large Scale Warehouses

2024-01-29T00:00:00+01:00

We, as the Zalando team BART, are excited to share our latest research paper, describing the optimization problem of order batching and picking in Zalando's warehouses. In this paper (preprint available on arxiv), we formally introduce our proposed order batching problem and provide benchmark instances, two baseline algorithms, and a solution validation tool, all made publicly available on GitHub. Our goal is to provide insights to the research community on planning and optimizing the warehouse order picking process in large-scale warehouses, such as Zalando's.

The Underlying Optimization Problem

Zalando Tech Logistics is responsible for creating the software that manages all Zalando warehouses and their processes. Team BART, part of Zalando's Logistics Algorithms department, provides the decision-making algorithms for order batching and picking. These decisions can be broken down into four parts:

Order Selection: Which customer orders are processed next?
Item Allocation: Which warehouse items are used to fulfill a selected order?
Batching: Which selected orders are picked together?
Picking: How are batches split up into pick tours?

Traditionally, these decision problems are considered individually and solved using simplified rules. For example, order selection could be done using a first-in-first-out approach. However, our experience and analysis of batching algorithms have shown that a purely sequential approach is far from optimal. While there has been some research on these problems in the literature, there is no closed formulation, to the best of our knowledge, that encapsulates all four problems into one. And this is exactly what we aim to achieve with our paper: We combine all of the four problems into one, named Joint Order Selection, Allocation, Batching and Picking.

Benchmark Instances

To ensure a clear understanding of the problem statement, we provide benchmark instances for the Joint Order Selection, Allocation, Batching, and Picking Problem. These instances allow anybody interested to immediately try out their ideas for solving this problem. Additionally, we share the implementation of two baseline algorithms described in the paper.

Outlook

We aim to stimulate academic discussion around the Joint Order Selection, Allocation, Batching, and Picking Problem. We believe there are practitioners and researchers interested in this type of optimization problem. By providing benchmark instances, we hope to establish a standard definition that can be easily adapted for further research.

Publishing this problem formulation also allows us to share insights on how we are solving this problem at Zalando. We look forward to sharing more in our next publication. In the meantime, we welcome any feedback and collaboration from the community: Feel free to share your feedback via GitHub.

Tale of 'metadpata': the revenge of the supertools

2024-01-23T00:00:00+01:00

The perfect storm

In the mids of Cyber Week preparation in November 2022, I was DMd by a colleague with a request to quickly join a call. To my surprise as I was anticipating a 1:1 call, I got greeted by a message indicating that 60+ others are in the call as well. It turned out that I was just about to join an incident response call for what later got to be known internally as the "metadpata" incident.

In the call, a group of colleagues was trying to put the jigsaw pieces together analyzing why suddenly a large amount of DNS entries across our AWS accounts were removed, causing our shop to effectively go offline for our customers. Additionally, all of us except for the cloud infrastructure team were locked out of accessing AWS accounts and internal tools due to missing DNS entries, rendering the incident response difficult. In short – the classic DNS incident that you may be familiar with from other write-ups. Some helpful and lucky souls hastily started to copy their cached DNS entries before they expired. It was an all hands on deck situation with everyone focused on the single goal of restoring service for our customers ASAP. What followed in the incident call was a controlled disaster recovery with colleagues manually restoring DNS entries starting with essential tooling, followed by core infrastructure, and the services powering our on-site experiences to restore service for our customers.

How was it possible that the DNS entries across multiple accounts suddenly disappeared? The Pull Request that triggered the event was aimed at adjusting YAML configuration for our infrastructure. However, apart from changing configuration for a test account, it also contained a "p" character in one of the configuration fields called "metadata" transforming it into "metadpata". Yet, why was this single character so powerful and destructive?

Enter supertools

We coined the term supertools when working on the Post Mortem for the incident. These are applications or scripts that have the ability to execute large-scale changes across the infrastructure. Initially well intentioned as daemons automating creation of resources and implementing various stages of their lifecycle, they also perform cleanup operations that result in removal of resources. The latter operation, typically used for cleanup of resources that are to be decommissioned is easy to become subject to cost optimization. As part of cost-saving measures, the pacing of executing deletion operations was sped up.

The tool processing the configuration with the unfortunate typo is responsible for setting up AWS accounts. It is a background job that parses the configuration and computes the operations that are to be executed on each affected account. It uses the metadata object to calculate the accounts to work on. The typo resulted the configuration to be interpreted as "no accounts" which in turn was interpreted to be equal to the situation where all accounts are to be decommissioned. The deletion process was triggered and it managed to delete hosted zones containing DNS entries, which triggered the incident. Luckily, the deletion process ran into an error when performing the deletion operations, reducing the scope of the incident and the disaster recovery required.

Incident response

While our incident response culture is well established, this incident tested it to its full extent. In an all hands on deck situation, the cloud infrastructure team was focused on disaster recovery, organized via an incident call. Through an incident chat room, our colleagues were reporting the impact they still observed and reported on the progress of recovery in their clusters. The Incident Commanders focused on determining the approach and priority of the recovery efforts as well as on facilitating the communication between the chatroom and the incident call. Throughout the incident response we switched the Incident Commanders according to their areas of expertise which kept the incident response focused and efficient.

Post Mortem

Through great collaboration across teams to recover the needed DNS entries and restore service for our customers, we were back online in a few hours. As the first incident of its kind and with a large scale impact for our customers, it got high attention across the organization. Predictably, this resulted in an overload of Google Docs that limits the concurrent editors for the document who were working on the Post Mortem. To reduce the likelihood of this happening again, we've changed all links to Post Mortem documents shared with big audiences use the /preview URL by default.

Being close to the start of Cyber Week the focus for the team was to complete the Post Mortem analysis work and decide upon immediate actions to prevent a similar incident from happening. This included pausing changes to the configuration, a review of all supertools in place, and temporary deactivation of the relevant deletion processes. We also wrote a 1-pager summary of the incident and shared it proactively with the whole organization to keep everyone informed about the types of action items scheduled short- and mid-term as agreed during an Incident Review.

Infrastructure changes

An important and often vigorously discussed part of Post Mortems are the action items aimed at preventing recurrence of the incident. In our case, we analyzed how infrastructure changes are reviewed and rolled out a number of improvements with the aim of improving the validation and reducing the blast radius of infrastructure changes that go wrong. We will focus on the most impactful changes that were implemented.

Account lifecycle management changes

We have introduced a new step in the account decommissioning process that simulates deletion using Network ACLs. We also remove the delegation for the DNS zone assigned to the account to ensure that related CNAMEs will not resolve anymore. The account is left in this state for one week before proceeding further with the real decommissioning. This acts as a final "scream test" to make sure there are no more dependencies on this account.

Having assessed the trade-offs and risks for deletion of resources, we have additionally decided to be more careful with deletion of resources that have low cost savings potential compared to the impact a wrong deletion could have. These changes are now done manually and take a longer time to complete, an acceptable trade-off we're willing to take to reduce the risk. To mitigate the potential cost increase, we are monitoring the account costs for the previous 7 days. In case it is over a certain threshold, we look at deleting the resources manually.

Change validation

We've introduced a series of validation steps, for example stringent checks for the presence of mandatory keys and the preview of all stack templates using AWS CloudFormation Linter before they get deployed.

Also, we have set up jsonschema validation for all our configuration files. All these checks run both locally (thanks to pre-commit hooks) and in the CI/CD pipelines. We also did some small quality of life improvements to enable autocompletion and schema validation in our local IDEs, which mitigates the possibility of typos and errors and is simple to set up:

# yaml-language-server: $schema=schema/config_schema.json
(your config)

Additionally, for creation/decommissioning of critical resources, we have introduced several automated quality checks which ensure that all the change corresponds to the user request and the pull request description. These checks also introduce additional approval from the respective account or cost center owners and validation from respective managers. The checks are implemented as a GitHub bot that comments on the Pull Request and blocks the merge until all the checks are validated.

Change previews

We have implemented automated previews in the Pull Request comments. This feature leverages the AWS CloudFormation "ChangeSet" feature. When an updated CF stack template is provided to the CloudFormation "CreateChangeSet" endpoint, CloudFormation generates a json preview of the changes, which then can be executed or rejected. We read this ChangeSet from each account in our AWS Organization and merge them to create a human readable preview of changes in a PR comment. After the preview is created, the ChangeSet is dropped.

Preview of changes in Pull Requests

Phased rollout

Our Kubernetes cluster rollout already included a phased rollout to different groups of clusters. This idea was extended to our AWS infrastructure. The rollout process adopted by our tooling now includes gradual rollout to different release channels, each associated with a few AWS account categories (e.g. playground, test, infra). All changes must go through all release channels before getting to production. This approach allows us to gradually deploy changes to different accounts, ensuring a more controlled propagation that catches errors early on with a limited blast radius. The trade-off here is of course that the rollout takes a longer time.

Summary

Supertools never sleep (unless you program them otherwise!). They're powerful yet often misjudged in review processes as they're expected to only trigger action in the scope of expected changes. As our story shows, this is highly dependent on the implementation and it's highly important to implement additional safety nets in the processes and tooling. We hope that the examples of changes we've implemented in our infrastructure will help you reflect and improve mechanisms in your own context.

Using modules for Testcontainers with Golang

2023-12-19T00:00:00+01:00

Introduction

Testcontainers for Go enables developers to run easily tests against containerized dependencies. In our previous articles, you can find an introduction of Integration tests with Testcontainers and explore how to write Functional tests with Testcontainers (in Java).

This blog post will deep dive into how to use modules and a common issue for Testcontainers with Golang.

What we use it for?

Services often use external dependencies like datastore or queues. It is possible to mock these dependencies but if you want to run for example integration test, it is better to verify against the real dependency (or close enough).

Starting a container with the image of the dependency is a convenient way to verify that the application works as expected. With Testcontainers, starting the container is done programmatically so that you can define it as part of your tests. The machine running the tests (developer, CI/CD) requires to have a container runtime interface (e.g. Docker, Podman...)

Basic implementation

Testcontainers for Go is very easy to use, the quick start example is:

ctx := context.TODO()
req := testcontainers.ContainerRequest{
    Image:        "redis:latest",
    ExposedPorts: []string{"6379/tcp"},
    WaitingFor:   wait.ForLog("Ready to accept connections"),
}
redisC, err := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
    ContainerRequest: req,
    Started:          true,
})
if err != nil {
    panic(err)
}
defer func() {
    if err := redisC.Terminate(ctx); err != nil {
        panic(err)
    }
}()

If we dive into the code above, we notice that:

testcontainers.ContainerRequest initialises a struct with container image, exposed port and waiting strategy parameters
testcontainers.GenericContainer starts the container returning the container and error structs
redisC.Terminate terminates the container with defer once the test is done

Implementing our own internal library

From the example in the previous section, there is some minor inconvenience:

wait.ForLog("Ready to accept connections") uses logs to wait for start of the container which can break easily
ExposedPorts: []string{"6379/tcp"} requires knowledge of the exposed port for Redis

There might also be some additional environment variables and other parameters useful to run a Redis container which requires deeper knowledge. As such, we decided to create an internal library which would initialise container with the default parameters required to ease test implementation. To remain flexible, we used the Functional Options Pattern so that consumer can still customize depending on the needs.

Example of implementation for Redis:

func defaultPreset() []container.Option {
    return []container.Option{
        container.WithPort("6379/tcp"),
        container.WithGetURL(func(port nat.Port) string {
            return "localhost:" + port.Port()
        }),
        container.WithImage("redis"),
        container.WithWaitingStrategy(func(c *container.Container) wait.Strategy {
            return wait.ForAll(
                wait.NewHostPortStrategy(c.Port),
                wait.ForLog("Ready to accept connections"))
        }),
    }
}

// New - create a new container able to run redis
func New(options ...container.Option) (*container.Container, error) {
    c := container.Container{}
    options = append(defaultPreset(), options...)
    for _, o := range options {
        o(&c)
    }

    return &c, nil
}

// Start - start a Redis container and return a container.CreatedContainer
func Start(ctx context.Context, options ...container.Option) (container.CreatedContainer, error) {
    p, err := New(options...)
    if err != nil {
        return container.CreatedContainer{}, err
    }
    return p.Start(ctx)
}

Usage of the library for Redis:

ctx := context.TODO()
cc, err := redis.Start(ctx, container.WithVersion("latest"))
if err != nil {
    panic(err)
}
defer func() {
    if err := cc.Stop(ctx, nil); err != nil {
        panic(err)
    }
}()

With this internal library, developers could easily add tests for Redis without the need to figure out the waiting strategy, exposed port, etc. In case of incompatibility, the internal library could be updated to centrally fix the issue.

Common issue - Garbage collector (Ryuk / Reaper)

Testcontainers covers the extra mile of ensuring that container is removed once test is done using a Garbage Collector which is an additional container started as a "sidecar". This container is responsible for stopping the container being tested even if your test crash (which would prevent defer to run).

When using Docker, it works without problem, but with other container runtime interfaces (like Podman) often you will get this kind of error: Error response from daemon: container create: statfs /var/run/docker.sock: permission denied: creating reaper failed: failed to create container.

One way to "fix this" is to deactivate it with the environment variable TESTCONTAINERS_RYUK_DISABLED=true.

Another way is to set the Podman machine rootful and add:

export TESTCONTAINERS_RYUK_CONTAINER_PRIVILEGED=true; # needed to run Reaper (alternative disable it TESTCONTAINERS_RYUK_DISABLED=true)
export TESTCONTAINERS_DOCKER_SOCKET_OVERRIDE=/var/run/docker.sock; # needed to apply the bind with statfs

In our internal library we took the approach of disabling it by default as developers had issues running it locally.

Moving to modules

Once our internal library was stable enough, we decided that it was time to give back to the community by contributing to Testcontainers. But surprise... modules has just been introduced in Testcontainers. Module is doing exactly what our internal library was for, we therefore migrated all our services to modules and discontinued the internal library. From the migration, we learned that it was possible to use the standard library out of the box now that modules have been introduced, which reduces the maintenance cost of our services. The main challenge was to fine-tune developer environment variables to run on the developer machine (make Garbage Collector work) using Makefile.

Adapted example from testcontainers documentation:

ctx := context.TODO()
redisContainer, err := redis.RunContainer(ctx,
    testcontainers.WithImage("docker.io/redis:latest"),
)
if err != nil {
    panic(err)
}
defer func() {
    if err := redisContainer.Terminate(ctx); err != nil {
        panic(err)
    }
}()

Conclusion

Testcontainers for Golang is a great library to support testing which is even better now that modules have been introduced. Some small impediments with the Garbage collector exist, but that can be fixed easily as described in this post.

I hope with this blog, if you haven't already, that you will adopt Testcontainers, highly recommended to improve testability of your applications.

Migrating From Elasticsearch 7.17 to Elasticsearch 8.x: Pitfalls and Learnings

2023-11-20T00:00:00+01:00

What this article is about

What kind of changes we had to make to the codebase
How we did the actual upgrade
What challenges we faced
How we did the data transfer
How the data was kept in sync

What this article is not

A step-by-step guide on how to upgrade Elasticsearch (read on to find out why).

Who we are

We are a team from the Search & Browse department, the department in Zalando that is responsible for all things search (read: relevance, personalisation, sorting, filters, full text search, ... in short, everything that forms the search experience). The search applications are using Elasticsearch as the main datastore, so we are also the ones responsible for its well-being.

Why upgrade

We have been using Elasticsearch for a long time. It was upgraded more or less on a regular basis, but we were always a bit behind the latest version (Elastic has a regular release schedule; the releases are all scheduled well in advance). We were on version 7.17 for a while, and while we were pretty happy with it, we still had a few reasons to upgrade to 8.x.

First, we wanted to use the new features that were introduced in 8.0. Namely, the approximate kNN (k nearest neighbors) - or ANN-search. The vector search was already used in Search & Browse, but it was the exact kNN search, the brute-force and less performant one. What Elastic says about the approximate vs exact kNN search is this:

In most cases, you’ll want to use approximate kNN. Approximate kNN offers lower latency at the cost of slower indexing and imperfect accuracy.

Exact, brute-force kNN guarantees accurate results but doesn’t scale well with large datasets. With this approach, a script_score query must scan each matching document to compute the vector function, which can result in slow search speeds. However, you can improve latency by using a query to limit the number of matching documents passed to the function. If you filter your data to a small subset of documents, you can get good search performance using this approach.

There is also a great article about ANN on Elastic blog by Julie Tibshirani - read it, you won't regret it.

Second, we also wanted to be on the latest version for performance and security reasons, because obviously, every new release has a lot of security fixes and performance improvements.

Why it's difficult to upgrade

Boromir telling you that you don't just upgrade Elasticsearch

Usually, Elasticsearch is updated in gradual increments, minor to minor version, and it's difficult, not to mention dangerous, to make such a big move as going from one major version to another. Also, the documentation on the official website, while ample, is pretty disorganized, and there's no complete step-by-step for such an endeavor. And even if you were to gather all the information from the docs, it's still not enough. You need to know what to do with your data, how to keep it in sync, and how to make sure that the new version is working as expected.

In Zalando, the size of data is pretty massive. We have millions of articles in each country, and while the gender root page for women in Germany will show you 450k items, it's simply not the full picture. This number is just how many items at most get scanned to show you the first page. The actual number of items is much higher. And we currently have 28 domains (country + language combos), each with its own catalog. So in short, we have a lot of data, and we need to make sure that it's not lost or corrupted during the upgrade.

How we approached the upgrade

Another reason why one can't just go and upgrade Elasticsearch is because, well, it's not an island.

What I mean is, it's not some independent entity that has a value all by itself. It's our datastore, and it's used by a lot of our services. So before one goes and upgrades this massive thing, one should think of possible breaking changes in the product. And also, one should think about how it changes the actual usage of Elasticsearch.

The main search application in Zalando, the one that deals directly with Elasticsearch queries, is called Origami. From the description on its (internal) repository page:

Origami is the Zalando Core Search API. It provides a powerful information retrieval language and engine that integrates several microservice components built by the Search Department. In the landscape of Zalando Search and Browse platform, Origami is the connector - coordinating all search intelligence to serve correct search results to customers.

Origami builds on top of Elasticsearch and our internal/Zalando-specific suite of APIs. These APIs will facilitate composing/serving search and discovery, navigation, and analytics functionalities.

The application is written in Scala and using a Java High Level REST Client, which got deprecated in Elasticsearch 7.15.0 and replaced by ElasticSearch Java API client, so first of all, we had to update the codebase to use the new client.

Updating the codebase

However, updating the codebase was also not a one-step task. (This just goes deeper into the rabbit hole, doesn't it?)

Origami has 443k lines of code in 846 files. Of course, a lot of these files are the configs and tests and test resources, so the actual number of Scala files is much lower. But still, it's a lot of code, and a lot of it is dealing with Elasticsearch.

Upgrading the Elasticsearch API to be able to work with version 8.x also represented a choice. We could either use the official Elasticsearch Java API Client, or we could use the Elasticsearch Scala client which seemed to be quite popular and had a lot of contributors (and stars) on GitHub. Both options were available and viable. Both had their pros and cons.

With the Elasticsearch Java API, the advantages would be:

The library is officially supported and its versions match the Elasticsearch releases;
There is a ready-made DSL for all the REST APIs;
It’s open source and the code is available on GitHub. The license is Apache License 2.0.

However:

It’s in Java. This means that all the lambda types, collection types, etc. are not directly interoperable and special transformations should be done within our code;
We’re missing on the other Scala advantages like built-in immutability, null safety and so on.

The unofficial Scala client is advertised as:

Providing a type-safe, concise DSL;
Integrating with standard Scala futures or other effects libraries;
Using Scala collections library over Java collections;
Returning Option where the Java methods would return null;
Using Scala Durations instead of strings/longs for time values;
Supporting typeclasses for indexing, updating, and search backed by Jackson, Circe, Json4s, PlayJson and Spray Json implementations;
Supporting Java and Scala HTTP clients such as Akka-Http;
Providing reactive-streams implementation;
Providing a testkit subproject ideal for tests.

The disadvantages, however, could not be ignored:

It’s not official and the releases are not closely following Elastic’s release schedule. At the time we were looking at it, Elasticsearch was already at v8.7 and this library’s last version was 8.5.4. (It could work with Elasticsearch up to version 8.6 though);
Because it did not implement all the new features, there was no DSL for kNN search. KNN search was still available via sending a pure JSON query, but it was not a pretty option.

In the end, we decided to go with the Elasticsearch Java API client. The main reason was that it was officially supported and the releases were closely following the Elasticsearch releases, and it wouldn't just disappear into thin air in the unlikely case when its creator would suddenly want to quit. Also, it had DSL for all the REST APIs. The absense of the kNN search DSL in the Scala library was really disappointing, because approximate kNN search was one of the main reasons why we wanted to upgrade in the first place.

So, the choice was made.

But.

As I said before, this was a large application.

How does one make sure that no existing functionality is going to break when upgrading the API? How does one make sure that all the existing queries are still going to work?

Obviously, you write a test.

Writing a test

There was one more decision that we made while selecting a migration strategy, and that was to start with compatibility mode. This meant that we would use the Elasticsearch High Level Rest Client from version 7.x, but in the compatibility mode, so that it would instruct Elasticsearch 8.x to behave like the old client. This way we would be able to upgrade the Elasticsearch cluster first, and then upgrade the client gradually. With this approach, we would avoid rewriting too much code at once. And afterward, we would be able to use one of the transition strategies, recommended by Elasticsearch, to gradually upgrade the client.

This approach was also a good fit, since we assumed that we might have a time during the transition phase when the application would have to deal with both Elasticsearch 7.x and Elasticsearch 8.x. Because our Elasticsearch was a multi-cluster deployment, it would be practically impossible to upgrade in one go. We would have to start with less mission-critical clusters, and then gradually move to the more important ones. So, we would definitely have to deal with both versions of Elasticsearch for some time.

So how to write such a test?

This is where Testcontainers shine. Basically, we had a helper class looking like this:

object ESContainers {
  val Version7179 = "7.17.9"
  val Version86 = "8.6.2"
  val Version88 = "8.8.2"

  val VersionDefault = Version7179

  def initAndStartESContainer(version: String = VersionDefault): ElasticsearchEndPoint = {
    val container =
      new ElasticsearchContainer(s"docker.elastic.co/elasticsearch/elasticsearch:$version")
        .withReuse(true)
        .withCreateContainerCmdModifier(cmd => cmd.getHostConfig.withCapAdd(Capability.SYS_CHROOT))
    container.start()
    val hostAndPort = container.getHttpHostAddress.split(":")
    ElasticsearchEndPoint(hostAndPort(0), hostAndPort(1).toInt, container)
  }
}

And then, in the test, we would just do this to start Elasticsearch with the version we needed.

private lazy val endpoint = ESContainers.initAndStartESContainer(Version88)

Since at some point we'd have to deal with both versions of the API, we had to test three combinations:

Elasticsearch 7.x with Elasticsearch 8.x API;
Elasticsearch 8.x with Elasticsearch 8.x API;
Elasticsearch 8.x with Elasticsearch 7.x API.

And with each, we needed to make sure that the common types of actions, done by the application, continue to work as expected.

So this is exactly what we did. We wrote three test classes:

NewClientWithOldElasticTest
OldClientWithNewElasticTest
NewClientWithNewElasticTest

Why is there no OldClientWithOldElasticTest? Because we already knew that it was working. It was what the application we already had.

Each class was checking that the application was able to do the following:

Create an index;
Create a document;
Create kNN vector mappings;
Index kNN vector data;
Search for a document with a kNN query;
Delete an index;
Close the client.

The tests were not covering all the queries that we ran - only the common types. But even with this simplified approach we were able to discover a few issues, for which we had to make changes to the codebase.

Issues discovered and fixes applied

Elasticsearch 8 deprecated the _type field in search response, so we had to remove it from all the test case resources that represented example JSONs for the expected response.
Elasticsearch 8 didn't allow null in the is_write parameter when creating an alias for the index. Therefore, code was added to set this flag explicitly.
Range query based on date/epoch_second didn't work with upper/lower bounds specified as numbers. (According to the Elastic team, it was a feature and would not be fixed). Due to that, the range boundaries had to be stringified before being passed to Elasticsearch.
In Elasticsearch 8, a cluster setting called action.destructive_requires_name now defaults to true instead of false. Since our e2e tests were dropping all test indexes by wildcard before starting, they all started crashing. So, a change was introduced to update this setting on a cluster to allow the test suits run this action. The method that was doing it was only used in test suites, because for a real production cluster, it's pretty unsafe.

Moreover, when we started to switch the other, more detailed integration tests to Elasticsearch 8, we found an issue that was a little more involved. Some of those tests started to fail with the following error:

{
  "type": "query_shard_exception",
  "reason": "it is mandatory to set the [nested] context on the nested sort field: [trace.origami.timestamp].",
  "index_uuid": "_xvEa8gNSFyCDm0aFXqYhg",
  "index": "article_1"
}

That seemed to refer to the sort clause that we had in the e2e test suite:

"sort": [
  {
    "trace.origami.timestamp": {
      "order": "desc"
    }
  }
]

The page about sorting on a nested field for ES 8.8 (current at that time) says that there should be a path specified in a "nested.path" clause of the sort. However, the same page for ES 7.17 states exactly the same, but the query still runs fine without that clause.

So something changed between the versions in such a way that it started erroring out in ES8, whereas in ES7 it was working fine, despite the docs stating that the parameter is non-optional (the thread I created on ES discussion board suggests there was a bug and it was fixed). So, we had to add the nested.path clause to the sort clauses in the queries that were sorting on nested fields, meaning that the sort clause from the example above would now look like this.

"sort": [
  {
    "trace.origami.timestamp": {
      "order": "desc",
      "nested": {
        "path": "trace"
      }
    }
  }
]

Deprecating Elasticsearch settings in preparation for 8.x migration

Summary of changes:

Remove fixed_auto_queue_size thread pool. It’s replaced with the normal fixed thread pool configuration.
Replace deprecated transport.tcp.compress.
Replace node role settings with new node.roles settings (see one and two).
Due to an existing bug, the coordinating role needs to be set as a default which can in turn be overridden by setting the node.roles environment variables with specific values.
Remove deprecated gateway.recover_after_master_nodes setting.
Add human approval to prevent upgrading master nodes before data nodes.
Explicitly disable the serial GC using -XX:-UseSerialGC to avoid the following error messages during start up: text Error occurred during initialization of VM Multiple garbage collectors selected even though -XX:+UseZGC or -XX:+UseG1GC is explicitly enabled. Most likely an intermediate script was logging this message. In ES 8.x the container can unsuccessfully exit because of this error.
Coordinating nodes are enabled by default by specifying an empty value.
Data nodes will only have the “data” role defined.
Monitoring checks had to be updated because the role abbreviations changed and became stricter than before.

How we did the actual upgrade

Finally, it seemed that the application was prepared to work with non-homogenous Elasticsearch versions. At last, it was time to upgrade the Elasticsearch cluster itself.

There is a documentation page with some advice about going from 7.x to 8.x, and it states that first, one should move to 7.17. From there, it is recommended to use an Upgrade Assistant tool to help prepare for the upgrade. As an alternative, is also recommended to use the Reindex API to reindex the data from the old version to the new one.

So in short, Elasticsearch provides two ways to upgrade:

The rolling upgrade approach;
Upgrading via reindex.

First one is upgrading live. It means that you upgrade the cluster node by node, and the cluster is still available during the upgrade. The second one is upgrading via reindex. It means that you create a new cluster, and you reindex the data from the old cluster to the new one. Then you switch the traffic to the new cluster and shut down the old one.

In general, Elastic recommends doing a rolling upgrade in a following way:

Upgrade the data nodes first;
Upgrade other non-master nodes (ML-dedicated, coordinating, etc.);
Upgrade the master nodes.

This is because the data nodes can join the cluster with the master nodes of a lower version, but older data nodes can't always join the newer cluster. So, if you upgrade the master nodes first, the data nodes might fail to join it, and the cluster will be unavailable.

In general, the rolling upgrade is the recommended way to upgrade, because it's less disruptive. However, in our case, it represented too many dangers. First of all, we have a multi-cluster deployment, and the clusters are pretty large, so we're talking about some terabytes of data. It would take a lot of time to upgrade the cluster node by node, and during this time, the cluster would be in a mixed state, with some nodes being upgraded and some not, with relocating shards, and in general in a degraded state.

That, in itself, wouldn't be so scary. What would indeed be bad is if something were to go wrong. If we faced data loss, we'd have no choice but to go with restoring the data from snapshots and then resetting the input streams to bring the data up to date. This would take quite some time, because we'd have to do it for all the indices in the cluster, and during all this time, the catalog of products would either be unavailable or would have stale or partial data.

So, we decided to go with the second option, the reindexing. It meant that we'd have to create a new cluster, reindex the data from the old one, and then gradually switch the traffic to the new cluster. It would take more time, but it would be way less risky and less disruptive, because when the data would be in sync, going to the new cluster would be just a matter of switching the routing. If something went wrong, the rollback procedure would be almost instantaneous as it would again be just the routing switched back.

And last but not least, having both clusters running side by side would give us time to test the new cluster and make sure that it was working as expected and performed at the same level. We could first test if with shadow traffic, and then gradually increase the traffic to the new cluster and decrease it on the old one.

Procedure per cluster

The procedure for each of out cluster would be similar and would include the following steps:

Deploy ES8 cluster.
Setup monitoring.
Create index templates (because if we were to index the data from the old cluster, we'd have to make sure that the new cluster has the same index templates as the old one).
Restore data from the latest snapshot.
Set up the shadow intake traffic. This meant that the data would gradually converge with the old cluster, but the queries would still be served by the old cluster. If we were to consider the moment the snapshot was taken as point A and the moment shadow intake was enabled on the new cluster as point B, then it would mean that we have full data from beginning to A, and then from B to the end.
That left us with the gap between points A and B, so the next step would be to perform the data update by resetting the data streams to the point of just before the snapshot was taken.
Shadow query traffic. This would be performed gradually, with monitoring for errors.
Verify that the new cluster works as expected and compare the cluster performance with the old one.
Switch the live traffic to ES8 cluster (again, gradually shifting the percentages).
Remove old traffic and clean up old cluster resources.

If these steps sound familiar, it is because they are. It is basically the Blue/Green procedure that is usually used for disaster recovery (failover cluster), or for testing something new. The only difference is that we were using it for the one-time Elasticsearch cluster upgrade and not keep the second cluster around. (We are also looking into applying the same approach for the failover cluster, but since our deployments are very large and complicated, we're still getting there.) This Blue/Green approach was also used by the team behind Zalando Lounge which has a separate catalog of products, also backed by Elasticsearch, so we had some in-house experience to compare with.

Routing and shadowing

The whole mechanism is based on a delicate balance of routing and shadowing. We use an open-sourced solution called Skipper as an ingress controller, which gives us access to filters. For the routing, we're using a custom resource type called RouteGroup. For example, to ensure that the intake pipeline ingests data into the new cluster, the route group configuration needs to be modified to shadow the intake traffic for the /bulk and /_alias/{index}_write endpoints. Here is a somewhat simplified example configuration for shadowing the specified endpoints:

apiVersion: zalando.org/v1
kind: RouteGroup
spec:
  hosts:
    - cluster-name-{{{CLIENT}}}.ingress.cluster.local
  backends:
    - name: backend-old
      type: network
      address: "http://backend-old.ingress.cluster.local"
    - name: backend-new
      type: network
      address: "http://backend-new.ingress.cluster.local"
  routes:
    ## match to shadow /_bulk, /_alias/{index}_ad*_write to new backend with ES8
    - pathSubtree: /
      pathRegexp: ^/(_bulk|_alias/(index-name-template)_[\d]+_write)$
      predicates:
       - HeaderRegexp("elasticsearch-index-name", "^(index-name-template)_[\d]+($|_.*)")
      filters:
       - teeLoopback("intake_shadow")
       - preserveHost("false")
      backends:
       - backendName: backend-old

    ## shadow "intake_shadow" matched requests to new backend with ES8
    - pathSubtree: /
      pathRegexp: ^/(_bulk|_alias/(index-name-template)_[\d]+_write)$
      predicates:
       - HeaderRegexp("elasticsearch-index-name", "^(index-name-template)_[\d]+($|_.*)")
       - Tee("intake_shadow")
       - Weight(2) ## hack required to not match route with Traffic() and teeLoopback()
      filters:
       - preserveHost("false")
      backends:
       - backendName: backend-new

But that's not all. Before shadowing the intake, the mapping templates should be created. One way to do it would be to just grab them and recreate to the new cluster. But that would mean that we'd have to do it manually, and also we might miss the updates to them if they were to happen while the clusters were still running side by side. Since the templates are stored in our code repos and updated (based on the version) on application restart, the traffic related to template creation also should have been shadowed, so we had to capture this specific traffic too. Snippet of code (shortened):

spec:
  routes:
    - path: /:index/_mapping
      predicates:
        - HeaderRegexp("elasticsearch-index-name", "^(index-name-template)_[\d]+($|_.*)")
    ## <...>
    - path: /_template/*
      predicates:
        - HeaderRegexp("elasticsearch-index-name", "^(index-name-template)_[\d]+($|_.*)")

Monitoring

The whole process would make no sense if we were going blind. Since it was a multistep procedure, we needed to see how each step is changing the data, affecting the cluster, performing compared to the old cluster, etc. So we needed to set up monitoring. It was based on creating Lightstep streams and setting up the dashboards in Grafana. The dashboards were showing the traffic from both clusters side by side per endpoint, and the key metrics like latency and error rate. We also monitored CPU and memory consumption via Kubernetes.

One of the most important things was that the data would be in sync, so the boards also had index sizes and the difference between them for the old and new cluster. This way, we could see if say restoring from the snapshot was indeed successful and if the follow-up of shadow intake and stream resetting was resulting in data converging in the end.

Alerting

And last but not least, before each new cluster went live, we had to update alerts and checks that were set up on the corresponding old cluster. We had to make sure that the alerts were pointing to the new cluster and that the checks were still working as expected. We also had to make sure that the alerts were not firing during the upgrade.

Backing up the data

And of course, as soon as the new cluster went live serving queries and the data on the old cluster stopped being updated (or preferably before that), we set up the snapshotting. We had to make sure that the data was backed up, using the same policies that the previous cluster was using.

Challenges we faced

The process of upgrading the cluster was not without challenges. Some of them were expected, some were not, and some were purely based on people never having performed some procedures before, or on something slipping one's attention.

One such thing resulted in duplicates being shown in the product catalog country-wide, because there was a routing error while switching the country index from an old cluster to the new one, so one extra index was created automatically (and erroneously) and for some time two different indices with duplicate content were existing behind the same alias. But that was quickly fixed, and the duplicates were removed by just dropping the mistakenly created index. (And hey, it's better to show the product twice than not to show it at all, right?)

In general, the whole process was an amazing learning experience, and the whole team is now better prepared for the next upgrade and feels more confident tackling Elasticsearch in general. So, while assuredly sh*t still can and will happen, what matters is how you deal with it and what you learn from it.

For example, the difficulty experienced by team members while restoring the data was a good indicator that our existing procedure of restoring from snapshot was extremely fussy and error-prone, which resulted in looking for alternative solutions, like Kibana-based workflows, to make the process more straightforward and more obvious. Historically, we were using custom scripts and our CI pipeline for that, but now we're aiming to get our engineers better acquainted with Kibana. The scripts are still the default way, but we're getting there.

Success!

As always after a big project, we had a retrospective, and the team was pretty happy with the results. The upgrade was successful, and the new cluster was performing at the same level as the old one. The new features were working as expected, and the new cluster was stable. The monitoring was set up, and the dashboards were showing the data in sync. The alerts were firing as expected, and the checks were working. So all in all, it was a success.

But you know what?

Products keep upgrading. Progress is the only constant thing in the world. So, we're already looking into the next upgrade, and we're already thinking about how to make it even better.

And we will keep evolving, because that's what we do.

We're Zalando. We dress code.

(See what I did here? Even though I can't take any credit for this. This is a slogan that we once had on our company hoodies!)

Helpful links

Mastering Testing Efficiency in Spring Boot: Optimization Strategies and Best Practices

2023-11-14T00:00:00+01:00

Introduction 🚀

Hey there, fellow engineers! Let's dive into the exciting world of Spring Boot testing with JUnit. It is incredibly powerful, providing a realistic environment for testing our code. However, if we don't optimize our tests, they can be slow and negatively affect lead time to changes for our teams.

This blog post will teach you how to optimize your Spring Boot tests, making them faster, more efficient, and more reliable.

Imagine an application whose tests take 10 minutes to execute. That's a lot of time! Let's roll up our sleeves and see how we can whiz through those tests in no time! 🕒✨

Understanding Test Slicing in Spring

Test slicing in Spring allows testing specific parts of an application, focusing only on relevant components, rather than loading the entire context. It is achieved by annotations like @WebMvcTest, @DataJpaTest, or @JsonTest. These annotations are a targeted approach to limit the context loading to a specific layer or technology. For instance, @WebMvcTest primarily loads the Web layer, while @DataJpaTest initializes the Data JPA layer for more concise and efficient testing. This selective loading approach is a cornerstone in optimizing test efficiency.

There are more annotations that can be used to slice the context. See official Spring documentation on Test Slices.

Test Slicing: Using @DataJpaTest as a replacement for @SpringBootTest 🧩

Let's take a look at an example (code below). The test first deletes all the data (shipments and containers, each shipment can have multiple containers) from the target tables, and then saves a new shipment. Next, it creates a thread pool with 50 threads, where each thread calls the svc.createOrUpdateContainer method.

The test will wait until all the threads are finished, then it will check that the database has only one container.

It's all about checking concurrency issues and involves a swarm of threads, clocking in at about 16 seconds on my machine – a massive chunk of time for a single service check, right?

@ActiveProfiles("test")
@SpringBootTest
abstract class BaseIT {
    @Autowired
    private lateinit var shipmentRepo: ShipmentRepository

    @Autowired
    private lateinit var containerRepo: ContainerRepository
}

class ContainerServiceTest : BaseIT() {
    @Autowired
    private lateinit var svc: ContainerService

    @BeforeEach
    fun setup() {
        shipmentRepo.deleteAll()
        containerRepo.deleteAll()
        shipmentRepo.save(shipment)
    }
    @Test
    fun testConcurrentUpdatesForContainer() {

        val executor = Executors.newFixedThreadPool(50)
        repeat(50) {
            executor.execute {
                containerService.createOrUpdateContainer("${shipment.id}${svc.DEFAULT_CONTAINER}", Patch("NEW_LABEL"))
            }
        }
        executor.shutdown()
        while (!executor.awaitTermination(100, TimeUnit.MILLISECONDS)) {
            // busy waiting for executor to terminate
        }
        assertThat(containerRepo.find(shipment)).hasSize(1)
    }

}

The first problem we have is the class declaration:

class ContainerServiceTest : BaseIT()

The issue starts with the BaseIT class using @SpringBootTest. This causes the Spring context for the entire application to be loaded (every time we mess with context caching mechanisms, we'll get to that later!). When the application is large enough, a huge number of beans are loaded - a costly operation for tests with specific objectives.

But no, we don't want to load everything. All we need to load is the ContainerService bean and JPA repositories. We can switch to @DataJpaTest. This annotation only loads the JPA part of the application, which is what we need for this test. Let's try it out!

@DataJpaTest
class ContainerServiceTest {
    @Autowired
    private lateinit var svc: ContainerService

    @Autowired
    private lateinit var shipmentRepo: ShipmentRepository

    @Autowired
    private lateinit var containerRepo: ContainerRepository
}

Upon execution, an exception is thrown:

org.springframework.beans.factory.BeanCreationException: Failed to replace DataSource with an embedded database for tests. If you want an embedded database please put a supported one on the classpath or tune the replace attribute of @AutoConfigureTestDatabase.

@DataJpaTest has an annotation @AutoConfigureTestDatabase, which by default, sets up an H2 in-memory database for the tests, and configures DataSource to use it. However, in this case, the H2 dependency is not found in the classpath.

And actually, we don't want to use H2 for our tests, so we can tell @AutoConfigureTestDatabase not to replace our configured database with an H2. Plus, we have to configure and load our own database, which is performed here by importing a @Configuration class called EmbeddedDataSourceConfig (It simply creates a @Bean of type DataSource).

@DataJpaTest
@AutoConfigureTestDatabase(replace = AutoConfigureTestDatabase.Replace.NONE)
@Import(EmbeddedDataSourceConfig::class) // Import the embedded database configuration if needed.
@ActiveProfiles("test") // Use the test profile to load a different configuration for tests.
class ContainerServiceTest {
    // test code
}

Let's try to run the test again. Now, it fails with this error:

org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'ContainerServiceTest': Unsatisfied dependency expressed through field 'containerService'

You already know the trick, you need to load the ContainerService bean in the Spring context!

@DataJpaTest
@AutoConfigureTestDatabase(replace = AutoConfigureTestDatabase.Replace.NONE)
@Import(ContainerService::class, EmbeddedDataSourceConfig::class)
@ActiveProfiles("test")
class ContainerServiceTest {
    // test code
}

Uh-oh! The Spring context loads successfully, but the test fails with the following error:

java.lang.AssertionError:
Expected size:<1> but was:<0> in:
<[]>

If you look at @DataJpaTest, you will notice that it uses the @Transactional annotation. It means that by default, deleting data from the target tables and creating a new container will only be committed at the end of the test method, thus the changes are not visible to the transactions created by the threads.

Since we would like to commit the transaction inside the main transaction (which @DataJpaTest uses), we need to use Propagation.REQUIRES_NEW:

@DataJpaTest
@AutoConfigureTestDatabase(replace = AutoConfigureTestDatabase.Replace.NONE)
@Import(ContainerService::class, EmbeddedDataSourceConfig::class)
@ActiveProfiles("test")
class ContainerServiceTest {
    @Autowired
    private lateinit var transactionTemplate: TransactionTemplate

    @Autowired
    private lateinit var svc: ContainerService

    @Autowired
    private lateinit var shipmentRepo: ShipmentRepository

    @Autowired
    private lateinit var containerRepo: ContainerRepository

    @BeforeEach
    fun setup() {
        transactionTemplate.propagationBehavior = TransactionTemplate.PROPAGATION_REQUIRES_NEW
        transactionTemplate.execute {
            shipmentRepo.deleteAll()
            containerRepo.deleteAll()
            shipmentRepo.save(shipment)
        }
    }

}

🎉 The test passes, completing in just 8 seconds (load context + run) - twice as fast as before!

Test Slicing: @JsonTest Precision in Validating JSON Serialization/Deserialization 💡

Consider this test snippet:

    public class EventDeserializationIT extends BaseIT {

    private static final String RESOURCE_PATH = "event-example.json";

    @Autowired
    private ObjectMapper objectMapper;

    private Event dto;

    @Test
    public void testDeserialization() throws Exception {
        String json = Resources.toString(Resources.getResource(RESOURCE_PATH), UTF_8);
        dto = objectMapper.reader().forType(Event.class).readValue(json);

        assertThat(dto.getData().getNewTour().getFromLocation()).isNotNull();
        assertThat(dto.getData().getNewTour().getToLocation()).isNotNull();
    }
}

The objective of this test is to ensure proper deserialization. We can use @JsonTest annotation to import the beans that we need in the test. We only need object mapper, no need to extend any other classes! Using this annotation will only apply the configuration relevant to JSON tests (i.e. @JsonComponent, Jackson Module).

@JsonTest
public class EventDeserializationTest {

    @Autowired
    private ObjectMapper objectMapper;

    // Test implementation
}

Test Slicing: @WebMvcTest for REST APIs 🌐

Using @WebMvcTest, we can test REST APIs without firing up the server (e.g., the embedded Tomcat), or loading the whole application context. It’s all about targeting specific controllers. Fast and efficient, just like that!

@WebMvcTest(ShipmentServiceController.class)
public class ShipmentServiceControllerTests {

    @Autowired
    private MockMvc mvc;

    @MockBean
    private ShipmentService service;

    @Test
    public void getShipmentShouldReturnShipmentDetails() {
        given(this.service.schedule(any())).willReturn(new LocalDate());
        this.mvc.perform(
                get("/shipments/12345")
                        .accept(MediaType.APPLICATION_JSON)
                        .andExpect(status().isOk())
                        .andExpect(jsonPath("$.number").value("12345"))
                // ...
        );
    }
}

Taming Mock/Spy Beans and Context Caching Dilemmas 🔍

Let's delve into the intricacies of the Spring Test context caching mechanism!

When your tests involve Spring Test features (e.g., @SpringBootTest, @WebMvcTest, @DataJpaTest), they require a running Spring Context. Starting a Spring Context for your test requires a considerable amount of time, especially if the entire context is populated using @SpringBootTest, resulting in increased test execution overhead and longer build times if each test starts its own context.

Fortunately, Spring Test provides a mechanism to cache a started application context and reuse it for subsequent tests with similar context requirements.

The cache is like a map, with a certain capacity. The map key is computed from a few parameters, including the beans loaded into the context.

The cache key consists of:

locations (from @ContextConfiguration)
classes (from @ContextConfiguration)
contextInitializerClasses (from @ContextConfiguration)
contextCustomizers (from ContextCustomizerFactory) – this includes @DynamicPropertySource methods as well as various features from Spring Boot’s testing support such as @MockBean and @SpyBean.
contextLoader (from @ContextConfiguration)
parent (from @ContextHierarchy)
activeProfiles (from @ActiveProfiles)
propertySourceLocations (from @TestPropertySource)
propertySourceProperties (from @TestPropertySource)
resourceBasePath (from @WebAppConfiguration)

For example, if TestClassA specifies {"app-config.xml", "test-config.xml"} for the locations (or value) attribute of @ContextConfiguration, the TestContext framework loads the corresponding ApplicationContext and stores it in a static context cache under a key that is based solely on those locations. So, if TestClassB also defines {"app-config.xml", "test-config.xml"} for its locations (either explicitly or implicitly through inheritance) and does not define different attributes for any of the other attributes listed above, then the same ApplicationContext is shared by both test classes. This means that the setup cost for loading an application context is incurred only once (per test suite), and subsequent test execution is much faster.

If you use different attributes per different tests, for example different (ContextConfiguration, TestPropertySource, @MockBean or @SpyBean) in your test, the caching key changes. And for each new context (that does not exist in the cache), the context must be loaded from scratch.

And if there are many different contexts, the old keys from the cache are removed, thus the next running tests that could potentially use those cached contexts need to reload them. This addition results in extra test time.

One efficiency optimization method is consolidating mock beans in a parent class. This ensures that the context remains unchanged, enhancing efficiency and avoiding context reloading multiple times.

Example before and after:

@SpringBootTest
public class TestClass1 {
    @MockBean
    private DependencyA dependencyA;
    // Test implementation
}

@SpringBootTest
public class TestClass2 {
    @MockBean
    private DependencyB dependencyB;
    // Test implementation
}

@SpringBootTest
public class TestClass3 {
    @MockBean
    private DependencyC dependencyC;
    // Test implementation
}

If we tried to run the above example, the context will be reloaded 3 times, which is not efficient at all. Let's try to optimize it.

@SpringBootTest
public abstract class BaseTestClass {

    @MockBean
    private DependencyA dependencyA;

    @MockBean
    private DependencyB dependencyB;

    @MockBean
    private DependencyC dependencyC;
}

// Extend the BaseTestClass for each test class
public class TestClass1 extends BaseTestClass {

    @Test
    public void testSomething1() {
        // Test implementation
    }
}

public class TestClass2 extends BaseTestClass {

    @Test
    public void testSomething2() {
        // Test implementation
    }
}

public class TestClass3 extends BaseTestClass {

    @Test
    public void testSomething3() {
        // Test implementation
    }
}

Now, the context will be reloaded only once, which is more efficient!

Or even better: You can avoid class inheritance by using @Import annotation to import configuration classes that contain the mock beans.

@TestConfiguration
class Config {
    @MockBean
    private DependencyA dependencyA;

    @MockBean
    private DependencyB dependencyB;

    @MockBean
    private DependencyC dependencyC;
}

@Import(Config::class)
@ActiveProfiles("test")
class TestClass1 {
    // Test code
}

Think twice before using @DirtiesContext ❗

Applying @DirtiesContext to a test class removes the application context after tests are executed. This marks the Spring context as dirty, preventing Spring Test from reusing it. It's important to carefully consider using this annotation.

Although some use it to reset IDs in the database, better alternatives exist. For instance, the @Transactional annotation can be used to roll back the transaction after the test is executed.

Parallel Execution of Tests 🏎️

By default, JUnit Jupiter tests run sequentially in a single thread. However, enabling tests to run in parallel, for faster execution, is an opt-in feature introduced in JUnit 5.3. 🚀

To initiate parallel test execution, follow these steps:

Create a junit-platform.properties file in test/resources.
Add the following configuration to the file: junit.jupiter.execution.parallel.enabled = true
Add the following to every class you want to run parallel. @Execution(CONCURRENT)

Keep in mind that certain tests might not be compatible with parallel execution due to their nature. For such cases, you should not add @Execution(CONCURRENT). See JUnit: writing tests – parallel execution for more explanation on the different execution modes.

Results 📊

Applying all the optimizations mentioned above made a big difference in our CI/CD pipeline. Our tests are much faster, taking only 4 minutes and 15 seconds now, compared to the previous time (10 minutes 7 seconds), which is a massive 60% improvement! 🌟

Conclusion 🎬

In this adventure of optimizing Spring Boot tests, we've harnessed a collection of strategies to bolster test efficiency and speed. Let's summarize the tactics we've implemented:

Test Slicing: Leveraging @WebMvcTest, @DataJpaTest, and @JsonTest to focus tests on specific layers or components. You can check more about (Testing Spring Boot Applications).
Context Caching Dilemmas: Overcoming challenges related to dirty ApplicationContext caches by optimizing the use of mock and spy beans. See Spring Test Context Caching.
Parallel Test Execution: Enabling parallel test execution to significantly reduce test suite execution time. See JUnit 5 User Guide on Parallel Execution.

These strategies collectively transform testing into a faster, more reliable, and efficient process. Each tactic, used alone or combined, contributes significantly to optimized testing practices, empowering engineers to deliver higher-quality software with enhanced efficiency.

Patching the PostgreSQL JDBC Driver

2023-11-09T00:00:00+01:00

Introduction

This blog post describes a recent contribution from Zalando to the Postgres JDBC driver to address a long-standing issue with the driver’s integration with Postgres’ logical replication that resulted in runaway Write-Ahead Log (WAL) growth. We will describe the issue, how it affected us at Zalando, and detail the fix made upstream in the JDBC driver that fixes the issue for Debezium and all other clients of the Postgres JDBC driver.

Postgres Logical Replication at Zalando

Builders at Zalando have access to a low-code solution that allows them to declare event streams that source from Postgres databases. Each event stream declaration provisions a micro application, powered by Debezium Engine, that uses Postgres Logical Replication to publish table-level change events as they occur. Capable of publishing events to a variety of different technologies, with arbitrary event transformations via AWS Lambda, these event streams form a core part of the Zalando infrastructure offering. At the time of writing, there are hundreds of these Postgres-sourced event streams out in the wild at Zalando.

One common problem that occurs with Logical Replication is excessive growth of Postgres WAL logs. At times, the Write Ahead Log (WAL) growth could occur to the point where the WAL would consume all of the available disk space on the database node resulting in demotion of the node to read-only - an undesirable outcome in a production setting indeed! This issue is prevalent in cases where a table being streamed receives very little to no write traffic - but once a write is made, any excessive WAL growth disappears instantly. In recent years, as the popularity of Postgres-sourced event streams has grown in Zalando, we see this issue occurring more and more often.

So what is happening at a low level during this event-streaming process? How does Postgres reliably ensure that all data change events are emitted and captured by an interested client? The answers to these questions were crucial to understanding the problem and finding its solution.

To explain the issue and how we solved it, we first must explain a little bit about the internals of Postgres replication. In Postgres, the Write Ahead Log (WAL) is a strictly ordered sequence of events that have occurred in the database. These WAL events are the source of truth for the database, and streaming and replaying WAL events is how both Physical and Logical Replication work. Physical replication is used for database replication. Logical Replication, which is the subject of this blog, allows clients to subscribe to data change WAL events. In both cases, replication clients track their progress through the WAL by checkpointing their location, known as the Log Sequence Number (LSN), directly on the primary database. WAL events stored on the primary database can only be discarded after all replication clients, both physical and logical, confirm that they have been processed. If one client fails to confirm that it has processed a WAL event, then the primary node will retain that WAL event and all subsequent WAL events until confirmation occurs.

Simple, right?

Well, the happy path is quite simple, yes. However as you may imagine, this blog post concerns a path that is anything but happy.

The Problem

Before we go on, allow me to paint a simplified picture of our architecture which was experiencing issues with this process:

A Postgres database with logical replication set up on two of its three tables

We have a database with multiple tables, denoted here by their different colors: blue (1), pink (2), purple (3), etc. Additionally, we are listening to changes made to the blue and pink tables specifically. The changes are being streamed via Logical Replication to a blue client and a pink client respectively. In our case, these clients are our Postgres-sourced event streaming applications which use Debezium and PgJDBC under the hood to bridge the gap between Postgres byte-array messages and Java by providing a user-friendly API to interact with.

The key thing to note here is that changes from all tables go into the same WAL. The WAL exists at the server level and we cannot break it down into a table-level or schema-level concept. All changes for all tables in all schemas in all databases on that server go into the same WAL.

In order to track the individual progress of the blue and pink replication, the database server uses a construct called a replication slot. A replication slot should be created for each client - so in this case we have blue (upper, denoted 1) and pink (lower, denoted 2) replication slots - and each slot will contain information about the progress of its client through the WAL. It does this by storing the LSN of the last flushed WAL, among some other pieces of information but let’s keep it simple.

If we zoom into the WAL, we could illustrate it simplistically as follows:

Each client has a replication slot, tracking its progress through the WAL.

Here, I have illustrated LSNs as decimal numbers for clarity. In reality, they are expressed as hexadecimal combinations of page numbers and positions.

As write operations occur on any of the tables in the database, those write operations are written to the WAL - the next available log position being #7. If a write occurs on e.g. the blue table, a message will be sent to the blue client with this information and once the client confirms receipt of change #7, the blue replication slot will be advanced to #7. However WAL with LSN #7 can’t be recycled and its disk space freed up just yet, since the pink replication slot is still only on #6.

As changes occur in the blue table, the blue client's replication slot advances, but the pink slot has no reason to move

If the blue table were to continue receiving writes, but without a write operation occurring on the pink table, the pink replication slot would never have a chance to advance, and all of the blue WAL events would be left sitting around, taking up space.

This will continue with WAL growing dangerously large, risking using all of the disk space of the entire server

However once a write occurs in the pink table, this change will be written to the next available WAL position, say #14, the pink client will confirm receipt and the pink replication slot will advance to position #14. Now we have the below state:

As soon as a write occurs in the pink table, the pink replication slot will advance and the WAL events can be deleted up to position #13, as they are no longer needed by any slot

This was the heart of the issue. The pink client is not interested in these WAL events, however until the pink client confirms a later LSN in its replication slot, Postgres cannot delete these WAL events. This will continue ad infinitum until the disk space is entirely used up by old WAL events that cannot be deleted until a write occurs in the pink table.

Mitigation Strategies

Many blog posts have been written about this bug, phenomenon, behavior, call it what you will. Hacky solutions abound. The most popular by far was creating scheduled jobs writing dummy data to the pink table in order to force it to advance. This solution had been used in Zalando in the past but it’s a kludge that doesn’t address the real issue at the heart of the problem and mandates a constant extra workload overhead from now and forever more when setting up Postgres logical replication.

Even Gunnar Morling, the ex-Debezium Lead, has written about the topic.

Byron Wolfman, in a blog post, alludes to the pure solution before abandoning the prospect in favour of the same kludge. The following quote is an extract from his post on the topic:

Excerpt from a blog post which details both the pure solution of advancing the cursor as well as the “fake writes” hack

This was indeed the solution in its purest form. In our case with a Java application as the end-consumer, the first port-of-call for messages from Postgres was PgJDBC, the Java Driver for Postgres. If we could solve the issue at this level, then it would be abstracted away from - and solved for - all Java applications, Debezium included.

Our Solution

The key was to note that while Postgres only sends Replication messages in case of a write operation, it is sending KeepAlive messages on a regular basis in order to maintain the connection between it and, in this case, PgJDBC. This KeepAlive message contains very little data: some identifiers, a timestamp, a single bit denoting if a reply is required, but most crucially, the KeepAlive message contains the current WAL LSN of the database server. Historically, PgJDBC would not respond to a KeepAlive message and nothing would change on the server-side as a result of a KeepAlive message being sent. This needed to change.

The original flow of messages between the database server and the PgJDBC driver. Only replication messages received confirmations from the driver.

The fix involved updating the client to keep track of the LSN of the last Replication message received from the server and the LSN of the latest message confirmed by the client. If these two LSNs are the same, and the client then receives a KeepAlive message with a higher LSN, the client can imply that it has flushed all relevant changes and that some irrelevant changes are happening on the database that the client doesn't care about. The client can safely confirm receipt of this change back to the server, thus advancing its replication slot position and allowing the Postgres server to delete those irrelevant WAL events. This approach is sufficiently conservative enough to allow confirmation of LSNs while guaranteeing that no relevant events can be skipped.

The updated flow of messages now includes confirmation responses for each KeepAlive message as well, allowing all replicas to constantly confirm receipt of WAL changes

The fix was implemented, tested, submitted to PgJDBC in a pull request. Merged on August 31st 2023, this fix is scheduled to be released in the 42.7.0 version of PgJDBC.

Rollout

Our Debezium-powered streaming applications support backwards compatibility with functionality that has been removed from newer versions of Debezium. In order to maintain this backwards compatibility, our applications do not use the latest version of Debezium and, by extension, do not use the latest version of PgJDBC which is pulled in as a transitive dependency by Debezium. In order to take advantage of the fix while still maintaining this backwards compatibility, we modified our build scripts to optionally override the latest version of the transitive PgJDBC dependency and we took advantage of this option to build not one, but two Docker images for our applications: one unchanged and another with a locally built version, 42.6.1-patched, of PgJDBC that contained our fix. We rolled this modified Docker image out to our test environment while still using the unchanged image in our production environment. This way we could safely verify that our event-streaming applications continued to behave as intended and monitor the behaviour in order to verify the issue of WAL growth had been addressed.

To verify the issue had indeed disappeared, we monitored a graph of the total WAL Size over the course of a few days on a low-activity database. Before the implementation of the fix, it would be common to see the following graph of total WAL size, indicating the presence of the issue over 36 hours:

Runaway WAL growth before the fix

That same database after the fix now has a WAL Size graph that looks like the below, over the same time range and with no other changes to the persistence layer, service layer or activity:

WAL growth (or lack thereof!) after the fix

As the fix itself was designed to be sufficiently conservative when confirming LSNs so that we could guarantee that an event would never be skipped or missed, this evidence was sufficient for us to confidently roll out the newer Docker images to our production clusters, solving the issue of runaway WAL growth for 100s of Postgres-sourced event streams across Zalando. No more hacks required :)

Understanding GraphQL Directives: Practical Use-Cases at Zalando

2023-10-19T00:00:00+02:00

GraphQL directives

In GraphQL, if you've used the syntax that starts with an @, for example, @foo, then you've used GraphQL directives. Directives provide a way to extend the language features of GraphQL using a supported syntax. Certain directives are built into GraphQL, like @skip, @include, @deprecated, and @specifiedBy, and are supported by all GraphQL engines.

If we look closer, we can see that two of these directives (@skip and @include) are used only in the queries, and the other two (@deprecated and @specifiedBy) are used only in the schema. This is because GraphQL directives are defined for two different categories of locations - TypeSystem and ExecutableDefinition. The TypeSystem directives are defined for the schema, and the ExecutableDefinition directives are defined for the queries. We will discuss this in detail in the next section.

The query directives are generally useful for clients to express certain types of metadata for the query. The schema directives are generally useful for declaratively specifying common server-side behaviors, for example, authorization requirements, marking sensitive data, etc.

Part 1: Schema directives at Zalando

The schema directives refer to the directives defined for the TypeSystem locations. The type system directives are available for the locations listed below. Consider @foo a directive for the location mentioned in the 1st column.

directive @foo on LOCATION_IN_FIRST_COLUMN

Directive Location	Example
SCHEMA	`schema @foo { query: Query }`
SCALAR	`scalar x @foo`
OBJECT	`type Product @foo { }`
FIELD_DEFINITION	`type X { field: String @foo }`
ARGUMENT_DEFINITION	`type X { field(arg: Int @foo): String }`
INTERFACE	`interface X @foo {}`
UNION	`union X @foo = A \| B`
ENUM	`enum X @foo { A B }`
ENUM_VALUE	`enum X { A @foo B }`
INPUT_OBJECT	`input X @foo { }`
INPUT_FIELD_DEFINITION	`input X { field: String @foo }`

The guild - https://the-guild.dev has a great article and a mechanism for implementing schema directives via their graphql-tools packages. I highly recommend reading it and using graphql-tools for implementing schema directives.

The gist is that you can define a directive in the schema and implement the directive in the resolver layer. The directive is implemented as a function that takes the resolver function as an argument and returns a new resolver function. The new resolver function can be used to implement the directive logic.

You can think of schema directives as some function call injected to your resolver function in a declarative way. Consider the following illustration to understand where the directive function can be invoked in the context of a resolver.

/**
 * Illustration of schema directives execution in
 * the query execution pipeline
 */
const resolvers = {
  Query: {
    async product(_, { id }) {
      // schema directives
      schemaDirectivesExecutions();

      // resolver logic
      const product = await getProduct(id);

      // schema directives
      schemaDirectivesExecutions();

      return product;
    },
  },
};

`@isAuthenticated`

At Zalando, we use SSO for customer authentication and step-up authentication. Our GraphQL server handles publicly available data like the product data, and also handles confidential data like customer-related data.

The queries can contain customer fields along with product fields and other non-customer data. Here, we need to ensure that the customer is authenticated and has the correct authenticity levels (ACR Value) whenever a field or mutation containing customer information is used in the query. So, we need a way to control this granularly for different data points in the schema. The directive @isAuthenticated is used for this purpose.

The directive is defined in the schema as follows -

scalar ACRValue @specifiedBy(url: "https://example.com/zalando-acr-value")

directive @isAuthenticated(
  """
  The ACR value, which indicates the level of authenticity
  expected to perform the operation.

  Optional. If not provided, the default behavior is to simply
  validate a user is authenticated and has no ACR requirements.
  """
  acrValue: ACRValue
) on FIELD_DEFINITION

For example, it is used in a mutation definition as follows -

type Query {
  customer: Customer @isAuthenticated
}
type Mutation {
  updateCustomerInfo(
    email: String
    phoneNumber: String
  ): UpdateCustomerInfoResult @isAuthenticated(acrValue: HIGH)
}

`@sensitive`

We expose customer-sensitive data via our GraphQL API - like the email address, customer name, phone number, address, etc, to render the customer profile page. We also use observability tools and monitoring tools like logging and tracing. We do not want such sensitive customer data in the logs and traces. So, we need a way to control logging so that the logs contain enough information to debug issues but not sensitive customer data. The directive @sensitive is used for this purpose.

directive @sensitive(
  "An optional reason why the field is marked as sensitive"
  reason: String
) on ARGUMENT_DEFINITION

For example, it is used in a mutation definition as follows -

type Mutation {
  updateCustomerInfo(
    email: String @sensitive(reason: "Customer email address")
    phoneNumber: String @sensitive(reason: "Customer phone number")
  ): UpdateCustomerInfoResult
}

It could be somewhat manual and forgetful to add @sensitive to the correct arguments in the schema proactively. So, we also rely on a schema linter to automatically fail when a field/argument name contains sensitive keywords like password, email, phone, bank, bic, account, owner, order, token, voucher, customer, etc. This way, we can ensure we do not forget to add @sensitive to the correct fields/arguments.

Implementing this directive is also quite simple and does not require any resolver logic. It can be implemented in NodeJS as follows (the implementation is shortened to fit into a post) -

function getSensitiveVariables(schema, document) {
  const sensitiveVariables = [];
  require("graphql").validate(schema, document, [
    (context) => ({
      Variable(node) {
        const isSensitive = context
          .getArgument()
          ?.astNode?.directives?.some(
            (directive) => directive.name.value === "sensitive"
          );
        if (isSensitive) {
          sensitiveVariables.push(node.name.value);
        }
      },
    }),
  ]);
  return sensitiveVariables;
}

`@requireExplicitEndpoint`

With GraphQL, all of the varieties of HTTP requests fit into one single pattern - POST /graphql. It makes using techniques and tools available for REST APIs - like rate limiting, bot protection, caching, and other security practices fail to work out of the box. So, we need a way to control different schema sections to be exposed via different HTTP endpoints. The directive @requireExplicitEndpoint is used for this purpose.

directive @requireExplicitEndpoint(endpoints: [String!]!) on FIELD_DEFINITION

In implementing this directive, we override the resolver for the respective field where it is used. We can access the request parameters (like pathname) by running GraphQL over HTTP. We then match the pathname with the list of endpoints provided in the directive and return an error if there is no match.

This directive allows us to define custom routes for different schema sections and prevents the client from accessing the entire schema via a single HTTP endpoint, POST /graphql. For example, let's see how we can define this directive for the updateDeliveryAddress mutation.

type Mutation {
  updateDeliveryAddress(
    id: ID!
    newAddress: CustomerAddress!
  ): UpdateDeliveryAddressResult
    @requireExplicitEndpoint(endpoints: ["/customer-addresses"])
}

So, a mutation query like the following will fail with an error when executing via /graphql endpoint -

# POST /graphql
mutation {
  updateDeliveryAddress(id: "1234", newAddress: { name: "Boopathi" }) {
    id
  }
}

`@draft`, `@allowedFor`

We use persisted queries and define different schema stability levels for different sections of the schema. We have a separate blog post explaining the details of how Zalando uses persisted queries and how we think about schema stability and granular control.

The @draft and @allowedFor directives are used for this purpose. It prevents clients from persisting a query that is not stable yet.

# Draft
directive @draft on FIELD_DEFINITION

# Restricted usage: Only for the specified components
directive @component(name: String!) on QUERY
directive @allowedFor(componentNames: [String!]!) on FIELD_DEFINITION

`@final`

Enums in GraphQL are tricky to evolve. Adding a new value to an enum is not considered a breaking change, but it is still a "dangerous" change. It is "dangerous" because the client might not have a handler for the new value. It is easy to update the client code for web applications, but for the mobile native apps shipped to the app store, it is impossible to update the client code. Though we practice defensive coding practices to handle unknown values, we still need a way to control the evolution of enums in a safe manner. The directive @final is used for this purpose.

directive @final on ENUM

The implementation of this directive is absolutely nothing - i.e., it does not need any runtime behavior. It is only used in our GraphQL linter that executes during the build time and prevents additions of new values to enums which are marked as final. When we want to make a dangerous change, we remove the @final directive in the first pull request and reason about and find if old apps would break by making this "dangerous" change. After extending the enum, we add it in a separate pull request. This process is cumbersome, but it is on purpose. It must be more complicated to make dangerous changes, and it is a trade-off we are willing to make.

The ideal situation would be that all enums are treated as final by default, and this directive is never required in the first place. During schema evolution, your use case might warrant such directives to control a smooth schema evolution.

`@extensibleEnum`

As we are discussing enums, another use-case of directives for enums, primarily one-off use cases, and extending them is the common case. Creating enums for one use case is tricky in these cases, and extending it has dangerous consequences. At Zalando, we have RESTful API guidelines, and one of the recommendations is to use x-extensible-enum to represent all enums. This recommendation is so that the enums can evolve, and the client is aware, right from the name, that it is extensible. We use the directive @extensibleEnum for this purpose. The type in GraphQL for the field would be String, and the directive is used to provide the list of allowed values.

directive @extensibleEnum(values: [String!]!) on FIELD_DEFINITION

For example, it is used in a query definition as follows -

type CustomerConsent {
  status: String! @extensibleEnum(values: ["GRANTED", "REJECTED"])
}

With @extensibleEnum, we found that contributors to the schema are more likely to think about the evolution of schema. We also noticed that contributors are more likely to use this directive for defining enums than the GraphQL native enum, as this directive is more explicit about the extensibility of the enum.

`@resolveEntityId`

Our GraphQL schema defines certain types as Entities related to the Entity-Relationship model. We define entities abstractly as the basic building blocks for designing customer experience. For example, product, customer, brand, etc. are some entities. The entity definition has some properties -

it follows a specific template/pattern of resolvers that is mostly the same for all entities
it is of a specific type name as defined in the schema
it has a unique ID of a specific pattern (for example, entity:product:1234 for type Product)
it has a set of fields that are common to all entities

To solve these cases holistically, we use the directive @resolveEntityId defined against each entity definition in the schema.

directive @resolveEntityId(
  "An optional override name for the entity name in its ID"
  override: String
) on OBJECT

The usage is as follows -

type Product implements Entity @resolveEntityId {
  id: ID!
}

The implementation of this directive is two-fold. For one, we generate TypeScript code based on the resolveEntityId directive. This code generation allows us to develop the boilerplate code for the entity ID type definitions and resolvers - for example, the __typename resolvers. The other part is the runtime, where an id resolver is added to wrap the entity IDs - for example, consider the product - entity:product:1234 is the full entity ID, and the 1234 is called the SKU of the product.

Part 2: Query directives at Zalando

Query directives are directives that are defined for the ExecutableDefinition locations. The executable directives are available for the locations listed below. Consider @foo a directive for the location mentioned in the 1st column.

directive @foo on LOCATION_IN_FIRST_COLUMN

Directive Location	Example
QUERY	`query name @foo {}`
MUTATION	`mutation name @foo {}`
SUBSCRIPTION	`subscription name @foo {}`
FIELD	`query { product @foo {} }`
FRAGMENT_DEFINITION	`fragment x on Query @foo { }`
FRAGMENT_SPREAD	`query { ...x @foo }`
INLINE_FRAGMENT	`query { ... @foo { } }`
VARIABLE_DEFINITION	`query ($id: ID @foo) { }`

Unlike schema directives, graphql-tools does not support attaching functions to resolvers the same way for query directives. They also have an excellent point: query directives are good for annotating the query with metadata and not for resolver logic. Likewise, most of our use cases include attaching metadata at the query level and one case for observability and monitoring.

For query metadata, the implementation is as simple as going through the parsed GraphQL document (AST - Abstract Syntax Tree) and extracting the metadata from the query directives. We use a two-step approach for the use case that adds behavior to a field - specifically the @omitErrorTag directive (discussed below). In the first step before execution, we extract the field paths of the fields that have this directive. In the second step, after execution, we match the error paths and omit the error tag for those extracted paths.

`@component`

The @component directive defines a component name by the client for the query. This directive is used in our observability and monitoring tools and for schema stability - restricted usage in production. See our blog post GraphQL persisted queries and schema stability for more details.

directive @component(name: String!) on QUERY

`@tracingTag`

The @tracingTag directive defines an OpenTelemetry tracing tag for the query. Using this directive on a query adds a specific client-defined tag to our tracing spans. The clients can then follow the traces and filter by this tag to find the traces for a particular query. This directive is useful for debugging, troubleshooting, monitoring specific set of queries, etc.

directive @tracingTag(value: String!) on QUERY | MUTATION | SUBSCRIPTION

`@omitErrorTag`

The @omitErrorTag directive is used to omit marking the tracing span as an error. This directive can be used on a particular field in the query. This directive lets the client define that some field errors are noncritical and should not be reported for alerting. The 24x7 on-call team can then focus on the critical errors and not be distracted by the noise.

directive @omitErrorTag on FIELD

`@maxCountInBatch`

The @maxCountInBatch directive is used at the Query level to declare the maximum number of queries that can be batched together in a single request. This directive is client-controlled i.e. it is only available during build/persist time. At runtime, the directive is used to prevent overfetching of data and bot abuse of the GraphQL API.

Our GraphQL server allows batching of multiple queries in a single batch. With persisted queries, we only send the id of the query, and the client cannot send a raw query in production. So, the system design allows the safe usage of maxCountInBatch controlled by the clients.

directive @maxCountInBatch(value: Int!) on QUERY

Example usage of all of the above query directives

query product_card($id: ID!)
#
# component directive
@component(name: "web-product-card")
#
# tracing tag directive to add a tag to the tracing span
@tracingTag(value: "slo-1s")
#
# maxCountInBatch directive to limit the number of queries in a batch request
@maxCountInBatch(value: 50) {
  product(id: $id) {
    id
    name
    brand {
      id
      name
    }
    # omitErrorTag directive to omit marking the tracing
    # span as an error if inWishlist field errors
    inWishlist @omitErrorTag
  }
}

Conclusion

Query directives allow clients to define metadata and, on rare occasions, behavior. Schema directives, on the other hand, allow the server to define behavior, validation, and resolution logic in a declarative manner. Schema directives carry the added advantage that the servers can make breaking changes to these directives, as these directives are not consumed by the client - they only experience the resulting behavior. It's important when designing a directive to consider its properties, use cases, trade-offs, and where the control should lie.

The use cases outlined in this blog post represent some of the ways we use GraphQL directives at Zalando. There are numerous other cases that we'll cover in future blog posts. I hope this piece provides a good starting point for you to explore GraphQL directives and their practical applications.

My First Year as an Engineering Manager at Zalando

2023-09-26T00:00:00+02:00

Starting a New Journey

Moving forward in career steps is always an exciting adventure, even if it comes with challenges. For me, the biggest challenge was becoming an engineering manager in a foreign country. Stepping into a new country as an expat, with a culture I wasn't all that familiar with, was a completely fresh start. When I said yes to my new journey, I started researching Zalando to learn more.

My first stop was the Zalando Engineering Blog - a real treasure for someone like me who was curious about the engineering culture and practices at what would be my new company. Reading post after post, I was amazed by everything - the interesting engineering topics, challenges, solutions, and approaches. Since I love reading and writing blog posts, I even dreamt of contributing here someday. Now, looking at today and thinking about my first year, I see that I've gained lots of experiences and learnings that I can put into words. While one post won't cover all the details, I believe I can create a short but nice summary of my journey so far. So, let's begin.

First Impressions

On my first day, as I stepped into the office, one thing truly resonated with me. A phrase was inscribed on the floor: "Always put yourself in the customer’s shoes". This is one of the founding mindset of Zalando which I would learn in the next few days. This also marked the first of many reminders that would constantly keep me aware of how important customers are for Zalando.

As I walked around and met with various people, I realised the impressive international working environment with a rich multicultural and diverse setup. From day one and with each passing day, I've come to believe that this is Zalando's greatest wealth. And on a personal note having colleagues from all corners of the world, having lunches, coffee breaks, learning from their diverse experiences – these are indeed great benefits that cannot be simply found in contracts.

Onboarding

As I settled in, my onboarding journey kicked off right away. Zalando provides an excellent onboarding program for newbies. It covers not only technical topics but also goes into Zalando's culture, with a lot of inspiring meetings. This also creates an opportunity to connect with colleagues from different departments that you may not have had a chance to interact with otherwise.

Besides Zalando's onboarding, it was important for me to really understand how my department and team contribute to the company. So, I focused on what we do and how our work helps Zalando’s success. My department is Pricing Platform, and our main scope is pricing and discounting tools and algorithms. The more I learnt, the more I was amazed by how much data science, engineering, and analytics are involved in something as simple as a 20% discount on the web site. For me, the real test is, if I can successfully explain the project details to my dad, who doesn't know much about tech except using a smartphone. If he gets it, then I'm pretty sure I truly understand what we do in our department. When I told my dad about my department's job, I started with, "dad you will not believe how that simple discount you see in the webpage is calculated".

Cyber Week

My first big challenge was Cyber Week. Since I joined Zalando just a month before Cyber Week, everyone was talking about it. Coming from a country that doesn't have Cyber Week, I initially thought (I'll admit it shamelessly) that Zalando was having a week of cyber security tests, which actually sounded pretty cool. But then, when I understood what Cyber Week was really about, I realised how important it was for Zalando.

The readiness for Cyber Week and all the preparations that go into it completely impressed me. The structured game plan, playbooks, situation rooms, incident processes – they were all new concepts to me, and I was amazed by how operational excellence can be. There’s no way I can cover all the details of Cyber Week in this post, but there's one thing I have to mention. During the final minutes of Black Friday, there's this tradition of virtually gathering with the shift crew and watching the order monitoring spike up like a hockey stick, marking the peak order rate during Black Friday. That moment made a strong impact on me, showing how our little contributions as software engineers play a role in those big successes.

Growing Together

While I've mostly focused on the technical and operational aspects of Zalando, I can't skip the people part, of course. Zalando has an amazing culture when it comes to managing and developing people. They provide different ways to grow with clear expectations. One thing that really surprised me was that Zalando offers both management and technical expert paths for software engineers. For example, after becoming a Senior Software Engineer, you can choose to either become an Engineering Manager or a Principal Engineer. This is quite unique, something I hadn't encountered before in my past experiences. It’s not about getting pushed into management; instead you have the opportunity to advance based on your skills and aspirations at the same level as management roles.

Feedback Culture

Talking about career growth, I shouldn't forget to mention performance evaluation. This is a vital aspect of any organization's success. Zalando recognizes this importance and has implemented effective practices to ensure that performance management is done right. Performance evaluation at Zalando starts with collecting feedback, the most important part of the process, in my opinion. Company provides an ideal environment for sharing and receiving feedback. You can receive feedback from your peers, team members, and stakeholders, essentially from the people you interact with daily. This culture of openness to feedback has been invaluable in helping me understand where we can improve as a team and how I can grow as a leader beyond my current capabilities.

Moreover, in my role as a leader, I know the importance of giving constructive feedback and facilitating performance evaluations for my team members. Zalando has several effective practices in place to support leaders in this regard. We receive support from experienced leaders, seek guidance from our peers in different departments, and collaborate with P&O (People and Operations) business partners. Throughout the year, we also have access to various training sessions, coaching sessions, and leaders' enablement programs. This comprehensive support to leaders makes sharing constructive feedback, which ultimately helps everyone reach their full potential, a seamless and rewarding part of the job.

It Is Not All About Work

While I've mostly shared the business aspect of Zalando, I must acknowledge that Zalando also knows how to have a good time. There are a lot of communities with various interests, running, fishing, beach volleyball, board games, or more technical topics like Python or Linux guilds.

The company also gives a big importance to continuous improvement, which is, of course, a crucial aspect of a software engineer's work. Departments organize hack weeks; for instance, our department had an Innovation Sprint where individuals pitched initiatives using cutting-edge technologies like generative AI. Every month, Tech Academy hosts a Coffee Bytes event, a casual coffee meet-up with no set agenda, allowing members of the tech community to connect and make friends. Considering all these examples, despite the importance of business and customers, having fun is equally important at Zalando. I realized this right from the beginning when I saw one of the t-shirts with the slogan "Zalando, we dress code".

What's Next?

Finishing up this look back, my first year as an Engineering Manager at Zalando has been a really good journey with lots of learning, growing, and experiencing new things. The diverse and dynamic environment, along with focusing on people and having fun, has been like magic. Thinking about what is next, I'm looking forward to continuing adding my small touch to Zalando's great work, enjoying the mix of tough challenges, teamwork, and moments that make us laugh. Here's to more times of growing, trying new things, and maybe getting a few more awesome sneakers along the way!

Sunrise: Zalando's developer platform based on Backstage

2023-08-03T00:00:00+02:00

Introduction

Since 2021, Zalando invested in building up a developer portal called Sunrise, aimed to become the starting point for Builders at Zalando. The portal is based on Spotify's Backstage platform with additional extensions built internally. Sunrise enables everyone at Zalando to view and discover information about teams, applications, APIs, events, CI/CD pipelines, Infrastructure accounts and costs, and much more. In this post, we explore how adopting Backstage impacted the daily life of Software Engineers at Zalando and get insights from Lacey and Arthur who led the efforts on the Product and Engineering side.

Fig 1. Sunrise: detailed information about applications

Lacey, what's your role in creating Sunrise?

Lacey: Funny story, I actually ran a vision workshop with the team responsible for the Developer Portal at Zalando before I became a member of the department! As the official product manager, I helped solidify the vision with a platform mindset and an experience strategically focused on interoperability and usability. I worked with engineering stakeholders and the engineering manager to devise a strategy and roadmap to give us the best chance at efficient implementation, good adoption, and improved satisfaction from users so that more platform and infra teams would want to contribute. And of course, I'm probably the loudest promoter of our platform's & contributors' solutions 😅

Arthur, how about you? How are you involved here?

Arthur: Hello! I've actually started to be involved with Sunrise as an early adopter and active user first, before moving internally to the team in May 2022. Since then, I've been leading the engineering team, driving the delivery of new features, coordinating support and maintenance on the platform, contributing to the product vision and ensuring our alignment with the organizational strategy, all the while managing our amazing team of 4 software engineers.

Why did Zalando choose Backstage for its developer portal? Was any similar solution in place before?

Lacey: Before Sunrise, we had over 100 disconnected interfaces & resources, plus "the Developer Console" which centralized links to resources mostly for the Code through Deploy steps of the Developer Journey. After recognizing that we'd need to evolve into a platform to achieve our vision, we considered several options (including building everything ourselves), and Spotify happened to reach out while we were still in the discovery & design phase. What made it a great fit then, was that we had extremely limited resources and skills (both engineering & design) on the team at the time, so we recognized that having an out-of-the-box solution for a design-system and plugins like the basic Software Catalog would be necessary for us to deliver something fast enough to justify the strategic investment & potential risk of failure.

I hear that our Engineers are really excited about Sunrise. Why and what features are they most excited about?

Lacey: From pretty much the beginning, the topic of interoperability has been prevalent as it's what enables us to eliminate friction from the day to day tasks Builders need to perform. Users really celebrated a deeper integration that two contributing teams collaborated on to make the experience of deploying data pipelines more seamless, and features that make org structure and reporting lines more transparent have also had very quick and wide adoption. We also have some very popular Platform features that enable all our users (regardless of whether they actually own services or not) to see personalized content by default and further customize personalization settings. The day to day features that people actually use the most are the action-oriented-easy-access links on the homepage, the CI/CD interface, Search, and the Application catalog, which includes integrations to tooling and resources across the SDLC.

How do you measure adoption of the platform and along each part of the SDLC?

Lacey: Since our vision for Sunrise was to make it the "daily" starting point for Builders, we monitor the share of Builders using the platform on weekdays, and weekly as our primary success metrics for adoption. Since not all features actually need to be used daily (for example, every single person won't be registering a new application every working day), we let contributors determine what makes sense for their integrations and we provide them with a centralized dashboard and support with Analytics to make it easier to understand usage. In the future, we hope to map adoption of features to more tangible improvements in operational performance.

What features were added on top of Backstage's open-source project?

Lacey: That's actually a pretty big question. For our earliest release, we added a personalized homepage with easy-access links to things engineers use often like open PRs and recently deployed pipelines, and added a support overview that they were used to from previous tooling, and our CI/CD platform that is internally built. Since then, we've integrated 27 other tools & services through 30 front-end plugins ranging from our internal machine learning platform, through widgets that make users aware of base image vulnerabilities or delivery performance insights, to a personalized dashboard covering all aspects of critical business events, like Cyber Week. Some of those plugins were contributed back to the open source community, such as the interface for our API Linter, Zally. Our platform features personalization – especially for users who don't own components themselves, but who have some accountability for them – increased adoption amongst principal engineers and leadership, and has helped contributors to Sunrise provide similar reporting-like features that they never had before with very little effort that in turn drive more regular use within engineering teams.

Which team operates the platform? Any challenges that you had to overcome to support Zalando's user base?

Arthur: Our team is called Builder Portal, and has been operating and evolving the platform since its inception. Our biggest technical challenge at Zalando's scale has been managing the various pre-existing sources of data and determining how to sync them with Backstage's Catalog system. We currently have over 40k registered entities (between applications, teams, and users) which we sync daily with the respective source of truth services. In terms of adoption, the biggest challenge from the get-go was to make sure that the experience is approachable and consistent for all users, regardless of which part of the development journey they are working on. Builders can be very opinionated in their ways of working, so making sure that our decisions are well thought out and will ultimately support them in working productively and happily can be challenging sometimes, but it's also very rewarding. And hey – we're Builders ourselves too, so we also enjoy using Sunrise while maintaining it.

Lacey: A lot of what we see impacting adoption of new features is that people have built habits – and incredibly long bookmark lists – to make up for deficiencies of the fragmented tooling. What turned out to be most impactful for solving this problem is ensuring that we redirect users from old features to the new ones in Sunrise shortly after making them generally available and then completely shut down the old tooling.

Backstage is open-source. How does Zalando and your team approach upstream contributions? Can you name some notable examples?

Arthur: Whenever we find some limitation in Backstage in comparison to what we want a feature to look like, we reflect on whether this is something that could impact other adopters of the platform or whether it's a Sunrise-specific problem. If it's the former, we reach out via a GitHub issue (e.g. bug report and feature request). If we know how to solve it, we also contribute a pull request (e.g. respective bugfix and new feature). We also keep an eye out for opportunities to share in-house plugins with the community. As mentioned by Lacey earlier, last year we open-sourced our API Linter plugin.

Fig 2. Sunrise: open-sourced API linter plugin

How about the internal features? How easy has it been to get contributions from outside of your team?

Arthur: We have at least ten other plugins (the number grows sporadically) owned and maintained by other teams in Zalando, including our own Continuous Delivery and Machine Learning Platform teams. There's always an initial barrier of entry (as with any other application and framework) for contributors to understand the domain-specific language of Backstage, as well as the standards we have implemented on the platform, especially since many platform teams don't have a lot of front-end engineers available to work on the user interface of their plugins. We invest a lot in creating standard components and documenting our patterns so contributors can spend less time figuring out which button to use and more time improving the overall experience for their users.

You recently reached a major milestone – 2,000 PRs merged to the repository and Sunrise replacing multiple internal tools and the prior generation of the developer portal. What's the next big milestone that you look forward to?

Lacey: Creating comprehensive visibility into everything running in production and mapping the relationships between entities – automatically where possible – so that we can centrally support global improvements to the operational health of systems and teams. The SBOM work you mentioned in your recent post is a big part of that, but we are also working on surfacing the relationships between entities like data pipelines and applications, as well as the relationships of applications and their components to business problems through a standardized and semi-automated documentation of Domains. Having that oversight will enable us to shift left not only security and compliance, but also productivity, reliability, and cost efficiency by providing insights about the current balance of operational health in relationship to business metrics relevant to our high-level Domains. It will give Builders easier access to the information they need to involve the right stakeholders and make decisions about what kind of work to invest in and when. To put it shortly: we're all a bit happier, more secure, and more efficient when working with transparency and less uncertainty.

Any tips that you'd give to teams who are also adopting Backstage as the foundation for their developer portal?

Lacey: Haha, the list is long because I've learned a lot over the life of this initiative. I'd sum it up as:

Having a clear, inspirational vision that includes (and delineates) the needs of both users and contributors – and that you constantly communicate – will be key for motivating contributors and for reaching the critical mass of user journeys needed for users to feel the benefit of your platform.
To drive adoption and impact, look for opportunities to personalize content to make it easier to recognise and understand, and invest in increasing the interoperability along the journeys your users take to complete tasks between both fully integrated interfaces and features, as well as external tooling – and don't forget to shut down old tooling!
Whether you're using an open source plugin or building something yourself from scratch, investing in great UX research and design is critical for building an experience that will remain cohesive as it grows – that's important so that your users are enabled to actually find the things you build, and are happy to use them.

Arthur: My tip is to leverage the power of open source! The Backstage Community is ever-growing and provides a lot of interesting, well-maintained plugins for you to make use of, so don't shy away from engaging with it. The framework itself is also constantly evolving and growing its scope, and with some big adopters already leveraging it (including us!), you're sure to see a lot of examples of interesting use cases that will support your teams to be more productive.

Bartosz: Thanks for the conversation and for walking us through our approach to buliding a Developer Platform!

If you would like to know more about Sunrise, check out Henning's talk Cloud native developer experience at Zalando or the related post.

All you need to know about timeouts

2023-07-26T00:00:00+02:00

Nobody likes to wait. We at Zalando are not an exception. We don't like our customers to wait too long for delivery, we don't like them to wait during checkout, and we don't like microservices that take too long to respond. In this post we're going to talk about - how to set a reasonable timeout for your microservices to achieve maximum performance and resilience.

Why set timeout

Before we start, let’s answer the simple question: "Why timeout?". A successful response, even if it takes time, is better than a timeout error. Hmm… not always, it depends!

First of all, if your server does not respond or takes too long to respond, nobody will wait for it. Instead of challenging the patience of your users, follow the fail-fast principle. Let your clients retry or handle an error on their side. When possible return a fallback value.

Another important aspect is resource utilisation. While a client is waiting for a response, various resources are being utilised: threads, https connections, database connections, etc. Even if the client has closed the connection, without a proper timeout configuration the request is still being processed on your side, which means that resources are busy.

Remember, when you increase timeouts you potentially decrease the throughput of your application!

Using infinite timeout or very high timeout is a bad strategy. For a while, you won't see the problem until one of your downstream services gets stuck and your thread pool gets exhausted. Unfortunately, many libraries set default timeouts too high or infinite. They aim to attract as many users as possible and try to make their library work in most situations. But for production services, it is not acceptable. It can even be dangerous. For example for native java HttpClient the default connection/request timeout is infinite, which is unlikely within your SLA :)

The default timeout is your enemy, always set timeouts explicitly!

Connection timeout vs. request timeout

The distinction between connection timeout and request timeout can cause confusion. First, let's have a look at what Connection timeout is.

If you google or ask ChatGPT you’ll get something like this:

A connection timeout refers to the maximum amount of time a client is willing to wait while attempting to establish a connection with a server. It measures the time it takes for a client to successfully establish a network connection with a server. If the connection is not established within the specified timeout period, the connection attempt is considered unsuccessful, and an error is typically returned to the client.

What does it mean to establish a connection? TCP uses a three-way handshake to establish a reliable connection. The connection is full duplex, and both sides synchronize (SYN) and acknowledge (ACK) each other. The exchange of these four flags is performed in three steps—SYN, SYN-ACK, and ACK.

A connection timeout should be sufficient to complete this process and the actual transmission of packets is gated by the quality of the connection.

In simple words, the value for the connection timeout should be derived from the quality of the network between services. If a remote service is running in the same datacenter or the same cloud region, connection time should be low. And the opposite, if you’re working on a mobile application then connection time to a remote service might be quite high.

To give you some insights. Round-trip time (RTT) in fiber, New York to San Francisco ~42ms, New York to Sydney ~160ms. You can also look at Connection Health Check by Amazon. This is what I get from my local machine, RTT 28ms to the recommended AWS Region.

When does connection timeout occur

A connection timeout occurs only upon starting the TCP connection. This usually happens if the remote machine does not answer. This means that the server has been shut down, you used the wrong IP/DNS name, the wrong port or the network connection to the server is down. Another frequent condition is when a given endpoint simply drops packets without a response. The remote endpoint's firewall or security settings may be configured to drop certain types of packets or traffic from specific sources.

Connection timeout best practices

A common practice for microservices is to set a connection timeout equal to or slightly lower than the timeout for the operation. This approach may not be ideal since the two processes are different. Whereas establishing a connection is a relatively quick process, an operation can take hundreds or thousands of ms!

You can setup a connection timeout which is some multiple of your expected RTT. Connection timeout = RTT * 3 is commonly used as a conservative approach, but you can adjust it based on your specific needs.

In general, the connection timeout for a microservice should be set low enough so that it can quickly detect an unreachable service, but high enough to allow the service to start up or recover from a short-lived problem.

Request Timeout

A request timeout, on the other hand, pertains to the maximum duration a client is willing to wait for a response from the server after a successful connection has been established. It measures the time it takes for the server to process the client's request and provide a response.

Setting optimal request timeout

Imagine you are going to integrate your microservice with a new API.

The first step would be to look at SLAs provided by the microservice or API you are calling. Unfortunately, not all services provide SLAs and even if they do you should not trust blindly. The SLA value is good enough only for starting to test real latency.

If possible, run an integration with the new API in shadow mode and collect metrics. This code should run parallel to the existing production integration, but without affecting the production system (run it in a separate thread-pool, mirror traffic, etc).

After collecting latency metrics such as p50, p99, p99.9 you can define the so-called acceptable rate of false timeouts. Let's say you go with a false timeout rate 0.1% that means the max timeout you can set is p99.9 corresponding latency percentile on the downstream service.

At this step you have a max timeout value you can set but you have a trade-off:

set timeout to the max value
decrease timeout and enable retry

Based on the test results you need to choose the timeout strategy. We'll cover retries a little bit later.

The next challenge you will face is a chain of calls. Imagine your service has SLA 1000ms and it calls sequentially Order Service with p99.9 = 700ms and then Payment Service with p99.9 = 700ms. How to configure timeout and not breach the SLA?

Option 1: Share your time budget One option would be to share your time budget (your SLA) between services and set timeouts accordingly 500ms for Order Service and 500ms for Payment Service. In this case, you have a guarantee that you will not breach your SLA but you might have some false positive timeouts.

Option 2: Introduce a TimeLimiter for your API Since different services will not simultaneously respond with the maximum delay, you can wrap the chained calls in a time limiter and set the maximum acceptable timeout for both services. In this case you could create a time limiter 1sec and set a timeout 700ms for downstream services.

In Java, you could use CompletableFuture and several methods among which are orTimeout and completeOnTimeOut that provide built-in support for dealing with timeouts.

CompletableFuture
    .supplyAsync(orderService.placeOrder(...))
    .thenApply(paymentService.updateBalance(...))
    .orTimeout(1, TimeUnit.SECONDS);

There is also a nice TimeLimiter module provided by the Resilience4j library

Retry or not retry

The idea is simple - consider enabling retry when there is a chance of success.

Temporary failures: Retry is suitable for temporary failures that are expected to be resolved after a short period, such as network glitches, server timeouts, or database connection issues. Retry can also avoid a bad node. Given a large enough deployment (e.g. 100 pods), a single pod might have a substantial performance regression, but if requests are load balanced in a sufficiently random way retrying is faster then awaiting a response from the bad node.

Retry on timeout errors and 5xx errors
Do not retry on 4xx errors

Idempotent operations: If the operation being performed is idempotent, meaning that executing it multiple times has the same result as executing it once, retries are generally safe.

Non-idempotent operations can cause unintended side effects if retried multiple times. Examples include operations that modify data, perform financial transactions, or have irreversible consequences. Retrying such operations can lead to data inconsistency or duplicate actions.

Even if you think an operation is idempotent, if possible, ask the service owner whether it is a good idea to enable retries.

For safely retrying requests without accidentally performing the same operation twice, consider supporting additional Idempotency-Key header in your API. When creating or updating an object, use an idempotency key. Then, if a connection error occurs, you can safely repeat the request without the risk of creating a second object or performing the update twice. You can read more about this idempotency pattern here Idempotent Requests by Stripe and Making retries safe with idempotent APIs by Amazon.

Circuit breaker: always consider implementing circuit breakers when enabling retry. When failures are rare, that's not a problem. Retries that increase load can make matters significantly worse.

Exponential backoff: Implementing exponential backoff can be an effective retry strategy. It involves increasing the delay between each retry attempt exponentially, reducing the load on the failing service and preventing overwhelming it with repeated requests. Here is a fantastic blog on how AWS SDKs support exponential backoff and jitter as a part of their retry behaviour.

Time-sensitive operations: Retries may not be appropriate for time-critical operations. The trade-off here is to decrease a timeout and enable retries or keep the max acceptable timeout value. Retries might not work well where p99.9 is close to p50.

Look at the graph, on the first one, timeouts occasionally happens, a big difference between p99 and p50, a good case for enabling retries

On the second graph, timeouts happen periodically, p99 is close to p50, do not enable retries

Recap

set timeout explicitly on any remote calls
set connection timeout = expected RTT * 3
set request timeout based on collected metrics and SLA
fail-fast or return a fallback value
consider wrapping chained calls into time limiter
retry on 5xx error and do not retry on 4xx
think about implementing a circuit breaker when retrying
be polite and ask the API owner for permission to enable retries
support Idempotency-Key header in your API

Resources

Speed of Light and Propagation Latency
Timeouts, retries, and backoff with jitter by AWS
The Tail at Scale - Dean and Barroso 2013
The Tail at Scale - Adrian Colyer 2015
The complete guide to Go net/http timeouts by Cloudflare
Handling timeouts in a microservice architecture
Making retries safe with idempotent APIs by AWS
Idempotent Requests by Stripe

Rendering Engine Tales: Road to Concurrent React

2023-07-11T00:00:00+02:00

Welcome back to our web platform blog series! It's been a while since we last talked about our approach to large-scale front-end development at Zalando. We are excited now to reconnect and share with you some substantial enhancements we've made to the streaming and rendering architecture of our Rendering Engine framework.

The first post of this new series will recap how Rendering Engine works, its relationship with Concurrent React, and our journey with it including design and implementation challenges as well as successes gained so far.
Additionally, it covers the main hydration mismatch errors we faced during this upgrade, our solutions and recommendations for avoiding them, and some extra tips and tricks for debugging this type of issue.

Intro

"Rendering Engine" is the web framework that is maintained by and currently used in Zalando to render the Fashion Store website, and is designed for building any web application with similar needs.

You might know Rendering Engine (RE) from our previous blog posts about Micro Frontends at Zalando and our journey through them from Project Mosaic with its fragments and Tailor, to Interface Framework (part 2).

In a nutshell, RE is a web framework best suited for creating a website that:

Uses React to render the UI
Inherently implements universal rendering (server side / client side) with high emphasis on server rendering and page load performance
Its page content, layout and UI steering is highly driven by backend in a nestable approach
The backend can be a recommendation engine, a CMS-like system able to define the shape and content of pages, or any other similar system.

The building blocks of RE's language for defining what to render, are Entities. Each Entity is a block of content that from a business-logic perspective has a specific identity, and can have other Entities nested inside. For example in the context of a fashion store, an Entity could be a Product, a Collection of products, an Outfit, etc. Which when organized in tree-like structures, can be used to define full layout and contents of pages. Defining each Entity from the backend is done through specifying a type, id, and optional extra data in the form of hints. We'll skip how RE handles defining layouts from the backend for the time being.

So by considering Entities to be responsible for describing "what to render" (by the backend), then specifying "how to render" is the responsibility of what we call a Renderer (by the client).
Each Renderer is a self-contained TypeScript module powered by multiple RE features provided during server- and client-side rendering. Each Renderer is responsible to render a specific type of Entity, while each Entity-type can be represented by multiple Renderers depending on the extra hints data.

This assignment mapping is defined via something called Rendering Rules. These configurations are passed to RE, which include "selectors" for matching the incoming Entity definitions from backend, and support nested and per-page rules.

There are a handful of other features built into this framework including monitoring, experimentation, tracking, a different rendering output for server driven mobile apps, etc. but for now this introduction should do.

React 18's Concurrent Rendering

(and how it fits Rendering Engine like a glove)

Performance has always been one of the key focus areas of Rendering Engine from its beginnings. Aside from being built with performance in mind and going through many micro improvements over the years, it also comes with some performance features built inside, including but not limited to streaming, lazy-loading, partial streaming and partial hydration (yes, almost the same concept as in Concurrent React!).

Although these performance related features have proven to be very important in the success of the Fashion Store website, their code's maintenance, improvements and required education as well as knowledge sharing come with a cost.

But more importantly, we anticipated having React's built-in support for these features would most probably bring even more performance boosts to the table.

Additionally, React's concurrent rendering APIs seamlessly integrate with the architecture of RE because its Renderers serve as ideal candidates for being encapsulated within a Suspense boundary. This enables them to function as individual blocks that can be server-rendered, streamed, hydrated, and client-rendered "concurrently". Especially since many of them have already been using Rendering Engine's own partial hydration/streaming features!

As a result, we have been very excited about the concurrent React 18 for quite a while and as soon as the opportunity arrived, we started the migration and refactoring of Rendering Engine's core functionalities to use the concurrent features.

Needless to say, this migration task has also had its challenges and costs! So now that we have finished some important milestones and are close to completion, we thought it is a good chance to start sharing our challenges, successes and learnings with you.

Design challenges with Concurrent Rendering

Rendering Engine at its core includes logic for handling the resolution of server's specified Entity definitions or layout into the corresponding Renderers, fetching their data as well as handling all the other aforementioned features like experimentation, tracking, etc. And only after that, it hands over the UI rendering responsibilities to React.
These happen gradually (and if needed, recursively) in a way that makes sure that Renderers remain independent while getting their data and rendering/streaming their final html, which makes way for performance gains.

So initially, with React 18 we thought of moving as much of this logic as possible (from data fetching to experimentation, tracking, etc.) to the React concurrent APIs such as Suspense and useTransition, through custom hooks - which is often referred to as the "Render-As-You-Fetch pattern. With the aim of reducing complexity and required effort among other things.

But after a trial phase and implementing a proof of concept, we faced some issues, the main ones being:

In cases where keeping the correct order of the content during streaming/hydration is important, the closest available solution would be to use the SuspenseList API. But it still seems to be experimental, with some limitations.
The useTransition API not considering nested suspense boundaries, causing bad UX in some scenarios.
By utilizing hooks to initiate requests or other async operations, the timing of fetch operations becomes coupled with the order of rendering, which may not be optimal for performance.
Progressive hydration and streaming, necessitate the availability of all the data required for client-side rendering as early as possible. This implies that, in addition to the HTML generated by components, it is crucial to stream their data to prevent redundant requests from being made by the client.
- During the trial phase, the streaming and caching layer to support this issue wasn't yet handled by React. And as of now, the latest supporting feature is still not final.

Chosen technical design

Due to the limitations mentioned above, we finally decided to go with a mixed solution.

In this approach, the concurrent streaming, hydration, rendering and basically all the Concurrent benefits are still achieved via fully utilizing React: by wrapping every Renderer in a Suspense boundary, and handling changes through concurrent APIs.
But at the same time, we created an "Application State" layer which encapsulates the main logic and Renderers data outside of React components/hooks in a central place, which dictates to the Suspense boundaries their state.

This way, the full power of orchestrating when to suspend a component (Renderer) depending on its place in the tree, handling the order of the suspended components, and deciding how to manage a transition considering the nested Suspense boundaries, would all be available and customizable in this Application State layer.
We will share the details of the technical solution for ordered streaming/hydration in another post.

In other words, everytime RE finds the matching Renderer and resolves all its corresponding data for an Entity definition (through "resolveEntity" step), the output will be written to the Application State layer. In the meantime React is rendering the Renderer components which are wrapped with Suspense.
To access data from the Application State, the suspendable Renderers use the "Connector hook".
The Connector hook reads from the application state which either returns the data that was asked for, or creates a promise that will be resolved once the data has been written. The promise is then used to suspend the component and React will automatically re-render once the Promise has been resolved.
Imagine Redux's useSelector hook, but instead of immediately returning selected data you get a Promise that only resolves once a reducer has made the data available.

Benefits gained from Concurrent Rendering

As we are still going through the changes and final steps of the full-fledged concurrent mode described above, the full benefits of it are yet to be observed.

Till date, we achieved some performance improvements by mainly using the new streaming and hydration root APIs.

Performance improvements from `renderToPipeableStream` and `hydrateRoot` APIs

As one of the milestones, after pure version upgrade and handling breaking changes, we solely changed RE's internal streaming and hydration code to use the new React 18 APIs instead. i.e. renderToPipeableStream instead of renderToNodeStream, and hydrateRoot instead of hydrate.
We rolled out this change through an A/B test covering all pages of our e-commerce website, and in the end we observed these mild performance (and business metric) improvements:

Overall

INP: -5.69%
FID: -8.81%
LCP: -2.43%
FCP: -0.23%
Bounce rate: -0.24%

Per page: (some of the frequently visited pages)

Metric	Home page	Catalog page (list of products and search)	Product Details page
INP	-2.92%	-6.76%	-6.09%
FID	-2.98%	-17.11%	-6.06%
Exit Rate	-0.43%	-0.06%	-0.06%

Needless to say, this shows great promise, and we are now even more excited about the results of the next steps.

Technical challenges: Rise of the Hydration Mismatch errors!

As also stated in some documentations around React 18, because the new React APIs are way more sensitive towards existing hydration mismatch issues, after the migration to the new streaming and hydration APIs, we started receiving a lot more hydration error logs (via Sentry) for Zalando Fashion Store.
So during this migration, we've been finding and fixing these issues to prevent negative user impact as much as possible. And after fixing dozens of different types of issues deep inside hundreds of Renderers, we were able to considerably reduce the number of the hydration mismatch errors occuring in the wild. That being said, there are still some more errors to fix which are harder to reproduce and find due to the dynamic nature of the page content in Fashion Store.
Nevertheless, below you can find the most common issues we found so far, and how we were able to fix them.

After that, we also briefly share some tips and tricks about the debugging process. Because - as you may also know if you have faced these errors in your projects - debugging them is not always a straightforward task, and to be honest, React's error logs (especially coming from the production environment) aren't very helpful!

Main types of issues we faced, and suggested solutions

Before going through details of each type, in some cases we realized that based on product requirements, one might actually not need to render some content on SSR (Server Side Rendering) and only the CSR (Client Side Rendering) would be enough.
Hence the obvious fix might be to just skip rendering on SSR and only show the content once the app is mounted on the user's browser.

To do that, we can rely on React hooks and lifecycle methods to ensure the app/component has been mounted on the browser. For example:

Instead of

  //...
  const { dataThatDiffersBetweenClientAndServer } = props;
  return (
    <div>{dataThatDiffersBetweenClientAndServer}</div>
  );

//...
  const [isMounted, setIsMounted] = React.useState(false);
  React.useEffect(() => {
    setIsMounted(true);
  }, []);
  const { dataThatDiffersBetweenClientAndServer } = props;

  return (
    <div>{isMounted ? dataThatDiffersBetweenClientAndServer : "some fallback" || null}</div>
  );

There are similar cases where due to the basic differences between the SSR and the CSR, like some data only being available on client side, one might need to render different content or elements on the two. For example, based on the exact specifications of the user's device, you want to display an app download banner.

For these scenarios, the suggestion would again be to simply wait until the initial hydration phase is finished on the client side, and then render the different content.

Note: in such cases, be mindful of layout shifts that can happen as a result of some element popping into the view.

With that out of the way, let's dive into the list of issues.

1. Timers

This is a common and somewhat expected source of hydration mismatch issues simply because if you're calculating and rendering the distance between two specific points in time (usually from past/future to now), it will result in slightly different values when calculated on SSR compared to a few moments later on CSR.

As also mentioned in React docs, in such cases where the mismatch is unavoidable, the suggestion is to simply tell React that the difference is expected and that React should ignore the mismatch during hydration. The way to do this is by passing the prop suppressHydrationWarning={true} to the element that contains such a mismatch. Keep in mind that this prop only works one level deep, so you have to pass it to the closest element wrapping the mismatching text. For example:

Instead of

  //...
  const timeDistance = targetDate.getTime() - Date.now();
  return (
    <div>{timeDistance}</div>
  );

  //...
  const timeDistance = targetDate.getTime() - Date.now();
  return (
    <div suppressHydrationWarning={true}>{timeDistance}</div>
  );

2. Localization of dates and different time-zones

Converting date values from raw formats (e.g. ISO 8601 2023-01-01T20:00:00.000Z) to human-readable strings can be a tricky cause of hydration mismatch errors.
Because if the timezone used for conversion is different between the server and client, the resulting values can be different as well.

So for example if the timezone is not specified while using the localization APIs (e.g. Intl.DateTimeFormat or Date.prototype.toLocaleString), then the host timezone will be used and if the SSR server has a different timezone than the user, it will lead to different localized date values in the end.

It's hard to decide what the best solution is in these cases especially because as of now it is not possible to know the exact local timezone of the user on SSR based on http headers (in the initial request).
On top of that, the question of which timezone to use for displaying dates is ultimately a product decision.

But if a specific universal timezone is approved and provided (for example the website's domain's matching timezone), then specifying that universal timezone to the conversion APIs on both the client and server code can fix this issue. Meaning:

Instead of

  //...
  return (
    <div>
      {someDate.toLocaleString(locale)}
      {new Intl.DateTimeFormat(locale).format(someDate)}
    </div>
  );

  //...
  return (
    <div>
      {someDate.toLocaleString(locale, { timeZone: universalTimezone })}
      {new Intl.DateTimeFormat(locale, { timeZone: universalTimezone }).format(someDate)}
    </div>
  );

That being said, depending on the situation and product requirements, an alternative approach would be to just move the conversion to the backend so that the client simply receives dates in the localized format - which has passed through timezone transformation (and localisation).

3. Localization of numbers

(and a Safari bug for "de-AT" locale!)

Similar to converting dates and importance of timezones, when converting raw numbers to localized human-readable strings (e.g. 12345 to "12,345") if the locale is not specified, then the host's locale will be used and it can lead to different results. So it's important to always pass a universal locale to these APIs which is consistent during server and client rendering:

Instead of

  //...
  return (
    <div>
      {someNumber.toLocaleString()}
      {new Intl.DateTimeFormat().format(someNumber)}
    </div>
  );

  //...
  return (
    <div>
      {someNumber.toLocaleString(universalLocale)}
      {new Intl.DateTimeFormat(universalLocale).format(someNumber)}
    </div>
  );

But in very specific cases, we observed that the localisation APIs act differently between SSR and CSR, which again lead to generating different values, thus hydration mismatches!

We particularly encountered this issue with the Safari browser where for the de-AT locale, the localisation APIs (like Intl.NumberFormat or tolocalestring) generate values like "2.345" but other browsers including Chrome and Firefox as well as Node.js generate values like "2 345" for the same locale!

So an alternative approach in these cases would be to receive the final localized values from the backend and show that to the user without needing any more modifications, thus eliminating the mismatches.

4. Invalid HTML nesting

This issue might be a new cause of hydration mismatch in React 18, which happens as a result of incorrect HTML like nesting a <div> inside a <p> or <button> inside <button>. We couldn't find clear documentation from React explaining why HTML validity issues lead to hydration mismatch errors (aside from community discussions like here). But regardless, to avoid them, adding markup validation steps (like this eslint plugin) could be helpful.

Either Way, in such cases the obvious goal is to use semantically correct HTML elements while nesting. For example:

Instead of

  //...
  return (
    <div>
      <p><div>Some text</div></p>
      <button><button>Button text</button></button>
    </div>
  );

  //...
  return (
    <div>
      <p><span>Some text</span></p>
      <button><span>Button text</span></button>
    </div>
  );

Some debugging tips & tricks

Soon after receiving the new hydration mismatch logs in our error tracking system (Sentry), it was clear that the most important first step in debugging them is whether we can reproduce them or not!
Because due to the nature of the React hydration errors in its production bundle, there is not much detail you can get from the error messages in Sentry. Although including the componentStack from the hydrateRoot‘s onRecoverableError callback in the logs comes in quite handy, (especially after cleaning the stack a bit to make it more readable) but due to code minification and uglifying in production bundle of your application, you will still have to carry out complicated tasks and use the provided line/column numbers to find the closest components with the help of sourcemaps.

On top of that, if a website has dynamic content served to each user like Zalando Fashion Store, it may be even harder to reproduce the exact page (with the same content) that was receiving a specific error.

Another issue we encountered was that the onRecoverableError callback is usually called multiple times by React for a single hydration mismatch problem, both polluting our Sentry logs as well as making the debugging process harder.
This seems to be due to the way hydration phase works, in which React compares a list of available server rendered DOM nodes with a list of client rendered React elements ("fibers") and tries to match them together and basically hydrate the nodes. And when matching and hydration fails for a specific node instance and errors are logged, it tries to hydrate the next one. What we observed here was that (at least in some cases) because of the previous mismatching node/fiber, the order of the lists becomes broken, and that leads to all the next ones failing as well. And that means a lot of other hydration mismatch error logs which aren't necessarily correct.
To mitigate this in the production environment, we modified our error tracking code to only send the first hydration error log to Sentry. We also found this to be very helpful to keep in mind during development debugging.

But in case reproducing the error locally is possible, then we found these steps to be helpful:

Work on the first error log, and after it's fixed, check if any other one remains.
Based on the log and the componentStack, find the closest component(s) causing the issue.
In some cases the cause of the issue is obvious in the specified component's source code - for example the issue number 4 mentioned above (Invalid HTML nesting).
- With HTML nesting issues, the log usually contains the text validateDOMNesting(...).
In other cases where the cause is not very obvious, what we found helpful was to check the React dev bundle (react-dom/umd/react-dom.development.js) and put debuggers on places which log the hydration errors (usually the checkForUnmatchedText or throwOnHydrationMismatch functions).
- Then by loading the page, try to find out what is the exact React fiber that causes the issue, and based on that find the component/element. Don't be afraid to go higher in the stack and use more debuggers!
- In some cases we realized that the fiber is the same element that caused the issue, but in others, it's more confusing as the fiber is something that was rendered after a mismatching (usually missing) node instance that was the actual cause of the issue.
- Here it also helps to check different variables like fiber, nextInstance, current, etc. including their received props.

Conclusion

The migration to React 18 and its concurrent features was of extra importance for our Rendering Engine framework due to its unique architecture. And despite the challenges, the results have been promising so far, especially since we observed improvements over Fashion Store website’s Core Web Vitals and bounce rate.

Additionally, the upgrade shined a light on the hidden hydration mismatch issues scattered in different components, which led us to not only fix many of them, but also collect and internally document them along with recommendations and debugging tips for further reference.

Next Steps

We are planning to share more detailed posts in the future about the architecture and technical specs of Rendering Engine - especially in light of the Concurrent features.
Additionally, we aim to share the effects of the new features and the final architecture on Zalando Fashion Store's performance.

Next up, we're excited to start using React Server Components which have shown great promise so far. Stay tuned!

Riptide HTTP Client tutorial

2023-06-29T00:00:00+02:00

Overview

Riptide is a Zalando open source Java HTTP client that implements declarative client-side response routing. It allows dispatching HTTP responses very easily to different handler methods based on various characteristics of the response, including status code, status family, and content type. The way this works is similar to server-side request routing, where any request that reaches a web application is usually routed to the correct handler based on the combination of URI (including query and path parameters), method, Accept and Content-Type header. With Riptide, you can define handler methods on the client side based on the response characteristics. See the concept document for more details. Riptide is part of the core Java/Kotlin stack and is used in production by hundreds of applications at Zalando.

In this tutorial, we'll explore the fundamentals of Riptide HTTP client. We'll learn how to initialize it and examine various use cases: sending simple GET and POST requests, and processing different responses.

Maven Dependencies

First, we need to add the library as a dependency into the pom.xml file:

<dependency>
    <groupId>org.zalando</groupId>
    <artifactId>riptide-core</artifactId>
    <version>${riptide.version}</version>
</dependency>

Check Maven Central page to see the latest version of the library.

Client Initialization

To send HTTP requests, we need to build an Http object, then we can use it for all our HTTP requests for the specified base URL:

Http.builder()
        .executor(executor)
        .requestFactory(new SimpleClientHttpRequestFactory())
        .baseUrl(getBaseUrl(server))
        .build();

Sending Requests

Sending requests using Riptide is pretty straightforward: you need to use an appropriate method from the created Http object depending on the HTTP request method. Additionally, you can provide a request body, query params, content type, and request headers.

GET Request

Here is an example of sending a simple GET request:

http.get("/products")
        .header("X-Foo", "bar")
        .call(pass())
        .join();

POST Request

POST requests also can be sent easily:

http.post("/products")
        .header("X-Foo", "bar")
        .contentType(MediaType.APPLICATION_JSON)
        .body("str_1")
        .call(pass())
        .join();

In the next sections, we will explain the meanings of the call, pass, and join methods from the code snippets above.

Response Routing

One of the main features of the Riptide HTTP client is declarative response routing. We can use the dispatch method to specify processing logic (routes) for different response types. The dispatch method accepts the Navigator object as its first parameter, this parameter specifies which response attribute will be used for the routing logic.

Riptide has several default Navigator-s:

Navigator	Response characteristic
`Navigators.series()`	Class of status code
`Navigators.status()`	Status
`Navigators.statusCode()`	Status code
`Navigators.reasonPhrase()`	Reason Phrase
`Navigators.contentType()`	Content-Type header

Simple Routing

Let's see how we can use response routing:

http.get("/products/{id}", 100)
        .dispatch(status(),
                on(OK).call(Product.class, product -> log.info("Product: " + product)),
                on(NOT_FOUND).call(response -> log.warn("Product not found")),
                anyStatus().call(pass()))
        .join();

In this example, we demonstrate retrieving a product by its ID and handling the responses. We use the Navigators.status() static method to route our responses based on their statuses. We then describe processing logic for different statuses:

OK - we use a version of the call method that deserializes the response body into the specified type (Product in our case). This deserialized object is then used as a parameter for a consumer, which is passed as a second argument to the call method. In our example, the consumer simply logs the Product object.
NOT_FOUND - we assume that we won't receive a Product response, so we use another version of the call method with a single argument: a consumer accepting org.springframework.http.client.ClientHttpResponse. In this scenario, we decide to log a warning message.
All other statuses we intend to process in the same way. To achieve this we use the Bindings.anyStatus() static function, allowing us to describe the processing logic for all remaining statuses. In our case, we have decided that no action is required for such statuses, so we utilize the PassRoute.pass() static method, that returns do-nothing handler.

In Riptide all requests are sent using an Executor (configured in the executor method in the Client initialization section). Because of this, responses are always processed in separate threads and the dispatch method returns CompletableFuture<ClientHttpResponse>. To make the invoking thread waiting for the response to be processed, we use the join() method in our example.

Nested Routing

We can have nested (multi-level) routing for our responses. For example, the first level of routing can be based on the response series, and the second level - on specific status codes:

http.get("/products/{id}", 100)
        .dispatch(series(),
                on(SUCCESSFUL).call(Product.class, product -> log.info("Product: " + product)),
                on(CLIENT_ERROR).dispatch(
                        status(),
                        on(NOT_FOUND).call(response -> log.warn("Product not found")),
                        on(TOO_MANY_REQUESTS).call(response -> {throw new RuntimeException("Too many reservation requests");}),
                        anyStatus().call(pass())),
                on(SERVER_ERROR).call(response -> {throw new RuntimeException("Server error");}),
                anySeries().call(pass()))
        .join();

In the example above, we implement nested routing. First, we dispatch our responses based on the series using the static method Navigators.series(), and then we dispatch CLIENT_ERROR responses based on their specific statuses. For other series such as SUCCESSFUL, we utilize a single handler per series without any nested routing.

Similar to the previous example, we use the PassRoute.pass() static method to skip actions for certain cases. Additionally, we use Bindings.anyStatus() and Bindings.anySeries() methods to define default behavior for all series or statuses that are not explicitly described. Furthermore, in this example, we've chosen to throw exceptions for specific cases, these exceptions can be then caught and processed in the invoking code - see TOO_MANY_REQUESTS status and SERVER_ERROR series routes.

Returning Response Objects

In some cases we need to return a response object from the REST endpoints invocation - we can use a riptide-capture module to do so.

Let's take a look on a simple example:

ClientHttpResponse clientHttpResponse = http.get("/products/{id}", 100)
        .dispatch(status(),
                on(OK).call(Product.class, product -> log.info("Product: {}", product)),
                anyStatus().call(response -> {throw new RuntimeException("Invalid status");}))
        .join();

As mentioned earlier, when we invoke the dispatch method, it returns a CompletableFuture<ClientHttpResponse>. If we then invoke the join() method and wait for the result of invocation - we'll get an object of type ClientHttpResponse. However, with the assistance of the riptide-capture module, we can return a deserialized object from the response body instead. In our example, the deserialized object has a type Product.

First, we need to add a dependency for the riptide-capture module:

<dependency>
    <groupId>org.zalando</groupId>
    <artifactId>riptide-capture</artifactId>
    <version>${riptide.version}</version>
</dependency>

Now let's rewrite the previous example using the Capture class. This class allows us to extract a value of a specified type from the response body:

Capture<Product> capture = Capture.empty();
Product product = http.get("/products/{id}", 100)
        .dispatch(status(),
                on(OK).call(Product.class, capture),
                anyStatus().call(response -> {throw new RuntimeException("Invalid status");}))
        .thenApply(capture)
        .join();

In this example, we pass the capture object to the route for the OK status. The purpose of the capture object is to deserialize the response body into a Product object and store it for future use. Then we invoke the thenApply(capture) method to retrieve stored Product object. The thenApply(capture) method will return a CompletableFuture<Product>, so we again can utilize the join() method to get a Product object, as we did in the previous examples. See also the riptide-capture module page for more details.

Conclusion

In this article, we've demonstrated the fundamental use cases of the Riptide HTTP client. You can find the code snippets with complete imports on GitHub.

In future articles, we'll explore usage of Riptide plugins - they provide additional logic for your REST client, such as retries, authorization, metrics publishing etc. Additionally, we'll look at Riptide Spring Boot starter, that simplifies an Http object initialization.

Context Based Experience in Zalando

2023-06-26T00:00:00+02:00

In 2022 we developed a unique partner experience that speaks to dedicated requirements from selective distribution brands and retailers around visual representation, brand storytelling and protecting brand equity. Our solution provides dedicated brand exposure across the experience and at the same time respects special requirements to secure brand equity. In order to achieve consistency with other articles, a general context-aware mechanism needed to be implemented.

We derived a plan to create distinction and elevation in the experience. The criteria for enabling an experience are based on explicit customer intent. For instance, searching for the retailer name or one of its brands will enable the elevated experience. Viewing their product details page will also enable it. These intentions are identified by our backend systems with specific business domain rules, i.e. the Search backend will have different rules from the Product backend.

To date, the Fashion Store was based solely on domain-specific data. These new rules, defined on customer intent and context, introduced new challenges in Zalando, and required a new solution. For instance, the same product can behave differently depending on that context. While viewing the catalog without any intent for a brand distinctive experience, for the sake of consistency, all products, including ones belonging to other distribution brands have a gray background, even though the brand elevated experience may dictate, for example, a white background.

In order to achieve this we needed to identify what we should apply for each use case, meaning what are the brand's requirements, and when they should be applied - which rules should be checked in order to understand the customer's context or intent.

Brand requirements can be a complicated matter. We identified some which were global on the merchant level; for instance, let's say one of the distribution brands are required to have different packshot images, with white backgrounds, whilst we typically use gray backgrounds in Zalando. Other requirements are brand-specific. Some brands are only to be shown in the product catalog when the brand or its products are explicitly requested to be shown by specific search queries or via catalog filters.

In order to support different kinds of requirements, we use the concept of experiences. Experiences are simply a collection of policies that we need to apply, and a list of selection rules.

For example, a policy may be the theme configuration that needs to be applied, or whether we are allowed to show the product under certain conditions. The selection rules define the criteria that enable the experience, e.g. selection by brand codes. This means that selecting a specific brand in the brand filter will change the experience to the one that has been configured for that brand.

{
  "id": "XP_ID",
  "name": "XP_NAME",
  "policies": [
    {
      "name": "THEME",
      "value": {
        "name": "THEME_NAME",
        "theme_config1": []
            ...
      }
    },
    {
      "name": "PRODUCT__FLAGS__HIDE_SALE",
      "value": true
    }
  ],
  "selection_metadata": [
    {
      "name": "experience_brands",
      "type": "brand_code",
      "value": ["BRANDNAME"]
    }
  ]
}

Selection rules can be another complicated matter. For instance, how to decide which experience to choose when two brands belong to different experiences? Thinking about the right use cases to support the business needs, whilst keeping simplicity is the key. Our approach to solving some cases is to define Fallback experiences, to be able to catch these use-cases.

As mentioned in other posts here in Zalando Engineering Blog, Zalando has many microservices, and even our Frontend’s architecture is based on micro frontends. We defined the general data structure to understand the experience, but how can we orchestrate it across Zalando's ecosystem?

In order to get into that, we need to break down the flow into two steps. The first one is the Experience Resolution step. This starts very early during the root entity resolution.

Let's say that a customer browses a catalog page. This will send a request to Rendering Engine, which will resolve the root entity by sending a request to the Fashion Store API (GraphQL), which will then query the Catalog backend system. The catalog has its own business logic to understand the customer’s intent and it will find the best matching experience, using its selection_metadata.

The resolved experience name is then stored in the Rendering Engine request state.

Fig 1. Root Entity Experience Resolution

At this point we have only resolved the root entity. We don’t yet know which renderers (micro-frontends) are required. During this process, we start the second step, where each one of them will query Fashion store API independently, only this time the query will use the previously resolved experience. In the catalog, we have product cards, whose data is populated by a different backend, the Product backend. As we have already resolved the experience, the Product backend can now understand which policies are required. For Zalando’s experience it will select the gray background images with the watermark, instead of the white ones.

Fig 2. Child Renderers are reusing previous resolved experience

Using this new mechanism, we successfully managed to introduce new concepts to Zalando. It has opened a door for so many new possibilities that we can leverage to further enhance the customer experience.

How Software Bill of Materials change the dependency game

2023-04-13T00:00:00+02:00

Dependency hygiene

Dependency updates are a tedious task when maintaining thousands of microservices. Some teams use tools like dependabot, scala-steward that create pull requests in repositories when new library versions are available. Other teams update dependencies regularly in bulk, supported by build system plugins (e.g. maven-versions-plugin, gradle-versions-plugin). Playing the catch-up game and getting some visibility through incoming pull requests or changes is far from great, though and we can do better here.

On the importance of dependency data and hygiene

What's needed for dependency management is the ability to get a complete picture of used dependencies over time and analyze trends over time. This granular data allows teams to step up their game.

Critical vulnerabilities in commonly used libraries (e.g. log4j, spring, commons-text) require an ability to find all affected applications in minutes. Only this way can the impact of a vulnerability be assessed and mitigated quickly. Some projects, like openssl, preannounce security updates allowing for more preparation time.

Similarly, upgrades to major versions of libraries, changes in licensing of open-source libraries (for example Akka) create the need to understand the library footprint to assess the need for action or migration costs. Bugs in libraries tend to eventually trigger production incidents and it's necessary to have a way to find all affected teams, track progress of patches across all applications, and identify reasons why teams struggle to keep up.

At Zalando, we use Software Bill of Materials (aka. SBOMs) to help answer various questions about application dependencies. We publish a curated data set containing dependency data from the SBOM for every application we deploy, based on its Container image. The data set is available in our data lake and thus can be easily queried and visualized by any engineer.

What are SBOMs?

The Software Bill of Materials contains information about the packages and libraries used by an application. It can be generated for an application based on its source code or extracted from a Docker container. The SBOM includes packages used by the operating system as well as the application and its dependencies. For each entry, the name, version, and license is tracked. Common formats like CycloneDX or SPDX help with portability and integration into various tooling. For example, syft can generate an SBOM file that can be further parsed with grype to periodically scan the application's SBOMs for vulnerabilities. On top, GitHub introduced recently an on-demand SBOM generation feature.

The SBOM needs to be generated with every software change, for example as part of the CI/CD pipeline. Some countries recommend or even mandate the use of SBOMs in certain scenarios in order to manage cyber security and software supply chain risks (see Securing the Software Supply Chain: Recommended Practices Guide for Developers).

What questions can the SBOM help to answer?

In the context of dependency management, SBOMs collected for all applications help us answer a variety of questions:

Which applications use dependency X (in version Y)?
How many distinct versions of dependency X do we use across all applications?
Does the dependency hygiene differ per language?
How quickly after release, are new versions of libraries adopted? Does adoption differ for versions that have known security vulnerabilities?
When adopting a new Docker base image, what are its contents?
Which application has dependencies licensed under license X?
Which distinct licences are being used by application dependencies?

From Docker image metadata, we can infer the owning team and thus target communication when reaching out to teams. For large-scale patch actions (like the famous log4j upgrade), we prepare change sets for different types of build files and automate the Pull Request creation across all repositories. This allows for central tracking of the patch progress and requires minimal support from the team for the deployment.

Another insight from analyzing the SBOM data was our usage of the AWS SDK. We noticed that some applications were using the full SDK (200MB+ in Java) instead of its individual modules. Addressing this finding helped reduce build times and lower resulting docker image size significantly.

Show me real data!

Our diverse application footprint across languages allows us to perform a comparison of the amount of libraries typical applications have. Looking at the data, the number of dependencies grows exponentially. Here an example for Python:

Fig 1. Number of dependencies in Python applications

Looking across languages we have two outliers that have the most amount of dependencies. For Python it's jupyter (2.5x next biggest app) and for Java it's tableau (3.14x next biggest app).

To compare how hungry each language ecosystem is for dependencies, we can plot the percentiles for the number of dependencies per application. Python wins the race with the lowest amount of dependencies, followed by golang (ca. 1.4-2x when compared to Python). Next in line is Java (covers Java, Kotlin, Scala as the SBOM scanner detects java-archives) with 2-3x more dependencies than golang and lastly JavaScript (incl. TypeScript) with 5-10x more dependencies than Java.

Fig 2. Number of dependencies per language

Another popular library used across Java and Kotlin projects

This example highlights the challenge with long-term maintenance of a large application footprint. As the frequency of changes to an application reduces, it's more difficult for teams to plan dependency updates for those applications, unless there are security issues to address. The following graph looks at the usage of an internal library with three data snapshots.

Fig 3. Usage of an internal library

We can see that versions 0.22.0+ exhibit expected behavior by being replaced with the next available version. On the other hand, usage of version 0.21.0 constantly increases, even though three newer versions are available in Q4. This situation requires further inspection. It is likely that new applications are created by using the same application template, which misses the dependency update.

SBOM Data quality

The SBOM data quality varies. For the JVM languages, we observed differing package names, group ids being detected. This increases the complexity of correlating library use across languages. Further, some SBOMs did not show any java-archive entries, because the team's build process flattened all dependencies into an uber-jar and the required metadata needed for library detection was lost. Hence, we recommend caution when using SBOM tools and double-checking that the SBOM generation works correctly for all applications.

Summary and future outlook

In addition to smaller findings like the one with AWS SDK, the value of SBOMs has already been proven with the very low time it takes us to analyze the impact of the Akka license change or CVEs.

We look to dive deeper into our SBOM data as we collect more historical data. Aside from observing trends on library usage and adoption, we hope to be able to correlate dependency data with dependency hygiene practices, deployment frequency, change failure rates, and lead times for each application. For our shared libraries, we aim to understand how to help reduce the burden of dependency updates acknowledging that plugin adoption is insufficient to remain a healthy dependency posture.

If you're not using SBOMs for dependency analysis yet, you're missing out on a great tool helping you to create more transparency. We're curious to read your stories and insights on SBOMs.

Gender Equity in IT Panel by Zalando Women in Tech Employee Resource Group

2023-04-12T00:00:00+02:00

As part of their week-long International Women's Day event series, the Zalando Women's Network and the Zalando Women in Tech Employee Resource Groups recently held an event to discuss the challenges that women in tech face in the workplace and to share ideas about how to overcome them. We welcomed women in tech leadership to the panel, who shared their experiences and insights into the world of work: Joyce Chen, VP Engineering Beauty; Tian Su, VP Customers, and host Ana Peleteiro Ramallo, Director of Applied Science.

Joyce Chen shared her past experience of being the first woman engineer in an all-men engineering group. She acknowledged that unconscious bias education has made progress over the last 10 years, and that she now has the language to describe what she went through. However, she also noted that the ratio of women to men in engineering, particularly in leadership positions, is still not good enough. To overcome this, Joyce shared the importance of mentoring, sponsorship, and reskilling.

Joyce also acknowledged that she often feels like she needs to work harder to prove her worth in a field dominated by men. She highlighted that this is a common feeling among women, and it stems from historic biases that still exist today. "To overcome this feeling: network, seek mentorship, believe in yourself, and empower yourself to achieve greatness."

Tian Su highlighted, "Men have historically been in leadership positions and therefore shaped society's perception of what good leadership looks like. This is why leadership is often seen through masculine traits. By bringing diversity into leadership, we can get different leadership styles, which can be beneficial for everyone." Tian also discussed the challenges in a former company of being the only mother on her team, which meant that she was not always able to attend social and training events after work. However, when she shared this with her former team, they realised that they hadn't considered this at all! They took the time and care to understand her situation, and they improved.

Ana Peleteiro Ramallo explained, "The way we think we need to behave at work is shaped by the leadership styles we see around us. It's important to bring clarity and your own perspective to your manager in order to help them understand your point of view".

The panelists also discussed the importance of role models, allies, and mentoring in helping women to succeed in the workplace. Joyce stressed the need for sponsorship and support, and encouraged allies to speak up and amplify women's voices. Tian noted that her husband is her biggest ally, and that intentional outreach from colleagues who are men can also make a difference. Ana emphasized the importance of finding allies who understand you and are willing to listen.

The event then opened to a Q&A session, and the panel was asked how to build resilience and overcome unconscious bias. Ana stressed the importance of communicating your perspectives and raising your voice when necessary, while Tian suggested taking conversations to a 1:1 setting to create a safe and open environment. Joyce emphasized the need for transparency and training, starting from the interview stage.

Overall, the event was a great opportunity to share ideas and support women in the workplace. By continuing to have these conversations and advocating for change, we can work towards a more equitable and inclusive future for all. Thanks to the Zalando Women's Network and the Women in Tech Employee Resource Groups for organizing this session, and the panellists for sharing their experiences and thoughts with us!

Applied Methods from Mathematical Optimization and Machine Learning in E-commerce

2023-02-21T00:00:00+01:00

Last year, Zalando hosted the 106th meeting of the Gesellschaft für Operations Research e.V. (Germany Society of Operations Research) working group on Practice of Mathematical Optimization. The workshop took place October 6-7, 2022 at the Zalando Headquarters in Berlin.

Applied Methods from Mathematical Optimization and Machine Learning

Techniques from the field of mathematical optimization on the one hand and from machine learning on the other hand have been crucial components in delivering solutions to customers in the e-commerce industry. Serving over 50 million customers and delivering a quarter billion orders last year, Zalando, is one of the largest online retail stores in Europe. Operating at such a large scale gives rise to a plethora of technical problems within these two fields that our applied scientists tackle across various teams. Thus, Zalando was uniquely positioned to host this workshop at the confluence of these two scientific fields, titled "Applied Methods from Mathematical Optimization and Machine Learning in E-commerce". The workshop included a number of talks by representatives from industry and academia from all over Germany. The presentations included applications ranging from forecasting to network design, pricing, logistics, scheduling, and vehicle routing, among others. See the full program of the workshop for more details.

The event took place in hybrid mode with streaming available for virtual attendees and presenters. The majority participants, i.e. around sixty, attended the event in person. They took advantage of the various networking opportunities during coffee breaks, the conference dinner and a tour of the historic east-side gallery, the largest remaining section of the Berlin wall, right across from the workshop venue at Zalando headquarters in Berlin.

Applied Scientists from Zalando presented two different use-cases at the confluence of optimization and ML in the workshop. The pricing team gave a talk about challenges in large scale article discounting, while the logistics team made a presentation about stock distribution and its challenges.

Pricing

The pricing team is responsible for the science behind offering attractive prices to customers. Their talk about Challenges in Large Scale Article Discounting gave a glimpse in the multitude of challenges that are connected to discounting for the entirety of Zalando's assortment.

Even with a proven machinery that manages to recommend millions of discounts under given business targets, many pitfalls have to be circumvented. We discussed the following complications and mentioned potential treatments.

Forecasting Challenges

The demand for niche articles, typically with just few sales per month, is hard to predict accurately. Moreover, articles with many sizes, e.g. jeans with many length and width combinations, can behave like multiple separate articles: different customers consider purely their own size, which creates a demand only on certain sizes. On top, some costs like shipping and returns are a mixed calculation based on the collection of articles handled together.

Optimization Challenges

An optimization model has to respect the business setup in its decisions. Several constraints were created so that the model has to follow business decisions, e.g. the model has to sell to customers in a sales period even if it would be more profitable to keep items now for sales in the future. Without them, it could be proposed to take an article offline for a certain period or prefer to sell stronger in countries where shipment costs are lower. On a technical side, some optimization problems can be infeasible through incompatible business targets and require adjustment recommendations.

Processes and Measuring

Further consideration stem from the connected processes around pricing. Matching competitors' prices, incorporating sales events and warehouse capacities are crucial in order to recommend profitable discounts. Ultimately, the impact has to be measured via A/B testing. When it comes to pricing, we have to carefully set it up to rule out customer discrimination by different prices and to enable gathering valuable insights.

Logistics

The logistics team delivered a talk titled Mathematical Optimization Meets Machine Learning to Optimize Stock Distribution. Zalando operates a network of interconnected warehouses and return centers serving its customer base across Europe. In order to best serve our customers we need to make our stock available to our customers where and when they desire it. This requires listening to our customers' demands and distribute stock across our network and within each facility accordingly. In this talk, we outlined the challenges at the core of this stock distribution problem and dived deep into some technical aspects.

Demand Forecasting

We model demand prediction as a time series forecasting problem at the individual article level for each of the markets we are active in for any given day. We produce probabilistic forecasts for each such problem using a deep recurrent neural network. Challenges abound in demand forecasting for the fashion industry where articles have fast turnover due to seasonality, the fast moving nature of fashion, and the diversity of trends in our vast customer base. This probabilistic demand forecast is used as input to solve two major optimization problems: (i) Item Network Distribution Problem: how best to distribute our stock across our facilities, and (ii) In-warehouse Item Relocation Problem: how best to position our articles within each facility.

Item Network Distribution

In the item network distribution problem, items are moved between warehouses: We need to ensure that for each country, the warehouses serving that country have the article assortment and stock quantities that best fulfill the country's expected demand. Our objectives are to maximize sales and minimize delivery times and costs. We discussed the algorithm currently used to make distribution decisions and presented some results.

In-warehouse Item Relocation

The in-warehouse item relocation problem is defined at the warehouse level. A warehouse contains various storage areas with different capacities and speed for collecting one item. Given a constant stream of incoming and outgoing items, we can relocate items between storage areas to achieve a distribution that is optimal for the demand reduced to a warehouse. We presented a formalization of the problem and prospective approaches to solve it.

How we manage our 1200 incident playbooks

2023-01-31T00:00:00+01:00

At Zalando, we use Incident Playbooks to support our on-call teams with emergency procedures that can be used to mitigate incidents. In this post, we describe how we structured incident playbooks, and how we manage these across 100+ on-call teams.

Incident Playbooks - where are we now?

We consolidated our incident playbooks as part of preparation for Cyber Week in 2019. Fast forward to 2023 and we have over 1200 playbooks that our teams have authored. Given the 850+ applications in scope for on-call coverage across 100+ on-call teams, that's 1.41 playbooks per application and ca. 12 playbooks per on-call team. The diagram below shows how our playbook collection has increased over the years. It's easy to see how Cyber Week preparations in Q3 of each year result in significant increases in the playbook collection.

Count of incident Playbooks over time

As expected, most applications have just a few playbooks. Below, you can see the number of applications per playbook count.

Number of applications per playbook count

What are incident playbooks?

Our Incident Playbooks cover emergency procedures to initiate in case a certain set of conditions is met, for example when one of our systems is overloaded and the existing resiliency measures (e.g. circuit breakers) are insufficient to mitigate the observed customer impact. In such cases there are often measures we can take, though they will degrade the customer experience. These emergency procedures are pre-approved by the respective Business Owner of the underlying functionality, allowing for quicker incident response without the need for explicit decision making while critical issues are ongoing.

Further, playbooks make incident response less stressful for colleagues on on-call rotations. Each on-call member takes the time to become familiar with the procedures and understands the toolbox they have available during incidents. New playbooks are reviewed by the on-call team, shared as part of on-call handover or operational reviews, and practiced in game days, or as part of preparation for big events.

The procedures document the conditions (e.g. increased error rates), business impact (e.g. conversion rate decrease), operational impact (e.g. reduction of DB load), mean time to recover, and the steps to execute. This structure allows all stakeholders involved in incident response to clearly understand the executed actions and target state of the system to expect. Lastly, by having playbooks in a single location, our Incident Responders and Incident Commanders have easy access to all available emergency procedures in a consistent format. This simplifies collaboration across teams during outages.

More often than not, our playbooks cover the whole system (a few microservices) instead of its individual components being covered through separate procedures. When the bigger system context is considered, there are more options available to mitigate issues.

When we started in 2019, we first focused on a collection of procedures that were already known, but not consistently documented. Next, as part of the Cyber Week preparations we wanted to explore and strengthen the mechanisms we have in place to mitigate overload or capacity issues across the different touchpoints of the customer (e.g. product listing pages) and partner journeys (e.g. processing of price updates).

Let's consider two examples:

1) Product Listing Pages (aka. catalog)

Our catalog pages integrate multiple data sources, such as teasers, sponsored products, and outfits. Fetching data from all sources comes at increased costs compared to a simple article grid. Therefore, we have a set of playbooks that disable the different data sources in order to reduce the load on the backends providing the APIs and the underlying Elasticsearch cluster. The playbooks are sorted in such way that we apply the playbooks with least business impact first. In one of our evening Cyber Week shifts, we encountered performance degradation resulting in increased latencies, which was hard to diagnose. While one part of the team was busy troubleshooting the issue, another part of the team executed multiple of the prepared playbooks in sequence in order to mitigate the customer impact.

Example playbook for catalog:

Title: Disable calls for outfits in the Catalog’s article grid
Trigger: High latency for fetching outfits for the article grid or High CPU usage for Elasticsearch's outfit queries
Mean time to recover: 3 minutes after updating configuration
Operational Health Impact: No more outfit calls from Catalog, reduced request rates to Elasticsearch by x%.
Business Impact: Outfits won't be shown as part of the catalog pages.

2) Monitoring system

Our monitoring system ZMON had a component ingesting metrics data and storing these in KairosDB TSDB, backed by Cassandra. Pre-scaling of the Zalando platform for Cyber Week peak workload resulted in a multi-factor increase in metrics pushed by the individual application instances, resulting in ingestion delays due to Cassandra cluster overload. To mitigate similar incidents, we developed a tiering system with three criticality tiers for the metrics, so that in case of overload of the TSDB, we could still ingest the most important metrics necessary to plot essential dashboards required to monitor the Cyber Week event. This playbook is still in place today, even though we changed our metrics storage.

Example playbook for ZMON:

Title: Drop non-critical metrics due to TSDB overload
Trigger: Metrics Ingestion SLO is at risk of being breached (link to alert/dashboard)
Mean time to recover: 2 minutes after updating configuration
Operational Health Impact: Loss of tier-3 and tier-2 metrics. Only tier-1 metrics are processed, leading to 40% load reduction on the metrics TSDB.
Business Impact: None

How do we author playbooks?

We use documentation site built using mkdocs to host the documentation containing a description of the incident process and all playbooks. We generate the playbook directory structure based on our OpsGenie on-call teams. This way there is always a skeleton available for every team to contribute their playbooks to. When we started in 2019 we had a team of 3 reviewers, who as part of the playbook reviews were committed throughout the year to explain the purpose/guidance of the playbooks and align these to a common standard. With sufficient examples and knowledge spread across the organization, we switched to using CODEOWNERS to delegate the reviews to representatives of the departments, skilled in operational excellence.

To remind new contributors about our playbook guidelines, we use a pull request template with a few check boxes as means for self-verification of playbook completeness. The 1st line of the template contains a TODO with a nudge for a 1-line summary of the changes. This proved to an easy way of providing reviewers with more context about the performed changes.

Integrating playbook data with application reviews

Aside from the information about triggers and impact for playbooks, we also collect additional metadata allowing us to integrate playbooks with our application review process:

Application – links playbooks to the involved applications
Expiry date – allows to nudge teams to re-review playbooks that will expire soon

To keep integration simple, along with the documentation, we also generate a JSON file with playbook metadata. During the application review process it's indicated per application (from certain criticality tier onward) whether there are any playbooks defined for it and whether any of these are expired.

With time, we made it mandatory for applications of certain criticality to have an assigned playbook. This partially increased the scope of the playbooks beyond the key emergency procedures while at the same time providing training to our engineers in the authoring of playbooks and thinking about the overload and failure scenarios that can occur.

Summary

When we initially created the incident playbooks site, maintenance of playbooks as markdown files was considered to be good means for ensuring consistency, but rather of temporary nature. To be consistent with our UI-driven application review workflow, we intended to manage playbooks in the same way. Managing structured data in markdown is not ideal, despite the ability to use front matter for metadata. However, managing playbooks in a code repository provides us with easy means for cross-team reviews using pull requests. This key advantage keeps us from moving to a UI-driven workflow where such collaboration would be limited.

We can certainly recommend every team to think about the failure scenarios their systems can experience, for example as part of production readiness reviews or game days. Without them, there are several key incidents that would have had a markedly larger impact on our customer experience.

Imagining how to react to such scenarios by putting the system into a degraded state, trading off availability over customer experience, can spark interesting conversations about resilience mechanisms that can be built into the software. These conversations drive engineers to make changes to their design to fundamentally improve availability, or at least, to ensure their software facilitates easier intervention.

If used often enough, playbooks should be ideally automated.

How You Can Have Impact As An Engineering Manager

2023-01-26T00:00:00+01:00

If you are a good leader,
Who talks little,
They will say.
When your work is done,
And your aim fulfilled,
“We did it ourselves”

- Lao-Tse

Last year, I shared how Zalando enables and supports the continued growth of our Software Engineers. The piece was written from a leadership perspective. A natural sequel to that would describe how our leaders are empowered. Specifically, I would like to provide my own perspective on how Engineering Managers can create impact and shape organisational culture.

Team Structures

To provide some context, Engineering Managers use the distinction between the “Team You Lead” and the “Team You Are On”. For the former, an Engineering Manager, is responsible for a single delivery team of Software Engineers or Applied Scientists. This is the team that they are leading. The latter refers to the Engineering Manager’s own team (their peer group that forms a department, and is led by a Head of Engineering).

The Team You Lead

I use the team you lead as the starting point to describe Engineering Management, because this, in my opinion, is the bread and butter of the role. Forming and leading a high-performing delivery team is no small feat. The team of individuals must collectively progress through the four stages of forming (purpose and raison d’etre), storming (sharing feedback, ideation, and defining roles within the group), norming (establishing ways of working and responsibilities), and performing (peak delivery and problem-solving). Take a look at Patrick Lencioni’s Five Dysfunctions of a Team (or read the Manga Edition for a more illustrated journey) to peek into the complex problems that leaders need to resolve in order to keep their team healthy.

Engineering Managers are accountable for driving the delivery of projects from start to finish - encompassing the entire lifecycle of what the team builds, how they structure step-changes to systems, how they can monitor and measure the performance of said systems for operational excellence, and all the other ingredients that go into delivering effective software.

The Team You Are On

Beyond the team that they lead, I mentioned that Engineering Managers have another team, and this is their peer group. No two organisations are identical, but typically, multiple teams are grouped to form a department, which is fulfilling a part of the larger group strategy. This for me, is where the magic happens for Engineering Management, and it is where I encourage my direct reports to make the biggest impact.

Andy Grove defined a Manager’s output as the output of her/his organisation, plus the output of neighbouring organisations under her/his influence. To put that in context, this is the output of the Team You Lead, plus the output of the teams of your peer group. For the sake of this post, I make the assumption that these teams are interacting, and I do this because “A system is never the sum of its parts; it’s the product of their interaction”.

Interaction is Culture

So, if the yield of a system is the product of how the parts interact, you might be wondering how Managers influence this.

Culture has entered the chat...

Culture is how work happens between people and between teams, which sounds simple, but culture is complex, and takes considerable time and effort to instil.

I recently read a great description of culture, which hypothesised that culture is composed of behaviour, processes, and practices. Let’s take a look at each, and hone in on the Manager’s role within.

Behaviour

A well known study of engineering team effectiveness from Google, named Project Aristotle, identified the common elements of their best teams, and at the top of that list, was Psychological Safety. Psychological Safety “...refers to an individual’s perception of the consequences of taking an interpersonal risk”. If we strip this down to bare metal, it is referring to how comfortable, and encouraged, team members are to speak up, to give their opinions, and to support one another.

Engineering Management is not about dictating what our engineers do, nor is it about having all the answers to the hard questions. Similarly, engineers are not blindly following instructions, nor are they viewed as code labourers. Instead, Engineering Management is about creating an environment that sets clear expectations and goals, encourages voices and opinions, destigmatizes failure, encourages diverse thinking, and supports the individual growth of each team member.

To accomplish this, Engineering Managers are provided with the autonomy to support their teams and to enable success as they know best. They should be guided by Our Founding Mindset (OFM), but be led by their own experience and know-how.

Achieving this within the Team You Lead is one thing, but the key is achieving this across the wider scope of the teams within your influence. This requires customer-first thinking, working backwards from the organisational goals, and ensuring that all teams have enough information and support to achieve their target. In other words, putting purpose over ego, and doing what’s right for the organisation and the customer.

Processes

A successful organisation is driven by autonomous, and empowered teams. Peak inside each of these teams and you will find a diverse collective of talented, ambitious, and driven individuals. We are actively shaping the Zalando of the future by hiring great people with high potential. Our Engineering Managers are responsible for contributing to, and defining, the processes that will enable these teams of individuals to succeed.

Processes at Zalando are constantly evolving; responding to the ever-changing landscape in which we operate. In order to successfully equip an organisation with the necessary processes for momentum, decision making, and enablement, our Engineering Managers are required to collaborate with other leaders across multiple disciplines and job families, such as Principal Engineering, Product Management, Technical Program Management, and Design.

Perhaps they might be collaborating with Talent Acquisition Partners to refine the candidate experience during the hiring process or creating a Mentorship program. In other cases, they might be contributing to a cross-functional working group to define KPIs to measure progress relative to the Group Strategy. Perhaps they might be supporting the Cyber Week preparations. You get the idea. These are just four examples that my cohort of Managers have been working on recently, however, they all share the running theme of intrapreneurial spirit - embodying our “Act Like an Owner” founding mindset. Making things happen throughout the organisation that ultimately become a tail-wind for impact.

Practices

If the purpose of processes is to shape the environment such that group thinking and empowered decision making is supported, then practice is the more granular day to day activities that sit atop the processes. These practices help Engineers to get things done.

As before, if we take the team you lead as the base, the Engineering Manager is responsible for working with their team to define fruitful ways of working that embrace best practices and foster collaboration. This will take time, especially for a newer team, but through trial and error, you will find that sweet spot.

When we hone in on practices beyond the team, we see wider collaborations across disciplines to get things done collaboratively across the department.

Practices, in my opinion, are the catalyst for helping Engineering Managers to understand how to scale themselves, by delegating and supporting the individuals on their team to step up and take on more responsibility. If we take a look at Communities of Practice, Operational Review Meetings, or Guilds, we typically see Engineers taking more of a leading role in establishing these practices, but in order to do this, our Engineering Managers are playing more of a supporting role. We are identifying opportunities and matching those to individual goals and aspirations. We are setting those individuals up for success by coaching, providing feedback, utilising training and development budgets, and stepping back to let them drive.

As individuals are growing into these responsibilities, it is important to nurture experimentation, to celebrate successes and failures, and most importantly, to provide the context (the why) of how these practices are related to the bigger picture.

Conclusion

Engineering Managers are responsible for steering and enabling a high-performing team of engineers, but their scope of influence and impact extends far beyond the realms of the team. Managers help to shape the behaviours, the processes, and the practices of the organisation to yield, and foster, a culture of innovation, delivery, empowerment and drive. This culture is what enables organisations to succeed in our non-linear world.

The Harvard Business Review recently published a terrific article, stating that in order to retain your best employees, you need to invest in your best managers. This article resonates with my own view that the success of an Engineering Organisation is greatly supported by our Engineering Managers - the ones who are close enough to the metal to implement culture, yet elevated enough to encompass a broad scope of influence, and provided with enough autonomy to innovate for the organisation.

I would like to finish this article off with an extract from our Role Expectations for the Management track:

“Great managers come in all shapes and sizes. There is no ‘checklist’ for leadership … No leader can do everything - some will exceed in certain capabilities while others will exceed in a different combination - this is OK and intended”.

Growth Engineering at Zalando

2022-07-26T00:00:00+02:00

We recently closed out our annual performance review for employees. Naturally, this period is for us to focus on how we are performing, what we aspire to achieve, and how we can progress towards those goals, with the support of our leads.

As a leader, I’ve spent a great deal of time working with Software Engineers on their development, and helping them to drive their career progression. These conversations and discussions are usually driven by the engineer, with managers playing a guiding and supporting role, and typically consist of self-reflection, ideation, motivation, and the culmination of a development plan.

I thought that it might be helpful to share some notes on a few of the ways that we enable growth for Engineers at Zalando.

Role Expectations

A standard progression for an engineer is from Junior to Mid to Senior. Unfortunately, aside from the title, we (and I include myself from my own engineering days), are not always completely clear on what the differences are between the levels. In order to progress as a Software Engineer, it is imperative that we understand the expectations at each level.

At Zalando, all of our engineers are provided with a copy of our Software Engineering Role Expectations. This document, very clearly defines the expectations per grade across a wide range of functional areas, such as Scope, Delivery & Impact, Community Contributions.

Moreover, the expectations very clearly describe the requirements for advancing to the next grade. A common activity for engineers reviewing their performance is to look at the functional areas on their current grade, and the grade above, and with the help of their lead, to perform a RAG assessment on their performance. This will usually shine a spotlight on areas for growth, and also shine a light on strengths that should be doubled down upon.

A concrete role expectations document is something that I would have greatly benefited from whilst coming up as an engineer.

Alice: "Would you tell me, please, which way I ought to go from here?"

The Cheshire Cat: "That depends a good deal on where you want to get to."

Alice: "I don’t much care where."

The Cheshire Cat: "Then it doesn’t much matter which way you go."

Alice: "...so long as I get somewhere."

The Cheshire Cat: "Oh, you’re sure to do that, if only you walk long enough."

Performance Reviews

I mentioned in the introduction that we have recently concluded our most recent performance review. Performance reviews of some shape and form are relatively standard practice across the industry, but no two systems are the same.

Our reviews are held annually, with a half-yearly check-in*. The reviews provide an opportunity for employees to receive rounded feedback, which incorporates inputs from their peers, stakeholders, and lead. In addition, it requires self-assessment. The self-assessment is particularly important. We are all responsible for owning our careers.

The performance reviews serve to:

Recognise and celebrate their contributions over the last period.
Identify their strengths and the areas that they shine in.
Highlight any development areas or blindspots.
Calibrate these elements relative to the aforementioned role expectations.
Develop a goal and milestones to work towards over the course of the next review period.

I personally cherish the development areas, and love to hear where I can push myself more, and course correct any bad habits or issues (we all have them).

*Growth and progression is a constant and ongoing collaboration between you and your lead, but the actual timelines for the official review periods are annually and half-yearly.

Continuous Feedback

When I started out my career in engineering, one of the exciting aspects was the tight feedback loop. Using the REPL or compiler, I could quickly validate my solution. Tight feedback loops allow us to quickly course correct when something is wrong, but also provide a nourishing hit of endorphins when things go well. This supercharged-catalyst approach is something that we use for the delivery of continuous feedback at Zalando.

One of our values is High challenge, high support, which states that

Feedback is a gift. We give and receive honest and timely feedback. At the same time, we provide each other with support, and we care about the person beyond their role.

The use of the word timely is critical here. The best time to provide feedback, especially critical, is when the action is fresh in the mind. This is when context is plentiful and crystal clear. My lead never waited until our next 1:1 to provide me with feedback, and this is something that I have continued.

Mentoring (noun)

the practice of helping and advising a less experienced person over a period of time, especially as part of a formal programme in a company, university, etc.

Mentoring is everywhere in Zalando. We have many official mentoring programmes (some are company wide, others are nurtured within departments), and we also have many unofficial mentoring relationships. During my tenure, I have benefitted from being a mentor, and a mentee.

Typically, for early stage engineers, seeking out an experienced mentor is a great way to broaden their network, to gain experience, and to accelerate their growth. Your mentor will likely be from a different team or business unit, so they can offer a more diverse approach to problem solving and development.

For our more tenured engineers, and especially those who are progressing towards Senior Engineering, mentoring a less experienced engineer* helps to prepare you for the seniority expectations such as coaching, guiding, providing feedback, and paving the way for a new generation.

*I have witnessed some success stories where engineers have mentored non-engineers and helped them to secure their first engineering role.

Personal Development Budget

We provide our engineers with a healthy personal development budget, which can be used for learning materials, educational resources, training and certifications, and the like. Every person is unique, and whilst you might prefer to upskill using sites like Coursera, I might prefer to read a book on a particular topic, or to join a local study group.

Personal development is certainly not limited to technical skills, and should also include soft-skills, and other attributes that shape a well-rounded career. A personal example. I recently sought to improve my public speaking skills and took an eight week online course on Presentation Skills. The course was aimed at individuals who often need to speak to groups, and who find it uncomfortable. To my surprise, the cohort consisted of quite a few engineering leaders.

Courses and activities like these can be cost-prohibitive to some, and having the investment of your company to support you is a huge boost to your development.

Missing it? Make it Happen!

Another one of our values is Act like an owner, which states that

“Ownership” is about being responsible to our customers, partners and colleagues, not about being entitled. We own our destiny and are not stopped by circumstances: Zalando is what you make of it.

We are all encouraged to take ownership of our careers and development. One such example of this is the large number of communities and groups that were founded and run by engineers. In my particular department, I have seen people create and run React meetups, Book Clubs, Podcasts, Show & Tells, Hackathons, etc. At one point in time, these forums did not exist - an engineer wanted to attend one, and so they took ownership and created it.

Founding and organising such initiatives is no small feat, and you can be sure that the creators developed many skills along the way.

Organisations are ever evolving, and don’t come equipped with everything that you would like. If there’s something that you want, then go and make it happen.

Support, Support, Support.

I have been incredibly fortunate to work with leaders and peers who support my growth and development. They have provided me with open and honest feedback on what I am doing well, and of course, what I am doing not so well.

Growing within an organisation with such a deeply woven culture of supporting one another is surprisingly easy. Our engineers’ growth and engagement is a top priority for our leadership cohort, and they have our full support for unlocking their potential. Support isn’t sugar-coated, and sometimes that means having difficult conversations, but we do this to set you up for success.

An Introduction to the Zalando Design System

2022-07-21T00:00:00+02:00

Yet Another "What is a Design System?"

There is a lot of literature and countless blog posts around the very definition of the concept of design systems. In this post, we'd like to look at it from an engineering perspective and describe the journey from the initial idea to the complete adoption here at Zalando.

You can also find more information about the creation process from a design point of view in this blog post.

At its core, a Design System is a collection of specifications describing a set of design primitives, reusable components, and arbitrary guidelines to ensure consistency and visual identity. Given such a broad definition, there are no fixed rules when it comes to technical implementation, but some patterns started to emerge in the industry.

Implementation-less Design System

How a Design System is implemented into a reusable library is highly influenced by the specific business use case, technologies and frameworks used, platforms to support, as well as teams and company wide processes and structure. In a very large company with many different products and a diverse panorama of tech stacks, providing a single solution that suits every context may become extremely difficult, if not impossible. On the other hand, visual consistency and brand identity are likely to still be a requirement.

A radical, but common, approach in these use cases is not providing an implementation at all. The Design System is defined via a strict set of platform and technology agnostic definitions. Different teams/products/departments can implement their own library using the best tool for the job as long as the specifications are respected.

Design Tokens

Relying exclusively on a set of specifications offers more flexibility. However, as more and more implementations are developed, the problem of guaranteeing that they are in sync with the latest specs becomes increasingly hard.

A step toward increasing consistency without sacrificing flexibility is to provide a set of core variables and assets to be used across implementations. Those variables, called tokens, represent all the shared values that will help us maintain consistency across our system. Some practical examples are color palettes, spacing, typography, and assets like logos, icons, etc.

Design Tokens are usually maintained in a centralised place and via some tooling they are converted into different formats to be consumed by a vast array of different platforms. Every independent implementation will use the latest version of those tokens as the only source of truth for the core variables and assets used. With such a setup, we can quickly roll out changes to Design System core elements across an arbitrary number of implementations.

The Single Component Library

The term "Design System" is often used as a synonym for a component library. While it is true that one of the practical implementations of a Design System is one of such libraries, overloading the term is a practice that may turn out to be counter-productive. A lot of emphasis is given to the technicalities of how the different components are developed in a specific architecture, glossing over the Design System's core goals, which are to enforce a visual consistency and identity while reducing the maintenance costs. These fundamental aspects are instead often relegated to vague concepts of default styles or custom themes.

The confusion of those terms is easy to understand: in many cases the one single component library is the main contact point between the Design System as a concept and its practical consumers. Referring to this contact point with the “design system” term is an understandable shortcut. Regardless of the terminology, we are dealing with very different concepts. For example, a Design System can exist without a component library, the same way a component library can be abstract enough to not enforce any visual identity.

The Zalando Implementation for the Web Platform

Our design system was initially conceived and developed roughly at the same time with its web platform implementation, this gave us the opportunity to gradually adopt certain technical decisions with a very tight feedback loop during a major visual and architecture redesign. In retrospect, that was both an advantage and a disadvantage: starting from scratch gave us the freedom to make the choices based on suitable use cases without the constraints of a legacy live system. On the other hand, the lack of a complete set of specifications led to many changing requirements that naturally caused a certain amount of refactors and duplicated work.

Overall it was an extremely interesting challenge and I would like to share some of the learnings and decisions we encountered on the way. As a first step, we identified some of the functional requirements we could foresee based on past experience and current business needs.

Team Autonomy

A high level of autonomy has consistently been reinforced by Zalando, even after years of change and growth. Different teams, especially on the customer-facing side, own specific parts of the experience and expect to independently develop new features without being blocked by overly centralised teams and architectures.

Speed

In every meaning of the word, we knew that speed would have been a requirement. From the performance of the components, to the ability to quickly iterate over existing implementations, provide new features, and avoid, as much as possible, becoming a bottleneck for other teams.

Consistency

One of the key metrics to evaluate the success of a Design System is the consistency and identity of the final customer-facing product. From a technical perspective, there are always some trade-offs between consistency, speed, and flexibility. While it can be complex, if not impossible, to maximize all of them, we tried to incentivize the "consistent way" by making it the easiest and fastest option whenever possible. We still had to consider possible escape hatches for certain edge cases, but we wanted the most obvious and simple option to be the one providing the highest level of consistency.

Consider Other Platforms

While our main focus was to support the web platform, we decided from the beginning to identify opportunities to maintain a certain level of code sharing across platforms. Some variables could be shared across all platforms, part of the CSS used on the website may be used for emails, some teams may want to use a different JS framework. Those are some of the possible use cases we thought could arise at some point. While we didn’t want to over-engineer our solution based on these uncertain requirements, we tried to keep a loosely coupled architecture that would allow some of these scenarios to be addressed more easily in the future.

Extended Atomic Metaphor

Our web component library follows an approach loosely based on the concept of Atomic Design. The basic idea is to have different abstractions that can be built based on each other, from the most simple to the most complex. In the same way, complex living organisms are composed of simpler molecules which in turn are composed of simpler atoms and so on. A layered approach is a natural fit for many complex and continuously evolving systems. In particular, we can observe in nature the speed at which layers of different complexities change and tend to be mirrored in artificial constructs like a Design System or many other instances of complex systems. A very interesting reading that I strongly suggest on the topic is Pace Layering: How Complex Systems Learn and Keep Learning. For our web architecture we ended up with these different layers:

Design Tokens: A centralised source of truth for variables and assets that define the core of the Design System. Some examples are: colour palette, spacing, typography, fonts, icons, etc.
Electrons
: A subset of the CSS grammar that only allows properties and values that are consistent with the specifications of the Design System. e.g. paddingTop_m, fontFamily_sansSerif, etc.
Atoms: A composition of electrons and/or other atoms that serve a single generic purpose and cannot be divided further without losing its functionality. E.g. the collection of electrons needed to create a button. In our implementation this is the last layer that directly uses CSS.
Molecules: A composition of atoms and/or other molecules forming a single generic component. An example could be the React implementation of different button types ready to be used as a package. At this level there should not be any business logic and emphasis is given to reusability and consistency with design specifications. For example, most of these molecules will also be available as components in the shared designer library.
Organisms: A composition of molecules, atoms, and/or other organisms to fulfill a specific business use case. They are not part of the core component library and are owned by different teams owning the specific feature they enable.

Consistent with the natural world analogy, elements belonging to the simpler layers like electrons and atoms, tend to be stable and only very rarely receive any updates, for example a major redesign every few years. On the other hand as the complexity of the layer increases, changes happen more and more frequently. Based on this expected behaviour, we shaped our architecture in order to optimise for:

very frequent changes in organisms
occasional changes in molecules
very rare changes in atoms and electrons.

We were also able to use these assumptions as a technical leverage to maximise other dimensions like bundle size, enforced visual consistency, testing, and documentation.

Contributions and Ownership

In terms of tangible entities, the Zalando Design System is composed of different parts with different ownership and contribution processes in place, this article covers the details of our "contribution model" more in-depth. Here, we will focus on the parts affecting the web platform, but a similar structure can be encountered for mobile app development as well.

Design Tokens repository: Owned by the larger Design System team, including designers as well as web and app engineers.
Figma component library: Includes a visual representation of the Design System specifications as well as a centralised component library that can be used by designers in many different teams to create screens and requirements for arbitrary features.
Web component library: Structured as a monorepo, it exports a single npm package for each atom, molecule and organism as well as a single highly optimised CSS bundle. The central Design System team has the ownership of the CSS layer, the atoms, the molecules, and some generic organisms.

Using GitHub code owners, different teams own specific organisms and are responsible for maintaining any business logic required. Pull requests on code owned folders are usually faster to approve and merge as we ensure that changes on a code owned component will not affect other exported packages.

The only way to use CSS on organisms and molecules is via atoms, this ensures a certain amount of consistency and makes it easy to spot possible deviations from the Design System specifications. Using a single, predictable CSS bundle and a set of React hooks and patterns, we encourage consistency and composability over one-off implementations. In return we get a very scalable library where an unlimited number of organisms will always result in the same CSS bundle size and not affect each other JS bundle size.

Challenges and Pain Points

Creating a Design System from scratch and driving its adoption in a large company was definitely challenging, from gathering the requirements from many hidden use cases to getting enough traction to refactor complex applications; it has been a journey where communication and coordination have played a major role. Finding a technical solution able to grow and scale as fast as our requirements was also a challenge.

While the system has been running relatively smoothly for more than 2 years and the adoption rate is close to 100%, there are some long-lasting pain points and possible areas for future improvements.

Fragmented Ownerships

Finding the right owner for specific common components like product cards, carousels, banners, etc. is extremely difficult from an organisational point of view. Even when an owner is found, it is hard to prevent some conflicts and overlaps of responsibilities. For example, multiple variations of the same components start to appear with different ownerships, features that require coordination across certain premises need the involvement of different teams, and the discoverability of what is currently available becomes a crucial requirement.

Coupling with Deployments

In software engineering, it is usually considered a best practice to group things that change together. Currently, a new version of the component library and a new version of the live customer-facing application are handled by different pipelines and the codebases live in different repositories. Although having independent releases and a platform-agnostic pipeline may be convenient, we cannot ignore the reality of having one main consumer. In this case, a solution involving a larger monorepo may help with the bottleneck problem created by the need to keep versions in sync.

Conclusions

A Design System tends to behave like most complex systems. Different layers evolve and stabilize at different paces, with a slow-changing core and fast iterations on the edges. The biology metaphor fits quite well in those behaviors and got popularized with atomic design. Porting those complexity layers into a technical implementation was not always straightforward, but overall a good decision.

Code can be observed with the same curiosity we have when looking at nature. Identifying the boundaries between different layers and their relationships is the key to control the complexity involved. While, to some extent, exceptions will always exist, knowing what parts of the system are stable, which ones are changing fast, and how they affect each other is a powerful tool. The architecture and processes around a Design System can be shaped around these characteristics in order to optimize for fast iterations on the edge layers and stability on the core ones. Embracing the chaotic nature of changes while learning and observing the larger patterns at play is the key to achieving long-term stability and a healthy evolution process.

International Women in Engineering Day (June 23rd)

2022-06-23T00:00:00+02:00

What were the biggest learnings in your career so far? And what advice would you give your younger self today? How do you get ahead in your career? We’re celebrating International Women in Engineering Day by talking to three senior Zalando Women in Tech: Mahak Swami, Engineering Manager; Floriane Gramlich, Director of Product Payments; and Ana Peleteiro Ramallo, Head of Applied Science. We caught up with them during the Women in Tech Global Conference 2022 — let’s find out their advice!

What’s the best thing about your job?

Mahak: In my team, we build products for the Zalando mobile app. The best thing is the technical challenges: working on them and solving them.

Floriane: I have an incredible team who I love to work with – it’s fun, but it’s also inspiring. Also, I work in payments, which is all about customer convenience: Ultimately, if I don’t do my job right, then people can’t pay, so I love that I’m making a difference.

Ana: The best thing about my job is that I get to work on super-interesting topics, and with really amazing and interesting colleagues.

Looking back at your career, what’s your tip for fostering a more inclusive environment?

Mahak: It’s really important that everyone’s opinions are considered when you’re solving a problem. An engineer could bring equally important input to the design, and vice versa. Everybody needs to bring their own values to the table, so that we can find the best solutions to the problem.

Floriane: Being yourself is super-important. That means accepting who you are, and not trying to imitate somebody else. Because, if you can’t be true to yourself, how can you be true to others?

Ana: The first thing is to make people aware when there is not an inclusive environment. Many times people want to be inclusive, and don’t realise there’s a problem.

What’s the best professional advice you’ve ever received?

Mahak: The best advice I’ve had was around executive presence: To speak about my work and represent it just as well as I was doing it. A lot of women don’t advocate for the work they’re doing. That’s one thing I’d definitely push for.

Floriane: So, the worst advice I ever received was, ‘Don’t be too ambitious’. I was told that a LOT in previous companies, in almost every performance talk. It’s terrible advice and I wonder if a man would be told the same thing. Now, it’s really important to me to be that multiplier for my teams, I say: Be ambitious!

Ana: The best professional advice I ever got was, ‘If you want something, just go and get it’. Because many times we doubt ourselves, but it’s about wanting to get something and having a plan for how to get it.

What advice would you give your younger self?

Mahak: Try out as many things as you can in your career. It’s very important to figure yourself out. Don’t be afraid to find out what clicks for you as a professional.

Floriane: Know what you want. Say what you want. Do what you want. And stand true to that. It’s super-important to invest in self-reflection quite early. You need to really understand who you are.

Ana: What I learned is to be really proactive and never stop learning. Continuous learning will help you to grow.

What other tips would you give to women starting their career in STEM?

Mahak: In general, women have this perception of tech: that it isn’t a place for them, and perhaps it’s difficult to get into. But that’s not the case. Tech is very logical, a lot of fun and now very inclusive too. When I started my career, I was often the first and only woman on the team. But now that’s not the case. You will have company and you will have fun – try it!

Floriane: Be curious and don’t let other people tell you what you can or can’t do. On a more practical level, look for role models (there are lots out there), find yourself a mentor, build your network, and really learn from others. Getting advice from outside your usual zone is very powerful.

Ana: Never allow anyone to tell you what you can or can’t do. You’re the only one who knows your goals and what you want to achieve. Also, there are no things for girls or things for boys – there’s only things you like. So, if there’s something you like, go ahead and enjoy it.

Learn more about International Women in Engineering Day and for more inspiration, check out our three Zalando speakers at the recent Women in Tech Global Conference.

Accelerate testing in Apache Airflow through DAG versioning

2022-06-10T00:00:00+02:00

Introduction

In the Performance Marketing department, we run paid advertisement campaigns for Zalando. To do so, we build services that allow us to manage campaigns, optimize and distribute content, and measure the performance of the campaigns at scale.

Talking about measurement, one of the core systems we’ve built and continuously extended over the years is our so-called marketing ROI (return on investment) pipeline. The ROI pipeline is a batch based data- and machine learning pipeline powered by Databricks Spark and orchestrated by Apache Airflow. It consists of various sub-pipelines (components), some of which are built using our python sdk zFlow. Examples for said components are our input data preparation, marketing attribution model or an incremental profit forecast for our campaigns. These components are owned and developed by different cross-functional teams (applied science, engineering, product) within Performance Marketing. You can read more about the way we measure campaign effectiveness from a functional perspective in our previous blog post.

Problem Statement

A recurring problem we faced during the development relates to the nature of the marketing ROI which lacks a ground truth¹. It means that while we oftentimes have assumptions on what the impact of a change in input data or to our components has on the ROI, we require the new version of the ROI pipeline to be run end-to-end to confirm our assumptions. Since different teams are working on different components of the ROI pipeline in parallel, evaluating the impact of a change on the final ROI in isolation is required to work effectively (i.e. teams not blocking each other). The following section explains the problem in more depth.

As mentioned earlier, we are using Airflow to orchestrate the overall pipeline. The Airflow code is stored in a github repository. We have two servers, production and test. When a pull request is opened, the Airflow pipeline is deployed to the test server. On merge to the main branch, we deploy to the production server. In this setup, we have two so-called pipeline environments, a production (live) and a test environment. The live pipeline uses the live data environment while the test pipeline uses the test data environment. As our data layer, we’re mainly using AWS S3 with data organized as Spark tables. A set of Spark tables represents a data environment. Only one version of an Airflow DAG such as our marketing ROI pipeline can exist in each environment. When multiple features are developed at the same time, they have to share the test environment which oftentimes leads to conflicts since testing in isolation is not possible. Alternatively, the features can be tested sequentially which leads to delays. To solve the problem, we implemented a mechanism to enable a flexible number of Airflow environments. Moreover, we also developed a script to spin up new data environments.

Figure 1 depicts the relationship between a pipeline and data environments.

Figure 1: Environments

Pipeline Environment

A pipeline environment is a version of a pipeline (set of Airflow DAGs) deployed to an Airflow server on which it can run end-to-end. Each environment contains all DAGs necessary to produce the required output (e.g. marketing ROI in our case), so multiple environments can co-exist on one server and can be used independently.

Data Environment

A data environment is a set of Spark/Hive databases, tables and views. A pipeline environment uses a single data environment for reading and writing data.

Airflow Environments

Our main objective was to create a new Airflow environment once a pull request is opened on which the developed version of the pipeline can be tested in isolation. The most trivial way is to create a new Airflow server for every pull request, which would be time consuming and costly. For example, Amazon Managed Workflows for Apache Airflow (MWAA) needs up to 30 minutes to create a new Airflow server and you have to pay for additional resources. With our solution, a new environment is created on the existing test server once a pull request is opened, resulting in multiple environments on the same Airflow server. The creation of a new environment takes less than one minute.

Figure 2 shows how this could look like on the test server. We have 2 Airflow DAGs qu.test_dag and qu.test_dag_2 with three different environments: feature1, feature2 and feature3. "qu" is the name of an internal team at Zalando. The DAGs always have the team name as prefix. It means that the same DAGs are adapted and deployed through three separate pull requests.

Figure 2: Airflow Environments

When the corresponding pull request is closed, the environment will be deleted automatically. How did we implement this since the concept of environments does not exist in Airflow? To achieve this, we adjusted the source code of the Airflow library and developed a cron job which deletes the environments later on. The following sections explain necessary modifications made.

Deploying Airflow code as a zip file

The Airflow code is deployed as a single zip archive using the Packaging DAGs feature. This feature prevents dependency conflicts because every deployment only uses the dependencies which are defined in the same zip file. The zip file has the name of the branch from which we are deploying. For example, when we deploy the Airflow code from branch feature1, the zip file is called feature1.zip.

Use correct Jinja Paths

A problem occuring through the use of zip file is that jinja templates for files are not working anymore. Jinja detects the absolute path of the file correctly but the file cannot be read because it’s inside a zip file. For this reason we also deploy the unpackaged zip archive in a different location. Inside the dag.py file (see Figure 3 line 13 - 19) we add the location of the unpackaged files to the template search path. As a result, jinja now searches for templates inside the unpackaged folder.

Renaming Dag Ids

On one Airflow server, it’s not possible to create multiple DAGs with the same id. Therefore, we have to rename the DAG ids for every deployment. For that reason we adapted the dag.py file (see Figure 3) of the Airflow library which contains the DAG class. Inside the init method we are checking the file path of the python file which is initializing the dag. The path contains the name of the zip file, e.g. feature1.zip. This way we can differentiate the environments. We modify the original DAG id and inject the environment name (see Figure 3, lines 3-11). Furthermore, we add the environment name as a tag to enable filtering on environments.

class DAG():
…
    def __init__(...):
         # /usr/local/airflow/dags/feature1.zip/qu/main/file.py
         file_path = get_path_of_file_which_initialized_dag()

         #feature1
         feature_name = get_zip_file_name(file_path)

         dag_id = {team_name}.feature_name.{dag_id.split({team_name}.')[1]}
         tags.append(feature_name)

         # /usr/local/airflow/features/feature1/
         feature_dir_path = get_feature_dir_path(file_path)
         template_searchpath.add(feature_dir_path)

         # /usr/local/airflow/features/feature1/qu/main/
         feature_file_path = get_feature_file_dir_path(file_path)
         template_searchpath.add(feature_file_path)
…

Figure 3: Pseudo Code of adapted dag.py

Environment Cleanup

We have developed a cron job that checks the status of pull requests. Once a pull request is closed, the corresponding environment is deleted on the Airflow server. The job deletes the zip file and the folder which contains the unpackaged files. Then, it queries the Airflow metastore for all associated DAGs and deletes them via Airflow cli.

Data Environments

Every Airflow environment also requires a data environment, otherwise conflicts on the data layer could occur during parallel feature development. Our data is mainly organized as Spark databases stored on S3. A data environment is a set of Spark databases with a corresponding suffix, e.g. all databases of the live environment have the suffix _live. The ddls of our databases and tables are stored in a git repository. We developed a script which uses the ddls to create a new data environment (see Figure 4). The databases have the environment name as a suffix, e.g. db_attribution_feature1.

Figure 4: Create new Data Environment

A new data environment initially is empty, i.e. the databases do not contain any data. We could copy the data, this costs time and money though. A more elegant way is the table environment feature which we implemented with the data environment script. Instead of copying data, the script creates a view pointing to the respective test data (see Figure 5). Table environments are defined in a configuration file which is automatically created via the table environment script. The script uses information about input and output tables of all tasks which are predefined as yaml files. An example table environment configuration is db_attribution.m_events:TEST, resulting in the creation of the following example view.

CREATE VIEW db_attribution_feature1.m_events AS
SELECT * FROM db_attribution_test.m_events

Figure 5: Creating a view instead of copying data

A view is only created if the table is not used as output by one of the respective tasks. In some cases you need initial data for tables which are used as output. Therefore, the table environment script creates a configuration stub for these tables like that:

db_attribution.m_events:
    partitions:
        - date BETWEEN "x" AND "y"

If you define the partition ranges and execute the data environment script, it creates the table and copies the data for you.

Summary

In this blog post we presented how we enabled versioning of our performance marketing pipeline which is based on Apache Airflow. The Versioning is necessary to enable more convenient simulation and testing. We modified the Airflow DAGs class and used the Packaging DAGs feature of Apache Airflow to make it possible to have multiple versions of the same DAGs on a single server. This allows us to deploy a git branch consisting of Airflow DAGs directly to a single Airflow server where they can run isolated from other versions. The deployment takes less than 1 minute compared to up to 30 minutes when you create a new Airflow server for the deployment. To enable isolation on data level we implemented a script which spins up a new Data Environment consisting of Spark/Hive tables on S3. As a result, every Pipeline version can use a dedicated Data Environment.

This is simplified, ultimately we consider the results of our a/b tests as ground truth. Yet, a/b tests are only run in certain periods of the year and are used to correct our marketing attribution results also in-between a/b test periods. Here, due to internal and external factors such as spend changes or campaign efficiency changes, the ground truth could in fact have changed as well. ↩

Operation-Based SLOs

2022-04-28T00:00:00+02:00

Anyone who has been following the topic of Site Reliability Engineering (SRE) has likely heard of Service Level Objectives (SLOs), and Service Level Indicators (SLIs). SLIs and SLOs are at the core of the SRE practices. They are fundamental to establish the balance between building new features on a product, shipping fast, or working on the reliability of that product. But they are not easy to get right. Zalando has gone through different iterations of defining SLOs, and we’re now in the process of maturing our latest iteration of SLO tooling. With this iteration, we are addressing fragmentation problems that are inherent to service based SLOs in highly distributed applications. Instead of defining reliability goals for each microservice, we are working with SLOs on Critical Busines Operations that are directly related to the user experience (e.g. "View Catalog", "Add Item to Cart"), rather than a specific application (Catalog Service, Cart Service). In this blog post we’re going to present our Operation Based SLOs, how we define them, the tooling around them, how they are part of our development process, and also how they contributed to a healthier on-call.

The first iterations of defining SLOs

To understand where we are right now, it’s important to understand how we got here. When we introduced SRE in Zalando back in 2016 we also introduced SLOs. At the time, we went with service based SLOs. Each microservice would have SLOs on whatever SLIs service owners defined (usually availability and latency), and they would get a weekly report of those SLOs, through a custom tool that was tightly coupled with our homebrew monitoring system.

Service Level Reporting tool

As these were new concepts in the company, we ran multiple workshops across the company for Engineers and Product Managers to train them on the basics and to kick-start the definition of SLOs across all engineering teams. Product Managers and Product Owners started to get unexpected questions from other peers and engineers:

"What is the desired level of service you wish to provide to your customers?"
"How fast should your product be?"
"When is the customer experience degraded to an unacceptable level?"

The last one was particularly relevant for services that have different levels of graceful degradation. Say the service cannot respond in the ideal way; it uses its first fallback strategy that is still "good enough" so we consider it a success. But what if that first fallback also fails? We can use a second fallback just so we don’t return an error, but maybe that is no longer a response of acceptable quality. Even though the response was successful from the client’s perspective, we still count it as an error. What was particularly interesting about this thought process was that it created a break from defining availability exclusively based on HTTP status codes (where failure is anything in the 5xx range). It’s good to keep this reasoning in mind, as it will be useful further down.

SLOs saw an increasing adoption across the company, with many services having SLOs defined and collected. This, however, did not mean that they were living up to their full potential, as they were still not used to help balance feature development and improving reliability. In a microservice architecture, a product is implemented by multiple services. Some of those services contribute to multiple products. As such, Product Managers had a hard time connecting the myriad of SLOs and their own expectations for the products they are responsible for. Because SLOs are on a microservice level, the closest manager would be on the team level. Taking into consideration the previous point that a product is implemented by multiple services, aligning the individual SLOs for a single product would mean costly cross-team alignment. Raising the SLO discussion to a higher management level would also be challenging, as microservices are too fine grained for a Head or Director to be reviewing. The learning at this stage was that the boundaries of Products did not match individual microservices.

In this service landscape we see that products can share individual services

We later tried to add additional structure to the existing SLOs. One of the challenges we had with service based SLOs was the sheer amount of services that had to be measured and monitored for their SLOs. Realistically speaking, they could not all have the same level of importance. To ensure teams focused on what mattered the most, a system of Tier classifications was developed - Tier 1 being most critical and Tier 3 being least critical. With each service properly classified, teams knew what they should be keeping a close eye on. Having the Tier definition also allowed us to set canonical SLOs according to an application's tier classification. Our tooling evolved to keep up with these changes.

To summarise, our experience with service based SLOs struggled to overcome the following challenges:

High number of microservices. The more there are, the more SLOs teams have to monitor, review, and fine tune.
Mapping microservice SLOs to products and their expectations. When products use different services to provide the end-user functionality and with some services supporting several products, SLOs easily conflict with each other.
SLOs on a fine grained level made it challenging for management to align on them. When dealing with SLOs on such a granular level as micro services, Management support beyond the team level is difficult to get. And within the team level, it requires costly cross-team alignment.

Symptom Based Alerting

In our role as SREs we were in frequent contact with different teams, helping them with PostMortem investigation, or reviewing their monitoring (what metrics were collected and paging alerts that were set up). While teams were quick to collect many different metrics, figuring out what to alert on was a more challenging task. The default was to alert on signals that could indicate a wider system failure ("Load average is high", "Cassandra node is down"). Knowing the right thresholds to alert on was another challenge. Too strict, and you’re being paged all the time with false positives. Too relaxed, and you’re missing out on potential customer impacting incidents. Even figuring out whether the alert always translates to customer impact was also tricky at times. All of this led us to push for a different alerting strategy: Symptom Based Alerting.

You can find more details about Symptom Based Alerting in the slides of one of the talks we did on this topic. But the main message of that talk is that there are some parallels between SLOs and Symptom Based Alerts. Namely, about what makes a good SLO, or a symptom worth alerting, and how many SLOs and alerts you should have. Both SLOs and Symptom based alerts should be focused on key customer experiences¹²³ by defining alerts and SLOs on signals that represent those experiences. Those signals are stronger when they are measured closer to the customer, so we should measure them on the edge services. There are benefits to keeping both alerts and SLOs at a low number²³. Focusing on the customer experience, rather than all the services and other components that make up that experience helps ensure that. By alerting on symptoms, rather than potential causes for issues, we can also identify issues in a more comprehensive way⁴, as anything that may negatively affect the customer experience will be noticed by the signal at the edge.

Let's see how this works in practice by taking the following SLO as an example: "Catalog Service has 99.9% availability". Let's assume Catalog Service is an edge service responsible for providing to our customers the catalog information, its categories, and the articles included in each category. If that service is not available, customers cannot browse the Catalog. Because it is an edge service it can fail due to issues in any of the downstream services. That, in turn, would negatively affect the availability SLO. Any breach of the SLO means that the customer experience is also affected. Due to the connection between the SLO's performance and the customer experience we come to the conclusion that the degradation of the SLI "Catalog Service availability" is a symptom of a degraded customer experience. The SLO sets a threshold after which that degradation is no longer acceptable, and immediate action is required. Or in other words, we should page when our SLO is missed, or in danger of being missed.

From this we derived the following formula:

Service Level Objective = Symptom + Target

Essentially, we wanted to capture high level signals (or symptoms) that represented customer interactions. These signals could be captured at the edge services that communicate with our customers. If those signals degraded, then the customer experience degraded. Regardless of whatever it was that caused that degradation. If we couple that with an SLO, then, following the formula above, we get our alert threshold implicitly.

There is an additional feedback loop between SLOs and symptom based alerts when you couple them like that:

If you get too many pages, then the respective SLO should be reviewed, even if temporarily.
If you get too few pages, then maybe you can raise the SLO, as you are overdelivering.
If you have a customer experience that is not covered by an alert, then you likely also identified a new SLO

The problem with setting up alerts at those edge services, however, was that it would always fall down to the team owning those services to receive the paging alerts and perform the initial triage to figure out what was going on.

While the concept seemed solid, and made a lot of sense, we were still missing one key ingredient: how could we measure and page based on these symptoms, without burning out the team at the edge given they'd be paged all the time?

Introducing Operation Based SLOs

When rolling out Distributed Tracing in the company, one of the challenges we faced was where to begin with the service instrumentation work to showcase its value early on. Our first instinct was to instrument the Tier 1 services (the most critical ones). We decided against this approach because we wanted to observe requests end-to-end, and instrumenting services by their criticality would not give us the coverage across system boundaries we were aiming for. Also, it is relevant to highlight that Tracing is an observability mechanism that is operation based, so we thought that going with a service based approach would be counter-intuitive. We then decided to instrument a complete customer operation from start to finish. But the question then became: "Which operation(s)?".

Earlier, for our Cyber Week load testing efforts, SREs and other experienced engineers compiled a list of "User Functions". These were customer interactions that were critical to the customer-facing side of our business. Zalando is an e-commerce fashion store, so operations like "Place Order" or "Add to Cart" are key to the success of the customer experience, and to the success of the business. The criticality argument was also valid to guide our instrumentation efforts, so that is what we used to decide which operations to instrument. This list became a major influence on the work we did from then on.

One of the key benefits we quickly got from Distributed Tracing was that it allowed us to get a comprehensive look at any given operation. From looking at a trace we could easily understand what were the key latency contributors, or where did an error originate in the call chain. As these quick insights started becoming commonplace during incident handling, we started wondering if we could automate this triage step.

That train of thought led us to the development of an alert handler called Adaptive Paging (you can see the SRECon talk to learn more details about Adaptive Paging). When this alert handler is triggered, it reads the tracing data to determine where the error comes from across the entire distributed system, and pages the team that is closest to the problem. Essentially, by taking Adaptive Paging, and having it monitor an edge operation, we achieved a viable and sustainable implementation of Symptom Based Alerting.

Adaptive Paging will traverse the Trace and identify the team to be paged

But rather than going around promoting Adaptive Paging as another tool that engineers could use to be alerted, we were a bit more selective. A single Adaptive Paging alert, monitoring an edge operation can cover all the services in the call chain, which span multiple teams. No need to have every individual team monitoring their own operations, when a single alert would serve the same purpose (while being less noisy, and easier to manage). And figuring out what to alert on was rather straightforward thanks to our list of "User Functions". We renamed it to Critical Business Operations (CBO), to be able to encompass more than strictly user operations, and once again followed that list to identify the signals we wanted to monitor. Alerts need a threshold to work, though. Picking alert thresholds was always a challenging task. If we are talking about an alert handler that can page any number of teams across several departments, this becomes an even more sensitive topic that requires stronger governance.

Our list of CBOs was a customer centric list of symptoms that could "capture more problems more comprehensively and robustly". And SLOs should represent the "most critical aspects of the user experience". Basically, all we needed was a target (which would be our alert threshold) and we would also have SLOs. CBOs then became an implementation of Operation Based SLOs.

Let’s take as an example "Place Order". This operation is clearly critical to our business, which is why it was one of the first to make the Critical Business Operations list. As there are many teams and departments owning services that are contributing to this operation, the ownership for the SLO is critical. We chose the senior manager owning the customer experience of the Checkout and Sales Order systems to define and be accountable for the SLO of the "Place Order" operation. This also ensured that SLO had management support. We repeated this process for the remaining CBOs. We identified the senior managers responsible for each of the CBOs (Directors, VPs and above) and discussed the SLOs for those operations. With each discussion we would end up with: a CBO with an SLO signed off by senior management; and a new alert on that same CBO that would be sure to page only on situations where customers were truly affected.

Our Operation Based SLOs tackled the issues we had with the service based approach:

Service Based SLOs	Operation Based SLOs
High number of SLOs.	A short list of SLOs, easier to maintain as changes in service landscape have no implications on the SLO definition.
Difficult mapping from services to products.	SLOs are now agnostic of the services implementing the Critical Business Operations.
SLOs on a fine grained level made it challenging for management to align on them.	Products have owners. We also changed the approach from bottom-up, to top-down to bring additional transparency to that ownership.

There were additional benefits that came with this new strategy:

Longevity of the SLOs → "View Product Details" is something that has always existed in the company’s history, but as a feature it has gone through different services and architectures implementing it.
Using SLOs to balance feature development with reliability → Before, the lack of ownership meant that teams were not clear when to stop feature development work to improve reliability should the availability decline. Now they had a clear message from the VP or Director that the SLO was a target that had to be met.
Out-of-the-box alerts → Our Adaptive Paging alert handler was designed to cover CBOs. As soon as a CBO has an SLO, it can have an alert with its thresholds derived from the SLO.
Transport agnostic measurements → Availability SLOs no longer need to be about 5xx rate, or using additional elaborate metrics. OpenTracing’s error tag makes it a lot easier for engineers to signal an operation as conceptually failed. This enables the graceful degradation scenario mentioned earlier.
Understanding impact during an incident → 50% error rate in Service Foo is not easily translatable to customer or business impact, without deep understanding of the service landscape. A 50% error rate on “Add to cart” is much clearer to communicate and derive urgency of needing to be addressed immediately.

SRE continued the rollout of CBOs by working closely with the senior management of several departments agreeing on SLOs that would be guarded by our Adaptive Paging alert handler. With this we also continued the adoption of Symptom Based Alerting. As more and more CBOs were defined, we needed to improve the reporting capabilities of our tooling, and developed a new Service Level Management tool that catered to this operation based approach.

Our Service Level Management Tool (operation based - not actual data)

As the coverage of CBOs and their respective alerts took off, we started getting reports that the alerts were too sensitive. Particularly, there were multiple occasions of short lived error spikes that resulted in pages to on-call responders. To prevent these situations, engineers started adding complex rules to the alerts on a trial and error basis (usually using time of day, throughput, duration of the error condition). SRE was aiming at creating alerts that did not require much effort from engineers to set them up, with no fine tuning required, or that would not change as components and architecture evolved. We were not there yet, but we soon evolved our Adaptive Paging alert handler to use the Multi Window Multi Burn Rate strategy which uses burn rates to define alert thresholds. The Error Budget became much more relevant with this change. The alerts went from being triggered whenever the error rate breached the SLO, to having the decision of whether a page should go out or not based on the rate we are burning the error budget for an operation. This not only prevented on-call responders from being paged by short lived error spikes, but also meant we could pick up on slowly burning error conditions. Because the Error Budget is derived from the SLO, it is still the SLO that made it possible to derive the alert threshold automatically. Together with the adaptability of Multi Window Multi Burn Rate which made it unnecessary to fine tune alerts, this meant engineering teams required no effort to set up and manage these alerts. We also made sure that the Error Budget was visible in our new Service Level Management tool.

Error Budget over three 28 day periods

Putting this model to the test

Everything we described so far seems to make perfect sense. And as we explained it to several teams, no one seemed to make any argument against it. But still, we were not seeing the initiative gaining the momentum we expected. Even teams that did adopt CBOs, weren’t disabling their cause based alerts. Something was missing. We needed the data to support our claims of a better process that would reduce false positive alerts, while ensuring wide coverage of critical systems. That’s what we set out to do, by dogfooding the process within the department.

For 3 months we put the whole flow to the test within the SRE department. We defined and measured CBOs for our department, with their SLO targets (at the same time demonstrating that this approach wasn’t exclusively for the use of end-user or external customer systems). Because SRE owns the Observability Platform our CBOs included operations like "Ingest Metrics", or "Query Traces". Those CBOs were monitored by Adaptive Paging alerts. Within our weekly operational review meeting we would look at the alerts and incidents created in the previous week, and gradually identify which cause based alerts could be safely disabled or not. All of this had the support of senior management, granting engineers the confidence to take these steps.

By the end of that quarter we reduced the False Positive Rate for alerts within the department from 56% to 0%. We also reduced the alert workload from 2 to 0.14 alerts per day. And we did this without missing any relevant user-facing incidents. In the process we disabled over 30 alerts from all the teams in the department. Those alerts were either prone to False Positives, or already covered by the symptom based alerts.

One thing the on-call team did bring up was that shifts had become too calm. They risked losing their on-call ‘muscle’. We tackled this with regular "Wheel of Misfortune" sessions, to keep knowledge fresh, and review incident documentation and tooling.

What's next?

We are not done yet with our goal of rolling out Operation Based SLOs. There are still more Critical Business Operations that we can onboard, for one. And as we onboard those operations, teams can start turning off their cause based alerts that lead to false positives.

And there are additional evolutions we can add to our product.

Alerting on latency targets

Right now, CBOs only set Availability targets. We also want CBO owners to define latency targets. After all, our customers not only care that the experience works, but also that it is fast. While we already have the latency measurements, and could, technically, trigger alerts when that latency breaches the SLO, it is challenging to use our current Adaptive Paging algorithm to track the source of the latency increase. We don’t want to burden the team owning the edge component with every latency alert, so we are holding off on those alerts until a proper solution is found.

Event based systems

So far we’ve been focusing on direct end-customer experiences, which are served mostly by RPC systems. There is a good chunk of our business that relies on event based systems, and that we also want to cater for with our CBO framework. This is quite the undertaking, as monitoring of event based systems is not as well established as traditional HTTP APIs. Also, Distributed Tracing, the telemetry pillar behind our current monitoring and alerting of CBOs, was not designed with an event based architecture in mind. And the loss of the causality property reduces the usefulness of our Adaptive Paging algorithm.

Non-edge customer operations

We always tried to measure customer experience as close to the edge as possible. There are, however, some operations that are deeper in the call chain, but would still benefit from closer monitoring. To prevent an uncontrolled growth of CBOs, well defined criteria needs to be in place to properly identify and onboard these operations.

Closing notes

Operation Based SLOs granted us quite a few advantages over Service Based SLOs. Through this type of SLOs we were also able to implement Symptom Based Alerting, with clear benefits for the on-call health of our engineers. And we were even able to demonstrate the effectiveness of this new approach with numbers, after trailing within the SRE department.

But the purpose of this post is not to present a new and better type of SLOs. We see operation based SLOs and service based SLOs as different implementations of SLOs. Depending on your organization, and/or architecture, one implementation or the other may work better for you. Or maybe a combination of the two.

Here at Zalando we are still learning as the adoption of this framework grows in the organization. We'll keep sharing our experience when there are significant changes through future blog posts. Until then we hope this inspired you to give operation based SLOs a try, or that it inspires the development of a different implementation of SLOs.

Zalando's Machine Learning Platform

2022-04-19T00:00:00+02:00

To optimize the fashion experience for 46 million of our customers, Zalando embraces the opportunities provided by machine learning (ML). For example, we use recommender systems so you can easily find your favorite shoes or that great new shirt. We want these items to fit you perfectly, so a different set of algorithms is at work to give you the best size recommendations. Our demand forecasts will ensure that everything is in stock, even when you decide to make a purchase in the middle of a Black Friday shopping spree.

As we grow our business, we look for innovative ideas to improve user experience, become more sustainable, and optimize existing processes. What does it take to develop such an idea into a mature piece of software operating at Zalando's scale? Let's look at it from the point of view of a machine learning practitioner, such as an applied scientist or a software engineer.

Experimenting with Ideas

Jupyter notebooks are a frequently used tool for creative exploration of data. Zalando provides its ML practitioners with access to a hosted version of JupyterHub, an experimentation platform where they can use Jupyter notebooks, R Studio, and other tools they may need to query available data, visualize results, and validate hypotheses. Internally we call this environment Datalab. It is available via a web browser, comes with web-based shell access and common data science libraries.

Because Datalab provides pre-configured access to various data sources within Zalando, such as S3, BigQuery, MicroStrategy, and others, its users don't have to worry about setting up the necessary tools and clients on their own laptops. Instead, they're ready to start experimenting in less than a minute.

While Datalab is well suited for prototyping and getting quick feedback, it's not always enough, especially when big data is involved. Apache Spark is much better suited for that purpose, and Zalando users can access it via Databricks. It's a well-known tool within the data science community, suitable for both experimentation via notebooks and for running large-scale data processing jobs in Spark clusters.

Some experiments require extra processing power, e.g. when they involve computer vision or training of large models. For these purposes, our applied scientists have access to a high-performance computing cluster (HPC) equipped with powerful GPU nodes. Using the HPC is as easy as connecting to it via SSH.

ML Pipelines in Production

One of the most frequently discussed problems in machine learning is crossing the gap between experimentation and production, or in more crude terms: between a notebook and a machine learning pipeline.

Jupyter notebooks don't scale well to requirements typical for running ML in a large-scale production environment. These requirements include secure and privacy-respecting access to large datasets, reproducibility, high performance, scalability, documentation, and observability (logging, monitoring, debugging). A machine learning pipeline is a sequence of steps that can meet these additional requirements, and describes how the data will be extracted and processed, what is the required hardware infrastructure, and how to train and deploy the model. Additionally, ML pipelines at Zalando should follow best practices of software engineering: the code needs be stored in git, clean, readable, and reviewed by at least two people. An ML pipeline can be visualized as a graph, like the one shown below.

But how does one implement such a pipeline? In early 2019 we at Zalando decided to use AWS Step Functions for orchestrating machine learning pipelines. Step Functions is a platform for building and executing workflows consisting of multiple steps that may call various other services, such as AWS Lambda, S3 and Amazon SageMaker. These calls can be used to perform all steps comprising an ML pipeline, from data processing (e.g. by invoking Databricks API), to running training and batch processing jobs in Amazon SageMaker and creating SageMaker endpoints for real-time inference. The fact that Zalando already used AWS as its main cloud provider, and the flexibility provided by integrations with other services made Step Functions a good fit for our machine learning needs.

A Step Functions workflow is a state machine that can either be created visually using an editor provided by AWS or deployed as a JSON or YAML file known as a CloudFormation (CF) template. CloudFormation is another AWS service that implements the concept of infrastructure as code, and allows developers to specify needed AWS resources in a text file. We can thus use a CF template to describe Lambda functions and security policies used by the Step Functions workflow that is our ML pipeline. After the template is deployed to AWS, CloudFormation will create all resources listed in the file.

CloudFormation templates are highly expressive and allow developers to describe even minute details. Unfortunately, CF files can become verbose and are tedious to edit manually. We addressed this problem by creating zflow, a Python tool for building machine learning pipelines. Since its creation, zflow has been used to create hundreds of pipelines at Zalando.

A pipeline in a zflow script is a Python object with a series of stages attached to it. zflow provides a number of custom functions for configuring ML tasks, for example training, batch transform, and hyperparameter tuning. It also offers flow control so stages can be run conditionally or in parallel. Together these functions form a Domain Specific Language (DSL) for describing pipelines in a concise and readable form. Because zflow code is annotated with type hints, users can spot mistakes early on, and the available warnings go beyond simple syntax checks available for JSON and YAML templates.

The code listing below demonstrates an example zflow pipeline, with some configuration options omitted for brevity. It shows how three stages are created and added to a pipeline in the desired order. The pipeline is then added to a stack (a group of CloudFormation resources). The last line specifies where the resulting template should be saved.

data_processing = databricks_job("data_processing_job")
training = training_job("training_job")
batch_inference = batch_transform_job("batch_transform_job")

pipeline = PipelineBuilder("example-pipeline")
pipeline \
    .add_stage(data_processing) \
    .add_stage(training) \
    .add_stage(batch_inference)

stack = StackBuilder("example-stack")
stack.add_pipeline(pipeline)

stack.generate(output_location="zflow_pipeline.yaml")

When a pipeline script is executed, zflow uses AWS CDK to generate a CloudFormation template file. The file contains all the information needed to create the necessary AWS resources. All that is needed now is to commit and push the generated template to the git repository and let Zalando Continuous Delivery Platform (CDP) deploy it to AWS. When that is done, our pipeline will appear in the AWS Console as a Step Functions state machine. It can then be executed, either via scheduler (like in our example), manually in the Console, or programatically via an API call.

With zflow, a pipeline can be coded in a concise way, tested, then versioned in a git repository, deployed, run, and scaled as needed. To ensure that it works as expected, we can track its executions using a custom web interface. Pipeline tracking is a part of the internal Zalando developer portal running on top of Backstage, an open-source platform for building such portals. Here a screenshot of a series of pipeline executions in the ML portaI.

This ML web interface provides a detailed, real-time view of pipeline execution. Pipeline authors can monitor how metrics evolve across multiple runs of training pipelines and can view these changes on a graph. They can also view model cards for models created by the pipelines. These are just a few features of the ML portal, and the tool is actively developed to improve the process of experimenting with notebooks and deploying the pipelines in production.

The detailed journey of a pipeline is shown in the diagram below.

Admittedly, that's a lot to take in! Let's summarize the steps and tools we discussed so far:

We use JupyterHub, Databricks, and a high-performance computing cluster for ML experimentation.
We describe our ML pipelines in Python scripts with zflow DSL. Pipelines can use various resources, such as Databricks jobs for big data processing and Amazon SageMaker endpoints for real-time inference.
When we run the pipeline script, zflow will internally call AWS CDK to generate a CloudFormation template.
We commit and push the template to a git repository, and Zalando Continuous Delivery Platform will then upload it to AWS CloudFormation.
CloudFormation will create all the resources specified in the template, most notably: a Step Functions workflow. Our pipeline is now ready to run.
A web portal built with Backstage provides a visual overview of running pipelines, together with additional information relevant to ML practitioners.

zflow and the dedicated web UI abstract away most of the complexity of building production pipelines with AWS tooling, such as CDK and CloudFormation, so ML practitioners can focus on their domain rather than the infrastructure. While zflow takes full advantage of AWS, it also allows us to integrate other tools used within the company and to quickly respond to our specific needs.

The Organization

Tooling is just one side of using any technology. Another aspect is the organizational structure that allows experts to work and collaborate effectively. While applying ML within the company, Zalando uses a distributed setup with additional resources in place to support reusing tools and practices across the organization. Most expertise is spread across over a hundred product teams working in their specific business domains. These teams have dedicated software engineers and applied scientists who in their daily work use both 3rd party products (e.g. AWS, Databricks) and internal tools (zflow, ML web portal).

Our experts are assisted by a few central teams which operate and develop some of the aforementioned tools. For example, a dedicated team provides support and improvements to our JupyterHub installation and the HPC cluster. Two teams actively develop zflow and monitoring tools for pipelines. Another group consisting of ML consultants works closely with product teams, offering trainings, architectural advice, and pair programming. A separate research team actively explores and disseminates the state-of-the-art in algorithmics, deep learning, and other branches of AI.

On top of that, our data science community provides platforms to exchange best practices from internal teams, academia, and the rest of the industry through expert talks, workshops, reading groups, and an annual internal conference.

Exciting Times

Teams at Zalando tackle many of the difficult problems in the space of machine learning and MLOps, such as reducing the time needed to validate and implement new ideas at scale and improving model observability. We constantly look for new ways to use technology to be faster, more efficient, and innovative in meeting all fashion-related needs of our customers. Best news: we would like to work with you on these exciting ML challenges!

Functional tests with Testcontainers

2022-04-12T00:00:00+02:00

In this article, I will show how teams at Zalando Marketing Services are using functional tests. We will follow the idea of functional tests: the main concept and the attributes of a good functional test. Then, we will discuss an example based on the TestContainers library used in the Spring environment.

You can find an introduction to the TestContainers library in my previous article Integration tests with Testcontainers, because that is out of the scope of this one.

Definition of functional test

There are many definitions of functional testing. For example, the definition found on Wikipedia is:

Functional testing is a quality assurance (QA) process and a type of black-box testing that bases its test cases on the specifications of the software component under test. Functions are tested by feeding them input and examining the output, and internal program structure is rarely considered (unlike white-box testing). Functional testing is conducted to evaluate the compliance of a system or component with specified functional requirements. Functional testing usually describes what the system does.”

Functional tests answer the fundamental question: Do the features work as intended? Functional tests are not answering the question of HOW it works internally, but rather WHAT the result should be.

Non-functional vs. functional testing

What is the key difference between non-functional software testing and functional testing?

The answer is relatively simple: non-functional testing is concerned with how, and functional testing is concerned with what. Functional testing verifies what the system should do, and non-functional testing tests how well the system works. The intention of functional testing is to verify software actions, and non-functional testing validates the behavior of the application.

Another comparison you might see when discussing this is black-box testing vs white-box testing. Black-box testing looks at the functionality of the software without looking at the internal structures. White-box testing is aware of the internal structures.

Concept

Testcontainers.org is a JVM library that allows users to run and manage docker images and control them from Java code. Zalando uses it mainly for integration and functional tests.

The main purpose of functional tests with the Testcontainers library is to set up a black-box test, by using an environment closest to the production one. To achieve this:

package and run your service in a docker container;
run all its dependencies, like: database, queues, streams, as separate docker containers;
make your service connect to locally run dependencies;
make your testing code independent of implementation;

The structure of invocation can look like below.

Your entire production code needs to be packaged and run as a docker image. If your service needs to communicate to the database, you need to run the database as a docker image as well. Your functional tests will test your code ran as a docker image, so your testing code does not have any connection to production code.

You also need to remember that a proper pyramid of tests is (when sorted from the highest to the lowest amount of tests):

unit tests
component tests
integration tests
functional tests
system tests

It is very nice to have functional tests, but it cannot dominate your testing structure.

Packaging your application into a docker container

Packaging your application into a docker image is pretty simple. In the root of your repository, just define Dockerfile like:

FROM openjdk:17-alpine
COPY service/target/application-exec.jar application.jar
EXPOSE 8080
ENTRYPOINT java ${ADDITIONAL_JAVA_OPTIONS} -jar application.jar

As an alternative solution, I would suggest using Jib

Code separation

I recommend organizing code into a multi-module maven project with two modules: service and functional-tests. The functional-tests module cannot have any dependency on the service module.

.
├── service
│   └── pom.xml
├── functional-tests
│   └── pom.xml
├── Dockerfile
└── pom.xml

Because we don’t have access to the service code, we cannot use any DTO objects, database repositories, etc.

We should operate on the simplest possible interfaces. For example, if we call a REST endpoint, send plain JSON and read JSON. Don’t create any internal DTOs. It would place you in the position of a real client of your service.
I recommend using only official interfaces to create resources, e.g. create entities via the REST interface. We could create the entity directly inside the database and inside the test to just retrieve it, but it would not be a black-box test then. If there are changes to the storage of the service in the future, we would need to change our tests.

AbstractFunctionalTests

All functional tests extend the AbstractFunctionalTest class where all needed docker images are run. In our example, I will run my microservice which is connected to the database.

public class AbstractFunctionalTest {
  private static final int HTTP_PORT = 8080;
  private static final int DEBUG_PORT = 5005;
  private static final Logger LOGGER =
      LoggerFactory.getLogger("Docker-Container");
  private static final Network network = Network.newNetwork();

  public static final PostgreSQLContainer postgreSQLContainer =
    (PostgreSQLContainer) new PostgreSQLContainer("postgres:14.2")
    .withUsername("username")
    .withPassword("password")
    .withDatabaseName("databaseName")
    .withNetwork(network)
    .withNetworkAliases("postgres");

  private static final GenericContainer<?> backendContainer;

  static {
    postgreSQLContainer.start();
    backendContainer = ofNullable(System.getenv("CONTAINER_VERSION"))
      .map(version ->
          new ServiceContainer("docker-repository/application", version))
      .orElseGet(() -> new ServiceContainer(".", Paths.get("../")))
      .withExposedPorts(HTTP_PORT, DEBUG_PORT)
      .withFixedExposedPort(DEBUG_PORT, DEBUG_PORT)
      .withEnv("SPRING_PROFILES_ACTIVE", "functional")
      .withEnv("ADDITIONAL_JAVA_OPTIONS",
          "-agentlib:jdwp=transport=dt_socket,"
        + "server=y,suspend=n,address=0.0.0.0:" + DEBUG_PORT)
      .withNetwork(network)
      .withCreateContainerCmdModifier(cmd -> cmd.withName("application"))
      .withLogConsumer(new Slf4jLogConsumer(LOGGER)
          .withPrefix("Service"))
      .waitingFor(Wait.forHttp("/actuator/health").forPort(HTTP_PORT)
      .withStartupTimeout(Duration.ofMinutes(2)));
            backendContainer.start();
            Runtime.getRuntime().addShutdownHook(new Thread(() -> {
              backendContainer.stop();
              postgreSQLContainer.stop();
            }));
          }
  }

As an alternative solution, I would suggest the creation of a Junit5 extension. In this case, we would use an annotation instead inheritance, with the same logic.

Logging

When running the docker image with our service, it is critical to add logging. Without it, you are loosing visibility on errors. Don't forget adding a logger to the container code:

.withLogConsumer(new Slf4jLogConsumer(LOGGER).withPrefix("Service"))

Stopping images

One of the biggest advantages of the TestContainers library is the fact that there is a Ryuk container that stops all other containers when an initial JVM process is terminated. It protects us from unwanted zombie containers (and networks, volumes) in the system. But if you run docker images from multiple maven modules, the Ryuk image can be too slow and the build can crash. That’s why I additionally specify shutdownHook, which stops all docker images when test execution finishes.

Runtime.getRuntime().addShutdownHook(new Thread(() -> {
  backendContainer.stop();
  postgreSQLContainer.stop();
}));

Example of a functional test

An example functional test can look like below. The testing method uses many helper methods to simplify the test. Helper methods are key to make the code readable.

public class AccountFunctionalTest extends AbstractFunctionalTest {

  @Test
  void shouldUpdateAccount() throws JSONException {
    // given
    createAccount();

    // when
   ResponseEntity<String> response = updateAccount();

    // then
   assertThat(response.getStatusCodeValue())
       .isEqualTo(HttpStatus.NO_CONTENT.value());
    var actual = getAccount("00000000-0000-0000-0000-000000000001");
    var expected = readFromResources("get_account_dto.json");
    JSONAssert.assertEquals(expected, actual, JSONCompareMode.LENIENT);
  }

  private void createAccount() {
    var json = readFromResources("create_account_dto.json");
    ResponseEntity<String> response = getTestRestTemplate()
        .exchange("/accounts",
            HttpMethod.POST,
            new HttpEntity<>(json, getPostHeaders()),
            String.class);
    assertThat(response.getStatusCodeValue())
        .isEqualTo(HttpStatus.CREATED.value());
  }

  private ResponseEntity<String> updateAccount() {
   return getTestRestTemplate()
      .exchange("/accounts/00000000-0000-0000-0000-000000000001",
      HttpMethod.PATCH,
      new HttpEntity<>(readFromResources("patch_account_dto.json"),
        getPatchHeaders(etag)),
      String.class);
  }

  private String getEtag(String id) {
    ResponseEntity<String> response = getTestRestTemplate()
      .getForEntity("/accounts/{id}", String.class, id);
    return response.getHeaders().getETag();
  }

  private String getAccount(String id) {
    ResponseEntity<String> response = getTestRestTemplate()
      .getForEntity("/accounts/{id}", String.class, id);
    return response.getBody();
  }

  private HttpHeaders getPostHeaders() {
    HttpHeaders headers = new HttpHeaders();
    headers.setContentType(MediaType.APPLICATION_JSON);
    return headers;
  }

  private HttpHeaders getPatchHeaders(String etag) {
    HttpHeaders headers = new HttpHeaders();
    headers.setContentType(
        new MediaType("application", "merge-patch+json"));
    headers.add(HttpHeaders.ETAG, etag);
    return headers;
  }
}

Advantages of functional tests

The biggest advantages of functional tests are:

We force engineers to think about the API first principle.
We are able to test the service as black-box, meaning that when you have a good functional tests coverage, you are able to make a deep refactoring without changing functional tests.
It gives developers a lot of confidence that the code does what it should do.
You are sure that your application is correctly packed as a docker image, so another layer of application is tested.
Functional tests give you a lot of confidence that the application works as expected. I find it very useful during code refactoring.

Disadvantages of functional tests

Writing functional tests can be time-consuming. Especially when something doesn’t work as expected, debugging becomes much harder. From a different point of view, if you have well-written helper classes you can speed up this process.
Because functional tests are running services and dependencies (like database, queues) as docker images, we need to run it at least once. Usually, it is slow. For example: PostgreSQL as a docker image needs around 4 seconds to start on my machine, Localstack which emulates AWS components, can take much longer to start, even 20 seconds.
In an ideal world, we should run new containers for each test, but it would be way too slow. So, we need to run it once for all tests. If functional tests are written in a bad way, they can make tests interfere with each other. It is critical that tests use different object identifiers and that there is a clean state after the test.

Summary

Unit tests force developers to think about methods. Functional tests do the same for applications/components.

I find functional tests to be an interesting concept. The TestContainers library makes it possible to use this concept inside the Java world. It can be pretty expensive to implement it, but it also gives you big confidence that a system still works during deep refactoring.

Functional tests implemented in this way are not for everybody. I would suggest having it in the systems where microservice contracts are not changing very fast. Besides of high cost of development, it gives us a very high confidence level that the delivered applications are working as intended.

Code example

You can find examples of usages in my GitLab project.

GraphQL persisted queries and Schema stability

2022-02-17T00:00:00+01:00

Persisted Queries

Persisted Queries in GraphQL are like stored procedures in Databases. To know about the Apollo's way of automated persisted queries, please follow their documentation here. In Zalando, we took a different approach - to disable GraphQL in production. It might sound counterintuitive at first - we have a GraphQL service, but we disable GraphQL in production - why?

Let us go over how the system works and explain the reasons for how it helps us maintain a stable schema.

Part 1: Build time persistence

At development time for the web and apps, the developers enjoy the power of GraphQL - the automatic code and type generation, combining multiple parts of the application to send queries and aggregation of those queries to perform one optimized batched request, etc.

When the code in the UI layers (web and app) is actually merged to the main deployment branch, at the build time, there is one extra step - persist the queries to the GraphQL service. The GraphQL service generates an ID for a particular query (ID is just the hash of the normalized query in terms of formatting and operation selection), and returns it back to the UI layers to bundle with the actual built files.

When the actual query is used in production, the GraphQL service does not allow GraphQL queries, but rather only allows the query IDs that are persisted. So, instead of the request looking like this:

POST /graphql

{
  "query": "query productCard($id: ID!) { product(id: $id) { name } }",
  "variables": {
    "id": "12345"
  }
}

it would look like this - with id instead of query:

POST /graphql

{
  "id": "a1b2c3",
  "variables": {
    "id": "12345"
  }
}

Part 2: Inspecting the persisted queries database

Now that we have a database of queries, we can perform certain inspections on these persisted queries. Because we do not allow non-persisted queries in production, we know at any time what parts of the schema are used in production and what are not used in production.

We leverage these persisted queries for better monitoring and alerting for each individual query separately. We are also able to tell if certain fields can have a breaking change because the field is no longer used or never used in production.

Schema Stability

As mentioned previously, our GraphQL schema covers wide variety of use-cases and different parts of the schema can have different levels of stability as new product features get added in.

All API's dream is to have a non-breaking model that evolves well. In most cases, it becomes impossible to design everything up front so well in a changing product landscape. In other aspects, the amount of time we spend meditating about certain models to get the best design possible may not warrant the actual time available to completely implement it end-to-end.

The schema is a collaboration of the UI engineers and the GraphQL server maintainers. It should be possible for the UI engineers to prototype something fast and break it later. But once the schema is merged to the main deployment branch, the GraphQL server maintainers do not wish to have breaking changes. How do we solve this conflict in a neat way?

Let's use branch deployments to satisfy this constraint, so the main branch stays clean. Though it looks simple and easy enough to understand, the mixing of branches across various projects soon becomes a nightmare in reality. At Zalando, we have microservices and the GraphQL layer is an aggregator from multiple other services. So, maintaining multiple feature branches across 3-5 projects for 1 or 2 product features isn't going to help any developer or team move smoothly. The complexity increases non-linearly as we mix different features that must work together.

Draft status

In the previous section, we learned about the power of persisted queries controlled by the GraphQL layer - we exactly know what part of the schema is used in production. So, our solution to schema stability starts by leveraging how we handle persisted queries - by marking certain parts of the schema as not ready for production, and preventing them to get into the persisted queries database.

For this we use GraphQL directives:

directive @draft on FIELD_DEFINITION

The above directive will help annotate certain fields in the schema as draft. And during the persistence time, we validate if the query contains a field which is marked as such and disallow persisting it.

export function draftRule(context) {
  return {
    Field(node) {
      const parentType = context.getParentType();
      const field = parentType.getFields()[node.name.value];
      const isDraft = field.astNode.directives.some(
        (directive) => directive.name.value === "draft"
      );
      if (isDraft) {
        context.reportError(new GraphQLError(`Cannot persist draft field`));
      }
    },
  };
}

This is an example implementation of the rule which you can pass to the GraphQL validation. The usage in the schema would look like:

type Product {
  fancyNewField: FancyNewType @draft
}

type FancyNewType {
  testField: String
}

In the above definition of a Product, when we add the new field fancyNewField, we begin by adding a draft status. When someone tries to persist it, it would fail.

This brings us new opportunities and guarantees:

The field cannot be used in production
We can break it at will, since we allow ONLY persisted queries in production
We can merge it to the main branch (and even deploy it)

The draft status and how our persisted queries work improves the work flow. We are able to faster develop multiple features, experiment with it across different codebases, and still have the safety of production usage only after we stabilized (removing draft) the schema by testing it end-to-end.

Experimenting in Production

The draft status allows us to deny persisting certain queries which we know are not ready for production usage. When they are ready, we want to carry forward certain experiments to production. But, we can still be unsure about the stability of this schema. This is tricky, but is a valid use-case often. Certain product features go into production as an experiment, and then it may change form or structure by a little.

One obvious option is to remove the draft. But we do not restrict who can persist it. For example, some other parts of the UI may start persisting those experimental fields, and we might not notice it until we inspect the queries. We certainly cannot break the schema once it is in production. So, how do we ensure that this experimental field is used only by the components that are part of the experiment?

Here, we introduce two new directives which act as access control for fields in production. The @component directive, and @allowedFor directive:

directive @component(name: String!) on QUERY
directive @allowedFor(componentNames: [String!]!) on FIELD_DEFINITION

These two directives complement each other where one is used in the query and the other one is used in the schema (here, on Field definition). We ask the queries authors to tag their queries using a component name, and we match those names in the other directive allowedFor during persist time.

Note: Instead of component name, you can also use the operation name of the query itself.

For example:

type Product {
  fancyProp: String @allowedFor(componentNames: ["web-product-card"])
}

and a query product card:

query productCard @component(name: "web-product-card") {
  product {
    fancyProp
  }
}

This would be allowed and any other query which uses the field fancyProp would fail to persist.

The component and allowed-for directives / annotations allow us to take an experimental feature to production by restricting the usage to one component of the UI. This allows us to handle breaking changes more easily as we have a guarantee that only that part of the UI needs to update when we have a minor breaking change.

Conclusion

When we first extend the GraphQL schema, we start with the draft annotation. Then we promote new fields to a restricted usage in production using the allowedFor annotation. After we finally have stabilized the schema, we remove all of these annotations and have a non-breaking contract in form of persisted queries.

This is just the starting point of the exploration to saving developer time as well as ensuring stability to the GraphQL schema. It helps us in evolving the schema rather than having to re-model it every single time.

Depending on how you want to evolve the schema, and how you prefer to handle breaking changes, you can use these concepts and save precious time - by thinking about schema evolution in a non-destructive manner.

Principal Engineering at Zalando

2022-02-10T00:00:00+01:00

In many companies, Senior Engineers who do not pursue Engineering Management, end up in a dead end in terms of their career progression. At Zalando, we have had a career path for individual contributors since 2016. Senior Software Engineers can choose one of the three possible career paths:

Engineering Management
Principal Engineering
Technical Program Management

In this post, we detail out how we leverage our senior individual contributors (Principal Engineers) throughout the company. In the last two years, we have observed an increased amount of companies emphasizing the value of career development for individual contributors. At this level, the roles are highly varying across companies, hence the importance of exchange about different approaches to structuring this role.

Principal Engineering

Beyond the Senior Software Engineer level, Engineers have increasingly varying profiles depending on their career journey and unique expertise. Depth-focused Principal Engineers are experts in their unique field (or more than one) whereas breadth-focused Principal Engineers have an expert view across many domains and aspects of the software development life cycle with an ability to leverage unique expertise of others or when needed dive deep themselves.

Up until 2021, there was no literature we would know about, speaking in detail about individual contributors above the senior level in tech companies. While traditionally Software Companies defined the role of an (Enterprise) Architect, the industry moved away from centralized architecture teams with hands-off individuals, as these were detached from the software development process and the necessary feedback loops to continuously adjust their approaches. More often than not, delivery teams are empowered with technical decision making and conduct architectural design adhering to guardrails set by the department and the company (in our case, the Tech Radar). Principal Engineers support the team in the architectural design and help to maintain architectural integrity in the scope of the department and beyond.

In March 2021, the book Staff Engineer: Leadership beyond the management track was published and added some common vocabulary about technical leadership and strategies for leading without formal authority. In addition, four archetypes are listed and provide classification for the types of tasks Principal Engineers are most commonly working on. It is important to note that individuals may transition between these archetypes throughout their career depending on their strengths or the organizational needs:

Tech Lead: leads critical technical initiatives across the department and beyond. Partners with more than one team to support teams and individuals with delivery and coaching. Usually, Principal Engineers transitioning from a Senior Engineer role in a single team to a Principal Engineer acting across teams will go through this path. Initially, delivery includes high focus on coding alongside the team for high-impact and critical projects.
Architect: manages technical direction, quality, and approach within an area or project. Navigates different levels of leadership to address mid to long-term challenges.
Solver: digs deep into an area or problem, captures findings, aligns a set of recommendations. May apply both to short-term and long-term engagements and include driving the implementation of the recommended solutions.
Right Hand: extends an executive's attention and borrows their scope and authority to address certain problem areas.

Principal Engineers at Zalando

Principal Engineers¹ at Zalando are senior individual contributors and role models for our Engineers. While they have no people management responsibilities, they are part of the leadership team. Principal Engineers report to a Manager of Managers (e.g. Head of Engineering) and assume the scope of the person they report to. Typically, this means they have 2-5 engineering teams that they support. Overall, Principal Engineers constitute around 4% of our total Engineering population.

At Zalando, Principal Engineers are responsible for the architecture of the systems built within the department they're part of. They enable others and facilitate the design process across teams. They are proactively initiating and executing process and technical improvements (e.g. scaling, technical debt reduction) across the department and beyond. Principal Engineers play a leading role in the full product development lifecycle. They're consulting Product and Engineering Management on projects early on, ensuring that technical considerations are factored into the project's scoping and planning processes.

Our (usually breadth-focused) Principal Engineers are leading the technical design for mid to large scale projects that their department is part of. This involves trade-off discussion, scope definition and negotiation with Product Designers and Product Managers as well as advice on structuring the projects into iterations optimized for reducing delivery risk, dependencies on teams, or ensuring quick time to market. Principal Engineers facilitate design discussions with the involved teams, delegate design or experimentation of well-defined parts of the design to other Engineers. They outline key design decisions and trade-offs and seek feedback through peer-reviews. To understand how their designs perform in production, they guide teams throughout the execution time of the project and support launch readiness through production readiness reviews and project launch coordination.

At Zalando, we peer-review technical designs on different organizational levels, depending on their scope and complexity. During peer-reviews for Zalando group-wide projects requiring contributions from multiple business units, Principal Engineers support the project teams in finding the best solution for realizing the project's goals. Additionally, they provide teams with a different perspective on the suggested solutions and discuss trade-offs related to dependencies, relation to other pending or ongoing projects, and risks and challenges anticipated during project delivery. In this way, we ensure consistency of technical solutions, promote standardized solutions and practices, connect teams who solved similar problems with one another, and seek to incorporate learnings from other projects into future designs.

Focus on operational excellence is key to delivering high-value customer experiences. Principal Engineers play a crucial role in scaling knowledge and raising the bar. They coach teams on resilience patterns, observability and facilitate weekly operational meetings where the operational performance of the system and past incidents are reviewed. They peer-review post-mortem documents and runbooks that the teams prepare as part of the incident response. Finally, they collaborate on alignment and implementation of cross-team action items.

Depth-focused Principal Engineers are most frequently part of platform or infrastructure teams. When compared with their peers, these individuals are also spending the highest share of their time writing code. They are thought-leaders influencing the long-term product roadmap. Through their network and collaborations with other Engineers across the company (e.g. via language guilds), they look for opportunities to scale the adoption of existing infrastructure solutions or initiate new ones, with the focus on making our teams or systems more efficient (e.g. shared libraries, application templates, operational guidance or patterns). Lastly, they contribute to setting Engineering Standards and support others in technology selection, evaluation, and adoption as part of our Tech Radar process.

Principal Engineers have also important contributions that go beyond core engineering tasks. They are bar raisers during the interview process, mentor other Engineers, and play a key role in our engineering communities. This way, they have opportunities to coach other engineers, role model our culture, and help identify and develop promising talent.

Principal Engineering Community

Principal Engineers form a company-wide community of experts, who support one another in their challenges and journey at Zalando. They self-organize both company-wide and per business unit in order to discuss and drive technical topics that they or their leadership consider as important to meet the business growth and operational excellence of Zalando's technical systems. The Community provides expertise around know-how, patterns, solutions, and the approach to rollout of these in teams. Further, Principal Engineers support one another in order to continuously upskill themselves and others, through mentorship, coaching, or pairing up on tasks.

Engineering-wide initiatives driven by the community are documented in a task list, which in addition to providing transparency on the community efforts, serves as an opportunity to (i) highlight tasks that any Engineer at Zalando can contribute to, or (ii) for anyone to request support on an engineering topic. Similar task lists exist in a smaller scope and provide ways to involve the Engineering talent from these organizations.

Helping Principal Engineers with their new role

The majority of our Principal Engineers have been promoted from within Zalando. Some of our senior individual contributors have switched career tracks from Engineering Management back to individual contributors. As the principal engineering role is tailored to our specific needs and organizational structure, it was important for us to set up newcomers to the role for success.

A few Principal Engineers teamed up and compiled a guide to beginning the journey of a Principal Engineer and how to structure the first 100 days in this role. This guide has proven to be helpful for our Principal Engineers, their Managers, and for colleagues who are planning their own career development towards the individual contributor track. In addition to the guide, our more seasoned Principal Engineers provide mentorship to other Principal Engineers.

We also realize that the role of a Principal Engineer may not be a fitting career opportunity for every Senior Engineer. Principal Engineering is not just a label for the best Senior Engineers. In the end, it's a technical leadership role with strong emphasis on cross-team coordination, communication skills, and requiring the ability to lead without authority. The initiatives that an individual is driving tend to have a much longer time horizon for the impact to become visible and are often realized through the hands of others. This delayed gratification can negatively affect motivation, especially for individuals who as problem-solvers with deep expertise value and source their energy from solving large-scale problems with fast iteration cycles (e.g. as part of incident response). At Zalando, we leverage stretch assignments as development opportunities to allow our colleagues to try out aspects of the Principal Engineer role and verify whether it's a good fit for them while allowing them to easily step back to their prior activities otherwise.

Managing Principal Engineers

Some of our Engineering Managers have not worked with nor managed Principal Engineers before. This can lead to situations where the potential of the individuals is under-leveraged. Individual contributors on this level require a degree of flexibility and share of their time to explore the potential of addressing the problem areas they have identified. They also need the necessary sponsorship and support in change management for solutions that are introduced within the department and beyond.

To address this challenge at scale, we compiled guidance for our managers on how to support and effectively work with Principal Engineers. This guide includes a short checklist allowing organizational leaders to easily verify whether they have structured the ways of working and expectations towards the Principal Engineers in the right way. This includes ensuring that the Principal Engineer is part of leadership rounds providing the right context about the department's priorities and upcoming projects, creating the necessary connections between key stakeholders and Heads of Product, and also includes examples of initiatives that Principal Engineers have driven at Zalando.

Summary

In this post we have provided insights into the key aspects of the role of a Principal Engineer at Zalando. While this is not an extensive description of the challenges and intricacies of the role, we hope that the information shared in this post will shed some light on the opportunities that the individual contributor path provides. Likewise, we will be happy if it serves as an inspiration for you to consider putting stronger focus on the individual contributor career path in your company.

There is no consistency in the industry for naming Senior+ roles. Some companies use (i) Senior, Staff, Senior Staff, Principal (e.g. Spotify), whereas others go for (ii) Senior, Principal, Senior Principal, ..., Distinguished Engineer (e.g. Amazon). We chose a naming scheme based on the second model. ↩

Releasing Connexion to the Community

2022-02-07T00:00:00+01:00

Connexion is a Python framework that automagically handles HTTP requests based on OpenAPI specification (formerly known as Swagger Spec) of your API described in YAML format. Connexion allows you to write an OpenAPI specification, then maps the endpoints to your Python functions; this makes it unique, as many tools generate the specification based on your Python code. You can describe your REST API in as much detail as you want; then Connexion guarantees that it will work as you specified.

After 6 years and 3.9k GitHub stars, Zalando is now releasing Connexion to the community. What does this mean? Connexion's repository will move from Zalando's GitHub organization to the new community-owned "spec-first" organization. This repository transfer highlights changes in Connexion's maintainer structure. Connexion's license (Apache 2.0) and release package on PyPI will not change.

Connexion was a huge enabler for Zalando to move towards API-first in 2015, i.e. to write the API specification before implementing the backend code. While Python is a first class citizen in Zalando's tech landscape (see our Tech Radar), Zalando's customer-facing production software is usually implemented in modern JVM languages such as Kotlin, Java, or Scala. Maintenance of Connexion stalled with core developers changing focus and nobody new stepping up within Zalando. Thankfully, ML6 took over most of the regular maintenance from Zalando. We are very glad to have found new active maintainers. Special thanks go to my colleague João as the original author, Rafael for his significant contributions, Robbe and Ruwan from ML6 for taking over, and to Daniel for donating the "spec-first" organization. The "spec-first" organization will serve as a company-neutral new home for this awesome open source project. The project is what it is today because of its community. Big thanks to all 165 contributors and to the numerous users of Connexion out there!

Moving Connexion out of Zalando's GitHub organization won't affect how the project is used within Zalando. With JVM-based languages powering most of Zalando's Fashion Store, Connexion is used for low-traffic services and tools in various departments. For example, Connexion powers parts of our internal Continuous Delivery Platform, serves metadata for our internal realtime business monitoring platform, exposes APIs for our inhouse machine learning platform, and is used in our pricing department. Connexion has gained some popularity among Zalando's data science community as Python is the most commonly used language for data scientists.

Personally, I'm very happy to see Connexion graduate and have it released to a new community-owned home. I will follow its path into the future and try to be helpful when time allows.

If you are interested in learning more about Connexion, check out the documentation.

Utilizing Amazon DynamoDB and AWS Lambda for Asynchronous Event Publication

2022-02-03T00:00:00+01:00

In our Microservices Architecture, services communicate both asynchronous via events and synchronous via REST calls. Frequently, a synchronous REST call modifies data in a data store and emits an event based on the changes made. Publishing data change events can be decoupled from performing the changes in the data store in order to increase the resilience of the application.

We will show how this is achieved with the Transactional Outbox pattern, presenting a cloud native approach utilizing Amazon DynamoDB, AWS DynamoDB Streams and AWS Lambda.

Problem Statement

In Zalando Payments we have a service, called Order Store, that stores payment related data for a given order in a DynamoDB table. Updating this data happens via a synchronous REST call. Changes to the stored payment information need to be propagated to other services too, which is realized by sending events to Nakadi, Zalando's message bus.

Initially, the service created/updated data in DynamoDB and then sent events to Nakadi to inform other services about the change in payment information. This meant the service had two downstream dependencies to complete the request, namely the database and the message bus. As the availability of a service is the product of the availabilities of its dependencies, the more dependencies a service has, the lesser is its own availability. Let's assume DynamoDB and the message bus have availabilities of 99.9% each. Thus, the maximum availability for the service is 99.9% * 99.9% = 99.8%.

Aiming for the highest availability possible, reducing the dependency to only DynamoDB results in a higher availability of the service. After explaining the transactional outbox pattern, we will provide a concrete solution, the technologies it comprises and how we achieved decoupling the process.

Transactional Outbox

Let us look at the underlying concept of how to decouple data update and event publication. The pattern we are describing here is known as Transactional Outbox. Our goal is to achieve that a service, synchronously called via a REST API, creates, deletes or updates a data store entry and also propagates the change to other services via messaging. However, publishing the message is decoupled from updating the data store.

In this drawing we provide the setup of the environment. Our flow consists of 4 steps, where the starting point is a synchronous call that triggers further actions.

Change Entry and Populate Outbox

After the call is received, the service triggers a change for an entry in the data store. This is denoted with 1. The actions that trigger a change consist of Create, Update or Delete, as a Read operation would not alter any data. Modifying data in the data store is transactional and once it is successfully completed, the service already returns a success response code to its caller.

As part of the transaction in the data store, the actual data change is written to an outbox. This is depicted in step 1.5. The outbox can be thought of as a write append log. Each data change operation in the data store will produce an entry in the outbox.

Consume Outbox and Publish Event

The transaction in the data store was successful and the data entry got updated or created. Thus, a new entry in the outbox exists. A so called message relay reads that entry from the outbox. To get aware of the new entry, the message relay notifies the outbox, which upon notification consumes the entry. This is depicted with number 2.

Upon consumption, the message relay extracts the data, transforms it to an event and publishes it, marked in the diagram with 2.5. Only after successful publication the entry is marked as consumed.

Concrete Solution

After describing the pattern we now want to present the concrete solution. In order to decouple the asynchronous event emission from the synchronous process we take advantage of various cloud services AWS has to offer.

The following diagram shows the complete flow from a synchronous REST API call to the publicaton of the Nakadi event following the new approach:

DynamoDB Streams

Recently, DynamoDB was extended with a Change Data Capture implementation – DynamoDB Streams. Once activated, as soon as an item in the DynamoDB table is changed (added, updated or deleted) a corresponding dataset is sent to the stream. In our case this dataset contains the old image, containing the table item before the change, and the new image, containing the table item after the change. It can be configured which images AWS exposes to the DynamoDB stream. With both these images we are now able to assemble a corresponding Nakadi event using AWS Lambda.

AWS Lambda

The trigger for our AWS Lambda is a DynamoDB Stream item. We chose Python for our implementation as it is more lightweight compared to Java. The lambda function will receive the item containing the old and new image. Then it will assemble the data change event, which contains the complete item after its change as well as a patch node containing the diff. As a last step the assembled event is published to Nakadi.

In case the publication to Nakadi fails, e.g. due to timeouts, the request is retried. If all the retries fail then we make use of an AWS SQS queue as fallback storage which is further explained in the next chapter. This also means that we do not guarantee that the events are published in the correct order.

AWS SQS & Kubernetes CronJob

AWS SQS is a message queue service. When creating a new AWS Lambda function it already comes with an AWS SQS queue attached as a dead letter queue. Having this queue it is ensured that no events are lost in case of a failed publication or even worse a temporary outage. Now, whenever Nakadi event publishing fails the event is sent to the dead letter queue. For event publishing retries with exponential backoff are in place to minimize the number of events that could not be published ending up in the dead letter queue. In order to retry sending the events in the queue in intervals we created a Kubernetes cronjob. The cronjob simply runs the Python code that is also run by the AWS Lambda and tries to publish the events to Nakadi again. As publication is eventually successful the event is then removed from the SQS queue.

Conclusion

We successfully decoupled synchronous data changes from eventually consistent event publishing. Through decreasing dependencies, we increased the resiliency of our service. Besides improving the architecture, the team also got to work with DynamoDB streams and AWS Lambda for the first time, offering a great possibility to learn about AWS technologies. Having implemented this pattern, we are working with our infrastructure teams to offer an implementation of this pattern to all teams at Zalando. We already have an implementation of the Transactional Outbox for PostgreSQL, managed centrally via a Kubernetes operator.

Maps with PostgreSQL and PostGIS

2021-12-02T00:00:00+01:00

This blog post explains to you which tools to use to serve geospatial data from a database system (PostgreSQL) to your web browser. All you need is a database server for the data, a web map application for the frontend and a small service in between to transfer user requests. I will also show you how these components can run on top of Kubernetes in a highly available cloud native fashion.

PostGIS - a spatial database

As a first step the dataset in your database you want to put on a map must include a geospatial representation: Two coordinates or an address. For Zalando it might be interesting to know the demand hotspots across Europe e.g. by joining the zip codes of shipments with administrative boundaries which are often available as Open Data. The database must support geo data types and indexes to answer spatial queries. At Zalando, the open source database system PostgreSQL is used by many teams and it offers a geospatial component called PostGIS. It is used for example to allow our customers to select the nearest pickup and return points. Over the years, PostGIS has grown a strong community and is widely accepted in the industry as the de facto standard to manage geospatial data. There are many different tools and interfaces available to import data in various formats into PostGIS and access it from your favorite data science environment - be it Jupyter, R or Tableau.

Bring the map to your browser

Creating a web mapping app is simple with tools like Leaflet.js. For the basemap we can use OpenStreetMap, the wiki-style free alternative to commercial map providers. Adding extra layers with e.g. over 100,000 polygons on top of it would slow down map navigation a lot. Splitting the data into a grid of tiles and loading only the ones of the area you are currently looking at on your screen is what makes a browser map fast and responsive. Until recently, a middleware was usually required to produce these tile structures. That middleware had to consider not only the grid creation, but also take care of different zoom levels. When you zoom out the geometry of streets, rivers, forests etc. should be coarser and styled differently - some details should be even left out at a smaller scale for the sake of readability.

Streaming spatial data from PostGIS as vector tiles into the browser map

The good news is, these days PostGIS can take over most of the middleware’s job and produce map tiles for you. You only need a lightweight server between the frontend that takes in requests from the map and sends queries to your spatial database to produce the tiles you want. pg_tileserv is such a solution. You configure the table name that contains the spatial data and that’s it. If you want to learn more about vector tiles I can recommend this talk by Paul Ramsey, one of the PostGIS authors.

Running it on Kubernetes

The Postgres Operator, created by my team at Zalando, provides you with an easy creation and update path for PostgreSQL servers running on top of Kubernetes. Engineers only have to write a short YAML manifest which can look like this:

apiVersion: acid.zalan.do/v1
kind: Postgresql
metadata:
  name: acid-geo
spec:
  numberOfInstances: 2
  postgresql:
    version: "14"
  volume:
    size: 10Gi
  teamId: acid
  preparedDatabases:
    map_db:
      defaultUsers: true
      extensions:
        postgis: geo
      schemas:
        geo: {}

The operator will notice the new manifest and create all the necessary resources in Kubernetes - a stateful set with 2 database pods, services to connect to the database, secrets for authentication etc.. With specifying preparedDatabases the operator will create a new database with schemas as well as a set of database roles (reader, writer, owner) with default access privileges assigned. Plus, you can list extensions to be created in a certain schema. The Postgres cluster is based on the Spilo docker image which includes the PostGIS extension.

To import arbitrary geodata formats I can recommend GDAL’s ogr2ogr command-line tool. In my case I’ve imported the latest European NUTS polygons of 2021 and the 1km² population grid of 2018 by Geostat.

To roll out pg_tileserv on Kubernetes I’m using a deployment resource. To run it within the Zalando infrastructure I had to move the tileserver base path behind our oauth2 proxy with a dedicated /tileserver base path which required me to overwrite pg_tileserv’s default configuration. Configuration of pg_tileserv happens via toml files so I’ve put that into a config map and mounted it into the container. Here you can see the manifest (leaving out the resources section in this example):

apiVersion: v1
kind: ConfigMap
metadata:
  name: acid-geo-tileserver-config
data:
   pg_tileserv.toml: |
    BasePath = "/tileserver/"
    Debug = true
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: acid-geo-tileserver
spec:
  replicas: 1
  selector:
    matchLabels:
      application: acid-geo-tileserver
  template:
    metadata:
      labels:
        application: acid-geo-tileserver
    spec:
      containers:
      - name: acid-geo-tileserver
        image: pramsey/pg_tileserv:latest
        ports:
        - containerPort: 7800
          protocol: "TCP"
        volumeMounts:
        - name: configs
          mountPath: /config
        env:
        - name: "DATABASE_URL"
          value: postgresql://map_db_reader_user@acid-geo:5432/map_db
        - name: "PGPASSWORD"
            valueFrom:
              secretKeyRef:
                name: map_db_reader_user.acid-geo.credentials
                key: password
      volumes:
      - name: configs
        configMap:
          name: acid-geo-tileserver-config

Another deployment is needed serving our Leaflet application, e.g. using a simple Ubuntu docker image with nginx running.

Dynamic mapping layers

The web map requests tiles from pg_tileserv which sends back protobuf files. In our case, a request looks like this - with geo.boundaries_europe being the schema qualified table name:

${BASE_URL}/tileserver/geo.boundaries_europe/{z}/{x}/{y}.pbf

Z is the zoom level and X and Y are the coordinates of the mouse cursor. Leaflet’s VectorGrid class can be used to display the vector tiles returned from PostGIS. For the boundaries the result can look like in the first picture above. The vector tile format must not consist solely of the geometry. Multiple thematic attributes can be included making it possible to change the style on the fly without sending another request to the database. pg_tileserv will take information from all columns it finds in a spatial table.

Alternatively, it allows me to serve vector tiles not only from a table but also from an SQL function using a query with PostGIS’ vector tile creator function ST_AsMVT. pg_tileserv’s README on GitHub provides some cool examples for such function layers. For example PostGIS allows you to create a grid of squares or hexagons within a defined extent, e.g. the envelope of a single tile. The grid can be intersected with another spatial data set to produce a heatmap. The following example is inspired from pg_tileserv's example of Advanced Function Layers.

CREATE OR REPLACE FUNCTION geodata.population_hexagons(
  z integer, x integer, y integer,
  step integer default 4)
RETURNS bytea AS
$$
WITH bounds AS (
  -- get web mercator tile bounds to given coordinate
  SELECT ST_TileEnvelope(z, x, y) AS geom
), hexes AS (
  -- generate hexgrid within bounds and join with population grid
  SELECT row_number() OVER () AS grid_id,
        h.geom, h.i, h.j,
        sum(p.popcount) * 0.5 AS popcount  -- oversimplified, of course
   FROM bounds b
   JOIN LATERAL ST_HexagonGrid(  -- 1. hex size, 2. boundary
          (ST_XMax(b.geom) - ST_XMin(b.geom)) / pow(2, step), b.geom
        ) h ON (true)
   -- do spatial join between our artificial grid and the Geostat grid
   -- the hex grid is in web mercator coordinate reference system (CRS)
   -- it must be tranformed into the same CRS of the population grid (WGS84 - 4326)
   JOIN geodata.population p
     ON p.geom && ST_Transform(h.geom, 4326)
  GROUP BY h.geom, h.i, h.j
), mvt AS (
  -- processing geometry for vector tiles
  SELECT ST_AsMVTGeom(h.geom, b.geom) AS geom,
         (h.i::text || h.j::text || h.grid_id::text)::int AS grid_id,
         H.popcount
    FROM hexes h, bounds b
)
-- baking mvt geom, grid_id and popcount into MVT encoding
SELECT ST_AsMVT(mvt, 'geodata.population_hexagons') FROM mvt;
$$
LANGUAGE 'sql' STABLE STRICT PARALLEL SAFE;

Your function must take the Z, X and Y parameters as arguments and return Postgres' bytea type, which is just a BLOB for the PBFs returned from ST_AsMVT. In the first part of the query we need to get the tile envelope for the given input. Within this square we generate the grid and join it against the Geostat population grid. For each hexagon we sum up the population of every intersecting Geostat grid cell. This is quite coarse, indeed. It would be more precise to join the generated grid against a point data set, e.g. one could generate centroids for each data polygon.

Dynamic hexagon grid joined against Geostat population data using an SQL function

Because this is all based on database queries triggered from user interactions with the map, such a heatmap can be dynamic and change while zooming in and out. As the vector tile grid gets smaller on a larger scale the heatmap becomes more fine-grained. In the map legend you can see that values adapt to the zoom level and hexagon size. This is much better for the perception by not overwhelming the observer when the full picture is shown and providing better guidance to points of interests.

A Systematic Approach to Reducing Technical Debt

2021-11-30T00:00:00+01:00

Introduction

While technical debt is a recurring issue in software engineering, the case of the Merchant Orders team within Zalando Direct was a an outlier as, due to a lack of a clearly defined process, technical debt more or less only ever accumulated. When I joined this team in autumn 2020 as its new engineering lead, the technical debt backlog had entries dating back to 2018. In this article, I describe the process we set up in Q1/2021 in order to regain control of our technical debt. While the situation in your own team may not be quite as dire, you may nonetheless find some aspects of this blog post useful to adopt. Our backlog of technical debt tickets used to be in excess of 70, with no end in sight. With the adoption of the methodology described in this article, we have already shipped more than ten features or improvements over the course of eight weeks, i.e. four sprints. For the first time in three years, i.e. ever since my team started tracking technical debt, we are reducing it.

This article is written from a managerial perspective and has Engineers and Engineering Managers as its target audience, though I hope that engineers of all levels find value in this article. Furthermore, I can only encourage any software engineer reading this article to approach their lead if ever-growing technical debt is an issue in their team. There is a non-zero chance that they will appreciate you raising the issue, considering that all of us are aware that technical debt is a serious problem. If you do not pay it down, you will get more technical debt on top for free, until your only option is a complete rewrite. This is quite similar to compound interest driving debtors into bankruptcy in the real world. Obviously, we would like to avoid such an outcome.

An excerpt from my team’s technical-debt backlog as of April 2021. As you can see, there are items from 2018 and 2019 on it.

Technical debt, Known and Unknown

Using the vocabulary of the Johari window, you can probably identify plenty of “known known” technical debt in your codebase. However, some technical debt constitutes an “unknown unknown”, i.e. technical debt we do not know that we have. In our case, we had a long backlog of known technical debt, with many dozens of entries. Given that we have over a dozen services to maintain, this is probably not even a particularly frightening number. However, there is also technical debt that you are completely unaware of. This may seem counter-intuitive, in particular if you subscribe to the notion of being able to perfectly design services in advance, as well as once and for all eternity. Yet, this is not a caricature, considering that you can encounter non-technical leads who hold rather similar beliefs. In some circumstances, this could even be a perfectly valid position to hold, for instance in static environments.

There are at least two sources of unknown technical debt. First, there are problems with your services that you simply have not yet identified. This can happen easily because once you agree on a design and subsequently carry out its implementation, you may not question any decisions the team has agreed on. This can of course mean that there are drawbacks in your design or implementation that someone with a fresh pair of eyes, for instance a new joiner, may be able to spot. Second, technology is a fast-moving field. This means that today’s cutting-edge design-patterns, development processes, testing strategies, or even programming languages and paradigms may get superseded. Your current best practices replaced your previous set of best practices one by one, and there are new developments that will one day make you wonder why anybody ever thought that a hitherto valid approach was ever a good idea. Of course, there is also the problem that we sometimes need to deliver features quickly to seize a business opportunity, which may lead to sub-optimal design and implementation decisions.

Not all change is positive, however. As much as we engineers may pride ourselves on our objectivity, our industry is also driven by fads. This is such a big issue that a company like Gardner makes money by selling their analyses about where on the “hype cycle” certain technologies are. Sometimes, we also regress as an industry, for instance by adopting technologies that are popular but less powerful. Yet, if they are being pushed by corporations with an annual marketing budget of many hundreds of millions of dollars, they can get a lot of traction in industry. Any of your services might look much differently if it was rewritten today. As a practical consequence, I think you should take the time to re-review your existing services and look for improvements, but, if possible, with a very critical view toward buzzwords du jour. Even TeX, one of the arguably most mature software products in the world, receives fixes to this very day. Its first version was released about two decades ago. Taking this into account, it is probably not an entirely implausible assumption that your services could be improved as well. On a related note, Zalando has formal processes in place for selecting technologies as well as adopting new technologies. This is certainly helpful for engineering leaders, yet it cannot address the problem that some technologies fall out of favor over time due to shortcomings.

As we create software solutions in a highly dynamic environment where both customer requirements and technologies can change, a semi-regular review of any of your services may uncover areas of improvement. All of that should be categorized as (hitherto unknown) technical debt. A very welcome consequence of such an exercise is that your engineers will gain greater familiarity with their services. This is particularly valuable if your services need to be reliable anytime. Preferably, each engineer on your on-call rotation should have very detailed knowledge of your services, so thoroughly studying the source code of your existing service will be very helpful to them.

Motivating your Engineers

In management theory, a popular concept is Theory X/Theory Y. These two show up in pairs. According to Theory X, people only work because they need money and, if they could get away with it, they would prefer to not work at all. In contrast, Theory Y posits that people are intrinsically motivated, care about their work, and want to advance in their career. Reality is probably somewhere in-between. However, as a leader, the problem is how to get people to want to work on technical debt. In our case, the problem was that the backlog had tickets on it that were three years old, which seems to imply a lack of motivation to work on such tickets.

As leaders we can of course simply tell people what to work on (Theory X). The problem, however, is that people tend to be more productive if they work on tickets they really do want to work on (Theory Y). Furthermore, my experience as an engineer was that work on technical debt can be both fulfilling, as well as open up new opportunities. Consequently, I use a Theory Y approach with my team, stressing the benefits of this kind of work. Please note that this is not in any way a cynical approach. A good part of my growth as an engineer was due to resolving hairy technical problems, oftentimes with a focus on performance improvements. In one of my internships I was given the task of increasing the performance of an artificial neural network, and this work led to me later on getting hired in a very competitive field. I also highlighted to my team that work on technical debt can sometimes be easily quantified. An engineer’s CV certainly looks better with hard data on percentages of performance increases or space reductions. Examples are: “Reduced weekly AWS hosting fees by $500 by evaluating resource requirements” (this is an actual result of our work) or “reduced space requirements of one of our databases by 12% by optimizing data types and removing redundant information.”

The Technical-Debt Rotation

My team already has several rotations in place. Thus, I set up technical debt as another rotation. I aim to give my team autonomy in their work, so my proposal was the following: all engineers take turns in the technical-debt rotation, and one iteration lasts for one week. In practice, this means that on every Monday an engineer should spend some time on identifying technical debt they want to work on. This can either be known technical debt, i.e. one or more tickets from the technical-debt tracker, or unknown technical debt. For the latter, my suggestion is to pick one of our many services, study the source code, and look for improvements. This should lead to a number of additional tickets. Preferably, an engineer identifying possible improvements of an existing service should also do the corresponding work. This is particularly the case when we only have a hypothesis that requires some work to test it.

I want the engineers on the technical-debt rotation to work on tickets related to technical debt before taking on any tickets from our regular backlog, which is of course considered during the planning meeting. In terms of the time commitment, I am rather flexible. I would like the engineer on the rotation to spend at least one day working on technical debt. However, there are situations where a bigger commitment may be warranted. This is particularly the case with larger subprojects, which is detailed in the next section. You may miss that I have not addressed the issue of urgency as, clearly, not all technical debt is created equal. Pressing issues we tend to address as soon as possible. We commonly do not even classify it as technical debt but instead as a necessary bug fix or an “operations” issue. Nonetheless, some of our accumulated technical debt is merely nice-to-resolve. My advice to fellow leaders would be to keep an eye on what your team is working on by tracking the technical-debt tickets your team closes. There should be a healthy mix of relative importance. If not, you will have to address this, perhaps in a separate session for backlog refinement. I would not advise you to rank all technical-debt tickets by urgency and simply assign them, however, for reasons specified in the previous section.

We also have a simple system in place for categorizing technical debt where we use the two metrics "complexity" and "impact", and rank both on a scale from one to five. In our case, these estimations are initially done by the engineer who adds entries to the tech-debt backlog, but they are reviewed intermittently. I think a good starting point is picking a few items that could be considered low-hanging fruit, i.e. work that pairs relatively low complexity with moderate to high impact. You may want to encourage your engineers to also tackle more complex work with a medium to high impact. You may also find that some of the technical debt is not worth resolving at the current point in time as the impact would be low to non-existent. Those you may want to save for a less busy time, for instance the code freeze before Cyber Week.

Capitalizing Technical Debt

One of the duties of software engineering leads is to ensure that the work their team performs is properly capitalized. This means that any software we create that increases our digital assets should also be added to our financial assets. In turn, this reduces our tax liabilities. Maintenance work, however, cannot be capitalized as it is instead considered an expense. A collection of technical debt tickets could constitute a mini-project that can be capitalized, however. One example would be a migration to new infrastructure or a significant rewrite that leads to performance improvements. Admittedly, packaging technical-debt tickets into a project may be an overly idealistic scenario. Yet, it is a possible outcome. In our team’s case, we have recently identified a number of issues with our Scala code base, due to an over-reliance on object-oriented programming constructs. If we resolved them, we would have a more maintainable system; we also predict an improvement in performance as there are many instances where objects are used instead of primitive types. Similarly, you may be able to identify a group of technical-debt tickets, provided your backlog is long enough, that could constitute a small project.

Results

The team has been following the technical-debt rotation as described in this article for about six months. Feedback from the team has been positive. Among others, the engineers remarked that it adds variety to their work or that they appreciate the increased autonomy. Of course, the latter will only be the case for as long as there is a large enough backlog of technical-debt tickets to choose from. At some point, hopefully, we will have reduced our backlog significantly, and then we will have to rely on the intrinsic motivation of wanting to better understand an existing system by diving deeper into implementation details or the satisfaction of improving the performance or design of a service. From the perspective of an engineering leader, my end goal is to pay down as much technical debt as possible. In fact, the ideal size of our technical-debt backlog would be zero. This is a distant goal, but we have taken successful steps towards it. First, I wanted to reduce the rate of increase of the backlog. We achieved this within the first two weeks. If you preside over a technical-debt backlog that has only been growing for three years, it is already satisfying to see that it is no longer growing as quickly. The next step was to keep the number of tickets on the backlog steady, which we reached soon afterwards. Now we are at the point where the total number of tickets on our technical-debt backlog is, possibly for the first time ever, declining. The team is very happy about it. One year from now, I expect us to have drastically reduced our technical-debt backlog.

Parallel Run Pattern - A Migration Technique in Microservices Architecture

2021-11-04T00:00:00+01:00

The business landscape in Zalando is growing every day. This continuous growth implies that we need to be able to cope with an ever-changing environment. Everyone with experience in software development knows that dealing with changes is a challenging problem. Especially, when the software is already working in production. Changing the software in production is like changing the tires on a car while it is still moving.

In large organisations such as Zalando, where microservices architecture is the standard, changes are even more frequent. Technologies become obsolete, organization structures change, teams split or merge, monoliths are being rewritten, and yesterday's microservices become today's monoliths. All those examples impose dramatic changes in codebases.

Naturally, testing is the first solution that comes to our minds when trying to minimize the regression of a change. But, in scenarios like decomposing a monolith or replacing a legacy component with a newer one, testing might not be enough. Furthermore, there are always dark corners in our systems that we have never tested or we don't know their behavior (anymore). Sometimes, as you may well know from your own experience, legacy systems don't even have tests one can use as a reference.

In this article, we will explore a design pattern called the Parallel Run¹ which is a strategy to make sure those dramatic changes will not break the system. We will walk you through a real-world example and describe how we managed to replace a service by taking advantage of this pattern and show you the challenges and surprises we dealt with. In the end, we summarize the upsides and downsides of this pattern to better help you choose when to implement it and when not.

Decomposing the monolith, a case study

Zalando is aiming to unify the user experience across platforms². As part of this effort we, the Returns team, were required to extract the returns logic out of a soon-to-be legacy monolithic application. Returns logic, as the name might imply, deals with everything to do with customers returning articles they've bought on the Zalando Fashion Store. This article will explore how our team used the Parallel Run pattern to transparently and safely extract the returns logic from the monolith to the new Returns microservice.

This new service should behave exactly like the respective part in the monolith and the customers should not notice any difference after the migration. In order to achieve this, the following complications needed to be overcome:

While reading the old code is possible, we might miss some parts of the logic or misunderstand the code.
Some parts of the code are not tested, so running the tests over the new code (if possible) would not guarantee the exact behavior.
The criticality of the application precludes downtime.

Parallel Run Pattern

In order to solve these problems, wouldn't it be nice if we could verify that each request handled by the new system would be handled exactly in the same way as for the system currently running in production? The parallel run pattern does exactly that.

When using a parallel run, rather than calling either the old or the new implementation, instead we call both, allowing us to compare the results to ensure they are equivalent. Despite calling both implementations, only one is considered the source of truth at any given time. Typically, the old implementation is considered the source of truth until the ongoing verification reveals that we can trust our new implementation.

-- Sam Newman, Monolith to microservices

Implementation

There are several ways of implementing this pattern. Hereafter we present how we solved it for the above use case.

The following diagram shows the flow for each incoming request:

(1-2) The Client makes a request that gets immediately processed and responded by the monolith to avoid any degradation in performance.
(3-4) After responding, the monolith POSTs a request to the /consistency-checks endpoint of the new Returns microservice, that immediately answers back with 202 (Accepted), indicating the request will be handled asynchronously. In this way we avoid the monolith having to wait, and we free its resources.
(5-6-7) The Returns microservice starts processing the request, in background, by first re-issuing the same request to itself but calling the actual endpoint.
(8) Then the response from the Returns microservice gets collected and compared with the one from the monolith.
(9) Finally, Metrics and Logs about the consistency are produced to later on verify that the expected consistency is reached and to investigate cases of inconsistencies.

The async request sent to the ConsistencyChecker part in the Returns microservice, contains information about the original request url with the query-params, the method, headers and, when present, the body. This information represents the new request to be sent to the Returns microservice. It includes also the HttpStatus, the headers, and the body of the response returned by the monolith in order to be checked against the response from the Returns microservice.

The following is an example of the structure that we used:

{
  "request": {
    "url": {
      "path": "api/example?param=something"
    },
    "headers": {
      "Content-Type": "application/json;charset=UTF-8",
      "Accept-Language": "de-DE"
    },
    "method": "GET",
    "body": null
  },
  "response": {
    "status": 200,
    "headers": {
      "Content-Type": "application/json;charset=UTF-8",
      "transfer-encoding": "chunked"
    },
    "body": "json-response-body"
  }
}

Each endpoint of the monolith has its own expected consistency to be reached in order to declare the migration successful. Once that threshold has been achieved, the migration can be considered safe, and we can perform the switch from the monolith to the new Returns microservice for that endpoint.

Monitoring and Reporting

In order to consider an endpoint ready, it had to reach a satisfying consistency percentage. For each request we produced the result metrics using Prometheus, and we displayed them with Grafana. Each endpoint, defined by an operation_id, had its own metric and its own tolerance. This was done because, as usual, fixing those last few percentages has a cost higher than the value it brings; given that each endpoint is completely separated from one another, each endpoint had its own target percentage to consider it consistent (enough).

Matched: counter for all the requests that matched between the monolith and the Returns microservice.

Unmatched: counter for all the requests that did not match between the two services. Possible examples could be:

Different HttpStatuses: such as 2xx and 4xx or even 201 and 200
Different Headers set: a missing header in one of the two responses or different values for the same header
Different Body responses: missing fields/attributes in the responses or different values for the same field/attribute

Failed: counter for all the requests where the response was terminated by temporary issues, such as for example in case of any 5xx. In these cases, even if they matched it would not be a valuable information given that the request couldn't be properly fulfilled due to a transient server-side issue. On the other hand, if the request did not match for 5xx cases, the unmatched counter should be increased because it means the overall behavior of the Returns microservice doesn't match the one from the monolith, and it requires a deeper investigation.

Rollout

The switch was done gradually, and it was done per endpoint to allow the system to be tested in a fully functional way. This was achieved by using a proxy to move the forwarding of the requests to the Returns microservice one by one once they were ready. In our case we used Skipper, an open-source Proxy developed by Zalando.

In this way, by minimizing the amount of endpoint rolled out to one per switch, we avoided introducing a massive set of changes in one go, and we were able to collect additional feedback by every single switch while still working on finalizing the other ones.

Clean-up

Once the migration was successfully finalized, all the code related to the parallel run logic needed to be cleaned-up. The three main parts to remove were the handler performing the consistency check (use cases layer), the gateway to call the localhost (gateway layer) and the domain model related to the consistency logic (entities layer). Additional clean-ups were done for configuration files such as the feature toggle to enable/disable the consistency checker and the config for the localhost gateway, the dependency injection in the Main file, the consistency-checker api in the route and, of course, all the tests to validate the consistency check logic. Code-wise we removed ~700 lines of code and ~1.3k lines between unit and component tests.

Advantages of this approach

Live data for testing: We can leverage the real production data as test cases. Therefore, given enough time, the system will be tested potentially under all the "real-life" use cases.
Gradual rollout: The rollout is done per endpoint minimizing the amount of changes per switch.
Incremental development: The gradual rollout also enables the possibility to approach the implementation per endpoint.
Easy rollback: By using a proxy to do the traffic switch, rolling back just requires a change to the proxy to migrate the endpoint back to use the previous host instead of the microservice one; this avoids the need of redeploying, making the whole process faster.
Finding bugs: Since the new microservice will be tested with real data, there might be cases where even the monolith was behaving incorrectly. This approach can make those edge cases visible.
Load testing: In case of using a different technology for the newer service, parallel run pattern helps to understand the performance characteristics of the new service. As a result, the development team can target more realistic performance goals or SLOs before going live.

Considerations and Limitations

While this approach makes the migration safer and smoother, it has also some concerns and issues to be kept into account.

Increased load: Given that requests received by the monolith are forwarded to the microservice, the load across all components increases, potentially doubling.
Refine the comparisons: In the comparison check not everything needs to match 100%. For example, in our case we ignored some headers that were not relevant for the outcome of the request.
GDPR: While collecting the data for the comparison we need to keep into account that sensitive information should either not be stored or cleaned afterwards. In the former case, analyzing some inconsistencies for the fields containing personal data might not be easy.
Non-trivial comparisons: Comparing the results is not always a straightforward task. For example comparing PDFs might be complicated due to different but negligible metadata, or a change in the http frameworks might result in different default response headers, or collections could have different orderings.
Non-Idempotent endpoints: Idempotency should always be kept into account. For example this approach can be used for POSTs that are idempotent but not when the idempotency of the endpoint cannot be guaranteed. When doing this investigation always consider idempotency of each operation and possible side effects (for example calling another POST api, updating a database, or publishing an event).
Not a quick-win: Even if this approach leads to a smooth and safe migration, it requires quite some time and effort to be properly set up and tuned.

Verdict

Implementing a parallel run is rarely a trivial affair, and is typically reserved for those cases where the functionality being changed is considered to be high risk. (...) the work to implement this needs to be traded off against the benefits you gain.

-- Sam Newman, Monolith to microservices

The parallel run pattern is a powerful technique to overcome the complexities and stress of migration projects, but not every migration project is a match to use this pattern. Increasing traffic, complexities in comparing the results, and the amount of effort are the risks that should be considered before implementing this pattern.

In the end, this pattern is just a tool that should be used wisely considering constraints, use cases, and team capacity when planning for it. When it is done properly, it saves you a lot of headaches.

Newman S. (2020). Monolith to Microservices. 2nd ed. O’Reilly Media, Inc. ↩
You can learn more about this effort in a series of articles about GraphQL in this blog. ↩

Tracing SRE’s journey in Zalando - Part III

2021-10-15T00:00:00+02:00

This is the third and last part of our journey to roll out SRE in Zalando. You’ll find the previous chapters here and here. Thanks for following our story.

2020 - From team to department

The road so far: 2016 saw an attempt at the rollout of a Site Reliability Engineering (SRE) organization that did not quite materialize but still left the seed of SRE in the company; in 2018 and 2019 we had a single SRE team working on strategic projects that improved the reliability of Zalando’s platform. The success of that last team brought with it many requests for collaboration, which had to be balanced with SRE’s own roadmap. In this chapter we’ll learn how SRE adapted in order to achieve sustainable growth.

In late 2019 there was a reorg in our Central Functions unit. This reorg was centered around a set of principles, chief among them were ”Customer Focus”, “Purpose” and “Vision”. Through that reorg SRE becomes a department that encompasses the original SRE Enablement team, the teams building monitoring services and infrastructure, and incident management. This is a clear investment from the company into the value SRE repeatedly demonstrated. The close collaboration those teams had had in the previous years already hinted at a common purpose between them. Through the Incident Commander role and the support to Postmortems, SRE was always in close contact with Incident Management. Distributed Tracing, where SRE invested much of its efforts, was actually owned by one of the monitoring teams. Now that everyone was under the same ‘roof’ we could further strengthen the synergies that were already in place.

Zalando’s SRE ~~team~~ department logo

In 2019 SRE had already started to dedicate time to its own products, but the creation of a department further endorsed SRE’s long term plans. But with an entire department under the SRE label, we had to be smart about our next steps. Particularly in the long term. Also, we had to adjust to what it meant operating as a department. Before, with a single team we could be (and occasionally had to be) more flexible, picking ad hoc projects. But now we had teams with a better defined purpose. And we wanted to have all teams working together towards a common goal. It was time to come up with a plan for how we could implement our new purpose: to reduce the impact of incidents while supporting all builders at Zalando to deliver innovation to their users reliably and confidently. That plan was materialized into the SRE Strategy, which was published in 2020, and it set the path for the years to come.

Following the same set of principles that influenced the creation of the SRE department (”Customer Focus”, “Purpose” and “Vision”), the SRE Strategy had at its core Observability. How did Observability fit with those principles and bound the three teams? For the teams developing our monitoring products it’s quite obvious. But Observability is also key for SRE: we drive our work through SLOs, and it is at the base of the Service Reliability Hierarchy. Finally, Incident Management is made that much more efficient with the right Observability into our systems, by identifying issues in our platform, and also making it easier to understand what is affecting the customer experience.

Our strategy set a target standardizing Observability across Zalando. Through that standardization we could achieve a common understanding of Observability within the company, reduce overhead of operating multiple services and make it easier to build on top of well defined signals (like we did before with OpenTracing). The concrete step for making this possible was to develop SDKs for the major programming languages at use in Zalando. Standardization was something we grew quite fond of in the previous years. While operating as a single team, doing several projects with different teams we were uniquely positioned to identify common pain points or inefficiencies across the company. But eventually we also realised one thing: as a single team it would be challenging to scale our enablement efforts to cover hundreds of teams in the company. Waiting for the practices we tried to establish to spread organically would also take too long. The only way we could properly scale our efforts and reach our goals, was to develop the tools and practices that every other team would use in their day to day work. We couldn’t do everything at once, but our new strategy gave us the starting point: Observability.

Observability is also at the base of Service Reliability Hierarchy

We started collecting metrics on our performance regarding Incident Response. How many incidents were we getting? What was the Mean Time To Repair? How many were false positives? What was the impact of those incidents? Now that incident management was part of SRE, it was important to understand how the incident process was working, and how it could be improved. We were already rolling out Symptom Based Alerting, so that alone would already help with reducing the False Positive Rate. But we took it a step further and devised a new incident process that separated Anomalies and Incidents. It’s easy to map these improvements to benefits for the business and to our customers, but there’s also something to be said about the health of our on-call engineers. Having an efficient incident process (and the right Observability into a team’s systems), goes a long way to making the lives of on-call engineers better. Pager fatigue is something that should not be dismissed, and can hurt a team through lower productivity and employee attrition. Something important to highlight in this whole process is that we started by collecting the numbers to see if they would match what our observations had already been pointing to. This is a common practice that guides our initiatives. That is also why one of the first things we did after creating the department was to define the KPIs that would guide our work, make sure they were being measured, and facilitate the reporting of those KPIs.

SRE continued the rollout of Operation Based SLOs by working closely with the senior management of several departments and agreeing on their respective SLOs. Those SLOs would be guarded by our Adaptive Paging alert handler. With this we also continued the adoption of Symptom Based Alerting. With Adaptive Paging we had an interesting development. Our initial approach was to make the SLO the threshold upon which we would page the on-call responder. What we soon discovered is that it made our alerts too sensitive to occasional short lived spikes, similar to any other non-Adaptive Paging alert. We mitigated this by providing additional criteria that engineers could use to more granularly control the alert itself (time of day, throughput, length of the error rate). What initially was supposed to be a hands off task for engineers (defining alerts and thresholds), quickly led us down a path we were already familiar with. Engineers were back at defining alerting rules because the target set by the SLO was not enough. After some experiments, we improved Adaptive Paging by having it use Multi Window Multi Burn Rate alert threshold calculation. This change resulted in two relevant outcomes. First, it brought Error Budgets to the forefront. Deciding whether to page someone or not was no longer whether the SLO was breached or not, but rather whether the Error Budget was in risk of being depleted or not. The second outcome, and arguably more important, is that we made it possible for the operations guarded by our alert handler to have their respective rules (length of the sliding windows and the alarm threshold) derived automatically from the SLO without any effort from the engineering teams, which was usually done through trial and error.

The challenge with rolling out Operation Based SLOs was that reporting and getting an overview of those SLOs was not easy, with the data fragmented in different tools. To address this issue, a new Service Level Management tool was developed. As we evolved the concept of SLOs, so too did we evolve the tooling that supported it. Other than reporting SLOs for the different operations, we also gave a view on the Error Budget. Knowing how much Error Budget is left makes it easier to use it to steer prioritization of development work.

Our operation based Service Level Management Tool (not actual data)

Late in 2020 we began developing what we called the SRE Curriculum. This was an initiative that aimed at scaling the educational benefits of SRE. Specifically, this meant sharing the wealth of knowledge that SREs have accumulated over time about the sharp edges of production. We were looking not only at raising the bar on the company’s operational capabilities, but also to facilitate any interactions with other teams by providing a common understanding on the topics covered by the curriculum. In the previous years we did several training sessions for incident response, distributed tracing, and alerting strategies. These were ad hoc engagements when teams requested our support. With the advent of the pandemic, many things changed and we had to adapt. Those training sessions were one of those things. The format for those sessions was based on having them in person. We did try to do some via video conference, but it did not have quite the same result. At the same time, the company’s Tech Academy was facing the same challenges. We grouped together to develop a new series of training sessions in a new format. The deliverables of this new format were a video and a quiz for each topic, with the content of each training being created and reviewed by subject matter experts to ensure a common understanding and a high quality training. This way we captured the knowledge that could be consumed by anyone in the company at any given time and different pace. Also, by having those training sessions part of the onboarding process, any engineer joining Zalando would get an introduction to some of the SRE practices we were rolling out.

The studio where we recorded some of the training sessions

The support of the SRE Enablement team is still in high-demand for ad hoc projects. After another collaboration between SRE and the Checkout teams, the senior management of that department officially pitched for the creation of an Embedded SRE team. This is something we had in the back of our minds for further down the road. But to have it being requested by another department was an interesting development. In any case, here we were. This development presented quite a few new challenges (and opportunities):

What will the team work on? What will its responsibilities be?
Who will the team report to?
Is this time bound? Or is it a permanent setup?
If they report to separate departments, how will they review the collaboration? Or how do we do performance evaluation effectively for SREs working in a different department?
How will the embedded SRE team collaborate with the product development team?
How will the embedded team keep in sync with the central team?

The Embedded team will report to the SRE department, and both SRE and product area management have aligned on a set of KPIs like Availability and On Call Health. The former will be dictated by the SLOs defined for that product area, but the latter aims at making sure the operational aspect is not having its toll on the product development team. On-call Health will be measured taking into account paging alerts and how often an individual is on-call.

We’re still figuring out most things as we go along, but this is an exciting development. This team will be different from the Enablement team, in the sense that it will have a much more concrete scope. This team will be able to be more hands-on on the code and tooling used within the product development team. It will be a voice for reliability within that product area, able to influence the prioritization of topics which ensure a reliable customer experience in our Fashion Store. The SRE department will also benefit from having a source providing precious feedback on whatever the department is trying to roll out to the wider engineering community.

You may remember from our last article where we mentioned that hiring was always a challenge (a topic you can also read from the experience of other companies that rolled out SRE). Now we’re planning to bootstrap another team, so that cannot be making things any easier. But the truth is that having a department with teams which were different in nature also had an unexpected benefit in our hiring. Before, our capacity constraints prevented us from hiring anyone who wasn't a good fit for the original position with the plan to further develop those people and establish the SRE mindset. Now we have the possibility to have a candidate with potential to join one of the teams in the department, and from there grow into the SRE role. Whether later they join the SRE Enablement team or not is not that important (although team rotation is something that is quite active in Zalando). Any team can benefit from having someone with the SRE mindset. Also, we strive for close collaboration within the department, so it’s not like engineers are isolated in their respective teams.

And this is it, mostly. You are all caught up with how SRE has been adopted in Zalando, and what we’ve been up to. And what a ride it has been! Attempting to create a full SRE organization, later starting with a single central team, reaching the limits of that team, creating a department, further growing that department with an embedded SRE team… Were we 100% successful? No (also, SREs don’t believe in 100%). But we’ve done the Postmortem where we failed, and the learnings we got from there turned into action items in our strategy. This has been working really well for us, but there’s still so much to do. There are many interesting ways that SRE can develop into, so we’re really excited to see what challenges we’ll get next. Until we reach our next stage of evolution, we’ll keep doing what we do best: dealing with ambiguity and uncertainty. And help Zalando ensure customers can buy fashion reliably!

Tuning Image Classifiers using Human-In-The-Loop

2021-10-13T00:00:00+02:00

In this blog post we describe an algorithm we developed when building our product image analysis infrastructure, where we use human-in-the-loop to tune the thresholds of our image classifiers. We discuss the algorithm in the following, and present some mathematical details and a simple code example in the appendices.

Background

When a customer browses for a product on the Zalando website they may use descriptive terms to search for what they want, for example a customer may use a specific term such as leopard print dress instead of providing a more generic term such as casual dress. One approach we use to support product search using descriptive terms is to automatically generate additional product information from product images using computer vision techniques. In particular, we train image classifiers to identify products that have a particular fashion attribute such as a specific pattern or style, e.g. leopard print, which correspond to descriptive search terms.

Problem

A typical image classifier generates a class-confidence score (a value between 0 & 1) at its output to indicate that a given input image belongs to one of the specified output classes, i.e., the image shows a particular fashion attribute. To generate a binary decision from the classifier output a class-confidence threshold parameter is selected based on a classifier performance metric such as precision & recall. Once the threshold has been selected the model can be deployed and used to generate class labels for an input image, which can be used in product search.

Over time the characteristics of the input product images may change, leading to a drift in the input data distribution. For image classifiers that are used to generate predictions for out-of-distribution input images the performance of the classifier may degrade. For example this may occur when an image classifier is trained on Zalando product images before the introduction of a new photography style on a revamped Zalando website, for which there are no annotated image examples in the new style available to retrain the model.

To solve this problem we modify the class-confidence threshold of the classifier to compensate for data distribution drift, and developed an Expectation-Maximization (EM) algorithm that we call AutoThreshold for this purpose. AutoThreshold estimates an optimal class-confidence threshold for an image classifier using manual annotations from a selection of the classifier's predictions on the out-of-distribution data. Additionally, the process of creating annotations for the out-of-distribution data helps in the generation of a new data set that can be used to train a new version of the image classifier.

Selecting Classifier Thresholds

The optimal threshold value for an image classifier is the class-confidence score, a value between 0 & 1, for which the set of predictions above that score leads to optimal classifier performance. Ideally this value would be 0.5, i.e., the center of the range. However, for a number of reasons this is never the case and is usually estimated post training to achieve best results.

The estimated optimal threshold for each output class of an image classifier is evaluated using an annotated image data set, i.e. validation set, where each image in the set is manually assigned a class label. The image classifier is tested by using the validation data set as input and comparing the classifier's predictions to the manually assigned labels. We can measure classifier performance using metrics such as precision & recall, which indicate the quality and quantity of the results. Optimizing the threshold is usually a tradeoff between precision & recall, where we want to find a threshold value that results in an acceptable score for both. Typically, a performance metric that combines both precision and recall, such as the $f_\beta$-measure, is used, and the class-confidence score that maximizes the metric is chosen as the threshold value.

Estimating Thresholds in the Absence of Data

For our use case there exists no training or validation data set for the out-of-distribution input image set. Furthermore, we do not annotate all images in advance, as this would be a costly, and time consuming, exercise for the scale of the data at Zalando (currently around 600k products). To overcome these issues we make use of the simple fact that when classifier predictions are ordered by class-confidence score—for a well trained image classifier—high-confidence class predictions exhibit greater correspondence with the image annotations than low-confidence predictions, which indicates model performance, and allows us to search for an optimal threshold between both extremes (demonstrated in the plot below). With this in mind, we frame threshold selection as an optimization problem using manual annotators, who generate annotations to be used in the metric calculations required to estimate a threshold.

Specifically, we take an iterative approach, where images to be annotated are conditioned on the image classifier, and annotators annotate a subset of the classifier's most confident predictions first. The generated annotations are used to estimate a threshold using our selected performance metric, and the process is repeated until our estimated threshold converges. This process can be implemented as an Expectaton-Maximization algorithm, and describes a human-in-the-loop procedure, which generates a validation data set for the out-of-distribution data over a number of iterations. Furthermore, the data set is generated in an efficient way, both in terms of the number of annotations required, and the selection of image examples which contribute most to discovery of an optimal threshold.

Problem Definition

Taking a binary image classifier as our motivating example, which typically has a sigmoid output layer, the value generated at the output for each of the $n$ input images can be interpreted as a class-confidence score, or probability $p_{i}$, that an input image, $\mathbf{x}_i$, belongs to the output class, $c$. For the purposes of image attribute identification, the predictions at the output, $\mathbf{p} =[p_{1},\dots,p_{n}]$, undergo a thresholding operation to replace the class-confidence scores with a binary class label, which indicates a transform from a continuous to categorical probability distribution. Since the output layer is a sigmoid function, where output values are thresholded by the parameter $t$ into two binary categories, true & false, we can model the classifier's output distribution using a Bernoulli distribution, i.e., $P(\mathbf{x}_i=c | p_{i})$. Furthermore, the distribution of annotations also follows a Bernoulli distribution. Using these details, we frame the problem of threshold estimation within the framework of the Expectation-Maximization algorithm, where we present algorithm details below, and present a more detailed mathematical explanation in Appendix A.

Threshold Estimation Using the EM Algorithm

The Expectation-Maximization algorithm is an iterative method to find maximum likelihood estimates of parameters (such as our classifier threshold) in the presence of unobserved latent variables. In our problem setting, the predictions made by the classifier are observed by our annotators to generate image annotations. However, the order of the images presented to the annotators is conditioned on the classifier's class-confidence score, which is unknown to our annotators. As mentioned, the estimated optimal threshold corresponds to a class-confidence score, and thus our latent variable allows us to estimate an optimal threshold for our classifier. Each iteration of the EM algorithm alternates between performing an Expectation step (E-step), which constructs a likelihood function to estimate the latent variable, and a Maximization step (M-step), which computes parameters that maximize the function constructed in the E-step. For our algorithm, the E-step generates annotations for the classifier's most confident predictions and the M-step estimates the optimal class-confidence threshold using the new set of annotations. Both steps are repeated at each iteration until the estimated threshold converges.

Algorithm Details - Binary Classifier

For a set of images, $\mathbf{X}=[\mathbf{x}_1,\dots,\mathbf{x}_n]$, and their class-confidence scores, $\mathbf{p}$, we construct a set of images ordered by their scores, $\mathbf{X}_{\tt asc} = {\tt sort}(\mathbf{X},\mathbf{p})$, to estimate the optimal threshold, $\hat{t}$, for the output class. We use $\mathbf{X}_{\tt asc}$ as input to the AutoThreshold algorithm, and specify a number of hyperparamters including the subset window size $m$, and classifier performance metric ${\tt metric}(.)$ (e.g., $f_{\beta}$-measure). We define a data windowing function that selects images to be annotated by centering a window of size $m$ on $\mathbf{X}_{\tt asc}$ at a position that corresponds to current threshold estimate (class-confidence score), i.e, $\mathbf{X}_{\tt subset} = {\tt window}(\mathbf{X}_{\tt asc}, \hat{t}, m)$. We denote associated predictions for the windowed subset as $\mathbf{p}_{\tt subset}$, and denote the annotations generated for this set as $\mathbf{a}_{\tt subset}$. Furthermore, we define a thresholding function ${\tt threshold}(\mathbf{p}_{\tt subset}, t)$, which generates true and false class labels from model predictions to be used as input to the performance metric.

The EM algorithm is outlined below:

Specify hyperparameters $m$ & ${\tt metric}$
Initialise the current threshold estimate $\hat{t}$ to the maximum class-confidence score, i.e. 1
E-step: Generate a new subset of manual annotations, $\mathbf{a}_{\tt subset}$, for the selected images, $\mathbf{X}_{\tt subset} = {\tt window}(\mathbf{X}_{\tt asc}, \hat{t}, m)$
M-step: Estimate a new threshold estimate which corresponds to the maximum metric value for the new set of annotations, $\hat{t} = {\underset {t} {\operatorname {argmax} }} \ \, {\tt metric}(\mathbf{a}_{\tt subset}, {\tt threshold}(\mathbf{p}_{\tt subset}, t))$
Return to step 3 until convergence

Practical Details

Below are some practical details on the operation of the algorithm:

Note that $\hat{t}$ can be initialized to any value between 0 & 1, if a good initial estimate is available it can be used to initialize the algorithm, if not initializing to 1 is a good choice. Also note that when initializing to the maximum, due to edge effects, the windowing function will only capture the $m/2$ examples beneath $\hat{t}$.
The EM algorithm typically converges to a local optimum, for our use case there is a global optimum, and we have observed (for a suitably selected subset size) very good convergence and results with this approach.
Note that as the algorithm operates on subsets of the unannotated data, and as such the number of available unannotated images, $n$, could grow as the algorithm runs, so $n$ is not required to be fixed. Furthermore, the number of required annotations (and hence algorithm iterations) will depend on the metric and subset size chosen.

Finally, for a multilabel classifier, where the output classes, $\mathbf{c}= [c_1,\dots,c_k]$, are independent but not mutually exclusive of each other, the above algorithm can be performed for each class separately, where the task is to estimate $\hat{t}_j$ for each of the $j=1,\ldots,k$ classes.

Threshold Estimation Example

Below we present an annotation plot for a run of our EM algorithm for a leopard print image classifier, which is a binary classifier and has a single class output. The middle subplot presents the annotations for the images sent to a crowdsourcing platform, ordered in ascending class-confidence score (as illustrated by the orange curve), where positive labeled images are indicated at the top of the subplot by blue dashes and negative labelled images are indicated at the bottom of the subplot by purple dashes. We can see that for high confidence predictions there are many positive annotations with few negative annotations, illustrating that the classifier is performing well. However there is a point at which the occurrence of positive labels is frequently punctuated by negative annotations, illustrating that the classifier performs poorly beyond this point. We can see from the subplot that the threshold estimated by the EM algorithm (as indicated by the black dot) is positioned just before the classifier begins to perform poorly, which demonstrates the algorithm's usefulness in estimating an optimal class-confidence threshold. Furthermore, the annotation density subplot indicates a natural separation between the cluster of positive and negative annotations, and the estimated threshold corresponds to this also.

To illustrate further we present a slope plot below, where we generate a cumulative sum of annotations and examine the slope of the curve, where annotations are assigned values 1 & 0 for positive and negative labels respectively, and are ordered by the class-confidence scores generated by the classifier (as was the case in the previous plot). The resultant plot is piecewise linear, where flat-line segments in the curve above the threshold represent consecutive False Positives, whereas those beneath the threshold represent consecutive True Negatives. Conversely, sloped-line segments in the curve above the threshold represent consecutive True Positives, whereas those beneath the threshold represent consecutive False Negatives. For our purposes we would like the curve above the threshold to have a slope as close to 1 as possible, and on average to have a steeper slope above the threshold than beneath it.

In the slope plot we observe the following:

There are many long sloped-line segments above the threshold, whereas there are few beneath the threshold
There are many long flat-line segments beneath the threshold, whereas there are few above the threshold
The slope on average above the threshold is steeper than beneath it

Therefore, for the leopard print image classifier predictions, we see that the threshold estimated by the AutoThreshold algorithm successfully identifies an appropriate class-confidence threshold.

Conclusion

We have presented a novel algorithm for the task of optimal threshold estimation for an image classifier that is applied to out-of-distribution data, where an EM algorithm and human-in-the-loop is used to generate annotations for the out-of-distribution data, which are used to calculate a threshold to compensate for the difference in distributions. The algorithm is simple to implement, and is efficient in terms of the number of annotated image examples required to estimate an optimal threshold.

In future work, we will explore using the EM algorithm and human-in-the-loop to train a classifier in the context of active learning, i.e., the case where there is no annotated data set to train a classifier.

If you would like to work on similar problems, consider joining our Data Science teams!

Appendix A: Mathematical Details

Below we provide further details on the presented algorithm's interpretation as an Expectation-Maximization (EM) algorithm.

EM Algorithm Description

Using standard notation, the EM algorithm can be described as follows: For a set of observed data $\mathbf{X}$ generated from a statistical model with unknown parameters $\boldsymbol{\theta}$, and a set of latent variables $\mathbf{Z}$, which are unobserved but effect the distribution of the data nonetheless, we estimate the values for $\boldsymbol{\theta}$ by maximizing the marginal likelihood of the observed data,

${\displaystyle L({\boldsymbol {\theta }};\mathbf {X} )=p(\mathbf {X} \mid {\boldsymbol {\theta }})=\int p(\mathbf {Z} \mid \mathbf {X} ,{\boldsymbol {\theta }})p(\mathbf {X} \mid {\boldsymbol {\theta }})\,d\mathbf {Z} }$,

i.e, we generate a maximum likelihood estimate (MLE) for $\boldsymbol{\theta}$. However, this quantity is often intractable since $\mathbf {Z}$ is unobserved and its distribution is unknown before obtaining $\boldsymbol{\theta}$.

The EM algorithm seeks to overcome this issue, and finds the MLE of the marginal likelihood by iteratively maximizing a specifed $Q$ function, which is defined as the expected value of the log likelihood function of ${\boldsymbol {\theta }}$, i.e., $Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})=\operatorname {E} _{\mathbf {Z} \mid \mathbf {X} ,{\boldsymbol {\theta }}^{(t)}}\left[\log L({\boldsymbol {\theta }};\mathbf {X} ,\mathbf {Z} )\right]\,$. The $Q$ function is maximized over two steps: In the first step—the E-step—the data-dependent parameters of the $Q$ function are calculated, while in the second step—the M-step—we seek to maximize the function constructed in the E-step over the parameters $\boldsymbol{\theta}$, where the value that achieves the maximum is our new estimate, $\boldsymbol {\theta }^{(t)}$.

AutoThreshold as an EM Algorithm

Using the above notation and translating to our algorithm description, our observations, $\mathbf{X}$, are the vector of annotations generated by human-in-the-loop, $\mathbf{a}$; our unobserved latent variables, $\mathbf{Z}$, are the ordered classifier predictions used to generate $\mathbf{a}$, i.e. $\mathbf{p}$; and the unknown model parameters, ${\boldsymbol {\theta }}$, are defined by the statistical model used to generate $\mathbf{X}$, which in our case is the Bernoulli distribution, as the annotators answer a yes-no question when generating annotations for our image data set. For the Bernoulli distribution, there is single model parameter $p$, which is simply the probability that an observation will be true.

For our use case, where we estimate a class-confidence threshold, $\hat{t}$, for an image classifier in order to generate binary predictions, the parameter $p$ has a direct correspondence, which can be explained as follows: For an ideal image classifier with perfect accuracy applied to a balanced data set (i.e., a data set with an equal number of true and false examples) the output distribution of the class labels will be uniform and the parameter $p$ will be 0.5, as all predictions will be correct, and a true or false outcome will have equal probability as the observations are balanced. Similarly, in the ideal case the sigmoid units at the output will be perfectly normalized and the class-confidence threshold used to assign predictions to categories will also be 0.5 (as is the standard assumption with logistic regression analysis etc.). Also, 0.5 corresponds to the sample mean of the observed predictions (where true & false are represented numerically by 1 & 0) which is the MLE for the parameter $p$.

Known Unknowns

As we move away from the ideal case where the data may not be balanced or the image classifier may exhibit errors, the parameter $p$ and threshold $t$ deviate from 0.5 and both become unknown (but still remain in the range from 0 to 1), since the classifier's output class distribution, $\mathbf{y}$, becomes unknown. However, a direct correspondence between the two parameters remains. To overcome this issue, and estimate an appropriate value for $t$ using a known distribution, i.e., $\hat{t}$, we generate a validation data set, i.e., a set of manually annotated images, and test the image classifier by generating class predictions for the images then compare against the image annotations. The goal is to estimate a value for $\hat{t}$ that will generate a class label output distribution, $\mathbf{y}$, as close as possible to $\mathbf{a}$.

However, as already discussed in this article, there are additional practical considerations when evaluating the performance of an image classifier such as precision & recall, and simply comparing annotations to class predictions to determine performance may not lead to the selection of a useful classifier. To choose a suitable image classifier, the effect of the class-confidence threshold itself must be considered, which leads to a meta-labeling of the model's class predictions using the annotations in the validation data set. In particular, all positively annotated images that are correctly classified are known as True Positives (TP), whereas those that are incorrectly classified are known as False Negatives (FN). Conversely, all negatively annotated images that are correctly classified are known as True Negatives (TN), whereas those that are incorrectly classified are known as False Positives (FP).

Using these four categories of class prediction, a performance metric can indicate how close an image classifier's class output distribution is to the validation data set, while also giving an indication of the classifier's performance when it comes to precision & recall.

Averages Over Categories

As mentioned above an important component of the EM algorithm is how to calculate the maximum likelihood estimate for the unknown parameter $\boldsymbol{\theta}$. For our use case where the observations are generated by a Bernoulli distribution, the MLE for the parameter $p$ is the sample mean. Although, as discussed above, for our use case we must also consider precision & recall, which necessitates the use of a performance metric to determine a class-confidence threshold that optimizes $p$ with respect to the validation data set. However, performance metrics such as precision & recall can be interpreted as averages over categories, which provides a direct connection to the MLE for $p$. For example, recall can be considered an average over the meta-labeled positive annotations TP & FN, i.e., recall = TP/(TP+FN); while precision can be considered an average over the meta-labeled annotations above the threshold, i.e., precision = TP/(TP+FP). Furthermore, as discussed, precision and recall may be combined to create a performance metric such as the $f_\beta$-measure, such derived performance metrics also perform averaging over the values for precision & recall. In summary, for a chosen performance metric, the optimal value for $\hat{t}$ has the effect of generating a Bernoulli distribution $\mathbf{y}$ which is a close as possible to $\mathbf{a}$, and also specifies a level of control over precision and recall.

Optimization Loop

Now that we have described how the AutoThreshold algorithm fits within the framework of the EM algorithm, we will provide further detail on the algorithm's optimization loop.

At each iteration, the number of items in $\mathbf{a}$, and their corresponding $\mathbf{p}$, increases by our specified window size, $m$, which increases the amount of data available to calculate our specified performance metric, ${\tt metric}(.)$, and also increases the number of possible values to be used to maximize $\hat{t}$. Where we increase the available observations in the E-step (by generating new annotations from our most confident predictions) and maximize the threshold in the M-step to estimate the optimal threshold. Here the E-step is arguably most important, since it generates the required validation data set, as the original problem is to generate a sufficient number of annotations for an unannotated data set to estimate a threshold. Furthermore, in the E-step, we increase the available observations using a suitably large subset size until the algorithm converges, which allows us to minimize overall the number of annotations needed to estimate an optimal threshold, which is what we wish to achieve with this algorithm.

Finally

To conclude we present some other interesting points to consider about this algorithm:

For this use case we apply the EM algorithm to a discrete probability distribution using categorical observations, i.e., annotations. Typically EM is applied to problems where observations are drawn from a continuous probability distribution, such as the Gaussian distribution.
For this use case we have our latent variables, $\mathbf{p}$, before we obtain our observations, $\mathbf{a}$. This is the reverse of the standard implementation of EM, and illustrates the flexibility of the EM algorithm's two-step learning iteration when applied to human-in-the-loop.
For this use case we have human-generated observations, where usually the EM algorithm is applied to sensor observations.

Appendix B: AutoThreshold Python Implementation

Below we present a simple code implementation of the AutoThreshold algorithm applied to a binary classification task using synthetic data.

#!/usr/bin/env python3.8

import numpy as np
from collections import namedtuple
from sklearn.metrics import f1_score

SyntheticData = namedtuple('SyntheticData', ['predictions', 'annotations'])

def generate_predictions_and_annotations(n):
    """Returns synthetic predictions and annotations for a step classifier response,
    ordered by prediction score.

    Note: The returned synthetic data has an optimal threshold at 0.5

    """
    predictions = np.linspace(0, 1, n)
    annotations = np.concatenate((np.zeros(n//2), np.ones(n//2)))
    return SyntheticData(predictions, annotations)

def predictions_generator(synthetic_data, thresh_ind):
    """Returns predictions for the current subset window as specified by `thresh_ind`.

    Note: In the normal operation of AutoThreshold this step would generate predictions
    for our out-of-distribution images from our image classifier. Here, our toy example
    is run on synthetic data and our precomputed predictions are simply returned.
    """
    return synthetic_data.predictions[thresh_ind-M//2:thresh_ind+M//2]

def annotations_generator(synthetic_data, thresh_ind):
    """Returns annotations for the current subset window as specified by `thresh_ind`.

    Note: In the normal operation of AutoThreshold this step would source annotations
    from a crowdsourcing platform. Here, our toy example is run on synthetic data and
    our precomputed annotations are simply returned.
    """
    return synthetic_data.annotations[thresh_ind-M//2:thresh_ind+M//2]

def calculate_optimal_threshold(annotations, predictions):
    """Returns the index of the optimal threshold using the F1 score.

    **Example:**

    >>> predictions = [0, 0.2, 0.4, 0.6, 0.8, 1.0]
    >>> annotations = [0, 0, 0, 1, 1, 1]
    >>> thresh_ind = calculate_optimal_threshold(annotations, predictions)
    >>> threshold = predictions[thresh_ind]
    >>> threshold
    0.6
    """
    scores = []
    for threshold in predictions:
        labels = []
        for prediction in predictions:
            label = 1 if prediction >= threshold else 0
            labels.append(label)
        scores.append(f1_score(annotations, labels))
    return np.argmax(scores)

def auto_threshold(synthetic_data, annotation_generator):
    """Main loop of the AutoThreshold algorithm.
    """
    # Specify initial estimate; here we start from the highest confidence which is
    # the n-th ordered prediction
    thresh_ind = N
    thresh_est = synthetic_data.predictions[thresh_ind-1]

    for i in range(MAX_ITERS):

        # E-Step: Generate annotations for the subset of ordered predictions
        predictions_subset = predictions_generator(synthetic_data, thresh_ind)
        annotations_subset = annotations_generator(synthetic_data, thresh_ind)

        # M-Step: Estimate local threshold index for the newly annotated subset
        thresh_ind_subset = calculate_optimal_threshold(annotations_subset, predictions_subset)

        # Estimate new threshold
        thresh_ind_old = thresh_ind
        thresh_ind = (thresh_ind_old - M//2) + thresh_ind_subset
        thresh_est = synthetic_data.predictions[thresh_ind]

        print('Iter: {}, Est: {:.3f}'.format(i, thresh_est))

        # Check convergence
        if thresh_ind == thresh_ind_old:
            print('Converged')
            break

    return thresh_est

if __name__ == "__main__":

    print("\nAutoThreshold Toy Example.\n")

    # Specify arguments: Max algorithm iterations, number of synthetic predictions & subset size
    MAX_ITERS = 25; N = 10000; M = 500

    # Synthetically generate ordered classifier predictions and annotations
    synthetic_data = generate_predictions_and_annotations(N)

    # Run AutoThreshold to estimate optimal classifier threshold
    thresh_est = auto_threshold(synthetic_data, annotations_generator)
    print("\nEstimated threshold value: {:.3f}".format(thresh_est))

    print("\n\tFin.\n")

Code output will look like:

$ ./autothreshold.py
AutoThreshold Toy Example.

Iter: 0, Est: 0.975
Iter: 1, Est: 0.950
Iter: 2, Est: 0.925
Iter: 3, Est: 0.900
Iter: 4, Est: 0.875
Iter: 5, Est: 0.850
Iter: 6, Est: 0.825
Iter: 7, Est: 0.800
Iter: 8, Est: 0.775
Iter: 9, Est: 0.750
Iter: 10, Est: 0.725
Iter: 11, Est: 0.700
Iter: 12, Est: 0.675
Iter: 13, Est: 0.650
Iter: 14, Est: 0.625
Iter: 15, Est: 0.600
Iter: 16, Est: 0.575
Iter: 17, Est: 0.550
Iter: 18, Est: 0.525
Iter: 19, Est: 0.500
Iter: 20, Est: 0.500
Converged

Estimated threshold value: 0.500

        Fin.

Space efficient machine learning feature stores using probabilistic data structures - a benchmark

2021-10-05T00:00:00+02:00

The problem

When building Machine Learning (ML) applications - such as recommender systems - there is often a need to provide a "feature store" which can enrich the request to the system with additional ML features.

For example: whether a user had looked at an article before is often very informative about whether the user will click or buy that article this time. So, companies keep a record of what article their users had clicked bought recently, and use this data in their recommender systems. Other commonly used data include: past browsing history, purchase history, user information like demographics, explicit preferences they shared etc.

These data are usually stored in key-value stores like Redis, using the user ID as the key, and the features as value.

When a request is made to the recommender system, a query is made to this key-value store using the user ID, and the retrieved features are fed to the recommendation algorithm together with the data contained in the original request. When there are many users, these feature stores can easily get very large.

This creates significant challenges in terms of the development and operation of ML applications.

They add to the processing time: Adding a network call commonly adds 2-10ms to your response time. To make matters worse, it also adds a lot of variance to the response time due to the variation of message sizes across users
Additional hosting costs/maintenance cost: Distributed databases with strict performance requirements can be expensive to host
Additional operational complexities: Operations like backfill can become very expensive to setup/execute
Development complexities: An external database adds a dependency to the application code, which adds some complexity to the development/testing process (like having to pre-populate this DB for tests). Intrusive performance optimizations like size limits, aggregations, prioritization of users are often necessary, which adds development time and increases the coupling between model design and infrastructure
Multiple lookups can be prohibitively expensive: For example: imagine you want to rank a thousand products, and want to retrieve features for each product - this would be extremely difficult with an external database under strict latency budget. Another hypothetical example is retrieving features for composite keys (interactions), e.g. "How many times were product X and Y bought together?". If the feature state is small enough to live in the same processes' memory, multiple look-ups are far cheaper and thus feasible.

The solution

What if, instead of having a big, unwieldy database, we could read a much smaller dataset into memory, and query that as a feature store from within the process? This is essentially what we can do with "sketching" data structures, a type of probabilistic data structures.

Sketching data structures can store large amounts of data in a compact (sublinear) space at the expense of accuracy. In other words, they store a "summary" of the original data. They are essentially a lossy compression algorithm for your features. Just like JPEG compression for your images, it can compress input data at varying "compression levels" - low-compression level means better quality but larger sizes, and high-compression level means lower quality but smaller sizes.

This allows us to trade-off accuracy in exchange for space requirements. As we will see below, the trade-off is highly favorable - a very small sacrifice in accuracy can save a lot of space.

In this article we will only describe and benchmark bloom-filter-backed feature stores in detail, but theoretically, other sketching data structures like HyperLogLog, Count-Min Sketch, Quotient Filters etc. could be used, too.

Benchmark of a sketching-data-structure-based feature store backed by a Bloom-Filter

Below is a benchmark based on a real-life click prediction dataset. It shows that prediction models that use a bloom-filter-based feature store can achieve the same level of prediction accuracy & prediction throughput with a vastly smaller feature state that can easily be fit into memory.

Benchmark setting

We used a real-life click prediction dataset which has two types of features:

Request features: Features that are immediately available in the request, like country, article id, device type, context URL and so on
Historical features: Features that are based on accumulated historical data, like browsing history, purchase history, preferences that were saved in the past etc.

The historical features were aggregated using count, max etc. (e.g. how many times did a user browse an item, what was the last time they looked at it etc.) and were then discretized to yield categorical features. They were then stored into feature stores.

The training data had about 5.7 mil examples. Out of these 5.7 mil examples, 2.8 mil had historical data (the rest had only request features). Combined, the data had 1.762 bil data points after feature extraction.

Finally, a logistic regression classifier was used to predict clicks. Our variants were as follows:

No history: A model without a feature store (so that it could only use request features)
Uncompressed history: A model that simulated use of a conventional feature store (the features were pre-fetched)
Compressed history: A model that used a bloom filter based compressed feature store

Implementation of the bloom-filter-based-compressed-feature-store

Below is a simplified implementation in Python that illustrates how the feature store was implemented. It returns what articles a user had looked at before, given their user_id. This is not the actual implementation that was used in the benchmark. The benchmark used a JVM-based implementation, and was more general in nature (it stored arbitrary categorical features).

from typing import Set

from bloom_filter import BloomFilter


class FeatureStore:
    def __init__(self, store: BloomFilter):
        self.store = store
        self.possible_articles = set()

    def add(self, user_id: int, article_ids: Set[int]) -> None:
        for article_id in article_ids:
            self.possible_articles.add(article_id)
            composite_key = f'{user_id}^{article_id}'
            self.store.add(composite_key)

    def retreive_articles(self, user_id: int) -> Set[int]:
        ret = set()
        for article_id in self.possible_articles:
            composite_key = f'{user_id}^{article_id}'
            if self.store.might_contain(composite_key):
                ret.add(article_id)
        return ret

The most important element to point out is the additional state self.possible_articles. This would hold the set of all possible features (in this case, all article IDs), and the code is brute forcing all of them in order to reconstruct the set of articles viewed by the user. This may appear to be a very expensive thing to do, but in practice it is very cheap in relation to the total processing. In my simple benchmark, the difference was undetectable. It is also worth noting that this process could be optimized, for example through the use of binary search, and/or by only querying for important features.

The compressed history variant had a parameter that determined the level of compression - i.e. higher compression level meant lower quality and size, lower compression level meant higher quality and size. What do we mean by "quality" here? In a nutshell, the bloom filter tells us if a binary categorical feature is present (1) or not (0). When the bloom filter says a feature is NOT present, it is always correct - i.e. there are no false negatives. However, when the bloom filter says a feature is present, it can be an error. In other words, at some probability, we will mistakenly set the feature value to 1, when in fact it should have been 0 (i.e. false positive). This adds noise to our model's input. This probability can be tuned via a parameter, and the higher the false positive rate, the smaller the state size.

For more details on how this compression level parameter works, and generally how bloom filters work and their characteristics, see e.g. here, here and here.

As an evaluation metric, we used click ROC-AUC (Area Under the Curve of the Receiver Operating Characteristic curve), a common metric for recommender systems.

Result

The scatter plot below shows the AUC (y axis) of the classifier at varying compression levels (x axis = size of the feature store in bytes in logarithmic scale). The dotted green line is the AUC with a key-value-store-based feature store equivalent (i.e. Uncompressed). The dotted red line is the AUC without any history features (i.e. No history).

As expected, our bloom-filter-backed feature store achieves performances between the two lines (uncompressed ~= 0.80 and no history ~= 0.70).

The estimated size of the key-value-store-based feature store was about 15GB. Hence, the results show that our compressed feature store achieves the same level of classification performance (AUC~=0.7997) using just 3% of memory (470MB vs 15GB). The state size can be further reduced at the expense of classification performance. For example, 90% of the uplift provided by the feature store can be retained by using merely ca. 40MB of state (AUC~=0.79). This would be just 0.3% of the size of an uncompressed feature store. Note that this "saving" grows as the data volume increases due to the sublinear space complexity.

When it comes to throughput (computational efficiency), all of the variants achieved similar throughput (20-22k predictions per second per core on my 2018 Mac). I.e. the additional overhead was undetectable with my performance tests.

The Limitation

So the benchmark results look very good - why would anyone use a conventional key-value-store-based feature store at all? Alas, the new feature stores come with severe limitations and are thus not a drop-in replacement for conventional feature stores.

You have to know what to ask

As described above, we need to keep the set of possible features in order to get the desired output. In a lot of use cases this is not an issue, but in some situations it may be prohibitively expensive (e.g. imagine reconstructing bag-of-word encoding of past user reviews).

They are difficult to update (and thus keep them "fresh")

The second, and probably by far the more important weakness is the difficulty associated with updating them.

Feature "freshness", as in how quickly recent events can be reflected to the feature store is very important, as recent events tend to have high informational value. Many distributed key-value stores have good write performance, and thus it's very feasible to keep them very "fresh" even when high load is involved. The situation is very different with sketching-data-structure-based feature stores.

First, let's consider the appending of new information to our new feature store.

Most sketching-data-structure (including bloom filters) allow incremental appends (so far, so good). However, since the complete state is loaded onto each node's RAM, every write must be applied on every node - so that each node (process) must be able to handle 100% of event traffic. This is usually impossible - common event streams like views, clicks are usually very high volume, and processing that amount of writes on a single node is not a practical option. One could consider batching, but in many key-value-based feature store, the target update latency is shorter than a few seconds - which makes this option extremely difficult.

Theoretically bloom filters could be distributed so that each node only needs to process a shard of the traffic - but at this point one would have converted one's real-time transaction server into a distributed database.

Second, let's consider deletion (expiry) of information.

The situation is even worse, because due to their nature, sketching-data-structures don't allow deletes of individual records. Thus, to delete a record from our new feature state, one has to completely regenerate it by re-processing the entire source dataset again (sans the information we want to delete). This is extremely expensive and thus can only be done on a low-frequency batch basis. There are some sketching-data-structure variants that allow some degree of expiry (see e.g. Age-partitioned Bloom Filters, but there are no mature implementations available.

They cannot support complex queries and updates

Finally, sketching-data-structure-based feature stores don't support complex queries or updates like "remove all events that happened on day X". With key-value-store-based feature stores, the additional cost of storing some metadata (like event timestamps) is relatively minor. But this can be a major undertaking for sketching-data-structure-based feature store.

Conclusion

Sketching-data-structure-based feature stores can not substitute conventional feature stores in all use cases, but they can be an attractive option when using an external feature store is prohibitively expensive. For example, if:

One can't afford the additional network call to an external feature store
Many feature lookups need to be performed per one request

Tracing SRE’s journey in Zalando - Part II

2021-09-21T00:00:00+02:00

Welcome to the second part of our journey establishing SRE in Zalando. You’ll find the first part here. Don’t miss out on the third and final post in one week.

2018 - The Return of SRE

In our previous blog post we left it with the plans for Site Reliability Engineering (SRE) in Zalando having to change. So, what were those changes and what were the challenges we faced in this new iteration? In this blog post we’ll go straight to the first quarter of 2018, when two sister SRE teams were bootstrapped around the same time in different departments. One of them was the SRE Enablement team in Digital Foundation (DF - a central functions department). The other was the Digital Experience SRE team (DX - the department responsible for the customer facing part of our Fashion Store). The last one was created from a grassroots initiative, but the DF one was reimagined by management of that department.

Since the decision made back in 2017 to grow the number of teams on call, the issue with overwhelmed on call teams was gone. As expected, the side effect of that decision was that teams were now much more aware of the operational burden of their services and would take steps to reduce that burden. Post-Mortems started becoming a regular practice in 2017, which also helped (although the practice was not yet well established). But while teams were slowly becoming more ‘operationally capable’, the complexity of our platform was growing at a much faster pace, with no one to keep a holistic view on the service landscape. You’ll notice from the name of the DF team that there is already something implied: SRE Enablement. This is where the new team differentiates itself from the 2016 initiative. The challenge that gave purpose to the Enablement team was raising the bar on our operational practices. This was around: monitoring, incident response, chaos engineering, resilience engineering.

Service Landscape

Both SRE teams had very limited resources (only 2 engineers each), and they obviously shared the same goals. To better align the efforts of both teams, an SRE Program is kicked-off that unites them around common goals. As before, the practices and mindset described in Google’s original SRE book are used as the main inspiration for our own SRE teams. The teams were composed of experienced engineers, with a strong background in software development, knowledge of systems engineering, and incident response (very much aligned with the profile that was outlined back in 2016). These engineers also enjoyed a fair amount of social capital across the organization, which greatly facilitated the collaboration with other teams.

Compared to the previous iteration, the SRE Program was not aiming at significant organizational changes. This gave some degree of freedom regarding the projects the Program would tackle. At the beginning of the Program, the 2 teams got together and made a list of all the topics that were SRE relevant and that we wanted to work on. When we were done, the size of the list was considerable (there are so many interesting, relevant and challenging topics in SRE). With our limited capacity, however (6 team members between the two teams - 1 Lead, 1 Program Manager, 4 Engineers), we had to be careful when picking our initiatives. Although this meant that we had to drop many of the topics we wanted to work on, that careful selection contributed significantly to the success of the Program, and the reputation we built for the SRE name within the company.

The SRE Program took on the rollout of Distributed Tracing across the engineering organization, helped improve the Page Load Time for some of Zalando’s pages, staffs the newly created Incident Commander role, and helps with Cyber Week preparations, namely Load Tests. SREs, in the role of Incident Commanders, provided on-site support during Black Friday in a dedicated Situation Room. SREs also worked with other teams on efficiency topics that led to significant cost savings with cloud infrastructure while preserving reliability targets.

Distributed Tracing Workshop

SLOs, as were introduced back in 2016 were still in place, with hundreds of new services specifying SLOs. Despite the growing number of SLOs, they were still not used to help the teams strike a balance between feature development and operational improvements. One of the things that made it more challenging was the fact that Zalando runs many thousands of services in production. We figured that not all of them had the same relevance. To try to put some structure into the SLOs we had, Service Tier definitions were published. To help with the Service Tiers, a new SLO reporting tool was developed. The new tool defined canonical SLIs and used the tier classification. However, this work was limited in scope. They targeted a single department, Digital Experience, home to one of the SRE teams. Services in other departments were not included in this effort and there was no mandate for them to adopt the new Service Tier definitions. Attempting to roll this out for the entire company (>4000 services) would not be feasible.

On the cultural level, the SRE Program took ownership of the SRE Guild. Guilds in Zalando are self-organized groups of colleagues, sharing a common interest, that meet regularly to exchange knowledge. The SRE Guild was actually a remnant from the 2016 initiative, but was left dormant. We saw the SRE Guild as an agent of cultural change to help us spread the SRE mindset. We then devoted efforts to develop a format that would be engaging and sustainable. Guild sessions provided a regular event with talks around all things SRE, whether it’s presenting the work of the SRE Program, or giving the floor for other teams or engineers to share knowledge. Postmortems became a regular topic in these sessions. This format is still in place today.

Black Friday 2018 Situation Room

Despite the success of the SRE Program, the fact that the individual teams were part of different organizations with different reporting chains led to some challenges related to the priorities of those different departments. Those different priorities and guidelines posed another problem when they would be at odds with each other. Teams in Zalando would seek out guidance from SRE, not knowing which team to reach out to, or even that there were 2 separate teams. To understand how two SRE teams that were working together could offer inconsistent guidance, it’s important to remember that they belonged to different departments. The SRE DX team could focus on the problem space of the DX department and offer customized solutions for those teams. The SRE DF team had the entire company in scope, so whatever that team did, it had to be applicable on a different scale. The SRE Program was planned for the year of 2018, culminating with the end of Cyber Week. Following that plan, after Cyber Week was over the program ended and each team went back to work on projects relevant to their respective departments.

2019 - Combining forces as a single SRE team

In early 2019 both SRE teams were officially united into a single team in the DF department (the department of one of the original teams). With this merger, SRE now had a single voice in the company.

The experience with Distributed Tracing in the previous year was quite positive - Do you get the pun in the blog post’s name, now? 🤓. For one, it became a fundamental tool for incident response because it allowed for quicker insights, saving time from incidents. The coverage across Zalando’s services kept growing. The standardized data model and the development of Zalando specific Semantic Conventions, and an API to consume the tracing data allowed the SRE team to build additional value from it.

One of the tools we developed based on Distributed Tracing is an Alert Handler called Adaptive Paging (which we talked about in SRECon’19). This alert handler monitors the error rate of what we call Critical Business Operations¹ (CBO) and when it is triggered it uses the tracing data to determine where the error comes from across the entire distributed system, and pages the team that is closest to the problem. This alert handler was also a game changer in our push for a different alerting strategy: Symptom Based Alerting. You can learn more about it in the slides of one of the talks we did on this topic.

Adaptive Paging will traverse the Trace and identify the team to be paged

A throughput calculator based on Tracing data is also developed that helped the Load Test efforts for Cyber Week preparations. By applying the expected throughput for a CBO, we could estimate the impact on all the components that are part of the same journey, usually through cascading remote procedure calls.

Throughput Calculator

Finally, through our use of Distributed Tracing, and Adaptive Paging, we made a significant change in our SLO strategy. We moved away from service based SLOs, and started rolling out Operation based SLOs.

Through internal and external hiring we grew the team up to 7 SREs. But that team size notwithstanding, hiring was always a challenge. Then, and today. The combination of the required skill set for an SRE at Zalando and the different definitions of the SRE role across the industry, means many candidates do not meet the bar, or simply have a different skill set. Nevertheless, it was agreed that we would not compromise our hiring. While growing engineers and teaching the SRE mindset was something seen as positive (and definitely a way to scale the team further), with our reduced size we could not provide an effective mentorship. Any engineers we would hire needing that mentorship would not be set up for success.

We took the previous year’s Distributed Tracing Workshop to SRECon’19

Both 2018 and 2019 were successful years for SRE, but there are quite a few differences between the two. In 2018 we worked exclusively on topics that SRE did not own. We were a mix of a consulting team and a kitchen sink team. We either volunteered for some of the projects we worked on, or were asked to help due to capacity reasons or because the projects required a specific skill set. Our main challenge was how to decide what to work on. There was no mathematical formula to determine this. It was always a matter of balancing the following dimensions:

Likelihood of success (Would we be in way over our head? Could we actually influence the outcome?)
Company’s priorities
Enablement (If we’re working with a team, will that team learn something from the engagement, or were we expected to do everything ourselves?)

In 2019 we still operated partially in the same kitchen sink/consulting mode, but the big difference is that in 2019 we started working on our own products, which also means we started taking some control of our roadmap.

Overall, 2019 was the year we started reaping the benefits of the achievements from the previous years. We had given a clear signal that a single (small) team of engineers dedicated to Reliability could bring significant benefits to an organization the size of Zalando. But, to an extent, we were also a victim of our success. Despite having our own backlog and a list of topics we wanted to work on, the team became increasingly more in demand from different parts of the organization. Our help was requested to improve Operational Excellence in departments, to assist in the roll out of major launches, to review Technical Design Documents, to help in PostMortem investigations, Cyber Week preparations, Production Readiness Reviews… As before, we had to pick our battles carefully. Accepting every challenge with our reduced capacity meant that we would likely do a poor job in all of them. And anything in our backlog that we had promised and wouldn’t deliver would also affect our reputation.

Things are starting to get interesting. After a few successful projects, SRE’s reputation in the company grew. We merged the two SRE teams into a single team, making sure that SRE could continue to grow unaffected by fragmentation. The SRE Guild kept on going, further spreading the SRE mindset. We grew the team, and even started to focus on our own backlog. But SRE is still a single, small, team in a very large organization. How far can we stretch this model? Well, that's what we're going to talk about in our last blog post on this series in one week's time.

EDIT 1: Don't stop now. The third and last part of our series is already available here.

Grossly summarizing it, Zalando is an e-commerce platform, so a Critical Business Operation is anything that affects our Business, like ‘Add To Cart’, ‘Place Order’ or ‘View Catalog’ ↩

Tracing SRE’s journey in Zalando - Part I

2021-09-13T00:00:00+02:00

2016 - First attempt at rolling out SRE

Welcome to the first installment of our three part series following Zalando’s SRE journey. Be sure to come back for the other two, with the next one being published in a week.

Site Reliability Engineering (SRE) is a recent discipline in the Software Engineering field that is growing in popularity, with many companies turning to this new way of working to solve their operational issues, or to support its growing scale. But being a recent discipline, it’s not yet well established how organizations should adopt SRE, or even what is the role of a Site Reliability Engineer (although the role enjoys an increasing demand). At Zalando we also took a stab at implementing SRE within our organization. We looked at it as a way to help us scale our engineering efforts, improving efficiency and making life for our developers easier. Today, Zalando includes in its organization a Site Reliability Engineering department, but the journey to reach this point was filled with challenges and learnings that we are now sharing with everyone.

In this series of blog posts we will take our readers through the road so far. We’ll describe what worked well for us, and what didn’t. Where we failed, and where we succeeded. We’ll also look into how we defined the role of an SRE within the company, and how SRE is growing in Zalando.

Before we get to the ‘How’, let’s start with the ‘Why’. Why would we want to have SRE in Zalando? Well, for that we need to understand the point that we were at as a company before this journey began. That takes us back to 2016 when we were well into our move to the cloud, migrating our monoliths to a micro services architecture (you can find more details about this and what came after in the blog post from our colleague Henning Jacobs).

A view of Zalando Tech pre-cloud

The move to the cloud came with disruptive changes to the way we were working. Teams were now responsible end-to-end for the software they built. That meant designing, developing, testing, deploying and operating the applications the teams owned. I’ll skip the gruesome details, but to put it simply, before this time, developers developed, and operators operated¹. This meant that the vast majority of our engineers were not experienced in a good chunk of their newfound responsibilities. This lack of experience coupled with the hypergrowth that we were going through resulted in a lot of different and complex issues. These issues were mostly around the operational aspect of software development (monitoring, automated testing, deploying, incident handling, managing the cloud runtime).

One of the more obvious pain points was the on-call support. Before we started the microservice migration, our service landscape was small enough that 5 on-call teams could cover the whole stack. Each team had a large enough rotation, and the domain was well understood by each team member. The monoliths were also quite similar in terms of monitoring and operations, making it easier to tackle issues even in services that a given engineer would not be so familiar with. That gradually changed as new teams were created, and more and more services were deployed in the cloud. And there was little standardization across those services. The on-call teams did not grow to meet the new demands, and were increasingly overwhelmed by the new services that they were responsible for.

Our deploy tool for our data center services

But 2016 is also the year that Google publishes their book Site Reliability Engineering. The practices and mindset described in that book seemed to provide some answers to the growth pains we were experiencing. For that reason, it becomes the main inspiration for implementing the SRE mindset, role and practices in Zalando. How it all started, though, was through a grassroots initiative to promote and pitch for an investment in SRE. After convincing enough managers, mostly through explaining the pain points being felt by the engineering teams, and how SRE can be a solution for those pains, a group of engineers teams up under a project scope to drive this implementation. One of their main goals was to solve the on-call situation, and make it sustainable. A quick side note: If it feels like the ‘convincing’ management is grossly summarized, or feels like it was just too easy, it’s important to bring up that Zalando is a company that does not shy away from change. It’s a core part of the company’s DNA and culture. And the culture of an organization always plays a key role in enabling (or resisting) such changes.

SRE Brainstorming session

Now that there was an initial buy in from management, there were o so many things to discuss at the time. But the one that had the most influence in the following steps was “How do we structure SRE?”. Again, remember that this had to be done in a way that it would solve the on-call problem. Should we go for a central team? We were already too big for that (our headcount had grown to 1.000+), so odds were that we wouldn’t be effective. Although it would make staffing easier because we’d need fewer SREs. Should we distribute one SRE per team? The scope would be too large for the lone SREs. Not to mention that, over time, they’d likely become the Ops engineer for the team they were in. It was agreed that we would need several SRE teams. But that still begged the question: What is the granularity at which we would create SRE teams? In the end we went with one SRE team per Product Cluster. This would give SREs end-to-end responsibility over a domain, without having too wide of a scope.

There was another concern around the reporting chain. This was an easy discussion, as we quickly converged to following the guidance in the SRE book and consider reliability work as a specialized role and have them separate from the product delivery teams.

To further gauge the interest in the SRE role and mindset, we sent out a survey to our engineering Org. In that survey we included a description of the desired profile for an SRE. That profile included: Software engineering, Operational mindset, Systems engineering, Software architecture skills, Troubleshooting skills.

Survey to gauge SRE interest

The survey results also gave us an idea on the talent pool that might be interested in a move to an SRE role. To further promote the role and the initiative within the company, several talks were done across the company and its different hubs, which, at the time, already included Helsinki, Dublin, and Dortmund.

With few engineers able to fit that profile we had to be smart about where to start rolling out SRE. Ideally, we start with the area with the most need for SRE practices. But to know which area that would be, we first had to measure the health of the different products at Zalando, to then be able to prioritize. Fortunately, at the core of SRE we have Service Level Objectives (SLOs) and Service Level Indicators (SLIs). With the lack of a standardized way of measuring availability, the first thing the team working on the SRE initiative decided to do was to roll out SLOs and SLIs. Workshops were conducted across the company for Engineers and Product Managers, and the first SLO reporting tool (SLR) was developed.

Zalando’s SRE Logo

To further demonstrate the educational benefit of SRE, the SRE program team ran Reliability Workshops as part of Cyber Week preparations to discuss and review Reliability Patterns for the more critical services. In those Reliability Workshops we covered Retry Strategies, Circuit Breakers and Fallbacks.

Many services did have SLOs defined and collected, but it still did not end up influencing the software development process. The vast majority of SLOs were defined through initiatives from Engineers. But in a microservice architecture, a product is implemented by multiple services. Product Managers had a hard time establishing a link between the different SLOs and their own expectations for the products they are responsible for. Management was kept in the loop, but not directly involved, so there was no real motivation for management to uphold the SLOs.

Senior Management agreed that SRE concepts like SLOs and reliability patterns are a much needed practice, and that teams should continue doing that. However, there was a clear preference to keep building the missing operational capabilities in the Delivery Teams. The way that was chosen to kickstart that capability building, was by putting each delivery team on-call for the critical services they owned. This decision was fundamental to properly establish the “you build it, you run it” mentality we still have today.

With teams now responsible 24/7 for their own services, the plans for Zalando SRE would necessarily have to change. Join us for the next chapter of our series to learn more about the next steps of this journey.

EDIT 1: No reason to stop the reading here. The second part of our series is already available here.

We did have some engineers with end to end responsibility. They would deploy, monitor and even be on-call for the services of their respective area. This was not standardized in the company, and it would depend greatly on the leadership of their respective teams. ↩

Micro Frontends: Deep Dive into Rendering Engine (Part 2)

2021-09-09T00:00:00+02:00

Zalando's Fashion Store has been running on top of microservices for quite some time already. This architecture has proven to be very flexible, and project Mosaic has extended it – although partially – to the frontend, allowing HTML fragments from multiple services to be stitched together, and served as a single page.

Fragments in Mosaic can be seen as the first step towards a Micro Frontends architecture. With the ambitions of the Interface Framework as presented in the first blog post, we did not want to just stop at serving multiple HTML pieces, we wanted more:

Implemented once, works anywhere - UI blocks should work in different contexts and be context-aware, not context-bound.
Declarative data dependencies - Components get the data they need but do not re-implement data fetching over and over.
Simplified A/B Testing - Zalando's decisions are data driven, so experimentation is at the core of our decision making. Running an A/B test that spans multiple pages and user flows should be possible with minimal alignment and zero delivery interruption.
Feels like Zalando - We want a consistent and accessible look and feel for all user journeys and ability to experiment with design fast, across multiple user flows.
Power to the engineers - Any developer should be able to contribute to all the Fashion Store experience. This means universal tooling and setup, first-class React integration, easy testing (also for work-in-progress code), and continuous integration.

That's how Renderers came to be.

Introducing Renderers

A Renderer is a self-contained Javascript module that runs inside the Rendering Engine framework. It fully relies on the framework to encapsulate all the implementation details like data fetching and layout composition.

A Renderer declares its data dependencies using GraphQL queries and, based on that data, provides a visual representation of a single Entity type (check Part 1 for a detailed explanation on Entities).

This visual representation is a React component, but data management and layout composition is handled solely by the Rendering Engine framework.

So, Renderers are visualisation components for Entities.

The mapping of Entities to Renderers is one-to-many, since different visual representations may exist for a given entity type. A Product Entity, for example, can be represented as a detailed product page, or as a compact card component in collection view. Each Renderer, on the other hand, corresponds to one specific entity type only.

All Renderers share some important properties:

Renderers are composable. A Renderer is able to embed other Renderers as children, or be embedded by other Renderers.
Renderers are declarative. They specify their dependencies and behaviour but delegate all implementation to the Rendering Engine, the framework that runs them.
Renderers are self-sufficient. A Renderer can visualise its Entity no matter on which page or in which context it appears. This ensures that the choice and arrangement of Renderers remains as flexible as possible.

Enabling dynamic content for Zalandos’ mobile apps

Project Mosaic was solely focused on the web. However, Zalando offers its Fashion Store as two experiences: the Web and the Native Apps. Since they share most parts of the user journey, it was natural to explore if the Apps could benefit from a system based on Entities and Renderers, too.

We knew it would be too much of a stretch for Mosaic fragments. But there's literally nothing that binds Renderers specifically to the Web!

In the Zalando app, we had already implemented server-side layout steering for some parts of the application experience such as the main App landing page. Instead of relying on hardcoded views, the app would receive layouts from a remote Zalando server over the network. The preferred format here would be JSON, but otherwise the same challenges were present: we wanted dynamic, personalizable UIs with declarative data dependencies.

If Renderers were able to output JSON instead of HTML, we could reuse the same rendering core as for the web with the same benefits. Our Renderers relied on React for their output. To cover the app-specific use case, we added a custom React reconciler that consumed custom React elements, and output app-compatible JSON instead of HTML. Now, web developers are able to contribute Native apps features by reusing the same set of APIs as they were used to deliver web experiences and bring the web and native apps experiences closer together. All the existing tools, infrastructure support, and the constantly evolving platform APIs are now shared.

The life of a Renderer

So, how does it look under the hood?

We decided to organise the Renderers API as a set of so-called life cycle methods, each accepting a function declaring Renderer's behaviour for a given context or case. All Renderers are implemented using TypeScript.

Let’s have a look at a simplified version of a collection carousel Renderer:

import { MOVE } from "@tracking/event-names";
import { SimpleCarousel } from "@dx/react-carousel-tile";
import { tile, ViewTracker } from "@if/rendering-engine/api";
import * as React from "react";
import * as query from "./query.graphql";

export default tile()
  .withQueries(({ entity: { id } }) => ({
    carousel: { query, variables: { id } },
  }))
  .withProcessDependencies(({ data }) => {
    if (data === null) {
      return { action: "error", message: "No collection data found." };
    }
    return {
      action: "render",
      data,
      tiles: { entities: getCollectionEntities(data) },
    };
  })
  .withRender((props) => {
    const {
      data: { collection },
      tiles: { entities },
      tools,
    } = props;
    return (
      <ViewTracker>
        <SimpleCarousel
          {...collection}
          onNextClickCarousel={() => {
            tools.tracking.track({ name: MOVE });
          }}
        >
          {entities}
        </SimpleCarousel>
      </ViewTracker>
    );
  });

Renderers are implemented using the fluent interface approach. By calling the tile() function of the Rendering Engine API, we are setting up a Renderer that defines various lifecycle methods. Each method receives a function that encapsulates the associated behaviour and has fully typed interfaces. Since renderers are declarative, they do not execute any of the lifecycle methods themselves. Instead, the Rendering Engine framework runs all of them, in due order and context, fetches data and dependencies, and passes the output down to other methods when necessary.

The most important lifecycle methods are:

`withQueries`

Declares a data dependency via a GraphQL query. Data is fetched automatically by the framework and is available when the other life cycle methods are called.

`withProcessDependencies`

Based on data delivered by withQueries, defines further action (render, error etc.) and allows data pre-processing, which is then passed to the withRender method. The chosen action tells the Rendering Engine that the Renderer should redirect, or be displayed in an error state.

This life cycle method is also responsible for specifying child entities of the current Renderer. In this example we want to display the collection entities as outfit or product cards based on their entity type. It is important to note that a given renderer does not know which renderers will be used for its child entities

`withRender`

Returns the root React component to be used as the Renderer output.

For the Web, this is transformed into HTML and rendered on the server (SSR). Later on, the markup is hydrated on the client side with the data. For the Apps, we use a custom React reconciler and custom (non-Web) components to output JSON instead of HTML. However, most of the data flow, dev tooling and infrastructure remain the same for both use cases.

There are more advanced features by using Renderers:

Progressive Hydration: we can mark specific renderers to be hydrated early, i.e. kicking off their React hydration as fast as possible on the client-side, and thus making its content interactive before its parent renderer.
Code Splitting: we only load and parse the Renderers needed on a given, personalised page which gives us a good performance out of the box.
Renderer State: Renderers have access to a local Renderer State. The concept is similar to React’s setState. It enables you to re-run renderer lifecycle methods for example to fetch additional data, and re-render the updated child entities. The "classical" React state can still be used via React Hooks.

Data sharing

Renderers are not intended to share data with each other that is based on the client side state. We want to avoid unwanted data coupling and allow Renderers to be reused in other contexts with minimal risks.

Renderers have access to Zalando’s GraphQL Mutation APIs which allows remote data to be modified. Since all Renderers use the same data schema for their data dependencies, they can subscribe to changes in the schema to limit the need for cross-renderer communication.

Rendering Engine

Rendering Engine is the framework powering the Renderers. It is a backend service written in TypeScript and running in NodeJS coupled to a client-side Javascript module that runs in the browser.

Rendering Engine encapsulates all the complexity and implementation details for the declarative Renderers. It processes incoming customer requests, matches Entities to Renderers, fetches data and other dependencies such as A/B testing assignments, asynchronously renders the response and delivers it back to the Web and Native App clients.

The following sections describe the main responsibilities of Rendering Engine.

UI Composition

All layouts in Interface Framework are represented as trees of nested entities that are visualized using the matching Renderers. The mapping of Entities to Renderers is fully described by a set of rendering rules.

In computer science terms, Rendering Engine recursively and asynchronously transforms a tree of entities into a tree of UI elements. On each step, it takes an entity node and its metadata as input, outputs a UI node plus zero or more child entity nodes, and then recurs over children.

The page rendering always starts with an Entity. We call it the Root Entity since it typically defines what the page is about. After the Rendering Engine receives a request, it extracts the root Entity from the request headers and looks up a matching Renderer. Once a Renderer is found, the Rendering Engine runs the Renderer lifecycle methods to fetch data. In case there are any child entities associated with this Renderer, the same resolution process happens recursively. Thus, each Renderer may "suggest" which entities should be rendered as its children, but has no control over the actual renderer choice. That choice is based exclusively on the Rendering Rules.

The important part here is that we do not block the resolution process. As soon as the entity is matched to a Renderer and the data resolved, the Rendering Engine kicks off the rendering process and starts streaming the HTML content to the client.

Data Fetching

The Rendering Engine takes care of fetching the GraphQL queries from the Fashion Store API. It uses an implementation of Perron, a data client with built-in support for circuit breakers, error handling and retries.

All queries to FSA are batched and cached based on a DataLoader implementation. This prevents duplicate calls to backends during the same request.

Universal Rendering

Zalando being an e-commerce platform, our typical web page would have a prevalence of static content with islands of interactivity and we aim at serving content as fast as possible. This is why Rendering Engine was built from the ground up with full Server-Side Rendering (SSR) support. Each Renderer first generates its markup on the server and the Rendering Engine stitches it all together and streams the HTML to the client which then hydrates the components using our runtime module.

For the Web use case, we provide additional Zalando-specific APIs which add interactivity, mutate data if necessary, lazy-load extra contents etc. For the Native app, the Rendering Engine only serves the JSON markup and the actual rendering happens in App clients for iOS and Android.

Mosaic backward compatibility

We knew that the migration from Mosaic to Interface Framework would not happen in a day. Our Mosaic codebase was extensive and actively maintained. Therefore, the Rendering Engine allowed Mosaic fragments to be used directly inside Renderers.

This made our migration path very smooth. In fact, we now view Mosaic fragments as a powerful API our framework supports, and we still use them sometimes. In addition, this opened up extra integration and observability benefits for the legacy implementations.

Monitoring and Tracing

Improved observability is yet another benefit of the integrated platform. The Rendering Engine automatically collects and reports Web Vitals so that we can correlate performance variations with code changes. A number of custom client-side metrics are also collected. All this happens automatically, so developers who contribute to Renderers can focus on the customer experience We also integrate a variety of common enterprise tools for logging aggregation, Open Tracing and client-side error monitoring, with zero-integration time for the Renderer developers.

Developer Experience

Rendering Engine focuses on providing a great developer experience with the following features:

Local Development Environment: the framework provides an integrated development server and an on-demand compilation of Renderers. It only builds the Renderers that are shown on the current page. This ensures fast build times even when more and more Renderers are added to the application.
Multiple version support: Rendering Engine uses the Zalando Design System as a UI component library. The UI components are defined as dependencies for each particular Renderer. To allow greater flexibility, it supports using multiple versions including convenient tools and hooks to simplify the version maintenance.
Continuous Integration & Deployment: New code changes get tested and built automatically with specific performance reports for every page. These reports include bundle sizes and Lighthouse metrics. The deployments to Kubernetes happen continuously in preview and production environment.
Automatic Persisted Queries: all GraphQL queries to the Fashion Store API are persisted on the server side together with a unique identifier. It helps reduce the request size, since the Rendering Engine client runtime sends the identifier instead of the whole query string.
Localization: Rendering Engine supports localized bits of text inside Renderers.

Page Rendering Explained

Let’s have a look at what happens in Interface Framework on a high-level when you visit a page on the Zalando website. In this example, the user visits an outfit view by choosing one from Zalando’s Get the Look page.

The request gets picked up by Skipper, which is an HTTP router and reverse proxy for service composition. Skipper identifies the matching route and forwards the request to the Rendering Engine along with the entity parameters:

entity-type: "outfit"
entity-id: "ern:outfit::4NXOAez0Qti"

The Rendering Engine gets the request with the entity above, that is called the root entity. The root entity defines the main content of the page. Based on the Rendering Rules, a matching Renderer is selected for this root entity.

For the outfit page, the set of Rendering Rules looks like the following:

export const outfitViewRule: RenderingRule = {
  selector: { entity: "outfit" },
  renderer: "outfit_view",
  children: [
    {
      selector: { entity: "outfit" },
      renderer: "outfit_highlight-b",
      children: [
        {
          selector: { entity: "product" },
          renderer: "product_horizontal-highlight-product-card",
        },
      ],
    },
    {
      selector: { entity: "collection" },
      renderer: "collection_simple-carousel",
      children: [
        {
          selector: { entity: "outfit" },
          renderer: "outfit_outfit-card",
        },
      ],
    },
  ],
};

The Renderer for the root entity is the Outfit View Renderer. We can refer to it as the top-level or root Renderer for the request. The Renderer has a data dependency in the form of the following GraphQL query.

{
 outfit(id: "4NXOAez0Qti") {
   id
   creator {
     variant {
       name
     }
   }
   relevantEntities(first: 2) {
     edges {
       node {
         id
       }
     }
   }
 }
}

The query is executed in the Fashion Store API and various parts of the query go through different resolvers depending on the fields that are present. Each of the resolvers then calls one or many microservices that provide data.

In our example, we ask for the creator’s name of the outfit together with two relevant entities. One resolver will call the Recommendation System to get the relevant entities for this outfit. Here, our relevant entities are a collection with other outfits from the same creator and a collection with outfits that look similar.

Each Renderer decides which relevant entities appear as its children and adds placeholders for them. This is achieved via the withProcessDependencies lifecycle method. The Rendering Engine picks up all relevant entities and determines matching Renderers. For each of these nested Renderers, the process repeats recursively until no more nested entities must be rendered.

After all the Renderers and their data dependencies are collected, the Rendering Engine renders the React components of each Renderer and streams the content to the client. The next picture shows a sketch of the outfit page that is divided into the corresponding Renderers. Each Renderer is responsible for one part of the page.

Conclusion

We have presented a deep dive into Rendering Engine with all its key functionalities. The final part of this blog series will cover a comparison between Mosaic and Interface Framework and what we have learned during the migration.

Update 2023/07: See Rendering Engine Tales: Road to Concurrent React for an update on Rendering Engine and how we integrated React Concurrent features as part of our upgrade to React 18.

Using Internal Mobility For Growth

2021-09-02T00:00:00+02:00

Long time readers of this blog will remember that back in 2019, we published a feature on the benefits of rotating engineers between teams. For those of you who have not seen it, the article described an initiative that aimed to establish cross-functional knowledge sharing, encourage cross team collaboration, and bring greater product awareness, by providing engineers with an opportunity to work on different teams within our Developer Productivity department.

Within Zalando, we are incredibly passionate about enabling our engineers to progress and to develop. This empowerment and growth mindset is deeply woven into our fabric. Take a peek at Our Founding Mindset. Four of them are focused on empowerment. I myself am particularly drawn to #makeUsBetterNotBigger.

Let’s take a look at how another of our business units, Zalando Direct, our B2B marketplace, is using Internal Mobility as a catalyst for development. Within the unit, the leadership team maintains a directory of opportunities that are used to foster growth within engineers. This repository covers community driven initiatives such as our architecture review groups and our weekly hacking sessions, in addition to our department driven topics and task forces such as improving observability of systems. One development opportunity is Internal Mobility.

Internal Mobility is described as an exciting avenue for growth that enables engineers to join a different team on either a fixed-length assignment, or on a permanent basis. In this article, I would like to focus on the former, which was our most recent success story. This story involved a Frontend Engineer who had been with Zalando Direct for over one year, and was joining my team on a short-term assignment for one month.

The goals of the team swap were to:

Provide a solid opportunity to expand knowledge and expertise by contributing to a new domain.
Provide the destination team with an experienced extra engineer to contribute to their large and growing backlog.
Further highlight that Internal Mobility should be used to successfully provide a development opportunity for our engineers.

Kicking Things Off

The engineer’s lead initiated the assignment, so let’s understand what that entails. First and foremost, it is imperative that our engineer is comfortable with, and excited about, the opportunity. Taking ownership of one’s own career progression and personal development is something that I look for when an engineer is on a seniority trajectory. I am always more than happy to double down my investment in them if I know that it will be maximised.

Thereafter, it is important to agree on scope and duration. Engineers know that diving into an unscoped project is a fool’s errand, and this is no different. Up front, it is important to be clear on what is expected from all parties, and what are the boundaries. In this case, it was agreed that the duration would be one month, and that the scope was to work on a particular area of partner-facing functionality within our platform, zDirect. For some additional context, zDirect is a web application that enables our partners to grow and steer their business on Zalando.

Onboarding

Onboarding a new joiner to our team is always a great opportunity to critically assess how well our process is. One factor that can accelerate onboarding productivity, is if the new joiner is familiar with the languages and tools. We were able to keep the tech stack unified, which is a subset of the technologies sponsored by Zalando as part of the tech radar. This, coupled with the engineer’s understanding of the ecosystem, meant that we were able to get up and running in no time at all. Additionally, we got some incredibly helpful feedback that enabled us to improve our onboarding documentation. Given that we are growing at an incredible pace, streamlining the onboarding process for new hires pays dividends on productivity and experience. Always be squeezing your Time-To-Ship!

From this point onwards, we had a new team member. They joined all of our ceremonies, paired with their colleagues, and got to grips with the team’s ways of working. Similarly, they attended social settings such as team lunches and activities. They immediately started shipping value, and right away boosted our team’s throughput. This required collaboration with our engineers, our product manager, and our designer. We do not work in isolation, and this is an important aspect of the assignment. Please don’t extract somebody from their team environment and have them work alone. A well known study on team dynamics stated that “Who is on a team matters less than how the team members interact, structure their work, and view their contributions”.

Use this opportunity to solidify your team and to hone the dynamics of collaboration.

So How Did This Experiment Go?

Ultimately, this assignment enabled our team to deliver increased value for our stakeholders. Throughput aside, however, the assignment yielded much more. As a leader, I thrive from helping my team to succeed. One of the most rewarding stages of this assignment was doing a final retrospective with our new team member. Throughout the process I could see a continuous stream of high quality deliveries, but I wanted to drill down further into the personal experience. To hear that they

“developed technically, acquired a better understanding of how the business operates, and identified different processes and ideas to bring back to their own team”

was of course music to my ears. Moreover, they were inspired to go out and enroll into a Typescript course (we provide every engineer with a healthy training budget to use for their own growth) and incorporate it into their development plan. I like to think of this as the flywheel effect on growth.

My last question to them was “Would you do it again?”, which was answered with an enthusiastic “Yes”.

Conclusion

Internal mobility assignments are a really effective way to provide engineers with an opportunity to learn new skills, to work in a new domain, and to push themselves out of their comfort zone.

All experiments come with learning opportunities, and the goal of trying something new is to broaden our understanding and experiences. Two important learnings for us (as receiving team) were that

We needed to improve our onboarding documentation.
Engineers should not have to switch back and forth during such an assignment.

For the former, our new member was able to pinpoint some gaps in the process, and we have since created an internal ways-of-working document to alleviate this for the next person. For the latter, there was an instance when our new member needed to respond to a topic for his original team, which broke the productivity flow, and led to some context switching. This is something that we will avoid next time.

Sidenote: Context-switching is a productivity killer. I remember reading Quality Software Management: System Thinking, by Gerald Weinberg, and being horrified by the impact that switching has on delivery.

That being said, I believe that any endeavour that yields learnings is a successful endeavour. The benefits and learnings that come from internal rotation are in abundance, and I would highly recommend that you try this in your organisation. Presently, we have a number of engineers on different assignments, ranging from weeks to months.

I opened up this article by referring back to an experiment conducted back in 2019. One of the goals that the authors hoped for was that rotations would become more of a regular thing in Zalando, and it’s awesome to be able to write this piece two years later, and say that, yes it is something that we are doing regularly, and continuously learning from.

Knowledge Graph Technologies Accelerate and Improve the Data Model Definition for Master Data

2021-07-29T00:00:00+02:00

The Master Data Management Challenge

Master data management (MDM) is a technology-enabled discipline in which business and Information Technology work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise's official shared master data assets.¹ At Zalando we are at an early phase of realising MDM for our internal data assets and we have chosen to do it in a consolidated style.

Typically, MDM projects are started because an organisation does not have a central view to a specific subject matter and, instead, that information, such as the contact details of a business partner, are scattered across systems with each maintaining their own differing or same record of these details. In our practical approach MDM is a set of practises to create a common, shared, and trusted view on data, also called a golden record, for a particular domain. In our MDM project, source systems are identified, their data is consumed, processed through a match and merge process, cleansed and quality assured, and then stored centrally according to a canonical data model. This centrally stored golden record, is then published back to the source systems for consideration and possible correction in their respective systems.

We are currently designing a central MDM component that harmonises the different records into the central and trusted golden record. Its form needs to be defined in a logical data model. This is a set of definitions of tables and columns in which the consolidated record pulled, matched, and merged from the different sources is stored. Deriving this model is usually done manually, which has the following drawbacks:

The amount of manual work to create the logical data model increases relatively to the number of system tables.
Usually, the data models are read and created by colleagues from engineering with limited business know-how.
The communication of the data model of source records and the data model of the golden record is shown as technical and textual definition files (SQL schema or a spreadsheet).
For business stakeholders that are domain experts the understanding of contents and how they relate to each other is hard to grasp from these technical definition files.
The domain expert is limited from conveying correctly the knowledge to the engineers creating the data model, which leads to errors and misunderstandings.

Because of these drawbacks, the risk is that a MDM tool is released with a faulty and incorrect model that needs iterations of rework. As the logical data model is a main driver for the effort of creating a MDM tool effecting user interface, processes, business rules, and data storage, this risk might have a large impact and delays the business value delivery.

As the communication between business and engineering about a correct logical data model is happening upon textual technical specification files, an effective and efficient data governance decision making process is hindered, too, which is important to make the golden record also trustworthy.

The logical data model is not the only deliverable in such an MDM project. We also have to deliver the mapping from each system's data model to the golden record's one and define whether mapping can be done directly 1-to-1, or whether it needs to go through some kind of transformation. For example, system A may define an address differently like system B.

System A: Address

address_line_1
address_line_2
address_line_3

System B: Address

street
zip_code
city
country_code

The golden record data model needs to define the optimal and correct way to store an address object as well as define how the differing systems' data models map to it. If done manually, also this work increases with the number of system tables.

Using Knowledge Graph Technologies

In order to improve this manual definition, we made use of knowledge graph technologies by describing all system's data models in a named directed graph. We then mapped each column of a system to a set of business concepts, such as "address", "contact person", or "business partner". These business concepts have attributes as well as relationships with other concepts. For example, the business partner concept is connected to the address concept as in the image below.²

We are using Neo4J to create these human-readable images about the mappings, since it has, in our opinion, the best look-and-feel in the current landscape of knowledge graph technologies. Most domain experts can read these images much better than the above mentioned data model definition files. Currently, we are mapping tens of tables and hundreds of columns, so creating images manually would generate more manual and error-prone work and that is why it is efficient to generate these images from the knowledge graph. The number in brackets in the colour legend is the total amount of nodes of this type in the knowledge graph.

For the above mentioned example of system A and B storing address information differently, we can model this in the knowledge graph in the following way. Columns from system A, such as address line 1, 2, and 3, map indirectly (one-to-many) to the address concept. This means that these columns need to be processed into the MDM system with a transformation algorithm. Columns from system B, however, map directly (one-to-one) to respective attributes of the address concept. See the image below for an illustration.

Focusing Manual Work Where it Should Be

The only manual work that is done is to record the mapping from systems' tables and columns to business concepts, their attributes, and their relationships. For example, system A and B is mapped in the following way:

System A: Address

address id -> concept: Address, relationship: has contact (target)
business partner id -> concept: Business Partner, relationship: has contact (source)
address_line_1 -> concept: Address
address_line_2 -> concept: Address
address_line_3 -> concept: Address

System B: Address

id -> concept: Address, relationship: has contact (target)
business partner id -> concept: Business Partner, relationship: has contact (source)
street -> concept: Address, attribute: street name
zip_code -> concept: Address, attribute: postal code
city -> concept: Address, attribute: city name
country_code -> concept: Address, attribute: country code

And that is all that needs to be done manually. A domain expert can provide us with these definitions and some coordination that the exact same name for concepts, attributes, and relationships is required. This is done by cross-referencing system's business concepts and unifying their wording.

Generating the Logical Data Model

The mapping from systems' tables and columns to business concepts is processed and written into the knowledge graph, which then holds the following types of nodes:

System, the name of one system owning tables and columns.
Table, the name of a table from a particular system.
Column, one column in one system with respective schema definitions, such as data type.
Concept, a business concept such as Address.
Attribute, one single data record defining the concepts, such as street name for the address concept.
Relationship, a connecting information between two concepts flowing from one, the source concept, to the other, the target concept. For example business partner "has contact" address.

The logical data model is then systematically created (via a Python script) from the concepts, attributes, and relationships. Each concept is created with a table of its own, where the columns are all of its attributes and an internal identifier for the concepts. Each relationship also becomes a table of its own with the internal identifiers of the source and target concepts as foreign key columns.

Since the graph contains the record which system's tables and columns contribute to one concept, we can then also generate the so-called transformation data model, which shows how each system's column maps to (directly or indirectly) to the logical data model of the golden record.

By using knowledge graphs for a live-data representation of all systems' logical data models and how they map to a semantic layer of business concepts, we are able to automatically generate the logical data model of the golden record inside the knowledge graph with additional information on how it connects to systems' data model. This enables us to keep a record of data lineage from each system to the golden record and, additionally, to use contemporary knowledge graph visualisation tools to give domain experts a intuitive and understandable representation on how each system is connected to the golden record. We see here two main advantages:

The dialogue between business and technology in designing the golden record logical data model has improved and accelerated the process of creating a correct model.
All deliverables, such as the logical data model and the transformation data model can be queried directly from the knowledge graph and do not need to be done manually, which is less error-prone.

We estimate that during the development of the MDM component this approach will keep on saving time for us by forgoing misunderstandings and improving stakeholder communication.

Wikipedia on Master Data Management 23.7.2021 ↩
For knowledge graph experts it is worthwhile to note that because this is a schema for the logical data model, also relationships between concepts are modeled as nodes. This is a deliberate design choice. It enables us to map data model information to relationships. ↩

How we use Kotlin for backend services at Zalando

2021-07-01T00:00:00+02:00

The adoption of Kotlin at Zalando

As outlined in prior posts, Zalando uses a Tech Radar to provide guidance on technology selection.

Recently, we moved Kotlin from TRIAL to ADOPT. With this change we are doubling down on the support of Kotlin as the 3rd JVM language next to Java and Scala. This is the result of increased adoption within the company (100+ new applications were written in Kotlin in a year), positive feedback from engineers starting to use it, as well as creation of guidelines, coding standards, reference projects, and service templates by the Zalando Kotlin Guild.

The experience that our Engineering Community gained over the recent years with Kotlin matches the developer stories of other companies. A nice collection of success stories can be found on the Android blog. Kotlin allows writing more succinct code with fewer pitfalls compared to Java and comes with a lot of useful features and libraries (e.g. data classes, null safety) that Java does not (yet) have as part of its standard library. This is probably also a reason why it is more wanted and less dreaded than Java and Scala in the 2020 Stackoverflow insights. Additionally, type inference, read only collections as well as the rich support for functional programming in the standard libraries were among the things our developers see as benefits compared to Java.

The Kotlin Guild

The Kotlin Guild was founded with around 10 core members who want to help the language grow in Zalando. Moving the language to ADOPT in the latest Tech Radar Update was a central milestone in that effort, as the ADOPT status comes with support from central infrastructure teams and the created documentation as well as templates, which help to promote a standardized tech stack and make bootstrapping new services easier. Due to being driven by our language guild, the whole process was kept transparent and open for contributions from the Engineering Community.

As a preparation for wider adoption of Kotlin, we collected internal good practices as well as the definition of tools and libraries for the development of RESTful backend services and Android apps with Kotlin that are recommended as default choices. For additional input we looked at how frequently things are used within the company, sat together with experts on specific topics, consulted external sources, and asked the whole Engineering Community to review final recommendations via a survey. Overall, we made sure that our recommendations support a positive developer experience and fit the need of most services, which are not directly serving customer traffic.

Looking forward the Kotlin Guild will continue to foster knowledge exchange as well as community building for its 250+ members. We also plan to cover more use cases with our documentation, like pure functional services using Arrow and will make sure we stay up to date with new development in the Kotlin space. Next to that, the members support each other with technical issues and regular talks are hosted.

How we build Backend Services at Zalando

Our internal developer tooling allows to initialize a repository from a template project. Those come with out-of-the box configuration and integrations which teams can then adapt to their needs. As an added benefit, they nudge teams towards higher consistency across different services and departments.

All APIs are defined in the OpenAPI format using Swagger. This allows our API portal to list all available APIs in one place along with their API linting results via Zally. API linting can also be required to pass for MUST validations on every build. Many of our teams follow the API first principle throughout service development.

Given that most services are deployed in Kubernetes, we consider Skipper filters the best way to handle Authentication and Authorization. This can either be achieved in Skipper directly, via Route Groups or Fabric Gateway. Skipper is designed to handle a large number of requests and is less likely to be misconfigured than for example Spring security.

Many JVM based Web services in Zalando are built using Spring Boot and we believe that this is also a good option when using Kotlin. This choice is mainly driven by the large adoption, but also because Spring integrates really well with Kotlin, is compatible with multiple application servers, and supports reactive programming via WebFlux. We do also see growing adoption of Ktor and predict it to gain popularity within Zalando in the future, possibly even in conjunction with GraalVM.

Libraries we use for Backend Services

As build system, we prefer Gradle over Maven because of its great customizability and build performance. Gradle is also used to compile the language itself and is used by many major framework projects like Spring Boot. On top of that, the build configuration scripts can be written in Kotlin.

Linting is a very good practice to keep the style consistent in a codebase and to settle disputes over correct indentation. Ktlint is our tool of choice as it follows the official coding conventions, is easy to run in Gradle, and does not enforce too many rules such that it seamlessly integrates into the software development process.

Kotlin-logging is recommended for logging as it automatically adds class names to the log, lazily evaluates messages, and is built on top of slf4j.

For Redis access, we recommend using Lettuce which is part of spring-boot-starter-data-redis, as it is a thread safe client with nice support for reactive programming.

To access relational databases, we see spring-boot-starter-data-jpa as a solid choice in case you like to use ORM, but advise considering jOOQ in cases where database transactions become more complex. It is also worth mentioning that jOOQ can be used together with other clients, as it can be used on top of JPA. jOOQ also has the added benefit that it supports database specifics like Postgres JSON types.

Zalando is investing into traceability with Open Tracing and we recommend opentracing-toolbox which eases integration of tracers, particularly in Spring Boot projects. Tracing allows linking requests across services and is also great to set up automated alerting.

Conclusion

We hope this gives you some idea why Kotlin is gaining popularity for backend development within Zalando.

Zalando Tech Radar - Scaling Contributions to Technology Selection

2021-06-24T00:00:00+02:00

Introduction

In our previous post about Technology Choices at Zalando we spoke about a few problems with scaling technology selection in Tech companies. Since then, we have focused on the remaining categories of the Tech Radar beyond languages and the Tech Radar contribution process. Now, we'd like to reflect on our lessons learned, which you can use when designing technology selection processes.

Scaling contributions

One of the challenges for us to solve was scaling contributions to the Tech Radar across our 250+ delivery teams. Technologists are often more excited in promoting a new, promising technology than working on guidelines or sharing knowledge about already well-known tech. Such individuals are also essential for continued innovation. On the other hand, companies look for organizational efficiency by ensuring talent mobility across teams supported by a more or less standardized tech stack. This makes it easier to address cross-team dependencies in product delivery by allowing teams to contribute to code bases beyond their area of responsibility. Further, it creates career opportunities for Engineers, who can quickly switch teams and work on a challenging, high impact project. Thus, for technology selection, there is a natural tension between early adopters' vested interest and the needs of the organization they work for. At Zalando, we have created a two-sided contribution model to the Tech Radar:

Anyone in Zalando is encouraged to contribute knowledge about technologies we have on the Tech Radar or suggest ones that are promising to evaluate and play a key role in this process.
Our Principal Engineers are maintainers of the Tech Radar and are moderating information collection on incoming suggestions, driving creation of good practices for technologies being evaluated or used, and for promoting technologies to increase their adoption.

Ring change suggestions are supported by issue templates in our internal Tech Radar GitHub repository. These templates provide guidance on common questions around use case fit, key differences from alternatives already on the Tech Radar, conformance to our Technology Selection Principles, and support within the Engineering Community.

We encourage and expect our Engineers to contribute information about usage, lessons learned from production incidents, or challenges they face at scale. Voluntary contributions alone are insufficient to keep an updated view of the technologies we use. Thus, to support usage information collection, we collect usage data from our AWS accounts, source code repositories, or our infrastructure platform offerings. Collected information is collected in a documentation page with a common structure across all entries:

Finally, we leverage Principal Engineers to moderate and drive discussions around technology adoption at Zalando. These colleagues have a sufficiently broad view on technology usage and performance in production across multiple teams and serve as a multiplying factor. They're responsible for encouraging teams they work with to share knowledge and highlight technology usage based on the software systems in their areas - either themselves or by enabling others to do so. Additionally, they moderate discussions within technology guilds or initiate working groups to create specific artifacts for the technologies, like collections of good practices or guidelines tailored to our environment, use cases, and scale. Such working groups are also excellent opportunities to develop or identify talent within the company.

Re-scoring - how have we decided upon changes?

After a longer period of time with no regular changes to the Tech Radar, we had a re-scoring exercise to complete. A similar approach was used originally at ThoughtWorks and can be used to create a Tech Radar from the ground up.

Within our Principal Engineering Community, we formed a working group per dimension: Datastores, Data processing, Infrastructure, and Queues. Our Tech Radar visualization merges Data processing and Queues in a single Data Management dimension for simplicity. Each working group was responsible for the data collection and analysis. One person from each group compiled the information in a structured format where per technology there was a case made for a ring change (or not). The change reasoning was supported by data points on usage, incidents, and expertise we gained since the technology was added to the Tech Radar (a few years in some cases) as well as conformance with our Technology Selection Principles. Where necessary to build a solid case, we reached out to teams in order to understand more details about their use cases or experience, if this was not sufficiently documented through recent information in our Tech Radar.

Based on the collected data, Principal Engineers participated in a review and re-scoring exercise. In a spreadsheet, we collected votes. Every 'nay' vote required a short rationale which we later discussed in the group to ensure we did not miss out on usage or use cases. We also found inconsistencies in the way we handle technologies with multiple deployment options (self-hosted vs. managed or vendor offerings), for which we did not find a good solution yet.

After the voting, the collected ring changes were discussed with our Senior Leadership Team. The main focus was on ensuring long-term support for the technologies we promote to ADOPT and that technologies on lower rings are in line with long-term strategies (e.g. Data Strategy).

Finally, the changes were shared with our Engineers where we shared detailed rationale per ring change and further information on the re-scoring process and contributions moving forward.

Notable changes

With the re-scoring, we moved a few technologies to ADOPT, confirming our investment in these. To scale adoption, in some cases, we formed dedicated teams that operate service offerings available to all Zalando Engineers and Data Scientists.

Airflow

Apache Airflow is a Workflow Orchestration tool used by data teams in Zalando. We have a central infrastructure team responsible for managing Airflow as a Service for our data teams.

Databricks

We've been using Apache Spark for various analytical and Machine Learning use cases and talked about our usage before (see Data Warehousing with Spark Streaming at Zalando). Databricks is also the core element of our Machine Learning Platform, available to all Engineers. More recently, we went from a centralized Data Lake approach towards a distributed Data Mesh architecture backed by Spark and built on Delta Lake powered by Databricks. See our talk Data Mesh in Practice: How Europe's Leading Online Platform for Fashion Goes Beyond the Data Lake for more information.

GraphQL

We've blogged about our GraphQL usage before. We have 200+ developers that contributed to the GraphQL API layer powering the Zalando shop over the past 2.5 years. We also have other use cases in production, for example in back-office applications for our Buying department.

Kotlin & TypeScript

Having seen continued and growing usage of Kotlin and TypeScript, we have initiated workstreams for within our language guilds to define guidelines, coding standards, reference projects, and service templates. These artifacts are helping teams in adopting the languages moving forward. Further, they help building a shared understanding what we consider as production-proven frameworks and libraries along with recommended configuration options. We've shared our TypeScript best practices in the past and more details about promoting Kotlin at Zalando.

SageMaker

We have blogged before about our usage of Amazon SageMaker for ML Pipelines with Real-Time Inference, distributed training. See also our talk on using SageMaker for training ML models from the AWS Summit 2019.

Tech Radar changes moving forward and future focus

The re-scoring exercise described in this post was a house-keeping exercise supported by clarifying the purpose of the Tech Radar, long-term ownership, and the contribution model. The amount of upcoming changes will of course depend on contributions from our Engineering Community and our appetite for trying out new technologies. While changes to ADOPT/HOLD are going to be evaluated on a quarterly basis, we have a steady stream of ongoing assessments and trials.

The Principal Engineering Community focuses on:

supporting and guiding contributions from the Engineering Community,
identifying promising technologies to invest in,
collecting best practices and expertise around technologies on TRIAL and ADOPT.

With the last point we aim to define paved roads for Engineers describing for example battle-tested configurations for typical use cases or standardized monitoring dashboards with their explanation for the key and most common technologies. While this is today already the case for our PostgreSQL as a Service offering built on top of Patroni and Postgres Operator, given a dedicated team responsible for this infrastructure, we don't have such guidance collected across all our ADOPT technologies yet.

Challenges we have not solved yet

There are a few challenges that the Tech Radar does not solve for today, mostly related to consistency and completeness of the technology landscape. If we resolve any of these challenges, we will surely share our insights and lessons learned.

Some technologies (e.g. etcd) have been successfully used in our infrastructure teams, but we would not want any delivery team to use these (e.g. for configuration management counting as "infrastructure") as we have more suitable building blocks in our platform.

In other cases, we have invested into service offerings built around open-source software (e.g. Airflow) and we would rather have teams extend this platform offering rather than deploy their own infrastructure.

We also have solutions built in-house (e.g. our request router - Skipper) which are an essential part of our cloud infrastructure. Teams don't really have a choice to easily opt-out of these. These technologies will most likely be moved to a different place that will represent the maturity of the development infrastructure at Zalando from a Product perspective.

For technologies, where we chose vendor offerings built on top of a technology (e.g. Databricks for Spark), the question arises whether to include one or both and with which ring assignment (setting Spark to HOLD while keeping Databricks on ADOPT may sound confusing). Here, we consider using the underlying technology and outlining the recommended deployment options.

Finally, there are 3rd party products, which allow us to deliver solutions faster, without the need to reinvent the wheel. One example are Content Management Systems - we've built a few over the past years and strive not to do this again. A question arises how to make these sufficiently visible to our Engineers, so that they're considered while building future products for our customers.

Making the Remote Onboarding a Success

2021-04-22T00:00:00+02:00

When the pandemic started in 2020 many Zalando employees went into home office. It changed our working habits and many other things and Zalando published remote working guidelines to support their employees. This concentrates only on remote working, but what happens if you change companies during the pandemic?

Joining a new company and getting onboarded can be already pretty tough during normal times. Starting a new job requires you to learn new skills and build up new relations within the company. Working from home amplifies those problems by introducing virtual barriers. It's not possible to walk up to somebody and ask a question or introduce yourself to people you meet by chance in different situations.

We were recently confronted with the challenge to grow our engineering team from two persons to five persons across two months. In this article I try to describe how we tackled this challenge to make sure that the new team members get quickly onboarded and feel welcomed in this new setup.

Onboarding Buddy

One of the first decisions we made was to assign an onboarding buddy to each new team member. The onboarding buddy is the go to person for the new team member in case of questions or problems where support is needed, e.g. setting up the notebook. As some persons might feel uncomfortable asking unknown people for help, especially remotely, daily 1:1 sessions have been set up to discuss the current state of the onboarding, answering open questions or to provide regular feedback. As time went on, the frequency of the 1:1s decreased, because people got used to working in the team.

Feedback

Providing regular feedback is the key to success during the onboarding. It’s supposed to create this continuous feedback loop to inform the new team members about how their contribution is viewed, get them used to Zalando's feedback culture and to also reflect on how the onboarding is working out and if it needs to be tweaked. To make sure we don’t forget to provide feedback, we set up monthly feedback sessions between the team and each new team member. While doing this we experimented with three different formats.

An open round where everybody shares the feedback freely.
The feedback is given in short 1:1 sessions between each team member.
The team collects the feedback and presents then one summarized view to the new team member.

Overall it’s impossible to say which format is the best. It could be intimidating in the beginning to receive feedback from the whole team in an open round, but fine at a later point in time when the team knows each other better. It depends on the situation and the people and we gave our new team members the possibility to choose. As those feedback sessions were also meant for the new member to provide feedback to the team, we prepared some questions to collect the feedback.

What do you think about the onboarding so far?
Is there any information that you missed or would have liked to receive earlier?
Is your workload manageable for you? Are the tasks too easy/too difficult?
Would you like to receive more/less support?
Is there anything you would like to work more on?
How comfortable would you feel if all other team members fall sick and you are alone working on tasks and support requests?

The last question is probably the most important one. It asks the new team members to reflect on themself and check how confident they are about their skills already. This is an important indicator for the team to maybe put some focus on certain areas that were missed so far in the onboarding. This way we found out that we needed to become better at introducing the on-call and incident process in our team as this was completely missed.

Technical Onboarding

The onboarding consists of course of some technical onboarding as well. We did the obligatory domain introduction and some introductions into our ways of working, like the sprint ceremonies. It’s important to not overwhelm the new team members in the start. Many if not most information can be also shared down the line when it’s necessary. It’s better to focus on the basics in the beginning and give time to let that sink in. But at some point the new team members need to get their hands dirty and work on some real tasks. To make the start easier, we defaulted to pair programming or even mob programming in the beginning. It was the rule that the tasks had to be done with at least two persons unless other circumstances prevented it. Pair programming while working remotely is even more important than usually. Not only because it allows for easy, “on the job” knowledge sharing, but it also allows the participants to bond and get to know each other. The pair programming was done with simple tools. The person programming was using their IDE of their choice and the screen was shared via the call so that other persons could watch the coding. Of course other tools and IDE plugins exist that try to make the whole setup even better, but in our experience it worked pretty well without them.

In our team we have a team role that rotates each day and that person takes care of incoming support requests from internal clients. Usually this requires a certain level of domain and system knowledge. We decided to onboard the new team members pretty fast to the role. On the one hand it frees up some time from the more experienced engineers and on the other hand it provides another learning opportunity for the new team members. As long as this was transparently communicated with clients, they didn’t mind that some support requests took longer than usual and the new team members made huge progress on domain knowledge in a relatively short time.

Relationships

The last part of the onboarding relates to the relationships inside the team. We are not just robots coming into work, but we are humans with emotions, goals and sometimes also problems. I believe that trust is an essential ingredient for efficient teams. It allows you to speak up freely, you can make mistakes and addressing conflicts leads to constructive discussions. And during the pandemic you are missing out on a lot of opportunities to get to know your new team-mates as there are no team lunches, no short discussions at the coffee machine and no rounds of table tennis during the breaks. This can quickly start to feel like you are being left alone with your problems. Therefore we introduced a weekly “Team Bonding” session which was moderated by our producer. The producer is responsible for team processes in our team and in case you don't have such a role, any person, be it a team member, team lead or somebody outside the team, could facilitate this meeting.

Every week she came up with new ideas for the session. Sometimes we just presented to each other personal objects from our home, another time we did powerpoint karaoke or we played a game like Tabu. Some of those exercises had some goals, like improving your presentation skills, but in the end it was always about the people and getting to know them. What drives your team-mates? What kind of humour do they have? What keeps them up at night right now? Opening up really helps to create this bond and increase the trust among each other. Such exercises can of course also be done when everybody is back at the office to continue the bonding between team-mates and are not only valuable when you are working remotely.

Summary

Summing up this article, it boils down to some simple points. Take your time to do a proper onboarding and be transparent with clients and leads about possible delays for support requests or roadmaps. Remind yourself constantly about providing feedback to give guidance and prevent unpleasant surprises. And don’t forget about the personal relationships that need to be created, because they will allow you to trust each other and also feel safe while making mistakes. Following those rules is very time intensive, but it pays off in the long run and we were able to build an awesome team in just about three months that already increased the productivity compared to before. Of course there is no one-size-fits-all solution regarding the onboarding and different teams might have different needs, but this setup worked very well for us.

Other Resources

Modeling Errors in GraphQL

2021-04-13T00:00:00+02:00

GraphQL Errors

GraphQL is an excellent language for writing data requirements in a declarative fashion. It gives us a clear and well-defined concept of nullability constraints and error propagation. In this post, let's discuss how GraphQL lacks in certain places regarding errors and how we can model those errors to fit some of our use-cases.

Before we dive into the topic, let's understand how GraphQL currently treats and handles errors. The response of a GraphQL query is of the following structure -

{
  "data": {
    "foo": null
  },
  "errors": [
    {
      "message": "Something happened",
      "path": ["foo", "bar"]
    }
  ]
}

Error extensions

The Schema we define for GraphQL is used only in the data field of the response. The errors field is a well-defined structure - Array<{ message: string, path: string[] }> in its simplest form. The Schema we define does not affect this Error.

Let's say the client queries a field using an ID. How can the client know from the above error object whether the Error is due to an Internal Server Error or the ID is Not_Found? Parsing the message is a no-go because it is not reliable.

Luckily, in GraphQL, there is a way to provide extensions to the error structure - using extensions. The error.extensions can convey other information related to the Error - properties, metadata, or other clues from which the client can benefit. As for the above example, we can model the response to be -

const err = {
  data: {},
  errors: [
    {
      message: "Not Found",
      extensions: {
        code: "NOT_FOUND",
      },
    },
  ],
};

Errors for Customers

When we have a GraphQL API that delivers content to the end-user - the customers, i.e., we have two levels of users -

The Developer or user of the API - UI/UX/front-end developer.
The Customer or end-user - The one who does not see any technical layers but gets the product's experience in its most presentable format. The Front-end developer builds this experience using data from the GraphQL API.

Since using the word user might be confusing, from now on, Developer will refer to the front-end developer, and Customer will refer to the end-user.

When we have an API whose data is directly consumed by two levels of these users - Developer and Customer, there might be different error data requirements. For example, let's take mutations - when the Customer enters an invalid email address,

The Developer who uses the GraphQL API needs to know that the Customer has entered an Invalid Email address via a parseable format - a boolean or enum or whatever data structure you choose will work except parsing the error message.
The Customer needs to care about the error message in a nicely styled format close to the text box. Also, for different languages or locales, the error message needs to be in the corresponding translated text.

Let's try to model this using the error extensions discussed above -

{
  "data": {},
  "errors": [
    {
      "message": "Die E-Mail-Addresse ist ungültig",
      "extensions": {
        "code": "INVALID_EMAIL"
      }
    }
  ]
}

While this would work, we soon end up in a case where multiple input fields in a mutation can be invalid. What can we do here? Do we model them as different errors or fit everything into the same Error.

The Customer errors still need to be usable by the Developers to propagate it. The front-end developers are the ones ultimately transforming our data structures to UI elements. So they need to understand the Error to highlight that input text-box with a red border. So, to make it easy, let's try modeling these as a single error with multiple validation messages -

{
  "data": {},
  "errors": [
    {
      "message": "Multiple inputs are invalid",
      "extensions": {
        "invalidInputs": [
          {
            "code": "INVALID_EMAIL",
            "message": "Die E-Mail-Addresse ist ungültig"
          },
          {
            "code": "INVALID_PASSWORD",
            "message": "Das Passwort erfüllt nicht die Sicherheitsstandards"
          }
        ]
      }
    }
  ]
}

The codes INVALID_EMAIL and INVALID_PASSWORD will help the front-end dev or Developer highlight the field in the UI, and the message will be displayed to the user right under that text-box.

All this leads to a complicated structure very soon and is not as friendly as the data modeled with a GraphQL schema.

Why you no Schema?

The biggest problem we face in modeling these in the extension object is that it's not discoverable. We use such a powerful language like GraphQL to define each field in our data structure using Schemas, but when designing the errors, we went back to a loose mode of not using any of the ideas GraphQL brought us.

Maybe, in future extensions of the language, we can write schemas for Errors as we write for Queries and Mutations. The developers using the Schema get all the benefits of GraphQL even when handling errors. For now, let's concentrate on modeling this using the existing language specification.

Errors in Schema

We want to enjoy the power of GraphQL - the discoverability of fields of data, the tooling, and other aspects for errors. Why don't we put some of these errors in the Schema instead of capturing them in extensions?

For example, the mutation discussed previously can be modeled like this -

mutation returns a Result type
Result type is a union of Success, Error.
Error schema contains necessary error info - like translated messages, etc.

type Mutation {
  register(email: String!, password: String!): RegisterResult
}

union RegisterResult = RegisterSuccess | RegisterError

type RegisterSuccess {
  id: ID!
  email: String!
}

type RegisterError {
  invalidInputs: [RegisterInvalidInput]
}

type InvalidInput {
  field: RegisterInvalidInputField!
  message: String!
}

enum RegisterInvalidInputField {
  EMAIL
  PASSWORD
}

This structure looks exactly like the one we designed above inside error extensions. The advantage of modeling it like this would be that we are using the benefits of GraphQL for errors.

When you have a hammer,

Now, with the idea of modeling errors as Schema types, we are left with more questions than answers -

Should I model all errors as GraphQL types?
How should I decide when to use error extensions and when to use GraphQL types for modeling errors?
etc.

When we have multiple teams maintaining the platform, many people contribute and think about modeling different parts of the Schema. There should be clear definitions for the different aspects of the existing data structures and the idea behind how we reached such solutions. The design and the Schema are changed far fewer times than it is read/used.

GraphQL gave us the mindset of "Thinking in Graphs". If we suggest a new way of modeling errors, we need to talk about this mindset and its ideas. Not all errors fit into this modeling (error types in Schema), and it will make the GraphQL API less usable if we approach it by looking at all the errors as nails.

Classification

To model errors, let's try to find some analogies. I want to think about modeling these errors in terms of programming language errors. For example,

Go: Error vs. panic
Java: Error vs. Exception
Rust: Error vs. runtime exception

The programming languages also model errors as two variants. In one model (an error type in go), we inform the Developer who uses the function. The Developer decides either to handle it or to pass it through. In the other variant (a panic in go), we skip everything and bring the program to a halt. We inform the end-user of the program that something has happened. This small variation captured as two different things help us understand the intention of data in errors.

Part 1. Action-ables

What is an error? It tells us that something is wrong and gives us some information on what action can be taken. We can think of errors as containers of action-ables. When modeling them, we classify them into different groups depending on who can take that action.

In GraphQL context, for some errors, the front-end takes care of it - either by a fallback or a retry. In case of some other errors like the invalid inputs, the front-end cannot take action; only the Customer who entered the invalid input can fix the input.

Instead of modeling the errors loosely, we now have a concrete use-case - model it for whoever can take action.

Part 2. Bugs in the system

Errors convey information - either to Developer or Customer. If the Error is conveying some bug in the system, it should not be modeled as schema error types. Here, the system means all the services and software involved in our entire product and not just the GraphQL service. It is essential because it separates the end-user / Customer vs. Developer who uses the API - the end-user looks at our product as one thing, not many individual services.

In the 404 Not Found case, if we had modeled the errors as schema types, it would make the Schema less usable. Let's take a product look-up use-case -

{
  product(id: "foo") {
    ... on ProductSuccess {
      success
    }
    ... on ProductError {
      error
    }
  }
  collection(id: "bar") {
    ... on CollectionSuccess {
      products {
        ... on ProductSuccess {
          success
        }
      }
    }
    ... on CollectionError {
      error
    }
  }
}

This way of handling errors at every level is not friendly for front-end developers. It's too much to type in a query and too many branches to handle in the code.

Part 3. Error propagation

We also have to remember not to disrupt GraphQL semantics of error propagation. If an error occurs in one place in the query, it propagates upwards in the tree till the first nullable field occurs. This propagation does not happen with error types in Schema. It is essential to model these schema error types for only specific use-cases. We go back to Part 1: Action-ables - we design these types for actions that the end-user or Customer can take.

The Problem type

Naming is half the battle in GraphQL. Since the name error is already taken by the GraphQL language (response.errors), it would be confusing to name our error types in Schema as Error. As we did before to look for inspirations, there is a well-defined concept in RFC 7807 - Problem details for HTTP API. So, we will call all our errors in Schema as Problems and, as it has always been, all other errors as errors.

The above register schema with the Problem type would look like this -

type Mutation {
  register(email: String!, password: String!): RegisterResult
}

union RegisterResult = RegisterSuccess | RegisterProblem

type RegisterSuccess {
  id: ID!
  email: String!
}

type RegisterProblem {
  "translated message encompassing all invalid inputs."
  title: String!
  invalidInputs: [RegisterInvalidInput]
}

type InvalidInput {
  field: RegisterInvalidInputField!
  "translated message."
  message: String!
}

enum RegisterInvalidInputField {
  EMAIL
  PASSWORD
}

Problem or Error

Problem refers to the Error as a Schema type. ** Error** refers to the Error that appears in the response.errors array with an error code at error.extensions.code.

Case 1: Resource Not Found

404s are bugs in the system in case of navigation. If the user navigates from the home page to a product page and ends up on a 404 page, some service selected an id that leads to 404 when resolved and this has most likely been the case upon selection. It's not something because the user entered some input. Also, these errors need to be propagated. So, this becomes an Error with an error code as NOT_FOUND and not a Problem.

Case 2: Authorization

Authorization errors are of the Error type and do not fit a problem type. Here, the action taker looks like it's the Customer who needs to log in. But, the UI can take action here and show a login dialog box to the Customer. In apps, the app decides to take the Customer to the login view. The action belongs to the Front-end and only then the Customer. So, we model it for the developer/front-end as an Error with error code NOT_AUTHORIZED and not a Problem.

Case 3: Mutation Inputs

Mutation Inputs is the only case where it is crucial to construct Problem types. It contains inputs directly from the Customer, and only the Customer can take action for this. So, we model these errors as Problems and not Errors.

Case 4: All other bugs / errors

Any runtime exception in the code or Internal Server Errors from any backends that the GraphQL layer connects to should be modeled as Error and need not contain an error code. This way, it is easy for the front-end to treat all non-error code responses as Internal Server Errors and take action accordingly - to retry or show the Customer an error page.

Conclusion

We have discussed Problem type as a possible solution where the error object in the GraphQL response does not suffice the use-cases. But we have to be careful about not overusing this for many use-cases where the error extensions already provide enough value.

We have to understand that the Problem type in unnecessary places does make the query and front-end code complicated. Our GraphQL Schema should try to simplify and provide a friendly interface.

In case you are interested, here are further posts in the GraphQL series -

Optimize GraphQL Server with Lookaheads

2021-03-18T00:00:00+01:00

In our first post about How we use GraphQL at Zalando, we briefly shared about performance optimizations using GraphQL-JIT. GraphQL-JIT allowed us to scale our implementation without performance degradations. In this post, we share another optimization we use - Lookaheads.

Same Model; Different Views

In our GraphQL service, we do not have resolvers for every single field in the schema. Instead, we have certain groups of fields resolved together as a single request to a backend service that provides the data. For example, let's take a look at the product resolver,

resolvers = {
  Query: {
    product(_, { id }) {
      return ProductBackend.getProduct(id);
    },
  },
};

This resolver will be responsible for getting multiple properties of the Product - name, price, stock, images, material, sizes, brand, color, other colors, and further details. The same Product type in the schema can render as a Product Card in a grid or the entire Product Page. The amount of data required for a Product card is less than the complete product details of a product page.

Every time the product resolver is called, the entire response from the product backend is requested by the GraphQL service. Though GraphQL allows us to specify the data requirements to fetch optimally, it becomes beneficial only between the client-server communication. The data transfers between the GraphQL server and the Backend server remain unoptimized.

Partial Responses

Most of the backend services in Zalando support Partial responses. In the request, one can specify the fields' list. Only these fields must be in the response trimming other fields which were not specified in the request. The backend service treats this as a filter and returns only those fields. It is similar to what GraphQL offers us, and the request somewhat looks like this -

GET /product?id=product-id&fields=name,stock,price

Here, the fields query parameter is used to declare the required response fields. The backend can use this to compute only those response fields. Likewise, the backend can pass it further down the pipeline to another service or database. The response for the above request would look like the following -

{
  "name": "Fancy T-Shirt",
  "stock": "AVAILABLE",
  "price": "EUR 35.50"
}

Partial responses help in reducing the amount of data over the wire and give a good performance boost. A GraphQL query is also precisely the same thing - it provides a well-defined language for the fields parameter in the above request.

Lookahead

Let's leverage these partial responses and use them in the GraphQL server. When resolving the product, we must know what the next fields are within this product, (or) we need to look ahead in the query to get the sub-fields of the product.

query {
  product(id: "foo") {
    name
    price
    stock
  }
}

A thing to note - name, stock, and price do not have explicitly declared resolvers. When resolving product, how can we know what its sub-selections are? Here, navigating the query AST (Abstract Syntax Tree) helps. During execution, the resolver function will receive the AST of the current field. The structure of the AST depends on the language and implementation. For GraphQL-JS, or GraphQL-JIT executors, it is available in the last parameter (of the resolver function) which is called a Resolve Info.

resolvers = {
  Query: {
    product(_, { id }, context, info) {
      const fields = getFields(info);
      return ProductBackend.getProduct(id, fields);
    },
  },
};

We use the query AST in the resolve info to compute the list of fields under product, pass this list of fields to the product backend, which supports partial responses, and then send the backend response as the resolved result.

Field Nodes

The resolve info is useful for doing a lot of optimizations. Here, for this case, we are interested in the fieldNodes. It is an array of objects, each representing the same field - in this case - product. Why is it an array? A single field may appear in more than one place in a query - for instance, fragments, inline fragments, aliasing, etc. For simplicity, we will not consider fragments and aliasing in this post.

The entire query is a tree of field nodes where the children at each level are available as selection sets.

Each fieldNode has a Selection Set, a list of subfield nodes - here - the selection set will be the field nodes of name, stock, and price. So the getFields implementation (without considering fragments and aliasing) will look like the following -

function getFields(info) {
  // TODO: handle all field nodes in other fragments
  return info.fieldNodes[0].selectionSet.selections.map(
    (selection) =>
      // TODO: handle fragments
      selection.name.value
  );
}

When we pass product resolver's info, the getFields function returns [name, stock, price]. We can take this list and pass it to the backend as the query parameter.

For simple use-cases like these, where the backend data structure and the GraphQL schema are the same, it's possible to use GraphQL fields as the backend fields. When it's a bit different, we need to map the schema fields to backend fields for the request. Also, we need to map the backend fields back to schema fields for the response.

Different schemas

If the backend fields are different from the GraphQL schema fields, then there exists a mapping from schema fields to backend fields. A simple mapping may be the difference in the name of the fields. For example, name in schema might be title in the backend. This mapping can get complex where a single schema field might derive from multiple backend fields. For example, price in schema might be a concatenation of currency and amount from the backend. It gets interesting when we have nested structures - for example, price in schema might be a concatenation of price.currency and price.amount.

The response is partial

Another aspect of this mapping is that it's not enough to think about it one way - from schema fields to backend fields. It only suffices the request from the GraphQL server to the backend server. The response that the backend sends must transform to match the schema, and it isn't free when we have such complications in the mapping of fields.

When we have a single transform function that converts backend response to match the schema, we have to understand that it is built from a partial response and not the complete response -

function backendProductToSchemaProduct(backendProduct) {
  return {
    name: backendProduct.title,
    // we have a problem here -
    price: `${backendProduct.currency} ${backendProduct.amount}`,
    stock: backendProduct.stock_availability,
  };
}

In the above implementation, when the query is { product(id) { name } }, the transformer will try to convert, assuming the complete response is available. Since the backend responded with partial data (only the name field is used), the access to a nested property will throw an error - Cannot read property currency of 'undefined'. We could have a null check at every place, but the code becomes not maintainable. So we need a way to model it both ways -

Map schema fields to backend fields during the request to the backend
Map backend fields to schema fields with the response from the backend

Dependency Maps

The mapping we talked about in our scribbling phase is what a dependency map is. Every schema field depends on one or many nested fields in the backend. A way to represent this can be as simple as an object whose keys are schema fields, and the values are a list of object paths.

const dependencyMap = {
  name: ["title"],
  price: ["price.currency", "price.amount"],
  stock: ["stock_availability"],
};

From this dependency map, we can create our request to the backend. Let's say the backend takes a query parameter fields in the following form - a comma-separated list of object path strings. Depending on the implementation, there can be a wide variety of formats for this. Here, we will take a simple one.

function getBackendFields(schemaFields, dependencyMap) {
  // Set helps in deduping
  const backendFields = new Set(
    schemaFields
      .map((field) => dependencyMap[field])
      .reduce((acc, field) => [...acc, ...field], [])
  );
  return backendFields.join(",");
}

For schema fields name and price, the computed backend fields would be a string, and we can construct the request to the backend -

GET /product?id=foo&fields=title,price.currency,price.amount

Transformation Maps

After the request, we know that the backend returns a partial response instead of the complete response. We also saw above that a single function that transforms the entire backend response to schema fields is not enough. Here, we use a transformation map. It's a map of schema fields to transformation logic. Like the dependency map, the keys are schema fields, but the values are transform functions that use only specific fields from the backend.

const transformerMap = {
  name: (resp) => resp.title,
  price: (resp) => `${resp.currency} ${resp.amount}`,
  stock: (resp) => resp.stock_availability,
};

As you see here, each value is a function where the only properties used inside this function are from the dependency map. To construct the result object from the partial response of the backend, we use the same computed sub-fields (from the getFields function) and use them on the transformer map. For example -

function getSchemaResponse(backendResponse, transformerMap, schemaFields) {
  const schemaResponse = {};
  for (const field of schemaFields) {
    schemaResponse[field] = transformerMap[field](backendResponse);
  }
  return schemaResponse;
}

So far,

Let's recap on how the concept we have so far unwrapped -

getFields: compute sub-fields by looking ahead in AST
getBackendFields: compute backend fields from sub-fields and dependency map
request the backend with the computed backend fields
getSchemaResponse: compute schema response from partial backend response, sub-fields, and the transformer map

Batching

At Zalando, like partial responses, most of our backends support batching multiple requests into a single request. Instead of getting a resource by its id, most backends have to get resources by ids. For example,

GET /products?ids=a,b,c&fields=name

will return the response,

[{ "name": "a" }, { "name": "b" }, { "name": "c" }]

We should take advantage of such features. One of the popular libraries that aid us in batching is the DataLoader by Facebook.

We provide the dataloader - an implementation for handling an array of inputs that returns an array of outputs/responses in the same order. The dataloader takes care of combining and batching requests from multiple places in the code in an optimal fashion. You can read more about it in the Dataloader's documentation.

Dataloader for Product resolver

When a Product appears in multiple parts of the same GraphQL query, each will create separate requests to the backend. For example, let's consider this simple GraphQL query -

query {
  foo: product(id: "foo") {
    ...productCardFields
  }
  bar: product(id: "bar") {
    ...productCardFields
  }
}

The products foo and bar are batched together into a single query using aliasing. If we implement a resolver for a product that calls the ProductBackend, we will end with two separate requests. Our goal is to make it in a single request. We can implement this with a dataloader -

async function getProductsByIds(ids) {
  const products = await fetch(`/products?ids=${ids.join(",")}`);
  return products;
}

const productLoader = new Dataloader(getProductsByIds);

We can use this productLoader in our product resolver -

resolvers.Query.product = async (_, { id }) => {
  const product = await productLoader.load(id);
  return product;
};

The Dataloader takes care of the magic of combining multiple calls to the load method into a single call to our implementation - getProductsByIds.

Complexities

The DataLoader deduplicates inputs, optionally cache the outputs and also provides a way to customize these functionalities. In the productLoader defined above, our input is the product id - a string. When we introduce the concepts of partial responses, the backend expects more than just the id - it also predicts the fields parameter used to select the fields for the response. So our input to the loader is not just a string - let's say, it's an object with keys - ids and fields. The dataloader implementation now becomes -

async function getProductsByIds(inputs) {
  const ids = inputs.map((input) => input.id);
  //
  // We have a problem here
  //                    v
  const fields = inputs[0].fields;
  const products = await fetch(
    `/products?ids=${ids.join(",")}&fields=${fields}`
  );
  return products;
}

Here, in the above code-block, the problem is highlighted with a comment - each of the productLoader.load calls can have a different set of fields. What is our strategy for merging all of these fields? Why do we need to merge?

Let's go back to an example and understand why we should handle this -

query {
  foo: product(id: "foo") {
    name
  }
  bar: product(id: "bar") {
    price
  }
}

The product foo requires name and product bar requires price. If we remind ourselves how this gets translated to backend fields using the dependency map, we end up with the following calls -

productLoader.load({
  id: "foo",
  fields: ["name"],
});

productLoader.load({
  id: "bar",
  fields: ["price.currency", "price.amount"],
});

If these two calls get into a single batch, we need to merge the fields such that both of them work during the transformation of backend fields to schema fields. Unfortunately, it's impossible to select different fields for different ids in the backend in most cases. If this is possible in your case, you probably do not need merging. But for our use-case and probably many others, let's continue the topic assuming merging is necessary.

Merging fields

In the above example, the correct request to the backend would be -

GET /products
  ? ids = foo , bar
  & fields = name , price.currency , price.amount

The merge strategy is quite simple; it's a union of all the fields. Structurally we need the following transformation - [ { id, fields } ] to { ids, mergedFields }. The following implementation merges the inputs -

function mergeInputs(inputs) {
  const ids = [];
  const fields = new Set();
  for (const input of inputs) {
    ids.push(input.ids);
    for (const field of input.fields) {
      fields.add(field);
    }
  }

  return {
    ids,
    mergedFields: [...fields].join(","),
  };
}

Putting it all together

Combining all the little things we handled so far, the flow for the product field resolution would be -

getFields: compute sub-fields by looking ahead in AST
getBackendFields: compute the list of backend fields from sub-fields and dependency map
productLoader.load({ id, backendFields }): use the product loader to schedule in the dataloader to fetch a product.
mergeFields: merge the different inputs to dataloader into a list of ids and union of all backendFields from all inputs.
Send the batched input as a request to the backend and get the partial response
getSchemaResponse: compute schema fields from partial backend response, sub-fields computed in the first step, and the transformer map

const productLoader = new DataLoader(getBackendProducts);

const resolvers = {
  Query: {
    async product(_, { id }, __, info) {
      const fields = getFields(info);
      const backendFields = getBackendFields(fields, dependencyMap);
      const backendResponse = await productLoader.load({
        id,
        fields: backendFields,
      });
      const schemaResponse = getSchemaResponse(
        backendResponse,
        fields,
        transformerMap
      );
      return schemaResponse;
    },
  },
};

const dependencyMap = {
  name: ["title"],
  price: ["price.currency", "price.amount"],
  stock: ["stock_availability"],
};

const transformerMap = {
  name: (resp) => resp.title,
  price: (resp) => `${resp.currency} ${resp.amount}`,
  stock: (resp) => resp.stock_availability,
};

function getFields(info) {
  return info.fieldNodes[0].selectionSet.selections // TODO: handle all field nodes in other fragments
    .map(
      (
        selection // TODO: handle fragments
      ) => selection.name.value
    );
}

function getBackendFields(schemaFields, dependencyMap) {
  // Set helps in deduping
  const backendFields = new Set(
    schemaFields
      .map((field) => dependencyMap[field])
      .reduce((acc, field) => [...acc, ...field], [])
  );
  return backendFields;
}

async function getBackendProducts(inputs) {
  const { ids, mergedFields } = mergeInputs(inputs);
  const products = await fetch(
    `/products?ids=${ids.join(",")}&fields=${mergedFields}`
  );
  return products;
}

function mergeInputs(inputs) {
  const ids = [];
  const fields = new Set();
  for (const input of inputs) {
    ids.push(input.ids);
    for (const field of input.fields) {
      fields.add(field);
    }
  }

  return {
    ids,
    mergedFields: [...fields].join(","),
  };
}

function getSchemaResponse(backendResponse, transformerMap, schemaFields) {
  const schemaResponse = {};
  for (const field of schemaFields) {
    schemaResponse[field] = transformerMap[field](backendResponse);
  }
  return schemaResponse;
}

Conclusion

All of the code, patterns, and nuances we have seen until now may differ for different applications or different languages. The critical aspect is to leverage the declarative nature of GraphQL and optimize for better user experience at all points throughout the lifecycle of a request.

Field filtering using Dependency Maps and Transformer Maps enables us to handle complexities in optimizing GraphQL servers for performance. Though this looks like a lot of work, at runtime, this outperforms the otherwise unoptimized handling of huge responses from the backend - JSON parsing cost + transfer of bytes + construction time of the response by the backend.

You also have to consider the trade-off of whether such optimizations work for every backend. As the GraphQL schema grows, these solutions scale well. At Zalando's scale, it has proved to be better than transferring a giant unoptimized blob of data.

Flexbox Layout Behavior in Jetpack Compose

2021-03-16T00:00:00+01:00

Introduction

The CSS Flexible Box Layout specification (AKA flexbox) is a useful abstraction for describing layouts in a platform agnostic way. For this reason, it is widely used on the web and even on mobile. Readers familiar with ConstraintLayout can think of flexbox as conceptually similar to the Flow virtual layout it supports. This type of layout is ideal for grids or other groups of views with varying sizes.

In the Zalando Fashion Store apps, we are using flexbox to define the layout of our backend-driven screens, which I spoke about previously. Thus far, we have been using Litho on Android and Texture on iOS (both of which use the flexbox based Yoga layout engine) for rendering backend driven screens because they support things that are essential when building fully dynamic UI at runtime such as async layout, efficient diffing of changes, and view flattening.

As Google prepares Jetpack Compose (now in beta) for production release, we have started evaluating it as a successor to Litho. Compose offers numerous layout composables, many with bits of flexbox like behavior. However, there is no Flexbox composable that does it all and no blog post explaining how flexbox concepts map to Compose, so I wrote this one. I also built this sample app, parts of which I will reference in code examples below.

Before we continue, yes, I know technically it's called Compose UI and not simply Compose, but as Jake said, most of us are already thinking of it this way. Insert a "UI" where necessary while reading if you'd like.

Flex

Let's start with the flex attributes, which describe the direction, size, and horizontal/vertical alignment of a layout's children.

Flex Direction

Flex direction specifies whether items are arranged vertically or horizontally. Compose has Row and Column composables that work for simple horizontal and vertical layouts.

@Composable
fun RowExample() {
    Row(
        modifier = Modifier.fillMaxWidth()
            .padding(bottom = 16.dp)
            .background(color = MaterialTheme.colors.primaryVariant),
    ) {
        Child()
        Child()
        Child()
    }
}

If flex wrap behavior is needed to control how items wrap across multiple rows, the FlowRow and FlowColumn composables will do this. However, these were deprecated before I even finished writing this article, so the best we can do is use the old implementation as a reference for our own.

@Deprecated
@Composable
fun FlowRowExample() {
    FlowRow(
        mainAxisSpacing = 8.dp,
        crossAxisSpacing = 8.dp
    ) {
        repeat(20) {
            Child(width = 48.dp, height = 24.dp)
        }
    }
}

The above code results in the following UI:

Flex Grow & Shrink

Flex grow controls how children will expand to fill available space in their parent layout. Flex shrink is its opposite, controlling how children will shrink relative to siblings if their parent layout does not have room for all of them.

Use the weight() modifier for flex grow behavior. Compose does not really have a flex shrink analog, but with its variety of layout composables, this can be overcome with a different approach in most cases. Depending on your specific needs, one approach could be to use Modifier.preferredWidth(IntrinsicSize.Min) to specify that a composable should not take up any more space than its children require. You can read more about it here in this question reposted from the kotlinlang Slack in Mr. Mark Murphy's excellent jetc.dev newsletter.

@Composable
fun FlexGrowExample() {
    Row(
        modifier = Modifier.fillMaxWidth()
            .padding(bottom = 16.dp)
            .background(color = MaterialTheme.colors.primaryVariant),
    ) {
        FlexChild(modifier = Modifier.weight(1F))
        FlexChild(modifier = Modifier.weight(2F))
        FlexChild(modifier = Modifier.weight(1F))
    }
}

The above code results in the following UI:

When the utmost flexibility is needed, there's always implementing your own Layout composable or the raw power of the ConstraintLayout composable, which can be used directly from Compose. If you don't mind reading Java instead of Kotlin, the implementation in Google's flexbox-layout library is a good starting point for understanding the algorithm.

Alignment

Alignment controls how items are arranged on their vertical and horizontal axes. This can be done on a parent layout with the *-content properties or on the children themselves using the *-self properties.

Main Axis

Main axis alignment refers to how children are aligned on the main axis of their parent; horizontal for rows and vertical for columns. In the flexbox spec, this is known as justify-content. In Compose, main axis alignment is controlled by the the horizontalArrangement parameter passed to Row and the verticalArrangement parameter passed to Column. Both include options such as start/end, center, and space around/between/evenly for possible values.

@Composable
fun ArrangementExample() {
    Row(
        modifier = Modifier.fillMaxWidth()
            .padding(bottom = 16.dp)
            .background(color = MaterialTheme.colors.primaryVariant),
        horizontalArrangement = Arrangement.SpaceBetween,
    ) {
        Child()
        Child()
        Child()
    }
}

The above code results in the following UI:

Cross Axis

Cross axis alignment refers to how children are aligned on the non-main axis of their parent; vertical for rows and horizontal for columns. In the flexbox spec, align-items and align-content control layout children while align-self allows children to do so themselves. In Compose, cross axis alignment is controlled by the verticalAlignment parameter passed to Row, the horizontalAlignment parameter passed to Column, and the align modifier on their child composables. Both include options start, end, and center for possible values.

@Composable
fun AlignmentExample() {
    Row(
        modifier = Modifier.fillMaxWidth()
            .height(150.dp)
            .padding(bottom = 16.dp)
            .background(color = MaterialTheme.colors.primaryVariant),
        verticalAlignment = Alignment.CenterVertically,
    ) {
        Child()
        Child()
        Child()
    }
}

The above code results in the following UI:

You may have noticed that the space around/between/evenly options from justify-content are not listed for the cross axis. This is because there is no cross axis space around/between alignment in Compose. However, the resulting layout could be achieved via other composable combinations.

Flexbox also specifies a stretch option for cross axis alignment. In Compose, the stretch equivalent would be individual children using the fillMaxSize()/fillMaxWidth()/fillMaxHeight() modifiers.

Layout

Finally, let's look at a few other attributes that affect a view's size and position.

Aspect Ratio

Compose's aspectRatio() modifier works exactly as you'd expect. It takes a float representing the desired ratio and uses that value to determine the size in the unspecified layout direction (width or height).

For example, specifying fillMaxWidth() and aspectRatio(16F / 9F) results in a rectangle that fills the width of the screen with a height corresponding to 9/16 of that width.

@Composable
fun AspectRatioExample() {
    Box(
        modifier = Modifier.padding(bottom = 16.dp)
            .background(color = MaterialTheme.colors.secondary)
            .fillMaxWidth()
            .aspectRatio(16F / 9F)
            .border(width = 2.dp, color = MaterialTheme.colors.secondaryVariant)
    )
}

The above code results in the following UI:

Padding & Margins

Compose has a padding() modifier, but none for margins. Margins can be considered extra padding, so a single value can be used.

Absolute Position

When absolute positioning is needed to place one composable on top of another, the Box composable can be used. Box children can use the align() modifier to specify where they are aligned within the box including top start/center/end, bottom start/center/end, and center start/end.

@Composable
fun AbsolutePositionExample() {
    Box {
        Box(
            modifier = Modifier.fillMaxWidth()
                .height(240.dp)
                .background(color = MaterialTheme.colors.primaryVariant)
        )
        Child(modifier = Modifier.align(Alignment.TopStart))
        Child(modifier = Modifier.align(Alignment.TopEnd))
        Child(modifier = Modifier.align(Alignment.BottomStart))
        Child(modifier = Modifier.align(Alignment.BottomEnd))
        Child(modifier = Modifier.align(Alignment.Center))
    }
}

The above code results in the following UI:

Conclusion

In this article, we have seen how much of the layout behavior defined in the flexbox spec has a direct analog in Compose and a few places where we have to do a bit more work to approximate certain concepts. Please see the sample app repo for the code as well as my first attempt at working with the Compose Navigation library.

During our recent Hack Week, we had a chance to spend more time with Compose. We were impressed with how easy it was to get started and managed to build a fairly performant Compose powered implementation of our home screen. For a beta, it's quite promising!

Thanks for reading!

Micro Frontends: from Fragments to Renderers (Part 1)

2021-03-11T00:00:00+01:00

In 2015, we wanted to improve how we delivered features to customers and move away from a monolithic shop system. Project Mosaic and its microservices approach for the frontend were vital to support this transition. Mosaic enabled a relatively large number of teams to work on the main Zalando website independently and without performance compromises. At its core, Mosaic architecture relies on page Fragments, which are owned by different teams.

Mosaic helped us deliver features quickly and experiment at scale, contributing to Zalando’s growth, but we identified limitations to the Fragments approach. The main pain points for Zalando at that time were:

Differences in tech stacks, bundling, and deployment practices across fragments led to inconsistent user experience and cross-team collaboration difficulties
A high entry barrier for teams contributing to the customer experience. To be able to add new features to the website, engineers had to
- build and operate their fragments (usually frontend and backend services)
- discover and integrate with all the data sources
- re-implement or adapt the UI
- re-implement or adjust tracking & A/B testing

In 2018, we started designing Interface Framework (IF) to overcome these issues. The new transition’s key goal was to build a platform that unified the tech stack and centralized the deployment and operation process for various parts of the Zalando website. It would enable a fully personalized customer experience, and guarantee overall UX consistency based on a new design language.

Now, we'd like to give you an update on our approach to frontend development in the form of a blog series. The first part covers the key features of the new framework and provides an overview of its architecture.

Why Interface Framework

Consistent Entity Data

We identified a reasonably small amount of content pieces in use by Zalando that can be visualized or catered for personalization purposes. For example, a Product, a Collection, or an Outfit. When organized in tree-like structures, they can be used to define layouts and content of the Zalando core user journey pages. When used individually, they can be the common language used across microservices to exchange data.

We call them Entities. Each Entity has a type and a unique id.

Dynamic View & Content Composition

Interface Framework supports dynamic composition of the user interface. It composes a page by forming a tree of nested Entities and transforming it into a tree of matching Renderers. The mapping of Entities to Renderers is specified in a declarative set of layout rules, which we call rendering rules. A Renderer is responsible for visualizing data related to an Entity.

Let's assume we are presenting a product page with some slots below the article to show additional content. Our personalization service chooses to provide three pieces of content: a collection, an outfit, and another collection. It determines what content the customers see on the page.

The Rendering Engine then decides to visualize the collection as a carousel, outfit as a card component, and the third collection as another carousel. It is responsible for how the content gets rendered to the customers.

Integrated Monitoring

Interface Framework automatically connects all views to the internal monitoring tools, ensuring that only the unified, user consent compliant, and thoroughly tested implementation is used. It helps to prevent incidents and disruptions in business reporting and personalization.

Orchestrated A/B Testing

A/B tests can now run in an orchestrated way to compare the results and make informed choices. This ensures features are tested with a representative user base, using standardized A/B testing scenarios and KPIs to ease comparison between features. Defining and setting up global A/B tests also means reducing the overhead of doing it for every page.

The integration of Zalando’s A/B testing platform in IF allows us to:

Implement experiments with only a few lines of code, and get the implementation automatically validated
Track experiments automatically without additional efforts to analyze the results
Continue managing experiments via the intuitive A/B testing platform UI
Keep experiment latency overhead low by batching all requests to the A/B testing platform for all Renderers

Integrated Testing for Developers

As Interface Framework provides a single integration point where all code is developed and deployed, we give developers access to deployment previews, which allow any open pull request to be previewed in an environment very close to production. This setup is different from the traditional staging approach. The preview deployment is connected to production endpoints and follows 100% production routing while ensuring that only authenticated developers can access it.

Consistent UX Design

All pages running on Interface Framework, the look & feel, accessibility features, and actual components used are all defined by a design system. Our server-side rendering framework, which we call Rendering Engine, takes over the complexity of component version management and optimizes client code bundle size.

Page Performance Quality Gates

We evaluated best practices from CI/CD pipelines for Fragments from various teams and combined them to measure the performance for pages served by Interface Framework. We do support the following tools:

Lighthouse CI: a tool to automatically run performance and accessibility tests for specific pages. It validates assertions with results and decides whether the current score is good enough for production.
Bundle Size Limits: we have a tool to automatically compute and check bundle sizes for Renderers on every pull request. It shows the results for all Renderers that have changed with the current version being released.
Client Metrics: we provide a built-in layer to report Web Vitals and custom metrics to capture all Zalando pages’ user experience.

Increased Organizational Speed and Efficiency

We are still organized around feature teams which have frontend engineers embedded. The main difference is that now they are working in a monolithic repository providing a unified and automated environment that offers new joiners a quick onboarding. The teams develop features and UI elements within Renderers. These Renderers are associated with Entities that make up our new page semantic.

There is quite a cultural shift as some ownership lines are now blurred in Renderers, with multiple teams contributing to most of them. As a result, we now have a much more collaborative development environment where teams benefit from their best practices. A centralized repository also means it is easier to ship large project changes and contribute to other teams' code, supported by a set of contribution guidelines.

We now have an aligned set of modern frontend technologies (React, TypeScript, GraphQL), a centralized server infrastructure, a release process, and a robust set of monitoring capabilities with dashboards and alerts. We are more efficient in terms of operations, and new reliability patterns immediately impact the whole website.

Architecture Overview

The following chart gives an overview of the underlying architecture. It contains all the core components of Interface Framework.

The GraphQL API is a data aggregation layer. It is to become the primary data source for all web pages at Zalando and reduce data integration costs across many teams. It provides a unified way for accessing content as an output of personalization services like the Recommendation System.

The Rendering Engine is a backend service and client-side runtime running in Node.js and the browser. Its primary purpose is to resolve and render a tree of Entities for a given request. The Recommendation System controls the structure of this tree.

A Renderer is a self-contained, reusable piece of code that runs inside the Rendering Engine. It declaratively specifies all of its data dependencies via GraphQL and uses the Zalando Design System to represent a single Entity visually.

The mapping of Entities to Renderers is one-to-many since different visual representations are possible for an Entity. An outfit Entity, for example, can be represented as a main view or a card component within a collection. Each Renderer, on the other hand, corresponds to one specific Entity type.

We do support a hybrid approach with Interface Framework. The Rendering Engine can serve views in different configurations:

View is a Mosaic Template and only uses Fragments.
View contains both Renderers and Fragments.
View only consists of Renderers.

This support for both rendering modes was and is still very beneficial for teams migrating their page from Mosaic to IF. Currently, we serve around 90% of traffic via Interface Framework.

Future Posts

In upcoming posts, we will dive deeper into the framework’s core components and share what we have learned during the transition from Mosaic to Interface Framework.

Part 2: Deep Dive into Rendering Engine

Update 2023/07: See Rendering Engine Tales: Road to Concurrent React for an update on Rendering Engine and how we integrated React Concurrent features as part of our upgrade to React 18.

How we use GraphQL at Europe's largest fashion e-commerce company

2021-03-04T00:00:00+01:00

Background

Today's large scale organizations leveraging microservice architecture face a plethora of problems at the data aggregation and presentation layers. Managing consistent and backwards-compatible APIs for Web and Mobile App frontends is definitely one of the complex ones. The balance between a frontend developer's need for consistent data source and of product managers for delivering new features quickly in a fast-paced, large organization is a tough nut to crack. It is very common for frontend developers to struggle finding the right backend service to deliver a given feature.

The Backend-for-frontend (BFF) concept is a pattern pioneered by Soundcloud wherein a backend application is created for every business and frontend specific use case. With our adoption of microservices at Zalando in 2015, we used this pattern to create a large number of BFFs for Web Product details page, Web wishlist page, Mobile app wishlist view, Mobile app home view and so on. The BFF is very similar to Netflix’s approach of Embracing the Differences which pointed out 4 key characteristics for APIs serving frontend applications:

Embrace the differences of the devices
Separate content gathering from content formatting/delivery
Redefine the border between “Client” and “Server”
Distribute innovation

While these two approaches addressed most of these concerns of frontend development, they also introduced other issues for a large organisation like Zalando:

Lack of optimal balance between fast feature delivery and developer experience
Duplication of efforts due to the large number of Backend-for-Frontend microservices
Inconsistent experience for Zalando customers across platforms
Fragmented handling of Security and Authentication concerns
Fragmented Observability implementations

Out of the above problems, Inconsistent experience for Zalando customers across platforms is a subtle one to understand and is more evident when the same business logic and aggregation is done in multiple ways in multiple backends leading to broken customer experiences. This is a classic example of Conway's law which in this case may ignore the User's point of view of different user experiences in their interaction with different frontend applications for the same organization.

The diagram below shows the inconsistency issue that is not uncommon across different user interfaces for the same application if served via multiple backends. In the mobile app the delivery date range for an article on Zalando is 5-9 Feb whereas in the desktop version it’s 1-3 Feb. Even though this particular example is hypothetical, we have seen such inconsistent data bugs at Zalando in the past due to the different BFFs having fragmented logic across different services.

All the above problems at large scale become exponentially hard. We observed this also at Zalando and used our Unified Backend-For-Frontend graph of Entities approach to address most of these concerns.

Our setup

GraphQL is a query language developed by Facebook to enable declarative data fetching. The users of the API declaratively specify the shape of the data requirement via the query and response structure they expect.

For example, in order to fetch the name of the example product mentioned above you can query it as:

From the GraphQL specification design principles, GraphQL was created with business requirements and hierarchical views in modern applications in mind:

Hierarchical: GraphQL specification recommends the language to be structured in hierarchy to be well suited for Hierarchical Views in modern frontend applications

Product-centric: The evolution of a GraphQL schema is directly influenced by the product/business features being developed by frontend engineers

These are the two main principles we have kept in mind at Zalando while building a single GraphQL API as a Unified Backend-For-Frontends (UBFFs) for all Web and mobile App frontend feature teams. We use a monorepo which has a shared ownership across 12+ domain teams using a set of contribution principles. This is similar to the one unified graph concept highlighted in Principled GraphQL.

We use an Entity system where entities are the first-class citizens in the graph with our custom implementation of GraphQL specification (graphql-jit) for performance optimization. The Entities themselves represent content and domain models spread across the Zalando shop e.g. Product, Campaign (elaborating the Entity model will be its own post in the series). The overall application data flow looks like this.

We started with the GraphQL solution at Zalando in the first half of 2018 and have had the service in production since the end of 2018. The unified GraphQL schema has grown significantly in the last 2 years to a dense graph now with more than 12 domains and serves more than 80% of Web and 50% of the App use cases (as of February 2021).

Advantages

With our implementation of GraphQL running in production for the last 2 years at Zalando, we addressed most of the aforementioned concerns and observed multiple advantages including:

Improved efficiency for developers to find and access data in one place as opposed to finding and integrating with the individual APIs.
Improved developer experience via GraphQL tools such as explorer with live assortment data.
Faster deployments leading to shipping features faster, leading to happy product managers.
Consistent customer experience across platforms with a single consistent data source for frontends.
Reduced duplication of effort to develop the same feature across platforms.
Easy to enforce governance and organisational best practices.
The GraphQL layer has a "No Business Logic" principle, which allows domain specific backend APIs to steer domain or platform (Web vs. App) specific content on their own.

Known concerns and challenges

Code reuse leading to bloated code base

Our approach with GraphQL has been to avoid any platform or domain specific logic in the GraphQL layer and instead let the domain specific teams drive this via presentation layer backend services. This allows us to keep a business logic agnostic data-aggregation layer which serves frontend developers and also helps in operational maintenance.

Adoption and learning curve

Given GraphQL was a new technology for our teams, it involved investment in terms of learning curve and adoption. We addressed the adoption using some common mechanisms:

One-stop-shop Documentation: We use a single structured documentation with embedded GraphQL editor, schema documentation, Voyager for schema exploration, practice exercises to allow our new users to adopt GraphQL.
Support chat: Just like any platform team we also provide support channel for any queries from users and contributors of the GraphQL service.
Trainings: Given that GraphQL is new at Zalando, we conducted GraphQL adoption training with 150+ developers participating to learn about using GraphQL at Zalando. The training had a broad impact on a large population of developers intending to switch to GraphQL.
Consultation: GraphQL schema design is always a tricky topic even for frontend developers who can use GraphQL. In order to ensure a single, dense, unified graph, our team also provided consultation for all new domains being integrated into the Unified graph.

These four measures have resulted in increasing the number of contributors to our monorepo from 50 to 150+ in 2020 and developers using GraphQL for feature development from 70 to 200.

God Component

God component is a design smell when a component is excessively large either in the terms of LOC or number of classes. We have a monorepo for the unified GraphQL service which makes it a potential architectural and operational risk. We address the architectural risk by shared ownership mechanism at Zalando, guided by a set of contribution principles. For the operational risk, we observe and address most issues by Reliability Patterns such as Circuit breakers, Timeouts and Retry patterns. We also introduced Bulkhead pattern to provide more Fault tolerance and isolation by deploying the application to serve traffic per platform (separate deployments for Web and mobile Apps).

Related work on Unified GraphQL

Unified Graph is a known concept which is being adopted by a lot of large organisations. Below is a list of some of the large organisations using unified GraphQL in production:

Github has a GraphQL implementation with a single graph of all the domains including repos, users, marketplace etc. in it.
Shopify has a single GraphQL implementations for its StoreFront (customer facing) and Admin (merchant facing) APIs where they allow customers and partners to build experiences using the unified graphs for each of those.
AirBnB has been working on creating a Unified Schema for GraphQL solution, which they shared during the GraphQL Summit 2019 talk.
Expedia moved from a REST specific service to a Central data graph using GraphQL to solve their problems of using REST endpoints where developers were spending more time to figure out which service to call than to develop features.
Apollo Federation is Apollo's solution for providing single data Graph over multiple Graphs across an organization. The difference between the Unified Graph we have at Zalando and Apollo's federation is that instead of having multiple Graphs connected via a library and gateway we have a single service at Zalando which connects all the domains in a single schema Graph. This has tradeoffs which we have addressed as mentioned here, since we gain by keeping a single Graph in terms of tooling, deployment and governance.
Netflix also has its own version of one-graph that they use in the Netflix Studio ecosystem and elaborated the setup in this blog post series.

Conclusion and next steps

The Unified Backend-For-Frontend (UBFF) GraphQL is not a silver bullet, but is a tradeoff which has worked well for our frontend data fetching problems at Zalando. In the next few articles in this series we will cover other aspects of our usage of GraphQL at Zalando in context of Observability, Performance Optimization, Security, Tooling, Errors etc. which allowed us to scale the adoption of the service to 200+ Web and App developers and serve the use cases of more than 25-30 feature teams.

References

Building an End to End load test automation system on top of Kubernetes

2021-03-02T00:00:00+01:00

Introduction

At Zalando we continuously invent new ways for customers to interact with fashion. In order to provide an excellent customer experience, we must ensure our systems can technically handle high traffic events such as Cyber Week or other sales campaigns. We have published a detailed article on how Zalando prepares for the Cyberweek. Checkout and payments related systems are particularly important during sales events. As we continuously evolve our systems and add new features to optimize the customer experience, it is cumbersome and expensive to manually test our systems capability to handle high traffic.

Our department is responsible for payments processing systems of Zalando, these systems must maintain high availability and reliability. We set out to build an automated end-to-end load testing system capable of simulating real user behaviour across the whole system composed of microservices in order to achieve high stability in our systems. This testing system automatically steers generated traffic based on a dynamically adjusted orders per minute configuration. In order to really push our services to the edge, we wanted to run the load testing system in our test cluster, as this enables us to break things when necessary without causing customer impact. These tests can then be conveniently managed and triggered by our team and serve as the first quality gate of the Payment system. As part of the Cyber Week preparation, we formed a dedicated project team tasked with making our vision come to life.

To summarize, we wanted to build a load testing tool with the following features:

Automatic load test execution based on a schedule.
Simple API through which developers can manually trigger a load test.
Load test tool to be ran in our test environment, that scales our Kubernetes services and Amazon ECS¹(Elastic Container Service) environment up to our production configuration and then execute load tests.
Automated alarms if a load test causes SLO (Service Level Objective) breaches.
The generated load test traffic must imitate our customer's checkout flow.

The diagram below illustrates how the testing system (NodePool A) and our Payment platform (NodePool B and ECS) is deployed:

Traffic generation

Our first step was to select a load testing framework. We considered multiple options such as Locust, Vegeta and JMeter. This was filtered down to Locust and Vegeta due to JMeter not being popular internally. We chose Locust as it was more popular within our development teams, thus the test suite would be easier to maintain. We have also blogged before on how we leveraged Locust in prior preparations for sales events.

Locust works both in standalone and distributed mode. It operates a controller with multiple workers in distributed mode. In order to generate higher loads a distributed setup is required to overcome resource limitations. We created locust scripts covering multiple business processes mimicking real world traffic patterns to our services. These scripts were then packaged as a docker container and deployed as a distributed locust system.

Mock External Dependencies

When we defined the scope of the load tests we all agreed we would only focus on testing internal service components and did not want to involve external dependencies for routine tests. Therefore we decided to mock these dependencies.

The table below compares a variety of tools that can be used to implement mocks.

	Mobtest	Wiremock	Mockserver	Mokoon	Hoverfly
Language	Javascript	Java	Java	Javascript	Golang
Github star/fork	1289/173	3453/934	2280/616	1402/63	1468/131
Config (API, route, ...)	Json config	Json	Js config	Js config	Json
Latency simulation	Fixed	Fixed / Random	Fixed	Fixed	Fixed / Random
Fault simulation	Yes	Yes	Yes	Yes	Yes
Stateful behaviour	No	State machine	No	No	key-value map
Easy to extend	No	Yes	Yes	No	Yes
Proxying	Yes	Yes	Yes	Yes	Yes
Response templating	Yes	Yes	No	Yes	Yes
Request matching	Yes	Yes	Yes	No	Yes
Record & Replay	No	No	Yes	No	Yes

After evaluating multiple options we settled on using Hoverfly as the mocking solution. Hoverfly provides the ability to easily set up mocks with static or dynamic responses. Mocks were created and deployed for multiple external dependencies. Furthermore, we wanted to run the load tests against services that could at the same time be used for other tests. This meant that the service needed to dynamically switch the dependency between the real service and its mock. For this, we leveraged header-based routing using Skipper, so a service can decide whether to use mocks or actual dependent service by examining if the request belongs to a load test or not.

Hoverfly example mocking a service with PATCH endpoint:

{
    "data": {
        "pairs": [
            {
                "request": {
                    "path": [{
                        "matcher": "exact",
                        "value": "/test"
                    }],
                    "method": [{
                        "matcher": "exact",
                        "value": "PATCH"
                    }]
                },
                "response": {
                    "status": 204,
                    "body": "",
                    "encodedBody": false,
                    "headers": {
                        "Date": [
                            "{{ currentDateTime 'Mon, 02 Jan 2020 15:04:05 GMT' }}"
                        ],
                        "Load-Test": [
                            "true"
                        ]
                    },
                    "templated": true
                }
            }
        ],
        "globalActions": {
            "delays": []
        }
    },
    "meta": {
        "schemaVersion": "v5",
        "hoverflyVersion": "v1.1.2",
        "timeExported": "2020-01-07T13:21:02+02:00"
    }
}

To start Hoverfly using this configuration, one can simply run:

hoverfly -webserver -import simulation.json

Load Test Conductor

In order to meet our goal of running automated load tests in the test cluster, we needed to design a system that could manage the full lifecycle of a load test and ensure the cluster and deployed applications match our production configuration. So applications in load test environment is updated to match resource allocation, number of instances and application version of the production environment.

Load test lifecycle

We defined the lifecycle of one load test as follows:

Deploy all applications in the test environment to be the same version as production.
Scale up the applications in the test environment to meet the resource configuration of the production environment.
Generate load test traffic that replicates real customer behaviour.
Scale down applications in the test environment after the test as a cost saving measure.
Clean up databases and remove unnecessary test data.

For this purpose, we built a microservice in Golang called the load-test-conductor that executes and manages these load test phases and transitions. Our service design was heavily influenced by what Kubernetes popularized for infrastructure management. We wanted our system to be a declarative system. Therefore, the service provides a simple API that can be used by engineers to run load tests by defining the desired state of load test. Executing a load test is now just one API call away!

On the diagram below, you can find the system components of the Load Test Conductor:

Deployment and Scaling

To ensure that the exact version of the service running in production is deployed and services are pre-scaled, we automated deployment and scaling of the application within the Load Test Conductor. We use our Continuous Delivery Platform (CDP) to find the version deployed in production using the Kubernetes client and trigger a new deployment of this exact version in our staging environment. Applications which need to be included in a load test can be provided as an environment-specific configuration. The Deployer component will trigger a deployment and wait till all the deployments are completed. Afterwards, the Scaler component triggers scaling based on the target configuration. Our load test conductor currently supports scaling resources in Kubernetes and AWS ECS environments. It also handles scaling down to the previous state once the load test is completed or failed.

Load generation

We chose to run locust in distributed mode to mimic customer traffic. Each Locust worker executes our test scripts and interacts with our microservices in order to simulate the customer journey through our systems. We wanted to be able to test different load scenarios, so we decided to implement an algorithm in the load-test-conductor that can instrument the locust workers through the API provided by Locust. The Locust API provides the functionality to change the count and the rate at which Locust workers are spawned. We designed an algorithm that ramps up locust workers based on a business KPI (orders placed per minute). Users of the test system can define a ramp-up time, a plateau time and the target orders per minute that the test should reach. Our algorithm then hatches the locust workers based on the configured parameters and dynamically recalculates the hatch rate and locust worker count needed to reach the defined orders per minute target.

Load generation pseudo code

set initial number of users to 1
set calculation interval to 60 seconds
while load test time has not exceeded
    get locust status
    calculate orders per defined calculation interval
    calculate orders per minute
    set number of orders to value from number of orders reported by locust.

    if user count in locust status is equal to zero
        print "load test is being initialized."
        set loadtest hatch rate to one
        set loadtest user count to initial number of users
        set loadtest orders per minute to 0
        set loadtest number of orders to 0
    else if orders per minute equal to zero
        print "load test stalled due to no orders getting generated."
        set loadtest hatch rate to one
        set loadtest user count to one
    else
        calculate total users needed to achive target orders per minute rate using
        current locust users per minute rate and orders per minute rate.
        calculate users that needs to be created.
        calculate time left for the load test.
        calculate iterations left for the load test.
        calculate users to spawn in this iteration.
        calculate hatchrate
        set loadtest hatch rate to calculated hatchrate
        set loadtest hatch rate to calculated users
    update locust with load test parameters, this triggers load generation.
    sleep for calculaton interval time.

Test Execution & Test Evaluation

To trigger the load test, we used a Kubernetes CronJob that calls the API of the load test conductor. For our Payment system, load tests take about 2 hours to complete.

To monitor the system during test execution, we leverage Grafana dashboards that provide insights into the most important metrics, for example - latency, throughput and response code rates. Through manual inspection of the graphs, we also evaluate if a load test was successful or not. Additionally, we use alerts that trigger when a service did not meet its SLO during a test.

Test results have to be manually evaluated to decide if the outcome is successful or not, which is sufficient for us for the time being.

Conclusions

Overall, the solution fulfilled the goal of a successful preparation and scaling of our applications. However, running load tests on the test cluster posed several challenges. Sometimes, new deployments were rolled out during tests, which caused the service to point to pods with minimal resources instead of the scaled up one. Several infrastructure components like cluster node type, databases, centrally managed event queues (Nakadi) had to be adjusted for similarity with the production environment. This required a lot of communication effort and alignment with teams managing the services.

We made the deployment of the production versions of the applications an optional feature, so that developers can test their feature branch code. The load test tool has become our standard way to verify for every developed change that the applications can handle peak production traffic.

Giving developers the possibility to run load tests by a simple API call encourages and enables them to thoroughly load test applications.

Since these load tests are conducted in a non-production environment, we could stress the services till they fail. In combination with load tests in production, this was essential for preparing our production services for higher load.

ECS is only used by a small set of isolated services, all other services run on Kubernetes. ↩

Integration tests with Testcontainers

2021-02-25T00:00:00+01:00

In this article, I will show how teams at Zalando Marketing Services are using integration tests in Java-based backend applications. We will follow the idea of integration tests: the main concept and the attributes of a good integration test. Then, we will discuss an example based on the TestContainers library used in the Spring environment.

Integration tests

There are many definitions of integration testing. For example, the definition found on Wikipedia is: Integration testing is the phase in software testing in which individual software modules are combined and tested as a group.

For this article, we define integration tests as tests of communication between our code and external components, e.g. database, one of the AWS services (like S3, Kinesis, DynamoDB, SQS, and others) or an external system with which we are communicating over HTTP.

The purpose of integration tests is to assess how our code will behave when communicating with external services. Not only in happy path scenarios, but especially in corner cases, e.g. external service will respond with an unexpected HTTP code, the HTTP response will come after a defined timeout, AWS S3 responses with internal errors.

Amount of integration tests

While implementing tests, we need to remember to maintain the proper balance between different test types. Integration tests cannot be the core of the testing codebase.

A pyramid of testing shows us the proportions of different types of tests. For backend applications, the foundations are unit tests and component tests. Integration tests are a complement of unit tests and other test types like component, system, and manual.

System tests and manual tests should ideally be the rarest type of tests. From our experience, we estimate the number of integration tests to be around 25% of unit tests, but it varies from application to application.

Integration tests with Testcontainers library

Let's see how to organize an integration test with the Testcontainers library, and how to manage a startup/teardown of Docker containers. Testcontainers.org is a JVM library that allows users to run and manage Docker images and control them from Java code. Zalando uses it mainly for integration tests. To implement an integration test, you need to run your application similarly to a unit test (method annotated by @Test).

The integration test additionally runs external components as real Docker containers. External components can be one of:

database storage - for example, run real PostgreSQL as a Docker image,
mocked HTTP server - you can mimic the behavior of other HTTP services by using Docker images from MockServer or WireMock,
Redis - run real Redis as a Docker image,
streams or queues (like RabbitMQ and others),
AWS components like S3, Kinesis, DynamoDB, and others, which you can emulate with Localstack
other application that can be run as a Docker image.

It is very easy to run Docker images from Java code. Every Docker image can be run with GenericContainer. For the most popular Docker images, there are prepared wrapper classes for convenient usage.

To make sure that every Docker image will be stopped after usage and resources are released, the library uses JVM ShutdownHooks and a special Docker image Ryuk. ShutdownHooks stops images when tests are finished. In case the Java process is no longer available, the Ryuk container stops all Docker images. It is worth mentioning that it is possible to disable Ryuk containers.

Maven configuration

To use Testcontainers, add a maven dependency with a current library version.

<dependency>
  <groupId>org.testcontainers</groupId>
  <artifactId>testcontainers</artifactId>
  <version>${testcontainers.version}</version>
  <scope>test</scope>
</dependency>

It's important to have control over test execution. Unit tests should be executed before integration tests. It is a consequence of the pyramid of testing and helps to ensure that feedback loops are short. In some cases, you may want to skip integration tests, for example when your local machine is slow and you want to run it only on CI/CD.

To run the integrations tests after your unit tests, simply add maven-failsafe-plugin to your project. Failsafe and Surefire plugins work in different build phases. By default, the Maven Surefire plugin executes unit tests during the test phase. It includes all classes whose name ends with Test / Tests or TestCase. The Failsafe plugin runs integration tests in the integration-test phase. To separate execution, we configure Failsafe plugin to run classes with postfix IntegrationTest. We also create a special profile, here: with-integration-tests to control if we want to run integration-tests or not.

<profiles>
 <profile>
   <id>with-integration-tests</id>
   <build>
     <pluginManagement>
       <plugins>
         <plugin>
           <groupId>org.apache.maven.plugins</groupId>
           <artifactId>maven-failsafe-plugin</artifactId>
           <executions>
             <execution>
               <goals>
                 <goal>integration-test</goal>
                 <goal>verify</goal>
               </goals>
             </execution>
           </executions>
           <configuration>
             <includes>
               <include>**/*IntegrationTest.java</include>
             </includes>
           </configuration>
         </plugin>
       </plugins>
     </pluginManagement>
   </build>
 </profile>

An invocation of maven command would look like:

mvn clean verify -P with-integration-tests

Basic integration test with TestContainers

Let’s set up a basic integration test with JUnit 5 and Spring Boot.

An integration test class example can look like the example below. The test class inherits from AbstractIntegrationTest. The test method creates an entity in the database run as a Docker image. Later, we read the entity from the database and control if the entity has been written correctly.

class AccountRepositoryIntegrationTest extends AbstractIntegrationTest {

    @Autowired
    private AccountRepository dao;

    @Test
    void shouldCreateAccount() {
        // given
        Account account = createAccount();

        // when
        underTest.save(account);

        // then
        Optional<Account> actualOptional = dao.findById(account.getId());
        Account expected = createAccount();
        assertThat(actualOptional).isPresent();
        assertThat(actualOptional.get()).isEqualTo(expected);
    }
}

The test class below is an abstract class that will be inherited by all integration tests. It contains static references to Docker containers - singleton container. In the static block, we start all images. We do not need to stop them, it will be done automatically. In the example below, the PostgreSQLContainer is going to listen on a random port. To facilitate adding properties with dynamic values, we used the @DynamicPropertySource annotation that was introduced in Spring Framework 5.2.5 (it has a more compact syntax than ApplicationContextInitializer).

@SpringBootTest(webEnvironment = WebEnvironment.RANDOM_PORT)
@ActiveProfiles("test")
public abstract class AbstractIntegrationTest {

    public static PostgreSQLContainer postgreSQL =
      new PostgreSQLContainer("postgres:13.1")
            .withUsername("testUsername")
            .withPassword("testPassword")
            .withDatabaseName("testDatabase");

    static {
        postgreSQL.start();
    }

    @DynamicPropertySource
    static void postgresqlProperties(DynamicPropertyRegistry registry) {
        registry.add("db_url", postgreSQL::getJdbcUrl);
        registry.add("db_username", postgreSQL::getUsername);
        registry.add("db_password", postgreSQL::getPassword);
    }
}

@TestContainers annotation

There are also different ways of running your containers. You can use the annotations set prepared in the Junit-Jupiter maven module:

<dependency>
 <groupId>org.testcontainers</groupId>
 <artifactId>junit-jupiter</artifactId>
 <version>${testcontainers.version}</version>
 <scope>test</scope>
</dependency>

A test class annotated with the @Testcontainers annotation runs all containers annotated with the @Container annotation. Additionally, when the container is static, it shares containers between test methods. You can control the startup order of containers by using dependsOn method of GenericContainer. The main limitation is, that containers cannot be reused between test classes. Moreover, this extension has only been tested with sequential test execution. Using it with parallel test execution is unsupported and may have unintended side effects. The test class would look like the example below.

@Testcontainers
@SpringBootTest(webEnvironment = WebEnvironment.RANDOM_PORT)
@ActiveProfiles("test")
public class ApplicationIntegrationTest {

    @Container
    public static PostgreSQLContainer postgreSQL =
      new PostgreSQLContainer("postgres:13.1")
            .withUsername("testUsername")
            .withPassword("testPassword")
            .withDatabaseName("testDatabase");

  @DynamicPropertySource
  static void postgresqlProperties(DynamicPropertyRegistry registry) {
    registry.add("spring.datasource.url", postgreSQL::getJdbcUrl);
    registry.add("spring.datasource.password", postgreSQL::getPassword);
    registry.add("spring.datasource.username", postgreSQL::getUsername);
  }

  @Test
  public void contextLoads() {
  }

}

Lifecycle of integration test

All tests (including integration tests) should follow principles defined as FIRST. The acronym FIRST was defined in the book Clean Code written by Robert C. Martin.

[F]ast - A test should not take more than a second to finish the execution.
[I]solated - No order-of-run dependency.
[R]epeatable - A test method should NOT depend on any data in the environment/instance in which it is running.
[S]elf-Validating - No manual inspection required to check whether the test has passed or failed.
[T]horough - Should cover every use case scenario and NOT just aim for 100% coverage.

Running a Docker image for every test method can take an enormous amount of time. To increase performance we need to make a real-life compromise. We can run a Docker image per class or even run once for all integration test executions. The second approach has been presented in the code. If we decide to share Docker images between tests, we need to be ready for it. There are many ways to achieve it, e.g.:

Tests should operate on unique IDs, names, etc. That way, we can avoid collisions of database constraints. In this case, you don’t need to clean up after the test execution. Some problems can occur, for example when you count elements in the database table. You can count elements created by different tests.
Tests should clean up the state after execution. This approach consumes much more development time and is error-prone.

If we would like to run tests concurrently, it would require even more discipline from developers.

Advantages of using the TestContainers library

You run tests against real components, for example, the PostgreSQL database instead of the H2 database, which doesn’t support the Postgres-specific functionality (e.g. partitioning or JSON operations).
You can mock AWS services with Localstack or Docker images provided by AWS. It will simplify administrative actions, cut costs and make your build offline.
You can run your tests offline - no Internet connection is needed. It is an advantage for people who are traveling or if you have a slow Internet connection (when you have already run them once and there is no version change in the container).
You can test corner cases in HTTP communication like:
- programmatically simulate timeout from external services (e.g. by configuring MockServer to respond with a delay that is bigger than the timeout set in your HTTP client),
- simulate HTTP codes that are not explicitly supported by our application.
Implementation and tests can be written by developers and exposed in the same pull request by backend developers.
Even one integration test can verify if your application context starts properly and your database migration scripts (e.g. Flyway) are executing correctly.

Disadvantages of using the TestContainers library

We bring another dependency to our system that you need to maintain.
You need to run containers at least once - it consumes time and resources. For example, PostgreSQL as a Docker image needs around 4 seconds to start on my machine, whereas the H2 in-memory database needs only 0.4 seconds. From my experience, Localstack which emulates AWS components, can start much longer, even 20 seconds on my machine.
A continuous integration (e.g. Jenkins) machine needs to be bigger (build uses more RAM and CPU).
Your local computer should be pretty powerful. If you run many Docker images, it can consume a lot of resources.
Sometimes, integration tests with TestContainers are still not sufficient. For example, while testing REST responses with a mockserver container you can miss changes of real API. Inside the integration test, you may not reflect it, and your code still can crash on production. To minimize the risk, you may consider leveraging Contract Testing via Spring Cloud Contract.

Code example

You can find examples of usages in my GitHub project.

A Machine Learning Pipeline with Real-Time Inference

2021-02-16T00:00:00+01:00

Customers love the freedom to try the clothes first and pay later. We’d love to offer everyone the convenience of deferred payment. However, fraudsters exploit this to acquire goods they never pay for. The better we know the probability of an order defaulting, the better we can steer the risk and offer the convenience of deferred payment to more customers.

That’s where our Machine Learning models come into play.

We have been tackling this problem for a while now. Everything started with a simple Python and scikit-learn setup. In 2015 we decided to migrate to Scala and Spark in order to scale better. You can read about this transition on our engineering blog. Last year we started to explore the potential value of tooling provided by Zalando's Machine Learning Platform (ML Platform) team as part of our strategy investment.

Pain Points with the existing solution

Our current solution serves us well. However, it has a few pain points, namely:

It’s highly coupled to Scala and Spark which makes using state of the art libraries (mostly Python) difficult.
It contains custom tailored code for functionalities which nowadays can be replaced by managed services. This adds an additional layer of complexity, making it difficult to maintain and to onboard new team members.
It is a bit problematic in production: it uses a lot of memory, suffers from latency spikes, new instances start rather slowly which affects scalability.
It has a monolithic design, meaning that feature preprocessing and model training are highly coupled. There is no pipeline with clear steps and everything runs on the same cluster during training.

Requirements for the New System

We started the project by writing down requirements for the new solution. The requirements fulfilled by our current system still stand:

API: the new system needs to conform to the existing API. We receive a JSON response with order data, and return a response in a JSON format.
Latency: the deployed service must respond to requests quickly. 99.9% of responses must be returned under a threshold in the order of milliseconds.
Load: the busiest model must be able to handle hundreds of requests per second (RPS) on a regular basis. During sales events, the requests rate for a model may scale at a higher order of magnitude.
Support for multiple models in production: several models, divided per assortment type, market, etc., must be available in the production service at any given time.
Unified feature implementation: our model features require preprocessing (extraction from the request JSON) both in production and in our training data (which comes in the same JSON format). The preprocessing applied to incoming requests in production must be identical to that applied to the training data. We want to avoid implementing this logic twice for both cases.
Performance metrics: we must be able to compare the performance between the new and the old version of a model (using the same data) to improve our tagging capabilities.

To alleviate the current pains, we require our new system to meet the following criteria in addition to those above:

Independence from a specific model framework: our research team develops improved models with different frameworks, such as PyTorch, Tensorflow, XGBoost, etc.
Fast scale-up: the production system should adjust to growing traffic and accept requests in a matter of minutes.
Clear pipeline: the pipeline should have clear steps, especially the separation between data preprocessing and model training should be easy to understand.
Use existing services: ML tooling made quite a leap in the recent years and when possible we should take advantage of what’s available instead of building custom solutions.

Architecture of the New System

The system is a machine learning workflow built primarily from services provided by AWS. At Zalando, we use a tool provided by Zalando’s ML Platform team called zflow. It is essentially a Python library built on top of AWS Step Functions, AWS Lambdas, Amazon SageMaker, and Databricks Spark, that allows users to easily orchestrate and schedule ML workflows.

With this approach we steer away from implementing the whole system from scratch, hopefully making it easier to understand, which was one of the pain points (#2) of our prior system.

In this new system, a single workflow orchestrates the following tasks:

Training data preprocessing, using a Databricks cluster and a scikit-learn batch transform job on SageMaker
Training a model using a SageMaker training job
Generating predictions with another batch transform job
Generating a report to demonstrate model’s performance, done with a Databricks job
Deploying a SageMaker endpoint to serve the model

The platform solution allowed us to create a clean workflow with a lot of flexibility when it comes to technology selection for all the steps. We consider this a big improvement in regards to our pain point #4.

Using a SageMaker training job allows us to substitute the model training step with any model available as a SageMaker container. In rare cases, when the algorithm is not already provided, we still have the possibility to implement the container on our own. This gives us much more flexibility and deals with pain point #1 discussed before.

Model Evaluation

After the training is finished, a SageMaker model is generated. To evaluate the performance of the model candidate we perform inference on a dedicated test dataset. As we needed to check additional metrics to the ones provided out of the box by SageMaker, we added a custom Databricks job to calculate those metrics and to plot them in a PDF report (example below, where we see a model performing poorly).

Model Serving

At inference time, a SageMaker endpoint serves the model. Requests include a payload which requires preprocessing before it is delivered to the model. This can be accomplished using a so-called “inference pipeline model” in SageMaker.

The inference pipeline here consists of two Docker containers:

A scikit-learn container for processing the incoming requests, i.e. extracting features from the input JSON or basic data transformations
Main model container (i.e. XGBoost, PyTorch) for model predictions

The containers are lightweight and optimized for serving. They are able to scale-up sufficiently fast. This solved our pain point #3.

Performance Metrics

Latency and Success Rate

We then performed a series of load tests. During every load test the endpoint was hit continuously for 4 minutes. We varied:

The EC2 instance type
Number of instances
The request rate. Different rates were applied to different AWS instance types. For example, it does not make sense to use ml.t2.medium instances to serve a model at a highest request rate, as they are not meant for such a load.

We reported the following metrics:

Success: the percentage of all requests which returned an HTTP 200 OK status. 100% is optimal. Although there is no hard threshold here, the success rate should be high enough to serve endpoint requests.
99th: the 99th percentile for response rates of all requests, in milliseconds. To be usable, an endpoint must be able to respond to requests within the agreed sub-second threshold.

Sample results, for m5.large instance type:

Some of our findings:

For a rate of 200 requests/s, a single ml.m5.large instance can handle the load with a p99 of under 80ms.
For a rate of 400 requests/s, the success rate is not near 100% until 4 or more ml.m5.large instances are used. The response rates are under 50ms.
For the 1000 requests/s rate, 2 or more ml.m5.4xlarge or ml.m5.12xlarge instances can keep the success rate with response times below 200ms.

Cost

Based on our estimates the cost of serving our models will increase significantly after the migration. We anticipate the increase by up to 200%. The main reason behind it is cost efficiency of the legacy system, where all the models are served from one big instance (multiplied for scaling). In the new system every model gets a separate instance(s).

Still, this is a cost increase that we are willing to accept for the following reasons:

Model flexibility. Having a separate instance per model means every model can use a different technology stack or framework for serving.
Isolation. Every model’s traffic is separated, meaning we can scale each model individually, and flooding one model with requests doesn’t affect other models.
Use of managed services instead of maintaining a custom solution.

Scale-up Time

We would like to be able to adjust our infrastructure to traffic as fast as possible. This is why we verified how much time it takes to scale the system up. Based on our experiments, adding an instance to a SageMaker endpoint with our current configuration reduces scale-up time by 50% over our old system. However, we wish to explore options for reducing this time further.

Cross Team Collaboration

Development of this system was a collaborative effort of two different teams: Zalando Payments and Zalando Machine Learning Platform, with each contributing members to a dedicated virtual team. This inter-team collaborative workstyle is typical for the ML Platform team, which offers the services of data scientists and software engineers to accelerate onboarding to the platform. To define the scope of the collaboration, the two teams created a Statement of Work (or SoW) to specify what services and resources the ML Platform will provide, and for what length of time. The entire collaboration lasted 9 months.

The two teams collaborated in a traditional Kanban development style: we developed user stories, broke them into tasks, and completed each task. We met weekly for a replanning and had daily standups to catch up.

Our collaboration was not without friction. Having developers from two different teams means overhead from two different teams. For example:

We had periods where the ML Platform team members had to deliver training programs for other parts of the company, and could not devote much time to this project. Similarly, members of the Payments team would occasionally need to attend to unrelated firefighting duties and miss a week of the collaborative project. Clearly communicating these external influences was very important, as the Payments team members are not aware of what is happening in the ML Platform team, and vice-a-versa.
Sharing knowledge between the two teams was crucial, especially in the early stages of the project. While the Payments' team members are experts at the underlying business domain, the ML Platform team members were not. Similarly, while the ML Platform team members are experienced with the tools used for the project, the Payments’ team members did not have this expertise.

Conclusion and Outlook

Our new system fulfills the requirements of the old system, while addressing its pain points:

Because we use Amazon SageMaker for the model actions (i.e. training, endpoints, etc.), the system is guaranteed to be independent from the modeling framework.
Each model served behind a SageMaker endpoint scales more quickly than in the old system, and we can easily increase the number of instances used for model training to speed up our pipeline execution.
Each stage of the pipeline has a clear purpose and thanks to SageMaker Inference Pipelines, the data processing and model inferencing can take place within a single endpoint.
Because we are using Zalando ML Platform tooling, our new system takes advantage of technology from AWS, in particular Amazon SageMaker.

We plan to use a similar architecture in other data science products.

The project was a successful test of a team collaboration across departments, and proved that such collaboration can bring great results.

Find out what challenges Customer Conversion solves at Zalando

2021-02-11T00:00:00+01:00

When our Hiring Sprint kicks off next month, we will be looking for great professionals to join some of our stellar teams – Shopping Cart, Checkout, Sales Orders and Returns. All meaningful segments of our Customer Conversion organization, these teams are responsible for forging and shaping some of the most relevant experiences in Zalando customer journey. Skilled in innovating and versed in perfection, our Customer Conversion organization might become your next career step if you ace our Hiring Sprint.

To give you a better idea of what expects you here I have spoken with our Director Customer Conversion, Pascal Hahn, who has talked me through the priorities of his teams and has shared some advice for those who are keen to join it ;)

Pascal, could you introduce the major functions and priorities of your teams?

Customer Conversion is the organization that enables our 35M customers to shop on Zalando. We are split in two departments: the Purchase department that delivers experiences from Shopping cart to Order confirmation, and the Post Purchase department that is responsible for processing orders, sorting out order details, order history as well as return experiences. Each department delivers experiences end-to-end, from ideation, product inception and development to operating and scaling them. Our mission is to let customers buy their beloved pieces easily and effortlessly by providing seamless, convenient and reliable experiences throughout. The work we do is a broad mix of designing and building new capabilities, experimenting, expanding and extending existing experiences or improving scalability and operational posture overall.

"Solving something that matters" - what does it mean for the team? What does it mean for you personally?

There’s no e-commerce without people shopping; and to work on the experiences that Zalando customers across all 17 markets use when they shop for their next favorite piece is a great mission. Being part of delivering excellent shopping experiences is what makes working at Zalando very special for me.

What do you appreciate the most about the challenges you face in your job?

To have a shot at solving problems that affect millions of users, together with some of the industry’s brightest minds is a privilege. When I started here about a year ago, I didn't know much about the inner workings of retail, and ever since I haven’t had a single day at Zalando without learning something new. Going forward, I still feel like there’s so much to learn.

Pascal, could you give some advice to people who'd like to work in the Customer Conversion organization?

If you’re excited about innovating at the intersection of the physical and the digital; if you take pride in building and operating systems that “just work”; if you enjoy using state-of-the-art tech at scale – this is the right place for you to work at. Whether you choose to work on product innovations with our product management team, or join us as an engineer or engineering leader that owns, delivers and operates our experiences, or as a data scientist who works on detecting transactional risks that affect our overall business – we offer a number of roles and challenges.

What do you think is the main achievement of the teams in Customer Conversion of the past few years?

The COVID pandemic has posed many challenges to our customers, team members, teams and business. When some markets introduced severe lockdowns, we had to react quickly building new features with very short timelines. Keeping the Zalando Store open and coping with the increased scale while delivering new features to our customers continually has been no easy feat. In addition, all the while we were working from home and had to cope with our own personal difficulties brought on by the virus and the imposed restrictions.

For more details on how to participate in our 1st Hiring Sprint follow this Link!

It's Never Too Late For a Career Change

2021-02-04T00:00:00+01:00

Is it ever too late to follow your dream and start a new career? Well, I was 30 and had been working for Zalando for more than 4 years when I decided to change my career path for the second time. I made the decision a year ago, joined my new team in April 2020, and I didn't regret it for a single day.

Since that transition, a lot of people approached me with questions and asked me for advice. I started to realize that my experience could be valuable to others out there. Some people may want to change their career too but are afraid of failure or do not have enough support from their friends or colleagues, or maybe haven’t even shared their thoughts with anyone yet.

This article contains answers to the questions I was frequently asked. I hope it might support you with the decision whether a career in software engineering is what you always wanted, provide you with arguments to convince people around you that switching careers is a great idea if you do it for the right reasons, or just help you go through a difficult time of uncertainty.

What did you do before you became an engineer?

I studied business mathematics and joined Zalando as a Business Analyst after completing my master's degree. At my first job, I was helping out one of the Product Managers (PM) in my department. One year later I was offered the opportunity to become a PM myself. By that time, product duties had already taken more than 50% of my working time, so it was an easy decision. I continued to work as PM for another 3 years.

How did you become interested in coding?

I was always working quite closely with engineers in my team. At some point, they realized that I enjoy thinking about technical stuff too, and started to involve me in their discussions. I still remembered a bit of coding that I did during my bachelor years, and I started spending some of my free time attending online courses and re-learning how to code.

How did you learn to code?

My interest was growing, but at the same time, I had to admit that I couldn't spend enough time coding outside my work. You should know that I'm a very social person, so almost every evening in my normal week is blocked for some kind of social activity. I love to travel, so the weekends didn't help either. I decided to give it a proper try: take a sabbatical and do a full-time course at Ironhack coding camp for 9 weeks. With the help of this course I built the foundation for my current programming skill set.

Why did you decide to switch to engineering?

After 9 weeks of coding every day¹, I still enjoyed it. So I said to myself, this is what I'd like to be paid for! It felt right to pursue something that is so much fun even while it's sometimes frustrating.

How did you know it was the right decision?

This was the key question for me. It was a life-changing decision, so I wanted to be fully aware of my motivations and confident that I really want it. My key takeaways were:

Make sure to not bargain one trouble for another. It's absolutely crucial to know that you want to become an engineer rather than just escape your current job. To verify that it's not about my current product or team, I first switched to another department still as a PM but working on a completely different topic. Only after spending half a year with the new project, I could say with certainty that my wish was not about the circumstances but the engineering job itself.
Make sure you want to become an engineer for the right reasons. I made a list of pros and cons for both my current job and software development and then talked to engineers I knew to ensure it's not just how I imagine this job to be. If some aspects of your current role make you unhappy, make sure it's not going to be a major part of your future role. If you are happy with your job, but the main reason is that you think you could earn more money as an engineer – please, think twice. However, if you can see how becoming a software engineer would fit your interests, character, and life goals much better than your current job – go for it!

What do you like most about engineering?

My favorite topic! There are so many things! Here are just a few highlights:

Power of creativity: when you write code, you create something that wasn't there before. Sometimes it's really touchable, like a new button, sometimes it's a new behavior you introduce, sometimes a performance gain. Whatever it is, the act of creation makes you feel almost like a god ^^.
Joy of focus: I love that engineering goals are usually very tangible. I also love that, at least at the beginning of your engineering career, you can focus on one task at a time. In my previous roles, I would often end up juggling a lot of balls at the same time, which can be very exhausting. It’s an extremely satisfying experience to really complete something end to end, even if it’s just a little button that does exactly one thing.
Solving puzzles: you often have to solve what feels like real mysteries. When you investigate failures or look for root causes of a bug, you are the Sherlock Holmes in this story. If you are into this kind of puzzles, it's going to be amazing.
Constant learning: no matter how long you are in this job, there is always more to learn - new frameworks, programming languages, tools, principles, concepts, entire new areas of technology. This feeling is shared by every engineer I know, regardless of how many years of experience they have. Your brain is always working, and it's beautiful.

Weren't you afraid to start on a new path after 4 years of a professional career?

Of course I was! Every new start is terrifying. But if you know why you are doing it and you have the support of your colleagues, friends and family, it's less scary. Even if you don't have that, the engineering community is a lovely place – there are always people who will point you in the right direction when you ask for help. Also, what's the worst thing that could happen? If a year down the line I should realize that it's not the right thing for me, I can always return to my previous job with even more valuable experience in my mental backpack.

How did you feel about throwing away years of professional experience?

The answer is simple: I didn't throw them away. Whatever you were doing before, whatever you learned and practiced, stays with you and you can most certainly use it in your new role. In my case, it was easy to justify: I brought with me the knowledge about the software development lifecycle, soft skills and business acumen. If you worked in a different role before, you still learned useful things there: maybe you were part of a team, a problem solver or a great communicator, or maybe you are amazing at structuring things. Whatever it is, you are going to need it and it's going to help you.

How did your friends and family react?

I was a bit afraid to tell them. "I'm 30, and I finally figured out what I want to become when I grow up" sounded weird even in my own head. But almost everyone I shared my idea with was so supportive and excited once I explained my motivation, that soon I started to gain a lot of energy from telling people about my goal and sharing my plans.

Is it better to do the change inside your current company or join a new one?

Well, it really depends on your current situation. On the one hand, I would highly recommend doing the first steps in your current company because it makes things easier. You already know the company, you know some people, you are not a complete newbie. I’m not sure if Zalando is special that way, but I received unimaginable amounts of support from my leads, colleagues and the company itself. Zalando invests in its people, so I was financially supported from the very first milestone on this way. My wonderful company paid for my coding camp, and the only thing I had to do in return was to sign that I won’t leave within the next year (which I didn’t intend to do anyway). Every next step would have also been way harder in a new environment. On the one hand, if you are not happy with your current employer, staying there only to make the transition easier is probably not the best idea. Short: if you like your company - make your transition there, if not - don't be afraid to leave.

What concrete steps can I take towards switching to engineering?

The way to engineering can be very different. Here is how I would go about it:

Try online programming courses to see if you like it. While doing that myself, I collected a list of resources that I found helpful, feel free to check it out and add new ones using the comments.
If you are still not quite sure, take a vacation or a sabbatical and give it a full-time test-drive.
Write a list of things that you love about your current job and that you think you might love about being an engineer. Talk to someone about it and verify that you have the right motivation.
Talk to your manager about your goal. Together you can figure out what would be the right way: a slow transition with a part-time involvement, or a full switch at a time frame that is satisfactory for both of you.
Do it :)

Conclusion

I have met a lot of wonderful people who would like to change their careers and try something new. Many of them have always dreamed of becoming an engineer but were told not to. Actually, my own sister once said that I shouldn’t study Computer Science because I’m not smart enough for that, so I didn’t. It can be scary, you might feel like people are going to be judgmental about it, you might be afraid to lose your stability - and it’s all justified. My goal here is to let you know that you are not alone with your fear. The change is not as crazy as it might sound to you, and that there are more people like you who have already successfully made the transition, and can support you. Give it a try!

If you have any questions that I haven’t covered here, don't hesitate to reach out to me, and I'll gladly share everything I know.

I'd like to point out that this was a very special situation for a limited amount of time. In normal times and especially during quarantine I pay a lot of attention to my work-life-balance and strongly recommend everyone to do the same. ↩

Stop using constants. Feed randomized input to test cases.

2021-02-02T00:00:00+01:00

Introduction

Testing is widely accepted practice in software industry. I am an iOS Engineer and have been writing tests, like most of us. The way I approach testing changed radically a few years back. And I have used and shared this new technique for a few years within Zalando and outside. In this post, I will explain what is wrong with most test cases and how to apply randomized input to improve tests.

This is our sample code under test:

struct DomainStore {
    private let internalStorage = UserDefaults.standard

    func set(value: String, for key: String) {
        internalStorage.set(value, for: key)
    }

    func get(for key: String) -> String? {
        internalStorage.value(for: key) as? String
    }
}

The usual testing approach

func test_setValueCanBeRetrieved() {
    let storage = DomainStore()

    storage.set(value: "Zalando", for: "companyName")
    let obtained = storage.get(for: "companyName")!
    XCTAssertEqual("Zalando", obtained)
}

Imagine someone opens your code a few months down the road and modifies the code under test ever so slightly.

struct DomainStore {
    private let internalStorage = UserDefaults.standard

    func set(value: String, for key: String) {
        internalStorage.set(value, for: key)
    }

    func get(for key: String) -> String? {
        return "Zalando"        // Note
    }
}

This diligent test runs on your machine or on CI and it passes. Does it mean the production code works fine? Of course not. Most Test Driven Development (TDD) practitioners would move past this DomainStore but, should you? How can we reveal similar quality issues and address them?

Fundamentally we are testing using constant String while the production method suggests it can take any String.

When we check this function signature.

func set(value: String, for key: String)

It tells it can take any String instance. Not just "Zalando". However, our previous test asserted on only 1 instance of String type.

Better approach: Feed Randomized Input to test cases

The fundamental idea of this technique is never to feed test cases hand typed constants. What do we feed in then? Welcome randomness.

This is our fixed test case.

func test_setValueCanBeRetrieved() {
      let storage = DomainStore()

      let value = String.random  // Note
      let key = String.random

      storage.set(value: value, for: key)
      let obtained = storage.get(for: key)!
      XCTAssertEqual(value, obtained)
}

Note:

String.random produces random instance of a String. At Zalando, we use this Randomizer library for generating random inputs. It covers most the used types in the Standard Library.
If Randomizer doesn’t fit your need, feel free to extend it or add your custom conformance to Random protocol requirement.

Now the above tempered code will not pass through this test case. Unless we run it, we don’t know ahead of time what values we are going to test with. And these values are different across runs. Effectively exercising our production code with many permutations of possible values. This is the essence of randomized input tests (sometimes referred to as permutation tests).

Going beyond a simple case

Here’s one example test case from our module. The code below creates random label component and sets random accessibility options on model layer, then asserts if the rendered view has correct accessibility information.

func test_whenAccessibilityProvided_andComponentHasTapAction_thenAccessibilityIsSet() {
        let props = LabelProps.random
        let accessibilityModel = APIAccessibility.random
        let component = LabelComponent(
          componentId: .random,
          flex: .random,
          actions: .random,
          props: props,
          accessibility: Accessibility(with: accessibilityModel, componentType: .label(props)),
          debugProps: DebugProps()
        )
        let node = MockNode()
        component.actions = [EventType.tap: [ComponentAction(.random, .log(.random))]]

        component.updateAccessibility(node)

        XCTAssertTrue(node.isAccessibilityElement)
        XCTAssertEqual(node.accessibilityLabel, accessibilityModel.label)
        XCTAssertEqual(node.accessibilityHint, accessibilityModel.hint)
        XCTAssertTrue(node.accessibilityTraits.contains(.staticText))
        XCTAssertTrue(node.accessibilityTraits.contains(.button))
}

Note:

User defined types (usually Structs) are composed of standard library types and predefined custom types. We can extend user defined types in our test target to conform to Random. An example conformance of LabelProps is as below:

struct LabelProps: Codable, Hashable {

    let text: String
    let backgroundColor: String?
    let font: FontProps

}

extension LabelProps: Random {
    public static var random: LabelProps {
        return LabelProps(text: .random, backgroundColor: .random, font: .random)
    }
}

We could do code generation on build phase to synthesize the Random conformance. Although this is out of scope of this post, its how Equatable conformance works.
Due to Swift’s type inference; .random will use the exact type’s random conformance.
For cases where we need to compare against input value, we can store the generated model into a local property. Like we did for accessibilityModel.
There are times when function under tests expects Email, URL, Deeplink or PhoneNumbers. These data types are often represented by String. However, String.random is not good enough on this case. There are 2 ways of tackling this. One is to extend String to have String.randomEmail. Another is to create concrete type which conforms to Random.

Conclusion

This technique was not my realization. I grasped the phrase “Don’t use constants on tests” from Jorge Ortiz during his workshop on Clean Architecture on Swift Averio, 2017. It then changed the way I write tests. I hope this technique will help you too.

The technique of permutation testing by using random input applies to all software testing; not just iOS development. The only requirement is Type.random.

Creating a uniform landscape for macOS Software

2021-01-21T00:00:00+01:00

At the time of this writing, we have a universe of Mac applications — that are identified and version-inventoried — within the fleet of little over 3,000 Mac devices in Zalando from which a subset — selected either by their importance, frequency of updates or size of the install base — are part of a so-called software lifecycle.

However, in July 2019, when a vulnerability was discovered in Zoom (long before becoming the mainstream video conference app during the COVID-19 pandemic), Information Security requested the immediate deployment of the latest patch to every device that had the app installed and a report of the progress of this task.

The report and the patch were not a challenge in themselves — this was already part of what we were doing with core applications such as Google Chrome, or Chat — but the process was nothing more than a set of manual and repetitive chores that could be streamlined.

So this defined a set of goals:

Procure patches and updates in a proactive way
Test them and then deploy to our users as soon as possible after their release
Keep detailed information about the patch levels of key applications
Automate, as much as possible, all these tasks

Our tools

JAMF Patch Management

The Mac Management Platform in use in Zalando, called JAMF Pro, provides Patch Management functionalities that are great at detecting the patch level of devices and deploying the appropriate versions; however, getting this functionality to work properly has the following requirements.

A source of patch definitions

The first thing the system needs is the so-called definition of the title¹ including dates, versions, OS requirements, etc. in a JSON format. JAMF (the company behind JAMF Pro) offers a web service with a basic set of titles, but of course, that doesn’t cover all our core applications. Fortunately, it’s also possible to configure additional sources of patch definitions, either local or from third parties.

Installation packages

Each vendor has different locations to provide their installers; additionally, for the management platform to be able to install applications (or its updates), they need to be uploaded to distribution points in a PKG format, which is not always what the vendor provides.

AutoPkg

An open source tool developed by the community of Mac admins around the world, called AutoPkg, provides a framework to automate many of the tasks surrounding patch management. The steps taken through the process are defined on plist-format files called recipes, which AutoPkg follows.

Recipes

The community of AutoPkg users has generated recipes that cover a broad range of applications and that are updated regularly; nevertheless, for security reasons, AutoPkg requires manual inspection of downloaded recipes or the creation of local copies, before allowing an automated execution. AutoPkg recipes have a parent-child relationship which brings modularity and also the chance of having different results depending on the child recipe that was executed.

Processors

Each step of a recipe is executed by a Python piece of code called processor. AutoPkg includes dozens of these processors — each of them with a specific functionality — but also has the ability to run custom processors, coded by users, to provide functionality not covered by the standard ones.

Our solution

The combination of JAMF Patch Management and AutoPkg was the right one to accomplish our goals, but this doesn’t work for our needs just out of the box and then it evolved into three different projects.

Cookbook

The name was obvious for the project aiming to standardize and manage our AutoPkg recipes.

For improved modularity of the process, each application that we have introduced into the software lifecycle has its own set of recipes:

Download from the vendor
Create a package
Sign the package²
Upload to the distribution points

In addition to the recipes, we created three custom processors to:

Announce in a Google Chat group the availability of a new version, packaged and uploaded to our system
Generate the JSON patch definition and upload it to our own definition server, for titles not covered by JAMF
Update information in our reporting tool, LineUp

Finally, for better organization of the workload, Cookbook is a git repository. We work locally, push our changes to the repository and then after merging, we pull on a server called Apple Packaging Station that runs AutoPkg on a regular schedule with help from a third party tool called AutoPkgR.

LineUp

When we first created a report about the deployment of the patch of Zoom, we pulled the information from our platform directly into a Google Spreadsheet and then used Google Data Studio to generate a chart.

This may seem okay for a one-shot requirement, but in reality this happens often throughout the year and becomes hard to maintain or scale. So then we opted for a custom database (hosted in Zalando’s shared Postgres cluster) queried with Grafana, which offers great visualization capabilities.

But then, with a proper database structure already holding the data, the next logical step was to add a custom visualization tool and provide it with its own API to update the information. This is when LineUp was born.

At the beginning, we were just looking for a simple mechanism to show information from the database without requiring a client application or the user to run SQL queries, and even the simplest web development frameworks, once connected to a database, have power to do much more than this. We selected Django as our framework and after developing these simple views, we decided to leverage its capabilities and come up with detailed views for each Mac application, creating a module to use JAMF’s API to get up-to-date information about them.

Then, while working on this, it was natural to expand the scope and include the inventory of applications running in the Windows and Ubuntu platforms and to do so, we developed a module to query Zalando’s asset management platform.

PackageChanger

After each scheduled execution of our AutoPKG recipes we end up with a set of packages uploaded to the distribution points, notifications about them in our Chat group, and the JAMF server aware of these new versions of applications. Now it’s time to test the updates and release them if they are working properly.

This became a new tedious process which is done in JAMF’s web UI. Each update implies going to a set of screens to associate the new version with a package, assign that version to a group of testers and later, release the version to the rest of the users as well as setting this version as the baseline installer for new devices.

To simplify these steps, we created PackageChanger, a command line tool that, through JAMF’s API, let’s us work with packages and versions in a faster and simpler way than using a web UI.

To work with the API we selected Ruby-JSS — a Ruby library developed by the Mac admins at Pixar Animation Studios — which to this day is the most comprehensive and well documented library to interact with it.

Our next steps

The work done so far has improved significantly the way we make updates available, especially for key applications, and has provided us with ways to have real-time information during first few hours after a software vulnerability is disclosed. We are still missing, nevertheless, some refinements to have a completely streamlined software lifecycle.

User interaction

Patch management from JAMF offers us two ways to deploy patches: automatic push or through the Self Service application notifying the user when updates are available. The latter would be optimal, but the notification mechanism does not work and leaves us with our user base unaware of patches. On the other hand, pushing updates has proven to be a source of discomfort for users, especially because updated applications need to be closed and reopened and it’s really difficult to find a convenient moment to do this.

As a response, we are working on an alternative notification mechanism, so we can continue to offer updates through Self Service, but making users aware of them with enough frequency and convenience so that they install them in a comfortable and timely manner.

Quality gate

Before generally releasing a patch we deploy it to a small subset of devices whose owners are considered testers. This allows us to know if the installer works and if the application runs as expected after the update.

These tests may be enough for simple applications — such as Google Chat — but fall short for specialized or complex ones — such as Tableau Desktop — where only a trained user would be able to tell if the new version is ready to be deployed to the user base.

The next improvement in this direction would be a quality gate, in which additional tests for releases are described and a bigger set of testers can go through them, decide if they are passed successfully, and then approve collectively the deployment of a patch.

Increased selection of titles

The initial set of applications covered by patch management was selected because of the obvious level of use the get within Zalando: Google Chrome, Chat, Backup and Sync, etc.

Afterwards, when LineUp provided us with information about the number of installations of each application, we had a roadmap of sorts to know which applications should be covered next. For example, we discovered that over one third of the Mac fleet has Docker installed on them, so we decided to start offering it in Self Service and provide patch management so that we can be sure our user base has easy access to this tool.

Here, the next step is part of a continuous improvement cycle, in which we will keep adding applications to the automated lifecycle.

Within patch management, the word title is used to refer to pieces of software that can be inventoried and have versioning, and range from internal tools to applications from the App Store. ↩
At the time of this writing macOS Catalina and macOS Big Sur allow the installation, through an MDM³, of unsigned packages. This may change with future releases of macOS and make crucial to include an automated signing step, which we already have. ↩
MDM stands for Mobile Device Management, which consists in a platform and a set of tools for the administration of mobile devices such as smartphones, tablets and laptops. ↩

Experimentation Platform at Zalando: Part 1 - Evolution

2021-01-12T00:00:00+01:00

Online controlled experimentation, aka A/B test, has been a golden standard for evaluating improvements in software systems. By changing one factor at a time, A/B test causally measures, from real users, whether one product variant is better than the other.

As an increasingly important area in tech companies, experimentation platforms face -- apart from their scientific challenges -- many unique engineering problems. In this blog series, we will share what we’ve learned at Zalando. During this journey, we have presented our works at well-known conferences including PyData 2018, Conference on Digital Experimentation 2018, and Causal Data Science Meeting 2020.

In this first post, we’ll introduce the evolution of experimentation platform at Zalando. Technical challenges and their solutions of experimentation engine, analysis system, data quality issues, and data visualization will follow in the upcoming posts.

The next sections are structured using the Experimentation Evolution Model in Fabijan et.al., 2017.

Phase one: crawl (before 2016)

As natural as data-driven decisions sound today, it’s not the focus in early stages of Zalando. In the early days, A/B tests are set up by each team individually and manually -- as well as their analyses.

Soon we discovered that such setup can neither ensure A/B test quality, nor can we know whether product teams actually run A/B tests before making decisions. There is very little A/B testing knowledge in most product teams then -- we realized the need of a centralized experimentation service. In order to take full control of data infrasture as well as analysis features, we need an in-house experimentation platform at Zalando instead of using off-the-shelf A/B testing tools.

In 2015, the first version of Zalando's Experimentation platform Octopus was released. It is named after Paul the Octopus, who correctly chose the winner team of a match at FIFA 2010, with a small error rate. That’s the essence of an experimentation platform, except that our metrics are based on trustworthy statistics rather than Paul’s mood of the day.

At this period, our biggest challenge is Lack of cross-functional knowledge. The initial platform was built by a virtual team with members from various parts of Zalando. The platform had three parts: experiment management, experiment execution, and experiment analysis. In the early days, the team's focus was set to execution because of few service customers - analyses can be performed manually in the worse case. This initial virtual team consisted of engineers and data scientist who had little knowledge of each other's domain at that time. For example, data scientists didn't have production software experience and didn't know Scala, whereas software engineers didn't know concepts of statistics. To decouple the development processes of one subgroup from another, we ended up with building an open-source statistics library wrapped by the backend production system.

Phase two: walk (2016-2020)

Even though wrapping analysis scripts into a production software system is not a scalable solution, it worked for the load at that time. Through hard groundwork, we achieved a platform where teams can configure and manage their A/B tests in one place. Another major benefit of platformization is that randomization process and analysis methods are now standardized. Octopus uses a two-sided t-test with 5% significance level to analyze results.

During these years, we have boosted the number of running A/B tests at Zalando.

There is a decrease of number of A/B tests in early 2020. This decrease could have been due to a focus of teams on large-scale coordinated product initiatives, which were not A/B testable during this period. Another possible cause is that we suggest to pause A/B tests due to abnormal user behaviour in the beginning of COVID-19 in Europe.

On the other hand, we also faced a few big challenges. The keywords of improvements in this period are scalability and trustworthiness:

Establishing experimentation culture. Many teams started to make product decisions through A/B testing, however, it’s a big company and the experimentation culture didn’t reach every corner. We started to look at use cases from various departments and integrated them into Octopus. We also provided in-person A/B testing training in the company at regular intervals. In addition, there is a company-wide initiative to ensure each team has embedded A/B test owners (product analysts or data scientists) who have sufficient knowledge of experimentation.
Source data tracking. The experimental data were collected from each product team through tracking events (we track only users who provided appropriate consent). A dedicated tracking team ingested these events, unified data schema, and stored them in a big data database. However, data tracking concepts were not holistically understood across the company -- some teams define their own version of tracking event schema. This inconsistency resulted in corrupted and missing data. As a consumer of this data, our A/B test analyses suffer from data quality. This situation started to improve after a period of extensive cross-team communication and reorganization.
A/B test design quality. Since we found that A/B tests from different teams had various level of quality, we introduced an A/B test design audit process as well as weekly consultation hours. Aspects of quality include testable hypothesis, clear problem statement, clear outcome KPI, A/B test runtime, and finishing based on planned stopping criteria. We also wrote internal blogs regularly to share our tips for effective A/B testing in Octopus.
A/B test analysis method quality. To make our services trustworthy, we revisited our analysis methods rigorously in peer reviews with applied scientists from other teams. We documented analysis steps transparently. Through scientific peer reviews, we have identified potential improvement areas such as non-inferiority tests.
The right analysis tool. A/B tests are not always feasible for every use case. For example, comparing performance between two countries. In such cases, quasi-experimental methods are better suited. We provided guidelines and software packages to help analysts to choose the right causal inference tool.
Randomization engine latency. Some applications have strict requirements for latency. For example, a slightly higher loading times of product detail pages may cause customers to churn. We enhanced the latency of our services through a few engineering optimizations. Technical details will be discussed in later posts.
Controlled rollout. In some cases, teams want to gradually increase the traffic into the tests, so that they don’t accidentally show a buggy variant to a lot of users. In other cases, several teams are working on a complex feature release and want to release the product at the same time. In general, such staged rollouts are called controlled rollouts. To support these use cases, Octopus created new features such as traffic ramp-up in experimentation and feature toggles.
Analysis system scalability. The biggest challenge we had in this period is that our initial analysis system can not handle the load of concurrent A/B tests anymore due to constraints in its architecture. As the maintenance cost of the analysis system became too high, we didn't have capacity to work on improvement of analysis methods. We concluded that the need of a new analysis system was pressing. In the end, we spent two years rebuilding the new analysis system in Spark. Our lessons learned will be shared in a separate post.

Phase three: run (2020-)

At this point, experimentation culture is established in most parts of the company. With the scalable infrastructure ready, the team can now work on more advanced statistical methods.

We are looking forward to bringing experimentation at Zalando to a new stage by:

Scaling out experimentation expertise. We have designed a new company-wide training curriculum that has a more smooth study experience. It covers causality, statistical inference, and analysis tools at Zalando. We have also increased the scope of causal inference research peer reviews to the whole company.
Automating data quality indicators. A/B testing results are highly senstive to data quality. The most important data quality indicator is sample ratio mismatch -- the actual sample size split is significantly different from the expected sample size split. Companies similar to Zalando have identified that between 6-10% of their A/B tests have sample ratio mismatch, a similar analysis on our historical data shows that at least 20% of A/B tests are affected within Zalando. Our platform automatically raises alerts to the affected team when sample ratio mismatch is detected. Further data investigation will be needed before analysis results are shown to users in the platform's dashboard. Another major data quality issue is the data tracking consent imposed by GDPR. As we process data only for visitors who provided their consent, we have been working on research to understand the selection bias for A/B tests and its solution.
Overall evaluation criteria. In the last few years, we understand from our users that selecting outcome KPI for A/B tests is a big pain point. We have now provided teams qualitative guidelines: a) KPIs should be team-specific. KPIs should be sensitive to the product that each team controls, i.e. each team can drive their KPIs by changing product features; b) KPIs should be proxies to long-term customer lifetime values, instead of short-term revenues. We plan to incorporate these guidelines into Octopus with scientifically proven methods.
Faster experimentation. We found that the median runtime of an A/B test at Zalando is about three weeks. This is higher than similar companies in the tech industry. Many users might claim their test has time constraints based on business requirements. We plan to support trustworthy analysis for faster experimentation by more advanced analysis methods, such as variance reduction, Bayesian analysis, and multi-armed bandit.
Stable unit assumption. In practice, each unit in the A/B test may not represent a unique person. For example, currently we are not able to detect the same person from Zalando website and Zalando App and assign him/her the same variant. A solution of this problem creates new engineering challenges due to latency requirement.
Data visualization. Smart data visualization provides answers to questions you didn’t know you had. With complex and hierarchical data from A/B tests, there is quite some potential for data visualization designs.

Summary

To sum up, experimentation platform at Zalando has evolved a lot since 2015. Nevertheless, we are and will always be focusing on bringing more scalable and more trustworthy experimentation to Zalando. We thank all team members, contributors and leadership who made it happen during this incredible journey.

Future posts

In the upcoming posts, we will provide more details about the technical challenges and solutions of the experimentation engine, analysis system, data quality issues, and data visualization. Stay tuned!

How Zalando prepares for Cyber Week

2020-10-08T00:00:00+02:00

Introduction

Cyber Week has become an increasingly important time of the year in e-commerce. In 2019, we have attracted 840,000 new customers and our sales (Gross Merchandise Volume) increased by 32% compared to the previous year. During the event we grew faster as a business than throughout the year where we grow at a 20-25% rate. Our peak orders per minute reached 7,200 compared to 4,200 the year before (+71% YoY).

From an engineering point of view, Cyber Week is a very exciting time, during which all systems are exposed to load that is far beyond any peak seen throughout the year. The experience of supporting the event itself has been extremely rewarding for everyone involved due to close collaboration between teams and strong focus on operational excellence and reliability. During the preparation time for the Cyber Weeks we created new capabilities in our teams and platform that serve us throughout the whole year. Looking back at the past years, we would like to share our experience and how our capabilities evolved over time around key themes of: Site Reliability Engineering, Load Testing in Production, and the Preparation approach itself.

Site Reliability Engineering

Phase 1: Building up knowledge about reliability engineering

Six years ago, when our e-commerce platform was still within on-premise data centers, we had a handful of on-call teams. Two of these teams were responsible for the backend and frontend systems of our e-commerce platform and were primarily responsible for Cyber Week preparations and support during the event. When we started moving more and more critical systems into the AWS cloud as part of our micro-frontend architecture, we adopted the "you build it - you run it" mindset and the number of on-call teams has increased dramatically to around 100 teams today. This also meant that we needed to educate many teams about designing for reliability. To achieve that, we formed a team of 10 colleagues, who were passionate about SRE and who signed up to perform production readiness reviews of our applications ahead of Cyber Week. In preparation for that, we ran a series of workshops with teams to share knowledge about reliability patterns and identified clusters of applications that required adjustments, so that the platform is stable in case of various failure types (e.g. failures of dependencies, overload, timeouts).

Phase 2: Distributed tracing

We use distributed tracing following the OpenTracing standard across our platform. This allows us to inspect the performance of our distributed system and quickly find contributing factors for increased latency or error rates across our applications. After instrumenting a set of applications and proving the intended wins resulting from it, we leveraged Cyber Week preparations to scale this effort. In year one, we focused on critical, tier-1 systems involved in the hot path of the browse journey in our shop. The year following that, we have expanded the coverage further to tier-2 systems for applications in the scope of Cyber Week. During the instrumentation, we have adopted additional conventions that help us identify the traffic sources: App, Web, push notifications, load tests. This allows us to better understand traffic patterns and perform capacity planning based on the request ratios between incoming traffic and the respective parts of our platform.

Phase 3: Dedicated team for SRE enablement

What started as a grass-roots movement around SRE practices in Phase 1, has evolved to a SRE department within Zalando, which is focused on reliability engineering, observability, and providing necessary infrastructure around monitoring, logging and distributed tracing. The SRE team also organizes trainings and knowledge exchange within the SRE guild where teams share lessons learned and pitfalls about operating systems in production and collaborate on formulating best practices.

Distributed tracing has been a game-changer for us. We have leveraged tracing data to reduce alert fatigue of our on-call teams through an approach called adaptive paging. It's an alert handler that leverages the causality from tracing and OpenTracing's semantic conventions to page the team closest the problem. From a single alerting rule, a set of heuristics is applied to identify the most probable cause, paging the respective team instead of the alert owner. See our talk from the SRECon Are We All on the Same Page? Let's Fix That which explains our approach in detail.

Load testing in Production

Phase 1: Feeling lucky

Over the years of operating our shop in the Data Center, we learned how to scale our shop's frontend. We kept adding servers and scaling our Solr fleet responsible for Product Data and Search until this has become impractical due to a multi-month lead time needed to get new, physical servers. The Solr fleet was the one most benefiting from auto-scaling in the cloud and thus the first system that we moved to the cloud six years ago. Our backend services (e.g. product information management, inventory management, order management, customer accounts and data) however, formed an over-provisioned system with a fixed number of instances in the Data Center. At its heart were PostgreSQL instances heavily optimized by our Database infrastructure team that we scaled through sharding and switching from spinning disks to SSDs.

This was sufficient for Cyber Week in 2015 where commercial campaigns were just about the right size for our capacity. With no past knowledge about what type of traffic to expect we were amazed how much more headroom our backend systems really had. Never before had we seen load throughout the day that surpassed every past evening peak we saw. There were of course some challenges with scaling, but we could overcome these with small tuning of the system configuration during the event. This was achieved mostly through pausing some asynchronous processing that was not essential for accepting and processing orders.

Phase 2: Load Tests in Production

In a cloud-based system that relies heavily on auto-scaling for cost-optimization, proper testing and capacity planning is a must. To achieve that, we set the target to better understand our scalability limits. We tried many approaches and given our experience, the only way we found effective for a large-scale system like ours are live load tests in production. Testing in production is an established practice, but difficult to execute well. Mistakes become really costly as the customer experience is degraded and thus this approach requires the ability to quickly notice customer impact and react by aborting the test or mitigating the incident otherwise.

To achieve our goal, we wrote simulators that place sales orders for test products that can be clearly differentiated from real customer orders, processed to a certain degree, and then skipped at the stage of fulfillment. This gives us the understanding of the limitations of our order processing system and all its dependencies, incl. inventory management and payment processing. Further, as shared before in end-to-end load testing Zalando’s production website, we wrote a simulator that traverses the user journey across key customer touch-points in our shop. We ran this simulation in production for all countries and mimic the traffic patterns we observe for sales events. Through that we uncover scalability bottlenecks and verify if certain resilience patterns work properly. Running the simulation is a fun and thrilling exercise, especially if the whole team starts suddenly hearing pagers fire as we continue to increase the test traffic.

Phase 3: Load Tests inform capacity planning

Having written and evolved the user journey simulator for two years we were not fully satisfied with its abilities to generate load at scale. There were too many rough edges and tuning the simulator to be able to generate the required load profiles and investing our development time was very time consuming. We decided that it's better to leverage an existing product that will do the job better. This paid off heavily as last year we were able to run the tests both on App and Web platforms simultaneously.

The different types of load tests that we ran in production last year helped inform capacity planning based on commercial goals and the projected sales. The final, clean run of tests also gave us sufficient confidence that the platform was scaled to sustain a certain amount of incoming traffic and sales in the peak minute and thus contributed to a smooth event for our teams.

Preparation as a project

The Cyber Week project is always at the top of our project lists and we dedicate highest attention to the preparation work. Over the past years, we have progressively increased collaboration between the engineering and commercial teams and have dedicated Program Managers responsible for the delivery of the project. With every year we tune the structure and reporting within this project.

Thanks to the high priority of the Cyber Week preparations, every year we are able to invest in a key theme that helps us build up new capabilities that we did not have before - be it resilience engineering know-how, load testing in production, capacity planning, production readiness reviews, or collaboration across the company. On top of that, we also run dedicated projects aimed at increasing scalability of our platform and deliver changes to the customer experience for sales events.

During the event

After months of preparation, the event itself is a cherry on top - it's the time where we see how the time invested has paid off. If we are well prepared, we expect a rather uneventful time in terms of the number of production incidents. For the key period where we expect the highest load on our systems, we organize a Situation Room to ensure rapid incident response. In the room, we gather representatives from key engineering teams, SRE team, and dedicated Incident Commanders to closely watch the operational performance of our platform. It's basically a control center with dozens of screens and graphs, that looked like this in 2019:

Summary

We've explored two key themes in Zalando's Cyber Week preparation journey. We are constantly tuning our approach based on insights from each year and adapting the areas we invest in to the business growth and commercial campaign requirements. This year has an added twist of remote working, which likely will require us to rethink how to organize the Situation Room efficiently. With seven weeks until Cyber Week, our preparations for this year's event are well underway and we are looking forward to sharing results and lessons learned in follow-up posts. With our growing application landscape, there are sufficient challenges ahead as we have more than 1122 applications (out of 4000+) in scope of the Cyber Week preparations.

Meet Boris Malensek, Our Head Of Engineering In Merchant Operations

2020-09-08T00:00:00+02:00

We spoke about his professional journey within Zalando, the evolution of Merchant Operations, and the engineering culture within the company.

The interview was initially conducted for Zalando’s External Talent Community.

Boris, let’s go back to the start. What attracted you to Zalando in the first place?

The main reason for my attraction to Zalando was how quickly the company was able to adapt to change. I liked that they were constantly trying out new things, even if at that given moment they didn’t seem like the best solutions. At Zalando, there have always been believers in the change, and for me that is important. I think of the process as a journey, and who you share this journey with has always been important to me.

Do you think that’s the main incentive for people to join Zalando – the constant change?

I don’t think there is just one formula, one reason, why people choose to join the company. But what candidates should understand is that Zalando will always change. We will probably become a more stable organisation over time, but there will always be changes. We will continue to try out new things, and people should not be afraid of that. Some things turn out to be a great success, others don't, but we will always try to innovate and be better than before.

What is special and particular about Software Engineering at Zalando?

The engineering culture. Since the day I joined it remains the most impressive engineering culture I’ve experienced. What I refer to by the engineering culture is the support you receive on various levels: from a single line of code up to global challenges. There is always someone ready to help you, someone to learn from, and that’s really powerful. Our feedback culture is getting stronger with people having healthy attitudes towards sharing feedback. In general, we strive to build a community based on trust. Zalando has invested a lot in technology and our solutions and tooling are state-of-the-art. The way we enable our engineering teams to deploy their software – fast, autonomously, at scale and still compliant – is impressive. That sets us apart from many other companies. Our approach to solving problems is unique. We always try to put the customer first, we try to understand why we do what we do, what the purpose is, and this is important. We always aim to explain our strategy in the clearest way possible.

As the Head of Engineering in Merchant Operations, what do you do and what are your responsibilities?

Firstly, on a daily basis I enable the team to tackle complex challenges by providing guidance when they are unsure of how to come to an optimal solution. However, my main goal is to make myself “obsolete”: I aim to develop the team in such a way that they feel empowered to solve problems independently. An important part of my role as a leader is to hire the best talent for our business unit and the broader organisation. I am also responsible for planning and outlining strategies for upcoming technological, architectural or organisational changes that support the longer term Zalando Group Strategy. I work on building a network within and outside Zalando, so that I can turn to like-minded engineers and leaders for help with problems. Finally, I am accountable for the software that we deliver: it needs to be scalable and resilient, and when we fail, we need to fail fast, learn from it, and move forward to continuously improve on what we have done before.

Boris, you have just had your 5-year anniversary at Zalando and have gone through several stages of career growth from a Senior Software Engineer to an Engineering Manager, to a Head of Engineering. When the time came to pursue the next steps in your development, what motivated you to choose a management path? What does being an engineering leader entail?

Most of us want to grow by simply stepping out of our comfort zone. That’s definitely something that still drives me today, and at Zalando I have opportunities to do that. I came to Zalando as an experienced Senior Software Engineer, and leading people and projects was not new to me. When I joined Zalando, there was a reorganisation within the company and with perseverance and self-driven efforts, I enthusiastically grabbed the opportunity to become an Engineering Manager. Being a leader has taught me the importance of creating opportunities for career growth within an organisation. I am to provide opportunities for growth both within my team and beyond - I believe that it's important to support employees' growth first and foremost, no matter where it may take them.

Merchant Operations is often referred to as a great success story within Zalando, could you tell us about how this business unit evolved?

Merchant Operations has a rich history. I have been involved with the department from the very start, but when I joined it five years ago it was called Brand Solutions. Brand Solutions was building a prototype for a marketplace. It had a small tech team, and I was the third software engineer to be hired for the team. We had a great commercial team working alongside us, developing the idea of the marketplace and managing important partner relationships. Over time, we grew into a fully-fledged organisation. Three years ago, David Roberts joined us as the VP of Merchant Operations, and around the same time our objective became clear: build a B2B marketplace model, to bring Zalando closer to being the Starting Point for Fashion by increasing our assortment to include external partners. Currently, we have around 80 people in the engineering organisation, compared to just 10 in the early days. We have engineers in Berlin and Dublin. Our Dublin team has been a great success story, having ramped up really quickly after the beginning of our expansion in October 2019 to a team of 15 today. What makes Merchant Operations unique is that it started as a pure operations team. However, if you want to reach the scale required to become a giant in the fashion e-commerce industry, you need to focus on innovating through technology - and that is how we began to transform. Our biggest initiative currently is Zalando Direct (zDirect) which steers the business of external partners to Zalando's platform and extensive customer base, which increases our offering and convenience proposition exponentially.

Lastly, could you give a piece of advice for a Senior Software Engineer who would like to join Zalando?

Patience is very important. I think it is always important to give yourself some time to learn, grow and focus on what you believe to be your ultimate goal. If you are a Senior Software Engineer and still in doubt about the direction you would like to take with your development, you have to think about this first and foremost. Your goal may be ambitious. But it’s really important that you think of constructive steps you can take to move towards it. Be disciplined. Stay determined, don't be afraid to ask for what you want, and remember to remain open to a path of continuous learning. It's only when you step outside of your comfort zone, that you realise what you are capable of.

Inbox Zero is not a Lifestyle

2020-07-17T00:00:00+02:00

The following guidelines and tricks help me with task management, time management, planning & prioritization, reacting to ad-hoc situations, and the sense of not having accomplished anything during the day. There is some overlap with our Remote Work Guidelines¹. My meta-advice for applying anything from this article: start with one improvement, don’t try it all at once. Start with tools you have at hand. It’s an ongoing improvement process, and it’s ok to fail and start over. I've been iterating over this on and off for roughly three years now.

Having worked as a software developer in my early career, I've been a manager for roughly 10 years now. I have gone back to an individual contributor role for a year in between. An aspect to consider when reading about my experience and the suggestions provided, is that a manager's schedule is somewhat different from a maker's schedule. Depending on your organization's challenges, a manager still needs to be able to create, to provide e.g. structure and strategy. This needs an environment comparable to that of a maker. On the other hand, makers will benefit from applying some of the solutions lined out in this article when they need to adapt to a challenging environment themselves. "Different types of work need different types of schedules"², and while this article is primarily aimed at managers, I believe that makers can take away some learnings, too, especially when they are planning to transition from an individual contributor role to a manager's career path.

To limit the scope of this article and the suggested solutions, a nice concept to introduce is the concept of constants. I'm going to refer to constants as constraints that are considered to be true, and can’t be ignored, at least not for too long: I have eight hours per day and 40 hours per week for work. I need to eat and take a break. I will need to process email and other requests. I need time to plan, and some plans I made will need to be changed.

In order to address all this, I need transparency on what kind of time and energy I have available, and what work needs to be done by when. I will need to understand how flexible I can change what I have planned to adapt to a new situation. For all this, I use the Google calendar and a task management tool.

Configure work time

Setting up your working hours in Google Calendar is a good reminder for you and your colleagues when you are available and when you should not be working. Make conscious decisions to break the rule of working outside of your working hours when needed. When your colleagues see they're inviting you to an event outside of your work hours, they will reconsider, or at least reach out to you first. That way you assert a certain control over your calendar and the invites you are getting.

Make a decision for every event

Events without a decision clutter your calendar and make the organizers’ lives harder. Make a decision on the same day or the next day latest for every incoming event, and move on. State a clear reason in the comment in case you decline an event.

Hide declined events

You’ve already made a decision on those events, and you don’t need declined events to clutter your calendar. If you ever need to revisit that decision, you can enable showing declined events for that purpose in your calendar's settings, and disable it again afterwards.

Defragment your calendar

If you have many short appointments like 1:1's, group them together. If short appointments come in, try to fill gaps or place them next to other meetings. That way you optimize for continuous free space which helps with blocking time for focused work that takes more than just 30 minutes. You can also use Google Calendar's reschedule event functionality to ask the organizer to reschedule, if you prefer a different time, and the other participants are available.

Block recurring events

Take back control over how and when you are working on what. Some things need to be done every day (processing email, responding to calendar invites and chats, having lunch, or planning and prioritizing work) and you need to make room for that. You can always cut back if you’re running out of overhead tasks. My work time as you can see in the following screenshot is from 10:00 to 19:00. I usually do not exceed my 40 hours work week with this setup.

For all tasks that need doing, I follow a Getting Things Done (GTD) approach³. I process my inbox after lunch because I like to get started with work I planned instead of new input from my inbox. When processing, I make prioritization decisions mostly on importance and urgency⁴. Processing means that I try to organize all tasks into my task management system, which makes it easier for me to discover these tasks at the right time in the right context. A task management system can be anything from a formatted text file or a google document, to a more sophisticated, dedicated task management app. Setting this up is a topic on its own. I suggest to start with whatever you have at hand. I try to follow a strict agenda for task processing:

Review perspectives⁵
- What is happening today and the next few days?
- What input am I waiting for that will be provided by someone else?
- What is stalled (i.e. it’s not clear what the next step would be)?
Process email inbox
Process assigned Google Followup Action Items
Process our internal communication platform
Process Google Chat (pull mode)
Process other inboxes (e.g. task management tool inbox). Categorize and compartmentalize tasks & projects.
Plan and schedule events in the calendar for important or full focus tasks
Flag tasks I plan to complete today

Tasks that are flagged are the focus for today and are highlighted in my task management system (e.g. listed on top of the text file). That way I can always go to one spot after some inevitable context switching to get back on track fast. In the evening I try to clear out my inbox, and process and schedule all tasks that came in after lunch for the next few days, so I can start the next morning without having to look into my email inbox. That way I might reach Inbox Zero from time to time, which feels extremely good. A much more important aspect than trying to achieve Inbox Zero all the time, is measuring how much you have on your plate and if your inbox is constantly filling up, or if you're able to keep a healthy balance. Inbox Zero is a signal, not a lifestyle.

Categorize calendar entries

When you categorize your calendar entries, you can see immediately what can be easily rescheduled or canceled in case of emergencies and urgent and important ad-hoc requests. You see how much time you have available, and you can reflect much better on what you did at the end of the day or week. It’s good to feel accomplished about your “focus week”, or “hiring week”, the “catch-up week” or an “off-the-charts week” if you made those choices deliberately. I use the following colors to categorize events.

Red: Lunch (to remind myself of the importance)
Bright blue: Inbox processing / quick topics / Getting Things Done (GTD)
Light purple: 1:1's / Jour Fixes with directs and skip-level directs
Dark grey: Recurring department or team meetings
Yellow: Everything hiring related like interviews, preparation and briefings
Orange: Focus time
Dark blue: Mentoring, Career Development, Performance Management
Light orange: Trainings
Green: Everything else (default for incoming events, because green is hope)

You can also use emojis to make your calendar look nicer. I’m a visual person and I used this trick to cheat myself into caring more about my calendar and getting into the habit of maintaining good calendar hygiene. If emojis don’t work for you, maybe you’ll find something else. My colleague Lacey Nagel uses an elaborate emoji mapping for events she owns:

🌊 blockers for time to focus on specific tasks
📌 user research/interviews
🥙 planned breaks / lunch by myself
🍱 lunch with other Zalando's
🙌 1:1's
🐩 backlog refinement
🗺 planning
🔬 retro
🎂 reminders for colleagues’ birthdays

I use some of those and use the following additional emojis for my calendar:

📥 processing my inbox / mail
🧹 finishing up for the day
🎓 career development

A Hiring Week

Looking at my calendar, I know at one glance I don’t have to try and reschedule something yellow, but I can delay focus time, or make a conscious decision to cut back on inbox processing, or move a 1:1. Even if you didn’t work on what you planned to (e.g. product review), because you had to jump in and interview a candidate, you can feel good about it looking at the yellow accomplishments at the end of your week.

Plan and schedule your focus work

If you don’t block those time slots in your calendar, someone else will do it. Understand your energy levels⁶. You might just want to get a few small things done and out of the way, to get the energy to work on the product strategy next. Maybe you don’t have a lot of energy left, so you can read a document that was shared, or watch an all-hands that was recorded earlier. Different kinds of tasks need different levels of energy. I adopted the energy levels “Short Dashes”, “Full Focus”, “Hanging around”, and “Depleted”. These can be contexts, tags, categories, or different To-Do lists in your task management system, to allow easy access to these tasks.

A Focus week

In the example below I had to get the Performance & Development statements for my directs ready before the due date, so I put blockers in the calendar and focused on it. I also finalized a quarterly product review. Another thing you can see is I felt in the mood to go through a few emails and process my inbox earlier on Tuesday so instead of cutting my lunch short, I switched the inbox processing event and the lunch event around.

A Management week

In the next example you can see that preparing material for performance management is a diligent effort and takes a lot of time, same as participating in the corresponding alignment meetings (PRCs). I cut back heavily on inbox processing and lunch, and did some overtime to make it work. At the same time I did not want to cancel the training sessions I had scheduled a long time ago, and had been looking forward to, or miss out on a project closing dinner on Thursday to celebrate success. That was a conscious decision again, so I can’t complain about it afterwards. Cutting back on a routine can be a slippery slope to breaking an established good habit, so be mindful to get back to a normal setup as soon as possible, and compensate for the overtime by taking some time off the following week.

Feel accomplished working asynchronously

Transitioning from the office to working remote, especially when using asynchronous communication, can further reduce the feeling of being appreciated and accomplished. The lack of face to face communication means less exposure to this type of appreciation. As someone giving feedback, or when reading something that someone else created or contributed to, you can compensate by explicitly expressing your appreciation. A thank you here and there goes a long way, even if it’s not actionable feedback. It doesn’t have to be. As someone who misses this kind of appreciation, I try to find other signals that potentially correlate with doing a good job, and being appreciated for it, like e.g. the number of readers of a document, or the amount of comments, discussion, and other contributions on topics I'm driving.

What has changed since going full-remote in March 2020?

One thing that has changed is that because of the lack of commute, I had more time in the morning, and I started to eat breakfast. Not doing that before meant that I would need to have lunch at noon because I hadn't eaten properly in the morning and would be hungry already. Now with a proper breakfast to start the day, I have shifted lunch to 1pm and process my inbox right before at 12. I essentially switched those events around. You also see that we introduced recurring executive sync meetings at the end of the day to stay connected while working in a remote-first setup.

Closing comment

I hope this blog post helps you in leading yourself. Reflecting on how I feel today compared to when I started out on this journey a few years ago, it is a night and day difference. When you learn concepts like the Eisenhower matrix, or Getting Things Done (GTD), most of the time you don't get specific tips and details of how to apply it on a day to day basis. I'm sharing my concrete experience as a template for you to start out with, customize, and iterate on.

Guidelines for remote work at Zalando ↩
Maker vs. Manager ↩
GTD in 15 minutes – A Pragmatic Guide to Getting Things Done ↩
Eisenhower Matrix ↩
The term 'perspective' is task management tool specific: A modern approach to GTD contexts and perspectives in OmniFocus ↩
A modern approach to GTD contexts and perspectives in OmniFocus ↩

Technology Choices at Zalando - Updating our Tech Radar Process

2020-07-15T00:00:00+02:00

Challenges with our Tech Radar

The Zalando Tech Radar is modelled after the Thoughtworks Technology Radar and includes a ring-based scoring for a certain technology/framework along with supplementary information about pros, cons, restrictions, usage, and lessons learned at Zalando available as a knowledge base for our teams. Since publishing, the approach and visualization engine has been used by others and also showcased at conferences as an example of how tech companies manage their technology choices.

Our initial concept of the Tech Radar suffered from a series of problems, which we have observed in the Engineering Community while maintaining the Tech Radar:

The ring change criteria were too high level without being specific for technology types (e.g. programming languages, data stores) or context (e.g. backend, data science, mobile), its support by our infrastructure and impact to engineering usage. They didn’t allow for transparent, objective, and recurring rescoring of the Tech Radar nor for clear guidance for our engineers on how to select or suggest technologies to evaluate.
The Tech Radar has been easy to ignore due to lack of a formal process and oftentimes delivery teams have been making key technology choices in isolation without consulting them with the guild maintaining the Tech Radar. Only after technologies were already in production, radar entries and ring changes were proposed instead of having followed the Tech Radar cycle. This led to a disconnect between the ring assignments and factual usage across teams.
The Tech Radar relied on voluntary contributions degrading in frequency due to neither being clearly incentivized nor part of the job expectations for higher grades. Contributions are usually driven by a small group of engineers forming an informal guild, who were driving the collection of lessons learned material and encouraging teams across the organization to contribute. The guild lacked a formal mandate to make company-wide technology decisions and was insufficiently representing our departments across the company.

Confirming the problem statements

To address these problems we have embarked on a journey starting with confirming the observed problems with our Engineering Managers and getting more insights on how they manage technology choices in their teams. We also explored potential effects on delivery in the past years. We found that Engineering Managers have felt insufficiently supported by the company to manage expectations and technology choices in their teams and missed the ability to lean on stricter guidance. Further, too broad technology choice has had an effect on the growth rate of their teams and created challenges with cross-team code contributions.

Technology choices in Tech companies

Having confirmed the problem, we’ve been collecting ideas on how the problems can be approached. We began with researching how other tech companies are managing technology selection. Unlike Zalando, other established tech companies (Google, Spotify, Tencent, Foursquare, and other CNCF End User companies) use a much stricter technology selection process, limit programming language choices, and invest into changing the way applications are built to leverage centralized control planes, which increases development velocity. They limit the tech stack choices due to the amount of investment into infrastructure support and the high cost of removing technologies that did not prove to be useful.

A too high number of technologies, that are adopted company-wide, make it challenging and expensive for Infrastructure teams to provide high-quality and well integrated tooling, e.g. CI/CD, observability, profiling, vulnerability scanning, compliance, governance, etc. It also causes the teams that provide infrastructure solutions to strongly depend on coordinated and continuous community contribution for technologies that are not supported centrally. A broad freedom of choice leads to increased difficulties in supporting software long-term when the original authors have left the company, which is guaranteed to happen sooner or later. There are also other problems related to development collaboration: (1) adjusting to cross-language communication becomes significant as teams will repeatedly implement the same functional components in different ways, (2) the code duplication rate is increased and it's costly to address non-functional requirements of services in terms of performance, high availability, and scalability, and (3) cross-team collaboration across different code bases is hindered.

Generally, aside from specialized use cases, especially high value in flexibility around technology choices is provided when organizations have the ability to identify technologies that are bringing a paradigm shift (e.g. Kubernetes) paired with business value and use case fit. This proves to be a difficult task and companies rarely get the timing right.

Data collection

We sourced information from the Engineering Community through a Programming Language survey among our developers. The survey indicated how many engineers are currently using a certain language, which they feel comfortable working with and to which degree, as well as which language they would like to support others with in terms of guidelines or ad-hoc help. We cross-checked this data with our 4,000+ applications and derived how the different programming languages have gained traction and popularity over time.

Setting the bar for ADOPT languages

We have collected expectations around the level of support that we would like to see for ADOPT languages, ranging from clear guidelines on the VM lifecycles, integration into CI/CD systems, observability, size and health of the community within and outside of the company, ability to hire engineers to grow our teams using those languages, up to best practices for common tasks like performance analysis and tuning through inspection of heap dumps or flame graphs. We then collected data on how all our languages used in production benchmark against that criteria to see how big the gap in our expectations is with reality.

Defining new ring semantics

We have redefined the ring semantics as follows:

ADOPT: technologies with broad adoption, in which Zalando is willing to invest long-term
TRIAL: captures all current experiments in production
ASSESS: active, non-production assessments of promising technologies and trends
HOLD: discouraged from broad adoption where the company is not willing to invest further; no new applications may use this technology
NIL: no ring assignment, captures previous assessments and findings for long-term documentation purposes (we periodically archive HOLD entries as NIL)

We optionally limit the ring assignments through a clear scope recommendation: Backend, Mobile, Web, Data, Machine Learning, and Infrastructure. This allows us to better differentiate between the specifics of those use cases. The updated semantics allow us to be broad in assessing the value of emerging technologies, but be selective in terms of their deployments to production and level of investment into adoption and promotion within the company. For TRIAL, we also involve explicit sponsorship from our Engineering Heads, who will support production trials and commit to being accountable for divesting from non-promising technologies and the removal of failed experiments from our technology landscape.

Technology Selection Principles and Principal Engineering Community

The timing for making changes to Tech Radar was fortunate due to two reasons. First, we have started an update of our role expectations for Software Engineers and Engineering Managers and included the responsibility and accountability for technology selection along with incentivizing contributions to the process in the new expectations. Second, we created a community of Principal Engineers with the most senior engineers across the company as members, who have been empowered to make decisions on technology selection and thus maintain the Tech Radar. We kicked off the community with a day-long remote off-site where we captured engineering challenges we face at Zalando, brainstormed on principles for technology selection, and initial exchange about the implications of new ring assignments and learnings about the programming languages we use in production. In departments that were not represented by Principal Engineers, we have included our Senior Engineers to contribute instead. Following the off-site, we have formalized Technology Selection Principles that provide guidance on technology choices in terms of breadth and depth, focus on company instead of local decision making, etc. Principle-based decision making enables healthy discussions and differs enormously from preference-based decision making, which easily becomes personal and leads to conflicts.

Parting ways with Clojure, Haskell, and Rust

Having reviewed the use cases where our teams have used the languages that are not on ADOPT, their current adoption within Zalando since 2016, the available set of languages, and the level of investment required to bring them to ADOPT, we have decided to part ways with Clojure, Haskell, and Rust and not create new applications in those languages moving forward. Although our teams have built many services using these languages and learned how to operate these at scale with many successes, following our technology selection principles, we decided to not further invest in these languages as their unique capabilities are not giving us any further leverage at this point in time. Instead, we are focusing our community efforts on Kotlin and TypeScript and expect our language communities to help us move these to ADOPT later this year.

Please note that this decision is specific to the context of Zalando (1,200+ developers, 4,000+ applications) and our current technology landscape and engineering practices. As such, this decision is not transferable to other organizations nor to be understood as a statement about the technical capabilities of the languages themselves. We encourage readers to follow a similar exercise as ours to derive decisions for their context.

Next steps

So far, we have reviewed the area of programming languages as the one having the biggest long-term impact on our engineers and system architecture as well as being the one sparking many debates on which language is better and why (when arguing based on preferences). As the next step, we are proceeding with reviewing the remaining categories of the Tech Radar, so stay tuned for further updates on our journey. (Update: check out our follow-up post on Scaling Contributions to the Tech Radar)

Launching the Engineering Blog

2020-07-01T00:00:00+02:00

Our Engineering Blog was launched in June 2020 after a long break of the previous tech blog. This post describes the technical setup behind engineering.zalando.com.

You will learn:

Which static site generator we selected and why.
What customizations we applied to design the blog and the publishing process.
How we serve static HTML using Skipper and S3.

Static Site Generator

Our previous tech blog used a CMS which only a limited number of people had access to. The CMS system also lacked a workflow to propose and review drafts. As authors of the Engineering Blog will (mostly) be software engineers, we decided to switch to a git-based workflow and a static site generator.

StaticGen provides a nice overview of many different static site generators. Nearly all of them provide the necessary features to generate a static HTML site from blog posts written in Markdown. So which static site generator to choose?

With the need to customize the blog engine, e.g. with custom templates and features like author titles, the main criteria for the static site generator is to use a familiar programming language for templating and for plugins. The static site generator should generate plain HTML and not contain unnecessary features we won't use. The winner was Pelican:

Pelican is written in Python. Python is the language the most people are familiar with in Zalando, so it's a safe bet.
Templates are written in Jinja. Jinja is a popular templating system, it's used in Zalando Open Source and I use it in my own OSS projects.
Atom/RSS feeds are supported out-of-the-box
There are many existing plugins and it's easy to write your own in Python.
It's actively developed. The last git commit was 16 days ago at the time of writing.

Customization

We implemented the blog's design with plain HTML/CSS. The CSS is generated via PostCSS and Tailwind CSS. Customizing Pelican's Jinja templates was straightforward.

Other customizations we did:

Enable the Atom feed via the FEED_ATOM setting in pelicanconf.py.
Generate the sitemap XML with the sitemap plugin.
Add author titles with the pelican-metadataparsing plugin.
Minify generated HTML with the pelican-htmlmin plugin.

Additionally to the above, we want to make sure that automatic linting is in place for blog posts:

Required meta keys must be present, e.g. title, summary, and author names.
The blog post Markdown file must be in the right year/month folder.
Article tags should be curated via an explicit allowlist. We want to avoid introducing many unnecessary tags and different tags for the same concept, e.g. "Postgres" vs. "PostgreSQL".

Linting is done via pre-commit which calls a custom Python script to validate blog post Markdown files. The .pre-commit-config.yaml looks something like this:

minimum_pre_commit_version: 1.21.0
repos:
  - repo: meta
    hooks:
      - id: check-hooks-apply
      - id: check-useless-excludes

  - repo: local
    hooks:
      - id: validate-content
        name: Validate blog content
        language: system
        # run with poetry to get dependencies (Pelican)
        entry: poetry run ./validate-content.py
        types: [markdown]
        exclude: ^content/pages/.*.md$

  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v3.1.0
    hooks:
      - id: check-added-large-files
      - id: end-of-file-fixer
      - id: trailing-whitespace
      - id: mixed-line-ending

Zalando's CI/CD system automatically lints all files by executing make lint.

Writing a blog post

Anybody in Zalando can pitch a blog post idea by creating an issue in the git repo:

Bootstrapping a new blog post looks like this:

hjacobs@ZALANDO-123:~/workspace/engineering-blog$ make new
poetry run ./scripts/new-post.py
This will create a new blog post, please answer a few questions..
Title of blog post: Launching the Engineering Blog
Slug [launching-the-engineering-blog]:
Date (estimated) of publishing [2020-07-04]:
Author names (separate with semicolon) [Henning Jacobs]:
Author titles (separate with semicolon) [Senior Principal Engineer]:
========================================
Title:         Launching the Engineering Blog
Slug:          launching-the-engineering-blog
Authors:       Henning Jacobs
Author Titles: Senior Principal Engineer
Date:          2020-07-04
URL:           /posts/2020/07/launching-the-engineering-blog.html
========================================
Does this look correct? Answer 'y' or 'n': y
Creating content/2020/07/launching-the-engineering-blog/2020-07-04-launching-the-engineering-blog.md ..

Useful commands:
- make devserver    Start local webserver, find your draft on http://localhost:8000/drafts/
- make lint         Validate content and formatting.

Please edit your article in content/2020/07/launching-the-engineering-blog/2020-07-04-launching-the-engineering-blog.md
and don't forget to open a PR :-)

Opening a PR to the Engineering Blog repository will trigger a build (make html) on our Zalando Continuous Delivery Platform. The PR build will publish a preview of the blog under a private (authenticated) URL.

After merging the blog post PR, it will automatically be published on the live site engineering.zalando.com.

Serving static HTML

Zalando's Continuous Delivery Platform has a built-in feature to upload files to a given S3 bucket. This feature is used to upload all files from the output directory (generated by Pelican) to the blog's S3 bucket. The S3 bucket is created via CloudFormation which also configures the S3 website:

AWSTemplateFormatVersion: 2010-09-09
Metadata:
  StackName: "engineering-blog"
  Tags:
    application: "engineering-blog"
Resources:
  S3Bucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: "<BUCKET-NAME>"
      AccessControl: PublicRead
      WebsiteConfiguration:
        IndexDocument: index.html
        ErrorDocument: error.html
    DeletionPolicy: Retain
  BucketPolicy:
    Type: AWS::S3::BucketPolicy
    Properties:
      PolicyDocument:
        # ...

The WebsiteConfiguration property will make the bucket contents available on http://<BUCKET-NAME>.s3-website.<REGION>.amazonaws.com. The S3 website only provides an HTTP endpoint (no SSL) and not a domain we would want to use publicly.

One way to serve the contents with a custom domain and SSL is to create a CloudFront web distribution. I decided to not use CloudFront as all the required infrastructure for domain+SSL is already in place.

We have Skipper as the Kubernetes Ingress proxy running for all our 140+ Kubernetes clusters. External DNS automatically configures the DNS name and the Kubernetes Ingress Controller for AWS configures the AWS ALB with the right ACM SSL certificate. So let's reuse this infrastructure and let Skipper proxy all requests to the S3 website bucket endpoint. This can be achieved by adding a default Skipper route as Ingress annotation:

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: "engineering-blog"
  labels:
    application: "engineering-blog"
  annotations:
    zalando.org/skipper-routes: |
      redirect_app_default: * -> compress() -> setDynamicBackendUrl("http://<BUCKET-NAME>.s3-website.<REGION>.amazonaws.com") -> <dynamic>;
spec:
  rules:
  - host: "engineering.zalando.com"
    http:
      paths:
      - backend:
          serviceName: "engineering-blog"
          servicePort: 80

That Skipper's compress() filter enables gzip compression as the S3 endpoint does not provide response compression out-of-the-box. The ACM certificate, HTTP/2 support, the S3 website response, and the enabled compression are visible when doing a curl request (output shortened):

$ curl -v --compressed https://engineering.zalando.com -o /dev/null
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* Server certificate:
*  subject: CN=engineering.zalando.com
*  subjectAltName: host "engineering.zalando.com" matched cert's "engineering.zalando.com"
*  issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
*  SSL certificate verify ok.
> GET / HTTP/2
> Host: engineering.zalando.com
> user-agent: curl/7.68.0
> accept: */*
> accept-encoding: deflate, gzip, br
< HTTP/2 200
< content-type: text/html
< content-encoding: deflate
< etag: "304fcc9c31aac19255bf1d84669059df"
< last-modified: Sat, 27 Jun 2020 07:23:19 GMT
< server: AmazonS3
< vary: Accept-Encoding

Performance

The static website should be fast. So let's test. We can use Vegeta for some basic HTTP load testing. 60ms as p99 latency looks good:

$ echo "GET https://engineering.zalando.com/" | vegeta attack -duration=60s | vegeta report
Requests      [total, rate, throughput]         3000, 50.02, 50.00
Duration      [total, attack, wait]             59.995s, 59.98s, 15.246ms
Latencies     [min, mean, 50, 90, 95, 99, max]  12.418ms, 19.751ms, 17.049ms, 25.05ms, 38.382ms, 59.958ms, 244.094ms
Bytes In      [total, mean]                     51441000, 17147.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:3000
Error Set:

The user experience with a real browser is much more interesting. Chrome Lighthouse can be used to assess the page performance. Google's PageSpeed Insights uses Lighthouse for its score calculation. Running PageSpeed Insights for the blog reports a nice score of 100 out of 100 (desktop):

Thanks go out to our Employer Branding colleagues who created the design and implemented the responsive HTML/CSS layout!

Summary

I hope this blog post gives you some inspiration for setting up your own blog with Pelican or some other static site generator. After re-launching our Engineering Blog, our main focus will be providing regular and high quality content. We still have to figure out the best way to source, review, and schedule blog posts.

Follow ZalandoTech on Twitter and subscribe to the Atom/RSS feed to get the latest articles.

PgBouncer on Kubernetes and how to achieve minimal latency

2020-06-24T00:00:00+02:00

Introduction

In the new Postgres Operator release 1.5 we have implemented couple of new interesting features, including connection pooling support. Master Wq says there is "No greatest tool", to run something successfully in production one needs to understand pros and cons. Let's try to dig into the topic, and take a look at the performance aspect of connection pooler support, mostly from a scaling perspective.

But first let's make an introduction. Why do we quite often need a connection pooler for PostgreSQL (and in fact for many other databases too)? There are several performance implications of having too many connections to a database open that result from how a connection is opened (PostgreSQL uses a "process per user" client/server model, in which too many connections mean too many processes fighting for resources and drowning in context switches and CPU migrations) and how certain aspects of transaction handling are implemented (e.g. GetSnapshotData has O(connections) complexity). Having said that there are three options where to implement a connection pooler:

on the database side, like proposed in this patch
as a separate component between the database and the application
on the application side

For Postgres Operator we have chosen the second approach. Although there are pros and cons for all of those options, any other will obviously require a lot of efforts (application side connection pooler is not something under the operator control, and internal connection pooler for PostgreSQL is a major feature one needs to develop yet). Another interesting choice to make in this case is which solution for connection pooling to use. At the moment for PostgreSQL there are couple of available options (listed in no particular order):

PgBouncer is probably the most popular and the oldest solution. Pgpool-II can actually do much more than just connection pooling (e.g. it can do load balancing), but it means it's a bit more heavyweight than others. Odyssey and pgagroal are much newer and try to be more performance optimized and scalable than the alternatives.

Eventually we went for PgBouncer, but current implementation allow us to switch to any other solutions if they conform to a basic common standard. Now let's take a look at how PgBouncer performs in tests.

Setup

In fact, we did significant amount of benchmarks with PgBouncer for different workloads on our Kubernetes clusters and learned few interesting details. For example, I didn't know that a Kubernetes Service can distribute workload in not exactly uniform way, so that one can see something like this, where the third pod is only half utilized and in fact gets half as much queries as the others:

NAME                         CPU(cores)   MEMORY(bytes)
pool-test-7d8bfbc47f-6bbhr   977m         5Mi
pool-test-7d8bfbc47f-8jtnp   995m         6Mi
pool-test-7d8bfbc47f-ghvpn   585m         6Mi
pool-test-7d8bfbc47f-s945p   993m         6Mi

This could happen if kube-proxy works in iptables mode and calculates probabilities to land on a pod instead of strict round-robin.

But in this article I want to offer one example, produced in a more artificial environment of my laptop. That's mostly because we can get more interesting metrics that are interesting for this particular case, but do not make sense to collect for all workloads. My original idea was to play around CPU management policies and exclusive CPUs, to show what will happen if a PgBouncer runs with a fixed cpuset. But interesting enough, another effect introduced an even bigger difference, so the following experiment will be more about scaling of PgBouncer instances.

To simulate the networking part of our experiment, let's setup a separate network namespace, where we will run PostgreSQL and PgBouncer, and connect it via veth link with the root namespace.

# setup veth link with veth0/veth1 at the ends
$ ip link add veth0 type veth peer name veth1

# check that they're present
$ ip link show type veth

# add a new network namespace
$ ip netns add db

# move one end into the new namespace
$ ip link set veth1 netns db

# check that now only veth0 is visible
$ ip link show type veth

# check that veth1 is visible from the other namespace
$ ip netns exec db ip link show type veth

# add corresponding addresses and bring everything up
$ ip addr add 10.0.0.10/24 dev veth0
$ ip netns exec db ip addr add 10.0.0.1/24 dev veth1
$ ip link set veth0 up
$ ip netns exec db ip link set veth1 up
$ ip netns exec db ip link set lo up

This link is going to be blazingly fast, so let's add a small delay to the veth interface, which corresponds to the empirical network latency we observe in our Kubernetes clusters. Distribution parameter here is mostly to emphasize its presence, since it's normal by default anyway.

$ tc qdisc add dev veth0 root netem delay 1ms 0.1ms distribution normal

In our experiment we will run pgbench test with a query ;, which is the smallest SQL query one can come up with. The idea is to not load the database itself too much and see how PgBouncer instance will handle many connections, which is in this case 1000 dispatched via 8 threads. A word of warning: use pgbench carefully, since in some cases it could be a bottleneck and produce confusing results. In our case we will try to limit this by pinning all the components to a separate cores, collect performance counters to see where what do we spend time and be alerted about strange results. But for a more diverse workload and more holistic approach you can use oltpbench or benchmarksql.

The result will be per transaction execution log. Every component, namely:

PostgreSQL instance
Two PgBouncer instances
PgBench workload generator

is bound to a single CPU core, with Intel turbo being disabled and CPU scaling governor for all the cores set to performance. Two instances of PgBouncer will run with so_reuseport option, which is essentially a way to get PgBouncer to use more CPU cores. The only degree of freedom we will investigate is their location between cores in relation to whether it's a real separate core, or just a separate hyperthread.

Benchmark

Here are the benchmark results, presenting rolling mean, 99th latency and standard deviation values, executed on a rather modest setup with 2 physical cores each with 2 hyperthreads for three cases:

Only one instance of PgBouncer on an isolated real core
Two PgBouncers on isolated hyperthreads, but on the same physical core.
Two PgBouncers on isolated cores (with potential noise from other components on the different hyperthread).

Hyper-Threading means than two components are still fighting for CPU time, but will share some execution state and cache. Usually, it renders more deviations in latency, which we will have in mind.

One nice feature we can immediately see is that results are relatively stable, which is good. Another interesting note is that despite the fact that we were only changing the core location for every component, we can see a significant difference in latency. For a single PgBouncer instance we've got the lowest latency, while for two PgBouncers on the same physical core it's almost two times higher (with somewhat minimal increase in throughput). In case of two PgBouncers on a different physical cores, even with potential competition for resources with another component (and a different resource consumption pattern), the latency is somewhere in between (with the throughput best of the three). Why is that?

In the course of investigation more and more puzzling measurements were collected, showing no significant difference in sampling with perf of PostgreSQL activity or both PgBouncer instances. Let's take a closer look at what PgBouncer is actually doing:

As expected, it spends a lot of its time doing networking. Kernel docs says that:

For interrupt handling, HT has shown no benefit in initial tests, so limit the number of queues to the number of CPU cores in the system.

This could be our working assumption. Network interrupts probably are not very well scaled between hyperthreads, so one needs to use a real core to scale them out. To get a bit more evidences, let's take a look at interrupts latencies in both cases, different cores and different hyperthreads. For that we can use irq:softirq_entry and irq:softirq_exit and a script from Brendan Gregg:

# one PgBouncer instance is running on a CPU2 with no other PgBouncer on the
# same physical core. We're interested only in NET_RX,NET_TX vectors.

$ perf record -e irq:softirq_entry,irq:softirq_exit \
    -a -C 2 --filter 'vec == 2 || vec == 3'
$ perf script | awk '{ gsub(/:/, "") } $5 ~ /entry/ { ts[$6, $10] = $4 }
    $5 ~ /exit/ { if (l = ts[$6, $9]) { printf "%.f %.f\n", $4 * 1000000,
    ($4 - l) * 1000000; ts[$6, $10] = 0 } }' > latencies.out

And the same for another case when a PgBouncer sits together with another one on the same physical core. Here is the 99th percentile of the resulting latencies:

Which indeed points into the direction of network interrupts being a bit slower for the case when both PgBouncers are sharing the same physical CPU. In theory, it means that we can get surprising performance results after adding more pods to a connection pool deployment depending on where did those new pods land, on an isolated CPU or on a CPU with another hyperthread already busy. In the view of these results it could be beneficial to configure CPU manager in the cluster, so that this would not be an issue.

Conclusion

Having said all above I must admit it's just a tip of the iceberg. If there could be interesting complications about how to run a connection pooler within a single node, you can imagine what happens on a higher architecture level. We've spent a lot of time discussing different design possibilities for Postgres Operator, e.g. whether it should be a single "big" pgbouncer instance (with many processes reusing the same port) with an affinity to be close to the database, or multiple "small" instances equidistant from the database. Every design has its own trade-offs about network round trips and availability implications, but since we value simplicity (especially in the view of such complicated topic) we went for a rather straightforward approach relying on the standard Kubernetes functionality:

Postgres Operator creates a single connection pooler deployment and exposes it via new service.
Connection pooler pods are distributed between availability zones.
Due the nature of connection pooling, pods are doing CPU intensive work with minimal amount of memory (less than a hundred of megabytes in a simple case) and it makes sense to create as many as needed to prevent resource saturation. Those pods could be scattered across multiple nodes and availability zones which means latency variability.
For those cases when this variability could not be tolerated, we would consider creating manually a single "big" pooler instance with the affinity to put it on the same node as the database and adjust CPU manager to squeeze everything we can from this setup. This new instance would be a primary one for connecting with another one providing HA.

This simplicity should not be confused with ignorance, it's based on understanding of proposed solution limitations and what could be adjusted beyond them. As in my other blog posts and talks I would love to emphasize the importance of the described methodology: even if you have such a complicated system in your hand as Kubernetes it's important to understand what happens underneath!

Learnings from Distributed XGBoost on Amazon SageMaker

2020-06-22T00:00:00+02:00

Overview

XGBoost is a popular Python library for gradient boosted decision trees. The implementation allows practitioners to distribute training across multiple compute instances (or workers), which is especially useful for large training sets.

One tool used at Zalando for deploying production machine learning models is the managed service from Amazon called SageMaker. XGBoost is already included in SageMaker as a built-in algorithm, meaning that a prebuilt docker container is available. This container also supports distributed training, making it easy to scale training jobs across many instances.

Despite SageMaker handling the infrastructure side of things, I found that distributed training with XGBoost and SageMaker is not as easy as simply increasing the number of instances. I discovered a few small "gotchas!" when attempting a few simple trainings. This post will step through my failed attempts, and end with a genuine distributed training with XGBoost in Amazon SageMaker.

Experiment Setup

I wanted to get an intuitive idea of how well the training time with XGBoost scaled as the number of instances scaled, as well as the training time when the data size increases. I am not especially interested in producing the "best" model to solve a problem, per-say, but there is a natural trade-off between training time and model accuracy that should considered.

For a data set, I used the Fashion MNIST by Zalando Research. The problem itself is to classify small images (28x28 pixels) of clothing as being from 1 of 10 different classes (t-shirts, trousers, pullovers, etc). The data set has 60,000 images for a training set and 10,000 images for a validation set.

To increase the training size, I duplicate the training data to measure the scaling of model training time as the computational resources change. The number of times the training data is duplicated is referred to as the "replication factor". For a typical ML project, you probably don't want to duplicate the training set outright. Although doing so improves our training and validation accuracies here, this method is likely not as efficient as changing hyperparameters (however, you might create new images with noise to improve regularization). For reference, the size on disk of the training data for different replication factors is provided below.

Replication factor 1: 0.63 GB, 60,000 images
Replication factor 2: 1.24 GB, 120,000 images
Replication factor 4: 2.48 GB, 240,000 images
Replication factor 8: 4.95 GB, 480,000 images

I wanted to use hyperparameters that would give a somewhat reasonable performance for accuracy, so I used a hyperparameter tuning job in SageMaker, with one instance per training. I tuned all of the tunable hyperparameters, except "num_round", which was fixed to 100. This hyperparameter increases the number of decision trees used, and increases training time and accuracy as its value increases. My hyperparameters were as follows:

hps = {'alpha': 0.0,
       'colsample_bylevel': 0.4083530569296091,
       'colsample_bytree': 0.8040025839325579,
       'eta': 0.11764087266272522,
       'gamma': 0.43319156621549954,
       'lambda': 37.547406128070286,
       'max_delta_step': 10,
       'max_depth': 6,
       'min_child_weight': 5.076838893848415,
       'num_round': 100,  # Not tuned: kept fixed
       'subsample': 0.8915771964367318,
       'num_class': 10,  # Not tuned: defined by Fashion MNIST
       'objective': 'multi:softmax'  # Not tuned: defined by Fashion MNIST
      }

There are additional hyperparameters than those listed above which are not tunable. I took those as their default value (which, as you will see, can cause some unexpected results). The full list of hyperparameters offered by XGBoost is different from the those offered by the SageMaker container as SageMaker adds a few additional hyperparameters which do not control model performance. The objective "multi:softmax" produces a metric called merror, which is defined as #(wrong cases)/#(all cases).

Lastly, the tools. I wrote all of the code for my experiments in Python 3.7 using the Amazon SageMaker Python SDK. I used the SageMaker docker container version 0.90-1 for XGBoost, the URI of which can be found by using the SageMaker Python SDK:

from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(region, 'xgboost', repo_version='0.90-1')

For each of the SageMaker training jobs, I used the ml.m5.xlarge instance.

Failed Attempt: Naive Distributed Computing

My first attempt was to check how the training time scaled as the number of instances increases. I expected to see a roughly linear improvement: if the number of instances doubles, then the training time should be cut in half.

I used a naive approach: other than the settings mentioned in the "Experiment Setup" section, I used default values, including a replication factor of 1. What I found was very different from my expectations:

There are two things to note here. Going from 1 to 2 instances increases the training time, though I expected to see the training time cut in half. Going beyond 2 instances, the training time is relatively flat.

Going from 1 to 2 instances demonstrates an internal switch of non-distributed to distributed training with XGBoost. There is a hyperparameter called tree_method which sets the algorithm used for computing splits at a node in a decision tree of XGBoost. The default for tree_method is "auto". For one instance, their greedy algorithm called "exact" is used. For more than 1 instance, an algorithm called "approx" is used, which approximates the greedy algorithm. The logic behind "auto" is explained in the XGBoost documentation (as well as other algorithm choices) and the implementation of "exact" and "approx" are described in the XGBoost paper from Chen and Guestrin.

The second thing to note is that after two instances, the training time remains flat as more instances are added. This is because each instance is training using the same data. In SageMaker, unless otherwise specified, the entire training data set is distributed to each instance. This setting is called "FullyReplicated". However, XGBoost expects that each instance receives a subset of the full data set. Another way to think of this is that the training data is completely replicated a number of times equal to the number of instances, and then each copy sent to each instance.

The data distribution can be corrected by sharding the training data (dividing it into different files, one for each instance), and defining an s3_input object such as

import sagemaker
s3_input_train = sagemaker.s3_input(s3_data=s3_location,
                                    content_type='csv',
                                    distribution='ShardedByS3Key')

and then starting a training job by passing the s3_input objects for training (and a similar one for validation):

xgb.fit(inputs={'train': s3_input_train, 'validation': s3_input_validation})

Take-Aways

XGBoost takes different default actions for the hyperparameter tree_method when moving to distributed training from non-distributed training. We should be mindful of this when estimating training times and when tuning hyperparameters.
XGBoost expects data to be split for each of the instances, but SageMaker by default sends the entire data set to each instance. We need to set the data distribution to "ShardedByS3Key" in SageMaker to match the expectations of XGBoost.

Failed Attempt: Using the Greedy Algorithm

To correct my previous failed attempt at distributed training, I made two changes to my experiment:

I set the value of the hyperparameter tree_method to "exact", so that each training job uses the same value of tree_method.
I set the data distribution for SageMaker to "ShardedByS3Key", and divided my training set randomly so that each instance gets a different piece of the training set.

In addition to the expected training times (i.e. doubling the number of instances cuts the training time in half), I also tried increasing the replication factor to get a sense of the scaling of training time compared to the size of the training set. I expect something similar: if the size of the training data doubles, then the training time should double.

The first plot shows the training times for each of the 4 replication factor. The second plot is the same as the first, but in log scale. The dotted lines indicate my expected training time (i.e. doubling the number of instances should halve the training time).

This actually looks pretty good! The training times match well with what one might expect. The trainings with higher replication factors require more time to run computations as there is more data to process. In fact, it's about a factor 2 increase in the training time when the training data size is doubled. It's also worth pointing out that more training data results in better scalability. In fact, with lower replication factors, the training time plateaus (actually, it even increases a little) with a larger number of instances. This would suggest that the overhead costs are eating the benefits of distributing the workload.

At first glance, everything seems to be ok: more training data implies longer training times, more compute resources implies shorter run times. But a check of the training and validation errors shows that something is not right:

As the number of instances increases, the error for training and validation increases. This is an artifact of the hyperparameter tree_method. For distributed training, XGBoost does NOT implement the "exact" algorithm. However, SageMaker has no problem letting us select this value in the distributed trainings. In this situation, the training data is divided among the instances, and then each instance calculates its own XGBoost model, ignoring all other instances. Once each instance is finished, the model from the first instance is saved, and the others are discarded.

The timing and error graphs reflect this behavior: as the number of instances increases, the training data on any given instance is smaller, resulting in faster trainings but worse error. A cheaper way to replicate this experiment is to throw away a percentage of the training data and then train with only one instance.

Take-Aways

Don't use "exact" for the value of tree_method with distributed XGBoost, because it's not actually implemented on the XGBoost side. Use instead "approx" or "hist".

Successful Attempt: Distributed XGBoost with SageMaker

After the learning from the previous attempt, I repeated the experiment, but this time using "approx" for tree_method. This does introduce a new hyperparameter, called sketch_eps, for which I use the default value.

The scaling looks good here and similar to those from experiment 2, albeit with longer training times. A check of the training and validation errors is more satisfying:

From the training and validation errors, we do see noise appearing. Note that there is randomness to using XGBoost: the piece of the training set given to each instance was selected randomly for each training, and node splitting in a decision tree has randomness (see hyperparameters like subsample or colsample_bylevel).

Take-Aways

Using many instances with a "low" amount of training data is a waste of computational resources. For example, using a replication factor of 1, the training time of using 10 instances is not much better than using 3 or 4.
When the training data is sufficiently large, doubling the number of instances approximately halves the training time.
The scaling in training data size is about what we expect: doubling the training data approximately doubles the training time.

Conclusion

Amazon SageMaker makes it easy to scale XGBoost algorithms across many instances. But with so many "knobs" to play with, it's easy to create an inefficient machine learning project.

How to work remotely at Zalando

2020-03-13T00:00:00+01:00

This document is heavily informed by remote work guidance from other companies and authors. Notable sources include FYI's 11 Best Practices for Working Remotely and Laurel Farrer’s How to Design Powerful Rituals for Successful Distributed Companies. Special thanks to Timo from GiantSwarm for sharing learnings in an ad-hoc phone call. Other sources are linked in the appendix. We would like to highlight that we added a link to Alice Goldfuss’ Work in the Time of Corona, which was published after this document was available internally, because of how succinctly and thoroughly she covers areas that other guidelines address partially at best. Zalando has some remote working experience due to our tech hubs, but we do not consider ourselves experts in this matter. That being said, we want to share our internal guidelines in the hope that others might find them useful.

Going fully remote as a company from one day to another is a challenge. Working remotely requires (1) a clear set of “rules to live by” that have 100% buy-in across the company, and (2) a healthy system of meetings, events, and habits that keep people communicating.

Due to the current circumstances, we have an opportunity to practice remote collaboration. Compared to just one team member doing mobile work and everyone else being co-located, we have the advantage that everybody is in the same situation (all remote). You can even get to know your colleagues better. Maybe introduce your co-workers to your cat during a video call.

This document contains guidelines, tips, and expectations to make 'remote' possible in our current situation. Please read these carefully and apply them in your teams adjusting to your special circumstances, if needed. The most important baseline rules to follow are:

Get VPN (needed for some internal Zalando tools and datacenter access) and make yourself familiar with Zalando's privacy information [internal link]
Establish daily standups via chat and video
Have weekly 1:1s between manager and team members
Perform weekly team retrospectives
Establish personal and team rituals
Prioritize documentation and clear communication
Embrace asynchronous work and communication

We expect every tech leader in Zalando to follow these baseline requirements, and support and empower their teams. The appendix contains the FAQ, additional tips, and resource links.

Guidelines

Managers

💬 Establish daily standups via chat (asynchronous) and video call (synchronous).
👫 Establish regular weekly 1:1 meetings (video calls) to check in regularly with your directs.
😊 Create a safe environment and culture for team members to report when they are away from the keyboard (e.g. "I'm AFK" in team chat, or via Google Chat Snooze) to prevent the feeling of being pressured to always be online.

Practice good meeting etiquette

🎥 Prefer Hangouts Meet over chat, turn on video to understand non-verbal communication.
📵 Be present and don’t fiddle with the phone.
🤩 Use agendas to communicate the purpose of a meeting.
📄 Share a document as pre-read and solicit comments before the meeting.
📝 Write meeting notes (assign a note-taker!) and share them.
👍 Define action items and owners.
⏲️ Start on time, end on time.

GitLab provides some good advice for All-Remote Meetings.

Prioritize documentation and clear communication

Document more than normal e.g. outlines of your ideas, next steps, meeting notes.
Collaborate virtually, e.g. virtual whiteboards & sticky notes (use Google Slides or Google Jamboard, a digital whiteboard), work on documents in real-time. Check out “Working with Google Software at the Zalando Workspace” instructions [internal link].
Share how you feel by using emojis 🤗. What’s going well? What’s not going well? Explain how you are feeling and when you need help.
Empathy is everything: always assume positive intent. Tone and nuance can get lost over chat, so assuming your colleague is coming from a positive place helps with potential misunderstandings. If you think your colleague acts weird, or a chat is getting too long or confusing, have a video call.
Say what is obvious too: communicating everything explicitly is key to avoid misunderstandings.
Take care of the Google Drive structure so that people can find documents faster. Familiarize yourself with the search features, e.g. searching within a subfolder is possible via the triangle on the right of the search bar 🔍.

Create boundaries between work and life

Boundaries between work and life get blurred when working remotely. We want to prevent that work environment and home environment merge into one. It’s easy to adopt bad routines, like waking up and immediately checking your email, sitting down for breakfast while working, keep working throughout the day without going for lunch or regularly drinking some water. Suddenly it’s 21:00 and you’re dehydrated, hungry, a headache is creeping up, but you’re still working. Unplugging is important to stay healthy. Our core working hours are between 10:00 to 16:00 local time and yes, you are responsible for getting your work done and to make sure to attend meetings while working your regular hours, but please use the following guidance to stay healthy.

📅 Time-block your day so you have a start and end time: configure your work time in Google Calendar. This makes it transparent for your colleagues and manager when you are available and when not.
🍲 Plan and block your lunch slot as a recurring public event. This helps you stay healthy and manages expectations for availability.
⏰ Plan regular breaks, e.g. by setting a break reminder and stay hydrated.
💻 Create a physical space for work at home that you can leave at the end of the day (i.e. don’t work from bed).
💼 Use props that signal your brain that you’re working (e.g. work shoes, work shirt).
📴 Switch off when you're away from work.
🎵 Use background music or sound to help with concentration. Background noise helps in creating an environment which you associate with working. You can share your favorite playlists within the team for that.

Tune In

👋 Check-in to team-chat by stating that you’re starting to work and what you worked on the day before.
📥 Assign tickets (e.g. GitHub issues) to yourself when you start working on them. Leave a comment to inform the whole team about progress.
🔕 Update your chat status (e.g. mute) when you need to focus.

Tune Out

✌️ Check-out of team-chat ("heading out from work", "AFK" for "away from keyboard").
🌜 Use the Google Chat "Snooze Notifications" feature to signal absence. If you have set up your work hours in Google Calendar, this happens automatically for non-work hours.
📤 Commit work frequently instead of only committing locally. Finish up by committing in the evening and provide a short summary in the ticket on the progress or blockers.

Make yourself visible and be responsive

Organizing expectations around communication creates a healthy relationship between employees and supervisors — no one will have concerns about productivity expectations or be left in the dark.

✉️ Catch up on email at least twice a day to stay informed.
📅 Check your calendar, respond to invites with a 'yes' or 'no' plus comment. Attend appointments.
💬 Scan relevant chats (esp. your team chat) every hour.
📟 Find a balance between synchronous team interaction and embracing the benefits of an asynchronous work style. You can stay online when working, and update your team via chat on what you’re working on, or manage expectations around check-ins. This way we compensate for the loss of ad-hoc availability from not sitting next to each other.

Reflect and Adapt

The new remote situation is radically different from how your team worked before. Set up weekly team retrospectives (video call) to recap what worked well and what can be improved. We recommend using Google slides to simulate a whiteboard with sticky notes: the first slide is the whiteboard. The following slides are for each team member (one slide per member) where they can prepare red & green "sticky notes" before the retrospective meeting. The meeting runs similar to a physical meeting: 1) every team member copies their notes to the "whiteboard" (1st slide), 2) the team clusters the notes on the whiteboard, 3) the team selects 1-2 most important issues, 4) the team defines action items and next steps.

APPENDIX

FAQ

What about the monthly tech onboarding and engineering bootcamp?

Tech onboarding and engineering bootcamp will happen remotely through Google Hangout Meets.

What should I do if my Internet connection at home is unavailable or slow?

If you don't have Internet at home or an unstable or slow connection and no company-provided phone, please contact Helpdesk which can provide phones for tethering.

Other Tips for Successful Remote Work

These tips are copied from Trello's excellent The Best Advice For Remote Work Success From 10 Global Teams (free PDF guide).

Chat vs. Video Calls

Recognizing the humanity in team members via seeing their face on a video call is a game-changer:

Tools can mask intention and humanity: Keep in mind that at the end of the chat is a human being with feelings and reactions.
If you have constructive feedback to give, do it over a video call so your intentions come across.
Due to a lack of verbal and emotional cues: One person may perceive a chat convo as an argument when the other person perceives it as a discussion.
Resentment builds over time due to underlying issues not being addressed. Digital communication gone rogue can breed misunderstandings and hurt feelings.

Expect Structure

Establish a process, structure, and agenda around meetings and updates so everyone can follow along no matter their location. Assign a meeting lead and scribe (note taker) to ensure key decisions are captured in writing.

Treat Others With Transparency

Keep important information accessible for everyone: log side chat decisions, record video meetings, and always take notes to share in public (company-internal) spaces.

Use Video for Face-to-Face

Seeing as up to 10,000 non-verbal cues can be exchanged in one minute of face-to-face interaction. Video meeting tools ( Hangouts Meet) are essential for building relationships with others. You can set up team-building activities over video that play into the strengths of remote work, like sharing your office view or introducing your cat to your coworker’s dog and watching the furry friendship unfold.

Never work from bed

"When I started working 100% remotely at Buffer, I set the rule for myself that I would never work from bed, and here’s why:- It becomes more difficult to fall asleep because working from bed weakens the mental association between your bedroom and sleep.- You may start to feel like you’re always at work and lose a place to come home to.- Your quality of sleep will decrease because using electronics before bed reduces the melatonin you need to fall asleep.” - Hailley Griffis, Future of Work Marketer, Buffer

Resources

Grafana: How to work from home effectively: Tips from the remote-first Grafana Labs team

mistro: 10 hacks to improve your WFH experience in 10 minutes (or less)

Alice Goldfuss: Work in the Time of Corona

GitLab: All-Remote Meetings

GitLab: What not to do when implementing remote: don't replicate the in-office experience remotely

Yonder: How to Design Powerful Rituals for Successful Distributed Companies

fyi: 11 Best Practices for Working Remotely

TechRepublic: The 10 rules found in every good remote work policy

GiantSwarm: Taking Care Remotely

GiantSwarm: Giant Swarm is "Remote First" and I put it to the test

GiantSwarm: Surviving and Thriving: How To Really Work Remotely

Trello: The Best Advice For Remote Work Success From 10 Global Teams [Free Guide]

SRECon2017: Don't Call Me Remote! Building and Managing Distributed Teams - Facebook

Inc.: It Only Takes 7 Words to Create the Last Work-From-Home Policy You'll Ever Need

Andreas Klinger: Managing Remote Teams - A Crash Course - Startup Lessons Learned

Wired: How to Work From Home Without Losing Your Mind

Open Source: June Updates - New releases, continue to foster diversity and inclusion in tech

2019-07-15T00:00:00+02:00

Project Highlights

Kopf - Kubernetes Operator Pythonic Framework now supports built-in resources and can be used to write controllers of any kind (pods, namespaces, mixed), not only of custom resources. Check out the latest release for more details https://github.com/zalando-incubator/kopf/releases
Skipper publishes new releases weekly. Some of the important features were implemented such as support to proxy Kubernetes API server and support Kubernetes externalName services from ingress.
Kubernetes Ingress Controller for AWS added dualstack and ssl-policy support in its last release. The controller helps to configure AWS application load balancers according to Kubernetes Ingress resources.

Foster diversity and inclusion in tech

Zalando hosted the launch event of Persian Women In Tech Berlin. This is the first event of the new Berlin chapter of the international organization of Persian Women in Tech. Shery Brauner, spoke about her career path from Iran to Germany, daring to take risks and now leading an engineering team at Zalando.

Learn more about Zalando's initiatives around diversity and inclusion topics here: https://jobs.zalando.com/en/diversity

Zalando Around The World

Meet and connect with Zalando representatives at tech events around the world:

OpenExpo Europe, Madrid, Jun 20: Hong Phuc Dang, InnerSource manager, shared how Zalando applies open source practices internally, tools and processes that we use to foster alignment and collaboration within the company. View Hong's slides

Data Engineering Meetup, Berlin, Jun 20: Suyash Garg gave an update on Zalando Nakadi project, with a focus on Nakadi SQL - a SQL engine for streaming queries over Nakadi Event Types.

ContainerDays, Hamburg, Jun 24 - 26: Henning Jacobs presented his well-known Kubernetes Failure Stories

Public Zalando Tech Presentations Repository is a compiled list of public talks by Zalando employees including meetup presentations, recorded conference talks, slides, etc. We try to keep the list up-to-date. Do check it out!

How we release open source projects

2019-05-27T00:00:00+02:00

This blog post describes how we manage the process of proposing, reviewing and approving projects to become open source, while at the same time ensuring project code follows our compliance rules, and the maintainers of the projects are aware of their responsibilities.

See our formal release guidelines

Overview

The process involves five steps that take the project from internal source code, through a review phase to our incubator, which eventually results in the project being graduated into our top level organisation, or archived as an inactive project due to lack of activity or maintainers:

An internal project is proposed for release by a Zalando engineer
The project proposal is reviewed by the internal open source review group
If approved, the project is published on the Zalando Incubator on GitHub
The project activity and health is monitored by the open source team
The project graduates from the incubator and into the main Zalando organisation, or the project is decommissioned and marked as archived.

How we monitor incubator projects and decide on whether to promote or archive them will be detailed detail in a later blogpost.

Proposing a new open source project

The first step to getting an internal project published on the Zalando Incubator is to fill out a google form and confirm understanding our requirements, which is available here.

Anyone inside of Zalando can do this and this step serves 2 purposes:

To collect information required to publish a project, such as its current location, who will be maintaining it and the long term plan for maintaining it.
To set expectations for the maintainers, such as amount of time needed to maintain the project, sign-off from the developers' engineering lead and ensuring the project does not require internal Zalando dependencies.

You can see a public version of the approval form without validation here.

Questions addressing who signed-off on publishing the project and how many hours the developers can commit to maintain the project serve as a good way to set expectations, both for the lead who appoved and for the maintainers. To run a sustainable project requires a commitment and we do not expect developers to use their private time—instead ensuring time will be made to work on the project should be part of the conversation.

We also address the need to have basic project health files in place such as a Code of Conduct, ways for users to get in touch in case of security issues, features or bugs, by providing maintainers with a standard set of files for guidance. We do this for 2 reasons:

Ownership of code should be visible to other teams inside Zalando, and to potential audits, beyond ownership, these files also communicate how to contribute, how to report security issues and our code of conduct.
Communication channels must be public, so maintainers of a project can be approached by external contributors. We want to avoid the throw code over the wall antipattern, so having clear ways to reach our maintainers is a central part of taking active ownership of code.

The Open Source Review Group

When a project is proposed, it is automatically shared on an internal mailing-list that consists of everyone at Zalando currently maintaining an approved project. This group is currently about two hundred people, which allows us to spread the decision making process across many different people and viewpoints.

Discussing the why

The point here is to have as many eyes on the proposal as possible, specifically we are interested in discussing the WHY of releasing a project and the 3 questions below is central in this discussion:

Will the project be sustainable?
Do Zalando have any value in open sourcing and maintaing it long term?
Does it have any value to anyone outside Zalando?

When code is released as open source, you are essentially sharing something of value, and, you are also taking responsibility for committing time to the additional overhead associated with open sourcing. This commitment and exchange of value should be justified. There are multiple ways to look at this, such as:

The project contributes positively to the employer branding efforts and supports hiring of tech talent
The project helps establish the company as a leader in a certain domain
The project will gain features and bugfixes from external community members
The maintainer team could gain valuable knowledge through collaborating with external community members

At Zalando we've seen several projects contribute to our employer branding efforts, it is however a side-effect and should not be the main reason for open sourcing. It is of course nice that Zalando is recognised for its Kubernetes (External-DNS, Stackset-Controller, es-operator), PostgreSQL (Patroni and postgres-operator) and Machine Learning projects (Flair and Fashion-mnist). Nonethless it is hard to measure the brand impact of such projects, and not a long-term motivation for the maintainers or Zalando.

Justifying open sourcing is not easy, a fair amount of guessing is involved since you do not know how people outside the company will receive and adopt your projects. However, making an assesment of possible impact before release will be good guidance for the project maintainers.

Reviewing project quality

Besides discussing the WHY, the open source team looks at compliance-specific areas which could be a blocker for releasing:

Do we use dependencies which have incompatible licensing
Does the source code contain anything confidential (such as tokens, urls, passwords, etc)
Does the project contain functionality or IP which gives Zalando a competetive advantage (such as the code that powers our search results)
Is the project something Zalando would consider trying to patent?

We use a dependency licensing scanning tool, as well as a source code scanner to look for tokens and passwords, to automate this as much as possible.

Review Meeting

Once a month the review group sits down with the maintainers proposing new projects. The discussions from the mailing-list are considered by the group and a decision is made. The project is either released, rejected, or, the maintainers are asked to improve certain aspects of the project before it can be released. By including the maintainers directly in the discussion we avoid having a blackbox reviewing projects in secrecy, instead the discussions are fast and transparent to everyone involved.

Depending on the number of project proposals, the meeting takes between 30-60 minutes. For each project reviewed, the open source team writes a one page release notes document which outlines why the project is being released, the discussion in the meeting and the measures taken to ensure our compliance rules are followed.

After the review meeting, the open source team sits down with the maintainers and perform the release of the project on GitHub.

Publishing the source code

After mailing-list discussion and approval in the monthly meeting, the project is released. We have a specific approach to doing this:

We only transfer the current state of the repository to github, so we do not include the git history, while having the history would be very valuable to track down decisions for code changes, it is simply too big of a security risk and would require the maintainers to audit all commits.
We automatically merge project files with our baseline files, to ensure all repositories have a minimal set of files, these are templated with employee names, emails and github names, so contact info and meta data is consistent.
The project is setup with a dedicated team assigned to it, with the correct branch protection in place and compliance tooling installed by default (we have a bot called Zincr for this).

And that is our release process for initially releasing new projects, I hope it gave you an insight into what a company of Zalando's size has to consider before releasing new code and how we have tried to keep the process simple and transparent for the maintainers of our projects.

In future posts, I will go through how we monitor current projects, how we decide what to keep and what to decommission as the projects evolve.

Understanding Redis Background Memory Usage

2019-05-16T00:00:00+02:00

A closer look at how the Linux kernel influences Redis memory management

Recently, I was talking to a long-time friend, previous university colleague and former boss, who mentioned the fact that Redis was failing to persist data to disk in low memory conditions. For that reason, he advised to never let a Redis in-memory dataset to be bigger than 50% of the system memory. Thinking about how wasteful that practice would be, it's interesting to understand why this can happen and look for alternatives to assure that Redis will be able to use as much memory as there's available to it, without sacrificing its durability.

#!/usr/bin/env python

import random
import string
import uuid
import redis

MEM_GB = 2 * 1024**3
KEY_SIZE = 1024**2
TOTAL_KEYS = int((MEM_GB * 0.5) / KEY_SIZE)

def gen_data():
    return ''.join([random.choice(string.ascii_letters + string.digits) for x in range(1024)]) * 1024

r = redis.StrictRedis()

for i in range(TOTAL_KEYS):
    r.set(uuid.uuid4(), gen_data())

It will generate random key/value pairs of 1MB each, using up to half of the total memory available. As it was executed on a 2GB RAM virtual machine, it will create a dataset about 1GB in size. Considering the memory used by the OS and all other processes, we can be sure that Redis is now using a bit more than 50% of the total system memory. From this point in time, calling BGSAVE will result in an error:

127.0.0.1:6379> BGSAVE
(error) ERR

And the following message will appear in /var/log/redis/redis-server.log (on a Ubuntu 18.04 LTS system):

10202:M 13 Sep 11:34:16.535 # Can't save in background: fork: Cannot allocate memory

Looking at the source code for this operation, this message is shown when the fork() system call returns -1. In its man page, we can see that this return code only means that it failed and no child process were created. Based on that information and the error message, one might say that the process failed because it was duplicating the entire dataset in memory, an action that can't be done with less than half memory available.

Digging through a bit of Unix history, we'll find that the first-generation of Unix OSes indeed duplicated the whole parent address space when fork() was called. On modern kernels like Linux, this doesn't happen anymore and the NOTES section of the same man page mentions this in detail:

*Under Linux, fork() is implemented using copy-on-write pages, so the only penalty that it incurs is the time and memory required to duplicate the parent's page tables, and to create a unique task structure for the child.

*A copy-on-write approach is much more efficient than actually copying data from one place to the other. The child process will share the same memory pages as its parent, but in the end will only need enough memory to create pointers to the actual data. Each of these memory pages will only be copied if, and only if, the child process tries to write something to them, hence the name copy-on-write (CoW). As the data is being dumped to disk, this is a read-only operation that results in virtually no increase in memory usage.

The question now is: if nowhere near double the amount of memory is needed, why is it still failing? The answer is that the Linux kernel cannot make the compromise of allowing a child process to point to that amount of data, as there's no guarantee it won't modify it. If the kernel allowed that, it could result in a situation where there the total system memory wouldn't be enough to hold everything that was allocated by both parent and child processes. The good news is that there's a way to overcome that, presented as a tip in the Redis log file:

10202:M 13 Sep 11:33:09.943 # WARNING overcommit_memory is set to 0! Background
save may fail under low memory condition. To fix this issue add
'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the
command 'sysctl vm.overcommit_memory=1' for this to take effect.

The message is a bit misleading, as a system that is using a bit more than 50% of memory isn't exactly in a "low memory condition," but is still consistent with what we know about the problem until now. Before trying any command or configuration with exactly knowing what it does, let's look at what the 1 option means in the overcommit_memory section of the proc file system man page:

In mode 1, the kernel pretends there is always enough memory, until memory actually runs out. One use case for this mode is scientific computing applications that employ large sparse arrays. In Linux kernel versions before 2.6.0, any nonzero value implies mode 1.

$ sudo sysctl vm.overcommit_memory=1
vm.overcommit_memory = 1
$ redis-cli
127.0.0.1:6379> BGSAVE
Background saving started

After that there will be much better messages in the Redis log:

10202:M 13 Sep 11:47:04.663 * Background saving started by pid 10337
10337:C 13 Sep 11:47:05.833 * DB saved on disk
10337:C 13 Sep 11:47:05.839 * RDB: 0 MB of memory used by copy-on-write
10202:M 13 Sep 11:47:05.885 * Background saving terminated with success

Back-Pressure Strategy for a Sharded Akka Cluster

2019-05-09T00:00:00+02:00

AWS SQS polling from sharded Akka Cluster running on Kubernetes

NOTE: This blog post requires the reader to have prior knowledge of AWS SQS, Akka Actors and Akka Cluster Sharding.

My last post introduced Akka Cluster Sharding as a Distributed Cache running on Kubernetes.

As that Proof-of-concept(PoC) proved promising, we started building a high-throughput and low-latency system based on the gained experiences and learnings.

Background

The system under consideration polls (fetches) messages from AWS SQS and does the following:

Processes polled SQS messages (such as JSON modifications)
Stores polled SQS messages in a datastore
Stores the latest state derived from polled SQS messages in-memory
Publishes the processed SQS message to destination AWS SQS (for other systems to work with them)
Finally acknowledging back the polled SQS messages to source AWS SQS.

This sounds pretty simple to implement at first, but turns into a challenging task when it happens at scale (up to 45,000 SQS-messages-processed/second).

Characteristics of the SQS message(s):

SQS message’s size varies from 5KBs to 100KBs
SQS message is uniquely identified by an identifier, let’s call it event_id. And there are more than 250,000 unique event_id(s) in the system
SQS messages are versioned and some lower versioned SQS messages will be acknowledged back to source AWS SQS(as these messages does not affect the state of system) without any processing(JSON Modification), storing into datastore and publishing to destination AWS SQS
SQS messages are evenly distributed by event_id, i.e in theory, all the SQS messages in one batch have a unique event_id

The Problem

Polling AWS SQS is easy. Controlled and dynamic polling based on the workload of a highly distributed system is challenging where failure is inevitable.

In the beginning, the implementation was simple and straightforward. One Actor (let’s say SQS Batch Poller) was responsible for polling and sending those polled SQS messages to desired entity actors to be processed, stored, published to destination SQS and eventually be acknowledged back to source SQS.

Moreover, the performance (time taken to process, CPU, memory etc) of the system depended on the size of SQS messages. A 5KB SQS message was quicker to process and required less resources compared to a 100KBs SQS message. This variation in size of the SQS messages made the workload of the system very dynamic and unpredictable.

This implementation worked fine with few thousand messages in SQS, but failed catastrophically when this number grew up to millions.

The failure happened because the SQS Batch Poller Actor kept polling SQS messages from AWS SQS without any knowledge of the state (processed or unprocessed) of already polled SQS messages. This filled the cluster with more than 120,000 unprocessed SQS messages and reduced the throughput to 10–12 SQS-messages-processed/sec. This resulted in unreachable Akka cluster nodes (Kubernetes Pods), killing them with OOM and eventually bringing down the whole system (Akka cluster).

Why did the Akka Cluster stop polling after ~120,000 SQS messages? Because that’s the limit imposed by AWS SQS. SQS can only have ~120,000 un-acknowledged or in-flight messages.

A better approach to poll SQS, without hitting the Akka cluster’s limits and killing it, was needed. The SQS Batch Poller Actor needed to be aware of the workload of the system and adjust the rate of polling AWS SQS accordingly.

Solution

The solution was to inform SQS Batch Poller Actor about the state of unprocessed SQS messages(Workload) in the system. i.e implementing Back-Pressure.

The key point in the Back-Pressure strategy was to limit the number of unprocessed messages the cluster can have at any given point in time. This strategy ensured that SQS is only polled if there is a demand for more SQS messages in the system and allowed the system to behave in a predictable manner irrespective of the size of SQS message.

The diagram below depicts the high-level architecture of the Back-Pressure Strategy.

The architecture consists of two main Actors, namely SQSBatchPollerManager and SQSBatchPoller, responsible for managing Back-Pressure and Polling SQS.

Before starting to define and implement Back-Pressure strategy, a few important details/assumptions need to be laid down.

maxUnprocessedMessages: A configurable limit on maximum number of SQS messages that can be present in the system at any given point in time. This limit can be adapted according to the throughput requirements and system limits. Increasing this limit comes at the cost of higher resources such as Memory, CPU, Network, etc.
parallelism: Parallelism factor to limit the number of SQS batches polled in parallel. This is a prevention against creating a peak in resource usages such as overwhelming database or a third party service with burst of thousands of request at once to load initial state of Entity actor.
batchSize: Each SQS batch can have a maximum of 10 SQS messages.

Involved Actors in Back-Pressure strategy

SQS Batch Poller Manager Actor (SQSBatchPollerManager): SQSBatchPollerManager actor is responsible for keeping track of unprocessed SQS messages in the system and to calculate the number of messages to be polled from SQS.

SQS Batch Poller Actor (SQSBatchPoller): SqsBatchPoller actor actually polls SQS message batch from AWS SQS and keeps track of the lifecycle of the polled SQS messages. It also informs back to the SqsBatchPollerManager upon complete processing of the SQS messages batch.

Entity Actor (EntityActor): EntityActor is responsible for processing(such as JSON Modification), storing into datastore, publishing to destination SQS, acknowledging back the polled SQS message to the source SQS and, finally informing back to SQSBatchPoller about successful or failed processing of this polled SQS message.

How these Actor(s) collectively implement Back-Pressure strategy? After successful cluster formation, the cluster is ready to poll and process SQS messages. Let’s see the whole process of Back-Pressured SQS polling step by step for a better understanding.

SQSBatchPollerManager receives a message PollSqs to start SQS polling.
Upon receiving PollSqs message, SQSBatchPollerManager calculates the number of SQS batches that can be polled in parallel (parallelism) while not exceeding the maximum number of unprocessed SQS messages (maxUnprocessedMessages) the cluster can sustain. After calculating the number of SQS messages to poll, SQSBatchPollerManager creates child actor(s), SQSBatchPoller, and sends a message PollSqsBatch to it.
Upon receiving PollSqsBatch message from SQSBatchPollerManager, SQSBatchPoller polls AWS SQS and sends these polled SQS messages to Cluster Shard Region Actor which in turn forwards these SQS messages to respective EntityActor.
Upon receiving SQS messages, EntityActor processes(such as JSON Modification), stores the state into datastore, publishes to destination SQS, acknowledges the polled SQS message to the source SQS and, finally sends a message SQSMessageProcessed back to SQSBatchPoller.
SQSBatchPoller waits for all the EntityActor(s) to send back an acknowledgement message SQSMessageProcessed. After receiving all the acknowledgement back from concerned EntityActor(s), it sends a message BatchProcessed back to SQSBatchPollerManager and kills itself.
SQSBatchPollerManager upon receiving BatchProcessed sends itself a message PollSqs and the whole process repeats from step 2 again.

With this strategy, AWS SQS polling is controlled by the speed of processing SQS messages by the system (Akka Cluster).

What’s next

What’s described above is a simplified version of the actual Back-Pressure strategy used in production system, But the underlying principle of Back-Pressure is exactly the same. Some obvious caveats such as handling SQS failures, Node(s) crashes, Actor crashes, optimization in polling AWS SQS, etc are excluded here and is out of the scope of this post.

I will try to write more about the handling of the failure cases listed above and optimizations in following posts.

How to Manage Stakeholder Requests in Big Organizations

2019-05-03T00:00:00+02:00

An important factor of success in agile environment is that team works well together. It is also important for a software engineer to be able to focus for longer periods of time with limited interruptions.

Many companies have solved the challenge of focus and dedication for the team by having a designated role, such as Scrum Master or Producer, who is responsible for managing stakeholder requests, prioritizing them and communicating to the development team.

But sometimes requests can't be evaluated by a responsible person in the first place. There are topics where somebody from the development team needs to have a look as well to help understand the technical side more deeply.

On top of that, sometimes anomalies appear on monitoring dashboards. Network slowdowns, operational issues and many other things might happen during working time and immediate action could be required to fix the issue.

Who should be responsible in this case? How can we ensure that team’s stakeholder relationships stay healthy and help us move forward?

Recently my team, which is responsible for developing an innovative machine learning product for the fashion world, faced a very similar issue.

Engineers were spending a reasonable percentage of their time working on ad-hoc requests from our stakeholders. There was no proven way to track or organize such requests well, so we could not guarantee the level of support we strive for.

At that point we understood that we needed either a clear owner of such topics or a pre-defined collaborative responsibility. It simply did not work out-of-the-box and the team needed to institutionalise the meaning of ownership. And we were up for the challenge.

So the team decided to introduce an internal role in our technical team - a facilitator, or, as we call it - Batman - the role that is perceived more as an honor than a burden, and everyone is comfortable with doing it ad-hoc.

Key principles of the role are:

Every member of the team shares responsibility for all stakeholder requests and service health, by performing facilitator duties on a shift basis
Facilitator duties are only valid during working hours

Key benefits of the role:

Stakeholders are always provided with support within a guaranteed lead time (normally up to 2 hours)
Knowledge about services and requests is spread more evenly in the team by performing the duties and learning from other team members
Quantity of ad-hoc requests / issues is always visible on the team’s dashboard

When to set it up?

The role is reasonable when:

The team is working in a big organization with many external and internal stakeholders
There are regular incoming ad-hoc requests and/or anomalies on monitoring dashboards
A reasonable part of the team (40-50%) is busy working on ad-hoc requests during the sprint
Iteration goals are not achieved regularly because of influx of unplanned work

How to set it up? Make sure to have an open conversation in the team. Acknowledge things that you would like to take care of. Spend some time on agreeing within the team of what it means and how it should work. The final definition and duties of the role can be different, everything depends on the team and skill set within the team.

Here are a few important guidelines we want to share that helped us to set up the role:

Define a clear set of expectations from the role, make sure duties are well described and understood by the team and stakeholders
Set up a schedule for taking on the role and make sure it is flexible enough.
Weekly shifts have proven to be optimal for our team.
Rule of thumb: no person should do two shifts in a row.
Clearly define what is not within the scope of the role.
Set up F.A.Q. section and maintain it, it will serve both team and stakeholders well.
Consider talking about the role actively in team retrospectives, or even having a dedicated retrospective about the role every 1-2 months. Openly talking about successes and challenges helps to adjust the process.
Count in the amount of time needed to perform the duties during planning.
Set up a short handover meeting for remaining tasks from the previous shift.
Document the findings, so the need for the role fades away with time (invest in proper runbooks and knowledge sharings)

What is the feedback from the team regarding the role?

Structured approach to ad-hoc requests is a big plus.
Noticeable improvement in knowledge sharing among team members.
Everyone should assume equal responsibility when doing the facilitator job.
The role only works well if everybody in the team embraces it and is diligent when performing facilitator duties.
When the team has different backgrounds (for example, backend engineering and research engineering), the time is needed to adjust to each other’s technical stack and way of thinking.
Handover of existing requests needs to be thought through better.
Sometimes there are not enough small tickets to pick up by the team member on the facilitator shift.

What is next?

We truly believe that talking about the role and how it develops helps us to adjust the process as we move forward.

We will continue developing the role inside our team to help us become even better.

Learning DevOps as a Software Engineer

2019-04-25T00:00:00+02:00

At Zalando the teams are autonomous and involved in the entire software development process - from gathering stakeholder requirements to design, implementation, testing and deployment. For me, this was one of the greatest challenges/opportunities of joining Zalando and it allowed me to grow on so many dimensions of software development, one of these being DevOps.

When I initially joined Zalando I had previously been focused only on software development and I was eager to understand how my software should be deployed and operated.

As part of the autonomy mindset, each team is given an AWS account where they can deploy their services. There is common infrastructure based on STUPS (fully open source by the way) that provides a common way to handle logging, monitoring and deployment concerns. Today we are actively moving to a Kubernetes based setup and a fully integrated continuous delivery platform.

There are three main topics that I faced while doing DevOps: Monitoring or Visibility, Reliability, and Software Delivery. Let’s focus on each one individually and how learning about it improved the solutions I bring to production.

Monitoring / Visibility

For a period of time, we did not know how our application was behaving. This lack of visibility included not knowing whether our users were seeing errors and the latencies of any backends for frontend.

This problem became apparent when there were some errors in one endpoint and we only learned about it when notified by the end users. This was a personal wake up call to better understand how the applications the team owns should be operated.

We started by measuring the four golden signals:

Latency - We gathered the latency perceived by our application on the various endpoints and from the load balancer’s perspective. Differences between these two signals can for example showcase long Garbage Collector pauses that may not be visible in internal application metrics.
Request rate - Abnormal variations should be investigated, especially during a deployment. One can also learn about the saturation point in terms of requests by monitoring this signal during load tests.
Saturation - We included in this CPU and memory consumption, TCP connection stats like new connections, total connections and the ones in TIME_WAIT and CLOSE_WAIT states.
Error and Success rate - Like the latency, we measure this inside the application on the various endpoints and on the load balancer level. Inconsistencies between these two could be explained by misconfiguration of timeouts on the LB level or other abnormal scenarios like the application refusing new connections.

We chose to not alert on saturation signals, and only use the latency and the error rate since these are the metrics that affect the end user experience. If there is no impact on latency and error rate, having the CPU at 99% is completely acceptable and actually a sign of good design since it would mean that application requires very little slack.

These monitoring capabilities provide us with an understanding of how our system is behaving in real time, information about application usage patterns, and helps us to foresee possible problems/issues. Now when we are developing a new service, we do not go to production without having good monitoring in place beforehand.

Reliability

Once the monitoring was improved, we saw a lot of inefficiencies that were introduced by our backend for the frontend. We expected our latencies to match closely the backend metrics but this was not the case. Upon further investigation, it was discovered that our authentication strategy was introducing significant unnecessary latency.

We also looked at where the stateful components of the system were being stored and added a Redis deployment to hold the session data. Previously with every deployment our users would need to log in again which meant that releases had to be aligned with them.

Our work on reliability highlighted that we had not properly considered how different components interacted when designing the system. Now, thinking about which components can fail and how their failure can affect the system is a common exercise when building new services or even refactoring current ones.

Software Delivery

The last topic we focused on was improving the way we deliver the software the team develops. Initially releases based around docker images were manual and done from developer machines, which was not even compliant with our internal policies. The first attempt at improving the situation was focused on producing these docker images by using a Jenkins job which improved the compliance status. The second iteration moved the team to a continuous delivery workflow using Kubernetes and an internal Continuous Delivery Platform. In order to enable this without reducing the quality of delivery, we introduced end to end testing (you can read about it here). These tests run on the staging and production deployments before the traffic switch to the new deployment. If the tests fail, the deployment is aborted and we are notified via instant messaging. I am happy with the current state of our delivery process but I continue to learn and try to find improvements.

Having a continuous delivery workflow reduced the operational needs of the team and allowed us to deliver faster to our stakeholders.

With our migration to Kubernetes we have also improved our application architecture. We have simplified it to just one service in the frontend and we moved all the stateful components to a Redis datastore. In the image below you can see the architecture before and after the Kubernetes migration.

System architecture before migration to Kubernetes

System architecture after migration to Kubernetes

As a software engineer getting involved with DevOps helped me to better understand how our applications are delivered to our customers and empowered me with crucial knowledge to investigate and fix issues autonomously. From my working experience, gaining DevOps knowledge as a software engineer has greatly improved my ability to have an impact.

Open Source: March Updates - A new Kubernetes operator & more Cloud Native Apps

2019-04-25T00:00:00+02:00

Project Highlights

A new operator is added to Zalando’s list of Cloud Native Applications. Elasticsearch Operator - an operator for running Elasticsearch in Kubernetes with focus on operational aspects, like safe draining and offering auto-scaling capabilities for Elasticsearch data nodes, rather than just abstracting manifest definitions.

To make things even simpler for developers, we also released a new framework that helps to build Kubernetes operators in Python. Kopf - Kubernetes Operator Pythonic Framework - a framework and a library to make Kubernetes operators development easier, just in few lines of Python code. The main goal is to bring the Domain-Driven Design to the infrastructure level, with Kubernetes being an orchestrator/database of the domain objects (custom resources), and the operators containing the domain logic (with no or minimal infrastructure logic).

Dedicated Open Source Time In The Zalando Cloud Infrastructure Team The engineering team led by Jannis Rake-Revelant, who is responsible for some our most popular open source projects have, since the beginning of the year, dedicated 20% of their time to ensure their open source projects are actively maintained and improved. As a company we believe it is important to take long term responsibility and show commitment to the open source community which we benefit from every day.

Zalando Around The World

Meet and connect with Zalando representatives at tech events around the world:

GAIA Conference, Göteborg, Apr 9: Mikio Braun - our AI expert - gave a keynote on Putting Data Science into Production. Check out his talk below:

Strata Conference, London, Apr 29 - May 2: Dirk Petzoldt - Head of Engineering will share how Zalando handles big data in our online marketing platform. More details

Coding Serbia Conference, Novi Sad, May 15 - 17: Luis Mineiro - Senior Site Reliability Engineer, will explain how we set up monitoring and alerting at Zalando and go over the basic concepts of Distributed Tracing and OpenTracing.

GitHub Satellite, Berlin, May 22 - 23: Per Ploug - Open Source Manager, will talk about Open Source and security and try to answer the question: who is actually responsible for the security of open source dependencies?

How to set an ideal thread pool size

2019-04-18T00:00:00+02:00

We all know that thread creation in Java is not free. The actual overhead varies across platforms, but thread creation takes time, introducing latency into request processing, and requires some processing activity by the JVM and OS. This is where the Thread Pool comes to the rescue.

The thread pool reuses previously created threads to execute current tasks and offers a solution to the problem of thread cycle overhead and resource thrashing.

In this post, I want to talk about how to set an optimal thread pool size. A well-tuned thread pool can get the most out of your system and help you survive peak loads. On the other hand, even with a thread pool in place, thread handling could be a bottleneck.

Why should I set a limit for my thread pool?

There is a lovely pre-configured thread pool - Executors.newChachedThreadPool Why don't we just use it?

Let's look at how it works:

/** Thread Pool constructor */
public ThreadPoolExecutor(int corePoolSize,
              int maximumPoolSize,
              long keepAliveTime,
              TimeUnit unit,
              BlockingQueue workQueue) {...}

/** Cached Thread Pool */
public static ExecutorService newCachedThreadPool() {
              return new ThreadPoolExecutor(0, Integer.MAX_VALUE,
                                                      60L, TimeUnit.SECONDS,
                                                      new SynchronousQueue());
}

Do you see this SynchronousQueue? It means that each new task will create a new thread if all existing threads are busy. In the case of high load, at best we will get a thread "starvation" situation, at worst OutOfMemoryError.

It is better to maintain control and not allow clients to "DDoS/throttle" our service.

Know your limits

Before you start sizing a thread pool you have to understand what you are limited to. And I don’t only mean hardware.

For example if a worker thread depends on a database, the thread pool is limited by the database's connection pool size. Does it make any sense to have 1000 running threads in front of a database connection pool with 100 connections?

Or if a worker thread calls an external service which can handle only a few requests simultaneously, the thread pool is limited by the throughput of this service as well.

It is obvious but we often forget it.

Of course, one of the most important resources for thread pool is CPU. We can get the total number of CPUs that we have as follows:

int numOfCores = Runtime.getRuntime().availableProcessors();

It was a classic way to get number of CPUs for many years. But be careful with this command if you run your service in a container environment. *Without specifying any constraints, a containerized process will be able to see the hardware on the host OS.

*Here are some nice articles on this topic: Better Containerized JVMs in JDK10

and: Nobody puts Java in a container.

Other constraints like memory, file handles, socket handles, could be critical as well.

Just give me the formula!

Brian Goetz in his famous book "Java Concurrency in Practice" recommends the following formula:

 Number of threads = Number of Available Cores * (1 + Wait time / Service time)

Waiting time - is the time spent waiting for IO bound tasks to complete, say waiting for HTTP response from remote service.

(not only IO bound tasks, it could be time waiting to get monitor lock or time when thread is in WAITING/TIMED_WAITING state)

Service time - is the time spent being busy, say processing the HTTP response, marshaling/unmarshaling, any other transformations etc.

Wait time / Service time - this ratio is often called blocking coefficient.

A computation-intensive task has a blocking coefficient close to 0, in this case, the number of threads is equal to the number of available cores. If all tasks are computation intensive, then this is all we need. Having more threads will not help.

For example:

A worker thread makes a call to a microservice, serializes response into JSON and executes some set of rules. The microservice response time is 50ms, processing time is 5ms. We deploy our application to a server with a dual-core CPU:

  2 * (1 + 50 / 5) = 22 // optimal thread pool size

But this example is oversimplified. Besides an HTTP connection pool, your application may have requests from JMS and probably a JDBC connection pool.

If you have different classes of tasks it is best practice to use multiple thread pools, so each can be tuned according to its workload.

In case of multiple thread pools, just add a target CPU utilization parameter to the formula.

Target CPU utilization [0..1], 1 - means thread pull will keep the processors fully utilized).

The formula becomes:

 Number of threads = Number of Available Cores * Target CPU utilization * (1 + Wait time / Service time)

Little's law

At this step we can get an optimal thread pool size, we know our theoretical upper bounds and we have some metrics in place. But how does the number of parallel workers change the latency or throughput?

Little's law can be used to answer this question. The law says that the number of requests in a system equals the rate at which they arrive, multiplied by the average amount of time it takes to service an individual request. We can use this formula to calculate how many parallel workers there should be to handle a predefined throughput at a particular latency level.

L = λ * W

L - the number of requests processed simultaneously
λ – long-term average arrival rate (RPS)
W – the average time to handle the request (latency)

Using this formula, we can calculate the system capacity, or how many instances running in parallel we need in order to handle the required number of requests per second with a stable response time.

Let's get back to our example. We have a service with average response time 55ms (50 wait time + 5 service time) and thread pool size with 22 worker threads.

Applying Little's law formula we get:

22 / 0.055 = 400 // the number of requests per second our service can handle with a stable response time

Conclusion

These formulas are not a silver bullet and cannot magically fit any projects but they could be a great starting point for your project. The disadvantage of the formulas is that they focus on the average number of requests in the system and might not suit for various traffic burst patterns. You can start with the values calculated by these formulas and then adjust your thread pool properties after load testing.

And one more time - “measure don’t guess”!

End-to-end load testing Zalando’s production website

2019-04-11T00:00:00+02:00

Black Friday is the busiest day of the year for us, with over 4,200 orders per minute during the event in 2018. We need to make sure we’re technically able to handle the huge influx of customers.

As a part of our preparations we ask all of our teams to perform load tests to ensure their individual components will handle the expected load. In addition, and due to the distributed nature of our system's architecture, we also need to ensure it will handle the expected load once all components have to work together. To ensure this, we simulate real user behaviour using different scenarios that contain the most common user actions (e.g. visiting the homepage, browsing the catalogue, adding an item to cart, checking out) on a large scale on the production system.

In preparation for Black Friday 2018 our Testing & Quality Strategy team, in cooperation with our SRE (Site Reliability Engineering) team, took on the challenge of providing the tooling required to perform these simulations.

A new set of tools

Our starting point was to look at what was done to prepare for Black Friday 2017. We reviewed a tool that had been created internally to perform end-to-end load testing. It used scenarios written in JavaScript and ran using a distributed set of Puppeteer nodes, each of them interacting with an instance of a Chrome browser. Unfortunately, due to the heavy usage of resources by the browser instances at such a large scale, it was prohibitively expensive to run and so couldn’t be used again.

We went back to the drawing board and, along with feedback gathered from stakeholders that were involved in the previous year’s efforts, started to design a new solution.

We first looked at existing load testing tools such as JMeter, Locust, and Vegeta; all of which we had previous experience with. We quickly realised that, whilst they all individually had their merits, none of them alone completely solved the problem.

We needed a way of recording scenarios representing a user interacting with our website in order to simulate traffic from real users. What's more, we needed a method of translating the scenarios into load test scripts that could be replayed in a lightweight manner and reused. Finally, we needed a mechanism for cost-effective scaling of the load.

After a few design rounds we came up with the following multi-tool solution:

Locust From the learnings of the previous year, we knew that creating our own load test runner from scratch would not be feasible, nor desirable, in the time we had. Therefore, we decided the core of our solution would be one of the already existing load testing tools that we had previously investigated. We settled on using Locust due to its in-built ability to run in a distributed mode and its support for scripting (it uses Python files as inputs).

HAR files In order to easily record the scenarios, we realised we could again reuse existing technologies: a web browser’s session can be easily exported by modern browsers as HAR (HTTP Archive) files. This, however, presented us with a new challenge: how do we convert these HAR files into something Locust can run?

Transformer We built Transformer to convert the scenarios recorded as HAR files into Locust’s input format, a Python file (the "locustfile").

Transformer considers each HAR file as a single scenario. It takes every HTTP request recorded there, and expresses it in Locust's words. The result is a locustfile that exactly replays these requests. Transformer can combine multiple HAR files (i.e. multiple scenarios) into a single locustfile, allowing to replay many scenarios in the same load test, each with its own customizable amount of load (more users visit the catalog than the Help page). And because there are always exceptions, a plugin mechanism allows to arbitrarily modify and enrich each request by injecting pre- and post-processing code in the locustfile. This allowed us to, amongst other things, replay dynamic requests requiring temporary, JavaScript-generated tokens without actually executing any JavaScript.

Zelt The final piece of the puzzle, cost-effectively generating the required load at large scale, was solved by our in-house Kubernetes infrastructure. We built Zelt to orchestrate the execution of Transformer, the distribution of the generated locustfile, and the deployment of the Locust controller and worker nodes into one of our Kubernetes clusters. It allowed us to easily provision, scale-up/down, and execute our load tests.

One more for the road Another tool, a Node.js library called PuppetHAR, was created to allow us to programmatically generate HAR files from Puppeteer scripts rather than manually in the browser; ultimately this was never used.

In practice We built these tools in close collaboration with our SRE team. They provided us with the scenarios, crafted using data from our analytics teams to represent real user journeys through the Zalando website. They also provided us with the inputs to the equations required to translate our internal target metrics, in requests per second (RPS), to Locust’s input format of number of concurrent users.

To run the load tests, virtual situation rooms were created including us, SRE, and members of the component teams. Using the previously created locustfile, we used Zelt to deploy the load testing infrastructure in Kubernetes, and used the Locust dashboard to initiate and control the tests.

As the tests were running, the teams that owned various component receiving the load were monitoring their production components using our in-house monitoring tools and would let us know if and how things were showing signs of strain under load. We used the same monitoring tools to observe our progress towards reaching our load targets and concluded the tests once they had been reached and sustained for a period of time (or if a component team requested us to stop because of a bottleneck found).

In our final configuration, we ended up running four Locust stacks consisting of 300 nodes each, and reached a total of 130,000 RPS observed.

Learnings Overall, the project was a success. We were able to execute end-to-end load tests against the production website on a scale larger than the actual traffic received during the peak of the Cyber Week campaign. Thanks to this, the teams were able to act upon the information gathered, discover their optimal scaling configuration, and fix the bottlenecks that were discovered all before Black Friday.

Throughout the process, however, we faced some challenges that we needed to overcome.

Reverse engineering With all record and playback methods, there is no guarantee that what you record will be replayable without error as states tend to change over time.

Our tooling was no different and we faced this issue frequently. Session identifiers would expire, articles would go out of stock, rate limiting would kick-in, and security measures would catch us out.

For each instance we had to essentially reverse-engineer our own website and work out which piece was tripping us up and how to work around it. Not only was this a technical challenge but also one of communication and coordination as we needed to find the teams responsible for the components we were fouling and work with them to find solutions.

Often we could only verify our solutions during a load test as the symptoms would only appear in high-load scenarios, this was obviously costly and slow. In order to try to alleviate this, we started working even closer with the component teams, bringing them to sit with us and pair on developing solutions whilst they monitored their systems for us.

Locust We were happy with Locust initially, but as our solution grew more specific and the scale of the load increased, the disadvantages of the tool started to show up.

Two of the Locust features that we relied on the most were the distributed mode of the test runners and the weight system for the scenarios. As we learned the hard way, unfortunately the two features combined don’t work as expected on a large scale. We soon started to realize that the health of the Locust project is far from what we hoped - some very old issues were not fixed, new issues were not addressed and the maintainers were not responsive. By this time it was already too late to change the tool. Eventually we forked the project and made the necessary changes to immediately address the most painful issue.

Next steps At the time of writing, we’ve already open-sourced Transformer and Zelt, and plan to open-source PuppetHAR in the future; so keep an eye on our Zalando Incubator homepage!

Internally, we’re already preparing for Black Friday 2019 and continue to improve our tools and processes for ensuring a smooth customer experience during any and all high-load situations.

Developing Zalando APIs

2019-04-04T00:00:00+02:00

How Zalando software engineers develop internal and external APIs

Imagine a distributed system consisting of 8,000+ active service applications; developed and operated by 300+ delivery teams in six tech hubs. 1,200+ software engineers use various technologies to implement business needs and are responsible end-to-end for those components.

A pretty complex system of people and software. And a real challenge to manage the complexity and balance fast delivery and technical dept.

We believe that interfaces are highly valuable technical assets. That’s why we decided early on standards for API engineering including a common API specification language for RESTful service-to-service communication. In our case, it is the OpenAPI standard for synchronous REST interface specification and JSON Schema for asynchronous events.

API-as-a-Product and API First Principle

Zalando is customer-obsessed. As software engineers at Zalando, we treat our APIs as products, always putting ourselves “in the customer’s shoes.” The best way to provide value is to create a well-designed, explicitly defined, discoverable, reusable, easy-to-understand interface which implements the demanded functionality.

We believe in the API First principle and always follow it. It allows us better alignment between a service provider and consumers (i.e. contract) and contributes greatly to the API and overall system design quality.

And here is how we typically develop an API:

API Design

Often it starts with a business requirement or an idea for a new product. As a software engineer, I make myself familiar with the domain and the requirements. Already, I think about who the potential consumers of my new API are, how they interact with the interface and what are the main building parts (business processes, resources) of the domain and the API.

The next step is to draft an API outside the code first. We adopted RESTful API web service principles with JSON as main payload format, and use OpenAPI Specification language (a.k.a. Swagger Spec) as format for our API descriptions.

API design is a crucial aspect of the API quality. In order to have the same look-and-feel experience for the API consumers and to raise the quality bar of APIs, our engineers and architects condensed their knowledge and experience in Zalando API Guidelines. I consult them often for design principles and best practices when drafting a new API.

Zalando’s API Portal provides a central repository where API specifications of all deployed services can be discovered. I regularly check related APIs to learn from API design practices and to align my application API with other service APIs of our ecosystem.

The API Portal is the central hub for all API-related information. I can use a comprehensive search to find APIs with their deployment and version information. Basically, I get all I need here to be able to consume the API: contact and deployment information, service location, authorization and authentication requirements, and the most important part: the OpenAPI specification of the interface. This is a great source for examples and inspirations providing even the history of APIs.

With all information in place, I can draft a very first version of the API specification, using the editor of my choice, be it Zalando’s IntelliJ IDEA “ Swagger Plugin”, Swagger Editor, or vi.

All API specifications have to be compliant to our API Guidelines. This ensures the same quality and look-and-feel experience across all Zalando APIs. The API Guild is the owner of the guidelines, but everyone is encouraged to contribute.

Becoming and staying compliant creates some efforts. Fortunately, some of the guideline rules can be automatically checked by Zally - our API Linter. Zally is a set of open source tools to automate compliance and quality assurance of RESTful APIs. It’s able to check lower-level aspects like the format, naming, as well as higher-level interface specification details like error handling and security.

Now it is time to get some real feedback.

Early Review and Feedback

After a team-internal discussion and prototyping work, I ask our peers, the API consumers and other stakeholders, for feedback. They should get the best experience and be able to easily integrate it into their components. Typically, I create a GitHub (Enterprise) pull request, a great tool for collaborative reviews, on the API specification file. If the review is a bigger one (new prominent, external, highly used API, or a bigger change) I additionally invite a special group of API enthusiasts, the API Guild, and involved architects. They provide feedback on API guidelines compliance and best practises, and inspire me to improve and harden my API design.

API Implementation

After the API design is aligned, implementation of the service is the easiest and the most fun part. We have a polyglot microservice application environment. Based on our Zalando Tech Radar principles, our teams have high autonomy to pick the best technologies to implement their services. Hence, there are lots of ways to realize the API. Depending on the implementation use-case, I would pick, for instance, Spring for Java or Kotlin, Akka HTTP for Scala application, or would Go for Resty. If I decide to use Python this time, our open sourced Connexion framework will implement a big part for me. It handles HTTP requests as specified in API specification and maps endpoints to Python functions. Many teams manually implement the API definitions. Sometimes, generators are also used to create, for example, Java or Scala client and server stubs out of the API specification.

Publishing and Operation

In order to promote my newly implemented service, I’m going to publish its API. This is done via deployment artifact, in our case a Docker image. All I need to do is to include the API specification into the image. That’s it. After a deployment to our Kubernetes production infrastructure, the API and all context information appears in API Portal. From now on the API’s history is tracked and it can be discovered by everyone at Zalando.

From the first deployment on, I’m interested in the performance of my API. With some lines of (Kubernetes deployment) configuration, I can activate monitoring and get a ( ZMON) monitoring dashboard “for free.” It is endpoint-based and provides metrics like the number of requests per status code classes, latency (incl. percentiles), and some basic client load monitoring. Additionally, I can easily configure authorization & authentication settings and rate limitations for the endpoints via deployment configuration. Especially in the times of many microservices, this infrastructure features it is a great relief from the operational perspective.

Conclusion and Outlook

Our vision is to build new business capabilities in days, not in weeks, to be highly efficient in engineering and operation of our SaaS ecosystem at scale, based on consistent high quality APIs that are sustainable and fun to use. We are now closer to this vision due to our tools and infrastructure features, like API design principles and guidelines, open API review culture, API portal, and API monitoring.

We are happy that our principles and tools find adoption outside Zalando by other tech companies and API enthusiasts. Our open source API Guidelines and API Linter gain external contributors and improve every day.

We plan to enrich our API service infrastructure with features like out-of-the-box monitoring, authentication/authorization, rate limitation. Our API Portal will be a central access hub for relevant API service operation information (e.g. like hostnames of deployed API services, effective rate limits) and will support backward compatibility checks and subscriptions for notification on API changes, and much more. We will raise adoption and developer experience via application-centric integration of all infrastructure services consistently supporting the developer productivity journey over design, code, build, deploy, and operate phases.

If you want to learn more about API engineering at Zalando, please also check out InfoQ interview How Zalando Delivers APIs with autonomous teams, and earlier tech blog post On APIs and the Zalando API Guild.

A Story of Rust

2019-03-28T00:00:00+01:00

Introducing Rust in an Enterprise Environment

Discovery

Sometime in 2013, on the internet I stumbled over a new programming language called Rust. Taking a look at the language, I was impressed by its high level features. At that time I was a backend Scala developer with a .Net background. When examining Rust, I found most of the features I used every day like Pattern Matching, the “New Type Pattern” and a “Scala like” Iterator API. But there was also something I really missed: No Nulls and no Exceptions. While also being a low-level language without a garbage collector I was convinced to further follow the language progressing.

Early Prototyping

It was in 2016 when I joined Zalando as a Scala Developer. After half a year we were thinking about introducing a new application for a simple task. Somehow the question came up on what technology to use and Rust was suddenly mentioned. We did a prototype quickly, and implementing it was quite easy. It also turned out that implementing a domain model was very painless, especially regarding serialization due to Rust’s high level abstractions. Unfortunately, we did not need the application anymore but nevertheless Rust proved to be a valid candidate for solving our problems.

The Experiment

A short time later, we had some problems with our main service. It was a Scala web service that resides at a critical position within Zalando. Under high load, the application consumed great amounts of memory and sometimes even crashed with the GC running out of memory at almost 100% CPU load. This forced us to massively overscale the application. So we asked ourselves what would happen if we rewrote the application in Rust. We did just that and it took just a few days to reimplement the application. Load tests revealed that the Rust application had much better latencies, consumed less memory and less CPU than the Scala application under the same load and, even better, it could handle more load without crashing. It is of course always easier to rewrite an application than writing it from scratch.

We added some more features and then considered to take it live. This was where we faced the first challenge. Our lead reminded us that Rust currently is not an “official” technology within Zalando and that taking the application live would be a serious risk. That was of course correct. Our lead asked us to collect the requirements for safely taking such an application live.

Afterwards, we approached Zalando’s Technologists Guild and presented our results during a Tech Stand Up. With our Technologists Guild, we came to the conclusion that Rust should stay with the “Assess” state on our Tech Radar until we gathered more experience. We also collected requirements for deploying a Rust application but unfortunately things came to a halt since we had to focus on other topics.

What happened was that we started to implement some tooling in Rust.

Justifying Rust

It was in the middle of 2017 when we needed to implement a new service. By that time we already had a Rust Study Group running and the Rust ecosystem evolved further. Since we knew that we couldn’t just start a service, we asked our lead whether we could do it with Rust. It was a simple streaming application doing some REST calls and writing data to Redis.

We asked our lead and again he had serious concerns. We would need really good reasons to use Rust over Scala, which was still our main technology stack. He also had serious concerns on whether the tooling was ready for productive usage and the question on how to onboard new team members with such a technology would also have to be answered too. There were of course more questions and the stakes were high, but completely understandable from a lead’s perspective.

In the following weeks, reasons for using Rust were collected. We started to analyze the problems we had with our current applications and figured out how those problems could be avoided with Rust. Of course there was also the performance argument but that was definitely not the most important reason. The main reasons were Rust’s safety and productivity features. But there was one more thing: With Rust we were able to use resources efficiently and there was already the plan to move to Kubernetes. Being able to have small pods running on Kubernetes could be a real cost saver.

There was a lot of communication with our lead and we got valuable feedback on the topics where we might need a bit more reasoning. Well, things were moving slowly and the end of the year was near. At that point in time we had serious doubts that we would ever use Rust for productive systems.

When things become real

It was at the end of 2017 when it was announced that the teams would be restructured due to changing requirements. We were a team of six developers and would be reduced to four. When this was announced to us there was also another revelation: Our lead said that from now on we would be a “Rust Team”. That was really unexpected and I have to admit that I did not really know what to respond to that.

Since we were planning to replace our old system with a new one, we almost immediately started to implement the first service we needed. It was a rather simple CRUD service, which was a good opportunity to onboard some of the team members to Rust. The service was ready to be used more quickly than expected, even though it was not yet fully finished. Since we needed more applications to reach our goal, we started to implement the smallest applications in parallel, thereby gradually increasing the difficulty level for the team to the final service which fully utilizes non-blocking IO.

In the end we managed to reach our goal in time, thereby introducing a new technology. Currently we have two REST services, a streaming application and multiple batching applications written in Rust all running on Kubernetes. The new applications have been live serving data for two countries over 2018 and are expected to serve even more countries in the near future. The resource usage of our applications is far below our former Scala services and reduce costs remarkably.

Conclusion

With Rust, one can build microservices taking the word “micro” literally. Rust gives the developer an “if it compiles it runs” experience which allows focus on business logic. Refactoring and even reengineering can be done quite fearlessly. The compiler is very helpful and even suggests solutions. A newcomer coming from Scala or C# already knows concepts like closures and the Iterator API which makes things a lot easier. And there is the borrow checker. Given enough support, newcomers can learn to handle it while still being productive. But one still has to be a bit resistant to pain when it comes to compile times and a lack of an easy-to-use “corporate” version of crates.io. When starting a project it is beneficial to have an experienced Rust developer on board and to not just start from scratch. We are still waiting for a stabilization of futures and async/await and the web ecosystem to become more mature since it is currently a challenge to choose an appropriate web framework/toolkit.

For us, Rust has so far been a story of success and it is likely that it will stay like this.

Running Apache Flink on Kubernetes

2019-03-22T00:00:00+01:00

Recently, I was developing a small stream processing application using Apache Flink. Zalando uses Kubernetes as the default deployment target, so naturally I wanted to deploy Flink and the developed job to our Kubernetes cluster. I learned a lot about Flink and Kubernetes along the way, which I want to share in this article.

Challenges

Compliance - At Zalando, all code running in production has to be reviewed by at least two people and all deployed artifacts have to be traceable to a git commit. The default way of deploying Flink Jobs is to upload a JAR containing the Job with any other required dependencies to a running Flink cluster. This is not compatible with our internal compliance guidelines.

Container Orchestration Readiness - One of the key selling points of Flink is to do fault tolerant stream processing. However - as will be outlined in the next section - the reliability features were not designed with container orchestration systems in mind, which makes operating a Flink cluster on Kubernetes not as straightforward as it could be.

Fragmented Documentation - Both Flink and Kubernetes are evolving quickly, rendering some documentation (especially blog posts like this one and forum/newsgroup posts) out of date. Unfortunately, the official documentation currently does not provide every information that is needed to run Flink in a reliable way on Kubernetes.

Flink Architecture & Deployment Patterns In order to understand how to deploy Flink on a Kubernetes cluster, a basic understanding of the architecture and deployment patterns is required. Feel free to skip this section if you are already familiar with Flink.

Flink consists of two components, Job Manager and Task Manager. The Job Manager coordinates the stream processing job, manages job submission and its lifecycle and allocates work to Task Managers. Task Managers execute the actual stream processing logic. There should always be exactly one active Job Manager and there can be n Task Managers.

In order to enable resilient, stateful, stream processing, Flink uses Checkpointing to periodically store the state of the various stream processing operators on durable storage. When recovering from a failure, the stream processing job can resume from the latest checkpoint. Checkpointing is coordinated by the Job Manager - notably, the Job Manager knows the location of the latest completed checkpoint which will get important later on.

Flink Clusters can be run in two distinct modes: The first mode, called Standalone or Session Cluster, is a single cluster that is running multiple stream processing jobs. Task Managers are shared between jobs. The second mode is called Job Cluster and is dedicated to run a single stream processing job.

A Flink Cluster can be run in HA mode. In this mode, multiple Job Manager instances are running and one is elected as a leader. If the leader fails, leadership is transferred to one of the other running Job Managers. Flink uses ZooKeeper for handling Leader Election.

Kubernetes Deployment Out of the two modes described in the previous section, we chose to run Flink as a Job Cluster. Two reasons drove the decision: The first reason is that the Docker image for Job Clusters needs to include the JAR with the Flink Job. This neatly solves the compliance problem as we can re-use the same workflow as we are using for regular JVM applications. The second advantage is that this deployment model allows to scale Task Managers independently for each Flink Job.

The Job Manager is modeled as a Deployment with one replica, Task Managers as a Deployment with n replicas. The Task Manager discovers the Job Manager via a Kubernetes Service. This setup deviates from the official documentation that recommends running the Job Manager of a Job Cluster as a Kubernetes Job. We think that using a Deployment is the more reliable option in this case (which is a never-ending streaming job) as the Deployment will make sure that one pod is always running whereas a Job could complete, leaving the cluster without any Job Manager. This is why our setup resembles the one describing a session cluster in the documentation.

Failures of Job Manager pods are handled by the Deployment Controller which will take care of spawning a new Job Manager. Since this is usually a relatively fast operation, this frees us from the need to maintain multiple Job Managers in hot-standby, which would increase the complexity of the deployment. Task Managers address the Job Manager with a Kubernetes Service.

As outlined above, the Job Manager keeps some state related to checkpointing in it’s memory. This state would be lost on Job Manager crashes, which is why this state is persisted in ZooKeeper. This means that even though there is no real need for the leader election and -discovery part of Flink’s HA mode (as is this handled natively by Kubernetes), it still needs to be enabled just for storing the checkpoint state.

As we already had an etcd cluster and etcd-operator deployed in our Kubernetes cluster, we did not want to introduce another distributed coordination system. We gave zetcd a try which is a ZooKeeper API backed by etcdv3. This setup works fine, so we decided to stick with it.

One other issue we faced with this setup was that the Job Manager sometimes got stuck in an unhealthy state that only could be fixed by restarting the Job Manager. This is done by a livenessProbe which checks if the Job Manager is still healthy and the job is still running.

It is also noteworthy that this setup only works correctly with Flink > 1.6.1 as there was this bug that prevented resuming from checkpoints in job clusters.

Conclusion The above setup is now running in production for a couple of months and is serving our use case well. This is showing that it is possible to reliably run Flink on Kubernetes, even though there are some small roadblocks on the way.

Going Further

“Flink in Containerland” by Patrick Lucas - main inspiration of the points of this post
“Redesigning Flink’s Distributed Architecture” by Till Rohrmann

Open Source: February Updates - Release new projects, join Google Summer of Code Program

2019-03-17T00:00:00+01:00

Project Highlights

Kube Metrics Adapter gained community attention as it was featured in a medium post 'Kubernetes autoscaling with Istio metrics'. Users provided very positive feedback on the project. Kube Metrics Adapter is currently maintained by Developer Productivity team at Zalando. It is a general purpose metrics adapter for Kubernetes that can collect and serve custom and external metrics for Horizontal Pod Autoscaling.
Introscope is a newly released project. It is a babel plugin and a set of tools for delightful unit testing of modern ES6 modules. It allows you to override imports, locals, globals and built-ins (like Date or Math) independently for each unit test by instrumenting your ES6 modules on the fly.
Postgres Operator is accepted as a mentor organization of Google Summer of Code, a global program focused on bringing more student developers into open source software development. This is the first year we participate in Google Summer of Code with Postgres Operator - a project to create an open-sourced managed Postgresql service for Kubernetes. Students can submit their proposal until April 9 -> Apply Now

Cloud Native: Bug squashing night!

We are inviting users and contributors of Zalando Cloud Native Applications to meet project maintainers at our tech office here in Berlin. We will spend this evening together answering users questions, reviewing pull requests, improving documentations and fixing as many bugs as possible. Sign up now!

Participating projects:

Zalando Open Source Around The World

Meet and connect with Zalando developers and project maintainers at open source events around the world:

KubeCon Europe, Barcelona, May 20 - 23: There are two sessions conducted by Henning Jacobs, Head of Developer Productivity and Mikkel Larsen, Senior Software Engineer. Check out more details below:

Es-operator: Building an Elasticsearch Operator From the Bottom Up: The talk will walk through how the Elasticsearch operator was designed, what problems it solves and how building it from the bottom up allowed getting it in production fast, gather more learnings and later extending the featureset to make it less manual to operate and reducing the cost of the overall infrastructure.
Kubernetes Failure Stories and How to Crash Your Clusters: This talk will show Zalando’s approach to Kubernetes provisioning on AWS, operations and developer experience, especially horror stories of operating 100+ clusters, lessons learned from incidents, failures, user reports and general observations.

PostgreSQL Day Italy, Bologna, May 16 - 17: Dmitry Dolgov will speak about ‘PostgreSQL at low-level’. In this session, he will discuss how much impact different knobs and options of the Linux kernel have on PostgreSQL and why, what would happen if you run databases in virtualized environment or inside a container. Dmitry will share experiences of running PostgreSQL inside Kubernetes, show how to see what's going on inside and how to break something spectacularly.

The Microservices & Serverless Conference in Berlin, Berlin, Apr 1 - 2: Oliver Trosien and Mikkel Larsen will share how Zalando utilizes Kubernetes to operate large-scale Elasticsearch clusters during their presentation titled 'Operating Elasticsearch in Kubernetes'.

Devops Gathering, Bochum, Mar 11 - 13: Henning Jacobs conducted a session on ‘Ensuring Kubernetes Cost Efficiency across (many) Clusters’. His talk provided insights on how Zalando approaches this problem with central cost optimizations (e.g. Spot), cost monitoring/alerting, active measures to reduce resource slack, and automated cluster housekeeping.

Rotating Engineers at Zalando

2019-03-14T00:00:00+01:00

For the past year, our group of Engineering Leads worked to improve collaboration and cross functional communication across teams. This was the result of team retrospectives and employee surveys indicating required improvement in these areas. One initiative which we took to address these issues was to implement role rotation amongst engineers. The goal of this developer rotation was to establish cross-functional knowledge sharing, encourage cross team collaboration within the department, and bring greater product awareness.

Preparation In order to prepare for the rotation, engineers we first required to answer a few questions: which teams are involved, who can rotate, for how long and how we can ensure business continuity. For our first implementation of team rotation, we limited it to 5 teams in the Developer Productivity department and those at the lead level and above were excluded. Next, we put forth an open sign-up form to gauge interest in those looking to take part in the rotation where 20% of the engineers signed up. Given this interest, we concluded that with proper preparation, a two-week rotation could be done without impacting deliverables. This preparation included ensuring each team has proper onboarding documentation for new team members requiring teams to brush up their documentation, another added benefit. Additionally, teams were asked to prepare a good first issue for the new team members. As in the open source world, many projects help new joiners with issues labeled good first issue or similar. And last, each new team member was paired with a mentor who could answer questions and provide context. We called those mentors buddies. This setup allowed those who wanted mentoring experience a way to practice their skills in guiding, managing and coaching.

Rotation In December 2018 we started the first two-week engineering rotation. Those taking part moved desks to their new teams. Buddies were paired up with rotating engineers to get their environment setup. They said hello in team stand ups and were involved in other team meetings like team retrospectives, team lunches and department stand ups. Buddies helped rotators to start with good first issues and paired with them along the way. Some of the rotators also got to see different Engineering Lead styles in 1 on 1 meetings. Other rotators participated in answering support questions.

Feedback / What we learnt A follow up retrospective revealed strong positive feedback. The Rotation was perceived as a valuable experience to understand better what other teams do. Also it was a learning experience about other teams’ products. Rotations also helped in exchanging ideas about different process workflows and problem solving approaches of Developer Productivity teams.

Their fresh view without history brought a different perspective to the teams which was beneficial. A great example of what I would like to call a success here was a deployment visualization that spanned backend and frontend components was driven by rotating engineers. The users’ feedback was very positive for the feature, so it was rolled out to all clusters soon after. This demonstrated rotating engineers were able to have end user impact.

What we learned from the rotation retrospective and final survey was the need to reconsider the timing next time. December contains a holiday season at the end of the month. That influenced some rotations. First issues should be shaped in a way that rotating engineers are able to finish the tasks within the rotation time frame. Another point mentioned was rotations - not only for engineers - should be performed regularly.

Conclusion In retrospective, the initiative of rotating engineers was a success. Throughout the rotation period in Developer Productivity, we were able to sharpen the awareness of the teams across their processes, workflows, tasks and methods. Both buddies and rotating engineers shared their experiences and knowledge with their original teams. It also highlighted improvement areas for team processes and tasks such as offboarding, integration testing and access roles. The success of the initiative was further indicated by the request for future, regularly conducted, rotation opportunities. Our goal is to continue with regular rotations and to expand beyond the engineering role to management and supporting functions.

How to Rock your Next Product Training

2019-03-11T00:00:00+01:00

Need to introduce end-users into your product? It can be fun: we show you how

Want to give your users a great first experience with your new IT application? User trainings for your software product are the perfect opportunity. As team WMS Training, we develop and deploy training solutions for tech products within the world of Zalando Logistics, and today we’ll show you how to quickly and easily develop a training session for your product.

We’ll start you off with three steps to creating a user-centred product training and then follow up with a few ideas to make your training more fun, memorable, and engaging.

Three Steps to User-Centred Product Trainings

Imagine that you and your team have been working on a new feature. After weeks of alignment, stakeholder management, and development everything is ready for Monday’s go live. You’re eager to finally see these weeks of work materialize into a solution for your end users.

As you’re finishing up your last email of the day, one of your stakeholders pops by your desk with a “quick question.”

“Just a quick question. I know we’re going live next week. Will there be trainings for our end users?”

To which you reply, “What? Oh yes, of course. We’ll show them how to use it.”

But as they walk away, doubts begin to well up inside of you:*The go-live is coming up! There’s no time to prepare anything… What if I bore them?... But, I’m not a trainer… What if they don’t get it?

*Well, don’t worry. Even if you don’t have a lot of training experience or time, we’ve got three steps that will help you develop a user-centred training session quickly and easily, and can be applied to live trainings, webinars, and eLearnings.

**Step 1. Identify your target group and their learning objectives **

Think back to the last training you attended that felt completely irrelevant to you. Maybe it was a standard safety training, with a focus on lifting things properly, because you work for a logistics company.

Chances are that if the training felt irrelevant, the content was not aligned with your personal learning objectives. By identifying your target group and their training goals, you’re managing your training like a product, with the learner as your user – whose problems you want to solve.

Example: As a production manager, I want to learn how to pull current performance data from the system in order to evaluate my department’s output and react to it.

Learning objectives can be framed like user stories, which can make learning objectives clear.

Try it out now: Let’s work through an example to gain a better understanding. Imagine someone from your family, totally new to smartphones, and a few of their friends have recently developed a passion for photography. They want to show it off, so they decide they’d like to start using Instagram. But they need your help, as someone who is into tech matters. Describe the target group in this example. Where and how should they be trained? What should they be able to do after the training?

**Step 2. Design an assignment to check that you’ve achieved learning objectives **

How will you know that your audience has achieved the learning objectives? Test their knowledge along the way and give everyone a chance to practice.

We all know that practice makes perfect. During training, you have the unique opportunity to give the user the chance to practice with you around, before jumping into it on their own.

If your system is still under development, but accessible, you can have your users login and search for test data or even ask the participants to perform exactly the tasks that they will have in the future.

If this isn’t feasible, never fear. You can easily integrate knowledge checks, with questions like: “I can enter performance data into the production screen. True or false?” This allows you to reiterate and reinforce key points in an interactive way.

Try it out now: It’s time to create an assignment that will show you that your family member and their friends have learned what they needed to. Take five minutes to identify one activity that will show you that they have fulfilled the learning objectives you outlined in Step 1.

**Step 3. Determine what learners need to know to complete your assignment **

Now we come to the final step: creating training content. Often many of us run into the following trap; we start by focusing on content, collecting any and all materials we have: technical descriptions, complex flowcharts, stakeholder presentations, etc.

The problem? A good deal of that material may not actually be relevant to your audience and may not help them achieve their learning objective. If anything, it may overwhelm them. These three steps will help you avoid that trap.

Now that you’ve built a user-centred training, how can you go one step further and ensure that the training engages your audience? The key here is to remember that your audience has a limited attention span, so avoid long explanations when possible. Instead, break down big concepts into smaller ones and leverage interactivity to make sure that you haven’t lost anyone.

We’ve found that teaching content in an interactive way engages our audience, gives them a chance to practice what they’ve learned, and helps them to better remember important points. It has the added benefit of allowing us to check what they’ve learned. We have some examples of how you can gamify your training in the following section.

Try it out now: How many of you have read technical manuals? How much fun are those? Instead, think about how you can present your family member and their friends with the information they need. Try to avoid information overload and provide your target group only with what they need. If you can, bring in an element of interactivity to increase user engagement and enjoyment.

The result? A user-centred training that gives your audience the skills they need to successfully use your product and leaves them with a great first impression.

Three Easy-to-Implement Learning Activities

Not sure where to start when it comes to developing interactive content? We’ve got you covered. Here are a few ideas of easy-to-add interactions.

** Objective:** The learner should be able to understand a high-level process or the data-flow between systems.

Prepare: Develop a flowchart of the process or the data-flow (e.g. with Powerpoint). Print it out and cut the single steps into puzzle pieces.

Conduct: Divide your audience into teams of 2 - 5 participants. Every team gets a set of puzzle pieces and needs to discuss the order of the workflow. Afterwards you show them your solution.

** Objective:** The learner should be able to understand important terms used in your software product and know where to find them on the screen.

Prepare: Take a screenshot of your product and develop a slide with terms and descriptions of important screen elements. Print it out and cut the single steps into puzzle pieces.

Conduct: Divide your audience into teams of 2 - 5 participants. Every team gets a set of puzzle pieces and needs to discuss their positions on the screenshot. Afterwards you show them your solution.

** Objective:** The learner should be able to distinguish between options/make the right decision.

Prepare: Formulate statements for certain decisions and why they are true. Delete parts from it and leave a blank line. Print out the worksheets.

Conduct: Divide your audience into teams of two or let them work alone. Everyone receives a worksheet and they need to come up with the right solution. Afterwards you show them your solution.

If you want to learn more about the three step process, try out the free online course from NovoEd. We took the course last year and found it very useful. And feel free to get in touch with us or to send us your solutions to your exercise if you’d like us to check how you did. We’d love to see them!

How to Make Space for Research & Innovation?

2019-02-28T00:00:00+01:00

Redesigning research and product development so that the explorative nature of data science becomes a driver for innovation

Zalando leverages cutting edge machine learning technologies to be Europe’s leading online platform for fashion and lifestyle. In order to develop these products, data scientists and product roles have to work together closely.

As the Agile Coaching Team, we went on a journey to discover ways to make our machine learning product development more effective. As a result, we created tools and environments to help data scientists and product roles work together from day one; making solutions better, testing ideas faster and simplifying collaboration. Now, people in very different roles understand each other’s backgrounds and motivations better, reducing conflicts and handover costs. Solutions are viewed early from multiple perspectives and often the best ideas come from places where you least expect them.

Discovering the data science impossibility patterns

To avoid jumping to solutions that would have no effect, we did over 20 interviews looking at multidisciplinary areas and teams who build data science/machine learning (DS/ML) heavy products. We were looking out for “pains and gains” of all involved roles, analyzing artefacts and work styles, shadowing their meetings.

We found several patterns for which we developed a solution:

“Data science takes a long time” - was often stated as a dogma by data scientists, product roles and leads. Of course, there are technical constraints, large amounts of data to fetch and models to be tested and trained. But much more it is about psychology: as data science takes time, it costs a lot of money, and where so much money goes, impactful deliveries are expected. Therefore, expectations to deliver results piled high in big batches in their backlogs and OKRs that “focus 120% on delivery” of course take a long time! That creates the second problem:
“Data scientists do not know which customer problem they are solving” - as there is a complete focus on delivery there is no time to do proper customer discovery and problem definition. Therefore backlog refinement, planning and demoing gets “hard.” That creates the third problem:
“Agile does not work in data science” - as “Scrum” is often a synonym for Agile, without a clear problem to solve, Scrum ceremonies do not work. Also, retrospectives hardly help fix the ceremonies as they are not broken. Instead, Scrum is just the “wrong” tool to discover customer problems.

One of our biggest contributions as coaches was that we created time for innovation; a five-day learning journey combining directed and self-driven workshops, an open space, coaching sessions, community work, peer-to-peer learning and teach-back-formats in an co-creative way.

On this journey we outlined a three-day workshop around two top priority topics as real cases, each tackled by one multidisciplinary initiative. The initiative pulled together experts from formerly separate teams, to deliver customer value end-to-end. In this way, we progressed while learning (did not interrupt work with trainings), using multidisciplinary collaboration while introducing it. Tackling real cases like “Personal Relevance in Browsing” and “Transparency About Personalization” was a clear requirement to get the buy-in from the leads.

We had over 40 customer interviews performed by data scientists and engineers, fostering a much deeper understanding of the customer and problem space they build solutions for. Almost as a side effect, it raised empathy for product roles, improving collaboration and lowering the cost of handovers and conflicts.

With this journey, we served and enabled a toolchain for customer discovery, problem definition and small, fast and cheap experiments of solutions “prototypes” (ideas, hypotheses, assumptions). Zalando packaged these tools in a framework that we call 4D: Discover - Define - Design - Deliver.

One key element of the learning journey was the co-creation of a concept of how we can use “hypothesis” to steer collaboration. Known from “hypothesis driven software development,” product work, as well as in data science, we developed a concept that enables working with hypotheses across the whole DS/ML workflow. Starting with “Explorative Research” managing input like time, effort, scope, as by the nature of science the output is not predictable during exploration. We set as early finish criteria the capability to formulate a “directional hypothesis.” This enables us to switch from “Explorative Research” to “Hypothesis Testing Research” gaining more transparency, predictability and control. Combined with hypothesis-driven product work and engineering, we can use this and science as central elements to streamline collaboration in DS/ML heavy products.

In a multidisciplinary setup we co-discovered the customer, and created hypotheses around their needs, pains and gains. We tested these hypotheses early and learned, refined and iterated.

With this journey, we created space for the unknown; the place where innovation is rooted. We equipped our teammates with the “right” tools and workflows, and sent them on a learning experience across all disciplines, ranks, teams and units to find truly new land.

A Journey On End To End Testing A Microservices Architecture

2019-02-21T00:00:00+01:00

End to end testing is a testing technique used to test the flow of an application through a business transaction. In microservices architecture there are different components working together to enable a business capability, therefore testing all of them can get tricky. In this article you can read about our team’s journey:

What our system looks like
What do you get from e2e testing?
How to define e2e tests
How to deal with authentication
What testing framework we choose
How to test canvas
How to test async flows
Automation

1 ) Our system

In our team we maintain a system that offers business capabilities such as the ability to explore and filter orders. The high level components that are used to enable that feature are: The front-end application, the backend for the front end, various databases (PostgreSQL, Solr and DynamoDB), message brokers (we use nakadi), and a bunch of microservices. You can read more details about our architecture later in this article.

2 ) What to expect from e2e testing

As you can see the architecture is quite complex and things might break on different levels. You might have great unit testing coverage for each component but if they can’t talk to each other, users expectations of the product are not met.

You can introduce some integration testing but things might get out of sync if more than one team is responsible for the same product or even if not all team members share equal ownership of each component (which they should).

You can achieve integration testing from a user perspective by mocking your dependencies (by intercepting requests). This approach adds complexity in writing tests but on the other hand end to end testing creates complexity to run all systems in a desired state where you can make your assertions confidentially.

Because we wanted to be able to ensure that whenever we are releasing a new feature we are not breaking anything else, and because changes in the backend could introduce bugs if we are not on the same page, we decided to introduce end to end testing in our systems. This way we could spot bugs on staging environments.

“Having end to end tests is also a very nice way to document all the user journeys of your application.”

3) Coverage and tests definition

The first step that we took was defining the scope of the systems that we we’re going to put under testing. It is strongly advised that when you perform end to end testing you should put all the components under testing, but on the scale of our company this is not always possible.

In our case we decided to do “domain scoped e2e testing”, because systems out of the domain might already have some e2e testing and our systems are decoupled from each other. Also it is pretty hard to put systems that are out of the domain in the desired state you need to perform your tests.

The architecture of the systems that we wanted to test is something like this:

After the scope was known we defined the list of features that we wanted to test. Basically that was all the features, but either way having a list around helps a lot. Once you have it, it is easy to group related features (one way of grouping features is by business capability) and split them into smaller tasks so the whole team can work on them. This will also help you prioritise groups and implement the important ones first. You can prioritise by urgency as well so you have the critical ones covered first. This also helped us identify which of them require some support from other teams and this way you remove some potential blockers early.

4) Authentication on test environment

One of the problems we had was authentication. We are using an SSO server to authenticate users. The only way to login through the SSO is by having an actual email address and a password. In order to achieve this we needed some real accounts and to have those, there where a few implications. Because of this we decided to authenticate users using an auto-generated token when running end to end tests and we only implemented this feature for staging environment, and this way we by passed SSO.

5) Choosing a testing framework that solves main problems

So far so good. We had an idea what we wanted to achieve and pretty quickly we ended up thinking about how to start and write end to end tests.

It is suggested that when you write and run e2e tests you should be able to have a deterministic state of all the systems so that you can easily assert whether the action was performed as it should.

Our main problems were:

Have a desired state of the system. This was hard because we didn’t own all the systems.
Have a desired state of the application. This was hard because we had a component that uses the html canvas.

The first one was solved by having an API that allows us to insert some data into the system which normally was not an application use case. The second one was solved by being able to talk to the state management component from e2e tests.

Now comes the best part, choosing a testing framework. We did some research and we decided to focus on 2 options, Zalenium and Cypress. Options like Nightwatch and puppeteer where considered as well. Both Zalenium and Cypress offered a really nice set of features like video recording, pretty nice integration with CI and Docker, a clean API and a nice dashboard, but the final winner for us was Cypress. We choose that because first of all our users mainly use Chrome. Also, Cypress seemed to be much faster than Zalenium and it managed to solve the problem of flaky tests. Another cool Cypress feature is its dashboard which you could use to interact with your tests. But the killer feature is that Cypress executes tests on the same environment as your application.

6) What if you want to test canvas?

Some parts of our application are written in canvas, and interacting with canvas is almost impossible. We decided to avoid canvas completely and interact with the application runtime. Our application is written in React and because Cypress runs on the same environment as our application we could dispatch actions and read from state in our tests.

7) Testing asynchronous flows

An interesting problem while testing was how to test application parts which are highly asynchronous in terms of communication with the backend. We have parts of the application that do short polling. To test this Cypress offers a dynamic way of configuring timeouts. For instance you could do something like:

`cy.get(‘some-selector’, {timeout: 50000})`

This way Cypress checks periodically whether this element is present but it retries until the timeout is done. As a timeout value we simply used SLO targets which were agreed between teams.

8 ) Automating tests

Automating tests was quite straight forward. In our CI/CD server we spin up 2 containers, one that runs the application and another one that runs the tests. After the process is done, those containers are destructed.

All this process was quite fun to work on and I learned a lot. Having end to end tests helped us understand how users could use our system and we automated Quality Assurance, something that was previously a manual process and sometimes also error prone.

Typescript Best Practices

2019-02-14T00:00:00+01:00

Typescript is becoming more and more popular. As with everything, there are good and bad sides. How good it is depends on your usage on your application. This article will not discuss the good and bad sides of Typescript but some best practices, which will help for some cases to get the best out of Typescript.

1. Strict configuration

Strict configuration should be mandatory and enabled by default, as there is not much value using Typescript without these settings. Without it, programs are slightly easier to write but you also lose many benefits of static type checking. The flags that need to be enabled in tsconfig.json are:

    {
        "forceConsitentCasingInFileNames": true,
        "noImplicitReturns": true,
        "strict": true,
        "noUnusedLocals": true
    }

The most important one is the "strict" flag, which covers four other flags that you can add independently:

noImplicitThis: Complains if the type of this isn’t clear.
noImplicitAny: With this setting, you have to define every single type in your application. This mainly applies to parameters of functions and methods.

    const fn = ( worker ) => worker.name

If you don’t turn on noImplicit, any worker will implicitly be of any type.

strictNullChecks: null is not part of any type (other than its own type, null) and must be explicitly mentioned if it is an acceptable value.

    interface Worker {
       name: string;
    }
    const getName = (worker?: Worker) => worker.name

This code snippet won’t compile because "worker" is an optional parameter and can be undefined.

alwaysStrict: Use JavaScript’s strict mode whenever possible.

For further compiler options please find them here:

https://www.typescriptlang.org/docs/handbook/compiler-options.html

2. General types - prefer to use primitive types

Use the primitive type number, string, boolean instead of String, Boolean, Number. These types refer to non-primitive boxed objects which are never appropriately used in Javascript.

3. Type inference

Instead of explicitly declaring the type, let the compiler automatically infer the type for you. Because it knows better which type it is:

    let name = 'David';  //name is string
    let age = 11; // age is number

4. Callback types

By callback which returns value, can be ignored. Other case using void instead of any is prefered:

    function cal(x: () => any) {
        var y = x();
        y.doAnything();  // ok but unchecked
    }

Using void is safer because it prevents using any value, which could be unchecked:

    function cal(x: () => void) {
        var y = x();
        y.doAnything(); // Error
    }

5. Function parameters

By function with a lot of parameters or parameters with the same type. It makes sense to change the function to take parameter as an object instead

    function cal(x: string, y: string, z: string) {}

By such a function, it’s quite easy to call it with the wrong order of parameters. For instance: cal(x, z, y) Change the function to take an object:

    function cal(foo: {x: string, y: string, z: string}) {}

The function call will look like: cal({x, y, z}) which makes it easier to spot mistakes and review code.

6. Overloads - Ordering

The more specific overloads should be put after the more general overloads. Example:

     interface Person {}
     interface Worker extends Person {}
     function tun (w: Worker) : number;
     function tun (p: Person) : string;
     function tun (a: any) : any;
     var w: Worker;
     var y = tun (w); // y: any

Should define the following order:

     declare function tun (a: any) : any;
     declare function tun (p: Person) : string;
     declare function tun (w: Worker) : number;
     var w: Worker;
     var y = tun (w); // y: number

Because the first matching overload will be resolved. When the more general one is declared, the less general one will be hidden.

Overload - use optional parameter

In the following example, you can use optional parameter(s) for only one declared function

    interface Business {
       cal (x: string) : number;
       cal (x: string, y: string) : number;
       cal (x: string, y: string, z: number) : number;
    }
    interface Business {
       cal (x: string, y?: string, z?: number) : number;
    }

But it only works for functions which have the same return type.

Overload - use union type

    interface Business {
       cal () : string;
       cal (x: string) : number;
       cal (x: number) : number;
    }

Instead you might use union type like this:

    interface Business {
       cal () : string;
       cal (x: string | number) : number;
    }

7. Don’t use "bind"

"bind" returns any. If you take a look into the definition of bind:

    bind (thisArg: any, ...anyArray: any[]) : any

This means that by using bind it’ll always return "any" and for bind() there is no type check, it accepts any type:

    function add (x: number, y: number) {
       return x + y;
    }
    let curryAdd = add.bind(null, 111);
    curryAdd(333); // Ok but no type checked
    curryAdd('333') // Allowed because no type check

Better to write it with arrow function:

     let curryAdd = (x: number) => add(111, x);
     curryAdd(333) // Ok and type check
     curryAdd('333') // Error

So that with the static type check, the compiler discovers the wrong type and does not silently allow for any type by binding. But in the new version of Typescript there will be more strictly-typed for "bind" on function types.

8. Non existing value - prefer to use undefined as null

When a value on an object property or a function parameter is missing, you can use Typescript optional "?" to say so.

     interface Worker {
        name: string;
        address?: string;
     }

Typescript "?" operator doesn’t handle null value. There are two values: null and undefined, but actually null is a representation of a missing property. It’s then the same as undefined. That’s why it’s recommended to use undefined for non existing values and forbid the use of null using TSLint rule:

{ "no-null-keyword": true }

It’s impossible to use typescript optional "?" to define a variable or function return type as undefined. In order to try to safely handle a missing 'worker', before using its property, typescript can actually infer the type of a parameter with type guards and we can actually use this to unwrap our optional worker:

    type Optional = T | undefined
    const getName = (worker?: Worker) => {
        if(worker) {
           return worker.name;
        }
        return 'no worker';
    }
    let worker: Optional;

    let worker: Worker | undefined;
    console.log(getName(worker));   // 'no worker'

The above code snippet will print 'no worker' because our worker is not defined but with this abstraction type we’ve safely handled a missing object use case. So the "Optional" would be a little bit shorter and have the same result.

On the Effectiveness of Online Marketing

2019-02-07T00:00:00+01:00

Measuring the incremental effect of online marketing to optimize advertising investment

One of the core values at Zalando is to be Customer Obsessed, and this applies to online marketing as well. For many Zalando customers, their experience starts with a catchy ad. Therefore, in Personalized Marketing, our mission is to reach customers with a personalized message and suggest products tailored to their needs or wants.

By increasing the relevance of marketing, we aim to increase the number of customers interested in our offer, and, in turn, generate profitable sales for Zalando. While doing so, we constantly face the “never-out-of-fashion” (to quote our latest Christmas campaign) question: What does marketing really do?

Simple question, complex answers

So the central question is, “What is the true incremental effect of online marketing?”

Would a customer have bought this pair of shoes even in the absence of marketing?
How much are we growing Zalando’s customer base thanks to online marketing?

The answers to these questions are complex and multi-faceted. Measuring the incrementality and not mistaking a success for a failure can nearly be impossible, as shown in [1]. Well-established and successful tech-giants are also not done answering the question. Hohnhol, O’Brien and Tang in [2] tried to find the best way to measure the impact of marketing beyond its short-term effect. Optimizing for the next few days or weeks may lower the impact of online marketing in the long run. Google researchers [2] claim “We have long recognized that optimizing for short-term revenue may be detrimental in the long term if users learn to ignore the ads.”

One of our objectives is to compute a Return-on-Investment (ROI) for every campaign. This metric allows us to allocate our resources efficiently. We maximize sales generation and new customer acquisition for every campaign, given a ROI target.

Performance measurement landscape

Zalando took up the challenge and aims to measure the performance of online marketing at scale. The ROI of our marketing activities is computed through a pipeline composed of several products. While each of them would deserve a dedicated blog post, this article aims to simply outline their purposes and main challenges.

Figure 1: Product Pipeline Overview

We have built a flexible and scalable data infrastructure based on S3, Hive and Spark on AWS. Spark’s parallelizing capabilities in combination with AWS EC2 ensure that we can meet our strict SLAs even with a continuously growing amount of customers and traffic. In the future we plan to automatically scale the size of the cluster depending on the size of the input data. We decoupled our sub-products and use Hive tables as interfaces between them. This allows for more autonomy in regards to the product development and generally lets us move faster.

At the start of our pipeline, we source all marketing clicks, sales and conversions (e.g. customer acquisitions, app installs) from Zalando’s Data Lake and DWH to build a structured and unified event data layer. This is one of our greatest challenges since the data is very diverse in regards to quality, update frequency, syntax and semantic. Therefore we are making great efforts to move from client-side towards server-side ad tracking and closely monitor our data through data quality dashboards. After updating the event data layer, we use our internal cross-device graph to create the customers’ journeys across all their devices, from first ad interaction to conversion.

Next, with our attribution model, we determine how much incremental value was created by every ad click. The particularity of the attribution problem is its unknown ground truth. As we cannot interview every single customer, we will never exactly know why a given customer bought their latest jacket on Zalando. We built a framework that allows us to iterate quickly and test many different attribution models. We are using SQL for simple transformations, while Scala is our choice for more complex computations. This way we are able to explore far beyond simple models (e.g. Last touch) and leverage our huge dataset with more complex models.

Figure 2: Attribution Illustration

As reality is unknown, we are running many randomized experiments with the aim of causally inferring the incremental impact of each marketing campaign. We use geo-based [3] and audience-based test methodologies to achieve this. In the former, marketing activities are turned off in certain regions and we quantify the impact on revenue, profit and customer acquisition. The latter splits a given customer base into two groups, giving one group a specific treatment and measuring the difference in behaviour. We use the results to calibrate our attribution and ensure it reflects reality.

Continuously running such a large number of parallel experiments is a great challenge. The test results need to reflect accurately the incrementality of marketing campaigns even though it can be highly affected by seasonality or ever-changing consumer behaviour. Hence, we are currently building an experimentation platform that sets experiments up, and analyze the results in an automated way.

Is That All?

The next logical step is bidding based on the ROI. We invest a lot resources to predict the performance of marketing. Every day, we estimate the incremental profit marketing campaigns will generate in the coming weeks. Each impression can lead to a click, each click can lead to a conversion. Every marketing campaign is a different time series, with its own behavior and characteristics. The magnitude between time series may vary by several orders, and while most of them are quite unique, it is possible to infer some similarities (embeddings is a solution). We are experimenting with state of the art machine learning models such as DeepAR [4]. All of this makes it an extremely complicated and deeply interesting problem to model.

The measurement of incrementality opens up many interesting topics that we also tackle in the Personalized Marketing Team, such as generating the best ads or setting the best target and budget.

Thanks to Pablo Croppi, Carolyn Hodgson, Dirk Petzoldt, Dominik Rief for reviewing this article, and to Yanwolf Hoffmann for design help.

REFERENCES

[1] Randall A. Lewis and Justin M. Rao. On the Near Impossibility of Measuring the Returns to Advertising, 2013 [2] H. Hohnhol, D. O’Brien, D. Tang. Focusing on the Long-term: It’s Good for Users and Business [3] J. Vaver and J. Koehler. Measuring Ad Effectiveness Using Geo Experiments, 2011 [4] V. Flunkert, D. Salinas, J. Gasthaus. DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks

Open Source: January Updates - Celebrate 'I Love Free Software Day

2019-02-07T00:00:00+01:00

Project Highlights

Lionel Montrieux brought Nakadi to FOSDEM 2019. This is one of the largest open source projects released by Zalando. Nakadi is a distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues. It is used in production by over a hundred teams daily and handles over 100 TB of data every day. Try out Nakadi!

New projects

Two new projects entered Zalando-Incubator this month:

Transformer is a tool to transform/convert web browser sessions (HAR files) into Locust load testing scenarios (locustfile). This tool can be used when users have HAR files (containing recordings of interactions with your website) that they then want to replay in load tests using Locust.
Autoscaler is a component that automatically adjusts the size of a Kubernetes Cluster so that all pods have a place to run and there are no unneeded nodes. This is a fork of Kubernetes Autoscaler. Our goal is to test and deploy the project in a large scale environment at Zalando, then propose upstream contributions, so the whole community can benefit from our experiments.

In addition to the project highlights, we recently released a policy to handle harassment in open source. At Zalando we encourage our employees to take active part in open source development. And we as a company commit to provide our full support to developers who engage in open source on Zalandos behalf. Find out more details here.

Celebrate I Love Free Software Day

Join Zalando this Valentine’s Day to celebrate our love for Free Software. This is a chance for us to show our appreciation to people who contribute to free and open source community. We are delighted to welcome our special guest speakers who devote several years to grow and foster FOSS movement around the world.

Lennart Poettering - Author of systemd
Chris Travers - Contributor of PostgreSQL since 1999
Daniel-Constantin Mierla - Core developer of Kamailio
Thilo Borgmann - Developer of FFMPEG
Robert Foss - Contributor of the Linux Kernel
Nicco Kunzmann - Mentor of FOSSASIA

I Love Free Software Day was first introduced by Free Software Foundation Europe. Find more details here

Defining a company policy to handle harassment in open source

2019-02-04T00:00:00+01:00

Open Source Participation

When you as a Zalando employee engage in open source communities as part of your work, you will interact with the wider open source communities outside Zalando - this is generally a good experience and collaborating with many different types of developers with different backgrounds is generally a positive input to your personal development.

However, there is also a small risk of encountering negative or even abusive behavior from community members when you act as an open source contributor or maintainer.

As an employer encouraging open source participation, we have decided to devise a policy for how we as a company can support our employees in case of harassment.

Statistics on harassment in open source

An extensive survey by Github in 2017 showed that nearly one out of five have experienced negative behavior personally and 50% have witnessed it between other people - fortunately outright harassment is much less likely with 14% witnessing it and 3% experiencing it personally.

Witnessing and experiencing behavior such as name-calling, stereotyping and outright harassment can have a big negative impact on peoples desire to be part of open source communities, especially for women or ethnic or sexual minorities who are already underrepresented in the open source world (3% female, 16% ethnic minority, 7% sexual minority).

So, the open source community see an underrepresentation of minorities and those who do participate have a risk of encountering hostile behavior. Is the risk of harassment big? No - generally speaking the risk is low, but the impact of potential harassment is very real.

As an industry we must prioritize the topic of diversity in open source, abusive behavior should not be tolerated, and in the case of it happening, companies should be ready to support their employees in dealing with it.

Supporting employee participation

As an employer, Zalando encourages its employees to take active part in open source development. Developers are granted time to maintain the projects we release and to contribute upstream to projects which are of strategic importance to Zalando. We as a company therefore have an obligation to ensure that we support employees who engage in open source on Zalandos behalf.

Support isn't just about granting time and resources for open source development, support is also understanding the potential risk employees face doing open source and to be ready to offer legal and HR guidance and understanding to employees in the event of harassment.

It is with this mindset that we have put together a formal policy for dealing with harassment in open source for our maintainers and contributors, a policy which employees can use to determine where inside Zalando they can find help to deal with such behavior and also to clarify what they can expect from Legal and HR.

The policy

We have divided the policy into 2 parts: proactive and reactive measures.

First of all: proactively we recommend that employees only engage with projects who have a code of conduct in place, we also enforce that all new projects released by Zalando have a code of conduct in place as part of the boilerplate files we provide. As part of our internal mandatory training for open source maintainers and during on-boarding of new employees we also make our expectations very clear: in case of behavior in breach of the code of conduct, it is expected you enforce the code or ask the open source team for help on how to act.

Secondly, if an employee do need guidance we reactively provide the following options:

P&O (Zalando HR) can guide you on how to react to abusive behavior and help you determine if legal action is required. Talk to your lead if you need assistance, or reach out directly to the open source team.
The open source team will assist you in reporting the abuse to the responsible platform owner (such as Github)
Zalando legal will provide legal guidance in case such is required
If it is established there is a need to report the incident to law enforcement, P&O (Zalando HR) and Zalando legal will assist you in collecting evidence and file a report

A small step forward

While policies will never solve the root cause, it is a step in the right direction. We believe in equal opportunity and access to the world of open source. We believe open source is important, not just to tech companies, but to society as a whole and we must all do what we can to ensure that the communities building the software that we all rely on is inclusive and safe for everyone to be part of.

The Product Playbook

2019-01-31T00:00:00+01:00

Shared language and visualizing to deliver great products

*Football is an environment with changing variables that players and coaches need to react to. Teams attempt to move the ball down the field by running or passing in a set number of plays.

*If you’ve ever watched a football game you will see coaches holding a subset of plays from the coach’s playbook they think may work for the game they are playing. This lets them make decisions in the moment. A coach may have 1,000 plays in the playbook, but will only use a fraction in a game situation. And each team may have a different playbook.

At Zalando, we came across the idea of creating playbooks for building products in a great article by Jon Lax. We also spotted the nice application of it at Typeform.

What is “a play”? *A play is an agreed upon set of actions the team takes in a given situation. When the coach says “let’s run Statue of Liberty Buck Sweep” everyone knows what that means and knows what they need to do to execute that play. A playbook is the collected knowledge of a coach or team on “HOW they do what they do”.

*It inspired us to make a playbook of how we build products — how we go from identifying value opportunities, delivering solutions, iterating or dropping on them.

*Anything you do that has some repeatable action is a play.

*Visualizing our product development like this helps us highlight the emphasis we put on keeping things simple. It helps to demystify and push people involved in product related tasks, and lead to continuous improvement.

Equal importance, it forces us to name our play which helped to create shared language:

We provide some definition of the name as much as we can to make sure everyone understands what it means. A play called MVP could have a lot of meanings.
Clarify the situations when to run a play. While most plays could be run at any point in a product’s life cycle most plays are most effective in a certain situation; big bets require extra scoping efforts , quick wins go straight to design kickoff , spikes are recurrent and ensure continuous discovery , running loads of AB tests in pre product-market situation may not be best for us, etc.
Also, why is this play the right one to deliver value to the team?

Let’s take a step back, and go through the playbook pillars:

The 4Ds
Getting real
The 50% rule
Learning loops

🖖 The 4Ds Maybe you believe the Customer Journey map method is best, the Double Diamond, the Hooked model, the six-weeks cycles or the Lean Canvas.

It doesn’t matter.

Plays can be grouped anyway you want. Simply organize your plays to map into the each of the phases.

At Zalando we commonly use the 4Ds framework: Discover, Define, Design, Deliver.

Every team member contributes to “Discover” which leads to richer ideas and involvement.

The simplicity of the framework helps us ship early to the customer , the only validation we ultimately seek. It ensures we develop great customer experience while ensuring business impact.

**📱 Getting real

This mantra is inspired from the essay, Getting Real of 37 Signals.

*Getting Real is about skipping all the stuff that represents real (charts, graphs, boxes, arrows, schematics, wireframes, etc.) and actually building the real thing.

Getting real is less. Less mass, less software, less features, less paperwork, less of everything that’s not essential.

Getting Real is staying small and being agile.

*❌ Things we don’t do:

Timelines that take months, version numbers roadmaps that predict the perfect future
Functional specs scalability debates
Endless preference options
Proprietary data formats
The “need” to hire dozens of employees
Ask users with hypothetical questions, instead we ask to complete tasks

** ✅ Things we do:**

Less meetings, less abstractions and less promises
To launch on time and on budget, we avoid throwing more time or money at a problem, instead we scale back the scope
It’s better to make half a product than a half-assed product
“Just-in-time” thinking
Multi-tasking team members
An open culture that makes it easy to admit mistakes
Basic documentation which makes clear what we do and includes people
Dead simple prototyping

*🎨Prototypes often start on a notebook

*Overall, less mass lets you change direction quickly. You can react and evolve. You can focus on the good ideas and drop the bad ones.

🌗 The 50% rule We believe that for any business to succeed, you’ll need to achieve 3 things:

a viable product/service
a large enough market
and a way to reach to your customers

As described by Gabriel Weinberg in Traction: *Startups often spend most their resources developing their products; By the time they realize they need to get more customers and try to ramp up their sales+marketing efforts, they’ve run out of money.

*This is why from the onset, we spent 50% of our time on product development and 50% on traction development. We can’t predict which traction channels will work; the only way is to test them.

To keep on the 50% rule, we share ‘simple’ documentation about what we are planning. This ensures alignment and makes it clear what we are trying to achieve.

💫 Learnings loops At the core of our product DNA is collecting and sharing learnings. The 4Ds cycles foster learnings inside the team and reinforce our plays.

4Ds cycles are our “learning loops”

To ensure learnings circulate, we have three (internally) public initiatives:

A team newsletter to highlight the achievements , but also the failures , from the past six weeks
A team website where are documented the features and AB Tests hypotheses and results
Real-time funnels to monitor the impact of each change , being traction or product

Learning loops are what keep us ahead of competitors. They enable us to iterate or pivot based not only on instinct but also on data. They ensure we ultimately ship in the right direction 🚢

Conclusion

The product playbook is a powerful way of explaining our underlying thinking.
Thinking in terms of a play book provides a shared language and visualizes how we do things. It crystalizes a common understanding of how we build products.
It allows us to embrace continual improvement: we remove old plays and continuously add new ones through learnings loops.
It ensures our team dynamic: we look in the same direction and move forward against clear business goals.
To an extent, it helped to build our relationships.

The product playbook template. It’s yours to make a copy and adapt it 👌

Nakadi Goes to FOSDEM

2019-01-29T00:00:00+01:00

Nakadi is Zalando’s open source event streaming platform. It is based on Apache Kafka. It started as a simple HTTP proxy, providing a REST interface to publish and consume JSON messages. It quickly evolved, with the addition of schema validation and evolution, self-service authorization, a subscription API for easy consumption, deep integration with Zalando’s infrastructure, a SQL-over-streams engine, and much more. It has now become a real platform for event streaming, and plays an essential role in Zalando’s architecture.

Nakadi is meant to be simple to use, and self-service. With Nakadi-UI, our open source web interface, users can create and manage resources such as event types, subscriptions, and SQL queries, by themselves. They can even inspect the contents of their event types, publish events, and get access to monitoring and alerting tools so they can keep on top of the health of their streams. Nakadi-UI is written in Elm, and it is probably one of the largest open source projects in that language.

Fig. 1: A view of an event type and its schema in Nakadi-UI

The Nakadi community has come up with a collection of great client libraries for the most used languages inside Zalando - Java, Scala, Python, and Golang. You can find Nakadi-UI, and all the community projects on https://github.com/zalando-nakadi. And Nakadi, its code and documentation, on https://nakadi.io. Head to the Nakadi-UI repository to get started right away with Docker-compose: you’ll get a local deployment of Nakadi and nakadi-UI with all their dependencies to play with.

At Zalando, we have been running Nakadi in production for over 3 years. These days, it handles over 100 TB of data every day. It is used by over a hundred teams daily, yet it is entirely maintained by a small team of 8 engineers. Not only do we develop and maintain Nakadi, but we also operate Zalando’s internal deployments, take care of operations, user support, 24x7 incident response, documentation, and much more.

This Sunday, at FOSDEM, we will show how we manage to do all this - and still find time to write code. Join us at 12:15 in the HPC, Big Data, and Data Science devroom for our talk - or grab us in the hallway track during the weekend!

A Day in the Life of a Frontend Engineer at Zalando

2019-01-24T00:00:00+01:00

You’ve probably never had the same day twice at your current job. At Zalando it’s no different. Here, it not only depends on the product you're currently working on but also on your peers.

Actually, what's expected from a frontend engineer can vary according to a company philosophy or your own previous experience: usually a frontend engineer can be seen as a Swiss army knife when in reality at Zalando, for example, we see them as masters of trades.

If you're considering joining us as a frontend engineer, beware that a day in the life of a frontend engineer for us usually means:

…BEING A PROBLEM SOLVER / USING MULTIPLE HATS First and foremost, you're going to be asked on a daily basis to come up with solutions. Topics change quite often since a lot is asked: from defining data models and structuring APIs together with the Backend Engineers to challenging the user interfaces defined by the design team. A day of a frontend can be a bit overwhelming at first, but there’s nothing to be done but taking a deep breath and getting your hands dirty.

Your focus is always going to be the user, which means that you'll have users on your mind every day. It's expected from a frontend engineer to have good UX notions and to always deliver the best experience to our customers.

…SPEAKING JAVASCRIPT ALL DAY LONG Discussing Javascript is basically what we do constantly.

Nobody knows Javascript from A to Z but since it is a technology that changes at the speed of light, being on top of it is quite important and it is quite healthy to share knowledge amongst colleagues.

We heavily rely on frameworks and libraries at Zalando (mainly React but you can always encounter other things like Angular, Vue or Polymer... (if you're curious about our stack, check out our Tech Radar) and we do use other technologies for some explicit typing (like Typescript or Flow). However, what we value most is :

the knowledge of the language itself;
its core functionality;
its asynchronous/synchronous nature;
its browser APIs.

We also take some time to consider what's best for the products: "Do I really need a library for this or do I know a better solution?", "Is this piece of code performant?"- These questions we ask ourselves everyday.

Not being afraid of trying new technologies and new ways of implementing the same thing is also part of the job: It takes a lot of experience to understand that a Senior Engineer is not the one that writes the most complex code but the one that always writes the simplest instead!

…WRITING THE BEST TEMPLATES On a day to day basis, we know that a line of CSS can save quite a few of Javascript, so we take our templating very seriously.

We take the time to make sure our HTML makes sense semantically-wise, as well as ensuring it is accessible to all of our users and it’s clear to any colleague who may lay eyes upon it. We work in component based projects, so styling might get overlooked or might not even be needed but we do care about clean and performing code, so we see CSS as a vital part in order to achieve it.

…ALWAYS BEING A STEP AHEAD Being a frontend engineer is very demanding learning-wise: so we allocate a bit of our time to always keep pushing forward and knowing what's coming.

…HAVING QUALITY AND PERFORMANCE AS TOP PRIORITIES Browsers are tricky. We know that we have to allocate some time in order to make sure everything is working correctly and as intended… Debugging comes as a second nature: sometimes it's just a GraphQL Mutation or a PUT request that didn't work but it's part of our job to know where to look for the mistake and figure out a proper solution.

Non-Functional Requirements are also there to be defended and challenged and we constantly need to figure out the most efficient ways to achieve them.

Since we use open source technologies, we need to evaluate the risk of encountering vulnerabilities in our products constantly. Every action our code allows (especially when communicating with backend services) is a potential security problem, so we do what we can to prevent something like XSS or DOM manipulation from happening. As mentioned before, we always have the best interests of our customers in mind, and that includes their data and assets.

Other than that, another part of our day-to-day is dedicated to preventing something from going wrong. Unit testing is part of our definition of done. We are fans of UI testing/E2E tests and we have no problem in testing and verifying each other's work.

...NOT BEING AFRAID TO GO OPERATIONAL As DevOps teams, we perform quite a lot of operational work (even if it's just taking care of a deployment). We don't exactly expect all of us to be AWS or Kubernetes experts but do our best to train each other on all we need, so that we can be the more independent.

We set up projects, from the simplest one to a complete and robust one, so pretty much all of us frontenders at Zalando are familiar with tools like Webpack or Babel.

We also value Continuous Integration and Continuous Delivery and that's always a concern on a daily basis.

…HAVING A CONSTANT AGILE MINDSET Having worked with any Agile Methodology before is pretty important. It does not matter whether it was Scrum, Kanban or a Tribe Model. What is important is that we work as a team and we place the team’s needs above our egos.

We work on scalable projects with lots of dependencies and external parties, so it's quite important to adapt the ways of working to deliver the most value possible. We do it the Agile way.

…BEING A MENTOR Knowledge sharing. Someone next to you is always eager to learn more and another part of our day is dedicated to share whatever we know is worth sharing.

…BEING A COMMUNITY CONTRIBUTOR Knowledge sharing: community version. However we can contribute, we are encouraged to do so. Doesn't matter if it is for Open Source projects, speaking at conferences or organizing meet-ups, we do our best to help the surrounding community.

…BEING INTERNATIONAL IS KIND OF MANDATORY We have over 100 nationalities at Zalando, so English is a big part of our day and that is irrespective of which office or country we work in. Embracing different cultures is one of the most rewarding aspects of having such a diverse team and it is a lot of fun, just like the dogs running on some floors to the Nerf gun wars.

So… A DAY IN THE LIFE OF A FRONTEND ENGINEER FOR US MEANS: Arriving to the office with an open mind; knowing that not everything is always going to be perfect and easy, but striving for continuous improvement and getting better at what we do.

We are a united team that enjoys the journey of being on the same boat and solving problems together. On an individual level, always being ready to share knowledge, to learn from others, as well as being responsible and accountable for the amazing work that you can do, are some of the most important qualities that we hope new potential team members would have.

The Magic Coaching Wand

2019-01-10T00:00:00+01:00

How the Zalando Personalization Unit improved with a diagnostic

In our coaching work, doing diagnostics can already create huge improvements without a lot of action on our part. Working at scale, Zalando has around 150 tech teams, this helps create an impact on the whole organisation.

In this blog post, I will share the story of a diagnostic done in a unit of seven machine learning and data scientist teams (ML/DS) in Berlin, Helsinki and Dublin. Key points include:

a diagnostic is an improvement on its own: what gets measured gets improved, be it that the unit becomes aware of blind spots or they get confirmation from an expert.
you can initiate improvements at scale if you do the diagnostic co-creatively and openly, having everybody in the unit agree on the overall situation using tools like “Lean Change Canvas.”
systemic problems are visible; affecting local teams and roles, but can not be solved there. They need to be tackled at a systemic level.

What follows is a personal experience, and how sometimes solutions are not obvious and have to be found by following a path that only emerges as you walk it.

The universal key that did not unlock the door

On the request of the Dedicated Owner (DO) of the unit we did interviews to get multiple perspectives on the “problem”. We talked to the DO, the Leads, Senior Engineers, Data Scientists, Product Managers, Producers, UX… This is our universal key of request clarification to differentiate symptoms from root causes and to find the systemic pattern.

Normally request clarifications unveil the path to a solution. This time it failed.

We talked to motivated, honest and open leads that really want to make a difference and support and grow their collages. We met a DO that gives freedom and support to his leads and teams. We found really passionate Data Scientists, Engineers and Product Managers. All of them were aware of the problems they collectively faced and what was causing them.

Why was an empowered group like this – with willingness and skill – not able to solve their own problems?

“You can not understand a system until you try to change it."

We started interacting with the system. Which of our solutions will it adapt and which ones will it refuse? Which problem will the system allow to be solved? We tried a one day leadership training, three day agile workshop with two teams, a session about agile at scale, story splitting, and a few more topics. We got a lot of good feedback for this work. It caused a lot of local optimizations and improvements.

But listening to the people felt like the “problems” stayed the same. The mood of the people hardly changed. Are we as humans so used to having problems that we refuse to let them go? That we actually miss them when they are gone?

The big picture made small

What are we not seeing? We tried a new approach. Based on “ Jeff Anderson's Lean Change,” we created a simple canvas: Urgencies - Vision - Next Steps. Rooting it in a more complex framework would allow us to scale the canvas later into a more powerful collaborative change board.

This time we asked the entire team to fill the canvases within a coaching session. The outcome was, once again, unbelievable. All teams had a great vision of how they want to work. They know precisely which next steps they can take to improve.

The elephant in the room

We have great teams. They have leaders that asks for and supports self-driven improvements.

Why don’t the teams “just do it”?

We asked the Dedicated Owner for a meeting. We prepared a room and then asked the Dedicated Owner to go pick the canvas (urgencies, vision, next steps) from the open team spaces and pin them into this room. It was a physical and transparent act of the Dedicated Owner to take care of the problems to start the meeting.

The Breakthrough

The magic happened in the meeting when we had the canvases from all seven teams on one wall in the session.

The Dedicated Owner started discovering the pattern. First the smaller, local patterns, then the systemic pattern that seems to affect every team to different degrees but cannot be linked to a single team or role. These are the patterns you can only see when you take a step back and look at the whole picture.

It wasn’t clear which role should drive which topic or improvement, when, and for what reason. We called this the “Ownership Pattern.” We also saw that we were jumping from having an idea or a goal right into delivering on it. We called this the “Product Pattern.”

On the local level, the responsibilities of who owned what and who did what seemed pretty clear. For topics “in between” (i.e. two roles) and “across” (i.e. several teams) as well “through” (i.e. certain processes) there was a lot more uncertainty.

Why? What happened in the past that we now have this pattern?

Zalando introduced further team autonomy and dedicated ownership. Zalando became successful because of its ability to execute and deliver new products very quickly. What is the effect of this on the culture of this area? Are there even more organisational or cultural influences? We were deep-diving into Zalando's past.

The Magic Happens

When we understood the origins we could understand the pattern and the effects. Now, could we initiate change by telling everyone what insights we found? No.

The magic happens when everyone is having their own, “Aha!” moment, just like the Dedicated Owner in the meeting before.

The next weeks we invested in creating these “Aha!” moments across the whole department, sharing and aligning the insights in a self-exploratory way. We also made sure that no one felt blamed or hurt by the insights i.e. about their role, but everyone had a shared understanding so we could jointly move on.

It was in this time we suddenly saw improvements happening in the unit without us triggering them: new boards, visual backlogs, canvases, roadmaps, UX sketches, goal alignments started popping up on the walls. The teams were acting on their next steps realizing their visions.

It was not us coaches, but the Senior Data Scientists, Engineers, Producers, Leads, Product Specialists, Product Owner, UX, … everybody moving a piece and making an improvement.

We coaches learned that a co-creative and open dialogs with personal moments can unlock the door to continuous improvement. We learned that an outside perspective and self-reflection – without blaming or hurting anyone – is needed from time to time to unstick and move forward as a unit.

On your next request, instead of creating an improvement plan you can try an open and co-creative diagnostic and – with a bit of good fate – create a self-engaged and sustainable improvement at the scale of a unit.

Open Source: December Review - Patroni, Machine Learning Meetup and more

2019-01-07T00:00:00+01:00

Project Highlights

Patroni - one of the most well-known open source projects of Zalando is now deployed as the Postgres Failover Manager on GitLab.com. Patroni was created a few years back when we needed an automatic failover to manage hundreds of in-house clusters. The project was a fork of Compose Governor, Patroni quickly overtook the original version and became one of the most widely used template for PostgreSQL High Availability these days. It is also adopted by IBM Cloud. Our team at Zalando published a searchable documentation site to help users get started easily. Do check it out and join Patroni community if you have any question.

Beside Patroni, Zalando also released other PostgreSQL driven projects such as:

Spilo a Docker image that provides PostgreSQL and Patroni bundled together. Spilo makes it simpler to deploy scalable Postgres clusters in a Kubernetes environment, and also do maintenance tasks.
PGObserver, PGObserver is a battle-tested monitoring solution for PostgreSQL databases. The project was originally developed to monitor performance metrics of Zalando's different PostgreSQL clusters.
Postgres-operator is used internally to manage over 500 Postgres clusters across a large number of Kubernetes installations. Learn more about the current development of this project here.

Inside Zalando Open Source

Machine Learning Meetup the Zalando Open Source Guild hosted a Holiday Hack event which brought together 71 researchers, developers and people who are interested in the field of Machine Learning to share knowledge and try out open source framework and solutions developed by Zalando Research and Engineering Teams.

During the event Zalando Open Source Maintainers conducted talks and guided the attendees to complete multiple challenges and hands-on exercises on two projects 1) Flair - a natural language processing library and 2) Connexion - a Swagger/OpenAPI framework for Python.

Open Source 2018 Year End Report it has been an amazing year for open source at Zalando: 25 new projects, 11.239 commits, 5.000 pull requests, with 31% coming from non-employees. We have seen activities, contributions across Zalando repositories throughout the entire year even in the busy holiday month of December. Click here to see the full report.

Zalando Open Source Around The World

KubeCon, December 10 - 14 Alexanders Kukushkin, Database Engineer at Zalando, delivered a speech on 'Building your own PostgreSQL-as-a-Service on Kubernetes'.

35c3, December 27 - 30 a number of Zalandos participated in the 35th Chaos Communication Congress (35C3) - the annual four-day conference on technology, society and utopia organised by the Chaos Computer Club (CCC). This was a great opportunity for us to meet and connect with developers, tech communities and hackerspaces across Germany and Europe. Hong Phuc Dang, Zalando Open Source Team, had the honor to speak at the Podium on Feminist Perspectives on Inclusive and Diverse Spaces and Communities where she exchanged lessons learned and ideas with other panelists on how to create and sustain more diversity in the tech community.

Front-End Micro Services

2018-12-06T00:00:00+01:00

The “micro frontends” idea has been around for a while now, with great resources such as this Tom Söderlund article, which includes a list of current existing implementations.

In this article, I would like to take an in-depth look at the reference implementation using fragments: explain what it tries to achieve, where it falls short and possible solutions to those limitations.

What are Fragments in the first place? They can be described as isolated pieces of your HTML page, built and served by independent services (and usually teams) such as Header, Product, Search, etc.

Example of an e-commerce website using different fragments to render a product page.

There are at least four benefits from typical micro services that fragments are trying to bring to the front end:

Ease of deployments with better isolation
Improved scalability with smaller pieces
Technological stack isolation with API integrations
Localized complexity with every piece easier to reason about

All of those usually lead to more autonomous and engaged teams with an improved DevOps culture.

The idea of fragments was made popular by the Zalando project, Mosaic. Many companies like HelloFresh are also following this approach.

Main implementations for fragments include:

Zalando’s Tailor, inspired by Facebook’s BigPipe
Web Components using Server Side Includes (Michael Geers talk)
Web server HTML transclusion using Edge Side Includes (Gustaf Nilsson Kotte talk)

These fragments-based solutions claim technological stack isolation but in practice all those fragments are only running a single framework (often React), which is probably a good thing as client bundle size would otherwise have to include different frameworks.

However, they achieve ease of deployments, improved scalability and are easily server-side rendered. There is a small catch though.

Like on the back-end side, a distributed architecture managed by different teams slowly leads to inconsistencies and different ways of doing things. While it might not be such a big deal for back-end side systems, creating inconsistent user interfaces and user experiences is an issue most customer-facing websites cannot ignore. The split of your UI components pipelines also means more infrastructure work to build and ship them to production.

Of course there are solutions to mitigate this. Immowelt for example went for a front-end micro service boilerplate. The boilerplate includes an advanced setup of Immowelt’s front-end stack: React, Redux, Universal rendering, etc. The advantage is to reduce the time to setup for a new service, limit fragmentation, share common practices between teams but still keep flexibility.

Another solution exposed and detailed by Allegro is to compose the HTML page from the same high-level front-end components whose unit they call “Box” and to focus on sharing and reusing components. In this context, the unit or “Box” declares its data dependency and can include other “Boxes.”

Zalando also identified those issues, the most important ones for us, as a company, being the non consistent digital experience, which penalises our brand proposition, together with the high barrier to entry for contributions from other teams because of the complete technological stack required to build a new fragment.

We are currently working on a replacement for Tailor (Zalando’s fragments based approach) which we call "Interface Framework" — an architecture stack composed of the following components:

Fashion Store API: GraphQL API aggregation layer
Renderers: self-contained pieces of code declaring their own data dependency and visual representation
Recommendation System: backend service which decides which renderers to display for page composition
Rendering Engine: backend service and client-side runtime orchestrating the view composition based on the data returned by the recommendation engine

Interface framework

Renderers are developed using Design Systems components in a mono-repository to ensure consistency. The previously redundant fragment’s stacks are now all centralised within the rendering engine which leads to faster on-boarding and reduced time to market for feature teams developing renderers.

This new architecture also enables dynamic view composition: at any point in the user journey, the data layer can choose how the page should look for personalisation purposes. We also want our partners to be able to build renderers themselves so that they can seamlessly integrate their content within our website.

Update: See also our series on details of the Inferface Framework:

Open Source: November Review - Maintainer training, new releases and more

2018-12-06T00:00:00+01:00

Project Highlights

ExternalDNS version 0.5.9 is ready for testing. This project allows you to control DNS records dynamically via Kubernetes resources in a DNS provider-agnostic way. ExternalDNS also successfully made its way to the Kubernetes Incubator. Check out the list of changes in this new release.

Zalando-Incubator welcomed two brand new open source projects 1) Darty - a data dependency manager for data science projects. It helps to share data across projects and control data versions and 2) opentracing-sqs-java as the name explained itself, this is a Java utility library for simplifying instrumentation of SQS messages with OpenTracing.

Skipper announced another new release this month. 1,400 commits were made since the project was first introduced in 2015. Skipper is an HTTP router and reverse proxy for service composition. It is designed to handle >300k HTTP route definitions with detailed lookup conditions, and flexible augmentation of the request flow with filters. This release includes a number of new features: apiMonitoring, east-west service-to-service API gateway setup in Kubernetes, automatic http redirects in kubernetes ingress controller running in GCP.

Inside Zalando Open Source

Maintainer training program is working in progress. Early this month, the Open Source team begins to design a new training course for our existing and want-to-be Zalando project maintainers. While Zalando tech is well-known for doing open source in the open, we never stop exploring new ways to improve and scale up our projects across Zalando. This professional training initiative aims to enhance maintainers’ knowledge around adoption, compliance, community management and sustainability in open source, and thereby ensure they become confident to take full ownership of their open source projects independently. The course is expected to launched in Q1, 2019 and will cover the following topics: - Introduction to a maintainer’s multiple roles - Open source adoption guidelines - Process to release open source - Compliance - Advocacy and stewardship - Mentorship and coaching

Machine Learning meets Fashion. We are inviting researchers, scientists and anyone who is interested in the field of Machine Learning and AI to join our re-imaging fashion journey. We believe only by working together with the community worldwide, we can bring our technologies and the know-how to the next level. Zalando Research team has released a number of publications around the most exciting research topics, such as Deep Learning, Computer Vision and Natural Language Processing, Large Scale Bayesian Inference, Reinforcement Learning and Causality. And we are very proud to share our work with the community by releasing so far 13 research projects in open source with Flair - a natural language processing library as the most recent one.

Zalando Research Lab

Zalando Open Source Around The World

HighLoad Moscow, November 8 - 9 -- Zalando engineers participated in this conference to connected with local developers and Russian tech communities, at the same time, we had three presentations starting with Valentine Gogichashvili - our Head of Engineering, speaking about data engineering inside Zalando, Henning Jacobs - Head Developer Productivity who gave a lecture on ‘Optimizing Kubernetes Resource Requests’ and finally Alexanders Kukushkin, our Database Expert, shared his experiences on ‘the migration of a 10 TB PostgreSQL Cluster to AWS’

Henning Jacobs (top), Valentine Gogichashvili (left), Alexander Kukushkin (right)

Open Source Diversity Meetup Berlin, November 20 -- Hong Phuc Dang from Zalando Open Source Team shared her story on ‘what and why’ she started her open source journey in the first place. At Zalando, we are working hard to ensure that inclusion and diversity are firmly embedded in our culture, several incentives were introduced by the Diversity Guild such as Women In Leadership, Inclusive Language, Diversity Day etc. Moving forward, we are working on increasing diversity across our open source projects.
CodeMotion Berlin, November 20 - 21 -- Paul Adams, Zalando Open Source Lead, talked about Adopting Open Source Best Practice for the Enterprise, with specific examples and policies that he and his team implemented inside Zalando.

Tag-based Navigation of a Fashion Catalog

2018-11-29T00:00:00+01:00

Exploring the Zalando Assortment by Browsing a Product Similarity Graph

Introduction

As Europe's leading online fashion and lifestyle platform, Zalando is continually developing new features to enable our customers to find the products they want. While the standard tools of Search, Categorization & Attribute Filtering are par-for-the-course for purchasing items online, with an ever-expanding fashion assortment and an increase in the data available to describe a product, this browsing experience is becoming more cumbersome and time-consuming, particularly on mobile devices.

At Zalando's Fashion Insights Centre in Dublin, while keeping a focus on developing AI and Big Data driven products and features in the medium term, we sometimes have time to explore new ideas with a longer-term vision. Either through our annual Hackweek (an internal week long hackathon) or our Slingshot programme (an “intrapreneurship” program fostered by Zalando's Innovation Lab). In this blog post we will share with you a project that has journeyed through, and benefited from, both programmes, and present a new method for browsing an online fashion catalog using a Product Similarity Graph.

Product Similarity

What do we mean when we say that two products are similar? Do we mean that the products are from the same fashion trend, that they appear visually similar, or that they have a number of attributes in common, for example brand or product type? In fact it can be all of these things, or a select few, summed up to create an overall similarity score between two products, using all the data that makes sense for the task at hand.

When looking at the Zalando catalog in total, what does product similarity mean here? Well, it means calculating the previously described similarity score for each product against all others, (sometimes referred to as a similarity self-join) and storing the similarity scores in a suitable way. Typically, the scores are represented in a matrix format, which leads to the construction of a Product Similarity Matrix, where each row contains the similarity scores for one product against all others in the catalog, likewise for the columns since the matrix is symmetric.

Many of you will notice a potential pitfall as the catalog grows, which is, as the number of products, n, increases, the number of scores required to be generated will increase by n for every new product added to the catalog, and as such, the algorithmic complexity of generating a Product Similarity Matrix is n-squared, i.e., O(n**2). Depending on the use case the catalog size could be in the millions. To make this problem manageable we use distributed systems and algorithms such as Locality Sensitive Hashing. However, we will spare the details for our purposes here, for now just consider that our Product Similarity Matrix is big. We use the Product Similarity Matrix within Zalando to provide a Product Similarity Service, which is currently used by our recommendation team to tackle the cold-start problem associated with Collaborative Filtering.

Product Similarity Graph

The Product Similarity Matrix, due to its construction, can be easily interpreted as an Adjacency Matrix, which in turn can be interpreted as a Network Graph, where products in the catalog are represented as nodes, and the similarity relationship between products is represented by connections between nodes. Network graphs are interesting mathematical objects and appear in a number of different areas of Computer Science, such as the study of online social networks (the social graph) and the study of internet traffic (communications graph). Here we are interested in the relationships between products in a fashion catalog, and we use a network graph to organise and store Zalando’s product data.

Below, we present a visualization of a Product Similarity Graph for a small number of the products sold by Zalando. As is typical for graph visualizations we can see a rich structure emerge, where clusters of very similar products form. These clusters correspond to high-level product attributes such as product type, with similar products being close together, e.g., low shoes are clustered close to boots but far away from trousers. Within clusters, other more detailed low-level attributes such as color, materials and styles create a distinction between products. Other connections between clusters are far apart, which indicate a weak similarity relationship between products. However, these connections offer an opportunity to browse and explore other parts of the graph and hence other items in the catalog, offering both a hunting and exploring mode of product discovery. Finally, it important to note that the graph is not fully connected, since many product pairs will exhibit no or low similarity.

*Figure 1: A visualization of the Product Similarity Graph for a small selection of items from the Zalando Catalog. *

Browsing via Graph Traversal

From looking at the visualization above it is easy to imagine traversing through the graph to find similar products of interest by choosing appropriate connections, quickly looking at one product then another until you find something you would like to buy. In a similar way as you might do in a bricks and mortar fashion store, where you start browsing in the trousers section, purchase an item, and decide to buy a matching shirt in another part of the store. However, how would we implement this feature? How do we enable a customer to discover products by browsing a network graph?

Network Graphs have been studied for many years and there exists many different algorithms to extract information from a graph, including algorithms to analyse the structure of a graph and algorithms to determine the optimal path through a graph between two nodes. However, for our use case where we would like to use a graph to drive a browsing experience – a scenario that has no predefined terminating node – there has not been much previous work. With this in mind, we present here a new Graph Exploration Algorithm called Graph Browser to enable browsing on a Product Similarity Graph, and provide a solution to the technical issues with browsing & exploring a graph in general.

Introducing the Graph Browser Algorithm

The Graph Browser algorithm enables browsing on a graph by generating a unique set of Navigation Tags for a product of interest on the graph, which we call the anchor product. The tags are generated directly from the product data neighbouring the anchor product and describe product attributes such as color, product type, brand etc. Furthermore, the tags indicate attribute differences between the anchor product and its neighbours in the graph, and allow the user to browse to a neighbouring product in the graph by simply selecting a tag. Common product attributes, such as color, may reference many neighbouring products, resulting in the tag referencing many possible products, while other tags may reference a single product. For the later case, the new product becomes the anchor product, for the former, the user chooses a single product to be the new anchor. Once a new anchor product is selected a new set of navigation tags are generated, and the process is repeated until the customer finds a product they would like to buy or exits the process. In this way, the browsing experience is only ever concerned with products that are in the direct neighbourhood of the anchor product, and the graph is traversed one connection at a time, where each product visited in turn is similar to the previous but differing by the selected tag. To explain further we provide a diagram of the process in figure 2. below.

Figure 2: A diagram of a Product Similarity Graph and its Navigation Tags (presented in curly braces) as generated by the Graph Browser Algorithm. Starting at P1, “black nike textile shoe,” one possible path to P3, “white fila leather shoe,” would be to select the following navigation tags in succession: “white’ & ‘fila,” where selecting “white” will assign P2 the new anchor, and selecting “fila” will move the anchor to P3. Alternatively, selection of the “white” and “leather” tags would also bring the customer to P3, based on their preference for a leather shoe, over brand preference.

To explain more formally, we present some details on the Graph Browser algorithm below, but first define some preliminaries: A Product Similarity Graph is constructed of product nodes, P = {p1,...,pn}, and connections between nodes that represent the similarity score between two products pi & pj, sij = sim_score(pi, pj). Each node has a record of all the product attributes for that product, pi = {a1,...,am}, which are used to generate the navigation tags. The algorithm is as follows,

A single node in the graph is selected as the anchor node, pi.
A set of all connected nodes to pi is constructed, Pcon = {p1,...,pk}.
A mapping is constructed of the attribute differences between the anchor node, pi, and the attributes of the products contained in the set of connected nodes, M = diff_attrs(pi, Pcon), where M[pj] returns the set of attributes that differ, Dij = {a1,...,ad}, between products pj & pi.
We construct a mapping of single attributes, or tags, to the products they reference by inverting our attribute difference mapping, and indexing attributes individually to the products they belong to, Q = tag_map(M), where products in Pcon that have common differing attributes are indexed by the same tag, i.e., the mapping Q[ai] returns the connected product, or products, that the tag references in the graph.
The set of navigation tags, T = {t1,...,tp}, for the anchor, pi, corresponds to the keys of the mapping Q. The user selects a tag, ts, from T and the new anchor product is selected from the indexed set of products, Q[ts], where pi now becomes Q[ts] if there is a single product indexed, or is selected by the user if there are more than one option.
The algorithm returns to step 1 and the process repeats.

which completes the description of the Graph Browser algorithm. We present a small Python code implementation of the Graph Browser algorithm at the end of this article.

Other Considerations

You will notice above that the initialization of the anchor node, pi, is not defined in the algorithm, however choices include: Random initialization, user-specific recommendations and search query initialization. Furthermore, we do not discuss how navigation tags are presented to the user, which can be presented using a ranking function that optimizes the positioning of the tags, for example by using the similarity scores themselves or using a customer’s preferences. Moreover, for the use case we present here we are interested in differences between products, however we could also generate tags that represent common product attributes.

Finally

The Product Similarity Graph, Graph Browser algorithm and the Navigation Tags it generates all combine to produce a quick and easy way to browse an online product catalog. While we are only beginning to explore the possibilities of the Product Similarity Graph within Zalando, we are hopeful that it will be be used to drive some of our product discovery and tag-based navigation features in the future.

Appendix: Python Graph Browser Implementation

import networkx as nx
# Define product records
products = [
    # ID,  Brand,  Type,   Material,   Color
    ['P1', 'nike', 'shoe', 'textile', 'black'],
    ['P2', 'nike', 'shoe', 'textile', 'white'],
    ['P3', 'fila', 'shoe', 'leather', 'white'],
    ['P4', 'fila', 'sock', 'synth', 'blue']
]

# Use jaccard index as similarity score
def jaccard_index(x, y):
    """Returns the jaccard index between `x` & `y`.
    """
    x = set(x); y = set(y)
    return len(x.intersection(y)) / len(x.union(y))

def product_similarity_matrix(products, sim_score=jaccard_index):
    """Returns a Product Similarity Matrix for specified ``products`` &
    similarity score, ``sim_score``.
    """
    prod_sim_mat = {}
    # n(n-1)/2 scores
    for i, product_i in enumerate(products):
        for product_j in products[:i]:
            idi, *attr_i =  product_i
            idj, *attr_j =  product_j
            prod_sim_mat[(idi, idj)] = sim_score(attr_i, attr_j)
    return prod_sim_mat

def product_similarity_graph(prod_sim_mat, products):
    """Combine Product Similarity Matrix and ``products`` to construct a
    Product Similarity Graph.
    """
    # Create networkx graph
    PSG = nx.DiGraph()

    # Add nodes and attrs to graph
    for product in products:
        id_, *attrs = product
        PSG.add_node(id_, attrs=attrs)

    # Add edges and scores to nodes
    for ind, score in prod_sim_mat.items():
        if score > 0:
            start, end = ind
            PSG.add_edge(start, end, score=score)
            PSG.add_edge(end, start, score=score)
    return PSG

def generate_diff_attrs(prod_sim_graph):
    """Generate a set of attribute differences between each pair of connected
    nodes in ``prod_sim_graph`` and add to edges.
    """
    for anchor, neighbour in prod_sim_graph.edges():
        anchor_attrs = set(prod_sim_graph.node[anchor]['attrs'])
        neighbour_attrs = set(prod_sim_graph.node[neighbour]['attrs'])
        prod_sim_graph.edge[anchor][neighbour]['diff_tags'] = \
            neighbour_attrs - anchor_attrs
    return prod_sim_graph

def generate_nav_tags(prod_sim_graph):
    """Generate a navigation tag map for each node of the ``prod_sim_graph``.
    """
    for anchor in prod_sim_graph.nodes():
        tag_map = {}
        for neighbour in prod_sim_graph.neighbors(anchor):
            for tag in prod_sim_graph[anchor][neighbour]['diff_tags']:
                tag_map[tag] = tag_map.get(tag, []) + [neighbour]
        prod_sim_graph.node[anchor]['nav_tags'] = tag_map
    return prod_sim_graph


if __name__ == "__main__":

    # Construct Product Similarity Matrix
    prod_sim_mat = product_similarity_matrix(products)

    # Construct Product Similarity Graph
    PSG = product_similarity_graph(prod_sim_mat, products)

    # Generate difference attributes and attach to edges
    PSG = generate_diff_attrs(PSG)

    # Generate navigation tags and attach to nodes
    PSG = generate_nav_tags(PSG)

    ### Browse the PSG using simulated user inputs ###

    # Setup simulated user inputs
    anchor = 'P1'
    tag_selections = ['white', 'fila', 'sock']
    product_selections = ['P2']

    # Run through simulated inputs
    for selection in tag_selections:
        print("Anchor: {}".format(anchor))
        print("Tag selection: {}".format(selection))
        # Graph Browser
        products = PSG.node[anchor]['nav_tags'][selection]
        if len(products)>1:
            product_selection = product_selections.pop(0)
            assert product_selection in products, "Bad selection"
            anchor = product_selection
            print("Product selection: {}".format(product_selection))
        else:
            anchor, = products
    print("Bought product: {}".format(anchor))

    print("\n\tFin.\n")

Zalando Postgres Operator: One Year Later

2018-11-26T00:00:00+01:00

Zalando Postgres operator: one year later

The Postgres operator provides a managed Postgres service for Kubernetes. It extends the Kubernetes API with a custom “postgresql” resource that describes desired characteristics of a Postgres cluster, monitors updates of this resource and adjusts Postgres clusters accordingly. Zalando successfully uses the operator to manage more than 450 Postgres clusters across a large number of Kubernetes installations.

Moving to production

More than a year and a half ago, Zalando prepared for running stateless and stateful applications alike on Kubernetes. With tens of teams working with hundreds of databases across multiple Kubernetes clusters, any kind of manual operations was out of the question. To keep the workload manageable Zalando’s database team therefore decided to automate the operations procedures.The operator pattern well known in the Kubernetes universe turned out to be a perfect fit for the job.

At present the operator manages more than 400 Postgres clusters in Zalando: it watches requests for additions, deletions and updates of Postgres manifests and automatically carries out all necessary actions on the clusters. This saves time for engineers and the admins alike: instead of manually configuring numerous Kubernetes objects, they just submit a single YAML file describing the desired Postgres cluster setup, and the operator takes care of the rest.

A year ago, the operator just left the prototype stage and was still in its infancy. Since then we have extended it into a production-ready Postgres-on-Kubernetes managed service with numerous features such as:

Role-based access control: By its very nature, the operator requires broad permissions to operate databases in the Kubernetes environment. Given the importance of security, we factored out a separate operator-specific service account and employed the RBAC capabilities of Kubernetes to precisely define the rights required by the operator adhering to the principle of least privilege.
Integration with external services: Postgres databases do not run in isolation but rather in the complex tech infrastructure. The seamless integration with existing tools is of great importance for our customer experience. Our generic sidecar container support enables running third-party applications side-by-side with the database pods. An example of such approach is a Scalyr sidecar that transparently to the user ships the Postgres container logs to the Scalyr service, hence empowering employees to use standard log processing tools.
Log shipping of Postgres logs to cloud storage: While Postgres normally rotates its log files within one week, the operator and Spilo can join forces to continuously archive the database log history in the cloud for as long as necessary.
Support for multiple namespaces. Namespaces enable us to better structure applications of different teams within a single Kubernetes cluster; a typical use case involves running experiments in a dedicated namespace and then deleting the no longer needed results by simply dropping the namespace. To take full advantage of multiple namespaces, we designed and built into the operator the ability to manage databases running in namespace other than the default one.
API versioning. We keep an eye on the ongoing evolution of Kubernetes and timely exploit the most useful features for the benefit of operator users. Since recently, we started to use Kubernetes-standard code generation to implement the API of the “postgresql” custom resource. By doing so we introduced API versioning to the operator and greatly reduced the manual effort needed to support new Kubernetes versions within the operator codebase.
Last by not least, we recognized the ever increasing adoption of our software and for that reason contributed the documentation to ease running this service in the environments other than ours.

Our efforts culminated in the release of the operator’s first stable version in August 2018. As the software we have built proved to be such a success within Zalando, we reached out the broader cloud computing community to share the experience of developing and operating a managed stateful service on top of Kubernetes. We are pleased to share our achievements with the community at the top tier industrial conferences such as FOSDEM 2018 and KubeCon North America 2018.

Want to delve in?

If you want to know more, check out our talks for a deeper technical perspective on what we are doing. For those of you who are willing to obtain hands-on experience with the hot technologies such as Postgres, Kubernetes, or golang in the thriving open-source environment, we prepared a list of good first issues.

Zalando Research Releases “Flair”

2018-11-22T00:00:00+01:00

Open sourcing machine learning research for natural language processing (NLP)

Two years ago, Zalando Research launched with a clear purpose to ensure that Zalando Tech is at the forefront of research in the areas of data science, machine learning, natural language processing and artificial intelligence.

Our researchers’ work previously focused mainly within Zalando. Therefore, we are very excited to announce that we have released “ Flair”; our state-of-the-art natural language processing (NLP) library. Flair is under the MIT license and will continue as an actively maintained open source project under Zalando leadership.

Zalando Research Team

The Flair project is our cutting edge framework for natural language processing (NLP), meaning a framework to give a computer the ability to understand, tag and classify written texts. Flair is useful when you want to understand the meanings of email messages, customer responses, website comments, or any other scenario where users submit text feedback that you want to automatically classify or otherwise process.

The library is implemented in Python on top of the popular PyTorch deep learning framework. It packages pre-trained models for NLP tasks, including named entity recognition (NER) to detect things like person or location names in text and part-of-speech tagging to detect syntactic word types like verbs and nouns. It allows you to easily apply our pre-trained models to your text, or train your own sequence labeling or text classification models.

For instance, we can train Flair to recognize fashion concepts such as brands, colors or seasons in text, or to classify whole text documents into one or more categories. Check out the results of such below:

Due to its versatility, Flair is already part of several in-production systems at Zalando, as machine learning has become a natural part of our engineering toolbox.

You can find documentation and the source code of Flair on Github.

This is an important milestone for the open source and research teams at Zalando. Having research mature into in-production tooling and made available to the wider tech ecosystem as open source indicates a healthy and cutting-edge engineering culture at Zalando.

Comparison with the state-of-the-art

Flair’s accuracy out-performs all of the previous best methods on a large range of NLP tasks; evaluated against industry-standard datasets shows substantial improvements:

Get involved

We invite you to start using Flair. There is already extensive documentation available on how to use the framework, so you can quickly get up and running and experiment with the models included, or train your own if you wish.

There is a growing community around Flair already, contributing new features and support for other languages.

Train Deep Learning Models on AWS

2018-11-08T00:00:00+01:00

A real-life example of how to train a Deep Learning model on an AWS Spot Instance using Spotty

Spotty is a tool that simplifies training of Deep Learning models on AWS.

Why will you ❤️this tool?

it makes training on AWS GPU instances as simple as a training on your local computer
it automatically manages all necessary AWS resources including AMIs, volumes and snapshots
it makes your model trainable on AWS by everyone with a couple of commands
it detaches remote processes from SSH sessions
it saves you up to 70% of the costs by using Spot Instances

To show how it works, let’s take a non-trivial model and try to train it. I chose one of the implementations of Tacotron 2. It’s a speech synthesis system by Google.

Clone the repository of Tacotron 2 to your computer:

git clone https://github.com/Rayhane-mamah/Tacotron-2.git

Docker Image

Spotty trains models inside a Docker container. So we need to either find a publicly available Docker image that satisfies the model’s requirements, or create a new Dockerfile with a proper environment.

This implementation of Tacotron uses Python 3 and TensorFlow, so we could use the official Tensorflow image: tensorflow/tensorflow-gpu-p3. But this image doesn’t satisfy all the requirements from the “requirements.txt” file. So we need to extend this image and install all necessary libraries on top.

Create the Dockerfile file in the root directory of the project:

FROM tensorflow/tensorflow:latest-gpu-py3

WORKDIR /root

# install pyaudio library
RUN apt-get update \
   && apt-get install -y python3-pyaudio \
   && apt-get clean \
   && rm -rf /var/lib/apt/lists/*

# install other requirements
COPY requirements.txt requirements.txt
RUN grep -v '^pyaudio' requirements.txt > requirements_updated.txt \
&& pip3 install -r requirements_updated.txt

Here we’re extending the original TensorFlow image and installing all other requirements (I couldn’t install the pyaudio library through pip, so I did it using apt).

Also, create the .dockerignore file with the following content:

# ignore everything
**

# allow only requirements.txt file
!/requirements.txt

Otherwise, you would get an out-of-space error, because Docker will be copying the entire build context (including heavy “training_data/” directory) to the Docker daemon.

Spotty Configuration File

Once we have the Dockerfile, we’re ready to write a Spotty configuration file. Create the spotty.yaml file in the root directory of the project.

It consists of 3 sections: project, instance and scripts.

Section 1: Project

project:
name: Tacotron2
 remoteDir: /workspace/project
 syncFilters:
   - exclude:
       - .idea/*
       - .git/*
       - '*/__pycache__/*'
- training_data/*

The section contains the following parameters:

Name of the project: The name will be used in names of AWS resources. For example, in the name of the S3 bucket that will be used to synchronize the project code with the instance.
Remote directory: It’s a directory where the project will be stored on the instance.
Synchronization filters: Filters are being used to exclude directories which shouldn’t be synchronized with the instance. For example, we ignore PyCharm configuration, Git files, Python cache files and training data.

Section 2: Instance

instance:
 region: us-east-2
 instanceType: p2.xlarge
 volumes:
   - name: Tacotron2
     directory: /workspace
     size: 50
 docker:
   file: Dockerfile
   workingDir: /workspace/project
   dataRoot: /workspace/docker
ports: [6006, 8888]

The section contains the following parameters:

Region: AWS region where a Spot Instance will be launched.
Instance type: Type of AWS EC2 instance.
List of volumes: Each volume has a name, a directory where the volume will be mounted, and a size. When you’re starting an instance the first time, the volume will be created. When you’re stopping the instance, a snapshot will be taken and automatically restored next time.
Docker: Here we set the path to our Dockerfile. An alternative approach is to build the image locally and push it to the Docker Hub Registry, then you can use the name of the image instead of a file. We set a working directory, it will be used by the scripts from the “scripts” section. Also, we can change a Docker data root directory to a directory on the attached volume, then the downloaded images will be saved with a snapshot of the volume. Next time it will take less time to restore the image.
Ports: Ports to expose. In this example, we open 2 ports: 6006 for TensorBoard and 8888 for Jupyter Notebook.

Read more about other parameters in the documentation.

Section 3: Scripts

scripts:
 preprocess: |
   curl -O http://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
   tar xvjf LJSpeech-1.1.tar.bz2
   rm LJSpeech-1.1.tar.bz2
   python3 preprocess.py
 train: |
   python train.py --model='Tacotron-2'
 tensorboard: |
   tensorboard --logdir /workspace/project/logs-Tacotron-2
 jupyter: |
   /run_jupyter.sh --allow-root

Scripts are optional, but very useful. They can be run on the instance using the following command:

$ spotty run

For this project we’ve created 4 scripts:

preprocess: downloads the dataset and prepares it for a training,
train: starts training,
tensorboard: runs TensorBoard on the port 6006,
jupyter: starts Jupyter Notebook server on the port 8888.

That’s it! The model is ready to be trained on AWS!

Spotty Installation

Requirements

Python 3
Installed and configured AWS CLI (see Installing the AWS Command Line Interface)

Installation

Install Spotty using pip:

$ pip install -U spotty
Create an AMI with NVIDIA Docker. Run the following command from the root directory of your project (where the spotty.yaml file is located):

$ spotty create-ami

In several minutes you will have an AMI that can be used for all your projects within the AWS region.

Model Training

Start a Spot Instance with the Docker container:

$ spotty start

Once the instance is up and running, you will see its IP address. Use it to open TensorBoard and Jupyter Notebook later.

Download and preprocess the data for the Tacotron model. We already have a custom script in the configuration file to do that. Just run:

$ spotty run preprocess
Once the preprocessing is done, train the model. Run the “train” script:

$ spotty run train

On a “p2.xlarge” instance it will probably take around 8–9 days to reach 120 thousand steps. But you could use instances with more performant GPUs to make the training faster.

You can detach this SSH session using Ctrl + b, then d combination of keys. The training process won’t be interrupted. To reattach that session, just run the spotty run train command again.

TensorBoard

Start the TensorBoard using the “tensorboard” script:

$ spotty run tensorboard

TensorBoard will be running on the port 6006. You can detach the SSH session using Ctrl + b, then d combination of keys, it still will be running.

Jupyter Notebook

You can use Jupyter Notebook to download trained models to your computer. Use the “jupyter” script to start it:

$ spotty run jupyter

Jupyter Notebook will be running on the port 8888. Open it using the IP address of the instance and the URL that you see in the output of the command.

SSH Connection

To connect to the running Docker container via SSH, use the following command:

$ spotty ssh

It uses a tmux session, so you can always detach it using Ctrl + b, then d combination of keys and attach that session later using the spotty ssh command again.

Stop Instance

Don’t forget to stop the instance once you are done! Use the following command:

$ spotty stop

When you’re stopping the instance, Spotty automatically creates snapshots of the volumes. When you start an instance next time, it will restore the snapshots automatically.

Conclusion

Using Spotty is a convenient way to train Deep Learning models on AWS Spot Instances. It will save you not just up to 70% of the cost, but also a lot of time on setting up an environment for your models and notebooks. Once you have a Spotty configuration for your model, everyone can train it with a couple of commands.

This article was originally published on Medium.

Open Source: October Review - Hacktoberfest, new releases and more.

2018-11-06T00:00:00+01:00

Project Highlights

Connexion version 2.0 with OpenAPI 3 support is ready, check out what is new in our latest release! Connexion is the Swagger/OpenAPI first framework for Python on top of Flask with automatic endpoint validation & OAuth2 support. With 87 active contributors and more than 1,000 repositories that depend on Connexion worldwide makes this project one of the most successful open source releases of Zalando.
Postgres-Operator after one year of development, this operator now manages more than 500 Postgres clusters across a large number of Kubernetes installations inside Zalando. Our engineers do not need to manually configure numerous Kubernetes objects, they just submit a single text file describing the desired Postgres cluster, and the operator takes care of the rest. Postgres-operator was first started by Zalando’s database team to provide a managed PostgreSQL service for Kubernetes. Try out the operator here!
Flair Zalando Research recently released a new version of this open source Natural Language Processing framework, it now runs on both Linux and Mac, click here to test! Flair gives users the ability to tag, classify and understand the meanings of email messages, customer responses, website comments, or any other scenario where users submit text feedback to be automatically classified or otherwise processed.

Inside Zalando Open Source

Zalando hosted a Hack Night at the Berlin office to celebrate Hacktoberfest - the month of open source. The main event started with a number of lightning talks by open source projects, followed by a hacking session, where Zalando engineers gathered as teams and worked on challenges under domains of machine learning, database and web plug-ins. Project maintainers were present to support participants completing their very first contribution and pull request.
The first Open Source Onboarding Training was conducted on October 9th as a part of Zalando’s Tech Bootcamp, where we explained to the new joiners the importance of open source at Zalando, how open source fits with Zalando’s culture and the way we work. During this training, we also highlighted the open source journey of a developer and guided people how to contribute to open source projects.
Open Source Team released a promotion framework that helps engineering teams to grow an ecosystem around their open source projects through various outreach and onboarding activities. This framework includes blogging tips, utilizing social media, organizing a release party, and writing tips for public announcement.

Zalando Open Source Around The World

PostgreSQL Conference Europe, October 23 - 26, 2018 at the PGConf Europe, Alexander Kukushkin, Zalando Database Engineer, presented how Zalando migrated one of the largest Postgres clusters to AWS EC2 with Patroni, a template for PostgreSQL High Availability with ZooKeeper, etcd, or Consul.
Open Source Summit Europe, October 22 - 24, 2018 our first speaker Dmitry Dolgov, Zalando Software Engineer, delivered a talk on PostgreSQL + Linux Kernel, showing common techniques of configuring the Linux kernel to work efficiently with PostgreSQL. The second speaker, Per Ploug, Zalando Open Source Community Manager, gave a presentation on ‘Turning Policy into Tooling’, where he outlined concrete efforts, tools and services that Zalando have developed and uses to remove compliance barriers. Finally, Zalando InnerSource Manager, Hong Phuc Dang spoke on a 'Mentorship Panel' together with Open Source Program Managers of Intel, Google, Bitergia and a researcher from Inria. The discussion covered topics such as value of mentorship, mentorship metrics, challenges and diversity.

Open Source Summit Europe: From left to right – Julia Lawall, Senior Researcher (Inria) - Josh Simmons, Open Source Program Manager (Google) - Hong Phuc Dang, InnerSource Manager (Zalando) - Jeffrey Osier-Mixon, Open Source Program Manager (Intel)

Github Universe USA, October 16 - 17, 2018 Zalando joined Github, Oracle and Comcast on a panel discussion about ‘The keys to open source success for enterprise teams’.

Github Universe: From left to right Bonnie Chatterjee, Director, Professional Services (GitHub) - Chad Arimura, Vice President of Serverless (Oracle) - Shilla Saebi, Open Source Community Lead (Comcast) - Per Ploug, Open Source Community Manager (Zalando)

Connexion 2.0 Release

2018-11-05T00:00:00+01:00

Today, we released Connexion 2.0 with OpenAPI 3 support.

Connexion is a Python framework that automagically handles HTTP requests based on OpenAPI Specification (formerly known as Swagger Spec) of your API described in YAML format. Connexion allows you to write a Swagger specification, then maps the endpoints to your Python functions.

Besides routing, Connexion also validates requests and responses automatically based on OpenAPI specifications, handles common authentication schemes, supports API versioning and supports automatic serialization of payloads. It can use both Flask and aiohttp as backend servers.

Besides OpenAPI 3 support, this release includes a more streamlined internal structure, better adherence to Swagger 2.0 spec by default, and support for basic authentication and apikey authentication. For a more detailed list of changes, check Connexion's Read Me.

Connexion 2.0 would not have been possible without the help of all our 87 contributors, specially our newest maintainer Daniel Grossmann-Kavanagh, who deserves most of the credit for this release.

#NoEstimates

2018-11-01T00:00:00+01:00

Why I advocate a practice of no estimates as a software engineer

Before I get to the topic, I would like to clarify one thing: I don’t want to ban estimations generally from software development, as there are good and solid reasons for it. In a nutshell, business needs to be predictable.

I want to show a software developer's view on how to reduce or even get rid of endless estimations meetings with doubtful outcomes. Critics would argue that software developers should improve their estimation skills in order to:

develop shared understanding within the team, especially in case of uncertainty
make informed decisions when very little data (about the product) is available
make the product more predictable

But let’s go step by step on how no estimates lead to the same goals.

Note: When I mention “team,” I mean a software development team and by “developer,” a software developer.

Improving predictability - Is there only one way?

There are many factors to improve the speed and predictability of software development. In case a team is missing some of those factors, they should be set and measured as team objectives in the meantime. Here are some of them:

Stable and autonomous team: The product is built by the teams, so the focus should be to make them stable. Autonomy gives teams self-confidence and fosters maturity, which allows for decisions on how to build things.
Reducing meetings: Keeping the teams busy with meetings will result in less time for developing. Software development consumes a lot of concentration and energy, so distractions should be kept to the minimum.
Cadence of collaboration: Planning together with designers and product teams to clarify the tasks, building and reviewing together to keep a high quality of code and strengthen knowledge sharing within the team.
High visibility and transparency: Making work and progress of the development transparent to stakeholders and manages will increase trust within the organization.

Really, no estimates?!

Let’s start to gather some arguments for estimations. Business runs on goals and commitments, and deciding which product should be built is a decision of computing cost against expected profit. With no estimates, how can costs be calculated? How and which commitments should be made? How can different projects be compared?

At some level, estimations have to be done in order to have numbers like cost and time for business decisions. But should this happen on the task level? No! Besides spending time to estimate every single task, it puts time pressure on the developers. Estimation also includes guessing. How certain can you be before starting a task? What is more important? Delivering on time or finding the best solution for the task? Let the developer decide this, as he/she is building the product. The idea behind it is to let the teams focus on what they do best: building the right products, which are reliable, stable and predictable.

How can this look?

Here, I want to quote one of the most active advocates of #NoEstimates (who’s been recommending this for as long as 15 years), Vasco Duarte:

The backlog should not contain well estimated stories, increasing the backlog after every sprint. Product should define the most important task and after completing it, they should define what to proceed with i.e. being agile!
Chunks should be small. It’s easier to split the work within the team and keep everybody up to date about the code base i.e. knowledge sharing!
Keeping chunks small decreases the developing and reviewing time. I feel amazed after finishing a task which didn’t take several days or get merged with another task i.e. boosting motivation to the next level!
Receive feedback from customers/stakeholders i.e. again, being agile!

My takeaway from the practice and experience of #NoEstimates is to empower teams to unleash their best. Provide them with the space and environment where nobody tells them what needs to be done but rather they do what needs to be done autonomously. My emphasis on #NoEstimates is to make sure that teams think about the value they deliver rather than going into a vicious circle of discussions. In the end, what matters is the outcome and not the input. Teams, especially developers, need to shift their mind towards value-driven development. This enables them to trigger discussions based on the value they could possibly deliver to the end customer, which is often the missing piece of puzzle. Teams should not follow the complaint mechanism but rather outcome-oriented, customer-centric decisions. Leaders play a vital role in enabling this kind of culture by providing teams with a “safe to fail” environment, where praise is given for experimentation. This, in the end, helps the team to grow and flourish.

For more information, why not check out some of my favorite sources on #NoEstimation: (1) #NoEstimates video by Vasco Duarte. (1) #NoEstimates: 6 Software Experts Give Their View on the Movement by Thomas Carney.

Singleton Types

2018-10-25T00:00:00+02:00

A Scala 3 Experiment

I'll start this post by admitting that I’ve never gone deeply into any kind of Scala coding on the typelevel. It's not what I, as a common application (or microservice) developer, usually need.

Having stated that, of course, I might be missing out on a whole world of opportunities for better code without knowing. And because of that, I put some effort into trying to understanding the features of Scala that might sound strange, overly-theoretical, and maybe even useless, at first sight.

A concept I couldn't imagine a proper use case for was the so-called "singleton types" (also called "literal types" or even "literal singleton types"). As it happens, I recently attended a Scala Days talk about them. In this talk, singleton types are used for improving the type-safety of database queries, and inspired by this, I finally got an idea of where I could try them out for myself.

Remember matrix multiplication from math, and how the dimensions have to fit? And how all the usual libraries for matrix multiplication take matrices of any dimension, and then throw runtime exceptions when the dimensions don't fit? That's what I'll use singleton types for in the following.

Let's start with the conventional approach, leaving out the actual multiplication details for brevity:

final case class Matrix(n: Int, m: Int) {
  def *(other: Matrix): Matrix = {
    require(m == other.n,
        s"matrix dimensions must fit ($m != ${other.n})")
    Matrix(n, other.m)
  }
}

In this piece of code, the runtime check ensures that the matrix dimensions are not just any integers, but also that they actually fit, so that the matrix multiplication can work at all.

This check is only necessary because we allow for all kinds of integers here in the first place. This is not the only option we have, though.

Here's a small Scala 3 REPL session (using dotr) that might surprise you:

scala> val y = 3
val y: Int = 3

scala> val y: 3 = 3
val y: Int(3) = 3

scala> val z: 4 = 3
1 |val z: 4 = 3
  |           ^
  |           found:    Int(3)
  |           required: Int(4)

See how y and z somehow get their value ascribed as their type? This is what singleton types are about: A singleton type is a type inhabited by exactly one value. So we might as well name the type after the value. Of course, singleton types like 3 or 4 are subtypes of Int, just as singleton types like "meep" or "foo" are subtypes of String.

This is all well and good, but how to make use of these types?

The basic idea here is to restrict the type of the two matrix dimensions to be the singleton type, instead of Int. Then we can ensure at compile time that two dimensions are exactly the same number by ensuring that they have the same singleton type.

In order to restrict a type to a singleton type, Scala 3 has a type called Singleton. Combined with the new & (very similar to what Scala previously had with with, but symmetrical), we can express:

A should be an integer singleton type

By writing:

A <: Singleton & Int

Making use of this, we can define our matrix class in the following way:

type Dim = Singleton & Int

final case class Matrix[A <: Dim, B <: Dim](n: A, m: B) {
  def *[C <: Dim](other: Matrix[B, C]): Matrix[A, C] =
    Matrix(n, other.m)
}

And with this, we get the compile-time behavior we were aiming for:

scala> val a = Matrix(2, 4)
val a: Matrix[Int(2), Int(4)] = Matrix(2,4)

scala> val b = Matrix(4, 3)
val b: Matrix[Int(4), Int(3)] = Matrix(4,3)

scala> a * b
val res10: Matrix[Int(2), Int(3)] = Matrix(2,3)

scala> b * a
1 |b * a
  |    ^
  |    found:    Matrix[Int(2), Int(4)](a)
  |    required: Matrix[Int(3), C]
  |
  |    where:    C is a type variable with constraint <: Dim

scala> val c = Matrix(3, 5)
val c: Matrix[Int(3), Int(5)] = Matrix(3,5)

scala> res10 * c
val res11: Matrix[Int(2), Int(5)] = Matrix(2,5)

And that's actually all there is to it. As an aside, notice how the error message from the compiler is pretty concisely telling us where we went wrong.

So here we are, having used singleton types to make a very simple matrix multiplication API a bit more typesafe, coming out of this with one more tool in our Scala 3 tool belt.

And now, it's your turn to find more use cases for singleton types!

Growing a Product Area at Zalando

2018-10-18T00:00:00+02:00

The six month journey of the customer inbox multi-disciplinary team

The customer inbox multi-disciplinary area operates in the Fashion Store pillar of the Zalando platform organization. The purpose of the Customer Inbox Unit is to serve customers personal and practical fashion messages, through multiple channels, i.e. “Target the customers at the right time, at the right place.”

In this post, we share how the Customer Inbox area simplified and transformed from four delivery teams having a component focus and complicated structures, to a performing unit with a business focus able to grow simple structures healthily.

Complicated structures do not scale. Healthy organizations tackle the complexity of product development by growing simple practices and simple structures. Simplicity allows complex mechanisms – like face to face conversation, individual interactions, continuous improvement – to emerge and it takes complexity to handle complexity.

Customer centric teams The multidisciplinary delivery teams had a strong technical component focus with certain benefits and also pitfalls: “It is not our component responsibility” was often given as a reply to the product managers during product feature discussions. As a result, product was writing technical requirements fitting the narrow technical teams’ purposes and a lot of time was spent on organizing dependencies.

Together with the Head of Engineering, we triggered a team workshop, where all of the four teams’ members realized the waste generated by component-focused teams, and decided to shift to business-focused teams. The realization happened using the easter egg simulation (that we tweaked a little to illustrate component vs business-focused teams).

Teams are now conducting end-to-end business initiatives, stretching across various components, and innersource (or manage themselves) the dependencies. Product managers write business initiatives with a customer focus (as opposed to tickets with narrow technical focus). The amount of overhead to manage dependencies is reduced. Customer centric structures can grow.

An exemplary situation to illustrate the behavioral changes happened shortly after the workshop. In order to differentiate between customers with and without commercial consent in the email templates, the product team asked our Smart Communication Team for a corresponding data enrichment feature. The feature turned out to require API changes in a service that is owned by the Communication Channels Team. Now the product team didn’t need to manage dependencies anymore since the Smart Communication Team innersourced the API changes.

Single source of information Each team was using their components’ github repositories as a place for long term plans. The 400 issues distributed in the 10 component backlogs were used as a baseline for planning.

Sitting down with the leadership team in an overall retrospective we analyzed the situation to understand the impact of such a structure using causal loop diagramming.

The leadership team explained to the Inbox Teams the impact of the current backlog structure. The unit cleaned up the different repositories, paid back a part of the technical debt, and moved the rest of the technical debt to a single product backlog. The team installed a zero bug policy. Any new bug generated during feature development is directly fixed without debates. Small issues (improvement ideas, to-dos) not treated in the next two iterations are systematically deleted.

Github backlogs are now used as a backlog for the next interval of work (sprint backlog) only. One single product backlog is used to store product and non-functional requirements. The number of tickets decreased to 100. This simplification of the structure allows better usage, adoption and efficiency. All technical tickets are linked to the corresponding product ticket which increases transparency both for the product and engineering side.

Shared meeting structure The causal loop diagram triggered another change. The Inbox Team had four planning sessions per cycle, with key people filling up their day with these meetings, carrying dependencies from one session to another: in short, being bottlenecks. Two of the three teams were not refining requirements at all before trying to pull them into a cycle, making planning sessions long and ineffective. Team members were participating in these planning sessions without engagement, NPS of the planning sessions was -70. Largely due to this ineffective setup, average delivery time was usually 50% longer than planned.

All Inbox Teams now run a common pre-planning (i.e. refinement) session and a common planning session where all teams refine and plan at the same place, at the same time, for the next cycle of work (two weeks). Each session is a 60 to 90 minutes time box, key resources navigate from one team to another. Teams align alone, consulting others when needed for technical support and dependency management. The team members are satisfied with the setup, NPS of the planning session is up to 30, and plans are accurate. To quote one product manager: “When it’s Friday, the sprint is over and the work is done.”

Conclusion - Handling complexity through simplification Movements like Cynefin, management 3.0 and LeSS draw the same conclusions. To grow healthy and tackle the complexity of product development at the scale of a big company, you need to un-scale and simplify your structures. From an un-growable complicated structure of 10 backlogs, 400 issues, multiple planning sessions, and component focused teams, the Customer Inbox area moved to a resilient and scalable structure with a single source of information, shared planning sessions, and customer centric feature teams. Some thoughts from the team:

Omar Elasfar (Producer) “Silos around specific components were torn down, unleashing our teams’ potential to be able to focus our customers and incrementally deliver on solutions that solve their problems.”

Sina Golesorkhi (Engineer) “We have now a team-centric mindset and people value the incremental development and collaboration more than before.”

Petra Graß (Product Lead) “Thanks to the workshop, we now think rather about the products and the impact they have than about the team structure and single features.”

These changes occurred in a period of six months driven by the area leadership team and supported by the agile coaches. It became real thanks to the people working in the Inbox Team.

A Team for Teams

2018-10-10T00:00:00+02:00

How we revolutionized the way we worked agile

One and a half years ago we started something new at Zalando. We asked all producers of our department to join one team with the purpose of helping us create great teams to get things done in the best way possible.

Where did we start from?

The producer role had been introduced at Zalando to provide a team with whatever it lacked at a certain moment in time, be it a roadmap, team building, process improvement, documentation or even testing. The role was an extremely clever idea to get through ongoing organizational changes as it made sure that in times of restructuring the most crucial needs of a team were met and the flow could continue. But as useful as it was to get Zalando through the change, it caused very diverse perspectives on the responsibilities and capabilities of one and the same role. For example, some producers would assume the responsibility for a team’s roadmap whereas others saw that responsibility with the product managers. These diverging perspectives resulted in diverging expectations, and eventually disappointment and frustration between producers, leads and product managers.

On top of that, producers were designed to be “part of the team,” helping the team to organize from within whatever it took to do that.

This brought about a few questions and problems:

How does a producer empower a team to grow autonomous, if he/she is a part of that team and continuously takes over operational tasks?
A team change for a producer would mean a leadership change in many cases thus the long term development was often interrupted.
How does he/she know what career level he/she is eligible for if there is no clear set of skills and the variety of the execution differs so much?

In Retail Operations, teams receive different input since every producer has a different focus. Producers learn to effectively enable multiple teams in a shorter time. In this setup we might be able to help more teams with less producers. Producers would no longer be a part of the team but would be partners to the team lead, working collaboratively and could thus support on a wider set of topics, for example product delivery collaboration or facilitating the alignment workshop for a high level architecture.

So what did we do?

All of this we wanted to tackle with our newly formed team of agile coaches, and so we had to set up quite a lot of things to lift the role to the next level.

Coaches would sit with the teams they worked with and for, but be directly answerable to a producer lead; they would learn from and support each other. e.g. If an agile team coach is stuck with a challenge he/she gets support through a team “intervision.”
We established a team with its own leader to take care of people development, creating common practices and standards, and roughly organizing the assignments and be a back up expert for any questions the coaches might have
We clarified the role and expectations on each job grade
We looked into our processes to find out what setup this team needed to effectively integrate with the rest of the department organisation. For example, did we start to align on the goal with the respective team leader before a coach started to work with a team?
We invested in the development of the skills necessary to become excellent

Role Clarification

The goal was to find the most crucial gap in our organisation, and taking our skills into account, narrow our contribution down to the most impactful place. After interviews with engineering leads, and others, I found that the teams needed professional support in the adoption of agile processes and excellent collaboration. Eventually we stopped working as generalist producers and started to work as specialised agile team coaches. We inserted the term “team” as there is a central team of agile coaches at Zalando, who develop the agile culture on a company level in coaching upper management and whole departments, as well as offering standardized trainings. In contrast, we coach teams on their processes and collaboration for a long period of time and facilitate cross-functional workshops in one department.

Process

However, to get to a place where this role could fully unfold its impact and fit in with the rest of the roles, we needed to take a few more steps. First of all we had to define a setup with our most important stakeholders: the engineering leads at retail operations. While the overlap of responsibilities in producers and leads would sometimes cause frictions or responsibility diffusion, the new role should be complementary and supportive. Through a couple of iterations we came up with a sponsorship model that starts with an engineering lead or product specialist requesting the help of an agile team coach. The agile team coach then observes the team for a while and writes down what he or she understood the problem was, the root causes he/she identified, and determines success measures and milestones throughout the coaching. To be a little more specific here: Usually the leads ask us to “make the team faster.” However, each team loses speed for different reasons. So the coach analyses the situation and comes back with insights on the “slowness.” This could be that the collaboration with the product manager is difficult or that the team has not learned to speak openly about issues. It could be that the team does not know how to turn a big problem into a small manageable chunks of work. Each of these root causes need to be addressed with a different coaching approach. This is aligned with the sponsor, the agile lead and if possible at that point in time, with the team.

The sponsor and the coach from then on have regular check-ins to talk about the status quo, next steps and distribution of tasks amongst each other. In case of severe dissatisfaction on any side, the issue can be escalated to the agile lead, who will mediate and try to reunite the sponsors and the coach. Our internal processes have evolved into a one hour bi-weekly operational meeting, where we talk about our own organization and discuss management updates.

The second meeting is our one hour deep dive. Here we raise all topics that cannot be discussed in a short amount of time but need some reflection or longer explanation. The last regular meeting we have is a half hour board meeting to keep us updated on each other’s sponsorships.

Development

The last piece of the puzzle to a successful agile coaching team is the trainings we invested in. The combination of a deep understanding of agile frameworks, team dynamics, innovation and moderation was not a standard for the producers at Zalando and is only in rare cases to be found in SCRUM masters. So we identified three places for the development of an agile team coach and took trainings accordingly.

The first area was team dynamics and team building. A three day training with an experienced academy in Berlin helped us to learn basic concepts and the necessary attitude for facilitating team building, and just as important, it helped us to get to know our limits. In the same quarter we visited an onsite change management training. For both trainings, we made sure to check in on our learnings and experiences in our deep dive sessions. The second area to learn about was delivery processes. As everyone on the team had a good understanding of SCRUM we took an intense training on Kanban, which also helped us to better reflect the success of the process improvements back to our teams. The last area is innovation and moderation. Team members took part in onsite visualization trainings, moderation trainings and we are now learning about innovation formats to better be able to support cross-functional ideation and planning. We took time to reflect on our learnings in our weekly deep dive sessions, co-planned a lot of the support we offered to the teams and shared new tools and ideas regularly.

So what did we gain?

All of these measures were a big investment to our department and Zalando, and obviously it’s important to check whether they were worth it. However, what we got from this change is considerable:

No role conflicts between engineering leads and product managers and agile team coaches anymore but appreciative, respectful working relationships

Professional approach to change initiatives: instead of fixing one small problem after the other we develop change strategies and measure our success

A coaching team that effectively and efficiently helps each other

Happy coaches and very good candidates in our last recruiting process

We gained trust beyond the tech teams and are currently involved in four non tech teams creating transparency, team spirit and self organization for them

We extrapolated our skills beyond supporting the delivery process and instead we now contribute great facilitation in the discovery, definition and design phases of the product development process

We have established a more mature agile culture in our department as the agile coaching team has established an alignment on some best practices such as clear planning process, estimations and expectation workshops

On the basis of this, we also started to work with elements of Scaled agile frameworks, such as a board for overall team coordination

Four Pillars Of Leading People

2018-10-04T00:00:00+02:00

Essential building blocks for strong leadership that enables people to grow and achieve results

The story of how I ended up working for Zalando in Berlin starts with a LinkedIn message from Joseph Wilkinson, one of our tech recruiters. In tech, we get a lot of messages on LinkedIn, but this one was different and made me very interested to know more about Zalando. I already knew something of the company because I was also working in a fashion e-commerce platform, but I was not aware of how big and challenging Zalando was. From that first contact to starting at Zalando was an easy decision. For me, it was the next big step to grow as a lead and to help the company grow even more.

A little over seven months ago, I had the opportunity to help open and grow our third international tech location in Lisbon, Portugal. The first six months were demanding, rewarding and impressive. Zalando is a remarkable and very well-known company in Europe, but in Portugal our brand is not well established yet. Since we still don't have our Fashion Store in Portugal, it's funny that some people and companies think we are still a small startup. Once they hear the numbers, they’re impressed. The first seven months have been a mixture of being able to deliver products, build the right environment for proper product development, as well as hiring the most talented and high-potential engineers, product managers and designers. We clearly had a good start, since we are a small team but managed to already have an impact in the company.

As a lead in a new tech hub, you have a clear influence on shaping the hub and its future. You need to lead by example and be a beacon of responsibility and accountability that every lead should assume regardless of their company or circumstances. Being one of the first leads at the tech hub is a great opportunity, as it allows me to create and shape a culture of high performance teams. The last years have also taught me that as a lead, I always have to think about the best interest of the company, business and the product. This can only be done with the right people and with the right mindset, focused on achieving greatness together. To achieve that it requires a solid foundation from a lead. And so I asked myself:

WHAT MAKES A GOOD LEAD?

What I have learned from leading teams for almost five years is that there are four pillars, which are essential for strong leadership to enable people to grow and achieve results: Empathy, Inspiration, Trust and Honesty.**

Empathy **For me, this is the number one skill every lead should have. Empathy helps in communication, solving problems, ensuring the understanding between all people is the same. If you can't understand people you won't be able to communicate with them effectively.

Inspiration As leads, we don't tell people what to do. We inspire them, give them access to all the information and together we define the goals and how the journey will be. Every meeting, every talk you have with your team; it's an opportunity to motivate people and to encourage that extra strength for the extra mile.

Trust You need to trust people and they will trust you. Being trustworthy for everyone, for all the ideas to become real, to ensure the right level of autonomy happens in the teams, is the foundation of a high performance team.

Honesty And you can't achieve the above three qualities without being honest. Honesty to yourself and to your team. In every moment, for the good and for the complicated moments. When you need to have a hard talk but also when you have to recognize the teams and individual efforts. Honesty to yourself because leads are not owners of all knowledge, as well as asking for help and support is also a part of being a great lead.

High performing teams are like family. You trust them no matter what and you know they are there for you. Building this culture takes time: it requires hard work and then maintaining it is even harder. I envision a leadership that brings clarity in uncertain moments. Each of these pillars is extremely hard to achieve and with every decision, talk or restructure you do, but keep them in mind and remember the company’s values and culture to make the right decision. Aiming towards a leadership able to inspire and share knowledge, so we all know the why and run towards the same goal, is the ultimate end.

There are many different leadership styles and strategies, and I believe the ability to adapt to the environment you are working in is crucial: Get to know the people you work with, understand their motivations and how you can empower and sponsor them to be better. A good lead needs to be emotionally adaptable to the environment they are working in, and be the guiding light, be committed to the team and to the company. Leading by example defines how much people will trust you or not. A good leader knows the rules but a great leader knows when to break them.

Leading high performing tech teams at Zalando Zalando is a big company, so the challenges of leadership are also big. Every decision you make and software your team develops has an influence on millions of people. It is an extremely diverse environment and a place where the broad knowledge of engineering and product is continuously challenged. Your strategies need to be much more global across departments and business units, so you need to communicate effectively. Zalando has been in my life for almost two years and I've had the chance to get to know so many amazing teams and people; people I don't just call colleagues but friends, and that is something that takes time and a trustworthy environment.

I have been in the Lisbon Hub for seven months now and we are building strong and high performance teams responsible for the development and delivery of core Zalando products. The products we are currently developing will have a huge impact on the company’s strategy and help to achieve the demanding goals we have for the future. The part I’m most passionate about are the people in this team; working with them, facing challenges together, growing stronger every day, every week. We are a small team that is capable of great things and that gets me out of bed every morning.

Leading people to help them achieve their goals In the end, the job of a lead is actually very easy. You just need to talk to people, understand their motivations, their strengths and then define proper strategies on how those skills can be beneficial for the different goals and objectives of teams and the business. As a lead you need to be prepared to fail and to act on failure, not just give up. Also making sure communication is transparent for all team members, so that everyone knows the “why” and what will be the result and impact expected. A lead has the responsibility and accountability to help the company meet its goals. It's a day to day job that needs to be consistent.

The small victories of being able to influence, to mentor and see how the people you lead achieve their goals is very rewarding. The most recent meaningful experience I had was talking to people I mentored and led in the past, and learning they are now also leading people and have grown so much. Knowing that you were an example and had a direct impact on this: on their growth, on their lives, makes it all worth it. There is no bigger satisfaction and sense of achievement than when you help people to achieve their goals and dreams. It's the best feeling you can have as a person. Just imagine all the times you felt you achieved something great and impressive, it was never alone, and you don't want to celebrate on your own either. Helping your family, your friends, a colleague, a peer or team member, even a stranger, to conquer their dreams allows you to leave a mark. If you think about what you do at your job and think how much what you do impacts or helps others, it will help you make better and smarter decisions; not only as a lead, but as a person.

The Journey to Connecting Retail

2018-09-27T00:00:00+02:00

Digitizing brick & mortar fashion stores through Connected Retail

Everything started back in 2015 when Zalando was already successful as an online fashion retailer in Europe. However, a B2B problem was identified that needed to be tackled: brick-and-mortar fashion stores need a way to increase their sales. Seeing the need to connect offline with online in order to help merchants solve this problem, is when I joined Zalando as a Product Manager in early 2016 at the newly established Helsinki Tech Hub.

I started working on a topic back then called “Offline” and my first task was to do market research on the problem mentioned above. I learned that the situation was not ideal, but if we could connect our online and offline sales channels there would be three main areas we could improve and that could have a huge impact through a technical solution. For Zalando this could mean more inventory and connecting local offerings online. For brick and mortar stores it could mean stores generating additional sales and offer huge potential for opening up new in-store use cases. And most importantly, for our Zalando customers this could mean reducing delivery times and creating new experiences.

Based on my findings, I started to build a team to work on a pilot project. Our team, which envisioned being a “fashion connector” soon became known as team “Silta”, (meaning “bridge” in Finnish), which was quite fitting as that is exactly what we were aiming to do: bridge offline and online fashion. We wanted to digitize brick and mortar fashion stores to help stores sell online, as well as reduce Zalando delivery times and improve the in-store customer experience.

In order to validate the hypothesis, we created a pilot with adidas, delivering a parcel from a Berlin brick and mortar store within 25 minutes of ordering from the online Zalando fashion store. The pilot, which was launched in June 2016, was a great success and it gained a lot of recognition in the news and the e-commerce industry.

From a technical point of view, the pilot was not a scalable solution, but it validated our hypothesis, and after this, the real work started to lay the foundation for a possible solution. After the pilot in the summer of 2016, our team started to grow (we were already a team of four) and we started working in the product discovery phase. From the stakeholder point of view we needed to deal with the cross-location complexity of having teams in Berlin and Helsinki. This is where the position of product manager played a key role in the team by ensuring transparency and clear information flow. During the last two and a half years, I have visited Berlin for business trips about 80 times, and I remember times when I needed to travel to Berlin every week during several months in order to meet face to face with my stakeholders, and to keep close to the users in order to keep the project moving forward swiftly.

By 2017, based on the pilot and the groundwork our team had done, Zalando decided to build a dedicated product to tackle offline merchants’ problems. This was great news for Team Silta, and for me personally, having laid the foundation and been along for the journey from the start. We decided to have a unique name that would be easy to identify both inside and outside Zalando, which would simply describe what we were trying to do, so our product became known as Connected Retail.

Now, in September 2018, we have just soft-launched a pilot of our new Connected Retail product with a Seidensticker store in Berlin that will help Zalando scale across Europe, connecting thousands of offline stores and delighting millions of customers. Our MVP (Minimum Viable Product) is a custom built Connected Retail system, and includes a ship-from-store feature. This launch is a very important milestone for both the teams located in Helsinki and Berlin, who have worked on this topic since 2016 across locations, and it will also play a big role in changing the way our brick and mortar merchants approach their customers.

However, there is still much to learn and one of the biggest challenges we face is “stock accuracy,” which is a multidimensional problem. The problem is to try to solve how to identify what is being sold at the offline store and what is being sold in other channels. Another complex problem that Connected Retail faces is how to digitize merchants for whom technology is still largely an unknown and who don’t have their stores enabled for it. What I have learned is that merchants know best how they work, and if we can build a product that will solve problems that merchants have, then they will naturally use the product. Although Connected Retail still has many challenges to overcome, what drives us is the vision of a future where every Zalando customer can purchase any article located in any physical store, making every store a small frictionless warehouse.

From a product management point of view, it has been an amazing journey that has tackled the entire product life cycle; starting from market research, product discovery, competitive analysis and moving into phases such as prototyping and user testing and towards a MVP definition and launch. I am proud to see how far w’ve come and excited to see how far Connected Retail will go in helping to digitize brick and mortar fashion stores.

Shop the Look with Deep Learning

2018-09-12T00:00:00+02:00

Retrieving fashion products based on a query image

Have you ever seen a picture on Instagram and thought, “Oh, wow! I want these shoes”? or been inspired by your favourite fashion blogger and looked for similar products (for example, on Zalando)? Visual search for fashion, the task of identifying fashion articles in an image and finding them in an online store, has been the subject of an ever growing body of scientific literature over the last few years (see for example [1-11]).

At Zalando, we have many outlets where this search is possible: our app, our Facebook chatbot, etc. We want to provide our customers with the best shopping experience possible, and words are not always enough to describe fashion.

Visual search poses some interesting challenges: how to deal with variations in image quality, lighting, background, different human poses and article distortion, or finding the right product in a large database in real-time.

Our working scenario so far has been to build on our home-grown FashionDNA to retrieve blazers, dresses, jumpers, shirts, skirts, trousers, t-shirts and tops in fashion images, with or without backgrounds.

Our Data Source As a fashion company, Zalando creates outfits every day and therefore generates many fashion images annotated with their corresponding products. This means that we can use state-of-the-art learning techniques such as deep nets which have revolutionized computer vision. As can be seen in Figure 1, these images include full body poses, half-body close-ups as well as detailed close-ups on a garment of interest. Although model poses are usually standardized and do not really reflect the more natural poses found on Instagram, having these different kinds of shots allows us to handle different scales. These images also display occlusions (shirts occluded by jackets for example) and back views.

Figure 1: Examples of images in our dataset. Image types (a-d) are query images featuring models, image type (e) represents the articles we retrieve from.

Unfortunately, an overwhelming majority of our fashion images have standardised clean backgrounds as shown in Figure 1, which means we have to think of a work around to learn how to handle them.

Studio2Shop: matching model We have designed a ConvNet model that takes a fashion image with Zalando clean backgrounds and an assortment of interest as input and returns a ranking of the products in the assortment for the eight categories mentioned above.

The products in the assortment are not represented by images, as is common in the literature, but by their FashionDNA. In other words, only a feature representation of the article is needed.

Figure 2 below illustrates the setting and the results we can get. On the left is the image of a person wearing an outfit, on the right side are the 50 top-ranking products in the assortment. The articles that are actually present in the outfit are marked in green.

Figure 2: *Random examples of the retrieval test using 20,000 queries against 50,000 Zalando articles. Query images are in the left-most column. Each query image is next to two rows displaying the top 50 retrieved articles, from left to right, top to bottom. Green boxes show exact hits.

To show its generalization capabilities, we have tested our model on part of an independent dataset published in [7], without fine-tuning it. Results are shown in Figure 3* below. Unfortunately, the dataset was modified to fit our setting, so our performance is not comparable with the one reported in [7].

*Figure 3: Random examples of outcomes of the retrieval test on query images from DeepFashion In-Shop-Retrieval [7]. Query images are in the left-most column. Each query image is next to two rows displaying the top 50 retrieved articles, from left to right, top to bottom. Green boxes show exact hits.

*Note that this exercise is a little academic as focusing on finding the exact products allows us to assess models quantitatively. In fact, retrieving exact matches is not critical for two reasons: a) it is quite unlikely that the exact product is part of the assortment, b) usually the customer feels inspired and a similar item will feel just as rewarding to them, if not more, because they can have a rounder neckline for example.

Thanks to how this model is built, it is able to provide similar items as a by-product. Figures 2 and 3 show that the style of the 50 top-ranking garments fits the style of the outfit, and that these garments are quite similar to one another.

This means that we can also retrieve similar products from other assortments. Figure 4 below shows the 50 top-ranking garments from a Zalando assortment on query images from [7], without our model being fine-tuned for such images.

*Figure 4: Random examples of outcomes of the retrieval test on query images from DeepFashion In-Shop-Retrieval [7] against 50,000 Zalando articles. Query images are in the left-most columns. Each query image is next to two rows showing the top 50 retrieved articles, from left to right, top to bottom.

*The details of this work can be found in [12].

Extension to images with backgrounds Unfortunately, training a similar model for natural images would require large amounts of natural fashion images annotated with products, which we don’t have. However, we do have large amounts of unannotated fashion images, in particular those available from public datasets such as Chictopia (10k), but also our own in-house images. The advantage of public datasets is that the segmentation’s ground-truth is given, whereas we have to segment our images ourselves.

Using these images and their segmentation, we have designed and trained Street2Fashion, a U-net-like segmentation model that can find the person in the image and simply replaces the background with white pixels. The results shown in Figure 5 below are good enough to focus on the fashion in the image.

*Figure 5: Examples of segmentation results on test images.

We use Street2Fashion as a preprocessing step, and build Fashion2Shop, a model with the same architecture as Studio2Shop but trained on segmented images. We refer to the full pipeline described in Figure 6* as Street2Fashion2Shop. In practice, a query fashion image is processed by the segmentation model to remove the background, and can then go through the matching model described above to be matched with appropriate products.

*Figure 6: Street2Fashion2Shop. The query image (top row) is segmented by Street2Fashion, while FashionDNA is run on the title images of the products in the assortment (bottom row) to obtain static feature vectors. The result of these two operations forms the input of Fashion2Shop which handles the product matching.

Figure 7* shows results obtained using Street2Fashion2Shop.

(a) Random examples of Zalando products retrieval using query images from LookBook [13].

(b) Random examples of Zalando products retrieval using query images from street shots.

*Figure 7: Qualitative results on external datasets. For each query image, the query image is displayed on the very left, followed by the segmented image and by the top 50 product suggestions. Better viewed with a zoom.

*The details of this work will shortly be available in [14].

[1] X. Wang and T. Zhang. Clothes search in consumer photos via color matching and attribute learning. Multimedia Conference (MM), 2011.

[²] S. Liu, Z. Song, G. Liu, C. Xu, H. Lu and S. Yan. Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set. Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

[3] J. Fu, J. Wang, Z. Li, M. Xu and H Lu. Efficient clothing retrieval with semantic-preserving visual phrases. Asian Conference on Computer Vision (ACCV), 2012.

[4] Y. Kalantidis, L. Kennedy and L.J. Li. Getting the look: Clothing recognition and segmentation for automatic product suggestions in everyday photos. International Conference on Multimedia Retrieval (ICMR), 2013.

[5] K. Yamaguchi, M.H. Kiapour and T.L. Berg. Paper doll parsing: Retrieving similar styles to parse clothing items. International Conference on Computer Vision (ICCV), 2013.

[6] J. Huang, R.S. Feris, Q. Chen and S. Yan. Cross-domain image retrieval with a dual attribute-aware ranking network. International Conference on Computer Vision (ICCV), 2015.

[7] Z. Liu, P. Luo, S. Qiu, X. Wang and X. Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. Computer Vision and Pattern Recognition (CVPR), 2016.

[8] E. Simo-Serra and H. Ishikawa. Fashion Style in 128 Floats: Joint Ranking and Classification using Weak Data for Feature Extraction. Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[9] X. Wang, Z. Sun, W. Zhang, Y. Zhou and Y.G. Jiang. Matching user photos to online products with robust deep features. International Conference on Multimedia Retrieval (ICMR), 2016.

[10] D. Shankar, S. Narumanchi, H.A. Ananya, P. Kompalli and K. Chaudhury. Deep learning based large scale visual recommendation and search for e-commerce. CoRR, 2017.

[11] X. Ji, W. Wang, M. Zhang and Y. Yang. Cross-domain image retrieval with attention modeling. Multimedia Conference (MM), 2017.

[12] J. Lasserre, K. Rasch and R. Vollgraf. Studio2Shop: from studio photo shoots to fashion articles. International Conference on Pattern Recognition Applications and Methods (ICPRAM), 2018.

[13] D. Yoo, N. Kim, S. Park, A.S Paek and I. Kweon: Pixel-level domain transfer. European Conference on Computer Vision (ECCV), 2016.

[14] J. Lasserre, C. Bracher and R. Vollgraf. To appear in Lecture Notes in Computer Science, 2018.

-- Our visual search engines are currently powered by the company Fashwell, this work is at the research stage. --

Visual Creation and Exploration at Zalando Research

2018-09-06T00:00:00+02:00

Adversarial texture distribution learning as a tool of artistic expression

Deep learning is progressing fast these days. Despite advances that were expected to happen sooner or later (e.g. accurate face and speech recognition), there are some new developments that would have seemed like a pipe dream years ago: neural networks can now generate realistic images just by looking at few examples of their properties.

Zalando Research is currently exploring such methods and their potential to aid Zalando’s content creation, private fashion labels, and sizing recommendation teams, and offer our customers a new fashion experience. In addition, working with large image collection and generative machine learning models has great synergy with cutting edge neural network art. The tools created for fashion research purposes are also useful as tools for visual artistic creation and exploration.

Generative texture models Our research in texture generation is a good example of this. Earlier this year, we developed new deep learning generative models to learn textures from just a few sample images, and textures are key ingredients in multiple artistic techniques. Having a tool like a Periodic Spatial Generative Adversarial Networks, or PSGAN, to learn texture distributions can lead to great flexibility in choosing applications for it. Figure 1 shows what textures we can learn and sample using only a single image (Figure 2) as training material.

Figure 1. Ocean textures generated from our model, a PSGAN trained using the single image from Figure 2.

Figure 2. An example ocean texture, which is used as training material for PSGAN

Textures and mosaics Mosaics are a classical artform starting from the times of ancient Romans and going to modern texture transfer techniques. The artist Max Ernst captured textures by physically copying and using them for his painting, a technique called Frottage. In present times, selecting a texture as a type of stylization and applying it to a large image with global composition is a very popular case of ML art, as signified by the success of Neural Art Style Transfer, but also as seen in multiple advertising campaigns such as Figure 3.

Figure 3. Example outdoor advertisement billboard using mosaic techniques for stylization, seen recently on the streets of Berlin.

In follow-up work we apply generative texture synthesis to create high-resolution mosaics from input content images. It was demonstrated at the NIPS 2017 Workshop for Machine Learning and Art, a great venue to explore collaborations between machine learning researchers and artists. Figure 4a shows how the texture process learned in the previous paragraph can be conditioned and used to stylize a human face from Figure 4b.

Figure 4a. A mosaic stylisation of the image from Figure 2b)

Figure 4b. A fashion model from Zalando's catalogue. * Texture control: from noise to music* Static image stylization works by conditioning the latent factors of textures on a target content image, which is a 2D array. Textures flow into one another in space and the artist can play flexibly with our tool and condition on a moving signal in time, which would then lead to smooth transitions and animations between textures in time. Music visualization is one such application of our technique. Figure 5 shows our music video accepted at the ECCV 2018 Computer Vision for Fashion, Art and Design Workshop.

*Figure 5: water textures morphing, controlled by music audio signal. * With an appropriate audio descriptor we can map the distribution of audio samples to the distribution of textures. And since music is a smoothly varying signal, we can create an animation frame by frame of a texture process controlled by music. Selecting input images with a suitable theme (water) to train the PSGAN allows us to emphasize the artist’s vision and represents a novel form of digital synesthesia. This tool also opens a totally new pathway for collaboration between musicians and generative model visualizations. It is also a conceptually new approach to art, since we replace the source of randomness with the variation of a music piece.

We will be showcasing the video at the workshop mixer on September 12th, at the Container Collective in Munich.

Follow Nikolay Jetchev’s twitter account for the latest developments and experiments with art and generative deep learning.

Zalando Strengthens its InnerSource Strategy

2018-09-05T00:00:00+02:00

Zalando is known for its commitment to the open source world. Many of our engineers are active contributors of open source projects like PostgreSQL or Kubernetes. The Zalando tech department currently consists of more than 2,000 employees that manage over 200 delivery teams and virtual teams. Zalando engineers are from 77 nations and based out of various locations across Europe which makes us super international but also quite distributed. Collaboration and alignment across delivery teams is challenging as the company continues to grow at an incredible speed. Enhancing InnerSource is an approach that could help Zalando to tackle those internal challenges.

What is InnerSource

InnerSource is an adaptation of open source software development practices within organizations. This means to apply the collaborative culture and open source methodologies to internal projects even if the projects are proprietary. At its essence, InnerSource does not only apply software development but also spread out to other business sectors such as Finance, Marketing, HR etc.

The Benefits of InnerSource are:

To improve developer productivity by increasing cross-team alignment and collaboration.
To improve developer mobility by enabling our software engineers to contribute to the efforts of other teams and to get familiar with software projects and tools used by other teams.
To increase development speed by removing team blockers and pushing discoverability of existing software products and components.
To decrease onboarding time and improve knowledge handover by providing well documented and discoverable internal projects

InnerSource at Zalando focuses on:

Fostering the ‘open source’’ culture from within, encourage individual teams to open up their work and accept feedback and contributions from developers outside of their team.
Promoting pull request as an initial tool for cross-team collaboration.
Creating a platform where teams have a chance talk about their work and learn from each other.
Introducing InnerSource pilot projects around Machine Learning starting at Digital Foundation.
Developing collaborative documentations of team best practices and examples.

More about Inner source at Zalando

Check-out How to InnerSource to learn how our development teams prepare for their Inner source participation.

Three Years of our Helsinki Tech Hub

2018-08-30T00:00:00+02:00

Getting to know our Finnish tech hub as it turns three

In early 2015, Zalando decided to expand its tech expertise and open tech hubs around Europe. First up was Dublin in April, and not far behind, the Helsinki Tech Hub was launched in August 2015. The Helsinki hub has had an exciting journey so far; from scaling to over 60 employees and designing a custom office to fit our community in our first year, to continuing to grow to over 100 employees with over 30 nationalities by our second anniversary. Fast forward to 2018, we look back on how we grew and what made us the unique Zalando Helsinki (or #Zelsinki) community we are today. We spoke with one of the most integral members of our Helsinki Tech Hub, our #Zelsinki Community Manager Elina Zimpfer, who has been with Zalando Helsinki since August 2016.

When you joined what was the Helsinki Tech Hub like?

The Zelsinki team was about half the size it is now when I joined. We were still at the old office, which was so full that newbies didn’t get their own desks when starting. The renovation and the move to the new office had been delayed by some weeks, so I started in the hurricane of moving offices and doing the final touches to the new location.

It is my everyday job to have an understanding of the local community and make sure that everyone in it is a part of the global Zalando ecosystem. It is really important that we as remote sites are represented at the heart of Zalando, and I’m proud to be a part of supporting that goal. The Helsinki Tech team has a very good record on giving talks at our internal knowledge sharing platform, and I see that as an important window for us to showcase our work to the rest of the company.

What kind of things did you do to build the community?

At a remote site and a smaller team, we have the chance to do many various activities and focus on smaller details. It soon became clear that our Zelsinki people are very competitive, so different tournaments are very popular. In Zelsinki, you can win a fantastic handmade trophy in almost anything, we have tournaments for pool, table tennis, mölkky, and Mario Kart just to name a few. We also love to have fun together and celebrate our achievements, so “cupcakes & bubbly” occasions are not unusual. In addition to our Helsinki internal activities, participating in the global Zalando Tech Community projects and events is a good way to keep the “one tech team” spirit alive. As Zalando is a relatively new company in the Helsinki Tech scene, it’s also important to raise our profile and give back to the external community. I organize external meetups and other events, and encourage our people to participate and give talks.

What is the most important thing about building a local community?

Each community has its own special features, traditions and characteristics, so the things that work for one community might not work at others. It’s important to keep the customer in mind and personalize solutions to fit the community members. Getting people involved in the creation of common events, is a great way to really grasp the needs of the community. We have many special activities in our Helsinki community, amongst them our annual Summer Adventure.

It’s the most wonderful time of the year for Finns. Nightless nights, Finnish sunkissed strawberries, Midsummer and of course for our Helsinki Tech Hub’s Zalandos, the “Zelsinki” Summer Adventure. This was our third Summer Adventure and we made it a good one!

Not your typical summer party and activity, at Zalando we want to personalize our experiences for our users. This also means we want our employees to have the same experience. Three years ago when we started out in Helsinki, we were a newly grown team of 50 people with 60% of our new colleagues from outside of Finland. Most of our colleagues were software engineers, so we knew they loved to solve problems and puzzles.

So we got to thinking about how we might show them around Helsinki, get the teams to know each other, and solve some problems along the way to our summer team event location. Escape rooms and geocaching were very popular at the time, so we decided to create an amazing race-type experience tailored for our Zelsinki team. And it was a hit!

That was the summer of 2016. Last summer, we decided to reiterate the game by involving our team more and we formed an event committee. The goal again was to solve the puzzles in randomly selected teams to get to our secret end location. We incorporated a theme and our Zelsinki Survivors had to face the jungle terrors all while getting to know some famous Helsinki locations such as Hietaniemi beach and the Jean Sibelius monument.

For our third iteration this summer, the Zelsinki Adventure became somewhat legendary and we had even more interest from our own community to create something great for their peers. Our Team Assistant Essi Marttila and myself were at the helm, enabling and empowering our people to get involved, and got together a great organizing committee with diverse skills. Jari Kalinainen created an iOS app, Antti Pennanen composed music, and myself and Essi came up with puzzles, activities and an exciting Wild West themed storyline. We had more props than ever, and it might have been possible to see a group of software engineers ride a hobby horse in a park in Helsinki city centre.

This is definitely a tradition that will last in our Helsinki Tech and we can’t wait to see how we develop the app in years to come, and have new colleagues join in to create an amazing experience for their peers.

What is the best thing about your job?

Definitely the people! We have such a great team! I also love the ever-changing nature of my job, every day is different and a new challenge.

We have a great team here and we work on some of the cornerstones of the fashion store: personalization, browsing, new emerging business, connected retail and logistics solutions. We’re looking forward to the next three years!

Zalando at the DatSci Awards 2018

2018-08-23T00:00:00+02:00

Building data science products in multi disciplinary teams

For the last three years, I have been working on different data science projects at Zalando, helping our more than 24 million customers find the most relevant items in the assortment we have. Along the way, I have learned how to scale data science, or how to build a new personalization product from scratch. Thanks to my experiences, I am a firm believer in having dedicated and autonomous multi-functional teams to solve complex problems, especially when they involve learning.

As data scientists, we are used to looking at problems from a data perspective, which has helped the teams I have worked with gain huge amounts of domain knowledge. We strive also to make data-driven decisions, where running A/B tests or doing online and offline evaluations of the models we build are some of the most important tools we have. However, what does it look like to work in a multifunctional team?

In a Zalando team, we usually have one or two data scientists, one or two engineers, a product manager, a designer, and sometimes a business developer. The details change from team to team, but you get the picture. Not all of these people are dedicated to the team 100% of their time, sometimes a designer can work with two or three teams, depending on their areas of interest. The main advantage of working with this setup is that we are able to tackle uncertainties and risks from many more angles, and way faster than on a researchers-only team.

Something I have learned when working with designers, is the many advantages of early testing and prototyping, and their customer-centric approach to problem solving. Moreover, because they tend to work in different products from similar areas, the knowledge transfer usually happens more naturally and also faster, and completely changes the way we work. When working closely with our copywriting team, we learn how to communicate our products in the right way for our customers, and working with engineers we learn how to make sure to build machine learning solutions that scale; ones we are able to operate.

A very good example I have previously written about is the latest product I was in charge of building, where we were able to collaboratively design a prototype to solve our customer problem of, “How can we make recommended content more transparent and relevant to our customers?” We did this in four days, writing only a minimum amount of code. We built six personalized prototypes for user testing, by manually adding “recommended” content into a static version of the Zalando App. Instead of using an algorithm, we “faked” the algorithmic result by using human expert curators to choose which content would be shown to each customer.

By faking the personalization part, we were not only able to understand our customers expectations about our product, but we also saved months of development of an algorithmic solution that was not what the customer expected. In particular, the feedback we got from our customers was far more specific and natural than when using non-personalized prototypes. For example, instead of asking someone “imagine you love leather jackets and we recommend you matching boots,” we can know beforehand that they bought a leather jacket last week, and we created “recommendations” of the boots we thought would better match their style.

Working in this environment is also aligned with our autonomous teams. During the process, everyone involved gained customer understanding and domain knowledge from the problem we are trying to solve, something extremely valuable for data scientists. Moreover, iterating on this is way cheaper and faster than iterating on A/B test cycles, even when we have a really strong testing-as-a-service infrastructure.

This is only one example that shows how much I like working with people from different backgrounds and functions, which also proves how important diversity is for building great machine learning products, especially in a B2C market that operates on a European scale like Zalando does.

* Humberto Corona is a product specialist and data scientist in Zalando's Fashion Insights Center in Dublin. A regular contributor to the tech blog, Humberto is a finalist in this year's DatSci awards, where this piece was originally published. Ana Peleteiro Ramallo took the Data Scientist of the Year title in her role at Zalando in 2017.*

Battle of the Frameworks

2018-08-16T00:00:00+02:00

How to choose a JavaScript framework?

Developers are often biased about their technology choices. At the beginning of the year, I was about to start working on a new product and my team could choose any tech stack. I did not want to be one of these biased developers who chose the framework they liked. I wanted to make an informed and educated decision. I already had experience with React and AngularJS. I had a good knowledge of Angular and experience with TypeScript. But what about Vue, the framework that most JavaScript developers wanted to learn according to State of JavaScript 2017 survey?

A friend of mine likes saying that JavaScript Frameworks are like weeds: everyday a new framework gets released. It does feel like this, doesn’t it? I was quite skeptical about Vue when it was released, and to be honest, I was quite skeptical about Vue a long time after it got released. Did we really need another JavaScript framework? I did not really think so. But I had some free time on my hands and decided to use it to learn Vue so that I could make an informed decision about which framework to choose.

History Lesson AngularJS was started as a side project at Google around 2009. Later it was open-sourced and v1.0 was officially released in 2011.

React was developed at Facebook. It was open-sourced at JSConf US in May 2013.

Vue was created by Evan You after working for Google using AngularJS in a number of projects. He wanted to extract what he really liked about AngularJS and build something lightweight. Vue was originally released in February 2014.

Angular 2.0 was announced at the ng-Europe conference in September 2014. The drastic changes in the 2.0 version created considerable controversy among developers. The final version was released in September 2016.

Side note: AngularJS and Angular 2.0, which was later simply called Angular are two different frameworks. The naming really caused a lot of confusion. I believe that the Angular team would have been better off choosing a different name.

In December 2016 Angular 4 was announced, skipping version 3 to avoid a confusion due to the misalignment of the router package's version, which was already distributed as v3.3.0. The final version was released in March 2017.

Nowadays, developers are looking for smaller, faster, and simpler technologies. All three frameworks (Angular, React and Vue) are doing lots of work in this direction. You can expect pretty good performance from these frameworks. Performance benchmarks show similar performance.

In April 2017, Facebook announced React Fiber, a new core algorithm of React. It was released in September 2017.

Angular 5 was released in November 2017. Key improvements in Angular 5 included support for progressive web apps, and a build optimizer.

Google pledged to do twice-a-year upgrades. Angular 6 was released in April 2018. Angular 7 will be released September/October 2018.

In the beginning of 2018, a schedule was announced for phasing-out AngularJS: after releasing 1.7.0, the active development on AngularJS will continue until the end of June 2018. Afterwards, 1.7 will be supported till June, 2021 as long-term support.

The bottom line is that these three frameworks, React, Vue and Angular, are quite mature. And it is likely they’ll be around for a while.

Key Concepts React use the Virtual DOM pattern. React creates an in-memory data structure cache, computes the resulting differences, and then updates the browser's displayed DOM efficiently.

React is all about components. Your React codebase is basically just one large pile of big components that call smaller components. Props are how components talk to each other; they are the data, which is passed to the child component from the parent component. It’s important to note that React’s data flow is unidirectional: data can only go from parent components to their children, not the other way around.

The component approach means that both HTML and JavaScript code live in the same file. React’s way to achieve this is the JSX language. It allows us to write HTML like syntax which gets transformed to lightweight JavaScript objects.

To build an Angular application you define a set of components for every UI element, screen, and route. An application will always have a root component that contains all other components. Components have well-defined inputs and outputs, and lifecycle.

The idea behind dependency injection is that if you have a component that depends on a service, you do not create that service yourself. Instead, you request one in the constructor, and the framework will provide you one. This allows you to depend on interfaces, not concrete types. This results in more decoupled code and improves testability.

Property bindings makes Angular applications interactive.

Vue also makes use of the Virtual DOM like React.

In Vue.js the state of the DOM is just a reflection of the data state. You connect the two together by creating "ViewModel" objects. When you change the data, the DOM updates automatically.

You create small, decoupled units so that they are easier to understand and maintain. In Vue the components are ViewModels with pre-defined behaviours. The UI is a tree of components.

In Vue the HTML, JS and CSS for each component live in the same file. Some hate the single-file components, some people love them. I personally think that they are very handy and can make you more productive as it reduces context switch.

Ecosystems This table that shows the libraries you may be familiar with in React or Angular or Vue alongside their equivalent in the other frameworks:

It is important to note here that Angular is somewhat more prescriptive. Some developers do not like this and prefer to have freedom choosing the tools they use. It is more of a personal preference.

Lessons Learned I had the idea to build a small app with all three frameworks and compare them. And so I did. But it was completely unnecessary because of TodoMVC; a project which offers the same Todo application implemented using MV* concepts in most of the popular JavaScript frameworks of today.

TodoMVC is supposed to help you select an MV* framework. But the Todo app is way too simple.

If you are new to web development and you are learning a new framework, the TodoMVC is probably a good start.

But if you are experienced and would like to build real-world, more complex applications there are better alternatives.

Some better alternatives are RealWorld and HNPWA.

RealWorld allows you to choose any frontend (React, Angular, Vue and even more) and any backend (Node, Scala etc) and see how they power a real-world full-stack medium.com clone.

HNPWA A collection of unofficial Hacker News clients built with a number of popular JavaScript frameworks. Each implementation is a complete Progressive Web App that utilises different progressive technologies.

Lesson #1 The Todo App is too simple. Use RealWorld or HNPWA to see what a real-world application would look like. Play with them, build on them and learn.

Lesson #2 Documentation is very important. Good documentation helps you to get started quickly. Vue really excels at documentation. This is one of the reasons why it is so easy to get started with Vue.

React and Angular also have good documentation. Still not as good as Vue in my opinion.

The main problem with the Angular documentation is that often you will stumble upon documentation about AngularJS instead and it can be very confusing and frustrating. That is why I said earlier that the Angular team would have been better off if they had chosen a different name for Angular.

Lesson #3 Community is important. When documentation fails, you learn that community is also very important. You want to be sure that it will be easy to get help if you get stuck and cannot find information in the documentation. You want to choose a framework whose corresponding communities are extensive and helpful; communities where you can discuss best practices and get solid advice.

Ultimately, you need to answer the following question: *Would it be easy to hire more developers who are experts or willing to work with and learn this framework?

***Other questions worth asking when choosing a JavaScript framework

**How high is the “Bus Factor?” The “Bus Factor” is a number equal to the number of team members who, if run over by a bus, would adversely affect a project. To put it more simply: Can other people continue working on your projects if you are hit by a bus tomorrow?

Remember that talent is hard to hire. You need to know how easy it is to find developers for each of the frameworks. Also, what does the learning curve look like for each framework? Again, I think that Vue really excels here. It has the lowest learning curve of the three.

What does the product roadmap look like? Is it just a prototype? Choose whatever, learn something new.

Would it have a single function that would never change? Do you have to ship it quickly? Choose whatever you are most familiar with.

Is your product business critical? Probably it is a good decision to be more conservative in your choice.

Is the product going to evolve, have new features, etc.? It should be scalable in that case.

Wrap Up There is a point in your programming career when you realise that there isn’t a best framework. All the frameworks solve the same problems but in different ways. Is it a good thing that there are so many alternatives? Yes. In my opinion, the competition between Angular, Vue, React, and the other frameworks out there is very healthy. It brings a lot of innovation and improvements in the entire JavaScript ecosystem. We all benefit from that no matter which framework we work with.

We are developers. We like fighting about all sorts of important things like tabs versus spaces, trailing commas, etc. Joking aside, it is somehow in our blood to fight about silly things. I feel that we should appreciate the improvements all these JavaScript frameworks bring. Because there isn’t a best framework.

Don’t ask what the best framework is, ask what the most suitable framework for your product and your team is.

The Future of Data Science

2018-08-09T00:00:00+02:00

Debunking the myth of the data science bubble

We’ve all read articles indicating the looming decline of data science. Some coined the term ‘data science bubble,’ some even went so far as set a date for the ‘death of data-science’ (they give it five years before the bubble implodes). This reached a point where anyone working in the field needed to start paying attention to these signals. I have investigated the arguments backing this ‘imminent death’ diagnostic, detected some biases, drafted an early answer on LinkedIn, the Zalando communication team picked on it, and following their encouragements, I prepared a revised version for the Zalando Blog. This post doesn’t aim at making any bold predictions about the future without proper evidence. I always found these to be relatively pointless. It just aims to point out that, for all the noise, there is no solid reason to believe that any of us should worry about our jobs in the years to come. In fact, the very arguments used to prognose a ‘data science bubble’ can be turned around as reasons not to worry.

The arguments used by proponents of the data science bubble are generally of three sorts:

1- Increased commoditization

2- Data scientists should not become software engineers

3- Full automation

Increased Commoditization: It is clear that data science work is getting increasingly commoditized: almost all ML frameworks now come with libraries of off-the-shelf models that are pre-architectured, pre-trained and pre-tuned. Want to do image classification? Download a pre-trained ResNet for your favorite deep-learning framework and you are almost ready to go. The net effect is that a single well-rounded data scientist can now solve in a week what a full team couldn't solve in six months 10 years ago.

Does that mean less demand for data scientists? Certainly not, it only means that investing in data science is now viable for a lot of domains for which data science was simply too expensive or too complex before. Hence a rising demand for data science and data scientists. It is useful to take software engineering as a comparison here. Over the years, most of the complexity around programming has been abstracted and commoditized. Only a few could start anything in assembly, C made it much easier to develop complex projects, Java commoditised memory management, etc… Did it make the demand for software engineers vanish? Certainly not, on the contrary, it increased their productivity and hence their net value to any organisation.

Data-scientists should not become software engineers: I strongly disagree with this assessment: one wouldn’t believe the number of data science projects that end up in a powerpoint presentation with pretty graphs and then just an ignominious death. Why? Because data scientists often lack the ability to make their projects deliver continuous value in a well-maintained and monitored production environment. 95% of the data science projects I see do not make it past the POC stage. Going beyond the POC requires a software engineering mindset.

It is still rare to find data-scientists actually capable of (1) putting a model in a production environment, and then (2) guaranteeing that machine-learned based value is continuously delivered, monitored and maintained in the long run. Sadly, that is precisely where the ROI for any data science investment lies. I am not sure pushing data scientists to move towards management would help there: chronic over-powerpointing and the urge for serial POCs that never make it beyond the MVP stage is very much a management-induced sickness. I am not saying data scientists should become software engineers but, if anything, data-scientists need better engineering and software architecture abilities, not less.

The risk of automation Full automation is very unlikely, because in many regards, data science is still more an art than it is a technique. There is a huge gap between the 'hello Mnist’ tensorFlow example and applying ML to a new domain for which no golden data-set or known model archetype exists. Ever had to use crowdsourcing for gathering labels? Ever ventured into the uncharted territories of ML? Ever had to solve a problem for which you couldn’t piggyback on an existing git repo? You will know what I am talking about…

And there we enter the real discussion: Data scientists that are not able to go beyond the TensorFlow Mnist-CNN example, the ResNet boilerplate or the vanilla word2vec + lstm archetype are indeed going to become extinct. The same way no programmer can make a living out of the ‘Hello World’ code he/she wrote during the first year of college. But for those who know how to go beyond that and make ML actually work in a continuous delivery environment, there is a bright future in front of them and there are good reasons to think it will span much longer than the five years to come.

Sources:

https://blogs.oracle.com/datawarehousing/the-end-of-the-data-scientist-bubble

https://towardsdatascience.com/the-data-science-bubble-99fff9821abb

https://medium.com/@TebbaVonMathenstien/are-programmers-headed-toward-another-bursting-bubble-528e30c59a0e

Agile Principles Over Frameworks

2018-08-02T00:00:00+02:00

Embracing the diverse in working agile

Very often I get asked what agile working looks like at Zalando. Do we use scrum? Do we use Kanban? Do we work with LeSS? Do we use SaFE? The answer to all of these is, “Yes”.

As Agile Coaches we value principles more than frameworks. The principles are derived out of these diverse frameworks and they evolve over time. We iterate them, we rewrite them and we focus only on the needs of Zalando. Our guiding principles are:

Customer Centricity to build the right thing Through practices we ensure that the customer value provides the direction for our work. We focus on solving customer problems and exploit market potential rather than finishing work we do not see or know the purpose or impact of.
Visualization for transparency and control As software development and other areas are brain and knowledge work, we create transparency by visualizing our work. We visualize concepts, workflows, KPIs and therefore can align and control the work.
Accelerated Feedback to lower risk and increase learning All our work is aiming to get feedback as fast as possible. Can we involve the user and customer directly and systematically in the process to receive feedback. Everything else is just assumption and this principles avoids building on assumptions.
Manage the flow to optimize for value creation The focus of our work is value creation and we focus on the work items running through our system and optimize their flow. By focusing on those work items, we see work from a different perspective than managing only people.
Build quality in to enable sustainable delivery This principle ensures that through our ways of working, we have quality as a default mode. Coding Practices and team policies are making quality not something to take of at the end of the process, but as the standard of our work.
Continuously improve to stay in the healthy performance zone Wherever we start, we constantly improve everything. We improve the human interactions to ensure conflicts don’t build up. We also improve our ways of working and our workflow. By constantly improving we remain agile, which is the sweet spot between chaos and bureaucracy.

Why are we agnostic about frameworks? We have more than 150 teams connected to software development. The skill sets and experiences of individuals are very diverse. Why should we force a team to Scrum if they work more effectively in a different mode and with different rules? As long as they are customer centric to build the right things, we give them the autonomy to use whatever frameworks or parts of them as they like. The context of each team differs. Focusing on principles allows as much customization of working style as possible and ensures that the most important practices are used.

How do teams learn and use principles concretely?

The teams learn about our principles in a two-day deep dive workshop. The aim is to get teams to understand the “why” behind the principles. We found that an agile mindset is far more important than following a certain practice. Practices themselves, without understanding them, do not lead to the intended impact. Next to understanding the “why,” we also connect concrete practices to each principle. For example, we connected eight concrete practices to the principle of “Accelerated Feedback” to lower risk and increase learning:

(1) Backlog refinement (2) Estimation

(3) Story splitting (3) Daily stand up (4) Test driven development (5) Continuous integration (6) Emergent design

(7) Live concept test

(8) Vertical User Stories

Teams use the principles as a reflection moment, e.g. in retrospectives to see where they can improve. Some teams take inspiration from the practices, see learning needs or potential for deep dives.

By working with principles over frameworks, we can work using agile methods agile at scale with a high level of diversity. The mixture of mindset and concrete practices make them very impactful. The whole principles are teamwork from Frank Ewert, Holger Schmeisky, Samir Hanna, Tobias Leonhardt and myself.

Agile in People Operations

2018-07-26T00:00:00+02:00

Applying agile frameworks to HR processes

At Zalando we set up multi-disciplinary teams to develop our products. We do not have a central tech unit, but tech is distributed everywhere. This means that the way our techies work together has also spread across the company. Everywhere in the organization people have touch points with agile frameworks and practices.

The team of people operations are the backbone of our HR processes, managing a high number of tickets around sick notes, working contracts, job references, work permit support, SAP data maintenance or similar topics. The team consists of 80 people. You might think this is not the first use case to apply agile practices and frameworks. But we decided to try it out.

To realize our goal of implementing agile practices within our team, we applied three principles:

Visualization
Manage the flow
Continuous improvement

Visualization

What did we do? In order to manage something properly, it needs to be transparent. We first started with a big board and then with a big screen, measuring and displaying all relevant data, e.g. open tickets, lost calls, employee satisfaction, etc. Measuring straight away already made a change, because we needed to think about ticket categories and immediately discovered improvement points.

What impact did it generate? As a result the team now has complete transparency about tickets and they can manage them using the two following agile principles.

Manage the flow

What did we do? We process the tickets differently now. Before the change, all tickets went into a big bucket and teams looked for their topic based upon keywords. If a keyword was misleading sometimes the ticket was delayed or an important topic was discovered too late. After making the data transparent, we now use a dedicated role (the Channel Manager) to dispatch tickets three times a day. The Channel Manager is able to see patterns and build or change new keywords in direct contact with the internal users. The motivation for this role is to become as automated as possible.

What impact did it generate? Moving from only managing teams to managing the flow of work items, we achieved faster cycle times and reduced our backlogs. We can prioritize very easily and time-critical tickets can be managed adequately. All of this happened through structured improvement.

Continuous Improvement

What did we do? We rearranged our workplace, so everybody has a clear view of our big screen. Every two weeks we have a standup for around 10 minutes at our screen, looking at the numbers and briefly sharing patterns and improvement points. After the standup, we move on to deep dives on an individual level. Looking at the visualized and omnipresent data, the team members see very clearly how and where they can improve themselves and have easier access to bring in their improvements. The same goes for the leadership team, having transparent results, gives rise to the team understanding what impediments they need to remove for the teams on a more systematic and holistic level.

What impact did it generate? We are constantly improving and the level of tickets dispatched in an automated fashion is increasing every month. Through this, we have managed to reduce our backlog and cycle time every month. We also set up our own way to discover and tackle our issues quickly. Less big “improvement projects” eating a lot of resources, and more weekly improvements based on KPIs and numbers.

Of course, there are still improvements to be made and we are not adapting all agile principles directly. Nevertheless, it is a good example for us on how we improved the work of a people operations team through agile working.

Lean Testing, or Why Unit Tests are Worse than You Think

2018-07-19T00:00:00+02:00

An economic perspective on testing

Testing is a controversial topic. People have strong convictions about testing approaches. Test Driven Development is the most prominent example. Clear empirical evidence is missing, which invites strong claims. I advocate for an economic perspective towards testing. Secondly, I claim that focussing too much on unit tests is not the most economic approach. I coin this testing philosophy “Lean Testing.”

The main argument is as follows: different kinds of tests have different costs and benefits. You have finite resources to distribute into testing. You want to get the most out of your tests, so use the most economic testing approach. For many domains (e.g. GUIs), tests other than unit tests give you more bang for your buck.

Confidence and Tests

The article ' Write tests. Not too many. Mostly integration' and the related video by Kent C. Dodds express the ideas behind Lean Testing well. He introduces three dimensions with which to measure tests:

Cost (cheap vs. expensive)
Speed (fast vs. slow)
Confidence (low vs. high) (click doesn't work vs. checkout doesn't work)

The following is the 'Testing Trophy' suggesting how to distribute your testing resources.

Compared to Fowler's Testing Pyramid, confidence as a dimension is added. Another difference is that unit tests do not cover the largest area.

One of Kent C. Dodds' major insights is that you should actually consider the confidence a test gives you: "The more your tests resemble the way your software is used, the more confidence they can give you."

Return on Investment of Tests

The Return on investment (ROI) of an end-to-end test is higher than that of a unit test. This is because an end-to-end test covers a greater area of the code base. Even taking into account higher costs, it provides disproportionally more confidence.

Plus, end-to-end tests test the critical paths that your users actually take. Whereas unit tests may test corner cases that are never or very seldomly encountered in practice. The individual parts may work but the whole might not. The previous points can be found in ' Unit Test Fetish' by Martin Sústrik.

Further, Kent C. Dodds claims that integration tests provide the best balance of cost, speed and confidence. I subscribe to that claim. We don't have empirical evidence showing that this is actually true, unfortunately. Still, my argument goes like this: End-to-end tests provide the greatest confidence. If they weren't so costly to write and slow to run we would only use end-to-end tests. Although better tools like Cypress mitigate these downsides. Unit tests are less costly to write and faster to run but they test only a small part that might not even be critical. Integration tests lie somewhere between unit tests and end-to-end tests so they provide the best balance.

As an aside: The term “integration test,” and even more so “end-to-end test,” seems to generate intense fear in some people. Such tests are supposed to be brittle, hard-to-setup and slow-to-run. The main idea is to just not mock so much.

In the React context of Kent C. Dodd’s article integration testing refers to not using shallow rendering. An integration test covers several components at once. Such a test is easier to write and more stable since you do not have to mock so much and you are less likely to test implementation details.

In the backend world, an integration test would run against a real database and make real HTTP requests (to your controller endpoints). It is no problem to spin up a Docker database container beforehand and have its state reset after each test. Again, these tests run fast, are easy to write, reliable and resilient against code changes.

Code Coverage

Another point is that code coverage has diminishing returns. In practice, most agree as most projects set the lower bound for coverage to around 80%. There is actually supporting research such as ' Exploding Software-Engineering Myths.' What follows are general arguments.

Even with 100% code coverage you trust your dependencies. They can, in principle, have 0% code coverage.

For many products, it is acceptable to have the common cases work but not the exotic ones ( Unit Test Fetish). If you miss a corner case bug due to low code coverage that affects 0.1% of your users you might survive. If your time to market increases due to high code coverage demands you might not survive. And "just because you ran a function or ran a line does not mean it will work for the range of inputs you are allowing" ( source).

Code Quality and Unit Tests

There is the claim that making your code unit-testable will improve its quality. Many arguments and some empirical evidence in favor of that claim exist so I will put light on the other side.

The article ‘ Unit Test Fetish’ states that unit tests are an anti-architecture device. Architecture is what makes software able to change. Unit tests ossify the internal structure of the code. Here is an example:

"Imagine you have three components, A, B and C. You have written extensive unit test suite to test them. Later on you decide to refactor the architecture so that functionality of B will be split among A and C. you now have two new components with different interfaces. All the unit tests are suddenly rendered useless. Some test code may be reused but all in all the entire test suite has to be rewritten."

This means that unit tests increase maintenance liabilities because they are less resilient against code changes. Coupling between modules and their tests is introduced! Tests are system modules as well. See ‘ Why Most Unit Testing is Waste’ for these points.

There are also some psychological arguments. For example, if you value unit-testability, you would prefer a program design that is easier to test than a design that is harder to test but is otherwise better, because you know that you'll spend a lot more time writing tests. Some further points can be found in ' Giving Up on Test-First Development'.

The article ' Test-induced Design Damage' by David Heinemeier Hansson claims that to accommodate unit testing objectives, code is worsened through otherwise needless indirection. The question is if extra indirection and decoupled code is always better. Does it not have a cost? What if you decouple two components that are always used together. Was it worth decoupling them? You can claim that indirection is always worth it but you cannot, at least, dismiss harder navigation inside the code base and during run-time.

Conclusion

An economic point of view helps to reconsider the Return on Investment of unit tests. Consider the confidence a test provides. Integration tests provide the best balance between cost, speed and confidence. Be careful about code coverage as too high aspirations there are likely counter-productive. Be skeptical about the code-quality improving powers of making code unit-testable.

To make it clear, I do not advocate to never write unit tests. I hope that I provided a fresh perspective on testing. As a future article, I plan to present how to concretely implement a good integration test for both a frontend and backend project.

If you desire clear, albeit unnuanced, instructions, here is what you should do: Use a typed language. Focus on integration and end-to-end tests. Use unit tests only where they make sense (e.g. pure algorithmic code with complex corner cases). Be economic. Be lean.

Sources

Additional Notes

One of the problems of discussing the costs and benefits of unit tests is that the boundary between unit and integration tests is fuzzy. The terminology is not completely unambiguous so people tend to talk at cross purposes.

To make it clear, low code coverage does not imply fewer bugs. As the late Dijkstra said (1969): “Testing shows the presence, not the absence of bugs.”

There is research that didn’t find Test Driven Development (TDD) improving coupling and cohesion metrics. TDD and unit tests aren’t synonyms but in the context of this article it’s still interesting: ‘ Does Test-Driven Development Really Improve Software Design Quality?’ Another article ‘ Unit Testing Doesn’t Affect Codebases the Way You Would Think’ analyzes code bases and finds that code with more unit tests has more cyclomatic complexity per method, more lines of code per method and similar nesting depth.

This article focussed on which kinds of tests you should distribute your automated testing budget. Let's take a step back and consider reducing the automated testing budget altogether. Then we'd have more time to think about the problems, find better solutions and explore. This is especially important for GUIs as often there is no 'correct' behavior but there is 'good' behavior. Paradoxically, reducing your automated testing budget might lead to a better product. See also ‘ Hammock Driven Development’.

There is a difference between library and app code. The former has different requirements and constraints where 100% code coverage via unit tests likely makes sense. There is a difference between frontend and backend code. There is a difference between code for nuclear reactors and games. Each project is different. The constraints and risks are different. Thus, to be lean, you should adjust your testing approach to the project you're working on.

Styling-API Reinvented

2018-07-12T00:00:00+02:00

Decoupled styling in UI components

Styling isolation

Styling isolation achieved via CSS-modules, various CSS-in-JS solutions or Shadow-DOM simulation is already a commonly used and embraced pattern. This important step in CSS evolution was really necessary for UI components to be used with more confidence. No more global scope causing name conflicts and CSS leaking in and out! The entire component across HTML/JS/CSS is encapsulated.

Styling API - exploration

I expect CSS technology to offer much more in the future. The encapsulation usually comes hand in hand with the interface, for accessing what was hidden in an organised way. There are different ways to provide styling-APIs, for customising the component CSS from the outside.

One of the simplest methods is to support modifiers; flags for the component, used to change appearance, behavior or state:

This is convenient if there are a few predefined modifiers. But what if the number of different use cases grows? The number of modifiers could easily go off the scale if we combined many factors, especially for non-enum values like "width" or "height".

Instead we could expose separate properties that provide a certain level of flexibility:

In such cases different modifiers can simply be constructed by users of the component. But what if the number of CSS properties is large? This solution also doesn't scale nicely. Another con is that any modification of the component's styles will likely force us to change the API as well.

Another solution is to expose the class that will be attached to the root element (let’s assume it's not a global class and proper CSS isolation technique is in place):

Attaching a class from the outside will effectively overwrite the root element CSS. This is very convenient for positioning the component, with such CSS properties as: "position," "top," "left," "z-index," "width," and "flex.” Positioning of the component is rarely the responsibility of the component itself. In most cases it is expected to be provided from outside. This solution is very convenient and more flexible than former proposals. But it’s limited to setting the CSS only for the component's root element.

The combination of the above solutions would likely allow us to address many use cases, but is not perfect, especially for component libraries, where simple, generic and consistent API is very important.

Decoupled styling

I'd like to take a step back and rethink the whole idea of styling-API for components. The native HTML elements come with minimal CSS, enough to make the elements usable. The users are expected to style them themselves. We are not talking about "customisation", as there is basically no inherent styling in place to "customise". Users inject styling, via a “class” attribute or “className” property:

In latest browsers like Chrome, we can also set the styling for more complex HTML5 elements like video elements:

.fashion-store-video::-webkit-media-controls-panel {
  background-color: white;
}

Thanks to Shadow DOM and webkit-pseudo-elements users can set the styles not only for the root element, but also for important inner parts of the video component. However webkit pseudo-elements are poorly documented and seem to be unstable. It’s even worse for custom elements, because currently it’s not possible to customise the inner parts of elements (::shadow and /deep/ have been deprecated). However, there are other proposals that will likely fill the gap:

Let's summarise the native approach, which I call "decoupled styling":

A component is responsible only for its functionality (and API) and comes with minimal or no styling
A component styling is expected to be injected from outside
There is styling-API in place to style the inner parts

Benefits

The nature of styling is change, the nature of functionality (and API) is stability. It makes perfect sense to decouple both. Decoupled styling actually solves many issues that UI-component library developers and users are facing:

styling couples components together
same changelog for styling and functionality/API causes upgrading issues (e.g. forced migrations)
limited resilience - changes in styling propagate to all parts of the frontend project
costs of rewriting components to implement a new design
costs of rewriting/abandoning projects, because of outdated components
limitations of styling-API to address different use cases
bottleneck of component library constantly adjusted for different use cases

API proposal

In the world of custom UI components, many components are constructed from other components. Contrary to native HTML/CSS implementation with injecting a single class name, here we need API for accessing the nested components. Let’s look at the following proposal for the API.

Imagine a “Dialog” component that contains two instances of a “Button” component (“OK” and “Cancel” buttons). The Dialog component wants to set the styling for OK button but leave the styling for the Cancel button unchanged (default):

OK

Cancel

We used “classes” property to inject the CSS classes for two of Button’s internal elements; the icon and the text elements. All properties are optional. It’s up to component itself to define its styling-API (set of class names referencing their child elements).

To use Dialog with its default, minimal styling:

But for cases where we want to adjust the styles, we will inject it:

We injected a class that will be attached to the root element. But we can do much more:

The example above shows how we can access every level of nested components structure in the Dialog. We’ve set the CSS classes for the root element and OK button. By doing that we will effectively overwrite the styling for the OK button, that is preset inside Dialog.

In the same way we will be able to set the styling for components that contain Dialogs, and farther up, to the highest level of the application. On the root level of the application, defining the styles will practically mean defining the application theme.

Implementation

I implemented two examples using React and TypeScript, first with CSS Modules and second with Emotion (CSS-in-JS library). Both are based on the same concept:

default, minimal styling for components is predefined as an isolated set of classes
styling-API (set of class names) is defined using TypeScript interface, with all properties optional
components allow injection of class names object (via “classes” parameter) which is “deeply-merged” with default class names object, overwriting the styles

React, TypeScript, CSS Modules: https://github.com/mrac/decoupled-styling-css-modules React, TypeScript, Emotion: https://github.com/mrac/decoupled-styling-css-in-js

Conclusion

Decoupling styling from UI components may be a step towards making them really reusable, drawing from the original idea behind Cascade Style Sheets to separate the presentation layer from UI logic and markup. Defining boundaries between UI logic and markup on one side and styling on the other side would likely change the way UX designers collaborate with engineers. Here designers would style components based on API provided by engineers. It would be easier now to specify what constitutes a breaking-change within that contract. Putting an ever-changing skin on top of what is stable would likely save costs, friction and contribute to software quality.

Dortmund Turns Six

2018-07-10T00:00:00+02:00

Zalando’s maiden tech hub celebrates in style

With our 10th anniversary celebrations coming up, 2018 is a very special year in the Zalando universe. But while the company celebrates 10 years, we in Dortmund are excited to celebrate our own birthday as we turn six.

Every year in July, we stop for a moment in Dortmund to reflect on our past journey together and celebrate the opening of our Dortmund Tech Hub in 2012. Being the first technology hub outside Berlin, it still feels very special to be part of this team and its continuing journey.

Back in the day, this was a big step for Zalando, one in which no one really knew what to expect. How would remote communication work? How can the community and Zalando grow over several locations but remain one team? What would be the core task here in Dortmund? Dortmund has a proud history as an industrial city, famous for coal and steel. A rather pragmatic way of working has developed here over time. So for us, the local "Zalandos," the goal was clear: make it work.

With a team of six developers and product managers from the outset, the office outgrew its initial space after our Smart Inventories, Gift Card, and Payments teams increased in number and scope. Boasting over 90 employees now, our Dortmund hub still holds an important role within the overall organization of Zalando, and maintains a consistent line of communications with Berlin and beyond.

At our birthday celebration, we looked back on this awesome journey and our great accomplishments. We achieved six years of growing together, taking our place in the Zalando universe, becoming known as the core pillar for Payment Services and our Fulfillment network, the backbone of our business. At last count, we number over 13 different teams that all contribute to core functionality of the Zalando Platform.

Here's to our future!

Utilizing the Finite State Machine

2018-07-05T00:00:00+02:00

How using a State Machine saved our apps & flows from refactoring

There is a lot to learn about a "Finite State Machine" (FSM).

A little intro: what is a FSM?

A Finite State Machine is an abstract model of computation, which can be in only one finite state at a specific moment. Finite State Machines are used to model problems in different domains such as AI, games, application flows, etc.

In simpler words: It describes how a program should behave by specifying pre-specified states and routes between them.

A Real World Example

Let's imagine a safe lock:

Simply, this lock has two states: locked and open. Depending on the transitions between these states, below diagram shows the routes/transitions.

Let's say every action is a transition, so every button you click on the lock, it will still be in the same state: button pressed.

Only after entering the correct combination, will the lock move to the open state. Afterwards, there is a security timeout that returns to the locked state after a certain time has expired.

Let's imagine a very simple manual way to code this lock in Javascript:

const OPEN_STATE = "open";
const LOCKED_STATE = "locked";
const lockTimeout = 3000;

class StateMachine {

  constructor(code){
    this.state = LOCKED_STATE;
    this.code = code;
    this.entry = "";
  }


  enterDigit(digit) {
    this.entry += digit;
  }

  unlockDevice() {
    if(this.entry === this.code) {
      this.state = OPEN_STATE;
      setTimeout(this.lockDevice,lockTimeout);
    }
  }

  lockDevice() {
        this.state = LOCKED_STATE;
        this.entry = "";
  }

}

const fsm = new StateMachine("123");
console.log(fsm.state);

fsm.enterDigit("1");
fsm.unlockDevice();
console.log(fsm.state); // prints "locked"

fsm.enterDigit("2");
fsm.unlockDevice();
console.log(fsm.state); // still "locked"

fsm.enterDigit("3");
fsm.unlockDevice();
console.log(fsm.state); // "unlocked"

Every time unlockDevice() is called, it checks if the current entry matches the code. This is called the transition condition. If true, it allows the state to transition to the next (or previous state).

Here are some examples of FSM libraries in Javascript that you might find useful:

Our use case

At Zalando, our team is responsible for building the Guest Checkout Flow to allow non-Zalando customers to be able to purchase without an account. We first started with the basic flow and didn't have much in mind on what was to come.

The basic flow was:

Product Page -> Personal Info -> Address Info -> Payment -> Confirmation -> Receipt

Every page in this design was responsible for the transition to the next page, example:

// product-detail.js

// ...
const buyButtonClicked() => {
    goToPersonalPage();
}
// ...

// personal.js

// ...
const confirmButtonClicked(personalInfo) => {
    if (personalInfoComplete(personalInfo)) {
        goToAddressPage();
    }
}
// ...

But there's one small flaw with this simple design. It's not extendable, not even testable.

Our product team wanted to introduce some new functionality to the flow, namely "Login Functionality," which would completely break the whole design.

Logged in users, without personal info or Address info:

Product Page -> Login -> Personal Info -> Address Info -> Payment -> Confirmation -> Receipt

Logged in users, without payment info:

Product Page -> Login -> Payment -> Confirmation -> Receipt

Logged in users, without address info, BUT HAVE PAYMENT:

Product Page -> Login -> Address -> Confirmation -> Receipt

Logged in users, without payment info:

Product Page -> Login -> Address Info -> Payment -> Confirmation -> Receipt

And what about Guest Users now? Too much if-else.

Enter The State Machine

This design screams for a state-machine like design. We laid down the states we want, defined some rules between them, and let the state machine do it's magic.

This is a simplified example of how the FSM would work. If you notice, almost all pages return back to the FSM for consultancy on where to go next. The FSM has validation rules that allows it to decide what to do next; it uses the Redux Store to decide.

We called this function, goNext(). We defined all the possible rules and transitions we have in the system; a fallback would be to just render the product page if the state is not compatible with any of the transitions.

The state machine takes the state, follows through the rules and keeps "going next" until it finally reaches the proper state.

An earlier example of a user with personal + address but with no payment would be:

*Personal state: User? Has personal? Yes? Go next. Address State: Has address? Yes? Go next. Payment: Has payment? No? Stay here. *

A challenge to that design

A good challenge to this design was the implementation of “going back.” The state machine was design to always move forward, right? What happens if the user decides to go back to the previous page? Luckily the Redux State System manages this, however, it was not implemented in our initial design with goNext(). The answer is simple. We implemented goPrev(), which would have the same concept as going forward, just the other way around. Same rules apply: different direction. It worked quite well (after ironing out some nasty bugs).

Pros of this FSM Design

Easily maintainable, transitions and states are clearly defined
Testable, unit tests can easily be written with pre-defined states for multiple scenarios
Easily extendable, allowing for new states to be just plugged in along with their rules

Cons of this FSM Design

If some scenarios are not well defined, the FSM just redirects the user to the product page when they were almost in the payment page. For example, if some underlying backend service (e.g., a payment provider) returns an unexpected response, the Redux state would get corrupted and the FSM wouldn't know what to do, redirecting the user to the product page, leaving the user confused with, "What on Earth happened to my credit card now?"

We try to cover as many scenarios as possible, also providing the user with a proper error page so that they do not get confused.

A next-step improvement would be allowing the FSM to "re-try" if something fails.

And as they say, computers and humans aren't perfect.

The State of Open Source

2018-06-28T00:00:00+02:00

The evolution and future of open source at Zalando

Open source software has been the core of Zalando’s tech stack since the company’s humble beginnings, selling flip-flops from a basement 10 years ago; it’s part of our DNA as a tech company.

For engineering teams at Zalando, open source is a natural part of how we solve problems, we consult and share the TechRadar for guidance on appropriate technologies to use, we contribute to projects such as Kubernetes, and work in the open on a very large part of our infrastructure setup such as Nakadi, Connexion and Patroni.

Today we are releasing the very first of our reports on open source at Zalando to give everyone inside and outside the company an insight into how we are actually doing. The increased insight gives us many reasons to celebrate our contributions, but also signposts to take action in the areas where we see a need to improve.

While the overall picture is positive and open source at Zalando is maturing, we also see challenges: contributions are driven by a relatively small group of Zalandos, there are legal and organisational barriers to entry, and no consistent process for open source work at Zalando.

This is where the newly created open source team will get to work, as outlined in the report, our focus for the rest of the year is ensuring that proper processes are in place, that there is no uncertainty on how to engage in open source, and that our open source efforts grow in size and reach.

We have drafted the following team objectives to work towards:

Increase inter-team alignment with inner source initiatives
Align community interests with Zalando’s business interests
Create & nurture open source projects
Improve developer recruitment and retention
Ensure efficient & legally safe adoption and publishing of open source

You can follow the process of the open source team on our issue tracker, find our open sourced policies on our website, and finally you can read the report on open source at Zalando here.

The Intrapreneurship Journey at Zalando

2018-06-21T00:00:00+02:00

Sharing our innovation stories: success, failures, and learnings

Franzi, Humberto, Neil, Lenia, Vivek... These are just some names of the people who are willing to put in the extra effort and run the additional mile to impact the organization in a way they haven’t done before. The stories of these Zalando intrapreneurs are the ones I summarized at the Innov8rs conference in Madrid.

Back in October 2015, Zalando took a leap and opened a new playground for Tech Innovation. It allowed tech teams to experiment with emerging technologies, support product discovery, kickstart hardware initiatives around 3D printing, and for prototyping of all sorts. Since then, our innovation strategy and approach have completely changed. How do we innovate in what we do at Zalando? One of the many vehicles for innovation is Slingshot, our intrapreneurship program.

Slingshot A name drawn from aerospace engineering and orbital mechanics: “a gravitational slingshot is the use of the relative movement and gravity of an astronomical object to alter the path and speed of a spacecraft, typically to save propellant and reduce expenses”. Same as the influence it has on a spacecraft, Slingshot is Zalando’s intrapreneurship program aimed to accelerate ideas or redirect their path. Fostered by the Innovation Lab, it is the opportunity to validate ideas in a fixed time frame of 12 days within a three month period. Open to everyone at Zalando, it allows individuals and teams to dedicate 20% of their paid working time to the project, and provides plenty of space, help and expertise to validate their ideas for further funding: more time, more money, more support and much more passion and commitment.

Our initial approach Our playground a couple of years ago was mostly aimed at developing tech-focused ideas. This meant that innovators came to our lab to experiment with brilliant visionary ideas with topics around chatbots, AR/VR, IoT and Conversational Technologies. Some of these ideas were built based on how we understood the technology and how customers could use it to browse and buy products from the Zalando Store. Good things happened back then, media coverage was one of them and the team who developed our first fashion chatbot made it to the Facebook F8 Conference. We were just on the right track! At least, that’s what we thought. We were not. Our use cases couldn’t be validated and our customers didn’t find them useful. In simple terms: we were trying to create solutions where we didn’t understand the problem of our customers. Our teams got so attached and fell in love with their solution to the extent that sometimes the problem lost its meaning along the way.

Our most valuable pivot Lean startups talk about quick iteration, pivots, and how to be more innovative while maximizing precious resources. Honestly, we failed here and there. It took us some time to realize some of the mistakes we made until we finally decided to move away from a “Solution first” approach to a “Problem first” approach. What does this mean? Our innovation approach focuses first on the customer problem backed with relevant insights and data, and then a defined problem statement. Only then are we able to explore many different solutions, prototype one, two or three of them and validate with the defined target audience. We didn’t reinvent the wheel for this, we relied on wide-spread tools and methodologies, especially on Design Sprints from Google Ventures.

An example of this approach is a recent project which came to us. These intrapreneurs wanted to explore Computer Vision as a solution for Search. When we started the project, we first tried to understand the problem from a user perspective. We got out of the building, crafted a research question to know how our customers search for outfits and met several of them. We came up with the user journey in figure 1 after talking to them. The important revelation here was that most of our users started their journey on social media channels and then had a lot of hacky ways to find a similar outfit. This changed the course of the whole project and we figured out how to innovate these journeys and provide a superior experience with less effort for our users. Starting with the customer problem: different approach, huge impact!

Support is paramount The work of our innovators can’t be done by themselves. We relied on the expertise of hundreds of our colleagues who help us move forward every single one of the 12 allocated days. We are more than thankful about their willingness to cooperate and work together to validate ideas. There are teams whose leads are keen and open to share their talent: designers, developers, business, product people, etc. On top of that, none of these initiatives could happen without the support of the management in Zalando, who invest in and support fostering a culture for innovation across the company boosted by quarterly events like Hack Week.

Our impact Slingshot has been around since the end of 2015. Small experiments have been created in Hack Week, validated in Slingshot, and new teams have been created and grown to something big, as big as Zalando’s Sizing Team, creating remarkable business impact. It all started from the bottom and that is how we succeed with our people: we foster an environment for bottom-up innovation boosted by the passion and drive of our people who have a deep desire to engage in building new products that reinforce our platform strategy and shapes the future of Zalando.

Stories such as Innovation in Digital Experience by a Zalando data scientist, also feature in this blog, showing how deeply we impact our people; people who are able to follow their passion, are truly obsessed of solving customer problems, and who are eager to see their products coming to life.

Moving forward Slingshot is everywhere. What started as a “tech only” initiative has spread around the company and our learnings have helped us have the right tools in place, iterate faster on our learnings, and offer a compelling value proposition company-wide. That is to say, enabling a platform for intrapreneurs to kickstart radical ideas that generate business value or fail fast. Our vision is also clear: to be the accelerator for Zalando visionaries to innovate our business.

At Zalando, more than 100 nationalities are represented, and we are all able to innovate and reimagine our business. When it comes to business success, it is all about people, people and people.

Interested to know more? Luis gave a talk about this journey at Innov8rs Conference in Madrid in May. If you have any questions or want to get in touch, you can reach him through his LinkedIn.

All Aboard

2018-06-14T00:00:00+02:00

What new tech employees can expect from Zalando onboarding

So, you’ve applied for a technical role at Zalando and you’ve just accepted the offer! If you’re wondering what to expect, look no further. We are excited to share a peek behind the scenes, so you can see what awaits you in the first few weeks of this journey, regardless of whether you’re joining in Berlin, Dortmund, Dublin, Hamburg, Helsinki or Lisbon, to make sure you’re well-equipped to dive into life at Zalando.

Part One We designed a special onboarding program for our tech community. The program introduces our new hires to their work environments. This starts with a company-wide onboarding day in which new hires from all across Zalando find out more about Zalando’s history, structure, and strategic future. Zalando started as a very small company in Berlin in 2008, and has now grown to be Europe’s largest online fashion platform with over 15,000 employees. As a new “Zalando,” you’re now an integral part of this journey, and it’s our job to make sure you’re fully equipped to hit the ground running!

Part Two: General Tech Onboarding For most of you, Your second day begins with the Tech Onboarding Program, together with all the other tech newbies. This is taken care of by the Zalando Tech Academy, our internal training centre which caters specifically to tech roles. This is an exciting time, because you get to interact with future colleagues from across the tech spectrum; product managers to UX designers, frontend and backend software engineers, as well as tech management, to get deeper insights into what we do. Early exposure to these different aspects is what makes Zalando Tech such a great place to work.

In this four-day program, a major highlight is a full-day trip to one of our logistics centres near Berlin, where you are able to see how Zalando Logistics operates on the inside, and how we employ technology to improve processes every step of the way. Every one of Zalando’s employees works towards the same goal: to ensure a seamless customer experience, whether you’re working in tech, on the business side, in logistics, or anywhere else. This trip to the warehouse is to provide exposure to even more areas of Zalando’s business; helping our newbies appreciate the work that goes into delivering customer satisfaction at every level.

The tech community is an extremely important aspect of working at Zalando, and one of our main goals from the beginning is to make it easy for you to connect with as many people as possible. That’s why part of your onboarding includes an overview of Zalando’s culture and community. Zalando is home to dozens of “ guilds”; self-organized groups of people who gather to exchange knowledge, share their expertise, or just indulge in things they are passionate about. Our community management team helps manage the guilds and provides the framework necessary for Zalando’s techies to thrive. Especially for newcomers and international employees, this is the perfect opportunity to get involved, and receive recommendations and support.

Part Three: Engineering Bootcamp After covering the basics in the first few days, software engineers embark on a further onboarding program, which dives a bit deeper into the tech environment: Engineering Bootcamp. In this three day program, engineers gain deep insights into our software development lifecycle, covering all the tools and technologies used by Zalando. You’ll be given hands-on exercises on all of our tools, so that you can get to coding and deploying projects as soon as possible.

Topics covered include why we use REST APIs, how we use GitHub Enterprise, how we deploy software using AWS, Stups and Kubernetes, and much more. We also introduce newbies to our “tech radar”: a graphical illustration of which frameworks, infrastructure, data management tools, and languages are in use at Zalando, which are being trialed, and which are on hold and not recommended for new projects.

Our tech onboarding program prepares you for almost any team within the Zalando Tech universe, but of course it’s followed by team-specific onboarding, in which you learn the ins and outs of projects you’re working on and how they fit into Zalando’s strategy.

Whether you’re joining the Zalando team as a software engineer, product manager, UX designer, data scientist, or any other role, we’ve got you covered! At Zalando, we want to ensure that you have the best experience possible when joining the team.

Loading Time Matters

2018-06-11T00:00:00+02:00

How Zalando's overall site speed improved by more than 25% in five months

We all know that providing a fast user experience is key. Still, it was somewhat a wake-up call for us last fall when we saw our aggregated loading time increasing; not because we had increased latency in our systems but simply because the share of mobile visits kept increasing. By now, over 75% of our traffic comes from mobile devices (nearly equally split between app and web). And customer expectations are rising, especially on mobile!

We took this wake-up call as an opportunity to explore the impact of site speed in more detail. Yes, at Zalando every millisecond of latency counts, but what’s the concrete impact of another 100 msec improvement? We analyzed the correlations of loading time and revenue per session across every step of the user journey and for every device. The pattern was very clear and consistent (even if somewhat different in size). Shorter loading times go hand in hand with higher revenue per session. An A/B test brought the final confirmation: 100 msec loading time improvement led to a 0.7% uplift in revenue per session.

At Zalando, we live our values by setting bold expectations and making them highly visible for everyone in the company. Our ambition was a 20% loading time improvement in the first half of 2018. We’re excited that our efforts paid off and we reached a 25% improvement on our overall loading time within five months. We’re obviously thrilled that this is noted in Google’s “Mobile Speed Leaderboards” study, which rated us as the fastest mobile site in fashion retail. We’d like to share how we achieved this.

Given Zalando’s size, with hundreds of engineering teams and a breakneck pace of development, some teams are entirely self-sufficient when it comes to managing their performance, while others embark on a crash program to eliminate bottlenecks. That’s where Mission Control comes in: targeted engagements with engineering teams and Zalando’s Site Reliability Engineering (SRE) program. Our site reliability engineers roll up their sleeves and apply their specialized experience to achieve immediate results, while providing the tools for self-management after the engagement ends.

Over the last few months, a special focus has been on the optimization of the render time and time to interact with our website. On almost every step of the user journey, the engineers reduced the time to interaction by decreasing the amount of code that has to be executed. This sounds obvious, but it is not always easy to implement due to the chosen technology.

We identified an older React version as one of the reasons for a slow loading time. So our platform team updated the React version that we use from 15.6.1 to 16.2.0. This update was solely responsible for improving the JavaScript execution time by over 100 milliseconds.

Our engineers from the Search and Browse team started the optimization with profiling their front end components with the component-level profiling, which was introduced in 15.4.0, and was turned on by default in React 16. It shows rendering time (mount and update) of each component, and warns about possible performance bottlenecks like updates triggered in lifecycle methods. This was a killer feature for us. Even if it is only available on development build, the proportion of rendering times resembles the one of production build.

Combined with Chrome’s Performance Tab, it helped us to identify the bottlenecks.

When we looked into profiling results, it was clear that reflows were the biggest bottlenecks. The purple boxes are reflows in JavaScript execution on production.

On mobile and tablet, react-lazyload for product images were triggering two reflows. The Catalog page renders eight products on server-side and 76 products with client-side. The second reflow took a very long time because it calculates the layout of a big area on the screen for the newly rendered 76 products. We removed the lazyload and implemented Low Quality Image Placeholders ( LQIP) instead to avoid reflow at all.

Before (Mobile):

After (Mobile):

On desktop and tablet, react-virtualized for a product filter dropdown was triggering reflow. The product filter component does not show anything until it is clicked, but it was rendered to provide links for crawlers. We stopped rendering the hidden product filter component and removed the reflow. For crawlers, we prepared links generated with string concatenation outside of React components.

Before (Desktop):

After (Desktop):

As a result, we managed to reduce JavaScript execution time of Catalog by about 200 milliseconds on desktop and about 300 milliseconds on mobile devices at 90 percentile.

Desktop:

Mobile:

Another optimisation we did, was reducing the bundle size. Only sending code that is necessary helped to optimize the performance significantly. In the end each byte counts as JavaScript is expensive for the browser to process. Also, surprisingly many visitors don’t have cache (needs data), so it’s important to keep JavaScript bundles as small as possible. To identify where we have to look and where potentially the best results can be achieved, we used the webpack-bundle-analyzer.

We identified libraries that are large in size but not very necessary for us and we used tree shaking to eliminate dead-code. Unfortunately some CommonJS libraries did not work well with tree shaking. In these cases, we removed the packages and chose a smaller alternative or wrote our own. Also, we found out that some internal libraries were bundling their dependencies into their bundles with webpack. This caused our bundle to have the same code multiple times because NPM’s deduping mechanism couldn’t find the duplication.

By applying this approach we reduced the overall size of our Header Fragment by 25% (36.6 KB -> 27.4 KB gzipped):

Header Fragment (before):

Header Fragment (after):

Because each byte counts, we also reduced the page site in total (amount of DOM elements, JSON data size e.g. props).

React client-side hydration needs the props that are used for server-side rendering. The props are typically embedded into HTML as JSON. In the JSON, we had some unnecessary properties in large arrays of objects that were passed through from backend APIs. Removing those unused properties reduced the page up to 17 KB gzipped.

As the Zalando website uses SVG for icons, part of reducing the page size was also the SVG optimization. The SVG Optimizer (SVGO) is a great tool for optimizing SVG images. We have already been using the tool for a while, but recently we noticed that we had forgotten to do decimal precision optimization. It specifies the precision of floating point coordinates. SVG images generated from graphic software usually have too precise numbers to render pixels. After the optimization we reduced the SVG size by about 50%.

The biggest learning we had from our optimizations efforts is:

*Remove as many as possible of your dependencies, keep the amount of code as small as possible and your webpage will be fast (again). A small and fast webpage will make your customers happy and will result in more conversion.

*Looking to the future, SRE is making a number of improvements to make it easier for Zalando’s hundreds of engineering teams to self-manage their performance. It starts with setting expectations by Service Level Objectives that are meaningful from the customer perspective. With expectations set, we measure our Service Level Indicators against those expectations and we dive deep to optimize bottlenecks; -- that’s where distributed tracing comes in. With expectations and deep instrumentation, we gain the ability to implement monthly error budgets to help engineering teams better achieve operational excellence. The journey continues...

State Management in React

2018-05-31T00:00:00+02:00

Comparing Redux, MobX & setState in React

by Kaiser Anwar Shad and revised by Eugen Kiss

Introduction

React is a declarative, efficient, and flexible JavaScript library for building user interfaces. Compared to other frontend libraries and frameworks, React’s core concept is simple but powerful: ‘React makes it painless to design simple views and renders by using virtual DOM’. However, I don’t want to go into detail about virtual DOM here. Rather, I want to show three ways how you can manage state in React. This post requires basic understanding about the following state management approaches. If not, check out the docs first.

setState: React itself ships with built-in state management in the form of a component’s `setState` method, which will queue a render operation. For more infos => reactjs.org
MobX: This is a simple and scalable library applying tested functional reactive programming (TFRP), which stands for: ‘Anything that can be derived from the application state, should be derived. Automatically.’ For more infos => mobx.js.org
Redux: Maybe the most popular state management solution for React. The core concepts are having a single source of truth, immutable state and that state transitions are initiated by dispatching actions and applied with pure functions (reducers). For more infos => redux.js.org

Location

setState is used locally in the component itself. If multiple children need to access a parent’s local state, the data can either be passed from the state down as props or, with less piping, using React 16’s new Context API.
MobX can be located in the component itself (local) or in a store (global). So depending on the use case the best approach can be used.
Redux is providing the state globally. Means the state of the whole application is stored in an object tree within a single store.

Synchronicity

setState is asynchronous.*
MobX is synchronous.
Redux is synchronous.

*Why asynchronous? Because delaying reconciliation in order to batch updates can be beneficial. However, it can also cause problems when, e.g., the new state doesn’t differ from the previous one. It makes it generally harder to debug issues. For more details, check out the pros and cons.

Subscription

setState is implicit, because it directly affects the state of the component. Changing the state of child components can be done via passing props (or Context API in React 16).
MobX is implicit, because it is similar to setState with direct mutation. Also component re-renders are derived via run-time usage of observables. To achieve more explicitness/observability, actions can (and generally should) be used to change state.
Redux is explicit, because a state represents a snapshot of the whole application state at a point in time. It is easy to inspect as it is a plain old object. State transformations are explicitly labeled/performed with actions.

Mutability

setState is mutable because the state can be changed by it.
MobX is mutable, because actions can change the state of the component.
Redux is immutable, because state can’t be changed. Changes are made with pure functions which are transforming the state tree.

* With mutability the state can be changed directly, so the new state overrides the previous one. Immutability is protecting the state from changes and (in Redux) instead of directly changing the state it dispatches actions to transform the state tree into a new version.

Data structure

setState -
MobX Graph: multidirectional ways; loops can be used. The state stays denormalized and nested.
Redux Tree: is a special kind of graph, which has only one way: from parent to child. The state is normalized like in a database. The entities only reference to each other by identifiers or keys.

Observing Changes

setState -
MobX: Reactions are not producing new values, instead they produce side effects and can change the state.
Redux: An Object describes what happened (which action was emit).

Conclusion

Before starting to write your application you should think about which problem you want to solve. Do you really need an extra library for state management or is React’s built-in setState fulfilling your needs? Depending on the complexity you should extend it. If you love to go for the mutable way and expect the bindings automatically, then MobX can fit your needs. If you want to have a single source of truth (storing state in an object tree within a single store) and keep states immutable, then Redux can be the more suitable solution.

Hopefully this post gave you a brief overview about the different ways to manage state in React. Before you start with one of those libraries, I recommend to go through the docs of each. There are a lot more treasures to discover!

TL;TR:

This post is inspired by:

State Management With MobX & MobX-state-tree by Michel Weststrate - Workshop in React Amsterdam 2018
Comparing Redux and MobX with two CTO's and React experts - state management using reactjs

Scaling Agile at Zalando

2018-05-17T00:00:00+02:00

Sharing successful large scale agile experiences

Zalando has been known for radical approaches to agility since 2015. In order to keep growing and staying successful we took the next step in 2017 forming around 30 business units. Each business unit is formed around one particular business problem, topic or product with end2end responsibility. All disciplines needed are inside this business unit from commercial roles to tech teams.

Challenges in large scale product groups

Looking at this setup, we experience challenges. You’re probably familiar with this if you work in a similar setup or if your company has around the size of one of our business units (<100 people).

Who takes product decisions at this size with several teams and product people?
How to keep the focus on the actual product with so many technical components and intermediate steps?
How to enable 50 engineers to understand their everyday task contribution to the overall quarterly goals?
How to do sprint planning with 20 people?
How to handle cross-cutting concerns like 24/7 and platform work in a feature team setup?

By far the biggest question was however: How can this work inside Zalando?

Our Solution Approach

How to support these +30 business units to reach their business goals through agile working? Rome was not built in a day. We knew we had to work by process and collaboration.

We used the power of our network and collected successful solutions from all units. The first and most important observation was that no solution can be mechanically copied, but always has to be adapted to the specific needs of the unit (“There are no best practices, only practices that work in a specific context”). To enable this adaption and learning, in addition to the bare facts we collected:

the story and motivation around the solutions
the details of how they are adopted
the (contact details of the) people who created them

For the first factor, we invited people from these teams for teachback sessions open for everyone to share their experiences in a try/avoid format.

Secondly, from these we created a 20 page guide on how to structure large teams with background details. Finally, we connected people we talked to who have similar challenges to the pioneers, because all advice needs to be adapted to the specific BU needs.

Concrete Examples

For example, the Fashion Store Apps group (5 teams) struggled with their narrow product definition: Each platform and the API were treated as separate products, with seperate teams, backlogs, product owners, etc. These needed to be managed, synchronized, and aligned, and code needed to be integrated. As you can imagine, somewhere along the way the focus on the customer gets hard to find. To address this, the team redefined the product as “Fashion Store Apps,” reorganized the teams to reflect this, and merged all backlogs into one.

Another example is how Personalization (6 teams) increased the understanding of the goals and unlocked possibilities. As is typical in a large organization, goals and concrete products were difficult to define for this department and usually the understanding did not transfuse to the engineering and data science teams. To tackle this, everyone (including engineers) took responsibility for creating or refining the press releases that underlie the epics for the upcoming quarter. Ideas to achieve goals are as likely to come from Product* as they are to come from delivery teams. The concrete outcome is an aligned and commonly understood overview of the next quarter’s sprints. This led to much higher involvement and identification during the quarter, and to more motivated teams.

A LeSS introduction backwards

These are only two examples from many more instances of how we scale agile at Zalando. The whole approach is somehow a LeSS introduction backwards. We make note of what trials work, and we find a high similarity to the LeSS framework without ever using the word or the whole framework. The practices emerged themselves as they made sense to the people inside the organization. As one engineering lead put it after reading a LeSS book, “It’s nice to see that we were not the only ones with these ideas.”

Our key learning directed to all fellow Agile Coaches and Agile Change Agents is to not implement frameworks, but to source from working solutions and share the successes.

Eventually we will end up in a form of LeSS organization without anybody inside Zalando connecting emotionally to the framework itself.

Many thanks for the input and support of our colleagues Samir Keck, Tobias Leonhard and Frank Ewert.

Dublin’s Data Science Guild

2018-05-15T00:00:00+02:00

How to establish and evolve your data science community

In Zalando, we have many guilds: self-organized groups of people who share interests. The topics, scope, size, and ways to organize the guilds varies. We have technical guilds like the web or API guilds, local and artistic guilds like the knitting guild in Helsinki, and some guilds that support the growth of people in certain job families, like the Data Science Guild.

For more than two years, I have been co-organizing the Data Science Guild in our tech hub in Dublin, creating a place to share data science knowledge and best practices, and creating a framework that allows the guild to evolve and grow autonomously.

I like to think of guilds like teams or products. This philosophy helps you build the kind of framework you need to run or be part of a guild. It gives the guild a reason to exist, and it can potentially tell you when it is time to pivot or move on to the next thing. When we started the data science guild, it was not perfect (and it isn’t perfect now), but as we were “releasing” a new product, it was important to see if the product was viable: Do people find our talks valuable? Can we guarantee to have content for 70% of the 52 weeks in a year?

Initially, we had really good feedback. Everyone was interested in giving talks, attendance was high, and as expected, a lot of people had ideas on how to make the guild better! A few months after the initial positivity, we started to see some challenges that needed to be addressed if the guild was going to survive. We needed a structure to maintain a constant flow of content (talks, discussions, etc.,) and we needed to scale the organization by creating a sense of collective ownership.

Once we saw there was both a need and value from having the data science guild after the initial ramp-up, we established our mission, “Sharing Data Science (DS) knowledge and experience to expand our DS expertise.” We devised and measured Objectives and Key Results ( OKRs), and implemented the collective ownership model.

We focused on three types of content: Internal talks led by any of our data scientists, in which they present a topic in depth. It could be a new library they are using, how to bring a model into production, or describing the results for the latest A/B test the team ran. This type of format is very useful to do dry runs of conference talks. Secondly, we have a “learning club,” a smaller and less formal setting, in which we discuss a recent scientific paper, or watch and discuss a video lecture. Finally, we also invite researchers from universities to present their work. For now, we ask them to present their work to PhD students, who benefit from getting feedback from our data scientists in different teams, and seeing new opportunities for the application of their work in different contexts.

When we implemented the collective ownership model, we iterated a few times. The idea was to give our community the opportunity to shape how the guild works, and to avoid having bottlenecks or too few people shouldering too much of the work. At first, we had one person who had ownership for each of the topics; one in charge of the speaker lineup every week, one sending the invites, and one taking care of the budget. Worth saying: that didn’t work. It required a lot of alignment between individuals, adding unnecessary overhead, which no one enjoyed.

We settled on a much smaller model, where we have fewer contributors who are part of small committees for half year periods (aligned with how we set and evaluate our OKRs). The structure looks like this:

Our content team designs and maintains a content portfolio that reflects our OKRs. They plan the topics, invite speakers, book rooms and send agenda updates.
Our audiovisual (AV) committee is a group of volunteers who know how to operate our not-so-easy-to-use AV system for streaming and recording presentations. Lately, we also have support from IT for this topic, which eases some of the burden.
Our social committee is in charge of coordinating the communication with the Data Science guild in Berlin and running our social events (this involves selecting and buying a mountain cakes and sweets)

When running the Data Science Guild, the most important aspects to consider is communication. Because we depend on our colleagues to give talks, lead discussions or invite speakers for our external event series, I found that people are much more willing to say “yes” to participating when the request comes in person. Face to face, our guild member can spend time explaining exactly what is required, answer any questions the potential speaker has, and set a date in the calendar for the talk. Of course, we then need to inform the attendees with enough time so they can plan to attend a talk; nothing is worse than spending time preparing a talk to which no one shows up! Finally, we also communicate the evaluation of our objectives and key results to our stakeholders and a wider internal audience in our monthly meeting; that helps form the image and reputation of the guild inside our office.

Being part of the Data Science guild has been a wonderful experience. There have been plenty of internal and external successes, and more importantly, we created a place where we Data Scientists come together beyond our day to day teams. In the last two years, almost all of us have presented at least once (some much more), we have co-organized and presented our work in two company-wide data science conferences, invited half a dozen PhD students to give talks in Zalando, and we have built a reputation for openly sharing knowledge and best practices in the Data Science community in Dublin, resulting in our members being shortlisted for two DatSci Awards in 2017. I believe Zalando is a “ good neighbor”; a company where everyone can make a positive impact in their community, whether that is with their team, a guild, or the whole company.

How to Make Product Management for Enterprise Systems Work

2018-05-09T00:00:00+02:00

Moving from a more traditional internal IT setup to a product-driven culture

I love building enterprise systems, because you get to work with your customers/users every day and literally see their lives change as you release new features. In my case, at Zalando, these are systems for fashion buying, supply chain management, inventory management and procure-to-pay processes (e.g. paying our suppliers for merchandise we bought from them). But building good enterprise systems is prone to failure, as author and product consultant Sam McAfee has recently pointed out.

While most product management articles and blogs talk about product management for consumer-facing or enterprise SaaS products, I’ve found that many of the methods and insights discussed in these can also be applied to developing enterprise software to be used by your company’s internal users.

This article is an attempt to share some of the lessons we’ve learned at Zalando. We’re Europe’s leading online fashion platform and in the past few years we’ve moved from a more traditional “internal IT setup” to a product-driven culture that develops industry-leading bespoke enterprise solutions.

Allow your enterprise systems teams to own a problem

As with any product decision, the first step to finding the best solution is to own and understand the problem. In a traditional internal IT setup, someone from the “business side” has an idea of what to improve. Frequently, this idea includes a detailed description of the solution they would like “IT to implement."

This is where the problems usually start. Many of these ideas create only limited value, and together they do not form a coherent product vision. And this is not a surprise; these colleagues know their business but they are not trained product managers. Unfortunately, the “product manager” (often called “business analyst” in such a setup) does not have the authority to push back on such requirements, but rather focuses on translating the business requirements into a detailed design that software engineers can then implement.

At Zalando, we changed this. We created multi-disciplinary teams that own specific business KPIs, together with their internal users. For example, our Competence Centre Inventory Management co-owns merchandise availability (an important KPI for every retailer) with the merchandise planners. The team consists of software engineers, product managers, product designers, controllers and analysts, who work very closely with our 250 merchandise planners (their internal users) and our 1,500-plus suppliers to identify the biggest levers to improve merchandise availability (1) (and related KPIs). Some of these levers may require system improvements, some may not. They are free to decide what to work on next. They are also free to decide what systems to build in-house and where to leverage external solutions. They just need to ensure that the KPIs they co-own improve; and their progress is reviewed each quarter.

Bridge the Gap Between “Tech” and “Business”

The above setup made it even more important for us to break down functional silos, more specifically between commercial teams (i.e. “the business”) and tech teams (engineers, product managers). This was a big challenge for us. When we started, our commercial teams had never heard of agile or MVPs – and often just saw “MVP” as an excuse to cut down on the features they wanted, or “agile” as an excuse for not being able to commit to release dates. Similarly, our tech teams did not understand the main business decisions and the need for planning ahead. For example, when your business grows by 20-25% a year, knowing whether a key problem, one that creates a lot of manual effort, will be solved in six months or not, has significant implications on how many people you need to start hiring today. Similarly, when you plan a new business line launch (such as Beauty as a new category on Zalando), and you need to prepare a big marketing campaign and buy the relevant merchandise, all to go live at the start of a new season in nine months from now, you need some form of a commitment that the required system functionality will be available by then. We found the tech team would reject deadlines, as in their view, they contradicted an agile approach. It literally required us to bring two different worlds together.

How did we go about this? From an organisational standpoint, we fully merged the “business” with the “IT” teams (we used the terminology “tech is everywhere”). We now no longer have one big monolithic IT organization, but have formed smaller units, where business and tech teams report into the same senior leaders. At the same time, we invested a lot into strengthening the overall tech community through the creation of topic-specific guilds or interest groups (2), and other means, such as cross-team tech talks. We also invested in educating our commercial leaders in basic tech concepts, such as agile development, MVPs and the importance of not building up technical debt.

At a more operational level, we aim for all colleagues in our Enterprise Systems Teams (engineers, product designers, analysts, product managers) to take internships with their users regularly. This greatly helps to build a network and increases understanding of the business problems. We’ve built very active key user communities around all our systems, and use these communities to help us regularly align our feature backlog and priorities, usually monthly. We also ensure that users are always present in sprint reviews. In addition, we regularly organise informal exchanges across disciplines through random lunches and other events.

Recruit and Train the Right Product Managers

You may think this is all obvious, and you are right. However, recruiting and training the right kind of product manager for this kind of internal role is not straightforward, and I would argue even more challenging than recruiting and training product managers for other product management roles.

First, they need to be interested in redesigning and improving internal business processes and KPIs, a skill set and interest not every product manager brings. Second, they need to be great at managing change. When you build consumer-facing applications, the consumer either gets it or they will leave. With internal systems (and a captive audience), you can achieve more impact when you combine a system change with a process change, i.e. a change of “how we do things” (but this necessarily requires seeking and achieving buy-in from many stakeholders). This means that product managers for in-house enterprise systems will spend a lot of time with users, taking them along on the journey, running system training, and getting their users’ buy-in for the new solution.

Finally, our product managers need to make make-or-buy decisions and switch between developing a system in-house as part of an agile team and implementing an external solution alongside a systems integrator. Traditionally, these two are different roles with different skill sets. At Zalando, we want our product managers to own the problem and solve it in the best way possible, and they can only do this if they know how to use both internal and external solutions to solve that problem.

Of course, each of our product managers also needs to have full command of the main product management methodologies to discover, define, design, and deliver the right solutions for our users and problems. To do this, we regularly work with tools, such as press releases, pre-mortems, user story maps, design sprints (which we modified slightly), MVPs, and different ways to prioritize problems and solutions.

So where do we find these unique product managers? Well, there (still) seems to be only a small number of product managers out there who bring the full skill set we are looking for. Therefore, we either recruit people from a business team and teach them product management, or recruit (consumer-facing) product managers and enable them to learn the additional skills required. Luckily, by now we have quite a large product management community where colleagues can exchange best practices and help each other. If you are starting from scratch, I suggest you do what we did a few years ago: get a handful a smart people with different profiles and have them learn from each other, supported by a more formal product management curriculum.

Of course, I could go into much more detail, but I’ll leave it here for now. I would love to hear any comments from the in-house systems product community out there, so please share your thoughts with us.

(1) There are various ways to measure this KPI but it roughly measures the percentage of customers that find the product they are interested in being in stock at the retailer.

(2) In guilds, colleagues who share the same technical interests can come together and exchange knowledge, e.g. we have an API guild, a Scala guild, etc.

*This piece was first published on MTP.

Many-to-Many Relationships Using Kafka

2018-05-08T00:00:00+02:00

Real-time joins in event-driven microservices

As discussed in my previous blog post, Kafka is one of the key components of our event-driven microservice architecture in Zalando’s Smart Product Platform. We use it for sequencing events and building an aggregated view of data hierarchies. This post expands on what I previously wrote about the one-to-many data model and introduces more complex many-to-many relationships.

To recap: to ensure the ordering of all the related entities in our hierarchical data model (e.g. Media for Product and the Product itself) we always use the same partition key for all of them, so they end up sequenced in a single partition. This works well for a one-to-many relationship: Since there’s always a single “parent” for all the entities, we can always “go up” the hierarchy and eventually reach the topmost entity (“root” Product), whose ID we use to derive the correct partition key. For many-to-many relationships, however, it’s not so straightforward.

Let’s consider a simpler data model that only defines two entities: Products (e.g. Shoes, t-shirt) and Attributes (e.g. color, sole type, neck type, washing instructions, etc., with some extra information like translations). Products are the “core” entities we want to publish to external, downstream consumers and Attributes are meta-data used to describe them. Products can have multiple Attributes assigned to them by ID, and single Attributes may be shared by many Products. There’s no link to a Product in Attribute.

Given the event stream containing Product and Attribute events, the goal is to create an “aggregation” application, that consumes both event types: “resolves” the Attribute IDs in Product entities into full Attribute information required by the clients and sends these aggregated entities further down the stream. This assumes that Attributes are only available in the event stream, and calling the Attribute service API to expand IDs to full entities is not feasible for some reason (access control, performance, scalability, etc.).

Because Attributes are “meta data”, they don’t form a hierarchy with the Product entity; they don’t “belong” to them, they’re merely “associated” with them. It means that it’s impossible to define their “parent” or “root” entity and, therefore, there’s also no single partition key they could use to be “co-located” with the corresponding Products in a single partition. They must be in many (potentially: all) of them.

This is where Kafka API comes in handy! While Kafka is probably best known from its key-based partitioning capabilities (see: ProducerRecord(String topic, K key, V value) in Kafka’s Java API), it’s also possible to publish messages directly to the specific partition using the alternative, probably a less known ProducerRecord(String topic, Integer partition, K key, V value). This, on its own, allows us to broadcast an Attribute event to all the partitions in a given topic, but if we don’t want to hardcode the number of partitions in a topic, we need one more thing: producer’s ability to provide the list of partitions for a given topic using the partitionsFor method.

The complete Scala code snippet for broadcasting events could now look like this:

import scala.collection.JavaConverters._
Future.traverse(producer.partitionsFor(topic).asScala) { pInfo =>
  val record = new ProducerRecord[String, String](topic, pInfo.partition, partitionKey, event)

  // send the record
}

I intentionally didn’t include the code to send the record, because the Kafka’s Java client returns Java Future, so converting this response to Scala Future would require some extra code (i.e. using Promise), which could clutter this example. If you’re curious on how this could be done without the awful, blocking Future { javaFuture.get } or similar (please, don’t do this!), you can have a look at the code here.

This way we made the same Attribute available in all the partitions, for all the “aggregating” Kafka consumers in our application. Of course it carries consequences and there’s a bit more work required to complete our goal.

Because the relationship information is stored in Product only, we need to persist all the received Attributes somewhere, so when a new Product arrives, we can immediately expand the Attributes it uses (let’s call it “Attribute Local View”, to emphasise it’s a local copy of Attribute data, not a source of truth). Here is the tricky part: Because we’re now using multiple, parallel streams of Attribute data (partitions), we need an Attribute Local View per partition! The problem we’re trying to avoid here, which would occur in case of a single Attribute Local View, is overwriting the newer Attribute data coming from “fast” partition X, by older data coming from a “slow” partition Y. By storing Attributes per partition, each Kafka partition’s consumer will have access to its own, correct version of Attribute at any given time.

While storing Attributes per partition might be as simple as adding Kafka partition ID to the primary key in the table, it may cause two potential problems. First of all, storing multiple copies of the same data means – obviously – that the storage space requirements for the system are significantly raised. While this might not be a problem (in our case Attributes are really tiny comparing to the “core” entities), this is definitely something that has to be taken into account during capacity planning. In general, this technique is primarily useful for problems, where the broadcasted data set is small.

Secondly, by associating the specific versions of Attributes with partition IDs, the already difficult task of increasing numbers of partitions becomes even more challenging, as Kafka’s internal topic structure has now “leaked” to the database. However, we think that growing the number of partitions is already a big pain (breaking the ordering guarantees at the point where partitions were added!) that requires careful preparations and additional work (e.g. migrating to the new topic with more partitions, rather than adding partitions “in place” to the existing one), so it’s a tradeoff we accepted. Also, to reduce the risk of extra work we try to carefully estimate the number of partitions required for our topics and tend to overprovision a bit.

If what I just described sounds familiar to you, you might have been using this technique without even knowing what it is; it’s called broadcast join. It belongs to a wider category of so called map-side joins, and you can find different implementations of it in libraries like Spark or Kafka Streams. However, what makes this implementation significantly different is the fact that it reacts to the data changes in real-time. Events are broadcast as they arrive, and local views are updated accordingly. The updates to aggregations on product changes are instant as well.

Also, while this post assumes that only Product update may trigger entity aggregation, the real implementation we’re using is doing it on Attribute updates as well. While, in principle, it’s not a difficult thing to do (a mapping of Attribute-to-Product has to be maintained, as well as the local view of the last seen version of a Product), it requires significantly more storage space and carries some very interesting performance implications as single Attribute update may trigger an update for millions of Products. For that reason I decided to keep this topic out of the scope of this post.

As you just saw, you can handle many-to-many relationships in a event-driven architecture in a clean way using Kafka. You’ll benefit from not risking having outdated information and not resorting to direct service calls, which might be undesirable or even impossible in many cases. As usual, it comes at a price, but if you weigh pros and cons carefully upfront, you might be able to make a well-educated decision to your benefit.

Investing in the Future of Engineering and Design

2018-05-03T00:00:00+02:00

Our cooperation with CODE University

At Zalando, we strive to create an environment in which all our engineers, product, and design specialists feel they can inspire each other, make their ideas a reality, and contribute to providing the best possible platform for Zalando’s customers to have the ultimate customer experience.

Part of this is making sure we understand what the future generation of product managers, interaction designers, and software engineers are thinking and what ideas and innovations they can bring to the table.

Since fall 2017, Zalando has been an official partner of CODE University, a private, state-recognized university of applied sciences based in Berlin, which offers courses in various software fields in a hands-on environment, which is embedded in Berlin’s network of digital and tech enterprises. Zalando decided to get on board with this exciting project in order to offer students insights into real-world applications of the techniques and principles they learn about in their studies.

Credit: Manuel Dolderer (CODE)

Bringing Zalando challenges into the classroom

The major part of Zalando’s cooperation with CODE is covered by the semester projects. Each semester, Zalando outlines projects for which it seeks input from software engineers, product managers, and interaction designers. These projects challenge the students to find successful solutions to realistic problems, and to facilitate their learning in a dynamic, realistic working environment.

For the winter semester of 2017/2018, Zalando’s two projects included one which sought to make same-day delivery by bike courier more efficient by integrating a chatbot to enable real-time delivery updates, while the other looked into the potential that an autonomous, indoor drone could have on warehouse processes. Students who select the projects offered by Zalando work together in groups to address issues with the most efficient solutions, while receiving information and guidance from the Zalando employees who mentor these projects. At the end of the semester, the students’ projects are presented to an audience which includes university faculty, other students, and business partners such as Zalando, and forms part of their final assessment. Two groups which worked on Zalando projects were awarded Best Pitch and Best Design for the winter semester 2017/2018.

According to the students, working on projects with real applications teaches them what it means to be innovative. Student, Dominic von Zielinski says, “I don’t want people to have to think about my innovations while using them. Because that’s what innovative means to me.”

Getting involved in Zalando’s Hack Week

Along with the longer-term projects which are an essential part of their studies at CODE University, Zalando also invited CODE University students along to its new-format Hack Week in March of this year. The new format involves cross-collaboration between company departments, such as Digital Foundation and People & Organization, to bring engineers and product managers together with Zalando employees from different disciplines and to see how they can find digital solutions to challenges within the company.

“What got me on board was the desire to get a better understanding of what’s possible in one week, and how things can come together so quickly,” says Dominic von Zielinski. “In the end, what kept me on board was seeing diverse people get together and do crazy and innovative stuff, all while being a part of the team.”

The awards ceremony awarded five different teams in different categories, acknowledging their hard work and providing them with valuable feedback from Zalando’s senior management. Additionally, two teams (including one with CODE University students) were given the opportunity to take part in the Zalando Innovation Lab’s Slingshot Program, in which winning team members are given the resources they need to dedicate two sprints towards bringing their project to the next level.

“During Zalando’s Hack Week, the main challenge I experienced was to transform complexity into simplicity,” says Edmund Maruhn, another CODE student who took part in Zalando’s Hack Week. “Once the week was up, I started to challenge myself and went the extra mile to improve what we had done, for two reasons: I fell in love with the problem we worked on from the first day, and I was motivated by the fact that we won an award!”

Investing in the future

By partnering with CODE university, Zalando wants to not only provide students with hands-on experience in finding solutions to digital problems, it also seeks to become more involved in Berlin’s wider tech ecosystem, connecting with other tech companies on this level to inspire students to be innovative, to always question the status quo, and to continue to nurture the pool of tech talent which has made Berlin the tech hub it is today.

Our Dublin Tech Hub Turns Three

2018-05-02T00:00:00+02:00

Celebrating Zalando’s first international tech hub

Three years ago, Zalando decided to start looking beyond Germany’s borders to tap into Europe’s pool of tech talent. Diverse and brilliant minds from other European cities and beyond contributed to cementing Zalando’s place as Europe’s most fashionable tech company. So, back in 2015, Zalando’s first move was across the Irish Sea, and now the team is very excited to celebrate its third anniversary!

Every birthday deserves some cake.

The Dublin Fashion Insights Center was opened to tap into that market’s unique pool of talented data scientists. The team has grown significantly over the years, now numbering almost 110 dedicated members. The last year alone saw the team write 8.7 million lines of code, attend 17 conferences and hire 49 new colleagues from a staggering 2,500 applications.

One of our newest additions is Sean Mullaney, who joins us as Dublin’s first VP of Information. Sean has founded a number of startups, and formerly worked at Google. He brings a huge amount of drive and experience to the Dublin office, and is keen to help shape Zalando’s data strategy.

“My passion has always been on applied innovation, particularly how to combine Big Data, machine learning, and UX to create high impact products and services,” says Sean.

Our Dublin team celebrates in style.

As it grows, the Dublin office will retain its focus on data science and provide Zalando with the tools it needs to drive strategic growth through artificial intelligence and leveraging its large datasets, bringing together product managers, designers, data scientists and software engineers from all backgrounds in Dublin’s Docklands.

Short Story of a Long Migration

2018-04-26T00:00:00+02:00

How we migrated the Zalando Logistics Operating Services to Java 8

“Never touch working code!” goes the old saying. How often do you disregard this message and touch a big monolithic system? This article tells you why you should ignore common wisdom and, in fact, do it even more often.

Preface

Various kinds of migration are a natural part of software development. Do you remember the case when the current database didn’t scale enough? Or maybe there is need for a new tech stack when the existing stack does not meet changing requirements? Or perhaps the migration from the monolithic application to the microservice architecture is hard. There could also be smaller-scale migrations like upgrading to a newer version of the dependency, e.g. Spring, or Java Runtime Environment (JRE). This is the story on how a relatively simple task of migration from Java 7 to Java 8 was performed on a large-scale monolithic application that has ultimate criticality to the business.

Zalos as the service for Logistics Operations

Zalos (Zalando Logistics System) is a set of Java services, backend and frontend, that contains submodules to operate most functions inside the warehouses operated by Zalando. The scale of Zalos can be summarized by the following statistics:

more than 80,000 git commits,
more than 70 active developers in 2017,
almost 500 maven submodules,
around 13,000 Java classes with 1.3m lines of code, plus numerous production and test resource files,
operates with around 600 PostgreSQL tables and more than 3,000 stored procedures.

Zalos 2, denoted as just Zalos below, is the second generation of the system, and has grown to this size over the past five years. Patterns that were, at the time, easy to adopt for scaling up architectural functionality, have quickly become a bottleneck with the growing number of teams maintaining it. It is deployed to all Zalando warehouses every second week, and every week there is a special procedure to create a new release branch. Each deployment takes about five hours, branching takes about the same time. When also considering urgent patches, it takes a significant portion of each team’s time to do regular deployment or maintenance operations.

Now, what happens if the system is left unmaintained for a while? The package dependencies and Java libraries become obsolete and, as a consequence, security instability grows. Then, one day one of the core infrastructure systems has to change the SSL certificate, and this causes some downtime in all relevant legacy systems operating a deprecated Java version. For the logistics services these problems might become a big disaster, and you start thinking: “What does it take to migrate Zalos from Java 7 to Java 8?”

Migration? Easy!

With some basic experience with Java 9, the option to go even further has been rejected pretty fast: a combination of Java-9 modularity and 500 sub-modules doesn’t look very positive. Well, bad luck. What else do you need to keep in mind for Java 8 support? Spring? Sure. GWT? Maybe. Guava? Oh yes. Generics? This too.

This is a good time to talk about the tech stack for Zalos. It contains backend as well as frontend parts, both running Spring 3. The backend uses PostgreSQL databases via the awesome sprocwrapper library. Both backend and frontend rely on Zalando-internal parent packages to take care of dependency management. The frontend engine is GWT 2.4 with some SmartGWT widgets. And, to mention a few more challenges, it uses Maven overlays with JavaScript but more on this later.

Our first strategy was to bump as many package dependencies as we can. Spring 4 which fully supports Java 8, GWT 2.8.2 that already has support for Java 9, Guava 23.0, etc. We use GWT 2.4; a jump of over five years development-wise. Hard dependency on our internal Zalando dependencies had ruled out the major Spring upgrade too. Guava 23 has deprecated some methods and we would need to change quite an amount of code: again, a failure.

Let’s try an another strategy then: bump as little as we can. This strategy worked much better. We only needed to have Spring 3.2.13 and Guava 20.0, plus required upgrades like javassist and org.reflections. The matrix of compatible versions is shown in the appendix. GWT dependency was left untouched, although it limits our client code to Java 7. A compromise but not a blocker: there is little active development of new GWT code anyway.

Now, overlays, or in our case Dependency Hell, is a feature of Maven to include dependencies from a WAR or a ZIP file and it “inlines” the complete package as is. And it does so with all its dependencies. As an example, this means, should an overlay have a different version of spring-core, you get two versions of spring-core in the final WAR artifact. When the application starts, it will get confused which version to use for which parts of the application, and various ClassNotFound exceptions will pop up. Bad luck, republishing all war-overlays with updated dependencies is required.

Go-live or don’t rush?

It took just two weeks of highly-motivated and self-driven work for two people to crack the problem and run the 500-module monolith on the laptop with Java 8. It took two more weeks to deploy it to the staging environment after fixing multiple issues. After that, it took two more months to finally deploy it to the production environment. Why so long? Because we deal with the utmost critical system that has several serious constraints, and here they are:

Deployments. Deployment to production lasts up to five hours and it should not interfere with any other deployment, due to internal limitations of the deployment system. With absolute priority for production deployment there isn’t much time for experimenting with the migration. Solution? Tweaking the deployment service helped reduce deployment time by about one third to have some freedom for experimenting on a staging environment.
Development. There are still about 25 commits per day in the main branch. Breaking it would have a significant impact on feature development, and it isn’t easy to experiment with JDK versions from the feature branch. This isn’t good, but still there is a more serious constraint.
Warehouse operations. They are the backbone of an e-commerce company and should not be interrupted by the migration. The risk of any bug should be carefully minimized to maintain the service liveness.

To solve at least two constraints, we created a concrete three-step plan on how we execute the migration in a safe manner and be able to roll back at any time:

Upgrades of all packages compatible with both Java 7 and 8 without changing runtime version. This ensured that there are no changes for deployment
Switch to Java 8 runtime (JRE) keeping source code in Java 7 mode. This step ensured that we can safely change the deployment settings without touching the code and dependencies.
Switch to Java 8 development mode to fully support Java 8 features. No major deployment changes were done with this step.

In addition, except for a staging environment, every step was carefully tested on a so-called beta environment which operates on production data.

Outlook

The migration was completed despite some failed attempts a few years ago. Several things have happened. The service has become a little more stable and secure. The code can now be written with lambdas, method references, etc. Deployment service has been improved too. But most importantly, the legacy system got attention. Even though we had one camp of people who said, “We tried that before, why do you want to try again?” there was also the second camp with, “You are crazy but yeah, do it”. No matter what was tried before and in what manner, it is never too late to try again.

Keep your legacy code under careful supervision: add code quality metrics, minimize maintenance efforts, optimize release cycles. With this you will stop having “Legacy Nightmares” but rather have a maintained piece of code.

Appendix

Here is a list Maven dependencies and related changes that finally made it working together:

In addition, the following compilation and runtime settings were required:

and properties for maven-compiler-plugin set to 1.8
tomcat 7, i.e. run services with “mvn tomcat7:run-war” and not “mvn tomcat:run-war” which uses tomcat 6 by default.

Improving Efficiency in Offline Campaigns

2018-04-24T00:00:00+02:00

Using an API to drive marketing profitability: a gift card study

Gift cards are becoming increasingly popular in the US and Europe. For time-pressed consumers trying to find a convenient gift for friends and family, gift cards are an easy solution, and it shows: gift cards are projected to grow at a 24% Compound Annual Growth Rate (CAGR) until 2023, According to Allied Market Research. Brands reap the benefits too: they can use them to engage their best customers to onboard family and friends, driving profitable sales.

Zalando wants to build Europe’s most beloved gift card, but the market is competitive. This blog post outlines the challenges of entering the market, describes our technical approach, and lists some key learnings.

Online - brave new world

Successful online businesses use data to their advantage. They make marketing decisions based on data from their own sites and those of partners. Data science and machine learning are increasingly applied to aggregate campaign outcomes into insights and provide guidance or even fully steer marketing investments. As a result, marketing processes are being digitized rapidly.

Offline - data scattered throughout proprietary systems

Yet to succeed in the market, online marketing prowess only brings you so far. To reach consumers on a broad scale, a brand must sell its gift cards through leading retail stores, as well as offer them via employee benefit and loyalty programmes.

E-commerce players like Zalando are used to easy integration with partners and have access to a wealth of marketing and sales data. Providing comparable insights on campaigns executed with brick-and-mortar retailers and businesses presented us with some tough nuts to crack, so we took a look at our business objectives.

I. Business Objectives

1. Integrate with gift card distribution networks Unlike e-commerce, gift cards are not only sold online: a very substantial share is actually sold through retail and to businesses. A sprawling ecosystem of aggregators and resellers serves tens of thousands of businesses and retail stores. If a brand wants to reach consumers on a broad base, it has to digitally integrate this vast ecosystem. But how do you find an approach that scales?

** 2. Harvest data in real time Using real-time data allows e-commerce players to observe the effects of campaigns in real time, giving them superior control over their investments. In the offline world however, data points are few and far between and are constrained to monthly CSV reports. So how do you enable real time data insights** into sales via brick-and-mortar businesses?

3. Understand Return on Investment (ROI) The right data makes the difference between running a profitable campaign and losing money. In this brave new world, ROI steering has become a mandate. An apples-to-apples comparison between online, retail and business campaigns demands using the same tools and metrics. Yet how do you A/B test campaigns in a brick-and-mortar world?

II. Digitally Integrating an Ecosystem

Our data strategy follows a simple mantra: a single API harvests all commercially relevant data from every partner in real time. Standardized data backhaul reduces complexity and frees up engineers and analysts to focus on initiatives that create business value. Real time data enables timely and precise business steering as well as improved operations.

1. Migration to RESTful API - challenges

a. Idempotency Since gift cards represent financial value for the customer, it‘s important to provide a robust and fault-tolerant way of operations. Errors, such as network problems, service interruptions, or user error are a fact of digital life. We handle incidents based on best practices, yet one of the most important cornerstones for our API is idempotency. We require our partners to provide a unique operation identifier on every call. Based on these identifiers, we can ensure that the required operation will be executed only once, even in the case of repeating calls, which can happen due to connectivity interruptions.

b. Scalability To address requirements towards a modern partner API, we designed our API to be scalable to address future growth. During design and implementation, we took substantial efforts to determine performance limits and push beyond these limits. Load testing is the best friend of the developer, allowing you to prove that your service is fulfilling the business requirements. As a nice side effect it may expose hidden problems in the implementation. And it‘s always better to find your problems and solve them before they affect actual customers.

c. Security Gift cards represent value, thus any processes that touch a gift card code must adhere to the highest standards of compliance. We worked closely with internal auditing personnel to identify weaknesses in our process and address these. Our basic mantra is that a gift card code should only be touched by the customer. This meant harmonizing diverse existing processes, such as manual distribution by mail, transfer to SFTP servers or upload to websites, towards a common, secure process.

2. Backhaul of commercial data - challenges

a. Standardization Though brick and mortar businesses harvest a wealth of commercial data from their networks, they use proprietary schemes, making further processing of any such data almost impossible. We decided to mandate the provision of core sales data from partners and customers. While this sounds somewhat onerous, it was essential for a solution designed to address business requirements.

b. Real time To enable marketing executives to make the right decisions, data needs to be available with little delay. In today’s increasingly competitive markets, a monthly report is not a solution; it is a problem. Using our API for data backhaul gets rid of that headache.

c. Attribution To track how campaigns perform requires attribution to the specific partner that executed a particular campaign. In retail this means knowing which retailer sells a particular gift card product at a particular time. Combining such data with Salesforce-based campaign planning enabled us to assign discounts to individual cards.

d. A/B testability To execute A/B tests in the real world of retail demands the ability to execute a campaign in two specific, comparable geographies. Thus, the retailer must be able to restrict campaigns by region, which is a substantial challenge. In addition, analysis of results requires information about the specific store that sold an individual card.

III Learnings

1. Plan efforts and timings generously Migrating an operative API that connects external partners is a challenge. Any bug will directly impact customers and revenue, so all inherent risks need to be addressed. Partners are generally loath to change a running system, thus migration takes considerable time, during which any systems and processes must be operated in parallel.

2. Focus on the what, be flexible on the how Winning over partners to change a process is never easy. What made the difference was pragmatism; focusing on outcomes while being flexible on the solution. We had to accept some constraints and put in some extra effort. But this way, we got our partners to onboard in a timely way.

3. Data sharing requires a win-win proposition While the strategic value of data is clear to businesses, the rationale to share such data with partners is less so. Today, sharing data with offline partners is mostly restricted to downloadable monthly reports. Providing extended data in real time requires changes to systems and processes. To decide such discussion in your favor, you must think through what benefits such change will provide to your partner.

4. Real time data - a treasure trove for machine learning Our data team was asked to provide advice on a number of incidents where gift cards were being misused in inventive ways. The metadata collected for marketing purposes proved extraordinarily useful in such cases. Using machine learning, we were able to precisely identify patterns of misuse. Real time data enables real time decisions.

IV Conclusion

While it takes considerable effort and persistence to get the necessary metadata from partners operating in the offline world, our early investments in a modern data backhaul are paying off by providing the transparency we require.

The concrete and tangible benefits for us are:

Full attribution of discounts granted by retail and business to profit calculation
Track campaign impact on a daily basis
Ability to steer on Return on Investment
A/B test offline campaigns across geographical regions

Distributed Cache

2018-04-19T00:00:00+02:00

Using Akka cluster-sharding and Akka HTTP on Kubernetes

This article captures the implementation of an application serving data over HTTP which is stored in cluster-sharded actors and deployed on Kubernetes.

Use case: An application, serving data over HTTP and with a high request rate, and the latency of order of 10ms with limited database IOPS available.

My initial idea was to cache it in memory, which worked pretty well for some time. But this meant larger instances due to duplication of cached data in the instances behind the load balancer. As an alternative I wanted to use Kubernetes for this problem and do a proof of concept (PoC) of a distributed cache with Akka cluster-sharding and Akka-HTTP on Kubernetes.

This article is by no means a complete tutorial to Akka cluster sharding or Kubernetes. It outlines knowledge I gained while doing this PoC. The code for this PoC can be found here.

Let’s dig into the details of this implementation.

To form an Akka Cluster, there needs to a pre-defined ordered set of contact points often called seed nodes. Each Akka node will try to register itself with the first node from the list of seed nodes. Once, all the seed nodes have joined the cluster, any new node can join the cluster programmatically.

The ordered part is important here, because if the first seed node changes frequently then the chances of split-brain increases. More info about Akka Clustering can be found here.

So, the challenge here with Kubernetes was the ordered set of predefined nodes, and here comes StatefulSet(s) and Headless Services to the rescue.

StatefulSet guarantees stable and ordered pod creation, which satisfies the requirement of our seed nodes, and Headless Service is responsible for their deterministic discovery in the network. So, the first node will be “-0” and the second will be “-1” and so on.

is replaced by the actual name of the application

The DNS for the seed nodes will be of the form:

-...svc.cluster.local

Steps:

Start with creating the Kubernetes resources. First, the Headless Service, which is responsible for deterministic discovery of seed nodes(Pods), can be created using the following manifest:

kind: Service
apiVersion: v1
metadata:
name: distributed-cache
 labels:
   app: distributed-cache
spec:
 clusterIP: None
 selector:
   app: distributed-cache
 ports:
   - port: 2551
     targetPort: 2551
     protocol: TCP

Note, that the “clusterIP” is set to “None.” Which indicates it’s a Headless Service.

Create a StatefulSet, which is a manifest for ordered pod creation:

apiVersion: "apps/v1beta2" kind: StatefulSet metadata: name: distributed-cache spec: selector: matchLabels: app: distributed-cache serviceName: distributed-cache replicas: 3 template: metadata: labels: app: distributed-cache spec: containers: - name: distributed-cache image: "localhost:5000/distributed-cache-on-k8s-poc:1.0" env: - name: AKKA_ACTOR_SYSTEM_NAME value: "distributed-cache-system" - name: AKKA_REMOTING_BIND_PORT value: "2551" - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: AKKA_REMOTING_BIND_DOMAIN value: "distributed-cache.default.svc.cluster.local" - name: AKKA_SEED_NODES value: "distributed-cache-0.distributed-cache.default.svc.cluster.local:2551,distributed-cache-1.distributed-cache.default.svc.cluster.local:2551,distributed-cache-2.distributed-cache.default.svc.cluster.local:2551" ports: - containerPort: 2551 readinessProbe: httpGet: port: 9000 path: /health
Create a service, which will be responsible for redirecting outside internet traffic to pods:

apiVersion: v1 kind: Service metadata: labels: app: distributed-cache name: distributed-cache-service spec: selector: app: distributed-cache type: ClusterIP ports: - port: 80 protocol: TCP # this needs to match your container port targetPort: 9000
Create an Ingress, which is responsible for defining a set of rules to route traffic from outside internet to services.

apiVersion: extensions/v1beta1 kind: Ingress metadata: name: distributed-cache-ingress spec: rules: # DNS name your application should be exposed on - host: "distributed-cache.com" http: paths: - backend: serviceName: distributed-cache-service servicePort: 80

And the distributed cache is ready to use:

Summary This article covers Akka Cluster-sharding on Kubernetes with the pre-requirements of an ordered set of Seed Nodes and their deterministic discovery in the network, and how it can be solved with StatefulSet(s) and Headless Service(s).

This approach of caching data in a distributed fashion offered the following advantages:

Less database lookup, saving database IOPS
Efficient usage of resources; fewer instances as a result of no duplication of data
Lower latencies to serve data

This PoC opens up new doors to think about how we cache data in-memory. Give it a try (all steps to run it locally are mentioned in the Readme).

The Democratization of ‘Data Science As A Service’

2018-04-17T00:00:00+02:00

How data science is becoming available ‘for the good of all’ businesses

In his 2010 Ted Talk “ When Ideas Have Sex,” Matt Ridley posits that human prosperity was caused by one thing and one thing only; our unique human ability to specialise and exchange ideas and tools.

Ridley’s example of the invention of the reading light illustrates how far we’ve come. Thousands of years ago, making an hour of reading light required hunting an animal and killing it, before rendering it down to make a candle. Today, the average human earns an hour of reading light in less than half a second. The reclaimed time is spent relaxing, traveling, and working day to day in specialized industries for the benefit of other humans. Specialization and exchange creates new technologies faster, and at an ever decreasing cost.

The Democratization of Data Science

‘Data science as a service’ is the latest way for humans to specialize and exchange data science ideas and tools, and is fast accelerating a new wave of computing innovations at an ever decreasing cost. At Zalando, Europe’s leading online fashion platform, we’ve been ‘all in’ on data science almost since the start of our journey to ‘ reimagine fashion for the good of all’; delivering customized experiences, quality search results, and contextually relevant recommendations through AI and Machine Learning. Today, we’re betting on ‘data science as a service’ as a new way to democratize the previously specialized power of data science to teams across Zalando.

Understanding Why Data Science Is Different

To democratize technologies, you must understand how this ‘new’ innovation is similar and different from legacy innovations that people already use; in this case, traditional ‘as a service’ API platforms.

First, most platform APIs typically enable users to do one of two things: perform CRUD-like operations to create, read, update, and delete information from a central source of truth (the platform), or ask questions of pre-defined and indexed datasets within a platform (most people call this ‘analytics’).

Data science platform APIs are different. Acronyms and words like NLP and deep learning are used to describe data science, but what data scientists really do is help machines understand the evolving, unstructured world around us as humans do. ‘Data science as a service’ APIs provide power by adding structure to unstructured random inputs and questions like:

In the examples above, what “this” is could refer to unstructured text, images, video, or audio. “Meaningful groups” and “unusual things” could be subjective. Humans can be biased when answering questions like these. Machines don’t (yet) have human biases, so helping them to understand unstructured inputs, and create loosely structured outputs requires a different way for these machines to talk to each other, and to humans.

Second, most platform APIs deliver confident results in a binary way. As an example, querying an API for a set of records created within a date range will deliver back the correct set of records created within that date range, provided the data initially provided is accurate. Similarly, when an API is used to read a single record in a database, the API will confidently retrieve that record and its contents for you.

Again, data science APIs are different. Imagine someone stopping you on the street, showing you the photo above, and asking you, “What do you see in this picture?” Your answers might begin with “I’m pretty sure I see...” (a boat on the Hudson River), or “I definitely see...” (the Empire State Building). You might also be asked clarifying questions like, “Where do think you see it?” As your answers will be either high, medium, or low confidence answers, ‘data science as a service’ APIs must also have a means to express their level of confidence.

Evolving the API Developer Experience, for Data Science

At Zalando, we’ve formed a deep understanding of why ‘data science as a service’ APIs are different through building our Fashion Content Platform Team. Simply put, our team of data scientists, engineers, designers, and product managers develop fashion-focussed AI models, capabilities and APIs to enable any team in Zalando to integrate self-serve ‘data science as a service’ APIs when building relevant and immersive experiences for customers. Here’s some key lessons we learned along the way.

Make it real

‘Data science as a service’ APIs are different, and for both technical and non-technical users, it’s important to understand why they’re different, by ‘making it real’. For non-technical users, easy to use demo UIs and a ‘Labs’ environment make it easy for any member of any team to understand what our deep learning and NLP models do, and how integrating them can help them deliver unique customer experiences. They also take the mystery out of data science, through familiar inputs, simple language, and visual responses with clear explanations. For technical users, ‘make it real’ happens through easy to use tools to call APIs from within the documentation. Certain fields are pre-filled with images and text to reduce time-to-first API call.

What you see, what you get

For image analysis deep learning APIs, developers must understand quickly how JSON responses – or features built with them – might surface to customers within their applications. We carry interactive and visual cues through the documentation and JSON responses are clear, and in-context too. Where relevant, Taxonomies are visual, using imagery to quickly articulate what ‘A-line’, ‘Cropped’, and ‘Paisley’ might mean. For NLP text analysis APIs like Entity Relations, JSON responses are structured for easy interpretation, and demo UIs are available for users to understand visually how Entities relate to each other.

Set expectations

Many developers ‘fail first time’ when using ‘data science as a service’ APIs, as they feel they’ve incorrectly integrated, or are not getting the results back they require. Like humans, machines are limited to what they know based on what they’ve seen before, and simple, visual explanations within documentation help developers understand what the machine knows now, and what it might be learning soon (the AI product roadmap). Providing examples of inputs that work, and inputs that do not, will help set expectations, and likely help fuel your AI product roadmap with new feature requests. Product limitations are always an opportunity to prioritise feature requests faster.

Explain the seemingly obvious

While terms like ‘confidence score’ and ‘features’ are part of day-to-day conversations amongst data science teams, it’s easy to forget that developers newer to data science may not understand what they mean, or what their JSON output represents. Stating the seemingly obvious not only helps developers adopt and integrate with APIs more quickly, it provides an opportunity for all types of developers to skill up and learn about new technologies, and hopefully will spark some ideas for them, too.

Data science as a service for the good of all

“Data science” in all its various forms has existed for more than 30 years, but the majority of businesses in the world today don’t understand what it is, or don’t understand the benefits data science can deliver for their business. ‘Data science as a service’ will address that knowledge and tools gap, enabling businesses everywhere to understand large datasets, automate manual processes, and deliver relevant customer experiences. We’re way closer to the beginning than the end of this journey at Zalando, and would love to hear from you if you’re as excited about the possibilities as we are.

Discovering Design Sprints

2018-04-12T00:00:00+02:00

Our experience of The Sprint

About two years ago, Jake Knapp, John Zeratsky and Braden Kowitz from Google Ventures published “ The Sprint.” They describe a methodology that helps you answer critical business questions, develop ideas, or tackle problems in just five days, and last year Jake Knapp shared his insights in a fireside chat at Zalando.

Last week, we had the chance to see it in action. In this article, we will not go into the details about how the Design Sprint works, since it’s already described perfectly in the book. We will rather share another valuable asset with you: our experience and learnings.

Why Design Sprint?

Our team started work on a new replenishment process and since Design Sprints were already successfully used at Zalando, we decided to give them a try. We had already tested some crucial assumptions for the new process we were working on with an excel prototype, and now wanted to craft the look of the customer interface to allow for deeper learnings.

Setup and Preparations

The Team

We recruited a multi-functional group of eight people across the department plus two facilitators. It’s quite a large group, but we aimed for involving the whole developer team in the discovery as early as possible to get everyone on the same page in terms of knowledge and decision-making. The following colleagues were involved:

Five engineers
One process specialist
One UX designer
One product manager
Two producers to facilitate the workshop

Learnings:

It’s hard to find a week where everyone who you would like to participate can get rid of all meetings and appointments. But it’s definitely worth it!
Get a strong facilitator. You will not be able to manage the sprint and to contribute to the workshop at the same time.

The War Room

According to the guidance, we booked a room in our office for the whole week. Unfortunately, it had terrible acoustics and turned out to be too small, so after the first day we moved to a bigger and more comfortable one.

Learnings:

Test your room before you start the sprint! Make sure it’s not too noisy and it’s big enough. Your team should easily fit in, together with the big whiteboards and all kinds of supplies. Don’t hesitate to change it if it doesn’t feel right: your team will thank you! Apart from that, you will need all your energy to focus on the sprint and not to waste any brainpower on room complaints.

What did we do and what did we learn?

Day 1: Knowledge Sharing and Alignment The first day was dedicated to setting the frame. We introduced the roles of Decider and Facilitator, aligned on the long-term goal, invited several experts from different teams for interviews, mapped the new process and formulated some open questions.

Status: Finally we get to do a Design Sprint!

** Learnings:**

Make sure to formulate the sprint questions correctly. Otherwise, you’ll have to come back to it in the middle of the sprint, and will possibly lose a lot of energy clarifying it. Look for critical hypotheses you can verify or falsify. This might put you out of your comfort zone, but you’ll only learn something new if you risk being wrong.
Map the process on a deeper level! We did it high-level in the beginning, but realised in the middle of the sprint that we had to go deeper. It was extremely difficult to come up with the storyboard without a common understanding of the different process steps. After some trial-and-error, we decided to step back and invest time in re-doing the process map.

** Day 2: Sketching Solutions **Having a common baseline of knowledge, we shared some ideas that we particularly liked on other products (“Lightning Demos”) and could incorporate into our product. Then we started to sketch our solutions. Google Ventures suggests some approaches how to structure the sketching process, and all of them are based on individual work. There is no exchange or feedback planned for this half of the day, everyone just develops their own idea.

Status: This idea is going to be good...

Learnings:

Explain the purpose of the Lightning Demos. This is an extremely helpful exercise, but sometimes you end up pitching your idea instead of just showing and explaining it.
Individual sketching might not always be the 100% right solution. Since there was no exchange on ideas, we felt like we lost the momentum of combing great concepts or getting inspired by the work of others. Next time we would either exchange or mix the suggested way of sketching with an iterative method such as rapid wire-framing. Day 3: Review and Decide On Wednesday we reviewed the ideas drawn by the team, and decided what we are going to prototype. After the Decider voted, we started to build a storyboard for our prototype. As mentioned already, at this point of time we ran into several issues: it was not clear what exact question should be answered by the end of the week, and we missed a detailed process map. We had to make a mind switch from, “We can test everything” to “We have to find the most important hypothesis.” It was very difficult to accept that limitation and align on one statement. Nevertheless, we overcame the difficulties. We aligned on one hypothesis to test and drew a detailed process map. After this was accomplished, we came up with good results, but it was definitely emotionally the most exhausting day of the sprint! Status: Roller coaster of emotions!

Learnings:

The Decider is the best role ever. Having a dedicated Decider role definitely has its advantages especially when there is a time constraint. Our Decider did an awesome job by not just making the decision about which prototype we should build, but also explaining the background and his thoughts to the group.
Make sure everyone knows the process. As mentioned already, after we realized that we were not on the same page we had to spend some time on redoing the process map. Day 4: The Prototyping Here you go: after just 3 days you start building things! Okay, not really building, since the prototype is just a facade. But it looks pretty much like a real product!

Learnings:

Assigning different roles really helped to get everything together in time. Everyone knew what they were responsible for, and working on for the next few hours.
Don’t get lost in details. The main challenge appeared to be the decision on how to spend our time. We concluded: Users don’t really care about how exact your numbers are or if your buttons looks awesome, they focus a lot more on the workflow and how the data is displayed. So we invested more energy in the look of the interface and neglected the correctness of the data. Day 5: The Moment of Truth The most exciting part of the workshop: Your team watches how real users interact with the prototype! This day is definitely the hardest one for the interviewer. You have to be well prepared to talk to users for five hours and be watched by the rest of the team.

Learnings:

Live testing is awesome! Getting real-life feedback was a great experience and provided a lot of insights. If you are planning on doing a design sprint, invest some time in advance to find users and align on time slots.
Get some feedback from the team. While this wasn’t part of the design sprint concept we think it’s important to always gather feedback and take away learnings for the next sprint to come.

What now?

We are very satisfied with the return on time invested. It was a great experience with solid results we can further test along the way. It can be very difficult to find a consensus within the team due to different opinions and approaches. The Design Sprint turned out to be very useful not only for building and testing prototypes, but also for getting the buy-in from the whole team, the management, and the users, and therefore increase everyone’s passion about the topic. Additionally, doing a design sprint is an efficient and low-budget way to find out whether you are actually building the right product for your users or if you need to re-think your approach.

We hope we have encouraged you to try a design sprint for whatever problem you are facing right now.

Managing Personalized Products

2018-04-10T00:00:00+02:00

A product manager's insights on customization

Personalization is a common term with digital products. But what does it actually mean, why do we do it, and how does it affect the product manager?

To illustrate, let me tell a personal story. I have gone to the same hairdresser for 10 years. He has seen a big part of my life, with big changes and evolutions. He knows my preferences. I like to have my appointments in the morning. I like to try new styles, but I expect him to come up with the new ideas. Additionally, he knows my hair, so he is able to produce a great end result every time. He remembers all of this about all of his regular customers, and because of that, not only does he do great cuts, he is able to serve everyone in a slightly different, personal way.

So the basis of personalization is good memory plus the ability to use everything you remember about the person.

The story also illustrates why personalization is important. It is almost impossible to make me go anywhere else, because the cost of switching is just too high. The trust that has been built over time in this relationship is so strong, I don’t have any interest in exploring alternative service providers. No matter how technically superior some other hairdresser is, he can’t win without the relationship.

At platforms like Zalando, the challenge of personalization comes from three facts: the scale of stock and customers (300,000 items and 23 million customers), the requirement of automated relationship building, and the tricky balance between being cool or creepy.

How does a product manager’s work change when the product is personalized? If we define product management as the intersection of tech, business and design, we can observe the impact from these three angles.

Design is all about the user experience. If you haven’t been customer obsessed before, now is the time. You need to be able to understand your customers at a very deep level. What are the aspects of the experience that actually add value, if personalized? Is it just the item recommendations, or everything from marketing to delivery? What is the right level of personalization, what makes it feel good and valuable? This might vary a lot depending on the culture and the individual. What are the customer’s goals at different times? Are they exploring or exploiting? How do you recognize those goals and then help customers reach them? Personalization forces you to a new level of customer insight, user research, and design.

From a business point of view, you need to level up your KPI skills. Zalando is divided into independent product teams which are responsible for one functional part of the whole; think of home page, category page, product page, wishlist, search, etc. Personalization must run consistently through all of these. Personalization alone does not make one single sale. So what is the goal and how do you measure the value of personalization? Looking at the short term, such as click-through rate, might lead us to optimize for interest, but it’s not valuable, causing negative effects further along in the customer journey. Optimizing for the current session might also be harmful for long-term relationship building. If the vision is to build a relationship like I have with my hairdresser, then you need to go for the long-term KPIs like the customer lifetime value, and some measures on how deep the relationship is; maybe by how diverse the usage of the shop is, and how widely the user shops across categories. Measuring these kinds of KPIs and attributing them correctly is not trivial.

In tech, we turn to data. First of all, you need the data that is predictive to your problem. You might have to start by building something just to gather data that enables the next steps. You can get to a certain level with rule-based systems, but at some point you enter the world of Machine Learning. Being a Product Manager with a ML product requires understanding of both data science and machine learning. You will be modeling human behaviour with a machine, so you need to be able to facilitate that discussion with the design, business, and tech experts. Data Science lives up to its name; it is science, not your standard software development. Be prepared for model exploration, data gathering, cleaning, labeling, lots of iterations, and discussions on when the performance is good enough. Dig up your statistics and probability skills, because you need to understand how those relate to what the customer will experience.

In summary, personalization offers an exciting challenge for a product manager to stretch their skills in all aspects of the role. But like in every product, the core is still the same: Deep understanding of the customer problem, and an exciting vision for the solution will take you far.

The Perks of Being in a Hackathon

2018-04-05T00:00:00+02:00

How stepping out of our comfort zone led to a hackathon victory

Zalando Tech doesn't just put on hackathons, we love to attend them too! Here, we catch up with software engineers, Lisa Knolle and Izabela Bratovic about their time at #picturepunk.

At the end of last year we took part in a hackathon. We came to this decision for the sake of exposing ourselves to new experiences, new people, and new technologies. Other than our in-house equivalent, the esteemed Zalando HackWeek, we were both completely inexperienced in participating in such events. Our event of choice was #picturepunk. It was an event hosted by the German Press Agency (DPA) and in cooperation with Adobe, Microsoft, and Google News Lab. Its focus was set on media journalism and the participants were tasked with finding ways to further improve the industry.

Following the “preparation is key” rule, we sat together and started to brainstorm for the next game-changing idea we wanted to bring with us. Three fruitful hours later, we decided our sheer willingness to work and contribute would just have to do. As the date quickly approached, the “hackathon anxiety” started to slowly settle in as well. Backing out was out of the question. But of course, this initial anxiety turned out to be completely unjustified.

The hackathon was kicked off with an idea pitch that was open for all contestants. Seeing that our brainstorming wasn’t so off-point immediately boosted our morale. More than a few ideas piqued our interest. The pitches ranged from enabling journalists to create smart photostories using an app, to making use of blockchain technologies for image licensing. Luckily, a rather interesting group of participants started to gather around the idea that appealed to us the most, and sooner rather than later, we found ourselves jotting down features on some post-its together with our freshly founded team. Our goal was set on finding ways to help journalists comb through the vast amounts of available stock media material, optimizing their search results and saving them some of their precious time. Our team of seven people consisted of three UX designers, a media science student, and an entrepreneur; the latter two both with some programming experience and, finally, the two of us full-time software engineers.

The next 48 hours now seem like nothing more than a short blur. Slightly sleep-deprived and high on a constant supply of Club Mate, we gave it our best shot to build the MVP. This was achieved using some technologies we have expertise in mixed with some that we don’t get to use in our day-to-day jobs. The sponsoring companies made sure we had their APIs at our disposal, and it was just as compelling exploring them as it was using them. Access to Google’s Cloud Vision API for image analysis and Microsoft Cognitive Services, which detect human emotions on images, were some of the tools we had the privilege of using. It was enlightening to see the state of technology in that field and try to put it to good use. Our application’s backend pulled media from Adobe Stock, enriched it with relevant metadata using the aforementioned APIs, and handed it over to our friendly user interface. The journalists would then be presented with many options of filtering through these images, be it by selecting metatags, liking or disliking images that come up, or even by details on those very images that were detected through image analysis in the previous step. Having less than 48 hours to prove our skills and put all of that together was what motivated us most and kept us going.

But, as we would soon find out, building a MVP does not necessarily make a winner; it was the team of cool individuals we worked with. Having met on the spot, armed with very different skill sets and personalities, we worked together towards a common goal. The designers’ resourcefulness and fast-paced working style made it possible for the team to effortlessly impress the jury. Equally important; our teammates’ aptitude towards the business side of the startup world was what brought our presentation to another level.

Photo credit: Silas Stein

Finally, and contrary to our expectations, our team was awarded by the judges in two out of four categories: Best Overall and Best of API. You can imagine why we can only recommend the experience! A hackathon can give you so much room to learn interesting new things, meet people who share the same drive as you, and the opportunity to challenge yourself in a different domain or even industry.

Cross-Department Hackathons at Zalando

2018-03-29T00:00:00+02:00

Last week, at our new-format Zalando Hack Week, two important departments dropped their day-to-day tasks and embarked on what, for many of them, was their first ever hackathon.

How did this come about? Zalando’s new Hack Week is a departure from the hackathons we have organised in the past, which typically only involved the tech department and happened once a year. However, as Europe’s most fashionable tech company, technology is such a huge part of everything we do. This presents us with an excellent opportunity to make the hackathon truly cross-departmental, and to involve talented and innovative minds from many different areas of the business. And since innovative ideas don’t wait around to crop up once a year, our Hack Week has become a quarterly event.

This time, our Senior Vice President (SVP) of the People & Organization department, Boris Ewenstein, and our VP of Digital Foundation, Eric Bowman, got together to organize a collaborative hackathon between their teams, with the goal of awarding the best projects with the chance to make their ideas a reality. The theme this time? For the Good of All: 10 Years of Zalando. Innovating and improving processes across the company to create an even better workplace.

What followed was a frenzied week of brainstorming, designing, discussing, and testing. Our judges, consisting mostly of senior members of the two departments, listened to the pitches of some 50+ teams, and awarded prizes to the best five. On top of the company’s recognition for their excellent work, the jury provided the winners with helpful feedback on their projects, should they wish to continue their work on the prototype.

The winners:

The Bravest of the Brave award went to the group wanting to introduce Zalando’s very own cryptocurrency.
The group wanting to enable the spread of AI insights across Zalando was awarded the Money Makers prize.
The winners of the Geeks of the Week went to the group wanting to make cluster-based requests on Skipper a reality.
The Swiss Army Knife award went to the group behind the idea of creating a one-stop resource of information for internationals looking to move to Berlin for a job at Zalando.
This quarter’s Customer Heroes wanted to bring data from different fashion stakeholders, from the designer through to the buyer and the producer, together to allow fashion teams at Zalando to create products with even more different combinations of fabrics, design, colours, and patterns.

The awards were presented in a glitzy awards ceremony, complete with presenters dressed to the nines who called the participants and judges onto the stage to show recognition for their hard work and excellent initiative.

The most coveted prize was, of course, the golden ticket to Zalando’s Slingshot Program, our internal entrepreneurial development incubator. Teams who win the golden ticket are given all the resources they need, and dedicate 20% of their working time for a short period to their projects, taking them to the next level and developing the first steps. In this way, Zalando recognizes the great ideas pitched by Zalando’s employees, from the bottom up, and helps them make them a reality.

This time around, two golden tickets were awarded: The first was to the winners of the Swiss Army Knife award, for their resource for internationals arriving in Berlin. The second went to a team who is setting up a mentorship matching program, allowing people to connect across the organization with the aim of improving diversity at the management level. Congrats to all groups involved!

At Zalando, we dare all of our employees to Reimagine Fashion. This can be done on many different levels, and at Zalando, this all starts with technology. Stay tuned for more insights into our hack weeks, and how our cross-functional teams are working together to reinvent fashion, tech and Zalando.

Discovering a Future in Tech

2018-03-27T00:00:00+02:00

How former Zalando trainee, Anriika Kauppi, found her calling

Fresh out of high school, Anriika Kauppi, 19, was interested in becoming a teacher, but instead of taking the scholastic route, she did a summer traineeship at Zalando’s Helsinki Tech Hub. With a family background in tech, Anriika wanted to see what the field had to offer as a career. A year and a half later, she has lived abroad for three months, completed another internship in the tech field, and applied to study engineering. Now Anriika is in her first year of studying engineering at the Tampere University of Technology in Finland, and she is passionate about inspiring her peers and young girls to study technology too.

Do you remember your first encounter with “tech”? What was it? It depends on what you define as “tech,” but when I was little, one of my first impressions was when my father showed me a computer that had a dummy program that taught the basics of how computers work. I printed my name a hundred times. At that time, however, it didn’t sound very fascinating, probably because it didn’t even occur to me that what it was doing was so special. In retrospect, after working at Zalando and starting to study, I became fascinated and realized the possibilities of coding.

Anriika stretched herself as a trainee at Zalando.

What was your role at Zalando? Why was it interesting for you? I was a trainee at Zalando in the summer of 2016 for three months, right after graduating from high school. I did subjective testing for one of the projects in the Helsinki tech hub. Together with this, I did application testing on a newly soft-launched application. I also planned how to automate my work for when my traineeship was up, so that the team could work independently without me. My idea was around testing one search engine’s accuracy against another one’s automatically, which in the end, the team implemented and used after my traineeship was over.

I loved finding the shortcomings of the search engines. However weird they were, there was always still some logic behind them, and figuring this out was fascinating to me. So much so, I wondered about the fun I could have working in the tech industry. If I could solve problems such as these in my career, I definitely wanted to go study IT at university, even though my previous goal was to become a school teacher.

One year after my traineeship at Zalando, I was accepted to study Information Technology at the Tampere University of Technology in Finland. I am currently in my first year, studying Python and C++.

What are some of your favourite tech products? I can’t name one, but the ideal kind of tech product for me is one that’s simple-to-use and aesthetically pleasing, while combining multiple touchpoints in one device. For example, some mobile banking applications in Finland do this very well.

Who is a hero of yours? Why? My grandma. She studied programming already before it was taught in universities. Basically, she learned programming with punched tape before the ‘70s, and she worked with microprocessors. She often wondered whether she was one of the first female microprocessor programmers in Finland, as at the time there were hardly any.

Nowadays, it’s also great that I can connect with my father on a similar level. He is also a software developer, and it’s cool to be able to have deep tech-focused conversations around topics we’re both passionate about, and to learn from his experience.

Anriika is part of “Zelsinkas”: Zalando Helsinki’s Women in Tech

Where do you see yourself in ten years? I see myself working abroad in an international tech company specializing in perhaps cyber security, testing, or UX design with a programming aspect. Although, I don’t yet know 100% which direction I’ll go in because the tech world is so vast, and it changes constantly. I have only started to discover a small part of it so far, but I’m excited by the opportunities to continue solving the problems that intrigue me. I could also combine teaching and IT. In Finland, they started to teach coding in all elementary and middle schools as of September 2016.

Now, I want to encourage others, especially girls, to understand the opportunities and possibilities of tech, and show that it is a creative and fun field with a highly logical mindset.

Looking back at that moment when I saw the dummy program as a child, I realize now that the program was probably only a for-loop and print function. It’s s pretty awesome that I finally get it!

In February 2018, Anriika represented herself and Zalando as part of “Zelsinkas”: Zalando Helsinki’s Women in Tech, at the Super-Ada event for 16 to 22-year-old women. The event encourages women to study and start a career in technology. Our Helsinki tech hub, as well as our other locations around Europe: Berlin, Dublin & Lisbon, are looking for inspirational female tech talent to join us in creating amazing experiences for our customers. Happy Women’s History Month!

In Praise of TypeScript

2018-03-22T00:00:00+01:00

Insights on making NodeJS APIs great

NodeJS is getting more and more popular these days. It’s gone through a long and painful history of mistakes and learning. By being a “window” for front-end developers to the “world of back-end,” it has improved the overall tech knowledge of each group of engineers by giving them the opportunity to write actual end-to-end solutions themselves using familiar approaches. It is still JavaScript, however, and that makes most back-end engineers nauseous when they see it. With this article and a number of suggestions, I would like to make NodeJS APIs look a bit better.

If you prefer looking at code over reading an article, jump to the sample project directly.

As a superset of JavaScript, TypeScript (TS) enhances ES6 inheritance with interfaces, access modifiers, abstract classes and methods (yeap, you read it correctly... abstract classes in JS), static properties, and brings strong typings. All of those can help us a lot. So, let’s walk through these cool features and check out how can we use them in NodeJS applications.

I split this post into two parts: an overview and actual code samples. If you know TS pretty well, you can jump to part two.

PART 1. OVERVIEW

INTERFACES, CLASSES, ABSTRACT CLASSES, AND TYPE ALIASES When I first tried TS, sometimes I felt like it went nuts checking and applying types. It’s technically possible to define variable type with type aliases, interfaces, classes and abstract classes so they really look pretty similar–kind of twins or quadruplets in this case–but as I looked into TypeScript more, I found that just like siblings they are actually really individual.

Interfaces are “virtual structures” that are never transpiled into JS. Interfaces are playing a double role in TS. They can be used to check if class implements certain patterns, and also as type definitions (so called “structural subtyping”).

I really like how TS allows us to extend interfaces so we can always modify already existing ones to our own needs.

Say we have a middleware function that performs some checks on request and adds additional property to requests named “supeheroName.” TS compiler will not allow you to add it on a standard express request, so we can extend this interface with needed property.

import { Request, Response } from  "express";
interface SuperHeroRequest extends Request {
superheroName: string;
}

And then use it in a route:

app.router.get("/heroes", (req: SuperHeroRequest, res: Response) => {
 if (req.superheroName) {
   res.send("I'm Batman")
 }
});

Of course, let’s not forget about the main function of interfaces; enforcing classes to meet a particular contract.

interface Villain {
 name: string;
 crimes: string[];
 performCrime(crimeName: string): void;
}
/* Compiler will ensure that all properties of IVillain interface are specified in implementing class and throw an errors on compile time if something is missing. */
class SuperVillain implements Villain {
 public name: string;
 public crimes: string[];

 constructor(name: string, crimes: string[] = []) {
   this.name = name;
   this.crimes = crimes;
 }

 performCrime(crime: string) {
   this.crimes.push(crime);
 }

 getCrimesList() {
   return this.crimes.join("\n");
 }
}



const doctorEvil = new SuperVillain("Doctor Evil");
doctorEvil.performCrime("Takeover the world");
doctorEvil.performCrime("Eat a donut");

console.log(doctorEvil.getCrimesList());

Abstract classes are usually used to define base level classes from which other classes may be derived.

abstract class Hero {
 constructor(public name: string, public _feats: string[]) {
}
 // Similar to interfaces we can specify method signature, that should be defined in derived classes.
 abstract performFeat(feat: string): void;
 // Unlike interfaces abstract classes can provide implementation along with method    signature.
 getFeatsList() {
   return this._feats.join("\n");
 }
}
class SuperHero extends Hero {
 constructor(name: string, _feats: string[] = []) {
   super(name, _feats);
 }
 performFeat(feat: string) {
   this._feats.push(feat);
   console.log(`I have just: ${feat}`);
 }
}

const Thor: SuperHero = new SuperHero("Thor", ["Stop Loki"]);
Thor.performFeat("Save the world");
console.log(Thor.getFeatsList());


// Abstract classes can be used as a type as well.

const Hulk: Hero = new SuperHero("Bruce Banner");
Hulk.performFeat("Smash aliens");
console.log(Hulk.getFeatsList());


// A try to instantiate abstract class will not work
const Loki: Hero = new Hero("Thor", ["Stop Loki"]);

As you can see, we can potentially use all of those by specifying a variable type. So what should be used and when? Let's sum it up.

Type aliases can be used to define primitive and reference types: string, number, boolean, object. You can’t extend type aliases.

Interfaces can define only reference (object) types. TS documentation recommends that we use interfaces for object type literals. Interfaces can be extended and can have multiple merged declarations, so users of your APIs may benefit from it. Interface is a “virtual” structure that never appears in compiled JavaScript.

Classes, as opposed to interfaces, not only check how an object looks but ensure concrete implementation as well.

Classes allow us to specify the access modifiers of their members.

The TS compiler always transpiles classes to actual JS code, so they should be used if an actual instance of the class is created. EcmaScript native classes can be also used as a type definitions.

let numbersOnly: RegExp = /[0-9]/g;
let name: String = "Jack";

Abstract classes are really a mix of the previous two, but as it’s not possible to instantiate them directly you can only use them as a type, if an instance is created from a derived class that doesn’t provide any additional methods or properties.

ACCESS MODIFIERS Unfortunately, JS doesn’t provide access modifiers so you can’t create, for example, a real private property. It’s possible to mock private property behaviour with closures and additional libraries, but such code looks a bit fuzzy and rather long. TS solves this issue just like any other Object Oriented Programming language. There are three access modifiers available in TS: public, private and protected.

PART 2. THE APPLICATION OR A DIVE INTO THE CODE.

So now, when we know and have all the tooling we need, we can build something great. For example, I would like to build a backend part of a MEAN (MongoDB, ExpresJS, Angular, NodeJS) stack; a simple RESTful service that will allow us to make CRUD operations with some articles. As including all the code will make this post too long, I’ll skip some parts, but you can always check the full version in the GitHub repository.

For project structure, see below:

To make code more declarative, easier to maintain and reusable, I’ll take advantage of ES6 classes and split the application into logical parts. I’m leaving most of the explanation in the comments.

./classes/Server.ts

import * as express from "express";
import * as http from "http";
import * as bodyParser from "body-parser";
import * as mongoose from "mongoose";
import * as dotenv from "dotenv";
import * as logger from "morgan";

/* Create a reusable server class that will bootstrap basic express application. */

 export class Server {

 /* Most of the core properties belove have their types defined by already existing interfaces. IDEs users can jump directly to interface definition by clicking on its name.  */

/* protected member will be accessible from deriving classes.  */
 protected app: express.Application;

 /* And here we are using http module Server class as a type. */
 protected server: http.Server;

 private db: mongoose.Connection;

 /* restrict member scope to Server class only */
 private routes: express.Router[] = [];
 /*  This could be done using generics like syntaxis. You can choose which is looking better for you
 private routes: Array = [];
*/

 /* public modifiers are default ones and could be omitted. I prefer to always set them, so code  style is more consistent. */
 public port: number;

 constructor(port: number = 3000) {
   this.app = express();
   this.port = port;
   this.app.set("port", port);
   this.config();
   this.database();
 }

 private config() {
  // set bodyParser middleware to get form data
   this.app.use(bodyParser.json());
   this.app.use(bodyParser.urlencoded({ extended: true }));
   // HTTP requests logger
   this.app.use(logger("dev"));
   this.server = http.createServer(this.app);

   if (!process.env.PRODUCTION) {
     dotenv.config({ path: ".env.dev" });
   }
 }

 /* A simple public method to add routes to the application. */
 public addRoute(routeUrl: string, routerHandler: express.Router): void {
   if (this.routes.indexOf(routerHandler) === -1) {
     this.routes.push();
     this.app.use(routeUrl, routerHandler);
   }
 }

 private database(): void {
   mongoose.connect(process.env.MONGODB_URI);
   this.db = mongoose.connection;
   this.db.once("open", () => {
     console.log("Database started");
   });
   mongoose.connection.on("error", () => {
     console.log("MongoDB connection error. Please make sure MongoDB is running.");
     process.exit();
   });
 }

 public start(): void {
   this.app.listen(this.app.get("port"), () => {
     console.log(("  App is running at http://localhost:%d in %s mode"), this.app.get("port"), this.app.get("env"));
     console.log("  Press CTRL-C to stop\n");
   });
 }
}

export default Server;

I have set the server and app properties to “protected” as I want to keep them private, so it’s not possible to override or access them directly. They could be reachable from derived classes. For example, if we want to add web sockets support to our server, we can extend it with a new class and use “server” or an “app” properties as we need.

./classes/SocketServer.ts

import Server from "./Server";
import * as io from "socket.io";

class SocketServer extends Server {

/* this.server of a parent Server class is protected property, so we can access it to add a socket.  */
 private socketServer = io(this.server);

 constructor(public port: number) {
   super(port);
   this.socketServer.on('connection', (client) => {
     console.log("New connection established");
   });

 }
}
export default SocketServer;

Going back to the application.

./app.ts

import Server from "./classes/Server";
import ArticlesRoute from "./routes/Articles.route";

const app = new Server(8080);
const articles = new ArticlesRoute();
app.addRoute("/articles", articles.router);
app.start();

As we can have multiple kinds of articles (products) e.g. electronic, fashion, digital, etc. and they might have rather different sets of properties, I’ll create a base abstract class with a number of default properties that should be common for all types of articles. All other properties can be defined in derived classes.

./classes/AbstractArticle.ts

// put basic properties into abstract class.

import ArticleType from "../enums/ArticleType";
import BaseArticle from "../interfaces/BaseArticle";
import * as uuid from "uuid";
import Price from "../interfaces/IPrice";

abstract class AbstractActrticle implements BaseArticle {
 public SKU: string;
 constructor(public name: string, public type: ArticleType, public price: Price, SKU: string) {
   this.SKU = SKU ? SKU : uuid.v4();
 }
}

export default AbstractActrticle;

For this example, I’ll create a Shoe class that will derive from an AbstractArticle class and set its own properties.

./classes/Shoe.ts

import AbstractActrticle from "./AbstractArticle";
import ArticleType from "../enums/ArticleType";
import Colors from "../enums/Colors";
import FashionArticle from "../interfaces/FashionArticle";
import Price from "../interfaces/Price";
import Sizes from "../enums/Sizes";

class Shoe extends AbstractActrticle implements FashionArticle {
 constructor(public name: string,
             public type: ArticleType,
             public size: Sizes,
             public color: Colors,
             public price: Price,
             SKU: string = "") {
   super(name, type, price, SKU);
 }
}

export default Shoe;

You might have noticed that Shoe class implements FashionArticle interface. Let’s take a look at it and see how we can benefit from Interfaces and possibility to extend those.

./interfaces/BaseArticle.ts

import ArticleType from "../enums/ArticleType";
import Price from "./Price";

interface BaseArticle {
 SKU: string;
 name: string;
 type: ArticleType;
 price: Price;
}

Extension of interfaces allows us to extend our own interfaces with additional properties.

./interfaces/FashionArticle.ts

import Colors from "../enums/Colors";
import BaseArticle from "./BaseArticle";
import Sizes from "../enums/Sizes";

interface FashionArticle extends BaseArticle {
 size: Sizes;
 color: Colors;
}

We can also extend already existing interfaces. As an example, I’ll create an FashioArticleModel interface that will extend the Document interface from Mongoose and our FashionArticle interface so we can use it when creating database schema.

./interfaces/FashionArticleModel.ts

import { Document } from "mongoose";
import FashionArticle from "./FashionArticle";

interface FashionArticleModel extends FashionArticle, Document {};
export default FashionArticleModel;

Using IFasionArticleModel interface in the schema allows us to create a model with properties from both the Mongoose Document and FashionArticle interfaces.

./schemas/FashionArticle.schema.ts

import { Schema, Model, model} from "mongoose";
import FashionArticleModel from "../interfaces/FashionArticleModel";

const ArticleSchema: Schema = new Schema({
 name: String,
 type: Number,
 size: String,
 color: Number,
 price: {
   price: Number,
   basePrice: Number
 },
 SKU: String
});

// Use Model generic from mongoose to create a model of FashionArticle type.
const ArticleModel: Model = model("Article", ArticleSchema);
export {ArticleModel};

I hope this example application already shows how TypeScript can make your code more declarative, self documentable and potentially easier to maintain. Using TS is also a good exercise for frontend developers to learn and apply OOP paradigms in real life projects, and backend developers should find many familiar practices and code constructs.

Finally I would suggest to jump into Articles route class and check a CRUD functionality of the application.

./routes/Articles.route.ts

import { Request, Response, Router } from "express";
import ArticleType from "../enums/ArticleType";
import Colors from "../enums/Colors";
import Shoe from "../classes/Shoe";
import Sizes from "../enums/Sizes";
import { ArticleModel } from "../schemas/FashionArticle.schema";
import FashionArticleModel from "../interfaces/FashionArticleModel";

class ArticlesRoute {
 public router: Router;

 constructor() {
   this.router = Router();
   this.init();
 }

 // Putting all routes into one place makes it easy to search for specific functionality
 // As this method will be called in a context of a different class, we need to bind methods objects to current class.
 public init() {
   this.router.route("/")
     .get(this.getArticles.bind(this))=
     .post(this.createArticle.bind(this));

   this.router.route("/:id")
     .get(this.getArticleById.bind(this))
     .put(this.updateArticle.bind(this))
     .delete(this.deleteArticle.bind(this));
 }
 // I'm not a huge fan of JavaScript callbacks hell and especially of using it in NodeJS, so I'll use promises   instead.
 public getArticles(request: Request, response: Response): void {
   ArticleModel.find()
     .then((articles: FashionArticleModel[]) => {
       return response.json(articles);
     })
     .catch((errror: Error) => {
       console.error(errror);
     })
 }

 public getArticleById(request: Request, response: Response): void {
   const id = request.params.id;
   ArticleModel
     .findById(id)
     .then((article: FashionArticleModel) => {
     return response.json(article);
   })
     .catch((error: Error) => {
       console.error(error);
       return response.status(400).json({ error: error });
   });
 }

 public createArticle(request: Request, response: Response): void {
   const requestBody = request.body;
   const article = new Shoe(requestBody.name, requestBody.type, requestBody.size, requestBody.color, requestBody.price);

   const articeModel = new ArticleModel({
     name:  article.name,
     type:  article.type,
     size:  article.size,
     color: article.color,
     price: article.price,
     SKU:   article.SKU
   });

   articeModel
     .save()
     .then((createdArticle: FashionArticleModel) => {
       return response.json(createdArticle);
     })
     .catch((error: Error) => {
       console.error(error);
       return response.status(400).json({ error: error });
     });
 }

 public updateArticle(request: Request, response: Response): void {
   const id = request.params.id;
   const requestBody = request.body;
   const article = new FashionArticle(requestBody.name, requestBody.type, requestBody.size, requestBody.color, requestBody.price, requestBody.SKU);

   ArticleModel.findByIdAndUpdate(id, article)
     .then((updatedArticle: FashionArticleModel) => {
       return response.json(updatedArticle);
     })
     .catch((error: Error) => {
       console.error(error);
       return response.json({ err: error });
     })
 }

 public deleteArticle(request: Request, response: Response): void {
   const articleId = request.params.id;
    ArticleModel.findByIdAndRemove(articleId)
     .then((res: any) => {
       return response.status(204).end();
     })
     .catch((error: Error) => {
       console.error(error);
       return response.json({ error: error });
     });
 }
}
export default ArticlesRoute;

As a conclusion, TypeScript is a powerful tool that brings a really flexible, reach type checking system to your code. It also introduces enhanced well-known patterns like interfaces, abstract classes and access modifiers.

Of course, the application is not ready for production use, as we have to cover everything with tests and set up a proper development environment, but we can cover that in the future.