OpenTelemetry for JavaScript Observability at Zalando

How Zalando improved observability for Node.js and web applications using OpenTelemetry

Senior Software Engineer

Posted on Jul 29, 2024

Tags:

Cover - OpenTelemetry & Zalando

"What’s happening inside my application?" - an age-old question bothering anyone who deploys a software service. Packaging source code for an application makes it a black box for its users who can only interact with it through explicitly available APIs. Fortunately, we’ve had several developments in the field of observability in recent years that help us peek into this black box and react to anomalies.

OpenTelemetry has become the widely-accepted open standard for application observability across the software engineering community. It evolved from the previous OpenTracing project which introduced standards for distributed tracing and brought all observability signals under one umbrella, introducing specifications and implementations. At Zalando as well, OpenTelemetry is the adopted standard for observability and our platform teams provide SDKs in several languages for engineers to instrument their applications.

For applications running in a JavaScript environment, the story was quite different though. We have a significant number of Node.js applications, and before 2023 the observability state of these applications was quite poor. During an incident, on-call responders would try to locate the root cause of the issue only to find some applications in the request flow having no instrumentation at all. In one specific, very interesting example, we had almost zero visibility into what the affected application was doing, which made understanding the root cause more difficult than it should be.

Often, the reason for the missing visibility was not the complexity of implementing it, but rather the "mundane" effort engineers would have to put in. The true impact of good observability is often intangible and hence can lead to some complacency on the part of service owners. We wanted to solve this problem without adding an operational burden to the already busy engineering teams.

Standardised Node.js Observability

At the end of 2022, the SRE Enablement and the Web Platform teams at Zalando collaborated to build a Node.js observability SDK based on OpenTelemetry. Observability SDKs had already proven successful at Zalando, providing several advantages:

Automatic configuration including out-of-the-box environment variable parsing.
Standard semantic conventions and APIs across languages.
Built-in auto-instrumentations and platform-specific metrics.
Central control over use/restriction of features, e.g. security and compliance.

Our Node.js Observability SDK is a small wrapper on top of open-source core OpenTelemetry packages that adds Zalando-specific configuration and acts as a proxy for all underlying dependencies. We also decided to provide a set of Node.js critical metrics by default: CPU and memory usage, garbage collection metrics and event loop lag. SDK users can use a boolean flag in the initialization configuration to enable HTTP instrumentation, and optionally Express.js instrumentation. Moreover, the SDK can be initialised in a single statement.

import { SDK } from "@zalando/observability-sdk-node";

new SDK().start();

The SDK constructor takes in an optional configuration argument, but thanks to the platform environment variables made available to any application deployed in Kubernetes at Zalando, the SDK is autoconfigured from these values. Calling the start() functions enables several features in the background:

Auto-instrumentations are registered, e.g. HTTP functions are monkey-patched to record span data during various network calls.
In-built metric collection is enabled at a configured interval.
Span and metric exporters are enabled to export telemetry data at a specified interval to the telemetry backend.

Providing these fundamental capabilities in the SDK out-of-the-box made it easy to instrument Node.js applications and saw a good rate of adoption.

Still blind on the Client

While we were improving in terms of server-side observability overall as a company, observability on the client-side was still a distant concept for us. Before 2023, we had baseline operational visibility into how our web applications were performing in our customers’ browsers, Sentry error logging being the only tool in our arsenal. While console error logging helps, it does not provide great details about why an issue occurred.

One of the examples of cases where we needed this kind of visibility was in our web checkout experience. There were known instances of a small portion of incoming requests being blocked by our web application firewall (WAF) during checkout as it flagged them as coming from bots. At times, these requests were sent by genuine customers and there was no way to detect these as our tracing spans began on the server, specifically at the proxy level (Skipper). We could have known how many customers were facing this issue only if we had a way to connect a user interaction (e.g. a button click) to an incoming/missing request at our proxy.

Taking inspiration from the server-side efforts in improving Node.js observability, we decided to start developing a web observability SDK, using corresponding OpenTelemetry packages.

Things are tricky on the Client side

We bootstrapped a minimal SDK to be used in web applications at Zalando which exposed tracing and metric collection APIs. Thanks to one of the early contributors to the Node.js SDK, we had already separated types and APIs into an independent package. This API package was then used to implement Node.js and web SDKs. This structure became especially useful while instrumenting isomorphic applications – those which run both on the server and client side.

@zalando/observability-api
@zalando/observability-sdk-node
@zalando/observability-sdk-browser

While developing the SDK, we realised that there are more operational challenges than technical effort in instrumenting on the client-side. These are the peculiarities of instrumenting on the web versus the server:

Performance Implications

On the web, every byte counts and hence adding instrumentation packages can lead to an increased page payload affecting your website performance. In the past, we tried to integrate some telemetry packages only to realise they added about 400 KBs to the page size! There are ways to asynchronously load these packages, but some features are easiest to implement when run in the critical page load path (e.g. tracing page load, generating propagation context for API requests).

We found OpenTelemetry packages to be very customizable and in the end we could cherry-pick packages that we considered crucial for the initial load and delay loading everything else. Overall, we added about 30 KBs to our page size. While developing the SDK, we also came across Grafana Faro, which is a similar implementation for frontend observability by Grafana. If you are starting from scratch, it’s a great package to check out.

Additionally, we also pushed the network requests to be least critical by using sendBeacon().

Sending Telemetry Data and User Consent

The next challenge is where the data should be sent to from the browser and whether you are even allowed to send it at all. On the server side, it’s easy since usually the services receiving telemetry data are deployed in the same cluster and no special configuration is required for the host application. On the client side though you need to go through the public internet and hence need some publicly accessible endpoint for sending telemetry data. We used our edge proxy (Skipper) to route frontend telemetry to an internal collector. This also allowed us to implement certain endpoint protection measures like rate-limits. To support adoption of the SDK in other applications, we also provided a custom template to deploy a proxy that would act as a telemetry backend.

Collecting data from customers’ browsers needs their explicit consent as per GDPR. We had to be mindful while exporting telemetry data – sending the export request only if the user consented.

Unprecedented Visibility

Early this year, we rolled out the integration of the web SDK in Zalando’s web framework – Rendering Engine. To start with, we traced runtime operations of the framework, e.g. page load, entity resolution and AJAX requests. We started receiving telemetry and an unprecedented visibility into client-side operations.

Page load ops

Client trace

Leveraging the Framework

Rendering Engine is an expressive web framework and has a concept of “renderers” as independent units that declare their data dependencies and UI. We decided to expose the capabilities of the SDK through platform APIs inside renderers to allow frontend developers to trace custom operations inside renderers. At a high level, this is how the API looks for a filter update operation on the client:

/*
 * This is a renderer in Rendering Engine.
 */
export default view()
  .withQueries(...)
  .withProcessDependencies(...)
  .withRender(props => {
    const traceAs = props.tools.observability.traceAs;
    /*
     * withRender() is where the React component for the renderer is declared
     *
     * props.tools.observability has tools related to client-side observability
    */
    const fetchFilteredProducts = (filter: string) => {
      const span = traceAs("fetch_filtered_products");
      span.addTags({ href: window.href });
      serviceClient.get(`/search?q=${filter}`)
        .then(res => {
          // process response
        })
        .catch(err => {
          span.addTags({ error: true });
        })
        .finally(() => {
          span.finish();
        });
    };

    return (
      <div>
        <button onClick={() => fetchFilteredProducts("shoes")}>Fetch Shoes</button>
      </div>
    );
  })

The traceAs function allows renderer developers to create a new span for a specific operation. The span can then be tagged with attributes, passed around functions and used to create new spans.

This API allowed us to trace crucial client-side operations that were asynchronous in nature. We were previously depending on the status of incoming HTTP requests as a result of user interactions, which is an indirect, “pseudo” way of determining service health. Instrumenting user interactions directly made the visibility much more “real”.

Web Performance Metrics

For the web shop, we already had real user monitoring (RUM) in place to collect various web performance metrics including the web vitals. These metrics form a crucial part of our experimentation strategy when we release features that might impact page performance. Our existing infrastructure was custom, with a service for collecting and aggregating metrics and storing them in a database. While this worked great over the years, we missed flexibility in adding custom attributes to the collected metrics and thus correlating regressions with features was difficult.

With the SDK already in the frontend application, we decided to enable OpenTelemetry metrics on the client-side. Since most of the implementation for recording metrics was already present, we only had to create a new exporter for OpenTelemetry. The feature was quickly rolled out and we started receiving core web vitals (FCP, LCP, INP, CLS) tagged with numerous attributes.

One immediate application of these metrics was to measure performance impact of the newly created “designer” experience. These pages consisted of some complex client-side animations and visualisations and the owning team wanted to measure how these affected overall web performance. We added a new attribute that denoted the current experience on the page and soon we could group and filter metrics on the basis of this attribute, thanks to all the existing tooling from ServiceNow Cloud Observability (previously Lightstep), built according to OpenTelemetry specifications.

LCP per experience

The side-effect of this is that we no longer need our custom setup to collect metrics and can happily de-commission it soon.

Challenges

Our adoption of OpenTelemetry had its own set of challenges.

Migration from OpenTracing

While most of the concepts in OpenTelemetry are similar to OpenTracing, the language SDK implementations have a different API when compared to corresponding OpenTracing implementations. The new APIs make it difficult to migrate existing instrumentation code, especially in a large codebase like ours. For example, the JavaScript OpenTelemetry SDK uses context to track the current active span, versus in OpenTracing, you'd have to pass the span object around manually in functions. The context approach is really useful, but we found out that for an already instrumented application (in the OpenTracing way), this is rather a frustration.

// 1. Starting a span in OpenTracing
const span = tracer.startSpan("name");
await callOtherFunction(span);

// 2. Starting an active span in OpenTelemetry
await tracer.startActiveSpan("name", async () => {
  await callOtherFunction(span);
});

// 3. Starting a span with custom context in OpenTelemetry
const context = getContextFromSomewhere();
const span = tracer.startSpan("name", {}, context);
await callOtherFunction(span);

We ended up not using context as it was easy to migrate from OpenTracing that way (as shown above in approach number 3).

OpenTelemetry Node.js packages use AsyncLocalStorage to propagate context values through asynchronous function calls. On the web though, there is no such runtime API (yet), and the same has to be acheived with Zone.js which monkey-patches global functions. We are not big fans of this, especially when done in the customer's browser and hence opted out of context on the client side as well, resorting back to manual passing of span objects.

Metrics & Bucketing

Collecting metrics on the client side is tricky as they usually are standalone values sent only once per page load (e.g. core web vitals). To obtain a percentile distribution, of e.g. the Largest Contentful Paint, a histogram instrument has to be used to record LCP values. These instruments are primarily designed to record values over time, e.g. memory usage in servers, and use value "buckets" to record number of values in each bucket. By default OpenTelemetry JavaScript histogram declared the following buckets.

[0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000];

Using these bucket values would provide a skewed view of the recorded values as the histogram would bucket values into the closest range and for various metrics the range could vary substantially. E.g. LCP is usually in the range of 600 to 2000 milliseconds while cumulative layout shift (CLS) ranges between 0 to 1.

Our solution was to use custom buckets using a view and OpenTelemetry has a way to declare these as a custom aggregation method. The API has been recently improved to make it easy to do this while creating a histogram. Our buckets for client-side metrics looked as follows:

const metricBuckets = {
  fcp: [
    0, 100, 200, 300, 350, 400, 450, 500, 550, 650, 750, 850, 900, 950, 1000,
    1100, 1200, 1500, 2000, 2500, 5000,
  ],
  lcp: [
    0, 100, 200, 300, 400, 500, 500, 550, 650, 750, 850, 900, 950, 1000, 1050,
    1100, 1150, 1200, 1250, 1300, 1350, 1400, 1450, 1500, 1550, 1600, 1650,
    1700, 1800, 1900, 2000, 2500, 5000,
  ],
  cumulativeLayoutShift: [
    0, 0.025, 0.05, 0.075, 0.1, 0.125, 0.15, 0.175, 0.2, 0.225, 0.25, 0.275,
    0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95,
    1, 1.25, 1.5, 1.75, 2,
  ],
  ...
};

I got a chance to discuss this limitation with OpenTelemetry contributors at the recent KubeCon in Paris and we also talked about the events API which potentially could be a solution for browser-based metrics.

Next Steps

With the newly-obtained visibility on the client side, we plan to introduce new critical business operations (CBOs) and modify existing ones to reflect the operation more realistically. CBOs are operational markers for the health of a certain important user feature. Moving them to the client side helps us track their health on a more finer level.

Taking an example of the catalog page (a.k.a the product listing page): applying a filter is a critical user feature that allows customers to narrow down their search in their shopping journey. If they are not able to filter, they might drop out and this would negatively affect the business. Client-side tracing provides visibility into this segment of the user journey as it's mainly happening in customers’ browsers. Measuring the health of this operation and using alerts to get notified of anomalies can help detect issues in customer journeys faster, allowing us to deliver on a high quality of service for our customers.

We're hiring! Do you like working in an ever evolving organization such as Zalando? Consider joining our teams as a Frontend Engineer!