Node.js and the tale of worker threads

Join me on a Friday night on-call investigation into a rogue Node.js service.

photo of Jeremy Colin
Jeremy Colin

Engineering Manager

Posted on Jul 25, 2024

A disrupted gaming night

I do not usually read code when dealing with production incidents, as it is one of the slower ways to understand and mitigate what is happening. But on that Friday night, I was glad I did.

I was about to start another session of Elden Ring (a video game in which everything is pretty much trying to kill the player) when I was paged with the following: "campaign service is consuming all resources we throw at it". I joined a call and was then told that the observed impact was due to one of the dependencies: the translation service, for which my on-call rotation was responsible for. The translation service was indeed very slow to respond (its p99 latency had increased from 100ms to 500ms) and its error rate had gone from 0 to 4%. This did not really explain why the service calling us (the campaign service) was on a cloud resource consumption spree.

Increased latency on translation service

We started with distributed tracing, however the campaign service was not instrumented so we could not get much out of our tracing tooling. We did see some context cancelled error messages on our request spans which usually means that the connection was unexpectedly closed from the client side. We quickly moved on to logging and sure enough, we found the same evidence in the translation service logs: java.lang.IllegalStateException: Response is closed

We are relatively well instrumented at Zalando in terms of operations, especially with built-in Kubernetes dashboards. Using our Kubernetes API Monitoring Clients dashboard we confirmed that the calling service (the campaign service) was misbehaving and instead of its usual 1 000 requests per minute to the translation service, it was making over 20 000 requests per minute.

Increased requests on translation service

It looked like the campaign service was effectively increasing the pressure on our translation service. This meant that our translation service was then slower to respond and sometimes not responding at all, which in turn somehow increased the amount of requests that the campaign service was making and the cloud resources it was consuming.

We were looking at a positive feedback loop that was destabilising both systems. Fortunately for us, the effects of the loop were eventually stopped at some point when both systems reached their allocated cloud limits, memory for the campaign service and the maximum number of replicas for our translation service. This had been going on for several hours as the campaign service is not on the critical path of the customer journey, so 4% was a slow burn error and we were only paged because the team that owns the service started investigating this anomaly and found this interaction with our translation service.

In an attempt to resolve the situation, we reduced the number of pods for the campaign service and allowed our translation service to scale up, and sure enough the situation improved by itself in a matter of minutes. As I was about to pick up my game controller again, I took one last look at the graphs and, lo fand behold, the error rate was back up and the positive feedback loop had resumed, as if in defiance of my gaming night.

Increased latency on translation service again

Not so fast Tarnished

In Elden Ring, you have to retry boss fights quite a lot so I rolled up my sleeves and started investigating again. This time, resolved to understand the systems' "patterns".

Taking another look at the campaign service logs was quite interesting to say the least. Yes, it did start with a bunch of request failed, read timeout but then it was followed by a lot of logs like Worker fragment (pid: 51) died and Worker 549 started. When I say a lot, I mean A LOT, more than 20 per second in total.

Campaign service logs

At this point, we needed to understand where they were coming from and yes, I started reading the code on github. We were dealing with a simple Node.js application. The entry point was a file called cluster.js and the first thing it did was get the number of CPUs from the OS and spawn a worker for each CPU core.

const cluster = require("cluster");
const numCPUs = require("os").cpus().length;

// master wrapper
if (cluster.isMaster) {
  console.log(`Master ${process.pid} is running`);
  console.log(`CPU Total ${numCPUs}`);

  // fork workers
  for (let i = 0; i < numCPUs; i++) {
    cluster.fork();
  }

  cluster.on("exit", (worker) => {
    // when worker exits
    console.warn(`Worker fragment (pid: ${worker.process.pid}) died`);
    cluster.fork();
  });
}

Pretty smart right? Node.js is single-threaded and you don't want to leave those precious CPU cores idle. Well that depends on where your code is running!

Following a migration, this service was now running on Kubernetes with pods requesting the equivalent of 1 CPU unit, so far so good. However, when called inside a Kubernetes container, the os.cpus().length method returned the number of cores available on the host machine, instead of the amount of CPU allocated to the container by Kubernetes. At this point, the campaign service was running on machines with 48 cores so it was spawning a whopping 48 processes (yes, for Node.js, this is a whopping number, I can see you Golang people judging us). In fact, using cluster mode for Node.js in a Kubernetes environment is discouraged because Kubernetes can help you do this in a simple way out of the box, for example by setting cpu request to 1000m to allocate one CPU core per pod.

Another interesting thing we could read in cluster.js was that when a worker thread exited, it immediately spawned another one, and we could quickly sense how this could lead to a dangerous situation. Well, while that explained the high number of workers and the logs we saw above, it still didn't explain why they kept exiting and spawning.

Enter translation-fetcher.js, a file that exposes a method to fetch translations from a remote API (our translation service for which I was paged for). Interestingly, when the fetch call fails, the catch clause calls process.exit(1).

TranslationManager.fetchAll()
  .then((data) => {
    const fallbackFilename = "./fallback-translations.json";
    fs.writeFileSync(fallbackFilename, JSON.stringify(data, null, 4));
  })
  .catch((e) => {
    console.error(e);
    process.exit(1);
  });

So there we had it! We had 48 forked worker processes, most of which were exiting, respawning and trying to fetch translations again on startup. We felt pretty confident that we understood what was happening as this was the only place in the whole codebase where a worker thread could exit. We also concluded that the fact that our translation service was slower to respond and sometimes not at all was what fed the positive feedback loop I described above. Indeed, if the call to the translation service failed, the worker thread was killed and a new one spawned, triggering a new call to the translation service, and so on.

Now it was time to patch the issue, so I could get back to being slain by monsters in my video game. We updated the service to no longer use cluster mode as in fact a few pods would be more than able to handle the load even at peak traffic. We struggled to deploy the service to production as it hadn't been deployed for a while and we were missing some permissions, but that's too boring a story to go into. Once the service was deployed, the number of requests to our translation service dropped from 20 000 requests per minute to 100 requests per minute and the health of our translation service quickly recovered and the service even scaled down. What happened next in Elden Ring will stay in Elden Ring.

Digging deeper

Fast forward to Monday, we start working on a detailed post-mortem analysis describing what I wrote above and I decide to write this up as our Site Reliability Engineering (SRE) team loves to hate on Node.js. When I get to the part where I talk about translation-fetcher.js, I get perplexed. It does not really make sense to call process.exit() in a live environment.

Also what about the response closed and context cancelled errors we were seeing, they did not match our current understanding of the worker being killed after the call itself failed. As I hate to share something I do not have a very good understanding of, I dove once more into the campaign service code and, lo and behold, I found a huge oversight we had made on that Friday night. The translation-fetcher.js code was not being called in the live environment, it was another file, obviously called translation.js that was being called on application startup, still calling our translation service but returning fallbacks if the call failed.

function initTranslations() {
  return TranslationManager.fetchAll()
    .catch((error) => {
      console.error("catch error", error);
      return fallbackTranslations;
    })
    .then((initialData) => {
      return manager.watch(initialData);
    });
}

const server = initAndPrefetchOAuth()
  .then(() => initTranslations())
  .then(() => {
    app.listen(PORT);
    log.info("Server started");
  });

So there never was a positive feedback loop with the translation service, it was all up in our heads and I felt a bit stupid about it.

What was happening then? We still didn't understand why workers were being killed and respawning, which led to a very high amount of requests to our translation service. That did, however, put the focus back on the campaign service: what happened at 2am on that Friday that had never happened before, destabilising the service? Could the logs tell us more?

Luckily for me, someone was curious about the number of CPUs allocated and as the main application started, the code was logging the number of CPUs in the machine. So I scanned the last 30 days of logs and got the history of the number of CPUs for the allocated machines: it was always 4, 8 or 16. Well, except for last Friday but also on the 6th of April 2022 at 10:49 when the AWS gods had gifted us a 48 cores machine, interesting... What was the state of the application at that point in time? Well it wasn't great, one pod, unsurprisingly allocated in the node (machine) with the 48 cores, was over-utilising both its CPU and memory allocations and was repeatedly being killed. At the exact same time, our beloved translation service had also begun to consume more resources, scaling massively from 4 to 20 pods despite only receiving twice as many requests.

Why did it not escalate at this point? Because once the pod was killed, despite being replaced twice by a pod on the same 48 cores node, the third time it was replaced by a pod on a different node with only 16 cores. The campaign service generously requested 2GB of memory for each of its pods so with 4, 8 or 16 cores, it was only spawning 2, 4 or 16 extra workers on top of the main process. It turns out that the campaign service application process needs around 120 MB of memory to run properly so it was painfully able to accommodate up to 16 cores, but 48 cores meant that each process only had around 40 MB of memory each (which is still 10 000 times more than the Apollo guidance computer that got us to the moon by the way) and around 20m CPU ("twenty millicpu"), which is really not that much for a single thread.

At this point, I still did not understand why the node thread workers kept dying, although I had an intuition that it was due to the low amount of resources available but I could not see any garbage collection or memory issues in the stack traces. I decided to run the service locally, updated the cluster file to spawn 50 worker threads regardless of the number of CPUs, built it and started it in a Docker container. At first, I gave the container a single core from my 5 year old Macbook and despite being excruciatingly slow, every worker thread spawned and triggered its initial request to get the translations. I repeated the operation, this time giving the container only 1000MB of memory and sure enough, after spawning around half of the workers, I saw the same logs as in production: Worker fragment (pid: 1) died, Worker 31 started.

That was the aha moment, up to that point, I was expecting a clue as to why a worker would be killed by Node.js, but it never came and it turns out that Node.js simply starts killing worker threads when it needs to reclaim memory. And if you remember the code in cluster.js, immediately after the worker thread exited, the application spawned another one, so we end up with lots of worker threads spawning and dying in quick succession, living just long enough time to say hello to our translation service. This also explains very well the context cancelled errors we saw in the translation service, because when the worker thread dies, the socket it created unexpectedly hangs up. It also explains well the read timeout errors in the campaign service as the processes did not have enough time (due to their very low CPU resource allocation) to read the translation service response. Unfortunately, this information was not readily available to us because the campaign service did not instrument its event loop lag, the degradation of which is a common root cause of API call read timeouts.

Building better observability

This story happened back in April 2022 and was one of the motivations for developing a Zalando Observability SDK for Node.js. Two years later, we have 53 Node.js applications instrumented with the SDK, which means that investigating incidents involving Node.js is now easier with common signals readily available. This will be the topic of a subsquent blog post, stay tuned!


We're hiring! Do you like working in an ever evolving organization such as Zalando? Consider joining our teams as a Frontend Engineer!



Related posts