End-to-End Latency Challenges for Microservices

Read about Typhoon, our open source project to assess distributed software architecture.

Software Developer

Posted on Aug 15, 2016

Tags:

There is pressure to define a global platform architecture and purify concepts of core business within it. Microservices is an appropriate design style to achieve this goal – it lets us evolve systems in parallel, make things look uniform, and implement stable and consistent interfaces across the system. Unfortunately, this architecture style brings additional complexity and new problems. Network latency is crucial for online businesses with a direct impact on sales.

Latency is an important part of Quality of Service that determines the degree of consumer satisfaction. End-to-end latency is one of the user-oriented characteristics used for quality assessment of distributed software architecture. The ultimate goal is the ability to quantitatively evaluate and trade-off the architecture to ensure competitive end-to-end latency of software solutions. Therefore, we have created Typhoon - an open source project to make assessments of distributed software architecture.

Why Typhoon was developed

Typhoon helps us to solve a series of short-term and long-term decision problems. For example, short-term decisions include the determination of optimal software and infrastructure configuration; long-term decisions concern the development and extension of data and service architectures, choice of technologies, or runtime environments. Typhoon help us control the actual end-to-end latency and specify emergency actions when systems are overloaded or technical faults occur.

Typhoon is a distributed system stress and load testing tool. It simulates traffic from a test cluster towards system-under-test. Its purpose is the validation of system performance and scalability, while spawning a huge number of concurrent sessions. The tool provides out-of-the-box cross-platform solutions to investigate protocol and latencies of microservices.

We had evaluated a few existing solutions and tried to match them with our needs (read below our major requirements), not ever finding the right fit. Therefore, we created our own tool with Typhoon.

Latency. We are looking at sources of latency through prisms of infrastructure, protocol, and application. A deep-dive is required to understand and approximate latencies at each domain. We need to know network delay, round trip time, a protocol’s handshake latency, time-to-first-byte and time-to-meaningful-response. Typhoon evaluates protocol overhead by approximating packet metrics and estimates application performance/scalability.

Realtime. We are looking for a cost efficient, distributed solution suitable to spawn a huge number of concurrent sessions. It is important to mitigate any bottlenecks within the tool and ensure responsiveness of runtime environment. Another aspect of this is real-time streaming and analysis of measurements.

Visualization. The time-series data visualization crisis is well depicted by Mike Bostock. The usage of the proposed visualization technique cubism.js improves readability and the analysis of latencies reported by the tool. We are looking for highly adoptable and customizable visualization, preferably based on D3.

Usability. Being easy to deploy and configure are mandatory requirements for us. We are looking for a zero-config solution, scalable up to dozens of individual nodes hosted in a cloud environment. The tool should offer a sophisticated approach to define workload scenarios. We believe a pure functional language is the best approach to express artificial behaviour.

Two strong candidates were evaluated: Tsung and Locust. They are widely known by the community as load testing frameworks. However, they are not compliant with our needs. Detailed latency analysis and customizable visualization are key features. This was a decision point to develop Typhoon with focuses on latency, visualization, and usability.

Latency challenge

The latency challenge has existed since the beginning of distributed computing. Transparent end-to-end communication involves various technologies and communication principles.

Infrastructure. Software architectural decisions should account for the complexity of the underlying network infrastructure. The Internet is not a single network, it is a series of heterogeneous systems composed of backbone networks, infrastructures managed by service providers, and various edge/access networks.

The infrastructure appears as a system that make peers wait. This time consists of network and transmission delays: The network delay is the time when message delivery is requested until that message begins delivery at the remote end; the transmission delay is time from when the message begins delivery until delivery is completed.

Short interactive scenarios such as client-service interaction concern network delay. Typhoon uses round-trip-time to estimate network delay.

Long interactive scenarios involve data transfer, streaming, etc. This interaction concerns network transmission. We are using packet metrics to approximate latency experienced by applications due to infrastructure.

Protocols. The rise in popularity of microservices architecture introduces new challenges to deal with, such as network latency and an overhead of communication protocols. Protocol internals have a significant impact on end-to-end latency in a heterogeneous network environment when communication is constrained by network delay, packet loss, and the capacity of network equipment.

Connection-oriented protocols require a handshake procedure before data transmission. Typhoon measures the latency required to establish TCP and TLS connections and approximate a packet rate that shows efficiency of techniques used by the protocol to provide value added service (e.g. reliable communication, data integrity, overflow, etc).

The time-to-display response is the most valuable metric from a consumer perspective which is influenced by infrastructure, protocol, and the application environment. Typhoon provides an analysis of application-level protocol behavior. One of these metrics is time-to-first-byte. This is a concrete, consumer oriented easily measurable indicator, defined independently of underlying solutions or technologies. It represents initial confirmation that the remote host is responding and the client application can proceed to rendering. Secondly, time-to-meaningful-response defines the network delay required to deliver application payload.

Microservice. The latency analysis of applications requires techniques to investigate component behavior using information gathered as the service sustains the load. A series of technology decision problems arises concerning both short-term and long-term arrangements. The short-term decisions include aspects of software configuration and capacity provisioning. The long-term focuses on technological and architecture requirements, the system’s ability to handle a certain amount of work, and its potential to be enlarged to accommodate growing traffic requirements, etc.

Erlang inside

Massive scalability (the ability to spawn a huge number of concurrent user sessions) and real-time (accuracy of measurements and real-time data processing) are two major requirements that led us to select Erlang as our runtime environment. This language is recommended as an indispensable technology in similar applications.

Incremental scalability and decentralization are key principles used by us to define the architecture. Typhoon is a peer-to-peer system, using consistent hashing to assemble and orchestrate the cluster. Erlang distribution and added-on third party libraries provide highly available and an eventual consistent actor management layer for us. It helps the system to deal with any possible network failures and provide high availability for synthetic load and telemetry collections. The optimistic technique to replicate data has been employed by the design.

Consistent hashing forms a ring topology from cluster nodes. Each node claims ownership of virtual shards. Each shard is responsible for coordinating workload scenarios based on its identity, spawning the load session across cluster nodes, aggregating telemetry, etc.

Workload definition language

The workload definition language is one of the challenges we’re looking to solve. An expressive language is required to cover the variety of traffic generation use-cases. We believe a pure functional language is the best approach to express artificial behaviour. It gives us rich techniques to hide the complexity of Typhoon from developers using monads as abstractions. Think about monads as computation. The workload scenario is defined as a chain of network operations wrapped by IO-monad.

We have decided to use Erlang-flavored syntax for scenario definition in first releases. The decision was mainly driven by time-to-market. Typhoon is built using an Erlang/OTP runtime. Parsing, compilation, and the debugging of workload scenarios is provided by the runtime -- scenarios are valid Erlang code. However, the development of workload scenarios does not necessarily require an Erlang development environment installed on your computer. Typhoon provides a REST API to lint and compile the scenario code. The scenario development requires a basic understanding of functional programming concepts and knowledge of Erlang syntax. Erlang language tutorials, Erlang module tutorials, and Erlang expressions can give you enhanced training on the subject.

A simple workload scenario in pure-functional notation can be seen below. It uses the Zalando Shop API to demonstrate Typhoon’s capabilities.

%%
-module(skeleton).
-compile({parse_transform, monad}).
title() ->
   "Skeleton Workload Scenario".
run(_Config) ->
   do(['Mio' ||        %% sequence of requests to execute as IO-monadic computation
      _ <- request(),  %% execute HTTP request and discard results
      A <- article(),  %% execute HTTP request and assign response to variable A
      return(A)        %% it just takes a value A and puts it in an IO context.
   ]).
request() ->
   scenario:request(
      scenario:header("Accept-Language", "de-DE",
         scenario:url("https://api.zalando.com/",
            scenario:new("urn:http:zalando:api")
         )
      )
   ).

To be continued

The latest Typhoon release provides a solid background to perform latency analysis. You can measure network delay, round trip time, protocol handshake times, time-to-first-byte and time-to-meaningful-response. It provides scalable traffic production in a cloud environment.

Typhoon requires improvements in data analysis and visualization. We need to develop new metrics (e.g. capacity estimation, active user estimation) and enhance reports to address aspects of executive level reporting and dashboarding. The workload definition language is another painpoint, but we continue to use Erlang-flavored syntax as our core language with support for other widely adopted functional languages (e.g. Scala, JavaScript) still required.

We're hiring! Do you like working in an ever evolving organization such as Zalando? Consider joining our teams as a Software Engineer!