12 Golden Signals To Discover Anomalies And Performance Issues on Your AWS RDS Fleet

Automate anomaly detection for AWS RDS at scale.

Senior Principal Engineer

Posted on Feb 20, 2024

Tags:

Logo rds-health utility

TL;DR: Database per service pattern in the microservices world brings an overhead on operating database instances, observing its health status and anomalies. Standardisation on methodology and tooling is a key factor for the success at the scale. We have incorporated learning from past incidents, anomalies and empirical observations into a methodology of observing the health status using 12 golden signals. The most simple way to adopt these methodology within your engineering environment is an open source utility rds-health recently released by us.

The problem of maintaining robustness at scale

Since Zalando concluded the organisation's scalability using microservice pattern, the company has experienced steady growth across multiple dimensions: in the number of users, in the technology landscape and number of teams involved in building and running systems. So far, Zalando is a leading European online fashion retailer. It is critical that our architecture is robust to withstand challenges and uncertainties while teams innovate and experiment with new ideas.

Overhead by microworld.Microservices became a design style for us to define system architectures, purify core business concepts, evolve solutions in parallel, make things look uniform, and implement stable and consistent interfaces across systems. Our engineering teams independently design, build and operate multiple microservices. Often, microservices are implemented with a datastore following the design pattern – database per service, where each service deploys its own database instances. The Zalando TechRadar guides teams about the database selection and their deployment options – AWS RDS with Postgres as one of the available options.

Hidden costs by toil. Operating swarm of small databases at company scale quickly gets tough. Complex anomaly detection tasks, such as byzantine failures or issues with SQL statements, takes a noticeable investment all over the place. A combination of manual processes and ad-hoc scripts to manage the health conditions of database instances are not an option at the scale. It became increasingly time-consuming and error-prone, some teams are required to allocate engineers for sprint or even months for such activities.

Standardisation is one of the factors that reduces this complexity. It is well known that if teams use the same frameworks or design pattern then making changes at scale becomes easier. Same concept is extendable into the operation domain. We have limited the fragmentation by providing stronger guidelines to our engineers on what metrics to observe from datastore components.

We have developed a methodology on how to detect anomalies with AWS RDS workload through 12 “golden signals”. We also decided to release an open-source command line utility (https://github.com/zalando/rds-health) to help automate and streamline detection of anomalies and performance issues. The utility provides a consistent and repeatable way to automatically analyse database metrics, reducing the risk of errors and improving overall efficiency.

12 Golden Signals

Setup and operating high-performing databases requires observability of a large variety of signals across multiple buckets: CPU, Memory, Disk and Workload. Thanks to past incidents and empirical observations, we have reduced complexity so that only a few signals from each of the discussed buckets need to be analysed for making a reliable conclusion about the heals status of database instances. This is how we got twelve golden signals.

C1: CPU Utilisationos.cpuUtilization.total - typical database workloads are bound to memory or storage, high CPU is an anomaly that requires further investigation. Our past experience advises us that CPU utilisation over 40% - 60% on database instances eventually leads to incidents.
C2: CPU Awaitos.cpuUtilization.await - the Linux kernel reports time is spent waiting for IO requests from its very beginning toward its end using await metric. Its high value indicates that a database instance is bound to the IO bandwidth of storage. Similar to the previous metric, we have concluded that any value above 5 - 10% eventually leads to incident.
M1: Swapped In from diskos.swap.in - Swap is an extension of RAM into the disk. Operating system swaps the RAM pages into the disk and back when there is not enough memory to run the workload. Any intensive activities indicate that the database instance is running on low memory. Considering the disk performance is order of magnitude slower, any swap activity would slow down the operating system and its applications.
M2: Swapped Out to diskos.swap.out - See explanation above.
D1: Storage Read IOos.diskIO.rdsdev.readIOsPS - Storage IO bandwidth is an essential resource for high-performing databases. It is required to align the IO bandwidth with the overall database workload so that there is enough bandwidth to handle workload. In the case of AWS RDS, the metric value shall be aligned with the storage configuration deployed for database instance. With the GP2 volume type, IOPS are provisioned by volume size, 3 IOPS per GB of storage with a minimum of 100 IOPS. The IO volume type has an explicit value defined at deployment time. Note that a very low value shows that the entire dataset is served from memory.
D2: Storage Write IOos.diskIO.rdsdev.writeIOsPS - See explanation above. Also note that a high number shows that the workload is write-mostly and potentially bound to the IO capacity of storage.
D3: Storage IO Latencyos.diskIO.rdsdev.await - Overall performance of storage is a function of its IO bandwidth and its latency. The latency metric reflects the time spent by the storage to load data blocks into memory. High storage latency implies a higher latency to conduct applications workload on the database. Our empirical observations show that storage latency above 10 ms eventually leads to incident, the latency above 5 ms impacts on applications SLOs. A typical storage latency for database systems should be less than 4 - 5 ms.
P1: Cache Hit Ratiodb.Cache.blks_hit / (db.Cache.blks_hit + db.IO.blk_read) - Databases do reading and writing of application data in blocks. The number of blocks read by the database from the physical storage has to be aligned with storage IO bandwidth provisioned to the database instance. Database caches these blocks in the memory to optimise the application performance. When clients request data, the database checks cached memory and if there is no relevant data there it has to read it from disk, thus queries become slower. Any values below 80 % show that databases have insufficient amount of shared buffers or physical RAM. Data required for top-called queries don't fit into memory, and the database has to read it from disk.
P2: Blocks Read Latencydb.IO.blk_read_time - The metric reflects the time used by the database to read blocks from the storage. High latency on the storage implies a high latency of application workload. We have observed an impact on SLOs when the latency has grown above 10 ms.
P3: Database Deadlocksdb.Concurrency.deadlocks - Number of deadlocks detected in this database. Ideally, it shall be 0. The application schema and IO logic requires evaluation if the number is high.
P4: database transactionsdb.Transactions.xact_commit - Number of transactions executed by database. The low number indicates that the database instance is standby.
P5: SQL efficiency [db.SQL.tup_fetched / db.SQL.tup_returned] - SQL efficiency shows the percentage of rows fetched by the client vs rows returned from the storage. The metric does not necessarily show any performance issue with databases but high ratio of returned vs fetched rows should trigger the question about optimization of SQL queries, schema or indexes. For example, If you do select count(*) from million_row_table, one million rows will be returned, but only one row will be fetched.

Open Source Command Line Utility

AWS offers a wide range of observability solutions for AWS RDS such as AWS CloudWatch, AWS Performance Insights and others. These off-the-shelf solutions help anyone with setting up alerts and debugging anomalies when one of twelve golden signals is violated. We are only missing an efficient utility to holistically observe the status of the entire AWS RDS fleet in your account with “a single click of the button”.

Screenshot rds-health utility

This is how the rds-health utility was born. It conducts analysis of AWS RDS instances using time-series metrics collected by AWS Performance Insights. Actually, the utility is a frontend for AWS APIs that simply automates analysis of discussed golden signals across your accounts and regions. The utility can be easily customised to meet specific use cases, allowing users to tailor their workflows to their unique needs. Some of the key features include:

Show configuration of all AWS RDS instances and clusters;
Check health of all AWS RDS deployments;
Conduct capacity planning for your AWS RDS deployments.

Check out our open source project at https://github.com/zalando/rds-health. It guides you through simple installation and configuration steps together with tutorials about its features. We are looking forward to hearing your feedback and suggestions for improvement. Please raise an issue on the project.

Conclusion

Our objective is reduction of complexity through limiting the fragmentation within our engineering ecosystems by enabling teams with engineering and operational guidelines. The discussed methodology on how to detect anomalies with AWS RDS workload through 12 “golden signals” is one of the examples about solving the complexity at Zalando.

Standardisation is not only guidelines but also automations of repetitive tasks, freeing up time for more creative and strategic work. We are happy to empower the Open Source Community with our learning and approaches on observing AWS RDS instances at scale through open source utility. Apply these learnings within your teams.

If you have any questions about our methodology or open source utility rds-health itself, please raise an issue on the project. Contributions are welcomed and encouraged!

We're hiring! Do you like working in an ever evolving organization such as Zalando? Consider joining our teams as a Backend Engineer!