Understanding observability vs. monitoring. Part 1
July 26, 2022
Table of contents
The development of clouds, the DevOps movement, and distributed microservice-based architecture have come together to make observability vital for modern architecture. We’re going to dive into what observability is and how to approach the metrics we need to track. Observability is a way of spotting and troubleshooting the root causes of problems involving software systems whose internals we might not understand. It extends the concept of monitoring, applying it to complex systems with unpredictable and/or complex failure scenarios. I’ll start with some of the basic principles of observability that I’ve been helping to implement across a growing number of products and teams at Nord Security.
Monitoring vs. Observability
“Monitoring” and “observability” are often used interchangeably, but these concepts have a few fundamental differences.
Monitoring is the process of using telemetry data to understand the health and performance of your application. Monitoring telemetry data is preconfigured, implying that the user has detailed information on their system’s possible failure scenarios and wants to detect them as soon as they happen.
In the classical approach to monitoring, we define a set of metrics, collect them from our software system, and react to any changes in the values of these metrics that are of interest to us.
Excessive CPU usage can indicate that we need to scale it up to compensate for increasing system loads;
A drop in successfully served requests after a fresh release can indicate that the newly released version of the API is malfunctioning;
Health checks process binary metrics that represent whether the system is alive at all or not.
Observability extends this approach. Observability is the ability to understand the state of the system by performing continuous real time analysis of the data it outputs.
Instead of just collecting and watching predefined metrics, we continuously collect different output signals. The most common types of signals - the three pillars of observability - are:
Metrics: Numeric data aggregates representing software system performance;
Logs: Time-stamped messages gathered by the software system and its components while working;
Traces: Maps of the paths taken by requests as they move through the software system.
The development of complex distributed microservice architectures has led to complex failure scenarios that can be hard or even impossible to predict. Simple monitoring is not enough to catch them. Observability helps by improving our understanding of the internal state of the system.
Choosing the right metrics to collect is key to establishing an observability layer for our software system. Here are a few different popular approaches that define a unified framework of must-have metrics in any software system.
Originally described by Brendan Gregg, this approach focuses more on white-box monitoring - monitoring of the infrastructure itself. Here’s the framework:
Utilization - resource utilization.
% of CPU / RAM / Network I/O being utilized.
Saturation - how much remaining work hasn’t been processed yet.
CPU run queue length;
Storage wait queue length;
Errors - errors per second
CPU cache miss;
Storage system fail events;
Note: Defining “saturation” in this approach can be a tricky task and may not be possible in specific cases.
Four Golden signals
Originally described in the Google SRE Handbook, the Four Golden signals framework is defined as follows:
Latency - time to process requests;
Traffic - requests per second;
Errors - errors per second;
Saturation - resource utilization.
Originally described by Tom Wilkie, this approach focuses on black-box monitoring - monitoring the microservices themselves. This simplified subset of the Four Golden Signals uses the following framework:
Rate - requests per second;
Errors - errors per second;
Duration - time to process requests.
Choosing and following one of these approaches allows you to unify your monitoring concept throughout the whole system and make it easier to understand what is happening. They complement one another, and your choice may depend on which part of a system we want to monitor. These approaches also don´t exclude additional business-related metrics that vary from one component of the software system to another.
System logs are a useful source of additional context when investigating what is going on inside a system. They are immutable, time-stamped text records that provide context to your metrics.
Logs should be kept in a unified structured format like JSON. Use additional log storage/visualization tools to simplify interaction with the massive amount of text data the software system provides. One very well-known and popular solution for log storage is ElasticSearch.
Traces help us better understand the request flow in our system by representing the full path any given request takes through a distributed software system. This is very helpful in identifying failing nodes and bottlenecks.
Traces themselves are hierarchical structures of spans, where each span is a structure representing the request and its context in every node in its path. Most common tracing visualization tools like Jaeger or Grafana display traces as waterfall diagrams showing the parent and child spans caused by the request.
Building an observable software system lets you identify failure scenarios and possible risks during the whole system life cycle. A combination of metrics, extensive log collection, and traces helps us understand what’s happening inside our system at any moment and speeds up investigations of abnormal behavior.
This article was just the first step. We’ve covered the standard approaches to metrics and briefly discussed traces and logs. But to implement an observable software system, we need to set up its components correctly to supply us with the signals we need. In part 2, we’ll discuss instrumentation approaches and modern standards in this field.