The Future of Observability

September 30, 2021September 29, 2021 Tom Zach cloud-native applications, KubeCon, observability, OpenTelemetry

We live in an era of microservices. New products are being developed not only by funded companies but also by creative individuals. Customers are used to having a great experience, and if they don’t get that—they leave.

This means developers must be able to quickly debug and resolve any issue they encounter.

This has also led to a growing need for observability; but the truth is, we need to look at observability differently than the way we currently do. Instead of looking at the three pillars of observability, metrics, logs and traces, as three different tools, we must look at them as if they were one unit, combined. Here’s why.

The Past

To better understand the future, it is important to understand where it all began.

Sponsorships Available

Back in the days of monolithic architecture, you would generally have one server with one large codebase to manage everything in your system. You would usually also have a relational database attached to that server.

The problems you’d encounter were somewhat predictable: Database is a bottleneck; high server CPU; high server memory; etc. It was extremely unlikely that you would encounter a scenario you never thought was possible, and if you did, you would simply jump to the relevant line of code and start debugging. This is why there wasn’t a real need for what we nowadays refer to as “observability.” Back then, you could use traditional monitoring solutions: Set an alarm for high CPU/memory, and when that alarm went off, simply start investigating.

Since microservices emerged and risen in popularity, everything became distributed. The predictability of issues has declined significantly. Asynchronous messaging systems serve as the glue of data transfer between services; this is great, but comes at a price—more moving parts to maintain and more places to look when we have an issue.

The increasing number of moving parts and microservices has increased the need for libraries like OpenCensus and OpenTracing, which later merged to form OpenTelemetry. They collect data from each microservice to allow us to view and query this data when issues occur.

A lot of articles have been written on OpenTelemetry, so I won’t go into too much detail here. But the most important thing to remember is that it provides a standardized way of collecting data from our microservices which aims to enable much-needed observability.

The Present State of Observability

You’ve probably heard a lot about the three pillars of observability: logs, metrics and traces. These three tools are our core arsenal when it comes to resolving production issues. The OpenTelemetry specification talks about them in-depth, but I’ll briefly explain each of them.

Logs

Logs are human-readable text that describes what is happening in our system at any given time. Could be an error, a warning or a debug log. They are and always will be a core part of the way developers approach resolving problems.

Metrics

In any computer system/service, we have gauges that we would like to monitor. A metric is a number associated with a point in time. For example, CPU usage, memory usage, etc. We can observe the metrics of our services using tools like the open source Prometheus or various different vendors’ solutions. We can also use metrics to set up alerts; for example, if CPU usage rises above 95%, notify the team.

Traces

We also can instrument our systems to create traces. A trace tracks the progression of a single request. Let’s use a sign-up form as an example. A user enters the form details, then hits ‘Submit.’ The authentication microservice accepts the request, performs validations and saves the user info to a database.

Each internal operation (like the POST request or saving to the DB) is called a span. A trace is made of one or more spans. The initial span is called the root span; it also records the time it took from the moment the user clicked on a button to either receiving success or failure. Tracing can be very useful when we want to see which actions took more time and which took less.

What Observability Means Today

The average software company has a vendor for logging, uses a vendor/cloud provider for metrics and uses either another vendor or open source tools for traces.

So, let’s imagine the following scenario:

We got an alert from our metrics provider saying our server (that holds all our microservices) is at 95% CPU.
The engineers go to the logs provider’s website to search for events in that time frame. They may or may not find a relevant log record that seems related.
Then, they may go to the tracing provider to see which requests took the longest, under the assumption that those are the ones that also take the most CPU.
They would most likely go back and forth between all those systems, until they reach the problematic line of code or the moving part that is responsible for the spike, and fix it.

This process is far from being perfect. This process of using the three pillars as three different tools is slow, frustrating and error-prone. Hopping between those systems requires a lot of mental energy and time and leaves plenty of room for guessing why things occurred. I believe we can change this if we shift our view on observability.

The Future: Unified Observability

OpenTelemetry, which has defined a specification for those three pillars, is constantly evolving; headed toward allowing a deeper correlation between those three pillars.

Let’s go back to our scenario from before—we received an alert saying CPU usage is 95% for our server. But let’s introduce a difference; imagine using one system, our observability provider.

The engineers go to view the metrics on the provider’s website. Metrics would be sent via OpenTelemetry SDKs with exemplars, which are highlighted values in a time series. Those exemplars can contain a set of key-value pairs that would store relevant information including trace IDs that were active during that time.

Clicking on a trace ID would immediately allow you to view that trace, or identify a potentially problematic one for deeper investigation. That system may even let you sort those IDs by their execution time. In the trace view, they would be able to see relevant logs that were associated with that trace, so they would see only what is relevant at any given time.

This is possible because the OpenTelemtry specification allows storing of trace ID in a log; meaning that for each log we know exactly which trace ID created it. Also, this enables the opposite: We can look up logs by trace ID.

This way, the odds that engineers are viewing the right information at the right time increase exponentially. No need to guess, try to correlate time frames of logs, traces and metrics when issues occur. At the click of a button, you can jump between the OpenTelemetry signals, or the three pillars: logs, metrics and traces. I believe this alone can save engineers countless hours that could be used instead to develop great products. To sum it up: One system plus three unified pillars equals a bright future for observability. I, for one, am very excited to be there when that happens.