The Future of Observability

At KubeCon + CloudNativeCon in Barcelona two years ago, I gave a keynote with Frederic Branczyk called Metrics, Logs & Traces: What Does the Future Hold for Observability?

We offered three predictions, the first of which was that we would start seeing more and more correlation between the three pillars of observability: logs, metrics and traces.

DevOps Experience

This was the initial model, sketched out several years ago.

Three Pillars of Observability

This model was relatively linear and one-dimensional, going from metrics to logs to traces. But in real life, you often have to switch back and forth between metrics and logs, for instance.

So over the last couple of years, we’ve been evolving this model, and I’m happy to say that we’ve made good on our prediction of more and more correlation.

Leveraging the open source projects Grafana (for visualization), Grafana Loki (for logging), Grafana Tempo (for tracing) and Prometheus (for metrics), our stack now allows you to combine metrics and logs and traces in Grafana dashboards for a single debugging or incident response experience.

LGTM means, of course, “looks good to me.” But at Grafana Labs, it also stands for logs, Grafana, traces and metrics. This is the new model of our vision for correlation, which shows more movement between the three pillars, and in both directions. I’ll walk you through the developments we’ve made to get here.

Metrics to Logs and Logs to Metrics

We launched Loki in 2018, and starting with v1.0, you can find your relevant logs from your metrics quickly, easily and automatically. The logs are systematically and consistently labeled in the same way as Prometheus metrics, which guarantees that you can always find them and that this always works.

Last October, the Loki v2.0 release brought many improvements to LogQL, the Loki query language that’s inspired by PromQL, the Prometheus query language. LogQL now allows you to extract metrics from your logs, which can be particularly useful if you have legacy applications that either aren’t instrumented or don’t expose metrics themselves.

For instance, you can generate ad hoc metrics directly from NGINX logs in real time and visualize them in Grafana, as demonstrated in this video:

Logs to Traces

Full native support for trace data was built into Grafana 7.0 to enable a seamless and unified experience that connects and visualizes metrics, logs, and traces. And as of Loki 2.0, if any log contains an identifier for a trace, you can click on it and jump directly to that trace in Tempo.

observability

Metrics to Traces

As of Grafana 7.4, you can use exemplars to go directly from metrics to traces without detouring through logs. Exemplars allow you to see a sampled subset of the requests on a graph, so that all those dots represent individual traces, and you can then click on them and jump straight to the trace.

observability

Traces to Logs

To link from a trace back into logs, the Grafana Agent allows you to decorate your traces, logs and metrics with consistent metadata, which then creates correlations that were not previously possible. After jumping from an exemplar to a trace, you can now go directly to the logs for the struggling service. The trace immediately identifies what element of your request path caused the error, and the logs help you identify why.

observability

Observability This Year and Beyond

As you can see, we’ve already built out a great deal of this model; our plan is to develop a lot more of these kinds of transitions, to add more aspects and join them up in more interesting ways. Up to now, LGTM has been specific to Prometheus, Loki and Tempo. This year and beyond, we’re working on making these workflows more accessible, easier to use, and available to other tools as well — with Grafana bringing them all together.

We’re looking forward to joining the open source community at KubeCon + CloudNativeCon EU this May, to share our LGTM vision and hear your thoughts.

Tom Wilkie

Tom Wilkie is VP, Product at Grafana Labs, but really he is a software engineer. Tom is a maintainer on the Prometheus project and a maintainer and the original author of Cortex, both CNCF projects. In his spare time, Tom likes to make craft beer and build 3D printers.

Tom Wilkie has 1 posts and counting. See all posts by Tom Wilkie