How To Avoid Cloud-Native Observability Tooling Sprawl

Observability is becoming increasingly crucial for companies employing containerized environments and Kubernetes. A recent CNCF report finds that 72% of respondents use as many as nine different tools to monitor their cloud-native environments and applications. Naturally, the use of many different monitoring tools could lead to sprawl if left ungoverned.

Without proper observability, it can be challenging to promptly troubleshoot and diagnose issues that arise. Companies are now using a wide range of tools for observability, from traditional APM to modern logging systems—but what is the best way to navigate this sprawl of tools? In this article, we explore the importance of observability, the current state of cloud-native observability and how to avoid cloud-native observability tooling sprawl.

What is Observability Data and Why Does it Matter?

Observability data is collected from various sources (such as logs, metrics and traces) that are used to monitor an application. It enables us to understand how the application runs, helping us identify any issues or areas for improvement. This is especially important for containerized environments, as these environments are often ephemeral and can change quickly. Without proper observability, it can be difficult to diagnose and troubleshoot any issues that arise.

The types of data that are collected really depend on the application, but they can range from application log files and user data to container metrics and task schedulers. And as the number of containers, databases and other applications running in Kubernetes continues to grow, the importance of observability data continues to increase. How this observability data is consumed is also important to keep in line with SLIs, SLOs and SLAs.

The State of Cloud-Native Observability

We are currently in the early innings of cloud-native observability, with a need for more knowledge and mindshare around the topic, says Anurag Gupta, co-founder and CEO of Calyptia. According to Gupta, teams should start off by identifying their goalposts. For example, one goalpost to consider is what high availability looks like for their application. In essence, how much downtime can your application incur and still maintain its objective?

Once this point is established, teams should then begin to identify the gaps in their current setup and slowly evolve the way they are using observability to enhance the datasets and metrics collected. It is important to start off small and iteratively, as trying to do too much at once can be overwhelming and can prevent teams from making reasonable progress. Gupta also recommends using whatever tools teams are comfortable with and gradually integrating containers into their current setup. He also suggests using open source tools, as this can help to avoid vendor lock-in and allows teams to take advantage of the community and control aspects of observability.

Avoiding Tooling Sprawl

The number of containers in use continues to balloon. Nowadays, companies might be running 100 to 200 containers per server. Each container produces log files and application data, and following this stream of data has become essential to diagnose whether something is going wrong. A report from Datadog found that container environments end up having significantly more monitors than non-container environments.

Tooling sprawl has become an issue for many companies, as they often end up using multiple tools to monitor these various environments, whether it’s APM, Datadog, Splunk or others. “Over time, each of these use cases has grown to solve different verticals,” says Gupta. There now exist many different backend solutions that have evolved and grown into different verticals of data. For enterprise monitoring, visibility into core application metrics can become fractured and opaque.

However, tooling sprawl can be avoided with an abstraction layer. Gupta recommends using an abstraction layer that can take in various data formats from multiple agents and transform them into whatever view is necessary. This could help to reduce agent sprawl and the amount of money spent on computing resources. It also gives teams more control, as they can enact rules and policies, debug more easily with error logs and send logs to different clouds, all from a centralized location.

Solving Sprawl With Abstraction and Aggregation

Observability is becoming increasingly important for securing containerized environments and Kubernetes, as it helps diagnose and troubleshoot any issues that arise. However, teams must be careful to avoid tooling sprawl, as this can be costly and lead to decreased performance. One potential way to avoid this is to use an abstraction layer that can take in data from multiple agents and transform it into dashboards catered to the team’s needs. Teams should also take advantage of open source tools and agents, as this can help to reduce data and costs, as well as provide teams with more control and flexibility.

Bill Doerrfeld

Bill Doerrfeld is a tech journalist and analyst. His beat is cloud technologies, specifically the web API economy. He began researching APIs as an Associate Editor at ProgrammableWeb, and since 2015 has been the Editor at Nordic APIs, a high-impact blog on API strategy for providers. He loves discovering new trends, interviewing key contributors, and researching new technology. He also gets out into the world to speak occasionally.

Bill Doerrfeld has 105 posts and counting. See all posts by Bill Doerrfeld