Why Prometheus is an Essential Observability Tool

Prometheus continues to remain as an essential tool for monitoring, and a key component in observability platforms for cloud-native environments in numerous organizations. As one of The Cloud Native Computing Foundation’s (CNCF) “fastest-growing” projects, the time series database is especially useful for gathering metrics about Kubernetes clusters, and is typically used with a Grafana dashboard as an observability tool for visualizations.

In this post, we explore why Prometheus has become widely adopted and integrated with observability platforms and use New Relic as an example of a leading observability player whose platform uses metrics from Prometheus. We also discuss why Prometheus metrics, and observability data in general, have become increasingly essential for many overextended, highly distributed DevOps teams in today’s pandemic context, especially those working in development environments.

Prometheus’ design and capabilities are especially well-suited for monitoring Kubernetes clusters, since the tool offers developers a simple, open and vendor-agnostic way to embed monitoring instrumentation into their services at the time of development, Buddy Brewer, general vice president and field chief technology officer for the Americas, New Relic, tells Container Journal.

“Application architectures involving containers and microservices have reached a point where developers today often can’t instrument fast enough to keep up with the pace of software development, leading to observability gaps,” says Brewer. “Prometheus is one source of telemetry data that helps developers observe modern software.”

Prometheus is not, however, the only source organizations use for observability data. Other tools used to gather data and to provide key information include data gathered from logs, traces, and high-cardinality metrics in addition to metrics data that a tool such as Prometheus provides, says Brewer. A big part of understanding what is going on across the full stack of a modern software architecture involves understanding and analyzing the intersections of these different types of data, says Brewer.

“In New Relic’s case, we bring Prometheus and other telemetry data sources into a single data store called the Telemetry Data Platform,” says Brewer. “We then provide curated views on top of this with our Full Stack Observability offering. Developers use this to get a holistic view of what’s going on without having to pivot between different tools and data stores.”

These capabilities are especially important as modern IT environments are increasing in scale and complexity, while at the same time, engineering teams are being asked to manage more than they ever have before, says Brewer.

“We are at a point where engineers are physically unable to track all of the changes and dependencies, and on top of that, digital transformation is speeding up due to COVID-19,” says Brewer. “The average engineer needs to get more done across more data in less time. They are asking themselves, ‘How can I get the fastest, simplest view of everything that is going on across my entire system?’ — all while also being able to drill down into the details and discover entity relationships, changes across applications, etc.”

The goal behind an observability platform, such as New Relic’s Explorer, is to improve how DevOps teams gather actionable data more efficiently, and thus, are able to do more with fewer resources. New Relic, for example, integrates Prometheus metrics into its observability platform to help DevOps teams gather actionable data from the so-called “unknown knowns” among observability data.

In this way, Prometheus monitoring metrics, as well as alerting, dashboards, and all the other traditional monitoring tasks, are part of observability, says Brewer.

“The real difference with observability is opening up the ability to ask questions of your system in real-time — or allowing AI to do it for you — through the data it throws off,” says Brewer. This data consists of aggregated metrics, events such as discrete predefined high-value occurrences defined by engineers; logs, including a line-by-line history of everything happening in the instance; and traces, such as end-to-end stitched paths of a trip through the system, says Brewer.

“To do this, you need: All the data in one place with high cardinality (for engineers or AI) with correlation and connection so you can understand your system, connect signals to the things that produced them, and understand dependencies,” says Brewer.

B. Cameron Gain

B. Cameron Gain is the founder and owner of ReveCom Media Inc. (www.revecom.io), which offers competitive analysis and testing services for software tools used by developer, operations and security teams. He first began writing about technology when he hacked the Commodore 64 family computer in the early 1980s and documented his exploit. Since his misspent youth, he has put his obsession with software development to better use by writing thousands of papers, manuals and articles for both online and print. His byline has appeared in Wired, PCWorld, Technology Review, Popular Science, EEtimes and numerous other media outlets.

B. Cameron Gain has 18 posts and counting. See all posts by B. Cameron Gain