Tuning K8s Performance and Cost With NGINX Ingress Metrics

October 11, 2021October 8, 2021 Amir Sharif kubernetes, load balancing, NGINX, Prometheus

There are load balancers, and then there are load balancers. Broadly speaking, there are two variations of NGINX: The first variation is open source with many contributors and an install base of—perhaps—billions. The second variation is NGINX Plus, a commercial product with support and proprietary load balancing algorithms. Although there’s no hard data available, it is reasonable to assert that NGINX Plus has a tiny fraction of the contributors and installations compared to its open source cousin.

Is Bigger Better?

Opsani recently ran an independent test to determine how their autonomous optimizer for cloud workloads would perform with each version of NGINX. The assumption was that the choice of load balancer as a front end for the optimization service would have no bearing on how the service performed. We were surprised by the outcome; we found significant differences in the setup.

Our principal focus was extracting NGINX metrics for the application that we were optimizing. By definition, the application being optimized by our service depends on NGINX for load balancing.

Once the test harness is set up and the application is up and running, optimization occurs automatically over a few application load cycles. The optimal configuration is usually calculated after we observe three or four peak periods.

The Application

Our reference application was Bank of Anthos, an application that mimics both a traditional three-tier enterprise application and the more modern microservices architecture.

Sponsorships Available

Test Method and Setup

To determine a baseline configuration and measure the impact on both the metrics-gathering methodology and the load balancing mechanism in place, we configured and ran the following tests:

Kubernetes “upstream” ingress with the community extended open source NGINX ingress controller as deployed with the standard process. Metrics were gathered by adding an Envoy proxy in line with the application under test in a sidecar container in the application pod.
Kubernetes ingress using the NGINX Plus-based ingress controller with the deployment process described here. In addition, metrics were gathered with the same deployment and configuration as with the community version to ensure that metrics gathering was not the source of any direct impact in the optimization performance process.
Kubernetes ingress using the NGINX Plus-based ingress controller as described with the previous test. We removed the sidecar Envoy in this model and collected RED metrics directly from the NGINX Plus ingress service controller.

In the first two cases, we leveraged collecting RED metrics (of primary interest were throughput, p50 latency and transaction errors) to a Prometheus sidecar co-resident in the optimization controller pod. For the third case, we manually implemented the same base configurations that our connector implements, but targeted at the NGINX controller instead. This required defining the separate Kubernetes connector and Prometheus metrics configurations.

The two proxies (Envoy versus NGINX Plus) have different metrics parameters. Still, we found a correlatable set using the median latency from NGINX Plus to compare with the p50 latency from Envoy. In addition, throughput and error rates were available in both metrics services and made up the rest of the required inputs for our optimization process.

Results

After optimizing Bank of Anthos with two variations of NGINX (Upstream and NGINX Plus) and two metrics collection variations (Envoy sidecar and Prometheus), we delivered the following results:

Setup Variation	Cost Reduction	P50 Response Time (latency) ^a	P90 Response Time (latency) ^a	Error Rate ^a	CPU Optimization	Memory Optimization
Upstream NGINX via Envoy	+44%	— ^b	— ^b	— ^b	+44%	+44%
NGINX Plus via Envoy ^c	+70%	+4%	— ^b	— ^b	+63%	+81%
NGINX Plus via Prometheus	+70%	-5%	NA^e	— ^b	+63%	+81%

1. Lower is better
2. Minor change from the baseline and within margin of error.
3. NGINX Plus load-balancing efficiency is more effective at distributing load than the open source path. Opsani optimization is more efficiently produced with even load distribution.
4. Only extracted mean, as statistically bucketized metrics were not readily available

The following facts stand out:

NGINX Plus load-balancing efficiency is more effective at distributing load than the open source path. Our optimization is more efficiently produced with even load distribution. Because of more efficient load balancing, NGINX Plus allows for superior optimization results.
When collecting metrics via Prometheus and thereby eliminating the sidecar mechanism, we can expect superior performance results, as measured by lower latency.

“Free” Costs Money

It is worth emphasizing the big difference in cost savings when the same app used a widely distributed and installed open source load balancer, NGINX, versus a proprietary load balancer. Namely, optimizing the same stack module with the load balancer difference yielded a cost savings difference of 26%.

Here is another way of looking at it: If your application costs $1.26 per hour when using the free version of NGINX, it would cost you only $1.00 per hour if you upgraded to (and paid for) NGINX Plus. On a small scale, this difference is trivial. But, if you are running any larger-scale production applications, the difference in operating costs is substantial.

As it turns out, “free” costs money.

To hear more about cloud-native topics, join the Cloud Native Computing Foundation and the cloud-native community at KubeCon+CloudNativeCon North America 2021 – October 11-15, 2021