Introduction
Topology Aware Routing (TAR) is a Kubernetes feature designed to keep traffic within the same availability zone (AZ). This can reduce cross-AZ traffic costs on cloud providers like AWS and GCP, where inter-AZ traffic incurs charges. Additionally, it can lower latency by keeping network requests local.
However, TAR is not a silver bullet. While it helps optimize costs and performance, it strictly prohibits cross-zone traffic, regardless of the system’s health or workload distribution. This limitation can lead to unintended service disruptions.
In this post, I’ll demonstrate how a slight misconfiguration combined with TAR led to severe performance issues
Enabling Topology Aware Routing in Kubernetes
To enable Topology Aware Routing for a Kubernetes Service, you need to add the following annotation:
service.kubernetes.io/topology-mode: Auto
When this feature is enabled, traffic is routed only within the same availability zone. If a service has enough endpoints in each zone, Kubernetes assigns Topology Hints to EndpointSlices
, ensuring that requests are always served from the same AZ.
More details on TAR can be found in the official Kubernetes documentation: Topology Aware Routing
The Problem: Unexpected Service Degradation
We use the KEDA Cron Scaler to scale out workloads before business hours and scale them in after hours. However, this scaler was disabled for services with TAR enabled by if statement in helm chart.
One Friday evening, we received a Slack alert that an application had become extremely slow. Our initial response was to check the database and external systems, but everything appeared normal.
After further investigation, we discovered a strange load imbalance across AZs:
- Pods in one AZ were completely idle, processing zero requests.
- Pods in two other AZs were overloaded, reaching 100% CPU usage.
So what went wrong? The answer was TAR.
Root Cause Analysis
Our application consists of two tiers:
- Frontend (handling user requests)
- Backend (processing business logic)
The backend service had TAR enabled, meaning it would only accept traffic from frontend pods within the same AZ.
For some reason, only 4 frontend pods were running, and they were spread across just 2 out of 3 AZs. This led to all requests going to backend pods in those two AZs, leaving backend pods in the third AZ completely idle.
But why weren’t frontend pods evenly distributed across AZs? And why scale-in event happened?
The Autoscaler Behavior
We had Horizontal Pod Autoscaling (HPA) configured, but HPA terminates pods randomly when scaling in. It does not consider:
- Pod Affinity
- Topology Spread Constraints
This randomness led to an imbalance: one AZ had no frontend pods left after a scale-in event.
The Bug in Our Helm Chart
After deeper investigation, we found a bug in our Helm chart that enabled the cron scaler for TAR enabled workloads.
The problematic condition:
{{- if lt (int (default .Values.autoscaling.minReplicas 1)) 2 }}
- type: cron
metadata:
timezone: America/New_York
start: 30 8 * * * # At 8:30 AM
end: 30 18 * * * # At 6:30 PM
desiredReplicas: "2"
{{- end }}
Here, default .Values.autoscaling.minReplicas 1
always evaluates to 1
and enables cron scaler. Because DEFAULT and GIVEN values were switched. Documentation: Using the default function
The fixed condition:
{{- if lt (int (default 1 .Values.autoscaling.minReplicas)) 2 }}
- type: cron
metadata:
timezone: America/New_York
start: 30 8 * * * # At 8:30 AM
end: 30 18 * * * # At 6:30 PM
desiredReplicas: "2"
{{- end }}
Backend HPA
The backend also had HPA enabled, but its scaling was based on average CPU usage across all pods. Since one-third of the backend pods were idle, the average CPU usage stayed below the threshold, preventing further scaling.
Key Takeaways
- Topology Aware Routing is not always beneficial. It prevents cross-AZ traffic even when some AZs have no available capacity. In some cases, this can lead to performance bottlenecks instead of optimization.
- HPA does not respect topology constraints. When scaling in, it randomly terminates pods without considering PodAffinity or Topology Spread Constraints. This can cause an uneven distribution of workloads.
- Small Helm chart misconfigurations can have a big impact. Always review commits carefully!
Improvements
What could we do better?
- Implement descheduler for automatic eviction of pods that violate Affinity or Topology Spread Constraints.
- Having HPA configured per AZ to scale workloads in each AZ independently.
- Use service mesh like Linkerd instead of TAR.
Conclusion
Topology Aware Routing can be a powerful tool for reducing cross-AZ traffic costs and optimizing latency. However, it must be used with caution, as it strictly prevents traffic from leaving an AZ - even when the system is under stress.
In our case, a minor Helm misconfiguration combined with TAR led to an imbalance in our deployment.
Would we enable Topology Aware Routing again? Yes, but with additional safeguards in place.