Join our free Kubernetes Optimization Workshop (GPUs included!) to learn how to cut costs and boost efficiency with real use cases examples

4 Kubernetes Failure Stories to Learn From

"The JVM version was identical. Application ran on EC2 containers so containerization wasn't a problem.

What would you call a situation with upstream latency of an application wherein the application hosted on EC2 is responding within 20ms while the one on Kubernetes is taking 10 times as much. Bizarre! Unlikely? Well, the Adevinta team handling the deployment was equally clueless. They ran the standard diagnostic but to no avail.

"The JVM version was identical. Application ran on EC2 containers so containerization wasn't a problem.

Indeed, there were some external dependencies and the team was a little optimistic to point the blame on DNS. Like they say, "It is always DNS". So they ran a few DNS queries from container.

[root@be-851c76f696-alf8z /]# while true; do dig "elastic.spain.adevinta.com" | grep time; sleep 2; done

;; Query time: 22 msec

;; Query time: 22 msec

;; Query time: 29 msec

;; Query time: 21 msec

;; Query time: 28 msec

;; Query time: 43 msec

;; Query time: 39 msec

Then they ran the queries from EC2 instance

bash-4.4# while true; do dig "elastic.spain.adevinta.com" | grep time; sleep 2; done

;; Query time: 77 msec

;; Query time: 0 msec

;; Query time: 0 msec

;; Query time: 0 msec

;; Query time: 0 msec

Indeed, the application was adding DNS resolution overhead with ~30ms resolutions time but it was nowhere near the latency target. But it raised some questions.

Why the hell Kubernetes is taking eternity to talk to AWS?
Isn't JVM supposed to cache DNS?

"When DNS isn't the problem, it often leads to the one."

An analysis of tcpdump testified DNS to be innocent but there was something with the way requests were handled. The service was sending subsequent GET requests on the same TCP connection for every request. Concisely, multiple queries on each request to the AWS Instance Metadata service were leading to spikes in resolution time.

Turned out, the two calls were a part of the application authorization workflow. The first one would query the IAM roles associated with the instance while the second one would request temporary credentials to access the instance. The credentials would expire after the Expiration time and then the client has to request a new credential.

/ # curl http://169.254.169.254/latest/meta-data/iam/security-credentials/arn:aws:iam::<account_id>:role/some_role`

{

    "Code" : "Success",

    "LastUpdated" : "2012-04-26T16:39:16Z",

    "Type" : "AWS-HMAC",

    "AccessKeyId" : "ASIAIOSFODNN7EXAMPLE",

    "SecretAccessKey" : "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",

    "Token" : "token",

    "Expiration" : "2017-05-17T15:09:54Z"

}

The team knew the Expiration can become a potential performance overhead, but they also knew the client can cache the credential keys after Expiration. But for some reasons unknown, AWS Java SDK was unable to. The Expiration time is hardcoded on AWS SDK and is included in the certificates. The team retrieved certificates from both containers and EC2 and to their surprise, the one from the container had a much longer time: 15min.

The AWS Java SDK was force refreshing any certificate with less than 15 min left in expiration time. As a result, each request was refreshing the temporary certificate, making two calls to the AWS API, and adding a huge latency time.

To fix this, they reconfigured credentials with a longer expiration period. Once this change was applied, requests started being served without involving the AWS Metadata service and returned to an even lower latency than in EC2.

Kubernetes is not so simple after all

Kubernetes applications run on groups of servers called nodes. Each node hosts pods and each pod encapsulates an application's container(s). Kubernetes scheduler takes care of nodes and pods "agenda". Kubernetes powers Moonlight's web resources. So when Kubernetes reported a few nodes and pods to be unresponsive, they linked the issue to the ongoing Google Cloud outage. However, when the issue persisted the entire weekend and the Moonlight website went offline the next week, they contacted Google Cloud support.

The Google Cloud support quickly escalated the issue to their engineering support team that reported an obscure behavior with Moonlight Kubernetes properties: the nodes were experiencing persistent 100% load, causing kernel panic and subsequent crashes. Being a fault-tolerant system, the Kubernetes scheduler was reassigning crashed pod to a new pod over and again, which elevated the underlying issue.

Once the issue was identified, the fix was about raising Kubernetes fault-tolerance. They added an anti-affinity rule, which spreads the pods out across nodes automatically to balance load.

When it isn't DNS, it got to be AWS IAM

A continuous delivery platform or CDP takes care of CI/CD cycles. Zalando operates custom CDP builds for their operations. So on one fine day, when the builds started failing, the team was doubtful because Kubernetes builder Pods were unable to find AWS IAM credentials. For a Pod to get AWS IAM credentials, kube2iam needs the Pod's IP address, which is set by kubelet process on the associated node. So Kubelet was taking multiple minutes updating the status of Pod. In the default configuration, Kubelet is slow responding to the requests to the API server. Somehow Zalando's CDP Kubernetes cluster had only one available node to the builder Pods. The rapid creation and deletion of Pods on a single node were delimiting Kubelet. The problem could not be solved until the team upscaled the cluster to include more than one node.

Migrations aren't always a "walk in the park"

The team at Ravelin had a nice time with their migration to Kubernetes and Google Cloud until the API layer came into the picture. The API layer was on old VMs. To facilitate migration they insisted on ingress. I mean on paper it was a cakewalk.

define the ingress controller
tinker with terraform to get some IP addresses
Google will take care of nearly everything

If you look at most documentation, this is how to remove a pod from a service.

"The replication controller decides to remove a pod.
The pod's endpoint is removed from the service or load-balancer. New traffic no longer flows to the pod.
The pod's pre-stop hook is invoked, or the pod receives a SIGTERM.
The pod 'gracefully shuts down'. It stops listening for new connections.
The graceful shutdown completes, and the pod exits, when all its existing connections eventually become idle or terminate."

What if I tell you step 3 and step 2 are supposed to take place simultaneously? You would tell me what's the difference does it make? They happen so quickly. The problem is that ingresses are relatively slow.

In this case, the pod received the SIGTERM long before the changes in endpoints is actioned at the ingress. The pod continued receiving new connections and the process to complete the process. The client receives 500 errors after another. What a wonderful story of Kubernetes migration falling apart?

The team at Revelin ultimately solved the issue by adding a pre-stop lifecycle hook that sleeps during grace period so the pod will just continue serving as if nothing happened.

Takeaways

Indeed, one of the common sources of problems are not bugs in Kubernetes or other software. It isn't either about fundamental flaws in the microservices or containerization services. Problems often appear just because we put some unlikely pieces together in the first place.

When we are intermingling complex pieces of software together that have never worked together in production before with the anticipation that they collaborate establishing a single, larger system.

More moving pieces ==more touch points==more room for failures==more entropy;

Nevertheless, Kubernetes saves you a lot of trouble before they even surface thanks to its built-in fault-tolerance.

Despite these cases, Kubernetes remains the most reliable container orchestration service around thanks to so many success stories, which I will cover in the next blog.

Update: More Failure Stories?

Similar stories, curated by the Kubernetes community, can be found in this open-source repository. You can also contribute to this project by opening a pull request.

These are stories from 2020 curated on the same list:

10 More Weird Ways to Blow Up Your Kubernetes - Airbnb - KubeCon NA 2020
- involved: MutatingAdmissionWebhook, CPU Limits, OOMKill, kube2iam, HPA
- impact: outages
Why we switched from fluent-bit to Fluentd in 2 hours - PrometheusKube - blog post 2020
- involved: fluent-bit, missing logs, Fluentd
- impact: lost application logs in production
Make your services faster by removing CPU limits - Buffer - blog post 2020
- involved: kops, CPU Limit, CPU throttling
- impact: high latency
The case of the missing packet: An EKS migration tale - MindTickle - blog post 2020
- involved: EKS, AWS CNI Plugin,
- impact: frequent connection failures when talking to services outside the cluster
Kubernetes Networking Problems Due to the Conntrack - loveholidays - blog post 2020
- involved: GKE, conntrack, HAProxy
- impact: high error rate on network-heavy services
DNS issues in Kubernetes. Public postmortem #1 - Preply - blog post 2020
- involved: conntrack, DNS, CoreDNS-autoscaler
- impact: partial production outage
CPU limits and aggressive throttling in Kubernetes - Omio - blog post 2020
- involved: GKE, CPU Limit, CPU throttling
- impact: high latency, errors
When GKE ran out of IP addresses - loveholidays - blog post 2020
- involved: GKE, cluster autoscaler, HPA, Alias IP VPC (VPC Native)
- impact: stuck deployment, blocked autoscaling of both pods and nodes
10 Weird Ways to Blow Up Your Kubernetes - Airbnb - KubeCon NA 2019
- involved: sidecars, DaemonSet, image registry, JVM, HPA
- impact: outages

Get similar stories in your inbox weekly, for free

Share this story:

The Chief I/O

The team behind this website. We help IT leaders, decision-makers and IT professionals understand topics like Distributed Computing, AIOps & Cloud Native