Issue 76

August 15, 2019 — Kubernetes Horror Stories 👻

This week we’re looking into Kubernetes failures and incidents. As you might imagine, these issues can have a nasty impact for users, and where recovery plays the most critical role.

First off, we have Spotify and the Accidental Kube Cluster Delete

David Xia, Infrastructure Engineer at Spotify goes in depth with his story on how Spotify accidentally deleted all its kube clusters with no user impact. Since the postmortem, Spotify has increased security and recovery tactics.

The Story of Grafana and Pod Priority

Back in July 2019, Grafana’s customers experienced a production outage, caused by Grafana’s pod priority system. With their recovery around 30 minutes long, it still had an impact for large kube pods.

Make Room for Memory!

Blue Matador had recently written a blog post outlining some issues with nodes in Kubernetes clusters running out of memory space. This case study covers the incident, the fix, and the take-away points.

When Pods and Nodes are Crashing

Moonlight’s website dealt with multiple issues including kernel panics on nodes and CPU resources used at 100% by pods. Kernel panics are interesting problems to deal with in Kubernetes.

The Breakage and Recovery of a K8s Cluster

This article explores how stability issues led to a script overloaded cluster. Monitoring comes into a strong recovery point for this particular incident.

Moving to Kubernetes: Where it Could go Wrong

This story had trouble migrating one application tier to K8s. Read more on this above!

Tweet of the Week

Interested in reading on more Kubernetes failures? Check out this tweet.

Loading tweet...