This week we’re looking into Kubernetes failures and incidents. As you might imagine, these issues can have a nasty impact for users, and where recovery plays the most critical role.
David Xia, Infrastructure Engineer at Spotify goes in depth with his story on how Spotify accidentally deleted all its kube clusters with no user impact. Since the postmortem, Spotify has increased security and recovery tactics.
Back in July 2019, Grafana’s customers experienced a production outage, caused by Grafana’s pod priority system. With their recovery around 30 minutes long, it still had an impact for large kube pods.
Blue Matador had recently written a blog post outlining some issues with nodes in Kubernetes clusters running out of memory space. This case study covers the incident, the fix, and the take-away points.
Moonlight’s website dealt with multiple issues including kernel panics on nodes and CPU resources used at 100% by pods. Kernel panics are interesting problems to deal with in Kubernetes.
This article explores how stability issues led to a script overloaded cluster. Monitoring comes into a strong recovery point for this particular incident.
This story had trouble migrating one application tier to K8s. Read more on this above!
Interested in reading on more Kubernetes failures? Check out this tweet.