Preparing your cluster(s) for your time off

With the holidays right around the corner and some much needed R&R on the horizon we are taking a look at some concepts to help you enjoy your festivus with a little less anxiety. Remember we are all SRE’s now with powerful tools at our fingertips, we just have to make sure we understand best practices so we can all enjoy the break! ☃️

Issue #148

Google literally wrote the book on site reliability engineering and here is a free copy for you to study up while all systems are still nominal. This guide explains the best practices Google has developed over the years to keep their infrastructure up and running. Read this, and go forth and reduce toil! 📖

Unpredictable stuff happens, you aren’t going to ever know every edge case and possible outage that may happen. But you can respond professionally and responsibly when an incident occurs. The folks over at Firehydrant have put together a great guide for understanding and preparing for being on call, measuring impact, and communicating with end users. 🚒

Understanding what's happening in your infrastructure is always the first step in dealing with a problem. This extensive whitepaper from Caleb over at Sumologic provides us with a primer for understanding observability in our Kubernetes clusters. There is a lot to digest here, but the better you can observe and understand your cluster, the faster you can respond and fix your infrastructure.

The folks over at Grafana share this in depth guide on how to get in front of incidents and track your SLIs/SLOs before someone else does. Check out the SRE Handbook (above) if you need a refresher on SLIs and SLOs. They get very specific, and if you follow this guide you can rest assured you will get a text message if and when things hit the fan! 📗

The folks over at nobl9 give us a writeup on getting your SLOs right for running in the cloud. Understanding what to look at is half the battle and this has some great input on how and what to measure. Understanding your SLOs going into the break can help a lot to having a completely disconnected time off.

Being ready to scale for unexpected traffic can really eliminate some of those holiday late nights. This article explains the nuances of scaling out, scaling up, and cluster scaling itself the Kubernetes way. It's not easy to get right, but once you do, an automated scaling strategy can be the best present you can give yourself this December! 🚦

The growth of CNCF is just mind blowing, this wrap-up of 2021 really shows how quickly things are moving. 🚀