When running a Kubernetes cluster, understanding the health of the services running on your clusters is job number one. Thanks to Google and their SRE handbook we have a pretty good idea of how to do this. So, without further ado let's jump into some ways to measure health (or SLOs and SLIs).
We will start with the experts on all things service level and get good definitions for the nuances and big difference between all these metrics. It can be a bit murky trying to understand and communicate the differences between these metrics so this is a great place to refer back to. 📈
This practical guide dives into a few options and mental frameworks for thinking about your SLOs. It also gives a pretty good overview of Prometheus Grafana and even Jaeger (tracing) and how to use them for your service-level metrics. 📘
This quick article gives you a hit list of the metrics and things you should / could monitor as SLIs for your platform. This article really helped me wrap my head around where to start when planning out my SLIs and how to think about them. ðŸ§
We head back to the experts for this in-depth playbook for how to set up your own SLOs. This one gets pretty deep pretty fast and gives you a great way to think about setting your service level metrics and how to measure them.
This is a great writeup on implementing your SLOs and setting up your Prometheus dashboards. Coming from the folks over at Buoyant, it’s not surprising, but they still make a good case for using a service mesh when setting up your service level metrics.
Knowing about problems before they make it to production is a path to happiness for SREs. In this writeup they explain how to use Keptn along with Prometheus to shift things left using Quality Gates. 🚪
If you’re considering registering for the Contributor Summit at KubeCon, virtually or in person, register now so it doesn't get cancelled!