Service Level Metrics

When running a Kubernetes cluster, understanding the health of the services running on your clusters is job number one. Thanks to Google and their SRE handbook we have a pretty good idea of how to do this. So, without further ado let's jump into some ways to measure health (or SLOs and SLIs).

Issue #140

We will start with the experts on all things service level and get good definitions for the nuances and big difference between all these metrics. It can be a bit murky trying to understand and communicate the differences between these metrics so this is a great place to refer back to. 📈

This practical guide dives into a few options and mental frameworks for thinking about your SLOs. It also gives a pretty good overview of Prometheus Grafana and even Jaeger (tracing) and how to use them for your service-level metrics. 📘

This quick article gives you a hit list of the metrics and things you should / could monitor as SLIs for your platform. This article really helped me wrap my head around where to start when planning out my SLIs and how to think about them. 🧠

We head back to the experts for this in-depth playbook for how to set up your own SLOs. This one gets pretty deep pretty fast and gives you a great way to think about setting your service level metrics and how to measure them.

This is a great writeup on implementing your SLOs and setting up your Prometheus dashboards. Coming from the folks over at Buoyant, it’s not surprising, but they still make a good case for using a service mesh when setting up your service level metrics.

Knowing about problems before they make it to production is a path to happiness for SREs. In this writeup they explain how to use Keptn along with Prometheus to shift things left using Quality Gates. 🚪

If you’re considering registering for the Contributor Summit at KubeCon, virtually or in person, register now so it doesn't get cancelled!