Skip to main content

Episode 6: Site Reliability Engineering and Cloud-Native Architecture

Lars Kamp
Andreas Grabner

Andreas Grabner is a DevOps Activist at Dynatrace, where he has fifteen years of experience helping developers, testers, operations, and XOps folks do their jobs more efficiently.

In this episode, Andreas and I discuss how the shift to cloud-native and more dynamic infrastructure is followed by a change in how developers, architects, and site reliability engineers (SREs) work together.

With the sheer quantity of resources running in cloud-native infrastructure and the monitoring signals produced by each resource, the only way to keep growing without "throwing people at the problem" is to turn to automation.

Andreas makes a noteworthy distinction between DevOps engineers and SREs:

  • DevOps engineers use automation to speed up delivery and get new changes into production.
  • SREs use automation to keep production healthy.

SREs are often former IT operations and system administrators responsible for physical machines, virtual machines (VMs), and Kubernetes clusters. As SREs, they move up the stack and become responsible for everything from the bottom of the stack all the way up to serverless functions and the service itself.

We dive into the differences between SLAs, SLOs, and Google's four golden signals of monitoring—latency, traffic, errors, and saturation. Andreas shares the example of a bank and how they started defining SLOs to measure the growth of their mobile app business versus just defining engineering metrics.

This episode covers "engineering for game days," chaos engineering, and making the unplannable, plannable. Andreas shares his perspective on the general trend to "shift left" and include performance engineering in the development and architecture of cloud-native systems.