SRE - Monitoring Applications and Infrastructure

5 min readApr 8, 2019

Recently, I went to the Hong Kong edition of the Microsoft Ignite tour and went to a great talk by David Blank-Edelman’s Monitoring Infrastructure and Apps in Production, and Diagnosing Failure in the Cloud. Thanks to him for a very entertaining talk. The talk was conceptual and covered things relevant to any system. Below are my takeaways, supplemented with my own experience. Note: this is only a very small area of SRE, and I recommend reading around the subject.

What is Site Reliability Engineering?

Ben Sloss wrote:

My explanation is simple: SRE is what happens when you ask a software engineer to design an operations team

Or said another way, “SRE helps an organization sustainably achieve the appropriate level of reliability in systems, services and products”.

There are three key words here:
sustainably — means not blowing people out with long working hours. People achieve more and make fewer mistakes during normal working hours, i.e. not at 3 in the morning.
appropriate — 100% reliability is the wrong goal. If an application has any integrations then nothing can be done to make them stay up. More important is the ability to handle failure gracefully.
reliability — no-one wants to see white screens or stack traces on screen. Reliability means giving back a useful response. If Netflix recommendation engine is down, we still receive a list of default recommendations.

Why monitor?

Simply put: to verify an application is behaving as I and others expect. This means a service meeting certain goals we set, and being able to understand what happens when a change is made. Moreover, we want to know before the customer.

SRE Practices

It’s impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that service and how to measure and evaluate those behaviors

Source link.

Based on Maslow’s hierarchy of needs, the Google book on SRE has a hierarchy of service reliability/production needs.

Source link.

Please read the relevant sections from the linked page for underlying explanation of the hierarchy. When considering service reliability, we often think about the following:

reliability
error rate
availability
request latency
system throughput
correctness (are responses/answers correct)
quality - can the service degrade gracefully
fidelity - how often show what we want (full experience)
durability (is the data still there when we want it)
freshness

Most importantly of all: It has to be about the customer. ‘Start by thinking about (or finding out!) what your users care about, not what you can measure‘, probably said by many, I got from the Google SRE book linked below. If all the metrics look good but the customer experience is terrible, then look for new ways to measure.

In order to monitor effectively, we need to set up some criteria by which a service can be judged, in order to verify an application is behaving as I and others expect. This involves setting up service level indicators (SLIs) and service level objectives (SLOs).

Service Level Indicators

Service Level Indicators are a measure of some aspect of the service and are created to give us a defined measure of how a service is behaving. These may include things such as: reliability, availability, request latency, error rate,
and system throughput.

Service Level Indicators are made up from a ratio/proportion where common measures are latency, error rate, throughput and availability. The data is often turned in to a metric such as ratio, average or percentile and must include a location where the measurement took place. Some examples are:

# of successful HTTP calls/# of HTTP calls at the LB
# of operations that completed in < 10ms/# of operations at the client
# of “full quality responses”/# of responses in the server log
# of records processed/# of records as determined by the app

And they can be turned in to figures in the following way

Ratio * 100 = % proportion
e.g. 50 successful HTTP calls out of 100 HTTP calls is ratio of 0.5 (50/100)
0.5 * 100
50% availability

Measurement come from multiple places such as log collectors, clients, an application, load balancers or front-end. However, don’t do too many because it is hard to pay attention to all of them, yet too few leaves areas unexamined.

Service Level Objectives

A service level objective is a target value or range of values that is measured from a service level indicator. So the service level objective formulas would be service level indicator is less than target, or service level indicator is between the lower and upper bounds of what is deemed acceptable (SLI ≤ target, lower bound ≤ SLI ≤ upper bound).

SLO Basic Recipe

Start monitoring from the customer perspective, it’s they we do this for.

The thing would be:
* http requests
* storage
* operations
SLI proportions would be:
* Successful 50% of the time
* Can read the data 99.9% of the time
* Return in 10ms 90% of the time
Time statement would be:
* In the last ten minute period”
* During last quarter”
* In the previous rolling 30 day period”

An example would be:
50% of HTTP requests as reported by the load balancer succeeded in the last 30 day window.

Example compounds would be:
90% of reads last week took place in < 10ms (as reported by the disk checker)
95% of reads last week took place in < 20ms (as reported by the disk checker)
Segmented:
Percentiles of things (50th, 90th, 95th, 99th)

Remember to keep the SLOs to something the customer cares about.

Alerting

Let’s think about what alerting is. It is detecting something is wrong and by this we mean we are not meeting our service level objectives from a customer perspective.

Actionable Alerting

First let’s start with what actionable alerts are not. They are not logs, notifications, heartbeats, or any normal every day activity. Who wants to be buzzed or distracted when everything is ok.
They are actions that need a human to investigate, and it should be the right human.
It’s very important to target the right human to respond, and give them the relevant information in the alert to act. The crucial details are the context:

where from
what expectation violated
why an issue for customer
how to resolve

Remember this could be 3 in the morning and so the quicker this is resolved the better. Giving context will save a huge amount of investigation time.

Creating good service alerts

It may take some trial and error to get this right. Some general guidelines are:

not too many or too few
don’t overlap them
alert for production services only
create them people first
use different styles of alerts for issues, planned maintenance and advisories

Links

There are too many to list them all. Best to go back to the basics:

Site Reliability Engineering free online

The Site Reliability Workbook free online

Or just google site reliability engineering resources.