In the last video, we said that one of the most important characteristics of a good SLI was that it had a predictable relationship with the happiness of your users. But what does that mean in practice? Most services running in production and serving users will already have something monitoring many aspects of their operation. System metrics like load average, CPU utilization, memory usage, or bandwidth are commonly graphed and visible on monitoring dashboards. It's tempting to reach for these metrics when looking for SLIs because sharp changes in them are often associated with outages. But that's not a good idea. If you think about it from the perspective of your users, they don't directly see your CPUs pegged at 100 percent. They see your service responding slowly. You may be also monitoring the internal state of your services and have noticed correlations between things like the thread pool fullness, or request queue length and outages. Again, these may seem like good SLIs, but just like CPU utilization example above, the data is noisy, and there are many reasons why large changes could occur. It's pretty clear, none of these metrics have a predictable linear relationship with user happiness. So, what are the characteristics of a good SLI? We already covered one. It should have a predictable relationship with the happiness of your users. Remember, CRE's second principle from our introductory module, that it's the experience of our users that determines the reliability of our services. This offers another rule of thumb. The SLI should aim to answer the question, is our service working as our users expect it to? One of the core assumptions we're making is that, when we violate our user expectations, they become unhappy. So, the closer we can get to directly measuring our performance against those expectations, the more accurate our SLI will be as a measurement of user happiness. We also recommend that SLI should be expressed as a ratio of two numbers, good events divided by valid events, to give it a value between 0 and 100 percent, this is the SLI equation, which we'll cover in more detail in the next lesson. Lastly, we suggest that the SLI be aggregated over a reasonably long time window, to smooth out noise from the underlying data. It's easy to set a static threshold for too unhappy when your SLIs have all these characteristics. Let's see these rules of thumb in practice. Here we have two monitoring metrics that could potentially be used as SLIs. Why do you think the lower metric is better for use as an SLI than the upper one? These metrics are showing different data for a period of time, where the service in questions suffered an outage. If we have some way of knowing our users were too unhappy during this time period, like the number of scathing messages they were posting on Twitter, and that it is represented by the red area on these two graphs, then we can assert that the lower metric is a far more useful measure of user happiness. While the bad metric does show an obvious downward trend during the outage, there's a lot of variance in the data. The expected range of values we see during normal operation has a lot of overlap with those seen during our outage. So, this perspective SLI fails both of our tests. On the other hand, the good metric has a noticeable dip that matches the time span of our outage closely. Because the data has been smoothed over a time window, there's much less noise. During normal operation, it has a narrower range of values that are noticeably different from the narrower range of values observed during our outage. This makes it a much better indicator of user happiness since it is both predictable and accurately tracks the performance of our service against our user expectations. This matters because SLIs need to provide a clear definition of good and bad events, and a metric with lots of variance and poor correlation with user experience is much harder to set a meaningful threshold for. For the bad metric, our choices are to set a tight threshold and run the risk of false positives or to set a loose threshold and risk false negatives. Worst, choosing the middle ground means accepting both risks. The good metric is much easier to set a threshold for. The biggest risk we have to contend with is that perhaps the SLI doesn't recover as quickly as we might have hoped for after the outage ends. Ensuring your metrics have these properties goes a long way toward their effectiveness when used as SLIs. But, where you measure prospective SLIs also plays a crucial part in their overall utility. In the next video, we'll go over the five main ways of measuring SLIs and talk a bit about the engineering trade-offs involved.