We left off at the end of the last module asking you to set availability targets for our example mobile gaming infrastructure. For this module, we're going to go with a target of three and a half nines. Even better, we're pretty sure that we should be able to meet this target at steady state because it's based on historical data from our load balancers. So far so good, but we're not done yet. For all we know, the last six weeks might have been unusually quiet and outage free. Our serving systems aren't static. We're constantly making improvements and pushing new features to keep our users happy. So, we can't assume today's steady-state will persist forever. To borrow an old adage from finance, past performance is not an indicator of future reliability. This leads us to ask two related questions once we've set our SLO target. First, is the error budget realistic? Can we expect to burn less than the budget if we consider longer time horizons over years instead of weeks or months when we taken into account large long-tail events? Second, what are the biggest single sources of burnt error budget? Is there any low-hanging fruit that if fixed could allow our service to attain higher levels of availability for less engineering effort? To answer these questions, we gaze deeply into our crystal ball, mutter strange incantations, and surround ourselves with the light of true vision. After seeing a vague premonition of a terrible fate befalling our load balancers, we realize that we need a data-driven approach and reach for that beloved tool to predict uncertain futures, the spreadsheet. The practice of safety engineering begins with identifying a hazard, a condition that could lead to an accident. The risk of that accident happening can be modeled as the impact of the accident multiplied by the probability of the hazard causing the accident. The accident we're trying to prevent is missing our SLO targets. The hazards we deal with are the root causes of outages or unavailability, like the failure of one of our cloud providers availability zones. If we can quantify the probability of one of these hazards occurring and the likely impact on our service, we can understand the risk to our SLO posed by that hazard. We will start off by trying to enumerate risks to our SLO based on what we know about our dependencies, serving infrastructure, application, and user behavior. This is an exercise in constructive pessimism, something that SREs in particular tend to have a lot of experience with. It's often easy to come up with lots of very specific potential risks. Instead of getting bogged down in details, it's better to think about classes of risks, like one of our serving zones is unavailable, instead of application servers in Europe West 3A are suffering 10 percent packet loss. Even when we categorize risks like this, it's common to find that we have a large number of heterogeneous classes. So, how do we compare them directly? What do we do when the list is far too long to realistically fix everything? The next couple of lessons, will introduce our risk spreadsheet, which we can use to answer these questions. But first, we'd like you to have a go at brainstorming some risks for our example service.