At the end of the last lesson, we asked you to fill out your own copy of the risk spreadsheet. I'm sure the more curious amongst you have had a peek at the risk stack rank sheet, where the magic happens. Let's take a look at what the stack rank looks like for the set of risks we were looking at in the last lesson. Why don't you pull up your own copy and play along. Let's go to the risk stack rank sheet. The first thing you'll notice is that we've changed two of the target availabilities from the master spreadsheet. If you think back to the start of the module, we decided to set an availability target of three and a half nines for our user profile journey, but this isn't one of the three default choices. The gray cells below these targets showed the available, accepted, and unallocated error budget as well as the cost at which a single risk becomes too large to safely accept. Below this, the risks from our risk catalog sheet are stack ranked in order of the amount of error budget they consume. The cells for each risk are shaded according to the key on the right, which makes it easy to see that a three nines target, where most of the cells are green is far easier to meet than a four nines target, where they're mostly red. Given the difference in the available error budget for these targets, this isn't a huge surprise. A key variable here is the topmost blue cell, the maximum percentage of your overall yearly budget that an individual risk is allowed to consume. It defaults to 25 percent, but let's set it to 100 percent temporarily. The default is lower because it's a bad idea to have a single known class of risk that's expected to consume most of your available error budget. Using 100 percent highlights in red the risks that will blow more than your entire error budget over the course of the year, which you have absolutely no choice, but to fix to meet your desired level of availability. In this case, there's no single risk that's going to use all of our error budget if we're targeting three and a half nines of availability. But you'll notice that some of the cells have turned yellow. This is telling us we can't accept all the risks and avoid doing any engineering work at all. Cells turn yellow when the sum cost of all the green cells below them consumes enough error budget that adding their own cost would make the total larger than the available budget. So, if we're not bothered by large single risks, which we shouldn't be, then we can choose to accept a subset of risks that fit within the error budget and plan engineering work to mitigate the rest. Let's flip the risk limit back to 25 percent of our budget and examine our options. We've got no yellow cells, so the simple option here is to plan engineering work to alleviate all the risks highlighted in red and ignore those that are green. A closer examination of the red risks showed that they might be hard to eliminate completely. The top risk overload due to an extremely popular new content release for our game causing a stampede of clients, it's super easy to mitigate. After all, we built our service to be horizontally scalable. We can just turn up a huge amount of temporary serving capacity for the launch day, eat the cost and we should be fine. However, overload in the user profile database is a little harder to cope with. This may require a caching layer or the addition of more read only replicas. If writes are the bottleneck, things become a lot harder, but maybe we can accept that risk. We're going to navigate back to the risk catalog sheet now. Let's say we can reduce the frequency of these outages by a factor of four with these mitigations, so that they only occur once every two years on average. Let's suppose that when they do happen, they only impact writes, which are 20 percent of the traffic. However, provisioning a larger database VM to handle the write load takes us twice the recovery time, because we have to ask our Cloud provider very nicely for a non-standard machine type. Let's go back to the stack rank and see what happens. This shifts the risks firmly into green territory, so we should be okay now. Our next large risk is deletion or corruption of our database, it's rare, but catastrophic and we can't really eliminate it. Our best bet is to cut that huge time to recovery where we're expecting to take an entire working day to get the service back up and running following the outage. Switching to LVM snapshots for recent database backups should cut lengthy restore times down. But to be certain of how long it will take, we should automate the entire restore process and run restores regularly. Ideally, we do things like run our complete integration test suite and QA process against servers pointed at the restored data. So, we've got confidence the restored data is usable. It's a lot of work, but if we can cut the recovery time down to under four hours or so, the risk becomes acceptable. Another database risk crops up next, the chance that the performance of our master degrades to unacceptable levels because of slow physical disk volumes, a bad schema push, or similar. Because we don't have separate SLO's for read-only and write latency, we're likely to be a bit slow off the mark catching this. Since fixes require costly database maintenance, or a support ticket, our recovery times are relatively slow too. This problem also happens surprisingly frequently. The simplest solution is to add a separate SLO, tracking latency of requests that mutate state, so increases in latency are not masked by read traffic. If we detect these problems quickly enough, we can accept the risk, though having some form of testing in place to reduce the frequency of bad schema pushes would almost certainly be worthwhile too. Our final unacceptable risk is the one from the post-mortem exercise in the last module. Fortunately, we already have a good plan for that, adding a blackbox prober. Like the previous risk, this will cut our detection time drastically, allowing us to accept the remaining risk. If we flip back to our stack rank, we're almost all green and we've planned out at least a couple of quarters of high-value impactful project work for our operations team. But what do we do about that last pernicious yellow box? It's our database restore risk, which we have to accept because we've already mitigated it as much as possible. Let's take the risks that we've just planned work to reduce and mark them as accepted. Oh no, that's made things worse, we now have two yellow boxes. We also need to accept the risk of a global networking outage at our Cloud provider and the risk of corruption due to hardware failure, since there's not a huge amount we can do about either of those. That leaves three other risks, but the spreadsheet is telling us we can only fit two of them into our available budget. Fortunately, two of them look like things we should fix anyway. Increased errors during our release pushes are a constant source of background pain for our users. Perhaps we're running our servers too hot and when a fraction of them are down for updates the rest are pushed over the limit or we've got a cold-start problem, where their servers take awhile to set up connections to all their back-ends, warm in-process caches or just-in-time compile bytecode. It's worth getting to and fixing the root cause whatever it is. We should realistically be able to take the loss of an availability zone without dropping a third of our traffic on the floor. So, we accept the risk of a load balancer misconfiguration and we've got more project work for our poor ops team. After all this work to mitigate large sources of risk, the remaining accepted risks constitute a little under four-fifths of our error budget, which gives us confidence that we can meet our three and a half nines SLO over the long term with a little head room for the unexpected. Now, we'd like you to do a similar analysis and planning exercise for the set of risks you have in your own sheet. For each risk, either accept it or plan work to mitigate or fix it. Do you think our service can meet three and a half nines?