When you exceed your error budget, how can you improve your service to reach your reliability target? Your SRE team can do a lot with cooperation from the Dev team to make your target SLO more attainable. One of the most obvious ways you've probably already thought about is rolling changes out gradually so that only a small group of users is impacted by any failures. By reducing impact, a smaller percentage of your users are affected by your service missing its SLO. Or, perhaps, removing single points of failure in your system. If your service is only run in a single region that could spell trouble, especially if a squirrel decides to sneak into the datacenter, chew on a live power cable, start a fire and cause cataclysmic failures. Side note, this is not a joke and happens more often than you think. In fact, if you're interested in reading more about datacenter dangers and our furry friends that accompany them, check out the supplemental reading we've added to this module. You can do many things to mitigate such failures like adding redundancy to other regions that will hopefully withstand an army of cable-chewing squirrels. We also have other improvements we can make. Think about the time it takes for a problem to affect a user to when it's resolved and no longer an issue. First, a user is impacted and then someone such as your SRE on-call is informed to fix the issue. The gap between these two times is called the time-to-detect or TTD. Then from there, we measure the time it takes from someone being informed of the issue to actually fixing it and recovering the service as the time-to-repair or time-to-resolution which you'll see abbreviated as TTR. Any of these components can be improved upon to make your service more reliable. We can generalize this with a formula. The expected impact of a particular type of failure on your error budget is proportional to the time-to-detect plus the time-to-resolution multiplied by the percentage of impact over the time-to-failure. This last value TTF expresses how frequently you expect this particular failure to occur. You may have also seen this expressed as TBF or time-between-failures. It's basically the same idea. So to improve reliability, you can reduce time-to-detect or time-to-resolution, reduce impact percentage, or increase the time-to-failure. To improve the time-to-detect, you can implement mechanisms to catch outages faster. For example, you can add automated alerting that pages a human instead of relying on people to notice a bad graph or abnormal performance. Another common approach is monitoring. While you may have monitoring to measure metrics like your SLIs, active users, ad clicks, do you have monitoring in place to measure your SLO compliance? If not, that's a good starting point. Knowing when you may be fast approaching your error budget limits or when you are within your target SLOs are helpful tools to improve reliability. To improve time-to-resolution, you can implement mechanisms to fix outages quicker such as developing a playbook or making it easier to parse and collate your server's debug logs so that the poor souls that are first to respond to the outage don't have to figure out what to do from scratch when they are stressed. Or, perhaps, you can automate a manual task such as draining a zone and redirecting traffic while you investigate. Making it quicker and easier to troubleshoot, as well as having fast and easy tools for mitigation at your disposal, can drastically reduce the time that an outage lasts. A common technique to reduce impact percentage is to limit the number of users that a particular change can affect in a given amount of time. For example, a percentage-based roll-out where a new feature only gets released to 0.1 percent of users, then one percent, then 10 percent, and so on in a staged manner gives changes time to bake instead of pushing a potentially outage causing release to all of your users at once. Another way of reducing impact percentage is by engineering your service to run in a degraded mode during a failure. For example, your service may decide to allow read-only operations but not allow writes. If the majority of your user actions are reads, then this may mitigate the impact of an outage. Lastly, increasing time to failure means making any particular failure less likely to happen. There are lots of approaches here, such as running your service in multiple failure domains and automatically directing traffic away from a zone or region that has failed. These are just a few things you can do to improve reliability. There are also operational approaches you can take which we will touch upon in the next video.