Now that we've covered some of the general things to keep in mind when developing an error budget policy, let's dive a little deeper into how to find the balance between feature and reliability work. A good way of defining this trade-off in your error budget policy is to have consequences tied to increasing levels of error budget burn. For example, if an outage burns a week of error budget, you may require that the development team dedicates a pair of engineers to the action items from the post-mortem. But if an outage burns 28 days of error budget, you may also halt all feature releases until the service is back within the SLO over the trailing 28-day window. It's important to note that this isn't the same as stopping feature development. It's very common for new feature releases to be associated with a higher error budget burn rate. After all, any change to a system increases the risk of outages. Since you've made your users considerably less happy with an outage, stopping feature releases pays back some of that damage and get you back into SLO faster. Stopping feature releases or enforcing the prioritization of post-mortem action items are just examples of consequences. They are useful because they align the incentives of the ops and development teams. The team who holds the pager wants more reliable service, and development teams want to build and release new features to improve their product. This way, one shouldn't happen without the other because both teams care about and work towards each other's top priority. If SREs help developers go as fast as is safe for their users and developers commit to prioritizing reliability issues and avoiding bad practices, then everyone gets what they want. Choosing consequences that have this alignment of incentives can make your error budget policy far more effective, and making sure those incentives push both SRE and development teams toward your overall business goals makes it easier to sell the policy to executives. It's helpful if the policy also includes some mention of the criteria that first alert an SRE team to a service being unreliable. If these criteria are expressed in terms of relative error budget burn rates, for example, paging someone when the service burns nine hours of budget in an hour, then they can be applicable to all services within SLO no matter the availability target. If the policy commits the SRE team to make some effort to bring the service back into SLO before escalating to developers, it can buy some goodwill in the inevitable negotiations. It's best to keep the core of the policy, the thresholds, and actions relatively lean and unencumbered with justifications or explanation. This may sound counter-intuitive, but you will want the people following it to be able to make clear decisions quickly. You absolutely do need justifications for the choices made and ideally some worked examples of applying the policy because there will always be corner cases that people will have to use their best judgment on. It's worth working some of these scenarios to try to identify those corner cases before setting the policy in stone. Keeping a record of precedents set when the policy has been applied in the past will also help guide future decision-making and maintain consistency. Lastly, expect lots of negotiation on this. Depending on the size of your business, there may be several teams and personalities involved in the decision-making, each with conflicting opinions, at least at first. In this case, it helps to have a designated decider, such as an executive from the business side. Also, don't go in expecting to get everything right at the beginning. Much like your SLOs themselves, make sure you leave some room for a retrospective on your error budget policy every so often to ensure that it's still performing well for your organization. Remember, SLOs and the policies behind them create a common language and align incentives across teams, enabling your organization to effectively prioritize and engineer reliability.