If you have ever seen a typical American TV crime drama, that part where the police interrogate suspects, throwing out a lot of blame and accusations to try and get them to talk, it's pretty similar to the image of what many engineers historically have felt like during incident reviews. That's hyperbolic, of course, but in reality, fewer things have struck more dread in the heart of developers than the words incident review. However, the thinking on incident reviews and how they are approached really has evolved over the past few years. Thanks in a large part to organizations transitioning to DevOps principles. In this video lesson, you'll learn how the industry has generally evolved when it comes to incident reviews, how you can change the mindset around them, and how to best use them to your advantage to improve your organization's processes. So when I started in the industry, incident reviews tended to be what I would describe as a blame game. It was often a bunch of people in a room, putting an engineer in the hot seat and interrogating the person to find out the root cause of an incident and what the action plan was going to be to make sure this never happened again. Often, such a review would generate a long list of tasks with the aim of mitigating a repeat of that incident, but those tasks were rarely followed up on and rarely completed. Frankly, the whole process was frustrating for everyone involved, and I remember people becoming very apprehensive whenever a review was scheduled. Well, over the past five years, a couple of major changes in thinking related to incident reviews has occurred. One is the concept of a blameless incident review. The intention of this shift is to make sure the environment is set up to encourage candor and also to treat the review as a learning opportunity for everyone. The idea is to assure the team member's question regarding the incident don't feel shame or guilt for what happened, and instead, critically and objectively analyze what caused the incident to occur as a means to plan prevention strategies for the future. I have a phrase I use, honoring and extracting reality, and I use that phrase a lot in this context. Often, leaders don't want to understand reality. They just want to see something fixed. But if the leaders in the organization are not encouraging the surfacing of reality to find out what really happened, then they definitely won't be able to extract it. This concept is directly tied to the Westrum Model we discussed earlier in the course. In pathological cultures, failure leads to blame and scapegoating. However, in generative cultures, failure leads to inquiry. I believe whether an organization boasts more of a pathological or generative culture, starts right at the top with how leaders engage in an activity like an incident review. If a review is in the spirit of learning and is blame free, then teams will engage in a very different way than if you're seeking to blame someone, and perhaps, even punish them. For blameless incident reviews, it's really important to identify a good facilitator, someone who can encourage open exploration. Etsy, the e-commerce site dedicated to selling handmade and vintage items, has actually done a really great job of this. Checkout Etsy's facilitator's guide linked in this module for a really great detailed description of how to conduct a learning debrief. In addition to investigating incidents, it is also extremely important to study successes too. This actually isn't considered a whole lot in our industry, but sometimes you can learn as much from a successful change as you can from a failed one. What worked? What can we learn and apply to other parts of the organization? Another concept that has evolved in the last few years is regarding root cause. It used to be a common practice to do what was called an RCA or Root Cause Analysis in the attempt to find a single root cause for an incident. Often, techniques like the five whys would be used to get to the single root cause. Now, the thinking on this has changed. To recognize that with complex systems that are the norm in our industry today, there really is no single root cause. Instead, there are usually multiple contributing factors. It is important to understand these factors and it's a complete fallacy to think you can get to a single root cause. Another myth is that human error can be a root cause. Human errors are never the root cause. Something in the system broke down. Usually, it's a process failure or I've seen incidents happen because an engineer is overworked and makes a change without fully understanding its impact to the rest of the system. Let me share an example to illustrate why human error is not a root cause. At one of the places I worked, a widespread network outage occurred and the first explanation of the root cause was human error. After we dug a little deeper, we actually found out that the failure was due to another issue altogether. When talking with employees, we found out that the engineer who made the change had been working full workdays and, in addition, was doing maintenance activities from midnight to 3:00 AM for four nights in a row. When the engineer tried to take a day off to recover, the engineer realized that they had forgotten to do some work for a critical project that needed to happen that day. The engineer came into the office to do the project work, booted up their laptop, which happened to be an Apple computer, and all of the previously open windows and applications launched including multiple SSH sessions. The engineer copy and pasted and applied a configuration to a non-production router or so they thought, and what actually happened was the configuration was applied to a production router. After conducting some blameless inquiry, we realized it was not human error after all. Instead, it was a management failure. The fact that this engineer was working too much and getting little to no sleep for days in a row is a result of poor capacity planning and prioritization by the leaders in the organization. Instead of blaming the engineer, the focus turned to, "How could we move deployments to the day time during normal work hours?" Treat infrastructure as code, automate configuration chain changes, and how we can have better leading indicators to help with balancing workload as it was becoming too much. In summary, the industry has evolved quite a bit with regards to incident management. Traditionally, organizations would focus on finding a single root cause and human error was often used as an incident classification. However, now, organizations are recognizing that in complex systems, there will never be a single root cause of an incident. Instead, there are multiple contributing factors. When an incident occurs, high-performing organizations use the incident as an opportunity to learn, not as an opportunity to blame. Human error is never a root cause. So instead, focus on what broke down in the system. It could be a gap in automation or an opportunity to reduce burden on an individual and/or team.