Making sure that work is humming along within your organization and is being done effectively is very important. In this lesson, we're going to focus on how monitoring is tackled in the DevOps world. After this lesson, you should be able to explain why incident dashboards aren't always helpful, discuss the importance of transitioning to proactive monitoring, and identify business metrics such as Mean Time To Detect, that can assist in monitoring. So, as we begin our discussion about monitoring with DevOps, I'd first like to talk about the evolution of monitoring in our industry. When I started, the focus of monitoring was primarily on system and infrastructure metrics. Most of the monitoring we did was related to the availability and performance of our services. That usually translated to, is the server up? Because of this, performance was measured very inconsistently. I started my career in infrastructure. So, a lot of the performance measurement we looked at, was about network latency and round trip time. We use lots of dashboards and alerts to help us keep an eye on things. The problem with this approach was that dashboards we're mostly only used during incidence. In other words, they were reactionary. They couldn't really give us any sense of a heads up when we needed to address something to make sure a service didn't go down. They really only told us when something was down, and needed fixing. They didn't allow us to be very proactive, and the alerts were noisy. Meaning that the alerts would be sent to operations teams and many times would be false alarms. So over time, the alerts would be ignored. For those of you familiar with, ''The boy who cried wolf story,'' that's essentially what we had going on here. As I learned more about monitoring early on in my career, it became clear that this type of monitoring was no longer sufficient. As many of you know, just because the server is up, does not mean the application is available. If we're sending alerts that require no action, why are we sending the alert at all? If no one is looking at the dashboards, how do we make them more useful and proactive? Also, even though our infrastructure was stable, we continue to have unstable applications. Clearly, none of this was a recipe for long-term success. So, we started looking at layer seven tied monitoring which allowed us to achieve better overall visibility as to the applications health. We added that to our dashboards and alerts, and even started sending alerts to the application teams. Guess what? The same thing happened. Reactive use of the dashboards and alerts that weren't actionable. We realized that we needed to evolve how we were monitoring. While I was at Nordstrom, we started to transition to looking at business metrics as a way of monitoring instead. For example, if we were used to seeing 100 transactions per minute, and we saw that dip below a certain threshold, then we would send an alert to the appropriate team to investigate. We also transitioned to tracking things like Mean Time To Detect, also known as MTTD, especially in our physical stores. Often the way we found out about an issue with a point of sale system, was when a store employee called the help desk. Sometimes, that can take up to 30 minutes to get routed to the appropriate engineering team. With a focus on MTTD, we can find out in seconds, and have the alert automated, and system-generated, versus requiring someone to call the help desk. In the book Accelerate, highly performing organizations make monitoring a priority. Unlike the ones cited in the book, it's important to refine your infrastructure and application monitoring system, and make sure you're collecting information on the right services, and putting that information to good use. The visibility and transparency yielded by effective monitoring are invaluable. Proactive monitoring was strongly related to performance and job satisfaction in the most recent State of DevOps report, and it is a key part of a strong technical foundation. Observability is also becoming the industry norm for monitoring. In ordinary english, observability means that you have the instrumentation you need to understand what's happening within your software. Observability focuses on the development of the application and the rich instrumentation you need not to pull and monitor it for thresholds, or defined health checks but to ask any arbitrary question about how the software works. Charity Majors, the CEO of Honeycomb, defines observability in this manner. She is one of the thought leaders in this space, and she speaks a lot about the benefits of observability. In one interview, she talked in depth about observability. In the interview, she talks about the current best-practice approaches to developing software, microservices, containers, Cloud native, schedulers, and serverless, are always coping with massively more complex systems. However, our approach to monitoring has not kept pace. Majors argues that the health of the system no longer matters. We've entered an era where what actually matters is the health of each individual event, or each individual users experience, or each shopping carts experience, or other high cardinality dimensions. Engineers are now talking about observability instead of monitoring, and about unknown unknowns, instead of known unknowns. Databases and networks for the last two priesthoods of systems specialists. They had their own special tooling inside language and specialists, and they didn't really belong to the engineering organization. That time is over. It will always be the engineer's responsibility to understand the operation ramifications and failure models of what we're building. It's vital to auto remediate the ones we can, fail gracefully where we can't, and shift as much operational load as humanly possible to the providers whose core competency it is. You don't want to attempt to monitor everything because you can't. Engineers often waste so much time doing this that they lose track of the critical path, and they're important alerts drown in the noise. In the chaotic future, we're all hurling toward, you actually have to have the discipline to have radically fewer paging alerts, not more. In other places, charity majors also talks about distributed systems or any mature complex application of scale built by good engineers. The majority of your questions trend toward the unknown unknown. Debugging distributed systems is typically a lot of examples of impossible things that rarely happen. You can't predict them all, you shouldn't even try. You should focus your energy on instrumentation, resilience to failure, and making it fast and safe to deploy and roll back via automated canaries, gradual rollouts, feature flags, et cetera. The same goes for large apps that have been in production awhile. No good engineering teams should be getting a sustained barrage of pages for problems they can immediately identify. If you know how to fix something, you should fix it, so it doesn't page you. Fix the bug, auto remediate the problem, or just disable paging alerts and off hours, and make the system resilient enough to wait until morning. In the end, the result is the same. Engineering teams should mostly get paged about only novel and undiagnosed issues which means that debugging the unknown unknowns is more and more critical. You can't predict what information you're going to need to know to answer a question, you also couldn't predict. So, you should gather absolutely as much context as possible all the time. Any API requests that enters your system, can legitimately generate 50 to 100 events over its' lifetime. So, you'll need to sample heavily. Majors company Honeycomb, provides some amazing sampling documents for more best practices. You might want to check that out. So, as we wrap up, I just want to say that the importance of monitoring cannot be overstated, and for organizations to be successful in their monitoring, it's really important to transition to some form of proactive monitoring. Identifying and diagnosing unknown unknowns is critical. Looking at observable business metrics as well as sampling and setting up automated actionable alerts that provide earlier warnings before something really bad happens is really the way to go.