Hello what I want to talk about now is the challenge of fault detection. Isolation recovery. FDIR has been the state of practice for many decades, really. Initiated by NASA and and during the Apollo Moon Landing program, so it has a history. As long as or longer than rate monotonic theory and, well, it's been quite effective. There is research on newer methods to improve upon this state of practice, but first we went to understand the what the state of practices and how it can be used to assist us with building mission critical. Highly available, highly reliable systems. Okay, so FDIR is related to failure modes and affects analysis. So the idea is that when we design a system, we want to understand in advance how we think it may fail and based on our analysis of how it might fail, we also under want to understand the affect of failure and what that means is that we want to understand the risk or the probability of occurrence of a failure. Pact of any given failure scenario and the cost to mitigate it or fix it. And so through FMEA that analysis process during design and during verification as we build the design and tested, we want to fix the highest risk, highest impact, lowest cost to fix issues first. That's pretty clear. Of course, this assumes that you can clearly identify all those failure modes, so that is the tricky part that requires rigorous testing and stress testing. Highly accelerated lifetime testing. For hardware and software and the combined system, which is part of our quality assurance approach to software and hardware and systems. So we'll talk about that as well, like how well we can actually do that, but it's pretty clear that probably we can't do that perfectly, right? One of our big problems in quality assurance is when are we done testing? When have we tested enough? It's difficult to know that there are methods, but that remains a challenge. So, but assuming we've done a rigorous diligent job, a failure modes and effects analysis. We could then attempt to fix the highest risk, highest impact, lowest cost to fix issues first and work our way up probably addressing high cost issues, even if they are high risk and high impact, then the fact that their high cost doesn't mean we're not going to address some republican address. All high risk, high impact issues and the cost is more just will fix the low cost ones first because we can get success there. So you might even say we want to flip that around and fix the high cost ones first. Perhaps so there are some controversies even about FMEA. Well, FDIR is kind of the response to FMEA. So once you've done your FMEA now what you can say is, well, when there's a fault. One way to fix it is to detect it when it happens. And then isolated stop its impact right? Basically, electrically isolated mechanically, isolate it if it's software, stop executing it as soon as possible, even on the very next clock cycle or instruction. So so just isolate it by making sure that it has no possibility of impact. When it's in this failure mode that we ideally have detected, so there's you can see there's a lot of challenges in FDIR. Isolation is deadly. One of ' detection is also a very challenging problem. An recovery isn't easy either. So while the strategy makes a lot of sense, it's difficult to implement in practice, so it's the state of practice. But there's research to improve FDIR and alternative strategies being researched one. Well known alternative strategy is autonomic theory. This is bio inspired by the human immune system and the autonomic system, and it has a catchy acronym self chop. What we want our systems that are self configuring, self healing, self optimizing. Protecting, in some ways it's unfortunate that this is really just a new strategy or aspiration more than the new solution. So like FDIR, it's the concept of being self, configuring, healing, optimizing, protecting this fantastic, but how do you do it right? That's always the question. So whether it's auto gnomic or FDIR continued research on how to best detect and isolating, recover or configure, here optimizing protect are underway. Right now what part of the state of practices that its system specific. So engineers, computer scientists working on the system need to figure out how to best detect isolating and recovery system that has a specific mission. So the big detection challenges is interesting since that's what you start with. There's no way to recover if you can't detect something is definite detection. So very often there is not definite detection there is the real potential for false positives and false negatives when we're trying to detect something. I will talk more about that. Isolation, can we isolate the failure and its impact safely and quickly? And then finally recovery. Well, the most common recovery is to simply have redundancies, so to duplicate components, subsystems, or whole systems. And when there's a fault detected in a system, we isolate it by essentially, disabling interfaces, somehow quiescing or safe in that component subsystem or whole system. And then switching over operations to the secondary system, which can either be what's called a cold spare, warms fair hot spare. In other words, it ideally it would be a hot spare, so it's just ready to go. It's been able to operate in control our system the whole time, but we just haven't been letting it do it. A great example of this is pilot copilot in commercial aviation, right? So if anything happens to the pilot, the pilot has a heart attack, something goes wrong with the pilot. We got the co-pilot right there. The co-pilot has redundant controls for the aircraft, and the co-pilot can fly the plane and complete the flight, or at least make a fail safe landing at an alternate airport, right? So this strategy was used before computing, right? With human kind of failure modes and recovery. So that's the state of practice. That's a quick overview of FMEA in FDIR, which can certainly be an entire class on its own. It's open research area, but it has been applied with reasonable success to date in real time embedded systems in mission critical systems. And we'll talk more about the recovery once we get through detection in isolation. We'll just cover the basic theory here and make sure that you understand the state of practice. But like I said, if we really want to dig into it, that could be a whole course on its own. So the general theory of detection and correction. We've looked at a couple of examples we looked at SECDED for bits at rest stored bits. And it really is a perfect detector and corrector for single bit errors, as we've seen with the Hamming code. An it's a perfect detector for a double bit error, right? I can't correct it, but it still can perfectly detect a double bit error, but it can't perfectly detect a triple quadruple or higher multi bit error. It's not reliable. We don't really know what's going to happen. In our scenario we have to test all additional 8,100 cases and make sure there's no possibility for confusion. No possibility for a false positive and no possibility for a false negative where somehow there would be triple quadruple 5 been air and we wouldn't detect it, right? And so we decided with sect that it wasn't worth bothering because the probability of a triple bit error is solo by FMEA, right? So that we were practicing FMEA, I didn't really mention it, but what we're saying is that it's so low risk. So in other words, it's so unlikely the impact still is very high and the cost to fix is probably high as well, right? So we would have to have some sort of more advanced code beyond SECDED and that might require research. Or at least even if we have the method ready to go much more complicated logic than the SECDED we looked at. So, the SECDED has other limitations. It's based on bits at rest, bits that aren't changing in terms of that. They may be changing, but they only change through the read and write mechanism. They don't change while they're sitting there, right? So unless it's in there so that makes it easier to detect. So the in general, the S user, occasional and distributed, and that's a big part of why SECDED is successful. We also looked at RAID for large scale storage of blocks and device failures, XOR and mirroring. And in fact, if you want to go deeper on this, you can look at generalized erasure codes. Reed-Solomon and other sorts of were called flat XOR codes, and we can actually correct double faults and triple faults. The one thing we ignored in RAID that's interesting is we assume we can detect the errors. That may be true if say, a disk storage device is actually yanked out of a system, right? We would actually notice it missing. There would be an interface error in the driver, but what if it was something more subtle? Like just what one of the concepts in storage that is somewhat controversial? Or it's not clear how often it happens, or whether we should worry about it is what's called bit rot. The idea that bits can essentially change over time for a number of reasons. Like in flash, the concept of read disturb, where continually reading bits and flash may eventually cause a bit to flip. Electromagnetic interference incompatibility issues as well as radiation and environmental scenarios that would cause data corruption. Now, we covered that in working memory, but how about in storage, right? Well, there is a solution for that. We could use a day to digest such as MD5 that would detect corruption. And we can certainly also detect device failures through the interface like the small computer system interface errors. But the question is, is that perfect detection? Well, data digests are pretty darn good. It's hard to fool a day to digest to get a correct data digest with the wrong data, but it's not perfect, so we're seeing a theme here. All detection has the potential to be imperfect at some point, right? We also didn't cover all the segments of a system where we might have errors. Another important segment is networking or wireless bits in flight. That typically is handled with Reed-Solomon code. It's a general erasure code detector for up to end bits erased or lost while they're in flight. And it's better for bit error rate control on transmission channels, RF channels, disk read, write channels, etc. Reed-Solomon can be used for RAID. It is one of the possible methods for RAID6, but it's fairly complex. And there are simpler methods like row diagonal parity, even odd flat XOR that can work instead. You got fall under the category of Generalized Erasure codes. So in general, you can see these are all very specific to what it is we're trying to protect with detection and isolation recovery. So we get into an in-depth conversation about each device, each component, each subsystem when we start working on FDIR. That's why it's actually hard to teach as a general subject, but there are high level theories. We know there are perfect and imperfect detectors. So knowing that, we know they are imperfect based on probability, perfect based on logic and math. That's probably correct. Really the Genesis of much of this theory started with radar during World War II. So when radar was first invented, you could, of course, detect enemy aircraft, but you might also mistake a bird for an enemy aircraft. [LAUGH] And this may or may not be a significant problem if you want Shooting down a bird. Well, that's probably not a complete tragedy other than for the bird. [LAUGH] But you did waste ammunition right then? The enemy may be coming later, and if you're out of ammunition and you wasted it on that poor bird, that could be a problem. So that's called the false positive problem. Worse than that with radar is a false negative. The enemy is coming and you just never see them, and they actually destroy the radar station, right? So I mean, this is well known theory in radar that's been around forever that will quantify here in a moment. So the imperfect scenario or the truth model is really only known either by physics or human review. So if we have a human look more closely at the data, and maybe they can correctly say, yeah, that really was an aircraft, even though it didn't look like it to the radar. And when we say when it didn't look like it to the radar, what we mean is it didn't meet threshold criteria for being classified as an aircraft, right? And so maybe we adjust that threshold or the algorithm used to determine what it has returned signal from the radar is an aircraft is may be flawed somehow, right? So if human looks at the data, they could say, now that actually should have been detected as in aircraft or no, it was actually just a bird or a bug, okay? So the perfect concept is has to be based in physics, logic and math. Something that's probably correct. Like we did with the single bit error, we showed that in a set of bits, we can always detect the error and there's no confusion in that detection. So I think that's why I started with that will work out to these imperfect detectors, which are more complicated to understand. So let's look at an imperfect detector. So this is based on some research that I'm engaged in with universities. I'm associated with, and University of Colorado Embry Riddle Aeronautical University, and we've been working with radar, an Electro optical and infrared detection systems to detect and track aerial objects for the purpose of Urban Air mobility. And unmanned aerial system traffic management, safety, and so when you see an object like this one here, you could say what is that? This is seen within a long wave infrared camera and it's basically an object in the air over the campus at Embry Riddle, and the question is that an aircraft? Is that a drone? Is that someone's kite that got away from them? Is it a meteor? Is it a bug? And it turns out it's a bug. How do we figure this out? Well, we went up and looked at what was going on and we saw bugs in front of the camera by the time of day we in Arizona, the bugs come out in the morning and temperature changes and they fly right in front of the lens. And by the way, bugs are a problem with radar as well as electrical optical. So and once human learns how to recognize the bugs, you have to basically do frame by frame analysis and look at it and say no, that's not a drone, that's not a plane, that's a bug. And it basically the only other alternative you have is some sort of physical modeling. I mean, so certainly the other thing we've done is we've specifically purposely flown drones. We knew where they were by experimental design, and by the fact that we had GPS tracking and things like that. And then we could see if we can detect it. So we would know that yes, this should be in the field of view. You should be able to see it now based on where it is and where the camera is pointing. And if you're not seeing it, then that's a false negative, right? Certainly the bugs are not part of our experimental design there in annoyance. So doing this research myself and students had to go through and review data frame by frame to get a human truth. What we would normally do is independently do this with a program called Autoit that the students came up with an as long as two, three or more humans are in 99% agreement. We would say that now is the best known truth for bugs for things that weren't part of the experiment. For the actual drones, We know when they should have been in the field of view and when they shouldn't. So we can use a simulation and modeling to determine truth there. But either way, the bug, the uncontrolled part of our experiment is a potential false positive. Unless we can classify it accurately as something else automatically or through human review. So machine detection in general is imperfect, but it can be effective. What do I mean by effective? Well, you can create algorithms and work on tuning these systems so you have a high true positive rate for a low false positive rate. That's an effective system, it's imperfect, but another words it's 99.999% correct or something like that right. We have some metric where we can say, you can trust this thing just like a human sentinel. If we put someone on the roof, a person up there, it's at least as trustworthy as that person. Or maybe better because it doesn't get tired, it doesn't get distracted, right? So the key is, in detection theory it's all about the true positive and true negative rate, so that should be high. So the system should have good ability to determine the truth, for whatever it is you're trying to detect, right? And basically discluding things that you don't want to detect. Now, you could detect them and classify them as things that aren't of interest, that's also a strategy. But either way, well we have to worry about our false positives, so false positives are a concern but they're not critical issues. So false positives load the system and they're an annoyance or cost of detection is usually the way you think of it in FDIR. The test indicates you may have a problem, so further analysis is required. This would be similar to say going in for a medical test because you have symptoms where you're concerned about your health. Or it's just a periodic check up and the test indicates that you may have a problem. I'm sorry to report that it looks like you may have cancerous cells, right? That would obviously be worrisome, that would cause you to have follow up visits to your doctor. More cost and expense for that false positive, so we don't want a high false positive rate we need a low false positive rate right? But we want obviously, a high true positive rate, and of course, the compliment to true positive rate is true negative rate. And so, normally we're going to generate a false positive rate as a side effect of having a high true positive rate. And that's actually something that we'll see is called the receiver operator characteristic, which comes from radar. There are other newer measures like precision recall and F measure surely I don't talk about here. That would be a topic for a class is dedicated to this concept, to these areas of development and research. So more dangerous, much more dangerous is a false negative, the test indicates you're healthy, but you really have any illness. So you go in for a cancer screening and the doctor tells you, you're fine, you're healthy. We'll see you in a year and you walk out the door with cancer cells in your body. You may not live to see the test, the follow-up test in a year, right, so that would be much more dangerous. Or in the scenario of radar, the enemy aircraft is approaching the radar station, you haven't seen it and it destroys your radar station. So for imperfect detectors, the tradeoff is most often threshold based but I would say also algorithm based right? What is the algorithm for detection based on the sensor data and given the algorithm. When do you determine that this really is something of potential interest? And then probably the third thing is, can we classify this as well, we detected it but it's not an object of interest. That concept is what's called the salient object detector and that's more of a machine learning and AI problem rather than simple detection. So you can see this can get very involved but let's just assume that in our problem domain We've got some great ideas for algorithms and how to set thresholds and so forth for our imperfect detectors. Or if we have perfect detectors, we don't even really have to have this discussion right? So we've got good performance, we got a high true positive rate with a low false positive rate. So and there's always a tradeoff, so what we're saying is there's going to be a curve. Where as we get a higher and higher true positive rate, that's going to come at the cost of false positives right? So this is going to go up here, this direction as we go out I'm sorry this is going to go up. The false positive rate is going to go up towards the right as we increase the true positive rate and we're going to have different locations on this curve. And we can decide where we want to be on this curve by tuning our system in terms of the algorithm used in thresholds applied. And we can say, well this is acceptable because we're only going to tell 10% of our patients that they may have cancer when in fact they don't. And we're going to get a reasonable detection accuracy, maybe this is 98%. So only 2% are going to be sent away, told they're healthy when they're not. Maybe that's unacceptable, maybe we gotta get a lot closer to 100% out here and we gotta get to say 99.9%. This is an engineering judgment call, but we're going to have a much higher false positive rate. And we're going to have more read follow up visits and retesting and things like that, right? So okay, so a perfect detector may be too costly or impossible, right? We can certainly work on that, this is R and D, how to build a better detector, right? Might involve machine learning, AI, statistics and mathematics and so on and so forth. So but when we are just stuck with whatever we got, we we need to understand how well it performs in terms of the receiver operator. I say curve, but it's actually supposed to be characteristic, that's technically the term. So we plot the number of false positives over the false positive rate, the FPR against the true positive rate, right? So the FPR is down on this x-axis and this is the true positive rate, TPR on the y-axis. And then that tells us how we're doing, so we can compare two different algorithms or we can also adjust the threshold. Normally you get this curve by adjusting thresholds or parameters in the algorithm. So in other words, this is a complicated thing to get data for in that each data point on here is a different setting for your detector. And so it can be laborious just to get this curve. And then comparing different curves would normally be for different algorithms or different detection methods, okay? So a false positive rate can be viewed as false alarms and true positive rate is basically accurate detection. So, how was this look like we've done this for one of our aircraft detection systems for our research and drone detection. So we classify aerial objects into aircraft drone and then we have another category called B, which is biological bug, bat, bird. Turns out everything biological in the error seems to start with B and then of course the 4th implicit classification is nothing. So using that detection and classification method, here we're just looking at detection for a specific object of interest aircraft. And if we're just guessing, we're going to be somewhere on a diagonal line, right? If we just kind of guess there's something there and it must be an aircraft because anything that moves might be an aircraft in the air, right? Then if we take that approach, that uninformed approach, typically I can get 100% accuracy by just saying anything that moves in the air is an aircraft. But at some point I might have close to 100% false positives to write because I'm just going to start calling everything in aircraft. And let's say there's no aircraft that day in the air, but there's plenty of other things in the air. Bugs and bats and birds, and I just call all of them aircraft eventually I'm going to and then there's one plane, right? So I get that one plane ride So my true positive rate is 100%. But it's going to be just under 100% for the false positive rate, right? Because 99 things that day I called an aircraft. So that would be a 99% false positive rate. And the one thing that actually wasn't aircraft I correctly identified as an aircraft. So in the limit it becomes this diagonal line, right? And that is guessing that's just saying caller. They even kind of beyond guessing just calling everything what we were looking for. [LAUGH] The kind of anything moves must be an aircraft. And so we need to be above this random line. I caught the random line. So just basically saying everything I see must be what I'm looking for. And so we're going to be above that and we see a trade off. So this is actual data from our research. So we're doing better than diagonal length. And so we might have another competing detector. And let's say it looks more like this. Well, then it's not as good, right? And be better to have a detector out here, in fact. That gets close to 100% sooner, right? So we really want things to move in this direction on a receiver operator characteristic. And it's that simple. So we have a way of quantifying this. And the interesting thing is from the FDIR perspective, the more false alarms we have. It just means that we're going to have to isolate and recover unnecessarily, right? And this can be a problem because it might mean a brief service interruption during recovery. We're certainly going to have to expend computing resources and use, which is to isolate systems and things like that. And there is potentially some risk from trying to recover when recovery is not necessary. On the other hand, the worst scenario is a false negative and we have total system failure when we didn't detect the fault. Okay, so that's it for FDIR. That's a pretty brief and rapid introduction. But it gives you the basic ideas. Thank you very much.