Revisiting Product Validation. So, here's a cartoon picture of a rack of your products P. P is your product, whatever it is, some device, might be that hypothetical board I was showing you earlier, and most companies do this, it's called the RDT at most places where I've worked, other companies may call it something else, but as part of the validation process. RDT stands for Reliability Demonstration Testing. So, you put a whole bunch of your products in a rack and you run them for awhile and you get tests. You put your product through its phases. There's quite a bit of work that goes into crafting test plans for RDT. Whether there's always things that need to be tested, you've got to test all of the requirements. Typically, you start with your requirements document and you'll write a series of tests that will validate every single requirement, and you stand back. I've shown 16 here, but the rack could have 4-500 units in it, a couple of thousand units, and you plug them all in, you're turning this thing on and it's hooked up to a PC, to monitor that testing and you stand back and you let it run for hours or days, maybe even weeks sometimes. So, many hours. You want to accumulate so many hours on there. Hopefully, when you get to the end of your whatever that reliability demonstration testing period is, you don't have any failures. Get to the end of that, you say, "All right, well, looks like we've done a pretty good job, mechanical's good, electrical's good, firmware's good. We've tested everything, ready to go and start ramping production on this product." So, shipping in volume. But, partway through your testing, this particular product right here, now bear in mind, they're all running the same tests in parallel. Oh, run in pretty much in step. Not clock for clock step, but pretty close because this PC would say start test one, start test one, start test one, start test one. So, there's skewed in time just a little bit, but they're close in time. So, this unit here reports failure. You check the log pages and the log page doesn't explain a failure. The PC fetched an illegal instruction or the code that is executing just doesn't make any sense at all, and you look at the program counter and you go, how in the world did they get there? It should never have gotten there. It was running such and such a test, this is recorded and it's recorded in a log page someplace, and you have no idea. How did the PC get to this memory location? This happens a lot. This is a question for firmware engineers ask often during product validation in RDT. Something happened. What was it? How did they get there? If you don't have the appropriate infrastructure in place, it is nearly impossible to figure it out. Firmware engineers, if they don't have a properly built debug infrastructure system, they will spend hours and hours and hours and hours trying to figure out how the program counter got to a certain place, as just one example of many. So, have some choices. So, if you're going to restrict your analysis to inform factor products and I mentioned that earlier, I'm going to use, iClicker box here. So, this is informed factor. You can take the case off of it and it'll be a printed circuit board in there, right? Even the printed circuit board has a certain size, that would be an informed factor product. An outer form factor product is a bigger printed circuit board that had lots of test points around it. It's functionally equivalent, but it allows you to hook logic analyzers up, scopes, whatever test equipment access debug ports that aren't accessible on the case. That's called outer form factor product. So, for restricting our analysis, to inform vector products in our RDT wreck there, you can create a new firmware image that has asserts coded into it. Raise your hand if you've heard the term firmware assert. Okay. Couple of you have. All right. Good. Asserts are, think like an ifdef. It's some header file someplace that would define maybe the word assert. Most of the time that line is commented out. Down in the body of your code and also strewn throughout all the code are, ifdef assert. Remember my syntax sample, programming assembly language though, is it where you see people? Is that right? There's no EM. The #ifdef, I think it is. Isn't it? I guess right. Yeah. [inaudible]. Yeah. So, somewhere in some header file, some dot h file that gets included into all this stuff is defined certainly. Most of the time, it's commented out. So, #ifdef, assert, you have a bunch of diagnostic code in here. This code does additional error checking during runtime in system. So, you turn asserts on and some projects have varying levels of asserts. It could be assert one, assert two, assert three or some numbering system. Whatever the bigger the number is, more and more assert code gets compiled into the executable to try and figure out what's going on. So, you create this new firmware image that has asserts coded into it, and you've got this extra error checking that will now execute in the informed factor product, but with this code would not normally be included in a production distribution. You're turning it on to try and help you figure out. What in the world's going on? Because you've got a really hard problem you're trying to figure out. Your boss keeps calling you every hour, "You got it figured out yet, you got it figured out yet?" "No, leave me alone. Let me figure it out." Drinking Mountain Dew and eating Twinkies and Mountain Dew and Twinkies, and staying up 24/7. So, that's the notion. Is you would add this code in to help you figure out. Downside is that the presence of the asserts changes the timing of the execution and you may not. Due to the fact that you've changed the timing, you've got all this assert code and they're trying to help you figure out why, what's happening to your code. It changed the timing and you're not seeing a failure. So, that didn't work, but might work or might not work. It's an approach. So, you can revert back to the easy fully instrumented system with production firmware that failed and try and rerun those same tests with highly instrumented logic analyzers hooked up, you got to the bugger and trace boards and the UART, JTAG and all that, and rerun the test with the production firmware that failed that was in the RDT system and see if you can get that unit to repeat its failure. Where can you, now, you got a lot more eyes on it, you've got a much higher level of visibility at this point in time. So, you've got your debugger, logic analyzer and hopefully, you will figure out how the code got to where it was. Firmware that we worked on at my company, we have very, very small amount of assert code in there that chips in production. Performance is such a big deal, it's such an important metric to our customers and it has to be very small and very fast. But there is a little bit of assert code in there to handle those panic cases where you get a divide by zero error or you get an illegal instruction exception or if something went wonky and you have no idea how you got there. There is a little bit of code in there to catch that and try to do no harm. The last thing you want to do is destroy more by trying to maybe reset. You might think, well, I'll just reset myself. Resetting, you might destroy and loose user data in a storage device. May seem like a reasonable approach, I'm just going to reset myself and then everything will be good again, right? That may have catastrophic effects and they result in unrecoverable data loss. So, resetting may not be a good idea. May just want to have the CPU just jumped to itself and wait for the technician to come along and pull the device out of service. Might be perfectly valid approach. Don't do anything.