This lecture is about software engineering in the context of genomic data science. Software engineering is often given short shrift in the world of computer science and programming, and the world at large. But, software engineering is critical to almost everything we do in computational analysis of data. By engineering, I mean paying attention not only to what the software does, but to how reliable it is, how many cases it handles, and whether it's really performing the way you expect it to perform. So, when we write programs, and we;'re doing a lot of data analysis, in our programs you can see equations. In our descriptions, you can see equations, which describe relationships between variables. Here's a very, very simple equation, z equals x over y. Everybody learns, sees equations like this when they're taking algebra in middle school or high school. So, this is math. We're familiar with that. But, programming is a different thing. We have these same equations in, in our programs But, we have to do more. So, of course, you can assign a variable like z to a value like x over y in a program. But, you have to check to make sure that y isn't equal to 0. So, here's, kind of, in pseudo-code how we would write that in a program. If y is not equal to 0, then we can assign the value x over y to z. So, why do we have to worry about things like that? It's because we know in our computer that undefined variables, undefined values are going to cause problems, they're going to make our program crash. So, we have to check for conditions that are unexpected but that might come up and will mess up our program in some way. So software engineering is all about thinking about all the different cases that your program is going to be handling or trying to think of all those cases and writing code to make sure those cases are handled. Then when we deal with big data sets, all sorts of very bizzare cases that you might think very rarely happen, and in fact, are very rare, they happen, because the data sets are so large. So, good software engineering is critical to reliable computational biology programs and analysis programs. So why do we need to understand genomic software? You're going to be if you, if you move into the field of, of computational biology or, or data, computational data science, you're going to need to you're going to need to run programs on very large datasets. You need to understand what, what those programs are doing and not treat them simply as black boxes that feed you an answer. If you don't understand what the programs are doing, say, underneath the hood you can be very confused by the output. So I'm just going to give you one example which was a, kin, kind of a combination of what software engineering. So give you an appreciation for software engineering, and also of the kinds of, of, of reports that a program can make that can initially confuse. So a little background here. There's a, there's a interesting process in in the world of genomics called RNA editing. So what is RNA editing? It has nothing to do with software engineering. DNA we've already explained. DNA gets converted into proteins by first being copied or transcribed into RNA. And that RNA gets translated into proteins. Now there was an interesting phenomenon discovered quite awhile ago, more than a decade ago, it said, once in awhile in very rare circumstances some of the nucleotides in the RNA can actually be edited by the cell. That is, they can be changed. So you have a piece of RNA that doesn't quite match your DNA, there are maybe one or two or three positions in your RNA sequence that are different and therefore your protein might be different. So this is a really interesting phenomenon. We actually understand some of the molecular basis of it, but it only affects one or two of the nucleotides. It can be changed into one or two others. It can't happen to any nucleotide. Not any nucleotide can be changed into any other but so it's very interesting and it's used to regulate genes. So why do I mention this problem? So if you wanted to study this phenomenon, you could take a big data set and try to, try to find new examples of RNA editing, this would be an interesting scientific discovery and there are people who, who study this as a part of their research programs. So how do we detect something like RNA editing. All we do is we sequence DNA, lots of it, because that's how we sequence DNA. We always get lots of it, these days. And we can also sequence RNA. And if I sequence DNA and RNA from the same person, I can, then, align those sequences to one another. So, in alignment, now, we're talking about a computational problem. Alignment is a well defined problem. We've, we've been studying it for years. There are many programs you can use to do alignment. It's important that you understand how those programs operate. If you just run those programs and you align DNA and RNA from the same person to each other, you would expect, not knowing anything else, that everything would match perfectly. And anywhere where there's a mismatch, an RNA, DNA difference, you could say aha, I found RNA editing. Now as I said a minute ago, RNA editing does occur. It's rare, but there are, there are a few nucleotides actually, A to I is the main kind of RNA anywhere at, at the A nucleotide gets changed to something called inosine or I, which gets read as a, as a G. But by the sequencing machine. So you'll see these differences and you can conclude that those genes underwent our new editing. So why am I giving this example? Well, it turns out that a couple of years ago, a couple of years ago now, some scientists who were looking for new types of RNA editing did exactly what I'm describing, using some of the current best software for doing alignment. And they found thousands of new examples of RNA editing in many different genes, none of which were known to undergo RNA editing before. And the way they found them was exactly how I described. They used software, and they weren't the ones who developed the software and they didn't quite understand all the details of how the software worked, and the software found for them lots of RNA-DNA differences. And they were very excited. This was published and, and generated a lot of discussion and a lot of controversy, cause they discovered a new mechanism for RNA editing, or it seemed that they had. Well, it turns out, that when you're dealing with big data sets and complex software, surprising things can happen and you shouldn't be that surprised. So what's the, the question here is if I align a large data set of RNA to a large data set of DNA, even from the same person, and I find lots of alignment mis, mismatches or misalignments. The question you want to ask before writing your paper and going off and getting excited about a new discovery is are those real. Are those differences real? So most of them are. The software that does alignment works most of the time. When I say most of the time it might work 99.999% of the time. But even one mistake in a million, in, in your software, one mis-alignment in a million leads to hundreds of errors in large number of data sets. And this is something that you really have to start working with this data set to gain an appreciation for. So you might get very excited at seeing what you, what you in this example believe are hundreds of, of RNA edits that are novel, that no one's ever observed before. And you might think, aha, I've discovered something. But the first question you need to ask yourself is What did that software do? Was the software engineered to handle all possible cases? Cause I just fed it 200 million reeds in alignment to say 200 million other reeds and asked it to show me all the differences. So even though the software may seem very reliable, it doesn't crash, it gives me lots of answers that seem to be correct. If it, if it's not handling all possible cases and occasionally making an error in a big data set, you can, you can get misleading conclusions. So, the message here for this, for this lecture is software engineering is critical for genomic software. Those people who develop it, including me, do our best to handle all possible cases, but you have to realize that the software that you're using, even if it's been carefully engineered, when you run it on a big data set sometimes it can do the wrong thing. And just because a program runs to completion and provides, and produces answers, doesn't mean that it's free of bugs. Of course if a program crashes you know something went wrong. But the more, the more dangerous cases for, for our, for our own analyses are when the software, when your software runs, gives you answers, and you then run with those answers and make more conclusions without checking to be sure that you understood exactly what the software was doing and, and all, that all the outputs were reliable.