What is computational biology software? This is a big topic. We're only going to give you a brief introduction, try to give you a flavor for what kinds of software we mean when we talk about computational biology software. But basically, computational biology software is what we're using to transform raw data into information. Information that you can use to make biological discoveries and guide experiments. The data that comes in from large scale genome experiments is in generally is just DNA sequence data, long strings of As, Cs, Gs and Ts. That if you just look at it are mind-numbing and not very informative. So you need software to go from this very raw data form, which comes in several formats and these very large files into something like say, on this slide, a picture of how a certain section of DNA might be duplicated in different people causing different phenotypes. So we need different types of software to do that. Software takes us from this raw, kind of uninterpretable and, and mind-numbing data into something which, which someone can look at. A student can look at, a post doc can look at or an investigator can look at and make sense of. So, one, one thing we often talk about when we're talking about analyzing data with computational biology software are analysis pipelines. You'll hear this word a lot. So, a pipeline is not really a pipeline to carry water or oil, but a pipeline is a way of taking raw data and feeding it through a whole series of programs that each make transformations on that data and result in conclusions at the end or condensed data at the end from which you can make your biological discoveries. So we might start with a raw data file and do something like say, clean it up to remove noise. There's multiple programs that do that kind of task and then you might summarize another way. You might assemble it, you might compare the sequences to each other, you might compare them to a reference genome. And then from there, go on and condense them further. So there are literally hundreds if not thousands of, of computational as you programs out there as well as pipelines. There are programs that let you visualize data in different ways. We have a, you have a, we have a course here on Galaxy, which, which puts together many of these programs and gives you one way of visualizing things. There is pipelines do things like discovery, which I'm not going to talk about today, except to say that it by discovery. I mean, if we're looking at someone's DNA, one of the things we want to figure out from that DNA sometimes is what's different about their genome from the reference genome. So the way we do that is to run through a competition biology pipeline, which does all sort of preprocessing, taking raw reads 100 base pair or 200 base pair of short sequences and converting them into a readout of what's different about this person from the referenced genome. So these are, that's what I mean by software pipelines. So just to give you the flavor of what a computational biology software pipeline is we're going to talk about one example to do something called RNA-seq. So RNA-seq is probably the most, one of the most popular experimental protocols today in genomics. RNA-seq is the name for protocol that takes RNA from a cells or collection of cells and essentially sequences it to figure out, which genes are turned on in those cells. And there's literally an, literally thousands of kinds of experiments you can imagine you would do with RNA-seq. So we can use RNA-seq to measure the difference between say, two cell types to see which genes are turned on in one cell type and not another. We can use RNA-seq to look at cancer cells to see what's gone wrong with those. So, essentially with RNA-seq, we're taking a collect, we're taking our, our cells, we're extracting RNA. And we're turning that RNA into sequences, raw sequences, which are as, as with all sequences, they are very short reads that represent in this case, because it was RNA, they represent the genes that were turned on in the sample that we were looking at. So how do you go from those raw reads, those short 100 based pair or 200 based pair sequences into, to the readout that you're interested in our RNA-seq experiment, which is a list of genes and their expression levels. So there's a pipeline that, that my group and, and others have been involved in developing that is sometimes called the Tuxedo tools that comprises three main programs called Bowtie, TopHat and Cufflinks. So the names are why it's called Tuxedo tools. And just, and there are actually more programs than just this, but just to give you the overall flavor of what this pipeline does is we start with RNA sequencing reads, these short reads. We line them to the human genome with Bowtie, that gives you a bunch of alignments. Because these are RNA, because these reads come from RNA, RNA is spliced, the introns are removed. I mean, some with a reasonable span two or more exons. So, a single short read might actually align with two places on the genome that are separated by an intron and in humans introns can be thousands of base pairs long, even tens of thousands or hundreds of thousands of base pairs long. So that's a different alignment problem. So you, you'll take reads that this, the reads that span these, these multiple exons have to be aligned differently, so it's another program to do that, it's called TopHat. And then you have all these alignments, you still don't have something you could give to a biologist and, and make any sense of. You need to take those alignments, assemble them together by comparing them to each other into the genes that were in the original sample. And then there's a program called Cufflinks that does that assembly. And then you have to go further than that, because now you know you can see perhaps which genes were present, but what you're really interested in is what their expression levels are. So Cufflinks also includes that. And then beyond that, when you're doing experiments, you want more than just the levels of expression in one sample. Usually, what you're doing, or almost always what you're doing is comparing two or more samples to each other. So you have to take the genes and their expression levels from one set of data and compare them to the genes and their expression levels from another set of data and see which genes went up and which went down. And from there, you can sort of, then you can start to make biological conclusions. So the Cufflinks package includes a program called Cuffdiff to do that, so that's a pipeline. It goes from raw reads to basically, tables showing you genes that went up and down and that's exactly what you want if you're the experimenter and you may not really care about what's going on in the software underneath, but it's important to understand these software tools. So, so here's a picture of what Tuxedo can spit out. Actually, it doesn't spit it out in quite this graphical form we're not there yet but with visualization tools, you can get a view very much like this. So the idea is you go from raw reads to a set of genes and one reason this is important in in our studies of the human genome is that we now know that almost all human genes, well, over 90% of human genes have more than one splice variant or isoform. That is what we're calling a gene is an interval on the genome that has some function and is usually translated, what we usually mean is it's translated into a protein. But the exons that comprise these genes can be chopped up and combined in different ways. We called those splice variants, or splice isoforms. And each of those different isoforms can be expressed at different levels. So not only do we have to figure out from these raw reads, which genes were expressed, but we also need to figure out which isoforms of those genes were expressed and what the expression level of those are. So we'd like our software to do all that and not have to worry about those details and we'd like our software to do it all correctly. So this is a very complicated process, which is why you need the software pipeline. So pipelines change and it's important that we have to keep, that we keep up with those things. So those programs I just talked to you about have already been superseded by, by even newer programs, which are, which are just being, some of them just being published that we call, that, that makes the, the next generation of the Tuxedo pipeline. So Bowtie is, is now Bowtie2. TopHat is still around, but its, its core engine is being replaced by a faster engine called HISAT. The Cufflinks Assembler is now been superseded by a program called StringTie, which does the same thing, assembles transcripts and quantitates them, but does it somewhat faster and somewhat more accurately. And the differential expression for the differential expression task, you can now use a program called a Ballgown. Now there are papers describing all these programs and you can go and read those papers and decide yourself, which are the best. And for some tasks, it may not matter that critically. But very often, it does matter. These programs produce different answer. So the bio, the biologists that, that you may work with if you're the computational analyst for a project isn't going to understand how this program work, programs work and isn't going to understand which ones are the best. So, it's up to you to keep up with what's the state of the art in, in competition biology software, if you want to do analysis. And you might think well, can it really matter that much? These are well defined, discreet problems. It's computing. Computing seems like a well defined, discreet task where the input and the output are always going to be pretty accurate, right? Well, I would argue that, that, that's not true at all. These are very complicated datasets and we've, we've found that the, the programs that operate on them, produce very different answers, even for something as well defined as alignment. Alignment is one of the most basic, in some ways the most basic of the computational problems that we deal with. Alignment, by alignment, I mean, take a short read and, and align it to the human genome. You would think that okay, well, if the read is long enough to align to one place in the genome, then all programs will give me the same answer. So, it doesn't really matter so much for my accuracy. It doesn't really matter so much which program I use as long as the, the program is fast enough. You might think, I'll just choose the fastest program and they'll all be giving you the same answers. Well, no, that's not true and, and we've done tests comparing two of the leading aligners. Bowtie2 and BWA on, on many datasets and we found that they, they don't actually align the same reads. So here's just one example from some exons data showing there's about 600,000, 660,000 reasons in this dataset that Bowtie line that BWA didn't line and there's another 300,000 reason in the BWA line that Bowtie didn't align. And there's a few reasons that are unaligned by both. And you can say, well, okay, 98% of the reeds are were aligned by both, so maybe it doesn't matter. But maybe those reeds that got aligned by only one of the programs are the ones that you care about. So you don't, so you at least need to be aware of that. And these doesn't even, and this slide doesn't even show you the fact that when the two programs align a read, they don't always align it to the same place. So your choice of software matters a lot and it's, and it's critical that if you want to be a computational biologist that you keep up with the latest software and you're aware of what the differences are between the different software packages that you might apply to a dataset. So the, the, the overall message I want to leave you with is that software is changing, because technology is changing. So you can say, technology has been changing more rapidly over the past decade than almost any other technology we know of and the software's been changing rapidly as well to keep up with it. Today, there's, there's new sequencing technology for generating longer and longer reads. There's, there's, in many cases, isn't even yet software to process those, those kinds of sequences. The sequencing technology we use to do the highest group of sequencing, which currently is illumina sequencing is also changing, getting faster and higher throughput and the nature of the data itself is changing. This means that if we use software from three or four years ago on the latest sequencing data, we might get the wrong answer. And what's important to realize is that these programs are, some of these programs are pretty well engineered, so you'll get an answer. You'll get alignments, you'll get genes, you'll get expression levels, you'll get differences between genes in experiments. And everything will look like it's okay, but if the technology has changed, that is the sequencing technology has changed, the software may no longer be doing the right thing, so you might be getting misleading results. So you have to keep up with the technology and you have to keep up with the software.