[NOISE] [MUSIC] In this lecture, I will explain some concepts from big data and biology that might be able to be solved by what I'll call computational pipelines. So one of the problems in analyzing large data sets is that there's so many different algorithms, and so many different people involved that the sort of amount of complexity in the types of analyses and algorithms that's being conducted is simply very high. So in order to deal with that, I'll here propose that pipelines is a useful way of thinking about those computations. And what do I mean by computational pipelines? If we start out with our computational module being a program that takes an input, does the computation and returns some piece of output. Then you can define pipeline as a collection of those computational modules where the outputs and inputs of the computational modules are wired together in a specific arrangement. So, shown here on the bottom is the simplest case of such pipeline where two computational modules are simply wired together in a sequence. A real life example of a computational pipeline in biology is the tuxedo pipeline. So the goal of this pipeline is to to take raw reads in the form of RNA sequences from two different conditions and then calculate what are all the genes that are differentially expressed. So the whole computation is broken down into a sequence of steps and each of the steps are performed by a specific program. So, there's the TopHat program, the Cufflinks program, the Cuffmerge program, Cuffdiff program and CummeRbund program. So, why bother separating the tasks and having different programs do each of these steps you might ask. So the main reason is that it makes the task modular, so it's therefore much easier to change specific configurations of each of the programs. It's much easier to swap out programs if you'd rather use some other program for a particular step. And, it even allows you to build pipelines from, if you have a standard set of programs, it's quite easy to build new pipelines from those program elements. These other things that I'll call flow diagrams here, which are very reminiscent of pipelines as I've defined them. Here, I would just like to stress that these are fundamentally different in that there's no direct correspondence here between the links and the nodes and the computation. So what these flow diagrams are meant to represent is something more abstract about the computations and the analysis that's happening. And that's tremendously useful, but it's just important to keep in mind exactly what is meant by the links and the nodes in such diagrams. So, this particular diagram also describes a whole class of different RNA sequence-based methods. So you could even say that the Tuxedo pipeline, as I described before, is sort of a subset of what's described in this flow diagram here. So if you have very straight correspondences between those compensational modules and the visual representation or the network representation. What that enables you to do is that you can have these graphical user interfaces, where you can essentially have a bunch of modules, and then put modules together using drag and drop, or other visual elements, and that's very convenient to do for sort of the non-specialist analysts. I've seen a lot of people use this KNIME software and it's actually quite efficient for what it does. And also it's quite interesting having this one to one correspondents between your computation and visual representation enables you to quite effectively communicate what a specific pipeline that you've put together is doing. And there's a whole bunch of other types of these visual representations of workflows. Taverna is another example. And I'm showing you here a relatively simple, though it looks a little bit complicated, workflow that runs a BLAST search, retrieves the hits in terms of the sequences, does a multiple alignment and then constructs a phylogenetic tree and thus some vizualizations. So that's a fairly standard bioinformatics analysis that you want to do. And maybe some of you have done that before. So, what's special about this one is that it uses mostly web resources, so the local client which runs the workflow is basically sending queries out to different web services as you sometimes do in bioinformatics when you just copy paste something into a web browser and then get some result. So it essentially this workflow enables you to stitch together different components where some of those components can be web based. So that's interesting I think because this is an example of something that uses cloud computing and also people share whatever workflows they have made on the internet. So currently, there's 2700 of these workflows available on myexperiment.org, and if you make something new, you may want to contribute to that bulk of available pipelines. The graphical workflows are well and good, but it's important to realize that they don't actually do anything that you couldn't do on a command line basis with the programs themselves either on some kind of environment like R or directly on a UNIX command line. So here I'm showing you an example of a programming invocation where we're calling the program Tophat from the Tuxedo pipeline from before and then we're passing it some argument. -p 8 means we're asking it to start up eight threads. So it's a multi-threaded application. We're saying that the input files are genes.gtf and the output files are in some particular format. And we would like to have these output file names. So in the UNIX system, there's a special pipe operator, which is the vertical line. And what that is doing, is that it takes the output of the program on the left, and then feeds it into the input stream of the program on the right. So here, program1 reads input.txt then it generates some output which gets sent to the input of program2, and the output of program2 gets sent into the input of program3. And the program3 writes it's output to output.txt. So this exactly qualifies as a pipeline as we have discussed previously. But the special thing about the UNIX pipes is that, it is an example of stream based computing. So what the operating system will do is that when you issue this command it will start all the programs at the same time. And so you're essentially, you have a parallel process running, which can be very useful, if that is enabled by the programs that you are invoking. So, another way to organize these pipelines. So you could imagine having a script with these pipe commands or even just having a sequential script with all of the different steps of the pipeline. But, sometimes you make it into situations where things just get a little bit too complicated, you have maybe have multiple branches. And maybe you want to keep track of dependencies. So at some point these scripts don't do enough for you. And so one example of a solution to this problem is to use makefiles and the program Make, which can keep track of dependencies. And, it's often used for compiled programs and keeping track of compilation dependencies, but here it's fairly flexible and you can use it also for data dependencies. So, here in this example it is very artificially constructed so it's basically doing a, again, a BLAST search against some database. And, then there's an intermediate result from the BLAST search, and then there's a filter program called filterBlast, which takes the BLAST results as it's input, and then does some filtering on it. So, here the dependency graph is sort of telling you statically the relationship between the different files and also includes the relationships or the dependencies of the programs as well. So, if I update the filterBlast program, it'll know that, when I run the Makefile, it'll know that whatever needs to change, or could potentially change because I changed the program, then it'll sort of run backwards through the dependency graph and then rerun all of those commands. So that's quite useful. So, there's a bunch of other very useful tools as well, and here I'm just going to mention one, which is COSMOS, which is a Python library for keeping track of these kinds of dependencies. So, to get more technical, there's a acyclic directed graph which keeps track of the dependencies. And, there's a bunch of other things that COSMOS will do for you. Like it'll keep track of what kind of files and your different tools that you are defining, has as input and output. And then you basically wrap [the program invocation]. So here you see the commands in the return [statement] in the (a) section here. It's basically just a program and invocation as before but you'll wrap it in this Python class and it will keep track of input-output relationships for you. So, that's one thing that it'll do. And then it also has these [special] functions, which some of you may recognize. It has add, map, and reduce functions which is very similar to this map reduce paradigm in parallel computation. So what this hints at is, and this I think is the primary reason for bringing up this COSMOS as an example, is that it can do dynamic constructions of these directed acyclic graphs, specifically for parallel computation. So, if you have some task that's easily parallelizable, then you can specify, at run time, what are the rules for separating a specific job into sub jobs and what are the dependencies and the COSMOS package will then keep track of that for you. [MUSIC]