All right. This is going to be our first hands-on lecture using Galaxy. And we're going to start learning about working with the Galaxy interface by working with some genomic interval data. So the analysis that we're going to do is a very simple one. We're going to look at human chromosome 22, and we're going to ask which coding exons have the largest number of repetitive elements overlapping them? And this exercise is also available online, if you go to usegalaxy.org/galaxy101, if you want to follow along there. So the general plan is the following. First we need to get data into Galaxy. This is almost always going to be the first thing you do when you're using Galaxy is to get data in. There's a variety of ways to do this, and I'll show you some of those. We then are going, so we're going to get the exon repeat data. We're going to identify which exons actually have repeats, or you know, match up the exons two repetitive elements. And then we're going to count the number of repeats per exon, and I'll use this exercise to also show you some other features of Galaxy and navigating Galaxy in general. All right, so we're going to start by getting both the data for exons, these are going to be coding exons on human chromosome 22, and repeats. So, if we go over to our web browser, here I have GalaxyProject.org. This is the starting point for everything related to Galaxy. And so here you'll see you can use Galaxy through the public instance. You can get Galaxy either through local installation in the Cloud, we'll talk about that in later lectures. And then we have lots of learning materials about Galaxy. So if you want to go beyond what we're talking about here, there are other screen casts, there are interactive tutorials, and then finally, there's the Get Involved link where you can find our mailing list. If you want to ask questions, we also have a Biostar site and a variety of, of, of other things that you can use. But we're going to go ahead and just click Use Galaxy. This takes us to usegalaxy.org. You can always get to the main public instance of Galaxy by going to usegalaxy.org. And here you see the standard Galaxy interface. So across the top we have a number of tabs for different both ways of using Galaxy and, and kind of different activities within Galaxy, as well as help and user management. We're going to stick with analyze data today so this is, we're going to, we're going to be basically doing interactive analysis step-by-step of data. On the left side of the screen, you'll see the tools. This is the list of all the tools that are installed in this Galaxy instance. If you remember from, from a previous lecture, I told you that Galaxy is extensible so you can install different tools into different Galaxy instances and customize them for different purposes. On the right we have our history which is currently empty, it's currently just called unnamed history. So the first thing you should do with, when you're using the Galaxy instance, I highly recommend that you create an account right away. And so if you go to the User tab and just click Register, you can put in your email address and password, and, and a password. And you should put in a public name, so this is, so that when you share things or publish them in Galaxy, this is an identifier that will be used, to uniquely identify you. We don't require any other information from you to sign up for Galaxy. You may want to look the terms and conditions for use of Galaxy, and you likely will want to subscribe to the Galaxy announcement mailing list, because that is a low volume mailing list that sends you important information about Galaxy. I already have an account at this Galaxy instance so I'm going to go ahead and log in instead of register. This is what you'll do after your first time registering. You can also link your Galaxy account to an open ID first if you want, so you could use your Google account or something else to log in. All right. So now that I'm logged in, what I'm going to do, if this is your first time using Galaxy, you won't need to do this, but you'll see I have a history here that's already loaded. So if I click on this cog icon at the top of the history, this is how you get to all of the history operations. And so, I'm just going to say Create New. This is going to give me a new blank history. So now as I said before, we want to get some data into Galaxy. If you click on Get Data here, you'll see lots of options for getting data. There is the simplest one of course, you can upload data from your own computer. And in fact, if you click this Upload link up at the, right here, you can drag and drop. You can upload multiple files. There's, you can pull files from URLs so there's lots of ways to get data in in that way. And then there's a number of these database servers, Biomar, Flymine, that provide data, and that, that can be accessed directly from Galaxy. But what we're going to use for the purposes of this exercise is you see Santa Cruz table browser. This is a database that's associated with their genome browser. So go ahead and click on UCSC Main, and now the table browser interface will load inside Galaxy. So now as I said before, I want to get, first get coding exons. We're going to go ahead and use UCSC Genes. You want to make sure that HG19 is selected here, and Genes and Gene Predictions, UCSC Genes. For position I'm just going to use chromosome 22. So click on the radio button next to position, and just type chr 22. So that'll give you the entirety of chromosome 22. And then make sure the output format is set to BED, browser extensible data. And that the box Send Output to Galaxy is checked. That's all good. Then you can say Get Output. It will act, give you, because you're getting gene features here, it's going to give you another screen. We want to select that we want coding exons, this is going to extract the coding regions from each of these genes, and send it back to Galaxy. So finally, you can say send query to Galaxy, and now Galaxy comes back up. Now, the nice thing here is that it's going to fetch this data in the background, and so we can continue to interact with Galaxy. And so I'm going to go ahead and say UCSC Main again. This time, if we go to the section repeat, and we'll just use the track RepeatMasker, we should still have chromosome 22 selected here. Again we need to make sure that this is set to BED format, BED. Send to Galaxy, and Get Output. Here we have fewer options and we can just say Create one BED record per Whole Gene, which is really referring to the whole repeat. Okay, well, that's, that's you'll notice that now in our history we have two elements. And the first one are the exons that we just fetched. It's shown in green which means that the fetching job is completed. So for all analysis you run it will start it out as grey, meaning it’s queued. It will go to yellow as it's running, and then it will turn either green or red, depending on whether it’s successfully completed or failed. If we click on the name of the dataset we see, we can see a preview and a variety of options. And if we click on this i icon we can actually see the entire dataset in the main window here. And so this is what the BED format looks like. This is a very common format for genomic intervals. We have one, two, three fields for representing the chromosome and position on that chromosome, a name, an optional score and the string. And you'll see if you look at repeats, similarly, these are, these are in the same format with the same features. Okay, so now we have our two datasets, the coding exons and the repeats. What we want to do is look for cases where they overlap. There are a few ways of doing this in Galaxy, but the one we want to use is one that's actually going to allow us to count the number overlaps. And so the tool we're going to use is called Join. It will match up all pairs that overlap, and that's in the section Operate on Genomic Intervals. So if we go back to our web browser and say Operate on Genomic Intervals > Join and for the first dataset we'll select the coding exons. The second data set we select the repeats. We'll have one base pair of overlap and this option we're going to leave at its default, but for example, if you wanted to include things, exons that had no repeats in the output, you could change this but for, for now, we only want records that have some overlap. So we'll say Execute. So now we're running the tool. Previously we'd been fetching data from external data sources. This was our first actual Galaxy tool that we're running. And so it's gone into the queue now and what it's going to do is find all of those pairs of exons and repeats that overlap. While that's running just I can show you a few other features of the Galaxy history. So in addition to the i icon here we have the pencil for editing attributes. This allows you to change the names of your datasets to make things easier to find. You can provide annotation and other info. The x icon will delete that data set. Deleted data sets can be recovered. However, after a set period of time, which is Galaxy instance specific, they will be purged. But you know, if you, if you delete something and then immediately realize you didn't want to do that, you can recover it. And we'll talk about some of these other features later. So our join is finished, and if we look at the output here, we'll see now we have exactly those two datasets, but joined side by side where there's overlap. So here the first, once again, six fields are from the exons file. And the second six fields are from the repeats file. And in each case here, we have an overlap. So now, what we wanted to do was actually get a count of the number of repeats overlapping with exons. The way we can do this is using the Grouping tool, which is in the section, Join Subtract and Group. This basically allows you to take all of the rows of a file and group them in, in particular ways. So we're going to, the dataset we want is Join, which is probably number three in your history. We want to group by column four. Why column four? Well over here, that's our exon name. All right, we want to, for each exon know how many repeats it overlapped. And so grouping by column four is going to do that for us. Now in addition to just grouping we want the count for each group. And so we can click Add New Operation, and we have all these different operation types. The one we want is going to be Count, again, on column four. And then we hit Execute. And so now this is going to go through this file and every place where we have the same exon name, it's going to group those together and it's going to give us an associated count. While that's running, we can again, some, I'll show you a few more features here. For any job, there is the View Details option. This gives you additional information. So particularly if you had an error in your job, you can look at this and see more information, see the parameters that were used, see other datasets that were used by this job. And the Rerun button. This a very useful feature of Galaxy. As I've mentioned before, Galaxy's keeping all of the providence, and so for every job, every dataset that you have in your history, you can go back and see the exact tool that was run and the exact parameters and datasets that were used when it was run. And you can see that at any time by clicking the Rerun button. And then you can modify those parameters if you want to for example, run that job in a slightly different way. Okay, so our group is finished. And if we click on the i icon again now we have the count, so we have the exon name in the first field and we have the count associated with each field. For example here, this exon had five repeats overlapping it. So the last thing we'd like to do, we've answered the question but we'd like to get back the rest of the exon information. The way we can do this is again by using Join but joining on the exon name with our original file. So here, under Join Subtracting Group, we can say Join Two Datasets. The first file we want is our known genes, and we're going to join on column four, which is the name. We want to join that against group, and using column one, which again was the name. So now execute there. And this dataset should run shortly. Okay. So now that join job is run, if we look at the output here and we've now recovered the full exon information along with the count. Now suppose we actually want to put that count into the score field of our, of, of this BED format file. What we can do is use the Cut tool under Text Manipulation to extract a set of columns from that dataset. And so we want, we want the first three columns. So we specify these as actually the first four, C1, C2, C3, and C4. And if we look at our dataset the count we want is actually in column eight, and so we can say C8. Execute that and now what this is going to do is it's going to just extract those columns in the order that we've specified and create a new dataset that's now a valid BED format file with the number of repeats overlapping each exon embedded. And again, clicking on the i icon, we see that in the main view. So lastly, if you want to get data out of Galaxy, you can click on this download icon, and this will save the data to your computer in raw format, in, in, in its native format. So you know, when you're clicking the i icon here you'll notice that this data has been formatted nicely for you in tab delimited way. So if you downloaded this it would not be as useful, but here this has downloaded the raw data that we can then work with if we have other tools locally on our system, or we want to upload to another Galaxy instance. So, in summary, interactive analysis in Galaxy is performed by using tools available in the Tools panel to operate on datasets. Datasets are immutable. Running tools always creates one or more new datasets. And you can always go back and see the old datasets that were used. This ensures that you can only do analysis archived and always be inspected for transparency and reproducibility. Datasets are available through the history, which gives you a complete provenance in chronological order of the analysis that was performed.