And now in this class you'll be introduced to the Granges class from genomic ranges and you'll be introduced to an Annotation Hub. And now we're going to take a use case or a little data analysis where we're going to put these different components together, and show how easy we can achieve certain simple analysis. The question we're going to try to answer in this use case, is we're going to study histone mark called H3K4 trimethylation, that is set to mark promoters. If that's really the case we should find a lot of this particular histol mark inside promoters of genes. We're going to use InCode data on the H3K4 trimethylation and we're specifically going to use InCode data on a specific cell line called GM12878 which is a so-called tier one cell line from InCode. I just picked that more or less at random. The H3KR trimethylation data has also been, the experiment has been done by encode, and we are going to use the analysis which solves from N code. This is a chip seek experiment. And the analysis we saw is a set of peaks in the genome. Where it's believed that this particular modification is enriched. So, the first part of this use case here is we're going to get the encode data, using annotation hub. So, let's fire up annotation hub [SOUND], oh, that was the wrong one. And hub. And the first thing we're going to do is we are going to make sure that we just deal with human data. So we're going to subset the hub, we're going to say the species is homo sapiens. Now let's try to start by searching for the history modification data. Let's do a query on the A hub, and let's use two search terms. We're going to use a modification h4k4 and we're going to use the name of the cell line, that was gm12878. Let's stall this so we can look at it. Okay. Something came out. That's a good start. We have 11 records, and now the question is, which of this data should we use? We can see in the print out that the data comes both from the Institute and UCSC. And if we just look at the title, there seems to be two broad classes of titles. Something starting with WGencode, and something starting with E116. Now if you look on the Internet a little bit and you try to Google these titles and you know a little bit about encode data and UCSC, you'll recognize the first set of tiles as coming from encode data hosted at UCSC. The second set of tiles, the one that starts with E116 got me a little confused. But I Googled E116, and it turns out that E116 is an internal ID from the Roadmap Image Genomics Project for this particular cell line, the GM12878 cell line. So basically, we have data here from incode and we have data from the Roadmap Project. The Roadmap Project is newer, it's probably better processed, better analyzed, but I said in the beginning I would analyze in code data, so I'm going to stick with the in code data. Now, the question is which one of these data sets should I use? If you look at the name, you can see that it starts with W University of Washington. That's where the experiment was done. And then there's something would replicate one and replicate two and broad peak and narrow peak. And that's all a little bit confusing. I'm going to stick with University of Washington. And then I'm going to look at replicate one for the broad peak and one on the narrow peak and see if I can make some sense out of it. So I'm going to get through genome ranges, and I'm going to get the second element, and I'm going to get the fourth element. Let's have a look. So it's GRanges which makes sense. These are supposed to be peaks. And we have some coordinates in the genome and we have some additional metadata. Let's go up here. We have a name that seems to be missing. Everywhere we have a score that the list of this one seems to be zero. Then we have something called signal value. A good guess is probably that higher signal value, more enrichment. [COUGH] We're going to ignore that for the time being. Let's have a look at how big these peaks were. Remember, they were called broad peaks, and narrow peaks. So, let's do a summary of width of the broad peaks, I think this was a broad peaks. And they seem to have a median of about 400 basis. But some of them are really long. The longest is 22 kilobasis. That's pretty broad for this type of histol modification, which is something I happen to know. Let's look at the other one in the peak. And that looks a little extreme right? The minimum and the third quartile of the distribution is 150, and if we could do a table over the different widths, we'll see that essentially all the peaks have a width of 150 bases. Clearly something with the data processing has enforced this particular width on all the peaks. At this point in time, once you spend some time thinking about which data do I want? What should the source be, and so on and so forth, we're going to sidestep that discussion. And just use the replicate one from the narrow peak from University of Washington. So that was gr2, I'm going to rename my GRanges peak. And I'm just going to check just to be sure what the genome was for these things here. Let's see, this here was number four. And you can see here that the genome was hg19. This is pretty important. If you can download data from different genome versions using Annotation Hibernate. If you're not careful, you're going to get weird analysis out of it. Now, we're going to take these peaks, and we're going to ask, are these peaks enriched in promoters? So in order to answer that question we need to get the promoters. Promoters are usually defined as a small interval, say 2 KB or so, around the transcription start side of genes. So now we're going to get the transcription start side of transcripts of genes. The way to really do that in is to use something called a transcript database object or TSDB object, which we haven't covered yet. So instead I'm going to get gene annotation from using an AnnotationHub and I'm specifically going to go for a specific type of gene annotation called Ref C. Ref C gives a highly curated set of validated genes. We have very high confidence in them. We know a lot of them we know are probed and coding. We know a lot of information about them. But it's also very limited in the sense that for most RefSeq genes, we only have a single transcript or a single isoform in the database. We are going to see that. Don't take my word for it. Okay. So we are going to log an annotation help for. Let's do a query and let's save the query so we can look at it. And let's just look for RefSeq. Okay. This is promising. We have eight datasets, some are called RefSeq Genes, and other are called Other RefSeq, and given what we are attempting to get, it's probably a good guess that we want to go for the RefSeq genes. But why are there four datasets with the exact same title? That got me a little confused at first, but then I looked at the genome of these things here. You basically have four sets on four different versions of the human genome. Before we knew that the peaks were in the eight gene, ninth gene so we're also going to get, we're also going to get the RefSeq Genes in hg19. That turns out to be the first data set, so let's download that from our AHub, from the qhs, we use a double bracket so we download it. Here we have the genes, we have 50,000 genes or transcripts. And if you scroll up a little bit you can see that each row is 'g ranges', not a 'g ranges' list though it's actually called a UCSC track but it looks very much like a 'g ranges'. We can see that each row has a single range associated with it and a strand and a name. This is the outer coordinates of the transcripts, or the coordinates of the pre mRNA of the transcript. If we scroll down a little bit, we can see that there's a metadata thing called blocks, and these blocks here actually gives you the coordinates of the different exons in the transcript. We don't really care about the different exons, we just need the start side. The transcription stop site, which is the start of these single ranges coordinates. Now, I said RefSeq was very conservative. The name we have here is a RefSeq identifier which goes at the gene level. So two transcripts from the same gene is going to have the same name. So we can use this to quickly check for how many transcripts do we have pet gene and RefSeq? So we do that if we do a table or with a gene's [INAUDIBLE] name. We are for each name, we are going to know how often this'll occur. Then I can do a little trick. I can call "table" on this again, and then I get basically how often I see a single transcript, how often I see two transcripts and so forth. And as you can see, the output by far most of the times, have a single transcript and there's about one-thousand genes that has multiple transcripts. Now that's a little aside. We're going to get back to the promoters. Remember to each could be both in the forward and reverse strand, and depending on the strand, we want to get either end of the interval we have. Normally in a in genomic we have a convenience for us called promoters. So I'm going to just call promoters on my things. And I get the promoters. Now where it respects the different strands. So it takes the start of the transcript. Now how wide are these promoters? I said that here it's a little open questions how wide you want to define a promoter. And it turns out that they are all exactly 2.2 kb. We can look at the arguments for the promoters function, and we can see that the default is the pick 2000 bases upstream, and 200 bases downstream of the I'm just going leave it as is and continue. So now we have the promotals and the peaks and we are ready to ask, do we have some significant overlap?