In this lecture, we're going to look at doing a ChIP-Seq analysis in Galaxy using a tool called MACS. Just a reminder about how ChIP sequencing works. This is an experimental technique where we can take protein DNA complexes that have been cross-linked together, pull down using an antibody, and then purify out and sequence DNA. And so, if that antibody is for a particular transcription factor, we now should get DNA that's enriched for regions of the genome that are bound by that transcription factor, or the antibody might be to a histone modification, in which case, we're going to get regions that are rich for that particular histone modification. And so this allows us to map locations of protein DNA binding and modified histones. The when you, when you perform this assay, you're going to get sequencing reads where we are going to align those back to the genome, and then from those alignments, we need to make a call as to whether we think a region Is enriched. And our goal is to reconstruct the region of the genome that is bound by that protein. So Model-based Analysis of ChIP-Seq, or MACS, is a tool for, for doing this, a widely used tool for doing what's called peak calling for ChIP-Seq data. And it's going to, from the MAC sequence data, try and estimate how far it needs to shift the mapped reads to identify the boundaries of the region that's actually being bound by the protein. And so it works by sampling a number of high quality windows. These are regions of the genome that appear to be enriched for the data and that, for, enriched for binding. And then using those regions, it's going to estimate the shift amount. So, we're going to actually get some ChIP-seq data that's available in the Galaxy instance under Demonstration Datasets, and so if you go to Galaxy again, and so here I am in Galaxy. I'm already logged in. I have a new history, which we can always get by clicking on the cog icon and Create New. And I'm going to go to Shared Data > Data Libraries. And the library you want is called Demonstration Datasets. So if you click on that. And there are two, two sets of data here. We want this one, Mouse ChIP-Seq G1E CTCF Binding. And so we can expand that. So we want to get all four of these datasets and import them into the history. And so you can just select the datasets and click Import to Current History. Once they're imported, we can go back to our Analyze Data interface and now we should see these four datasets in the history. So, just to give you an idea what these data sets are, so, we have data from, two different cell conditions and so, the, there are two cases here that are labeled G1E. There's G1E input and G1E CTCF. So this, G1E is a, this is a mouse cell line where a particular transcription factor, data one, is null. And so then G1E ER4 is that same cell line but with the transcription factor restored. And so what you're, what you're able to see here is the change from one cell type to another. And the particular transcription factor we're looking at here is called CTCF. This is a important transcription factor. It's involved in genome structure and insulator activity, and gene regulation. So once we've imported all the data from the data library, the first thing we want to do is map it. We're going to skip our quality control. If you want to look at doing quality control and verification, you can look at the previous lecture on FASTQC. This would be a natural time to do that, but we're going to forge ahead. And so what we're going to do is use Bowtie 2 to map our data onto the mouse genome. So, under the NGS Mapping section, one of the tools that you have available is Bowtie 2. And so if you click on Bowtie 2, you'll see the tool form. We want to make sure that the dataset that's selected under FASTQ file, this is, these are the reads we're actually going to be mapping onto the genome, is G1E CTCF. The other thing that we need to do is actually select the genome that we're going to map onto. Right now, we just have reads, and so this, the read data alone doesn't tell us exactly what genome this data originally came from. And so we can click on Select reference genome and the appropriate reference genome here is the mouse genome. And we're going to use build mm9 of mouse, and so if you just type mm9 into the box, then you should be able to select mouse build mm9. So, there are a var, a, a variety of sets off default parameters, or we can configure additional parameters. We're just going to go ahead and use the defaults. And so now we can click Execute, and this will run Bowtie 2. Okay, so once Bowtie 2 is finished, you should now have a green dataset in your history. It says Bowtie 2 on data 1. This is the reads it aligned in a binary alignment format called bam, so we can't look at this data directly. But we have some summary statistics showing that we have about 95% of our reads aligning exactly one time. And so that's that's quite good. So now that we have the reads aligned, the next thing that we need to do is actually do peak calling. And so peak calling, as I said before, we're going to use MACS, Model-based Analysis of ChIP-Seq data. And this is in the NGS: Peak Calling section in Galaxy. So we can go to NGS: Peak Calling or remember you can always use the search here to find the tool, in this case MACS. So click on MACS. The tool form for MACS will come up. One of the first things we can do here is give this experiment a name. This can just be helpful for keeping track of things. And so we'll say MACS on G1E CTCF here. This is single end data, again. And the only tag file we have available right now, this is based on the file formats, it's, it's finding Dataset 5, our Bowtie on data 1, aligned reads in BAM format. So go ahead and make sure that's selected. We do want to modify a couple of things here. Mainly let's change our tag size to 36. These are 36 base pair reads. And we want, we want to keep the MFOLD, this is our fold enrichment at 32. And we actually want to say, so this last stop should perform new peak detection method, futurefdr, we'll say Yes to that. So now click Execute. And this is going to generate two files. The first, which is Dataset 6 here in my history, is the actual peaks. So these are regions that MACS is saying are enriched for CTCF in this cell type, and it's also going to give us an HTML report that will give us some information just on the runs, on, on the, on this shifting model, etc. So we'll go ahead and let MACS run. Okay. So once Max is finished, we can see that we have our peak data here, which is a BED file. Galaxy is telling us that there's 605 regions. So that's the number of peaks that have been identified on chromosome 19 and we also have our HNL report. We can click the eye icon if we want to view the report. And so this is just telling us some information from MACS, and we can actually look at if we click on Model, this is the peak model showing the forward. It basically reads aligning to the forward, it reads aligning to reverse, and the shift which has a peak right in the middle. We can visualize this BED file of peaks. For example, we could, there's a display at UCSC Maine, so Galaxy has the ability to display a different genome browsers. And so if we click on display at UCSC Main, it'll bring up the mouse in the nine UCSC Genome Browser with our peaks as a custom track. And so now in the browser here, we can see our MACS peaks on G1E CTCF across chromosome 19 and the UCSC Genome Browser. So, in doing ChIP-seq analysis, it's important to think about biases that can affect your results. There are a number of issues, such as issues with chromatin accessibility that are going to affect how your DNA gets fragmented, issues with amplification, repetitive regions, which are going to be difficult to map back to. And so it's very important in a ChIP-seq experiment to use some kind of control, and a common control is to use input DNA control, where we have data that's, that is fragmented, but the immunoprecipitation, or the, the antibody pull down has not been performed. And so we're going to rerun our MACS peak detection, but now using a control. What this allows MACS to do is use that control data for determining the background expectation of the number of peaks you would see and use that in making your peak calls. It also allows for MACS to compute a false discovery rate. Without a control, MACS will not compute a false discovery rate. But using a control, it's able to to model that background. So if we go back to Galaxy, we already have the input DNA control here, this was Dataset number 4, our G1E input, and it's also a FASTQ dataset. And so we can go ahead and run Bowtie on that as well. So click Bowtie 2 again. And select G1E input, which is Dataset 4, Single end again. We need to make sure we selected mm9 as our reference genome. And go ahead and click Execute. So Bowtie is going to, or Bowtie 2, rather, is going to run again and give us aligned reads for the input control now. And we wait. All right, once Bowtie 2 is finished, again, we'll have a green dataset, in this case Bowtie 2 on data 4. This is all of our input reads aligned to the genome. Again, we have 95% aligning exactly once, so this appears to be pretty good alignment. Now we can run MACS again. [SOUND] So search for and click on MACS in the Tool menu. And [SOUND] we'll give it MACS G1E CTCF with control as the name. And so for our ChIP-Seq Tag File, this is our this, this is the actual enriched sample that we expect to be representative of CTCF binding in this case. So we want to make sure we select our 5. That was the dataset that we had created previously. And here we can, we have this optional ChIP-seq control which we did not specify before, but now we're going to go ahead and specify control is dataset 8. We'll change our tag size again to 36 base pairs and use the new peak detection method, and go ahead and click Execute to run MACS. And again, MACS will take a couple minutes to run. All right, so once MACS is finished, we again have two datasets in our history, the BED file which contains the peaks and the HTML report. And so if we look at our peaks here, we now have a slightly different number of peaks. In fact, a smaller number of peaks when we call these peaks using control. So in this case, we've got 529 regions now that are enriched for CTCF binding. And this is a, a, that file, which represents genomic intervals, and just like in the earlier lectures, you can now start to use genomic interval operations and other tools to interpret these peaks. So, in summary, MACS is one tool that's available in Galaxy for the analysis of ChIP-seq data. There are other tools, of course, but this is one that's widely used particularly for transcription factor binding. Controls are extremely important for accurately calling ChIP-seq peaks. Typically input controls are used. And as for most genomic problems, there are tools that might be appropriate, depending on the type of data. So while MACS is good for punctate kind of pointy, transcription factor binding, other tools may be more appropriate for broad histone modifications. One such tool is SICER, which is also available in Galaxy.