This video is about the process for a ChIP-seq analysis. ChIP-seq can be used for a couple of different things, but what we're going to be using it for, is actually measuring the way in which proteins interact with DNA. So if you remember the central dogma of molecular biology, information flows from DNA to RNA to protein. But you can actually regulate the amount of transcription in many different ways, but one of those is by binding of proteins to the DNA. This is the case for transcription factors for example, that might regulate the expression of particular genes. And so you can actually adapt the next generation sequencing technology to measure the amount of that particular proteins are bound to particular locations on the DNA. So the first step in this process is to cross-link proteins to DNA, as you'll recall from the Genomic Technologies class, and then the fragmented DNA. And then if you have an antibody for that particular protein that you've attached to the DNA, or a cross-link to the DNA, you can then do an antibody pulldown. So this basically enriches for a particular subset of the DNA, that's associated with the proteins that you're interested in, or whatever they might be. Or whether there are transcription factors, or something else. And so then we did pull down just those fragments, and then we sequence them. And then we're going to be looking at basically how do we analyze the data that come from these sequences, that come from this pull down experiment. So the first step in this process, just like the first step often in a next generation sequencing experiment, is to align. And then in this case, you actually don't necessarily need to worry about, like you did with say something like RNA seek, or with sequencing. You don't necessarily need to worry about doing anything, other than a sort of a straight ahead alignment to the Genome. And so the popular software for doing these are software like Bowtie2 and BWA. That are very fast aligners to the Genome. The next step in the process is to detect peaks, basically to identify, again we've enriched particular sequences because they've been pulled down by proteins. And so we want to be able to identify where there are peaks, or where there big piles of reeds corresponding to those sequences. And so there are a couple of different software for doing these, CisGenome, MACS, and PICS are a couple of the popular ones. PICS is in bio-conductor, and CisGenome and MACS are both sort of standalone software. And they can be used to basically detect, where are those peaks are in the sequence. And then the next step is to count basically, so to try to obtain a measurement for the amount of reeds that cover a particular peak. Now, there is a questions as to how quantitative the ChIP-seq technology is, in terms of how much binding there is there. But it isn't useful to have the quantification of how many reeds fall into each of the different peaks. So then the next step, and this is actually relatively recent that these sort of processes have been heavily introduced in sort of normalization, and so especially the cross sample normalization. Until relatively recently, many ChIP-seq experiments didn't have a large number of replicates, but they're definitely increasing over time. And so some of the ideas that have been used in RNA sequencing analysis, and other places have been moved over into the ChIP-seq world. And so, it's now common to apply some kind of normalization. Whether that's MAnormalization as I've shown in this figure here, or some other kind. And so the diffbind package in bio-conductor, and the MAnorm package in bio-conductor are the two approaches that use various different types of normalization, to make the peak counts comparable to each other. And then you need to do some sort of statistical tests. And so this is basically to identify whether there's any differences between the cases that you care about. Whether there is, usually you do it in a comparative experiment with a sort of ChIP-seq, and when you do that there are different tests that people use. Some people do use sort of a binomial test. Or you can use other types of tests. And so these are the software packages, the CisGenome, a sort of suite, covers the whole process. So does MACS, these typically focus on the two sample case. Diffbind is a package that can analyze multi-sample ChIP-seq data, or ChIP-seq with the outcomes that aren't necessarily just two groups. And then the next step actually in the ChIP-seq analysis is so we've identified these regions, mainly in the Genome, or the particular transcription factor has bound. And so the next step is we might want to understand, what is the sequence motifs underlying that particular transcription factor. And so there's several sets of software that will allow you to annotate the sequences that have been sort of bound by the transcription factor, and try to identify their transcription factor binding sites, or their sequence motifs that are associated with that particular protein. And then downstream from that it's often the case if we want to look for, is there further annotation that we can do with these sequences? Like, are they commonly enriched in particular genomic motifs? Also, people would try to like build a network model that relates the transcription factor binding back to gene expression. That's a little bit beyond this scope here, but these software will kind of get you started on annotating, and pointing you toward the direction of analyzing these ChIP-seq data after you've done the actual statistical analysis.