[MUSIC] Alright, in this week's class, we are touching on Cis-element prediction, mapping and prediction. And we will go over, in this lecture, why we would want to determine cis-elements, methods for determining cis-elements both in vitro and in silico. And we'll talk a little bit about some cis-element data bases and then tools for predicting cis-elements de novo from, for instance, sets of co-expressed gene promotors. So, the transcriptional regulation in eukaryotes and also, prokaryotes is quite complex. There are a number of proteins that need to bind to the promoters of genes in order for that gene to be transcribed, and hence for that transcript to be translated into protein. The structure of promoters is often times complex and could be quite long, especially in the case of mammals and here we might have some proximal transcription factor binding sites close to the start of transcription. Let's see, a cis-regulatory module comprising several cis-elements. And perhaps the distal transcription factor binding site all of which act together to recruit coactivater complex, which act on a transcription initiation complex to bring about transcription. And this is completely ignoring the structure of the chromatin, which does also need to be in a certain configuration for transcription to occur. So these proteins that are recruited to the promoters have specificities as determined by the amino acids that make up those proteins. And this is just a SeqLogo, we've seen it before, of the CAP DNA complex. CAP is the Catabolite Activator Protein. And this binds to specific residues in the promoters of E. coli genes, in this case to bring about transcription under certain metabolic conditions. And you can see there's some specificity, quite high specificity, as determined by the bit score of the Sequence Logo. For some residues here and a little bit less specificity at these residues. And again, recall from the the motif lab, the first lab of Bioinformatic Methods II, That the bit score for DNA sequences, the maximum the bit score can be, which reflects the conservation at a given position, is two. The other thing about these cis elements is that they're actually quite short. So that makes things a bit tricky when we're trying to identify these patterns in promoters, say, of coexpressed genes. And here for instance is the SeqLogo of the ABA responsive element. ABA is a plant hormone. The ABA responsive element from Arabidopsis thaliana, this is the extent of it, so it's 1, 2, 3, 4, 5, 6, 7, 8 positions, And this last position isn't particularly informative. So, just keep that in mind when we're trying to predict these elements. That makes things a little bit trickier. So, we can actually determine the transcription factor binding specificity in vitro or in vivo. And knowing which transcription factor binds to which promoter can help us understand when a given genus is expressed, and why it's expressed in response to different perturbations. And, where those transcription factors bind, where the transcription factor binds is called the cis-element. And to determine that, it's quite tricky. So there's a method called SELEX, which is the Systematic Evolution of Ligands by Exponential Enrichment which we can carry out and then we can follow up by using EMSA, which is eletrophoretic mobility shift assay. So in this case, what we do is, we take a given transcription factor, and we mix it with a library of random oligonucleotides of some length. And this library would have PCR primers on either end of the, the variable location. And the oligonucleotides that get bound by the given transcription factor are selectively amplified by a PCR step and those the transcription factor binds to are sequenced to identify the specific sequence, and to follow that up to confirm that that sequence is specifically bound by the transcription factor we can mix an oligonucleotide with that sequence and the TF, and then we would see a shift in its mobility on a gel. So the mobility of the transcription factor on a protein gel would actually slow down, so it would run higher up on the gel. So we can also do ChIP-chip or ChIP-seq. And here in perhaps an in vivo situation, we can take the transcription factors that are bound to the DNA, and cross link those with formaldehyde. And then we isolate the DNA-protein mix from say, tissue of a mouse or part of a plant. And then we shear the DNA into smaller fragments. And we pull down the DNA that the transcription factor is bound to using immunoprecipitation. And then the DNA fragments that were bound are released and they're detected by either hybridization to a microarray So that's ChIP-Chip. Or by sequencing, next generation sequencing, that's ChIP-seq. So this labor intensive in the sense that we actually have to create transgenic animals that express transcription factor with a tag attached to them. So we can pull them down using a generic antibody. Or we would have to develop an antibody that recognizes specific transcription factor. All very labor intensive things. The other way of determining transcription factor binding specificity, is to over express a given transcription factor. We can then purify it. So we can over express it in E. coli. Purify it and then we can hybridize that transcription factor to something called de Bruijn arrays. And these de Bruijn arrays from Martha Bulyk's lab, for instance, contain all possible 10mers convoluted onto a microarray. And then you look using effectively an antibody approach. Where the transcription has bound, where the transcription factor's bound to on that array and then you can figure out which ten-mer the given transcription factor is binding to. This is can be done in a high throughput manner. we can actually determine, say the binding specificities of 800 mouse transcription factors in a relatively timely manner. So, there are also methods for determining transcription factor binding sites and cis elements in silico. And here the big assumption is that co-expressed genes share common regulatory elements, and the trick would be to identify those programmatically. We can also assume that orthologous promoters from closely-related species would tend to be conserved, and we could use this sequence information to look for smaller conserved regions within the promoters. But the caveat here is obviously that cis-elements are quite small and they can also be degenerate between species. So the methods that we'll talk about today are word counting methods as exemplified by Promomer, Gibbs Sampling method as exemplified by MotifSampler. And then phylogenetic footprinting and comparative genomic methods. But before we get to that, let's talk about ways of representing cis elements. We could represent them in standard multiple sequence alignment as over here. We can then convert that multiple sequence alignment into a PSSM. We talked about PSSMs in the first lecture. And here instead of amino acids, we're just representing nucleotide frequencies at each of the positions of the multiple sequence alignment. So in this multiple sequence alignment we have five As at the first position, so in our PSSM, we'd have a 5 there. We've got 5 Cs at the second position, we'd have a five for the C in the PSSM of that, that particular cell in the matrix, and so on. So you can use PSSMs to describe a particular pattern in promoters and also represent that PSSM as a SeqLogo. I described how to generate a SeqLogo for amino acids in the the first lecture, and in the case of this particular alignment here, this is what our SeqLogo would look like. Again, maximum bit score. For a perfect conservation is two in the case of nucleotides. And then we could also represent that as a regular expression, or a pattern that we could use to search a given sequence. So, if we just read across the top of the SeqLogo we can get the consensus sequence, which would be ACGTGG. But can also allow some wobbles in the fifth and sixth position by including the wobble nucleotides in square brackets like that. So, let's talk about word count method for determining potential cis elements in promoters. This is exemplified in the Promomer program. Here what we do is we count the occurrence of each word of length k nucleotides in the promoter or promoter set. So we would break the promoters up into say, six-mer words. And we would look at the frequency of each word compared to the frequency in the whole promoter set or in a random set that we can generate by randomly picking out gene identifiers, gene identifier numbers, from the total set possible in creating random sets. And then we would evaluate the significance either using percentile occurrence in the case of single promoter. Or bootstrapping in the case of a set of promoters, and then the user gets an easy-to-interpret graphic of the results. Here's an example with a set of ABA responsive promoters. And one of the over-represented words, four-mers in this case, ACGT, if we click on it we see that there is a good over-representation relative to the background set. And it's nice to note, because this ACGT actually does represent the core of the ABA responsive element. If we look at a permutation of ACGT, CGTA, we don't see a significant difference in the distributions. So this does seem to work, this method does seem to work. The downside to this method is that we can't actually have wobbles this is word counting based approach, word counting based approach and that's the way it works. But it can be useful in certain situations. So a more probabilistic way of discovering cis elements is, to, is called Gibbs Sampling and it's used in a lot of different areas, in banking, speech recognition and so on. Basically what it does is it searches for the statistically most probable motifs in unaligned sequences. So this is the important thing with the sequence: we don't have to align the sequences. We'll work through an example with 29 sequences. And here, first, we guess the width for the motif, and we can automate this. In our example, we're going to set the width for our motif to be equal to six. And we choose a random position for the start of the motif in all but one of the sequences. And we don't use this left out sequence yet. And then we estimate the, the nucleotide frequencies, actually, in this case, in the motif columns of all but the left out sequence. And we also estimate the background frequencies. So what does this mean visually? So here we've chosen just a random, a start position for our motif. You can see there's quite a bit of variation in what this ostensible motif is. We then tabulate the the numbers of occurrences of A, C, G, and T at each motif position. And we count the number of occurrences of A, C, G, and T at the non motif positions. So these positions here and these positions here. And actually, what we do is convert that to a frequency matrix. So these columns would sum to one. And then we actually, we leave out the first sequence, we do the same process, we count the number of A, C, Gs, and Ts at each position to come up with this frequency matrix. And then we move to the the left-out sequence, so that's the first sequence. And we scan with our matrix and ask which position in the left-out sequence actually best matches our frequency matrix. And we do this by calculating an odd score ratio for each position. Let's so here we would break up our sequence into six mere in this case. And for each six-mer, we calculate this odd score ratio for each position. So, for this second position here which is CAGAAC, we simply multiply the instance of C at first position by the instance of A at second position. G at the third position, A at the forth position, A at the fifth position, and C at the sixth position, to come up with the P observed. And we divide that by P background, which is just the values for this zero column. C-A-G-A-A-C, and that will give us this Ax score. And basically the position with the highest Ax score is used to decide the probable location of that motif in the left-out sequence. So we set that to be the location of the motif in the left-out sequence. And then we move on to the next sequence. We leave out the next sequence. We update our frequency weight matrix. And we do this, greater than 100 times. And eventually, we'll converge to a given motif. So this is the sequence logo After just one sequence being left out and the frequency matrix being updated for the sequence logo but after a hundred times we actually do see some kind of motif being discovered in the promoters of those genes. So to discover a new motif you'd actually repeat the process with new, random start positions for the motif. [COUGH]. So MEME works in a similar kind of way and MEME is the program that we'll be using today in the lab. So, there are several transcription factor binding site databases. JASPAR is probably the best one out there right now. It's available at that URL. It's open access. There are no limitations on its use. It's species agnostic, or it has transcription factor binding sites for many different species. And it's fairly responsive. AGRIS is quite good for plants. Both JASPAR and AGRIS contain transcription factor binding specificities in the sense that they are PSSM-based. So, so it's not just regular expressions. You could also consider using CisBP, which is a widely used transcription factor binding site database at the University of Toronto from Tim Hughes' Lab. So there're several tools that we'll be exploring today for cis element mapping and prediction. MEME I just mentioned. Which is this, this probabilistic kind of discovery tool. Promomer, a word counting tool. TF2Network, which is great for generating gene regulatory networks, and JASPAR, for getting information about transcription factor binding specificity for, in the case of today's lab, Homo sapiens, human, and Arabidopsis. We can also do some simple mapping with the JASPAR interface. So there are lots of sites and programs available for predicting cis-elements. And a lot of these programs are stand alone command line applications that are quite computationally intensive. And basically if several programs identify a given cis-element, then the likelihood that it's biologically relevant increases. However, it's important to have an appropriate background model here, so randomly generated sets of promoters. In order to be able to access the statistical significance. And here in the lab we'll explore the difference between hypergeometric tests Athena (and others) uses versus a more sophisticated distribution test using the bootstrap sets of promoters, randomly selected sets of bootstrap promoters. So many different algorithms have been developed to predict regulatory elements. It's a stubborn problem and this graph from Tompa et al. In Nature Biotechnology in 2005 shows how poorly most of these programs actually work. And as you see it's the nucleotide correlation coefficient. And, this is a measure between zero and one, of how well a predicted motif matches, known sites in a sequence set. So, you can actually generate sequence sets with seeded motifs, and then run the programs on these data sets and ask how many times do we get this motif back from the prediction program. And this, this darker blue column here, actually shows that most of the programs are running at the level of 0.1, which is not very good. So it is a tricky problem. But perhaps as more data become available on transcription factor binding specificity as determined by these de Bruijn arrays we'll understand better how this regulation works. However there's a lot of combinatorial stuff going on, as discussed in the introduction and we're a long way from completely understanding cis-regulation and gene regulatory networks in any organism. So to close this course and also Bioinformatic Methods One, we have provided an overview of the commonly used bioinformatic methods and the value of such methods will only increase as data generation in biology becomes easier and easier. And I feel that the key in the future will be integrating and leveraging data from diverse sources in a systems-biological manner to get new biological insight, using some of the tools we examined but also using other tools like GeneMANIA, that are more integrative in nature. I'm just going to suggest a few other Coursera courses these two: Computational Molecular Evolution, and Bioinformatics, as probably good next steps. That's particular this Bionformatics one out of Peking University. If you want to learn how to program, there's an Intro to Programming, Intro to programming in Python. And there's also one for Ruby on Rails to develop web apps if you're into that. These courses are more about sort of, I would consider these to be higher level courses, Epigenetics, Genomic and Precision Medicine. And the Useful Genetics one might be a good one for some background information on genetics. The book we use for the University of Toronto based version of this course is Zvelebil and Baum from 2008. It's a little older but it's excellent. It's really quite comprehensive. It's called Understanding Bioinformatics and I think that would be a good place to read up more on some of the things that we talked about in these courses. Hope you found these courses very useful or this course very useful. And I hope that you learned a lot and we'll perhaps see you in the future. Take care, bye.