[MUSIC] Today we will talk about algorithms for finding regulatory motifs in DNA. And we will start from game when we will be implanting patterns in random strings. Let's generate ten random sequences. And then let's take a 15-nucleotide-long pattern, and insert it at random positions in these ten sequences. Now, turn your head, I will hide these patterns, and tell me where these patterns are. What algorithm should you run to figure out where I have inserted this pattern? And you already probably recognized that if we slightly modified our frequent words problem, we will find the pattern I have inserted. Indeed, you can concatenate all the sequences and find the most frequent word in the resulting concatenate. The implanted pattern will appear ten times in this concatenate. It is surprisingly frequent, which will most most likely mean that it was the pattern that I have inserted. Now let me change the experiment slightly. Now instead of the patterns that I implant without any changes, I will insert the same 15 nucleotide long patter with four random mutations at random positions. And, in this is case, this pattern forms a so-called (k,d) motif which is a k-mer that appears in every sequence at most d mutations. Can you find such a pattern if I hide from you where I inserted it? Do you think that the Frequent Words Problem will be able to help us? Well, that all may be entertaining but what biologists think about this game? I think biologists are falling asleep, because it's absolutely unclear what this problem has to do with real biological problems. And I will try to show you that this problem is actually about a biological problem of finding regulatory elements in DNA. My next question will be, do we have a clock gene? Think about our daily schedule and think how our life depends on day and night. In every cell, we probably have to express different proteins depending on what time of the day is it. But who are these molecular time keepers who tell the cells in our body what is the time right now? How do these molecular time keepers change gene expression to produce proteins needed at night as opposed to proteins needed in the morning? And anoother questions that is relevant to this, can we find genes responsible for sleep disorders When our Circadian Clock has problems. Now, we will focus on plants rather than human because for plants, keeping time is a matter of life and death. Just think about photosynthesis, flowering, or frost resistance. If you don't know what is the time, you will be dead if you are a plant. And, the question that arises is how probably a thousand plus genes in plants follow the circadian clock. Who controls them? Who are the molecular managers who tells these genes to increase or decrease gene expression? Here comes the surprise. Just three genes in plants are molecular managers that are responsible for orchestrating this circadian behavior. And they're called CCA1, LCY, and TOC1. These genes are regulatory proteins, also know as transcription factors. And these genes, to control other genes, (to exert control over circadian clock) they bind to short regions (maybe 10 nucleotide, 15 nucleotides) in the upstream regions of the genes. For example, if one of these regulatory proteins wants to control a particular circadian gene, it probably has a region within thousand nucleotides from the start of the gene where it binds. For example, here's CCA1, one of the three regulators. and to exert control over these genes, it has to bind in the upstream reason of these genes. But how does CCA1 know where to bind? Probably there are some hidden messages that tell CCA1: "Bind here!" Our goal today is to find this particular hidden message (that's shown on the slide) where CCA1 binds to. Of course, we don't know where these hidden messages are actually hiding. In the upstream regions, and today we will learn about algorithms that are aimed at finding such hidden messages. We will start from formulating the problem. So, when we talked about implanting a 15-mer with four random mutations, we figured out that the Frequent Words Algorithm wouldn't be able to find it, because there are no frequent words that appearing without mutations in the resulting strings after implantations. That's why we need to solve a different problem that I call the Implanted Motif Problem. In this problem, you are given a set of stings Dna, and integers k and d. And you need to find all (k,d) motifs in Dna. How should we solve this problem? Should we possibly explore all possible 4 to the power k k-mers and see which of them represent (k,d)-motifs. That will take time. That's why, let's try something else. Let's try to see whether a pairwise comparison (comparing sequences to each other) would help. Think about this. Each sequence is random. The only non-randomness in these sequences is are these implanted motifs, and that's why they probably exhibit larger similarity than other regions of random sequences. Thus, my idea, let's find the most. closed 15-mers, [INAUDIBLE], let's say in the first and second sequence, and it will give us an idea on how the implanted pattern looks like. Would it work? Let's figure out whether the pairwise comparison between strings will help us to find the implanted motifs. Well, unfortunately, it won't work because when we implant a pattern, it has four differences from the original pattern, but we don't have access to the pattern, so there is nothing to compare with. The only thing that we have are implanted instances. But every two implanted instances actually may be four plus four mutations apart. How we can find them? Since pairwise comparison won't work, maybe the only option we have is just to explore all four to the power k possible k-mers. Should we explore all of them? Not necessarily, because if a k-mer is so far away from all k-mers in the strings that we analyze, there is no reason to explore it. It cannot possibly be an implanted pattern. That's why motif enumeration algorithm would look like this. We will start from each k-mer, from a sequence. Let's call it a. For each such k-mer, let generate all possible mutations with up to d nucleotides mutated, and for each such mutated k-mer a', let's check whether it appears as a (k,d)-mer in the string. That will work, (it will be slow but definitely will work if k is small) but the question we should ask, would it solve the real biological problem? Unfortunately it won't, it won't work because our model for implanting patterns is not very biological. It doesn't properly reflect biological reality. I would say a little bit messy biological reality, because when biologist generate, let's say, a set of genes, that are controlled by circadian clock, they cannot guarantee that all these genes have a particular pattern implanted. Some genes don't have any patterns implanted in their upstream region. As a result, we need to find out a way to search for motifs, even if some sequence don't have any motifs implanted. And that makes our life, little bit more difficult and we need to develop scoring for every set of motifs, even in the case when some of the implanted pattern don't look very similar to the canonical motif. And we'll move to the next topic, which is motif finding problem that is more adequate for finding regulatory motifs.