[MUSIC] I am now introducing the median string problem, and to define this problem, we need to define the notion of distance not only between k-mers of the same size, but also the distance between a k-mer and the longer string. So what is this distance between two strings of different lengths. We simply start from a smaller string (a k-mer) and compare it to the first k-mer in the text. Distance in this case is seven, we move further. distance six, continue, continue, continue until the end of the string. And then, return to the place where the distance was minimal. So in this way, we will find a k-mer within a longer string that is most close to our k-mer of interest, and this is the distance between a pattern and a string. What is the distance between a k-mer and a set of strings. In this case, it's simply the sum of distances between a pattern and each string in the set. For example, for a pattern AAA and a set of strings shown on the slide, we can compute distance with every string and simply sum up all the distances. The result is five. And median string for the set of strings Dna is a k-mer pattern minimizing distance between pattern and DNA over all possible k-mers. And our median string problem will be based on the problem we formulated before, which is yet another equivalent motif finding problem. If you remember this problem, then it looks like even more complicated than the original motif finding problem, because in this problem, we need to find a k-mer and a set of motifs satisfying certain properties. But in fact, the only thing we need to find is a k-mer pattern as shown in this median string problem. In this case, we search for a k-mer pattern minimizing distance between this pattern and the set of strings Dna (among all possible k-mers). Now, there is a very simple algorithm for solving this problem. We simply need to try all possible k-mers and see whether the distance between the pattern and DNA is minimal among all possible k-mer. The running time of this algorithm is 4 to the power k times n*t*k, where we analyze t sequences of length n. So what we have just accomplished? We started from motif finding problem, and the brute force algorithm for solving this problem was incredibly slow. We've turned it into yet another equivalent motif finding problem, and through this problem, we were able to switch to the median string problem with this very different run time. In fact, it's still an exponential algorithm but for practical applications (where k is usually less than 15) we still can run this algorithm. So dramatic improvement on the Motif Finding Problem that we started from. So sometimes, change of perspective helps. And to move further, we will notice that although the Median String Problem is much faster than the Motif Finding Problem, it's still slow if we search for long motifs and little bit later we will figure out how to deal with such motifs. Our last topic in this segment is Greedy Motif Search. We'll now talk about a greedy algorithm, for solving the Motif Finding Problem. Given a set of motifs, we have already learned how to construct the consensus string. Now let's construct the count matrix where in every column we simply have counts for all nucleotides. And also. this count matrix can be transformed into the so-called profile matrix where we have frequencies of nucleotides in every column. Now, these frequencies of nucleotides in every column can be viewed as a four-sided, biased dice, representing probabilities of ending up on a given face of a dice. And we can view it as a probability distribution. And if it is a die, let's ask a question: "What is the probability that tossing such a die based on the motifs will generate a given string of DNA?" For example, what will be the probability for generating a string shown at this light? So given the following profile, we ask, what is the probability of generating the consensus string, AAACCT? And this probability will be simply computed by multiplication of corresponding elements in the profile matrix. If you take another string, we will get a different probability. And you have already noticed that the closer the string is to the consensus string of the profile, the higher is the probability of generating this string. And now, we define the notion of Profile-most probable k-mer in a sequence. This is a k-mer with the highest probability among all k-mers in the string. And here's an example of how I can generate Profile-most probable k-mer in a string. We start from the first k-mer in the string, compute probability, and record it in the last column of the matrix. We continue until we fill all the matrix, and finally, select the largest probability, which in this case, corresponds to the red entry in the matrix. And with this at hand, we are now ready for GreedyMotifSearch, which works in a very simple way. We start from the set of i-1 motifs selected from the first i-1 sequences and show how to extend this set of motifs by the motif in the i-th sequence. What we do, we form profile of motif from the first i-1 1 sequences and then select profile most probable k-mer in the sequence number i. Afterwards, we iterate. And that will the result in the greedy algorithm for solving the problem. And we will see later whether it works or doesn't work for our goals.