So in the previous section, Phillip explained why the problem of multiple pattern matching is important. And he further gave an algorithm for solving the multiple pattern matching problem. But let's see, what is the way in which he solved the problem? Here is the genome as the road, and here is the pattern as a student. So basically what Phillip does, is he asked the student to go along the road and find all the matches that a pattern can match. But please remember that we have multiple patterns. So this approach force us to slide the patterns along the genome multiple times. OK? Here we slide. This approach takes a lot of time. The time complexity of this algorithm is O( |genome size| * |Patterns| ). In brute force, the combined length of all the patterns, is huge. And this, algorithm is not that practical. Now, I'm describing another approach, so instead of asking individual patterns to go along the genome, one at a time I pack all of the patterns into a single structure and move this structure along the genome only once. So, it is similar to packing all the students onto a bus and move the bus just one on the road. This data structure has a very important property. It can retrieve the pattern very quickly. Like, if a father here asks if Taca is on this bus? The bus driver can say "I can retrieve Taca" very quickly. So, this data structure is called a Trie, coming from the word retrieval. So the Trie is a data structure for representing a collection of patterns. And the tries support fast pattern matching. Now, given us a set of patterns, the idea is to combine all these patterns into a rooted tree, with branches labeled by letters in the alphabet. And every string in the patterns correspond to a path from the root to a leaf. Let's go through this example with me, to understand how we can construct a trie from a collection of patterns. OK? We start with a root and the first pattern is banana. So from the root the first character of banana is b. There is no b here so we have to create a new branch and label it my b. The next character is a. So we have to create a new branch and label it by a, and so on, so fourth. An a. So for the next pattern, p-a-n. We start from the root, there's no p. So we have to create a new branch p here, OK? And a, n. The next one is "and". We start from the root, there's no "and" going down from the root, so we have to create one, 'a' and the next one is n-d. Next one is n-a-b, "nab". We start from the root. We don't have 'n' so we have to create a new one and a-b. OK? Antenna, we start from the root. But please note that' a' here we have a branch labelled by 'a', so we don't have to create new one. We just follow it down. Go to next 'n', here the next one is 't'. We don't have' t', so we have to create a new one. And e-n-n-a. The next one is bandana. We start from the root. We have b already, so we don't have to create a new one. So, a-n, there's no d going down, so we have to create a new one. And a. Next one is ananas. We, we have a already, so we just follow down. And here 'a', we don't have 'a', we have to create a new 'a'. OK? n-a-s. The next one is nana. We start from the root. "nana": we go through a n-a, there is no 'n' next, so we have to create a new one, here right. OK? So this tree, this rooted tree is called a trie, constructed from a set of patterns. So a trie is a rooted tree with branches labelled by letters in the alphabet. Every pattern is a path from the root to the leaf. For example, banana. It correspond to path from the root to the leaf. But please note that in this example, we have some errors here. If you want to construct a trie for yourself, you need to add a new characters dollar at the end of each pattern.
(Denoting an end of a word in the trie.) We don't have space so allow us to have some mistake here. So, the question now is, how to use the trie for pattern matching? Before, Phillip had to move individual pattern, one at a time, along genome, to find a match. Now, we have already created a special structure called a trie that packes all the patterns into a single data structure. Now the ultimate thing we have to do is to move this data structure along the genome. OK. Here we have the string, that's the genome, panamabananas. The trie, constructed from the set of patterns we want to match. First we put the trie at the first position of the string. P, and we want to ask a question, whether there is any pattern in this connection that can match at this position on the genome. So we just swap down from the root nearest p, so this is a p-match. We walk down a-n. So good. So at this position we can find a match: "pan". We move the trie to the next position, so, and we try to walk down the tree from the root to the leaf of the tree, to find if there is any match. So, we could find 'a' here, next one is 'n'. Next one is 'a' here, yes we can find. Next one is 'm'. No. In the trie, we don't have anything. Going out from this we don't have label 'm' so at this position we cannot find any match. We move the trie, one more time. An n here a-n we cannot find n, so that is no match at this position. We move to the next. We start with a-n, here we got stuck we cannot match so we have to move. We can say that at this position there is no path that matches. This position immediately there is no match. Next position 'a' here, there is no match here, we move to the next b-a-n-a-n-a-s. OK, so we can find a match here. So we move it to the next one, a-n-a-n-a-s. Exactly. So we can find a match at this position. So next one is n-a-n-a. We can find a match. So we move next, a-n-a-s. There is no 's', so walking down. So at this position we cannot find any path that matches. Similarly, we move the trie, here we cannot find any match, here we cannot find match, here, immediately we cannot match. OK? So, to recap. In this section I have described a data structure called a trie. Which is a representation of a collection of strings. This data structure allows us to retrieve a pattern from the collection very quickly. And the run time of the trie-matching, now we have reduced it to the genome multiplied by the longest pattern. Where in the previous algorithm that Phillip described, it takes genome multiplied by the total length of all the patterns.