In this instance, I will introduce the matching functionality within the biostrings for finding sub sequences in other sequences. This is something that's being used a lot nowadays. It's exactly what all short read aligners do. Such as bowtie or mac. If you really have millions of short reads I would still recommend using a dedicated short read aligner such as bowtie on Mac, or one of the many other choices. But sometimes it can be incredibly convenient to be able to match a smaller set of sequences to the genome. So, let's start off by getting a genome and a string. I'm going to load the yeast genome version two and I'm going to have a small, little DNA string here, and then we're going to print on the screen. So, there's a number of function for matching strings and bio string. There's matching a string to a string, matching a set of strings to one string, matching one string to a set of strings, and matching a set of strings to a set of strings. That sounds very fancy and has something to do with the computational efficiency. Matching a single string to a single string is something we do with matchPattern. We take our sequence and then we now have to take a specific yeast genome. Because we are mapping a single string against a single string. We get a single hit on this particular DNA string. And the return object is something called a views that looks very much like a DNAStringSet. We will discuss views in a different session. Whenever we have a match function, there's also a count function that just counts how many matches we have. Now, most of the time, you're not interested in matching up against a single chromosome. You're really interested in matching up against a set of chromosomes. For that we have the vmatchPattern. vmatchPattern matches one sequence against a set of sequences. Now the return object is a GRanges. And there's a couple of interesting things here. First of all, there's hits on both the forward and the reverse strand. We didn't get that with the max pattern. And the max pattern didn't search both the forward and the reverse strand of the object. We do that here. Secondly, we see in the output that for every hit on the forward strand, we have a hit on the reverse strand, and that seems a little weird. Now that turns out because we have a weird sequence. The DNA sequence we are searching for turns out to be its own reverse complement. We can check that by asking is dnaseq equal to reverseComplement of itself? And we get true, and that's why whenever we have a hit on the forward strand, we have a hit in the same location on the reverse strand, this is not an error. Finally, we can, there's a set of function called max predict that takes, that builds it, it's called predict for dictionary. It takes a set of sequences such as short reads of the same length, it builds a dictionary on them, and then it matches them against the full genome. Both max predict and match pattern and vmatchPattern. Allows a small set of mismatches, and allows intels. And is a very fast and efficient way of searching the genome for a small set of sequences. There's another set of matching procedures in bioconductor that are a little bit more esoteric, but can be very useful when you need it. There is something called matchPVM. Or precision weight matrix. For people in the know, a precision weight matrix is also known as a sequent logo, or a transcription factor binding a motif. And details a probabilistic representation of a short sequence. matchPVM allows us to search the genome for example for binding size for given transcription factor. Then we have a function called pairwiseAlignment. pairwiseAlignment implements a classic pairwiseAlignment algorithm. That's known in computations biology. Either a global pairwiseAlignment or a local pairwiseAlignment like a Smith Waterman or Needleman Bush, a type algorithm. And it actually allows to map millions of reads against a short sequence such as a gene. It turns out that these local and global alignment route genes using dynamic programming are impossible to use when you map up against the entire genome. But they are still very useful even if you are a million of reads, as long as you align them up to a very small section of the genome that could for example be a gene. This is underutilized, I would say. I could see a lot of interesting applications of this idea. Finally, there is a function called TrimLRPatterns for trimming off specific patterns on the left and the right of a DNA string set. The use case here is trimming off sequence adapters, but trimLRPatterns has a very rich set of functionality. Allowing intels and mismatches in the sequence adaptors, and yeah can be used for this.