In this section, we will be talking about a special type of a genomic feature, alleles. Think about a particular application. This can be a clinical investigator who wish to sample all the sequence of a number of patients and control. To identify and analyze the, to identify DNA or RNA variations, or this could be sequencing of a new species, plant species for instance. In either case, this will start by isolating the DNA or the RNA sequencing it. And the next step would be to try to place it into genomic context, in other words to map it to the genome. And we call this an alignment. Intuitively if you're thinking that particular read, that particular sequence, came from some place on the genome. So they're should be somewhere on the genome a sequence of new strongly resembles it. I'm saying resembles it, and it is not that it is identical because there can be a number of differences. This can be polymorphisms, or they can be errors introduced by sequencing. And our task is to find an alignment. An alignment is, essentially a mapping between the letters of the read and the letters of the genomic sequence at the particular location of origin. Where we're matching each base that is identical between the two, and we're using some spacers in order to fill in the gaps, or for the portions that do not match. So I'm showing you one example here at the bottom with two sequences, sequence one and sequence two. And just for reference, let's think that sequence one is the reference and generally is the genome, and sequence two represent the read. We talked about letters bases that match and those are shown here with vertical bars. But there are also a number of positions that do not match, and they need to be filled in. And we have three types of so called edit operations that we can use to fill in these positions. The simplest one perhaps is substitutions, assuming a substitution, we have one letter in the genome and another letter in the read. We have three substitutions here marked with the red, and in the third one, for instance, we have a G in the genome, and an A in the read. This might be a mark of polymorphism or it could simply be a sequencing error, as I said earlier. Now there might be cases in which a letter is present in the genome, but is missing from the read. So we say that, that's a deletion, and that's shown here in blue. And yet there may be another case here in which there's a letter in the read for which we have no correspondence in the genome. And that's shown here in green, and we're going to call it insertion. So it's insertion corresponding to the genome. All this notation refers to the genome as being the reference sequence. So let's think a little bit about alignment as segments in genomic features. If we start with the DNA shown here on the left-hand side, we produce a number of reads represented in red. And then every read, the vast majority of the cases, each read can be mapped continuously along the genomic region of its origin. So we have three reads and their corresponding locations along the genome. In the case of an mRNA however, the mRNA splices together information that is located at different positions at different intervals within the genome, as the exons. The exons are separated by introns. So if a read belongs entirely within an exons and that's the way you're seeing these red read here, then they will be located then we will have a continuous alignment. The alignment will be in one piece. However if our read in this case shown in green, spans the boundary between two exons. Then they would have to be divided along the genome, and the points of divisions would be particularly the boundaries of the exons or the starting end of the intron. So that's what you're seeing here in green, and we call such an alignment the spliced alignment. So those are the two types of alignment structure that we can find when we're sequencing DNA and mRNA respectively. A couple more concepts related to alignments, especially as it relates to alignments of next generation sequences. You may recall that the DNA or the RNA will shear into fragments, and the fragment sizes followed a fairly well characterized distribution. A normal distribution with a given average and a given standard deviation. So we'd expect that anywhere we find these two reads along the genome is the originating sequence, they will follow the same rules we got this the distance between them. If they do, and this is the case here in cartoon on the left-hand side. So if they're mapped on the genome where the reads are in opposite directions and facing each other, the alignments. And the distance between them is within the boundaries that are specified for the original set of fragments, then we call that particular mapping as being properly paired. So we're saying that these are properly paired or that they are concordant. Now of course, if any of those conditions is being violated, we're going to say that the mapping is non-concordant. So what does it mean that it is being violated? For instance, the reads are not mapping in the corresponding orientations. They might be mapping both looking outside or they might be mapping both looking in the same direction. So that's one case, or another case might be when the distance, so when the mapping does not meet the requirements for the distance between the reads. In this example that I'm showing in column number three at the top, when the reads are mapped in the proper orientation, however, they are too far apart to be concordant. And one variation here might be that the might actually map to different chromosomes. So we have properly paired or concordant mappings, and we can have non-concordant mappings as well. And we shall see how these are being used in the following sections. Now as I said, these are very special types of genomic features. And they have their own way of being represented. The standard format for representing alignments of next generation sequences is SAM. We can also obtain by compression, a compressed format that's called BAM. But over the next few slides, I will be talking about the SAM format and what the information here represents. In a SAM file, the first portion of the file, the first few lines, represents the header. Every line in the header will start with an x sign, and then by very short code that tells us what kind of line that is. So the first one is an HD line and a header. So it tells us something about whether this is sorted or not. We have here a file that is sorted by coordinates for instance. The following few lines, marked with an SQ code, are lines that correspond to sequences, particularly to genomic sequences. And you'll see the identifier for each one of them, in the order in which they are listed in the genome. So we're starting with chromosome one, followed by chromosome ten, eleven, and so on. So now we said they are listed here alphabetically, but other orders can be used. And then we have an indicator of the length, LN, and that's the name, 248 mega bases and so on. Following the sequence one we have a PG line that tells us the program, or the way this file was generated. So in this case it was called, it was generated by a program that's called TopHat. And just to get ahead to look forward, to some of the future lectures, TopHat is a spliced alignment program. We talked about spliced alignment on the previous slide. So following the TopHat, we have the version as well as some information about the command line parameters that were used to produce this file. So that's the header. And following the header, we have the alignments and there would be one line for each alignment. And everything that is shown here at the bottom, is actually just one line. So there's a lot of information, let's go through it. The first item in one line, and by the way, all fields picking a line are separated by terms. So the first one is to read the read ID. So that's taken from the FASTQ header. You might recall that from the FASTQ representation. The second one is a FLAG, and it's a complicated and complex entity which I'm going to explain on the next slide. The third one just like with other genomic feature representations, gives us the scaffold of the chromosome ID. So the substrate genomic axis. Followed by the start of the alignment, in this case the position of 10,021. The following field gives us the mapping quality and the next one marked here as a 50M, tells us something about the alignment. Basically it's a compact, a compress representation of the alignment and is called a CIGAR string. We'll be talking about this in a couple of slides. The following line, the following field, tells us about where the mate of this particular read is located, on what chromosome. If the chromosome is the same as the current chromosome, we're going to see an equal sign. If it's a different chromosome, we're going to see the name of that chromosome, or if the mate is not mapped, then we're going to see a star sign. So that can give us some information as to, in a very restricted sense, as to whether this pair has the chance of being concordant. The next column looks at where the next mate would start. So in this case at position 10,151, as I said on the same chromosome. And then is followed by the mate distance measured along the genomic axis. The next field is a stream field and gives the actual sequence of the query of the read. And immediately followed by the query base qualities, as I've shown you on the FASTQ format. And lastly, we have a number of features that are all separated by tags, that give us some further information about the alignment. And there are variety of such fields and tags, and some of them are standard and some of them are user specified. So all the tags that start with an x point are user specified, or rather they are program specific. A tag, as you can see here, has three components. The first component is a two letter code, followed by and separated by a column by a letter that tells us what kind of field it is. For instance, A marks a character, I marks an integer field, F a float, Z a string, and H a hex string. And then lastly, we have the value for that particular attribute. So we have a tag a code, followed by the type, and followed by the attribute value. And you're going to see only a few of them represented here. So AS here is the alignment score and it's an integer value. NM here is again an integer value, and is used to denote the edit distance to the reference. In other words it tells us, how many of those edit operations, how many substitutions, insertions, and deletions we needed to include to equalize to do the mapping between the query and the corresponding genomic sequence. Then we have NH which gives us the number of hits. So you can see that this read has ten hits. We have the strand, the predicted strand, in the field XS and you might notice that it's program specific. And then we have the hit index for this alignment. So if we have multiple matches and each one of them has a different quality or a different alignment score, what is the order, the position of this alignment within that particular hierarchy, that particular list? So that's the SAM format in a nutshell, but I would like to spend a bit more time on the flag and cigar strings, because they are complex fields.