Hello everyone, this is the supplementary video for this week. I will explain and illustrate the basic ideas and usage of three NGS analysis softwares: BWA, samtools, and GATK. I will first introduce the sequence compression and alignment algorithm employed by BWA, the read mapping software. Then I will introduce the common softwares for variant calling and genotyping. Last but not least, I will illustrate how to use BWA, samtools, and GATK from the command line. The BWT algorithm is the compression algorithm used by BWA. Its compression is lossless. By sorting and transforming the character matrix rotated from the initial string, this algorithm could achieve a much higher compression ratio. It will search for identical(matched) regions using reverse-char method. It cannot handle gaps. Specifically, here is a demonstration of BWT algorithm. Move up elements of the input sequence, i.e. the first column sequence by one position, and put the original first element at the end of the string. We will then get the second column sequence. Repeating this procedure again and again, we will get a 7-by-7 matrix displayed at the left of the slide. Then we treat the rows of this matrix as strings and sort them alphabetically to transform the matrix from the left one to the right one. Then we take the last column of this matrix to get the sequence below. Now we can see that the first row in the left matrix is the second row in the right matrix. So using BWT we get an I=2 and a transformed sequence for this sequence. Next we need to reverse this sequence back to the original sequence. As described earlier, the left matrix is constructed by repeating moving up elements in the current column by one position and appending the original first element at the end of the column. This means that the transformed sequence and the original sequence are made up of the same characters And the first column in the right matrix is generated from sorting the rows of the left matrix. So we can recover the first column by simply sorting the last column of the right matrix. We will call the first column Column F, and the last column Column L. Why do we need to do this? The reason is that the transformed (right) matrix has two useful properties. The first property is that, for each character X of the Column L sequence, the character that succeeds X in the input sequence is just the character in the Column F that is in the same row as X is. The second property is that, for each character that occurs more than once in the input string, if we label them 1,2,3,⦠by the order they occur in the input string, we will find that these labels have the same order in Column L and Column F. The first property is relatively easier to understand. For the second property, please note that, for example, the first G in Column F is also the first G in Column L. In other words, the G at Row 5, Column L is the same G as that at Row 2, Column L. As we know that the input sequence is the second row of the transformed (right) matrix, the G at Row 2 in Column L will thus be the last character of the input sequence. Now we can recover the input sequence using the two properties just mentioned. The order by which we recover the input sequence is displayed on the right. For example, the input sequence is the second row of the transformed matrix. Since the last character of the input sequence is G, which is also the first G in Column L, the position of this particular G in Column F will be Row 5. The preceding character will thus be the character at Row 5 in Column L, which is the second G. Repeating this procedure again and again, we can recover the input sequence by the order displayed on the right. However this order is reversed. We have just introduced the BWT algorithm. So how could we apply this algorithm to sequence alignment? Assume we have another input sequence TACG. Now we need to find subsequences that are identical to this input sequence in the original sequence. The last character of this input sequence is G. We can see that there are two Gs in Column L, so we will start searching from these two Gs. First, the G at Row 2. After two steps of searching we will find that it is preceded in the original sequence by G, not C. So we stop here. Now we try the G at Row 5. Following the characters as described above results in a subsequence that is identical to this input sequence. So we find it. This is the basic idea of BWA algorithm. If we get the Column F and Column L of the reference sequence to search, and build indices, we can map the reads very fast. Letâs try again with another input sequence GGAC. If youâre good at computing, you will still find a subsequence that matches. This is because this method of recovering the sequence cannot tell whether the current character is at the start (or the end) of the original sequence. Therefore it cannot tell whether the match crosses the start and end of the original sequence. Nevertheless this can be easily resolved by introducing a way of discriminating the âstartâ and âendâ when building indices. As shown above, we add a $ at the end of the original sequence. Letâs see what we will get. The Column F and Column L become this. Searching for GGAC again will abort at the third step, leaving the problem resolved. Thatâs all for the BWA/BWT algorithm. Next we will introduce some common softwares for variant calling and genotyping. samtools is one of the early common NGS tools for processing SAM/BAM files. The SAM/BAM files are the result of mapping reads to the reference genome. This tool has also facilitated the development of variant calling and genotyping algorithms based on BAM files. The program mpileup and bcftools can accomplish these tasks. Most genotyping algorithms are based on Bayesian statistics. The diploid genome can have three different genotypes, two different homozygous ones and one heterozygous one. Considering the base distribution and base quality profile at the locus deduced from the reads, we can infer the genotype by choosing the one that is most likely to exist (i.e. the one whose posterior probability is the largest among the three genotypes). GATK is a recently developed tools from the 1000Genomes Project. It is often used to call variants from the genome sequencing data of one or more healthy people. Currently the common variant callers listed on its website are the UnifiedGenotyper and the HaplotypeCaller. We will illustrate the usage of UnifiedGenotyper only, as it is more commonly used. This is the analysis pipeline for DNA sequencing data recommended by the GATK website. First, the reads are mapped to the reference genome by BWA or similar tools. Then we can use the âindel realignmentâ model to re-map the reads around indels again and more carefully (as the mapping around indels is often poorly performed and raise false positive variants easily). We can also use the âbase recalibrationâ model to estimate and possible correct the base quality profile given by the sequencing machine. After these preprocessing steps, we can use the âUnifiedGenotyperâ model (or others) to call the variants and the genotypes. After filtering the variants with respect to their qualities, we can start the downstream analyses such as the variant annotation which might be introduced next week. Thatâs the slides for this supplementary video. Now letâs see how to use them from the command-line. Hello everyone, Iâm Yue Huang from Center for Bioinformatics, Life Science College, Peking University. Next I will introduce and illustrate how to discover the mutation loci of interest from the raw NGS data. Generally, the raw NGS sequencing data is stored as fastq files as shown below. Here we will use the BWA software first to align these fastq files to the genome. Due to the time limit, we will only use the human mitochondria genome as the genome to map reads to. The first step of BWA analysis is to build indices for the mitochondria genome. After that we will map the paired-end reads. After mapping we can see that the output is saved in â.saiâ format. We cannot open this format yet; we need another step of BWA processing. As we can see, we add the parameter â-râ and the following quoted parameters in this step. These parameters will be added to the header part of the final BAM file. They are needed for many downstream analyses. Letâs have a look at the alignment file in BAM format. The â.bamâ is a binary file. We need to use âsamtools viewâ to view it. In the BAM format, each line represents a alignment of a read to the genome, including the name of the read, the name of the chromosome this read has been aligned to, the coordinates of the alignment, and some other information. However such information cannot be used straightforward to call variants. Thus we need some other processing here. Before processing, we need to sort these BAM files first. After sorting, we need to build indices for them. Now we will use the âindel realignmentâ tool provided by GATK to tune the primary alignment result, because the primary alignment result might have some error around some indels. This tuning by GATK will be done in two steps. As you can see, we first use the RealignerTargetCreator tool to find the loci we need to realign. Then we will use the IndelRealigner tool, the real âindel realignmentâ tool, to process the BAM files. After the indel realignment, we will use another toolkit provided by GATK, Base[Quality]Recalibration, to tune the base quality of the alignment result. This tuning is also done in two steps. In the first step, we need to provide a training data set of known variant loci. Here we use the two VCF files provided by dbSNP. In the second step, we will run the TableRecalibration tool to generate the final.bam file we need. After these GATK tuning procedures, we can start calling variants from our alignment result now. I will illustrate the usages of two tools here. The first one is the mpileup provided by samtools, and the second one is the UnifiedGenotyper provided by GATK. Letâs see how to use samtools. First, we need to build for the fasta file another type of index needed by samtools. Then letâs run the samtools mpileup command. In this command, we provide the original genome sequence, and the preprocessed alignment result in â.BAMâ format. Letâs have a look at the â.VCFâ file we have generated. As you can see, the first several lines are comments. Here are the SNP loci we have found. As you can see, they reside on these positions of the mitochondria genome: 73, 150, 199, 263, etc. The succeeding information describes the details of the SNP, which are explained in detail in the VCF format specification. Aside from samtools, we can also use the UnifiedGenotyper provided by GATK to call variants. Here we also need to provide the genome sequence and the preprocessed alignment result, and the output is still in VCF format. As you can see, this VCF file is nearly the same as the previous VCF file generated by samtools, with some differences in the header. We can see that the variants called differ between these two tools. How the differences have emerged depends on how you use these two tools. You need to find in practice which tool suits you better. Thatâs all. Thank you everyone. Bye.