Hello everyone, I’m Yue Huang from Center for Bioinformatics, Life Science College, Peking University. Next I will introduce and illustrate how to discover the mutation loci of interest from the raw NGS data. Generally, the raw NGS sequencing data is stored as fastq files as shown below. Here we will use the BWA software first to align these fastq files to the genome. Due to the limitation of time, we will only use the human mitochondria genome as the genome to map reads to. The first step of BWA analysis is to build indices for the mitochondria genome. After that we will map the paired-end reads. After mapping we can see that the output is saved in “.sai” format. We cannot open this format yet; we need another step of BWA processing. As we can see, we add the parameter “-r” and the following quoted parameters in this step. These parameters will be added to the header part of the final BAM file. They are needed for many downstream analyses. Let’s have a look at the alignment file in BAM format. The “.bam” is a binary file. We need to use “samtools view” to view it. In the BAM format, each line represents a alignment of a read to the genome, including the name of the read, the name of the chromosome this read has been aligned to, the coordinates of the alignment, and some other information. However such information cannot be used to call variants directly. Thus we need some other processing here. Before processing, we need to sort these BAM files first. After sorting, we need to build indices for them. Now we will use the “indel realignment” tool provided by GATK to refine the primary alignment result, because the primary alignment result might have some errors around some indels. This tuning by GATK will be done in two steps. As you can see, we first use the RealignerTargetCreator tool to find the loci we need to realign. Then we will use the IndelRealigner tool, the real “indel realignment” tool, to process the BAM files. After the indel realignment, we will use another toolkit provided by GATK, Base[Quality]Recalibration, to tune the base quality of the alignment result. This tuning is also done in two steps. In the first step, we need to provide a training data set of known variant loci. Here we use the two VCF files provided by dbSNP. In the second step, we will run the TableRecalibration tool to generate the final.bam file we need. After these GATK tuning procedures, we can start calling variants from our alignment result now. I will illustrate the usages of two tools here. The first one is the mpileup provided by samtools, and the second one is the UnifiedGenotyper provided by GATK. Let’s see how to use samtools. First, First, we need to build another type of index for the fasta file, which is necessary for samtools. Then let’s run the samtools mpileup command. In this command, we provide the original genome sequence, and the preprocessed alignment result in “.BAM” format. Let’s have a look at the “.VCF” file we have generated. As you can see, the first several lines are comments. Here are the SNP loci we have found. As you can see, they reside on these positions of the mitochondria genome: 73, 150, 199, 263, etc. The succeeding information describes the details of the SNP, which are explained in detail in the VCF format specification. Aside from samtools, we can also use the UnifiedGenotyper provided by GATK to call variants. Here we also need to provide the genome sequence and the preprocessed alignment result in ".BAM" format, and the output is still in VCF format. As you can see, this VCF file is nearly the same as the previous VCF file generated by samtools, with some differences in the header. We can see that the variants called differ between these two tools. How the differences have emerged depends on how you use these two tools. You need to find in practice which tool suits you better. That’s all. Thank you everyone. Bye.