Welcome back to Peking University MOOC: "Bioinformatics: Introduction and Methods". I am Ge Gao from the Center for Bioinformatics, Peking University. Let's start this week's topic: Reads mapping in deep sequencing data analysis. Starting from this week, we will begin our second Session of this MOOC. In this unit, we will focus on deep sequencing data analysis and illustrate how to adjust the classical methods and introduce new computational tools by examining the specific problems. These adjustments and new tools will help to analyze and process the huge amount of next generation sequencing data efficiently, facilitating solving the biological problems. Specifically, in this and next weeks we will introduce DNA data analysis methods for genome resequencing. We will introduce the analysis methods for deep sequencing of each topic first. Then we will discuss how to use the results produced by these methods to explore the biological problems further. We hope you can understand not only the basic ideas but also the usages of these tools, so that you can use them better. Therefore, our TAs and other students from our lab will film special videosusage for the basic usage and ideas of common tools involved in each unit. These videos will be uploaded as supplementary learning materials. You are encouraged to think and discuss with other students and TAs in forums. Now let's begin the first unit of this week: From Sequencing to NGS. As mentioned in Week 1, the genome is merely a series of four letters (A, T, C, and G), yet contains the genetic information of the life. After the importance of DNA has been realized in the early 20th century, people have been using sequencing, i.e. to get the exact sequence of a DNA molecule, as one of the important methods to understand the life. However, the nucleotide sequencing method had not been able to be applied on a large scale until 1977 when the English biochemist Frederick Sanger developed such a method. Due to this contribution, he also shared with Walter Gilbert the Nobel Prize in Chemistry in 1980. It had also made a solid foundation for the completion of human genome draft in the 20th century. The wide application of Sanger sequencing made it possible to sequence the genome on a large scale. The sequencing technique has been developing rapidly in the 21st century. Marked by the 454 technology being established in 2005, the next-generation sequencing technology stepped into life science. Nowadays there are about a thousand NGS sequencers in over a hundred research institutes and companies all over the world. They are applied widely in the research, teaching, and application of various fields such as biology, medical science, and agricultural science. such as biology, medical science, and agricultural science. Thus it is also called as deep sequencing. We will switch between the two words “next-generation sequencing” and “deep sequencing” in later units depending on the context. On the other hand, the next-generation sequencing produces reads that are shorter but also subject to a higher error rate on average compared to Sanger Sequencing. The downstream bioinformatics analyses are thus met with a bigger challenge. In order to make it easier to understand, Yao He (one of our TAs) has filmed a supplementary video for this unit. This video explains the specific parameters and usage of related sequencers. You can watch it if you're interested. The data generated by these sequencers are often stored in the FASTQ format, except for the SOLiD sequencers made by the ABI except for the SOLiD sequencers made by the ABI Each read has its own nucleotide sequence and the quality information for each base stored in this format. To make it human-readable, the quality of sequencing at a single base, Q, is denoted as -10 * the denary logarithm of the probability this base is called erroneously. In other words, when Q is 20, the probability this base is called erroneously is 0.01 . The Q values are further encoded into ASCII characters and stored in Fastq files. The reliability of the base symbols can thus be inferred by the corresponding quality information. Empirically, bases with a Q value less than 20 ,i.e. the probability the base is called erroneously is larger than 0.01, are considered not reliable. If such bases make up more than 20% of a read, this read might be discarded. Also, to avoid the problem brought by the short length of the reads, the next-generation sequencing has applied the paired-end reads , coming from the two ends of a longer [DNA] segment. The two names in each pair of paired-end reads will be appended with /1 and /2, The emergence of next-generation sequencing has driven the research in related fields considerably. Currently, in addition to genome DNA sequencing, the NGS is also used to study the epigenetic modification, the transcriptome, the protein-DNA interaction, and many other important biological problems. We can discover the genetic variations between different individuals by mapping the reads from the individual genome to the reference genome. Associating the variation with specific phenotypic variations, we can run the association study to study the genetic basis of phenotypic variations. This will provide important clues for the study of the mechanism and thus the diagnosis and clinical treatment plan -- of genetic diseases. RNA-Seq is a deep sequencing technique used to study transcriptome. This technique could sequence all the transcripts expressed in cells systematically. RNA-Seq technique allows research scientists to reconstruct the transcriptome quickly, and thus identify alternative splicing isoforms. This can hardly be done by traditional techniques, such as microarray technique. Besides, we can estimate the expression level of each locus by summarizing the number of mapped reads for that locus To estimate expression level of genes, analysis differential expression, cluster and find genes related to particular biology process. Similar to RNA-Seq, ChIP-Seq is another technique that utilizes deep sequencing to study transcriptional regulation. Different from RNA-Seq , ChIP-Seq uses deep sequencing technique to sequence the DNA segment bound to the specific antibody. The sequencing result can thus be used to infer the protein-DNA interactions. By selecting different antibodies, the ChIP-Seq technique can be used to detect binding sites of transcription factors or specific chromatin modification regions Thus this technique is widely used in studies of transcriptional regulation and epigenetics. That’s all for the deep sequencing technique and its main application. In next unit, we will introduce the first step of deep sequencing dat analysis: Reads mapping.