Welcome back to Peking University MOOC, Bioinformatics: Introduction and Methods. I'm Liping Wei from the Center for Bioinformatics at Peking University. Let's continue with our discussion of functional prediction of genetic variations. In this unit, let me show you some of the main databases of genetic variations so that you know where to find these data for your research and get a global picture of known genetic variations. Databases not only allow easy access to large amount of information but also provide data for training machine learning methods and for mining genome wide statistical patterns. Most of the databases of genetic variations were created in the past three decades. I'll briefly show you some of the databases in this slide and then elaborate on some of them a bit more in the slides that follow. In 1998, NCBI established the dbSNP as a central repository of single nucleotide variations and other small genetic variations identified in our species. In 2010, NCBI established dbVar to store large scale genomic variations. In 2012, the 1000 Genome Project released the first batch of genome and axiom sequences of over 1,000 healthy individuals from all over the world. On the other hand, many databases have been created for disease associated genes and mutations. A good example is OMIM, which is a well-annotated database of human diseases and associated genes. If you want to find the exact mutations that are associated with diseases, the Human Gene Mutation Database (HGMD), created in 1996, is a great choice. In 2007, LSDB, the Locus Specific Databases, were created to organize mutations and polymorphisms around each locus. Two additional databases, dbGap and ClinVar, contained genes and variations discovered by genome-wide association studies or GWAS are next generation sequencing to either cause diseases or increase disease risks. Finally, in 2004, cosmic database was created to contain somatic mutations discovered in tumors. Now, let's look at some of these databases in greater detail. dbSNP was created in September 1998 by NCBI, the National Center for Biotechnology Information in the U.S. In collaboration with NHGRI, the National Human Genome Research Institute. Version 138 of dbSNP contains genetic variations from 131 species, mostly small variations such as SNPs, Indels, microsatellite markers, short tandem repeats and so on. Large genetic variations are primarily stored in the dbVAr database. As a central repository, dbSNP contains most of the human genetic variations identified so far in both healthy and sick individuals. It has over 230 million submissions covering over 60 million reference SNP clusters, 27 million of which are in genes. As this figure shows, the amount of data of dbSNP has been increasing rapidly over the years. Each record in dbSNP contains lots of useful information about the variation. For instance, this record is about a single nucleotide variation A/T. The ancestral allele is adenine, which is usually determined by comparison with our closely related primates. A small part of human population acquired varying thymine. The frequency of the minor allele thymine is 0.7 percent. Down here, you can click to find additional useful information about this variation such as flanking sequences and minor allele frequencies in different human sub-populations. You may have noticed here that this variation was contributed by the 1000 Genomes Project. The 1000 Genomes Project was launched in January 2008 as an international research effort to establish the most detailed catalog of human genetic variations.On October 2012, the genome sequences of 1,092 healthy individuals from all around the world, who are sick, were announced in the "Nature" publication and made freely available to the public. Phase I of the project used next generation sequencing technologies including Illumina, GH2, and HiSeq left technology SOLID and 454 for low coverage whole genome sequencing and higher coverage whole axiom sequencing. Phase II of the project is adding more sequencing coverage as well as adding 1,000 more individuals. Genetic variations called from the sequencing data from this project is periodically deposited into dbSNP. OMIM or Online Mendelian Inheritance in Man is a database of all human diseases known to have a genetic component and their casual or risk genes. It was initially created in 1966 by Dr. Victor McKusick as a catalog of Mendelian traits and disorders in the form of a book named "Mendelian Inheritance in Man" or MIM. Two book editions of MIM were published. As the data grows larger and larger and the book grows thicker and thicker, an online version, OMIM, was created in 1985 and made publicly available two years later. The black bars in this figure show the number of genetic disorders with no molecular basis. The gray bars show a number of mapped disorders was yet unknown molecular basis. OMIM now includes not only single gene Mendelian disorders but also complex diseases with susceptibility genes and some somatic cell genetic diseases. As of October 2013, OMIM has detailed description of 14,000 genes, 4,000 phenotypes with no molecular basis, 1,700 phenotypes mapped without no molecular basis, and 1,800 suspected Mendelian disorders and complex diseases. There are two main types of records in OMIM. The first type of records focused on genetic disorders. For example, this record of breast cancer describes the clinical features, diagnosis, and treatment of breast cancer. The inheritance model of different types of breast cancers, the genes and mutations discovered so far, and other useful information such as animal models. You can type in 114480 on the OMIM website to see the complete record. The second type of records focused on genes that had been associated with certain genetic diseases. For example, this record on BRCA1 one shows its cytogenetic location, genomic coordinates, gene structure, gene function and evolution. More importantly, it shows how BRCA1 was discovered to be associated with several types of cancer. The families and patients studied and the mutations that are identified and the genotype-phenotype correlations. OMIM is a beautiful resource for gene disease associations. However, if you're interested in getting not only the disease genes but also other known mutations on the disease genes, you may want to know about the Human Gene Mutation Database (HGMD). HGMD is a comprehensive database manually curated from literature of gene mutations that underlie or associated with human genetic diseases. The origin of HGMD trace back to 1987 when biologist, David Cooper, and mathematician, Michael Krawczak, decided to collaborate on the study of the mutation or mechanisms of human genes. Their study was based on extensive statistical and mathematical analysis of known mutations that they manually collected from scientific literatures. In 1996, they decided to make the collection into a public online database, HGMD. As this figure shows, the number of mutations published each year has been on a steep rise. Collectively, as of February 2013, there are over 140,000 human gene mutations in HGMD, including 60,000 missense substitutions, 15,000 nonsense substitutions, 13,000 splicing substitutions and so on. Here is the example of mutations in BRCA1. For each mutation, you can find the nucleotide changes, the amino acid changes, the phenotypic changes, and the reference. Learning from all these genetic variation data and their known phenotypic effects, we can develop methods to predict the functional and phenotypic effects of new genetic variations. In the next unit, I will share with you a conservation-based method, SIFT, and the rule-based method, PolyPhen. I look forward to seeing you then.