Welcome back, let's continue with our discussions of bioinformatics resources. In this unit, let's look at a few examples of databases and software at the National Center for Biotechnology Institute or NCBI. NCBI was was established on November 4th, 1998 as a division of the National Library of Medicine, NLM, at the National Institute of Health, NIH, in the US. It has lot of raw data and secondary data as well as a number of software tools. It is one of the resources that I use the most. This tables lists examples of resources at NCBI. There are several large data repositories including GenBank for nuclear tied sequences, dbEST for expressed sequence tax, which are no longer popular. GEO for gene expression data, and SRA for next generation sequencing reads. For DNAs, there are annotated databases, genome and gene. To assist research in comparative genomics there are the taxonomy resource and homoloGene. Genetic variations and disease mutations can be found dbSNP, dbVar, OMIM, dbGaP and ClinVar. Annotated RNAs are available in RefSeq and UniGene. Proteins and annotations are available in protein, RefSeq and conserved domain. Many of you have used PubMed to search for literature. MeSh is a hierarchical controlled vocabulary to represent the citations in PubMed. The most widely used tool at NCBI is the BLAST suite of tools. Now let's look at a few of them in a little more detail. As of the end of 2013, over 1,000 complete whole genome sequences are available from the NCBI genome resource. Partial genomes that have been sequenced in progress can also be found here. In addition to the sequences, the genome research also contains maps, chromosomes, assemblies, and annotations. For instance, you can browse the human genome by chromosome. The genes are notated including name, description, disease association and so on. The data is available for, not only browsing, but also searching batch entries query and FTP. Another useful resource at NCBI is called RefSeq, shorthand for Reference Sequence. For nucleic acid sequences and protein sequences. RefSeq contains sequences that have been manually curated by an expert team. To be the most complete and accurate sequence for each gene from all the sequence defragments generated by different experiments and different labs. And entry IDs starting with NM denotes a nucleic acid sequence and an entry ID starting with NP denotes a protein sequence. NCBI also has a resource called Gene that integrates lots of useful information about each gene. It is a good starting point for you to learn about a gene of your interest. For example, as shown in this example, the function chromosomal location, gene ontology, genetic variation pathway, interaction of a gene are all shown in one page. How convenient. As of the end of 2013, NCBI-Gene resource provides annotations for about 14 million genes in 11,000 species. If you're studying human genes, there is another great resource with even more annotations than NCBI-Gene, and that is GeneCards. GeneCards is a curated secondary database of extensive experimental and predictive data about each human gene. It is an excelent starting point for learning about a human gene. By the way, GeneCard is not part of the NCBI. Another important resource at NCBI is to Sequence Read Archive, SRA, which we have touched upon in earlier lectures. SRA stores raw sequencing data from the next generation sequencing technologies. From the SRA, you can download not only the raw read sequences, but also metadata about the experiment design, sample information, library sequencing platform, and so on. Thus of the end of 2013, SRA contains 2.69 peta based pairs of data from over 323,000 sum post in over 27,000 records. As this figure shows, the data has been increasing exponentially since SRA was created in 2007. The amount of data doubles every five months. This figure was made in September of 2013. It is already obsolete just three months later. It's amazing how fast the data is growing. It's a source of data that cannot be ignored. Another useful resource at NCBI is the taxonomy database. It is a curated classification and nomenclature for all of the organisms that have at least one DNA sequence in the public databases. It currently represents about 10% of all describe species of life on Earth. The taxonomy is a rooted hierarchical tree. For instance, Homo sapiens is a small branch of Hominidae. Which is a branch of primates, which is a branch of lamellae, which is branch of eukaryotes, etc. As of the end of 2013, the taxonomy database represents about 300,000 species. And about 100,000 other higher or lower taxa, on the taxonomy tree. Another very useful resource at NCBI, that many of you are familiar with, PubMed. PubMed contains over 23 million citations for biomedical articles. For most of the citations in the PubMed, They provide the title and abstract of the article, the authors, their affiliations, the journal, and the volume where the paper was published, related citations and so on. In particular, you can follow the links on the top right of the page to see the full text of the article. If you see the Open Access sign here, it means that the journal makes the article freely available to everybody in the world. As a matter of fact, other than the air we can breathe in and the sun that we bathe in, nothing is completely free for everybody. Somebody has to pay for it to make it free for someone else. For Open Acess articles, the readers can freely access these articles only because the authors had to pay the journals about $1,000 per article. For instance, after our paper on KOBAS was peer reviewed and accepted. The Journal of Nucleic Acids Research required our group to pay about $1,200 to make our paper freely available to everybody in the world to read. I'm not complaining, we were happy to make our paper freely accessible to everybody who is interested. It's just like this MOOC, everybody who is interested in learning can have the opportunity to learn. Similarly most of the bioinformatic resources are freely available for users around the world. Because government grant agencies or universities and research institutes had paid for the resources to be developed. This Open Access movement in lab sciences is a new movement that started in the late 80s. I think it has made a very positive impact on lab science research. By the way, others who have financial difficulties, can apply to the journal to have the publication fee waived. Back to this webpage, if you see a free in PMC sign here. It means that NCBI has made the article freely available through a related resource at NCBI called PMC or PubMed Central. PMC contains the complete full text of 2.9 million articles, about 13% of all PubMed citations as of the end of 2013. All articles in PMC are freely available to every user. Another useful resource at NCBI, that not many people know about is called MeSH or medical subject headings. It is a hierarchical controlled vocabulary that a National Library of Medicine, NLM. A division within NIH, developed for indexing articles in PubMed. At the highest level, there are 16 categories such as anatomy, organisms, diseases, and so on. If you expand the category phenomena processes, you will see that it includes 17 subcategories such as physical phenomena, chemical phenomena, etc. If you further expand the subcategory genetic phenomena. You will see finer subcategories such as genetic variation which includes mutation, which includes allelic imbalance, base pair miss match etc. This hierarchical controlled vocabulary makes it more efficient to organize and search for citation in PubMed. Next, I'd like to say a few quick words about another useful resource for you research called My-NCBI. In your research, you should always try to know about the latest published work that are relevant to the biological questioning study and the technology you use. My-NCBI allows you to set up keyword searches that are automatically done every day or week or month. And the keyword search results will be delivered automatically to your email box. If there are journals that are important for you to read every issues, you can also set it up in my NCBI. Then you will be receiving regular emails containing stations of all the new papers that have been published that match your keywords or in your favorite journal. It is an excellent way to keep up with the latest literature. I've asked all the students in my group to sign up for it. I suggest that you do it as well. Finally, one of the most popular resources at NCBI is the BLAST suite of tools. Including Nucleotide BLAST, protein BLAST, cite BLAST and so on. It has a nice user interface. You can easily to a BLAST search by using the default parameters or you can change parameters to do more sophisticated analysis. The output of BLAST also has a nice graphical user interface. In addition to the online version of BLAST, NCBI also provides a stand alone version for download called BLAST+. You can easily embed BLAST+ into your own computer programs. There is also wwwblast which you can download and embed into your own webpages, very useful. In the next unit, I will show you some of the resources at the European volume from that institute EPI. See you at the next unit.