Welcome to Peking University MOOC “Bioinformatics: Introduction and Methods”. I’m Liping Wei from the Center for Bioinformatics at Peking University. Let’s continue with this week lectures on Ontology and Identification of Molecular Pathways. In Unit 1, let’s look at ontology and the Gene Ontology. You may still remember this slide which I showed you in Week 1’s lectures. The “informatics” in “bioinformatics” refers to the computer and computational technologies that we use to address biological questions. It follows the theme of “from Data to Discovery”. In the past few weeks, Dr. Gao and I spent most of the time telling you about many bioinformatic algorithms, software tools, and web servers. They are all based on biological data and operate on biological data. How can the data be organized so as to facilitate computation? This is the subject of this unit. If you think about it, biologists really don’t make it easy for computers. Take the example of a human gene that many of you are familiar with, the WNT1 gene. Some people, however, write it as wnt-1. But a computer cannot automatically know that WNT1 and WNT-1 refer to the same gene. Even worse, wnt1 was previously known as INT1. There is no way for a computer to know that INT1 and WNT1 refer to the same gene. WNT1 is a short-hand for WINGLESS-TYPE MMTV INTEGRATION SITE FAMILY, MEMBER 1. The poor computer is really getting confused. WNT1 is a member of the WINGLESS-TYPE MMTV INTEGRATION SITE FAMILY. But a computer looking at “WNT1” and “WINGLESS-TYPE MMTV INTEGRATION SITE FAMILY” has no way of knowing their relationship. To complicate things even further for the poor computer, the drosophila research community has a habit of naming a gene after its mutant phenotype. So the homologue of WNT1 is called wingless in drosophila. You may be able to guess it easily, but it is very hard for a computer to guess things like this accurately. Another example is the highly conserved delta DNA polymerase. It used to be called CDC2 in yeast and now renamed to POL3. It is called DNApol-delta in drosophila, and Pold1 in mouse and human. How can a computer know that all these different names are referring to the same gene? A computer is only as smart as we make it to be. So if we want our computer to use its power to compute for us, we need to clearly define the data for it. We need to define all the entities, give each entity a unique name, enumerate all its synonyms and acronyms, and describe its properties. We also need to define the relationship between the entities, such as WNT1 is-a WNT family member. Basically we need to define a vocabulary. Not just any vocabulary, but a vocabulary that is “controlled”, dictating the use of predefined, authorized terms that have been preselected by the designer of the vocabulary, in contrast to free natural language vocabularies. The vocabulary is “common”, meaning that it is agreed upon by different people to use it as a common language across different applications. Last but not least, the vocabulary is “hierarchical”, defining the hierarchical relationship between the entities. When this is done in a very formal way, this vocabulary is called an Ontology. When I was doing my Ph.D. at Stanford I learnt in the bioinformatics algorithm class that my advisor Russ Altman taught that an ontology was “a specification of a conceptualization”. I still remembered there is a sophisticated task every time I say these long words. To put it in more easy-to-understand language, an ontology can be defined more operationally as a set of concepts within a domain, defined by a shared vocabulary to denote the types and properties of the concepts as well as the relationships between the concepts. Ontology had a long root in philosophy since the 17th century before it was adopted in computer science in the mid-20th century. In philosophy it is defined as the study of the nature of being, becoming, existence, or reality, as well as the basic categories of being and their relations. So why is ontology important? What does it enable us to do? First, communication. An ontology allow different people to communicate unambiguously. For example, with an ontology of gene function, different groups annotating different genomes can have a common language. Second, computation. An ontology represents knowledge in a computable form so that computer programs can analyze the data automatically. Third, discovery of patterns across different hierarchies. The hierarchical structure of an ontology enables us to go above a set of individual genes to find the larger functional categories or pathways involved. It gives us a bird’s-eye view. Open Biomedical Ontologies (OBO) collects major ontologies in life sciences, including Gene Ontology, Anatomical Entity Ontology, Disease Ontology, Sequence Ontology, System Biology Ontology, and so on. The most widely used by far is the Gene Ontology. You may remember from our first week’s lectures that, since the first free-living organism was sequenced in 1995, more and more genomes were getting sequenced in the late 1990s. . The yeast genome sequencing was completed in 1996, the C. elegans genome in 1998, the Drosophila genome sequenced in 2000 and the mouse genome was well on its way In 1998 the genome scientists were busy annotating the genomes. They soon realized that a large fraction of genes were conserved across different species, but the homologous genes were often given completely different names in different species, making comparisons difficult. It made it hard to borrow knowledge from one species to aid the understanding of homologous genes in another species. To provide a unified common language for biology, in 1998, scientists of three model organism genome databases, FlyBase, Saccharomyces Genome Database (SGD), and Mouse Genome Database (MGD) decided to collaborate to create the Gene Ontology. The Gene Ontology defines a structured, common, controlled vocabulary to describe attributes of genes and gene products across organisms. Collaboration is key to build a consensus vocabulary. Today the Gene Ontology consortium has grown from the initial three genome groups to over 20 members, spanning multiple large genome sequencing projects. The Gene Ontology is divided into three categories to describe the genes and gene products from three different angles: Molecular Function, Biological Process, and Cellular Component. Each category is a hierarchical controlled vocabulary. The Molecular Function describes the elemental activities or tasks performed by individual gene products, for instance carbohydrate binding or ATPase activity. The Biological Process describes the biological goals or objectives that are accomplished by ordered assemblies of molecular functions. For example, mitosis or purine metabolism. The Cellular Component describes the subcellular structures, locations, and macromolecular complexes that the gene product belongs to. For example, nucleus, telomere, and RNA polymerase II holoenzyme. Each of the three categories of the Gene Ontology can be represented as a Directed Acyclic Graph. The three categories are the three roots. Part of the Biological Process graph is shown here. Each node of the graph is a concept. For instance the node “pigmentation during development” is a concept of a biological process. The edges between the nodes represent the relationships between the concepts. For example “pigmentation” is a special kind of “pigment”, For example “pigmentation during development” is a special kind of “pigmentation”, so you see this “is_a” edge pointing towards “pigmentation”. The graph is “directed” because the edges, or relationships, have directions, such as A is part-of B, in which case B cannot be a part of A. The graph is “acyclic” because all the edges point towards the root, so they can’t form loops. How can this graph be stored in the computer? Please pause for a moment to think about how you would do this. There are several formats for storing the Gene Ontology. One of the most widely used formats is the OBO format. Gene Ontology currently uses OBO version 1.2. Here each concept is named a “Term”. Each concept has a few properties defined including a unique id, a name, a namespace which specifies which of the three categories the term belongs to. “def” is the definition of the term. “synonym” lists all the synonyms and acronyms of the name. Finally there is definition of the concept’s relationship with other concepts, such as “is_a”. Here is a real example of a part of an OBO data file.The first line is [Term]. The next line specifies the unique ID, which is GO:0000001 here. The next line specifies the name of the concept, “mitochondrion inheritance”.Next the namespace, “biological process”. The definition of the concept is included here. And the synonyms and acronyms are listed here. “mitochondrial inheritance” and “mitochondrion inheritance” are very similar to human eyes but if you don’t define it for the computer, the computer won’t know that they refer to the same thing. The relationship with other concepts are specified here. Each concept can have relationship with multiple other concepts. Here “mitochondrion inheritance” is a kind of “organelle inheritance” which has ID of GO:0048308, and it is a kind of “mitochondrion distribution” which has ID of GO:0048311. There is an empty line between two terms. As you can see, such a standard format makes it easy to write a computer program to parse the data. Another popular format for storing ontology is the XML format. The XML format is used not just in bioinformatics, but also in many other disciplines. It’s a commonly used data storage structure. Each concept in the Gene Ontology is called a “go:term”.A term has an accession number which was like the ID in the OBO format. A term has a name, synonyms, and definition, just like in the OBO format. The relationship with other terms are specified here, such as “isa”.Finally, the term is cross-referenced to related entries in other databases. Here is a real example of a part of a RDF-XML file, which is a specific type of XML file used in Gene Ontology. Each term starts with “<go:term”. The URL of the corresponding entry in the online Gene Ontology server is listed here. Each term ends with </go:term>. The slash signifies the end of it. You may notice that this structure looks very similar to an HTML file as well. The GO accession number is given here between <go:accession> and </go:accession> The name, synonyms, and definition are given here. The relationship to other terms are specified here. In the second term shown here, cross-reference to its corresponding entry in the InterPro database is specified as IPR009446. The definition of relationships deserves a closer look. The first relationship is “is a”.“B is a A” means B is a subtype of A. A is sometimes called a mother or parent node, and B is sometimes called a child node. For examples, “mitochondrion inheritance” “is a” “organelle inheritance”; “pigmentation during development” “is a” “pigmentation”. In the Directed Acyclic Graph the “is a” relationship is shown with a letter “I” on the edge, with the arrow pointing to the mother node. The second relationship is “part of”, such as “B is a part of A”.For example, “ribosomal large subunit assembly” is “part of” “ribosome assembly”. “pigment metabolic process during pigmentation” is “part of” pigmentation. In the Directed Acyclic Graph the “part of” relationship is shown with a letter “P” on the edge, with the arrow pointing to the mother node. The third relationship is “regulates” such as “B regulates A”. There are two subrelationships: “positively regulates” and “negatively regulates” For examples,The “R” on the edge specifies that “regulation of pigmentation during development” “regulates” “pigmentation during development”. The R is on a black background. An “R” with a green background on an edge specifies that “positive regulation of pigmentation during development” “positively regulates” “pigmentation during development”. Yes, you guessed it. An “R” with a red background on an edge specifies that “negative regulation of pigmentation during development “negatively regulates” “pigmentation during development”. Once the relationships are defined, you can reason over them, and more relationships can be deducted. If “A is a B” and “B is a C”, then “A is a C”. If “A is a ” and “B part of C”, then “A part of C”. If “A part of B” and “B is a C”, then “A part of C”. If “A part of B” and “B part of C”, then “A part of C”. If “A is a B” and “B regulates C”, then “A regulates C”.If “A regulates B” and “B is a C”, then “A regulates C”. If “A regulates B” and “B part of C”, then “A regulates C”.If “A regulates B” and “B part of C”, then “A regulates C”. Similarly, If “A is a B” and “B positively regulates C”, then “A positively regulates C”. If “A positively regulates B” and “B is a C”, then “A positively regulates C”. If “A positively regulates B” and “B is part of C”, then “A positively regulates C”. If “A is a B” and “B negatively regulates C”, then “A negatively regulates C”. And so on. Gene Ontology is a very useful resource. All the data is freely available online at this web site. As of December 2013, Gene Ontology has about 40,000 terms, covering 2800 species, with over 573,000 genes annotated. The Gene Ontology defines about 76,000 relations, 83% of which are “is a” relations, 8% “part of” relations, and the remaining 9% regulations. The deepest branch has 12 levels. You can browse and search Gene Ontology from the AmiGO tools on the same web site. Browsing can be done in a tree-like structure. You can expand a mother node to see all its child nodes. You can clearly see the relationship between the nodes. You can search for GO terms. You can also search with a gene name to find out about its GO annotation. You can search for GO terms. You can also search with a gene name to find out about its GO annotation. Here are some summary questions for you to think about. See you at the next unit!