Hey welcome back. Let's continue with our lectures on ontology and identification of molecular pathways. In the last unit we learned about the important concept of ontology. We looked closely at the gene ontology which is a heirarcical common control vocabulary to describe the molecular function biological process and cellular component of genes and gene products. There is another way of organizing and representing biological knowledge. That is to organize the molecules in to biological pathways. We'll take the pathway data base as an example. Let's continue our unit two of lectures. So what is a biological pathway? As you know, molecules don't work in isolation in our body. In fact, they work together in teams, just like people in a large factory that manufactures a product. Some people have a very specialized the job. They take a half finished product from the person on their left, add a part onto it. And pass it down to the person on their right, who adds another part. There are also product managers walking around controlling the pace of the production. And making sure that no staff is too fast and no staff too slow. There are sales managers who monitor the demand for the product on the market and pass on this message to the supply manager. The supply manager now brings more or less raw material to the production manager who then turns up or down the speed of the production. A biological pathways is similar. It is a series of actions among molecules in a cell that leads to a certain product or a change in a cell. There are three main types of pathways. The metabolic pathways, which are like a factory's production assembly line. The gene regulation pathways like the production management and the signal transduction pathways like the monitoring of the [INAUDIBLE] in cells and the transmission of the information to the supply manager and product manager. Why do we need to have pathway databases? Experimental biologists spend lots of time and effort on discovering new components of pathways, new connections between the components, and even brandnew pathways. However, this knowledge used to be scattered all over different papers in different formats, which made it hard to find. In the past few decades, some good hearted bioinformatics scientists had taken the trouble to collect all the knowledge into databases, with graphical interface, so that biologists can now easily learn about hundreds of pathways by simple pointing and clicking on their computer. In addition, as we will see later in this week's lectures, pathway databases also enable computations and analysis that can discover important patterns above individual genes. So, what pathway databases are out there? We listed here a few of the main pathway databases including KEGG, BioCarta, BioCyc, PANTHER, PID, and Reactome. These are all great resources. In this unit, we'll take KEGG as an example to elaborate. KEGG stands for, Kyoto Encyclopedia of Genes and Genomes. It organizes data in several overlapping ways, including pathway, diseases, drugs, compounds and so on. We will focus on KEGG pathways here and solve 2013 there are 450 reference pathways in KEGG. KEGG pathway are divided into seven categories. Including metabolism, genetic information processing, environmental information processing, cellular processes, organismal systems, human diseases and drug development. Some of these categories were relatively new, and not yet comprehensive, such as human diseases. But categories like metabolism are extremely useful. Each category is organized in a hierarchy court structure. For instance carbohydrate metabolism is a kind of metabolism, and starch and sucrose metabolism is a kind of carbohydrate metabolism. KEGG is a collection of mostly manually drawn pathway maps like this starch and sucrose metabolism pathway. The nodes marked with the rectangle are gene products, mostly proteins, but sometimes RNAs. In this pathway, most of the nodes are enzymes. Small circles denote other molecules, mostly chemical compounds, such as the substrates. The pathway is also linked out to other upstream or downstream pathways. As you can see, there are many interactions between the nodes. Let's look at the interactions more closely. There are several types of interactions in the pathway. The first type is protein-protein interactions including phosphorylation and dephosphorylation marked with a plus p or minus p on the arrow. Ubiquitination, glycosylation, and methylation marked with a +u, +g or +m on the arrow. Activation and inhibition are marked with a standard arrow head or a T head. Other types of effects such as indirect effects, state change, binding association, and dissociation are marked with different arrows. Finally, protein complexes are shown as grids. A second type of interaction in the pathway is gene expression regulation, including expression and repression either through a chemical compound or directly, or indirect regulatory effect. A third type of interaction is enzyme-enzyme relationships, such as two consecutive reaction steps, shown here. This is a pathway entry page that you'd see on the web. The pathway entry is stored in two formats. The first is a simple flat file format, very similar to which is C on the web. The second more informative format is the KUGB markup language for KGML format. Here a pathway has properties such as its basic information, name, organism and number, et cetera. It defines a number of entries such as reactions and relations. But substrate and product are defined as features of reactions. In order to generate the nice classical representation of the pathway that we had looked at earlier. One feature of each entry is the graphics including the coordinates, logo shape, size, and color. In KGML format, it's stored in a computer like this. An example of the entries show here following the format The KEGG pathways can be browsed in its hierarchical structure. You can also search for pathway of infant. I mentioned earlier that entry has a feature called graphics. The user can input a list of genes that she wants to highlight by specifying the gene ID and the background and foreground color and the gene will be highlighted in the pathway map. So you see the underlying computer representation of the pathways enables a flexible and user friendly interface. You can do this kind of searching and browsing easily, if you store your data in free packs file without the well defined the data structure. Talking about structure, KEGG actually also defines something that is sort of similar to the gene ontology, called the KEGG ontology, or KO, that describes gene functions in a hierarchical control of a capillary. KO has four flat levels. The top level is shown here which has a few very broad categories. If you click on metabolism you will see that it has a number of subcategories such as carbohydrate metabolism, energy metabolism, and so on. If you click on carbohydrate metabolism, you'll see a list of different types of carbohydrate metabolism. If you click on starch and the sucrose metabolism you see the bottom level of KO which is the list of gene products who have fundamental functional rows in starch and sucrose metabolism. To end this unit, here are some summary questions for you to think about. We talked about at length the representation of gene ontology and KEGG pathways in the computer. But how are individual genes associated with gene ontology terms and KEGG pathways? That is the topic of our next unit. I look forward to seeing you then.