[SOUND] [MUSIC] So one of the things that we are supposed to do as the BD2K-LINCS Data Coordination and Integration Center, is to find solutions for putting all the big data that is collected in biomedical research and specifically molecular data, at the genome wide scale, and trying to make sense of it, and developing tools that can be used to further extract knowledge from all this data. So we made great strides in this direction, and recently published two large reviews that lay out our plan on how we can leverage all of this data and put it together. So in the first review, we listed most comprehensive resources of experimental data that is collected in the field, and categorized the data into seven subsections. The first is Drug and Gene Knockdown, followed by Genome-Wide Expression, and this is what the LINCS Program is all about. The Connectivity Map that we discussed, as well as the work that is done by many of the LINCS centers, as well as the data that is deposited in the Gene Expression Omnibus. The next category is Transcription Factors and Histone Modification Profiled by ChlP-Seq. There are two large scale NIH projects called ENCODE and Epigenomics Road Map Project that systematically use ChlP-Seq analysis, to profile the binding of proteins onto the DNA in different human cell types and conditions. The next type of data is cell viability data after single gene knockdowns, or drug pertubations of many different human cell. The next type of data is knockout, or mutation data and their association with disease. We now have more and more gene expression data from individual patients, and different tissues of those patients. The GTEx project provide such data combined with genomic sequencing project like the cancer genome atlas, provide large collection of different data at different regulatory layers from cancer patients. Including genomics, transcriptomics, and proteomics from individual tumors across many different types of cancers for large cohorts of hundreds of patients for each cancer. There is also accumulated knowledge of protein, protein interactions, metabolic and self signalling pathways that are continually extracted from the literature, or are now being able to be profiled with high content screens. Finally, there is accumulated knowledge about drugs and toxic chemicals that cause adverse event and toxicity. Those provide links between small molecules and the human phenotype. All this data can be converted to what we call attribute tables, networks, single-entity node networks, for example, gene, gene association networks, or functional association networks. Gene set libraries that we will discuss when we discuss enrichment analysis. Bi-partite graphs that connect genes to entities, and those are just different views of the same data. So by collecting many of those data sets and obstructing them to those attribute tables, bi-partite graph, gene sets and single node networks, the challenge of integrating all of those resources becomes easier and possible. The ensemble of bi-partite graphs, gene sets and networks, allow us to form connection between biological entities that typically are not identifiable by standard methods. And those could be of great interest to biomedical researchers, because they can identify interesting relationships that are not obviously found when you're looking at one data set alone. Graph theory algorithms and machine learning methods can now be applied to draw noble inferences from this integrated data. So the data integration opens an opportunity to discover new connections among drugs, genes, diseases, tissues, and other biological entities, and this is becoming gradually more clear as we advance in the course. So let's look at those data structures that can be used for this unification of representation and analysis. So the first data structures that is relatively obvious are bipartite graphs. In this data structure, you have two types of nodes. In our case, those are genes and their biological properties, or functions, that are associated with those genes. The most typical way that data is represented in biology is through attribute tables. And in most cases, we have genes as the rows, and then the different conditions that the biological conditions that measure the state of those genes, or those variables, either gene expression or protein expression, can be the columns or the conditions, and this could be measurements of different cell types, different drugs applied to those cell types. And the numbers in those tables don't have to be zeroes and one, they can be overall absolute change, or differential change when compared to the control. We can always put a threshold, and decide when there is a connection. In other cases, the binary representation is the only possibility. This is for example, if we knock a gene in a mouse, and we look at the possible phenotypes that the knockout can cause. The representation of set libraries is basically the same, and it's just a transformation of the data, where now the terms of each library, or the label of each library, is the common function for the genes, and the genes are members of each set. The label for example, could be a pathway and the genes in the pathway comprise the actual set. In this example, we show how this can be transposed where the genes are the labels, and the terms can become the set members. From this data we can also construct a network. So if the nodes in the network are the genes, they can be connected based on their common shared attributes, and these are called functional association networks. The attributes can also be connected to form attribute attribute networks and these for example, can be a network of disease terms. An adjacency matrix is a representation of a network using a matrix. So a general matrix and set operation can be used to analyze, manipulate, and integrate those datasets conforming to these data structures. So, as we've seen throughout the course, data analysis and integration methods can be used on those data structures and these include clustering, classification, or machine learning, bench marking, graphical models, or we can detect biases. These topics are reviewed in general in more detail throughout the course. One thing that we haven't spoke about in the course is those skewed biases in the representation of genes, and in one of the reviews we show that different type of data has different types of biases where genes are more commonly represented. So if we look at a literature based data, the most well studied genes like P53 have many references, citations and connections. Different genes are very commonly identified when we apply ChIP-seq methods. The same is true of gene expression data, or proteomics. We are debated to care links, DCIC, already integrated and analyzed many of those data set, and we now have a database called the Harmonizome, and it can be accessed on this website. So from this website, you can download over a hundred process data set extracted from over 60 publicly available resources. In the rest of our API presentations, you will learn how to access this resource programatically. We hope that you can find this resource useful also for applying various data analysis on the process data sets to shorten processing time, and accelerate big data to knowledge discoveries. In the next three lectures, Andrew Rollard from our lab, will describe how he processed and analyzed several of those resources in order to construct the Harmonizome database. [MUSIC]