[SOUND] In this lecture I will introduce several more data sets that we processed and converted to gene set libraries, gene-gene networks, attribute tables and bipartite graphs. The Human Endogenous Complexome data set reports proteins identified by mass spectrometry that were recovered from 3000 immuno-precipitations. Each immuno-precipitation provides a snapshot of the proteins that participate in one or more complexes with the protein that was targeted in the immuno-precipitation. This data set is useful for learning physical interactions, which could be direct or indirect, among proteins. Many of the immuno-precipitations were targeted to DNA binding proteins and their co-regulators. So this data set is especially useful for learning about protein complexes that regulate gene expression. The Human Endogenous Complexome is structured as an unweighted gene set library. Where the protein targeted by each antibody labels the set of proteins detected in the pull down. A gene set library can also be represented as a by part type graph. The items in the sets, in this case the proteins detected in the pull down, form one set of nodes. While the items labeling the sets, in this case the immuno-precipitations, form the other set of nodes. Edges are drawn from each item in a set to the item labelling the set. By applying an algorithm such as sets to networks, data of this kind can be used to infer pair-wise interactions between the pull-down proteins. This HEC map is a visualization of the human endogenous complexome data set. Proteins label the rows, immuno-precipitations label the columns, and the colored regions indicate which proteins were detected in which immuno-precipitations. This information is useful for understanding which proteins work together. The Online Mendelian Inheritance in Man data set Is a collection gene phenotype and gene disease associations curated from biomedical literature. Each association is supported by evidence showing that mutation of a gene has some affect on a phenotype or disease. The OMIM dataset is structured as a list of disease-gene pairs. This is already in the form of an unweighted by bar type graph. This heat map is a visualization of the OMIM data set. Genes label the rows, diseases and phenotypes label the columns. And colored regions indicate which genes are associated with which diseases. Overall, the information about disease genes is quite sparse, which makes it difficult to make inferences about this data set. And makes computational methods that can do a good job of predicting disease genes very valuable. This data set can be made more useful by mapping the disease names to an ontology, such as the disease ontology. This would allow gene similarities to be discovered by connecting genes through more general disease terms. The Proteomics Database dataset contains protein expression measurements for many cell lines and tissues. The Human Proteome Map is another source of protein expression data. Given the diversity of biological functions performed by different cell types in tissues, a survey of protein expression across a variety of samples, is naturally a very good way of learning functional associations among proteins. Furthermore, given the tissue specificity of many diseases, this data could be useful for prioritizing disease genes. The ProteomicsDB data set is structured as a matrix with proteins labeling the rows, cell types or tissues labeling the columns, and protein expression levels filling the matrix. As we have seen from the previous examples, a matrix like this can be interpreted as a Bipartite Graph. Where the rows, in this case, proteins form one set of nodes. The columns, in this case, the cells and tissues form the other set of nodes. And the matrix values define the weights of the edges connecting the proteins to cells and tissues. As is evident from this example. Proteomics data tend to contain many missing values, shown as zeros in the matrix, which makes quantitative analysis challenging. One approach is to threshold the data to extract associations between proteins and cells or tissues in which those proteins are expressed higher or lower than usual. This heat map is a visualization of a bipartite graph derived from the ProteomicsDB data set. Red regions of the heat map indicates cells and tissues with higher protein expression than usual. And blue regions indicates cells and tissues with lower expression than usual. Many proteins have similar patterns of expression across tissues, which suggests their functional similarity. Such information could be useful in combination with the OMIM data set for example, to attempt to infer novel disease genes. Given these two data sets, we could designed a machine learning algorithm to attempt to predict whether two genes have similar protein expression across tissues are likely to have roles in similar diseases. The Allen Brain Atlas data sets contain gene expression measurements for human and mouse brain tissues at several developmental time points. As argued for the protein expression data we just discussed, a survey of gene expression across a variety of tissue samples is naturally a very good way of learning functional associations among genes. The Allen Brain Atlas data sets are structured as matrices with genes labeled in the rows, brain tissues labeled in the columns and gene expression levels filling the matrix. As discussed previously, we can extract from this matrix a bipartite graph connecting genes to brain tissues in which those genes are expressed higher or lower than average. This heat map is a visualization of a bipartite graph derived from of the Allen Brain Atlas data sets. We see that the genes cluster by their pattern of differential expression across brain tissues, which suggests that the data contain information about the functional similarity of genes. We can use this information for a variety of inference problems. For example, to predict diseased genes, to predict phenotypes of gene knockouts, to predict pathways in which gene products participate and more. The Genotype Tissue Expression Project data set contains expression quantitative trait loci. Expression quantitative set loci are regions of the genome where DNA sequence variation is correlated with variation of expression of a gene or set of genes. Expression quantitative trait loci mapping is very important for improving our understanding of how sequence variation in non-protein coding regions of the genomes affects molecular biology and human health. The GTEx data set is structured as a list of Genes, SNP, P value triplets. This places it in the form of a weighted bipartite graph, where the P values can be used to define weights of edges connecting genes to SNPs. Alternatively, a significance threshold can be applied to create an unweighted bipartite graph. This heat map is a visualization of the GTEx data set. Genes label the rose, SNPs label the columns, and colored regions indicate which SNPs affect the expression level of which genes. Overall, the information about expression quantitative trait loci appears to be quite sparse, indicating a need for more data of this kind. The apparent clustering of the date into horizontal bars indicates that genes tend to have several expression quantitative trait loci, but rarely are they shared with other genes. By mapping the SNPs to known regulatory sequences such as promoters, enhancers, and insulators. Or by analyzing the regions around the SNPs for sequence motifs more insight may be drawn from this data. [MUSIC]