Alright, in this week's, we are going to be exploring protein-protein interactions and in today's lecture, in this lecture we'll talk about why we want to study protein-protein interactions. We'll talk about methods for determining protein-protein interactions, protein-protein interaction databases, the properties of protein-protein interaction networks and tools for investigating protein-protein interaction networks. So if we consider the cell as a city how could we go about describing the city ? We could describe where the people live, how they get from A to B, how they interact, the methods that they use to get from A to B... these kinds of things, these kinds of parameters. We could also consider the cell as a circuit and here I've depicted a Colpitts oscillator as an electronic diagram. And you can see that we've got a transistor here and some various other parts, a couple of resistors, capacitor, power supply. And it's important to know how these parts are connected in order to be able to figure out what that circuit is going to do. Now ultimately, all we care about in terms of an electronic circuit, in terms of our iPhone say, is what happens when we turn it on. Can we talk into it ? Can we hear music from it ? And this is actually what happens when you turn on the Colpitts oscillator. We get an oscillating signal coming out of it. And we're sort of approaching that that level of understanding for biology and that's almost the area of systems biology, understanding biological systems as a collection of parts in order to understand how things respond, how we respond to environmental cues. Now an important step however, to being able to get there is to know how things are wired, and that's why we want to figure out how the parts of the cell are wired together, so that hopefully down the road, we can actually understand how the response happens. So proteins do not typically operate in isolation but rather as part of larger complexes or as signal transduction cascades or metabolic modules. So we need to elucidate PPIs, protein-protein interactions, to understand the biological system in question. And such studies can tell us whether a given protein is a key player or peripheral member of a given system, how many interactors that protein has. So how can we determine protein-protein interactions ? There are many classical biochemical methods: so, chromatographic etc. We can also get a lot of information from the literature, where people have used these methods to determine protein-protein interactions. We can do yeast two hybrid studies, followed by clone sequencing. We can do affinity purification. TAP tagging, followed by mass spectrometry to determine the interactors. We can infer interaction based on orthology and we can also use some other high-throughput methods that I'll talk about in in a second. So, in case of Y2H (yeast 2-hybrid), what we're doing is we're taking two proteins or we're taking the GAL4 protein and we're splitting it into two parts: the activation domain and the binding domain. So the activation domain here is shown as this pink dot and the binding domain is shown as this green blob. And what we're doing is, we're attaching one protein (the prey) to the binding domain, we're attaching another protein, here called the bait, to the activation domain. And if the bait and the prey interact, then we basically reconstitute the function of the GAL4 protein, bringing the activation domain in proximity to the start of transcription, then we would get that transcriptional read-out from a reporter gene such as LacZ. So in the case of TAP tagging, what we're doing is we're adding a small tag to the end of a protein, the target protein. And, we can actually add a couple of tags. So here's the calmodulin binding peptide. And this protein A. And that's actually the tandem part of the acronym. Tandem Affinity Purification (TAP). And then we purify, we introduce this construct into a living system and then we can, after purification, we can, after disruption of the system in question, we can then pull down the protein using this tag, the protein A tag. Which is hopefully attached to other proteins to which it interacts, with which it interacts and then, we can actually do a second round of purification using this calmodulin-binding peptide to get highly-purified enrichment of the interactors of our protein of interest, and we can determine what those interactors are by mass spectrometry. So there are several advantages and disadvantages to yeast two-hybrid systems. So the advantages are that it's very amenable to automation. We can do a lot of screens in a high throughput manner. It's a yeast-based system, so that creating the clones is very rapid. And doing the tests is quite rapid. The disadvantages are that it's a somewhat artificial system in the sense that we're targeting our proteins to the nucleus of yeast, typically they're over-expressed, so that if the protein is inherently sticky, we might get a lot of false positives out of the system. It's not a great way to determine non-binary interactions. So it's great for determining binary protein-protein interactions But if we want to determine the membership of a complex, we might want to use TAP tagging. One other advantage of yeast two-hybrid is that it tends to be quite good for detecting transient interactions, so the kinds of interactions that occur in signaling pathways. Now, the advantages of TAP tagging are that it's performed typically in in vivo systems. We introduce this construct back into the organism from where the protein was originally isolated. We might express it at endogenous levels, native levels. So we're not over-expressing it and therefore we can hopefully avoid this problem of stickiness that might occur with yeast two-hybrid. The disadvantage of this method is that we have to introduce the construct into a system, an in vivo system. If we're working with mice, or some other larger organism, this might take a lot of time to generate the appropriate organism with which we can do the TAP tagging. The advantage of TAP tagging in terms of identifying protein-protein interactions is that we can actually identify complexes quite well. So these details are described in one of the grey boxes in the lab. There other experimental methods for determining protein-protein interactions and they're listed in this table here. So there's yeast two-hybrid, as I mentioned. (This table is sorted by whether or not the methods are high throughput or low throughput.) High throughput methods include yeast two-hybrid, affinity purification mass spectrometry and those two I just told you about. DNA microarrays and gene coexpression. We'll talk a little bit about that in the gene expression analysis lecture and lab. Protein micro arrays are another method whereby proteins are spotted onto membranes and then we can wash over a different protein over that, over those membranes, to see which proteins it binds to on the array. We can use synthetic lethality or phase display. Low throughput methods include X-ray crystallography, so we can co-crystallize two proteins and see how they interact, that's great. But it definitely a very low throughput method. We can use FRET, we can use a surface plasmon resonance, Atomic Force Microscopy and electron microscopy. These are all quite low throughput methods. Right. So, as I mentioned, we can also use the interacting orthologs, the concept of interacting orthologs, to predict whether or not two proteins interact and that's what we've done, my lab has done, in collaboration with Matt Geisler at Southern Illinois University. And what we did here is we took the genome sequence databases from four organisms: yeast, fly, worm and human. We took the Arabidopsis genome and we computed to the orthologs using a piece of software called INPARANOID to come up with an ortholog list of Arabidopsis genes, to these other species. and then we took the interactome databases of those four species, did a match-replace for the orthologs to come up with an Arabidopsis predicted interactome and we could have a score associated with each predicted interaction. And then we validated this predicted interactome by, on the one hand, doing co-expression analysis and looking as to whether or not the interacting orthologs were co-expressed. Then we also looked for the same for the interacting orthologs to have the same sub-cellular localization. And that's what I've just said in the previous slide. Just in point form here and in terms of the colocalization of the Arabidopsis interologs, if we look at the network or part of the network of the interacting orthologs and we colour the nodes (the nodes represent the proteins), we color the nodes according to their subcellular localization, we can already start to see that these nodes tend to cluster together when they shared interactions. So this means that just visually, from a visual observation of the network, again where the proteins are represented by the nodes, the interactions are represented by the edges, we see that the proteins that are in the same compartment do tend to interact, or put it the other way, proteins which are predicted to interact do seem to be in the same compartment. We can actually test this statistically. And that's described in this paper here. But we see definite enrichment for interacting orthologs to be in the same compartment as shown along the diagonal here, by this red colouring, where we have a P less than 0.01 an enrichment for them being in the same compartment. And in fact, we actually see depletion in the case of the interacting orthologs being in other compartments. We also asked the question whether or not the interacting orthologs are coexpressed. So if the proteins are going to interact, then they probably should be expressed at the same time at the same place. We used a Pearson correlation coefficient to compute this score, the co-expression score. And we used a compendium of gene expression data across about 1,000 different conditions, tissues, responses to abiotic stress, and so on, to calculate this Pearson Correlation Coefficient. And then we compared this to three random data sets, random from the entire proteome, random from the interolog dataset, random with the same topology as the interolog network. And these are the results. So the two important curves on this graph, where we're looking at on this axis, the Pearson correlation coefficient, it's a distribution on the distribution graph, are the blue graph here, the blue line here, which denotes the distribution of the Pearson correlation coefficients scores for our predicted interologs, the predicted interactors. And we see that the Pearson correlation score on average is around 0.8 for these predicted interactors. So a score of one means that the genes are perfectly coexpressed. They're always on at the same time at the same place. Score of zero means that the genes are not at all correlated in terms of their expression pattern. And a score of minus one would mean that the genes are anti-correlated in terms of their expression pattern. The other important line here is this purple line and that is the random network that we generated. And we see that the Pearson correlation coefficient is approximately 0.2 on average for this random network. So we do see a distinct difference between our predicted interactors and our random network, in terms of the co-expression scores. So this gives us another level of support for our predicted interactors. So how can we use this use this predicted interaction network? We can extend known pathways, so this is this a small subset of the SNARE-Syntaxin Network from an interaction database called BIND. It's for the SNARE-Syntaxin network, which is involved in vesicle trafficking. The original network is based on these literature examples in BIND. And we can extend those predictions quite dramatically, extend what's known quite dramatically with our predictions. In terms of the protein-protein interaction network topology, the PPI networks tend to be scale free and follow power log distribution with respect to the connectivity distribution of the nodes. So what this means is that there are relatively few nodes with a high degree of connectivity and these are the hub proteins and there are many more nodes with low degree of connectivity. So connectivity is just the number of interactions radiating out from given node. So in this case, this node here in grey would have one, two, three, four, five, six, connections to it. Whereby this note in white would only have one connection. So, a degree of connectivity of one. So there's some other terms of for describing networks including betweenness, so how many network paths pass through a given node. Connectivity, how many nodes or edges need to be removed to disconnect the remaining nodes from each other. And the degree, which I just told you about. We can also generate clusters of protein-protein interactions using some of these parameters as cut-offs. And then with those clusters we can actually ask the question, is there enrichment for any particular term associated with that cluster ? So more than 50 protein-protein interaction databases exist for several model organisms. It's important to know how the data in the database were generated, if the database aggregates protein-protein interaction data, does it provide a link back to the primary data source, the reference, the literature reference ? There are several tools available for analyzing protein-protein interaction data and will be using one called Cytoscape in the lab today. It's a fairly powerful tool. There are several initiatives to provide protein-protein interaction data on the fly for use in tools such as Cytoscape. And one of these is called PSICQUIC, the PSICQUIC initiative, which aims to make queries across interaction and other proteomics databases seamless for easier programmatic access. So, easier access by computational tools. And web services in general is kind of where bioinformatics is going, so that we don't have to maintain local databases ourselves. So one thing that we'll do in this protein-protein interaction lab is to ask the question: "Are protein-protein interactors enriched for certain subset ?" Enriched for a set of terms, and the terms that we're asking for as to whether or not they're enriched are called Gene Ontology terms. So an ontology is a controlled vocabulary for describing a knowledge system and in the case of the Gene Ontology for classifying genes or gene products, actually. There are three main organizing principles. And those are of molecular function, biological process and cellular component. So, in the case of the Gene Ontology "biological process", we can see several categories underneath, like cell growth and/or maintenance. Such as nuclear division. And nuclear division in turn, would contain a some subset of genes of a given organism. So Gene Ontology is organism agnostic. And that's actually quite nice, because then you can actually start to compare between organisms. Now, GO was initially proposed in 1998, by Michael Ashburner for an ISMB bioinformatics conference in Montreal. And it aims were to develop a set of shared vocabularies of terms that describe aspects of molecular biology and that are common to more than one life form. To describe gene products held in each, contributing model organism database. And to provide the scientific resource for access to the vocabularies, the annotations and the associated data and to provide a software resource to assist in the curation of GO term assignments to biological objects. So the structure of GO is an acyclic graph, and I'll show you what that means, in a minute, and it simply means that one child term can have more than one parent term. We can explore GO using godatabase.org. And here's an example of the Go-Biological Process for DNA metabolism. We see several sub-categories underneath that, such as DNA degradation, DNA replication, DNA recombination. And under, say, the category DNA ligation, we could have genes from yeast in blue. We could have genes from Drosophila in magenta and the corresponding genes for mouse in red. So really it doesn't matter what the organism is, they can all use the Gene Ontology term for DNA ligation, to ascribe a particular function to gene or set of genes. The other aspect of the GO, the Gene Ontology that I was just mentioning, is this directed acyclic nature of the graph. And we see that DNA ligation actually has three parent terms: DNA recombination, DNA repair and DNA-dependent DNA replication. So it's not a hierarchical structure. It is this, this directed acyclic graph, which makes it very flexible. So in order to assess whether or not any categories in particular, GO categories are enriched in a particular set of proteins that interact, we can use a hypergeometric P value. And this is just a bit of information, from the Excel help file. We can calculate the hypergeometric test score P value, using this function in Excel, HYPGEOMDIST. And we need four parameters, which we need the sample size, which is the number of genes, in your list, with a given function, we need the number in the sample, which is the total number of genes in our list of interest. We need the total number of genes with a given function and that's from the total number of genes available for sampling and the number of population which is the total number of genes available for sampling So in the case of our GO test, what we're doing is we're entering the number we're interested in with a particular GO term, that's this value here, the total number of genes in the list, this number here, and then in, the overall set of genes or gene products, we're entering the total number of genes with that function and the finally the total number of genes in the population, if that makes sense. So we can use the P-value to assess whether or not a particular GO category is enriched for set of protein-protein interactions. And that will help us to get an idea of what that set of proteins might do. You might be asking, well, why do we care about that? And the reason is that oftentimes these protein-protein interactions are determined in the absence of any information about the biology. So in the case of yeast two-hybrid system, we're not really focused on any particular aspect of biology, we just generate info on whether or not the set of pair of proteins interact. And so, then, using such methods as Gene Ontology enrichment can really be helpful to make sense of the large data sets that are generated. That's this week's lecture. I hope you enjoy the lab. Thank you.