Hello? This week we've been looking at networks as dynamic structures that change over time. In the past two videos, we looked at different models of network evolution that give you the dynamics for how a network evolves from the beginning to the end. Today we're going to look at a related problem which is, given a fixed network, can we predict how these network is going to look into future? More specifically, we're going to be looking at the link prediction problem, which is, given a network, can we predict which edges are going to be formed in the future? This problem has a lot of applications. For example, if your Facebook and you want to create a friend recommendation algorithm where a friend or a friend recommender to tell people who they should friend. Then basically you're solving this problem. You're looking at the current Facebook network and you're trying to predict which new friendships are likely to form in the future. Let's take this small network as an example and let's try to ask the question, what new edges are likely to be formed? More specifically, let's formulate the problem in this way. Let's say that we take a specific pair of nodes and we want to assess whether or not they're likely to connect. What we're going to do today is we're going to go over several measures, seven them, to try to make this assessment, to try to decide whether a specific pair of nodes are likely to become friends or not. We've actually thought about these kind of question before, if you remember, we talked about triadic closure, which is the tendency for people who share connections in a social network to become connected themselves. A triadic closure actually gives us a hint for the first measure that we're going to look at is and that is very simple measure. Just look at the number of common neighbors that that the two nodes have. Now this is very simple, but let me define this measure to introduce some notation. We're going to say that the number of common neighbors of nodes X and Y is going to be the size of a set, which is the intersection of the sets NX and NY, where NX defines the set of neighbors of the node X. For example, the common neighbor measure for nodes A, C in this network is going to be two because nodes A and C have two common neighbors, that is node B and node D so AC have two common neighbors. In network X, we can use the function common neighbors, which takes in as input the graph two nodes and it outputs an iterator of all the common neighbors of the two nodes. Here what I'm doing is I'm creating a list of tuples which have the two nodes and the number of common neighbors. I'm only including the nodes that are not connected with each other. The ones that don't have an edge between them by using the function non edges. If I sort this list, we can see the pairs of nodes that have the most common neighbors between them. We see that the pair AC has two neighbors in common. There are many others that have only one or in common, and then many others that have zero neighbors in common. If we start to compare between different edges, so for example, we look at the pair AG and the pair HI and ask which one of these two is more likely to become connected. Then by looking at the number of common neighbors, we actually can't tell because both of these have exactly one neighbor in common. Let's look at other measures that may potentially give us different answers for this particular pairs of nodes. The next measure we're going to look at is called the Jaccard coefficient and what it does, is that it looks at the number of common neighbors, but it normalizes it by the total number of neighbors of the two nodes. The way that we write it down is we say the Jaccard coefficient of nodes X and Y is going to be the fraction of the number of common neighbors that's in the numerator. It's the intersection of the sets NX and NY divided by the number of neighbors of X and Y, which would be the union of NX and NY. Here the pair of nodes A and C have a Jaccard coefficient of 1.5 because they have two common neighbors, B and D. They have four total neighbors, nodes B, D, E, and F. That's 2 over 4, which is 1.5. In network X, we can use the function Jaccard coefficient, which takes as input the graph and that outputs a iterator of tuples which have the two nodes and the Jaccard coefficient of the two nodes. But it only outputs the pairs of nodes that are not already connected, so the non edges. If we sort this list, we now find that IH have the highest Jaccard coefficient of one. That's because I and H are both connected to a single neighbor, which is a common neighbor, G. Whereas the nodes A and G, they have one common neighbor, but they have more neighbors that are in the union of the neighbors of A and G and so therefore they have a lower Jaccard coefficient. The next measure is called the research allocation index. The intuition behind it is that it considers a fraction of a resource for example information or something else that a node can send to another node through their common neighbors. First let me tell you how it's defined mathematically and then I'll tell you a little bit more about what the intuition is behind the measure. The resource allocation index of the nodes XY is going to be the sum over all the common neighbors of X and Y of 1 over the degree of the nodes. In this case, if X and Y have a lot of common neighbors and they're going to have a large resource allocation index. But if they have a lot of neighbors that have low degree, then they're going to have an even larger research allocation index. Now what's the intuition behind this? Let's consider two nodes, X and Y. Let's say that we're measuring the resource allocation index between these two nodes. Let's say that they have exactly one common neighbor Z. Now imagine that X is trying to send to Y a unit of something, let's say information or something else, and it's going to do so by passing it first to Z and then hoping that Z will pass this unit to Y. But actually what Z does is that when it receives this unit from X, it's going to distribute this unit evenly among all the neighbors of Z. Then in that case, well, Y is only going to get a fraction of that unit in which fraction depends on what the degree of Z is. If Z has degree n, then Y is only going to get 1 over n of that unit. If Z is the only common neighbor of X and Y, and Z has a lot of neighbors, a very large degree, then X is going to be able to send less to Y than if Z had very few neighbors. That's why this resource allocation index penalizes pairs of nodes that have common neighbors, that themselves have lots of other neighbors. For example, when we measure the research allocation index of nodes A and C, we would have one-third for node B, because node B is a common neighbor of A and C and it has degree 3, plus 1 over 3, which is for node D, which also has degree 3 and it's also a common neighbor of A and C. So the resource allocation index over AC is going to be two-thirds. We can use the function resource allocation index to compute the resource allocation index of all pairs of nodes that are not connected by an agile already in the graph. If we sort it, we can see the resource allocation index of all pairs in this network. If we look at the two edges that were being following, we see that in this case, AG has higher resource allocation index then IH. That is because IH has a common neighbor, which is G. But G has a high degree, has a degree of four, whereas A and G, their common neighbor is E, and E has a slightly smaller degree of three. That will give AG a larger research allocation index than IH. Next match measure is called the Adamic-Adar index, which is very similar to the resource allocation index. The only difference is that rather than dividing by the degree, it divides by the log of the degree. In this case, the Adamic-Adar index for nodes AC instead of being 2 over 3 is going to be 1 over the log of 3 plus 1 over the log of 3, which is 1.82. There's a function for it too that you can use on network X is Adamic-Adar index. Again, we can compare AG and HI, but in this case is going to be very similar to the resource allocation index, since all we did was to put the log in the denominator. Next, measure 5 is going to be the preferential attachment score. If you recall, the preferential attachment model has the feature that nodes that have very high degree are more likely to get more neighbors. The intuition behind this measure is that, well, if I'm looking at a pair of nodes and they both have a very high degree, then they're more likely to be connected to each other in the feature. What it does is very simply to take the product of the degree of the two nodes. It's very simple. The preferential attachment score of XY is going to be the product of the number of neighbors of X times the number of neighbors of Y. AC would have a preferential attachment score of nine. In Network X, we can use the preferential attachment function to compute the preferential attachment score over all the pairs of nodes that are not connected by an agile ready. Here we can look at the two edges that we've been following, A, G and H, I. We see that A, G has a higher preferential attachment Score then I, H, and that's because a has a degree of three and g has a degree of four, which makes the preferential attachment score if 12, whereas I and H both only have one neighbor and so they have a score of one. Next we're going to look at two other measures that take into account the community structure of the network. In cases where you have a network and on top of the network you have some knowledge about different communities. Here we'll think of community service sets of nodes and we'll make the assumption that each node belongs to only one community. One example is if this network was, for example, a network of communication among employees in a company, then you might think that the department for which an employee works for would define a community. For example, there's HR and there's legal and so on. You could imagine thinking of those as communities. If you had that type of structure, then you could use different measures for determining whether two nodes are likely to connect to each other or not. In this case, let's assume that these network has to communities. These are the two communities. There's A, B, D, and C belongs to Community 1 and the other nodes belong to Community 2. What these two measures do is they make the assumption that if two nodes belong to the same community and they have many neighbors that also belong to the same community, then they're more likely to form an edge than if they had neighbors that belong to different communities or if the two nodes themselves were in different communities. Those are the basic assumption that we make. This allows us to compute new measures. Let's go were two of them. The six measure that we're going to look at today is the number of common neighbors, but with a bonus for neighbors that belong to the same community as the two notes we're looking at. This is called the common neighbors on the origin Hopcroft score. If we're looking at nodes X and Y is simply going to be the number of common neighbors. The size of the intersection of X and end Y plus some bonus, which is going to be the sum over all the common neighbors of this function fu. FU is simply an indicator that tells us whether the node u, which is a common neighbor of X and Y, belongs to the same community as X and Y or not. If it does, then it's a 1, if not it's a 0. It just simply gives a bonus for the neighbors that belong to the same community as X and Y. For example, if we look at nodes A and C, their common neighbors are B and D, and they all belong to the same community. A and C would get a score of two for having two common neighbors plus 2 bonus points, so we 4. Notes E and I would have a score of two because they both have a common neighbor, namely G, and it belongs to the same community, so would get a bonus point for a total of two. But if we look at the edge A, G, for example, they have one common neighbor, namely E. But A and G are not in the same community, so E is not in sync community as A and G, therefore, A and G have a score of just one. Now to use this on network acts, we first have to tell Network X which communities each node belongs to. Here we're adding a new attribute to each one of the nodes named community that tells which community the node belongs to. For A through D you are say, belongs to Community 0 and for the other ones have belongs to the Community 1. Now we can use the function, cn_soundarajan_hopcroft, G which outputs an iterator of the tuples with of the nodes and the score for each one, each pair that is not already connected by an H. We can sort it and find which nodes have the highest score. If we look at the two edges that we've been following, then we find out I, H It has a score of two because they're common neighbor G belongs to the same community as they do, whereas A, G has a score of one. The last measure we're going to look at is a measure that is similar to the resource allocation index, but it only takes into account nodes that are in the same community as a two nodes where we're looking at. If we're computing these measure, which is called the resource allocation Soundarajan-Hopcroft score. This was after the researchers came up with this measure of X and Y, then what we do is we sum over all the neighbors of X and Y. Rather than summing just one degree of the nodes of the common neighbors like we did in the standard Resource Allocation Index, we now have this f u in the denominator of the fraction. This function f u is the same as before, is one if you belongs to the same community, x, x, and y, and zero or otherwise. In the case where you have a common neighbor that does not belong to the same community as x and y, then that neighbor is not contributing anything to the sum because you have a zero in the numerator. If we look at some examples for the nodes A and C, we would find that, because both of the neighbors B and D belong to the same community as A, C, then we have the same score as we did before 1/3 plus 1/3 which is 2/3 for these Resource Allocation Index. If we look at, for example, nodes E and I, we find they have one common node, which is node G, and node G has a degree of four. It has a score of 1/4 because G belongs to the same community as E and I. However, if we look at two nodes that are not in the same community like A and G, they would have a score of zero because their common neighbor, while they have one, namely E, belongs to a different community, at least one of the two nodes and so it doesn't contribute anything to the sum. These two don't get anything in the sum, so you'll get a score of zero. We can use this function in network x, ra_index_soundarajan_hopcroft(G) to compute the scores of all the non-edges. Here we find that I, H has a score of 0.25, whereas A, G has a score of 0 because A, G, they are common neighbors, not in the same communities, at least one of A and G. In this case again, we find that I, H has a higher score. These were the seven measures I wanted to talk about today. I want to point out two things. One, none of these measures actually tell you whether or not you should predict that a particular edge is going to come up in the future or not. It just gives a score that is supposed to give you a sense for whether or not these two nodes are likely to connect. The second thing is that different measures can give you different scores. For example, we saw that some measures would give the edge H, I higher score than A, G, and some measures would do the opposite. These measures aren't necessarily consistent with each other. If you're actually trying to solve the link prediction problem, typically what would happen is that you would use these measures as features then you would use a classifier if you have some labeled data, you would train a classifier and use these measures as features in order to make the prediction. In summary, we talked about the link prediction problem, which says that given a network we're trying to predict which edges are going to be formed in the feature. To do so we talked about five different measures : the number of common neighbors, the Jaccard coefficient which takes the number of common neighbors but normalizes by the total number of neighbors that the two nodes have. The Resource Allocation Index which is thinking along the lines of if a node needed to pass information or pass a resource to another node through their common neighbors, how much of that will actually get there? The Adamic-Adar Index which is similar to the Resource Allocation Index, but it takes the log and the denominator and the Preferential Attachment Score, which bases the score on the idea that in preferential attachment nodes with a high degree tend to accumulate more edges, as well as two measures that require community information. If you do have community information on your network, then you can use these two and they're very similar to other ones we saw above. We have one that takes the common neighbors and it gives a bonus for the common neighbors that are in the same community as the two nodes and then the resource allocation one which only considers common neighbors that are in the same community as the two nodes. This is all for this video. I hope to see you next time. Thanks.