Hello. Today we're going to talk about the concept of distance in social networks. The idea here is that sometimes we'd like to know how far nodes are away from each other. For example, in this network that you see here, how far is node A from node H? Or would like to know, for example, are some nodes far away from each other and other nodes close to each other in general, in this network? If so, which nodes are closest and which nodes are the furthest away from each other in the network.? To answer all these questions, we need to develop a concept of distance between nodes, and that's what we're going to do in this video today. To answer this question, the first concept we need is a concept of a path. A path is simply a sequence of nodes that are connected by an edge. For example, we can find paths that go from the node G to the node C. Here's the path G-F-C, and you can find different paths. For example, here's the path G-F-E-C. How far is node A from node H? Well, we will look at different paths that go from A to H. For example, we can find a path A-B-C-E-H and notice that this path takes four hops to get from A to H. There are other paths. For example, there is a path A-B-C-F-E-H, which takes five hops. For a path, we're going to define the path length to be the number of steps that it contains from the beginning to the end. In this case, path 1 has length four and path 2 has length five. To the find the distance between two nodes, we're going to define it to be the length of the shortest possible path between the two nodes. Going back to the question of what is the distance between node A to node H, the answer is four, because the shortest path between A and H has four hops or has length four. In network x, you can use the function shortest path to find the distance from any node to any other node. Here in this example, I'm finding the distance between nodes A and H in the graph G, which is the graph that you see here. Here you get the shortest path between A and H. If you're interested in just the length of this path, then you can use the function shortest path length, and this gives you the length of this path which is four. Sometimes what we'd like to do when we have real social networks is to find the distance from a single node to all the other nodes in the network, to figure out how far away are other nodes from this specific route node. In this case A, let's say we're interested in figuring out what distance from node A to all the other nodes in the network is. Well, this is something that you can easily do manually in the network that's very small like this, but this could become very tedious to do manually if you have a very large social network. What you can do is you can program a computer to do this. What we're going to do is we're going to talk about an efficient or one of the efficient ways that we have to compute the distances from a given node to all the other nodes, and this one is called breadth-first search. What it basically does is that you start at a node and you start discovering different nodes with different layers, and at each given layer, you discover the set of nodes that are one distance away from the nodes that were in the previous layer. I'm going to walk you through an example of how breadth-first search works. Here we have the network, and we're interested in figuring out the distance from node A to all the other nodes in the network. What we're going to do is we're going to start at A, and we're going to start discovering new nodes as we walk through this network. We're going to be writing down all the notes that we discover. We start at A and we process the node A by looking at who is connected to A. In this case, K and B are connected to A, and so those are going to be a distance one away because they're the shortest path from each one of those nodes to A is just one hop, a path of length one. Now we're going to process each one of the newly discovered nodes, and ask which nodes are connected to this newly discovered node that we haven't discovered yet, and those nodes are going to be assigned to the next layer. Let's say we process node B, node B is connected to K, A, and C. But we've already discovered nodes A and K, so the only node that we discover here is node C. Now we're going to process node K, and node K is connected to nodes A and B, but we've already discovered both of those. So the only newly discovered node is node C, and it's a distance two away from A. Now we process node C, which is connected to B, F, and E. Here we've already discovered B, so the only two nodes that we discover are F and E. Those are a distance three away from A. Now we're going to process node E. Node E has five connections. Out of those five C and F we already discovered, so the only new ones are the other three which are D, I, and H. We assign those to the next layer. Now we process node F, which is connected to three nodes, G, C, and E. But the only one we haven't discovered yet out of all those is G. I want this gets assigned to the next layer, and all of those nodes are a distance four away from A. Now, we have to process each one of those newly discovered nodes. By now you can see that we're already almost done here. Let's process node D, which is only connected to E, but we've already discovered E. D does not discover any new nodes. Now let's go with I. I is connected to E and J. We haven't discovered J yet so this one is assigned to the next layer. Next we process H, which is only connected to E, but we are ready discovered E. Finally we process G, which is connected to F, which we've already discovered. J is a distance 5 away. We have to process J, but J is only connected to I, which we already discovered, and now we're done. We've processed all the nodes. There are no new nodes to discover. Here you can see how we efficiently figure out the distance between A and all the other nodes in the network. This is something that you can program using a computer to do this in an efficient way. You can use network X to run the breadth-first search algorithm by using the function bfs_tree. What it does is it gives you the tree that we've built and this graph that we built here is called a tree, and is the tree of the nodes that you discover. With this function you're able to obtain this tree. If you look at the edges of the graph T here you get that these are the edges of the tree. There's A, K, and A, B, and B, C, and so on. Now, if you're interested in not necessarily the tree in the order in which these nodes were discovered, but simply the actual distances between A and all the other nodes, then you can use shortest path length and you give it the graph, which in this case is G, and the root node, which in this case it's A, and you get a dictionary of all the distances from the node to all the other nodes. A is a distance 0 from itself, a distance 1 from B, a distance 2 from C, and so on. We've defined distance between any two nodes in a network. But if we go back to the original questions we had in the beginning, we were interested in characterizing the distances between all pairs of nodes in the graph. Are nodes in general close to each other, far away from each other? If some are close and some are far, then how can we figure out which are close, and which are far, and so on. Now we're going to try to answer these questions. The first thing you can do is you can simply take the average distance between every pair of nodes in the network. In network X you can do that by using the function average shortest path length in G. In this case, this graph has an average path length of 2.53. It means that on average any pair of nodes in this graph are distance 2.53 from each other. That tells you on average what happens. Now, what is the maximum possible distance between any two nodes? What are the two nodes that are the furthest away from each other? How long is that? How far away from each other are they? This is called the diameter and use simply the maximum distance between any pair of nodes. In Network X you can use the function diameter to get it. In this case, the diameter of the graph is five. You can see that by looking at the distance from K to J, which has a length of five. The other thing that is useful to define is the eccentricity of a node. The eccentricity of a node is the largest distance between the node and all the other nodes in the network. You take a node, measure the distance from the node to all the other nodes, and figure out which one of those instances is the largest one of all. In network X you can use the function eccentricity to get all those distances. Here you can see, for example, that A has an eccentricity of five. As we had seen it has a distance 5 to some node, in this case, that's J. That's pretty large. That's actually as big as it could get. Because the diameter, the largest possible distance between two nodes was five. But if you think of a node like, for example, node E here, which looks like is closer to all the other nodes, then you see that it has a eccentricity of 3, which means that no node in this graph is a distance larger than three from node E. Now we have these eccentricities, the radius of the graph is the minimum eccentricity in the network. Basically asks, what is the maximum distance that a node has from all the other nodes? That's eccentricity and the radius takes the smallest one of those. In network X you can use the function radius to get the radius of the network. As you could see from the dictionary of eccentricities, the radius of this network is three. The next question we had was, now that we know how to calculate distances and now that we have a sense for what is a large distance in the network, like the diameter, what is the short distance like the radius, we can try to identify which nodes are far away from all the other nodes, and which nodes are close to all the other nodes. For the first one we have the periphery, which is the set of nodes in a graph that have an eccentricity equals to the diameter of the network, which is the largest eccentricity that you could have since the diameter is the largest distance between any two nodes in the network. If you apply this definition to these graph you find that the periphery of this graph are the nodes A, K, and J. You can get these set of nodes using network X function periphery. As you imagine, these nodes tend to be on the outskirts of the network far away from all the other nodes. The opposite concept is the concept of nodes that are central. The center of the graph is a set of nodes that have eccentricity equal to the radius. When you check which nodes are in the center in this graph, using the center function network X, you find that C, F, and E are nodes that are in the center of the graph, and this makes sense. They're towards the center of the graph. They tend to be close to most of the nodes. With these tools, now you're able to take a look at a network, a real social network, and start to ask questions about how far are these nodes from each other, which nodes are central, which nodes are not, and so on. Let's run an example in this using the karate club network, which we had seen in a previous video. This is a network of friendship in a karate club. As you may remember, the story here is that Node 1 is the instructor of the karate club and this Node 34 for is an assistant, and they have some type of dispute. They're not friends with each other. So the club actually splits into two groups, and this is the separation of the two groups. The set of students on the left go with one of the instructors or with the assistant, and the other ones go with the original instructor. If we take this network and apply the definitions about distances that we just covered, we can discover far nodes are from each other and who's central and who's not. Let's begin by loading up this network. This network is so famous that actually in Network X, you can simply load it by using the function karate club graph. So that one returns this particular graph. Now, I'm converting the nodes' labels to be integers, so that they match the figure I have here. That's what I'm doing with that command there. Then I could ask different questions about the network. In this case, the average shortest path between the nodes is about 2.41. The radius of the network is three, and the diameter is five. So meaning the pair of nodes here that are distance far away from each other, and that's the largest that the distance can be. Then we're going to ask, who's at the center of this network? Here are the nodes that are in the center and here I'm highlighting them in red. As you can see, the instructor is in the center and all the other nodes that are in the center are also connected to the instructor, and they also tend to have high degree. So they're easily zero connected to many other high degree nodes and they just have small distances from them to all the other nodes in the network. Now, when you look at the periphery, these are the peripheral nodes. I'm highlighting them here in blue. As you can see, they're on the outside. They tend to have small number of connections and none of them are actually connected to the instructor. Now, you may look at this and say, "This makes sense." But for example, this Node 34 was the assistant here. He seems pretty central. He's connected to a bunch of nodes. It seems like he could be close to all the other nodes in the graph as well. Why is 34 not showing up in the center? Well, it turns out that if you look carefully, Node 34 has a Distance 4 to Node 17. To get from 34-17, you have to go 34, 32, 1, 6, and 17. It couldn't be in the center because the radius of the graph is three and this one has a node that is a distance far away from it. Now it turns out that actually if this Node 17 was just a bit closer, for example, if this Node 17 was a Distance 3 away from 34, then 34 would actually be in the center because 34 is a distance, at most three to every other node in the network. This shows that this definition of a center is pretty sensitive to just one node that happens to be far away. So if you make a very small change to the graph that makes particular node further away from the other, you can make it or break it for the node to be in the center or not. In the future, we'll look at other measures of centrality for nodes that are less sensitive to small changes like that. To summarize, we've looked at a set of measures. The first one is the distance between two nodes. That was defined to be the length of the shortest path between the two nodes, and the eccentricity of a node, which is defined to be the largest distance between the node and all the other nodes in the network. Both of those definitions take place at the either node level or at the pair of nodes level. We've also looked at definitions that take place at the whole graph level. They try to characterize the distances in the whole network, and they were the average distance between any two pair of nodes, the diameter, which is the largest distance between any two pair of nodes, and then the radius, which was the smallest eccentricity in the graph. Then we use those definitions to identify the periphery and the center of the graph, which is the sets of nodes. The periphery is the sets of nodes with eccentricity equal to the diameter, and the center is a set of nodes with eccentricity equal to the radius. That's all for this video and I hope to see you in the next one.