Hi, it's Matt again. And now, we're going to be talking a little bit about the diameter of a network. So again, we're still in our basic definitions, trying to understand how we represent networks, things we can talk about in terms of their characteristics and diameter is a very important one. Diameter and average path length refer to how close nodes are to each other. So, how long will it take to get from one node to another in some sense? How fast will information spread? How, how quickly can we get from one part of the graph to another? These are going to be questions that are going to be answered by things like diameter and average path length. So, when we look at the diameter, the definition here is the largest geodesic in the network. So, the largest shortest path. so if, if we have a network which is not connected, often when people talk about the diameter, they'll talk about the diameter of the largest component. It get's a little iffy spent a lot of time's, you're going to be stuck with a few isolated nodes in a network. And so diameter often will then just to re, refer to the component largest component. one thing about diameter is it can be prone to outliers. It could be that there's, that happen to be one pair of nodes which are very far from each other, but a lot of the other ones are, are relatively well connected to each other. So, average path length is going to be a different measure than the diameter inside of the maximum geodesic. It's going to be the average geodesic in the network. And if we look at different network structures, diameter begins to tell us things about it. So, for instance, if we look here at the network on the left. We've got a ring or a circle of different nodes connected to each other. And in that situation, when we look at these nodes, we end up seeing that they are some of them can be quite far from each other. And the diameter in this case is on the order of half the number of nodes. So, either n over 2 or n minus 1 over 2, depending on whether we've got even or odd number of nodes. Whereas, if we look at a tree, the situation on the right here in this picture, what we have is a much more efficient way of getting from one node to another. And so, if it, if we think about the number of levels that this tree has. So, in this case it has three different sets of, of links. So, one, two, three depth. so in this case, we're going to end up or, or we're, we're going to end up covering a number of nodes where n is equal 2 to the K plus 1 minus 1. So, you can just go through and, and keep track of that yourself. when you look at the relationship between how many levels you have and how many nodes you have, you end up, in this case, having K is equal to log base 2 of n plus 1 minus 1. So, a very simple formula. And with that in, in this case the diameter is 6. It's twice what this K was. So you, getting from node 25 here to node 30. it takes a path link of 6 to go up 3, and then down 3. So, it's twice times K is going to be the, the longest shortest path in this network. And what this tells us is that the order of, of the diameter is going to be roughly 2 times this log of n plus 1. So, that's giving us sort of how quickly we reach out. And the idea here is a tree is a much more efficient way of connecting nodes in term of short paths than we saw in terms of this circle. Because now we, we end up connecting nodes to each other in a manner which is, is quite efficient in terms of the branching process. And so, even with a small number of, of links, we're able to more efficiently connect things. And so, instead of scaling linearly with the number of nodes, this is going to scale with the log of the number of nodes, which is going to tend to be a much smaller number. Now, in, in terms of small average path lengths and diameters, one thing that's been found quite extensively in terms of looking at social networks in the real world is that they tend to have short average path lengths and short diameters. And what do we mean by short? It means intuitively that we connect a large number of nodes with a, a fairly small number of connections. The famous most famous example of this was Stanley Milgram's experiment where he had people sending letters from one part of the US to another. I believe it was part of the Midwest, say, Kansas and Nebraska, and you were trying to get a letter to somebody on the East Coast, say, in Massachusetts. he would give them say, a person's name and the rough location of where this person was and maybe some basic background information about the person. So, you know, I'm supposed to get something to John Smith who lives in, in Massachusetts and who is a lawyer in Amherst. And I live in Lincoln, Nebraska. And the, the idea was that I could, should send the letter to somebody I know. And then, the instructions in the letters tell them to send it to somebody they know and so forth, with the intent of eventually getting to, the letter to the person that was named. And the amazing thing that Milgram found was that out of the letters that were originally mailed from the initial points the median number of hops that it took to get to the end point was 5. And in fact, 25% of the letters made it. So that, that's first of all, fairly high in terms of response rate for this kind of participation in an experiment. And secondly, the median of five seem to be quite small given that you're starting with somebody in, in one part of the country who had no knowledge of the other individual, in another part of the country and it took them only five steps to get to that individual. this kind of experiment has been replicated in, in more, more advanced settings, we're working with e-mail and other kinds of things. one thing that's sort of interesting about this experiment is that when, you know, if we think about the median of 5, we have to be a little bit careful about how we interpret that result. Why? Well out of the 75% that didn't make it, the longer a path is, that, that would actually take to, to go from one person to another, if we think that there's some probability that people won't send it on, the longer the path, the more likely it, it becomes that it won't appear as having made it in, in the experiment. So the, the letter will never reach the final point where Stanley Milgram could then collect it and get all the names. So, the ones that actually reach are going to tend to be shorter than the others. So in, in experiments that have followed up on this, you can begin to try and estimate, you know, how many didn't make it. And, and those are going to be biased upwards. But it's still sort of an important number because people are getting it in only five steps and they didn't get to see the network. It's not as if they saw the network of all the connections in the US and we're able to figure out what the most efficient path was. They were just trying to send it to somebody they thought might know somebody. You know, if I, If this person's a lawyer then, if I know a lawyer and I know a lawyer in Massachusetts, maybe I send it to them or if I know a lawyer in Chicago who might have a friend in Massachusetts. So, I'm, I'm trying to navigate the network without actually knowing the explicit structure of the network. There's a number of other studies that have found small path links in, in different settings. So Grossman's study of, of co-authorships among mathematicians average path length 7.6, maximum a month, in a giant component, 27. Newman's study of physics co-authorships, average 5.9, max 20. Goyal of, of [UNKNOWN] Gonzales of Economics 9.5 as an average, max 29. So here, you're seeing things again that, you know, the averages somewhere in the order 5 to 10, max closer to 30 in this case. there is a study by Lada Adamic and Pitkow on the Worldwide Web. There, they found an amazing number a mean file 3.1 on a giant component reaching 85% of about 50 million pages. Facebook mean of about 4.74, recently found by Backstrom and a set of co-authors. This is, is 721 million users. So, actually, they're, when you start working with, with networks of that size, even trying to calculate the average path length is going to be something which is going to take some doing. You're going to have to work carefully to find algorithms to calculate these things. so there's some very interesting findings there and in terms of the average path length. So, in order to say a little bit more about diameters and understand why we see these short average path lengths. first it's going to be useful. I have a couple more definitions in hand. the neighborhood of a node i, in, in network g N sub i of g is going to be the set of other nodes that i is linked to in g. And again, our usual convention is we won't allow for i to be linked to i. And then, the degree of a node, which is a very important concept, the degree of node i is just simply counting up how many neighbors a given node i has in the network g. So, degree is a basic idea, which is can be very important in keeping track of things in, in networks. And in understanding why we have short average path lengths, one thing that's going to be very useful is understanding the most basic form of random graph. Which is due to Erdos and Renyi in a set of papers in the 1950s. Also there was independent work by [UNKNOWN] around the same time. And the, the model is the simplest possible network you could imagine in terms of random graph. We start with n nodes. And each link is just formed independently with some probability p. So, what do we do? We have a coin, say, it has a three-quarters chance of, of coming up heads. And for every link, we look, okay, should one be linked to two, flip a coin, three-quarters possibility of yes, one quarter possibility of no. If it comes up heads, we, we color the, the link, if not, we don't. And we do that for every link we just keep repeatedly flipping the coin, okay? And, and, the coin comes up heads with some probability p. this is known as G n, p graph on n nodes with a probability p. it's a very important benchmark in understanding things. Why is it an important benchmark? Because if we see some network out there in the real world if we know what the properties of, of a random network of n nodes with some probability p was, then we can compare the real world network that we see to this benchmark network and say does it look like something systematic is going on? Does this network look systematically different than if nature had just picked nodes I'm sorry, links at random? Or does it look like something is systematically different? So, that's going to be an important benchmark. And in order to talk about properties of these kinds of networks, it's going to be useful to talk about large networks. Why large networks? Well, because if, if we're dealing with very small networks, every single network has some possibility of arising if we're just flipping coins. And so, making statements about what's going to happen on a small network is going to be a very complicated probabilistic statement. But when we get to large numbers, laws of large numbers are going to allow us to say things like with a probability close to one. Then, we should observe that the network should be connected. something like that. So, we can talk about, you know, the average path length ought to be 6 with a probability very close to 1, on a network of such and such a size. So, by looking at sequences of networks on larger and larger sets of nodes, we can begin to make statements like that. So, in particular, lets look at Erdos and Renyi networks. And I'm going to talk about large networks. And in particular, we're going to look at the networks that have two different properties. So, one is we're going to be thinking about the degree, now as a function of n. So, we have to talk about, you know, if it has a million nodes, what's the average degree, if it has two million nodes and so forth. We want that to be something which is at least as large as the log of the number of nodes. And we'll talk about that later. But what this is going to make sure is that there's actually paths from any two, two nodes to each other with a high probability. So, with a high probability, you can get from any node to any other node. Then, we can actually calculate the diameter. Okay, so, so we'll take d of n to be at least log n. And we'll also, we're going to make sure that the network isn't too complete. So, d of n ought to be a small number relative to the overall size n, so that it's not as if every node just reaches every other node in a path of length one. We'll want to, you know, have something interesting going on. So, it's not as if I have I'm connected to everybody in the world. I'm connected to some fraction of people in the world and will make sure that, that's at least log in in, in terms of size. Okay. So, what's a theorem? A basic theorem on basic structure is that if we have these two conditions, and then we start looking at large enough n, the average path length and the diameter of this random graph, these GNP graphs are going to be approximately proportional to log of the number of nodes over log of the degree. So, a very simple formula should give you a very accurate approximation with a high probability to what the average path length in diameter are in these kinds of networks. So, these kinds of theorems have been proven for the original Erdos-Renyi random graphs that we studied through the 50s and 60s. paper by Moon and Moser, there's a set of, of calculations by Bollobas. a series of papers which look at, at these kinds of calculations. It's been looked at in, in richer and richer models. Chung and Lu have a paper in 2001 looking at a different set of models. I have a paper in 2008 that allows for node types and other kind of, of complications to this model. You still get the same formula in this kind of, of setting. So, what does this tell us? this gives us an idea of why, why we end up with a very short average path lengths and short diameters if we had something in a, in a random graph. And that'll tell us something about real world networks to the extent that random graphs are going to be part of those or embedded in those kinds of things. So when we look at this we, you know, we can state the theorem more succinctly in terms of mathematical notation. a fore statement of the theorem is that under these conditions if we're looking at this GNP, then the average distance compared to the log n over log d is converging in probability to 1. So basically, average distance is going to be proportional to log n over log d. The same for the diameter. So the next thing we're going to do is just try and understand the logic behind this. And then, I'll also have a separate lecture, sort of a, a, a more advanced lecture that'll give part of the proof of this for people who really want to dig into the mathematics. but the intuition on, behind this is fairly simple. We'll go through that intuition, and then people who want the more advance part can actually go through pieces of the proof as well.