0:00

Hi, it's Matt again. And now, we're going to be talking a

Â little bit about the diameter of a network.

Â So again, we're still in our basic definitions, trying to understand how we

Â represent networks, things we can talk about in terms of their characteristics

Â and diameter is a very important one. Diameter and average path length refer to

Â how close nodes are to each other. So, how long will it take to get from one

Â node to another in some sense? How fast will information spread?

Â How, how quickly can we get from one part of the graph to another?

Â These are going to be questions that are going to be answered by things like

Â diameter and average path length. So, when we look at the diameter, the

Â definition here is the largest geodesic in the network.

Â So, the largest shortest path. so if, if we have a network which is not

Â connected, often when people talk about the diameter, they'll talk about the

Â diameter of the largest component. It get's a little iffy spent a lot of

Â time's, you're going to be stuck with a few isolated nodes in a network.

Â And so diameter often will then just to re, refer to the component largest

Â component. one thing about diameter is it can be

Â prone to outliers. It could be that there's, that happen to

Â be one pair of nodes which are very far from each other, but a lot of the other

Â ones are, are relatively well connected to each other.

Â So, average path length is going to be a different measure than the diameter

Â inside of the maximum geodesic. It's going to be the average geodesic in

Â the network. And if we look at different network

Â structures, diameter begins to tell us things about it.

Â So, for instance, if we look here at the network on the left.

Â We've got a ring or a circle of different nodes connected to each other.

Â And in that situation, when we look at these nodes, we end up seeing that they

Â are some of them can be quite far from each other.

Â And the diameter in this case is on the order of half the number of nodes.

Â So, either n over 2 or n minus 1 over 2, depending on whether we've got even or

Â odd number of nodes. Whereas, if we look at a tree, the

Â situation on the right here in this picture, what we have is a much more

Â efficient way of getting from one node to another.

Â And so, if it, if we think about the number of levels that this tree has.

Â So, in this case it has three different sets of, of links.

Â So, one, two, three depth. so in this case, we're going to end up

Â or, or we're, we're going to end up covering a number of nodes where n is

Â equal 2 to the K plus 1 minus 1. So, you can just go through and, and keep

Â track of that yourself. when you look at the relationship between

Â how many levels you have and how many nodes you have, you end up, in this case,

Â having K is equal to log base 2 of n plus 1 minus 1.

Â So, a very simple formula. And with that in, in this case the

Â diameter is 6. It's twice what this K was.

Â So you, getting from node 25 here to node 30.

Â it takes a path link of 6 to go up 3, and then down 3.

Â So, it's twice times K is going to be the, the longest shortest path in this

Â network. And what this tells us is that the order

Â of, of the diameter is going to be roughly 2 times this log of n plus 1.

Â So, that's giving us sort of how quickly we reach out.

Â And the idea here is a tree is a much more efficient way of connecting nodes in

Â term of short paths than we saw in terms of this circle.

Â Because now we, we end up connecting nodes to each other in a manner which is,

Â is quite efficient in terms of the branching process.

Â And so, even with a small number of, of links, we're able to more efficiently

Â connect things. And so, instead of scaling linearly with

Â the number of nodes, this is going to scale with the log of the number of

Â nodes, which is going to tend to be a much smaller number.

Â Now, in, in terms of small average path lengths and diameters, one thing that's

Â been found quite extensively in terms of looking at social networks in the real

Â world is that they tend to have short average path lengths and short diameters.

Â And what do we mean by short? It means intuitively that we connect a

Â large number of nodes with a, a fairly small number of connections.

Â The famous most famous example of this was Stanley Milgram's experiment where he

Â had people sending letters from one part of the US to another.

Â I believe it was part of the Midwest, say, Kansas and Nebraska, and you were

Â trying to get a letter to somebody on the East Coast, say, in Massachusetts.

Â he would give them say, a person's name and the rough location of where this

Â person was and maybe some basic background information about the person.

Â So, you know, I'm supposed to get something to John Smith who lives in, in

Â Massachusetts and who is a lawyer in Amherst.

Â And I live in Lincoln, Nebraska. And the, the idea was that I could,

Â should send the letter to somebody I know.

Â And then, the instructions in the letters tell them to send it to somebody they

Â know and so forth, with the intent of eventually getting to, the letter to the

Â person that was named. And the amazing thing that Milgram found

Â was that out of the letters that were originally mailed from the initial points

Â the median number of hops that it took to get to the end point was 5.

Â And in fact, 25% of the letters made it. So that, that's first of all, fairly high

Â in terms of response rate for this kind of participation in an experiment.

Â And secondly, the median of five seem to be quite small given that you're starting

Â with somebody in, in one part of the country who had no knowledge of the other

Â individual, in another part of the country and it took them only five steps

Â to get to that individual. this kind of experiment has been

Â replicated in, in more, more advanced settings, we're working with e-mail and

Â other kinds of things. one thing that's sort of interesting

Â about this experiment is that when, you know, if we think about the median of 5,

Â we have to be a little bit careful about how we interpret that result.

Â Why? Well out of the 75% that didn't make it,

Â the longer a path is, that, that would actually take to, to go from one person

Â to another, if we think that there's some probability that people won't send it on,

Â the longer the path, the more likely it, it becomes that it won't appear as having

Â made it in, in the experiment. So the, the letter will never reach the

Â final point where Stanley Milgram could then collect it and get all the names.

Â So, the ones that actually reach are going to tend to be shorter than the

Â others. So in, in experiments that have followed

Â up on this, you can begin to try and estimate, you know, how many didn't make

Â it. And, and those are going to be biased

Â upwards. But it's still sort of an important

Â number because people are getting it in only five steps and they didn't get to

Â see the network. It's not as if they saw the network of

Â all the connections in the US and we're able to figure out what the most

Â efficient path was. They were just trying to send it to

Â somebody they thought might know somebody.

Â You know, if I, If this person's a lawyer then, if I know a lawyer and I know a

Â lawyer in Massachusetts, maybe I send it to them or if I know a lawyer in Chicago

Â who might have a friend in Massachusetts. So, I'm, I'm trying to navigate the

Â network without actually knowing the explicit structure of the network.

Â 7:53

There's a number of other studies that have found small path links in, in

Â different settings. So Grossman's study of, of co-authorships

Â among mathematicians average path length 7.6, maximum a month, in a giant

Â component, 27. Newman's study of physics co-authorships,

Â average 5.9, max 20. Goyal of, of [UNKNOWN] Gonzales of

Â Economics 9.5 as an average, max 29. So here, you're seeing things again that,

Â you know, the averages somewhere in the order 5 to 10, max closer to 30 in this

Â case. there is a study by Lada Adamic and

Â Pitkow on the Worldwide Web. There, they found an amazing number a

Â mean file 3.1 on a giant component reaching 85% of about 50 million pages.

Â Facebook mean of about 4.74, recently found by Backstrom and a set of

Â co-authors. This is, is 721 million users.

Â So, actually, they're, when you start working with, with networks of that size,

Â even trying to calculate the average path length is going to be something which is

Â going to take some doing. You're going to have to work carefully to

Â find algorithms to calculate these things.

Â so there's some very interesting findings there and in terms of the average path

Â length. So, in order to say a little bit more

Â about diameters and understand why we see these short average path lengths.

Â first it's going to be useful. I have a couple more definitions in hand.

Â the neighborhood of a node i, in, in network g N sub i of g is going to be the

Â set of other nodes that i is linked to in g.

Â And again, our usual convention is we won't allow for i to be linked to i.

Â And then, the degree of a node, which is a very important concept, the degree of

Â node i is just simply counting up how many neighbors a given node i has in the

Â network g. So, degree is a basic idea, which is can

Â be very important in keeping track of things in, in networks.

Â 10:16

And in understanding why we have short average path lengths, one thing that's

Â going to be very useful is understanding the most basic form of random graph.

Â Which is due to Erdos and Renyi in a set of papers in the 1950s.

Â Also there was independent work by [UNKNOWN] around the same time.

Â And the, the model is the simplest possible network you could imagine in

Â terms of random graph. We start with n nodes.

Â And each link is just formed independently with some probability p.

Â So, what do we do? We have a coin, say, it has a

Â three-quarters chance of, of coming up heads.

Â And for every link, we look, okay, should one be linked to two, flip a coin,

Â three-quarters possibility of yes, one quarter possibility of no.

Â If it comes up heads, we, we color the, the link, if not, we don't.

Â And we do that for every link we just keep repeatedly flipping the coin, okay?

Â And, and, the coin comes up heads with some probability p.

Â this is known as G n, p graph on n nodes with a probability p.

Â it's a very important benchmark in understanding things.

Â Why is it an important benchmark? Because if we see some network out there

Â in the real world if we know what the properties of, of a random network of n

Â nodes with some probability p was, then we can compare the real world network

Â that we see to this benchmark network and say does it look like something

Â systematic is going on? Does this network look systematically

Â different than if nature had just picked nodes I'm sorry, links at random?

Â Or does it look like something is systematically different?

Â So, that's going to be an important benchmark.

Â And in order to talk about properties of these kinds of networks, it's going to be

Â useful to talk about large networks. Why large networks?

Â Well, because if, if we're dealing with very small networks, every single network

Â has some possibility of arising if we're just flipping coins.

Â And so, making statements about what's going to happen on a small network is

Â going to be a very complicated probabilistic statement.

Â But when we get to large numbers, laws of large numbers are going to allow us to

Â say things like with a probability close to one.

Â Then, we should observe that the network should be connected.

Â something like that. So, we can talk about, you know, the

Â average path length ought to be 6 with a probability very close to 1, on a network

Â of such and such a size. So, by looking at sequences of networks

Â on larger and larger sets of nodes, we can begin to make statements like that.

Â So, in particular, lets look at Erdos and Renyi networks.

Â And I'm going to talk about large networks.

Â And in particular, we're going to look at the networks that have two different

Â properties. So, one is we're going to be thinking

Â about the degree, now as a function of n. So, we have to talk about, you know, if

Â it has a million nodes, what's the average degree, if it has two million

Â nodes and so forth. We want that to be something which is at

Â least as large as the log of the number of nodes.

Â And we'll talk about that later. But what this is going to make sure is

Â that there's actually paths from any two, two nodes to each other with a high

Â probability. So, with a high probability, you can get

Â from any node to any other node. Then, we can actually calculate the

Â diameter. Okay, so, so we'll take d of n to be at

Â least log n. And we'll also, we're going to make sure

Â that the network isn't too complete. So, d of n ought to be a small number

Â relative to the overall size n, so that it's not as if every node just reaches

Â every other node in a path of length one. We'll want to, you know, have something

Â interesting going on. So, it's not as if I have I'm connected

Â to everybody in the world. I'm connected to some fraction of people

Â in the world and will make sure that, that's at least log in in, in terms of

Â size. Okay.

Â So, what's a theorem? A basic theorem on basic structure is

Â that if we have these two conditions, and then we start looking at large enough n,

Â the average path length and the diameter of this random graph, these GNP graphs

Â are going to be approximately proportional to log of the number of

Â nodes over log of the degree. So, a very simple formula should give you

Â a very accurate approximation with a high probability to what the average path

Â length in diameter are in these kinds of networks.

Â So, these kinds of theorems have been proven for the original Erdos-Renyi

Â random graphs that we studied through the 50s and 60s.

Â paper by Moon and Moser, there's a set of, of calculations by Bollobas.

Â a series of papers which look at, at these kinds of calculations.

Â It's been looked at in, in richer and richer models.

Â Chung and Lu have a paper in 2001 looking at a different set of models.

Â I have a paper in 2008 that allows for node types and other kind of, of

Â complications to this model. You still get the same formula in this

Â kind of, of setting. So, what does this tell us?

Â this gives us an idea of why, why we end up with a very short average path lengths

Â and short diameters if we had something in a, in a random graph.

Â And that'll tell us something about real world networks to the extent that random

Â graphs are going to be part of those or embedded in those kinds of things.

Â So when we look at this we, you know, we can state the theorem more succinctly in

Â terms of mathematical notation. a fore statement of the theorem is that

Â under these conditions if we're looking at this GNP, then the average distance

Â compared to the log n over log d is converging in probability to 1.

Â So basically, average distance is going to be proportional to log n over

Â log d. The same for the diameter.

Â So the next thing we're going to do is just try and understand the logic behind

Â this. And then, I'll also have a separate

Â lecture, sort of a, a, a more advanced lecture that'll give part of the proof of

Â this for people who really want to dig into the mathematics.

Â but the intuition on, behind this is fairly simple.

Â We'll go through that intuition, and then people who want the more advance part can

Â actually go through pieces of the proof as well.

Â