0:00

Okay folks. So our next topic is again taking these growing random network models.

Now we've got a version which is fairly

rich and spans a whole series of different degree distributions.

Let's have a look at some data,

see how this work out.

And if you recall what we had in terms of the degree distribution

was that we ended up with an equation for the degree distribution which looks like this.

And what that does is if we - it's easy to do this in the log-log world,

take log of each side and well we're actually bring the F over here,

then take a log of each side.

What we're going to end up with this log of 1 minus the distribution function is

proportional to c minus x log d of a plus m times x.

Okay. So what's the difficulty with actually fitting this kind of equation?

We could just look at,

you know, frequency of degrees,

regress that off of log of a function of degrees and then try and estimate what the x is.

The difficulty is that the x here is

entering in both parts so the x is telling us something about the a.

So we've got this a coming in in several places right?

So a is the parameter we want to estimate how much is uniformly at random.

How much is being formed via preferential attachment.

The a is entering here.

It also enters this x term.

So what we've got is we've got c minus two over one minus a log

of d plus am two over one minus a, right?

So we've got several places where a is entering both here and here.

And so what we going to do is,

what we'll do is, search over a's.

We'll search over a's until we minimize the distance between

the degree distribution generated

by this particular a and the actual the degree distribution.

Okay? So we'll try and select an a to minimize

that distance between the actual distribution and the models distribution.

So when we search over a's, to do that,

and let me say a little bit about how one can

search so you can do the the program search by different methods.

One is you can just plug in a bunch of different a's and for each one that'll give you

a degree distribution and then try and see to what extent the actual data,

you know, what's the distance between each frequency distribution in

the data and the actual ones here?

Sum those up, square them and try and minimize that.

Alternatively what you can begin to do is guess an a over here

and then do a regression that'll give

you an expression here that gives you a new estimate for a,

plug it back in here.

Regress again, and so forth.

You can actually show that that is a contraction.

It's a well-behaved mathematical process.

It'll converge to the right a. I don't want to go through a proof of that but that's

an alternative technique so one is just brute force

put down a bunch of a's between zero and one,

search over a grid see how each one,

how close does it come to matching the data,

so you can look at

the actual frequency distributions compared to the true frequency distributions.

Right? So what we're going to end up with is,

so what we've got is true frequencies for each degree,

we've got for each a we've got a frequency,

we can look at these things, sum them up,

square them and say try and choose the a that

minimizes the sum of the squared distances between

the actual distribution and the one for a and just choose a to minimize this.

So for each a we get an expression plug it in.

Calculate all those numbers and then minimize that.

Or you can do this iterative process where you guess and a,

do a regression and get a new x that gives you

a new a estimate and iterate and that'll give you another way to do this.

Okay. So what do you end up with?

Let's look at some plots.

So here we've got different datasets.

This is one where we're looking at citations of the original Small Worlds paper.

And what do you end up with?

You end up with a at about 0.56 and

the fit of that is about 98% so you're doing very well in terms of

actually calculating the variations of

the actual variation in

the frequency distribution that is matched by this model is about 98%.

So there's about a 2% error when you look at the f of d's minus

a of d's compared to the overall squares of f of d's.

So you're doing very well here.

When you look at different ones this is the prison inmates dataset that we looked at,

we talked about before McCrae's dataset.

Here you get a, the best fit is a equals one.

So it looks like it's pretty much uniformly at random and R-squared of about 94%.

So again a fairly tight fit.

That one looks like it's being matched fairly well by uniformly at

random and in fact there's a number of ones that come out at one.

So if you do there's another one, Amateur Radio operators,

so for people that don't know before the Internet people

used to use short wave radios and talk to each other so instead of having

chat rooms you could get on the radio and then just search

to find somebody else who might be

also broadcasting on the radio, you could talk to them.

These were amateur radio operators.

There's a dataset that looks at these

again a equals one uniformly at random fairly good fit.

The high school romance that we looked at, Bearman, Moody,

and Stovel's data, the degree distribution there.

Again very well fit uniformly at random.

So looks better, like more like uniformly random here the R-squared is .99.

So a remarkably strong fit to

uniformly at random model that says something about high school romances, in any case.

So you know these are just a couple of notes on these fits.

Some of these are actually even more curved than with a equals one.

And in those cases it could be that instead of having

this growing network model if you go back to the

static uniformly at random model you can even do a better fit.

So those are fit better even by not including some of the growth.

So the growth isn't a major factor in those.

The citations networks, you know,

some of the things is it has too many with degree zero to fit.

And then part of what's happening in the model is

the model starts with everybody having some degree and it's not

taking into account the fact that some citations are directed and

might never be hitting some articles.

Interestingly, if we go back to

the Notre Dame World Wide Web that

has sort of generated a lot of interest in the power law.

When you look at the best fitted series it actually has a at

.36 so a little more than a third of the links are being formed uniformly at random.

Now when you actually plot out,

now instead of just binning the things a

very fine binning of the actual degrees the R-Squared is .97.

And interestingly this still looks,

in a log-log plot it looks fairly close to a line.

Right. But it's saying that the actual fit has some curvature.

And one thing that's very important to understand about

these kinds of things is that a lot of the data,

when you do a log-log plot,

you know, most of your data is actually in a smaller part of the graph.

Right. So the log of the frequency is changing the relative weights.

And so what you actually see in your eye is,

you know, capturing the long tail.

And so this looks fairly linear but there actually can be a lot of curvature that you've

missed because the log

straightens things out and it's very difficult for your eye to detect that.

And so what this tells us is if we actually

want accurate understandings of what's really going on in

these things just eyeballing these things is not going

to give us a good idea of what the true distributions look like.

When you actually fitted distribution,

the distribution here looks like it's sort of two thirds preferential attachment but one

third uniformly at random is

a much better fit than just a pure preferential attachment model.

And so you know there there might be other things going on whereas you know

pages accumulate some of their links

through some other process other than preferential attachment.

And that's an important caveat to these things.

So when we understand power law is one thing we should understand is there are

fat tails but not necessarily purely linear relationships.

And when you fit them it's very difficult to let your eyes do things in a log-log plot.

You can get a very different distribution if you're careful about it statistically.

All right, so we end up here,

you know, just to emphasize that eyeballing log-log plots can be misleading.

So fat tails, yes.

An actual power distribution, no.

Here we're finding about two thirds of a power distribution and

one third of uniform at random seems to be a better fit of this.

Okay.

One other thing in terms of fitting this kind of model into data.

There are other things that the Friends of Friends model allows us to do.

So you know it does give us

a variety of degree distribution so we can span different degree distributions.

And it ties the degree distribution to the way that links were

formed in terms of having some meeting friends of friends.

But to emphasize some other things things that were missing from

the original models including a straight preferential attachment model is you

still don't get clustering you're just forming links at random and the chance that two

of my friends were already connected to each

other begins to vanish as time becomes large.

When you actually do things by this Friends of Friends model then what you end up doing

is connecting to people in ways that form triangles.

So let me just sort of give you an illustration of that

so you understand it's an important idea.

And so the idea,

recall here when we've got these nodes,

so we've got a new node and let's say suppose we have some existing nodes

and let's suppose that those existing nodes were already connected to each other.

Right. So this network and the way in which the node form some attachments

was first found one uniformly at random it found

this node and then it attached to one of its friends.

So the fact that the attachment was via this search process of

finding a friend of a friend means that we've actually formed a triple here.

And so we ended up with two of this nodes friends

being friends with each other partly because the way

it found people was through finding friends of friends.

So we get natural clustering directly through this process and that's different

than what happens in the peer preferential attachment model

or the uniformly at random model.

So the fact that you're doing things through network search gives clustering directly.

So the clustering is going to depend on this

a parameter but we'll find clustering in the data.

Second thing is diameter is naturally going to be

small partly because you've already got either you

know you've got random network formation

so you've either got something that looks uniformly at random or

preferential which even has a lower diameter than dash-ready kinds of networks.

Another thing and one thing we haven't talked about yet in much detail but you

will observe in a lot of networks is what's known as assortativity.

Okay and what does assortativity mean?

Let's think of the older nodes.

The older nodes are more likely to be

connected to each other because when they were born and young

the formation process was only between the ones that were existing at

that time period and so older nodes are more likely to be connected to other older nodes.

And so what we and up with is a correlation in the age of

nodes in connections which also means that they are

the ones that have a higher degree are going to tend to be connected

to other ones that have high degree and ones that have

lower degree are going to have

more connections to ones with lower degree because they're forming

ones later on their forming their links later on and

so will tend to have more links to younger nodes too.

And so what we'll end up with is

a degree distribution that exhibits some correlation in degrees.

More of a high degree node's friends will be

high degree and more low degree's nodes will be low degree.

So we get a series of other things coming out directly.

This is something which is going to come out of any of these growing systems.

So this comes out of the growth.

This just comes out of the fact that we've got random network.

And this part, the clustering,

comes out of the friends of friends part.

Right. So we've got the growth that's giving us assortativity,

the randomness gives us small world diameter,

and the clustering part,

the other part of small worlds is coming from the friends of friends of the process.

So we're going to get three extra things.