0:00

Welcome to Lecture three of Networks, Friends, Money, and Bytes.

Â In today's lecture, we will continue to talk about Google.

Â But this time, how Google rank webpages. In the last lecture, we talked about how

Â Google auctions the advertising slots, usually on the, the right panel of the

Â search results. And today, we'll talk about how Google

Â ranks web pages displayed in the main panel of the Search Result page.

Â And we will see that, the algorithm that Google uses to rank these web pages, can

Â be viewed as the solution to a gigantic system of linear equations.

Â So, while in this lecture, we'll start using some Linear Algebra and matrix

Â notation to accelerate to the presentation, you should know that at the

Â heart of all these notation, it is just solving linear equations.

Â Now, the idea that links can be embedded in text dates back to at least the middle

Â of last century. But then in the 1990s, with the

Â introduction of the Web in '89, the browser in 1990, and the web portal in

Â 1994, this vision of linking text grew to a gigantic scale.

Â Now, we can view these web pages, as a network.

Â They form a network of information represented as web pages.

Â So, each node is a web page. And a link is what's called a hyperlink.

Â It connects the text in one web page to that of another web page and you go from

Â this page to the other by clicking on that link.

Â As we all know, that webpage A points to webpage B doesn't mean that B points back

Â to A so we have to provide a sense of direction to this link.

Â We call this a directional link. And the resulting graph is called a

Â directed graph. How big is this graph?

Â It is very big. Nobody knows for sure but it is estimated

Â that N, the number of webpages out there, is somewhere between 40 to 60 billion.

Â At the same time, this graph is also very sparse meaning that there aren't that many

Â links compared to the number of no's out there.

Â Okay. Many of these webpages have very few links

Â pointing in or links pointing out. So, we're talking about a huge and a

Â sparse graph. It's a very special kind of graph.

Â 3:10

And the idea is that we will be able to leverage the pattern of connectivity in

Â this graph to calculate what we call the importance score of each node.

Â And in this case, each node is a web page, and importance score will be used to rank

Â that page and what we want to do is to say that let's turn a seemingly circular logic

Â that says, important nodes pointing to, say you, you being a web page, means that

Â you are also important. This argument almost sounds like a cyclic

Â argument. But, in fact, we can use it to derive a

Â recursive definition to define, define an equilibrium.

Â Now, this equilibrium is not the equilibrium of a convergence of an

Â algorithm, like in power control, or the equilibrium of strategic thinking as in

Â gang theory. This is a fixed point equation equilibrium

Â as we will see in the rest of this lecture, alright?

Â So, coming back to this question. How can we quantify the relative

Â importance of different nodes in a graph? Or, more specifically in this case,

Â quantify the importance of each web page. We will see later in Lecture eight, quite

Â a few different ways to quantify the notion of importance of nodes in a graph.

Â In today's lecture, we'll look at a specific one used to, by Google.

Â Now, there are quite a few intuitions you can think about this.

Â One is to say, well, you want to quantify node importance, right?

Â So, maybe, I'll just count how many links does this node have.

Â Well, since this is a directograph, where each directed link, I'll just count how

Â many links point into this node. In this case, there are three links

Â pointing into this node. So, we call this the in-degree.

Â 5:35

And the out-degree is one because there's just one node, one link pointing out of

Â this node. And the total degree of this node is four.

Â Okay. So, this is a simple intuition as we will

Â see that it's sort of getting on the right track, but it lacks many elements for it

Â to be a useful metric. Now, we will also see in the discussion of

Â important scores of nodes, a recurring theme in this course.

Â And that recurring theme says, that we never will talk about a network whether

Â it's a network of people, a network of webpages, a network of wireless devices.

Â There actually are two different elements in it.

Â One concerns the topology, that is the pattern of connectivity, and that's

Â usually represented visually by graphs and algebraically, by matrices.

Â We will see within today's lecture, three matrices to describe an increasing order

Â of complexity, the graph that we are dealing with.

Â But that's only half of the story. The other half is the functionality that

Â you would like to study that's running on top of this network.

Â That's what you do on the graph. In today's case, what you do is, can be

Â viewed as a special case of navigation or search.

Â Later in Lecture nine, we will talk six degrees of separation.

Â That's also a kind of navigation and search.

Â In Lecture thirteen, we'll talk about routing on the internet.

Â That's another special case here. So, in order to fully understand a

Â network, we need to know not just properties of the topology, but also

Â viewed models of functionalities. And, in fact, in Modules three and four of

Â today's lecture, we will see some very crude models of the functionality of

Â search and navigation that is implicitly assumed behind Google's ranking algorithm.

Â Alright, now, back to our intuition. That intuition says, let's count the

Â number of links. We call that Try number zero, that's the

Â very basic intuition. And we will soon see that is hardly a very

Â useful one, when it comes to ranking web pages.

Â Here's a little smarter idea that says, what we can do is to say, each node has a

Â certain score. Give it a number, give it a positive real

Â number. We're going to denote that number as pi

Â sub node index. In this case, say, node one.

Â So, it's got a number denoted as pi sub one.

Â And another node, say node two, also has a number, pi sub two, okay?

Â And here's another one, node three with a number pi sub three.

Â And our job for the rest of this lecture is to compute these pis, what we called

Â importance scores, we can collect them into a vector, pi one, pi 2...

Â Up to pi N. This is a very long vector and by

Â convention, it's a column vector and we call this vector pi, okay?

Â In the textbook or in written media printed media, you can't use bold face.

Â It's hard to draw bold face on the screen so we're just going to use this arrow to

Â represent that vector. So, pi vector, that's what we want to

Â compute. So, Try one says, that look, you got these

Â pis, okay. Maybe you can just add up the pis that

Â points to you. For example, if node one and node two both

Â point to node three, then I can say that node 3's importance is the sum of pi one

Â and pi two, okay. Simple enough.

Â This is slightly better than just counting degrees cuz I now allow each node to have

Â different important score, okay? And, of course, in general, first node and

Â second node, they may have many other connections coming in and going out.

Â 10:28

So, this is a reasonable choice, okay? However, what it ignores is that some

Â node, like this node one here and two, they are pointing to many other nodes, not

Â just pi of node number three. So, think of a professor who writes

Â reference letters for her students. And if this professor keeps writing that

Â this is the very best student I have ever had in my 80 years of teaching at this

Â university, and so on. If every student should rise in that way,

Â and should rise like, say 100 letters every semester, then soon people will

Â realize that, hey, this professor's reference letters should be discounted.

Â And that's the idea here, is that you've got to normalize by how you spread your

Â important score. If you point to too many web pages, then

Â each of those web pages that you point to should not receive all your importance.

Â 12:40

So, that's try number two now, okay? The ideas that we get to normalize by the

Â strata of importance. It turns out that by this point, we're

Â sort of close to what Google does. In the next of three modules, we will have

Â to quantify this intuition and then, provide two more enhancements about what

Â are getting there. The idea is that, look at all the

Â importance scores, add them up across the links pointing to you, but make sure you

Â normalize by the outgoing number of links. Now, you may wonder, if I have a node

Â here, number two with important square pi two, I can write pi two as some

Â combination of the important scores normalized from the incoming neighbors.

Â And each of these, say, this involves pi one, can also be written as the summation

Â of normalized important scores from its incoming neighbors.

Â So, the question is, on the left-hand side, I will have pi one, pi two, and all

Â the pis. On the right-hand side, I also have all

Â these pis, divided by their outgoing number of neighbors.

Â So, I've got pis both on the left-hand side and the right-hand side of this, this

Â linear equation system. Can they actually make sense collectively?

Â Is there is a set of consistent scores pi? Is there a vector pi that, can satisfy all

Â these linear equations simultaneously? If so, then we have found a particular

Â meaningful and self-consistent way to assign importance scores.

Â If not, then we're in trouble. So, the question now is how can we modify

Â this basic intuition in our try number two now, in order to guarantee both the

Â existence of such a set of consistent scores and an efficient way to compute

Â these scores? But before we go into the general space,

Â let's look at a simple, very small example.

Â There are only four nodes and six directed links.

Â Nodes one and two points to three, three points to four, and then four points to

Â back to one, two, and three, okay? Now, what is the intuition of the

Â importance score vector pi, which should be a vector of length four here?

Â If we assume that such a pi consistent set of scores actually exist.

Â Here's one particular view. We can think of nodes three and four as

Â collectively acting like a supernode, okay?

Â Let's call this supernode three+4. And here's node one.

Â Here's node two. One points to supernode three + five, so

Â does two, and a supernode points back, okay?

Â So now, intuitively, you look at the supernode and say, this supernode got two

Â links pointing inwards. But each of these nodes, one and two, only

Â has one length pointing inward. So, intuitively, you'll think that nodes

Â one and two should each, has a smaller important score than the supernode three +

Â four. And furthermore, by symmetry of this graph

Â here, nodes one and two should have the same importance score.

Â And then, the supernode three + four. Need to distribute its score in some way

Â between nodes three and four. So, at least we know intuitively that pi

Â one should equal to pi two, and each of them should be smaller than pi3 + four,

Â the supernode score, okay? So, intuitively that's what we should.

Â Anticipate. Now, we can write down the math a little

Â further and calculate in this small example very easily the result.

Â Now, first of all you'll also notice that, in this node of three pointing to node

Â four, it says, that basically all the important score of pi three that we get

Â goes into node four. And node four has only got this one

Â incoming link. So, pi four should equal to pi three, or

Â pi three divided by one, cause there's only one outgoing link from node three.

Â And plus what? There's no other terms.

Â There's only one link pointing inwards to node four.

Â So, we can just say pi four equals pi three, and pi one equals pi two.

Â Now, once we know that, the job is very simple now.

Â Let's, say, pi one equals pi two, this number is x.

Â Pi three equals pi four, this number is y, okay?

Â Now, we know that there is one link from node four pointing to node one, right?

Â 19:19

So now, we got two equations. One says, x = y / three.

Â The other says, by normalization 2x + 2y should equal to, say, a constant one.

Â Now, we got two variables in two linear equations.

Â In this case, we got a solution. The solution is x = one / eight and, and y

Â = three / eight, which satisfies our intuitive understanding in the last slide.

Â Now, therefore, what we're going to output is that we will say, nodes three and four

Â pi for position number one. Three, and four, and nodes one and two

Â tied for position number, I guess three, okay?

Â Cuz positions one and two are pi between nodes three and four.

Â So, you can't display three and then four, and then one and then two, for example, on

Â the Search Result page by Google. Now, is this the right answer or not?

Â And that's actually a tricky, almost philosophical question.

Â What is a right answer? In some sense, the right answer depends on

Â how does each user find that Google search results useful or not as useful.

Â Now, because Google is such a dominant force in the search space today, it's

Â actually hard to imagine what other search orders might look like so it's not easy to

Â actually falsify whatever we might conjecture about that.

Â And it is just very hard to measure or quantify subjective view on the usefulness

Â of the search results, okay? So, what we are doing here is really a

Â proxy of the real problem, which is make each individual users find the search

Â results as useful as possible. And furthermore, as you can see, that we

Â carried out a whole bunch of computation. Well, in this case, it's not that much,

Â two variables and two linear equations. But the actual numbers of the important

Â scores are not displayed in the search result page.

Â All you see is the rank order. You do not see the scale behind the order.

Â So, order is displayed. Scale, the numerical scale is hidden from

Â users' view, okay? So, in that sense, scale is really a means

Â to the ends. So, we don't know, for example, that this

Â node is three times more important than that node.

Â