0:21

Named after the mathematician Bayes, whose work was then, further extended by the

Â famous mathematician Laplace. And Laplace raised a very interesting

Â question, what's the chance that the sun will rise tomorrow if you have observed

Â that it has been rising every morning in the past, say, 100 million days?

Â Now, you may think this is a funny question.

Â But, for now, ignore the fact that you know something about the underlying

Â physics of why the sun rises. Purely view this as a question that you

Â observe. Something has been coming up every morning

Â the last 100 million days, and then what's the chance you predict it will come up

Â again tomorrow? Can you say, well, can be exactly 100

Â percent mathematically? Again, we're not talking about underlying

Â physics. About this is such an overwhelming number.

Â So, what should it be? And this is an interesting thought

Â experiment that we can simplify little further to say, suppose I give you a

Â sequence of n experiment and I say, s out of n experiment, for s is smaller than n

Â I've served one. Okay?

Â And I ask you that question, what's the chance the next experiments also return

Â one? Now, without going through the foundation

Â of probability theory. Intuitively, the answer is just s over n.

Â That's in the past s out of n chances runs give us this result of one as, well, our

Â observation. So, call this question number one.

Â Suppose now, I switch to another question and say that, also run n experiments.

Â And the experiment is actually that there's a coin that's loaded, and I flip

Â it. If it's heads, then I return one.

Â If it's tail, then I return zero. And again, s out of the last n runs, I

Â observe one. I'll you ask the question, what's the

Â chance that next time you'll also see one? You may say, hold on, isn't this question

Â the same as the last question? Shouldn't the answer also be s over n?

Â Well, not according to the Bayesian view of probability.

Â The Bayesian view is a very powerful and for some time, a controversial view in

Â probability theory community. But, what it says is that, now that I've

Â given you some prior information that the experiment consists of flipping a loaded

Â coin, then you'll be able to make some model about that based on the observation

Â in the last n rounds. And therefore, your answer may be

Â different. What is that answer?

Â Let's derive that in the next five minutes and I will come back to answer this

Â question, why is it different from s over n again?

Â The essence of the Bayesian view, the philosophical underpinning is captured in

Â this picture. You got underlying model.

Â Later in the course, we'll also see different kind of latent factor model,

Â we'll see hidden model, we'll see reverse engineering of network topology, as well

Â as protocols. Philosophically, they follow a similar

Â spirit. In this simple question, the underline

Â model is just captured by a parameter p. And as the chance that a single coin flip

Â of this loaded coin will result in head, and therefore observation one.

Â Now, different piece will clearly lead to different observations.

Â That goes without saying, but Bayesian view also says different observations

Â telling me something about p. And, more observation I have, the better I

Â can build a model for p from which I can then do forward engineering to predict the

Â future outcome. You first reverse engineer the p before

Â you make prediction. So, in our case, we say that, if you know

Â the value of p, then we know that the chance of seeing s out of n flips heads is

Â simply a binomial distribution. It is p to the power s, cuz observed s

Â such cases. One minus p to the power n minus s, cuz we

Â observed n minus one of such cases, and there are edge whose asked possible

Â [unknown] arrangements of that sequence of s out of n being one.

Â So, this is just a binomial distribution, and we all know that for a fixed p.

Â What is now flipping our heads around and turning the table is that, since that's

Â the case, then the probability distribution of p, and let's call that f

Â of p, should be proportional to this observation frequency.

Â If we count the frequency in the observation of heads, then that will tell

Â us something about the underlying probability distribution of this value of

Â p. And let's just pause for one second

Â because it sounds intuitive, it's actually counter-intuitive for a site.

Â Cuz we're now saying that let's go ahead and predict, we're saying that let's build

Â a model of p and that model says that p's distribution should be proportional to our

Â observation. This proportionality principle is what

Â Laplace did to extend base understanding of the relationship between observation

Â and model. And I say it's proportional because the

Â self is not a, a distribution. We have to normalize it by integrating

Â overall the possible p's and the p there, a p can range from zero to one.

Â And now, this is indeed a probability distribution, okay?

Â And as a function of p, that's the probability distribution f of p.

Â So, all we need to do now is to evaluate this integral and then do the division

Â skipping those detailed steps, cuz that's not what we care about in this course.

Â We get the following answer. It's n plus one factorial over s factorial

Â n minus s factorial times p to the s, n minus one minus p to the n minus s.

Â 7:11

Okay? So, that is the probability distribution

Â of p as a function of n and s. As a function of, how many runs have you

Â carried out and what is the number of ones that you have observed?

Â And now, we're basically done because we know the conditional probability of

Â observing another one given a particular p value is just p, right?

Â Because we know what p is. So, condition on that, the chance that

Â we're getting, a coin flipped of head is just p.

Â So, we have to unconditionalize it by integrating this answer p, across the

Â probability distribution f of p for p from zero to one.

Â And, substituting that complicated looking expression f of p and doing the integral

Â against getting the steps from integral tables, the answer is remarkable and

Â remarkably simple, s + one over n + two. Let's recap.

Â We started by saying that, you want mid prediction, right?

Â But, hold on first. You know something about a model.

Â So, based on your observation, first view of the model around p, and that's how we

Â get to the distribution of p. Then, you make prediction and the answer

Â is s + one over n + two which is clearly not s over n.

Â So, mathematically, you know this is the right answer if you follow Bayesian

Â analysis. But, intuitively, why not s over n?

Â Well, here's one way to intuitively make yourself feel comfortable.

Â The moment I tell you this prior information about a model, it's as if I've

Â told you that I've run for free two trials.

Â One trial shows up heads, one trial shows up a tail.

Â So, when you tally you should say, it's n plus two trials.

Â And the number of heads I've observed is s plus one, okay?

Â Because the impact of that prior knowledge about the underlying model is, roughly

Â speaking, intuitively speaking, equivalent to as if you had free runs.

Â Two of them, one showing head, one showing tail.

Â 9:42

So now, we can take this philosophy to understand how to do ranking, of competing

Â product of the same category on Amazon. And we know that knowing heads shows up

Â 100 out of 103 runs is going to give us something different than observing ten

Â heads out of thirteen runs. Similarly, on Amazon, we know that knowing

Â there's a lot more reviewer of a product compared to its competitor should give us

Â some information. So, suppose you've got a list of products

Â indexed by i. For each product i, there are n sub i

Â reviews. And, the average, simple naive average is

Â r sub i, okay? Then, we say, look across all of these

Â products in the same category. You can add up the ni's, call that total

Â number of review for all the products in the category N, okay?

Â And then, you can look at the reviews, summation of niri, okay?

Â Divided by N, that is the average, total naive average across all product category,

Â call that R. So, if I know a certain product i got a

Â lot of review, ni, relative to R, I should put more trust to the corresponding ri.

Â Whereas, if there is very few reviews ni relative to r, then, well relative to n,

Â I'm sorry, very small ni relative to n, then I would rather put the trust on the

Â average review, okay? It's like saying that if this product got

Â a lot of reviews, I should trust its own average.

Â But if it got too few reviews relative to n, then I should pretend as if it is an

Â average product across the entire category.

Â 11:53

And, it turns out that the Bayesian adjusted value, ri tilde, follows the

Â following expression. It is NR, which is really is the summation

Â of niri, plus niri divided by N plus ni. So, we can view this as a sliding ruler.

Â On the one end is your product i's own average review, ri, on the other end is

Â the total average across all products in the category.

Â And we say that the actual ri tilde, the Bayesian adjusted rating average is

Â somewhere between the naive average of your own product and the total product

Â depending on how much i can trust this ri. If this ri has a lot of ni backing it up,

Â I'll move closer to ri. If it's got very few of them, then I'll

Â just say, you know what? I think I would rather not trust this ri

Â that much and more towards R, the average rating for all the products.

Â So, ri tilde is sliding somewhere in between and its sliding according to this

Â formula. Let's take a look at two extreme cases.

Â In extreme one, one extreme cases, this ni for a certain product, is so small that we

Â can ignore it compared to the N. Says, nearest zero.

Â And say, this is one and N's ten trillion. Alright.

Â So, we can pretty much ignore this term and this term.

Â And in that case, I ri tilde is just R. It moves all the way to this extreme end.

Â Or lets say, ni for certain product, is n over two.

Â So, half of the other reviews in this category goes to this particular product.

Â So, relative to the other competing products, we can say this product, i, got

Â a lot of reviews. So, we should be able to trust more this

Â rating. So, if we substitute ni over equals N over

Â two into this expression, what we see is that, ri tilde now equals two-third R plus

Â one-third ri. So, it's moving much closer to ri now.

Â And it turns out that this formula is used exactly in Internet Movie Database, IMDBs

Â ranking of the top 250 movies of all times based on all the ratings of these movies,

Â okay? It doesn't use a simple naive average

Â because some movies got a lot more reviews and ratings than other movies.

Â Instead, it used this Bayesian adjusted, NR + niri over N + ni.

Â The relative sample size of the review, ni, relative to the total across all the

Â movies plays a role here. Here's another example, Beer Advocate.

Â This is a website that ranks all the different kinds of beers around the world.

Â And it use almost the same formula, okay? Each Bay i's, Bayesian adjusted ranking r

Â tilde i equals N min, I'll explain what this is in a minute, times R plus niri

Â over N min plus ni. Where N min is the minimal number of

Â rating that you need in order to even show up in the top beer ranking on this

Â website. So, why use N min instead of N which is

Â the total number accumulated number of ratings?

Â One reason is that, if you use this N, as time goes by, the total n would only

Â increase in the dynamic range of r tilde i will be shrinking.

Â So, Beer Advocate decided to use a twist of this formula and say, instead of

Â looking at the total number of ratings, just look at the minimum ratings that you

Â need in order to show up on the website. So, in fact, you can pick this as N, as N

Â min, or as some other kind of number that somehow corresponds to a notion of the

Â total review population. What about Amazon then?

Â Does Amazon use Bayesian analysis in this ranking?

Â And, what else that it use? We'll come to this in the next and the

Â final segment of this lecture's video. But first, let me highlight to draw back

Â some limitations of this Bayesian ranking. Number one, this is useful for ranking.

Â 16:55

Therefore, you must have a backdrop of a scale of all competing products number of

Â ratings. If you just want to do adjustment of the

Â number of stars, say, five, 4.5 stars from 121 reviews, you can do that, okay?

Â You must also know, what about the competing products?

Â How many reviews and what's the naive number of stars that they received in

Â order to do adjustment, in order to lead to a ranking of the product?

Â So, that's why on Amazon's home page, you can click through a number of links to

Â look at the ranking which, actually, encompass Bayesian.

Â But, if you look at just number of stars, they still show you the simple naive

Â average number. And you just have to process in your own

Â head, by, in calculating the impact or number of ratings somehow according to

Â your own notion of n. The second limitation is that, this

Â formula implicitly assumes actually a Gaussian distribution of the review

Â scores. As we mentioned, that often they have a

Â bipolar, bimodal distribution. Or in general, multimodal distribution in

Â which case would lead to a slightly different and more complicated formula.

Â But, we will not have time to go through that set of material.

Â So, what we're going to do now is going to go through two examples on Amazon, a small

Â and a larger example to figure out what Amazon does and how does it use the

Â Bayesian viewpoint to do ranking.

Â