0:14

the internal recommendation engine that Netflix was using.

And it gave me a certain kind of RMSE variety, and

I'm wondering if I can do better.

For example, can I improve this by say 10%?

You might say wow 10% is not a big improvement isn't it?

Well first of all, this sewing match was not a stupid algorithm and

second, even a few percent improvement in the ANC can change

the order of the recommended movie and therefore the effect

on customer satisfaction, inventory control and so on.

From 1999 to 2006,

I've got seven years of ratings by

different users on different movies.

1:51

So it's big data.

It's also sparse data because only

about 1% of the total number of

possible rating were actually known.

If you multiplied the number of movies by the number of users,

you actually see this is actually a small number.

2:21

The average number of users per movie and

movies per user, it's kind of big, reasonable numbers.

But, some users actually rated a lot of movies.

One user actually rated almost every single movie on there.

3:11

or person, that could get to 10% first.

So this was the famous Netflix prize,

it's open to all countries and people except to just a couple.

It's online.

Okay, you don't have to show up in person.

And it's international.

3:34

Clearly both the $1 million and perhaps more importantly the huge data

set made available but very attractive to the machine learning and

data mining, Information Retrieval Research Community.

So, we're going to take a quick look at the data set that's involved here.

The entire data set, which is a little over 100 million entries of

RUIs were divided into four parts,

the training set, the probe set, the quiz set and the test set.

They say why did they make it so complicated, for example,

why don't we just make it into two set, the training set and the test set.

It turns out this is a pretty smart arrangement,

the training and probe set were made public.

And they are just big enough that it becomes very interesting,

small enough that they can fit into a normal laptop In 2006.

And then, the quiz and test set, these data were hidden, okay?

Only Netflix have the ground truth.

4:51

So, as far as our competitors, say you and I form a team.

As far as we're concerned, we have access to training data

which is almost 100 million, plus 1.4 million of probe data.

5:04

Since we have access to this data, we can just hide it as grand truth,

train our algorithm and then look at a recommendation accuracy on the probe set.

It turns out that the probe set's size and statistical characteristics,

in terms of distribution of movie and user,

are very similar to the quiz set and the test set, which we don't know.

So we can use this to test our methods as frequently as we want, many times a day.

5:49

Why a limitation?

Because if we submit too many times, say every 50 seconds,

we may have enough hint on the actual underlying data and

reverse engineer the ground truth, then that wouldn't be useful for Netflix.

6:54

And you wonder, this 10% improvement, why not, say, 11%?

Why not 9%?

I don't know how Netflix decided 10 is a good round number.

But had it been 11%, it would have been a lot more challenging, so happens for

this metric and this set of data, both the training and

test set getting to 11% would have been extremely difficult.

And getting to 9% would have been a little too easy but

it was a good pick 10% in hindsight.

7:23

So what happened in the competition is a very interesting sunken

story we just briefly present some highlights.

First of all, very soon, within a week,

okay, somebody already was able to actually beat sign match.

But you need to beat by 10%, so have to go from, 0.9514,

up to four digit of accuracy of RMSE on the Quiz Set.

Remember the test set is reserved for the final phase.

So 10% beating by a tiny bit.

On the quiz set in terms of RMSE over sign match was achievable within one week.

8:07

Now have to push this down by 10%, all the way to 0.8553.

Now you may wonder, these are very small numbers, less than 1, and

10% of that is very small.

Well actually don't forget, this whole thing is on a scale of 1 up to 5 stars,

so 1 is actually a pretty big deviation.

For example, if you see a movie recommended with 4 and 5 stars,

4.5 stars versus 3.5 stars, that actually clearly shows a pretty big difference.

All right, and then in almost one year's time,

by September 2007 now.

There's a team called Bell Kor that made

an 8.26% improvement over sign match.

Very good, seems that 10% will be eminently achievable,

but that turned out not to be the case.

The first place changed hands a few times until,

in the last one hour before the first year of the competition ended,

this same team retained the leading position with 8.43% improvement.

And that gave them $50,000 USD of an prize for

leading the chart, even though it didn't succeed in getting 10%.

And then what happens from 2007 to 2008 is that teams start to merge.

You have a mom and pop shop back then to be able to be somewhere on the chart,

it becomes very difficult.

Because the smart teams merged together so

that they can share their secret, in particular Bell Kor.

And Big Chaos, another team, emerged,

and this team won the 2008 progress

prize in October of that year.

At that point this RMSE gets down to 0.8616,

and then it merged again with the team called Pragmatic Theory.

10:19

So now these three teams become a one team,

it's called Bell Kor's Pragmatic Chaos.

So grab one word from each of the original three composite teams.

And in June 2009, almost three years into competition,

this merged team became the first one to achieve more than 10% improvement.

Okay, in fact it got 10.06% of

the RSMC, over benchmark.

At that point, the competition entered into the final last call phase,

all the team got 30 days to make their final submissions.

At the end of that time two teams actually beat match by more than 10%.

11:07

One team is this Bell Kor Pragmatic Chaos,

which got an RMSE of 0.8554.

The other is called the Ensemble, got 0.8553 actually,

that's slightly better, but remember that's not a quiz set.

11:36

And here is the grand finale, actually all of them,

both of them received 0.8567 as the RSMC.

The same performance up to four digits,

that's what the last digit that count, so, who's going to win?

We'll have to determine by who submitted the final solution earlier.

It turns out Bell Kor Pragmatic Chaos submitted their algorithm 20 minutes,

12:12

Before the Ensemble, so because they both achieve the same performance or

RMSE on the test set, it comes down to the submission time.

And the Kor Pragmatic Chaos won the competition with a differential

of 20 minutes in this wonderful, almost three year scientific quest.

That's the grand finale, so

you must be wondering what did the Bell Kor Pragmatic Chaos

use to achieve this milestone and receive the prize?

12:49

Well You can actually go online and see three documents,

three PDFs, where you can read all the algorithm details.

As part of the competition deal, you have to release your public information, and

allow Netflix to use it or its variant.

So, this is our public information at this point,

except if you go through that you'll see it is just a bag of many tricks,

and thousands of parameters fine tuned based on the training data.

13:30

That's what we're going to target in the next, remaining part of this lecture.

If you want to get to exactly little above 10%,

then it would take many more tricks.

So, I will focus to, how to get you to 8 to 9%,

and what are the key ideas behind it?

So here is a pictorial illustration of the problem, the problem is that you've got

many users, you've got many movies, and you can put them into a matrix or table.

You can also think of this as a bipartite graph, with the user nodes on the left,

movies nodes on the right, and it's bipartite because only users and

movies are paired up.

And then you can give weights to these lengths,

1 to 5, so there are many possible angles.

First we're going to view this as a table,

now in this table actually is a little misleading because a is too small.

There are only 48 possible entries, whereas actually we

can have around 10 billion possible entries in the actual Netflix price.

It's also very dense, you can see that about half of

the cells are actually filled with actual rating data,

it's a lot sparser in the actual Netflix prize.

So the size and the sparsity are actually not accurate, but

suffice as to conceptually illustrate what the problem is.

You're given these data points which are 1 to 5, integers,

and whatever is a blank means that this user has not graded this movie.

Now, she may have watched the movie or not, but there's no rating recorded.

And you see four question marks, these are basically

entries where Netflix knows the grand truth, but

you and I as teams entering to the prize competition, we do not.

15:43

Or maybe we do but we have to resolve it as the probe set,

so in any case, when we train the data shown here.

And then when we test, or when Netflix tests,

then they can use these hidden grand truth.

16:10

Maybe I'll say, just look at this movie, okay?

If this movie has got a lot of high rating just give it a high rating like 5,

5 or maybe this is 5, 2.

This is 4, 4, 5.

I'm going to add a good movie in.

Make it 4.5.

Okay, this one is 2, 3, 4.

That's a tough call.

16:29

Maybe I'll look at this road here.

Okay, I'm trying to predict how would user 5 like the movie three.

User 5 seems to give 3, 3 all the time.

Maybe, should also give 3 here.

What about this entry?

16:46

The movie 7s gets 1, 3 and 4, and kind of diverse.

And this user should give 2 and 4.

So, it's really hard to say what this might be.

Okay, it could be 1, 2, 3, 4, 5.

I don't know maybe just give an average.

Whatever's the average of all the entries here.

Say the average is 3.5, all right 3.5.

Sounds a little like filling out the table in Sudoku.

17:59

In other words, just look at an individual column or row,

column corresponds to movie lists, let's say individual column.

Do not worry about what the other columns show.

18:11

And this is similar in a crude analogy

to the Google, if Google were computing for relevance score.

But if you wanted to input a score by looking at the connectivity pattern

among the web page.

Or in this case what the others movies show and tell about this movie,

they need to go to from content base, to collaborative filtering.

19:03

Okay?

So if some user likes this movie maybe she will like this movie too.

We are basically defining a neighborhood relationship between movies.

Now we could also do a neighborhood relationship construction for the users.

If certain users from the different existing entries show their taste are very

similar.

There may be then if user three likes a movie,

user four even though she hasn't watched the movie will like it too.

19:42

There's another idea which is a little more involved and

we'll postpone it until advanced material part of the video.

Is to say that maybe there are some kind of feature sets about these movies.

For example is this a romantic movie or action movie,

a thriller movie, a sci-fi movie.

Is it long, is it short and so on.

And these feature sets is a much smaller set than the set of possible movies.

There might be 20,000 movies but only about say 20 different genres.

20:48

We just mentioned this, a large and sparse data is the big challenge.

One solution is just look at each row and

each column in isolation, so that content-based filter.

The other is look at entire table.

That's called the collaborative filtering.

Collaborative among the users and movies I guess.

And this is just like this one, it's a centralized computation so

we don't need a distributed computation.

21:14

And within collaborative filtering there are two main flavors.

It turns out these two are also interestingly connected.

But we will skip that.

Is in the textbook and advanced material.

One is neighborhood method.

The other is latent factor method.

21:31

Along this way of going to neighborhood method, we'll take a couple of detours.

One is go into a particular optimization problem, called the least squares,

is one of the most frequently used.

Optimization problem in many diverse fields.

From control of airplanes to your remote.

From how light bulbs are controlled on the factory floor to a lot of supply chains.

It is used in finance.

22:04

And economic modeling it is used in many communication systems.

It's hugely useful optimization problem.

And it is a special case of something called Convex optimization.

So we will take about a five minute detour into defining a convex

optimization because this will be a recurring theme.

We'll see this a few times later in the course.

22:49

And the timestamp, TUI is also important.

First, movies come in out of fashions Over that period of seven years.

And second, different users also may have different moves on different days.

There's a batch processing effect if this user decides to rate

the last 10 movies she watched on Netflix.

On the same day, those ten ratings tend to have a different statistical distribution

then if they were entered right after watching each movie on different days.