0:14

the internal recommendation engine that Netflix was using.

Â And it gave me a certain kind of RMSE variety, and

Â I'm wondering if I can do better.

Â For example, can I improve this by say 10%?

Â You might say wow 10% is not a big improvement isn't it?

Â Well first of all, this sewing match was not a stupid algorithm and

Â second, even a few percent improvement in the ANC can change

Â the order of the recommended movie and therefore the effect

Â on customer satisfaction, inventory control and so on.

Â From 1999 to 2006,

Â I've got seven years of ratings by

Â different users on different movies.

Â 1:51

So it's big data.

Â It's also sparse data because only

Â about 1% of the total number of

Â possible rating were actually known.

Â If you multiplied the number of movies by the number of users,

Â you actually see this is actually a small number.

Â 2:21

The average number of users per movie and

Â movies per user, it's kind of big, reasonable numbers.

Â But, some users actually rated a lot of movies.

Â One user actually rated almost every single movie on there.

Â 3:11

or person, that could get to 10% first.

Â So this was the famous Netflix prize,

Â it's open to all countries and people except to just a couple.

Â It's online.

Â Okay, you don't have to show up in person.

Â And it's international.

Â 3:34

Clearly both the $1 million and perhaps more importantly the huge data

Â set made available but very attractive to the machine learning and

Â data mining, Information Retrieval Research Community.

Â So, we're going to take a quick look at the data set that's involved here.

Â The entire data set, which is a little over 100 million entries of

Â RUIs were divided into four parts,

Â the training set, the probe set, the quiz set and the test set.

Â They say why did they make it so complicated, for example,

Â why don't we just make it into two set, the training set and the test set.

Â It turns out this is a pretty smart arrangement,

Â the training and probe set were made public.

Â And they are just big enough that it becomes very interesting,

Â small enough that they can fit into a normal laptop In 2006.

Â And then, the quiz and test set, these data were hidden, okay?

Â Only Netflix have the ground truth.

Â 4:51

So, as far as our competitors, say you and I form a team.

Â As far as we're concerned, we have access to training data

Â which is almost 100 million, plus 1.4 million of probe data.

Â 5:04

Since we have access to this data, we can just hide it as grand truth,

Â train our algorithm and then look at a recommendation accuracy on the probe set.

Â It turns out that the probe set's size and statistical characteristics,

Â in terms of distribution of movie and user,

Â are very similar to the quiz set and the test set, which we don't know.

Â So we can use this to test our methods as frequently as we want, many times a day.

Â 5:49

Why a limitation?

Â Because if we submit too many times, say every 50 seconds,

Â we may have enough hint on the actual underlying data and

Â reverse engineer the ground truth, then that wouldn't be useful for Netflix.

Â 6:54

And you wonder, this 10% improvement, why not, say, 11%?

Â Why not 9%?

Â I don't know how Netflix decided 10 is a good round number.

Â But had it been 11%, it would have been a lot more challenging, so happens for

Â this metric and this set of data, both the training and

Â test set getting to 11% would have been extremely difficult.

Â And getting to 9% would have been a little too easy but

Â it was a good pick 10% in hindsight.

Â 7:23

So what happened in the competition is a very interesting sunken

Â story we just briefly present some highlights.

Â First of all, very soon, within a week,

Â okay, somebody already was able to actually beat sign match.

Â But you need to beat by 10%, so have to go from, 0.9514,

Â up to four digit of accuracy of RMSE on the Quiz Set.

Â Remember the test set is reserved for the final phase.

Â So 10% beating by a tiny bit.

Â On the quiz set in terms of RMSE over sign match was achievable within one week.

Â 8:07

Now have to push this down by 10%, all the way to 0.8553.

Â Now you may wonder, these are very small numbers, less than 1, and

Â 10% of that is very small.

Â Well actually don't forget, this whole thing is on a scale of 1 up to 5 stars,

Â so 1 is actually a pretty big deviation.

Â For example, if you see a movie recommended with 4 and 5 stars,

Â 4.5 stars versus 3.5 stars, that actually clearly shows a pretty big difference.

Â All right, and then in almost one year's time,

Â by September 2007 now.

Â There's a team called Bell Kor that made

Â an 8.26% improvement over sign match.

Â Very good, seems that 10% will be eminently achievable,

Â but that turned out not to be the case.

Â The first place changed hands a few times until,

Â in the last one hour before the first year of the competition ended,

Â this same team retained the leading position with 8.43% improvement.

Â And that gave them $50,000 USD of an prize for

Â leading the chart, even though it didn't succeed in getting 10%.

Â And then what happens from 2007 to 2008 is that teams start to merge.

Â You have a mom and pop shop back then to be able to be somewhere on the chart,

Â it becomes very difficult.

Â Because the smart teams merged together so

Â that they can share their secret, in particular Bell Kor.

Â And Big Chaos, another team, emerged,

Â and this team won the 2008 progress

Â prize in October of that year.

Â At that point this RMSE gets down to 0.8616,

Â and then it merged again with the team called Pragmatic Theory.

Â 10:19

So now these three teams become a one team,

Â it's called Bell Kor's Pragmatic Chaos.

Â So grab one word from each of the original three composite teams.

Â And in June 2009, almost three years into competition,

Â this merged team became the first one to achieve more than 10% improvement.

Â Okay, in fact it got 10.06% of

Â the RSMC, over benchmark.

Â At that point, the competition entered into the final last call phase,

Â all the team got 30 days to make their final submissions.

Â At the end of that time two teams actually beat match by more than 10%.

Â 11:07

One team is this Bell Kor Pragmatic Chaos,

Â which got an RMSE of 0.8554.

Â The other is called the Ensemble, got 0.8553 actually,

Â that's slightly better, but remember that's not a quiz set.

Â 11:36

And here is the grand finale, actually all of them,

Â both of them received 0.8567 as the RSMC.

Â The same performance up to four digits,

Â that's what the last digit that count, so, who's going to win?

Â We'll have to determine by who submitted the final solution earlier.

Â It turns out Bell Kor Pragmatic Chaos submitted their algorithm 20 minutes,

Â 12:12

Before the Ensemble, so because they both achieve the same performance or

Â RMSE on the test set, it comes down to the submission time.

Â And the Kor Pragmatic Chaos won the competition with a differential

Â of 20 minutes in this wonderful, almost three year scientific quest.

Â That's the grand finale, so

Â you must be wondering what did the Bell Kor Pragmatic Chaos

Â use to achieve this milestone and receive the prize?

Â 12:49

Well You can actually go online and see three documents,

Â three PDFs, where you can read all the algorithm details.

Â As part of the competition deal, you have to release your public information, and

Â allow Netflix to use it or its variant.

Â So, this is our public information at this point,

Â except if you go through that you'll see it is just a bag of many tricks,

Â and thousands of parameters fine tuned based on the training data.

Â 13:30

That's what we're going to target in the next, remaining part of this lecture.

Â If you want to get to exactly little above 10%,

Â then it would take many more tricks.

Â So, I will focus to, how to get you to 8 to 9%,

Â and what are the key ideas behind it?

Â So here is a pictorial illustration of the problem, the problem is that you've got

Â many users, you've got many movies, and you can put them into a matrix or table.

Â You can also think of this as a bipartite graph, with the user nodes on the left,

Â movies nodes on the right, and it's bipartite because only users and

Â movies are paired up.

Â And then you can give weights to these lengths,

Â 1 to 5, so there are many possible angles.

Â First we're going to view this as a table,

Â now in this table actually is a little misleading because a is too small.

Â There are only 48 possible entries, whereas actually we

Â can have around 10 billion possible entries in the actual Netflix price.

Â It's also very dense, you can see that about half of

Â the cells are actually filled with actual rating data,

Â it's a lot sparser in the actual Netflix prize.

Â So the size and the sparsity are actually not accurate, but

Â suffice as to conceptually illustrate what the problem is.

Â You're given these data points which are 1 to 5, integers,

Â and whatever is a blank means that this user has not graded this movie.

Â Now, she may have watched the movie or not, but there's no rating recorded.

Â And you see four question marks, these are basically

Â entries where Netflix knows the grand truth, but

Â you and I as teams entering to the prize competition, we do not.

Â 15:43

Or maybe we do but we have to resolve it as the probe set,

Â so in any case, when we train the data shown here.

Â And then when we test, or when Netflix tests,

Â then they can use these hidden grand truth.

Â 16:10

Maybe I'll say, just look at this movie, okay?

Â If this movie has got a lot of high rating just give it a high rating like 5,

Â 5 or maybe this is 5, 2.

Â This is 4, 4, 5.

Â I'm going to add a good movie in.

Â Make it 4.5.

Â Okay, this one is 2, 3, 4.

Â That's a tough call.

Â 16:29

Maybe I'll look at this road here.

Â Okay, I'm trying to predict how would user 5 like the movie three.

Â User 5 seems to give 3, 3 all the time.

Â Maybe, should also give 3 here.

Â What about this entry?

Â 16:46

The movie 7s gets 1, 3 and 4, and kind of diverse.

Â And this user should give 2 and 4.

Â So, it's really hard to say what this might be.

Â Okay, it could be 1, 2, 3, 4, 5.

Â I don't know maybe just give an average.

Â Whatever's the average of all the entries here.

Â Say the average is 3.5, all right 3.5.

Â Sounds a little like filling out the table in Sudoku.

Â 17:59

In other words, just look at an individual column or row,

Â column corresponds to movie lists, let's say individual column.

Â Do not worry about what the other columns show.

Â 18:11

And this is similar in a crude analogy

Â to the Google, if Google were computing for relevance score.

Â But if you wanted to input a score by looking at the connectivity pattern

Â among the web page.

Â Or in this case what the others movies show and tell about this movie,

Â they need to go to from content base, to collaborative filtering.

Â 19:03

Okay?

Â So if some user likes this movie maybe she will like this movie too.

Â We are basically defining a neighborhood relationship between movies.

Â Now we could also do a neighborhood relationship construction for the users.

Â If certain users from the different existing entries show their taste are very

Â similar.

Â There may be then if user three likes a movie,

Â user four even though she hasn't watched the movie will like it too.

Â 19:42

There's another idea which is a little more involved and

Â we'll postpone it until advanced material part of the video.

Â Is to say that maybe there are some kind of feature sets about these movies.

Â For example is this a romantic movie or action movie,

Â a thriller movie, a sci-fi movie.

Â Is it long, is it short and so on.

Â And these feature sets is a much smaller set than the set of possible movies.

Â There might be 20,000 movies but only about say 20 different genres.

Â 20:48

We just mentioned this, a large and sparse data is the big challenge.

Â One solution is just look at each row and

Â each column in isolation, so that content-based filter.

Â The other is look at entire table.

Â That's called the collaborative filtering.

Â Collaborative among the users and movies I guess.

Â And this is just like this one, it's a centralized computation so

Â we don't need a distributed computation.

Â 21:14

And within collaborative filtering there are two main flavors.

Â It turns out these two are also interestingly connected.

Â But we will skip that.

Â Is in the textbook and advanced material.

Â One is neighborhood method.

Â The other is latent factor method.

Â 21:31

Along this way of going to neighborhood method, we'll take a couple of detours.

Â One is go into a particular optimization problem, called the least squares,

Â is one of the most frequently used.

Â Optimization problem in many diverse fields.

Â From control of airplanes to your remote.

Â From how light bulbs are controlled on the factory floor to a lot of supply chains.

Â It is used in finance.

Â 22:04

And economic modeling it is used in many communication systems.

Â It's hugely useful optimization problem.

Â And it is a special case of something called Convex optimization.

Â So we will take about a five minute detour into defining a convex

Â optimization because this will be a recurring theme.

Â We'll see this a few times later in the course.

Â 22:49

And the timestamp, TUI is also important.

Â First, movies come in out of fashions Over that period of seven years.

Â And second, different users also may have different moves on different days.

Â There's a batch processing effect if this user decides to rate

Â the last 10 movies she watched on Netflix.

Â On the same day, those ten ratings tend to have a different statistical distribution

Â then if they were entered right after watching each movie on different days.

Â