0:00

In this video, I'm going to talk about applying restrictive Boltzmann machines to

Â collaborative filtering. Collaborative filtering means trying to

Â figure out how much a user would like one product based on how much that user liked

Â other products and how much many other users like products.

Â The particular case we'll look at is the Netflix competition, in which a machine

Â learning algorithm has to predict how much a particular user will like a particular

Â movie. The training data for this competition

Â consists of 100 million ratings of 18,000 movies by half a million users.

Â So it's quite a big data set. It's not the kind of thing anybody at the

Â time imagined a Boltzmann machine could deal with.

Â As we'll see, there's an important trick to being able to get a restricted

Â Boltzmann machine to cope with the fact that nearly all the ratings of nearly all

Â the movies are missing. But when we use this trick, we're able to

Â train models, that are very useful in practice, and were in fact used in the

Â winning entry. So now I'm going to explain how restricted

Â Boltzmann machines were used for collaborative filtering in the Netflix

Â competition. In that competition, you're given most of

Â the ratings that half a million users gave, to 18,000 movies, and each movie

Â gets rated on a scale from one to five. Each user, of course, only rates a small

Â fraction of the movies. But even so, there's about a hundred

Â million ratings, so it's quite a big data set.

Â You have to predict the ratings that the users gave to held out movies, and if you

Â can do that well you get a big prize. You actually win a million dollars if

Â you're the best person at doing that. So, you can draw the ratings in a big

Â matrix like this where we have movies across the top, and users down the side.

Â And so for example, user two gave a rating of five to movie one, and a rating of one

Â to movie three. User four gave a rating of four to movie

Â one, and the question is what rating did he give to movie three.

Â You might decide he's quite like user two because he rated movie one the same way.

Â So maybe like user two he hated movie three.

Â On the other hand, user four liked movie six.

Â So maybe he likes all the movies. By the time you've done that much

Â reasoning, you realize you better use some statistics.

Â 2:42

So here's some of the data on that table on the previous slide and we just have to

Â predict the third term of a triple. So if we built a language model, what we

Â would do is we'd convert each user into a vector features for that user, that is a

Â vector that we learned and we converted movie into a vector features for that

Â movie, a vector that we learned and from those two feature vectors, we try and

Â predict the rating. Now, the obvious way to do this is to put

Â in a big hidden layer, and make the feature vectors feed into the hidden layer

Â and then have the hidden layer predict the rating.

Â We tried that, and we couldn't get that to work any better than a very simple method.

Â Which is simply, to take the scalar product of the feature vector for the user

Â and the feature vector for the movie. You just multiply them together point

Â wise, add it up and output that as your rating.

Â So it's not even a soft max, you actually output whatever real number you get form

Â the scaler product. Now that's exactly equivalent to doing

Â something else, which is normally called a Matrix Factorization model.

Â If we arrange the user features down the rows and the movie features above the

Â columns, we can see that if we multiply, that matrix of users times features by the

Â matrix of features times movies, then we'll get predictions for the ratings.

Â And it will be exactly equivalent to the language model that's beside it.

Â So the matrix factorization model is the most commonly used model for collaborative

Â filtering like this, and it works pretty well.

Â Now let's consider an alternative model, using our restrictive Boltzmann.

Â Machine. It's not obvious how you would apply a

Â restrictive Boltzmann machine to this problem.

Â And so we had to do some thinking. In the end we decided that we would treat

Â each user as a training case. And so a user is really a vector of movie

Â ratings. And for each movie we would have a visible

Â unit that had five alternative values. So visible units instead of being binary,

Â are five way softmaxes. And so the network of our restrictive

Â Boltzmann machine looks like this. Each of it's visible units is a five way

Â softmax, with one visible unit per movie. You might start worrying about there being

Â 18,000 visible units here. And then we had about 100 binary hidden

Â units. And each hidden unit is connected to all

Â five values of the soft max. It also has a bias.

Â And so you can see the number of parameters we'll have is large.

Â The CD learning rule for softmax, incidentally is exactly the same as for a

Â binary unit and like I said we've got about a 100 in units.

Â And what we're going to do is learn a model and then try and fill in one of the

Â missing values using the model. Now the problem with this approach, is we

Â don't want to have an RBM with 18,000 visible units.

Â Only a few of which have known values. That's a huge number of missing values to

Â begin with. And there's a neat way around that.

Â 6:14

For each user we use an RBM that only has as many visual units as the movies that

Â the user rated. So, it's possible that every user will

Â correspond to a different RBM, with a different subset, with visible units.

Â Now, all of these RBMs are going to share the same weights.

Â That is, we know which move is which. And so if two users saw the same movie and

Â rated the same movie, the weights from that movie to the hidden units will be the

Â same for those two users. So we're doing an awful lot of weight

Â sharing here. And that's lucky, because for each user,

Â We only get one training case. We make this specific RBM for each user

Â with the right architecture, that is, the right number of visible units for the

Â movies that the user rated. And now there's only one training case,

Â which is that rating vector. But all of these half a million training

Â cases share weights to hidden units. So the learning works fine.

Â 7:27

The models are trained with CD1 and then after a while with CD3. That is you go up

Â and down three times before you collect the statistics for the negative phase and

Â then we CD5 and then we CD9 and how well does it work?

Â Well the RBM's work about as well as the matrix factorization methods but they give

Â very different errors. What that means is that if you average the

Â predictions of the RBM's with the predictions of the matrix factorization

Â methods, you get a big win, and the winning group actually used multiple

Â different RBM models in their average and multiple different matrix factorization

Â models and I think probably other models as well.

Â As far as I know their main models were matrix factorization models and RBM

Â