0:00

In this module,

Â we're going to be talking about a general case of learning recommenders.

Â So many algorithms are built directly from some algorithmic structure and

Â insight into what might produce a good recommendation, such as find items similar

Â to ones the user likes or decompose the matrix for latent features.

Â And then we use metrics to to evaluate how well the algorithm did.

Â But there are many other algorithms, including many modern algorithms, that

Â use the metric directly in the process of building up the recommendation algorithm.

Â And what they do is they learn the best recommender,

Â within a family, for some particular metric.

Â We've already seen this when we looked at

Â the FunkSVD matrix factorization method,

Â where we directly learned matrices Q or P and

Â Q to minimize, Or

Â have least a squared error.

Â So what we want to do here is see how we can generalize that approach

Â to many different kinds of recommender algorithms.

Â 1:20

The structure of learning a recommender algorithm is expressed in this formula.

Â So we have theta, which is the set of parameters and or

Â recommendation model, but often its a parameter such as for

Â the FunkSVD the theta is going to be our P and Q matrices,

Â 1:43

That are going to be used to produce the recommendations.

Â And then we have this function, sum function, which is either the error or

Â the utility computing predictions or

Â recommendations using a particular model and parameters.

Â 2:03

compute the theta that minimizes g,

Â argmin means find the theta for which g of theta is smallest.

Â If it is a utility function that is it's high when the recommendations are good,

Â rather than high when there's lots of errors, then we want to maximize g.

Â The approaches are equivalent.

Â It's just a matter of whether we're looking for the smallest error or

Â the largest utility.

Â 2:35

There's a model, such as matrix factorization or

Â linear regression, or nearest neighbor.

Â The model generally has some parameters, theta,

Â the coefficients in a linear model, the matrices in a matrix decomposition.

Â 3:29

We have some model, let's say matrix factorization.

Â And it has some parameters, PQ.

Â Now the model will take the parameters and

Â it's going to produce some output, predictions.

Â 4:00

And it computes how good or bad the output is with respect to the training data.

Â The optimization algorithm takes that loss/utility or

Â utility function and computes a new set of parameters to try.

Â And we iteratively generate output using the model and parameters,

Â measure how bad it is, improve the parameters,

Â until we are satisfied with the parameter values that we have chosen.

Â As we saw in FunkSVD, this is often through some kind of

Â a convergence criterion, a threshold on how much the parameters are changing, or

Â just we're going to train 40 times, 100 times, whatever.

Â But these pieces work together to let us train the algorithm.

Â For a simple example, we can look at estimating ratings with a single value.

Â So we have our scoring function is going to be predict the rating with

Â a single global value.

Â So our parameters are the value B.

Â Our error function we want to minimize the squared error.

Â So we, the G is going to be the sum over the users and

Â items of the error, which is the rating minus this baseline value.

Â Now with this problem, our optimization algorithm is,

Â look up that statistics tells us that b equals mu.

Â Now we have the best value.

Â There's not a lot of sophistication to the training.

Â But the basic principle is there.

Â We want to find the value that minimizes the error.

Â Now this approach can be applied to many different kinds of models.

Â We can do a bias model, which takes a global bias, mean rating, and

Â then per user, per item, biases.

Â And effectively what this gives us is a personalized mean.

Â How much better or worse are you going to like this item than average,

Â adjusted by how much better or worse you like items, on average.

Â And in this case, the parameters are going to be b,

Â b sub u, for every user, b sub i for every item.

Â So if we have,

Â 6:42

That's a lot of parameters.

Â If you want to apply the gradient descent approach that

Â we saw in FunkSVD, that's 1501 derivatives.

Â Fortunately, they all follow common patterns and

Â a lot of things are going to be zero.

Â But the parameters explode pretty quickly here.

Â You can also use a linear regression that

Â 7:17

Or we can do matrix factorization, which we've seen worked out in detail,

Â where we're learning matrix, or learning matrix values.

Â We may also include the baseline values themselves in our learning process.

Â So we've got these different models, we can then apply them to different metrics.

Â We can look at prediction accuracy metrics like we've seen, RMSE, or

Â simplifying that to the sum of the squared error.

Â We could optimize for classifier accuracy, accurately classifying things as good or

Â bad, consistent with the user's preferences.

Â We often model this as what's called a logistic regression,

Â which is a linear regression that's wrapped up in the logistic functions,

Â that it's good for predicting boolean values.

Â We can also optimize directly for rank accuracy, where we want to

Â optimize the results of ranking to put good things towards the top.

Â This generally requires some extra intermediate steps in order to actually

Â make that optimizable.

Â We're going to talk more about that in one of the later videos.

Â And with the model and the error metric,

Â then we need an optimization method.

Â 8:38

Any optimization method for finding the best values, given a utility

Â function may be applicable and some are going to be easier or harder to apply to

Â different kinds of error functions and to different kinds of recommendation models.

Â One common theme in a lot of these is they do use derivatives.

Â So a lot of them need the derivative of the loss of the utility function

Â in order to compute updates, because most of these are iterative methods,

Â where they start with a guess for the parameters and

Â then they see how bad the recommendations are with the guess and

Â iteratively improve the parameters until we get to a minimum value for the error.

Â 9:41

Many different algorithms can be framed in this way.

Â And to do this, we need to determine a quality measure.

Â How are we going to judge good as far what the algorithm has produced?

Â Then we need to devise prediction recommendation model and

Â determine its parameters.

Â We need to map all this to an optimization method that's going to find the best

Â parameter values.

Â We need to train, test, cross-validate,

Â measure how good our final train recommender is.

Â And it's useful to have a smooth differentiable error response that we can

Â plug it into our learning algorithms.

Â 10:29

Many modern recommendation methods are built on these principles, and it's a very

Â general approach that can result in a wide array of recommender designs.

Â Training these kinds of recommenders can be an expensive process.

Â This iterative process of computing recommendations,

Â measuring area and improving the parameters can take a while.

Â And how interpretable the outcome is, how explainable the recommendations are,

Â depends heavily on the model design.

Â Some are going to produce very explainable recommendations.

Â Others, it's going to be a little harder to explain to users where

Â the recommendations came from.

Â But the end resulting recommenders can be extremely effective.

Â