So let's take a look at how we can develop a recommended system if we had features of each item, or features of each movie. So here's the same data set that we had previously with the four users having rated some but not all of the five movies. What if we additionally have features of the movies? So here I've added two features X1 and X2, that tell us how much each of these is a romance movie, and how much each of these is an action movie. So for example Love at Last is a very romantic movie, so this feature takes on 0.9, but it's not at all an action movie. So this feature takes on 0. But it turns out Nonstop Car chases has just a little bit of romance in it. So it's 0.1, but it has a ton of action. So that feature takes on the value of 1.0. So you recall that I had used the notation nu to denote the number of users, which is 4 and m to denote the number of movies which is 5. I'm going to also introduce n to denote the number of features we have here. And so n=2, because we have two features X1 and X2 for each movie. With these features we have for example that the features for movie one, that is the movie Love at Last, would be 0.90. And the features for the third movie Cute Puppies of Love would be 0.99 and 0. And let's start by taking a look at how we might make predictions for Alice's movie ratings. So for user one, that is Alice, let's say we predict the rating for movie i as w.X(i)+b. So this is just a lot like linear regression. For example if we end up choosing the parameter w(1)=[5,0] and say b(1)=0, then the prediction for movie three where the features are 0.99 and 0, which is just copied from here, first feature 0.99, second feature 0. Our prediction would be w.X(3)+b=0.99 times 5 plus 0 times zero, which turns out to be equal to 4.95. And this rating seems pretty plausible. It looks like Alice has given high ratings to Love at Last and Romance Forever, to two highly romantic movies, but given low ratings to the action movies, Nonstop Car Chases and Swords vs Karate. So if we look at Cute Puppies of Love, well predicting that she might rate that 4.95 seems quite plausible. And so these parameters w and b for Alice seems like a reasonable model for predicting her movie ratings. Just add a little the notation because we have not just one user but multiple users, or really nu equals 4 users. I'm going to add a superscript 1 here to denote that this is the parameter w(1) for user 1 and add a super strip 1 there as well. And similarly here and here as well, so that we would actually have different parameters for each of the 4 users on data set. And more generally in this model we can for user j, not just user 1 now, we can predict user j's rating for movie i as w(j).X(i)+b(j). So here the parameters w(j) and b(j) are the parameters used to predict user j's rating for movie i which is a function of X(i), which is the features of movie i. And this is a lot like linear regression, except that we're fitting a different linear regression model for each of the 4 users in the dataset. So let's take a look at how we can formulate the cost function for this algorithm. As a reminder on notation is that r(i.,j)=1 if user j has rated movie i or 0 otherwise. And y(i,j)=rating given by user j on movie i. And on the previous side we defined w(j), b(j) as the parameters for user j. And X(i) as the feature vector for movie i. So the model we have is for user j and movie i predict the rating to be w(j).X(i)+b(j). I'm going to introduce just one new piece of notation, which is I'm going to use m(j) to denote the number of movies rated by user j. So if the user has rated 4 movies, then m(j) would be equal to 4. And if the user has rated 3 movies then m(j) would be equal to 3. So what we'd like to do is to learn the parameters w(j) and b(j), given the data that we have. That is given the ratings a user has given of a set of movies. So the average we're going to use is very similar to linear regression. So let's write out the cost function for learning the parameters w(j) and b(j) for a given user j. And let's just focus on one user on user j for now. I'm going to use the mean squared error criteria. So the cost will be the prediction, which is w(j).X(i)+b(j) minus the actual rating that the user had given. So minus y(i,j) squared. And we're trying to choose parameters w and b to minimize the squared error between the predicted rating and the actual rating that was observed. But the user hasn't rated all the movies, so if we're going to sum over this, we're going to sum over only over the values of i where r(i,j)=1. So we're going to sum only over the movies i that user j has actually rated. So that's what this denotes, sum of all values of i where r(i,j)=1. Meaning that user j has rated that movie i. And then finally we can take the usual normalization 1 over m(j). And this is very much like the cost function we have for linear regression with m or really m(j) training examples. Where you're summing over the m(j) movies for which you have a rating taking a squared error and the normalizing by this 1 over 2m(j). And this is going to be a cost function J of w(j), b(j). And if we minimize this as a function of w(j) and b(j), then you should come up with a pretty good choice of parameters w(i) and b(j). For making predictions for user j's ratings. Let me have just one more term to this cost function, which is the regularization term to prevent overfitting. And so here's our usual regularization parameter, lambda divided by 2m(j) and then times as sum of the squared values of the parameters w. And so n is a number of numbers in X(i) and that's the same as a number of numbers in w(j). If you were to minimize this cost function J as a function of w and b, you should get a pretty good set of parameters for predicting user j's ratings for other movies. Now, before moving on, it turns out that for recommended systems it would be convenient to actually eliminate this division by m(j) term, m(j) is just a constant in this expression. And so, even if you take it out, you should end up with the same value of w and b. Now let me take this cost function down here to the bottom and copy it to the next slide. So we have that to learn the parameters w(j), b(j) for user j. We would minimize this cost function as a function of w(j) and b(j). But instead of focusing on a single user, let's look at how we learn the parameters for all of the users. To learn the parameters w(1), b(1), w(2), b(2),...,w(nu), b(nu), we would take this cost function on top and sum it over all the nu users. So we would have sum from j=1 one to nu of the same cost function that we had written up above. And this becomes the cost for learning all the parameters for all of the users. And if we use gradient descent or any other optimization algorithm to minimize this as a function of w(1), b(1) all the way through w(nu), b(nu), then you have a pretty good set of parameters for predicting movie ratings for all the users. And you may notice that this algorithm is a lot like linear regression, where that plays a role similar to the output f(x) of linear regression. Only now we're training a different linear regression model for each of the nu users. So that's how you can learn parameters and predict movie ratings, if you had access to these features X1 and X2. That tell you how much is each of the movies, a romance movie, and how much is each of the movies an action movie? But where do these features come from? And what if you don't have access to such features that give you enough detail about the movies with wish to make these predictions? In the next video, we'll look at the modification of this algorithm. They'll let you make predictions that you make recommendations. Even if you don't have an advanced features that describe the items of the movies in sufficient detail to run the algorithm that we just saw. Let's go on and take a look at that in the next video