In this lecture, we're going to begin our study of similarity based recommended systems by introducing one of the most common variance, which is called collaborative filtering. We'll give a few instances of this type of recommender system. So what we're trying to do in this lecture is build recommender systems that are based on the notion of similarity between users or similarity between items. So how might you go about estimating the similarity between two users? Well, one way to do this is to look at uses in terms of the items they've purchased, and conversely, how might you go about measuring the similarity between two items. Well again, you might do that in terms of the users who purchased those two items. So we can define similarity between items in terms of users and similarity between users in terms of items. So for example, on Amazon, if we were trying to recommend substitutable or complimentary items, we might do so by saying people who viewed X also viewed Y, or people who bought X also bought Y. So somehow those recommendations are estimates of where did these two items are similar to each other, whether the query on the top is similar to the items in the first row or the atoms in the second row in terms of click or purchase behavior. So the first row might be items that are similar in terms of users who have clicked, and the second row might be items that are similar in terms of users who bought. So a few definitions to start us off, we're going to start by introducing this item set I sub u. That's going to be the set of items purchased by a particular user u, and conversely U sub i is the set of users who purchased a particular item i. We can also describe the same concept in terms of a matrix. So you might imagine there's this very large, very sparse matrix of purchases or clicks where each column corresponds to a different item, and each row corresponds to a different user, and we have a one in a particular position if that user has clicked on or purchased that item, and a zero if they have not. We can extract the binary representation of a user now, and it's protecting a particular row of that matrix, and that will be describing which items that user has purchased, how we can copy the binary representation of an item by taking a column of that matrix, and that would say which uses purchased that item. So these two your things can be easily mapped to each other, we can take that item set just by looking at particular values of Ru,i for particular item i, such that that user clicked or that user purchased on that item. Similarly, we can extract the user set U sub i by finding users such that Ru, i is equal to 1. What's wrong? There is again. You can say the mapping between these two concepts by showing how we can recover this item set or this user set in terms of this matrix representation, so to discover i sub u instead of items purchased by user u, is just the set of items such that Ru,i is equal to 1. Similarly, U sub i instead of users who purchased i, is a set of users such that Ru, i is equal to 1. So we can easily represent this problem in terms of sets or matrices. So given these definitions of user sets and item sets or matrices describing user item interactions, we can now try to come up with a bunch of different similarity measures based on this definition. So the first we might try is something like the Euclidean distance between two items i and j, or similarly between two users if we would like. So we're saying the difference between two items might be defined in terms of the sets of users who purchased them. So the difference between i and j might be measured by saying how many users purchased item i but not j, and how many uses purchased item j but not I? At some notion of Euclidean distance between those two items, i and j. We can represent that in terms of sets or matrices. So on the left-hand side of this equation representing in terms of sets, as the set difference between the users who purchased i and the users approaches that j, plus the set difference in terms of users who purchased j and the users who purchased i. You can represent something like that using a Venn diagram. So we can think of some Venn diagrams describing the sets of users who purchased i that maybe this set A here, and a set of users who purchased j that might be the set B here. We're just looking at the parts of those two sets that are not overlapping. The more different those products might be, the larger we expect the non-overlapping component to be. Seems reasonable first idea. Let's see whether it actually works in practice. So imagine we have here four products here 1, 2, 3, and 4, and we have sets of users that is user IDs really who've purchased each of those products. Let's look at the difference between them. So what's the difference between product one and product two according to this Euclidean distance measure? So how many uses purchased product one but not two and how many uses purchase product to but not one? If we look carefully, we find that there's a total of three users who purchase product two but did not purchase product one. So that's user number six, user number 35, and user number 38. On the other hand, what is the similarity between product three and product four? That's the second equation there. That's going to be only two. As one user who purchased product three and as one user who purchased product four, and at different uses for the Euclidean distance between them is two. Based on these equations, we would conclude that items three and four are more similar to each other than items one and two. But it seems like the wrong conclusion because item three and four have no users in common, whereas items one and two have many uses in common. So what's missing here is maybe some normalization to the number of purchases those products have. This distance measures going to favor or consider more similar products that have very few associated purchases. So just to summarize, this is going to favor small sets, even if a few elements in common, we are trying to fix that by normalizing to the number of purchases of a particular product. This is that second measure we'll come up with, is a very commonly used measure called Jaccard similarity, which really is performing this type of normalization. So the definition of Jaccard similarity in general to take the size of a set intersection, A intersect B and divide it by the size of the set union. In the case of two products, i and j, we would be saying what is the intersection between uses a bunch of those products versus the union of uses approaches those products? So again, we can describe this in Venn diagrams. On the numerator here, we have the intersection. That is the set of users who purchase both products. On the denominator we have the union, that is the set of users who purchased either products. Then we're just taking the ratio between those two things. So this is going to take a value between zero and one. The intersection is always going to be smaller than the union in other words. So we'll take a maximum of one when there's sets overlap completely. So the two uses purchased exactly the same set of items, or the two items are purchased by exactly the same set of users in this case. We'll take a value of zero if there's two sets of completely disjoint. So just to summarize, we have two sets here describing uses approaches both items, and we're measuring the similarity according to how much those sets overlap, we normalized it based on the size of those sets. Thought idea is one called the cosine similarity. So very similar idea, but we're considering angles between vectors rather than set intersection. So this idea, is to say we have two vectors, A and B, which could be two rows or columns of our matrices describing which uses purchased i or which uses poachers j, and we'll take the angle between those two vectors. It's actually very similar idea to the Jaccard similarity. So given the rise of that matrix, we can represent them in some very high dimensional space saying the vector describing a particular item in our catalog, which represents the set of users who purchased that item, it's a bunch of zeros and a bunch of ones, we can treat that as a coordinate, it is very high dimensional space where the dimensionality is the number of users. We can do that for two movies and then compute the angle between them. If those movies were purchased by a similar set of users, those should be similar or nearby vectors, if they are purchased by very different sets of users, they should be very different or closer to orthogonal vectors. Then we take the cosine of that angle. So the cosine of the angle would be one, if the angle were very close to zero, which mean those vectors pointing in exactly the same direction or these two items were purchased by the same set of users. It would take a value of negative one if the vectors pointing in exactly the opposite direction. In this case, that can't actually happen since there are no negative values in our matrices, all the values in our matrices are zero or one, but the smallest value you can take for this type of data is going to be zero. That would mean, we have a 90 degree angle between those two vectors or there are no users who purchased two items in common. That seems very, very similar to the Jaccard similarity, is just a slightly different definition of the similarity between two items and I'll explain later on why we might want to use one rather than the other. Okay, so why do we introduce this cosine similarity? So Jaccard similarity only works where we can talk about users and items in terms of sets, it's defined by set intersection and set union. Alternately, the cosine similarity is defined on vectors, so it can work for arbitrary vectors that may not correspond to sets. So imagine we had a setting where we had opinions in addition to purchases. Rather than having this matrix of clicks or this matrix of purchases, we might have a matrix of thumbs up, thumbs down values. So in this case, we have a value of one if a user bought an item and then liked it, a value of zero if a user just didn't buy an item, and a value of negative one if you user a disliked an item. Now, we have no such thing as Jaccard similarity applied to that matrix data, because we can no longer represent it is easily using sets but the cosine similarity is still perfectly well-defined for things like thumbs up, thumbs down values or star ratings. Okay, so in our previous example, if we now had thumbs up, thumbs down values, we would again, have a representation of a particular movie in terms of not just, did a user a buy it or not, but did they actually like it or not. So the vector would then point in the opposite direction if a user disliked a particular movie. Okay now, it can take values between one and negative one. So one again means that the two items are very similar, because the vectors point the same direction, which means that similar set of users bought both movies, and they rated them both the same way, which might mean they rated them both positively or they rated them both negatively. It would take a value of negative one if the movie is rated again by the same set of users, but the users disagree. So that's when the vectors are pointing in totally the opposite direction. The movies were purchase by the same people, but they had opposite opinions about them, and similarity is zero here, if they just point in opposite directions, that means the movies are essentially unrelated to each other. They were purchased by different sets of users altogether. Finally, there's a fourth definition of similarity we might use, this is called the Pearson correlation. So what would happen if we had numerical ratings rather than just thumbs up and thumbs down? Maybe the cosine similarity would no longer be the best idea. So think about a data set like this, where we have star ratings. This is zero if we haven't rated the movie, and a value of 1, 2, 3, 4, 5 if you have rated a movie depending on what your star rating of it is. In this case unlike the thumbs up thumbs down, five now means you bought it and liked it, zero means you just didn't buy it, and one means you bought it and disliked it. Okay, so to consider how this idea is different from cosine similarity or why cosine similarity might be a bad idea to apply to this data, think about what's happen in the following scenario. We're again, given a data set where we have three users, so a three-dimensional space and two movies that are being rated. Suppose we have one of our movies, Harry Potter, and it was watched by two of those users and they both rated it five stars. That might correspond to a vector like 5, 5, 0 in this space, where we take a particular row or column of that matrix. On the other hand, then we have another movie like, Pitch Black, which was watched by the same users, but they really didn't like it, so they both gave it one star. Okay, and what's important to note is that these two vectors actually point in the same direction. So if we were to take the cosine similarity between them, the angle theta between those two movies, well, the angle is zero degrees, right? Those vectors point in exactly the same direction, so the cosine similarity with the cosine of that angle, would be equal to one. That's the wrong notion. These movies should be as opposite as possible. The users who watched both movies, liked one and hated the other. So somehow, we might like to normalize that data. Say in this case, we could subtract the average. The average rating here would be three. If we were to subtract the average, we would end up with re-normalized data that said Harry Potter took a value of 2, 2, 0 and Pitch Black took a value of -2, -2, 0, and in this case, the angle between them would be equal to 180 degrees. Those movies would point in exactly the opposite direction. They'd be very different movies in terms of recommending into people and that's what we might like. So this next notion of similarity called Pearson correlation, is going to try and perform that re-normalization or that average subtraction to make sure that these items would point in opposite directions. Okay, so the basic idea is that we don't want one-star ratings to be parallel to five-star ratings, rather we would like to try and figure out or subtract the average so that we know that five stars is positive but one star is negative. All that's going on here, the equation looks fairly frightening, but we're really just subtracting averages appropriately. This is nothing other than a cosine similarity equation, where we have first extracted averages in terms of the ratings given by each user or the ratings given for each item. So we can just compare and contrast our definition of cosine similarity to Pearson correlation. This equation is given here for similarity between users, but you could equivalently define it for the similarity between items, just by interchanging, I sub U with U sub I, whenever you see it. So all we're doing to compute the similarity between users, is taking the items rated by both users, but we're first subtracting the average rating by a particular user, that means we're determining whether this rating is a positive or negative one by the standards of that individual user, by subtracting the average. Other than that average subtraction, the Pearson correlation looks pretty much equivalent to the cosine similarity. Okay, so that's about it. It's a bunch of different notions of recommendation based on computing the similarity between two items or two users, in terms of sets, vectors, or matrices, where the entries of this matrices could correspond to things like clicks, thumbs up, thumbs down, or star ratings. So how does this actually work in practice, and how is Amazon really generating those recommendations of people who bought X, or people who bought Y, or people who liked X also liked Y? Things like that. Well, according to a fairly old tech report from Amazon, they're not doing something which is not very complicated. It's very similar to what I've described. It is taking a set of users who purchased a particular item, I, and then their ranking or other items, J, in terms of one of these similarity measures, so for example, Jaccard similarity, and then it is surfacing those items which have the highest similarity compared to others. So just to note, all we've developed so far is a bunch of recommended systems based on similarity between users and items. What's really surprising about this, is we did so just looking at sets of users, sets of items or these giant anonymous matrices of clicks and purchases. We have no idea who the users are. We don't know that this person's a 25-year-old female or that this movie is an action movie. We're just looking at raw click or rating data. We completely ignored any notion of features, which makes it clearer now, how this is very different from any type of regression or classification system we've developed so far. A few problems will try to address later. First of all, the solution is slow. We'll implement it later on with some code and we'll see some tricks we can use to make it faster by avoiding some of the more costly comparisons. Secondly, this is of no use for brand new users or brand new items. If a user has never rated anything before or an item has never been rated before, we have no clue which items are similar to that item or which users a similar to that user. A third issue which is maybe more subtle one, is that this type of technique won't necessarily encourage diverse results. We saw, for example, the recommendations that were generated when you click on a pair of jeans, a bunch of pairs of very similar looking Jeans. I don't know, for example, well, maybe I was just looking at things made of denim or maybe I was looking for things that are colored blue, and I should be given other recommendations of that type, rather than so many recommendations that are almost identical to the original query, okay, but most of those ideas would be left for later lectures in this course. The purpose of this lecture was just to introduce a few of the most commonly used similarity based recommended systems, and to describe various situations, where each system might be preferable over others.