Welcome back, in previous videos we've looked at matrix factorization techniques that are based on linear algebra and have geometric vector space kinds of interpretations. In this video we are going to look at ones that are not based in algebra and geometry. Particularly they're going to be based in probability theory. Linear algebra's not the only basis we can use for matrix factorization. There's many different interesting algorithms that are based on probabilistic models. The basic idea is that we assume that the data is generated by some random process and we treat all uncertainty just as randomness. We make an assumption about what the structure of that process is and then we learn parameters that would generate data by this process that looks a lot like the data that we have. For an initial example, we can think of a probability model as a non-personalized probabilistic model. And if we think about the problem of a user is going to buy an item. And we can say, well the probability that they're going to buy an item i is just based on the number of times i was bought, scaled to be a probability. So, it's the number of buys of i over number of total buys. And this is independent of the user u, it's non-personalized. We just say they're going to buy item i with probability popularity. This is a probabilistic model. It says that the probability the user is going to buy an item, watch a movie, read a book, is based on how popular the item is. But we can go try to personalize that. We can try to compute the probability that a particular user is going to buy a particular item. And/or the probability that they're going to give it a particular rating. The problem, if we just try to learn what's the probability the user's going to buy the item i, is that there are many parameters. We don't have the user having bought or deciding not to buy every item, so we don't have enough data to even try to learn this. So what we do is we break it down into pieces. And one way to do this is called Probabilistic LSA or Probabilistic Latent Semantic Analysis. Latent Semantic Analysis, or Latent Semantic Indexing, is a term commonly use for singular value decomposition and information retrieval. So the goal is to compute the probability that the user is going to buy i. So what we do is we decompose that into latent factors. Just like we do for computing ratings or otherwise estimating preferences in other matrix factorization techniques, we say that the probability that the user is going to buy item i is based on two things. The probably that the user is going to pick a particular random topic z, a random factor z, and then given that they've picked that factor, what the probably that they're going to pick an item is. So, we can draw this out in common notation used for probabilistic models. We say we have a user u and they're going to pick a topic z And then from the topic z, they're going to pick an item i. And this happens once for every rating or for every item they buy. And so we have to learn two probability distributions, P(z/u). So how likely is the user to pick a topic? These topics are latent, like they are with matrix factorization, little reason to believe they mapped to anything. So what is the probability that the user is going to decide I want to watch an animated science fiction movie. Then we also learn P(i/z). So suppose the user says I want to watch an animated science fiction movie, what's the probability that they're going to pick The Incredibles? So the individual item probabilities are not dependent on the user. They're just dependent on what topic the user picked. The topic probabilities are dependent on the user. So if we've got 40 topics, the user buys a bunch of items, and we can try to infer, okay, what topic. How likely are they to like each topic based on the items that they buy? So this gives us a much smaller parameter space. Just like the matrix factorization in general lets us represent users compactly, as their preference for 40 of 100 features, rather than the items they buy. We can represent this as the probability that they're going to pick each of our different features. The end result has the same form as SVD. We have our P matrix, which is the probability that the user, Is going to pick a particular topic. And we have our Q matrix, which is the probability that they're going to pick an item given a topic. And, In our formula that the probability of the user picking item i is the sum over all topics, We do the sum, because we want to say over all topics that they could get to an item through, what's the probability that they do that topic and then probability that they get that item? Because many items are going to have relevance to different topics. The Incredibles is animated, it's science fiction, it's a superhero movie. So any of those paths, if those are some of our latent topics, are going to be ways to get to The Incredibles. So, one difference though is that the left and right matrices are stochastic instead of orthogonal. That is, they encode probability distributions instead of linear algebra vector spaces. We can learn this model with an algorithm called expectation maximization. We can also extend it to model the ratings as a distribution that can be dependent on a variety of factors. There's several different ways, I'll refer you to the literature in order to learn more about that. One downside to the PLSA though, is it's pretty slow to learn. There's an alternative formalization of probabilistic matrix factorization that is much quicker to learn that models ratings is being drawn from normal distribution. We can extend this model to consider other distributions as well. And a normal distribution or a Gaussian distribution is just your standard bell curve. So we assume models ratings come from a bell curve and that the mean of that bell curve is determined by user and item via features that each have, the latent features. So we look at the users values for different features, the items values for different features, and that determines the mean of this normal distribution. We say that the user's going to randomly draw from that distribution. This is what we call a Gaussian mixture model when we're treating it as normals. So we assume these ratings are normally distributed with a mean determined by the dot product of our user preference vector and our user feature vector and item feature vector. We have this variance that models the noise of ratings, how much variation there is from these means and the ratings users actually pick. We also model the user picking their user feature preferences and the item feature values as being normally distributed, randomly normally distributed as well. So we get a model that looks like the user has a vector p sub u. And the item has a vector q sub i. And this happens once for every user. This happens once for every item. And then their rating, Depends on the feature vectors values from the user and the item. And then each of these also depends on the variances. So that we have a feature, a user feature value for each user, an item feature vector for each item, we combine them to generate the normal distribution from which our ratings are drawn. And we can train this whole thing using any of the methods for inferring a probabilistic model, such as expectation maximization, gradient dissent, etc. Matrix factorization, both probabilistic and not, is core to many algorithms currently used in recommender systems. There's various other latent feature models that can be interpreted probabilistically as well, such as neural nets and restricted Boltzman machines. There's probabilistic elements to the way those work. I refer you to the reading and then to the additional videos in this course for more details on many of these concepts.