[SOUND] This lecture is about the specific smoothing methods for language models used in probabilistic retrieval model. In this lecture, we will continue the discussion of language models for information retrieval, particularly the query likelihood retrieval method. And we're going to talk about specifically the smoothing methods used for such a retrieval function. So this is a slide from a previous lecture where we show that with a query likelihood ranking and smoothing with the collection language model, we add up having a retrieval function that looks like the following. So this is the retrieval function based on these assumptions that we have discussed. You can see it's a sum of all the matching query terms, here. And inside its sum is the count of the term in the query and some weight for the term in the document. We have t of i, the f weight here, and then we have another constant here in n. So clearly if we want to implement this function using programming language, we still need to figure out a few variables. In particular, we're going to need to know how to estimate the probability of a word exactly and how do we set alpha. So in order to answer this question, we have to think about very specific smoothing methods, and that is main topic of this lecture. We're going to talk about two smoothing methods. The first is simple linear interpolation with a fixed coefficient. And this is also called a Jelinek-Mercer smoothing. So the idea is actually very simple. This picture shows how we estimate a document language model by using maximum likelihood estimate. That gives us word counts normalized by the total number of words in the text. The idea of using this method is to maximize the probability of the observed text. As a result, if a word like network is not observed in the text, it's going to get 0 probability, as shown here. So the idea of smoothing, then, is to rely on collection language model where this word is not going to have a zero probability to help us decide what nonzero probability should be assigned to such a word. So we can note that network has a nonzero probability here. So in this approach what we do is we do a linear interpolation between the maximum likelihood placement here and the collection language model, and this is computed by the smoothing parameter lambda, which is between 0 and 1. So this is a smoothing parameter. The larger lambda is, the more smoothing we will have. So by mixing them together, we achieve the goal of assigning nonzero probabilities to a word like network. So let's see how it works for some of the words here. For example, if we compute the smooth probability for text. Now the maximum likelihood estimated gives us 10 over 100, and that's going to be here. But the collection probability is this. So we'll just combine them together with this simple formula. We can also see the word network, which used to have a zero probability, now is getting a non-zero probability of this value. And that's because the count is going to be zero for network here. But this part is nonzero, and that's basically how this method works. Now if you think about this and you can easily see now the alpha sub d in this smoothing method is basically lambda. Because that's remember the coefficient in front of the probability of the word given by the collection language model here. Okay, so this is the first smoothing method. The second one is similar but it has a tie-in into the coefficient for linear interpolation. It's often called Dirichlet Prior, or Bayesian, Smoothing. So again here we face problem of zero probability for an unseen word like network. Again we will use the collection language model, but in this case, we're going to combine them in somewhat different ways. The formula first can be seen as a interpolation of the maximum likelihood estimate and the collection language model as before, as in the J-M smoothing method. Only that the coefficient now is not lambda, a fixed number, but a dynamic coefficient in this form, where mu is a parameter, it's a non-negative value. And you can see if we set mu to a constant, the effect is that a long document would actually get a smaller coefficient here. Because a long document will have longer lengths, therefore the coefficient is actually smaller. And so a long document would have less smoothing, as we would expect. So this seems to make more sense than a fixed coefficient smoothing. Of course, this part would be of this form so that the two coefficients would sum to 1. Now this is one way to understand this smoothing. Basically, it means it's a dynamic coefficient interpolation. There is another way to understand this formula which is even easier to remember, and that's on this side. So it's easier to see how we can rewrite the smoothing method in this form. Now in this form we can easily see what change we have made to the maximum likelihood estimate, which would be this part. So normalize the count by the document length. So in this form we can see what we did is we add this to the count of every word. So what does this mean? Well, this is basically something related to the probability of the word in the collection. And we multiply that by the parameter mu. And when we combine this with the count here, essentially we are adding pseudocounts to the observed text. We pretend every word has got this many pseudocount. So the total count would be the sum of these pseudocounts and the actual count of the word in the document. As a result, in total we would have added this many pseudocounts. Why? Because if you take somewhat this one over all the words, then we'll see the probability of the words would sum to 1, and that gives us just mu. So this is the total number of pseudocounts that we added. And so these probabilities would still sum to 1. So in this case, we can easily see the method is essentially to add this as a pseudocount to this data. Pretend we actually augment the data by including some pseudo data defined by the collection language model. As a result, we have more counts is that the total counts for a word would be like this. And as a result, even if a word has zero count here, let's say if we have zero count here, then it would still have nonzero count because of this part. So this is how this method works. Let's also take a look at some specific example here. So for text again we will have 10 as the original count that we actually observe, but we also add some pseudocount. And so the probability of text would be of this form. Naturally, the probability of network would be just this part. And so here you can also see what's alpha sub d here. Can you see it? If you want to think about it, you can pause the video. But you'll notice that this part is basically alpha sub d. So we can see, in this case, alpha sub d does depend on the document, because this length depends on the document, whereas in the linear interpolation, the J-M smoothing method, this is a constant. [MUSIC]