0:14

There are in general about two kinds of approaches to text categorization

Â by using machine learning.

Â One is by generating probabilistic models.

Â The other is discriminative approaches.

Â In this lecture, we're going to talk about the generative models.

Â In the next lecture, we're going to talk about discriminative approaches.

Â So the problem of text categorization

Â is actually a very similar to document clustering.

Â In that, we'll assume that each document it belongs to one category or one cluster.

Â The main difference is that in clustering we don't really know

Â what are the predefined categories are, what are the clusters.

Â In fact, that's the goal of text clustering.

Â 0:59

But in the case of categorization, we are given the categories.

Â So we kind of have pre-defined categories and

Â then based on these categories and training data, we would like to allocate

Â a document to one of these categories or sometimes multiple categories.

Â But because of the similarity of the two problems,

Â we can actually get the document clustering models for text categorization.

Â And we understand how we can use generated models to do

Â text categorization from the perspective of clustering.

Â And so, this is a slide that we've talked about before, about text clustering,

Â where we assume there are multiple topics represented by word distributions.

Â Each topic is one cluster.

Â So once we estimated such a model,

Â we faced a problem of deciding which cluster document d should belong to.

Â And this question boils down to decide which theta i has been used to generate d.

Â 2:27

Well, in general, we use base wall to make this influence and

Â you can see this prior information here that we need to consider if a topic or

Â cluster has a higher prior then it's more likely

Â that the document has been from this cluster.

Â And so, we should favor such a cluster.

Â The other is a likelihood part, it's this part.

Â 2:56

And this has to do with whether the topic word of distribution

Â can explain the content of this document well.

Â And we want to pick a topic that's high by both values.

Â So more specifically, we just multiply them together and

Â then choose which topic has the highest product.

Â So more rigorously, this is what we'd be doing.

Â So we're going to choose the topic that would maximize.

Â This posterior probability at the top of a given document gets posterior

Â because this one, p of the i, is the prior.

Â That's our belief about which topic is more likely,

Â before we observe any document.

Â But this conditional probability here is

Â the posterior probability of the topic after we have observed the document of d.

Â 4:05

And this is related to how well this word distribution

Â explains the document here, and the two are related in this way.

Â So to find the topic that has the higher

Â posterior probability here it's equivalent to maximize this product

Â as we have seen also, multiple times in this course.

Â 4:32

And we can then change the probability of document in your product of

Â the probability of each word, and that's just because we've made

Â an assumption about independence in generating each word.

Â So this is just something that you have seen in document clustering.

Â 4:50

And we now can see clearly how we can assign a document to a category

Â based on the information about word distributions for

Â these categories and the prior on these categories.

Â So this idea can be directly adapted to do categorization.

Â And this is precisely what a Naive Bayes Classifier is doing.

Â So here it's most really the same information except

Â that we're looking at the categorization problem now.

Â So we assume that if theta i

Â represents category i accurately, that means the word distribution

Â characterizes the content of documents in category i accurately.

Â Then, what we can do is precisely like what we did for text clustering.

Â Namely we're going to assign document d to the category

Â that has the highest probability of generating this document.

Â In other words, we're going to maximize this posterior probability as well.

Â 5:56

And this is related to the prior and

Â the [INAUDIBLE] as you have seen on the previous slide.

Â And so, naturally we can decompose

Â this [INAUDIBLE] into a product as you see here.

Â Now, here, I change the notation so that we will write down the product as

Â product of all the words in the vocabulary, and

Â even though the document doesn't contain all the words.

Â And the product is still accurately representing the product

Â of all the words in the document because of this count here.

Â 6:37

When a word, it doesn't occur in the document.

Â The count would be 0, so this time will just disappear.

Â So if actively we'll just have the product over other words in the document.

Â So basically, with Naive Bayes Classifier,

Â we're going to score each category for the document by this function.

Â 6:56

Now, you may notice that here it involves a product of a lot of small probabilities.

Â And this can cause and the four problem.

Â So one way to solve the problem is thru take logarithm of this function,

Â which it doesn't changes all the often these categories.

Â But will helps us preserve precision.

Â And so, this is often the function that we actually use to

Â score each category and then we're going to choose

Â the category that has the highest score by this function.

Â So this is called an Naive Bayes Classifier, now the keyword base is

Â understandable because we are applying a base rule here when we go from

Â the posterior probability of the topic to a product of the likelihood and the prior.

Â 7:47

Now, it's also called a naive because we've made an assumption that every word

Â in the document is generated independently, and this is indeed a naive

Â assumption because in reality they're not generating independently.

Â Once you see some word, then other words will more likely occur.

Â For example, if you have seen a word like a text.

Â Than that mixed category,

Â they see more clustering more likely to appear than if you have not the same text.

Â 8:15

But this assumption allows us to simplify the problem.

Â And it's actually quite effective for many text categorization tasks.

Â But you should know that this kind of model doesn't

Â have to make this assumption.

Â We could for example, assume that words may be dependent on each other.

Â So that would make it a bigram analogy model or a trigram analogy model.

Â And of course you can even use a mixture model to model what the document looks

Â like in each category.

Â So in nature, they will be all using base rule to do classification.

Â But the actual generating model for documents in each category can vary.

Â And here, we just talk about very simple case perhaps, the simplest case.

Â 9:00

So now the question is, how can we make sure theta i

Â actually represents category i accurately?

Â Now in clustering, we learned that this category i or

Â what are the distributions for category i from the data.

Â But in our case, what can we do to make sure

Â this theta i represents indeed category i.

Â 9:34

Indeed in the textbook,

Â we typically assume that there is training data available and

Â those are the documents that unknown to have generator from which category.

Â In other words, these are the documents with known categories assigned and

Â of course human experts must do that.

Â In here, you see that T1 represents the set of documents

Â that are known to have the generator from category 1.

Â And T2 represents the documents that are known to have been

Â generated from category 2, etc.

Â Now if you look at this picture, you'll see that the model

Â here is really a simplified unigram language model.

Â It's no longer mixed modal, why?

Â Because we already know which distribution has been used to generate which documents.

Â There's no uncertainty here, there's no mixing of different categories here.

Â 10:30

So the estimation problem of course would be simplified.

Â But in general, you can imagine what we want to do is

Â estimate these probabilities that I marked here.

Â And what other probability is that we have to estimate it in order to do relation.

Â Well there are two kinds.

Â So one is the prior, the probability of theta i and

Â this indicates how popular each category is or

Â how likely will it have observed the document in that category.

Â The other kind is the water distributions and

Â we want to know what words have high probabilities for each category.

Â 11:18

And in general, we can do this separately for the different categories.

Â That's just because these documents are known to be generated

Â from a specific category.

Â So once we know that,

Â it's in some sense irrelevant of what other categories we are also dealing with.

Â 11:51

And this is a problem that we have seen also several times in this course.

Â Now, if you haven't thought about this problem,

Â haven't seen life based classifier.

Â It would be very useful for you to pause the video for a moment and

Â to think about how to solve this problem.

Â So let me state the problem again.

Â So let's just think about with category 1,

Â we know there is one word of distribution that has been used to generate documents.

Â And we generate each word in the document independently, and we know that we have

Â observed a set of n sub 1 documents in the set of Q1.

Â These documents have been all generated from category 1.

Â Namely have been all generated using this same word distribution.

Â Now the question is, what would be your guess or

Â estimate of the probability of each word in this distribution?

Â And what would be your guess of the entire probability of this category?

Â Of course, this singular probability depends on

Â how likely are you to see documents in other categories?

Â 12:55

So think for a moment, how do you use all this training data including

Â all these documents that are known to be in these k categories,

Â to estimate all these parameters?

Â Now, if you spend some time to think about this and

Â it would help you understand the following few slides.

Â So do spend some time to make sure that you can try to solve this problem, or

Â do you best to solve the problem yourself.

Â Now if you have thought about and then you will realize the following to it.

Â 13:29

First, what's the bases for estimating the prior or the probability of each category.

Â Well this has to do with whether you have observed a lot of documents

Â form that category.

Â Intuitively, you have seen a lot of documents in sports and

Â very few in medical science.

Â Then you guess is that the probability of the sports category is larger or

Â your prior on the category would be larger.

Â 13:57

And what about the basis for estimating the probability of where each category?

Â Well the same, and you'll be just assuming that words that are observed

Â frequently in the documents that are known to be generated from a category will

Â likely have a higher probability.

Â And that's just a maximum Naive Bayes made of.

Â Indeed, that's what we can do, so this made the probability of which category and

Â to answer the question, which category is most popular?

Â Then we can simply normalize, the count of documents in each category.

Â So here you see N sub i denotes the number of documents in each category.

Â 14:55

Now what about the word distribution?

Â Well, we do the same.

Â Again this time we can do this for each category.

Â So let's say, we're considering category i or theta i.

Â So which word has a higher probability?

Â Well, we simply count the word occurrences

Â in the documents that are known to be generated from theta i.

Â 15:20

And then we put together all the counts of the same word in the set.

Â And then we just normalize these counts to make this distribution

Â of all the words make all the probabilities off these words to 1.

Â So in this case, you're going to see this is a proportional through the count of

Â the word in the collection of training documents T sub i and

Â that's denoted by c of w and T sub i.

Â 15:49

Now, you may notice that we often write down probable

Â estimate in the form of being proportional for certain numbers.

Â And this is often sufficient,

Â because we have some constraints on these distributions.

Â So the normalizer is dictated by the constraint.

Â So in this case, it will be useful for you to think about

Â what are the constraints on these two kinds of probabilities?

Â So once you figure out the answer to this question, and

Â you will know how to normalize these accounts.

Â And so this is a good exercise to work on if it's not obvious to you.

Â There is another issue in Naive Bayes which is a smoothing.

Â In fact the smoothing is a general problem in older estimate of language morals.

Â And this has to do with,

Â what would happen if you have observed a small amount of data?

Â So smoothing is an important technique to address that outsmarts this.

Â In our case, the training data can be small and when the data set is small when

Â we use maximum likely estimator we often face the problem of zero probability.

Â That means if an event is not observed

Â then the estimated probability would be zero.

Â In this case, if we have not seen a word in the training documents for

Â let's say, category i.

Â Then our estimator would be zero for the probability of this one in this category,

Â and this is generally not accurate.

Â So we have to do smoothing to make sure it's not zero probability.

Â The other reason for smoothing is that this is a way to bring prior knowledge,

Â and this is also generally true for a lot of situations of smoothing.

Â When the data set is small,

Â we tend to rely on some prior knowledge to solve the problem.

Â So in this case our [INAUDIBLE] says that no word should have zero probability.

Â So smoothing allows us to inject these to prior initial that no

Â order has a real zero probability.

Â 17:54

There is also a third reason which us sometimes not very obvious, but

Â we explain that in a moment.

Â And that is to help achieve discriminative weighting of terms.

Â And this is also called IDF weighting,

Â inverse document frequency weighting that you have seen in mining word relations.

Â 18:22

So one possible way of smoothing the probability of the category

Â is to simply add a small non active constant delta to the count.

Â Let's pretend that every category has actually some extra number of

Â documents represented by delta.

Â 18:40

And in the denominator we also add a k multiplied by delta because

Â we want the probability to some to 1.

Â So in total we've added delta k times because we have a k categories.

Â Therefore in this sum, we have to also add k multiply by

Â delta as a total pseudocount that we add up to the estimate.

Â 19:06

Now, it's interesting to think about the influence of that data,

Â obvious data is a smoothing parameter here.

Â Meaning that the larger data is and the more we will do smoothing and

Â that means we'll more rely on pseudocounts.

Â And we might indeed ignore the actual counts if they are delta is

Â set to infinity.

Â Imagine what would happen if there are approaches positively to infinity?

Â Well, we are going to say every category has an infinite amount of documents.

Â And then there's no distinction to them so it become just a uniform.

Â 19:44

What if delta is 0?

Â Well, we just go back to the original estimate based on the observed training

Â data to estimate to estimate the probability of each category.

Â Now we can do the same for the word distribution.

Â But in this case, sometimes we find it useful

Â to use a nonuniform seudocount for the word.

Â So here you'll see we'll add a pseudocounts to each word and

Â that's mule multiplied by the probability of

Â the word given by a background language model, theta sub b.

Â Now that background model in general can be estimated by using

Â a logic collection of tests.

Â Or in this case we will use the whole set of all the training data to

Â estimate this background language model.

Â But we don't have to use this one,

Â we can use larger test data that are available from somewhere else.

Â 20:36

Now if we use such a background language model that has pseudocounts,

Â we'll find that some words will receive more pseudocounts.

Â So what are those words?

Â Well those are the common words because they get a high probability by

Â the background average model.

Â So the pseudocounts added for such words will be higher.

Â Real words on the other hand will have smaller pseudocounts.

Â Now this addition of background model would cause a nonuniform

Â smoothing of these word distributions.

Â We're going to bring the probability of those common words to a higher level,

Â because of the background model.

Â Now this helps make the difference of the probability of

Â such words smaller across categories.

Â Because every category has some help from the background four words, and

Â I get the, a, which have high probabilities.

Â Therefore, it's not always so important that each category

Â has documents that contain a lot of occurrences of such words or

Â the estimate is more influenced by the background model.

Â And the consequence is that when we do categorization,

Â such words tend not to influence the decision that much

Â as words that have small probabilities from the background language model.

Â Those words don't get some help from the background language model.

Â So the difference would be primary because of the differences of the occurrences

Â in the training documents in different categories.

Â 22:25

So view is also a non negative constant and

Â it's [INAUDIBLE] set to control smoothing.

Â Now there are some interesting special cases to think about as well.

Â First, let's think about when mu approaches infinity what would happen?

Â Well in this case the estimate would approach

Â 22:43

to the background language model we'll attempt to the background language model.

Â So we will bring every word distribution to the same background language model and

Â that essentially remove the difference between these categories.

Â Obviously, we don't want to do that.

Â The other special case is the thing about the background model and

Â suppose, we actually set the two uniform distribution.

Â And let's say, 1 over the size of the vocabulary.

Â So each one has the same probability, then this smoothing formula is

Â going to be very similar to the one on the top when we add delta.

Â It's because we're going to add a constant pseudocounts to every word.

Â 23:29

So in general,

Â in Naive Bayes categorization we have to do such a small thing.

Â And then once we have these probabilities,

Â then we can compute the score for each category.

Â For a document and

Â then choose the category where it was the highest score as we discussed earlier.

Â 23:49

Now, it's useful to further understand whether

Â the Naive Bayes scoring function actually makes sense.

Â So to understand that, and also to understand why adding a background model

Â will actually achieve the effect of IDF weighting and to penalize common words.

Â So suppose we have just two categories and

Â we're going to score based on their ratio of probability, right?

Â So this is the.

Â 24:24

Lets say this is our scoring function for

Â two categories, right?

Â So, this is a score of a document for these two categories.

Â And we're going to score based on this probability ratio.

Â So if the ratio is larger,

Â then it means it's more likely to be in category one.

Â So the larger the score is the more likely

Â the document is in category one.

Â So by using Bayes' rule,

Â we can write down this ratio as follows, and you have seen this before.

Â 25:09

Now, we generally take logarithm of this ratio, and to avoid small probabilities.

Â And this would then give us this formula in the second line.

Â And here we see something really interesting,

Â because this is our scoring function for deciding between the two categories.

Â 25:41

It doesn't really depend on the document.

Â It just says which category is more likely and then we would then favor

Â this category slightly, right?

Â So, the second part has a sum of all the words, right?

Â So, these are the words that are observed in the document but

Â in general we can consider all the words in the vocabulary.

Â So here we're going to collect the evidence

Â about which category is more likely, right?

Â So inside of the sum you can see there is product of two things.

Â The first, is a count of the word.

Â And this count of the word serves as a feature to represent the document.

Â 26:27

And this is what we can collect from document.

Â The second part is the weight of this feature,

Â here it's the weight on which word, right?

Â This weight tells us to what extent observing

Â this word helps contribute in our decision

Â to put this document in category one.

Â Now remember, the higher the scoring function is,

Â the more likely it's in category one.

Â Now if you look at this ratio, basically, sorry this weight it's basically based

Â on the ratio of the probability of the word from each of the two distributions.

Â Essentially we're comparing the probability of the word from the two

Â distributions.

Â And if it's a higher according to theta 1,

Â then according to theta 2, then this weight would be positive.

Â And therefore it means when we observe such a word,

Â we will say that it's more likely to be from category one.

Â And the more we observe such a word,

Â the more likely the document will be classified as theta 1.

Â 27:35

If, on the other hand, the probability of the word from

Â theta 1 is smaller than the probability of the word from theta 2,

Â then you can see that this word is negative.

Â Therefore, this is negative evidence for supporting category one.

Â That means the more we observe such a word,

Â the more likely the document is actually from theta 2.

Â 27:58

So this formula now makes a little sense, right?

Â So we're going to aggregate all the evidence from the document,

Â we take a sum of all the words.

Â We can call this the features

Â that we collected from the document that would help us make the decision.

Â And then each feature has a weight that tells us how

Â 28:32

And then finally we have this constant of bias here.

Â So that formula actually is a formula that can

Â be generalized to accommodate more features and

Â that's why I have introduce some other symbols here.

Â To introduce beta 0 to denote the Bayes and fi to denote the each feature and

Â beta sub i to denote the weight on each feature.

Â Now we do this generalisation, what we see is that in

Â general we can represent the document by feature vector fi,

Â here of course in this case fi is the count of a word.

Â But in general, we can put any features that we think are relevant for

Â categorization.

Â For example, document length or

Â font size or count of other patterns in the document.

Â 29:42

So if each f sub i is a feature value then we multiply the value

Â by the corresponding weight, beta sub i, and we just take the sum.

Â And this is the aggregate of all evidence that we can collect from all these

Â features.

Â And of course there are parameters here.

Â So what are the parameters?

Â Well, these are the betas.

Â These betas are weights.

Â And with a proper setting of the weights, then we can expect such a scoring function

Â to work well to classify documents, just like in the case of naive Bayes.

Â We can clearly see naive Bayes classifier as a special case of

Â this general classifier.

Â Actually, this general form is very close to a classifier called a logistical

Â regression, and this is actually one of those conditional approaches or

Â discriminative approaches to classification.

Â 30:32

And we're going to talk more about such approaches later, but

Â here I want you to note that there is a strong connection,

Â a close connection between the two kinds of approaches.

Â And this slide shows how naive Bayes classifier can be connected to

Â a logistic regression.

Â And you can also see that in discriminative classifiers

Â that tend to use more general form on the bottom,

Â we can accommodate more features to solve the problem.

Â [MUSIC]

Â