0:07

When we talked about the NaÃ¯ve Bayes model and the theory and the formulation behind it,

Â we didn't really focus on the features and what the features represented.

Â There are two ways in which NaÃ¯ve Bayes features could be learned.

Â There are the two classic variants of NaÃ¯ve Bayes for text.

Â You have the multinomial NaÃ¯ve Bayes model and the other one would be a Bernoulli model,

Â and we will talk about it soon.

Â The multinomial NaÃ¯ve Bayes model is one in

Â which you assume that the data follows a multinomial distribution.

Â So what does that mean?

Â It means that when you have the set of features that define a particular data instance,

Â we're assuming that these each come independent of each other and

Â can also have multiple occurrences or multiple instances of each feature.

Â So, counts become important in this multinomial distribution model.

Â So you have each feature value,

Â a some sort of a count or a weighted count.

Â Example would be word occurrence counts or TF-IDF weighting and so on.

Â So, suppose you have a piece of text, a document,

Â and you are finding out what are all the words that were used in this model.

Â That would be called a bag-of-words model.

Â And if you just use the words,

Â whether they were present or not,

Â then that is a Bernoulli distribution for each feature.

Â So, it becomes a multivariate Bernoulli when you're talking about it for all the words.

Â But if you say that the number of

Â times a particular word occurs is important, so for example,

Â if the statement is to be or not to be,

Â and you want to somehow say that the word to occur twice,

Â the word be occur twice,

Â the word or occur just once and so on,

Â you want to somehow keep track of what was the frequency of each of these words.

Â And then, if you want to give more importance to more rare words,

Â then you would add on something called a term frequency,

Â inverse document frequency weighting.

Â So you don't, not only give importance to the frequency,

Â but say how common is this word in the entire collection,

Â and that's what the idea of weighting comes from.

Â So for example, the word THE is very common,

Â it occurs on almost every sentence,

Â it occurs in every document,

Â so it is not very informative.

Â But if it is the word, like,

Â SIGNIFICANT, it is significant because it's not gonna be occurring in every document.

Â So, you want to give a higher importance to a document that

Â has this word significant as compared to the word the,

Â and that kind of variation in

Â weighting is possible when you're doing a multinomial NaÃ¯ve Bayes model.

Â The second model is the Bernoulli NaÃ¯ve Bayes model.

Â Here, the assumption is that the data follows a multivariate Bernoulli distribution,

Â where each feature is a binary feature, that is,

Â the word is present or not present,

Â and it's only that information about just the word being present that is

Â significant and modeled and it does not matter how many times that word was present.

Â In fact, it also does not matter whether the word is

Â significant or not in the sense that is the word THE,

Â which is fairly common in everything,

Â or is the word something like SIGNIFICANT,

Â which is less common in all documents.

Â So when you have just the binary features, I mean,

Â just a binary model for every feature,

Â then the entire data,

Â the set of features follows what is called a multivariate Bernoulli model.

Â So these are the two standard classic variants in NaÃ¯ve Bayes,

Â and you'll see that most of the approaches and most of

Â the tools that you have for NaÃ¯ve Bayes modeling give you that option,

Â give you the option of multinomial NaÃ¯ve Bayes or Bernoulli NaÃ¯ve Bayes.

Â It's fairly common in text documents to use the multinomial NaÃ¯ve Bayes,

Â but there are instances where you would want to go the Bernoulli route,

Â especially if you want to somehow say that the frequency is

Â immaterial and it's just

Â whether the presence or absence of a word that is more important.

Â