So why is Naive Bayes called Naive? The example I gave you had just a single feature. So our X was just a normal scalar. What happens if X is actually a vector? What happens if we have to deal with multiple features like not just the height, but also height and weight? Or many, many other features? So, let's say that X is a vector with the dimensions. Well, if we compute the likelihood for, let's say, X_i given class J, and if X is a vector, then we'll have to compute this probability X one for the first element of the vector, given X two, X three, and so on until X_d end the class, times the probability of the second component X two given X three, so on until CJ, times the fourth component and so on and so on and so on. So times the probability of X_d given CJ times the marginal probability of the class. And that's a very difficult computation to make. These individual attributes conditioned on the joint probabilities of everything else. So to handle this, we assume that X one, X two, X three and so on until X_d are conditionally independent from each other. Well, given the class. So if this assumption here that these attributes are conditionally independent of each other given the class holds, then we can rewrite this as the probability of X one, X two, and so on X_d, which is this guy here. Given the class, this simplifies to the probability of X one given the class times the probability effects to given the class times and so on the probability of X_d given the class. So, you don't have to compute this horrible thing here anymore. Now, the problem is that this assumption here is a very bold assumption. It's a very naive assumption to make. And that's where the name of naive bayes comes from because this is an assumption we make when we use naive bayes. Now, if you're doing something like text classification and your features here are actually words, for example, you are writing a spam harm classifier. You want to detect spam email messages, right? And your features are words. If one of your words is, let's say, 'win,' and another feature here is 'money,' and that's one message, then the probability of 'money' appearing together with 'win' would be greater than say the probability of 'win.' I don't know. Win a hippopotamus, if you wish you know. We can draw a hippopotamus here. Just being silly now. So, it's clear what I'm trying to say is that the features that you get and you based your conditional probability on will not be independent of each other. It's much more likely to get 'win' and 'money' together in an email message than 'win a hippopotamus'. So this assumption doesn't normally hold. The good news is that although this is a very naive assumption, it actually works really well in practice. And you can easily, easily write a spam classifier that has very very high percentage of accuracy using this assumption, over 90 percent. Anyway, again, this was a sideline that kind of tells you why a Naive bayes is called naive bayes because of this assumption of conditional independence that we make. The final thing I would like to mention is that we talked about Gaussian Naive Bayes that's not the only type of naive bayes algorithm that you can use. There are different types of implementations in general. For example, using different distribution like binomial naive bayes. For example, if your features are not continuous like the features that we used in our example, which use a continuous normal distribution, maybe you have binomial features, maybe again going back to the spam detector, your features are actually binary. Stating if this observation contains or doesn't contain a specific word from a dictionary and you denote this by ones and zeros, then you use a different distribution. You would use the binomial distribution. So your X given CJ will be given by something else, then the Gaussian will be the product of i equals one to N of PJIX to the power of i one minus BJI, one minus X_i. It will be given by the binomial distribution. There is also multinomial naive bayes when you're dealing with multiple features and so on. So these are all different ways of using naive bayes, but at the core is essentially the same thing. You have a distribution, I use bayes role to compute the likelihood, and you use the priors and the likelihood to get the posterior and you use maximum posteriori.