Let's go back and understand this instead of just giving you definitions. Let's go back, I hope you finish this exercise on the spam filter. So, the spam data remember, was to classify emails into spam or non-spam. It had various independent variables. Number of occurrences of the word money, number of occurrences of the word make, etc. You can make up your own variables, but this is one way of detecting spam. So, we will build a spam filter we wanted to, and we will measure these various measures of performance. So, building it is very easy. Building as you notice, you would have noticed very easy to do. Here's all the data. As you can see. You are looking at the classifier spam or not spam. Here is the dataset called spam.csv, we load it. You select "Model tree" which is what we asked you to do. Hit "Execute" and it spits out the decision tree. We want to see the tradeoff between using all the variables versus using a subset of variables. So, first of all, we can do that. We can also make it as generous a model as possible, and fit it, and then see if we make it a more parsimonious model, what it does. So, we will try to make the model as complex as possible. So, how do we do that? Rattle when you start, gives you some default values. What we will do is we will change these values to the most generous one possible, maximum depth of 30, complexity of zero, minimum bucket size of one, minimum split to zero. So, basically, we are telling build as big as tree, don't stop. That's what it's doing. So, setting min split, and min bucket split, and complexity parameters to the minimum values, and depth to the maximum values, allows you to build the most complex tree possible. The error is pretty low on your training set. There's no question, you're doing very well. Your average class error is 3.6 percent. You're misclassifying very few spam. But on your validation set, your error rate is pretty high at 17.85 percent. So, what it's probably telling you is how is it I'm doing so well on my training data and not so well on the validation data? What should go through your head is, it's because maybe the model is too complex. So, how do I reduce the complexity of the model? Then one thing I can do, is just choose the default value in rattle. So, in this, we say okay, the maximum depth of the tree is three, the complexity cutoff is 0.01, the minimum bucket size is seven, and the minimum split is 20. We create a new model. How does it do? Well, the error as expected on the training set went up because you have a less complex model. It went up from about three percent to 14 percent. But ruby hole as expected, the validation set accuracy actually improved. So, here is a simple tradeoff of the complexity of the model. The complexity could be two ways, one in the tree you are creating, and complexity also could be in the number of features we are using. So, there are two ways of controlling complexity in this particular example. The next concept. So, the concept we just learned, is how do we control the complexity of the tree or whatever method to tradeoff between overfitting and underfitting? So, this tradeoff is more for each method and you have to solve a few examples and get a feel for it. But in many cases, it's pretty obvious. Too complex you're overfitting, too very low complexity, the error is very high. Now, the second thing you're worried about in classification is how do you classify an object into a particular class? Typically, you want to assign it to the class with the highest probability. If there are two classes, if there are multiple classes, you can then construct a model which classifies one class against all the other class. Then compare these two as binary classes. That is another one simple way of doing it. Red versus all the other colors. Then we can decide whether it's red or not red. You can do that. How do we decide it falls on one category or the other? The default is the cutoff value of 50 percent. That mean, assign it to that class whose probability is greater than 50 percent. To demonstrate it, I'm going to use a split of 99:01, later on I'm going to change it back. But in this example, for only now, I'm using a 99 percent training data because I wanted to visually show you what we can do, I'm holding back one percent of the data. So, here it is, I've fit the model. I've asked the model to provide a score of the probabilities because I want to show you what happens as I change the probability cutoff values. So, here is the model. We are classifying the spam, the same spam data. Remember, I'm training the data on 99 percent of the data and the validation set is one percent. The size of the validation set is pretty small. It has got 30 plus 1747 values. So, this is the prediction which Rattle saves for you in this file which you saw. When you asked it to save the probabilities, it creates a file called spam validate score. I put it into an Excel, so that we can play with it. I've automated it. So, initially, I'm going to put a cutoff value of 50 percent. What does it mean? It means that if the prediction, and this is the probability which is getting predicted, if the prediction is more than 50 percent the probability, I'm going to say yes to spam. If it is going to be less than 50 percent, I'm going to say it is not spam. You can see it makes four mistakes in this case, in these two cases, and in this case. In these four cases, the probability is less than 50 percent. So, it says it's not spam. That's the error it's making here. So, it is predicting the actual is yes, the prediction is no, it's not spam. So, out of the 30 it's predicting as no, it actually has misclassified four of them. I hope you see that. What about the nos? Well, the nos are predicted wrong when the probability is more than 50 percent. So, it's saying yes, it is a spam, but really it's not a spam. So, look at it. Actually, the nos starts somewhere here, and it's predicted this particular value 4204 as a yes. You can see the error it has made here. So, the error it has made is predicted, this is a spam but actually it is not a spam. So, it's made five mistakes in total, and you can compute the error because it's 5 divided by 47. Now, you may be more interested in predicting the spam correctly operating the non-spam. So, in this example, I don't want it to be aggressive in predicting the spam because I don't want to lose important e-mails. What should I do? What I do is I increase the cut-off value. If I increase the cut-off value to 80 percent, what it does, is it predicts many of those which are spam as non-spam. You can see that here. So, it has increased the prediction which are really spam to non-spam. So, you can see this, almost half it's getting wrong. So, out of the 20 which are spam, it gets only 11 of them right. But it's doing very well when it says it's non-spam. So, it's getting a 100 percent accuracy as far as the non-spam is concerned. You can see that by changing the cut-off value, I can increase or decrease the sensitivity to one of these classes. So, there are these nice terms which allow you to decrease it. So, what I want you to take away from this is, by changing the cut-off values, you can increase the sensitivity to the yes values, or the no values.