When we build the supervised learning model, whether it's for classification or regression, it's not only important for the model to make correct predictions on the examples it saw in the training data. If that were the only definition of success, then supervised learning algorithms could simply memorize all of the training data and always have a 100 percent accuracy. Of course, that's not what happens in the real world. What we actually want from a supervised machine learning algorithm is the ability to predict a class or target value correctly on a test set of future examples that haven't been seen before. This ability to perform well on a held out test set is the algorithms ability to generalize. How do we know that our trained supervised model will generalize well? In other words, perform well on an unseen test set. Well, typically machine learning makes an important assumption that the future test set has the same properties, or, more technically, as drawn from the same underlying distribution as the training set. This means that if we observe our model has good accuracy on the training set, and the training set and test sets are similar, we can also expect that it will do well on the test set. Unfortunately, sometimes this doesn't happen due to an important problem known as 'overfitting'. Informally, overfitting typically occurs when we try to fit a complex model with an inadequate amount of training data. And overfitting model uses its ability to capture complex patterns by being great at predicting lots and lots of specific data samples or areas of local variation in the training set. But it often misses seeing global patterns in the training set that would help it generalize well on the unseen test set. It can't see these more global patterns because, intuitively, there's not enough data to constrain the model to respect these global trends. As a result, the training set accuracy is a hopelessly optimistic indicator for the likely test set accuracy if the model is overfitting. Understanding, detecting, and avoiding overfitting is perhaps the most important aspect of applying supervised machine learning algorithms that you should master. There are serious real world consequences to overfitting. The reading for this week contains one example from the medical world you might find interesting. To understand the phenomenon of overfitting better. Let's look at a few visual examples. The first example that we'll look at for overfitting involves regression. In this chart on the x axis, we have a single input variable that might be, for example, the size of a piece of property. And we have a target variable on the y axis. For example, this might be that market selling price of a house that sits on that piece of property. So in regression, we're looking to find the relationship between the input variable and the target variable. And we start off with the idea of a model that we want to fit that might explain this relationship. So, one example of a model that we could try to fit would be a linear model. So trying to predict that there's a linear relationship between the input variable and the target variable. So, the regression model might fit a straight line to these points. In this case the model has under-fit the data. The model is too simple for the actual trends that are present in the data. It doesn't even do well on the training points. So these blue points would represent the training points, the input to the training process for the regression. And when we underfit, we have a model that's too simple, doesn't even do well on the training data and thus, is not at all likely to generalize well to test data. On the other hand, if we pick a better model, for example, the idea that this might be a quadratic relationship, we might actually be able to apply that and get what looks like a much better fit. So this would be example of a reasonably good model fit. We've captured both the general trend of the points while also ignoring the little variations that might exist due to noise. A third example might be, we might hypothesize that the relationship between the input variable and the target variable is a function of several different parameters. Let's say, a polynomial so something that's very bumpy. If we try to fit a more complex model to this set of training data, we might end up with something that looks like this. So, this more complex model has more parameters so it's able to capture more subtle behavior. But it's also much higher variance here as you can see so it's more focused on capturing the more local variations in the training data rather than trying to find the more global trend that we can see as humans in the data. So, this is an example of overfitting. In this case there's not enough data to really constrain the parameters of the model enough so that it can recognize the global trend. The second example will look at overfitting in classification. So, this diagram shows a simple two dimensional classification problem where each data instance is represented by a point, and where there are two features associated with each data instance. This is a binary classification problem so points are labeled either red over here or blue. And the problem in classification is to find a decision boundary. So as we saw with K nearest neighbors we want to take this feature space and find regions that are associated with the different labels. So, one type of classifier that we'll look at shortly is called the linear classifier where we try to find a decision body that's essentially a straight line through the space. So just as in regression, we can have underfitting models, good fitting models, and overfitting models. So, an example of an underfitting model would be something that really didn't look much at the data. Maybe something like this that only looked at one feature for example. So again, this is an overly simplistic model. It doesn't even capture the patterns. The division between the classes in the training data well and so would be unlikely to generalize so this in an underfitting model. A reasonably good model that fits well would be. You know, a linear model that finds this general difference between the positive class over here and the negative class over here. So you can see that it's aware of this sort of global pattern of having most of the blue negative points in the upper left and most of the red positive points more toward the lower right. And it's robust enough in the sense that it ignores the fact that there may be occasionally a blue point in the red region or red point in the blue region. Instead, it's found this sort of more global separation between these two classes. So, this is a reasonably well fitting model. An overfitting model on the other hand would typically be a model that has lots of parameters so it can capture complex behavior. And so, it would try to find something very clever, where it tried to completely separate the red points from the blue points in a way that resulted in a highly variable decision boundary. So this has the advantage that, a questionable advantage, that it does capture the training data classes very well. It predicts the training data classes almost perfectly. But as you can see, if the actual division between the classes, it's captured by this linear model, the overfit model is going to make lots of mistakes in the regions where it's trying to be too perfect in some sense. So, you'll see this typically with overfitting. The overfit model is highly variable. And again, it's trying to capture too many of the local fluctuations and does not have enough data to see the global trend that would result in better overall generalization performance on new unseen data. This third example shows the effect of modifying the K parameter in the K nearest neighbors classifier that we saw previously. The three plots shown here show the decision boundaries for K=10, K=5, and K=1. And here we're using the fruit dataset again with the height of a piece of fruit on the x axis and the width on the y axis. So the K=10 case, as you recall, K=10 means that, for each query point that we want to get a prediction for, let's say over here, we have to look at the 10 nearest points to the query point and we'll take those. We won't go through all 10 here, but let's just say there are several here that are nearby and will average or combine their votes to predict the label for this point. So in the K=10 case, we need the votes from 10 different data instances in the training set to make our prediction. And as we reduce K, so K=5, we only need five neighbors to make a prediction. So for example, if the query point was here, we'd look at these four and then possibly whatever was the closest in this let's say, that one or this one. And then finally, in the K=1 case, that's the most unstable in the sense that for any query point, we only look at the single nearest neighbor to that point. And so, the effect of reducing K in the k-nearest neighbors classifier is to increase the variance of the decision boundaries, because the decision boundary can be affected by outliers. If there's a point far, far away, it might have, it has much greater effect on the decision boundary in the K=1 case, than it would in the K=10 case, when the votes of nine other neighbors are also needed. And so, you can see that by adjusting the value of K for the k-nearest neighbors classifier, we can control in some sense, the degree of model fitting that's appropriate for a dataset. Now, the actual value of k that works best can only be determined by evaluating on a test set. But the general idea is that as we decrease K for k-NN classifiers, we increase the risk of overfitting. Because, for the reasons I mentioned before, where K=1 for example, we're trying to capture very local changes in the decision boundary that may not lead to good generalization behavior for future data.