In this video, we're going to talk about how we actually evaluate our networks. We've talked about a lot about different options we have for defining a network. We can make very shallow networks, we can use logistic regression, which is considered to be a fairly simple model, or we can use these deep neural networks, and we're going to talk about more and more complicated networks as this course goes on. But one of the fundamental questions we're going to have is how do we actually evaluate how well these networks are going to work? We're going to start out with a question. We've talked about this logistic regression, this logistic regression, we went through and we talked about in depth what this network is doing. It has all these features, we're going to combine them in a linear fashion, and then we're going to convert it into a probability. So, when we do this, this is a fairly simple model, it's backed by decades of statistical theory on what it's actually doing and when it's going to work. On the right, we're showing this multilayer perceptron. This is a deep neural network, and we showed you this network and it can do a lot more than logistic regression. It can capture a lot more properties than logistic regression ever could. Now, the question is, do we actually always want to do this? So, let's think about what would happen as we increase the complexity. Well, first, creating deep models really helps us learn complex relationships. So, what I'm showing here is a famous toy dataset called the two moons dataset. So, we have our red class on the top and it curls around, and we have our blue class on the bottom and it curls around. If we use something like logistic regression, it can't separate these two classes, it can never predict very well on this because the model is just very limited in what it can capture. If we use a deep neural network, we can actually capture much more complicated relationships. So, what I'm showing here is I'm showing in the color red where a multilayer perceptron has learned to predict, that it's going to be the red class, and in blue where the multilayer perceptron has learned to predict the blue class. A deep neural network, in this case, a multilayer perceptron can represent this very very accurately and it predicts very very well. But we can come up with a lot of cases where this type of complexity is not wanted. So, a deep model can actually give perfect performance on the training dataset, and then when we go into the real world, it can fail completely, and so this is really bad and we actually need to validate the performance. Is this neural network going to work as well as we think it is? So, how can we actually go through and do this validation, and why do we need to do this? Before we jump into how we do the validation, let's think a little bit more about the concept of overfitting. Overfitting is what happens when our learned model is going to increase complexity deferred the observed training data too well. We're going to predict extremely well on our training dataset. But then, when it comes to real data, we're not going to predict well. So, why does this happen? So, let's start out with a really simple example. So, on the right here, I'm showing a collection of a few data points. We have a few observations and we want to come up with a function that's going to predict what our observation will be given the value of x. So, what do we want to use to fit these example datapoints? One strategy we can use from a very classical point of view is to increase our polynomial order. So, on the left here, I'm showing a simple linear regression fit, and we have our observations, and we have our x here and we can say this first order fit. If we get a new value of x, this is what our prediction is going to be. It's this red line right here. So, this does pretty well and a lot of people will be very satisfied with this. But there's this question, can we actually increase the complexity a little bit to actually predict better? So, let's look at what happens when we go to a third order fit. This is a cubic fit. So, in this cubic fit, we have x here, we have f of x, and you can see that we can trace out this line here. If we look at our observed data points, it actually fits a little bit better. Over the range that I'm showing plotted here, it seems pretty feasible that this could be what's happening. You might be a little bit worried about what's happening outside the visible range here. So, outside zero and four, this could be going off in finding things. But inside this range here, it actually seems like it's doing really well. But we can keep increasing the complexity. So, what I'm showing here is an eighth order fit to these data points. I think if you go and show this to anybody that's taking you seriously, they will start taking you seriously because this is not a legitimate prediction. We'd never have enough information to say that this is good. It just looks ridiculous. I don't think anyone would argue that this is the best model that's going to work the best in reality. It's fitting in really well to our training data but it's not fitting very well if we try to predict future points. So, what actually happens in overfitting? So, when we're increasing the depth of our neural network, which is what we've talked about so far, it means we're actually going to increase the number of parameters in the model. So, logistic regression has a finite number of parameters and each time we add an additional wire and a multilayer perceptron, we're getting more and more parameters. But remember, we have to estimate these parameters. So, if we have more parameters to estimate, we're going to get each parameter a little bit more wrong, and all the errors are going to add up and these parameters. The second thing which we really like actually is that when we have these more complex models or networks, we can learn more complex relationships. But maybe these relationships are actually too complex for reality. We can learn any complex relationship we want. Does this complex relationship actually happen? When we're overfitting, this means we're not going to generalize. We want our models and our analysis to generalize, which means that if we go out into the real world and say, "This is a model that we trained," we're now going to make predictions, we wanted these predictions to actually hold. We want this to work when they go out into the real-world. So, we want to know how well is our network and our model fitting actually going to work in the real-world. So, we had our training dataset, we'll use our training dataset, and we talked about how to mathematically setup our goal to get our parameters, and now we want to say, "Well, how well is this actually going to work in the real-world?" I want to now introduce a very standard validation strategy. So, if our goal in the end is to understand how all these network will perform in the real world, a good approach and a really the gold standard approach is actually try it in the real-world So, what we can do? We can take our training dataset, we can estimate are parameters, and then we're going to get new real world data, and say, "Let's use our network on this. Did we predict well?" We can estimate the real world performance in this way. This is really the gold standard. If you want to say that your network is working, this is the best thing you can do. However, this is extremely costly. So, we don't want to do this when we want to just trying to validate whether our model is working. Instead, can we actually use existing data to estimate performance? The answer is yes. We're going to try to use our existing data, and we're going to try to use our existing data intelligently to create ways of validating how well we think this is going to work in the real-world based off of our available data that we've already collected. We don't want to have to go out and collect a new experiment. So, we have all of our available data here and what we're going to do is we're going to split this data into separate groups. So, we're going to define three groups of data, and I'm going to talk through what each of these groups of data is actually doing and what we use them for. But for now, we're just going to have a training dataset, a validation dataset, and a testing dataset. We're showing these three different datasets in different colors. What we're going to do is we're going to take all of our available data, we're going to put some of that into our training dataset, and we're going to put some of that into a validation dataset, we're going to put some of that into a test dataset. We're just going to randomly assign them to these different groups, and we're going to assign them in different proportions. So, our training dataset, we've already talked a lot about what it's used for. We talk about how we use a training dataset to get our model parameters. So, we already know what we're going to use our training dataset for, and we can just show this right here. We can split these data into separate groups. We can take all of our available data. We can make some of our training dataset. We already know how to use this. But we've now created two separate datasets, a testing dataset and a validation dataset. So, why have we done this and why is this a good thing for us to do? First, the test set. Test set is a very standard practice in machine learning, and we want to create the test set prior to any analysis. We don't want to use the data at all before we create the test set, and a test set will never be used to learn or fit any parameters. After learning the network, we can evaluate its performance on the test set. So, the idea here is that this data was not included in the training or fitting. So, what this means is that if we fit our data or we've trained or learned from our training dataset, we now are trying it on this new test set. But we've never seen this test set before, so this is analogous to running a new experiment. It's a synthetic experiment but it's analogous to running a new experiment. A big deal is that the test set is ideally going to be used once. If we reuse the test set it's going to lead to bias, and what this means is that our performance estimates will be optimistic. When we say they're optimistic, this means that when we go out into the real world, we think that our performance is going to be higher than it actually is, and this comes from the fact that we kept reusing our test dataset. To get around this problem that we only want to use a test set once, we're going to create a validation set. So, the idea of the validation set is we want to be able to compare which approach is best, and we can't really do this if we're only using the tests at once. We can't compare all these approaches and just say, this approach is best, because then our tests that we're overestimating performance. So, we want to create the second [inaudible] dataset that we're going to call the validation dataset. So, the validation dataset, we're also not going to use it to learn parameters but it can be used to repeatedly estimate the performance of a model. The idea is that if we're going to try a lot of different structures of networks, we're going to have our logistic regression model, we're going to have our multilayered perceptron, we'll have a number of other different models we want to try. We can learn all of them and we can estimate the performance on the validation dataset. Once we've picked out what model we want to use, we can run the final evaluation on the test data, and we're only going to run on the test data once. To visualize what's happening here, we have our training dataset that we've split out from all of our available data and we're going to put it and we're going to learn our parameters from our training dataset. But then, using our validation dataset, we're going to estimate their performance, and what we're going to do is we're going to take this performance estimate on this particular model and we're going to use it to refine our model or network. We're going to change our architecture. We're going to say, well maybe we need more complexity, maybe we do use less complexity, and we're going to keep doing this until we think we know what model is best for our data. Then once we have that, we're going to break out of this loop here, and we're going to go and combine it and use our test dataset once to estimate our final performance. So, this is the structure that we want to use when are using machine learning. We want to use our validation data to help us figure out what model to use in our test data to estimate real-world performance.