0:00

Suppose you'd like to decide what degree of polynomial to fit to a data set, so

Â the, what features to include that gives you a learning algorithm.

Â Or suppose you'd like to choose the regularziation parameter, lambda, for

Â learning algorithm, how do you do that? This is called model selection problems

Â and in our discussion of how to do this, we'll talk about not just how to split

Â your data into a train and test sets but how to split your data into what we'll

Â discover is called the training validation and test sets.

Â We'll see in this video just what these things are and how to use them to do

Â model selection. We've already seen a lot of times the

Â problem of overfitting, in which just because a learning algorithm fits a

Â training set well, that doesn't mean it's a good hypothesis.

Â More generally, this is why the training set error is not a good predictor for how

Â well the hypothesis would do on new examples.

Â Concretely, if you fit some set of parameters, theta zero, theta one, theta

Â two, and so on, to your training set, then the fact that your hypothesis does

Â well on the training set, well, this doesn't mean much in terms of predicting

Â how well your hypothesis will generalize to new examples not seen in the training

Â set. And a more general principle is that once

Â your parameters were fit to some set of data, maybe the training set, maybe

Â something else, then the error of your hypothesis is measured on that same data

Â set, such as the training error. That's likely to be a good estimate of

Â your actual generalization error. That is of how well the hypothesis will

Â generalize to new examples. Now, let's consider the model selection

Â problem. Let's say, you're trying to choose what

Â degree polynomial to fit to data. Should you choose a linear function, a

Â quadratic function, a cubic function, all the way up to a tenth order polynomial?

Â So, it's as if that's one extra parameter in this algorithm, which I'm going to

Â denote d, which is what degree of polynomial do you want to pick?

Â So, it's as if in addition to the theta parameter, it's as if there's one more

Â parameter D that you're trying to determine using a data set.

Â So, the first option is d equals one. If you fit a linear function, we can

Â choose d equals two, d equals three, all the way up to d equals ten.

Â So, we like to fit this extra sort of parameter, which I'm denoting by d.

Â And concretely. let's say that you want to choose a

Â model, that is, choose a degree or polynomial,

Â choose one of these ten models, and fit that model and also get some estimate of

Â how well your fitted hypothesis would generalize to new examples.

Â Here's one thing you could do. What you could, first, take your first

Â model and minimize the training error and this would give you some parameter data

Â of theta. And you could then take your second

Â model, the quadratic function and fit that to your training set and this will

Â give you some other parameter vector theta.

Â In order to distinguish between these different parameter vectors I'm going to

Â use a superscript one, superscript two there, where theta superscript one, just

Â means the parameters I get by fitting this model to my training data and theta

Â superscript two, just means the parameters I get by fitting this

Â quadratic function to my training data, and so on.

Â And by fitting a cubic model, I get parameters theta three up to, you know,

Â say, theta ten. And one thing you could do is that take these parameters and look

Â at the test set errors. I can compute on my test set, J test of

Â one, J test of say theta two, and so on,

Â J test of theta three, and so on.

Â So, I'm going to take each of my hypothesis with the corresponding

Â parameters and just measure their performance on the test set.

Â Now, one thing I could do then, is, in order to select one of these models, I

Â could then see which model has the lowest test sets error.

Â And let's just say for this example, that I ended up choosing the fifth order

Â polynomial. So, this seems reasonable so far, but now

Â let's say, I want to take my theta hypothesis, this, this fifth order model,

Â and let's say, I want to ask, how well does this model generalize?

Â One thing I could do is look at how well my fifth order polynomial hypothesis had

Â done on my test set. But the problem is, this will not be a fair estimate, of how

Â well my hypothesis generalizes. And the reason is,

Â what we've done is we fit this extra parameter d, that is this degree of

Â polynomial, and we fit that parameter d using the test set.

Â Namely, we chose the value of d that gave us the best possible performance on the

Â test set. And so,

Â the performance of my parameter vector theta five on the test set,

Â there's likely to be an overly optimistic estimate or generalization error,

Â right? So that because I have fit this parameter

Â d to my test set, it is no longer fair to evaluate my hypothesis on this test set.

Â It's because I've fit my parameters to the test set, I've chosen the degree d D

Â of polynomial using the test set. And so, my hypothesis is likely to do

Â better on this test set than it would on new examples that it hasn't seen before.

Â And that's, which is, which is what we care about.

Â So, just to reiterate, on the previous slide, we saw that if we fit some set of

Â parameters, you know, say, theta zero, theta one, to some training set, then the

Â performance of the fitted model on the training set is not predictive of how

Â well the hypotheses we generalized in new examples, is because these parameters

Â were fit to the training set so they're likely to do well on the training set,

Â even if the parameters don't do well on other examples. And in the procedure I

Â just described on the slide, we just did the same thing.

Â And specifically, what we did was we fit this parameter d to the test set,

Â and by having fit the parameter to the test set, this means that the performance

Â of the hypothesis on that test set may not be a fair estimate of how well the

Â hypothesis is, is likely to do on examples we haven't seen before.

Â To address this problem in a model selection setting, if we want to evaluate

Â a hypothesis, this is what we usually do instead.

Â Given the data set, instead of just splitting it into a train and test set,

Â what we are going to do is instead split it into three pieces, and the first piece

Â is going to be called the training set, as usual.

Â So, let me call this first part, the training set. Anf the second piece of

Â this data, I'm going to call the cross validation set,

Â 7:02

cross validation. And I'm going to abbreviate cross

Â validation as CV. Sometimes, it's also called the

Â validation set instead of cross validation set.

Â And then the last part, I'm going to call my usual test set.

Â And the pretty, pretty typical ratio which to split these things, would be to

Â send 60% of your data to your training set, maybe 20% to your cross validation

Â set and 20% to your test set. And these numbers can vary a little bit

Â but this sort of ratio would be pretty typical.

Â And so, our training sets will now be only maybe 60% of the data.

Â And our cross validation set or our validation set will have some number of

Â examples. I'm going to denote that M subscript CV.

Â So, that's the number of cross validation examples.

Â Following our earlier notational convention, I'm going to use

Â X(i)CV,y(i)CV to denote the ith cross validation example.

Â And finally, we also have a test set over here with

Â M subscript test, being the number of test examples.

Â So, now that we've defined the training validation or cross validation and test

Â sets, we can also define the training error, cross validation error, and test

Â error. So, here's my training error and I'm just

Â writing this as J subscript train of theta.

Â This is pretty much the same thing, it's using the same thing as the J of theta

Â that we're writing so far. It's just a training set error as

Â measured on your training set. And then J subscript CV, that's my cross

Â validation error, it's pretty much what you'd expect. Just

Â like the training error, you should have measured it on the cross

Â validation data set. And here's my test set error same as

Â before. So, when faced when a model selection

Â problem like this, what we're going to do is, is instead of using the test set to

Â select a model, we're instead going to use the validation set or the cross

Â validation set to set the model. Concretely, we're going to first, take

Â our first hypothesis, take this first model, and say, minimize the cost

Â function. And this will give me some parameter

Â vector theta for the linear model. And as before, I'm going to put a superscript

Â one, just to denote that this is the parameter for the linear model.

Â We do the same thing for the quadratic model, get some parameter vector theta

Â two, get some para, parameter vector theta three, and so on, down to, say, the

Â tenth other polynomial. And what I'm going to do is, instead of

Â testing these hypothesis on the test set, I'm instead going to test them on the

Â cause validations test, I'm going to measure J subscript CV to see how well

Â each of these hypothesis do on my cross validation set.

Â And then, I'm going to pick the hypothesis with the lowest cross

Â validation thereof. So, for this example, let's say, for the

Â sake of argument, that it was my fourth order polynomial that had the lowest

Â cross validation error. So, in that case, I'm going to pick this

Â fourth order polynomial model. And finally,

Â what this means is that, that parameter d, remember d was the degree of

Â polynomial, right? So, d2, = n, d3 = 3, up to d10, = 10.

Â What we've done is we've fit that parameter d and we have said d4 = 4 and

Â we did so using the cross validation set. And so, this degree of polynomials for

Â the parameter is no longer fit to the test set and so we've now saved a way to

Â test set and we can use the test set to measure or to estimate the generalization

Â error of the model that was selected by this algorithm.

Â So, that was model selection and how you can take your data, put it into a

Â training validation and test set, and use your cross validation data to

Â select a model and evaluate it on the test set.

Â One final note. I should say that, in the machine

Â learning, as it is practiced today, there are many people that will do that

Â earlier thing that I talked about. And said that, you know, isn't such a

Â good idea of selecting your model using the test set.

Â And then, using the same test set to report the error as though selecting your

Â degree of polynomial on the test set and then reporting the error on the test set

Â as though that were a good estimate of generalization error.

Â That sort of practice, unfortunately, many, many people do do it and if you

Â have a massive, massive test set, it's maybe not a terrible thing to do.

Â But many practitioners most practitioners of machine learning tend to advise

Â against that and it's considered better practice to have separate training

Â validation and test sets. But I'll just warn you sometimes people

Â do, you know, use the same data for the purpose of the validation set and for the

Â purpose of the test set, so you only have a training set and a test set.

Â And that's considered, that's good practice for you,

Â we'll see some people do it. But if possible, I would recommend

Â [UNKNOWN] do it by yourself.

Â