0:00

Suppose you'd like to decide what degree of polynomial to fit to a data set, so

the, what features to include that gives you a learning algorithm.

Or suppose you'd like to choose the regularization parameter, lambda, for

learning algorithm, how do you do that? This is called model selection problems

and in our discussion of how to do this, we'll talk about not just how to split

your data into a train and test sets but how to split your data into what we'll

discover is called the training validation and test sets.

We'll see in this video just what these things are and how to use them to do

model selection. We've already seen a lot of times the

problem of overfitting, in which just because a learning algorithm fits a

training set well, that doesn't mean it's a good hypothesis.

More generally, this is why the training set error is not a good predictor for how

well the hypothesis would do on new examples.

Concretely, if you fit some set of parameters, theta zero, theta one, theta

two, and so on, to your training set, then the fact that your hypothesis does

well on the training set, well, this doesn't mean much in terms of predicting

how well your hypothesis will generalize to new examples not seen in the training

set. And a more general principle is that once

your parameters were fit to some set of data, maybe the training set, maybe

something else, then the error of your hypothesis is measured on that same data

set, such as the training error. That's likely to be a good estimate of

your actual generalization error. That is of how well the hypothesis will

generalize to new examples. Now, let's consider the model selection

problem. Let's say, you're trying to choose what

degree polynomial to fit to data. Should you choose a linear function, a

quadratic function, a cubic function, all the way up to a tenth order polynomial?

So, it's as if that's one extra parameter in this algorithm, which I'm going to

denote d, which is what degree of polynomial do you want to pick?

So, it's as if in addition to the theta parameter, it's as if there's one more

parameter D that you're trying to determine using a data set.

So, the first option is d equals one. If you fit a linear function, we can

choose d equals two, d equals three, all the way up to d equals ten.

So, we like to fit this extra sort of parameter, which I'm denoting by d.

And concretely. let's say that you want to choose a

model, that is, choose a degree or polynomial,

choose one of these ten models, and fit that model and also get some estimate of

how well your fitted hypothesis would generalize to new examples.

Here's one thing you could do. What you could, first, take your first

model and minimize the training error and this would give you some parameter data

of theta. And you could then take your second

model, the quadratic function and fit that to your training set and this will

give you some other parameter vector theta.

In order to distinguish between these different parameter vectors I'm going to

use a superscript one, superscript two there, where theta superscript one, just

means the parameters I get by fitting this model to my training data and theta

superscript two, just means the parameters I get by fitting this

quadratic function to my training data, and so on.

And by fitting a cubic model, I get parameters theta three up to, you know,

say, theta ten. And one thing you could do is that take these parameters and look

at the test set errors. I can compute on my test set, J test of

one, J test of say theta two, and so on,

J test of theta three, and so on.

So, I'm going to take each of my hypothesis with the corresponding

parameters and just measure their performance on the test set.

Now, one thing I could do then, is, in order to select one of these models, I

could then see which model has the lowest test sets error.

And let's just say for this example, that I ended up choosing the fifth order

polynomial. So, this seems reasonable so far, but now

let's say, I want to take my theta hypothesis, this, this fifth order model,

and let's say, I want to ask, how well does this model generalize?

One thing I could do is look at how well my fifth order polynomial hypothesis had

done on my test set. But the problem is, this will not be a fair estimate, of how

well my hypothesis generalizes. And the reason is,

what we've done is we fit this extra parameter d, that is this degree of

polynomial, and we fit that parameter d using the test set.

Namely, we chose the value of d that gave us the best possible performance on the

test set. And so,

the performance of my parameter vector theta five on the test set,

there's likely to be an overly optimistic estimate or generalization error,

right? So that because I have fit this parameter

d to my test set, it is no longer fair to evaluate my hypothesis on this test set.

It's because I've fit my parameters to the test set, I've chosen the degree d D

of polynomial using the test set. And so, my hypothesis is likely to do

better on this test set than it would on new examples that it hasn't seen before.

And that's, which is, which is what we care about.

So, just to reiterate, on the previous slide, we saw that if we fit some set of

parameters, you know, say, theta zero, theta one, to some training set, then the

performance of the fitted model on the training set is not predictive of how

well the hypotheses we generalized in new examples, is because these parameters

were fit to the training set so they're likely to do well on the training set,

even if the parameters don't do well on other examples. And in the procedure I

just described on the slide, we just did the same thing.

And specifically, what we did was we fit this parameter d to the test set,

and by having fit the parameter to the test set, this means that the performance

of the hypothesis on that test set may not be a fair estimate of how well the

hypothesis is, is likely to do on examples we haven't seen before.

To address this problem in a model selection setting, if we want to evaluate

a hypothesis, this is what we usually do instead.

Given the data set, instead of just splitting it into a train and test set,

what we are going to do is instead split it into three pieces, and the first piece

is going to be called the training set, as usual.

So, let me call this first part, the training set. Anf the second piece of

this data, I'm going to call the cross validation set,

7:02

cross validation. And I'm going to abbreviate cross

validation as CV. Sometimes, it's also called the

validation set instead of cross validation set.

And then the last part, I'm going to call my usual test set.

And the pretty, pretty typical ratio which to split these things, would be to

send 60% of your data to your training set, maybe 20% to your cross validation

set and 20% to your test set. And these numbers can vary a little bit

but this sort of ratio would be pretty typical.

And so, our training sets will now be only maybe 60% of the data.

And our cross validation set or our validation set will have some number of

examples. I'm going to denote that M subscript CV.

So, that's the number of cross validation examples.

Following our earlier notational convention, I'm going to use

X(i)CV,y(i)CV to denote the ith cross validation example.

And finally, we also have a test set over here with

M subscript test, being the number of test examples.

So, now that we've defined the training validation or cross validation and test

sets, we can also define the training error, cross validation error, and test

error. So, here's my training error and I'm just

writing this as J subscript train of theta.

This is pretty much the same thing, it's using the same thing as the J of theta

that we're writing so far. It's just a training set error as

measured on your training set. And then J subscript CV, that's my cross

validation error, it's pretty much what you'd expect. Just

like the training error, you should have measured it on the cross

validation data set. And here's my test set error same as

before. So, when faced when a model selection

problem like this, what we're going to do is, is instead of using the test set to

select a model, we're instead going to use the validation set or the cross

validation set to set the model. Concretely, we're going to first, take

our first hypothesis, take this first model, and say, minimize the cost

function. And this will give me some parameter

vector theta for the linear model. And as before, I'm going to put a superscript

one, just to denote that this is the parameter for the linear model.

We do the same thing for the quadratic model, get some parameter vector theta

two, get some para, parameter vector theta three, and so on, down to, say, the

tenth other polynomial. And what I'm going to do is, instead of

testing these hypothesis on the test set, I'm instead going to test them on the

cause validations test, I'm going to measure J subscript CV to see how well

each of these hypothesis do on my cross validation set.

And then, I'm going to pick the hypothesis with the lowest cross

validation thereof. So, for this example, let's say, for the

sake of argument, that it was my fourth order polynomial that had the lowest

cross validation error. So, in that case, I'm going to pick this

fourth order polynomial model. And finally,

what this means is that, that parameter d, remember d was the degree of

polynomial, right? So, d2, = n, d3 = 3, up to d10, = 10.

What we've done is we've fit that parameter d and we have said d4 = 4 and

we did so using the cross validation set. And so, this degree of polynomials for

the parameter is no longer fit to the test set and so we've now saved a way to

test set and we can use the test set to measure or to estimate the generalization

error of the model that was selected by this algorithm.

So, that was model selection and how you can take your data, put it into a

training validation and test set, and use your cross validation data to

select a model and evaluate it on the test set.

One final note. I should say that, in the machine

learning, as it is practiced today, there are many people that will do that

earlier thing that I talked about. And said that, you know, isn't such a

good idea of selecting your model using the test set.

And then, using the same test set to report the error as though selecting your

degree of polynomial on the test set and then reporting the error on the test set

as though that were a good estimate of generalization error.

That sort of practice, unfortunately, many, many people do do it and if you

have a massive, massive test set, it's maybe not a terrible thing to do.

But many practitioners most practitioners of machine learning tend to advise

against that and it's considered better practice to have separate training

validation and test sets. But I'll just warn you sometimes people

do, you know, use the same data for the purpose of the validation set and for the

purpose of the test set, so you only have a training set and a test set.

And that's considered, that's good practice for you,

we'll see some people do it. But if possible, I would recommend

[UNKNOWN] do it by yourself.