0:00

Suppose you're left to decide what degree of polynomial to fit to a data set.

Â So that what features to include that gives you a learning algorithm.

Â Or suppose you'd like to choose the regularization parameter longer for

Â learning algorithm.

Â How do you do that?

Â This account model selection process.

Â Browsers, and in our discussion of how to do this, we'll talk about not just how to

Â split your data into the train and test sets, but how to switch data into what we

Â discover is called the train, validation, and test sets.

Â We'll see in this video just what these things are, and

Â how to use them to do model selection.

Â We've already seen a lot of times the problem of overfitting,

Â in which just because a learning algorithm fits a training set well,

Â that doesn't mean it's a good hypothesis.

Â More generally, this is why the training set's error is not a good predictor for

Â how well the hypothesis will do on new example.

Â Concretely, if you fit some set of parameters.

Â Theta0, theta1, theta2, and so on, to your training set.

Â Then the fact that your hypothesis does well on the training set.

Â Well, this doesn't mean much in terms of predicting how well your hypothesis will

Â generalize to new examples not seen in the training set.

Â And a more general principle is that once your

Â parameter is what fit to some set of data.

Â Maybe the training set, maybe something else.

Â Then the error of your hypothesis as measured on that same data set,

Â such as the training error,

Â that's unlikely to be a good estimate of your actual generalization error.

Â That is how well the hypothesis will generalize to new examples.

Â Now let's consider the model selection problem.

Â Let's say you're trying to choose what degree polynomial to fit to data.

Â So, should you choose a linear function, a quadratic function, a cubic function?

Â All the way up to a 10th-order polynomial.

Â 1:51

So it's as if there's one extra parameter in this algorithm,

Â which I'm going to denote d, which is, what degree of polynomial.

Â Do you want to pick.

Â So it's as if, in addition to the theta parameters, it's as if

Â there's one more parameter, d, that you're trying to determine using your data set.

Â So, the first option is d equals one, if you fit a linear function.

Â We can choose d equals two, d equals three, all the way up to d equals 10.

Â So, we'd like to fit this extra sort of parameter which I'm denoting by d.

Â And concretely let's say that you want to choose a model,

Â that is choose a degree of polynomial, choose one of these 10 models.

Â And fit that model and also get some estimate of how well your

Â fitted hypothesis was generalize to new examples.

Â Here's one thing you could do.

Â What you could, first take your first model and minimize the training error.

Â And this would give you some parameter vector theta.

Â And you could then take your second model, the quadratic function, and

Â fit that to your training set and this will give you some other.

Â Parameter vector theta.

Â In order to distinguish between these different parameter vectors, I'm going

Â to use a superscript one superscript two there where theta superscript one

Â just means the parameters I get by fitting this model to my training data.

Â And theta superscript two just means the parameters I

Â get by fitting this quadratic function to my training data and so on.

Â By fitting a cubic model I get parenthesis three up to, well, say theta 10.

Â And one thing we ccould do is that take these parameters and look at test error.

Â So I can compute on my test set J test of one,

Â J test of theta two, and so on.

Â 3:53

So I'm going to take each of my hypotheses with the corresponding parameters and

Â just measure the performance of on the test set.

Â Now, one thing I could do then is, in order to select one of these models,

Â I could then see which model has the lowest test set error.

Â And let's just say for

Â this example that I ended up choosing the fifth order polynomial.

Â So, this seems reasonable so far.

Â But now let's say I want to take my fifth hypothesis, this, this,

Â fifth order model, and let's say I want to ask, how well does this model generalize?

Â 4:27

One thing I could do is look at how well my fifth order

Â polynomial hypothesis had done on my test set.

Â But the problem is this will not be a fair estimate of how well my

Â hypothesis generalizes.

Â And the reason is what we've done is we've fit this extra parameter d,

Â that is this degree of polynomial.

Â And what fits that parameter d, using the test set, namely,

Â we chose the value of d that gave us the best possible performance on the test set.

Â And so, the performance of my parameter vector theta5, on the test set,

Â that's likely to be an overly optimistic estimate of generalization error.

Â Right, so, that because I had fit this parameter d to my test set is no longer

Â fair to evaluate my hypothesis on this test set, because I fit my parameters

Â to this test set, I've chose the degree d of polynomial using the test set.

Â And so my hypothesis is likely to do better on

Â this test set than it would on new examples that it hasn't seen before, and

Â that's which is, which is what I really care about.

Â So just to reiterate, on the previous slide, we saw that if we fit some set of

Â parameters, you know, say theta0, theta1, and so on, to some training set,

Â then the performance of the fitted model on the training set is not predictive of

Â how well the hypothesis will generalize to new examples.

Â Is because these parameters were fit to the training set,

Â so they're likely to do well on the training set,

Â even if the parameters don't do well on other examples.

Â And, in the procedure I just described on this line, we just did the same thing.

Â And specifically, what we did was, we fit this parameter d to the test set.

Â And by having fit the parameter to the test set, this means that

Â the performance of the hypothesis on that test set may not be a fair estimate of how

Â well the hypothesis is, is likely to do on examples we haven't seen before.

Â To address this problem, in a model selection setting,

Â if we want to evaluate a hypothesis, this is what we usually do instead.

Â Given the data set, instead of just splitting into a training test set,

Â what we're going to do is then split it into three pieces.

Â And the first piece is going to be called the training set as usual.

Â 6:54

And the second piece of this data, I'm going to call the cross validation set.

Â [SOUND] Cross validation.

Â And the cross validation, as V-D.

Â Sometimes it's also called the validation set instead of cross validation set.

Â And then the loss can be to call the usual test set.

Â And the pretty, pretty typical ratio at which to split these things will be

Â to send 60% of your data's, your training set,

Â maybe 20% to your cross validation set, and 20% to your test set.

Â And these numbers can vary a little bit but this integration be pretty typical.

Â And so our training sets will now be only maybe 60% of the data, and our

Â cross-validation set, or our validation set, will have some number of examples.

Â I'm going to denote that m subscript cv.

Â So that's the number of cross-validation examples.

Â 7:52

Following our early notational convention I'm going to use xi cv comma y i cv,

Â to denote the i cross validation example.

Â And finally we also have a test set over here with our

Â m subscript test being the number of test examples.

Â So, now that we've defined the training validation or

Â cross validation and test sets.

Â We can also define the training error, cross validation error, and test error.

Â So here's my training error, and

Â I'm just writing this as J subscript train of theta.

Â This is pretty much the same things.

Â These are the same thing as the J of theta that I've been writing so

Â far, this is just a training set error you know, as measuring a training set and then

Â J subscript cv my cross validation error, this is pretty much what you'd expect,

Â just like the training error you've set measure it on a cross validation data set,

Â and here's my test set error same as before.

Â 8:49

So when faced with a model selection problem like this, what we're going to

Â do is, instead of using the test set to select the model, we're instead

Â going to use the validation set, or the cross validation set, to select the model.

Â Concretely, we're going to first take our first hypothesis, take this first model,

Â and say, minimize the cross function, and

Â this would give me some parameter vector theta for the new model.

Â And, as before, I'm going to put a superscript 1,

Â just to denote that this is the parameter for the new model.

Â We do the same thing for the quadratic model.

Â Get some parameter vector theta two.

Â Get some para, parameter vector theta three, and so

Â on, down to theta ten for the polynomial.

Â And what I'm going to do is, instead of testing these hypotheses on the test set,

Â I'm instead going to test them on the cross validation set.

Â And measure J subscript cv,

Â to see how well each of these hypotheses do on my cross validation set.

Â 9:53

And then I'm going to pick the hypothesis with the lowest cross validation error.

Â So for this example, let's say for the sake of argument,

Â that it was my 4th order polynomial, that had the lowest cross validation error.

Â So in that case I'm going to pick this fourth order polynomial model.

Â And finally, what this means is that that parameter d,

Â remember d was the degree of polynomial, right?

Â So d equals two, d equals three, all the way up to d equals 10.

Â What we've done is we'll fit that parameter d and we'll say d equals four.

Â And we did so using the cross-validation set.

Â And so this degree of polynomial, so the parameter, is no longer fit to the test

Â set, and so we've not saved away the test set, and we can use the test set to

Â measure, or to estimate the generalization error of the model that was selected.

Â By the of them.

Â So, that was model selection and how you can take your data,

Â split it into a training, validation, and test set.

Â And use your cross validation data to select the model and

Â evaluate it on the test set.

Â 10:59

One final note, I should say that in.

Â The machine learning, as of this practice today, there aren't many

Â people that will do that early thing that I talked about, and said that, you know,

Â it isn't such a good idea, of selecting your model using this test set.

Â And then using the same test set to report the error as though

Â selecting your degree of polynomial on the test set, and then reporting the error on

Â the test set as though that were a good estimate of generalization error.

Â That sort of practice is unfortunately many, many people do do it.

Â If you have a massive, massive test that is maybe not a terrible thing to do,

Â but many practitioners,

Â most practitioners that machine learnimg tend to advise against that.

Â And it's considered better practice to have separate train validation and

Â test sets.

Â I just warned you to sometimes people to do, you know, use the same data for

Â the purpose of the validation set, and for the purpose of the test set.

Â You need a training set and a test set, and that's good,

Â that's practice, though you will see some people do it.

Â But, if possible, I would recommend against doing that yourself.

Â