0:54

In this case all of these parameters, theta one, theta two, theta three and so

Â on will be heavily penalized.

Â And so what ends up with most of these parameter values being close to zero.

Â And the hypothesis will be roughly h of x just equal or

Â approximately equal to theta zero.

Â And so we end up with the hypothesis that more or less looks like that.

Â It's a small lesser flat constant straight line and so

Â this hypothesis has high bias, and it badly underfits the status [INAUDIBLE].

Â Horizontal straight line's just not a very good model for this dataset.

Â At the other extreme is if we have a very small

Â value of lambda such as if lambda were equal to zero.

Â In that case, given that we're fitting a high-order polynomial,

Â this is our usual overfitting setting.

Â In that case given that we're fitting a high-order polynomial basically without

Â regularization or with very minimal regularization we end up with our usual

Â high variance overfitting setting.

Â It's basically if lambda is equal to zero we're just fitting it

Â with our regularization so that overfits the hypothesis.

Â And is only if we have some intermediate value of lambda that is neither

Â too large or too small that we end up with parameters theta that give us

Â a reasonable fit to this data.

Â So, how can we automatically choose a good value for

Â the regularization parimeters of lambda.

Â Just to reiterate, here's our model and here's our learning algorithm objective.

Â For the setting where we're using regularization,

Â then we define J train of theta to be something different.

Â To be the optimization objective but without the regularization term.

Â Previously, in a earlier video when we were not using regularization

Â I define J train of theta to be the same as J of theta as a cross function.

Â But when we're using regularization when the term we're going to define J train,

Â to be just my sum of squared errors on a training set or my average squared error

Â on the training set without taking into account that regularization term.

Â And similarly then also we're going to define the cross-validation sets error and

Â the test sets error as before to be the average sum of script errors on

Â the cross-validation and the test sites.

Â So, just to summarize, my definitions of J train, Jcv and

Â J test are just the average square that are one half of the average square

Â on my training validation and test sets without the extra regularization term.

Â So this is how we can automatically choose the regularization parameter long term.

Â What I usually do is maybe have some range of values of lambda I want to try out.

Â So, I might be considering not using regularization.

Â Well, here are a few values I might try.

Â I might be considering lambda equal to 0.01, 0.02, 0.04, and so on.

Â And I usually set these up in multiples of two until some maybe larger value.

Â If I were doing this in multiples of two I should end up with 10.24

Â instead of 10 exactly.

Â But, this is close enough and the third or

Â fourth decimal places won't affect your result that much.

Â So this gives me maybe 12 different models that I'm trying to select amongst

Â corresponding to 12 different values of the regularization parameter lambda.

Â And of course you can also go to values less than 0.01 or values larger than ten,

Â but I've just truncated that here for convenience.

Â Definition of these top models we can do is then the following.

Â We can take this first model with lambda equal zero, and minimize my cost

Â function J of theta, and this will give me some parameter vector theta.

Â And similar to the earlier video, let me just denote this as theta superscript one.

Â [COUGH] And then I can take my second model.

Â With lambda set to 0.01 and minimize my cost function

Â now using lambda equals 0.01 of course to get some different parameter vector theta.

Â Limited delta theta two.

Â And for that I end up with theta three so if this is fair for my third model and so

Â on until the final model with lambda is set to ten when I put ten or

Â 10.24 and I put this theta 12.

Â Next I can take all of these hypotheses, all of these parameters and

Â use my cross validation set to evaluate them.

Â So I can look at my first model,

Â my second model fits of these different values of the regularization parameter and

Â evaluate them when I cross-validation sets basically measure the average squared

Â error of each of these parameter vector theta of my cross-validation set.

Â And I would then pick which ever one of these 12 models gives me the lowest

Â error on the cross-validation set.

Â And let's say for the sake of this example, that I end up picking theta five.

Â The fifth order polynomial because that has the lowest cross-validation error.

Â Having done that?

Â Finally, what I would do if I want to report test

Â set error is to take the parameter theta five that are selected and

Â look at how well it does on my test sets.

Â And once again,

Â here is as if we fit this parameter theta to my cross-validation sets.

Â Which is why I'm saving aside a separate test set.

Â That I'm going to use to get a better estimate of how well my

Â parameter vector theta will generalize to previously unseen examples.

Â So that's model selection applied to selecting

Â the regularization parameter lambda.

Â The last thing I'd like to do in this video is get a better understanding of

Â how cross-validation and

Â training error of vary as we vary the regularization parameter lambda.

Â And so just a reminder, all right?

Â That was our original cross function J of theta.

Â But for this purpose,

Â we're going to define training error without using a regularization parameter,

Â and cross-validation error without using the regularization parameter.

Â 7:35

And we want a larger risk of overfitting.

Â Whereas if lambda is large that is if we were under the y part

Â of this horizontal axis then with a large value of lambda

Â we run a higher risk of having a bias problem.

Â So, if you plot J train and Jcv, what you find is that for small values of lambda,

Â you can fit the training set relatively well because you're not regularizing.

Â So for small values of lambda the regularization term basically goes away

Â and you're just minimizing, pretty much the square area.

Â So when lambda is small, you end up with a small value for

Â J train, whereas if lambda is large, then you have a high bias problem, and

Â then you might not fit your training style well, so you end up with a value up there.

Â So J train of theta will tend to increase when lambda increases.

Â Because a large value of lambda corresponds to high bias.

Â Where you might not even fit your training set well.

Â Whereas a small value of lambda corresponds to if you can

Â 8:51

Where over here on the right if we have a large value of lambda,

Â we may end up underfitting.

Â And so this is the bias regime.

Â And so the cross-validation error will be high.

Â Let me just label that.

Â So that's Jcv of theta because with high bias we won't be fitting we won't be

Â doing well on the cross-validation set.

Â Whereas, here on the left, this is the high variance regime, where

Â if we have too small a value of lambda, then we may be overfitting the data.

Â And so, if we're overfitting the data, then it cross-validation error,

Â will also be high.

Â And so this is what the cross-validation error and

Â what the training error may look like on a training set as we vary the along there.

Â And so once again it will often be some intermediate value

Â of lambda that just quote, just right, or that works best in

Â terms of having the small cross-validation error or a small test set.

Â And whereas the curves I've drawn here are somewhat cartoonish and

Â somewhat idealized.

Â So on a real data set the curves you get may end up looking a little bit more messy

Â and just a little bit more noisy than this.

Â For some data sets you will really see these four source of trends and

Â by looking at the plot of the whole that cross-validation error.

Â You can either manually or automatically try to select a point

Â that minimizes the cross-validation error and

Â select a value of lambda corresponding to low cross-validation error.

Â When I'm trying to pick the regularization parameter, lambda for

Â a learning algorithm often, I find that plotting a figure like this one shown here

Â helps me understand better what's going on, and helps me verify that

Â I am indeed picking a good value for the regularization parameter lambda.

Â So hopefully, that gives you more insight into regularization and

Â its effects on the bias and variance of a learning algorithm.

Â By now, you've seen bias and variance from a lot of different perspectives.

Â And one way to do it in the next video is take all of the insights that

Â we've gone through and build on them to put together a diagnostic.

Â It's called learning curves, which is a tool that I often use to try to

Â diagnose if a learning algorithm may be suffering from a bias problem or

Â a variance problem, or a little bit of both.

Â