You've seen how regularization can help prevent overfitting, but how does it affect the bias and variance of a learning algorithm? In this video, I'd like to go deeper into the issue of bias and variance and talk about how it interacts with and is affected by the regularization of your learning algorithm. Suppose we're fitting a high high-order polynomial like I show here. But to prevent overfitting we're going to use regularization like I shown here so we have this regularization term to try to keep the values of the firms are small and as usual regularization comes from j equals one to m rather than j equals zero to m. Let's consider to be cases, the first is the case of a very large value of the regularization parameter lambda. Such as if lambda were equal 10,000, a huge value. In this case all of these parameters, theta one, theta two, theta three and so on will be heavily penalized. And so what ends up with most of these parameter values being close to zero. And the hypothesis will be roughly h of x just equal or approximately equal to theta zero. And so we end up with the hypothesis that more or less looks like that. It's a small lesser flat constant straight line and so this hypothesis has high bias, and it badly underfits the status [INAUDIBLE]. Horizontal straight line's just not a very good model for this dataset. At the other extreme is if we have a very small value of lambda such as if lambda were equal to zero. In that case, given that we're fitting a high-order polynomial, this is our usual overfitting setting. In that case given that we're fitting a high-order polynomial basically without regularization or with very minimal regularization we end up with our usual high variance overfitting setting. It's basically if lambda is equal to zero we're just fitting it with our regularization so that overfits the hypothesis. And is only if we have some intermediate value of lambda that is neither too large or too small that we end up with parameters theta that give us a reasonable fit to this data. So, how can we automatically choose a good value for the regularization parameters of lambda. Just to reiterate, here's our model and here's our learning algorithm objective. For the setting where we're using regularization, then we define J train of theta to be something different. To be the optimization objective but without the regularization term. Previously, in a earlier video when we were not using regularization I define J train of theta to be the same as J of theta as a cross function. But when we're using regularization when the term we're going to define J train, to be just my sum of squared errors on a training set or my average squared error on the training set without taking into account that regularization term. And similarly then also we're going to define the cross-validation sets error and the test sets error as before to be the average sum of script errors on the cross-validation and the test sites. So, just to summarize, my definitions of J train, Jcv and J test are just the average square that are one half of the average square on my training validation and test sets without the extra regularization term. So this is how we can automatically choose the regularization parameter long term. What I usually do is maybe have some range of values of lambda I want to try out. So, I might be considering not using regularization. Well, here are a few values I might try. I might be considering lambda equal to 0.01, 0.02, 0.04, and so on. And I usually set these up in multiples of two until some maybe larger value. If I were doing this in multiples of two I should end up with 10.24 instead of 10 exactly. But, this is close enough and the third or fourth decimal places won't affect your result that much. So this gives me maybe 12 different models that I'm trying to select amongst corresponding to 12 different values of the regularization parameter lambda. And of course you can also go to values less than 0.01 or values larger than ten, but I've just truncated that here for convenience. Definition of these top models we can do is then the following. We can take this first model with lambda equal zero, and minimize my cost function J of theta, and this will give me some parameter vector theta. And similar to the earlier video, let me just denote this as theta superscript one. [COUGH] And then I can take my second model. With lambda set to 0.01 and minimize my cost function now using lambda equals 0.01 of course to get some different parameter vector theta. Limited delta theta two. And for that I end up with theta three so if this is fair for my third model and so on until the final model with lambda is set to ten when I put ten or 10.24 and I put this theta 12. Next I can take all of these hypotheses, all of these parameters and use my cross validation set to evaluate them. So I can look at my first model, my second model fits of these different values of the regularization parameter and evaluate them when I cross-validation sets basically measure the average squared error of each of these parameter vector theta of my cross-validation set. And I would then pick which ever one of these 12 models gives me the lowest error on the cross-validation set. And let's say for the sake of this example, that I end up picking theta five. The fifth order polynomial because that has the lowest cross-validation error. Having done that? Finally, what I would do if I want to report test set error is to take the parameter theta five that are selected and look at how well it does on my test sets. And once again, here is as if we fit this parameter theta to my cross-validation sets. Which is why I'm saving aside a separate test set. That I'm going to use to get a better estimate of how well my parameter vector theta will generalize to previously unseen examples. So that's model selection applied to selecting the regularization parameter lambda. The last thing I'd like to do in this video is get a better understanding of how cross-validation and training error of vary as we vary the regularization parameter lambda. And so just a reminder, all right? That was our original cross function J of theta. But for this purpose, we're going to define training error without using a regularization parameter, and cross-validation error without using the regularization parameter. And what I like to do is plot this J train and plot this Jcv. Meaning just how well does my hypothesis do on the training set and how well does my hypothesis do on the cross-validation set as I vary my regularization parameter lambda? So as we saw earlier if lambda is small then we'll not using much regularization. And we want a larger risk of overfitting. Whereas if lambda is large that is if we were under the y part of this horizontal axis then with a large value of lambda we run a higher risk of having a bias problem. So, if you plot J train and Jcv, what you find is that for small values of lambda, you can fit the training set relatively well because you're not regularizing. So for small values of lambda the regularization term basically goes away and you're just minimizing, pretty much the square area. So when lambda is small, you end up with a small value for J train, whereas if lambda is large, then you have a high bias problem, and then you might not fit your training style well, so you end up with a value up there. So J train of theta will tend to increase when lambda increases. Because a large value of lambda corresponds to high bias. Where you might not even fit your training set well. Whereas a small value of lambda corresponds to if you can freely fit to fairly high degree polynomials to a data lets say. As for the cost-validation error, we end up with a figure like this. Where over here on the right if we have a large value of lambda, we may end up underfitting. And so this is the bias regime. And so the cross-validation error will be high. Let me just label that. So that's Jcv of theta because with high bias we won't be fitting we won't be doing well on the cross-validation set. Whereas, here on the left, this is the high variance regime, where if we have too small a value of lambda, then we may be overfitting the data. And so, if we're overfitting the data, then it cross-validation error, will also be high. And so this is what the cross-validation error and what the training error may look like on a training set as we vary the along there. And so once again it will often be some intermediate value of lambda that just quote, just right, or that works best in terms of having the small cross-validation error or a small test set. And whereas the curves I've drawn here are somewhat cartoonish and somewhat idealized. So on a real data set the curves you get may end up looking a little bit more messy and just a little bit more noisy than this. For some data sets you will really see these four source of trends and by looking at the plot of the whole that cross-validation error. You can either manually or automatically try to select a point that minimizes the cross-validation error and select a value of lambda corresponding to low cross-validation error. When I'm trying to pick the regularization parameter, lambda for a learning algorithm often, I find that plotting a figure like this one shown here helps me understand better what's going on, and helps me verify that I am indeed picking a good value for the regularization parameter lambda. So hopefully, that gives you more insight into regularization and its effects on the bias and variance of a learning algorithm. By now, you've seen bias and variance from a lot of different perspectives. And one way to do it in the next video is take all of the insights that we've gone through and build on them to put together a diagnostic. It's called learning curves, which is a tool that I often use to try to diagnose if a learning algorithm may be suffering from a bias problem or a variance problem, or a little bit of both.