In this video, I'd like to convey to you the main intuitions behind how regularization works and we'll also write down the cost function that we'll use when we're using regularization. With the hand drawn examples that we have on these slides. I think I'll be able to convey part of the intuition. But an even better way to see for yourself how regularization works is if you implement it. And sort of see it work for yourself. And if you do the programming exercises after this. You get a chance to sort of see regularization, in action, for yourself. So, here's the intuition. In the previous video, we saw that if we were to fit a quadratic function to this data, it would give us a pretty good fit to the data. Whereas, if we were to fit an overly high order degree polynomial. We end up with a [INAUDIBLE] that may fit the trading set very well. But really not be not [INAUDIBLE] over fit the data poorly, and not analyze the generalize well. Consider the following. Suppose we were to penalize and make the parameters theta3 and theta4 really small. Here's what I mean. Here's our optimization objective, here's an optimization problem where we minimize our usual squared error cost function. Let's say I take this objective and I modify it and add to it plus 1,000. Theta three^2 + 1000, theta four^2. 1000. I'm just writing down as some, as some huge number. Now, if we were to minimize this function. Well, the only way to make this new cost function small is if theta three and theta four are small, right? Because otherwise, you know? If you have 1000 times theta three. This, this, [INAUDIBLE] this new cost function's going to be [INAUDIBLE]. So when we minimize this new function, we're going to end up with theta three close to zero. And then theta four, close to zero. And that's as if we're getting rid of these two terms over there. And, if we do that. Well then, if theta three and theta four are close to zero, then we're basically left with a quadratic function. And so it ends up with a fit to the data, that's, you know? A quadratic function, plus maybe tiny contributions from small terms theta three theta four. That may be very close to zero. [INAUDIBLE]. [COUGH] And so we end up with, essentially, a quadratic function, which is good. Because this is a much better, hypothesis. In this particular example, we looked at the effect of penalizing two of the parameter values being launched. More generally, here's the idea behind regularization. The idea is that. If we have small values for the parameters. Then having small values for the parameters. Will somehow will usually correspond to having a simpler hypothesis. So for our last example we penalized just theta3 and theta4 and when both of these were close to zero we wound up with a much simpler hypothesis that was essentially a quadratic function. But more broadly if we penalize all the parameters usually that, we can think of that as trying to give us a simpler hypothesis as well because when, you know, these parameters are close to zero in this example that gave us a quadratic function, but more generally It's possible to show that having smaller values of the parameters corresponds to usually smoother functions as well, the less simpler and which are therefore less prone to over fitting. I realize that the reasonning for why having all the parameters be small, why Thancarus wants the simpler hypothesis, I realize that reasonning may not be entirely clear to you right now, and it is kind of hard to explain, unless you implement it yourself and see it for yourself, but I hope that the example of having theta three and theta four be small and how that gave us a simpler hypothesis I hope that helps explain why, at least gives some intuition as to why this might be true. Let's look at a specific example. For housing price prediction, we may have a hundred features that we talked about where, maybe X1's the size, X2 is the number of bedrooms, X3 is the number of floors, and so on. And we may have a hundred features. And. Unlike the polynomial example, we don't know, right? We don't know that theta three, theta four are the high order polynomial terms. So if we have just a bag. If we have just a set of 100 features. It's hard to pick in advance, which are the ones that are less likely to be relevant? So we have, you know, 100 or [INAUDIBLE] parameters. And we don't know which ones to pick to. We don't know which parameters to pick to try to shrink. So in regularization, what we're going to do is take our cost function. Here's my cost function for linear aggression. And what I'm going to do is modify this cost function, to shrink all of my parameters. Because you know, I, I don't know which one or two to try to shrink. So I'm going to modify my cost function, to add, a term at the end. Like so, and then we add square brackets here as well, we're going to add an extra, regularization term at the end, to shrink every single parameter and so this term would tend to shrink all of my parameters theta one, theta two, theta three, up to, theta 100. By the way, by convention. The summation here starts from one. So I'm not actually going to penalize theta zero being large. that's sort of a convention that the sum is from I equals one through N. Rather than I equals zero through N. But in practice it makes very little difference. And whether you include you know, theta zero or not in practice it'll make very little difference in results. But by convention usually we regularize only theta one through theta 100. Writing down regularize optimization objective or regularize cost function again, here it is, here is JF theta. Where this term on the right is a regularization term. And lambda here is called the irregularization parameter. And what lambda does is it controls a tradeoff between two different goals. The first goal captured by the first term in the objective is that we would like to train, is that we would like to fit the training data well. We would like to train-, fit the training set well. And the second goal is, we want to keep the parameters small. And that's captured by the second term. By the regularization objective and by the regularization term. And what lambda, the regularization parameter does is it controls the trade off between these two goals. Between the goal of fitting the trading set well, and the goal of keeping the parameters small, and therefore keeping the hypothesis relatively simple to avoid overfitting. For our housing price prediction example. Whereas previously, if we have a very high order polynomial. We may have wound up with a very wavy or curvy function like this. If you still fit a high order polynomial with all the polynomial features in there. But instead, you just make sure to use this sort of regularized objective. Then, what you can get out is in fact, a curve that. Isn't quite a quadratic function, but is much smoother and much simpler. And maybe a curve like the magenta line that, you know, fits. That, gives a much better hypothesis for this data. Once again, I realize it can be a bit difficult to see why shrinking the parameters can have this effect. But if you implement this algorithm yourself with regularization you will be able to see this effect firsthand. In regularized linear regression if the parameter, if the regularization parameter londer is set to be very large then what will happen is we will end up penalizing the parameters theta1, theta2, theta3, theta4 very highly, that is if our hypothesis is this one down at the bottom. At the end of penalizing theta one, theta two, theta three, theta four very heavily then we end up with all these parameters close to zero. Right, theta one will be close to zero, theta two will be close to zero, theta three, and theta four will end up being close to zero. And if we do that it's as if we're getting rid of these terms in the hypothesis so that we're just left with a hypothesis. That, we'll say that, it says that, well housing prices are equal to theta zero, and that is a hint to fitting a flat horizontal straight line to the data, and this. Is an example of under fitting. And in particular, this hypothesis, this straight line that just fails to fit the training set well. It's just a flat straight line. It doesn't go, you know, go near, it doesn't go anywhere near most of out training examples. And another way of saying this, is that. This hypothesis has too strong a preconception or too high a bias that housing prices are just equal to theta zero. And despite the clear data to the contrary you know, chooses to fit a sort of flat line, just a flat horizontal line I didn't draw that very well. Fits just a horizontal flat line to the data. So for regularization to work well. some care should be taken to choose a good choice for the regularization parameter lambda as well. And when we talk about we multiselection later in this course we will talk about the way or variety of ways for automatically choosing the regularization parameter lambda as well. So, that's the idea behind regularization and the cost function we'll use in order to use regularization. In the next two videos let's take these ideas and apply them to linear regression and to logistic regression, so that we can then get them to avoid over fitting problems.