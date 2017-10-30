Why does regularization help with overfitting? Why does it help with reducing variance problems? Let's go through a couple examples to gain some intuition about how it works. So, recall that our high bias, high variance, and "just write" pictures from our earlier video had looked something like this. Now, let's see a fitting large and deep neural network. I know I haven't drawn this one too large or too deep, but let's see if [INAUDIBLE] some neural network and is currently overfitting. So you have some cost function, write J of W, b equals sum of the losses, like so, right? And so what we did for regularization was add this extra term that penalizes the weight matrices from being too large. And we said that was the Frobenius norm. So why is it that shrinking the L2 norm, or the Frobenius norm with the parameters might cause less overfitting? One piece of intuition is that if you, you know, crank your regularization lambda to be really, really big, that'll be really incentivized to set the weight matrices, W, to be reasonably close to zero. So one piece of intuition is maybe it'll set the weight to be so close to zero for a lot of hidden units that's basically zeroing out a lot of the impact of these hidden units. And if that's the case, then, you know, this much simplified neural network becomes a much smaller neural network. In fact, it is almost like a logistic regression unit, you know, but stacked multiple layers deep. And so that will take you from this overfitting case, much closer to the left, to the other high bias case. But, hopefully, there'll be an intermediate value of lambda that results in the result closer to this "just right" case in the middle. But the intuition is that by cranking up lambda to be really big, it'll set W close to zero, which, in practice, this isn't actually what happens. We can think of it as zeroing out, or at least reducing, the impact of a lot of the hidden units, so you end up with what might feel like a simpler network, that gets closer and closer as if you're just using logistic regression. The intuition of completely zeroing out a bunch of hidden units isn't quite right. It turns out that what actually happens is it'll still use all the hidden units, but each of them would just have a much smaller effect. But you do end up with a simpler network, and as if you have a smaller network that is, therefore, less prone to overfitting. So I'm not sure if this intuition helps, but when you implement regularization in the program exercise, you actually see some of these variance reduction results yourself. Here's another attempt at additional intuition for why regularization helps prevent overfitting. And for this, I'm going to assume that we're using the tan h activation function, which looks like this. This is g of z equals tan h of z. So if that's the case, notice that so long as z is quite small, so if z takes on only a smallish range of parameters, maybe around here, then you're just using the linear regime of the tan h function, is only if z is allowed to wander, you know, to larger values or smaller values like so, that the activation function starts to become less linear. So the intuition you might take away from this is that if lambda, the regularization parameter is large, then you have that your parameters will be relatively small, because they are penalized being large in the cost function. And so if the weights, W, are small, then because z is equal to W, right, and then technically, it's plus b. But if W tends to be very small, then z will also be relatively small. And in particular, if z ends up taking relatively small values, just in this little range, then g of z will be roughly linear. So it's as if every layer will be roughly linear, as if it is just linear regression. And we saw in course one that if every layer is linear, then your whole network is just a linear network. And so even a very deep network, with a deep network with a linear activation function is, at the end of the day, only able to compute a linear function. So it's not able to, you know, fit those very, very complicated decision, very non-linear decision boundaries that allow it to, you know, really overfit, right, to data sets, like we saw on the overfitting high variance case on the previous slide, ok? So just to summarize, if the regularization parameters are very large, the parameters W very small, so z will be relatively small, kind of ignoring the effects of b for now, but so z is relatively, so z will be relatively small, or really, I should say it takes on a small range of values. And so the activation function if it's tan h, say, will be relatively linear. And so your whole neural network will be computing something not too far from a big linear function, which is therefore, pretty simple function, rather than a very complex highly non-linear function. And so, is also much less able to overfit, ok? And again, when you implement regularization for yourself in the program exercise, you'll be able to see some of these effects yourself. Before wrapping up our def discussion on regularization, I just want to give you one implementational tip, which is that, when implementing regularization, we took our definition of the cost function J and we actually modified it by adding this extra term that penalizes the weights being too large. And so if you implement gradient descent, one of the steps to debug gradient descent is to plot the cost function J, as a function of the number of elevations of gradient descent, and you want to see that the cost function J decreases monotonically after every elevation of gradient descent. And if you're implementing regularization, then please remember that J now has this new definition. If you plot the old definition of J, just this first term, then you might not see a decrease monotonically. So to debug gradient descent, make sure that you're plotting, you know, this new definition of J that includes this second term as well. Otherwise, you might not see J decrease monotonically on every single elevation. So that's it for L2 regularization, which is actually a regularization technique that I use the most in training deep learning models. In deep learning, there is another sometimes used regularization technique called dropout regularization. Let's take a look at that in the next video.