0:00

Why does regularization help with overfitting?

Â Why does it help with reducing variance problems?

Â Let's go through a couple examples to gain some intuition about how it works.

Â So, recall that high bias, high variance.

Â And I just write pictures from our earlier video that looks something like this.

Â Now, let's see a fitting large and deep neural network.

Â I know I haven't drawn this one too large or too deep,

Â unless you think some neural network and this currently overfitting.

Â So you have some cost function like J of W,

Â B equals sum of the losses.

Â So what we did for regularization was add

Â this extra term that

Â penalizes the weight matrices from being too large.

Â So that was the Frobenius norm.

Â So why is it that shrinking the L two norm or

Â the Frobenius norm or the parameters might cause less overfitting?

Â One piece of intuition is that if you

Â crank regularisation lambda to be really, really big,

Â they'll be really incentivized to set

Â the weight matrices W to be reasonably close to zero.

Â So one piece of intuition is maybe it set the weight to be so close to zero for

Â a lot of hidden units that's basically zeroing

Â out a lot of the impact of these hidden units.

Â And if that's the case,

Â then this much simplified neural network becomes a much smaller neural network.

Â In fact, it is almost like a logistic regression unit,

Â but stacked most probably as deep.

Â And so that will take you from

Â this overfitting case much closer to the left to other high bias case.

Â But hopefully there'll be an intermediate value of lambda that

Â results in a result closer to this just right case in the middle.

Â But the intuition is that by cranking up lambda to be

Â really big they'll set W close to zero,

Â which in practice this isn't actually what happens.

Â We can think of it as zeroing out or at least reducing

Â the impact of a lot of the hidden units so you end up

Â with what might feel like a simpler network.

Â They get closer and closer as if you're just using logistic regression.

Â The intuition of completely zeroing out of a bunch of hidden units isn't quite right.

Â It turns out that what actually happens is they'll still use all the hidden units,

Â but each of them would just have a much smaller effect.

Â But you do end up with a simpler network and as

Â if you have a smaller network that is therefore less prone to overfitting.

Â So a lot of this intuition helps better

Â when you implement regularization in the program exercise,

Â you actually see some of these variance reduction results yourself.

Â Here's another attempt at additional intuition

Â for why regularization helps prevent overfitting.

Â And for this, I'm going to assume that we're using

Â the tanh activation function which looks like this.

Â This is a g of z equals tanh of z.

Â So if that's the case,

Â notice that so long as Z is quite small,

Â so if Z takes on only a smallish range of parameters,

Â maybe around here, then you're just using the linear regime of the tanh function.

Â Is only if Z is allowed to wander up to larger values or smaller values like so,

Â that the activation function starts to become less linear.

Â So the intuition you might take away from this is that if lambda,

Â the regularization parameter, is large,

Â then you have that your parameters will be relatively small,

Â because they are penalized being large into a cos function.

Â And so if the blades W are small then because Z is

Â equal to W and then technically is plus b,

Â but if W tends to be very small,

Â then Z will also be relatively small.

Â And in particular, if Z ends up taking relatively small values,

Â just in this whole range,

Â then G of Z will be roughly linear.

Â So it's as if every layer will be roughly linear.

Â As if it is just linear regression.

Â And we saw in course one that if every layer

Â is linear then your whole network is just a linear network.

Â And so even a very deep network,

Â with a deep network with a linear activation function

Â is at the end they are only able to compute a linear function.

Â So it's not able to fit those very very complicated decision.

Â Very non-linear decision boundaries that allow it to really

Â overfit right to data sets like we saw on

Â the overfitting high variance case on the previous slide.

Â So just to summarize,

Â if the regularization becomes very large,

Â the parameters W very small,

Â so Z will be relatively small,

Â kind of ignoring the effects of b for now,

Â so Z will be relatively small or,

Â really, I should say it takes on a small range of values.

Â And so the activation function if is tanh,

Â say, will be relatively linear.

Â And so your whole neural network will be computing something not too far from

Â a big linear function which is therefore pretty

Â simple function rather than a very complex highly non-linear function.

Â And so is also much less able to overfit.

Â And again, when you enter in regularization for yourself in the program exercise,

Â you'll be able to see some of these effects yourself.

Â Before wrapping up our def discussion on regularization,

Â I just want to give you one implementational tip.

Â Which is that, when implanting regularization,

Â we took our definition of the cost function J and we actually modified

Â it by adding this extra term that penalizes the weight being too large.

Â And so if you implement gradient descent,

Â one of the steps to debug gradient descent is to plot the cost function J as a function

Â of the number of elevations of gradient descent and you want to see that

Â the cost function J decreases monotonically after every elevation of gradient descent.

Â And if you're implementing regularization

Â then please remember that J now has this new definition.

Â If you plot the old definition of J,

Â just this first term,

Â then you might not see a decrease monotonically.

Â So to debug gradient descent make sure that you're plotting

Â this new definition of J that includes this second term as well.

Â Otherwise you might not see J decrease monotonically on every single elevation.

Â So that's it for L two regularization which is actually

Â a regularization technique that I use the most in training deep learning modules.

Â In deep learning there is another sometimes used regularization technique

Â called dropout regularization.

Â Let's take a look at that in the next video.

Â