0:00

In the last video you learned about gradient checking.

Â In this video, I want to share with you some practical tips or

Â some notes on how to actually go about implementing this for your neural network.

Â First, don't use grad check in training, only to debug.

Â So what I mean is that, computing d theta approx i, for

Â all the values of i, this is a very slow computation.

Â So to implement gradient descent, you'd use backprop to compute d theta and

Â just use backprop to compute the derivative.

Â And it's only when you're debugging that you would compute this

Â to make sure it's close to d theta.

Â But once you've done that, then you would turn off the grad check, and

Â don't run this during every iteration of gradient descent,

Â because that's just much too slow.

Â Second, if an algorithm fails grad check, look at the components,

Â look at the individual components, and try to identify the bug.

Â So what I mean by that is if d theta approx is very far from d theta,

Â what I would do is look at the different values of i to see which are the values of

Â d theta approx that are really very different than the values of d theta.

Â So for example, if you find that the values of theta or d theta,

Â they're very far off, all correspond to dbl for some layer or for

Â some layers, but the components for dw are quite close, right?

Â Remember, different components of theta correspond to different components

Â of b and w.

Â When you find this is the case, then maybe you find that the bug is in how

Â you're computing db, the derivative with respect to parameters b.

Â And similarly, vice versa, if you find that the values that are very far,

Â the values from d theta approx that are very far from d theta,

Â you find all those components came from dw or from dw in a certain layer,

Â then that might help you hone in on the location of the bug.

Â This doesn't always let you identify the bug right away, but

Â sometimes it helps you give you some guesses about where to track down the bug.

Â 1:56

Next, when doing grad check,

Â remember your regularization term if you're using regularization.

Â So if your cost function is J of theta equals 1 over m sum of your

Â losses and then plus this regularization term.

Â And sum over l of wl squared, then this is the definition of J.

Â And you should have that d theta is gradient of J with

Â respect to theta, including this regularization term.

Â So just remember to include that term.

Â Next, grad check doesn't work with dropout, because in every iteration,

Â dropout is randomly eliminating different subsets of the hidden units.

Â There isn't an easy to compute cost function J that dropout is

Â doing gradient descent on.

Â It turns out that dropout can be viewed as optimizing some cost function J, but

Â it's cost function J defined by summing over all exponentially large

Â subsets of nodes they could eliminate in any iteration.

Â So the cost function J is very difficult to compute, and

Â you're just sampling the cost function

Â every time you eliminate different random subsets in those we use dropout.

Â So it's difficult to use grad check to double check your

Â computation with dropouts.

Â So what I usually do is implement grad check without dropout.

Â So if you want, you can set keep-prob and dropout to be equal to 1.0.

Â And then turn on dropout and hope that my implementation of dropout was correct.

Â 3:30

There are some other things you could do, like fix the pattern of nodes dropped and

Â verify that grad check for that pattern of [INAUDIBLE] is correct, but

Â in practice I don't usually do that.

Â So my recommendation is turn off dropout, use grad check to double check that your

Â algorithm is at least correct without dropout, and then turn on dropout.

Â Finally, this is a subtlety.

Â It is not impossible, rarely happens, but it's not impossible that your

Â implementation of gradient descent is correct when w and b are close to 0, so

Â at random initialization.

Â But that as you run gradient descent and w and b become bigger,

Â maybe your implementation of backprop is correct only when w and b is close to 0,

Â but it gets more inaccurate when w and b become large.

Â So one thing you could do, I don't do this very often,

Â but one thing you could do is run grad check at random initialization and

Â then train the network for a while so that w and

Â b have some time to wander away from 0, from your small random initial values.

Â And then run grad check again after you've trained for some number of iterations.

Â So that's it for gradient checking.

Â And congratulations for coming to the end of this week's materials.

Â In this week, you've learned about how to set up your train, dev, and test sets,

Â how to analyze bias and variance and what things to do if you have high bias versus

Â high variance versus maybe high bias and high variance.

Â You also saw how to apply different forms of regularization,

Â like L2 regularization and dropout on your neural network.

Â So some tricks for speeding up the training of your neural network.

Â And then finally, gradient checking.

Â So I think you've seen a lot in this week and

Â you get to exercise a lot of these ideas in this week's programming exercise.

Â So best of luck with that, and

Â I look forward to seeing you in the week two materials.

Â