0:00

You've seen how using momentum can speed up gradient descent.

Â There's another algorithm called RMSprop,

Â which stands for root mean square prop, that can also speed up gradient descent.

Â Let's see how it works.

Â Recall our example from before, that if you implement gradient descent,

Â you can end up with huge oscillations in the vertical direction,

Â even while it's trying to make progress in the horizontal direction.

Â In order to provide intuition for this example, let's say that

Â the vertical axis is the parameter b and horizontal axis is the parameter w.

Â It could be w1 and w2 where some of the center parameters was named as b and

Â w for the sake of intuition.

Â And so, you want to slow down the learning in the b direction, or

Â in the vertical direction.

Â And speed up learning, or at least not slow it down in the horizontal direction.

Â So this is what the RMSprop algorithm does to accomplish this.

Â On iteration t, it will compute as usual the derivative dW,

Â db on the current mini-batch.

Â 1:15

So I was going to keep this exponentially weighted average.

Â Instead of VdW, I'm going to use the new notation SdW.

Â So SdW is equal to beta times their previous

Â value + 1- beta times dW squared.

Â Sometimes [INAUDIBLE] as dW squared.

Â So for clarity, this squaring operation is an element-wise squaring operation.

Â So what this is doing is really keeping an exponentially weighted

Â average of the squares of the derivatives.

Â And similarly, we also have Sdb equals beta Sdb + 1- beta, db squared.

Â And again, the squaring is an element-wise operation.

Â Next, RMSprop then updates the parameters as follows.

Â W gets updated as W minus the learning rate, and

Â whereas previously we had alpha times dW, now it's

Â dW divided by square root of SdW.

Â And b gets updated as b minus the learning rate times,

Â instead of just the gradient, this is also divided by, now divided by Sdb.

Â 2:39

So let's gain some intuition about how this works.

Â Recall that in the horizontal direction or

Â in this example, in the W direction we want learning to go pretty fast.

Â Whereas in the vertical direction or in this example in the b direction,

Â we want to slow down all the oscillations into the vertical direction.

Â So with this terms SdW an Sdb,

Â what we're hoping is that SdW will be relatively small,

Â so that here we're dividing by relatively small number.

Â Whereas Sdb will be relatively large, so that here we're dividing yt relatively

Â large number in order to slow down the updates on a vertical dimension.

Â And indeed if you look at the derivatives, these derivatives are much

Â larger in the vertical direction than in the horizontal direction.

Â So the slope is very large in the b direction, right?

Â So with derivatives like this, this is a very large db and a relatively small dw.

Â Because the function is sloped much more steeply in the vertical direction than as

Â in the b direction, than in the w direction, than in horizontal direction.

Â And so, db squared will be relatively large.

Â So Sdb will relatively large, whereas compared to that dW will be smaller,

Â or dW squared will be smaller, and so SdW will be smaller.

Â So the net effect of this is that your up days in the vertical direction

Â are divided by a much larger number, and so that helps damp out the oscillations.

Â Whereas the updates in the horizontal direction are divided by a smaller number.

Â So the net impact of using RMSprop is that your updates will end

Â up looking more like this.

Â 4:22

That your updates in the, Vertical

Â direction and then horizontal direction you can keep going.

Â And one effect of this is also that you can therefore use a larger learning rate

Â alpha, and get faster learning without diverging in the vertical direction.

Â Now just for the sake of clarity, I've been calling the vertical and

Â horizontal directions b and w, just to illustrate this.

Â In practice, you're in a very high dimensional space of parameters,

Â so maybe the vertical dimensions where you're trying to damp

Â the oscillation is a sum set of parameters, w1, w2, w17.

Â And the horizontal dimensions might be w3, w4 and so on, right?.

Â And so, the separation there's a WMP is just an illustration.

Â In practice, dW is a very high-dimensional parameter vector.

Â Db is also very high-dimensional parameter vector, but

Â your intuition is that in dimensions where you're getting these oscillations,

Â you end up computing a larger sum.

Â A weighted average for these squares and derivatives, and so

Â you end up dumping ] out the directions in which there are these oscillations.

Â So that's RMSprop, and it stands for root mean squared prop, because here

Â you're squaring the derivatives, and then you take the square root here at the end.

Â So finally, just a couple last details on this algorithm before we move on.

Â 5:49

In the next video, we're actually going to combine RMSprop together with momentum.

Â So rather than using the hyperparameter beta, which we had used for momentum,

Â I'm going to call this hyperparameter beta 2 just to not clash.

Â The same hyperparameter for both momentum and for RMSprop.

Â And also to make sure that your algorithm doesn't divide by 0.

Â What if square root of SdW, right, is very close to 0.

Â Then things could blow up.

Â Just to ensure numerical stability, when you implement this in practice you

Â add a very, very small epsilon to the denominator.

Â It doesn't really matter what epsilon is used.

Â 10 to the -8 would be a reasonable default, but this just ensures slightly

Â greater numerical stability that for numerical round off or whatever reason,

Â that you don't end up dividing by a very, very small number.

Â So that's RMSprop, and similar to momentum, has the effects of

Â damping out the oscillations in gradient descent, in mini-batch gradient descent.

Â And allowing you to maybe use a larger learning rate alpha.

Â And certainly speeding up the learning speed of your algorithm.

Â So now you know to implement RMSprop, and this will be another way for

Â you to speed up your learning algorithm.

Â One fun fact about RMSprop,

Â it was actually first proposed not in an academic research paper, but

Â in a Coursera course that Jeff Hinton had taught on Coursera many years ago.

Â I guess Coursera wasn't intended to be a platform for dissemination of

Â novel academic research, but it worked out pretty well in that case.

Â And was really from the Coursera course that RMSprop started to become widely

Â known and it really took off.

Â We talked about momentum.

Â We talked about RMSprop.

Â It turns out that if you put them together you can get an even better

Â optimization algorithm.

Â Let's talk about that in the next video.

Â