hyperparameters of the learning rate alpha, as well as this parameter Beta,

which controls your exponentially weighted average.

The most common value for Beta is 0.9.

We're averaging over the last ten days temperature.

So it is averaging of the last ten iteration's gradients.

And in practice, Beta equals 0.9 works very well.

Feel free to try different values and

do some hyperparameter search, but 0.9 appears to be a pretty robust value.

Well, and how about bias correction, right?

So do you want to take vdW and vdb and divide it by 1 minus beta to the t.

In practice, people don't usually do this because after just ten iterations,

your moving average will have warmed up and is no longer a bias estimate.

So in practice, I don't really see people bothering with bias correction

when implementing gradient descent or momentum.

And of course this process initialize the vdW equals 0.

Note that this is a matrix of zeroes with the same dimension as dW,

which has the same dimension as W.

And Vdb is also initialized to a vector of zeroes.

So, the same dimension as db, which in turn has same dimension as b.

Finally, I just want to mention that if you read the literature on gradient

descent with momentum often you see it with this term omitted,

with this 1 minus Beta term omitted.

So you end up with vdW equals Beta vdw plus dW.

And the net effect of using this version in purple is that vdW ends up being

scaled by a factor of 1 minus Beta, or really 1 over 1 minus Beta.

And so when you're performing these gradient descent updates, alpha just needs

to change by a corresponding value of 1 over 1 minus Beta.

In practice, both of these will work just fine,

it just affects what's the best value of the learning rate alpha.

But I find that this particular formulation is a little less intuitive.

Because one impact of this is that if you end up tuning the hyperparameter Beta,

then this affects the scaling of vdW and vdb as well.

And so you end up needing to retune the learning rate, alpha, as well, maybe.

So I personally prefer the formulation that I have written here on the left,

rather than leaving out the 1 minus Beta term.

But, so I tend to use the formula on the left,

the printed formula with the 1 minus Beta term.

But both versions having Beta equal 0.9 is a common choice of hyper parameter.

It's just at alpha the learning rate would need to be tuned differently for

these two different versions.

So that's it for gradient descent with momentum.

This will almost always work better than the straightforward

gradient descent algorithm without momentum.

But there's still other things we could do to speed up your learning algorithm.

Let's continue talking about these in the next couple videos.