0:00

During the history of deep learning,

Â many researchers including some very well-known researchers,

Â sometimes proposed optimization algorithms

Â and showed that they worked well in a few problems.

Â But those optimization algorithms subsequently were shown not to really

Â generalize that well to the wide range of neural networks you might want to train.

Â So over time, I think the deep learning community actually developed

Â some amount of skepticism about new optimization algorithms.

Â And a lot of people felt that gradient descent with momentum really works well,

Â was difficult to propose things that work much better.

Â So, rms prop and the Adam optimization algorithm,

Â which we'll talk about in this video,

Â is one of those rare algorithms that has really stood up,

Â and has been shown to work well across a wide range of deep learning architectures So,

Â this is one of the algorithms that I wouldn't hesitate to recommend you try

Â because many people have tried it and seen it work well on many problems.

Â And the Adam optimization algorithm is basically taking

Â momentum and rms prop and putting them together.

Â So, let's see how that works.

Â To implement Adam you would initialize:

Â Vdw=0, Sdw=0, and similarly Vdb, Sdb=0.

Â And then on iteration T,

Â you would compute the derivatives: compute dw, db using current mini-batch.

Â So usually, you do this with mini-batch gradient descent.

Â And then you do the momentum exponentially weighted average. So Vdw = ÃŸ.

Â But now I'm going to this ÃŸ1 to distinguish it from the hyper parameter

Â ÃŸ2 we'll use for the rms prop proportion of this.

Â So, this is exactly what we had when we're implementing

Â momentum except it now called hyper parameter ÃŸ1 instead of ÃŸ.

Â And similarly, you have VDB as follows: 1 - ÃŸ1 x db.

Â And then you do the rms prop update as well.

Â So now, you have a different hyperparemeter ÃŸ2 plus one minus ÃŸ2 dwÂ².

Â And again, the squaring there is element y squaring of your derivatives dw.

Â And then sdb is equal to this plus one minus ÃŸ2 times db.

Â So this is the momentum like update with hyper parameter

Â ÃŸ1 and this is the rms prop like update with hyper parameter ÃŸ2.

Â In the typical implementation of Adam,

Â you do implement bias correction.

Â So you're going to have v corrected.

Â Corrected means after bias correction.

Â Dw = vdw divided by 1 minus ÃŸ1 to the power of T if you've done T iterations.

Â And similarly, vdb corrected equals vdb divided by 1 minus ÃŸ1 to the power of T.

Â And then similarly, you implement this bias correction on S as well.

Â So, that's sdw divided by one minus ÃŸ2 to the T and sdb

Â corrected equals sdb divided by 1 minus ÃŸ2 to the T.

Â Finally, you perform the update.

Â So W gets updated as W minus alpha times.

Â So if you're just implementing momentum you'd use vdw,

Â vw or maybe vdw corrected.

Â But now, we add in the rms prop portion of this.

Â So we're also going to divide by square roots of sdw corrected plus epsilon.

Â And similarly, B gets updated as a similar formula,

Â vdb corrected, divided by square root S,

Â corrected, db, plus epsilon.

Â And so, this algorithm combines the effect of gradient descent

Â with momentum together with gradient descent with rms prop.

Â And this is a commonly used learning algorithm that is proven to be very

Â effective for many different neural networks of a very wide variety of architectures.

Â So, this algorithm has a number of hyper parameters.

Â The learning with hyper parameter alpha is still important and usually needs to be tuned.

Â So you just have to try a range of values and see what works.

Â A common choice really the default choice for ÃŸ1 is 0.9.

Â So this is a moving average,

Â weighted average of dw right this is the momentum light term.

Â The hyper parameter for ÃŸ2,

Â the authors of the Adam paper,

Â inventors of the Adam algorithm recommend 0.999.

Â Again this is computing the moving weighted average of dw2 as well as db squares.

Â And then Epsilon, the choice of epsilon doesn't matter very much.

Â But the authors of the Adam paper recommended it 10 to the minus 8.

Â But this parameter you really don't

Â need to set it and it doesn't affect performance much at all.

Â But when implementing Adam,

Â what people usually do is just use the default value.

Â So, ÃŸ1 and ÃŸ2 as well as epsilon.

Â I don't think anyone ever really tunes Epsilon.

Â And then, try a range of values of Alpha to see what works best.

Â You could also tune ÃŸ1 and ÃŸ2 but it's not

Â done that often among the practitioners I know.

Â So, where does the term 'Adam' come from?

Â Adam stands for Adaptive Moment Estimation.

Â So ÃŸ1 is computing the mean of the derivatives.

Â This is called the first moment.

Â And ÃŸ2 is used to compute

Â exponentially weighted average of the Â²s and that's called the second moment.

Â So that gives rise to the name adaptive moment estimation.

Â But everyone just calls it the Adam authorization algorithm.

Â And, by the way, one of my long term friends and collaborators is call Adam Coates.

Â As far as I know, this algorithm doesn't have anything to do with him,

Â except for the fact that I think he uses it sometimes.

Â But sometimes I get asked that question,

Â so just in case you're wondering.

Â So, that's it for the Adam optimization algorithm.

Â With it, I think you will be able to train your neural networks much more quickly.

Â But before we wrap up for this week,

Â let's keep talking about hyper parameter tuning,

Â as well as gain some more intuitions about what

Â the optimization problem for neural networks looks like.

Â In the next video, we'll talk about learning rate decay.

Â