0:00

One of the things that might help speed up your learning algorithm,

Â is to slowly reduce your learning rate over time.

Â We call this learning rate decay.

Â Let's see how you can implement this.

Â Let's start with an example of why you might want to implement

Â learning rate decay.

Â Suppose you're implementing mini-batch gradient descent,

Â with a reasonably small mini-batch.

Â Maybe a mini-batch has just 64, 128 examples.

Â Then as you iterate, your steps will be a little bit noisy.

Â And it will tend towards this minimum over here, but it won't exactly converge.

Â But your algorithm might just end up wandering around, and

Â never really converge, because you're using some fixed value for alpha.

Â And there's just some noise in your different mini-batches.

Â But if you were to slowly reduce your learning rate alpha,

Â then during the initial phases, while your learning rate alpha is still large,

Â you can still have relatively fast learning.

Â But then as alpha gets smaller, your steps you take will be slower and smaller.

Â And so you end up oscillating in a tighter region around this minimum,

Â rather than wandering far away, even as training goes on and on.

Â So the intuition behind slowly reducing alpha, is that maybe

Â during the initial steps of learning, you could afford to take much bigger steps.

Â But then as learning approaches converges,

Â then having a slower learning rate allows you to take smaller steps.

Â So here's how you can implement learning rate decay.

Â Recall that one epoch is one pass,

Â 1:42

Through the data, right?

Â So if you have a training set as follows,

Â maybe you break it up into different mini-batches.

Â Then the first pass through the training set is called the first epoch,

Â and then the second pass is the second epoch, and so on.

Â So one thing you could do, is set your learning rate alpha = 1

Â / 1 + a parameter, which I'm going to call the decay rate,

Â 2:35

If you take several epochs, so several passes through your data.

Â If alpha 0 = 0.2, and the decay-rate = 1,

Â then during your first epoch,

Â alpha will be 1 / 1 + 1 * alpha 0.

Â So your learning rate will be 0.1.

Â That's just evaluating this formula, when the decay-rate is equal to 1, and

Â the the epoch-num is 1.

Â On the second epoch, your learning rate decays to 0.67.

Â On the third, 0.5, on the fourth, 0.4, and so on.

Â And feel free to evaluate more of these values yourself.

Â And get a sense that, as a function of your epoch number, your learning rate

Â gradually decreases, right, according to this formula up on top.

Â So if you wish to use learning rate decay, what you can do,

Â is try a variety of values of both hyper-parameter alpha 0.

Â As well as this decay rate hyper-parameter,

Â and then try to find the value that works well.

Â Other than this formula for learning rate decay,

Â there are a few other ways that people use.

Â For example, this is called exponential decay.

Â Where alpha is equal to some number less than 1,

Â such as 0.95 times epoch-num, times alpha 0.

Â So this will exponentially quickly decay your learning rate.

Â Other formulas that people use are things like alpha

Â = some constant / epoch-num square root times alpha 0.

Â Or some constant k, another hyper-parameter,

Â over the mini-batch number t, square rooted, times alpha 0.

Â And sometimes you also see people use a learning rate that decreases in

Â discrete steps.

Â Wherefore some number of steps, you have some learning rate,

Â and then after a while you decrease it by one half.

Â After a while by one half.

Â After a while by one half.

Â And so this is a discrete staircase.

Â 4:55

So so far, we've talked about using some formula to govern how alpha,

Â the learning rate, changes over time.

Â One other thing that people sometimes do, is manual decay.

Â And so if you're training just one model at a time, and

Â if your model takes many hours, or even many days to train.

Â What some people will do,

Â is just watch your model as it's training over a large number of days.

Â And then manually say, it looks like the learning rate slowed down,

Â I'm going to decrease alpha a little bit.

Â Of course this works, this manually controlling alpha,

Â really tuning alpha by hand, hour by hour, or day by day.

Â This works only if you're training only a small number of models, but

Â sometimes people do that as well.

Â So now you have a few more options for how to control the learning rate alpha.

Â Now, in case you're thinking, wow, this is a lot of hyper-parameters.

Â How do I select amongst all these different options?

Â I would say, don't worry about it for now.

Â In next week, we'll talk more about how to systematically choose hyper-parameters.

Â For me, I would say that learning rate decay is usually lower down on

Â the list of things I try.

Â Setting alpha, just a fixed value of alpha, and getting that to be well tuned,

Â has a huge impact.

Â Learning rate decay does help.

Â Sometimes it can really help speed up training, but

Â it is a little bit lower down my list in terms of the things I would try.

Â But next week, when we talk about hyper-parameter tuning,

Â you see more systematic ways to organize all of these hyper-parameters.

Â And how to efficiently search amongst them.

Â So that's it for learning rate decay.

Â Finally, I was also going to talk a little bit about local optima, and

Â saddle points, in neural networks.

Â So you can have a little bit better intuition about the types of

Â optimization problems your optimization algorithm is trying to solve,

Â when you're trying to train these neural networks.

Â Let's go on to the next video to see that.

Â