One of the things that might help speed up your learning algorithm is to slowly reduce your learning rate over time. We call this learning rate decay. Let's see how you can implement this. Let's start with an example of why you might want to implement learning rate decay. Suppose you're implementing mini-batch gradient descents with a reasonably small mini-batch, maybe a mini-batch has just 64, 128 examples. Then as you iterate, your steps will be a little bit noisy and it will tend towards this minimum over here, but it won't exactly converge. But your algorithm might just end up wandering around and never really converge because you're using some fixed value for Alpha and there's just some noise in your different mini-batches. But if you were to slowly reduce your learning rate Alpha, then during the initial phases, while your learning rate Alpha is still large, you can still have relatively fast learning. But then as Alpha gets smaller, your steps you take will be slower and smaller, and so, you end up oscillating in a tighter region around this minimum rather than wandering far away even as training goes on and on. The intuition behind slowly reducing Alpha is that maybe during the initial steps of learning, you could afford to take much bigger steps, but then as learning approaches convergence, then having a slower learning rate allows you to take smaller steps. Here's how you can implement learning rate decay. Recall that one epoch is one pass through the data. If you have a training set as follows, maybe break it up into different mini-batches. Then the first pass through the training set is called the first epoch, and then the second pass is the second epoch, and so on. One thing you could do is set your learning rate Alpha to be equal to 1 over 1 plus a parameter, which I'm going to call the decay rate, times the epoch num. This is going to be times some initial learning rate Alpha 0. Note that the decay rate here becomes another hyperparameter which you might need to tune. Here's a concrete example. If you take several epochs, so several passes through your data, if Alpha 0 is equal to 0.2 and the decay rate is equal to 1, then during your first epoch, Alpha will be 1 over 1 plus 1 times Alpha 0, so your learning rate will be 0.1. That's just evaluating this formula when the decay rate is equal to 1 and epoch num is 1. On the second epoch, your learning rate decay is 0.67. On the third, 0.5. On the fourth, 0.4, and so on. Feel free to evaluate more of these values yourself and get a sense that as a function of epoch number, your learning rate gradually decreases, according to this formula up on top. If you wish to use learning rate decay, what you can do is try a variety of values of both hyperparameter Alpha 0, as well as this decay rate hyperparameter, and then try to find a value that works well. Other than this formula for learning rate decay, there are a few other ways that people use. For example, this is called exponential decay, where Alpha is equal to some number less than 1, such as 0.95, times epoch num times Alpha 0. This will exponentially quickly decay your learning rate. Other formulas that people use are things like Alpha equals some constant over epoch num square root times Alpha 0, or some constant k and another hyperparameter over the mini-batch number t square rooted times Alpha 0. Sometimes you also see people use a learning rate that decreases and discretes that, where for some number of steps, you have some learning rate, and then after a while, you decrease it by one-half, after a while, by one-half, after a while, by one-half, and so, this is a discrete staircase. So far, we've talked about using some formula to govern how Alpha, the learning rate changes over time. One other thing that people sometimes do is manual decay. If you're training just one model at a time, and if your model takes many hours or even many days to train, what some people would do is just watch your model as it's training over a large number of days, and then now you say, oh, it looks like the learning rate slowed down, I'm going to decrease Alpha a little bit. Of course, this works, this manually controlling Alpha, really tuning Alpha by hand, hour-by-hour, day-by-day. This works only if you're training only a small number of models, but sometimes people do that as well. Now you have a few more options of how to control the learning rate Alpha. Now, in case you're thinking, wow, this is a lot of hyperparameters, how do I select amongst all these different options? I would say don't worry about it for now, and next week, we'll talk more about how to systematically choose hyperparameters. For me, I would say that learning rate decay is usually lower down on the list of things I try. Setting Alpha just a fixed value of Alpha and getting that to be well-tuned has a huge impact, learning rate decay does help. Sometimes it can really help speed up training, but it is a little bit lower down my list in terms of the things I would try. But next week, when we talk about hyperparameter tuning, you'll see more systematic ways to organize all of these hyperparameters and how to efficiently search amongst them. That's it for learning rate decay. Finally, I also want to talk a little bit about local optima and saddle points in neural networks so you can have a little bit better intuition about the types of optimization problems your optimization algorithm is trying to solve when you're trying to train these neural networks. Let's go onto the next video to see that.