In this video, we'll look at two further refinements to the reinforcement learning algorithm you've seen. The first idea is called using mini-batches, and this turns out to be an idea they can both speedup your reinforcement learning algorithm and it's also applicable to supervised learning. They can help you speed up your supervised learning algorithm as well, like training a neural network, or training a linear regression, or logistic regression model. The second idea we'll look at is soft updates, which it turns out will help your reinforcement learning algorithm do a better job to converge to a good solution. Let's take a look at mini-batches and soft updates. To understand mini-batches, let's just look at supervised learning to start. Here's the dataset of housing sizes and prices that you had seen way back in the first course of this specialization on using linear regression to predict housing prices. There we had come up with this cost function for the parameters w and b, it was 1 over 2m, sum of the prediction minus the actual value y^â€‹2. The gradient in this algorithm was to repeatedly update w as w minus [inaudible] alpha times the partial derivative respect to w of the cost J of wb, and similarly to update b as follows. Let me just take this definition of J of wb and substitute it in here. Now, when we looked at this example, way back when were starting to talk about linear regression and supervised learning, the training set size m was pretty small. I think we had 47 training examples. But what if you have a very large training set? Say m equals 100 million. There are many countries including the United States with over a 100 million housing units, and so a national census will give you a dataset that is this order of magnitude or size. The problem with this algorithm when your dataset is this big, is that every single step of gradient descent requires computing this average over 100 million examples, and this turns out to be very slow. Every step of gradient descent means you would compute this sum or this average over 100 million examples. Then you take one tiny gradient descent step and you go back and have to scan over your entire 100 million example dataset again to compute the derivative on the next step, they take another tiny gradient descent step and so on and so on. When the training set size is very large, this gradient descent algorithm turns out to be quite slow. The idea of mini-batch gradient descent is to not use all 100 million training examples on every single iteration through this loop. Instead, we may pick a smaller number, let me call it m prime equals say, 1,000. On every step, instead of using all 100 million examples, we would pick some subset of 1,000 or m prime examples. This inner term becomes 1 over 2m prime is sum over sum m prime examples. Now each iteration through gradient descent requires looking only at the 1,000 rather than 100 million examples, and every step takes much less time and just leads to a more efficient algorithm. What mini-batch gradient descent does is on the first iteration through the algorithm, may be it looks at that subset of the data. On the next iteration, maybe it looks at that subset of the data, and so on. For the third iteration and so on, so that every iteration is looking at just a subset of the data so each iteration runs much more quickly. To see why this might be a reasonable algorithm, here's the housing dataset. If on the first iteration we were to look at just say five examples, this is not the whole dataset but it's slightly representative of the string line you might want to fit in the end, and so taking one gradient descent step to make the algorithm better fit these five examples is okay. But then on the next iteration, you take a different five examples like that shown here. You take one gradient descent step using these five examples, and on the next iteration you use a different five examples and so on and so forth. You can scan through this list of examples from top to bottom. That would be one way. Another way would be if on every single iteration you just pick a totally different five examples to use. You might remember with batch gradient descent, if these are the contours of the cost function J. Then batch gradient descent would say, start here and take a step, take a step, take a step, take a step, take a step. Every step of gradient descent causes the parameters to reliably get closer to the global minimum of the cost function here in the middle. In contrast, mini-batch gradient descent or a mini-batch learning algorithm will do something like this. If you start here, then the first iteration uses just five examples. It'll hit in the right direction but maybe not the best gradient descent direction. Then the next iteration they may do that, the next iteration that, and that and sometimes just by chance, the five examples you chose may be an unlucky choice and even head in the wrong direction away from the global minimum, and so on and so forth. But on average, mini-batch gradient descent will tend toward the global minimum, not reliably and somewhat noisily, but every iteration is much more computationally inexpensive and so mini-batch learning or mini-batch gradient descent turns out to be a much faster algorithm when you have a very large training set. In fact, for supervised learning, where you have a very large training set, mini-batch learning or mini-batch gradient descent, or a mini-batch version with other optimization algorithms like Atom, is used more common than batch gradient descent. Going back to our reinforcement learning algorithm, this is the algorithm that we had seen previously. The mini-batch version of this would be, even if you have stored the 10,000 most recent tuples in the replay buffer, what you might choose to do is not use all 10,000 every time you train a model. Instead, what you might do is just take the subset. You might choose just 1,000 examples of these s, a, R of s, s prime tuples and use it to create just 1,000 training examples to train the neural network. It turns out that this will make each iteration of training a model a little bit more noisy but much faster and this will overall tend to speed up this reinforcement learning algorithm. That's how mini-batching can speed up both a supervised learning algorithm like linear regression as well as this reinforcement learning algorithm where you may use a mini-batch size of say, 1,000 examples, even if you store it away, 10,000 of these tuples in your replay buffer. Finally, there's one other refinement to the algorithm that can make it converge more reliably, which is, I've written out this step here of Set Q equals Q_new. But it turns out that this can make a very abrupt change to Q. If you train a new neural network to new, maybe just by chance is not a very good neural network. Maybe is even a little bit worse than the old one, then you just overwrote your Q function with a potentially worse noisy neural network. The soft update method helps to prevent Q_new through just one unlucky step getting worse. In particular, the neural network Q will have some parameters, W and B, all the parameters for all the layers in the neural network. When you train the new neural network, you get some parameters W_new and B_new. In the original algorithm S [inaudible] on that slide, you would set W to be equal to W_new and B equals B_new. That's what set Q equals Q_new means. With the soft update, what we do is instead Set W equals 0.01 times W_new plus 0.99 times W. In other words, we're going to make W to be 99 percent the old version of W plus one percent of the new version W_new. This is called a soft update because whenever we train a new neural network W_new, we're only going to accept a little bit of the new value. As similarly, B equals 0.01 times B_new plus 0.99 times B. These numbers, 0.01 and 0.99, these are hyperparameters that you could set, but it controls how aggressively you move W to W_new and these two numbers are expected to add up to one. One extreme would be if you were to set W equals one times W_new plus 0 times W, in which case, you're back to the original algorithm up here where you're just copying W_new onto W. But a soft update allows you to make a more gradual change to Q or to the neural network parameters W and B that affect your current guess for the Q function Q of s, a. It turns out that using the soft update method causes the reinforcement learning algorithm to converge more reliably. It makes it less likely that the reinforcement learning algorithm will oscillate or divert or have other undesirable properties. With these two final refinements to the algorithm, mini-batching, which actually applies very well to supervise learning as well, not just reinforcement learning, as well as the idea of soft updates, you should be able to get your lunar algorithm to work really well on the Lunar Lander. The Lunar Lander is actually a decently complex, decently challenging application and so that you can get it to work and land safely on the moon. I think that's actually really cool and I hope you enjoy playing with the practice lab. Now, we've talked a lot about reinforcement learning. Before we wrap up, I'd like to share with you my thoughts on the state of reinforcement learning so that as you go out and build applications using different machine learning techniques via supervised, unsupervised, reinforcement learning techniques that you have a framework for understanding where reinforcement learning fits in to the world of machine learning today. Let's go take a look at that in the next video.