Let's see how we can use reinforcement learning to control the Lunar Lander or for other reinforcement learning problems. The key idea is that we're going to train a neural network to compute or to approximate the state action value function Q of s, a and that in turn will let us pick good actions. Let's see how it works. The heart of the learning algorithm is we're going to train a neural network that inputs the current state and the current action and computes or approximates Q of s, a. In particular, for the Lunar Lander, we're going to take the state s and any action a and put them together. Concretely, the state was that list of eight numbers that we saw previously, so you have xy, x dot, y dot, Theta, Theta dot, and then LR for where the legs are grounded, so that's a list of eight numbers that describe the state. Then finally, we have four possible actions: nothing, left, main, a main engine, and right. We can encode any of those four actions using a one-hot feature vector. If action were the first action, we may encode it using 1, 0, 0, 0 or if it was the second action to find the left cluster, we may encode it as 0, 1, 0, 0. This list of 12 numbers, eight numbers for the state and then four numbers, a one-hot encoding of the action is the inputs we'll have to the neural network, and I'm going to call this X. We'll then take these 12 numbers and feed them to a neural network with say, 64 units in the first hidden layer, 64 units in the second hidden layer, and then a single output in the output layer. The job of the neural network is the output Q of s, a. The state action-value function for the Lunar Lander given the input s and a. Because we'll be using neural network training algorithms in a little bit, I'm also going to refer to this value Q of s, a as the target value Y that were training the neural network to approximate. Notice that I did say reinforcement learning is different from supervised learning, but what we're going to do is not input a state and have it output an action. What we're going to do is input a state action pair and have it try to output Q of s, a, and using a neural network inside the reinforcement learning algorithm this way will turn out to work pretty well. We'll see the details in a little bit so don't worry about it if it doesn't make sense yet. But if you can train a neural network with appropriate choices of parameters in the hidden layers and in the upper layer to give you a good estimates of Q of s, a, then whenever you're Lunar Lander is in some state s, you can then use the neural network to compute Q of s, a. For all four actions, you can compute Q of s, nothing, Q of s, left, Q of s, main, Q of s, right, and then finally, whichever of these has the highest value, you pick the corresponding action a. For example, if out of these four values, Q of s, main is largest, then you would decide to go and fire the main engine of the Lunar Lander. The question becomes, how do you train a neural network to output Q of s, a? It turns out the approach will be to use Bellman's equations to create a training set with lots of examples, x and y, and then we'll use supervised learning exactly as you learned in the second course when we talked about neural networks. To learn using supervised learning, a mapping from x to y, that is a mapping from the state action pair to this target value Q of s, a. But how do you get the training set with values for x and y that you can then train a neural network on? Let's take a look. Here's the Bellman equation, Q of s, a equals R of s plus Gamma, max of a prime, Q of s prime, a prime. The right-hand side is what you want Q of s, a to be equal to, so I'm going to call this value on the right-hand side y and the input to the neural network is a state and an action so I'm going to call that x. The job of the neural network is to input x, that is input the state action pair, and try to accurately predict what will be the value on the right. In supervised learning, we were training a neural network to learn a function f, which depends on a bunch of parameters, W and B, the parameters of the various layers of the neural network and it was the job of the neural network to input x. Hopefully I'll put something close to the target value y. The question is, how can we come up with a training set with values x and y for a new network to learn from. Here's what we're going to do. We're going to use the lunar lander and just try taking different actions in it. If we don't have a good policy yet, we'll take actions randomly, further, left fasser, further, right fasser further, main engine, do nothing. By just trying out different things in the lunar lander simulator we'll observe a lot of examples of when we're in some state and we took some action, may be a good action, maybe a terrible action either way. Then we got some reward R of s for being in that state, and as a result of our action, we got to some new state, S'. As you take different actions in the lunar lander, you see this S, a, R of s, S', and we call them tuples in Python code many times. For example, maybe one time you're in some state S and just to give this an index and we call this S^1, and you happen to take some action a^1, this could be nothing left, main thrusts there or right. As a result of which you've got some reward, and you want up at some state S^'1. Maybe at different times you're in some other state S2, you took some other action, could be good action, could be bad action, could be any of the four actions, and you got the reward, and then you want up with S '2 and so on multiple times. Maybe you've done this 10,000 times or even more than 10,000 times, so you would have to save the way with not just S^1, a^1 and so on, but up to S^10,000, a^10,000. It turns out that each of these tuples will be enough to create a single training example, x^1, y^1. In particular, here's how you do it. There are four elements in this first tuple. The first two will be used to compute x^1, and the second two would be used to compute y^1. In particular, x^1 is just going to be S^1, a^1 put together. S^1 would be eight numbers, the state of the lunar lander, a^1 would be four numbers, the one-hot encoding of whatever action this was and y^1 would be computed using the right-hand side of the Bellman equation. In particular, the Bellman equation says, when you input S^1, a^1, you want Q of S^1, a^1 to be this right hand side, to be equal to R of S^1 plus Gamma max over a' of Q of S^1'a' prime. Notice that these two elements of the tuple on the right give you enough information to compute this. You know what is R of S^1? That's the reward you've saved the way here. Plus the discount factor gamma times max over all actions, a' of Q of S'^1, that's a state you got to in this example, and then take the maximum over all possible actions, a'. I'm going to call this y^1. When you compute this, this will be some number like 12.5 or 17, or 0.5 or some other number. We'll save that number here as y^1, so that this pair x^1, y^1 becomes the first training example in this low dataset we're computing. Now, you may be wondering, where does Q of S', a' or Q of S'^1, a' come from. Well, initially we don't know what is the Q function. But it turns out that when you don't know what is the Q function, you can start off with taking a totally random guess. What is the Q function? We'll see on the next slide that the [inaudible] will work nonetheless. But in every step Q here is just going to be some guess. They'll get better over time it turns out of what is the actual Q function. Let's look at the second example. If you had a second experience where you are in state S^2 to got to a^2, got that reward and then got to that state. Then we would create a second training example in this dataset, x^2, where the input is now S^2, a^2, so the first two elements go to computing the input x, and then y^2 will be equal to R of s^2 plus gamma max of a' Q of S' to a', and whatever this number is, y^2. We put this over here in our small but growing training set, and so on and so forth, until maybe you end up with 10,000 training examples with these x, y pairs. What we'll see later is that we'll actually take this training set where the x's are inputs with 12 features and the y's are just numbers. We'll train a new network with, say, the mean squared error loss to try to predict y as a function of the input x. What I describe here is just one piece of the learning algorithm we'll use. Let's put it all together on the next slide and see how it all comes together into a single algorithm. Let's take a look at what a full algorithm for learning the Q-function is like. First, we're going to take our neural network and initialize all the parameters of the neural network randomly. Initially we have no idea, whether it's a Q function, let's just pick totally random values of the widths. We'll pretend that this neural network is our initial random guess for the Q-function. This is a little bit like when you are training linear regression and you initialize all the parameters randomly and then use gradient descent to improve the parameters. Initializing it randomly for now is fine. What's important is whether the algorithm can slowly improve the parameters to get a better estimate. Next, we will repeatedly do the following; We will take actions in the lunar lander, so float around randomly, take some good actions, take some bad actions. It's okay either way. But you get lots of these tuples of when it was in some state, you took some actually gathering what R of S and you got to some state s prime. What we will do is store to 10,000 most recent examples of these tuples. As you run this algorithm, you will see many steps in the lunar lander, maybe hundreds of thousands of steps. But to make sure we don't end up using excessive computer memory, common practice is to just remember the 10,000 most recent such tuples that we saw taking actions in the MTP. This technique of storing the most recent examples only is sometimes called the replay buffer in reinforcement learning algorithm. For now, we're just flying the lunar lander randomly, sometimes crashing, sometimes not and getting these tuples as experienced for our learning algorithm. Occasionally then we will train the neural network. In order to train the neural network, here's what we'll do. We'll look at these 10,000 most recent tuples we had saved and create a training set of 10,000 examples. Training set needs lots of pairs of x and y. For our training examples, x will be the s, a from this part of the tuple. There'll be a list of 12 numbers, the 8 numbers for the state and the 4 numbers for the one-hot encoding of the action. The target value that we want a neural network to try to predict would be y equals R of S plus Gamma max of a prime, Q of s prime a prime. How do we get this value of Q? Well, initially is this neural network that we had randomly initialized. It may not be a very good guess, but it's a guess. After creating these 10,000 training examples we'll have training examples x1, y1 through x10,000, y10,000. We'll train a neural network and I'm going to call the new neural network Q new, such that Q new of s, a learns to approximate y. This is exactly training that neural network to output f with parameters w and b, to input x to try to approximate the target value y. Now, this neural network should be a slightly better estimates of what the Q function or the state action value function should be. What we'll do is we're going to take Q and set it to this new neural network that we had just learned. Many of the ideas in this algorithm are due to Mnih et al. It turns out that if you run this algorithm where you start with a really random guess of the Q function, then use Bellman's equations to repeatedly try to improve the estimates of the Q function. Then by doing this over and over, taking lots of actions, training a model, that will improve your guess for the Q-function. For the next model you train, you now have a slightly better estimate of what is the Q function. Then the next model you train will be even better. When you update Q equals Q new. Then for the next time you train a model Q of s prime a prime will be an even better estimate. As you run this algorithm on every iteration, Q of s prime, a prime hopefully becomes an even better estimate of the Q function so that when you run the algorithm long enough, this will actually become a pretty good estimate of the true value of Q of s, a, so that you can then use this to pick, hopefully good actions or the MTP. The algorithm you just saw is sometimes called the DQN algorithm which stands for Deep Q-Network because you're using deep learning and neural network to train a model to learn the Q functions. Hence DQN or DQ using a neural network. If you use the algorithm as I described it, it will work, okay, on the lunar lander. Maybe it'll take a long time to converge, maybe it won't land perfectly, but it'll work. But it turns out that with a couple of refinements to the algorithm, it can work much better. In the next few videos, let's take a look at some refinements to the algorithm that you just saw.