You are now in the third week of the Capstone Project, which means we are halfway through building a complete RL system. So far, we have formalized the lunar lander problem using the language of MDPs. Last week, we discussed which algorithm to use to solve that MDP. Now, let's discuss the meta parameter choices that you will have to make to fully implement the agent. This means we need to decide on the function approximator, choices in the optimizer for updating the action values, and how to do exploration. So which function approximator will you use? My general advice is to always start simple. For a function approximation, that would mean using a fixed basis like tile coding. Unfortunately, that might not be the best choice for this problem without carefully designing the tile coder. If you tile all the inputs together, the number of features grows exponentially with the input dimension. For example, if you want to use ten tiles per dimension for this eight-dimensional problem, you could easily end up with 100 million features. So maybe we should consider using a neural network instead. One hidden layer should be sufficiently powerful to represent the value function for lunar lander, and it will be a bit easier for you to implement. We need to decide the number of hidden units in that layer. Remember that you get to choose the size of the hidden layer of a neural network. As you add more nodes to a layer, you add more representational power. However, the more nodes you add the more parameters there are to learn. We also need to pick the activation functions. We could use a sigmoidal function like tanh but these have some issues of saturation. Think about when the inputs to the activation function are high magnitude, either positive or negative. The gradient is computed in these flat regions of the activation. Such a gradient near zero does not provide much signal to change our weights and can slow learning. We can also use a linear threshold unit or LTU. But again, these flat regions make it hard to train the neural network. A pretty effective and common choice is to use rectified linear units or ReLUs. So let's go ahead with those. We also need to discuss how we are going to train the neural network. Using vanilla stochastic gradient descent will likely be too slow for this project. So what are other options? We could try this algorithm called adagrad. The downside to this is that adagrad decays the step sizes towards zero, which can be problematic for non-stationary learning. We could try RMSProp, which uses information about the curvature of the loss to improve the descent step. However, we'd like to also incorporate momentum to speed up learning. A good choice can be the ADAM optimizer. This combines the curvature information from RMSProp and momentum. We finally need to discuss which expiration method we will use. What about optimistic initial values? This would be a reasonable choice if we were using a linear function approximator with non-negative features. But since we are using a neural network, it is difficult to maintain optimistic values and so is unlikely to be effective. We can also consider Epsilon greedy, this is very straight forward to implement. The downside though is that it's exploration completely ignores whatever information the action values might have. It is equally likely to explore an action with really negative value as an action with moderate value. I know how about we use a Softmax policy? This choice could be better because the probability of selecting an action is proportional to the value of that action. This way we are less likely to explore actions that we think are really bad. By the way, in course 3, we only talked about Softmax policies on action preferences. We use policy gradient methods to adjust the action preferences. But it is not a big leap to consider using an action value method like expected SARSA, and use a Softmax directly on the learn action values. There are few things to consider when using a Softmax on the action values. First, let's think about how it affects the expected SARSA update. Remember that we need to compute the expectation over action values for the next state. This means we'll need to compute the probabilities for all the actions first under the Softmax function. Next, we also need to consider how much the agent focuses on the highest value actions. We control those with a temperature parameter called Tau. If Tau is large, then the agent is more stochastic and selects more of the actions. For very large Tau, the agent behaves nearly like a uniform random policy. For very small Tau, the agent mostly selects the greedy action. Finally, we need to consider an additional trick to avoid overflow issues when computing the Softmax. Imagine that the action values are large. Exponentiating those values can get very large. Instead, we can use the fact that subtracting a constant from the action values when computing the probabilities has no effect. For example, we can subtract the maximum action value divided by the temperature. Then, all the exponents are negative and we avoid taking the exponent of large positive numbers. Altogether, we now have a reasonable strategy to learn an optimal soft policy that also explores a bit more intelligently than Epsilon greedy. The agent takes actions according to its current Softmax policy and uses expected SARSA updates. That's it for this video. Today, we brainstorm some of the key choices in your agent. Overall, there are a lot of choices to make. Most of these choices we set by reasoning through what might be the most appropriate, like we did for the choices in the function approximator. Other choices like specific step sizes in the optimizer or exploration parameters like the temperature, can be less obvious to simply select. We will discuss more ways to determine these parameters in module 5. See you then.