In this video we are going to talk about global optimality of agent decisions. And also about how to express such a degree of optimality, in a convenient mathematical form. Recall that our ultimate goal, as I have said previously is to find an optimal clause. There are many different ways to do so, and today we will cover one of the most fundamental method, the dynamic programming. What is the dynamic programming? This is a method of solving complex problem in a step-by-step fashion. At first, break the problem into a set of small pieces. And second, an iterative process is launched. That is until no more of unsolved pieces remain. Solve at next piece reusing all the results of already solved pieces. This dynamic programming approach lies at the very heart of the reinforcement learning and thus it is essential to deeply understand it. Before we delve into the dynamic programming approach, let us first concentrate on the measure of agents behavior optimality. Such a measure is absolutely required to build an optimal policy. Getting single there right measure of optimality of an agent actions. Well, in fact we have already seen the possible one, the cumulative discounted reward, which is also called the return. Maybe this is the right measure of global optimality of agent decisions. Can we say that it is the sum of discounted rewards if the largest possible from any time t onward than the agent acts optimal? Well, almost yes. There is one interesting peculiarity of the return, it is random. In fact, the return may be random because of both poles randomness and also because the environment itself could be random. Then if the agent can choose actions by sampling them from say dependent distribution. And environment can also react with different work and next state given the same state and action. Thus the question arises, how should one maximize the return, which is a random variable, how to handle this randomness? One of the most innovative approaches of accounting for randomness is by the means of taking an expectation. This approach of taking an expectation of return is still used in reinforcement learning. And corresponds to the notion of value function. Value function, v(s), is also called state value function because it depends only on the current state. The intuition is as follows. The value function is a mean reward that agent could get out from the environment, starting from state s and following policy pi onward. The value function, as you may have guessed, is defined simply as an expected return, conditioned on the state an agent currently stands in. That is, S sub t. That is expressed in the first line of the formula in this slide. This formula is a decomposition of the value function, which you will encounter many times during the course. Now, we are to go through this formula line by line. It's interesting that this expectation is meant to be over both policy stochasticity and environment stochasticity and over all timed steps from time t to the very end of the. Taking the expectation with respect to the randomness in each and every state doesn't seem very practical. So we proceed with the second line of the formula, which composes the return under an expectation into the sum of two terms. The immediate reward, Error sub t, and the future discounted return from the next state which is denoted by gamma times G sub t + 1. In fact, we can write the expectation over both sources of firmness in an explicit manner. And this is precisely what is going on in the third line of the formula. Additionally, know that on that line, we have pushed the expectation with respect to all future randomness inside the outer sums. This expectation is highlighted in green on the slide. Now, know that this expectation is by definition nothing but the value function of the next state, s prime. This reference is written explicitly on the last line of the formula. Now please make sure that everything from this line is clear to you because this reference is ubiquitous in reinforcement learning. The decomposition of the value function on the immediate to work and discounted value of the next state, the last line, is a singular going to heavily reliable. This decomposition is so important that it owns a name. It is called the Bellman expectation equation for value function. So to reiterate, for computing the right hand side of Bellman expectation equation, we should average the reward plus the discounted next value, over all combinations of actions, rewards, and next state. Weighted by the probabilities of VR curry. We can also think about combination of value expectation equation as about propagating values into tree. This tree is known in reinforcement learning literature as backup tree, or backup diagram. The backup diagram consists of all possible combinations of action selection and environment reaction to the selected action. In the root of the tree, there is a state s for which we want to compute the value function. In that root node an agent could choose among different actions, which are represented as filled blue circles on the second level of the tree. Once the action is selected the environment takes its turn. On each action made in the root state an environment could react with the reward and a transition to the next state. This possible next state are denoted these field green circles at the bottom level of the backup tree. Each arc in this tree diagram is associated with its own probability. Step level of the tree corresponds to action selection. So the top level arcs are associated with probabilities assigned to actions by our policy pi. The bottom level arcs, on the other hand, corresponds to the probabilities of reworking and transitioning to the next state. These probabilities are known as environment dynamics. To compute the value of the state s in the root, according to the formula on the slide, we need to average reward and discounted values in that state, or all possible combinations of action selection and environment reaction. This diversion could be represented as the backing up the value from the leaves, to the root of the tree. That is, we start from the leaves, in each leaf we compute the value of that state, of the next state as prior. And then we propagate the computed values back up to the root. Once the value exits the state note it gets multiplied by the discount factor of gamma and once the value ascends the particular arc, it gets multiplied by the probability associated with this arc. Additionally, in each node in this tree, all the values incoming through arcs are summed and further propagated up to the root of the tree. You need to know this diagram is nothing but visual representation of the computation process performed by the Bellman expectation equation. However, such a visual queue may be useful to understand and contrast methods with each other. Another crucial concept in the reenforcement learning literature is action value function, q. Which is a function of two arguments, state s and action a. Action value function is defined similarly to the state value function, but with an interesting twist. It is defined as expectation of return conditional on state and action. That is a single but very important distinction. The intuition of the state action value function, which is also called q function is as follows. The q function is the mean reward that agent could get out from environment. After making action a instead s and subsequently following policy pi. Please note that action a, that is argument of q function, is not connected with policy pi. The policy pi could have even zero probability of selection action a instead of s. But this by no means restricts us from being able to relate value of action a instead of s. The formula and its recursive form, is very similar to those of the value function. One important difference exists, though. When computing q function, one is free of policy stochasticity in the first action selection step. Equivalently, we don't have the sum over all possible initial actions. Because the first action we know for sure is equal to the argument of the q function. Otherwise, the regards to the definition of q function is precisely the same as for the value function. For deep understanding of reenforcement learning, it is really necessary to get acquainted with value and action value functions. More to say, we'll require to re-express one function in terms of another. Thus let us first make sure we understand how to do this. For now, we already know how to write the q function recursively in terms of v function. The formula stems from direct decomposition of q function into immediate work, and value of the next state. To help understanding that this dependence of q function on the value of the next state is written from the previous slide as the first formula. But what if you want to write v function in terms of q function? That is, what if you want to express the value of state in terms of action values of the state. Well, in fact, this is pretty easy. If you recall the definition of v and q functions, you'll notice that V function is nothing but an expectation of action value function with respect to policy probabilities. That observation is highlighted in green in the second formula on the slide. This observation gives us the means to write q function of current state and action in terms of q function of the next state and next actions. This dependence is made explicit as the last formula on the slide. Meanwhile the dependence of q function on q function of the next state, can be also visualized with back up tree. That is precisely what we are going to do next. At the top of the slide, there is a formula that expresses a q function, in terms of q functions of the next state and actions. This diagram looks strikingly similar to the previously discussed backup tree for value function computation. However, unlike the previous backup diagram, this diagram for q function has a different root. It is no longer a state but actually pair of state and action. That is the arguments of q function. The top level arcs of the tree now are also changed. For computation of q(s,a), the top level arches are the environment reactions to the action a instead s. That is environment reacts with the reward r and the next state s prime, and it reacts on the action a made instead s. In the next state s prime, however, many different actions are possible and probabilities of that actions are associated with arcs on the bottom level. The computational part, however, hasn't changed. As before we start from the very bottom of the tree, compute the action value of the leaves, multiply the values by the arc probabilities of corresponding actions. Then sum all the incoming values in an intermediate green note corresponding to the next state, as prior. And multiplying the sum by the gamma. Then the same process is done on the top level, with probabilities of actions, replaced by the probabilities of the environment dynamics. So for now, we have already discussed many different topics. We know what is return, and why the return is a meaningful measure of policy performance. We also know that Bellman expectation equations allow to actively asses the policy quality. They do so by recursive computation of failure and action value functions. However, our ultimate goal still is to find an optimal policy. We want to know how to optimally act in each and every state. To figure this out, however, we should first decide on how to compare policies with each other. That's precisely what we are going to do next. [MUSIC]