So in a previous video, we converted our problem to a more standard formulation, where the unknown Q function is expanded in a new basis side of dimension three times m, where parameters w that are given. By parameters of the original metrics, Wt, stored common wise. Now our problem is to compute these coefficients Wt. As in a dynamic programming solution, we compute the coefficients, recursively. By going backwards in time, starting from time T-1, then T-2 and so on all the way to the current time T=0. The way to do it is similar to what we did with dynamic programming approach. Again, we look at the Bellman optimality equation, shown here. It says that the time T Q function is given by the expectation of the expression in the right hand side of this equation. Now we can interpret it as regression of this expression on the known function q t that is, written as an expansion in state action basis function SI N. Where epsilon t would be a random noise, time here would mean zero. These two equations are clearly equivalent in expectations because if we take expectations of both sides of the second equation, we recover the Bellman optimality equation. So now we have a standard regression problem for coefficients W Rt in this setting. Let's note that this is both similar to and also different from how we computed the optimal q function with the optimal action and hour dynamic program in solution in the last week. There we had the regression problem approximation for the Bellman equation for the optimal q function, evaluated at the optimal action a t star. Respectively, we expanded these q function in a set of Basis function phi n of x. Because the action at here was fixed to be the optimal action at star. Rewards Rt were computed there as a part of the solution to the problem. Now, all this can be compared with the present reinforcement learning setting. This time we compute the optimal q function using a regression interpretation of the Bellman equation for arbitrary actions a t, and not just for the optimal action a t star. As we had for the dynamic programming solution. The other related difference is that now our function approximation for the Q-function realized on expansion in state action basis, siam, rather than state-dependent basis functions, thiam, as was the case with the dynamic programming solution. And furthermore, while in the dynamic programming approach the computing rewards aren't seen. In the reinforcement learning section, both the rewards rt and actions at are observed. So because we reduced the problem to a regression, we can now find the coefficient of W by minimization of the objective function shown in this equation. Now this relation calls for any data including data that maybe of policy. That is collected under suboptimal policy. Before we move to this general case, let's discuss what we should get for a simple case of on-policy learning. When all recorded actions are optimal actions. For this case we simply have to substitute at by at star in the previous equation. Then we have the first equation shown here. But now we can see that it's almost identical to the objective function that is shown in the. Second equation here and this is the objective function which we had in our dynamic programming solution. The only difference is in the loss term, in these two equations. But this is exactly the term that is optimized by a choice of parameters. If we have infinite number of observations then these two objective functions will have the same minimum. Because the last terms can be made coinciding point wise in this limit. Therefore, both methods converge synthetically to the same solution. If we deal with the on-policy Q-learning. But because Q-learning is an off-policy algorithm, and because the original optimization for coefficients Wt is convex, we are guaranteed that the off-policy method asymptotically converges to the same solution. We can also find the explicit link between the two solutions for large but finite N if we impose the quality between them in the square sense. And this will give us this objective function in the first equation here. So we need for coefficients Wt, in terms of coefficients only get t of the DP solution, we obtain the second equation here. This equation gives the solution of the reinforcement learning problem for on-policy learning, in terms of the dynamic programming solution. And this relation becomes exact in the limit where N goes to infinity. Now, let's come back to a more general case of finite N and off-policy learning. That is given by our least square optimization for coefficients Wt This case is also easy to solve. Similar to our programming solution, we introduce metrics s(t), and vector Mt that are shown in these equations and expressed in terms of new basis functions of M. Then the optimal weights Wt*. Simply given by the inverse of metrics as g times vector mt. Once we have the coefficients Wt*, we know the optimal Q function and also know the elements of vector U_w that we introduced earlier. We compute this optimal Q function for the current time step t. And then we move to the previous time step t- 1 in our backwards regression. But now according to the Bellman optimality equation, we will need the optimal Q function at the optimal action at-star. From the next time step, t to solve for coefficients Wt-1 for the current time step. How exactly do we do it is very important for the algorithm. So let's discuss this next. Once we solve for coefficients Wt for the current time step t, we can compute the elements of vector U_W that we introduced earlier. And try the optimal Q function for the next time step, t + 1. As the second order polynomial in components of this vector, as shown in this equation. But the critical point is that it would be completely wrong to use a point of a maximum of this expression as a function of at Plus one star. As such optimal video. Because this will amount to using the same dataset to estimate both the optimal action and the optimal Q function. This will lead to possible overestimation of a Q T plus one star. Due to inequality and convexity of the function, the correct way to proceed is to plug our previous analytical solution for a t + 1 star, shown here. And applied at the previous time step. Due to availability of the optimal knowledge collection, we do not face in our setting a potential overestimation problem. Because now, we do not use the same data set to estimate two functions. The second calculation of the optimal election is now analytical. And here I would like to know that a potential for overestimation, is a classical problem of Q-learning. Where this is sometimes addressed using various numerical fix, such as for example, double Q-learning. You can read more on such methods in the additional reading for this week. But in our framework, such things are not needed because there are no logical explanations available. This produces a numerical stable solution to the model. So to summarize, we saw how to make one backwards step in a fitted method. Going in this way all the way back to the current time, t equals zero, we get the optimal option mention price. But this time, in a completely data driven and model independent way. Due to the fact that fitted Q iteration is a model fit and of policy algorithm.