So, in the last video we introduced two parameterizations for the Q function. One in terms of matrix WT and another in terms of vector U_W which was given by a product of matrix WT and vector phi of basis functions. We saw that if we somehow observed optimal actions HE star and optimal Q function values QT star, this would show us two equations for three unknown components of vector U_W. Now, this suggests that a right solution, the one that would be found by the dynamic programming method if the dynamics are known, should be in the space of solutions parameterized by a time-dependent matrix WT as we did in our specification of the Q function. So, we have to find a way to learn the matrix of coefficients WT. Let's see how this can be done. What we need to do is to rearrange terms in our definition of the Q function to convert it into a product of a parameter vector and the vector that depends on both the state and the action. So, how we do this. We do it in several simple steps. First, let's write explicitly the value of the Q function as a trace of the matrix expression given here. We write it as the element y is product of elements of matrix W with elements of the matrix that is formed by direct product of vectors A and phi. This gives us the second line in this expression. We have two types of multiplication here. First, this symbol stands for the element y is product of two matrices, this product is also known as the Hadamard product of two matrices. Second, this other expression has this circle crossed product sign and this sign stands for a matrix given by the outer product of two vectors A_t and Phi_t. By definition, the element ij of this matrix is given by a product of A_ti and Phi_TJ. So, so far so good, we converted the whole expression into a trace of a product of two matrices. But now, we can do one more thing and represent this expression as a scalar product of two vectors. What we have to do to this sense is to convert both matrices entering this expression to vectors. We can do this by concatenating columns of matrix W and the matrix given by the outer product of A_t and Phi_t, and these will give us the third line in this equation. Now, we can compactly write this whole expression as a dot product of vector W and vector psi. And here, vector psi is obtained by concatenating in the columns of the outer product of vectors A_t and Phi_t. This can be viewed as a new set of basis functions that depend both on state and action. The dependence on the action A_t is quadratic but the dependence on the state X_t can be arbitrary and depends on a functional form of the original basis functions Phi_N for the state space. If we use for example cubic D splines as basis functions as we did in the dynamic programming solution to the model. Then the X dependence will be a local cubic. Now, let's compare what we got with the setting of dynamic programming. In our current Reinforcement Learning setting, we reduce the expression for the Q function to a dot product of vector W and the new vector of state action basis function Psi_t. Both have a number of components given by 3M, so that the number of unknown coefficients in this problem is 3M as you could also easily guess just by looking at the matrix W_t which has dimension three by M. So, we have 3M unknowns here, and now what about the number of observed variables? For each time step in the batch mode reinforcement learning setting, we have 3M observables because we observe the state X_t, the action A_t and the reward R_t for all of N historical or Monte Carlo pass for the stock. So, we have a 3N observables for 3M unknowns. And if N is larger than M, then our problem is well-posed, we have N divided by M observations per one phi parameter in this setting. Now, we can compare these with the situation we had with the dynamic programming approach. In that case, we had to find two quantities, the optimal policy and the optimal Q function. In our dynamic programming solution, we expended both in M basis functions, so we had 2M unknown coefficients in total. This is less than 3M unknown coefficients in the Reinforcement Learning setting. But on the other hand, the number of observations per times slice in a dynamic programming approach was also less than in Reinforcement Learning approach. And this is because in the dynamic programming setting, we only absorb the states. So, for N Monte Carlo pass, we only had N data points each time. The number of observations per parameter was therefore N divided by 2M which is actually less than the ratio of N divided by M that we have for the Reinforcement Learning setting. So, it appears that at least when the data for Reinforcement Learning is generated from the dynamic programming solution, both the reinforcement learning and dynamic programming methods should perform about equally well. This setting is called on-policy learning and it means that action is required for training of the Reinforcement Learning occasions were actually optimal actions. It turns out that this is indeed the case and both algorithms produce nearly identical results for such setting. This actually gives us a simple way to benchmark our Reinforcement Learning algorithms because we can't so far model, when it's fully specified by means of dynamic programming, we can compute the optimal policy from this solution, then we can just use the absorbed actions and the words obtained with this policy as data for Fitted Q-iteration method and compute the optimal policy and Q function using this method. Because this solution is already known from dynamic programming, we can use it to benchmark our Reinforcement Learning solution and check the speed of convergence, the final results, and so on. This is actually very ensuring as we now know exactly what we expect to see, so any problems with Fitted Q-iteration would immediately show up if found there. If we now want to try some other Reinforcement Learning algorithm, instead of Fitted Q-iteration. Again, we have a good benchmark obtained with a dynamic programming approach. Therefore, we can test many different Reinforcement Learning algorithms with our simple MGP model for option pricing. Let's take a short pause here and continue in the next video with the actual solution of the model with the Fitted Q-iteration method.