Now, it's time to summarize what we have obtained so far with our reinforcement learning solution to the MGP model for option pricing and hedging. We can summarize it on two counts. First by considering it as a reinforcement learning problem, and second by considering the financial aspects of the model that we develop here. Let's start with the reinforcement learning and machine learning side. On the side of machine learning we formulated the discrete-time Black-Scholes model as a mark of decision process model. Then we produced two solutions for this model. The first one is a dynamic program and solution that can be applied when the model is known. We implemented this solution in the Monte Carlo setting. Where we first simulate parts of the stock and then solved the optimal control problem by doing backward recursion. Using extensions in basis functions for functional approximation for the optimal action and optimal Q-function. This gives us a benchmark solution for the Bellman optimality equation. We can also check that the solution obtained by such dynamic programming method reproduces the classical results of the Black- Scholes model when time steps are very small, and risk aversion lambda goes to zero. Then we turn to a completely data driven and model independent way of solving the same MDP model. Because our model can be solved using Q-Learning, and its more practical version for batch reinforcement learning called Fitted Q-Iteration, such data driven solution of the model is actually feasible. This means that the model can be used with the actual stock price data, and data obtained from an option trading desk. Because Q-Learning and Fitted Q Iteration are off-policy methods, It means that the required action and reward data should not necessarily correspond to optimal actions. The algorithm can learn from sub-optimal actions as well. But now a very nice property of our MDP model is that it can be solved by both the dynamic programming and reinforcement learning methods. Therefore, the dynamic programming solution can be used as a benchmark which we can use for testing of any reinforcement learning methods, and not only the Fitted Q Iteration method. More to this, while so far we consider the model only with a single stock and hence a single risk factor, we can extend this framework to include more complex dynamics. Again, in such setting the dynamic programming solution can be used as a benchmark to test various reinforcement learning algorithms. The other nice thing about the reinforcement learning and DP solutions to the model is that, both of them are very simple and involve only simple linear algebra. Because the model is very simple, optimization is performed analytically in both solutions, which makes the model reasonably fast. Now, after we discussed these machine learning aspects, let's talk about financial modeling aspects, so far MDP model. On the financial side, we found that the famous Black-Scholes option pricing model can be understood as a continuous time limit of another famous financial model namely; the Markowitz portfolio model, albeit applied in very special setting. This is because in our MDP problem, the problem of optimal hedging comprising of an option amounts to a multi-period version of the Markowitz theory, where the investment portfolio is very simple and consists of only one stock and cash. The dynamics of the world is assumed in this model to be log-normal as in the Black Scholes model. Now, what we found is that, if this investment portfolio matches the option pay-off at maturity at capital T, then the option price and hedge can be computed from this portfolio by doing a dynamic optimization of risk adjusted returns. Exactly as in the classical Markowitz portfolio theory. By making the replication portfolio for the option Markowitz optimal, we reproduce the Black Scholes model in the limit of vanishing time steps and risk aversion lamda. So, the classical Black- Scholes model is matched in this limit. Which is good. But the present model does much more than the Black-Scholes model, or for this sake than other more complex models of mathematical finance do. Our MDP model produces both the price and the optimal hedge for an option that incorporate actual risk in options. Which always persists in options because they are never re-hedged continuously. Models of mathematical finance such as Black-Scholes model, or various local, or stochastic volatility models, focus on the so-called fair or risk neutral option price. Assuming that hedge arrows are second order effects relative to the expected cost of hedge, which is just equal to the fair option price in this model. But the reality is that in many cases, such second order effects are as large as the first order effects. And risk in option becomes first class citizen in such models. The MDP model that we presented here captures this risk in the modern independent and data driven way. More to this in this formulation, pricing and hedging are consistent as they are both obtained from maximization of the same objective function. Such consistency is important, especially given the fact that many previous small lose of discrete time, risky option pricing did not provide explicit link between the option price and the option hedge. So, the different option prices which correspond to the same hedging methods. But in the framework that we presented, risky option price is fully consistent with the risk minimization of option hedge. Also because our approach is model independent, it frees us from the need to construct and calibrate some complex model of stock dynamics. Which is actually one of the main objectives of models in traditional and mathematical finance. Alright. Now, after this summary of the MDP option pricing model, let's take a look at some examples. In the first set of experiments we test the performance of Fitted Q Iteration for on-policy settings. In this case, we simply use the optimal hedge and rewards obtained with a dynamic programming solution as data for Fitted Q Iteration. And the results are shown on these graphs for two sets of Monte Carlo simulations shown respectively in the left and right columns. Because we deal here with the on-policy learning, the resulting optimal Q- function and its optimal value Q t star at a star which are shown in the second and third rows respectively, are virtually identical in the graph. The option price for this parameter values used in this example is 490 plus or minus point 12. Which is identical to the option value obtained with the dynamic programming method. And the Block Scholes option price for this case is 4.53. and this number can be recovered if we take a smaller value of risk averse lambda. We can also test the perfomance of Fitted Q Iteration for off-policy model. To make off-policy data we multiply optimal hedges computed by the DP solution of the model by a random uniform number in the interval from one minus eta to one plus eta. Where eta would be parameter between zero and one that controls the noise level in the data. In our experiment we considered the value of eta equal to 0.15, 0.25, 0.35 and 0.50 to test the noise tolerance of our algorithms. In this figure, you can see the results obtained for off-policy learning for eta equal to 50%, with five different scenarios or sub-optimal actions. We observe some Normal ethnicity in these graphs, but this is due to a low number of snares. But please note that the impact of sub-optimality of actions in recorded data is rather mild at least for a moderate level of noise in actions. And this is expected as long as Fitted Q Iteration is an off-policy algorithm. This means that one data set is large enough. Our MDP model can learn even from data with purely random actions. And in particular it can learn the Black Scholes model itself if the world is log-normal, because Q-Learning is a model free method. So, this concludes our third week of this course. And in your homework you will implement the Fitted Q Iteration, and also evaluate the performance of the algorithm in both on-policy and off-policy setting. Good luck with this work. And see you the next week.