Now, it's time to summarize what we have obtained so far with

our reinforcement learning solution to the MGP model for option pricing and hedging.

We can summarize it on two counts.

First by considering it as a reinforcement learning problem,

and second by considering the financial aspects of the model that we develop here.

Let's start with the reinforcement learning and machine learning side.

On the side of machine learning we formulated

the discrete-time Black-Scholes model as a mark of decision process model.

Then we produced two solutions for this model.

The first one is a dynamic program and

solution that can be applied when the model is known.

We implemented this solution in the Monte Carlo setting.

Where we first simulate parts of the stock and

then solved the optimal control problem by doing backward recursion.

Using extensions in basis functions for

functional approximation for the optimal action and optimal Q-function.

This gives us a benchmark solution for the Bellman optimality equation.

We can also check that the solution obtained by such dynamic programming method

reproduces the classical results of

the Black- Scholes model when time steps are very small,

and risk aversion lambda goes to zero.

Then we turn to a completely data driven and model

independent way of solving the same MDP model.

Because our model can be solved using Q-Learning,

and its more practical version for

batch reinforcement learning called Fitted Q-Iteration,

such data driven solution of the model is actually feasible.

This means that the model can be used with the actual stock price data,

and data obtained from an option trading desk.

Because Q-Learning and Fitted Q Iteration are off-policy methods,

It means that the required action and reward data

should not necessarily correspond to optimal actions.

The algorithm can learn from sub-optimal actions as well.

But now a very nice property of our MDP model is that it can

be solved by both the dynamic programming and reinforcement learning methods.

Therefore, the dynamic programming solution can be used as

a benchmark which we can use for testing of any reinforcement learning methods,

and not only the Fitted Q Iteration method.

More to this, while so far we consider the model

only with a single stock and hence a single risk factor,

we can extend this framework to include more complex dynamics.

Again, in such setting the dynamic programming solution can be

used as a benchmark to test various reinforcement learning algorithms.

The other nice thing about

the reinforcement learning and DP solutions to the model is that,

both of them are very simple and involve only simple linear algebra.

Because the model is very simple,

optimization is performed analytically in both solutions,

which makes the model reasonably fast.

Now, after we discussed these machine learning aspects,

let's talk about financial modeling aspects,

so far MDP model.

On the financial side,

we found that the famous Black-Scholes option pricing model can be

understood as a continuous time limit of another famous financial model namely;

the Markowitz portfolio model,

albeit applied in very special setting.

This is because in our MDP problem,

the problem of optimal hedging comprising of an option

amounts to a multi-period version of the Markowitz theory,

where the investment portfolio is very simple and consists of only one stock and cash.

The dynamics of the world is assumed in

this model to be log-normal as in the Black Scholes model.

Now, what we found is that,

if this investment portfolio matches the option pay-off at maturity at capital T,

then the option price and hedge can be computed from

this portfolio by doing a dynamic optimization of risk adjusted returns.

Exactly as in the classical Markowitz portfolio theory.

By making the replication portfolio for the option Markowitz optimal,

we reproduce the Black Scholes model in the limit of

vanishing time steps and risk aversion lamda.

So, the classical Black- Scholes model is matched in this limit. Which is good.

But the present model does much more than the Black-Scholes model,

or for this sake than other more complex models of mathematical finance do.

Our MDP model produces both the price and the optimal hedge

for an option that incorporate actual risk in options.

Which always persists in options because they are never re-hedged continuously.

Models of mathematical finance such as Black-Scholes model, or various local,

or stochastic volatility models,

focus on the so-called fair or risk neutral option price.

Assuming that hedge arrows are

second order effects relative to the expected cost of hedge,

which is just equal to the fair option price in this model.

But the reality is that in many cases,

such second order effects are as large as the first order effects.

And risk in option becomes first class citizen in such models.

The MDP model that we presented here captures

this risk in the modern independent and data driven way.

More to this in this formulation,

pricing and hedging are consistent as they are both

obtained from maximization of the same objective function.

Such consistency is important,

especially given the fact that many previous small lose of discrete time,

risky option pricing did not provide

explicit link between the option price and the option hedge.

So, the different option prices which correspond to the same hedging methods.

But in the framework that we presented,

risky option price is fully consistent with the risk minimization of option hedge.

Also because our approach is model independent,

it frees us from the need to construct

and calibrate some complex model of stock dynamics.

Which is actually one of the main objectives of

models in traditional and mathematical finance.

Alright. Now, after this summary of the MDP option pricing model,

let's take a look at some examples.

In the first set of experiments we test

the performance of Fitted Q Iteration for on-policy settings.

In this case, we simply use the optimal hedge and rewards obtained

with a dynamic programming solution as data for Fitted Q Iteration.

And the results are shown on these graphs for two sets of

Monte Carlo simulations shown respectively in the left and right columns.

Because we deal here with the on-policy learning,

the resulting optimal Q- function and its optimal value

Q t star at a star which are shown in the second and third rows respectively,

are virtually identical in the graph.

The option price for this parameter values used in

this example is 490 plus or minus point 12.

Which is identical to the option value obtained with the dynamic programming method.

And the Block Scholes option price for this case is 4.53.

and this number can be recovered if we take a smaller value of risk averse lambda.

We can also test the perfomance of Fitted Q Iteration for off-policy model.

To make off-policy data we multiply optimal hedges computed by the DP solution of

the model by a random uniform number in the interval from one minus eta to one plus eta.

Where eta would be parameter between zero and

one that controls the noise level in the data.

In our experiment we considered the value of eta equal to 0.15,

0.25, 0.35 and 0.50 to test the noise tolerance of our algorithms.

In this figure, you can see the results obtained

for off-policy learning for eta equal to 50%,

with five different scenarios or sub-optimal actions.

We observe some Normal ethnicity in these graphs,

but this is due to a low number of snares.

But please note that the impact of sub-optimality of actions

in recorded data is rather mild at least for a moderate level of noise in actions.

And this is expected as long as Fitted Q Iteration is an off-policy algorithm.

This means that one data set is large enough.

Our MDP model can learn even from data with purely random actions.

And in particular it can learn the Black Scholes model itself if the world is log-normal,

because Q-Learning is a model free method.

So, this concludes our third week of this course.

And in your homework you will implement the Fitted Q Iteration,

and also evaluate the performance of the algorithm

in both on-policy and off-policy setting.

Good luck with this work.

And see you the next week.