In the last lesson we constructed the mark of decision process or MGP model for auction,

pricing, and hedging, and then solve it

using the Monte Carlo based Dynamic Programming approach.

To remind you of this Monte Carlo setting,

we simulated for POS for the underlying stock,

and then computed optimal policy,

and hence optimal actions.

The optimal option price was obtained as a negative of the time zero optimal Q function,

when the second argument is taken to be an optimal action.

Now we move to reinforcement learning.

As we said several times before,

reinforcement learning solves the same problem as dynamic programming,

that is it finds the optimal policy.

But unlike dynamic programming reinforcement learning does not

assume that transition probabilities and the work function are known.

Instead it relies on samples to find an optimal policy.

Now, why this approach is interesting?

It's interesting because it tries to go to the heart of the problem,

but without solving first another problem,

namely the problem of building a model of the world.

The conventional approach to Option Pricing requires

that we build the model of strong dynamic stock

by designing some stochastic process and

then calibrating it to auction and stock pricing data.

But let's note that diligent and model of the world is not our objective.

Our purpose is rather to find an optimal option price and hedge.

In other words our task is to find an optimal policy from data.

But this is clearly very different task from the task of building the model of the world.

And more to these in some cases,

the world might have very complex dynamics,

yet an optimal policy can be a very simple function.

Vladimir Vapnik the father of

support vector machines once formulated the principle that one

should avoid solving more difficult intermediate problems when solving a target problem.

Support vector machines that we discussed in

the previous course actually implement Vapnik's principle.

Now, in our case,

to price and hedge an option,

we do not have to explain the world,

but rather just need to learn to act optimally in this world.

This is our target task in Vapnik's principal.

The intermediate task would be to explain the world,

that is to build a model of the world.

According to the classical approach of quantitative finance,

we always first have to build a model of the world.

That is to formulate the law of dynamics and estimate model parameters.

This is called model calibration,

and it amounts to minimization of

some lost function between observables and model outputs.

Now depending on the model,

this might be a very resource demanding procedure.

But even after it is completed,

this is not the end yet,

as we still did not solve our main problem of auction, hedging, and pricing.

This takes another calculation,

though normally much less time consuming than

the first one because it doesn't involve optimization.

So, let's look at this traditional approach from the point of

view of the original problem of auction hedging,

which we will now view as a problem of optimal control in reinforcement learning pricing.

If we built a model of the world first,

we can apply dynamic programming to solve the problem of optimal control.

But any model introduces model specifications.

Therefore they can propagate into the quantities we actually care about most.

That is the optimal price and hedge.

So, what reinforcement learning approach does,

it focuses on this original task while

relying on data samples instead of a model of the world.

Therefore this approach implements the Vapnik rule.

Now, once we agree that the general approach of reinforcement learning

takes us straight to the ultimate goal at least conceptually,

we can discuss different particular specifications of this approach.

For example we can still have a model of the world or maybe know

some important model parameters in the search for enforcement learning.

Such approach would correspond to what is called model

based reinforcement learning as opposed to a model-free reinforcement learning.

Furthermore, there are different types of enforcement learning.

Some of them focus on direct policy search,

while some other maximize the value function as we outlined in the previous course.

We will concentrate for now on value based reinforcement learning which condenses

information it needs from the world to optimize policy into a value function.

Now, we will consider offline reinforcement learning also

known as Batch mode or simply batch reinforcement learning.

In this setting, we only have access to some historically collected data.

There is no access to real time environment,

and we also assume that no simulator of such environment is available.

Now, how our data looks like in the setting of batch reinforcement learning?

Data are given by a set of N trajectories,

and information set Ft at time t is given by information sets

of Ftn available from all seperate n trajectories.

Each set F(n) t includes

the historical values up to time t of the following tuples of values.

What is in this tuples?

We record the underlying stock price St,

which we express as a function of Xt.

The state variable hedge position at

instantaneous reward Rt and the next time value X sub t plus 1.

So Fn t is a collection of all such tuples as shown in this equation.

Infact as long as the dynamics hallmark of a collection of N trajectories of

lengths t each is equivalent to a data set of n times t, single step transitions.

We assume that such data set is available either as

a similar data or as a real historical stock price data,

combined with some artificial data that would track the performance

of a hypothetical stock and cash duplicating portfolio for a given option.

Now, neither the dynamics nor

the true reward distribution are assumed known in the reinforcement learning approach.

All that is given is a set of one-step transitions.

Now we can compare these data for reinforcement learning

with data we had for dynamic problem and solution to the model.

In a dynamic programming setting,

our only data was samples of stock prices simulated using Monte Carlo.

In the course of Beckworth recursion,

we computed instantaneously awards,

and then computed optimal actions that one should take to hedge the option.

Now let's compare this with the searching of batch reinforcement learning.

We have stock prices and the next steps

stock price which is the same data as any dynamic programming.

But instead of rewards and actions being

computed from the model using the knowledge of modal dynamics,

in batch reinforcement learning we are given sampled values of rewards and actions.

In the next video,

we will see how we can work with this data.