Now, we will re-formulate the discrete time BSM model as a Markov Decision Process or MDP model. This means that now, we will take a view of the problem of option pricing and hedging as problem of stochastic optimal control in discrete time. The system being controlled in this case, will be a hedge portfolio and the control will be a stock position in this hedge portfolio. This problem will then be solved by a sequential maximization of some rewards. As we will see shortly, these rewards are in fact the negatives of a hedge portfolio one-step variances, times the risk-aversion lambda, plus a drift term. Now, when addressing the problem this way, we need to differentiate between two cases. The first case corresponds to the scenario when the model of the world is known. In this case, we can use methods of dynamic programming or DP or model based reinforcement drawing to solve the problem. In the second scenario, the model of the world is unknown. In this case, we can use model free reinforcement learning methods that rely on samples of data rather than on the model of the world to solve the problem of optimal control. These two situations correspond to two different ways to solve the Bellman optimality equation for this model. So, we will start with the formulation that is suited for a Dynamic Programming setting, and then later will show how it can be adjusted for and setting for reinforcement learning. We first define state variables in our model. It turns out that though we could use stock prices as T as state variables, they are not very convenient to work with because they drift upwards with time creating non-stationary job dynamics. But this is easy to fix. To this end, we introduce state variables Xt that are related to stock prices St as shown in the first equation here. And as the second equation here shows, Xt is just a re-scaled standard Brownian motion. And third equation shows how the stock price St can be computed off Xt by hedging a deterministic drift and exponentiatng. Now, I want to digress a bit with a general remark. In principle, we can consider either discrete state or continuous state problems of optimal control and finance. While continuous state formulation is more practically irrelevant, a discrete state formulation is easier to understand. So, I would like to comment here how we can go back and forth in the model formulation between the continuous state and discrete state MDP problem. In general, we will assume that you where the continuous state space. But if we want to work with a discrete state space, we can simply use a discretized version of the Bradshaw's model and construct optimal hedges directly in these discrete model. This can be done in a number of ways, and one of the easiest ones would be to use a Markov chain approximation to the Bradshaw's model that was developed by Dawn and Siminato in 2001. Even though we will not be working with such descritized dynamics in our course, I thought it would be a good idea to mention it at least given that discrete space MDP problems simplify sort of things. For example, we can use for such systems, simple algorithms such as Q-learning, which we'll be discussing in the next lesson. So, let's just remember that, everything that we are discussing here can be formulated in a discrete state space and one with specifying our MDP problem. First as our new state variables are RXt instead of St, we will introduce a new action variable A of Xt. The original action variables U of St can be obtained from actions A of Xt if we just re-express Xt in terms of St. Next to describe actions A for different states of the world, we introduce the notion of a deterministic policy, pi, that maps states Xt and time T into action A of T that will be given by the value pi of T and Xt of the policy function. Now, we are ready to specify the Bellman equation for our model. We have already introduced the value function Vt as the negative of the option price and it's given by the first equation here. In the second line, I split the term corresponding to T prime equal T in the sum so that now the sum of T prime runs from T plus one to capital C. But now we can replace the last term in this equation with the future value function V of T plus one. If we use the definition of V of T plus one and rearrange terms in this relation, we can obtain the second equation on the slide. And in this equation, we also introduce the discrete time discount gamma, in terms of a continuous time discrete rate R and timestep delta T. Now, we can substitute the second relation into the first equation. This produces the Bellman equation for the value function Vt. This Bellman equation is the first equation here and it has the standard form as it should. What is specific to the model we have here, is the form of the reward function, R of Xt, At and Xt plus one This function is shown in this second equation here. As this equation shows, there are two contributions to the instantaneous reward Rt. The first contribution is proportional to the change of stock price, multiplied by the position in the stock. The second term proportional to lambda, is the negative risk at timestep T as measured by the variance of the hedge portfolio at time T. The second line in this equation shows its explicit functional dependence on action At. In this second equation, I expressed variance of a random quantity as an expectation of the squared demeaned value of this quantity. So, that values with hats here such as Pi-hat or S-hat stand for demeaned quantities. If we now take the time T expectation of this instantaneous reward Rt, we will obtain the expected reward shown in this equation. Now, there are two things that stand out in this relation. First, this relation is quadratic in the action variable At. Therefore, it's easy to optimize with respect to AT. Second, this formula shows that the quadratic risk reward given by the second term is incorporated as one-step expectation. Therefore, once we have this modification to the expected reward, we can just use the standard risk-neutral MDP approach to solve the problem. You might remember that in the previous course, we talked about inadequacy of the standard risk-neutral MDP formulation for financial problems, as the latter care about risk. Now, the formulation that we just give here provides a simple cure to this problem. We can just include the expected quadratic risk in the reward function and then use such risk adjusted rewards within risk evaluation. It's also interesting to know that if we take lambda to zero in this expression, then we see that the expected reward becomes linear in At in this limit. So, it means that in this limit there is nothing to maximize anymore if we focus on the expected rewards. There is no more an objective function in the sense of MDP problem, but it doesn't mean of course, that there is no way to find any optimal hedges. The optimal hedge can still be found by a quadratic risk minimization as we just did before, but the only thing is that in this limit there is no more link between such risk minimization and any objective function that will be linked to the option price. In other words, in the limit of lambda going to zero, the hedging and pricing problems become de-coupled.