0:02

To begin with, I want you to answer this simple mathematical quiz.

Â Let's say that you have a task to compute a derivative,

Â and you want to compute a derivative of logarithm of F of X or in this case,

Â logarithm of p of Z.

Â How do you do that?

Â How do you simplify the derivative?

Â Well, as a break,

Â we've probably been taught at school,

Â is to simply find the table of derivatives and use the chain rule.

Â So in this case, you can say that the derivative logarithm of

Â F of X is a product of derivative of the logarithm,

Â and the derivative of the function itself.

Â Now the derivative of the logarithm is one over whatever is there under the logarithm,

Â and then multiplied by the derivative of the phi,

Â or F in your abstract case.

Â Now, if you take the one over phi of Z and move it

Â from right to the left inverting it of course,

Â then you'll see this other formula,

Â you have the equation which holds for,

Â derivative of some function is equal to

Â this function times the derivative of its logarithm.

Â It's kind of a universal truth that comes from basic properties of derivatives.

Â And now we are going to apply this equation to

Â our formula to make it more convenient for approximation.

Â Now remember, we want to compute the derivative of our expected reward,

Â the so-called nabla G here.

Â This problem can be used to computing this derivative,

Â because the outer integration does not depend on policy in our case.

Â To do this in a more approximate manner,

Â let's first plug in our formula,

Â let's replace the nabla P here with our P times nabla log P formula we've just derived.

Â Results are going to be pretty much as you'd expect to get.

Â So you'll have this same integration,

Â but now in the inner integral,

Â you'll have first not a derivative,

Â but just the phi itself times the nabla logarithm of phi times your reward.

Â The unique part about this formula,

Â is that unlike your original formula,

Â which requires you to explicitly compute the integrals,

Â so the nabla phi is not a probability density,

Â so this is not a mathematical expectation.

Â The second formula allows you to approximate it via sampling.

Â In this case, you can sample over states and sample over actions.

Â And then over those samples,

Â you'll have to compute the thing

Â that is left in this formula, which is not an expectation.

Â Now, how would this formula change if you substitute the integrals with expectation,

Â as it was originally written?

Â Yes, right.

Â This is your expectation, and then it would be just the expectation over

Â states and action of nabla logarithm of policy times the reward.

Â The final thing we'll have to do,

Â is we want to find out how this thing translates to a multiple step decision process.

Â It's not cool if you can only solve one step process in this case.

Â The exact derivations of

Â the final formula are going to be a little bit more complicated than the original ones,

Â so we'll avoid derivation this time.

Â The final results are going to be unsurprising again,

Â what you going to have is instead of having a single reward here,

Â you use this discounted reward.

Â So if you want to maximize the expected discounted reward,

Â what you can do is you can sample states and actions,

Â and you can compute, this way,

Â you can approximate the expectation of derivative of

Â logarithm policy times the discounted reward,

Â times your G or the optimal Q,

Â whatever you'd prefer to name it.

Â And this is how you apply the policy gradient to breakouts,

Â to remote control, to any complicated process.

Â Next section, we'll see how this idea, the policy gradient,

Â is reduced to a practical algorithm called reinforce,

Â how this algorithm can be used to solve practical problems.

Â