0:02

Now, if you recall the definition of

Â mathematical expectation from any probability theory course you took,

Â the expectation is basically a sum or integral over all possible outcomes available,

Â weighted or multiplied by their respective probabilities.

Â In the first expectation we had on the previous slide,

Â the expectation of a normal distribution is

Â changed to an integral because this continues to many possible outcomes.

Â They are all the vectors of real numbers.

Â And the second part of

Â the formula is weighted by the probability density function in the normal distribution.

Â So, you have this N theta given mu sigma squared,

Â which is the exact PDF of normal distribution,

Â or you plug in the mu, sigma squared from

Â your permutation across from the previous slide.

Â The second part, the second expectation is a little bit harder to think about,

Â even computing the probabilities might be a challenge in any practical environment.

Â So, this in expectation or trajectories.

Â And a trajectory is not just a number,

Â but it's something, it's a complicated structure sampled from a process.

Â So, you have a first state assembled from the distribution of the first states,

Â maybe it's a fixed first state depending on the environment.

Â And you generally don't know the distribution of the first states until you

Â explicitly sample them and maybe fit some probability distribution to model them.

Â And then, your agent has to pick action.

Â So, how the thetas from the left part of our formula,

Â the one sampled from a normal distribution

Â and you plug in those thetas to whatever policy your using or network.

Â You compute this policy from

Â the initial state and you get

Â the probabilities of actions on the last layer, you can use the softmax layer.

Â So, you have the probabilities of actions.

Â Now you have to pick an action,

Â which is also a thing that should be carried at random.

Â And then you fit this action back into the environment to get the next state.

Â Next state is again sampled from

Â the probability distribution of a next state given through a certain action.

Â Then, you have to take the next action after this next state.

Â Then the next, next state, next,

Â next action and every other iteration,

Â the one where you sample next state,

Â you have to use this unknown probability distribution of

Â a next state which is not given to you from the black-box environment.

Â So, this is how you could technically,

Â analytically compute this thing.

Â But we of course,

Â won't be able to do so in any real circumstance.

Â What you need this integral for,

Â is we needed to get a clue on how to maximize this expectation.

Â And to maximize it,

Â the simple way is to apply the gradient based approach for which you have

Â to compute the gradient of this J with respect to something that you can optimize,

Â so that you can influence this thing with.

Â What variables do you compute the gradient with respect to?

Â Gap. Those are the parameters of the probability distribution,

Â the vector of mus and sigma squared or

Â maybe some other vectors in case you define the probability distribution differently.

Â Now, once we try to compute this gradient,

Â we'll find ourselves in the following situation.

Â If you just plug the derivative sign before this double integral here,

Â you'll find that luckily for us,

Â a large part of this integral doesn't explicitly depend on

Â mu and sigma squared once you sample a particular theta.

Â So, the second part where you compute

Â the expected return from a trajectory given a particular theta,

Â is not expected to depend on mu sigma squared once you give

Â it a value of theta sampled from those mu sigma squared defined normal distribution.

Â This allows you to lift

Â the derivative sign a little bit and move the second integral outside of this derivative.

Â So, now, you get the integral of

Â all possible thetas times

Â the derivative of normal distributions probability density function,

Â times the second integral which is just

Â expected trajectory return given this particular theta.

Â So, this is still a double integral and we

Â have to devise some way to estimate it in practice.

Â Previously, we used Monte Carlo sampling of trajectories

Â for each possible theta samples from the normal distribution several times.

Â And right now, we have to devise something

Â similar lest we want to take integrals of distributions we don't know.

Â So, is it possible to devise a scheme that takes samples from this one,

Â or is there something that prevents us from doing so?

Â To backtrack, we are kind of screwed.

Â Because previously, we had the integral times the normal distribution,

Â which is a valid probability density function.

Â But the gradient of this distribution is no longer

Â a probability density function or at least in our case, it isn't.

Â The simplest explanation here is that if you take

Â the derivative of a probability density function that sometimes the derivative might be,

Â say, a negative, because the probability density function decreases.

Â And the probability density function itself cannot be negative, so we're screwed.

Â Of course, the other properties are broken as well because

Â the gradient might be larger than the one for some steep curves.

Â But, this should be far enough for us to

Â think about some other way to give up on this sampling approach,

Â and this makes things a little bit more complicated.

Â We'll need some trick from the math domain that helps us resolve this issue.

Â And this trick is, as it turns out,

Â a very popular one, which we're going to use several times further across.

Â