0:00

[MUSIC]

Â Now the only issue with method that we have left untackled yet

Â is how do you apply it efficiently to the problems at a greater scale.

Â Now with the bicycle-riding problem, let's say ten states and maybe four actions,

Â it's really probably sufficient to solve it.

Â But any reinforcement learning problem can do that.

Â Let's see how method actually applies to something like steering

Â an autonomous self driving car, or playing a game from a feedback.

Â From here, that it's no longer possible to store a table with

Â action probabilities for all the possible states, because you don't have ten states.

Â You have either say, a bajillion states, or

Â even continuous space of states which is even technically impossible to record.

Â If you can see direct camera feeds from your robot or from your game, then you

Â will probably find out that the amount of possible camera, well, camera images,

Â is as large as color depth, say 256 to the power of the amount of pixels.

Â So if it is even 100 by 100 pixels image, which is super small,

Â you'll have 256 times 10,000, which is insane.

Â You won't be able to record this on any hard drive error,

Â or maybe you will in a few years.

Â Now, not only storing this table will be impractical or

Â even impossible in practice.

Â It would also be kind of inefficient because of how state spaces

Â work in complex environments.

Â Now we mentioned you're playing this Doom game, and you have this camera image.

Â And it's probably kind of reasonable to expect that if an agent visits this state,

Â say, ten times, it will reliably tell that turning to the right and

Â flying subsequently is a reasonable strategy to execute.

Â The only problem here is that if you change this state ever so slightly,

Â say you change the health bar of a person from 59% to say 49.

Â Then, the agent will have to learn everything from scratch,

Â because it's now a different state, and has not yet seen it.

Â 1:53

Of course, it's not the case for humans.

Â Humans can easily generalize, and

Â this is a property you want from your reinforcement learning algorithms as well.

Â This is where we approach the so-called approximate reinforcement learning.

Â This time we don't use the tabular policy.

Â We don't just record all the probabilities explicitly.

Â But we use some kind of machine learning model, or

Â well, any model you want, to model this probability distribution given the state.

Â For example, it could be a neural network.

Â Basically a neural network that was previously used for

Â classification that will take a state, an image, maybe apply some computational

Â layers if you're familiar with image recognition with neural networks.

Â Then it would have output with a soft max layer with as many units as you have

Â actions.

Â Could also use any other method.

Â It could be regression, any other linear model, or a Random Forest.

Â Anything that can run the simulation for you.

Â And since you can no longer set the probabilities or

Â update them explicitly after each iteration,

Â you'll have to replace this phase with training your neural network.

Â Again, you play M games in this Doom environment.

Â Then you select M best of them, you call those elite games.

Â And then instead of recomputing the whole table, you simply perform

Â several iterations of gradient descent or building several new trees.

Â Well, let's consider the neural metric option for now.

Â In the case of neural network,

Â what you do is you initialize this neural network with random weights.

Â You probably remember some clever ways you can do so from the deep learning course.

Â And you take this network and use it to pick actions,

Â to actually place a 100 games in this Doom environment we had in the previous slide.

Â Then you take again the best sessions, several best sessions, and

Â you cold denote them as elite sessions.

Â And you train your neural network to increase the probability of actions in

Â the elite sessions.

Â You do so the way you usually train neural networks to classify.

Â You have states which are the inputs of your neural network, and

Â you have the actions in the session that serve the purpose of kind of the correct

Â answer, the target why, in the supervised notation.

Â Now you take some kind of stochastic [INAUDIBLE] algorithm say [INAUDIBLE], or

Â stochastic gradient descent or [INAUDIBLE] proper anything.

Â And you perform one or several iterations of network updates given those M

Â best sessions, then you repeat the process.

Â 4:07

Now, you can of course achieve it in high-level frameworks, and

Â it looks remarkably simple.

Â Having neural network, and whenever you want to update the policy provided by this

Â neural network, just call whatever the training methods requires you to call,

Â say the feet in cyclic learn.

Â And you provide your elite states as inputs to your network, and

Â elite actions as reference answers.

Â Or you can get something more sophisticated up and

Â running if you utilize the particular feats of your neural network architecture.

Â But this is the most simple,

Â kind of the bare bones algorithm that will still work and work rather efficiently.

Â You'll probably see this guy in action in the practical assignment.

Â