0:00

So, once we're done fighting those fire-specific digital architecture,

Â I want to direct your attention to the elephant in the room.

Â This elephant that we have promptly ignored

Â since the second week of our course and right until now.

Â The elephant is the fact that in almost any practicals reinforcement learning problem,

Â the environment doesn't strictly abide by the Markov Decision Process rules.

Â The main issue for now is the fact that,

Â in almost no case,

Â your agent will have a direct access to

Â the environment's state like it was supposed in the MDP.

Â The environment states in MDPs basically over is to know about the environment.

Â Strictly speaking, if you're navigating a robotic car through city streets,

Â then you not only need to know what's the camera image of this robot.

Â You need to know the exact positions of all your surroundings, and quite frankly,

Â you have to know all the properties of

Â all the quantum particles in the known universe because,

Â technically, that's the only possible scenario

Â for which the Markov property holds exactly.

Â So, it's an issue that you cannot solve exactly.

Â It's an issue that you cannot simply ignore

Â because there's loads of stuff that prevents you from learning optional policy.

Â So, we have to somehow mitigate the fact that our agent's observations are imperfect.

Â For robotic car, we would very much like to know what's happening right

Â behind us even though our camera might only be facing forwards.

Â For example, we want to take into account the fact

Â that if there is someone beeping behind us,

Â then there's probably a car even though we might not see directly.

Â The same is true for any other practical related situation.

Â For example, if you're trying to trade your stocks or anything,

Â then you might benefit from knowing the history of how this asset traded over last month,

Â not just this current variation.

Â Finally, within Atari games,

Â you don't know a lot of variables like the velocity of objects on your game fields.

Â This brings a lot of complications even for a usual DQN.

Â The complication that we have mitigated through duct tape so far.

Â Now, to mitigate this issue,

Â the first thing we have to do is we have to redefine the way our decision process works.

Â The usual Markov Decision Process looks like this.

Â We have an agent and our environment,

Â and environments sends state to an agent,

Â which in turn gives his policy to predict his action,

Â and the action gets fed back into the environment to get the next state.

Â We assume that there is some probability of getting to the next step ST plus 1,

Â and up thing or our work r given

Â the particular state and action we were at the previous step.

Â We probably assume that in any practical case,

Â we don't know the distribution explicitly,

Â but we can estimate it from samples if we wish.

Â [inaudible] algorithm is mitigated

Â But nevertheless, since well,

Â we don't know the probability distribution,

Â we do have direct access to those states.

Â We know the state can devise our policy based on the state.

Â However, the situation will look much closer to this scheme here.

Â Don't freak out, it's always a bit more complicated.

Â The difference here is that while in previous situations,

Â the agent received environments state directly here,

Â now we have this observation function,

Â the O of S on the left here.

Â The observation function is basically some function which limits what agent can see,

Â and what he cannot see.

Â So, there is a hidden state in the environment,

Â this S here in a circle,

Â and technically it exists.

Â There even is next stage you include it in the next distribution.

Â But you never see not just the probability distinction,

Â but the entire state as it is as well.

Â So, you don't get to see the state you only see a sequence of observations,

Â and you may somehow judge

Â what happens inside the environment based on those observations,

Â but it's the best thing you can count on.

Â So, to actually solve this new process called,

Â Partially Observable Markov Decision Process because of this observation function,

Â we need to introduce something else to an agent,

Â which helps him operate in the situation.

Â What would help you to for example,

Â not forget about the person that you are not seeing right now directly?

Â If you've just looked away from a person,

Â what would help you to still keep him in mind?

Â Well yes. What we want to introduce is,

Â there should be some kind of agents persistent memory.

Â The memory cell where he can store some information between iterations.

Â So, there is some kind of this hidden variable h,

Â which is yet another vector or any other set of numbers, and on every iteration,

Â agent can update his memory,

Â his new h given his observation and the previous memory.

Â There is of course a bunch of different ways you can actually implement this memory cell,

Â and we'll discuss them in just a few minutes,

Â but so far let's find out why this thing is useful.

Â Now, one way you can think about it is that this memory state,

Â the h here is a tool for an agent to

Â approximate what's happening in the hidden environment state.

Â This S which is now hidden under the blue cloud at the bottom.

Â So, there is some hidden variable you could use your h to

Â learn how to reverse engineer this hidden variable from from a sequence of observations,

Â and then use this our such new variable to alter your policy.

Â For example, if you have a robotic car example where currently,

Â you're approaching say, a traffic light,

Â and in this case,

Â your current observation is just the current state of this traffic light.

Â It's either blue or whatever.

Â green or red or yellow, in most cases,

Â and let's say it doesn't have a timer or any other additional information sources.

Â Now, for your usual MDPs agent,

Â this information is insufficient for many situations.

Â For example, if you know that the traffic light is green,

Â but it's going turn red in just say,

Â three seconds, then you want to speed up to get

Â past the traffic light before switches to get to your destination faster.

Â That's where our memory comes into play.

Â Basically, your memory is updated based on the signals of observation.

Â So, your agent can learn to, for example,

Â count the amount of seconds since the traffic lights switched on last time, and this way,

Â it can basically introduce

Â information about what's going to happen next because

Â it understands how the hidden variable,

Â the hidden traffic light timer operates and basically reverse-engineers it.

Â It seems true about a lot of other cases, and of course,

Â you won't get a drastic increase in performance just

Â given this one ability to recognize traffic light properties.

Â But if you add up all the influences for the situations

Â that you can kind of reverse-engineer with this memory,

Â you'll get a huge boon.

Â But of course, this memory is only useful if an agent can effectively operate with it.

Â In fact, there's one thing which technically qualifies as

Â memory although it's not learned or anything, it's not that complicated,

Â which we use for a usual Deep Q-Network to work with Atari efficiently,

Â to be able to get some information about object velocity here.

Â What kind of memory was this one? Well, yes.

Â Basically for the Atari games,

Â we just use the frame buffer heuristic.

Â Basically, we said that we cannot get the state variables exactly,

Â but we can get almost everything we need if we just stack say,

Â last four observations or any other amount of observations because it's useful.

Â Now technically, this gives us all the information we need for Atari,

Â but it has a number of flaws.

Â For instance, this week,

Â we cannot remember anything which happens

Â more than four turns before this particular step.

Â If you want to monitor something more complex,

Â then four turns is not going to be enough.

Â In Atari, the effect of this heuristic is so

Â great only because most of

Â the hidden information just velocity and maybe acceleration for objects,

Â which are traceable from two and three terms respectively.

Â So, the architecture we used for the Deep Q-Network

Â with the frame buffer was basically this neat scheme here.

Â The difference between the one-frame DQN and this one is that we have the frame buffer,

Â which contains four images: the image for current time frame, previous one,

Â the one before previous one, and so on,

Â and together they use to estimate the kind of the motion,

Â the dynamics of things via

Â all those convolutions that they are fed into as different channels.

Â Now, this kind of stuck here,

Â the [inaudible] Q here,

Â the first in first out structure is in a way a simplified agent memory.

Â Of course, its not again, that complicated,

Â but it's something which persists between the iterations.

Â Technically, it solves our problem to some limited extent.

Â However, there's a much more powerful approach here.

Â Well, the overall idea is that we're trying to train some architecture,

Â which assumes that there is some human process there, basically,

Â there is hidden process of the environment state,

Â and that you can only see some observation,

Â some visible part which is not entire process.

Â You want to draw on those hidden state.

Â There's actually one architecture in

Â deep learning which works with these exact assumptions,

Â and use them rather well. What I'm talking about?

Â Yeah. Recurrent Neural Networks.

Â Of course, there's a bunch of those guys,

Â but generally, here is that you have a vector of numbers,

Â and you learn a transformation which transforms previous vector of numbers,

Â your previous memory state.

Â Your current observation, your current

Â time frame in a tally or anything into a new vector of the same amount

Â of numbers so that you can then apply

Â this transformation iteratively to oneself for as many traces as you want.

Â So, now I'm just going to repeat you all the information you've already been

Â taught at the very first course of advanced machine learning specialization.

Â Namely, The Introduction to Deep Learning.

Â There, at the final link by Caterina Lebachaure,

Â we've been taught about how recurrent neural networks operate and how they're trained.

Â You basically initialize those weights here,

Â the blue squares with random values,

Â and then you simply apply this transformation to basically oneself,

Â as many traces as you want.

Â For example, you can take your observations from 10 steps ago.

Â So, 10 Markov decision process tree which is before you get this observation,

Â and you feed it into your recurrent neural network from bottom,

Â from this yellow triangle.

Â You initialize the initial hidden states,

Â the old state here on the picture, at say zero.

Â Some fixed value that defines that the network has no prior information.

Â Then you simply apply the first arbitral of

Â those weights and then the second arbitral feeding now the next frame.

Â So the first one was 10 traces ago,

Â this one is nine traces ago.

Â Then third frame and fourth frame and so on until you

Â get the current frame where now you're instate,

Â your age depends on all the previous frames starting from minus 10 frames ago,

Â or potentially for as many frames as you want.

Â You may start from say a million frames ago,

Â it would only take you say a few years to train.

Â Now, after this whole process,

Â your final hidden state is used to evaluate the Q function.

Â Just as usual, you could use

Â either usual Q-learning or maybe dual link Q-learning or any other hack if you want,

Â to train your network to predict Q values

Â with temporal difference error just like before.

Â Now, a closer look to those formula would reveal the fact that the only thing that

Â changed since our usual DQN is that we normally depend on states directly.

Â Whenever we have the state here is replaced with this O of S,

Â the observation of the state.

Â Instead of taking just one state, the current ST,

Â we instead consider all the states from say ST

Â minus 10 or some state in the past till this current state.

Â So, it might be a huge sequence if you're going to train it for enough time.

Â The way it does so, is by learning this recurring formula.

Â So basically, the Q function is as usual

Â just a dense layer with one unit per action in your activation.

Â Q depends on each T,

Â which depends on HT minus 1, HT minus 2,

Â HT minus 3, yada yada yada until HT minus some fixed amount of time,

Â which you've decided to stop at.

Â In which HT in turn depends on it's observation.

Â So, HT depends on observation of ST,

Â HT minus 1 depends of observation of ST minus 1.

Â This is how it works, it's just a huge differential formula.

Â Now, this formula has some parameters, namely those weights.

Â The weights, the blue matrices here.

Â The weights from previous scheme state to initial state and from input to Newton state.

Â How to train those weights? How do you actually tune

Â them to make your Q function as accurate as possible?

Â How do you do that?

Â Yes, you back propagate.

Â While this formula might scare all the courage out of you,

Â it will most definitely be much easier a job for

Â Tensor Flow or Tiano or PyTorch and any other automatic differentiation framework,

Â which will just take the formula and then just

Â T of gradients of it to get the necessary gradients.

Â Then you can use Adam or LMS prop

Â or any method you prefer to tune the weights just as you did for the convolutionary work.

Â So here's how it happens.

Â Unless you're going to use some huge time frames or train it very extensively,

Â it will to some extent learn how to use the previous states.

Â It will learn to remember some useful information and forget the useless one.

Â But recurrent neural networks have a lot of nasty properties.

Â For example, if you train neural networks for a long time spans,

Â if not 10 but say 100 previous stages,

Â which might make sense in a lot of situations,

Â there are two problems; gradient vanishing and gradient explosion.

Â Gradient vanishing is when by multiplying those gradients,

Â you run at risk of getting something very close to zero.

Â Because if just one of the products of those DHT by DHT minus 1 is getting near zero,

Â then the entire product is going to be close to zero as well.

Â The opposite problem is the gradient explosion,

Â which is when you multiply a lot of logic against and you get

Â some cosmic shift in your ways which basically throw them on the other side of floor 32.

Â Those problems employ other training quite a bit,

Â so to fight them you know a lot of tricks and a lot of

Â architectures that have some kind of workarounds there.

Â For gradient inflation, there are LSTM and given the current units.

Â For gradient explosion, there's clipping.

Â You probably know more than one way you can do so.

Â Actually if you're more into theoretical deep learning,

Â you also know that there is unitary neural networks,

Â unitary current neural networks,

Â that have neither of those problems by construction.

Â So here's how it goes.

Â You simply introduce new architecture which has to be trained not just on SE Arus prime.

Â So it has to be trained on observation,

Â action, rewards and nice observations.

Â But this time it has to access all the observations from

Â say 100 or 10 steps ago till now and all the current neural network.

Â Here's how it's going to work. Now, there's

Â one very popular implementation of DQN with this current neural network,

Â which differs from our previous picture in just a few ways.

Â First, the significant current neural network recuses LSTM because of course it does.

Â Since LSTM is basically the version of RNM which

Â doesn't suffer from variation gradients and has

Â all those nice almost interpretable properties with forgets,

Â updates and so on,

Â and as usual, it just takes the output of LSTM,

Â the kind of public recurring state,

Â the non-cell, the H of LSTM,

Â and it computes Q values densely based on those guys.

Â Okay, like I've just mentioned,

Â you have to train this network in a special way.

Â Just so you can simply assemble trajectories.

Â You can sample not just single SAR sprantuple but subsequents

Â of those tuples that come one after another.

Â Here's when one problem occurs.

Â The problem is that if you sample those trajectories in a special way,

Â then you no longer get independent and identically distributed data.

Â So, technically you're sampling your new make

Â optimization is going to be slightly less efficient in this case.

Â Sometimes those DRQN are even known to diverge

Â because there's so much that can go wrong and something eventually goes.

Â So, basically if you compare the DRQN

Â vs the usual DQN on the known benchmarks, you'll get something like this.

Â Sometimes it's better, sometime it's way better.

Â Like here. But in some cases you can also see

Â that the DRQN is not better but also worse than the original DQN.

Â This is because it is much harder to actually train,

Â much more complicated for conversions.

Â We'll also study some tricks to improve this performance

Â during the next week when we use policy mixed methods.

Â Because there is a method which is specific and very convenient to them which

Â also solves the NRM problem just as a side quest.

Â Until now, you can still train DRQN with experience and play with some efficiency.

Â So here's how you mitigate the problem of POMDP,

Â the partially observable decision processes.

Â Of course there's much more to it.

Â There are special architectures like

Â the deep neural network equivalent

Â of planning model that allows your agent to think proactively.

Â There is a lot of cool stuff when you have a model based planning.

Â Room to reduce all the flinks about it in the reading section so that the

Â curious of you would have their curiosity satisfied. Until next week.

Â