0:02

Welcome back to our reinforcements coursera.

Â By the end of last week, you've learned one huge concept,

Â which is that if you know the value function or action value function,

Â then you can find optimal policy very easily.

Â We have also found at least one way to infer this value

Â or action value function by means of some way or kind of dynamic programming.

Â This is all nice and well, but now,

Â we're going to try to understand how this transfers to

Â practical problems and find out all the limitations that accompany it.

Â What I mean by practical problems is

Â basically any kind of problem that arises in the wild where

Â you don't have access to all the state action transition probabilities and so on.

Â The worst case as I have probably shown you already,

Â you have all the access you want to the agent,

Â you can implement anything there,

Â but the environment is mostly a black box that just responds to

Â your actions in some kind of hardly modellable way.

Â Think of environment as your Atari game or

Â maybe a robotic car or automatic pancake flipper for whatever measure.

Â Even the trivial kind of case of Go,

Â please don't blame me for calling it trivial.

Â Even this case, you don't know anything about how,

Â for example, your opponent going to react.

Â You can model it to some theory,

Â but you're never given accurate predictions of what is going to happen on the next turn.

Â So the first problem that arises here and actually a huge problem,

Â that you no longer have access to

Â these state transition probability distribution or the reward function as well.

Â You can sample states and rewards from the environment,

Â but you don't now the exact probabilities of them occurring.

Â So, in this case, of course you can not simply come with

Â the expectation of the action values with respect to possible outcomes.

Â And this prevents you from both training and using

Â your optimal policy given the value function.

Â So, what are you going to do to approach this problem?

Â Is there any trick of the trade from machine

Â learning that you do when you don't know probability distribution?

Â Oh, yeah, you kind of learn it.

Â This is what you do in machine learning when you have unknown dependency,

Â unknown slash in the data and you have a lot of samples to train and model.

Â Can train another network the tool for example take

Â your break car game and break the probability of the next state?

Â It would kind of sort of technically work,

Â but the problem here is that it's usually much harder to

Â learn how the environment works than to find an optimal policy in it.

Â In breakout, this transition function is actually an image to image problem,

Â and which you have to use probably a fully convolutional network

Â that will take an image and predict the next image,

Â which is super complicated comparing to simply picking an action.

Â In a more kind of leisured problem,

Â if you're trying to find whether or not you want a cup of coffee,

Â you're not required to find out how the coffee machine works,

Â or you can, but does a lot of spare work that you don't actually need.

Â Instead, what you want do is you want to design

Â a new algorithm that would get rid of this probability distribution.

Â So, it would do by only using samples from the environment.

Â So, let's add a bit more formulas,

Â a bit more details to this problem.

Â With your usual value iteration,

Â there are two missing links that two supposed you cannot complete explicitly.

Â First, you can not compute the maximum over all possible actions.

Â To do so, you would have to actually see the rewards for all actions.

Â And in model free setting,

Â in the black box setting,

Â this would take at least one attempt for each action.

Â So, to figure out whether your robot should do action A or action B,

Â should it, for example, jump forward or just make a single step forward.

Â It have to do both things and then see which one of

Â them will better work plus the value function.

Â This is kind of impossible because in real life,

Â if you're taking some particular action,

Â there is no undoing,

Â you can not get back in time.

Â Now, the problem of this expectation.

Â Here, actually I have to expect over all possible outcomes,

Â all possible next seats.

Â And this is another problem that you can not approach directly because in real life,

Â you're only going to see one outcome, one possible result.

Â So, if you're trying to use a slot machine.

Â If you're pulling a lever, then you are only going to see one outcome.

Â Not all the sets of outcomes with there respective probabilities.

Â Otherwise, you would be much better off in any gambling.

Â Now, what happens here is you have

Â basically a lot of expectations maximizations that you can take exactly.

Â So, let's find out what we actually can do to

Â see how we can approximate them and approach this problem.

Â The usual model free settings requires that you train

Â from not all the states and actions but the trajectories.

Â And trajectories basically a history of your age in playing a game,

Â either it is a history from state zero to the last state.

Â So, you begin playing break out, you take a few actions,

Â you get a few breaks and then you lose or you win or whatever depending on your agent.

Â Worse may be a partial session,

Â so you begin but maybe didn't finish yet.

Â So, this trajectory is basically a set of states,

Â actions, and rewards coming the sequence.

Â So, there's first state, first action,

Â first rewards, second stage,

Â second action, second reward, and so on.

Â Of course, you can sample all of

Â those trajectories and from many other times you need plenty of them.

Â But in many cases,

Â I mean practical application is most important here.

Â Each trajectory is particular expense on your side.

Â So, if you're training your robotic car,

Â you would actually have to consider

Â its expenses in the gasoline in maybe the amount of time

Â you spend on it to take just one session off the driving five minutes for a street.

Â Now, if we're talking about let's say, Atari games,

Â it's a little bit cheaper because you no longer need to spend money,

Â you just need to spend, in fact,

Â computer resources which do convert in money.

Â Now, those costs are different for each environment but they are usually none zero.

Â So you have to take that into consideration.

Â Now, the other issue is again we are not able to see all the possible outcomes,

Â so we only see one outcome,

Â you only try one action at a time.

Â So, to find all the possible outcomes,

Â you have to sample all the trajectories and

Â average over the different actions, different outcomes in them.

Â That can be quite costly.

Â So, once you have got those trajectories,

Â we have to somehow use them to train our algorithm.

Â And the first question is a question for you by the way is that,

Â which kind of value function would you prefer to train?

Â If you only have trajectories,

Â there's no probability distribution,

Â would be better if you had a perfect value function of

Â a state or an action value function of state and action? Which is better?

Â Well, right. As you probably don't remember,

Â if not, you might have guessed by using your common sense,

Â that is if you have state value function,

Â then to find an optimal policy,

Â you actually to average with probabilities that come from the environment.

Â So, you have to compute the expectation

Â over all the possible next state of this value function.

Â You don't get this thing unless you explicitly approximate it.

Â On the contrary, if you already have perfect Q-functions, action value functions,

Â you just speak the action with

Â highest action value and you're golden as your optimal policy.

Â So, the first decision here is that unless we're trying

Â something very specific and exotic would be

Â better off learning a perfect Q-function than

Â a V-function and even an imperfect Q-function will do.

Â Now, to keep it strict and formal,

Â let's complete on what action value is and how it is defined.

Â The definition for the last lecture that action value

Â is the expected amount of your different returns,

Â the reward plus gamma times next reward plus and so on,

Â that you would get if you start from state

Â S then take action A and both S and A are function prime just here.

Â And then you end up on the next state after which you follow your policy.

Â If this policy is an optimal policy,

Â this gives you Q star.

Â If it's your policy, it will be Q pi,

Â so far as the notation goes.

Â So, the good part about Q-function is that if you know this Q function,

Â by definition gives access to an optimal policy given as deterministic.

Â And the Q-function itself is very easy to express in terms of the V-function.

Â So, this formula if only with an expectation for success is here,

Â gives you a way to estimate Q-function and if you are kind of

Â on raw V term here as an expectation of action values over policy,

Â you'll get recurrent formal for Q-function.

Â This is old probably unknown information for you since you've gone for the last week.

Â