0:03

So, welcome to week two of our reinforcement learning course.

Â During this week, we will talk about

Â core concepts lying at the heart reinforcement learning.

Â How to explain to agent what do we want it to do?

Â The answer to this question is reward hypothesis,

Â which states that we could formulate any goal and

Â any purpose in terms of cumulative sum of scalar signal.

Â This signal is called reward,

Â and the sum of the signal over time is called a return.

Â So, the return G sub t is a sum of immediate rewards from arbitrary time t,

Â until the very end of the episode.

Â The end of the episode is denoted by a capital T. This return consists of

Â all rewards to the end of the episode and

Â thus is a measure of global optimality of agent policy.

Â Return is a random variable because each immediate reward depends on

Â the agent action and also depends on the environment reaction to this action.

Â For now, consider as example, the game of chess.

Â Let's assume that we have designed the immediate rewards to be

Â a value of all opponent's piece taken at a particular timestep,

Â T. So, the return is equal to the total value of

Â all opponent's pieces an agent have managed to take till the end of the game.

Â Also, mathematically convenient, such formulation of our desire could have side effects.

Â So, to better understand the limitations of cumulative sum of immediate rewards,

Â consider another example of data center non-stop cooling system.

Â An agent controls temperature in data center room and could adjust

Â speed of different fans to enforce required temperature regime.

Â We reward this agent with plus one for

Â each second the system's temperature was sufficiently

Â low and give it reward equal to zero otherwise.

Â So, can you think of any possible problem in such a design?

Â Well, the problem here is the length of the episode.

Â Unlike the chess games this task doesn't have natural ending.

Â That is, cooling system is meant to operate, well,

Â in every day of the week,

Â every single moment and operate in that way forever.

Â The essence solves the problem with infinite horizon lies in optimization problem,

Â that is, our objective, our return,

Â is infinite for a myriad of non optimal behaviors.

Â For instance, it is indeed infinite for an agent,

Â behaviors that violates temperature regime at every search timestep.

Â Tasks who has infinite horizons are often called continuing tasks.

Â So of course, we could split such infinite horizon in some fixed length chunks.

Â For example, hours, or days and assess

Â agent's performance on the basis of performance during these chunks.

Â However, as this approach requires manual decisions of what chunk length is appropriate,

Â and this approach does not generalize well to arbitrary problem.

Â Infinite horizon is not the only problem which may occur with the formulation of return.

Â To better understand another problem with cumulative sum of rewards,

Â consider as an example of an agent responsible for

Â floor cleaning and air conditioning in a building.

Â So, we encourage our agent to clean the floor by rewarding cleaning action with

Â large value of 100 but because we also want air in the building to be fresh,

Â we also encourage an agent for turning air conditioning system on and subsequently off,

Â with reward equal to one.

Â For this example, there is no infinite horizon

Â because the episode ends when the day is over.

Â So, what potential problem do you see in such a design?

Â Well, this example illustrates a problem of positive feedback loop.

Â This positive feedback loop is a cycle which

Â allows an agent to gain large or possibly infinite reward.

Â In this example, an agent will completely ignored the task of

Â floor cleaning because this activity takes the whole day to finish.

Â While turning air conditioning on off takes one second.

Â And by doing this sequence of actions many,

Â many times throughout the day,

Â and agent is able to obtain cumulative reward much larger than 100.

Â Well, the feedback loop in this task is an example of a bad reward design.

Â We will discuss the reward design a little bit later, but for now,

Â think about the environments where this infinite loop is in fact desire.

Â For example, giving a positive reward for each second an agent is riding

Â a bicycle is an example of a desire for positive feedback loop.

Â We want an agent to never fall off the bike and in this case,

Â with agent getting better and better in riding a bicycle,

Â we could also face the problem of unboundedly increasing sum of rewards.

Â This unboundedness can greatly harm our optimization procedure and thus,

Â could break that learning process.

Â But, how could we deal with infinite return?

Â What could help us against the return being very large due to positive infinite loop?

Â One of the most common approaches is discounting.

Â Discounting means that we introduce some multiplier gamma,

Â which is less than one and greater or equal to zero.

Â This multiplier present rate with which things lose their value over time.

Â That is, each reward receive later,

Â as in current moment t,

Â is reduced by multiplying the reward with the gamma,

Â to the power of number of timesteps to this return.

Â So discounting focusses agents attention

Â more on close rewards and reduce the value of very distant once.

Â More informally, the same cake compared to today's one was gamma times less tomorrow,

Â gamma square times less the day after tomorrow, and so on.

Â So that rewards received in n timesteps in the future

Â is worth gamma to the power of N minus one times less,

Â compared with the same reward acting right now.

Â Such discounting allows to make infinite sum finite,

Â providing that each term, is assumed,

Â is bounded was from above and below.

Â Well, why does this discounting solves the problem of return being infinite?

Â This is so because of mathematical probity of geometric progression.

Â So for example, if each immediate reward is equal to one,

Â as an example of this data center cooling system,

Â that infant sum is equal to the value of one over one minus gamma.

Â Maximal return have nothing to do with

Â the number of steps after which agent does not care,

Â and this fact is useful to keep in mind.

Â On the floors, you can see why rewards of plus one would be equal

Â two after being discounting N times was different gammas.

Â Know that slope of any curve decreases while increases N,

Â this decrease suggests that even in the near future,

Â are discounted at a higher discounted rate then events in the distant future.

Â Also know that in case gamma is not exactly equal to zero,

Â every reward, even infinitely distant one contributes to the return computation.

Â It may contribute very,

Â very little for very distant rewards,

Â but it certainly does so.

Â That is our agent still cares about distant rewards,

Â but not as much as in case of undiscounted return.

Â However, you should understand that agent

Â will be almost indifferent to the very distant rewards,

Â and this change of optimization objective

Â definitely change what optimal policy looks like in each of these cases.

Â So, why is discounting meaningful thing to do?

Â Well, the reason of discounting is

Â partially mathematical convenience and partially inspiration from human behavior.

Â Humans, just like animals,

Â given two similar rewards show of preference for

Â the one that arrives sooner rather than later.

Â Humans also discount the value of the later their work by

Â a factor that increases with the length of the delay.

Â However, a scientific literature suggests that human and animal discounting

Â function f(t) is different and is more like hyperbolic discounting.

Â The discounting function f(t) used in

Â the reinforcement learning is very similar to the so-called quasi-hyperbolic discounting.

Â Mainly, if we assume beta in quasi-hyperbolic discounting is equal to one,

Â then we get precisely the same discounting scheme you have been shown on previous slide.

Â In some sense, a quasi-hyperbolic discounting is

Â a discrete approximation to a hyperbolic discounting function.

Â And thus, it's rather close to how human discount.

Â However, unlike hyperbolic discounting,

Â it has some nice mathematical properties.

Â The second reason of this particular kind of discounting is mathematical convenience.

Â It not only makes infinite sums finite,

Â it preserve some amount of contribution for each reward,

Â but it also allows us to express the return in a recurrent function.

Â We will heavily rely upon this recourse of definition of Gt,

Â through Rt plus gamma multiplied by Gt plus one in future lectures.

Â So, make sure you understand and remember it.

Â Why do we also think about discounting from a different more theoretical perspective?

Â Under mark of assumption,

Â any action affects only the immediate reward and the next state.

Â Well, any action could

Â affect all or some of the future rewards by affecting the next state.

Â That is, when you find a pressure in corner of the room,

Â you definitely should assign them credit related to this

Â rewards to the action of opening the door to this room.

Â So, let's assume that the action effect lasts

Â some subsequent number of steps after the action was committed, and then end.

Â Let us also treat gamma S probability of effect continuation,

Â and one minus gamma S probability is that effect ends.

Â Then, the expected amount of return,

Â which is due to current action is exactly equal to the discounted return.

Â That is, with probability one minus gamma effect as immediately after

Â the action and only the immediate reward error

Â zero is attributed to the action committed now.

Â With probability gamma effect less two timesteps and then,

Â ends with probability one minus gamma.

Â And in this case, R0 and R1 are attributed to the current action, and so on.

Â So, you get the idea.

Â Let us now speak a little bit about reward design.

Â Two examples given in this lecture are,

Â in fact, examples of a bad reward design.

Â Consider for example, a game of chess and

Â reward equal to the value of taken opponent's piece.

Â In this case, an agent will not have the desire to

Â win because it knows for sure that it will not be rewarded.

Â That is so because we reward an agent only for taking pieces.

Â And the cleaning robot example,

Â also you know the reward given for cleaning the floor.

Â To these examples share the same mistakes.

Â And both of them, reward is given for how an agent should

Â perform it is and not for what to do we wanted to do.

Â In the game of chess, we should reward an agent for winning

Â the game and not for taking pieces because the

Â later could simply result in losing

Â every single game with a lot of opponent's pieces taken.

Â That is surely not what we want.

Â The same is valid for the second example.

Â We should reward an agent for fresh air and

Â clean floor but not for using the means to achieve this.

Â However, as such sparse rewards given only in the very end of an event,

Â say plus one in the chess game for winning,

Â may introduce additional difficulties in agent's learning.

Â To some extent, this may be mitigated by your work shaping,

Â which we'll discuss a bit later.

Â The next concern is about reward shifting.

Â In the machine learning courses,

Â you may get used to standardizing

Â training and testing the data before doing anything with this data.

Â Well, for example, means of instruction and division by

Â standard deviation is often a good idea in supervised and unsupervised learning.

Â In the reinforcement learning,

Â this is valid only for state representations but not for reward.

Â You should not standardize rewards.

Â So, let us see what will happen with

Â optimal policy if you shift all the rewards by some mean value.

Â So consider a world with four states and episode starts in S1 and ends in S4.

Â Rewards for transitions are reason both the arrows.

Â In the following example,

Â mean other words will be negative, say, minus three.

Â And suppression of this value will turns

Â a negative feedback loop between states S2 and S3 into a positive feedback loop.

Â So, optimal action for this world with

Â modified rewards will definitely make use of this positive feedback loop.

Â However, in the world with untransparent rewards,

Â this policy will get possibly unboundedly bad the return.

Â So the second takeaway is do not subtract anything

Â from rewards unless you're pretty sure what you're doing.

Â So, what transformations do not change optimal policy?

Â One of such transformations is reward scaling.

Â That means you can divide all the rewards in

Â market decision process by any non-zero constant.

Â And be sure that the optimal policy wasn't changed.

Â This scaling by another constant may be especially helpful if you know

Â beforehand the greatest immediate reward possible in a market decision process.

Â And it is also useful for approximate methods,

Â which we are going to cover in future videos.

Â In other transformation which doesn't change the optimal policy in

Â a market decision process is related to the so-called reward shaping.

Â Reward shaping is an umbrella term for many methods that get an agent in

Â a learning process by adding an extra values to the immediate rewards coming from an MDP.

Â In general, under such reward shaping,

Â the optimal policy changes.

Â But there exists an important theory specifying how

Â extra rewards should look like in order to preserve the optimal policy.

Â In fact, these extra values should be equal to

Â the difference of the potentials of next and current states.

Â Such potential function is denoted by pi on this slide.

Â We can also multiply the potential of the next state by a discount factor,

Â gamma, if we want.

Â The resultant potential-based function,

Â F of SA and its prime,

Â is presented on the slide.

Â In particular, if I know nothing about condition probabilities and rewards in an MDP,

Â the potential-based function F or the form depicted on

Â this slide is the only one that preserves the optimal policy under reward shaping.

Â The intuition of why does such addition of F

Â later and change optimal policies may be as follows.

Â Pretend for simplicity that discount factor gamma is equal to one,

Â that is there are no discounting in an MDP.

Â In such a case, the potential-based function sound or any circular sequence of states,

Â that is starting and ending in the same state,

Â is precisely equal to zero.

Â That is why an agent cannot fall into a trap of

Â a positive feedback loop caused by the introduction of potential-based rewards.

Â Also, note that the added values do not depend on action A,

Â but only on current and next state potentials.

Â Thus, the potentials pi cannot influence that action selection directly,

Â only by the means or the next state.

Â But the influence of the next state potential is also illuminated because it is

Â obstructed from the total return once we subsequently exit the next state as pi.

Â This potential-based shaping function trick either other advanced technique.

Â So, if you are get interested,

Â you are highly encouraged to read

Â the relevant paper specified at the bottom of the slide.

Â