In everyday life, we learn

a lot without getting explicit,

positive, or negative feedback.

Imagine for example, you are riding

a bike and hit a rock that sends you off balance.

Let's say you recover so you don't suffer any injury.

You might learn to avoid rocks in the future and

perhaps react more quickly if you do hit one.

How do we know that hitting a rock is bad

even when nothing bad happened this time?

The answer is that we recognize that the state of losing

our balance is bad even

without falling and hurting ourselves.

Perhaps we had similar experiences in

our past when things didn't work out so nicely.

In reinforce and learning

a similar idea allows us to relate the value of

the current state to the value of future states

without waiting to observe all the future rewards.

We use Bellman equations to formalize

this connection between the value

of a state and its possible successors.

By the end of this video,

you'll be able to derive

the Bellman equation for state value functions,

define the Bellman equation for action value functions,

and understand how Bellman equations

relate current and future values.

First, let's talk about

the Bellman equation for the state value function.

The Bellman equation for

the state value function defines a relationship

between the value of a state and

the value of his possible successor states.

Now, I'll illustrate how to derive this relationship from

the definitions of the state value function and return.

Let's start by recalling that the state value function

is defined as the expected return starting

from the state S. Recall that

the return is defined as

the discounted sum of future rewards.

We saw previously that the return at time t,

can be written recursively as

the immediate reward plus

the discounted return at time t plus 1.

Now, let's expand this expected return.

First, we expand the expected return as

a sum over possible action choices made by the agent.

Second, we expand over possible rewards

and next states condition on state S and action a.

We can break it down in this order because

the action choice depends only on the current state,

while the next state and reward

depend only on the current state and action.

The result is a weighted sum of terms consisting of

immediate reward plus expected future returns

from the next state S prime.

All we have done is explicitly write

the expectation as it's defined,

as a sum of possible outcomes

weighted by the probability that they occur.

Note that capital R_t plus 1 is a random variable,

while the little r represents

each possible reward outcome.

The expected return depends on states and

rewards infinitely far into the future.

We could recursively expand

this equation as many times as we want,

but it would only make the expression more complicated.

Instead, we can notice that this expected return

is also the definition of

the value function for state S prime.

The only difference is that the time index is t

plus 1 instead of t. This

is not an issue because

neither the policy nor p depends on time.

Making this replacement, we get

the Bellman equation for the state value function.

The magic of value functions is that we can use them as a

stand-in for the average of

an infinite number of possible futures.

We can derive a similar equation

for the action value function.

It will be a recursive equation for the value of

a state action pair in terms of

its possible successors state action pairs.

In this case, the equation does not

begin with the policy selecting an action.

This is because the action is already

fixed as part of the state action pair.

Instead, we skip directly to the dynamics function

p to select the immediate reward and next state S prime.

Again, we have a weighted sum over

terms consisting of immediate reward

plus expected future return given

a specific next state little s prime.

However, unlike the Bellman equation

for the state value function,

we can't stop here.

We want to recursive equation for the value of

one state action pair in

terms of the next state action pair.

At the moment, we have

the expected return given only the next state.

To change this, we can express the expected return from

the next state as a sum

of the agents possible action choices.

In particular, we can change

the expectation to be conditioned on

both the next state and

the next action and then sum over all possible actions.

Each term is weighted by the probability under Pi

of selecting A prime in the state S prime.

Now, this expected return is the same as

the definition of the action value function

for S prime and A prime.

Making this replacement, we get

the Bellman equation for the action value function.

So we have now covered how to derive

the Bellman equations for

state and action value functions.

These equations provide relationships

between the values of a state or

state action pair and

the possible next states or next state action pairs.

You might be wondering why we care so much

about these definitions and relationships.

The Bellman equations capture

an important structure of

the reinforcement learning problem.

Next, we will discuss why this relationship is

so fundamental and reinforcement

learning and how we can use

it to design algorithms which

efficiently estimate value functions.