So let's get to the second part of our material for this week.
This time we're going to study a more advanced analysis of the ramifications of what it
takes to train by policy-based methods in
comparison to the stuff we already know, the value-based ones.
Now, instead of trying to give you yet another list of many things,
I want you to analyze,
I want you to make some of the conclusions.
Remember, there are some key differences in terms of what
value-based methods learn and what policy-based methods learn.
And I want you to guess what are the possible conclusions of this difference.
Yes. Actions overall.
Yes, there is definitely more than the one thing in which we differ.
And the most kind of,
the most important advantage of
the policy-based methods is that they kind of learn the simple problem.
We'll see how this difference in approaches gives you
better average rewards later on when we
cover particular implementation of policy-based algorithms.
Now, another huge point here is that,
the value-based and policy-based algorithms,
they have different kind of ideal how they explore.
Value-based methods, If you remember,
you have to specify the explicit kind of exploration strategy,
with epsilon-greedy strategy or Boltzmann softmax strategy.
Basically, you have your Q-values and you determine the probabilities of
actions given those Q-values and any other parameter you want.
Now, in policy-based methods, you don't have this thing.
Instead, you have to sample from your policy.
And this is both a boon and a quirk basically.
So, what you have is, in policy-based methods,
you cannot explicitly tell the algorithm that should over or under explore.
But instead, you have this kind of,
you have algorithm decide for itself,
whether it wants to explore more at this stage because it's kind of not sure what to do,
or it wants to take the opposite direction because it's obviously straight from offset.
So, you can of course,
affect how the policy-based algorithms explore.
Work out this just in a few slides.
Now, finally you can point out some of the areas where
the current scientific progress has better
developed for the value-based methods or the policy-based ones.
For value-based methods, their main strength is that, instead of value-based,
they give you this free estimate of how good this particular state is.
You can use it for some kind of seeing the charts and
for other algorithms that rely on this value-based approach.
Finally, value-based methods, have this, well,
more kind of more mechanisms designed to train off-policy.
For example, both Q-learning and expected value SARSA simple algorithms,
may be trained on session sampled from
experience cheaply just as well as their own sessions.
The main advantage here is that,
since you can train off-policy,
you increase this property of simple efficiency.
The idea is that, your algorithm requires less training data,
less actual playing to converge to the optimal strategy.
Now, of course there are similar ways you
can take class actions for policy-based methods.
But, they are slightly harder to grasp and even harder to implement.
Speaking of the advantages of policy-based methods,
first you have this innate ability to work with any kind of probability distribution.
For example, if you have actions that are not discrete but continuous,
you can specify a multi-dimensional normal distribution
or Laplacian distribution or anything you want for your particular task.
And, you can just plug it to
the policy algorithm or formula and it will work like a blaze.
Now, basically this allows you to train
not actually terrible on the continuous action basis and,
of course you can do better with special case algorithms
that we're going to cover in the reading section.
But however, considered, it is a strong argument towards using policy-based methods.
Finally, since policy-based methods learn the policy,
the probability of taking action in a state,
have one super neat idea.
They are, they train exactly the stuff
you need when you train supervised learning methods.
This basically means that, you can transfer between,
policy-based enforcement learning and supervised learning,
without changing anything in your model.
So, you have a neural network and you can train it as a classifier and convert
it as a policy of an agent trained by reinforcer, actor-critic.
You might have to train another head for actor-critic,
but this is not as hard as retraining the whole set of Q-values.