0:03

At a glance, Monte Carlo Tree Search is nothing but a family of

Â decision-time planning algorithms which may

Â be viewed as distant relatives of heuristic search.

Â Note that Monte Carlo tree search is

Â a planning algorithm and it requires a fair amount of computation time.

Â In particular, algorithm of Monte Carlo tree search family heavily

Â relies on Monte Carlo rollout estimates of action or action-value function.

Â But usually, these algorithms do not throw such estimates away

Â after an action is made but preserve some of the estimates from state to state.

Â This slightly reduces the computation time needed for the algorithm to work well.

Â The main difference compared to

Â the previously discussed naive version of planning with Monte Carlo

Â estimates is presence of two policies: the Tree policy and the Rollout policy.

Â The latter is already familiar to us and in practice,

Â it should be as good and as fast as possible.

Â The former however, is yet unfamiliar to us.

Â The tree policy is a policy which actually determines

Â the direction of the search while respecting uncertainty of the estimates.

Â Let us now discover the main structure of

Â the Monte Carlo tree search algorithms and then discuss the details of the UCT algorithm,

Â the most popular representative of the Monte Carlo tree search family.

Â Monte Carlo tree search,

Â like the naive algorithm we have discussed previously,

Â uses Monte Carlo rollout to estimate the action values.

Â However, unlike the previously discussed alternatives,

Â Monte Carlo Tree search algorithms additionally build a tree.

Â This tree consists of the initial segments,

Â states and actions of trajectories that have obtained higher return previously.

Â In some sense, these beginnings of trajectories are the most interesting to

Â us because they are most likely to lead us to the best possible trajectory.

Â We will discuss how to build such a tree in a while.

Â For now, let's discuss the main scheme of Monte Carlo tree search algorithm.

Â Please look at the diagram on the slide.

Â On this diagram, empty circles represent states,

Â and the filled circles correspond to actions.

Â Typical Monte Carlo tree search algorithm can be

Â divided into four main phases: selection,

Â expansion, simulation and backup.

Â To produce meaningful decisions,

Â these phases should be repeated as many times as it is possible,

Â and when the time runs out,

Â algorithm is stopped and the best action for current state,

Â that is the root of the tree, is selected.

Â All of these phases describe how a single rollout is performed.

Â At first, during the selection phase,

Â the rollout starts from the root of the tree,

Â that is current status,

Â and sends down the tree selecting actions according to the tree policy.

Â Once such rollout enters the leaf of the tree,

Â the expansion phase is launched.

Â During the expansion phase,

Â a new node or nodes which are directly adjacent

Â to the current leaf node are added to the tree.

Â For example, a new node maybe the new state

Â an agent find itself in after making an action in the leaf state.

Â More typically however, the new state is

Â added to the tree with all actions available in that state.

Â Then when the simulation exits the tree,

Â the rollout policy takes control.

Â All the subsequent actions till the very end of the episode are made

Â according to rollout policy which may be completely random, for example.

Â When the episode is over,

Â the return on the rollout become available.

Â And this return should be propagated back to each of

Â the state action pairs that were visited by the current rollout.

Â This big propagation is performed by simply storing in each such state action node,

Â the cumulative discounted return from that node till the end of the episode.

Â Well, the last two phases should be already familiar to you,

Â but several things are still unclear.

Â First and most importantly,

Â what does the tree policy look like?

Â Second, what are the strategies of choosing the action once the time is up?

Â Well, because a Monte Carlo tree search is not

Â a single algorithm but a family of algorithms,

Â there are plenty of different choices of the tree policy.

Â However, we are going to cover only one choice,

Â mostly because it's effectiveness and popularity.

Â The Upper Confidence Bounds for Trees, abbreviated as UCT.

Â What should the tree policy do is to balance between exploitation, that is,

Â concentration on the directions of the search that seems most profitable,

Â and on the other hand,

Â it should also explore mostly because of the uncertainty

Â in current estimates and because of the large search space.

Â Indeed, there are multiple sources of uncertainty in

Â our estimates such as stochasticity in the environment,

Â finite sample estimate of the return,

Â and random action choice of the rollout policy.

Â The effective balance between exploration and exploitation,

Â there are a lot of approaches but the most simple one is to treat

Â each action selection as an independent multi-armed bandit problem.

Â This problem could be solved with many techniques.

Â For example, with upper confidence bound algorithm known as UCB.

Â The application of the UCB algorithm as

Â a tree policy is in fact what is called UCT algorithm.

Â So, the tree policy in the UCT algorithm is defined as

Â argmax over all actions of the expression with two agents.

Â First agent is an approximate action-value function which is defined

Â as an average Monte Carlo return gained by the simulation after making

Â action a in state s. This first term promotes

Â exploitation because it favors

Â actions which were previously shown to lead to large action value.

Â The second agent is the one that encourage exploration.

Â Let's see why the N of s in the numerator is the total number of simulations that

Â previously have visited the state s. The N of s and a in the denominator,

Â is a total number of simulations that have made action a in

Â state s. When the action a is made in the state s,

Â then the denominator increases,

Â and it increases faster than the numerator because of the logarithm.

Â That effectively decreases the contribution

Â of the whole exploration term for this action.

Â On the other hand, incrementing N of s and a after making action a in

Â state s effectively increases the exploration values of all other actions in that state.

Â That is so because N of s increases every time an agent finds itself in

Â state s. But N of s and a increases only for the action that was committed.

Â This exploration exploitation balance has not only good theoretical properties,

Â but it is also very simple to implement and has proven to be very effective in practice.

Â Please note that the constant C in front of a second term,

Â can be used to increase or decrease

Â the overall amount of exploration made by the tree policy.

Â Also, note that exploration term ensure that each action is chosen at least once.

Â That is so because either N of s and a is zero,

Â the second term will be infinite.

Â The last but certainly not the least topic we need to discuss about

Â Monte Carlo tree search is how to actually select

Â an action when planning is finished or interrupted.

Â Remember, an agent stands in the root of the tree build by

Â the Monte Carlo tree search and need to choose an action from that state.

Â In fact, there are many possible strategies to select

Â such action while making use of the planning result.

Â The most straightforward one is to select

Â the action which has the biggest estimate of action-value function.

Â Despite simple and usually effective,

Â this may not be the best strategy.

Â One case when it fails is the case of

Â very rare but also very large returns, outlier returns.

Â If such returns are possible in your environment,

Â you may benefit more from the robust strategy of selecting the most visited action,

Â that is, the one which has the greatest N of s and a.

Â You may also want to continue planning until

Â the first two strategies will select the same action.

Â This approach is called Max-robust strategy

Â and was shown to be particularly effective for the game of go.

Â Another strategy is called the Secure strategy.

Â It is about choosing the action that maximizes the lower confidence bound.

Â More specifically, that is,

Â maximizes the same expression as the tree policy but

Â changing the plus sign to the minus in front of the second agent.

Â That maybe the way to go if you are solving

Â a real life task with some expensive robot as an agent.

Â In that case, you may want to sacrifice some amount of reward favoring safe trajectories,

Â that is, the ones that will definitely not match an expensive robot.

Â Well, now let me conclude with what are the properties of Monte Carlo tree search.

Â Monte Carlo tree search is context independent,

Â that is, it does not require any hand designed heuristics.

Â It performs an asymmetric search which means that it allows us

Â to focus computations on the most promising directions of the search.

Â Monte Carlo tree search is also what called anytime algorithm.

Â That mean that you can stop the algorithm at

Â any single moment and it will output the best decision it has found so far.

Â Another benefit of Monte Carlo tree search is that it

Â saves a lot of computation time because it preserves

Â all the estimates of the sub tree in which

Â root an agent found itself in after committing an action.

Â It is also incredibly simple and easy to implement.

Â However, its performance depends on the quality of the rollout policy.

Â And also Monte Carlo tree search may be very computationally intensive.

Â Now is the last. Monte Carlo tree search is

Â a great algorithm which is very important to know about.

Â