We've discussed a few methods for balancing exploration and exploitation. Because we are estimating our action values from sampled rewards, there is inherent uncertainty in the accuracy of our estimate. We explored to reduce this uncertainty so that we can make better decisions in the future. In this video, we will discuss another method for selecting actions to balance between exploration and exploitation called UCB. By the end of this video, you will understand how upper-confidence bound action selection uses uncertainty in the estimates to drive exploration. Recall the Epsilon-greedy action selection method that we discussed previously. This method uses exploratory actions epsilon percentage of the time. The exploratory actions are selected uniformly. Can we do better? If we had a notion of uncertainty in our value estimates, we could potentially select actions in a more intelligent way. What does it mean to have uncertainty in the estimates? Q of a here represents our current estimate for action A. These brackets represent a confidence interval around Q star of A. They say we are confident that the value of action A lies somewhere in this region. For instance, we believe it maybe here or here. The left bracket is called the lower bound, and the right is the upper bound. The region in between is the confidence interval which represents our uncertainty. If this region is very small, we are very certain that the value of action A is near our estimated value. If the region is large, we are uncertain that the value of action A is near or estimated value. In UCB, we follow the principle of optimism in the face of uncertainty. This simply means that if we are uncertain about something, we should optimistically assume that it is good. For instance, say we have these three actions with associated uncertainties, our agent has no idea which is best. So it optimistically picks the action that has the highest upper bound. This makes sense because either it does have the highest value and we get good reward, or by taking it we get to learn about an action we know least about like the example on the slide. Let's let the algorithm pick one more action. This time Q2 has the highest upper-confidence bound because it's estimated value is highest, even though the interval is small. We can use upper-confidence bounds to select actions using the following formula; we will select the action that has the highest estimated value plus our upper-confidence bound exploration term. The upper-bound term can be broken into three parts as we will see in the next slide. The C parameter as a user-specified parameter that controls the amount of exploration. We can clearly see here how UCB combines exploration and exploitation. The first term in the sum represents the exploitation part, and the second term represents the exploration part. Let's look at a couple of examples of the exploration term. Let's say we've taken 10,000 steps so far. Imagine we've selected action A 5,000 times. The uncertainty term here will be 0.043 times the constant C. If instead we had only selected action A 100 times, the uncertainty term would be 10 times larger. Let's investigate the performance of upper-confidence bound action selection using the 10-armed Testbed. We will use the same setup as before. The Q star values for these actions are normally distributed with mean zero and standard deviation one. The rewards are sampled from a univariance normal with mean Q star. As before, we will average over 2,000 independent runs. To compare UCB to Epsilon-greedy on the 10-armed bed problem we said C equal to two for UCB and so epsilon equal to 0.1 for Epsilon-greedy. Here we see that UCB obtains greater reward on average than Epsilon-greedy after about 100 times steps. Initially, UCB explores more to systematically reduce uncertainty. UCB's exploration reduces over time whereas Epsilon-greedy continues to take a random action 10 percent of the time. In this video, we discussed upper-confidence bound action selection, which uses uncertainty in the value estimates to balance exploration and exploitation.