[MUSIC] There are, as it turns out, multiple ways to do so. The first one is specific to this UCB. First, in general, you don't need to get the whole distribution. You only need to know it's percentile. And you might have to compromise a little bit in the way that you don't know the exact value, but you only know it for some inequality. Turns out, in statistics, there is more than one way you can compute that. There are inequalities like the counting inequality, also known as Chernoff inequality. Or or there's going to be like five more of them. That all compute something like the previnsi of some particular value being more than some particular threshold. The case that this particular rate regardless of the distribution. Or what as in the housing case regardless of a distribution shape as long as the possible value of this action q value is between 0 and 1. In this case the formula on the slide is a simplified version of the Hoeffding inequality here. You can easily find the full version if you just Wikipedia it. But the essence here is that the probability of your action value, q-value, being larger than the average q-value you've measured so far by more than t. Which is a value of say plus 5. Would decay as exponent to the power of -2 times number of samples, number of times you've measured this q-value, times t squared. So if your t is equal to 5, and say your q-value, your average q-value is 10, and t is equal to 5. The [INAUDIBLE] of your q-value being larger than 15 is exponent of -2 times n times 25. And n might be some 1,000. Which is kind of really, really small number but they are non 01 nevertheless. What you want to do with this inequality is you want to solve it not for the probability, but for t, for this offset here. You want to find such an offset, such a difference between expected value and your action value, q-value. Such that the probability of being above this offset is greater, okay lesser or equal than 5%. Or another chance probability of being within this offset, so less or equal, is at least 95%. This would be some estimate of the percentile wanted, all by it is an inequality so it might be a less efficient one. If you solved this thing for t, plugging in some probablilty that you want, say you want to be 99% sure that you are within this range. Then your solution is going to look something like this. Of course there can be a coefficient here between the square root. Which depends on the probability, the certainty you want to get. And if you want to be 90% it would be less than 1 here, and if you want to be 9.99999% sure it will be, say, maybe 10 times, 100 times more. But the general idea looks like this. You take the amount of times you've picked this action in a state. And you take the amount of times you've ever been to the state and picked any action, regardless of which one. And then you plug them in this square root thing here. So you take the logarithm over time, over a number of times, you've been to the state. Divided by times you've picked this action in the state, multiplied by two square root, you know all that stuff. And this is how far you get from the expected value. So the final and if priority of action is a sum of this exponent value and the addition from the UCB, from the upper confidence bound. Now, let's see how this formula works. Let's plug in some numbers. For starters lets say that we have never picked an action. There's new action that we have not yet picked. And I want you to tell me how inched are we in this action. What's the odds that we're going to pick it. Yes, we're actually really interested. And unless there is some other action we should also pick zero times. The action which has not yet been picked will be the top priority. Because if na is equal to 0 the number times it picked this action, then the priority of this v-hat kind of gets set to infinity. And it means that we are dead sure when we pick this action. It might be optimal so we might just well try it now. Again, this is kind of reasonable to expect, we want to try all the extras at least once. And then figure out which of them do we want to continue trying. But let's see what happens as we convert to a larger number of na and N. Let's say we have one state and three actions. And for this state we've been there say three million times. And each action we've been picking, we've picked one million times, exactly. So N equals 1 million. Okay, sorry, 3 million. And na equals 1 million. What's going to be the v-hat here? Well yes, it turns out the v-hat is going to be very close to the expectation, and it's going to get even closer as the N and na increase towards the infinity. The infinity it will get, exactly, it will convert exactly into the expected value. Because algorithm of some value divided by this value itself, or something proportional to it, it's going to convert to zero. Okay, so this is how it's going to be explained. And again, it makes some sense because after you've explored everything countless times, it doesn't really matter,. The exploration value doesn't really matter. You know everything perfectly. So you can just as well exploit it. This is the kind of approach to computing the upper bound. This actually practically means that you have to store for each state and for each action how many times have you picked this action. And how many times have you been to this state. And it's, well, just three times as many parameters if you are using 3 times plus 2 times plus 1, many parameters compared to standard q learning. And it's kind of bearable in a terrible case. If you're not using a table, if you're using some kind of approximations. You'll also have to approximate the densities of the state and the action being picked in the state for some parametric functions. But that's going to be covered in a reading section in much more detail. So this is of course one thing we've derived from [INAUDIBLE] or turn off inequality, but you can find your own modified formula. Which would work for any other inequality of your preference, and it might be better. [INAUDIBLE] hears that those are all inequalities, so they are upper bounds of how we are interested. Okay, sorry. Lower bounds of how we're interested. So if something says to you, if UCB-1 says to you that you should explore, then it might be that you really should. But it might be that just overestimate the value of this action. Because It doesn't know anything about the distribution beforehand. And again for this particular inequality we assume that it is over- [MUSIC]