Now, if you tackle the problem of exploration versus expectation from the perspective of Bayesian methods, you probably will find yourself doing the exact same thing you've been taught at the respective course for [inaudible]. Mainly you would start from some prior distribution. In this case, distribution on the rewards in bounded setting or the Q function in multi-step and the P setting. Then you'll learn a posterior distribution of the same variable after observing the data. Namely, the actions you've taken in states and the rewards you've obtained for those actions. After you've learned this distribution, you can then simply take the percentile of this distribution, say 95th or 99th, and sort actions by this percentile to pick whichever is most interesting to you. The rest comes naturally after you've done the update properly. The only blank spot here is, how do we actually learn those distributions? How do we learn this P of Q given data? And there are two main families of methods that can do that. One approach here, is the so-called parametric Bayesian inference. This basically means that you have to select some kind of known probability distribution family to use as the P of Q given data. One popular, if slightly two popular choices for this role, is the normal distribution family, which is franchised by, in this case, Mu and Sigma squared. First being the expected value and the other being the variance. Once you've learned the Mu and Sigma squared for the P(Q) with Q given data for each possible action in each possible state or the P of Q given data for each action bounded setting, you can then simply plug them into an easily global formula for the percentile, which is the net pure in reading section, to actually get the percentile then execute on your UCB policy. Another way you could do the known UCB exploration is to get those Mu and Sigmas and sample from them using normal distribution sample, a standard function most math packages to get samples and perform the forms and sampling strategy. This approach is quite powerful but has its limitations. The main limitation of Parametric Bayesian Inference is that, it relies wholly on your choice of the distribution. For example, if you choose the normal distribution, it means that your distribution is, by definition, unimodels. So there's one peak and the further you go from this peak, the lower the probability is. And the other problem is that, the probability decays exponentially, the exponent of the square lower from the distance. And this might be useful in some cases but in other cases, it might be detrimental. It might even cause you to overestimate some actions like it did before or underestimate and never learn the optimal policy. But if I was to try more complicated family, like a mixture of normal distributions or well, anything else that suits your particular problem maybe a Gamma distribution, maybe a Landau distribution, or any other well, less popular choices. But this is still limited because if you don't have some expert knowledge of how they would look like, then you have to simply, well, leave what's currently there. Now, alternative approach, which is slightly more advanced and a whole lot more complicated, is non-parametric Bayesian Inference. In this case, we try to explain it on example of Bayesian neural networks. Again, this is a rather complicated thing to explain in the scope of our course. But if I had to squeeze it into a few sentences, Bayesian neural networks would be a mechanism by which you learn a natural model that predicts not just one value at its output, but a distribution. It does so by having not just one weight at each connection, but distribution and it's weight. Now, think about it, the regular neural networks have the neurons which are connected with those weights, and each weight is just basically a number anywhere between minus infinity to plus infinity. And a Bayesian neural network, and if expanded by saying that each neuron is not just one number but some distributions, say a normal distribution, and to infer from this network, you have to sample those numbers from the distribution. You have to sample weights each time and, of course, each time you will have slightly different well, probabilities of Q at the end. Now, this approach is slightly more powerful than the first one because it doesn't just take itself to a particular formula for normal distribution. Namely, it doesn't have to be unimodal, or exponentially decaying as normal distribution, but it can if so are the demands of this particular task. Or condone some kind of trimodal, or bimodal thing whether flat plate or at the middle which might make sense in some cases. Again, this is much more powerful but it comes at a price, and in this case the prices that you don't get the entire probability based from just one round of your method. Namely if you have a Bayesian neural network, then if you sample all those weights given, so if you feed an input to this sample set of weights, you only going to get one sample, and if you sample it again,you'll get another outcome. So to get the whole picture, you have to repeat the sampling procedure many times until you get enough data to [inaudible] the percentiles. Of course, if you sample say a 1,000 times, you can more or less reliably find the 90th percentile by just taking sample number 900 ranked from the lowest to the highest one, and this is how you can compute empirical percentiles in basic statistics. And of course, chaining those networks is another challenge in itself which takes a lot of well, another set of prior distributions there, only with those parameters and [inaudible] tricks and all these stuff you've been learning throughout the Bayesian networks course. So we are not including them into the compulsory part of our course. So you can just read about them in more detail on the bonus section. Okay so, the first approach is a rather simple, the second approach is slightly less simple, but potentially more powerful and has of course other approaches which are also non-parametric but doesn't have Bayesian neural networks in them. So if you chose the Bayesian UCB over the frequency one, the one based on inequality. It means you've just traded a few more added benefits for a few more drawbacks that usually goes with any applied science. On the plus side of the Bayesian methods, get that they no longer depend on the inequality as the [inaudible] ones do. This means that you no longer run at risk of having to explore actions past the point that it become irrelevant, just because you're inequality is slightly more optimistic than it should have been. Instead you have a particular probability distribution with an exact percentile that can measure that which makes sense given your exact choice of prior. And here we kind of stumble in the first major drawback, the Bayesian Methods can be made or broken with a choice of prior. And if you choose prior to, well loosely it means that you're going to over explore, and it means that you are not certain of anything, and means that you might have to spend more time exploring at the very beginning. Alternatively, if you choose a more, well, steep prior the one that looks closer to a data function, it means that your inference is going to be a bit more green than she'd have been. And in the worst case it might not even learn the optimal policy. Another benefit is that, if you pick a more complicated probability distribution family, say the one from Bayesian neural network, or even a mixture of normal distribution, you can engineer a particular formulation of the distribution of this well [inaudible] previously of this, the probability of a reward given a particular action, and you can tailor this for your particular problem given all the expert knowledge you have about it. This is slightly complicated but you've probably been taught a few tricks and how to do this in a setting of applied Bayesian methods. Again previously not specialization. So here it goes. You have a more accurate estimate which wholly depends on the prior and can turn for a task. This is how Bayesian methods compared with the frequency ones. So again, here's some motivational image that shows that distributions should not be unimodal and so on and so forth. This was learned with Bayesian neural networks. So if you compare all the methods we just learned, the thumbs of sampling, the UCB, and the Bayesian UCB on a problem of information retrieval, the one where you are band it tries to select lieges for a user to satisfy his query, and you'll see that they compare one way or another. This is they regret [inaudible] over time. And you can see that basically, depending on a choice of prior and depending on a particular inequality, those methods can have different performance. Of course that doesn't mean that you have to just pick the one which loops the highest here, but it means instead that the performance methods depends on the particular problem and you better try all of them, maybe have some library that gets all that implemented before picking a particular one. Of course, if you have more domain knowledge than Bayesian method it's going to be slightly more efficient as a rule of thumb than the frequent ones, or if you have almost no domain knowledge, and you're just about to figure out priors from top of your hats then that means that you're likely going to shoot yourself in the foot by under exploring or over exploring with a wrong prior. This concludes a section of our course where we learn about the advanced exploration methods. And by now we have learned a few, or quite a lot actually of the details of how the exploration methods are defined and how you can compare them against one another. The first known thing with the notion of regret, or how much rewards or money would your agent lose if it's started exporting from random compared to the agent that had the optimal policy from the get go. And this is basically a universal measure of how you can compare one choice of exploration versus the other. We learned about the future regret, which is regret that is measured from say now till 10 days into the future and might make sense in some finite time horizon processes, and the notion of infinite regret, which basically sums the difference between your agent and the optional agent from now until the moments when it reaches the optimal policy if at all. And the infinite regret can grow algorithmically or it can grow linearly or it can grow otherwise depending on the particular exploration policy. We've also measured a few of extended exploration strategies like Epsilon Greek exploration, [inaudible] dis-regret. We figured out that unless you pick their parameters in a very particular way, you can end up with something that never converges an optimal policy, and has a linearly growing regret, which is terrible, the worst thing you can get actually. Now the other kind of new batch of exploration methods we've learned are the ones bassed on uncertainty. Those that are represented with two kinds of UCB. The frequent is one based on [inaudible] equation of inequality and the Bayesian one we just covered. As it was the Thompson sampling which doesn't make the percentile because they're just samples from the distributions. Now those exploration methods only allow you to improve and trade on your agents exploration in your contextual [inaudible] setting, like a line Ads or recommendation systems, or in information retrieval. But we're not going to get it a little further, were going to extend those exploration methods to a model-based multistep setting. In other words, we're going to find out how we can employ all those UCB like exploration methods based on uncertainty, in order to improve how you plan in an environment where you know the probability of next stage given state an action. This is going to come right now at the next section by [inaudible] which I recommend you to switch right now. Farewell.