0:02

The one we're going to study now is the fact that the first option is unknown.

Â There is this new particle no one knows about it.

Â Well, the second one, it turns out most people have

Â tried doing this some time in their childhood years.

Â And they might not remember it

Â directly or this they may have read about it or have seen it in other person.

Â They probably know the outcome of

Â the second research opportunity pretty good to not try this at home anytime soon.

Â So, we want to partase actions that are uncertain

Â whose outcomes are not yet that well known to us.

Â And to actually implement this in any practical algorithm,

Â we won't just require the Q values themselves.

Â Just want the probability of Q value,

Â it is Bayesian sense.

Â So basically our belief expressed as a probability,

Â the the Q value is going to turn out to be this or that number.

Â In the plot here at the bottom of the slide,

Â in one particular stage you can see three probability distribution for actions,

Â that each should present, our belief about

Â each particular actions Q value in a particular state.

Â Again for the third time we emphasize it is a Bayesian probability.

Â It means that, the variance the breadth of

Â this curve doesn't represent the randomness in the action itself,

Â but only our belief.

Â It means that if the green action is

Â actually deterministic but we have never tried it yet so we have no idea,

Â or we've only tried it a few times,

Â then it means that is going be quite broad anyway.

Â While the orange action can actually be very noisy.

Â It can have very wide turn plus and minus 10.

Â What we are dead sure that it's expected to turn is going be what is it?

Â Point five or whatever, 0.6.

Â So it's our belief,

Â expressed as a probability distribution,

Â the same way you did in a Bayesian methods course at the beginning of the RFP.

Â And our challenge here is to pick an action not just given the action values,

Â but given the beliefs we have about them.

Â So now we're done with this.

Â We have this one state and three actions,

Â and the beliefs of

Â the Q values of those actions are represented with those distributions.

Â So these our actual beliefs of what the Q value is going to turn out.

Â Now I want you to tell me,

Â which actions of those three are even eligible?

Â Which of them it makes sense to pick,

Â regardless of what method we use?

Â Well, turns out yes.

Â Thing is, the blue action is oftenly mis-regardless of how we pick it.

Â Because the orange one dominates into expectation.

Â So if we want to exploit it would always be

Â the orange one not the blue one it's just better,

Â and if want to explore the blue action is dominated by the green one.

Â So, the green one has some chance of being better than the orange,

Â if we believe this belief distributions sorry for the totoligy.

Â But the blue one won't be any better.

Â So, either a green one,

Â or the orange one.

Â Depending on how you prefer to explore and exploit.

Â So now let's get to some algorithms that actually

Â decide what's the probability of picking the green action or the orange one,

Â and hopefully don't pick the blue one ever.

Â Lets begin with Thompson sampling.

Â This algorithm is actually a more general one,

Â but we are only going to study its simpler form for now.

Â Thompson sampling actually suggests that you take one sample,

Â from each of those distributions and assuming they are

Â normal distributions you can just take the sample and if they're imping distributions,

Â histograms you can just take,

Â MPR random sample or whatever.

Â And then you'll get three points,

Â three Q values of each action.

Â The Thompson sampling wants you to pick an action

Â whose sample Q value is going be largest.

Â The question to you is,

Â on average if you see those points,

Â which actions is going be picked the most,

Â which is going be picked the least, and which working?

Â What are the probabilities of taking each actions?

Â Of course there's more than one possible way to interpret those histograms,

Â but in general, you can more or less say that,

Â the blue one is going to have a lot of zero,

Â because regardless of what you sample from it,

Â there's a chance of one that's a sample from red one will be larger.

Â So, it's no longer needs to be explored.

Â Sample from red one is between say,

Â roughly 0.3 and say 0.8 or nine.

Â While the sample from green can be anywhere between

Â minus whatever two plus one and something.

Â This actually means that,

Â at some points you will actually pick the green one as

Â it is better for you to explore it,

Â and at other points you'll pick the orange.

Â Ofcourse you can also sample with some temperature use

Â importance sampling to skew this towards more exploration or more exploitation.

Â Flatten error thing and this time,

Â you're going be more exploring.

Â Or you can sample proportionally to square of each probability and it's be exploiting

Â