0:02

For now, we shall stick to the banner ads example.

Â It's the first one and arguably the simplest to understand.

Â One more feature of it is that,

Â it's one of those rare problems where we get to compute the gradient of your money,

Â the expected revenue taht you're going to get over next month,

Â respect to weight of a neural network,

Â and optimize the gradient accent,

Â following the gradient of your money.

Â How cool is that? So, okay. Banner ads.

Â And for now on, let's imagine that you're not a data scientist,

Â but an executive officer say CEO.

Â And you're a CEO of a startup which hosts a site that gets most of its revenue,

Â your revenue from banner ads.

Â Of course, you have some huge daily active user accounts by now,

Â and what you want to do now is you want to optimize,

Â you want to squeeze as much dirty money from those users as you can.

Â So, what you going to do as executive officer,

Â is you're going to hire some data scientists,

Â or pay some data scientists to do those job for you.

Â And that just turns out that two guys are willing to do the job.

Â The first guy says,

Â "Hello CEO, I've just invented a super mega method.

Â It uses deep learning block chain,

Â all those fancy words to learn how to find an optimal policy of banner placement,

Â and it solves it doesn't matter the problem."

Â Then comes the second one says,

Â "Wow this thing is total bullshit.

Â Instead, you have this method which is much better.

Â It uses, whatever, another fancy word,

Â one fancy word, two buzzer word, three buzzer, four."

Â And that does solve the bandit problem as well,

Â although in a different way.

Â So you have those two black box methods,

Â and being assumed executive officer,

Â you basically don't know anything about black boxes.

Â And instead, you have to somehow measure

Â the efficiency by the business value of those methods.

Â So what you going to do? How do you pick one of those,

Â method A or method B?

Â Well, yes.

Â Right. The obvious thing to do is to measure money,

Â and to say give each of those methods some five percent,

Â 10 percent of your user accounts rest runs by your classic advertisement methods.

Â And what happens to them after say a month of advertisement.

Â Your task is obviously to get as much money as you can,

Â but there are a few dilemmas to solve.

Â For example, I mentioned that method A gives you

Â say a profit of one million dollars or whatever,

Â one over each month.

Â It starts giving it from day zero, and stays there.

Â While method B starts by giving you almost no profit,

Â but eventually creeps upwards and gets better than method A in say, half a year.

Â Or maybe method A does creep up as well,

Â but not at this pace.

Â This brings you out of special cases where

Â your decision might depend on whether you plan to stay in business for next month,

Â or next year, or eternity.

Â And in a theoretical field,

Â this brings you to a notion of regret, source regret.

Â Regret is basically how much money your algorithm wasted?

Â Or how much money you could have earned if

Â your algorithm you the optimal policy from the get go.

Â So now let's plot this Eta,

Â the regret value as a function of time.

Â Then given the point of time, we want to sum up all those differences,

Â the [inaudible] with optimal policy versus with your policy.

Â From time step zero to your time step.

Â So, Eta of 10 would be the sum for time steps zero,

Â one, two, three, and so on, until you reach the step 10.

Â Inclusive or exclusive it doesn't quite change what we're going to see.

Â Since those differences are all positive and you're adding them up,

Â the function is going to grow or at least,

Â it's going to either grow or stay where it is if you've converged.

Â And the curves going to look anything like this, actually.

Â Lets begin with the blue curve.

Â You can see is that the regrets,

Â the y axis value.

Â Starts very good for the blue curve.

Â It's below everything else and iteration is like 200,

Â 400, and so on.

Â But eventually, it exceeds those of any other curve and keeps on growing.

Â What happened is, basically, This strategy,

Â the blue one, the strategy failed to converge to the optimal policy.

Â This actually means that,

Â if the final policy is not optimal,

Â then the difference between your policy and optimal policy of

Â the difference in reward is going to be non-negative and is going to stack up, and grow,

Â and grow, and grow over time,

Â until as the time step t converts to the infinity,

Â your regret is also going to be infinite.

Â So, this is the bad, theoretical bad case,

Â where your policy is not optimal at any given moment of time.

Â So, let's find out what's going to happen if it does get to the optimal.

Â The red curve here, the second one,

Â is quite different in how it behaves.

Â It starts much worse,

Â at the beginning it gets a lot of regrets early on, because it explores,

Â it has to be mix up optimal actions to get

Â a better look of how the actual space looks is like,

Â how the rewards are defined at all those actions.

Â But eventually, it converges.

Â And this case, you can see that it converges to

Â either almost the simplistic function or almost a horizontal line,

Â or it converges to the exact horizontal line.

Â What this means is that,

Â it has finally found an optimal strategy after some point in time.

Â And so regret is basically,

Â it's a constant, it's fixed,

Â after some number of iterations.

Â This is a theoretically again, the best outcome.

Â There's a guarantee that you're going to eventually get to what you want.

Â Of course, if you get it faster then your policies even better.

Â Of course, there can be a middle ground if

Â your regrets starts not that bad but then it grows, but not linearly.

Â Its growth speed decreases,

Â and then in the infinity it reaches the constant value,

Â which means that the regrets,

Â it grows logarithmically, while the blue curve grows linearly.

Â And this can be again only derived by looking at some toy tasks

Â and particular properties of your algorithm that can

Â be exploited in mathematical derivation.

Â In practice however, we're going to see something much

Â more rough round the edges, much less curvy,

Â and you can see some step like ascension trajectory

Â at an aerial plot that you can obtain by actually feeding users with your banners.

Â But let's get further.

Â