0:03

So we've seen how Gaussian process can be used for regression classification problems.

Â Now let's see how they can be used for optimization problems.

Â So, what is a Black-box optimization?

Â You have some function f(x) you don't know anything about it and you want to find

Â out what is the value of the maximum of this function.

Â In the cases when the gradient is known,

Â you can use a gradient descent with restarts.

Â In the case when gradient is unknown you can, for example,

Â numerically estimate the gradient or if

Â the function is not differential you can use grid search or random search.

Â However, there is one special case of

Â optimization problems when computing each point is really expensive.

Â For example, you want to find out where the oil is and you

Â should do some probe drilling to estimate the amount of oil in some points.

Â So X in this case would be

Â geographic coordinates and F of X would be for example the amount of oil.

Â In this case the cost of the sampled cost would be like $1 million.

Â If you would like to estimate the hyper parameters of

Â the neural network in this case the X

Â would be versus hyperparameters for example the size of the layers,

Â the number of layers and so on.

Â And F of X would be some objective function.

Â In this case the cost of one sample would be 10 hours.

Â Since you would have to wait for this time until the network will converge.

Â In the drug discovery,

Â it is even more important.

Â The X could be the drug and F of X would

Â be the effect of this against some severe disease.

Â In this case one sample,

Â that is one test of the proposed drug,

Â would cost you 2 months of clinical trials,

Â $10,000 for example for paying for assistance and

Â also maybe a life or a rat if you do experiments on the rats.

Â So these are cases when the cost of

Â evaluating a function at any point is really, really expensive.

Â And we will see how Gaussian process can be used to find the maximum of

Â the function by minimizing the number of samples that you have to make.

Â So again, our goal is to optimize a function that is to find

Â its maximum and also to minimize the number of trials.

Â The key here is to define a surrogate model.

Â A surrogate model, we'll call it Fhat,

Â shouldn't be approximately equal to the original function.

Â However, we want it to be cheap to evaluate

Â and also it should approximate the function well.

Â In this case we'd be able to find the maximum value for example

Â the surrogate model and then to sample the original function at this point.

Â We'll also need an acquisition function.

Â We will call it mewe of X.

Â It will show how likely for us it is to sample least for that point.

Â We should estimate the profit for optimization from sampling the point X.

Â And it should use a surrogate model

Â since the surrogate model is really cheap to evaluate.

Â So first of all,

Â let's see which surrogate model we will use.

Â The model should be able to approximate arbitrary complex functions.

Â So the natural choice here would be a nonparametric model.

Â Also it should be profitable to estimate

Â uncertainty so the Gaussian process is a really good choice for this case.

Â Now let's see what we need from an acquisition function.

Â It should balance between exploration and exploitation.

Â The exploration is that you search in regions with higher uncertainty.

Â Since the uncertainty there is high,

Â there is some chance that you'll get a new maximum layer.

Â Exploitation is searching in the regions where the value is already estimated to be high.

Â There may be low uncertainty,

Â but you will still be able to compute

Â the value of the position of the maximum more precisely.

Â So, now let's see three different kinds of acquisition function.

Â And the first is the maximum probability of improvement or MPI for short.

Â So, we have the current best value.

Â Let's write it down as f*.

Â And the acquisition function would be the probability that after sampling the point X,

Â that the value of this function would be greater or equal to the current best value.

Â So this is a probability that Fhat, the surrogate model,

Â would be a greater or equal to f*, the current best,

Â plus some parameter X1.

Â So, since everything is normal,

Â it is really easy to compute this value.

Â And it can be written using the cumulative objective function

Â of the normal distribution as shown on the right.

Â Another one is called an upper confidence bound.

Â And that is you try to take the mean value

Â and add etha times the variance of the surrogate model of this point.

Â So, for example you can sample at points where there is high mean or

Â the points where there is high variance by varying the value of etha.

Â The third acquisition function that we will see is called an Expected Improvement.

Â As the title suggests,

Â you just compute the expected value

Â of the gain that you'll get after sampling the new point.

Â Also, again, since everything is normal you can compute

Â an articulate and the formula would be like this.

Â So let's see an example.

Â We start with some few points,

Â in this case three points.

Â And while the process is in converge you train the Gaussian process.

Â You find the maximum of an acquisition function for example using

Â the gradient descent or some other optimization techniques.

Â And since computing the values of the surrogate model,

Â the Gaussian process are relatively cheap,

Â this process won't take much time.

Â And then you evaluate the function of

Â the position of the maximum of an acquisition function.

Â Then again you train the Gaussian process with new point,

Â you find you find the maximum again,

Â values at the new point, and so on.

Â So let's see how it works.

Â So again here I have three points.

Â Now I add a new one and you see below this is the plot of an acquisition function.

Â And we'll find the maximum fit.

Â It would be somewhere on the right.

Â And now you can see that the points on the left

Â have a relatively small value of the acquisition function.

Â So the mean is quite low relative to

Â the current best point and also the variance is small.

Â And it is very unlikely that we will ever be

Â able to get the point that has a value greater than the current optimum.

Â And so we'll sample near the current optimum for the rest of the experiment.

Â So as you see we change our function a bit

Â and now we're almost sure that the position of the maximum is somewhere near this point.

Â And so, finally, after sampling a few points,

Â we find the position of the maximum.

Â So, it is very important to compare random search in Gaussian process.

Â For random search, it is really easy to parallelize it.

Â For example, it is really easy to sample

Â some random points and then compute the value of the function then.

Â However for Gaussian process,

Â you have to come up with some heuristics to sample not one point but many.

Â And why heuristics is to take the maximum of an acquisition function and replace and

Â then add a new point to the Gaussian process at this point with

Â the value being the mean of the Gaussian process.

Â Then we compute the maximum of

Â the acquisition function again and take the maximum value of it again,

Â and repeat it as many times as you want.

Â And so you'll have a set of points and you'll be able to then sample them in parallel.

Â Now the random search actually needs many more points on average to converge,

Â and also it works poorly in high dimensional spaces.

Â For example, in the plot below you can see that the Gaussian process

Â converges much faster than the random search,

Â and also it converges at the better optimum.

Â Also the random search works for any function

Â since sampling the random X is usually quite cheap.

Â However, it is important to know that

Â using Bayesian optimization with Gaussian process is useful

Â only when each relation of the function is

Â expensive since training the Gaussian process also takes some time.

Â And if it takes more time than evaluating the true function,

Â then you'd better use the random search.

Â