So what you do, is then sample r uniformly, at random, between a and b.
So in this case, r would be between -4 and 0.
And you can set alpha,
on your randomly sampled hyperparameter value, as 10 to the r, okay?
So just to recap, to sample on the log scale, you take the low value,
take logs to figure out what is a.
Take the high value, take a log to figure out what is b.
So now you're trying to sample, from 10 to the a to the b, on a log scale.
So you set r uniformly, at random, between a and b.
And then you set the hyperparameter to be 10 to the r.
So that's how you implement sampling on this logarithmic scale.
Finally, one other tricky case is sampling the hyperparameter beta,
used for computing exponentially weighted averages.
So let's say you suspect that beta should be somewhere between 0.9 to 0.999.
Maybe this is the range of values you want to search over.
So remember, that when computing exponentially weighted averages,
using 0.9 is like averaging over the last 10 values.
kind of like taking the average of 10 days temperature,
whereas using 0.999 is like averaging over the last 1,000 values.
So similar to what we saw on the last slide, if you want to search between 0.9
and 0.999, it doesn't make sense to sample on the linear scale, right?
Uniformly, at random, between 0.9 and 0.999.
So the best way to think about this,
is that we want to explore the range of values for 1 minus beta,
which is going to now range from 0.1 to 0.001.
And so we'll sample the between beta,
taking values from 0.1, to maybe 0.1, to 0.001.
So using the method we have figured out on the previous slide,
this is 10 to the -1, this is 10 to the -3.
Notice on the previous slide, we had the small value on the left, and
the large value on the right, but here we have reversed.
We have the large value on the left, and the small value on the right.
So what you do, is you sample r uniformly, at random, from -3 to -1.
And you set 1- beta = 10 to the r, and so beta = 1- 10 to the r.
And this becomes your randomly sampled value of your hyperparameter,
chosen on the appropriate scale.
And hopefully this makes sense, in that this way,
you spend as much resources exploring the range 0.9 to 0.99,
as you would exploring 0.99 to 0.999.
So if you want to study more formal mathematical justification for why we're
doing this, right, why is it such a bad idea to sample in a linear scale?
It is that, when beta is close to 1, the sensitivity
of the results you get changes, even with very small changes to beta.
So if beta goes from 0.9 to 0.9005,
it's no big deal, this is hardly any change in your results.
But if beta goes from 0.999 to 0.9995,
this will have a huge impact on exactly what your algorithm is doing, right?
In both of these cases, it's averaging over roughly 10 values.
But here it's gone from an exponentially weighted average over about
the last 1,000 examples, to now, the last 2,000 examples.
And it's because that formula we have, 1 / 1- beta,
this is very sensitive to small changes in beta, when beta is close to 1.
So what this whole sampling process does,
is it causes you to sample more densely in the region of when beta is close to 1.