0:00

In the last video, you saw how sampling at random, over the range of hyperparameters,

Â can allow you to search over the space of hyperparameters more efficiently.

Â But it turns out that sampling at random doesn't mean sampling uniformly at random,

Â over the range of valid values.

Â Instead, it's important to pick the appropriate scale

Â on which to explore the hyperparamaters.

Â In this video, I want to show you how to do that.

Â Let's say that you're trying to choose the number of hidden units, n[l], for

Â a given layer l.

Â And let's say that you think a good range of values is somewhere from 50 to 100.

Â In that case, if you look at the number line from 50 to 100,

Â maybe picking some number values at random within this number line.

Â There's a pretty visible way to search for this particular hyperparameter.

Â Or if you're trying to decide on the number of layers in your neural network,

Â we're calling that capital L.

Â Maybe you think the total number of layers should be somewhere between 2 to 4.

Â Then sampling uniformly at random, along 2, 3 and 4, might be reasonable.

Â Or even using a grid search, where you explicitly evaluate the values 2, 3 and

Â 4 might be reasonable.

Â So these were a couple examples where sampling uniformly at random over

Â the range you're contemplating, might be a reasonable thing to do.

Â But this is not true for all hyperparameters.

Â Let's look at another example.

Â Say your searching for the hyperparameter alpha, the learning rate.

Â And let's say that you suspect 0.0001 might be on the low end,

Â or maybe it could be as high as 1.

Â Now if you draw the number line from 0.0001 to 1,

Â and sample values uniformly at random over this number line.

Â Well about 90% of the values you sample would be between 0.1 and 1.

Â So you're using 90% of the resources to search between 0.1 and 1, and

Â only 10% of the resources to search between 0.0001 and 0.1.

Â So that doesn't seem right.

Â Instead, it seems more reasonable to search for hyperparameters on a log scale.

Â Where instead of using a linear scale, you'd have 0.0001 here,

Â and then 0.001, 0.01, 0.1, and then 1.

Â And you instead sample uniformly, at random, on this type of logarithmic scale.

Â Now you have more resources dedicated to searching between 0.0001 and

Â 0.001, and between 0.001 and 0.01, and so on.

Â So in Python, the way you implement this,

Â 3:08

So after this first line, r will be a random number between -4 and 0.

Â And so alpha here will be between 10 to the -4 and 10 to the 0.

Â So 10 to the -4 is this left thing, this 10 to the -4.

Â And 1 is 10 to the 0.

Â In a more general case,

Â if you're trying to sample between 10 to the a, to 10 to the b, on the log scale.

Â And in this example, this is 10 to the a.

Â And you can figure out what a is by taking the log base 10 of 0.0001,

Â which is going to tell you a is -4.

Â And this value on the right, this is 10 to the b.

Â And you can figure out what b is,

Â by taking log base 10 of 1, which tells you b is equal to 0.

Â 3:58

So what you do, is then sample r uniformly, at random, between a and b.

Â So in this case, r would be between -4 and 0.

Â And you can set alpha,

Â on your randomly sampled hyperparameter value, as 10 to the r, okay?

Â So just to recap, to sample on the log scale, you take the low value,

Â take logs to figure out what is a.

Â Take the high value, take a log to figure out what is b.

Â So now you're trying to sample, from 10 to the a to the b, on a log scale.

Â So you set r uniformly, at random, between a and b.

Â And then you set the hyperparameter to be 10 to the r.

Â So that's how you implement sampling on this logarithmic scale.

Â Finally, one other tricky case is sampling the hyperparameter beta,

Â used for computing exponentially weighted averages.

Â So let's say you suspect that beta should be somewhere between 0.9 to 0.999.

Â Maybe this is the range of values you want to search over.

Â So remember, that when computing exponentially weighted averages,

Â using 0.9 is like averaging over the last 10 values.

Â kind of like taking the average of 10 days temperature,

Â whereas using 0.999 is like averaging over the last 1,000 values.

Â So similar to what we saw on the last slide, if you want to search between 0.9

Â and 0.999, it doesn't make sense to sample on the linear scale, right?

Â Uniformly, at random, between 0.9 and 0.999.

Â So the best way to think about this,

Â is that we want to explore the range of values for 1 minus beta,

Â which is going to now range from 0.1 to 0.001.

Â And so we'll sample the between beta,

Â taking values from 0.1, to maybe 0.1, to 0.001.

Â So using the method we have figured out on the previous slide,

Â this is 10 to the -1, this is 10 to the -3.

Â Notice on the previous slide, we had the small value on the left, and

Â the large value on the right, but here we have reversed.

Â We have the large value on the left, and the small value on the right.

Â So what you do, is you sample r uniformly, at random, from -3 to -1.

Â And you set 1- beta = 10 to the r, and so beta = 1- 10 to the r.

Â And this becomes your randomly sampled value of your hyperparameter,

Â chosen on the appropriate scale.

Â And hopefully this makes sense, in that this way,

Â you spend as much resources exploring the range 0.9 to 0.99,

Â as you would exploring 0.99 to 0.999.

Â So if you want to study more formal mathematical justification for why we're

Â doing this, right, why is it such a bad idea to sample in a linear scale?

Â It is that, when beta is close to 1, the sensitivity

Â of the results you get changes, even with very small changes to beta.

Â So if beta goes from 0.9 to 0.9005,

Â it's no big deal, this is hardly any change in your results.

Â But if beta goes from 0.999 to 0.9995,

Â this will have a huge impact on exactly what your algorithm is doing, right?

Â In both of these cases, it's averaging over roughly 10 values.

Â But here it's gone from an exponentially weighted average over about

Â the last 1,000 examples, to now, the last 2,000 examples.

Â And it's because that formula we have, 1 / 1- beta,

Â this is very sensitive to small changes in beta, when beta is close to 1.

Â So what this whole sampling process does,

Â is it causes you to sample more densely in the region of when beta is close to 1.

Â 7:59

Or, alternatively, when 1- beta is close to 0.

Â So that you can be more efficient in terms of how you distribute the samples,

Â to explore the space of possible outcomes more efficiently.

Â So I hope this helps you select the right scale on which to

Â sample the hyperparameters.

Â In case you don't end up making the right scaling decision on some hyperparameter

Â choice, don't worry to much about it.

Â Even if you sample on the uniform scale, where sum of the scale would

Â have been superior, you might still get okay results.

Â Especially if you use a coarse to fine search, so that in later iterations,

Â you focus in more on the most useful range of hyperparameter values to sample.

Â I hope this helps you in your hyperparameter search.

Â In the next video, I also want to share with you some thoughts of how to organize

Â your hyperparameter search process.

Â That I hope will make your workflow a bit more efficient.

Â