0:00

Hi, and welcome back.

Â You've seen by now that changing neural nets can

Â involve setting a lot of different hyperparameters.

Â Now, how do you go about finding a good setting for these hyperparameters?

Â In this video, I want to share with you some guidelines,

Â some tips for how to systematically organize your hyperparameter tuning process,

Â which hopefully will make it more efficient for you to

Â converge on a good setting of the hyperparameters.

Â One of the painful things about training deepness

Â is the sheer number of hyperparameters you have to deal with,

Â ranging from the learning rate alpha to the momentum term beta, if using momentum,

Â or the hyperparameters for the Adam Optimization Algorithm which are beta one,

Â beta two, and epsilon.

Â Maybe you have to pick the number of layers,

Â maybe you have to pick the number of hidden units for the different layers,

Â and maybe you want to use learning rate decay,

Â so you don't just use a single learning rate alpha.

Â And then of course,

Â you might need to choose the mini-batch size.

Â So it turns out, some of these hyperparameters are more important than others.

Â The most learning applications I would say,

Â alpha, the learning rate is the most important hyperparameter to tune.

Â Other than alpha, a few other hyperparameters I tend to would maybe tune next,

Â would be maybe the momentum term,

Â say, 0.9 is a good default.

Â I'd also tune the mini-batch size to make

Â sure that the optimization algorithm is running efficiently.

Â Often I also fiddle around with the hidden units.

Â Of the ones I've circled in orange,

Â these are really the three that I would consider second in importance to

Â the learning rate alpha and then third in

Â importance after fiddling around with the others,

Â the number of layers can sometimes make a huge difference,

Â and so can learning rate decay.

Â And then when using the Adam algorithm I actually pretty much never tuned beta one,

Â beta two, and epsilon.

Â Pretty much I always use 0.9,

Â 0.999 and tenth minus eight although you can try tuning those as well if you wish.

Â But hopefully it does give you some rough sense of what hyperparameters

Â might be more important than others, alpha,

Â most important, for sure,

Â followed maybe by the ones I've circle in orange,

Â followed maybe by the ones I circled in purple.

Â But this isn't a hard and fast rule and I think

Â other deep learning practitioners may well

Â disagree with me or have different intuitions on these.

Â Now, if you're trying to tune some set of hyperparameters,

Â how do you select a set of values to explore?

Â In earlier generations of machine learning algorithms,

Â if you had two hyperparameters,

Â which I'm calling hyperparameter one and hyperparameter two here,

Â it was common practice to sample the points in a grid like

Â so and systematically explore these values.

Â Here I am placing down a five by five grid.

Â In practice, it could be more or less than the five by five grid but you try out in

Â this example all 25 points and then pick whichever hyperparameter works best.

Â And this practice works okay when the number of hyperparameters was relatively small.

Â In deep learning, what we tend to do,

Â and what I recommend you do instead,

Â is choose the points at random.

Â So go ahead and choose maybe of same number of points, right?

Â 25 points and then try out the hyperparameters on this randomly chosen set of points.

Â And the reason you do that is that it's difficult to know in

Â advance which hyperparameters are going to be the most important for your problem.

Â And as you saw in the previous slide,

Â some hyperparameters are actually much more important than others.

Â So to take an example,

Â let's say hyperparameter one turns out to be alpha, the learning rate.

Â And to take an extreme example,

Â let's say that hyperparameter two was that

Â value epsilon that you have in the denominator of the Adam algorithm.

Â So your choice of alpha matters a lot and your choice of epsilon hardly matters.

Â So if you sample in the grid then you've really tried out

Â five values of alpha

Â and you might find that all of the different values

Â of epsilon give you essentially the same answer.

Â So you've now trained 25 models and only

Â got into trial five values for the learning rate alpha,

Â which I think is really important.

Â Whereas in contrast, if you were to sample at random,

Â then you will have tried out 25 distinct values of

Â the learning rate alpha and therefore you be more

Â likely to find a value that works really well.

Â I've explained this example,

Â using just two hyperparameters.

Â In practice, you might be searching over many more hyperparameters than these,

Â so if you have, say,

Â three hyperparameters, I guess instead of searching over a square,

Â you're searching over a cube where this third dimension is hyperparameter three and

Â then by sampling within

Â this three-dimensional cube you get to

Â try out a lot more values of each of your three hyperparameters.

Â And in practice you might be searching

Â over even more hyperparameters than three and sometimes it's just hard to

Â know in advance which ones turn out to be

Â the really important hyperparameters for your application and sampling at random rather

Â than in the grid shows that you are more richly

Â exploring set of possible values

Â for the most important hyperparameters, whatever they turn out to be.

Â When you sample hyperparameters,

Â another common practice is to use a coarse to fine sampling scheme.

Â So let's say in this two-dimensional example that you sample these points,

Â and maybe you found that this point work the best and

Â maybe a few other points around it tended to work really well,

Â then in the course of the final scheme what you might do is zoom in to

Â a smaller region of the hyperparameters and then sample more density within this space.

Â Or maybe again at random,

Â but to then focus more resources on searching within

Â this blue square if you're suspecting that the best setting,

Â the hyperparameters, may be in this region.

Â So after doing a coarse sample of this entire square,

Â that tells you to then focus on a smaller square.

Â You can then sample more densely into smaller square.

Â So this type of a coarse to fine search is also frequently used.

Â And by trying out these different values of the hyperparameters you can then

Â pick whatever value allows you to do best on your training set

Â objective or does best on your development set or

Â whatever you're trying to optimize in your hyperparameter search process.

Â So I hope this gives you a way to more

Â systematically organize your hyperparameter search process.

Â The two key takeaways are,

Â use random sampling and adequate search and

Â optionally consider implementing a coarse to fine search process.

Â But there's even more to hyperparameter search than this.

Â Let's talk more in the next video about how to choose

Â the right scale on which to sample your hyperparameters.

Â