0:00

Batch norm processes your data one mini batch at a time,

Â but the test time you might need to process the examples one at a time.

Â Let's see how you can adapt your network to do that.

Â Recall that during training,

Â here are the equations you'd use to implement batch norm.

Â Within a single mini batch,

Â you'd sum over that mini batch of the ZI values to compute the mean.

Â So here, you're just summing over the examples in one mini batch.

Â I'm using M to denote the number of examples

Â in the mini batch not in the whole training set.

Â Then, you compute the variance and then you compute Z norm by

Â scaling by the mean and standard deviation with Epsilon added for numerical stability.

Â And then Z total is taking Z norm and rescaling by gamma and beta.

Â So, notice that mu and sigma squared which you

Â need for this scaling calculation are computed on the entire mini batch.

Â But the test time you might not have a mini batch of

Â 6428 or 2056 examples to process at the same time.

Â So, you need some different way of coming up with mu and sigma squared.

Â And if you have just one example,

Â taking the mean and variance of that one example, doesn't make sense.

Â So what's actually done?

Â In order to apply your neural network and test time is to

Â come up with some separate estimate of mu and sigma squared.

Â And in typical implementations of batch norm,

Â what you do is estimate this using

Â a exponentially weighted average where the average is

Â across the mini batches.

Â So, to be very concrete here's what I mean.

Â Let's pick some layer L and let's say you're going through mini batches X1,

Â X2 together with the corresponding values of Y and so on.

Â So, when training on X1 for that layer L,

Â you get some mu L. And in fact,

Â I'm going to write this as mu for the first mini batch and that layer.

Â And then when you train on the second mini batch for that layer

Â and that mini batch,you end up with some second value of mu.

Â And then for the fourth mini batch in this hidden layer,

Â you end up with some third value for mu.

Â So just as we saw how to use

Â a exponentially weighted average to compute the mean of Theta one, Theta two,

Â Theta three when you were trying to compute

Â a exponentially weighted average of the current temperature,

Â you would do that to keep track of what's

Â the latest average value of this mean vector you've seen.

Â So that exponentially weighted average becomes

Â your estimate for what the mean of the Zs is for that hidden layer and similarly,

Â you use an exponentially weighted average to keep track of

Â these values of sigma squared that you see on the first mini batch in that layer,

Â sigma square that you see on second mini batch and so on.

Â So you keep a running average of the mu and the sigma squared that you're

Â seeing for each layer as you train the neural network across different mini batches.

Â Then finally at test time,

Â what you do is in place of this equation,

Â you would just compute Z norm using whatever value your Z have,

Â and using your exponentially weighted average of

Â the mu and sigma square whatever was the latest value you have to do the scaling here.

Â And then you would compute Z total

Â on your one test example using that Z norm that we just computed on

Â the left and using the beta and

Â gamma parameters that you have learned during your neural network training process.

Â So the takeaway from this is that during training time mu and

Â sigma squared are computed on an entire mini batch of say 64 engine,

Â 28 or some number of examples.

Â But that test time, you might need to process a single example at a time.

Â So, the way to do that is to estimate mu and sigma squared

Â from your training set and there are many ways to do that.

Â You could in theory run your whole training

Â set through your final network to get mu and sigma squared.

Â But in practice, what people usually do is implement and

Â exponentially weighted average where you just keep

Â track of the mu and sigma squared values you're seeing

Â during training and use and exponentially the weighted average,

Â also sometimes called the running average,

Â to just get a rough estimate of mu and sigma

Â squared and then you use those values of mu and sigma squared

Â that test time to do the scale and you need the head and unit values Z.

Â In practice, this process is pretty robust

Â to the exact way you used to estimate mu and sigma squared.

Â So, I wouldn't worry too much about exactly how you do

Â this and if you're using a deep learning framework,

Â they'll usually have some default way to estimate

Â the mu and sigma squared that should work reasonably well as well.

Â But in practice, any reasonable way to estimate the mean and

Â variance of your head and unit values Z should work fine at test.

Â So, that's it for batch norm and using it.

Â I think you'll be able to train much deeper networks

Â and get your learning algorithm to run much more quickly.

Â Before we wrap up for this week,

Â I want to share with you some thoughts on deep learning frameworks as well.

Â Let's start to talk about that in the next video.

Â