0:00

In this video, I'm going to describe a new way of combining a very large number of

neural network models without having to separately train a very large number of

models. This is a method called dropout that's

recently been very successful in winning competitions.

For each training case, we randomly omit some of the hidden units.

So, we end up with a different architecture for each training case.

We can think of this as having a different model for every training case.

And then, the question is, how could we possibly train a model on only one

training case and how could we average all these models together efficiently at test

time? The answer is that we use a great deal of

weight sharing. I want to start by describing two

different ways of combining the outputs of multiple models.

In a mixture, we combine models by averaging their output probabilities.

So, if model A assigns probabilities of 0.3, 0.2 and 0.5, to three different

answers, model B assigns probabilities of 0.1, 0.8 and 0.1, the combined model

simply assigns the averages of those probabilities.

A different way of combining models is to use a product of the probabilities. Here,

we take a geometric mean of the same probabilities.

So, model A and model B again assign the same probabilities as they did before.

But now, what we do is we multiply each pair of probabilities together and then

take the square root. That's the geometric mean and the

geometric means will generally add up to less than one.

So, we have to divide by the sum of the geometric means to normalize the

distribution so that it adds up to one again.

1:54

You'll notice that in a product, a small probability output by one model, has veto

power over the other models. Now I want to describe an efficient way to

average a large number of neural nets that gives us an alternative to doing the

correct Bayesian thing. The alternative probably doesn't work

quite as well as doing the correct Bayesian thing, but it's much more

practical. So, consider the neural net with one

hidden layer, shown on the right. Each time we present a training example to

it, What we're going to do is randomly emit

each hidden unit with a probability of 0.5.

So, we crossed out three of the hidden units here.

And we run the example through the net with those hidden units absent.

What this means is that we're randomly sampling from two to the h architectures,

where h is the number of hidden units, It's a huge number of architectures.

Of course, all of these architectures show weights.

That ism whenever we use a hidden unit, it's got the same weight as it's got in

other architectures. So, we can think of dropout as a form of

model averaging. We sample from these two to the h models.

Most of the models, in fact, will never be sampled.

And a model of this sampled only gets one training example.

That's a very extreme form of bagging. The training sets are very different for

the different models, but they're also very small.

The sharing of the weights between all the models means that each model is very

strongly regularized by the others. And this is a much better regularizer than

things like L2 or L1 penalties. Those penalties pull the weights toward

zero. By sharing weights with other models, a

models gets regularized by something that's going to tend to pull the weight

towards the correct value. The question still remains what we do with

test time. So, we could sample many of the

architectures, maybe a hundred, and take the geometric mean of the output

distributions. But that would be a lot of work.

There's something much simpler we can do. We use all of the hidden units, but we

halve their outgoing weights. So, they have the same expected effect as

they did when we were sampling. It turns out that using all of the hidden

units with half their outgoing weights, exactly computes the geometric mean that

the predictions that all two to the h models would have used, provided we're

using a softmax output group. If we have more than one hidden layer, we

can simply use drop out at 0.5 in every layer.

5:10

We could run lots of stochastic models with dropout, and then average across

those stochastic models. And that would have one advantage over the

mean net. It would give us an idea of the

uncertainty in the answer. What about the input layer?

Well, we can use the same trick there, too.

We use dropout on the inputs, but we use a higher probability of keeping an input.

This trick's already in use in a system called denoising autoencoders, developed

by Pascal Vincent, Hugo Laracholle and Yoshua Bengio at the University of

Montreal, and it works very well. So, how well does dropout work?

Well, the record breaking object recognition net developed by Alex

Krizhevsky would have broken the record even without dropout.

But it broke a lot more by using dropout. In general, if you have a deep neural net

and it's overfitting dropout, it will typically reduce the number errors by

quite a lot. I think any net that requires early

stopping in order to prevent it overfitting would do better by using

dropout. It would, of course, take longer to train and it might mean more hidden

units. If you got a deep neural net and it's not

overfitting, you should probably be using a bigger one and using dropout, that's

assuming you have enough computational power.

There's another way to think about dropout, which is how I originally arrived

at the idea. And you'll see it's a bit related to

mixtures of experts, and what's going wrong when all the experts cooperate,

What's preventing specialization? So, if a hidden unit knows which other

hidden units are present, it can co-adapt to the other hidden units on the training

data. What that means is, the real signal that's

training a hidden unit is, try to fix up the error that's leftover when all the

other hidden units have had their say. That's what's being back propagated to

train the weights of each hidden unit. Now, that's going to cause complex

co-adaptations between the hidden units. And these are likely to go wrong when

there's a change in the data. So, a new test data,

If you rely on a complex co-adaptation to get things right on the training data,

it's quite likely to not work nearly so well on new test data.

It's like the idea that a big, complex conspiracy involving lots of people is

almost certain to go wrong because there's always things you didn't think of.

And if there's a large number of people involved, one of them will behave in an

unexpected way. And then, the others will be doing the

wrong thing. It's much better if you want conspiracies,

to have lots of little conspiracies. Then, when unexpected things happen, many

of the little conspiracies will fail, but some of them will still succeed.

So, by using dropout, We force a hidden unit to work with

combinatorially many other sets of hidden units.

And that makes it much more likely to do something that's individually useful

rather than only useful because of the way particular other hidden units are

collaborating with it. But it is also going to tend to do

something that's individually useful and is different from what other hidden units

do. It needs to be something that's marginally

useful, given what its co-workers tend to achieve.

And I think this is what's giving nets with dropout, their very good performance.