0:00

In this video, I'm going to talk about improving generalization by reducing the

Â overfitting that occurs when a network has too much capacity for the amount of data

Â it's given during training. I'll describe various ways of controlling

Â the capacity of a network. And I'll also describe how we determine

Â how to set the metric parameters when we use a method for controlling capacity.

Â I'll then go on to give an example where we control capacity by stopping the

Â learning early. Just to remind you, the reason we get

Â over-fitting is because as well as having information about the true regularities in

Â the mapping from the input or output, any finite set of training data also contains

Â sampling error. There's accidental regularities in the

Â training set, just because of the particular training cases that were

Â chosen. So when we fit the model, it can't tell

Â which of the regularities are real, and would also exist if we sampled the

Â training set again, And which are caused by the sampling

Â error. So the model fits both kinds of

Â regularity. And if the model's too flexible, it'll fit

Â the sampling error really well, and then it'll generalize badly.

Â 1:19

So we need a way to prevent this over fitting.

Â The first method I'll describe is by far the best.

Â And it's simply to get more data. There's no point coming up with fancy

Â schemes to prevent over fitting if you can get yourself more data.

Â Data has exactly the right characteristics to prevent over fitting.

Â The more of it you have the better. Assuming your computer's fast enough to

Â use it. A second method is to try and judiciously

Â limit the capacity of the model so that it's got enough capacity to fit the true

Â regularities but not enough capacity to fit the spurious regularities caused by

Â the sampling error. This of course is very difficult to do.

Â And I'll describe in the rest of this lecture, various approaches to trying to

Â regulate the capacity appropriately. In the next lecture, I'll talk about

Â averaging together many different models. If we average models that have different

Â forms and make different mistakes, the average will do better than the individual

Â models. We could make the models different just by

Â training them on different subsets of the training data.

Â This is a technique called bagging. There's also other ways to mess with the

Â training data to make the models as different as possible.

Â 3:45

A very common way to control the capacity of a neural network is to give it a number

Â of hidden lairs or units per lair is a little to large, but then to penalize the

Â weights using penalties or constraints using squared values of the weights or

Â absolute values of the weights. And finally, we can control the capacity

Â of a model by adding noise to the weights, or by adding noise to the activities.

Â Typically, we use a combination of several of these different capacity control

Â methods. Now for most of these methods, there's

Â meta parameters that you have to set. Like the number of hidden units, or the

Â number of layers, or the size of the weight penalty.

Â 4:32

An obvious way to transit those meta parameters is to try lots of different

Â values of one of the meta parameters like, for example, the number of hidden units,

Â and see which gives the best performance on the test set.

Â But there's something deeply wrong with that.

Â It gives a false impression of how well the method will work if you give it

Â another test set. So the settings that work best for one

Â particular test set are unlikely to work as well on a new test set that's drawn

Â from the same distribution because they've been tuned to that particular test set.

Â And that means you get a false impression of how well you would do on a new test

Â set. Let me give you an extreme example of

Â that. Suppose the test set really is random,

Â quite a lot of financial data seems to be like that.

Â So the answers just don't depend on the inputs or can't be predictive from the

Â inputs. If you choose the model that does best on

Â your test set, that will obviously do better than chance because you selected it

Â to do better than chance. But if you take that model and try it on

Â new data that's also random, you can't expect it to do better than chance.

Â So by selecting a model, you got a false impression of how well a model will do on

Â new data and the question is, is there a way around that?

Â 6:13

You hold back some validation data, which isn't going to be used for training.

Â But is going to be used for deciding how to set the meta parameters.

Â In other words, you're going to look at how well the model does on the validation

Â data to decide what's an appropriate number of hidden units or an appropriate

Â size of weight penalty. But then once you've done that, and

Â trained your model with what looks like the best number of hidden units and the

Â best weight penalty, You're then going to see how well it does

Â on the final set of data that you've held back which is the test data.

Â And you must only use that once. And that'll give you an unbiased estimate

Â of how well the network works. And in general that estimate will be a

Â little worse than on the validation data. Nowadays in competitions, the people

Â organizing the competitions have learned to hold back that true test data and get

Â people to send in predictions so they can see whether they really can predict on

Â true test data, or whether they're just over-fitting to the validation data by

Â selecting meta-parameters that do particularly well on the validation data

Â but won't generalize to new test sets. One way we can get a better estimate of

Â our weight penalties or number of hidden units or anything else we're trying to fix

Â using the validation data, is to rotate the validation set.

Â So, we hold back a final test set to get our final unbiased estimate.

Â But then we divide the other data into N equal sized subsets and we train on all

Â but one of those N, and use the Nth as a validation set.

Â Then we can rotate and a hold back a different subset as a validation set, and

Â so we can get many different estimates of what the best weight penalty is, or the

Â best number of hidden units is. This is called N-fold cross-validation.

Â It's important to remember, the N different estimates we get are not

Â independent of one another. If for example, we were really unlucky and

Â all the examples of one class fell into one of those subsets,

Â We'd expect to generalize very badly. And we'd expect to generalize very badly,

Â whether that subset was the validation subset or whether it was in the training

Â data. So now I'm going to describe one

Â particularly easy to use method for printing over fitting.

Â It's good when you have a big model on a small computer and you don't have the time

Â to train a model many different times with different numbers of hidden units or

Â different size weight penalties. What you do is you start with small

Â weights, and as the model trains, they grow.

Â And you watch the performance on the validation set.

Â And as soon as it starts to get worse, you stop training.

Â 9:00

Now, the performance civilization on the set may fluctuate particularly if you're

Â error rate rather than a squared error or presentory error.

Â And so its hard to decide when to stop and so what you typically do is keep going

Â until you're sure things are getting worse and then go back to the point at which

Â things were best. The reason this controls the capacity of

Â the model, is because models with small weights generally don't have as much

Â capacity, and the weights haven't had time to grow big.

Â 9:37

So consider a model with some input units, some hidden units, and some output units.

Â When the weight's very small, if the hidden unit's a logistic units, their

Â total inputs will be close to zero, and they'll be in the middle of their linear

Â range. That is, they'll behave very like linear

Â units. What that means is, when the weights are

Â small, the whole network is the same as a linear network that maps the inputs

Â straight to the outputs. So, if you multiply that weight matrix W1

Â by that weight matrix W2, you'll get a weight matrix that you can use to connect

Â the inputs to the outputs and provided the weights are small, a net with a layer of

Â logistic hidden units will behave pretty much the same as that linear note.

Â Provided we also divide the weights in the linear note by four, which take into

Â account the fact that when there's hidden units there, in that linear region, and

Â they have a slope of a quarter. So it's got no more capacity than the

Â linear net, so even though in that network I'm showing you there's three six + six

Â two weights, it's really got no more capacity than a network with three two

Â weights. That's the way its grow.

Â We start using the non linear region of the sequence.

Â And then we start making use of all those parameters.

Â 11:06

So if the network has six weights at the beginning of learning and has 30 weights

Â at the end of learning, Then we could think of the capacity as

Â changing smoothly from six perimeters to 30 perimeters as the weights get bigger.

Â And what's happening in early stopping is we're stopping the learning when it has

Â the right number of parameters to do as well as possible on the validation data.

Â That is when it's optimized the trade off between fitting the true regularities in

Â the data and fitting the spurious regularities that are just there because

Â of the particular training examples we chose.

Â