0:00

In this video, I'm going to talk about the reason why we want to combine many models

Â when we're making predictions. If we have a single model, we have to

Â choose some capacity for it. If we choose too little capacity, it would

Â be able to fit the regularities in the training data.

Â And if we choose too much capacity, it won't be able to fit the sampling error in

Â the particular training set we have. By using many models, we can actually get

Â a better tradeoff between fitting the true regularities, and overfitting the sampling

Â error in the data. At the start of the video,

Â I'll show you that when you average models together, you can expect to do better than

Â any single model. This effect is largest when the models

Â make very different predictions from each other.

Â And at the end of this video, I'll discuss various ways in which we can encourage the

Â different models to make very different predictions.

Â As we've seen before, when we have a limited amount of training data, we tend

Â to get overfitting. If we average the predictions of many

Â different models we can typically reduce that overfitting.

Â 1:17

For regression, the squared arrow can be decomposed into a bias term and a variance

Â term. And that allows us to analyze what's going

Â on. The bias term is big if the model has too

Â little capacity to fit the data. It measures how poorly the model

Â approximates the true function. The variance term is big if the model has

Â so much capacity that it's good at modeling the sampling error in our

Â particular training set. So, it's called variance, because if we go

Â and get another training set of the same size from the same distribution, our model

Â will fit differently to that training set, because it has different sampling error.

Â And so we'll get variance in the way the models fit to different training sets.

Â 2:03

If we average models together, what we're doing is we're averaging away the

Â variance, And that allows us to use individual

Â models that have high capacity and therefore high variance.

Â These high capacity model typically have low bias.

Â So we can get the low bias without incurring the high variance by using

Â averaging to get rid of the variance. So now let's try and analyze how an

Â individual model compares with an average of models.

Â 2:36

On any one test case some individual predictors may be better than the combined

Â predictor. The different individual predictors will

Â be better on different cases. And if the individual predictors disagree

Â a lot, the combined predictor is typically better than all of the individual

Â predictors when we average over test cases.

Â So we should aim to make the individual predictors disagree, without making them

Â be poor predictors. The art is to have individual predictors

Â that make very different errors from one another, but are each fairly accurate.

Â 3:13

So, now let's look at the math and what happens when we combine networks.

Â We're going to compare two expected squared errors.

Â The first expected squared error is the one we get if we pick one of the

Â predictors at random and use that for making our predictions.

Â And then what we do is we average overall predictors, the error we'd expect to get

Â if we followed that policy. So Y bar is the average of what all the

Â predictors say, and YI is what an individual predictor says.

Â So Y bar is just the expectation over all the individual predictors I of YI and I'm

Â using those angle brackets to represent an expectation, where the thing that comes

Â after the angle bracket tells you what it's an expectation over.

Â We can write the same thing as one over n times the sum overall of the n of the yi.

Â 4:09

Now, if we look at the expected squared error we'd get if we chose a predictor at

Â random, What we'd have to do is compare that

Â predictor with the target, take the squared difference.

Â And then average that over all predictors. That's also on the left hand side there.

Â If I simply add a Y bar and subtract a Y bar, I don't change the value.

Â And now it's going to be easier to do some manipulations.

Â 4:36

I can now multiply it that squared and inside this expectation bracket I have t

Â minus y bar squared, y I minus y bar square, and t minus y bar into y I minus y

Â bar, which has the c will disappear. So the first term, T minus Y bar squared,

Â doesn't have an I in it anymore, and so we can forget about the expectation brackets

Â for that. That really is T minus Y bar squared.

Â And that's the squared arrow you'd get if you compared the average of the models

Â with the target. And our aim is to show the thing on the

Â left hand side is bigger than that, i.e., by using that average, we've reduced the

Â expected squared error. So the extra term we have on the right

Â hand side, is the expectation of y i minus y bar squared.

Â And that's just the variance of the y i. It's the expected squared difference

Â between y I and y bar. And then the last tone disappears, it

Â disappears because the difference of Y I from Y bar we expect to be uncorrelated

Â with the difference between the arrow that the average of the networks makes on the

Â target. And so we're multiplying together two

Â things that are zero mean and uncorrelated and we expect to get zero on average.

Â So the result is that the expected squared error we get by picking a model at random

Â is greater than the squared error we get by averaging the models by the variance of

Â the outputs of the models. That's how much we win by when we take an

Â average. So, I want to show you that in a picture.

Â So, along the horizontal line, we have the possible values of the output, and in this

Â case, all of the different models predict a value that is too high.

Â 6:33

The predictors that are further than average from T make bigger than average

Â squared errors, like that bad guy in red, and the predictors that are less than the

Â average distance from T make smaller than average squared arrows.

Â And the first effect dominates, because we're using squared error.

Â So if you look at the math, let's suppose that the good guy and the bad guy were

Â equally far from the mean. So the average squared error they make is

Â Y bar minus epsilon squared plus Y bar plus epsilon squared.

Â And when we work that out, we get the squared error that the mean of the

Â predictors makes, plus an epsilon squared. So we win by averaging predictors before

Â we compare them with the target. That's not always true.

Â It depends very much on using a squared error.

Â If, for example, you have a whole bunch of clocks.

Â And you try and make them more accurate by averaging them all,

Â That'll be a disaster. And it'll be a disaster because the noise

Â you expect in clocks isn't Gaussian noise. What you expect is that, many of them will

Â be very slightly wrong and a few of them will have stopped or will be wildly wrong.

Â And if you average, you make sure they are all significantly wrong, which is not what

Â you want. The same thing applies to the discrete

Â distribution as we have our class labeled probabilities.

Â 8:13

Is it better to pick one model at random, or it is it better to average those two

Â probabilities, and predict the average of Pi and Pj.

Â What if I had a measure is the log probability of getting the right answer?

Â Then, the log of the average of Pi and Pj is going to be a better bet than the log

Â of Pi plus the log of Pj averaged. That's most easily seen in a diagram

Â because of the shape of the log function. So that black curve is the log.

Â On the horizontal access I've drawn Pi and Pj,

Â And the gold colored line, joins log Pi to log Pj.

Â You can see that if we first start with Pi and Pj together, to get that average value

Â at the blue arrow is, and then we compute the log, we get that blue dot.

Â Whereas if we first take the log of pi, and separately take the log of pj, and

Â then we average those two logs, we get the mid-point of that gold line,

Â Which is below the blue dot. So to make this averaging be a big win, we

Â want our predictors to differ by a lot. And there's many different ways to make

Â them differ. You could just rely on a learning

Â algorithm that doesn't work too well, and get stuck in different local optima each

Â time. It's not a very intelligent thing to do,

Â but it's worth a try. You could use lots of different kinds of

Â models, including ones that are not neural networks.

Â So, it makes sense to try decision trees, Gaussian process models, support vector

Â machines. I'm not explaining any of those in this

Â course. In Andrew Ng's machine on Coursera, you

Â can learn about all those things. Well you could try many other different

Â kinds of model. If you really want to use a bunch of

Â different neural-network models, you can make them different by using a different

Â number of hidden layers or a different number of units per layer or different

Â types of unit. Like in some nets you could use

Â rectified-linear units, And in other nets you could use logistic

Â units. You could use different types or strengths

Â of weight penalty. So you might use early stopping for some

Â nets, and an L2 weight penalty for others, and an L1 weight penalty for others.

Â 10:42

You could use different learning algorithms.

Â So for example you could use full batch for some, and mini batch for others, if

Â your data set is small enough to allow that.

Â You can also make the models differ by training the models on different training

Â data. So, there's a method introduced by Leo

Â Breiman called bagging, where you train different models on different subsets of

Â the data. And you get these subsets by sampling the

Â training set with replacement. So we sampled a training set that had

Â examples A, B, C, D, and E. And we got five examples, but we'll have

Â some missing and some duplicated. And we train one of our models on that

Â particular training set. This is done in a method called random

Â forest that uses bagging with decision trees, which Leo Breiman was also involved

Â in inventing. When you train decision trees with bagging

Â and then average them together, they work much better than single decision tree bys

Â themselves. In fact, the connect box uses random

Â forests to convert information about depth into information about where your body

Â parts are. We could use bagging with neural nets, but

Â it's very expensive. If you wanted to train say, twenty

Â different neural nets this way, you'd have to get your twenty different training

Â sets. And then it would take twenty times as

Â long as training one net. That doesn't matter with decision tress

Â cuz they're so fast to train. Also, at test time, you'd have to run

Â these twenty different nets. Again, with decision trees, that doesn't

Â matter, cuz they're so fast to use at test time.

Â Another method for making the training data different is to train each model on

Â the whole training set, But to weight the cases differently So, in

Â boosting, we typically we use a sequence of fairly low capacity models.

Â And we weight the training cases for each model differently.

Â What we do is we up weight the cases the previous model got wrong and we down

Â weight the case of previous model got right.

Â So the next model in the sequence doesn't waste its time trying to model cases that

Â are already correct. It uses its resources to try to deal with

Â cases the other models are getting wrong. An early use of boosting, was with neural

Â nets for MNIST, And there when computer's are actually

Â slower. One of the big advantage is was that it

Â focused to competitional resources on modelling the tricky cases,

Â And didn't waste a lot of time, going over easy cases again and again.

Â