0:00

By now, you've seen a

Â couple different learning algorithms, linear

Â regression and logistic regression.

Â They work well for many problems,

Â but when you apply them

Â to certain machine learning applications, they

Â can run into a problem called

Â overfitting that can cause them to perform very poorly.

Â What I'd like to do in

Â this video is explain to

Â you what is this overfitting

Â problem, and in the

Â next few videos after this,

Â we'll talk about a technique called

Â regularization, that will allow

Â us to ameliorate or to

Â reduce this overfitting problem and

Â get these learning algorithms to maybe work much better.

Â So what is overfitting?

Â Let's keep using our running

Â example of predicting housing

Â prices with linear regression

Â where we want to predict the

Â price as a function of the size of the house.

Â One thing we could do is

Â fit a linear function to

Â this data, and if we

Â do that, maybe we get

Â that sort of straight line fit to the data.

Â But this isn't a very good model.

Â Looking at the data, it seems

Â pretty clear that as the

Â size of the housing increases, the

Â housing prices plateau, or kind

Â of flattens out as we move to the right and so

Â this algorithm does not

Â fit the training and we

Â call this problem underfitting, and

Â another term for this is

Â that this algorithm has high bias.

Â Both of these roughly

Â mean that it's just not even fitting the training data very well.

Â The term is kind of

Â a historical or technical one,

Â but the idea is that

Â if a fitting a straight line to

Â the data, then, it's as

Â if the algorithm has a

Â very strong preconception, or a

Â very strong bias that housing

Â prices are going to vary

Â linearly with their size and despite the data to the contrary.

Â Despite the evidence of the

Â contrary is preconceptions still

Â are bias, still closes

Â it to fit a straight line

Â and this ends up being a poor fit to the data.

Â Now, in the middle, we could

Â fit a quadratic functions enter and,

Â with this data set, we fit the

Â quadratic function, maybe, we get

Â that kind of curve

Â and, that works pretty well.

Â And, at the other extreme, would be if we were to fit, say a fourth other polynomial to the data.

Â So, here we have five parameters,

Â theta zero through theta four,

Â and, with that, we can actually fill a curve

Â that process through all five of our training examples.

Â You might get a curve that looks like this.

Â 2:57

The term high variance is another

Â historical or technical one.

Â But, the intuition is that,

Â if we're fitting such a high

Â order polynomial, then, the

Â hypothesis can fit, you know,

Â it's almost as if it can

Â fit almost any function and

Â this face of possible hypothesis

Â is just too large, it's too variable.

Â And we don't have enough data

Â to constrain it to give

Â us a good hypothesis so that's called overfitting.

Â And in the middle, there isn't really

Â a name but I'm just going to write, you know, just right.

Â Where a second degree polynomial, quadratic function

Â seems to be just right for fitting this data.

Â To recap a bit the

Â problem of over fitting comes

Â when if we have

Â too many features, then to

Â learn hypothesis may fit the training side very well.

Â So, your cost function

Â may actually be very close

Â to zero or may be

Â even zero exactly, but you

Â may then end up with a

Â curve like this that, you

Â know tries too hard to

Â fit the training set, so that it

Â even fails to generalize to

Â new examples and fails to

Â predict prices on new examples

Â as well, and here the

Â term generalized refers to

Â how well a hypothesis applies even to new examples.

Â That is to data to

Â houses that it has not seen in the training set.

Â On this slide, we looked at

Â over fitting for the case of linear regression.

Â A similar thing can apply to logistic regression as well.

Â Here is a logistic regression

Â example with two features X1 and x2.

Â One thing we could do, is

Â fit logistic regression with

Â just a simple hypothesis like this,

Â where, as usual, G is my sigmoid function.

Â And if you do that, you end up

Â with a hypothesis, trying to

Â use, maybe, just a straight

Â line to separate the positive and the negative examples.

Â And this doesn't look like a very good fit to the hypothesis.

Â So, once again, this

Â is an example of underfitting

Â or of the hypothesis having high bias.

Â In contrast, if you were

Â to add to your features

Â these quadratic terms, then,

Â you could get a decision

Â boundary that might look more like this.

Â And, you know, that's a pretty good fit to the data.

Â Probably, about as

Â good as we could get, on this training set.

Â And, finally, at the other

Â extreme, if you were to

Â fit a very high-order polynomial, if

Â you were to generate lots of

Â high-order polynomial terms of speeches,

Â then, logistical regression may contort

Â itself, may try really

Â hard to find a

Â decision boundary that fits

Â your training data or go

Â to great lengths to contort itself,

Â to fit every single training example well.

Â And, you know, if the

Â features X1 and

Â X2 offer predicting, maybe,

Â the cancer to the,

Â you know, cancer is a malignant, benign breast tumors.

Â This doesn't, this really doesn't

Â look like a very good hypothesis, for making predictions.

Â And so, once again, this is

Â an instance of overfitting

Â and, of a hypothesis having

Â high variance and not really,

Â and, being unlikely to generalize well to new examples.

Â Later, in this course, when we

Â talk about debugging and diagnosing

Â things that can go wrong with

Â learning algorithms, we'll give you

Â specific tools to recognize

Â when overfitting and, also,

Â when underfitting may be occurring.

Â But, for now, lets talk about

Â the problem of, if we

Â think overfitting is occurring,

Â what can we do to address it?

Â In the previous examples, we had

Â one or two dimensional data so,

Â we could just plot the hypothesis and see what was going

Â on and select the appropriate degree polynomial.

Â So, earlier for the housing

Â prices example, we could just

Â plot the hypothesis and, you

Â know, maybe see that it

Â was fitting the sort of

Â very wiggly function that goes all over the place to predict housing prices.

Â And we could then use figures

Â like these to select an appropriate degree polynomial.

Â So plotting the hypothesis, could

Â be one way to try to

Â decide what degree polynomial to use.

Â But that doesn't always work.

Â And, in fact more often we

Â may have learning problems that where we just have a lot of features.

Â And there is not

Â just a matter of selecting what degree polynomial.

Â And, in fact, when we

Â have so many features, it also

Â becomes much harder to plot

Â the data and it becomes

Â much harder to visualize it,

Â to decide what features to keep or not.

Â So concretely, if we're trying

Â predict housing prices sometimes we can just have a lot of different features.

Â And all of these features seem, you know, maybe they seem kind of useful.

Â But, if we have a

Â lot of features, and, very little

Â training data, then, over

Â fitting can become a problem.

Â In order to address over

Â fitting, there are two

Â main options for things that we can do.

Â The first option is, to try

Â to reduce the number of features.

Â Concretely, one thing we

Â could do is manually look through

Â the list of features, and, use

Â that to try to decide which

Â are the more important features, and, therefore,

Â which are the features we should

Â keep, and, which are the features we should throw out.

Â Later in this course, where also

Â talk about model selection algorithms.

Â Which are algorithms for automatically

Â deciding which features

Â to keep and, which features to throw out.

Â This idea of reducing the

Â number of features can work

Â well, and, can reduce over fitting.

Â And, when we talk about model

Â selection, we'll go into this in much greater depth.

Â But, the disadvantage is that, by

Â throwing away some of the

Â features, is also throwing

Â away some of the information you have about the problem.

Â For example, maybe, all of

Â those features are actually useful

Â for predicting the price of a

Â house, so, maybe, we don't actually

Â want to throw some of

Â our information or throw some of our features away.

Â