0:15

Hello and welcome to lesson two.

This lesson is going to explore the topic of cross-validation.

Cross-validation is a technique that can reduce the likelihood of overfitting.

So it's an important concept especially given the problems

overfitting can cause and the challenge in identifying overfitting.

So to get started in this lesson we're going to review overfitting.

We're going to also talk a little bit about

the bias variance trade-off which is underlying the challenge of overfitting.

We're going to introduce cross-validation,

as well as learning invalidation curves.

So by the end of this lesson, you should be able to explain what

that bias variance trade-off is and how it impacts model selection.

You're going to be able to talk about what

cross-validation is and how it can reduce the likelihood of overfitting.

And you should be able to apply cross-validation by using the scikit-learn library.

Now, there are two readings and a notebook for this particular lesson.

The first reading is talking about overfitting and what it is,

and how to identify it and overcome it.

So this is a nice little introductory review,

if you will of overfitting,

which we've talked about in a previous lesson,

but I want to make sure that we've covered it here as well.

Next, we're going to talk about the bias-variance tradeoff.

The issues of bias and variance have to do

with whether you're capturing the model complexity,

which is the bias term,

or whether you're having too much fluctuations in your model and you're overfitting,

which is the variance term.

And this does a very nice job of explaining it both graphically,

as well as, mathematically and texturally.

So it's a great article that talks about

this fundamental challenge in machine learning and data analysis.

Next, is our notebook.

This notebook is going to talk about cross-validation.

We're going to induce the validation curve.

We're going to talk about different cross-validation techniques

and why you might use one versus the other.

And then we're going to end with a discussion of validation and learning curves.

First, we have our notebook set up code.

One thing that's different is I've set global font sizes for

the title and the x and y axis labels and a legend font size.

This is so that we don't have to keep continually do this in our plotting code.

So, what are we going to do?

First, we're going to talk a little bit about the bias-variants tradeoff.

To do this we first make a signal and we

add sample from the signal and we add noise to our signal.

So the blue is the signal,

the light purple here are the observations which are signal plus noise.

We're then going to go ahead and fit this particular curve.

We're going to use a polynomial regressor.

Here's our terms, again just showing you how you apply polynomial features to make

a polynomial to able to be applied to data as a linear model.

So then we're going to go ahead and do this.

We actually generate both a underfit and a overfit mode.

The under fit model, here now in this purple curve, is the signal.

The light blue points are the training data,

the bigger red points are the test data.

So we're fitting curves using

the blue points and then we measure how well we did off of the red points,

and these are randomly selected from the original data.

You also see a first order model here in this dash line

and a third order polynomial fit in this green dot dash line.

Both of these models don't do a very good job of capturing the variations of the data.

So we would say they underfit and are a high bias model.

On the other hand, we can also apply higher order polynomial terms.

These are shown in the green dash line and the black dot dash line.

While these look like they fit nicely in parts of the parameter space,

they do a bad job particularly at the ends.

This is often what you will see in a overfit model,

like this where the data that you're fitting to ends.

The model doesn't do a good job because it doesn't

know what the data really should be doing after the ends.

And so you get this fluctuations at the ends.

Neither of these models does a good job of

capturing the real signal in the data, they're high variance.

So one thing we can do is measure something called a validation curve.

With the validation curve,

we're going to plot how well our model does.

So here's our R-squared score versus the complexity of the model.

And we do this for both the training and testing scores.

Normally, you might just say the training data tells me how well I'm doing.

Eventually, C it starts to fall off.

This tells us that we are losing the ability to capture data,

capture the structure in the data.

And so we can see that after about 12th order,

we're not doing very well with this polynomial.

And remember the ones I showed you,

were out here, 19 and 24.

Same thing with the testing scores.

The testing scores rise rapidly and they sort of

flatten off with some small fluctuations out here.

Again, we're not going to do very well.

Once we get out here to these higher complex models,

we're losing accuracy and that's not something we want.

So the validation curve gives you a way of sort of

quantifying what's the right complexity of the model.

And if we look at this curve you'd see anywhere between say five,

six and out here to say 15th,

gives you a reasonable performance.

After that, we start to get a decrease.

So, in general, you want to again keep your model as simple

as possible and so we probably choose values out here, six or seven.

Unless we have a very good reason to take away any more complex.

So what about this? We can do cross-validation where we actually try to,

rather than do the 2-Fold split of training and test data,

we split the data into three forms.

And the reason we do this is simple.

If you split your data into a training data set and you

train a model and then you apply it to the test data set,

that test data has now been seen by the model.

If you then say, well, what happens if I change this hyperparameter and I retrain?

Say you do this as you are making a validation curve,

such as the one I just showed.

As the model is retested,

you're effectively getting what's known as data leakage where

information about the test data is being used to select the end model.

That means your test data is no longer giving you an accurate measure

of how well that model is performing and it can lead to overfitting.

So, what we'll do is create three data sets.

One for training, one for validating

the hyper parameters we use for our models, and then,

one the test data set that's held out till

the very end to give the final performance of our model.

And it's very important to make sure that last data set is only

used once for the final performance metrics.

So the way you can split this data,

there's many different ways,

and that's what leads to

the different types of cross-validation that we're going to look at.

We can do a KFold where we'll split our data into KFolds.

So say we have five folds.

We would train on four folds and then validate on the remaining fold.

Now here's the trick, with cross-validation,

we then change which four folds we

use and this allows us to have five different tests done.

And we can combine all of those together to get what we

think the best model is for that particular data.

Any time we see the word stratified in the name of a cross-validation,

it means we're preserving the relative ratios of the labelled classes within each fold.

This can be important if you have

different numbers of training data for different classes.

When we do the iris dataset we had 50 in each class.

But imagine you had 100 of one and 25 of the other two,

when you break that up you would want to maintain those ratios in each Fold.

And the way to do that, is to use the stratified KFold.

You can also use GroupKFold.

Group is similar, except that now,

we only use one group within each Fold.

Then there's LeaveOneOut which is similar to KFold but you basically break the data

into chunks and you leave one observation

out to validate the model trained on the remaining data.

LeavePOut, same idea except you leave P,

so instead of one, it's P observations.

And then lastly, a shuffle split where you just generate

a user defined number of training validation data sets.

So these may be a little alien to you, these ideas,

so what we're going to do is generate some data,

a 10 element array,

and we're going to use that array to demonstrate these different techniques.

So remember, we had zero,

one, two, three, four, five, six, seven, eight,

nine, we can do a KFold,

we can split it five times.

We can then show what our data is.

So with this, we split it up into five.

That means that each test item is going to be used once for our KFold.

So you can see the first thing we do,

is we take zero and one out and then,

our training data is two through nine,

our test is zero one.

The next fold is zero,

one, four, five, six, seven, eight, nine.

The test was two, three, etc.

So you see, every element was used what's in the test or,

in this case, actually validation data and the others are all used in the training.

And this is a 5-Fold cross-validation iterator.

We can then show LeaveOneOut.

LeaveOneOut is a little more complex because here we're going to go through

each data use it once for test and then the remaining data are all used for training.

Thus for 10 elements,

we have 10 essentially iterations through the data to get our training and test data.

LeavePOut is a little more complex than that because now,

it's a combinatorial problem that we'll be limiting ourselves to five because,

otherwise, it's a huge number and it will fill the screen.

So we're going say, let's just use the first five.

What do we do? Well, we can start here.

We can take zero and then there's four ways to choose the next number.

Then we can choose one and there's only three ways to choose the next number.

Then there's two and there's two ways to choose the next number,

then there's three and there's only one.

So we choose our test data set,

the remaining data are used as our training and you could

see here there's 10 different ways to

build this data with a LeaveTwoOut cross-validator.

Then we could demonstrate shuffle split.

We're going to say, we want in this case ten,

our test size is point two so that's two items removed for testing.

And notice this, we actually reuse points.

So here, five was used three times.

You can see that nine was used three times,

eight was used three times,

one is only used twice, etc.

So this randomly pulls out two items,

uses the others for training and we do it however many times we said.

Stratified maintains, as I said before it maintains those class labels.

So here, we actually create a class label, ones and zeros.

And you could see that when we say create the Folds,

here we pull out four items and it kept three,

five, seven, eight and nine.

So we had both zeros and ones selected and the rest go with that as well.

GroupKFold, any time we have a group,

we have to give a grouping to the data so we do that.

We say, here's our data,

we went out to 15 now,

zero through 14 inclusive.

We then have three groups,

ten, 12 and 11.

And when we break the data up,

it breaks it up such that only one group is used at each test.

So you can see right there is zero, there's two,

there's three, there's four,

there's six, there's seven, there's nine.

That was all the tens.

And then all the 11's and all the 12's.

So hopefully this has given you all the information

you need to understand these cross-validators.

We demonstrate how to use them primarily a lot later on,

but we also make the validation curves.

To do that, we're going to use something called cross-val score that's going to use

a standard scalar and a computing here.

We're going to put it into a pipeline.

This then is going to run a 5-Fold cross-validation by default.

If we put in a different cross-validation technique, we would put that in here.

And it's going to show us a different results, as well as,

we can compute the mean and standard deviation.

So we ran five different scaling and SVC algorithms on that data and took the mean.

We could actually use that in different metrics.

We can use the F1 score,

we can use a different cross-validator.

Here we're going to use a shuffle split, going to get the same thing.

We can then use this to build a validation curve,

which shows us as a parameter changes,

how the score changes and you could see this for the training data,

as well as a cross- validation scores.

And this shows you, right in here,

is a good value for that gamma parameter.

We also have learning curves which are similar,

except they talk about the impact of the training data size on the model performance.

So again, same idea,

training size is increasing with our score doing for

our cross-validator and our training and you can

see that it jumps up and then starts leveling off,

very small changes after that.

So it shows you that once you get up to about half the data size,

this is doing reasonably well.

So, we've gone through a lot of ideas here.

I realize that teaching you both the learning curve,

validation curves, cross- validation all of that's a lot.

Do go through this carefully,

it's a very important concept.

If you have questions let us know and good luck.