0:15

Hello and welcome to lesson two.

Â This lesson is going to explore the topic of cross-validation.

Â Cross-validation is a technique that can reduce the likelihood of overfitting.

Â So it's an important concept especially given the problems

Â overfitting can cause and the challenge in identifying overfitting.

Â So to get started in this lesson we're going to review overfitting.

Â We're going to also talk a little bit about

Â the bias variance trade-off which is underlying the challenge of overfitting.

Â We're going to introduce cross-validation,

Â as well as learning invalidation curves.

Â So by the end of this lesson, you should be able to explain what

Â that bias variance trade-off is and how it impacts model selection.

Â You're going to be able to talk about what

Â cross-validation is and how it can reduce the likelihood of overfitting.

Â And you should be able to apply cross-validation by using the scikit-learn library.

Â Now, there are two readings and a notebook for this particular lesson.

Â The first reading is talking about overfitting and what it is,

Â and how to identify it and overcome it.

Â So this is a nice little introductory review,

Â if you will of overfitting,

Â which we've talked about in a previous lesson,

Â but I want to make sure that we've covered it here as well.

Â Next, we're going to talk about the bias-variance tradeoff.

Â The issues of bias and variance have to do

Â with whether you're capturing the model complexity,

Â which is the bias term,

Â or whether you're having too much fluctuations in your model and you're overfitting,

Â which is the variance term.

Â And this does a very nice job of explaining it both graphically,

Â as well as, mathematically and texturally.

Â So it's a great article that talks about

Â this fundamental challenge in machine learning and data analysis.

Â Next, is our notebook.

Â This notebook is going to talk about cross-validation.

Â We're going to induce the validation curve.

Â We're going to talk about different cross-validation techniques

Â and why you might use one versus the other.

Â And then we're going to end with a discussion of validation and learning curves.

Â First, we have our notebook set up code.

Â One thing that's different is I've set global font sizes for

Â the title and the x and y axis labels and a legend font size.

Â This is so that we don't have to keep continually do this in our plotting code.

Â So, what are we going to do?

Â First, we're going to talk a little bit about the bias-variants tradeoff.

Â To do this we first make a signal and we

Â add sample from the signal and we add noise to our signal.

Â So the blue is the signal,

Â the light purple here are the observations which are signal plus noise.

Â We're then going to go ahead and fit this particular curve.

Â We're going to use a polynomial regressor.

Â Here's our terms, again just showing you how you apply polynomial features to make

Â a polynomial to able to be applied to data as a linear model.

Â So then we're going to go ahead and do this.

Â We actually generate both a underfit and a overfit mode.

Â The under fit model, here now in this purple curve, is the signal.

Â The light blue points are the training data,

Â the bigger red points are the test data.

Â So we're fitting curves using

Â the blue points and then we measure how well we did off of the red points,

Â and these are randomly selected from the original data.

Â You also see a first order model here in this dash line

Â and a third order polynomial fit in this green dot dash line.

Â Both of these models don't do a very good job of capturing the variations of the data.

Â So we would say they underfit and are a high bias model.

Â On the other hand, we can also apply higher order polynomial terms.

Â These are shown in the green dash line and the black dot dash line.

Â While these look like they fit nicely in parts of the parameter space,

Â they do a bad job particularly at the ends.

Â This is often what you will see in a overfit model,

Â like this where the data that you're fitting to ends.

Â The model doesn't do a good job because it doesn't

Â know what the data really should be doing after the ends.

Â And so you get this fluctuations at the ends.

Â Neither of these models does a good job of

Â capturing the real signal in the data, they're high variance.

Â So one thing we can do is measure something called a validation curve.

Â With the validation curve,

Â we're going to plot how well our model does.

Â So here's our R-squared score versus the complexity of the model.

Â And we do this for both the training and testing scores.

Â Normally, you might just say the training data tells me how well I'm doing.

Â Eventually, C it starts to fall off.

Â This tells us that we are losing the ability to capture data,

Â capture the structure in the data.

Â And so we can see that after about 12th order,

Â we're not doing very well with this polynomial.

Â And remember the ones I showed you,

Â were out here, 19 and 24.

Â Same thing with the testing scores.

Â The testing scores rise rapidly and they sort of

Â flatten off with some small fluctuations out here.

Â Again, we're not going to do very well.

Â Once we get out here to these higher complex models,

Â we're losing accuracy and that's not something we want.

Â So the validation curve gives you a way of sort of

Â quantifying what's the right complexity of the model.

Â And if we look at this curve you'd see anywhere between say five,

Â six and out here to say 15th,

Â gives you a reasonable performance.

Â After that, we start to get a decrease.

Â So, in general, you want to again keep your model as simple

Â as possible and so we probably choose values out here, six or seven.

Â Unless we have a very good reason to take away any more complex.

Â So what about this? We can do cross-validation where we actually try to,

Â rather than do the 2-Fold split of training and test data,

Â we split the data into three forms.

Â And the reason we do this is simple.

Â If you split your data into a training data set and you

Â train a model and then you apply it to the test data set,

Â that test data has now been seen by the model.

Â If you then say, well, what happens if I change this hyperparameter and I retrain?

Â Say you do this as you are making a validation curve,

Â such as the one I just showed.

Â As the model is retested,

Â you're effectively getting what's known as data leakage where

Â information about the test data is being used to select the end model.

Â That means your test data is no longer giving you an accurate measure

Â of how well that model is performing and it can lead to overfitting.

Â So, what we'll do is create three data sets.

Â One for training, one for validating

Â the hyper parameters we use for our models, and then,

Â one the test data set that's held out till

Â the very end to give the final performance of our model.

Â And it's very important to make sure that last data set is only

Â used once for the final performance metrics.

Â So the way you can split this data,

Â there's many different ways,

Â and that's what leads to

Â the different types of cross-validation that we're going to look at.

Â We can do a KFold where we'll split our data into KFolds.

Â So say we have five folds.

Â We would train on four folds and then validate on the remaining fold.

Â Now here's the trick, with cross-validation,

Â we then change which four folds we

Â use and this allows us to have five different tests done.

Â And we can combine all of those together to get what we

Â think the best model is for that particular data.

Â Any time we see the word stratified in the name of a cross-validation,

Â it means we're preserving the relative ratios of the labelled classes within each fold.

Â This can be important if you have

Â different numbers of training data for different classes.

Â When we do the iris dataset we had 50 in each class.

Â But imagine you had 100 of one and 25 of the other two,

Â when you break that up you would want to maintain those ratios in each Fold.

Â And the way to do that, is to use the stratified KFold.

Â You can also use GroupKFold.

Â Group is similar, except that now,

Â we only use one group within each Fold.

Â Then there's LeaveOneOut which is similar to KFold but you basically break the data

Â into chunks and you leave one observation

Â out to validate the model trained on the remaining data.

Â LeavePOut, same idea except you leave P,

Â so instead of one, it's P observations.

Â And then lastly, a shuffle split where you just generate

Â a user defined number of training validation data sets.

Â So these may be a little alien to you, these ideas,

Â so what we're going to do is generate some data,

Â a 10 element array,

Â and we're going to use that array to demonstrate these different techniques.

Â So remember, we had zero,

Â one, two, three, four, five, six, seven, eight,

Â nine, we can do a KFold,

Â we can split it five times.

Â We can then show what our data is.

Â So with this, we split it up into five.

Â That means that each test item is going to be used once for our KFold.

Â So you can see the first thing we do,

Â is we take zero and one out and then,

Â our training data is two through nine,

Â our test is zero one.

Â The next fold is zero,

Â one, four, five, six, seven, eight, nine.

Â The test was two, three, etc.

Â So you see, every element was used what's in the test or,

Â in this case, actually validation data and the others are all used in the training.

Â And this is a 5-Fold cross-validation iterator.

Â We can then show LeaveOneOut.

Â LeaveOneOut is a little more complex because here we're going to go through

Â each data use it once for test and then the remaining data are all used for training.

Â Thus for 10 elements,

Â we have 10 essentially iterations through the data to get our training and test data.

Â LeavePOut is a little more complex than that because now,

Â it's a combinatorial problem that we'll be limiting ourselves to five because,

Â otherwise, it's a huge number and it will fill the screen.

Â So we're going say, let's just use the first five.

Â What do we do? Well, we can start here.

Â We can take zero and then there's four ways to choose the next number.

Â Then we can choose one and there's only three ways to choose the next number.

Â Then there's two and there's two ways to choose the next number,

Â then there's three and there's only one.

Â So we choose our test data set,

Â the remaining data are used as our training and you could

Â see here there's 10 different ways to

Â build this data with a LeaveTwoOut cross-validator.

Â Then we could demonstrate shuffle split.

Â We're going to say, we want in this case ten,

Â our test size is point two so that's two items removed for testing.

Â And notice this, we actually reuse points.

Â So here, five was used three times.

Â You can see that nine was used three times,

Â eight was used three times,

Â one is only used twice, etc.

Â So this randomly pulls out two items,

Â uses the others for training and we do it however many times we said.

Â Stratified maintains, as I said before it maintains those class labels.

Â So here, we actually create a class label, ones and zeros.

Â And you could see that when we say create the Folds,

Â here we pull out four items and it kept three,

Â five, seven, eight and nine.

Â So we had both zeros and ones selected and the rest go with that as well.

Â GroupKFold, any time we have a group,

Â we have to give a grouping to the data so we do that.

Â We say, here's our data,

Â we went out to 15 now,

Â zero through 14 inclusive.

Â We then have three groups,

Â ten, 12 and 11.

Â And when we break the data up,

Â it breaks it up such that only one group is used at each test.

Â So you can see right there is zero, there's two,

Â there's three, there's four,

Â there's six, there's seven, there's nine.

Â That was all the tens.

Â And then all the 11's and all the 12's.

Â So hopefully this has given you all the information

Â you need to understand these cross-validators.

Â We demonstrate how to use them primarily a lot later on,

Â but we also make the validation curves.

Â To do that, we're going to use something called cross-val score that's going to use

Â a standard scalar and a computing here.

Â We're going to put it into a pipeline.

Â This then is going to run a 5-Fold cross-validation by default.

Â If we put in a different cross-validation technique, we would put that in here.

Â And it's going to show us a different results, as well as,

Â we can compute the mean and standard deviation.

Â So we ran five different scaling and SVC algorithms on that data and took the mean.

Â We could actually use that in different metrics.

Â We can use the F1 score,

Â we can use a different cross-validator.

Â Here we're going to use a shuffle split, going to get the same thing.

Â We can then use this to build a validation curve,

Â which shows us as a parameter changes,

Â how the score changes and you could see this for the training data,

Â as well as a cross- validation scores.

Â And this shows you, right in here,

Â is a good value for that gamma parameter.

Â We also have learning curves which are similar,

Â except they talk about the impact of the training data size on the model performance.

Â So again, same idea,

Â training size is increasing with our score doing for

Â our cross-validator and our training and you can

Â see that it jumps up and then starts leveling off,

Â very small changes after that.

Â So it shows you that once you get up to about half the data size,

Â this is doing reasonably well.

Â So, we've gone through a lot of ideas here.

Â I realize that teaching you both the learning curve,

Â validation curves, cross- validation all of that's a lot.

Â Do go through this carefully,

Â it's a very important concept.

Â If you have questions let us know and good luck.

Â