0:00

Model evaluation tells us how our model performs in the real world.

Â In the previous module,

Â we talked about in-sample evaluation.

Â In-sample evaluation tells us how well our model fits the data already given to train it.

Â It does not give us an estimate of how well the train model can predict new data.

Â The solution is to split our data up,

Â use the in-sample data or training data to train the model.

Â The rest of the data, called Test Data,

Â is used as out-of-sample data.

Â This data is then used to approximate how the model performs in the real world.

Â Separating data into training and testing sets is an important part of model evaluation.

Â We use the test data to get an idea how our model will perform in the real world.

Â When we split a dataset,

Â usually the larger portion of data is used for

Â training and a smaller part is used for testing.

Â For example, we can use 70 percent of the data for training.

Â We then use 30 percent for testing.

Â We use training set to build a model and discover predictive relationships.

Â We then use a testing set to evaluate model performance.

Â When we have completed testing our model,

Â we should use all the data to train the model.

Â A popular function in the scikit-learn package for

Â splitting datasets is the train test split function.

Â This function randomly splits a dataset into training and testing subsets.

Â From the example code snippet,

Â this method is imported from sklearn cross-validation.

Â The input parameters y_data is the target variable.

Â In the car appraisal example,

Â it would be the price and x_data,

Â the list of predictive variables.

Â In this case, it would be all the other variables in

Â the car dataset that we are using to try to predict the price.

Â The output is an array.

Â x_train and y_train the subsets for training.

Â x_test and y_test the subsets for testing.

Â In this case, the test size is a percentage of the data for the testing set.

Â Here, it is 30 percent.

Â The random state is a random seed for random data set splitting.

Â Generalization error is a measure of how well

Â our data does at predicting previously unseen data.

Â The error we obtain using our testing data is an approximation of this error.

Â This figure shows the distribution of the actual values in

Â red compared to the predicted values from a linear regression in blue.

Â We see the distributions are somewhat similar.

Â If we generate the same plot using the test data,

Â we see the distributions are relatively different.

Â The difference is due to

Â a generalization error and represents what we see in the real world.

Â Using a lot of data for training gives us an accurate means

Â of determining how well our model will perform in the real world.

Â But the precision of the performance will be low.

Â Let's clarify this with an example.

Â The center of this bull's eye represents the correct generalization error.

Â Let's say we take a random sample of the data using

Â 90 percent of the data for training and 10 percent for testing.

Â The first time we experiment,

Â we get a good estimate of the training data.

Â 3:10

If we experiment again training the model with a different combination of samples,

Â we also get a good result,

Â but the results will be different relative to the first time we run the experiment.

Â Repeating the experiment again with

Â a different combination of training and testing samples,

Â the results are relatively close to the generalization error,

Â but distinct from each other.

Â Repeating the process, we get a good approximation of the generalization error,

Â but the precision is poor i.e.

Â all the results are extremely different from one another.

Â If we use fewer data points to train the model and more to test the model,

Â the accuracy of the generalization performance will be

Â less but the model will have good precision.

Â The figure above demonstrates this.

Â All our error estimates are relatively close together,

Â but they are further away from the true generalization performance.

Â To overcome this problem, we use cross-validation.

Â One of the most common out of sample evaluation metrics is cross-validation.

Â In this method, the dataset is split into K equal groups.

Â Each group is referred to as a fold.

Â For example, four folds.

Â Some of the folds can be used as a training set which we use

Â to train the model and the remaining parts are used as a test set,

Â which we use to test the model.

Â For example, we can use three folds for training,

Â then use one fold for testing.

Â This is repeated until each partition is used for both training and testing.

Â At the end, we use the average results as the estimate of out-of-sample error.

Â The evaluation metric depends on the model,

Â for example, the r squared.

Â The simplest way to apply cross-validation is to call the cross_val_score function,

Â which performs multiple out-of-sample evaluations.

Â This method is imported from sklearn's model selection package.

Â We then use the function cross_val_score.

Â 5:08

The first input parameters,

Â the type of model we are using to do the cross-validation.

Â In this example, we initialize a linear regression model or object

Â lr which we passed the cross_val_score function.

Â The other parameters are x_data,

Â the predictive variable data,

Â and y_data, the target variable data.

Â We can manage the number of partitions with the cv parameter.

Â Here, cv equals three,

Â which means the data set is split into three equal partitions.

Â The function returns an array of scores,

Â one for each partition that was chosen as the testing set.

Â We can average the result together to estimate out of

Â sample r squared using the mean function NnumPi.

Â Let's see an animation,

Â let's see the result of the score array in the last slide.

Â First, we split the data into three folds.

Â We use two folds for training,

Â the remaining fold for testing.

Â The model will produce an output.

Â We will use the output to calculate a score.

Â In the case of the r squared i.e.

Â coefficient of determination, we will store that value in an array.

Â We will repeat the process using two folds for training and one fold for testing.

Â Save the score, then use

Â a different combination for training and the remaining fold for testing.

Â We store the final result.

Â The cross_val_score function returns

Â a score value to tell us the cross-validation result.

Â What if we want a little more information?

Â What if we want to know the actual predicted values

Â supplied by our model before the r squared values are calculated?

Â To do this, we use the cross_ val_predict function.

Â The input parameters are exactly the same as the cross_val_score function,

Â but the output is a prediction.

Â Let's illustrate the process.

Â First, we split the data into three folds.

Â We use two folds for training,

Â the remaining fold for testing.

Â The model will produce an output and we will store it in an array.

Â We will repeat the process using two folds for training one for testing.

Â The model produces an output again.

Â Finally, we use the last two folds for training,

Â then we use the testing data.

Â This final testing fold produces an output.

Â These predictions are stored in an array.

Â