0:00

If you recall, in the last module,

Â we discussed polynomial regression.

Â In this section, we will discuss how to pick

Â the best polynomial order and problems

Â that arise when selecting the wrong order polynomial.

Â Consider the following function,

Â we assume the training points come from a polynomial function plus some noise.

Â The goal of Model Selection is to determine the order of

Â the polynomial to provide the best estimate of the function y(x).

Â If we try and fit the function with a linear function,

Â the line is not complex enough to fit the data.

Â As a result, there are many errors.

Â This is called underfitting,

Â where the model is too simple to fit the data.

Â If we increase the order of the polynomial,

Â the model fits better, but the model is

Â still not flexible enough and exhibits underfitting.

Â This is an example of the 8th order polynomial used to fit the data.

Â We see the model does well at fitting the data and estimating

Â the function even at the inflection points.

Â Increasing it to a 16th order polynomial,

Â the model does extremely well at tracking

Â the training point but performs poorly at estimating the function.

Â This is especially apparent where there is little training data.

Â The estimated function oscillates not tracking the function.

Â This is called overfitting,

Â where the model is too flexible and fits the noise rather than the function.

Â Let's look at a plot of the mean square error for

Â the training and testing set of different order polynomials.

Â The horizontal axis represents the order of the polynomial.

Â The vertical axis is the mean square error.

Â The training error decreases with the order of the polynomial.

Â The test error is a better means of estimating the error of a polynomial.

Â The error decreases 'til the best order of the polynomial

Â is determined then the error begins to increase.

Â We select the order that minimizes the test error.

Â In this case, it was eight.

Â Anything on the left would be considered underfitting,

Â anything on the right is overfitting.

Â If we select the best order of the polynomial,

Â we will still have some errors.

Â If you recall the original expression for the training points we see a noise term.

Â This term is one reason for the error.

Â This is because the noise is random and we can't predict it.

Â This is sometimes referred to as an irreducible error.

Â There are other sources of errors as well.

Â For example, our polynomial assumption may be wrong.

Â Our sample points may have come from a different function.

Â For example, in this plot,

Â the data is generated from a sine wave.

Â The polynomial function does not do a good job at fitting the sine wave.

Â For real data, the model may be too difficult to

Â fit or we may not have the correct type of data to estimate the function.

Â Let's try different order polynomials on the real data using horsepower.

Â The red points represent the training data.

Â The green points represent the test data.

Â If we just use the mean of the data,

Â our model does not perform well.

Â A linear function does fit the data better.

Â A second order model looks similar to the linear function.

Â A third order function also appears to increase,

Â like the previous two orders.

Â Here, we see a fourth order polynomial.

Â At around 200 horsepower,

Â the predicted price suddenly decreases.

Â This seems erroneous.

Â Let's use R-squared to see if our assumption is correct.

Â The following is a plot of the R-squared value.

Â The horizontal axis represents the order polynomial models.

Â The closer the R-squared is to one the more accurate the model is.

Â Here, we see the R-squared is optimal when the order of the polynomial is three.

Â The R-squared drastically decreases when the order is increased to four,

Â validating our initial assumption.

Â We can calculate different R-squared values as follows.

Â First, we create an empty list to store the values.

Â We create a list containing different polynomial orders.

Â We then iterate through the list using a loop.

Â We create a polynomial feature object with the order of the polynomial as a parameter.

Â We transform the training and test data into a polynomial using the fit transform method.

Â We fit the regression model using the transform data.

Â We then calculate the R-squared using the test data and store it in the array.

Â