0:00

In this video, we'll discuss ridge regression.

Â Ridge regression prevents overfitting.

Â In this video, we will focus on polynomial regression for visualization,

Â but overfitting is also a big problem when

Â you have multiple independent variables, or features.

Â Consider the following fourth order polynomial in orange.

Â The blue points are generated from this function.

Â We can use a tenth order polynomial to fit the data.

Â The estimated function in blue does a good job at approximating the true function.

Â In many cases real data has outliers.

Â For example, this point shown here does not appear to come from the function in orange.

Â If we use a tenth order polynomial function to fit the data,

Â the estimated function in blue is incorrect,

Â and is not a good estimate of the actual function in orange.

Â If we examine the expression for the estimated function,

Â we see the estimated polynomial coefficients have a very large magnitude.

Â This is especially evident for the higher order polynomials.

Â Ridge regression controls the magnitude of

Â these polynomial coefficients by introducing the parameter alpha.

Â Alpha is a parameter we select before fitting or training the model.

Â Each row in the following table represents an increasing value of alpha.

Â Let's see how different values of alpha change the model.

Â This table represents the polynomial coefficients for different values of alpha.

Â The column corresponds to the different polynomial coefficients,

Â and the rows correspond to the different values of alpha.

Â As alpha increases the parameters get smaller.

Â This is most evident for the higher order polynomial features.

Â But Alpha must be selected carefully.

Â If alpha is too large,

Â the coefficients will approach zero and underfit the data.

Â If alpha is zero,

Â the overfitting is evident.

Â For alpha equal to 0.001,

Â the overfitting begins to subside.

Â For Alpha equal to 0.01,

Â the estimated function tracks the actual function.

Â When alpha equals one,

Â we see the first signs of underfitting.

Â The estimated function does not have enough flexibility.

Â At alpha equals to 10,

Â we see extreme underfitting.

Â It does not even track the two points.

Â In order to select alpha,

Â we use cross validation.

Â To make a prediction using ridge regression,

Â import ridge from sklearn.linear_models.

Â Create a ridge object using the constructor.

Â The parameter alpha is one of the arguments of the constructor.

Â We train the model using the fit method.

Â To make a prediction, we use the predict method.

Â In order to determine the parameter alpha,

Â we use some data for training.

Â We use a second set called validation data.

Â This is similar to test data,

Â but it is used to select parameters like alpha.

Â We start with a small value of alpha.

Â We train the model, make a prediction using the validation data,

Â then calculate the R-squared and store the values.

Â Repeat the value for a larger value of alpha.

Â We train the model again,

Â make a prediction using the validation data,

Â then calculate the R-squared and store the values of R-squared.

Â We repeat the process for a different alpha value,

Â training the model, and making a prediction.

Â We select the value of alpha that maximizes the R-squared.

Â Note that we can use other metrics to select the value of alpha,

Â like mean squared error.

Â The overfitting problem is even worse if we have lots of features.

Â The following plot shows the different values of R-squared on the vertical axis.

Â The horizontal axis represents different values for alpha.

Â We use several features from our used car data

Â set and a second order polynomial function.

Â The training data is in red and validation data is in blue.

Â We see as the value for alpha increases,

Â the value of R-squared increases and converges at approximately 0.75.

Â In this case, we select the maximum value of alpha because

Â running the experiment for higher values of alpha have little impact.

Â Conversely, as alpha increases,

Â the R-squared on the test data decreases.

Â This is because the term alpha prevents overfitting.

Â This may improve the results in the unseen data,

Â but the model has worse performance on the test data.

Â See the lab on how to generate this plot.

Â