0:16

Hello and welcome to Lesson IV, Introduction to Regularization.

Â This lesson introduces the concept of regularization,

Â which is a very important concept used a lot with

Â linear models and it's used to reduce the impact of over fitting.

Â And it does this by introducing penalties,

Â such as on-model complexity.

Â The idea is, if over fitting often results from overly complex models,

Â then you penalize complex models to encourage less complex models to have higher scores.

Â Some of the techniques that implement this are known as Lasso,

Â Ridge regression, and Elastic Net regularization.

Â So by the end of this lesson, you should be able

Â to explain the benefits of regularization,

Â articulate the differences between these three different types

Â of regularization and be able to apply the Lasso,

Â Ridge or Elastic Net regularization methods by using the Scikit learn library.

Â This particular lesson focuses solely on this notebook.

Â Now, the idea here is to penalize a overly complex model.

Â So we're going to introduce some data and then

Â we're going to talk about fitting a polynomial to this data,

Â what it might mean to have an optimal model fit to that data,

Â and then we're going to introduce and demonstrate

Â these three different types of regularization.

Â So, first we have our initial startup code.

Â We then talk a little bit about our data.

Â To demonstrate this, we're going to need a model.

Â And so, here's our model. We're going to have linearly space X.

Â We're going to have a signal and a noise component.

Â And I'll keep those separate so that it makes it easier

Â to model them independently, if we so wish.

Â We then can plot this data. Here you go.

Â Here's our nice little curve,

Â along with the noise and signal.

Â And then, we can actually think about fitting this data.

Â So this is very similar to the previous notebook where we

Â showed the bias and variants of different fits.

Â Here we're going to do the polynomial features.

Â Then again, we're going to then want to fit a linear model to this,

Â but the linear model is of course using these polynomial features that allows us to have

Â a more complex model and we're going to then plot different model fits.

Â So this is a bit of a busy plot.

Â Let me make sure we're all on the same page here.

Â The first order model is the dash line,

Â that's sort of a light yellow.

Â We then have a third order fit,

Â which is a short dash line here.

Â We then have a Seventh Order fit,

Â which actually does a really good job of fitting the data.

Â And then we have some higher order 21st and 23rd and you can see these are

Â wiggling all over the place and not doing a very good job of capturing that true signal.

Â And then of course, our data has been broken into

Â training and testing blue and red respectively.

Â Now, this model shows both under fitting for the low order polynomials,

Â as well as over fitting for these really high polynomials.

Â And that Seventh Order does a pretty good job of fitting.

Â So, how do we know, how do we figure out that write out value?

Â Well, one way to do it is to actually do a particular statistic and compute the model,

Â fit it to our training data,

Â and then compute what is the test air based on that particular metric.

Â So we can plot that.

Â This is now showing the model performance for the training data in

Â the blue dot dash line and the test data for the solid line.

Â And you could see that the score drops rapidly until we get to about five,

Â then it sort of flattens off and then after about seven or eight,

Â it starts actually going up a little bit for the test data.

Â This indicates that our optimal fit is occurring somewhere in here.

Â Now, when we want to do regularization,

Â we can imply Ridge,

Â Lasso or Elastic Net.

Â The ridge regression that we're going to demonstrate,

Â adds a penalty term,

Â which is technically the L2 norm of the regression coefficients,

Â that was just the Euclidean Norm, if you remember.

Â And the idea, is that it's to smooth fits out,

Â to penalize wildly fluctuating fits.

Â Now, this is a little more complex to demonstrate,

Â because if we're trying to show fit coefficients, technically,

Â we would want to be showing all the way out to 20th order,

Â doing that would make a very wide display.

Â And so, we're only going to show the first seven coefficient fits here.

Â We're going to fit this model.

Â It's going to actually be a 17th order polynomial fit to

Â the data and we're going to plot and display the fit coefficients.

Â So when we have alpha,

Â which is the effect of the ridge regression.

Â When it's zero, there's no effect.

Â You can see that these terms are very big.

Â Remember, truly, we're going out to 17th order.

Â As we start applying these,

Â you notice that these numbers are dropping, becoming much smaller.

Â That's the impact of ridge regression,

Â that it penalizes wildly fluctuating fits

Â and encourages smaller coefficients for these higher order terms.

Â If I actually make that a little smaller,

Â you should be able to see the full terms here.

Â There I made them a little smaller,

Â so now you can see the coefficients and how it starts off very high and then it drops.

Â You also now can see the fits.

Â So here's the fits to the day.

Â This is 17th order polynomial.

Â Let me start with alpha equals zero.

Â Notice It's doing a horrible job of fitting.

Â The deflection is not even going through hardly any of these fits.

Â As soon as we start adding in an alpha though,

Â the fits starts getting better.

Â And you could say that almost immediately here,

Â this alpha of.0001 does a really good job of fitting the data.

Â That shows you the impact of ridge regression.

Â So let me make this bigger again and go down to Lasso.

Â Lasso is a second type of regularisation,

Â but here we use a L1 norm instead of the L2 Norm.

Â And the idea is that this is going to again drive some of the coefficients to zero.

Â Not just make them small, but drive them to zero.

Â The idea is we get a sparser solution.

Â Again to demonstrate, we're going to do a 17th order polynomial fit.

Â We will only display the first seventh polynomial coefficients,

Â but we'll plot all of them.

Â So if we come down here, we're going to see

Â that these higher order terms they went to zero.

Â And they went to zero pretty quick.

Â We're not even showing no effect, no Lasso progression.

Â You see immediately big and small.

Â Even as alpha gets really big,

Â gets to one, even these terms all go to zero.

Â So that's obviously not necessarily a good thing there.

Â So here we can again, see the same thing.

Â You can see as alpha starts increasing how the model changes.

Â Here is Alpha equals one.

Â Notice that it all fits roughly the same out here.

Â It's really only here on the left side and if you remember the model,

Â actually it was supposed to go up.

Â So we actually aren't capturing all of the full signal here,

Â we might actually want to use a smaller value with Alpha to see whether that

Â improves the model performance or not.

Â Remember, we're training on the blue.

Â And so, it may not be surprising that it's not capturing that

Â full turn up here given the wide spread of the blue points over here.

Â Now, in some cases,

Â you might want to use both the L1 and L2 norms,

Â and this is known as Elastic Net,

Â that effectively is combining Ridge and Lasso regression,

Â and the ratio between the two is set by this hyper parameter L1, 2 ratio.

Â We still have the alpha parameter,

Â but now we're setting this element two parameter to be point five,

Â which is a roughly equal split between Lasso and Ridge.

Â Everything else is the same,

Â except that now we're applying Elastic Net regression.

Â So some coefficients are going to get driven to zero.

Â All of them are going to be driven smaller.

Â You can then see the effect of this and this plot here shows that.

Â So again, here in this notebook we've introduced the basic idea of regularization,

Â shown you how to perform it with Ridge regression,

Â Lasso regression and Elastic Net.

Â Hopefully, you've learned about the importance of

Â regularization and how it can be used to prevent over fitting.

Â If you have any questions let us know and good luck.

Â