Hello and welcome to Lesson IV, Introduction to Regularization.

This lesson introduces the concept of regularization,

which is a very important concept used a lot with

linear models and it's used to reduce the impact of over fitting.

And it does this by introducing penalties,

such as on-model complexity.

The idea is, if over fitting often results from overly complex models,

then you penalize complex models to encourage less complex models to have higher scores.

Some of the techniques that implement this are known as Lasso,

Ridge regression, and Elastic Net regularization.

So by the end of this lesson, you should be able

to explain the benefits of regularization,

articulate the differences between these three different types

of regularization and be able to apply the Lasso,

Ridge or Elastic Net regularization methods by using the Scikit learn library.

This particular lesson focuses solely on this notebook.

Now, the idea here is to penalize a overly complex model.

So we're going to introduce some data and then

we're going to talk about fitting a polynomial to this data,

what it might mean to have an optimal model fit to that data,

and then we're going to introduce and demonstrate

these three different types of regularization.

So, first we have our initial startup code.

We then talk a little bit about our data.

To demonstrate this, we're going to need a model.

And so, here's our model. We're going to have linearly space X.

We're going to have a signal and a noise component.

And I'll keep those separate so that it makes it easier

to model them independently, if we so wish.

We then can plot this data. Here you go.

Here's our nice little curve,

along with the noise and signal.

And then, we can actually think about fitting this data.

So this is very similar to the previous notebook where we

showed the bias and variants of different fits.

Here we're going to do the polynomial features.

Then again, we're going to then want to fit a linear model to this,

but the linear model is of course using these polynomial features that allows us to have

a more complex model and we're going to then plot different model fits.

So this is a bit of a busy plot.

Let me make sure we're all on the same page here.

The first order model is the dash line,

that's sort of a light yellow.

We then have a third order fit,

which is a short dash line here.

We then have a Seventh Order fit,

which actually does a really good job of fitting the data.

And then we have some higher order 21st and 23rd and you can see these are

wiggling all over the place and not doing a very good job of capturing that true signal.

And then of course, our data has been broken into

training and testing blue and red respectively.

Now, this model shows both under fitting for the low order polynomials,

as well as over fitting for these really high polynomials.

And that Seventh Order does a pretty good job of fitting.

So, how do we know, how do we figure out that write out value?

Well, one way to do it is to actually do a particular statistic and compute the model,

fit it to our training data,

and then compute what is the test air based on that particular metric.

So we can plot that.

This is now showing the model performance for the training data in

the blue dot dash line and the test data for the solid line.

And you could see that the score drops rapidly until we get to about five,

then it sort of flattens off and then after about seven or eight,

it starts actually going up a little bit for the test data.

This indicates that our optimal fit is occurring somewhere in here.

Now, when we want to do regularization,

we can imply Ridge,

Lasso or Elastic Net.

The ridge regression that we're going to demonstrate,

adds a penalty term,

which is technically the L2 norm of the regression coefficients,

that was just the Euclidean Norm, if you remember.

And the idea, is that it's to smooth fits out,

to penalize wildly fluctuating fits.

Now, this is a little more complex to demonstrate,

because if we're trying to show fit coefficients, technically,

we would want to be showing all the way out to 20th order,

doing that would make a very wide display.

And so, we're only going to show the first seven coefficient fits here.

We're going to fit this model.

It's going to actually be a 17th order polynomial fit to

the data and we're going to plot and display the fit coefficients.

So when we have alpha,

which is the effect of the ridge regression.

When it's zero, there's no effect.

You can see that these terms are very big.

Remember, truly, we're going out to 17th order.

As we start applying these,

you notice that these numbers are dropping, becoming much smaller.

That's the impact of ridge regression,

that it penalizes wildly fluctuating fits

and encourages smaller coefficients for these higher order terms.

If I actually make that a little smaller,

you should be able to see the full terms here.

There I made them a little smaller,

so now you can see the coefficients and how it starts off very high and then it drops.

You also now can see the fits.

So here's the fits to the day.

This is 17th order polynomial.

Let me start with alpha equals zero.

Notice It's doing a horrible job of fitting.

The deflection is not even going through hardly any of these fits.

As soon as we start adding in an alpha though,

the fits starts getting better.

And you could say that almost immediately here,

this alpha of.0001 does a really good job of fitting the data.

That shows you the impact of ridge regression.

So let me make this bigger again and go down to Lasso.

Lasso is a second type of regularisation,

but here we use a L1 norm instead of the L2 Norm.

And the idea is that this is going to again drive some of the coefficients to zero.

Not just make them small, but drive them to zero.

The idea is we get a sparser solution.

Again to demonstrate, we're going to do a 17th order polynomial fit.

We will only display the first seventh polynomial coefficients,

but we'll plot all of them.

So if we come down here, we're going to see

that these higher order terms they went to zero.

And they went to zero pretty quick.

We're not even showing no effect, no Lasso progression.

You see immediately big and small.

Even as alpha gets really big,

gets to one, even these terms all go to zero.

So that's obviously not necessarily a good thing there.

So here we can again, see the same thing.

You can see as alpha starts increasing how the model changes.

Here is Alpha equals one.

Notice that it all fits roughly the same out here.

It's really only here on the left side and if you remember the model,

actually it was supposed to go up.

So we actually aren't capturing all of the full signal here,

we might actually want to use a smaller value with Alpha to see whether that

improves the model performance or not.

Remember, we're training on the blue.

And so, it may not be surprising that it's not capturing that

full turn up here given the wide spread of the blue points over here.

Now, in some cases,

you might want to use both the L1 and L2 norms,

and this is known as Elastic Net,

that effectively is combining Ridge and Lasso regression,

and the ratio between the two is set by this hyper parameter L1, 2 ratio.

We still have the alpha parameter,

but now we're setting this element two parameter to be point five,

which is a roughly equal split between Lasso and Ridge.

Everything else is the same,

except that now we're applying Elastic Net regression.

So some coefficients are going to get driven to zero.

All of them are going to be driven smaller.

You can then see the effect of this and this plot here shows that.

So again, here in this notebook we've introduced the basic idea of regularization,

shown you how to perform it with Ridge regression,

Lasso regression and Elastic Net.

Hopefully, you've learned about the importance of

regularization and how it can be used to prevent over fitting.

If you have any questions let us know and good luck.