Welcome to our section on regularization and model selection. In this section, we'll discuss the power of regularization, which will ultimately be one of the major keys to ensuring that we choose a model that will give us the balance between bias and variance within our models. Now let's go over the learning goals for this section. In this section, we're going to cover a quick recap on the relationship between model complexity and model error. We'll discuss how regularization can be used as an approach to avoid over-fitting. So as we get to a more complex model, often we end up over-fitting to our training data, we will use regularization to ensure that we do not over-fit. We'll discuss some standard approaches to regularization including Ridge and Lasso, which we were introduced to briefly in our notebooks. But now we'll look under the hood at the actual math. As well as looking at elastic net, which will be a sort of balance between Ridge and Lasso regression. And then finally, we'll discuss recursive feature elimination, which is just a means of eliminating features that may be diluting the power of our model to be able to generalize. So it'll be a means of reducing the number of features to ensure that we're only including features that really have predictive power. So from last time, we recall when building a model, we want training and test errors to be as small as possible. So we want both train and test to be small. The curve on top is, again, going to be the error of the Tessa. And we recall that as we get a more complex model, our test set will end up increasing in error because it's not generalizing well. So we discussed one way to handle working with complexity versus error, is we can use just a simpler overall model. Another way that we can deal with this is using regularization to take our existing model and make it not as complex. So can we tune with more granularity than just choosing a polynomial degree? For example, when we were working with linear regression? Yes, we can, by using something called regularization. So what do we mean by this term regularization? If you recall, the means by which our machine learns the parameters from the data is that we try and minimize some cost function. As we saw for linear regression, this is going to be minimizing the mean squared error, the error between the outcome variable and our predicted variable squared. And our new cost function with regularization will be that original cost function, which we represent here as M(w), plus this lambda symbol, which we have seen before when we introduced Lasso and Ridge regression in our last notebook. Multiplied by R(w), where our R is just going to be a function of the strength of our different parameters. The regularization portion, which is just lambda time R(w), is added onto our original cost function so that we can penalize the model extra if it is too complex. Essentially, this will allow us to dumb down the model. So the stronger our weights, the stronger our parameters, the higher this cost function is going to be. And we're trying to ultimately minimize this, so we're not going to be able to fit it as closely to the actual training data. The lambda is going to add a penalty proportional to the size of the strength of these parameters, or some function of these parameters. And we'll get deeper into how it can be a function of the parameter later on. But the takeaway is that the larger this lambda is, the more we penalize stronger parameters. And again, the more we penalize our model for being stronger and having stronger parameters, the less complex that model will be able to be as we try to minimize this function, right? That'll make it so that we are trying to minimize the strength of all of our parameters while minimizing our original cost function. Therefore, we're going to be increasing our original cost function according to how much we want to penalize our model for being more complex. So the regularization strength parameter lambda allows us to manage this complexity tradeoff. The idea being that more regularization, a higher lambda, introduces a simpler model or more bias. So increasing lambda means more penalty for stronger weights, the more penalty we attribute to our weights, the less complex our model can be. And then on the other end, less regularization makes it so that the model is more complex and will increase the variance. And if our model is overfit, so if we see that our variance is too high, and we can see this because our test set error is very high while our training error is very low. Regularization can help improve the generalization error and reduce this variance. Now, let's take a step back from just looking at regularization and think about it in the context of feature selection, or figuring out which one of our features are important to include in our actual model. So regularization is essentially going to be a form of feature selection, since it's going to take the contribution of each one of those features and eliminate or reduce them as it adds more weight to the penalty. So this is going to be most obvious when we work with Lasso regression, as Lasso will actually drive some of the coefficients in our linear regression down to zero. So if you think about a coefficient of zero, you're essentially removing the contribution of that feature altogether. It's going to be the same effect as manual removing some features prior to modeling. Except that with Lasso, it'll find which ones to remove automatically according to some mathematical formula, which we'll see in a bit. Another way to perform feature selection would be to just remove some of the features right at the start. This would have to be done quantitatively. And a way that we can do this is remove a feature at a time and then measure the predictive results via cross-validation. And if the feature elimination does improve the model on the holdout set, or it doesn't reduce the error that much. Then we would say that we can probably safely remove that feature and perhaps even generalize better if that feature was not included. So why are we doing this? Wouldn't you assume that models with more features are better? Now, obviously, given what we discussed, that's not always going to be the case. Not all features are necessarily relevant. You can think about a lot of the analysis that you may do in your business practices. If you think about the customer churn, the customer name probably does not add a lot of value, but it could help you predict if you are just fitting to the training set. So it can add some extra noise that'll allow you to overfit your model, trick your algorithms to think that that is actually an important feature. But then when you get new data, you're not able to generalize well. Besides reducing the effects of over-fitting, feature elimination can also be useful to improve fitting time and/or results of some of the models. Particularly for those that don't have a built-in regularization term, such as Ridge or Lasso, this will allow us to eliminate features not trained on all the features that we have available. Feature elimination can also be used to identify what are the most important features, which can improve the model interpretability. And this is often going to be an important business requirement as we actually try to learn how can we affect the outcome of our variable. So that closes out the section in regards to regularization. In the next video, we will continue by looking at Ridge regression. Thank you.