First, let's talk about how we can use regularization to create sparser more simpler models. Early in the course, we learned about L2 regularization, which is added to sum of the squared parameter weights terms to the last function. This was great at keeping weights small, having instability and a unique solution, but it can leave the model unnecessarily large and complex, since all of the features may still remain a little bit small weights. Using something instead called L1 regularization, adds the sum of the absolute value the parameter weights to the last function, which tends to force the weights of not very protective features to zero. This acts as a built-in feature selector by killing all bad features and leaving only the strongest in the model. This sparse model has many benefits. First, with fewer coefficients to store and load, there is a reduction in storage and memory needed with a much smaller model size, which is especially important for embedded models. Also, with fewer features, there are a lot fewer mult ads which not only leads to increased training speed, but more importantly increase prediction speed. Many machine learning models already have enough features as it is. For instance, let's say that I have data that contains the date time of orders being placed. Our first order model, would probably include seven features for the days of the week and 24 features for hours of the day, plus possibly many other features. Therefore, the day of the week plus hour of the day is already 31 inputs with just that. Now, what if we want to look at the second order effects of the day of the week cross with the hour of the day. There is another 168 inputs in addition to our 31 plus others for a grand total now of almost 200 features, just for that one date time field plus what other features we are using. If we cross this with one HUD encoding for US state for example, the triple Cartesian product is already at 8400 features with many of them probably being very sparse full of zeros. Hopefully this makes clear why built-in feature selection through L1 regularization can be a very good thing. What strategies can we use to remove feature coefficients that aren't useful besides L1 regulirization perhaps? We could include using simple counts of which features occur with non-zero values. The L0 norm is simply the count of the non-zero weights, and optimizing for this norm is an NP hard non convex optimization problem. This diagram illustrates what a non-convex optimization error surface might look like. As you can see, there are many local peaks and valleys, and this is just a simple one dimensional example. You pretty much had to explore lots and lots of starting points with gridding descent making this an NP-hard problem to solve completely. Thankfully, the L1 norm just like the L2 norm is convex, but it also encourages sparsity in the model. In this figure, the probability distributions of the L1 and L2 norms are plotted. Notice how the L2 Norm has a much smoother peak at zero, which results in magnitudes of the weights being closer to zero. However, you'll notice the L1 norm is more of a cusp centered at zero. Therefore, much more the probability is exactly at zero than the L2 norm. There are an infinite number of norms which are generalized by the P-norm. Some other norms or the L0 norm that we already covered which is the count of the non-zero values in a vector, and the L infinity norm which is the maximum absolute value of any value in a vector. In practice though, usually the L2-norm provides more generalizable models and the L1 norm. However, we will end up with much more complex heavy models if we use L2 instead of L1. This happens because often features have high correlation with each other, and L1 regularization which use one of them and throw the other away, whereas L2 regularization will keep both features and keep their weight magnitudes small. So with L1, you can end up with a smaller model but it may be less predictive. Is there any way to get the best of both worlds? The elastic net is just a linear combination of the L1 and L2 regularizing penalties. This way, you get the benefits of sparsity for really poor predictive features while also keeping decent and great features with smaller weights to provide a good generalization. The only trade off now is there are two instead of one hyper parameters a tune with the two different Lambda regularization parameters. What does L1 regularization tend to do to a model's low predictive features parameters weights? The correct answer is have zero values. Whenever we are doing regularization techniques, we are adding a penalty term to the last function or in general the objective function, so that it doesn't over optimize our decision variables or parameter weights. We choose the penalty terms based on prior knowledge, function, shape et cetera. L1 regularization has been shown to induce sparsity to the model and do to its probably distribution, having a high peak at zero, most weights except for the highly predictive ones will be shifted from there non-regularized values to zero. L2 regularization, will be used for having small magnitudes, and its negative would be used for having large magnitudes which are both incorrect. Having all positive values would be like adding many additional constraints to the optimization problem, bounding all decision variables to be greater than zero which is also not L1 regularization.