We know we are going to use regularization methods that penalize model complexity. Now the question is, how to measure model complexity. Both L1 and L2 regularization methods represent model complexity as the magnitude of the weight vector, and try to keep that in check. From linear algebra you should remember that the magnitude of a vector is represented by the norm function. Let's quickly review L1 and L2 norm functions. The weight vector can be of any number of dimensions, but it's easier to visualize it in two-dimensional space. So a vector with w0 =a, w1=b, would look like this green arrow. Now, what's the magnitude of this vector? You may instantly think C because you are applying the most common way that we learn in high school, the Euclidean distance from the origin. C would be the square root of sum of s squared plus b squared. In linear algebra, this is called the L2 norm, denoted by the double bars and the subscript of 2, or no subscript at all, because 2 is the known default. The L2 norm is calculated as the square root of sum of the squared values of all vector components. But that's not the only way magnitude of a vector can be calculated. Another common method is L1 norm. L1 measures absolute value of a plus absolute value of b, basically the yellow path highlighted here. Now remember, we're looking for a way to define model complexity. We used L1 and L2 as regularization methods, where model complexity is measured in the form the magnitude of the weight vector. In other words, if we keep the magnitude of our weight vector smaller than certain value, we've achieved our goal. Now let's visualize what it means for the L2 norm of our weight vector to be under certain value, let's say 1. Since L2 is the Euclidean distance from the origin, our desired vector should be bound within this circle with a radius of 1, centered on the origin. When trying to keep L1 norm under certain value, the area in which our weight vector can reside will take the shape of this yellow diamond. The most important takeaway here is that, when applying L1 regularization, the optimal value of certain weights can end up being zero. And that's because of the extreme diamond shape of this optimal region that we are interested in. Thus as opposed to the smooth circular shape in L2 regularization. Let's go back to the problem at hand, how to regularize our model using vector norm. This is how you apply L2 regularization, also known as weight decay. Remember we're trying to keep the weight values close to the origin. In 2D space, the weight factor would be confined within a circle. You can easily expand the concept to 3D space, but beyond 3D is hard to visualize, don't try. To be perfectly honest in machine learning we cheat a little in the math department. We use the square of the L2 norm to simplify calculation of derivatives. Notice there is a new parameter here lambda, this is a simple scalar value that allows us to control how much emphasis we want to put on model simplicity over minimizing training error. It's another tuning parameter which must be explicitly set. Unfortunately, the best value for any given problem is data dependent. So we'll need to do some tuning either manually or automatically using a tool like hyperparameter tuning which we will cover in the next module. To apply L1 regularization, we simply swap our L2 norm with L1 norm. Careful though, the outcome could be very different. L1 regularization results in a solution that's more sparse. Sparsity in this context refers to the fact that some of the weights end up having the optimal value of zero. Remember the diamond shape of the optimal area? This property of L1 regularization extensively used as a feature selection mechanism. Feature selection simplifies the ML problem by causing a subset of the weight to become zero. Zero weight then highlight the subsitive features that can't be safely discarded.