You've seen that activation functions are important for deep learning models. There are multiple functions that are used as activation functions. In this video, you'll see some of the most popular ones that are used today. This video will focus on four commonly used activation functions that you'll be using in your ganz. The first one is ReLU, the second one is a variant called Leaky ReLU, and the last two are Sigmoid and Tanh. There are actually an infinite number of possible activation functions, but not all of them are ideal. One of the most popular effective activation functions is known as the Rectified Linear Unit or ReLU for short. What ReLU does is it takes the max between z and zero, meaning if its input is z from this current layer l, this activation g, where ReLU is g here, will take the maximum between the value zero and z. Really what that means is that it squashes out all the negatives. Graphically, I can illustrate the function to look like a straight line with a slope equal to one for the positive values. Any value coming into z such as the value two, if it goes in to g it'll still be the value two. Now, if a negative number in z is passed into g, then graphically, whenever that value z is negative, it will then output zero. Essentially, no negative values are allowed, so it looks similar to a hockey stick which makes it non-linear. Linear would mean a single straight line. You might notice that ReLU is strictly speaking not differentiable at z equals zero. But by convention and implementation, the derivative of ReLU at z equals zero is often set to zero here. All right. The flat part of the ReLU activation function when z is negative always has a derivative equal to zero. This can be problematic because the learning process depends on the derivative to provide important information on how to update the weights. With a zero derivative some nodes get stuck on the same value and their weights stops learning. That part of the network would stop learning. In fact, the previous components of the network will all be affected as well. This problem is known as the dying ReLU problem because it's the end of learning, and that's why a variation of ReLU exists. It's called the Leaky ReLU. What the Leaky ReLU does is that it maintains the same form of ReLU, and it maintains the same form as ReLU for the case when z is positive, which means that it keeps the same positive value again as whatever the input is, but it adds a little leak or a slope in the line when z is less than zero, when z is negative down here, and it's still nonlinear with a bend in the slope at z equals zero, but now it has this non-zero derivative when z is negative. This slope is intended to be less than one so it doesn't have to form a line with this positive side. That would be unfortunate and make it linear. Note that the derivative at z equals zero is still set to zero. Broadly the slope is treated as a hyperparameter and that's a here, but it's typically set to 0.1, meaning the leak is quite small relative to the positive slope still. This solves the dying ReLU problem by enlarge. In practice, most people still use ReLU though, but Leaky ReLU is catching up in popularity. Now I'll show you two other common activation functions that look fairly similar to one another. First, it's the sigmoid activation, which has a smooth as shape and outputs values between 0-1. When z is greater than or equal to zero, and the sigmoid activation outputs a value between 0.5 and 1. When z is less than zero, the sigmoid outputs a value between 0 and 0.5. Because it outputs a value between 0 and 1, the sigmoid activation function is often used in binary classification models in the last layer to indicate a probability between 0 and 1, for example, predicting that there was a cat and a picture with a probability of 0.95. Now the sigmoid activation function isn't used very often in hidden layers because the derivative of the function approaches zero at the tails of this function. This produces what you'd call vanishing gradient problems, or saturated outputs here at the tails of the function. You can imagine that this function continues to go in both directions because it can take any real value as input. It is asymptotically approaching one here at the top and asymptotically approaching zero on the bottom. This produces what you call vanishing gradient problems, because you have these saturated outputs here at the tails, and the values will always be close to one or close to zero when you get too far away from the input z being zero. Another function with a similar shape to the sigmoid is the hyperbolic tangent or tanh for short. In contrast with the sigmoid however, it outputs values between -1-1. When z is positive, it outputs a value between 0-1 and when z is negative it outputs and negative value between -1-0. One key difference from the sigmoid is that tanh keeps the sign of the input z, so negatives are still negative. That can be useful in some applications. Because it's shape is similar to the sigmoid however, the same saturation and vanishing gradient issue does occur. Again, the tails do extend on both sides of bridging one at the top and negative one at the bottom. Both are used in neural networks, in fact, all of these activation functions are using neural networks and especially in gans, which you'll be implementing shortly. There more activations out there and new ones are in the process of development by researchers all the time. If you're interested, you could come up with your own activation function, just make sure it's non-linear and differentiable. To sum up, many different functions are currently used as activation functions. I showed you the ReLU, the Leaky ReLU, sigmoid and tanh, and most of them come with their own problems; ReLU with the dying ReLU problem, and sigmoid and tanh with the vanishing ingredient in saturation problems. You'll see and use all of these activations presented here on the models that you'll implement throughout the specialization.