The fundamental goal of the activation function is to add nonlinearity to the network so that it can model nonlinear relationships. We generally use functions like the hyperbolic tangent function where there is a region in the input space where the hidden unit is inactive, a region where the hidden unit is active, and an intermediate region where change occurs. When training neural networks, we use a backpropagation algorithm that involves taking gradients of the hidden units with respect to their outputs. This involves taking a gradient of the activation function to evaluate how the weights corresponding to the hidden unit should change from one iteration of the optimization to the next. If we use the hyperbolic tangent function, we find that the gradient is very close to zero in the inactive and active regions (where the function is plus or minus 1). This is problematic because when the neuron is in the "saturated" region, the gradient is near zero. This causes the training to slow dramatically because the gradient motivates changes in the weight values. When the neuron is stuck in the saturated region, training stops on that neuron. This means that the weight values no longer change, and that the neuron can't encode more information about relationships in the data. This "vanishing gradient" problem is compounded for deep neural networks because the gradients from different layers are multiplied together. If the gradient is small in the final few layers, the problem can cascade back to the first few layers in the network where training slows to a crawl or stops altogether. A deep learning alternative is the rectified linear unit (or ReLU) as an activation function. This function is a piecewise function that is 0 for all negative inputs and equal to the input for all positive inputs. It has an inactive region where it is zero (all negative inputs), and an active region that does not saturate (the gradient is nonzero for positive inputs). The ReLU only has vanishing gradients in the inactive region, reducing the number of neurons that "die" by having the gradient stuck at zero. The ReLU also has computationally easy gradients: for negative inputs the gradient is zero and for positive inputs the gradient is 1. Thus, the process of backpropagation doesn't require computing any derivatives. The derivative can be assigned based on the value of the input. Although we present a comparison between the rectified linear unit and the hyperbolic tangent activation functions, there are other types of activation functions that can be used in deep learning that address the same issues.