Hello everyone. In this lecture, we will talk about deep neural network. We'll first talk about the components of deep neural networks, then talk about the forward pass and backward pass algorithms, and then introduce some applications of deep neural networks for health care applications. To describe deep neural networks, we start with the simplest one that comprises of a single neuron. A neuron is a computational unit that takes any input, numerical value x_1 to x_n, and their associate weight w_1 to w_n, along with a bias term b and output value y to a linear combination followed by a non-linear activation function, g. More specifically, the linear combination produce an intermediate output z, which is weighted sum of x_i plus the bias term b. Then we pass this intermediate value z to a nonlinear function g to produce a final output y. Depending on the specific task, the final target y tries to approximate, can be either binary for classification tasks or numerical for regression tasks. To learn a model for a single neuron, we need to specify the activation function g and learn the parameter weights, w's, and the bias term b from data. First, let's talk about activation functions. An activation function describes the nonlinear transformation in a neuron which needs to be specified by the user. It's usually not learned from the data and the popular choices of activation function include sigmoid, tanh, and ReLu. Next, let's talk about each one of them. Sigmoid function. The input can be an arbitrary real values and output will be in the range of zero and one, which can naturally be interpreted as the probability of an event, for example, the probability of having heart diseases. As a result, sigmoid function is a popular choice for classification tasks. Mathematically, sigmoid function is 1 over 1 plus e to the minus x. As I said, it has to natural interpretation to probability, but it also have vanishing gradient problem. Vanishing gradient is a tough problem for learning neural networks in general. As we'll show later, a neural network learning depends on gradient-based optimization. If the gradient is too close to zero, the optimization process would not be able to make progress. This is called the vanishing gradient problems. For example, the gradient of a sigmoid function in the end, when the input is very large or very small is close to zero. So this may cause vanishing gradient problem. Tanh function is another popular activation function. It has multiple mathematical forms. It can be either e_x minus e to the minus x divided by e_x plus e to the minus x, or 2 over 1 plus e to the minus 2x minus 1. Computationally, the second formulation makes more sense as it only require one exponential calculation, while the other one required two. Tanh is a shifted and re-scaled version of sigmoid and its output value range from minus one to positive one. Because of this re-scaling, tanh has stronger or larger gradients than sigmoid, especially near the origin. However, it still has vanishing gradient problem when x is far away from zero. The rectified linear function, which is called ReLu, is a simple and more modern activation function. ReLu is specified as maximum over zero and input x. Visually, it is a linearly increasing curve for x greater than zero and zero when x is less than or equal to zero. ReLu is different from sigmoid and tanh in a sense that the output does not have upper bound. Unlike tanh and sigmoid, ReLu does not suffer from vanishing gradient problems as its gradient is constant, one for all the value of x greater than zero. To summarize, activation functions provide the specification of nonlinear transformation. There are a few popular choices of activation functions, including sigmoid, tanh, and ReLu. Their relative relationships are plotted. In this figure, you can see sigmoid and tanh are bounded in small range while ReLu is lower bounded by zero without upper bound. To learn a neural network model, you need to specify the choice of activation functions as part of the neural network architecture.