Multilayer neural networks. A neural network is constructed by connecting many neurons such that the output of one neuron become the input of another. Taking this simple multilayer neural networks as an example. From left to right, we have input layer which has three input unit, x_1 to x-3 and one bias unit. So the bias unit just consider as just always giving the input 1. This node is called hidden layer, since it's value are not observed in the training data. In this example, the hidden layer has three hidden unit, h_1, h_2, and h_3, and one bias unit again. Then the rightmost layer is called the output layer. In our example, it only has one output unit y. So by convention, the input layer is considered layer 1, the rest layer are considered layer 2, and 3, and so on. Here, the hidden layer is layer 2 and the output layer is layer 3. Next, the parameters in this network include; w, j, i, and superscript in parenthesis one, this denote the way to associate the connection between j's node of the hidden layer and i's node of the input layer. For example, w_12, superscript in parenthesis one, this connect the input x_2 to the first hidden unit h_1. This index, the order is actually quite important, and the first index corresponding to the output or the next layer, and the second index corresponding to the input, the previous layer. Why is this specified like this? It will become clear when we are dealing with this matrix version of this in a few slides. This whole thing become a matrix operation, so this index, the first one corresponding to the row of the matrix, and the second one corresponding to column of the matrix and they will all become clear in a few slides. But just keep in mind, the first index is the output and the second index corresponding to the input. We also have this bias term b_j and superscript in parenthesis one. For example, b_3_1 connects the bias unit to this third hidden unit. Likewise, we have w_i superscript in parenthesis two, indicate the weight from h_i to the output y and also the bias term b superscript in parenthesis two. For example, w_12 connect hidden unit h_1 to the output unit y, and b_2 is just this bias term for this layer. So because there's only one output unit in this neural network, there's only one subscripts for all this weight. Unlike the previous layer, we have two subscripts, and note that here the superscript in parenthesis one and two indicate the associate layer numbers. Besides all this weight and bias terms, we also have activation function. G_2 is activation function from the layer 2 and a_1 superscript in parenthesis as two as output of h_1 from layer 2. Again, the non-linear activation G_2 is applied on the linear combination of x_1 using a set of weight w_1_1, and w_1_2, and w_1_3 and the bias term b_1. All this weight are from layer 1. You see this superscript in parenthesis as one here. Similarly, for the output unit, we have also another activation function, G_3. Next, we'll go into more details and explain all this forward pass calculation. But just keep in mind, once we have calculate the derivative of the loss function with respect to all this weight parameter w_ji and the derivative with respect to the bias term b_j. Then we can just do this standard SGD updates to update the parameters and to learn the parameter for this neural network. Next, we'll go through this forward pass and backward pass. Then we can understand how those derivative are calculated. Let's first focus on the forward pass. There's multiple units are calculated through the forward pass. This forward pass is important for scoring a new data point get the output y and is also crucial building blocks for learning the neural network. As we'll explain later in the backpropagation algorithm. Let's first illustrate the forward computation step by step. In this example, given x_1, x_2, and x_3, this three dimensional data point, we need to compute three linear combinations indicated by the hidden unit h_1, h_2, and h_3. In particular, z_1 superscript in parenthesis two indicate the linear combination for hidden unit h_1 from the input x with the weight w_11, w_12, and w_13 all with superscript in parenthesis as one, and the bias term b_1. Again, here, the superscript in parenthesis as one indicates the layer number. Once we have the linear combination, we can perform this non-linear activation G superscript in parenthesis as two over z_1_2 to obtain the output from h_1 that is a_1 superscript in parenthesis as two. Similarly, we have the second linear combination for unit h_2 that can be computed as z_2 as a linear sum of all this input X weighted by w_21, w_22, and W_23 plus the bias term b_2. After that, the unit h_2 performed the same non-linear activation G_2 over the Z_22 to obtain the output for h_2, which is a_2 superscript in parenthesis as two. We do the same for h_3, again, here, we compute the same for h_3. But first compute z_3 superscript in parenthesis as two, which is this linear combinations, and then we apply the same activation function to compute a_3_2 by applying this G_2 activation function over z_3. Now we have the output from hidden layer, h_1, h_2, and h_3. We can treat those output a_1, a_2, and a_3 as input to the next layer, which is output layer in this network. In particular, we can compute this linear combination z_3 as a weighted sum of a_1, a_2, and a_3 with the weight w_1, w_2, and w_3 plus this bias term b. Note that unlike the previous layer, there is only a single output unit of y. Only one index for w and no index for bias term b. Again, the final output of y has this activation function g_3 over the linear combination z_3. To summarize, we put all this equation for forward computation for this particular neural network here. This network is fully connected, meaning that all the units are connected between layers. For example, x_1, x_2, x_3 connected to h_1, h_2, and h_3 then h_1, h_2, h_3 connect with y. Each of the unit compute this linear combination z, then followed by a non-linear activation, a. As you see there's a lot of symmetry here. Multiple neurons are involved to perform similar operation. Collectively, they help to learn a complex function mapping the input to the output. The forward computation can be represented by a more compact vector notation. Here, all this weight w_ij become a weight of particular matrix W and the bias term become a vector. The linear combination become a matrix vector product W_1 times x plus a vector b_1. Activation function g_2 is applied element-wise on this vector z_2. Similarly we have another matrix vector multiplication plus another vector. Then activation function applied element-wise on the vector z_3. Of course, when you have multiple layer of neural networks, the most general form of this forward computation from layer l to layer l plus 1 can be specified with the following equation. The linear combination z superscript impresses as l plus 1 equal to W_l times a_l plus b_l. Activation function a_l plus 1, equal to just applying the activation function g_l plus 1 over this vector z_l plus 1. Here is a complete summary of forward pass over newer networks. This is the element-wise computation and this is the vector form computation. It is actually crucial to represent neural network computation as a vector operation because they can be run more efficiently in parallel, especially on a GPU. Here's this more general form when you have multiple layers.