Now let's consider the simplest neural network, a single neuron. We break down the computation of neuron into two steps, linear combination and nonlinear transformation. The linear combination computes a way to sum over x_1 to x_n and its bias term b, we call this intermediate result z. Then the nonlinear transformation is to apply the activation function over z to produce output y. In a supervised learning setting, we want output y close to a target label t. For example t can be a binary indicator about whether a patient has heart disease or not. To measure the model quality, we need to specify a loss function to measure how much difference is between the output y and target label t. For example, we can use a squared loss or a square Euclidean distance specified like this, as a loss function. The goal of training such a neuron is to minimize this loss function on the training data by adjusting the ways and the bias term of this neural network. Before we talk about neural network learning, we need to introduce a general optimization method called gradient descent. Gradient descent is a basic optimization method that has been used widely in machine learning applications. Next, let's go through a high-level procedure and how to apply gradient descent method for classification or regression problems. Recall the input to any classification or regression problem as a training dataset consisting of n pairs of data points x and the corresponding target y. The optimization process, I'll put the parameter set Theta for a specific classification or regression model. For example, the Theta in linear regression model are the linear coefficient Beta. First, we specify the likelihood function of data given the model parameters Theta. The likelihood function is really the joint probability of data given the mono parameter Theta. In this step, we often use log-likelihood instead of the likelihood function. Since log transformation is monotonically increasing and leads to the same optimal as original likelihood function. But oftentimes, log-likelihood is easier to manipulate numerically. We often prefer to use log-likelihood here. Second, we find the derivative, also called gradient of the likelihood or the log-likelihood function. This step is crucial because most of the computation really happened in this step and depending on how the likelihood function is specified, finding their derivative can be easy or sometimes hard. After that, we can update the new model parameter by moving the old parameter in the opposite direction of the gradient. Finally, we repeat this process until it converges, which means the Theta is no longer changing and there are some remarks here. This algorithms require some additional tuning parameters. In particular, the learning rate or step size, which controls how far we update this Theta based on the gradient. The step size can be learned through cross-validation. The gradient descent is the simplest gradient-based optimization method. There are many other more advanced gradient-based algorithms such as conjugate gradient and quasi-Newton method. But as long as you can specify the formula for computing the gradient, you can use those more advanced optimization method as a black-box. Here is the illustration of gradient descent algorithm for linear regression. Given the input data which are a set of data points, x map to a output value or output target y. But first, specify the log-likelihood function for the linear regression, which is in the form of a constant minus this particular term corresponding to the log-likelihood for a Gaussian distribution. Then the second step we compute the gradient of this log-likelihood function which lead to this particular term. Note that this notation means partial derivative of Beta on the J's element, and the entire gradient is really of vectors with all this different Beta J's. Third, we can update each Beta j by moving in the opposite direction of the gradient. By the way, here EDA is the learning rate, usually is set to be a small constant. Note that this gradient computation required going through all the data point. So this summation is over all i from this entire data set, which can be expansive given a large data set. Then we repeat this process for many times until it converges. So stochastic gradient descent or SGD method is a variant of gradient descent for handling large data set. When gradient computation on the entire data set is too expensive to do, we can use SGD. The idea is to compute the likelihood functions on a random subset of data points, for example, B data points. Sometimes SGD specifically refer to perform such an online update on one single data point, that is B equal to 1. While for larger set of B, we refer those as a mini batch updates. Irregardless, the idea is when you have a large input data set, we'll a take a subset, then compute a gradient for that subset, then update the parameter, then we iterate. So that's the general idea of SGD. Most of the neural network optimization uses SGD. Here's just a illustration of using stochastic gradient descent or SGD for linear regression problem. So given a large data set, we can compute the log-likelihood on a single data point first then compute the gradient for that single data points, and then update the Beta coefficient with that single grading calculation on a single data point then iterate, and random sample another data point and repeat this process. So that's the SGD algorithm for linear regression. Now if we look at general SGD for neural networks, so the general algorithm goes as follows. It will iteratively take a single data points, which consists of a set up input feature x and output target y. You also need to specify some learning rate, EDA. For each neural network, we have the weight parameter w and this bias factor b. We initialize them to some small value then iteratively pick a training example x, t, then we compute the gradient for weight vector w and scalar bias vector b. Then we update the weight vector w with this newly calculated gradient and also update the bias vectors with the newly updated derivative for bias vector b. Here's this just to remind us the notation of this nebula operation over w is really doing a partial derivative over all those different w's and put them together in this vector form. For neural network learning, it's really about how to efficiently calculate all these gradient operators then update the corresponding parameter using SGD.