Training neural networks requires tuning the weight parameters to minimize the error function. The goal is to choose weights that give the best model performance, but since we start with random weights, we must choose an optimization procedure to iteratively improve the weight values. The simplest algorithm for optimizing the parameters is gradient descent, where the derivative of the error function is taken with respect to the weights and a step is taken in the direction of the negative gradient. Stepping toward the negative involves changing the weight values in a way that reduces the overall error function. This procedure is iterated until convergence, which occurs when the gradient is zero or very close to zero. The intuitive interpretation of gradient descent is that we are moving weight values through an error landscape in the direction of steepest descent. At each iteration, we find the direction that causes the error to decrease the most, and we take a step in that direction. Each iteration of the gradient descent is generally easy to compute, but it can take many iterations before convergence occurs. A more advanced method is the Broyden-Fletcher-Goldfarb-Shanno algorithm (or BFGS), which uses an approximation to the matrix of second derivatives to update the weights. This method is in the class of quasi-Newton methods, which means that instead of taking steps toward the negative gradient, we take steps toward local extrema based on a quadratic Taylor expansion of the error function. The basic idea is that given an error function, we can write a quadratic approximation of the function and find local extrema in the quadratic approximation. This procedure can be iterated to find global extrema just like we iterate the gradient descent method. The BFGS method is generally more computationally expensive per iteration but should take fewer iterations to converge than gradient descent methods. Deep learning models generally have many parameters and use big data sets for training. The previous two optimization algorithms run into trouble when used to solve extremely large problems, so shortcuts must be implemented to effectively train deep neural networks. Stochastic gradient descent uses the same mathematical formalism as gradient descent, but instead of computing the gradient using the entire data set, the gradient is computed using a random row of data. This approximate gradient is much easier to compute than the full gradient, dramatically speeding up training time. Because we're not using the full gradient, stochastic gradient descent can take many more iterations to converge, but each iteration is computed so much faster that the overall algorithm converges much faster than traditional or "batch" gradient descent. There are generalizations of stochastic gradient descent that use "mini-batches" and compute the gradient using a subset of the data instead of a single row. These methods can be thought of as an intermediate between gradient descent using the full data set and stochastic gradient descent. The BFGS algorithm is already designed to avoid computation of the full Hessian matrix (the matrix of second derivates), but even approximating the Hessian can be computationally intensive. The limited memory BFGS (often called L-BFGS) is a generalization of the BFGS method that approximates only portions of the Hessian rather than approximating the whole matrix. To approximate the whole Hessian, we must iteratively compute differences in gradients and store those differences. The L-BFGS method saves time by only using the gradient calculations from the most recent iterations. The "limited memory" part of the name means that the update step in the optimization routine is computed using a local approximation to the second derivative, including information from the difference of nearby gradients, but ignoring information from distant regions in the error landscape. Both stochastic gradient descent and L-BFGS were developed to overcome computational difficulties experienced when training deep learning models using traditional methods.