In this video, we'll discuss training. In the last video, we learned we could use a plane to automatically classify an image. In this video, we will learn how to determine the plane. We will use the dataset of images to train the classifier. When we have an unknown sample, we can classify the image. Cost and loss. Training is where you find the best learnable parameters of the decision boundary. In this case, we will randomly select a set of learnable parameters, w and b, the superscript is the guess number. In this case, the decision boundary does a horrible job as it classifies all the images as cats. The second decision boundary does better. Finally, this decision boundary performs the best. First, we need a way to determine how good our decision boundary is. A loss function tells you how good your prediction is. The following loss is called the classification loss. The first column will show the output of the loss function. Each time our prediction is correct, the loss function will output a zero. Each time our prediction is incorrect, the loss function will output a one. The cost is the sum of the loss. The cost tells us how good our learnable parameters are doing on the dataset. In this case, our model output y hat is incorrect, predicting a cat as a dog and a dog as a cat. In this case, our model output is correct, predicting a dog is a dog and a cat is a cat. For each incorrectly classified samples, the loss is one, increasing the cost. Correctly classified samples do not change the cost. For this decision boundary, the cost is three, for this decision boundary the cost is one, for this decision boundary the cost is zero. The cost is a function of the learnable parameters. We see a set of learnable parameters. The decision boundary misclassifies three points, changing the learnable parameters misclassifies the following points. The final learnable parameters perform perfectly. To simplify, let's look at the cost as a function of the bias parameter b. We can plot the cost with respect to learnable parameters. In this case, we plot the cost with respect to the bias parameter b. Let's see the relationship between cost and the decision boundary. We see the first line misclassifies the following points, thus the value of the cost for this value of b is three. The second misclassifies the following two points, hence the value of the cost is two. The final lines perform perfectly. The cost is zero. In reality, the cost is a function of multiple parameters w and b, even our super simple 2D example has too many parameters to plot. In practice, classification error is difficult to work with. We use the cross entropy loss that uses the output of the logistic function as opposed to the prediction y had. The cost is still the sum of the loss. The cross entropy deals with how likely the image belongs to a specific class. If the likelihood of belonging to an incorrect class is large, the cross entropy loss in turn will be large. If the likelihood of belonging to the correct class is correct, the cross entropy is small, but not zero. Now let's discuss gradient descent, a method to find the best learnable parameters. Here's a plot of the cost using cost entropy. Notice how the curve is smooth compared to the curve classification cost. If we find the minimum of the cost, we can find the best parameter. The gradient gives you the slope of a function. At any point we can use gradient descent is a method to find the minimum of the cost function. Let's see how gradient descent works. Consider the cost function. If we start off with a random guess for the bias parameter, we use the superscript to indicate the guessed number. In this case, it is our first guess, so it is zero. We have to move our guess in the positive direction, we can move the parameter in that direction by adding a positive number, examining the sign of the gradient it is the opposite sign of the number. Therefore, we can add a number proportional to the negative of the gradient. Subtracting the gradient works if we are on the other side of the minimum. In this case, we would like to move in the negative direction. We can move the parameter value to the negative direction by adding a negative number to the parameter. Examining the sign of the gradient, it is the opposite sign of the number, therefore we can add an amount proportional to the negative of the gradient. Here is the final equation of gradient descent. We add a number proportional to the gradient depending on the context, the variable i. Eta dictates how far we move in the direction of the gradient. It's usually a small number. We see for each iteration the new values for the bias parameter decrease the cost. When we reach the minimum, the gradient is zero. The parameter values stops updating. If we use a learning rate that's too low, we may never reach a minimum. If we use learning rate that's too large, we may oscillate and never reach the minimum. Learning rate is a hyper parameter we select it by finding a value that has the best accuracy using validation data. We can see the relationship between the cost function and the decision plane. Each iteration of gradient descent finds a parameter b that decreases the cost, and the decision plane does a better job at separating the classes. It's challenging to perform gradient descent on the threshold function. The slope is zero in many regions. If we get stuck in these regions, the gradient will be zero and not update. The decision plane has multiple parameters. As a result, the gradient is a vector. We can update the parameter, it's a set of vectors. For the two dimensional case we can plot it as a surface. It's a bowl shape. When we update the parameter it will find the minimum. Usually we plot cost with respect to each iteration i. This is called the learning curve. Generally, the more parameters you have, the more images and iterations you need to make the model work. Let's look at different learning curves. We can choose the learning rate that's way too large as shown in the one dimensional example. We can choose a learning rate that's too small. We can choose a learning rate that's too large. With a good learning rate, we will reach the minimum of the cost. That's it. Thanks. [MUSIC].