In the previous video, you saw the logistic regression model to train the parameters W and B, of logistic regression model. You need to define a cost function, let's take a look at the cost function. You can use to train logistic regression to recap this is what we have defined from the previous slide. So you output Y hat is sigmoid of W transports experts be where sigmoid of Z is as defined here. So to learn parameters for your model, you're given a training set of training examples and it seems natural that you want to find parameters WNB. So that at least on the training set, the outputs you have the predictions you have on the training set, which only writers. Why had I that that will be close to the ground troops labels why I that you got in the training set. So to fill in a little bit more detail for the equation on top, we had said that why had as defined at the top for a training example X. And of course for each training example, we're using these superscripts with round brackets with parentheses to index into different train examples. Your prediction on training example I which is why had I is going to be obtained by taking the sigmoid function and applying it to W transporters X I the input for the training example plus B. And you can also define Z I as follows where Z I is equal to, you know, W transport Z I plus B. So throughout this course we're going to use this notation all convention that the super strip parentheses I refers to data be an X or Y or Z. Or something else associated with the I've training example associated with the life example, okay, that's what the superscript I in parenthesis means. Now let's see what loss function or an error function we can use to measure how well our album is doing. One thing you could do is define the loss when your algorithm outputs y hat and the true label is why to be maybe the square error or one half a square error. It turns out that you could do this, but in logistic regression people don't usually do this. Because when you come to learn the parameters, you find that the optimization problem, which we'll talk about later becomes non convex. So you end up with optimization problem, you're with multiple local optima. So great in dissent, may not find a global optimum, if you didn't understand the last couple of comments, don't worry about it, Ww'll get to it in a later video. But the intuition to take away is that dysfunction L called the lost function is a function will need to define to measure how good our output white hat is when the true label is y. And squared era seems like it might be a reasonable choice except that it makes great in descent not work well. So in logistic regression were actually define a different loss function that plays a similar role as squared error but will give us an optimization problem that is convex. And so we'll see in a later video becomes much easier to optimize, so what we use in logistic regression is actually the following loss function, which I'm just going right out here is negative. y lock y hat plus 1 minus y log 1 minus, y hat here's some intuition the widest loss function makes sense. Keep in mind that if we're using squared error then you want to square error to be as small as possible. And with this logistic regression, lost function will also want this to be as small as possible. To understand why this makes sense, let's look at the two cases, in the first case let's say y is equal to 1, then the loss function. y hat comma Y is just this first term right in this negative science, it's negative laws y hat if y is equal to 1. Because if y equals 1, then the second term 1 minus Y is equal to 0, so this says if y equals 1, you want negative log y hat to be as small as possible. So that means you want log y hat to be large to be as big as possible, and that means you want y hat to be large. But because y hat is you know the sigmoid function, it can never be bigger than one. So this is saying that if y is equal to 1, you want, why I had to be as big as possible, but it can't ever be bigger than one. So saying you want, y hat to be close to one as well, the other cases Y equals zero, if Y equals 0. Then this first term in the loss function is equal to 0 because y equals 0, and in the second term defines the lost function. So the loss becomes negative Log 1 minus y hat, and so if in your learning procedure you try to make the lost function small. What this means is that you want, Log 1 minus y hat to be large and because it's a negative sign there. And then through a similar piece of reasoning, you can conclude that this loss function is trying to make y hat as small as possible, and again, because y hat has to be between zero and 1. This is saying that if y is equal to zero then you're lost function will push the parameters to make white hat as close to zero as possible. Now there are a lot of functions with roughly this effect that if y is equal to one, try to make white head large and y is equal to zero or try to make white head small. We just gave here in green a somewhat informal justification for this particular lost function will provide an optional video later to give a more formal justification for y. In logistic regression, we like to use the loss function with this particular form. Finally, the last function was defined with respect to a single training example. It measures how well you're doing on a single training example, I'm now going to define something called the cost function, which measures how are you doing on the entire training set? So the cost function jay, which is applied to your parameters W N B, is going to be the average, really one of the m of the sun of the lost function apply to each of the training examples. In turn, we're here, y hat is of course the prediction output by your logistic regression algorithm using, you know, a particular set of parameters W N B. And so just to expand this out, this is equal to negative one of them, some from I equals one through of the definition of the lost function above. So this is y I log y hat I plus 1 minus Y, I log 1minus y hat I I guess it can put square brackets here. So the minus sign is outside everything else, so the terminology I'm going to use is that the loss function is applied to just a single training example. Like so and the cost function is the cost of your parameters, so in training your logistic regression model, we're going to try to find parameters W n B. That minimize the overall cost function J, written at the bottom. So you've just seen the setup for the logistic regression algorithm, the lost function for training example, and the overall cost function for the parameters of your algorithm. It turns out that logistic regression can be viewed as a very, very small neural network. In the next video, we'll go over that so you can start gaining intuition about what neural networks do. So that let's go on to the next video about how to view logistic regression as a very small neural network.