0:00

In an earlier video, I've written down a form for the cost function for

Â logistical regression.

Â In this optional video, I want to give you a quick justification for

Â why we like to use that cost function for logistic regression.

Â To quickly recap, in logistic regression,

Â we have that the prediction y hat is sigmoid of w transpose x + b,

Â where sigmoid is this familiar function.

Â And we said that we want to interpret y hat as the p( y = 1 | x).

Â So we want our algorithm to output y hat as the chance

Â that y = 1 for a given set of input features x.

Â So another way to say this is that if y is equal to 1

Â then the chance of y given x is equal to y hat.

Â And conversely if y is equal to 0 then

Â 1:00

the chance that y was 0 was 1- y hat, right?

Â So if y hat was a chance, that y = 1,

Â then 1- y hat is the chance that y = 0.

Â So, let mer take these last two equations and just copy them to the next slide.

Â So what I'm going to do is take these two equations which

Â basically define p(y|x) for the two cases of y = 0 or y = 1.

Â And then take these two equations and summarize them into a single equation.

Â And just to point out y has to be either 0 or 1 because when binary cost equations,

Â so y = 0 or 1 are the only two possible cases, all right.

Â When someone take these two equations and summarize them as follows.

Â Let me just write out what it looks like, then we'll explain why it looks like that.

Â So (1 â€“ y hat) to the power of (1 â€“ y).

Â So it turns out this one line summarizes the two equations on top.

Â Let me explain why.

Â So in the first case, suppose y = 1, right?

Â So if y = 1 then this term ends up being y hat,

Â because that's y hat to the power of 1.

Â This term ends up being 1- y hat to the power of 1- 1, so that's the power of 0.

Â But, anything to the power of 0 is equal to 1, so that goes away.

Â And so, this equation, just as p(y|x) = y hat, when y = 1.

Â So that's exactly what we wanted.

Â Now how about the second case, what if y = 0?

Â If y = 0, then this equation above is p(y|x) = y hat to the 0,

Â but anything to the power of 0 is equal to 1, so

Â that's just equal to 1 times 1- y hat to the power of 1- y.

Â So 1- y is 1- 0, so this is just 1.

Â And so this is equal to 1 times (1- y hat) = 1- y hat.

Â 3:25

is a correct definition for p(ylx).

Â Now, finally because the law of function is a strictly monotonically

Â increasing function, you're maximizing should give you

Â is optimizing p(y|x) and it you compute log of p(y|x),

Â that's equal to log of y hat r of y 1- y at sub par of 1- y.

Â And so that simplifies to y log y hat

Â + 1- y times log 1- y hat, right?

Â And so this is actually negative of the loss

Â function that we had to find previously.

Â And there's a negative sign there because usually if you're training a learning

Â algorithm, you want to make probabilities large

Â whereas in logistic regression we're expressing this.

Â We want to minimize the loss function.

Â So minimizing the loss corresponds to maximizing the log of the probability.

Â So this is what the loss function on a single example looks like.

Â How about the cost function,

Â the overall cost function on the entire training set on m examples?

Â Let's figure that out.

Â So, the probability of all the labels In the training set.

Â Writing this a little bit informally.

Â If you assume that the training examples I've drawn independently or drawn IID,

Â identically independently distributed,

Â then the probability of the example is the product of probabilities.

Â The product from i = 1 through m p(y(i) ) given x(i).

Â And so if you want to carry out maximum likelihood estimation, right,

Â then you want to maximize the, find the parameters that maximizes

Â the chance of your observations and training set.

Â But maximizing this is the same as maximizing the log, so

Â we just put logs on both sides.

Â So log of the probability of the labels in the training set is equal to,

Â log of a product is the sum of the log.

Â So that's sum from i=1 through m of log p(y(i)) given x(i).

Â And we have previously figured out on the previous

Â slide that this is negative L of y hat i, y i.

Â 5:48

And so in statistics, there's a principle called the principle of maximum likelihood

Â estimation, which just means to choose the parameters that maximizes this thing.

Â Or in other words, that maximizes this thing.

Â Negative sum from i = 1 through m L(y hat {i},y{i}) and

Â just move the negative sign outside the summation.

Â So this justifies the cost we had for

Â logistic regression which is J(w,b) of this.

Â And because we now want to minimize the cost instead of maximizing likelihood,

Â we've got to rid of the minus sign.

Â And then finally for convenience, we make sure that our quantities are better scale,

Â we just add a 1 over m extra scaling factor there.

Â But so to summarize, by minimizing this cost function J(w,b) we're really

Â carrying out maximum likelihood estimation with the logistic regression model.

Â Under the assumption that our training examples were IID, or

Â identically independently distributed.

Â So thank you for watching this video, even though this is optional.

Â I hope this gives you a sense of why we use the cost function we do for

Â logistic regression.

Â And with that, I hope you go on to the exercises, the pro exercise and

Â the quiz questions of this week.

Â And best of luck with both the quizzes, and the following exercise

Â