And one thing I didn't really show but had alluded to is that softmax regression or

the softmax identification function generalizes the logistic activation

function to C classes rather than just two classes.

And it turns out that if C = 2, then softmax with

C = 2 essentially reduces to logistic regression.

And I'm not going to prove this in this video but the rough outline for

the proof is that if C = 2 and if you apply softmax,

then the output layer, a[L], will output two numbers if C = 2,

so maybe it outputs 0.842 and 0.158, right?

And these two numbers always have to sum to 1.

And because these two numbers always have to sum to 1, they're actually redundant.

And maybe you don't need to bother to compute two of them,

maybe you just need to compute one of them.

And it turns out that the way you end up computing that number reduces to

the way that logistic regression is computing its single output.

So that wasn't much of a proof but the takeaway from this is that softmax

regression is a generalization of logistic regression to more than two classes.

Now let's look at how you would actually train a neural network

with a softmax output layer.

So in particular,

let's define the loss functions you use to train your neural network.

Let's take an example.

Let's see of an example in your training set where the target output,

the ground true label is 0 1 0 0.

So the example from the previous video,

this means that this is an image of a cat because it falls into Class 1.

And now let's say that your neural network is currently outputting y hat equals,

so y hat would be a vector probability is equal to sum to 1.

0.1, 0.4, so you can check that sums to 1, and this is going to be a[L].

So the neural network's not doing very well in this example because this is

actually a cat and assigned only a 20% chance that this is a cat.

So didn't do very well in this example.