0:00

In this video, we're going to look at the soft max output function.

Â This is a way of forcing the outputs of a neural network to sum to one so they can

Â represent a probability distribution across discreet mutually exclusive

Â alternatives. Before we get back to the issue of how we

Â learn feature vectors to represent words, we're gonna have one more digression, this

Â time it's a technical diversion. So far I talked about using a square area

Â measure for training a neural net and for linear neurons it's a sensible thing to

Â do. But the squared error measure has some

Â drawbacks. If for example the design acuities are

Â one, so you have a target of one, and the actual output of a neuron is one

Â billionth, then there's almost no gradient to allow a logistic unit to change.

Â It's way out on a plateau where the slope is almost exactly horizantal.

Â And so, it will take a very, very long time to change its weights, even though

Â it's making almost as big an error as it's possible to make.

Â Also, if we're trying to assign probabilities to mutually exclusive class

Â labels, we know that the output should sum to one.

Â Any answer in which we say, the probability this is A is three quarters

Â and the probability that it's a B is also three quarters is just a crazy answer.

Â And we ought to tell the network that information, we shouldn't deprive it of

Â the knowledge that these are mutually exclusive answers.

Â So the question is, is there a different cost function that will work better?

Â Is there a way of telling it that these are mutually exclusive and then using a,

Â an appropriate cost function? The answer, of course is, that there is.

Â What we need to do is force the outputs of the neural net to represent a probability

Â distribution across discrete alternatives, if that's what we plan to use them for.

Â The way we do this is by using something called a soft-max.

Â It's a kind of soft continuous version of the maximum function.

Â So the way the units in a soft-max group work is that they each receive some total

Â input they've accumulated from the layer below.

Â That's Zi for the i-th unit, and that's called the logit.

Â And then they give an output Yi that doesn't just depend on their own Zi.

Â It depends on the Zs accumulated by their rivals as well.

Â So we say that the output of the i-th neuron is E to the Zi divided by the sum

Â of that same quantity for all the different neurons in the softmax group.

Â And because the bottom line of that equation is the sum of the top line over

Â all possibilities, we know that when you add over all possibilities you'll get one.

Â That is, the sum of all the Yi's must come to one.

Â What's more, the Yi's have to lie between zero and one.

Â So we force the Yi to represent a probability distribution over mutually

Â exclusive alternatives just by using that soft max equation.

Â The soft max equation has a nice simple derivative.

Â If you ask about how the YI changes as you change the Zi, that obviously depends on,

Â all the other Zs. But then the Yi itself depends on all the

Â other Zs. And it turns out, that you get a nice

Â simple form, just like you do for the majestic unit, where the derivative of the

Â output with respect to the input, for an individual neuron in a softmax group, is

Â just Yi times one minus Yi. It's not totally trivial to derive that.

Â If you tried differentiating the equation above, you must remember that things turn

Â up in that normalization term on the bottom row.

Â It's very easy to forget those terms and get the wrong answer.

Â 4:12

Now the question is, if we're using a soft max group for the outputs, what's the

Â right cost function? And the answer, as usual, is that the most

Â appropriate cost function is the negative log probability of the correct answer.

Â That is, we want to maximize the log probability of getting the answer right.

Â So if one of the target values is a one and the remaining ones are zero, then we

Â simply sum of all possible answers. We put zeros in front of all the wrong

Â answers. And we put one in front of the right

Â answer and that gets us the negative log probability of the correct answer, as you

Â can see in the equation. That's called the cross entropy cost

Â function. It has a nice property that it has a very

Â big gradient when the target value is one and the output is almost zero.

Â You can see that by considering a couple of cases.

Â 5:17

So value of one in a million is much better than a value of one in a billion,

Â even though it differs by less than a millionth.

Â So when you make the output value, you increase by less than one millionth.

Â The value of C improves by a lot. That means it's a very, very steep

Â gradient for C. One way of seeing why a value of one in a

Â million is much better than a value of one in a billion, if the correct answer is one

Â is that if you believe the one in a million, you'd be willing to bet but odds

Â of one in a million, then you'd lose $one million.

Â If you thought the answer was one in a one billion you'd, you'd lose $one billion

Â making the same bet. So we get a nice property that.

Â 6:01

That cost function, C has a very steep derivative when the answer is very wrong

Â and that exactly bounces the fact that the way which the advert changes is to change

Â the import, the Y or the Z is very flat when the once is very wrong.

Â And when you multiply the two together to get the derivative of cross entropy with

Â respect to the logic going into i put unit i.

Â You use the chain rule so that derivative is how fast the cost function changes as

Â you change the output of the unit times how fast the output of the unit changes as

Â you change Zi. And notice we need to add up across all

Â the Js, because when you change the i, the output of all the different units changes.

Â The result is just the actual output minus the target output.

Â And you can see that when the actual target outputs are very different, that

Â has a slope of one or -one. And the slope is never bigger than one or

Â -one. But the slope never gets small until the

Â two things are pretty much the same. In other words, you're getting pretty much

Â the right answer.

Â