One way to learn the parameters of the neural network so that it gives you

a good encoding for your pictures of faces is to

define an applied gradient descent on the triplet loss function.

Let's see what that means.

To apply the triplet loss,

you need to compare pairs of images.

For example, given this picture,

to learn the parameters of the neural network,

you have to look at several pictures at the same time.

For example, given this pair of images,

you want their encodings to be similar because these are the same person.

Whereas, given this pair of images,

you want their encodings to be quite different because these are different persons.

In the terminology of the triplet loss,

what you're going do is always look at one anchor image and

then you want to distance between the anchor and the positive image,

really a positive example,

meaning as the same person to be similar.

Whereas, you want the anchor when pairs are compared to

the negative example for their distances to be much further apart.

So, this is what gives rise to the term triplet loss,

which is that you'll always be looking at three images at a time.

You'll be looking at an anchor image,

a positive image, as well as a negative image.

And I'm going to abbreviate anchor positive and negative as A,

P, and N. So to formalize this,

what you want is for the parameters of

your neural network of your encodings to have the following property,

which is that you want the encoding

between the anchor minus the encoding of the positive example,

you want this to be small and in particular,

you want this to be less than or equal to the distance of the squared norm

between the encoding of the anchor and the encoding of the negative,

where of course, this is d of A,

P and this is d of A,

N. And you can think of d as a distance function,

which is why we named it with the alphabet d. Now,

if we move to term from the right side of this equation to the left side,

what you end up with is f of A minus f of P squared minus,

let's take the right-hand side now,

minus F of N squared,

you want this to be less than or equal to zero.

But now, we're going to make a slight change to this expression,

which is one trivial way to make sure this is satisfied,

is to just learn everything equals zero.

If f always equals zero,

then this is zero minus zero,

which is zero, this is zero minus zero which is zero.

And so, well, by saying f of any image equals a vector of all zeroes,

you can almost trivially satisfy this equation.

So, to make sure that the neural network doesn't just output zero for all the encoding,

so to make sure that it doesn't set all the encodings equal to each other.

Another way for the neural network to give a trivial output is if

the encoding for every image was identical to the encoding to every other image,

in which case, you again get zero minus zero.

So to prevent a neural network from doing that,

what we're going to do is modify this objective to say that,

this doesn't need to be just less than or equal to zero,

it needs to be quite a bit smaller than zero.

So, in particular, if we say this needs to be less than negative alpha,

where alpha is another hyperparameter,

then this prevents a neural network from outputting the trivial solutions.

And by convention, usually,

we write plus alpha instead of negative alpha there.

And this is also called, a margin,

which is terminology that you'd be familiar

with if you've also seen the literature on support vector machines,

but don't worry about it if you haven't.

And we can also modify this equation on top by adding this margin parameter.

So to give an example,

let's say the margin is set to 0.2.

If in this example,

d of the anchor and the positive is equal to 0.5,

then you won't be satisfied if d between

the anchor and the negative was just a little bit bigger, say 0.51.

Even though 0.51 is bigger than 0.5,

you're saying, that's not good enough, we want a dfA,

N to be much bigger than dfA,

P and in particular,

you want this to be at least 0.7 or higher.

Alternatively, to achieve this margin or this gap of at least 0.2,

you could either push this up or push this down so

that there is at least this gap of this alpha,

hyperparameter alpha 0.2 between

the distance between the anchor and the positive versus the anchor and the negative.

So that's what having a margin parameter here does,

which is it pushes

the anchor positive pair and the anchor negative pair further away from each other.

So, let's take this equation we have here at the bottom,

and on the next slide,

formalize it, and define the triplet loss function.

So, the triplet loss function is defined on triples of images.

So, given three images,

A, P, and N,

the anchor positive and negative examples.

So the positive examples is of the same person as the anchor,

but the negative is of a different person than the anchor.

We're going to define the loss as follows.

The loss on this example,

which is really defined on a triplet of images is,

let me first copy over what we had on the previous slide.

So, that was fA minus fP squared

minus fA minus fN squared,

and then plus alpha, the margin parameter.

And what you want is for this to be less than or equal to zero.

So, to define the loss function,

let's take the max between this and zero.

So, the effect of taking the max here is that,

so long as this is less than zero,

then the loss is zero,

because the max is something less than equal to zero,

when zero is going to be zero.

So, so long as you achieve the goal of making this thing I've underlined in green,

so long as you've achieved the objective of making that less than or equal to zero,

then the loss on this example is equals to zero.

But if on the other hand,

if this is greater than zero,

then if you take the max,

the max we end up selecting,

this thing I've underlined in green,

and so you would have a positive loss.

So by trying to minimize this,

this has the effect of trying to send this thing to be zero,

less than or equal to zero.

And then, so long as there's zero or less than or equal to zero,

the neural network doesn't care how much further negative it is.

So, this is how you define the loss on

a single triplet and the overall cost function for your neural network can

be sum over a training set of these individual losses on different triplets.

So, if you have a training set of say 10,000 pictures with 1,000 different persons,

what you'd have to do is take your 10,000 pictures and use it to generate,

to select triplets like this and then train

your learning algorithm using gradient descent on this type of cost function,

which is really defined on triplets of images drawn from your training set.

Notice that in order to define this dataset of triplets,

you do need some pairs of A and P. Pairs of pictures of the same person.

So the purpose of training your system,

you do need a dataset where you have multiple pictures of the same person.

That's why in this example,

I said if you have 10,000 pictures of 1,000 different person,

so maybe have 10 pictures on average

of each of your 1,000 persons to make up your entire dataset.

If you had just one picture of each person,

then you can't actually train this system.

But of course after training,

if you're applying this,

but of course after having trained the system,

you can then apply it to

your one shot learning problem where for your face recognition system,

maybe you have only a single picture of someone you might be trying to recognize.

But for your training set,

you do need to make sure you have multiple images of the same person at least for

some people in your training set so that you can

have pairs of anchor and positive images.

Now, how do you actually choose these triplets to form your training set?

One of the problems if you choose A, P,

and N randomly from your training set subject to A and P being from the same person,

and A and N being different persons,

one of the problems is that if you choose them so that they're at random,

then this constraint is very easy to satisfy.

Because given two randomly chosen pictures of people,

chances are A and N are much different than

A and P. I hope you still recognize this notation,

this d(A, P) was what we had written on the last year's slides as this encoding.

So this is just equal to this

squared known distance between the encodings that we have on the previous slide.

But if A and N are two randomly chosen different persons,

then there is a very high chance that this will be much

bigger more than the margin alpha that that term on the left.

And so, the neural network won't learn much from it.

So to construct a training set,

what you want to do is to choose triplets A,

P, and N that are hard to train on.

So in particular, what you want is for all triplets that this constraint be satisfied.

So, a triplet that is hard will be if you choose values for A, P,

and N so that maybe d(A,

P) is actually quite close to d(A,N).

So in that case,

the learning algorithm has to try extra hard to take

this thing on the right and try to push it up or take this thing on

the left and try to push it down so that there is

at least a margin of alpha between the left side and the right side.

And the effect of choosing these triplets is that it

increases the computational efficiency of your learning algorithm.

If you choose your triplets randomly,

then too many triplets would be really easy, and so,

gradient descent won't do anything because your neural network will just get them right,

pretty much all the time.

And it's only by using hard triplets that the gradient descent procedure

has to do some work to try to push these quantities further away from those quantities.

And if you're interested,

the details are presented in this paper by Florian Schroff, Dmitry Kalinichenko,

and James Philbin, where they have a system called FaceNet,

which is where a lot of the ideas I'm presenting in this video come from.

By the way, this is also a fun fact about

how algorithms are often named in the deep learning world,

which is if you work in a certain domain, then we call that blank.

You often have a system called blank net or deep blank.

So, we've been talking about face recognition.

So this paper is called FaceNet,

and in the last video,

you just saw deep face.

But this idea of a blank net or deep blank is

a very popular way of naming algorithms in the deep learning world.

And you should feel free to take a look at that paper if you want to learn some of

these other details for speeding up your algorithm

by choosing the most useful triplets to train on,

it is a nice paper.

So, just to wrap up,

to train on triplet loss,

you need to take your training set and map it to a lot of triples.

So, here is our triple with an anchor and a positive,

both for the same person and the negative of a different person.

Here's another one where the anchor and positive are of

the same person but the anchor and negative are of different persons and so on.

And what you do having defined this training sets

of anchor positive and negative triples is use

gradient descent to try to minimize the cost function J we defined on an earlier slide,

and that will have the effect of that propagating to

all of the parameters of the neural network in order to learn

an encoding so that d of

two images will be small when these two images are of the same person,

and they'll be large when these are two images of different persons.

So, that's it for the triplet loss and how you can train

a neural network for learning and an encoding for face recognition.

Now, it turns out that

commercial face recognition systems are trained on fairly large datasets at this point.

Often, north of the million images

sometimes not in frequently north of 10 million images.

And there are some commercial companies talking about using over 100 million images.

So these are very large datasets by models that even balm.

So that's it for the triplet loss and how you can use it to

train a neural network to operate a good encoding for face recognition.

Now, it turns out that today's face recognition systems

especially the loss cure commercial face recognition systems

are trained on very large datasets.

Datasets north of a million images is not uncommon,

some companies are using north of 10 million images and some companies

have north of 100 million images with which to try to train these systems.

So these are very large datasets even by modern standards,

these dataset assets are not easy to acquire.

Fortunately, some of these companies have trained

these large networks and posted parameters online.

So, rather than trying to train one of these networks from scratch,

this is one domain where because of the share data volume sizes,

this is one domain where often it might be useful for you

to download someone else's pre-train model,

rather than do everything from scratch yourself.

But even if you do download someone else's pre-train model,

I think it's still useful to know how these algorithms were trained or

in case you need to apply these ideas from scratch yourself for some application.

So that's it for the triplet loss.

In the next video,

I want to show you also some other variations

on siamese networks and how to train these systems.

Let's go onto the next video.