So, if the object is, classes 1, 2 or 3, pc will be equal to 1.

And if it's the background class, so

if it's none of the objects you're trying to detect, then pc will be 0.

And pc you can think of that as standing for

the probability that there's an object.

Probability that one of the classes you're trying to detect is there.

So something other than the background class.

Next if there is an object, then you wanted to output bx,

by, bh and bw, the bounding box for the object you detected.

And finally if there is an object, so if pc is equal to 1,

you wanted to also output c1, c2 and

c3 which tells us is it the class 1, class 2 or class 3.

So is it a pedestrian, a car or a motorcycle.

And remember in the problem we're addressing

we assume that your image has only one object.

So at most, one of these objects appears in the picture,

in this classification with localization problem.

So let's go through a couple of examples.

If this is a training set image, so if that is x, then y will be

the first component pc will be equal to 1 because there is an object, then bx, by,

by, bh and bw will specify the bounding box.

So your labeled training set will need bounding boxes in the labels.

And then finally this is a car, so it's class 2.

So c1 will be 0 because it's not a pedestrian,

c2 will be 1 because it is car, c3 will be 0 since it is not a motorcycle.

So among c1, c2 and c3 at most one of them should be equal to 1.

So that's if there's an object in the image.

What if there's no object in the image?

What if we have a training example where x is equal to that?

In this case, pc would be equal to 0, and

the rest of the elements of this, will be don't cares,

so I'm going to write question marks in all of them.

So this is a don't care, because if there is no object in this image,

then you don't care what bounding box the neural network outputs as well as

which of the three objects, c1, c2, c3 it thinks it is.

So given a set of label training examples, this is how you will construct x,

the input image as well as y, the cost label both for

images where there is an object and for images where there is no object.

And the set of this will then define your training set.