0:01

Hello and welcome back.

Â This week you learn about object detection.

Â This is one of the areas of computer vision that's just exploding and

Â is working so much better than just a couple of years ago.

Â In order to build up to object detection, you first learn about object localization.

Â Let's start by defining what that means.

Â You're already familiar with the image classification task where an algorithm

Â looks at this picture and might be responsible for saying this is a car.

Â So that was classification.

Â 0:34

The problem you learn to build in your network to address later on this video is

Â classification with localization.

Â Which means not only do you have to label this as say a car but

Â the algorithm also is responsible for putting a bounding box,

Â or drawing a red rectangle around the position of the car in the image.

Â So that's called the classification with localization problem.

Â Where the term localization refers to figuring out where in the picture

Â is the car you've detective.

Â Later this week, you then learn about the detection problem

Â where now there might be multiple objects in the picture and

Â you have to detect them all and and localized them all.

Â And if you're doing this for an autonomous driving application,

Â then you might need to detect not just other cars,

Â but maybe other pedestrians and motorcycles and maybe even other objects.

Â So you'll see that later this week.

Â So in the terminology we'll use this week, the classification and

Â the classification of localization problems usually have one object.

Â Usually one big object in the middle of the image that you're trying to recognize

Â or recognize and localize.

Â In contrast, in the detection problem there can be multiple objects.

Â And in fact, maybe even multiple objects of different categories

Â within a single image.

Â So the ideas you've learned about for image classification will be useful for

Â classification with localization.

Â And that the ideas you learn for

Â localization will then turn out to be useful for detection.

Â So let's start by talking about classification with localization.

Â 2:15

You're already familiar with the image classification problem, in which you might

Â input a picture into a ConfNet with multiple layers so that's our ConfNet.

Â And this results in a vector features that is fed

Â to maybe a softmax unit that outputs the predicted clause.

Â So if you are building a self driving car,

Â maybe your object categories are the following.

Â Where you might have a pedestrian, or a car, or a motorcycle, or a background.

Â This means none of the above.

Â So if there's no pedestrian,

Â no car, no motorcycle, then you might have an output background.

Â So these are your classes, they have a softmax with four possible outputs.

Â So this is the standard classification pipeline.

Â How about if you want to localize the car in the image as well.

Â To do that, you can change your neural network to have

Â a few more output units that output a bounding box.

Â So, in particular, you can have the neural network output four

Â more numbers, and I'm going to call them bx, by, bh, and bw.

Â And these four numbers parameterized the bounding box of the detected object.

Â So in these videos, I am going to use the notational convention that the upper

Â left of the image, I'm going to denote as the coordinate (0,0),

Â and at the lower right is (1,1).

Â So, specifying the bounding box,

Â the red rectangle requires specifying the midpoint.

Â So thatâ€™s the point bx,

Â by as well as the height, that would be bh,

Â as well as the width, bw of this bounding box.

Â So now if your training set contains not just the object cross label,

Â which a neural network is trying to predict up here, but

Â it also contains four additional numbers.

Â Giving the bounding box then you can use supervised learning to make your algorithm

Â outputs not just a class label but also the four parameters

Â to tell you where is the bounding box of the object you detected.

Â So in this example the ideal bx might

Â be about 0.5 because this is about halfway to the right to the image.

Â by might be about 0.7 since it's about maybe 70% to the way down to the image.

Â bh might be about 0.3 because the height of this red square is

Â about 30% of the overall height of the image.

Â And bw might be about 0.4 let's say because the width

Â of the red box is about 0.4 of the overall width of the entire image.

Â 5:55

So, if the object is, classes 1, 2 or 3, pc will be equal to 1.

Â And if it's the background class, so

Â if it's none of the objects you're trying to detect, then pc will be 0.

Â And pc you can think of that as standing for

Â the probability that there's an object.

Â Probability that one of the classes you're trying to detect is there.

Â So something other than the background class.

Â Next if there is an object, then you wanted to output bx,

Â by, bh and bw, the bounding box for the object you detected.

Â And finally if there is an object, so if pc is equal to 1,

Â you wanted to also output c1, c2 and

Â c3 which tells us is it the class 1, class 2 or class 3.

Â So is it a pedestrian, a car or a motorcycle.

Â And remember in the problem we're addressing

Â we assume that your image has only one object.

Â So at most, one of these objects appears in the picture,

Â in this classification with localization problem.

Â So let's go through a couple of examples.

Â If this is a training set image, so if that is x, then y will be

Â the first component pc will be equal to 1 because there is an object, then bx, by,

Â by, bh and bw will specify the bounding box.

Â So your labeled training set will need bounding boxes in the labels.

Â And then finally this is a car, so it's class 2.

Â So c1 will be 0 because it's not a pedestrian,

Â c2 will be 1 because it is car, c3 will be 0 since it is not a motorcycle.

Â So among c1, c2 and c3 at most one of them should be equal to 1.

Â So that's if there's an object in the image.

Â What if there's no object in the image?

Â What if we have a training example where x is equal to that?

Â In this case, pc would be equal to 0, and

Â the rest of the elements of this, will be don't cares,

Â so I'm going to write question marks in all of them.

Â So this is a don't care, because if there is no object in this image,

Â then you don't care what bounding box the neural network outputs as well as

Â which of the three objects, c1, c2, c3 it thinks it is.

Â So given a set of label training examples, this is how you will construct x,

Â the input image as well as y, the cost label both for

Â images where there is an object and for images where there is no object.

Â And the set of this will then define your training set.

Â 8:47

Finally, next let's describe the loss function

Â you use to train the neural network.

Â So the ground true label was y and the neural network outputs some yhat.

Â What should be the loss be?

Â Well if you're using squared error

Â then the loss can be (y1 hat- y1)

Â squared + (y2 hat- y2) squared +

Â ...+( y8 hat- y8) squared.

Â Notice that y here has eight components.

Â So that goes from sum of the squares of the difference of the elements.

Â And that's the loss if y1=1.

Â So that's the case where there is an object.

Â So y1= pc.

Â So, pc = 1, that if there is an object in the image

Â then the loss can be the sum of squares of all the different elements.

Â 9:48

The other case is if y1=0,

Â so that's if this pc = 0.

Â In that case the loss can be just (y1 hat-y1) squared,

Â because in that second case, all of the rest of the components are don't care us.

Â And so all you care about is how accurately is the neural

Â network ourputting pc in that case.

Â So just a recap, if y1 = 1, that's this case,

Â then you can use squared error to penalize square deviation from

Â the predicted, and the actual output of all eight components.

Â Whereas if y1 = 0, then the second to the eighth components I don't care.

Â So all you care about is how accurately is your neural network

Â estimating y1, which is equal to pc.

Â Just as a side comment for those of you that want to know all the details,

Â I've used the squared error just to simplify the description here.

Â In practice you could probably use a log like feature loss for

Â the c1, c2, c3 to the softmax output.

Â One of those elements usually you can use squared error or

Â something like squared error for the bounding box coordinates and

Â if a pc you could use something like the logistics regression loss.

Â Although even if you use squared error it'll probably work okay.

Â So that's how you get a neural network to not just classify an object but

Â also to localize it.

Â The idea of having a neural network output a bunch of real numbers

Â to tell you where things are in a picture turns out to be a very powerful idea.

Â In the next video I want to share with you some other places where this idea of

Â having a neural network output a set of real numbers, almost as a regression task,

Â can be very powerful to use elsewhere in computer vision as well.

Â So let's go on to the next video.

Â