0:00

You've already seen most of the components of object detection.

Â In this video, let's put all the components

Â together to form the YOLO object detection algorithm.

Â First, let's see how you construct your training set.

Â Suppose you're trying to train an algorithm to detect

Â three objects: pedestrians, cars, and motorcycles.

Â And you will need to explicitly have the full background class,

Â so just the class labels here.

Â If you're using two anchor boxes,

Â then the outputs y will be three by three because you are using three by three grid cell,

Â by two, this is the number of anchors,

Â by eight because that's the dimension of this.

Â Eight is actually five which is plus the number of classes.

Â So five because you have Pc and then the bounding boxes,

Â that's five, and then C1, C2, C3.

Â That dimension is equal to the number of classes.

Â And you can either view this as three by three by two by eight,

Â or by three by three by sixteen.

Â So to construct the training set,

Â you go through each of these nine grid cells and form the appropriate target vector y.

Â So take this first grid cell,

Â there's nothing worth detecting in that grid cell.

Â None of the three classes pedestrian, car and motocycle,

Â appear in the upper left grid cell and so,

Â the target y corresponding to that grid cell would be equal to this.

Â Where Pc for the first anchor box

Â is zero because there's nothing associated for the first anchor box,

Â and is also zero for the second anchor box and

Â so on all of these other values are don't cares.

Â Now, most of the grid cells have nothing in them,

Â but for that box over there,

Â you would have this target vector y.

Â So assuming that your training set has a bounding box like this for the car,

Â it's just a little bit wider than it is tall.

Â And so if your anchor boxes are that,

Â this is a anchor box one,

Â this is anchor box two,

Â then the red box has just slightly higher IoU with anchor box two.

Â And so the car gets associated with this lower portion of the vector.

Â So notice then that Pc associate anchor box one is zero.

Â So you have don't cares all these components.

Â Then you have this Pc is equal to one,

Â then you should use these to specify the position of the red bounding box,

Â and then specify that the correct object is class two.

Â Right that it is a car.

Â So you go through this and for each of

Â your nine grid positions each of your three by three grid positions,

Â you would come up with a vector like this.

Â Come up with a 16 dimensional vector.

Â And so that's why the final output volume is going to be 3 by 3 by 16.

Â Oh and as usual for simplicity on the slide I've used a 3 by 3 the grid.

Â In practice it might be more like a 19 by 19 by 16.

Â Or in fact if you use more anchor boxes,

Â maybe 19 by 19 by 5 x 8 because five times eight is 40.

Â So it will be 19 by 19 by 40.

Â That's if you use five anchor boxes.

Â So that's training and you train ConvNet that inputs an image,

Â maybe 100 by 100 by 3,

Â and your ConvNet would then finally output this output volume in our example,

Â 3 by 3 by 16 or 3 by 3 by 2 by 8.

Â Next, let's look at how your algorithm can make predictions.

Â Given an image, your neural network will output this by 3 by 3 by 2 by 8 volume,

Â where for each of the nine grid cells you get a vector like that.

Â So for the grid cell here on the upper left,

Â if there's no object there,

Â hopefully, your neural network will output zero here,

Â and zero here, and it will output some other values.

Â Your neural network can't output a question mark,

Â can't output a don't care.

Â So I'll put some numbers for the rest.

Â But these numbers will basically be ignored because

Â the neural network is telling you that there's no object there.

Â So it doesn't really matter whether the output is a bounding box or there's is a car.

Â So basically just be some set of numbers, more or less noise.

Â In contrast, for this box over here hopefully,

Â the value of y to the output for that box at the bottom left,

Â hopefully would be something like zero for bounding box one.

Â And then just open a bunch of numbers, just noise.

Â Hopefully, you'll also output a set of numbers that

Â corresponds to specifying a pretty accurate bounding box for the car.

Â So that's how the neural network will make predictions.

Â Finally, you run this through non-max suppression.

Â So just to make it interesting.

Â Let's look at the new test set image.

Â Here's how you would run non-max suppression.

Â If you're using two anchor boxes,

Â then for each of the non-grid cells,

Â you get two predicted bounding boxes.

Â Some of them will have very low probability,

Â very low Pc, but you still get

Â two predicted bounding boxes for each of the nine grid cells.

Â So let's say, those are the bounding boxes you get.

Â And notice that some of the bounding boxes can go

Â outside the height and width of the grid cell that they came from.

Â Next, you then get rid of the low probability predictions.

Â So get rid of the ones that even the neural network says,

Â gee this object probably isn't there.

Â So get rid of those.

Â And then finally if you have three classes you're trying to detect,

Â you're trying to detect pedestrians, cars and motorcycles.

Â What you do is, for each of the three classes,

Â independently run non-max suppression for

Â the objects that were predicted to come from that class.

Â But use non-max suppression for the predictions of the pedestrians class,

Â run non-max suppression for the car class,

Â and non-max suppression for the motorcycle class.

Â But run that basically three times to generate the final predictions.

Â And so the output of this is hopefully that you will have

Â detected all the cars and all the pedestrians in this image.

Â So that's it for the YOLO object detection algorithm.

Â Which is really one of the most effective object detection algorithms,

Â that also encompasses many of the best ideas across

Â the entire computer vision literature that relate to object detection.

Â And you get a chance to practice implementing many components of this yourself,

Â in this week's problem exercise.

Â So I hope you enjoy this week's problem exercise.

Â There's also an optional video that follows this one

Â which you can either watch or not watch as you please.

Â But either way I also look forward to seeing you next week.

Â