0:00

In the last video,

Â you learned how to use a convolutional implementation of sliding windows.

Â That's more computationally efficient,

Â but it still has a problem of not quite outputting the most accurate bounding boxes.

Â In this video, let's see how you can get

Â your bounding box predictions to be more accurate.

Â With sliding windows, you take

Â this three sets of locations and run the crossfire through it.

Â And in this case,

Â none of the boxes really match up perfectly with the position of the car.

Â So, maybe that box is the best match.

Â And also, it looks like in drawn through,

Â the perfect bounding box isn't even quite square,

Â it's actually has a slightly wider rectangle or slightly horizontal aspect ratio.

Â So, is there a way to get this algorithm to outputs more accurate bounding boxes?

Â A good way to get this output more accurate bounding boxes is with the YOLO algorithm.

Â YOLO stands for, You Only Look Once.

Â And is an algorithm due to Joseph Redmon,

Â Santosh Divvala, Ross Girshick and Ali Farhadi.

Â Here's what you do.

Â Let's say you have an input image at 100 by 100,

Â you're going to place down a grid on this image.

Â And for the purposes of illustration,

Â I'm going to use a 3 by 3 grid.

Â Although in an actual implementation,

Â you use a finer one,

Â like maybe a 19 by 19 grid.

Â And the basic idea is you're going to take

Â the image classification and localization algorithm that you saw a few videos back,

Â and apply it to each of the nine grids.

Â And the basic idea is you're going to take the image classification and

Â localization algorithm that you saw in the first video of

Â this week and apply that to each of the nine grid cells of this image.

Â So the more concrete,

Â here's how you define the labels you use for training.

Â So for each of the nine grid cells, you specify a label Y,

Â where the label Y is this eight dimensional vector,

Â same as you saw previously.

Â Your first output PC 01 depending on whether or

Â not there's an image in that grid cell and then BX,

Â BY, BH, BW to specify the bounding box if there is an image,

Â if there is an object associated with that grid cell.

Â And then say, C1, C2, C3,

Â if you try and recognize three classes not counting the background class.

Â So you try to recognize pedestrian's class,

Â motorcycles and the background class.

Â Then C1 C2 C3 can be the pedestrian,

Â car and motorcycle classes.

Â So in this image,

Â we have nine grid cells,

Â so you have a vector like this for each of the grid cells.

Â So let's start with the upper left grid cell,

Â this one up here.

Â For that one, there is no object.

Â So, the label vector Y for the upper left grid cell would be zero,

Â and then don't cares for the rest of these.

Â The output label Y would be the same for this grid cell,

Â and this grid cell, and all the grid cells with nothing,

Â with no interesting object in them.

Â Now, how about this grid cell?

Â To give a bit more detail,

Â this image has two objects.

Â And what the YOLO algorithm does is it takes the midpoint of reach of

Â the two objects and then assigns the object to the grid cell containing the midpoint.

Â So the left car is assigned to this grid cell,

Â and the car on the right,

Â which is this midpoint,

Â is assigned to this grid cell.

Â And so even though the central grid cell has some parts of both cars,

Â we'll pretend the central grid cell has no interesting object so that

Â the central grid cell the class label Y also looks like this vector with no object,

Â and so the first component PC,

Â and then the rest are don't cares.

Â Whereas for this cell,

Â this cell that I have circled in green on the left,

Â the target label Y would be as follows.

Â There is an object,

Â and then you write BX, BY, BH,

Â BW, to specify the position of this bounding box.

Â And then you have, let's see,

Â if class one was a pedestrian, then that was zero.

Â Class two is a car, that's one.

Â Class three was a motorcycle, that's zero.

Â And then similarly, for the grid cell on

Â their right because that does have an object in it,

Â it will also have some vector like

Â this as the target label corresponding to the grid cell on the right.

Â So, for each of these nine grid cells,

Â you end up with a eight dimensional output vector.

Â And because you have 3 by 3 grid cells,

Â you have nine grid cells,

Â the total volume of the output is going to be 3 by 3 by 8.

Â So the target output is going to be 3 by 3 by 8 because you have 3 by 3 grid cells.

Â And for each of the 3 by 3 grid cells,

Â you have a eight dimensional Y vector.

Â So the target output volume is 3 by 3 by 8.

Â Where for example, this 1 by 1 by 8 volume in

Â the upper left corresponds to

Â the target output vector for the upper left of the nine grid cells.

Â And so for each of the 3 by 3 positions,

Â for each of these nine grid cells,

Â does it correspond in eight dimensional target vector Y that you want to the output.

Â Some of which could be don't cares,

Â if there's no object there.

Â And that's why the total target outputs,

Â the output label for this image is now itself a 3 by 3 by 8 volume.

Â So now, to train your neural network,

Â the input is 100 by 100 by 3,

Â that's the input image.

Â And then you have a usual convnet with conv,

Â layers of max pool layers, and so on.

Â So that in the end,

Â you have this, should choose the conv layers and the max pool layers,

Â and so on, so that this eventually maps to a 3 by 3 by 8 output volume.

Â And so what you do is you have an input X which is the input image like that,

Â and you have these target labels Y which are 3 by 3 by 8,

Â and you use map propagation to train the neural network to map

Â from any input X to this type of output volume Y.

Â So the advantage of this algorithm is that

Â the neural network outputs precise bounding boxes as follows.

Â So at test time,

Â what you do is you feed an input image X

Â and run forward prop until you get this output Y.

Â And then for each of the nine outputs

Â of each of the 3 by 3 positions in which of the output,

Â you can then just read off 1 or 0.

Â Is there an object associated with that one of the nine positions?

Â And that there is an object, what object it is,

Â and where is the bounding box for the object in that grid cell?

Â And so long as you don't have more than one object in each grid cell,

Â this algorithm should work okay.

Â And the problem of having multiple objects within

Â the grid cell is something we'll address later.

Â Of use a relatively small 3 by 3 grid,

Â in practice, you might use a much finer,

Â grid maybe 19 by 19.

Â So you end up with 19 by 19 by 8,

Â and that also makes your grid much finer.

Â It reduces the chance that there are multiple objects assigned to the same grid cell.

Â And just as a reminder,

Â the way you assign an object to grid cell as

Â you look at the midpoint of an object and then

Â you assign that object to whichever one grid cell contains the midpoint of the object.

Â So each object, even if the objects spends multiple grid cells,

Â that object is assigned only to one of the nine grid cells,

Â or one of the 3 by 3,

Â or one of the 19 by 19 grid cells.

Â Algorithm of a 19 by 19 grid,

Â the chance of an object of two midpoints of

Â objects appearing in the same grid cell is just a bit smaller.

Â So notice two things, first,

Â this is a lot like the image classification and

Â localization algorithm that we talked about in the first video of this week.

Â And that it outputs the bounding balls coordinates explicitly.

Â And so this allows in your network to output bounding

Â boxes of any aspect ratio, as well as,

Â output much more precise coordinates that aren't just

Â dictated by the stripe size of your sliding windows classifier.

Â And second, this is

Â a convolutional implementation and you're not implementing this algorithm nine

Â times on the 3 by 3 grid or if you're using a 19 by 19 grid.19 squared is 361.

Â So, you're not running the same algorithm 361 times or 19 squared times.

Â Instead, this is one single convolutional implantation,

Â where you use one consonant with a lot of shared computation between

Â all the computations needed for all of your 3 by 3 or all of your 19 by 19 grid cells.

Â So, this is a pretty efficient algorithm.

Â And in fact, one nice thing about the YOLO algorithm,

Â which is constant popularity is because this is a convolutional implementation,

Â it actually runs very fast.

Â So this works even for real time object detection.

Â Now, before wrapping up,

Â there's one more detail I want to share with you,

Â which is, how do you encode these bounding boxes bx, by, BH, BW?

Â Let's discuss that on the next slide.

Â So, given these two cars,

Â remember, we have the 3 by 3 grid.

Â Let's take the example of the car on the right.

Â So, in this grid cell there is an object and so the target label y will be one,

Â that was PC is equal to one.

Â And then bx, by,

Â BH, BW, and then 0 1 0.

Â So, how do you specify the bounding box?

Â In the YOLO algorithm, relative to this square,

Â when I take the convention that the upper left point here is

Â 0 0 and this lower right point is 1 1.

Â So to specify the position of that midpoint,

Â that orange dot, bx might be,

Â let's say x looks like is about 0.4.

Â Maybe its about 0.4 of the way to their right.

Â And then y, looks I guess maybe 0.3.

Â And then the height of the bounding box is specified as

Â a fraction of the overall width of this box.

Â So, the width of this red box is maybe 90% of that blue line.

Â And so BH is 0.9 and the height of

Â this is maybe one half of the overall height of the grid cell.

Â So in that case, BW would be, let's say 0.5.

Â So, in other words, this bx, by, BH,

Â BW as specified relative to the grid cell.

Â And so bx and by,

Â this has to be between 0 and 1, right?

Â Because pretty much by definition that

Â orange dot is within the bounds of that grid cell is assigned to.

Â If it wasn't between 0 and 1 it was outside the square,

Â then we'll have been assigned to a different grid cell.

Â But these could be greater than one.

Â In particular if you have a car where the bounding box was that,

Â then the height and width of the bounding box,

Â this could be greater than one.

Â So, there are multiple ways of specifying the bounding boxes,

Â but this would be one convention that's quite reasonable.

Â Although, if you read the YOLO research papers,

Â the YOLO research line there were

Â other parameterizations that work even a little bit better,

Â but I hope this gives one reasonable condition that should work okay.

Â Although, there are some more complicated parameterizations

Â involving sigmoid functions to make sure this is between 0 and 1.

Â And using an explanation parameterization to make sure that these are non-negative,

Â since 0.9, 0.5, this has to be greater or equal to zero.

Â There are some other more advanced parameterizations

Â that work things a little bit better,

Â but the one you saw here should work okay.

Â So, that's it for the YOLO or the You Only Look Once algorithm.

Â And in the next few videos I'll show you

Â a few other ideas that will help make this algorithm even better.

Â In the meantime, if you want,

Â you can take a look at

Â YOLO paper reference at the bottom of these past couple slides I use.

Â Although, just one warning,

Â if you take a look at these papers which is

Â the YOLO paper is one of the harder papers to read.

Â I remember, when I was reading this paper for the first time,

Â I had a really hard time figuring out what was going on.

Â And I wound up asking a couple of my friends,

Â very good researchers to help me figure it out,

Â and even they had a hard time understanding some of the details of the paper.

Â So, if you look at the paper,

Â it's okay if you have a hard time figuring it out.

Â I wish it was more uncommon,

Â but it's not that uncommon, sadly,

Â for even senior researchers,

Â that review research papers and have a hard time figuring out the details.

Â And have to look at open source code,

Â or contact the authors,

Â or something else to figure out the details of these outcomes.

Â But don't let me stop you from taking a look at the paper yourself though if you wish,

Â but this is one of the harder ones.

Â So, that though, you now understand the basics of the YOLO algorithm.

Â Let's go on to some additional pieces that will make this algorithm work even better.

Â