In this week, we will talk about object detection, which is probably the key problem in computer vision. The goal of object detection is to detect the presence of object from a certain set of classes, and locate the exact position in the image. We can informally divide all objects into two big groups: things and stuff. Things are objects of certain size and shape like cars, bicycles, people, animals, planes. We can specify where object is located in image with a bounding box. Stuff is more likely a region of image which correspond to objects like road, or grass, or sky, or water. It is easier to specify the location of a sky by marking the region in an image, not by a bounding box. Now, we will talk mostly about detection of things. To detect a stuff, it is better to use semantic image segmentation methods, which we will discuss during week five. Compared to image classification, output of the detector is structured. Each object is usually marked with a bounding box and class label. Bounding box is described by position of one of the corners, and by width and size of the box. Object position and class are annotated in ground truth data. If only part of this information is annotated, we call it the "weak" annotation. For example, only the presence of object in image can be annotated without a bounding box. There is a lot of research into object detection with weak annotation, but performance of such algorithms is lower compared to algorithms trained to use full annotation. To check whether the detection is correct, we compare the predicted bounding box with ground truth bounding box. The metric is intersection over union or IoU. It is the ratio of area of intersection of predicted in ground truth bounding boxes to the area of the union on these boxes as shown on the slide. Either IoU is larger than the threshold, then the detection is correct. The larger the threshold, the more precisely detector should localize objects. Currently, the threshold is usually set to 0.5. The detector output is a set of detection proposals. Usually, for each proposal the detector also gives a score as a measure of confidence in the detection. So, we can rank all proposals according to the score. Each proposal is considered. If IoU is larger than the threshold, then it is the true positive detection. If IoU is lower, then it's false positive detection. If some ground truth object is not detected, then it is marked as misdetection or false negative. On the whole dataset, you can measure the precision and the recall of the detector. The precision is the ratio between the number of true detections and the number of all detections. The recall is the ratio of number of true detection to the number of objects annotated in the ground truth data. By varying some parameter in the detector, usually the threshold on detection score, you can simultaneously change precision and recall. Then you can plot the precision-recall curve. To compare two detectors correctly, you should compare their precision-recall curves. If one curve is generally higher than the other curve, then the first detector is better than the other. To measure the overall quality of the detector with one number, we compute average precision. It is the mean of 11 points on the curve for recalls from zero to one, by 0.1 intervals. If you have multi-class detector, then you can compute mean average precision by averaging of average precision across classes. On the plot, we can select the working point, the precision, and the recall are best suited for the task at hand. Know that for production algorithm, the point is selected so that precision is closer one by sacrificing the recall. Sometimes, another charting metric are used to measure the quality of the detector. It is the plot of a miss rate with false detections per image? The curve is constructed similar to the precision-recall curve by varying the threshold on the detection score. For the object detection, the creation of ground truth annotation is very important. It has been demonstrated in the recent papers that annotations are usually different across datasets and annotators. This lead to significant error in training of detector and its evaluation. A lot of objects can be missed in ground truth data. This is especially true for the small objects or objects with very similar appearance to the target. For example, if you want to detect faces, then detectors can detect not only real faces but for example prints on clothes or portraits. It is much better to annotate all such objects, and then mark some of them as objects to ignore during training and evaluation. It can also use several annotators for verification of annotation. The higher localization precision you want to obtain, the higher are usually the requirements on annotation precision. It is very important to have strict annotation protocol with clear definition how objects should be marked in difficult cases. For example, in a recent city person data set, each pedestrian should be marked with a line from toe to the head, to the middle point between the legs, and then bounding boxes width to height aspect ratio is placed on top of this annotation. Some objects are more important for us than other, so detection of such object classes as faces, pedestrian, or cars has a lot of practical applications. Thus, a lot of algorithms has been proposed for the detection of specific class of object. Such detectors reach top performance on their classes compared to the multi-class detectors. Practical multi-class detectors repeat only this development of deep learning methods. There are several peripheral data sets for multi-class detection. ImageNet is the first example of such data sets. It has objects of 1000 classes same as classification on ImageNet. Each image has annotation of one class and at least one bounding box. There are 800 training images per class. Detector should produce five guesses per image. Detection on the is correct if at least one guess has correct class and the corresponding bounding box is close to correct.