[MUSIC] This video will study human 2D pose estimation. The problem with localizing anatomical key points or parts that has largely focused on fighting body parts of individuals. Inferring the oppose of multiple people and images. Especially, socially engaged individuals it presents a unique set of challenges. Each image may contain an unknown number of people that can occur at any position or scale. Second, interactions between people induce complex spatial interference due to contrast and contact. Occlusion and limb articulations making association of parts difficult. A common approach is to employ a person detector, and perform person single-person pose estimation for each detection. That's a top down approach that directly leverages existing techniques for single-person pose estimation, but suffers from early commitment. If the person detector fails, and it may do so when people are in close proximity, there is no recourse to recovery. Furthermore, the run time of these top-down approaches as proportional to the number of people. Because for each detection, a single person pose estimator is run. And the more people there are, the greater the computational cost. Nevertheless, we'll analyze several approaches of single person pose estimation. The problem of human pose estimation can be formulated as the task of finding the position of human body key-points, head, shoulders, elbows, wrists and so on. And we can formulate this as a regression problem. Attempts exist to train a convolution neutral neural network to regress the location of the human joint positions x and y using L2 loss. It may work well, but on a very specific data set, such as the data set of TV presenters. However, instead of regressing the joint positions directly, there were an idea to regress a heat map of the joint positions separately for each joint in the input image. At training time, the ground truth labels, or heat maps, synthesized for each joint separately by placing a Gaussian with fixed variance and centered at the joint position. This network can be fully convolutional, and the loss can be held to which penalizes the square pixel-wise differences between the produced heat map and the synthesized ground truth heat map. Of course, the approach is to regress the location of key points is naive for more complicated tasks. Before we proceed to the next part, it is worth noting that there is another formulation of the human pose estimation task. It is segmentation. You can see, on the image, that we predict the class for each pixel of the human body. We can use any segmentation method for solving this problem. All the above detectors work well to estimate the posture of one person. As we have said before, we can apply them after detecting people on image, but this top down approach has some drawbacks. The most serious is that the inference time is proportional to the number of people. In contrast, bottom-up approaches are attractive as they offer robustness to early commitment. And have the potential to decouple run time complexity from the number of people in the image. Yet, bottom-up approaches do not directly use global contextual cues from other body parts and other people. In practice, bottom-up methods do not retain the gains in efficiency as the final parts requires costly global inference. The DeepCut algorithm bottom-up approach that jointly labels part detection candidates and associates them to individual people. Requires solving the integer linear problem over a fully connected graph. But that's an NP hard problem and the average processing time is on the order of hours. There was an attempt to build on the DeepCut with stronger part detectors based on residual networks and image dependent pair-wise cores. That vastly improved the run time, but the method still takes several minutes per image with a limit of the number of parts per posals. Now, let's talk about the state of the art and multiple pose estimation task that works in real time. The key point of this method is using part affinity fields. This method takes the entire image as the input for a two branch CNN to jointly predict confidence maps for body part detection. And part affinity fields for parts association or limbs. The parsing step performs a set of bipartite matchings to associate body part candidates that finally assembles into full body poses for all people in the image. The CNN that predicts feature maps for body parts and limbs is multistage. Each stage, in the first branch, predicts confidence maps of body parts. And each stage in the second branch predicts PAS of limbs. After each stage, the predictions from the two branches, along with the image features, are concatenated for next stage for refinement. This process, you can see, in the left picture. Confidence maps of the right wrist in the first row and affinity field in the second row of right forearm across stages. Though there is confusion between left and right body parts and limbs in early stages, the estimates are increasingly refined through global inference in later stages as shown in the highlighted areas. Given a set of detected body parts, how do we assemble them to form the full body poses of an unknown number of people? We need a confidence measure on the association for each part of body part detections. That is, that they belong to the same person. In other words, we should predict not only the location, but also the orientation and formation across the region of support of the limb. For this task, we could use part affinity fields. The part affinity field that is a 2D vector field for each limb. For each pixel in the area belonging to a particular limb, a 2D vector encodes the direction that points form one part of the limb to the other. For instance, in the right picture in the ground truth feature map or part affinity field, corresponding to the forearm. The value of the point, p, is the unit factor from j1 to j2 and k is the person ID on the image. For all our other points, the vector is zero valid. Having these key points candidates confidence map from our convolutional neural network and orientation of limbs. We could find the optimal association of key points to people by solving a maximum weighted bipartite graph matching problem for graph where vertices are the key point's candidates, and edges are weighted limbs, as in the bottom-left picture. The weights, or the confidence of their association, and they can be calculated as the line integral over the corresponding affinity field along the line segment connecting the candidate part locations. To summarize, human pose estimation aims to predict locations of anatomical keypoints for individual people. And we can employ a part-based methods for that, like those used for keypoint regression. Semantic segmentation machinery forms the natural basis for pose estimation along with some other hacks, like the affinity fields. [SOUND]