0:00

[MUSIC]

Â This video will study human 2D pose estimation.

Â The problem with localizing anatomical key points or

Â parts that has largely focused on fighting body parts of individuals.

Â Inferring the oppose of multiple people and images.

Â Especially, socially engaged individuals it presents a unique set of challenges.

Â Each image may contain an unknown number of people that can occur at

Â any position or scale.

Â Second, interactions between people induce complex spatial interference due to

Â contrast and contact.

Â 0:43

A common approach is to employ a person detector, and

Â perform person single-person pose estimation for each detection.

Â That's a top down approach that directly leverages existing techniques for

Â single-person pose estimation, but suffers from early commitment.

Â If the person detector fails, and it may do so

Â when people are in close proximity, there is no recourse to recovery.

Â Furthermore, the run time of these top-down approaches as proportional to

Â the number of people.

Â Because for each detection, a single person pose estimator is run.

Â And the more people there are, the greater the computational cost.

Â Nevertheless, we'll analyze several approaches of single

Â person pose estimation.

Â The problem of human pose estimation can be formulated as the task of finding

Â the position of human body key-points, head, shoulders, elbows, wrists and so on.

Â And we can formulate this as a regression problem.

Â 2:08

At training time, the ground truth labels, or heat maps,

Â synthesized for each joint separately by placing a Gaussian

Â with fixed variance and centered at the joint position.

Â This network can be fully convolutional, and

Â the loss can be held to which penalizes the square pixel-wise differences

Â between the produced heat map and the synthesized ground truth heat map.

Â Of course, the approach is to regress the location of key points is naive for

Â more complicated tasks.

Â 3:04

All the above detectors work well to estimate the posture of one person.

Â As we have said before, we can apply them after detecting people on image, but

Â this top down approach has some drawbacks.

Â The most serious is that the inference time is proportional to the number of

Â people.

Â In contrast,

Â bottom-up approaches are attractive as they offer robustness to early commitment.

Â And have the potential to decouple run time complexity from the number of people

Â in the image.

Â Yet, bottom-up approaches do not directly use global contextual cues from other

Â body parts and other people.

Â In practice, bottom-up methods do not retain the gains in

Â efficiency as the final parts requires costly global inference.

Â The DeepCut algorithm bottom-up approach that

Â jointly labels part detection candidates and

Â associates them to individual people.

Â Requires solving the integer linear problem over a fully connected graph.

Â But that's an NP hard problem and

Â the average processing time is on the order of hours.

Â There was an attempt to build on the DeepCut with stronger part detectors

Â based on residual networks and image dependent pair-wise cores.

Â That vastly improved the run time, but the method still takes several

Â minutes per image with a limit of the number of parts per posals.

Â Now, let's talk about the state of the art and

Â multiple pose estimation task that works in real time.

Â The key point of this method is using part affinity fields.

Â This method takes the entire image as the input for

Â a two branch CNN to jointly predict confidence maps for body part detection.

Â And part affinity fields for parts association or limbs.

Â The parsing step performs a set of bipartite matchings to associate

Â body part candidates that finally assembles into full body poses for

Â all people in the image.

Â The CNN that predicts feature maps for body parts and limbs is multistage.

Â Each stage, in the first branch, predicts confidence maps of body parts.

Â And each stage in the second branch predicts PAS of limbs.

Â After each stage, the predictions from the two branches, along with

Â the image features, are concatenated for next stage for refinement.

Â This process, you can see, in the left picture.

Â Confidence maps of the right wrist in the first row and

Â affinity field in the second row of right forearm across stages.

Â Though there is confusion between left and right body parts and

Â limbs in early stages, the estimates are increasingly refined through

Â global inference in later stages as shown in the highlighted areas.

Â Given a set of detected body parts,

Â how do we assemble them to form the full body poses of an unknown number of people?

Â We need a confidence measure on the association for

Â each part of body part detections.

Â 6:17

That is, that they belong to the same person.

Â In other words, we should predict not only the location, but also the orientation and

Â formation across the region of support of the limb.

Â For this task, we could use part affinity fields.

Â The part affinity field that is a 2D vector field for each limb.

Â For each pixel in the area belonging to a particular limb, a 2D vector

Â encodes the direction that points form one part of the limb to the other.

Â For instance, in the right picture in the ground truth feature map or

Â part affinity field, corresponding to the forearm.

Â The value of the point, p, is the unit factor from j1 to j2 and

Â k is the person ID on the image.

Â For all our other points, the vector is zero valid.

Â Having these key points candidates confidence map from our convolutional

Â neural network and orientation of limbs.

Â We could find the optimal association of key points to people by

Â solving a maximum weighted bipartite graph matching problem for

Â graph where vertices are the key point's candidates,

Â and edges are weighted limbs, as in the bottom-left picture.

Â The weights, or the confidence of their association, and

Â they can be calculated as the line integral over the corresponding affinity

Â field along the line segment connecting the candidate part locations.

Â 7:48

To summarize, human pose estimation

Â aims to predict locations of anatomical keypoints for individual people.

Â And we can employ a part-based methods for that,

Â like those used for keypoint regression.

Â Semantic segmentation machinery forms the natural basis for

Â pose estimation along with some other hacks,

Â like the affinity fields.

Â [SOUND]

Â