Now, we are ready to summarize the entire procedure of structure from motion. Which is answering the question of where am I? The camera person as you're moving a scene, as well as where are they? The scene structures or scene points in the 3D space. So we have seen a lot of theories and it's time to actually put in practice. In this situation, we have play a game of friendly basketball games with some of our friends. And we asked our friends each to wear a GoPro camera on their head. This is a basketball game view for our own personal perspective among our friends. And each one of it give us a view of the 3D world as well as our friends found perspective. As you can imagine the cameras jiggling as we run through the space. Using the technique we describe in this course, we can compute the three dimensional camera movements as we've seen, and so visualize in the following. Here, we see the points are measured in three dimensional space and we see ourselves moving the three dimensional space relative to each other. Each person has a little prism attached to it, and that's the camera in front of us an optical center which we are measuring. The orientation, the position are computed automatically. So with this technique, we can take scope or cameras out or your cell phone camera out, take a video as if you would normally walk into a space. We can entirely reconstruct three dimensional space and position ourself in that scene. In the following slides, we will summarize the role of procedures for computing this. The general setting is that we have a set of images taken from the same camera or different camera of your friends as we move around the space. And what we do is start with a pair images for examples, these two pair of images, of the buildings in the background from different viewpoints. So we see, if I have two pictures, two dimensional pictures, we can compute the relative position of the camera relative to each other in terms of rotation and translation. And this is done by allowing us to first use an automatic featuring matching algorithms. And this algorithm consist two steps, and first step is detected those feature points automatically in the pictures. And then find a corresponding points in the second image and make a correspondence between them as illustrated here by the red line. Once we have found the corresponding points in multiple images or in two images, we know those two locations of feature points are correlated to each other through the fundamental matrix in a bilinear form. And recall the fundamental matrix in fact enclosed around through the rotation and translation between the two cameras, squeeze something together into a one, three by three matrix. Encoded also in the fundamental matrix is the camera calibration itself. So fundamental matrix consists calibrations, the camera rotation, and translation, all condensed into one three by three matrix. And our goal is try to estimate this fundamental matrix from the feature matchings. As we have 8, and no variables in the 3 by 3 matrix, we can find 8 corresponding points across 2 views. So long as we have 8 corresponding views, we have a sufficient amount of constraints to solve this equation. And this is done by forming a matrix called A times x. Where x is a column vector consists of all unknown variables in the fundamental matrix F equal to 0. And this is what we called a linear least square problem. Here we summarize the entire procedure of computing a fundamental metrics as a 8 point algorithm. Or we have constructed the least square problems and find the solution to the least square problems through SVD. And we do the second SVD on the estimated f matrix itself. To further clean that up to arrive at a solution where fundamental matrix has rank two. Since we have run the automatically hours in between the two views to estimate point correspondences, some would be correct and many would be wrong. And we need to in the stack, group the correct response together to give us a correct estimation of fundamental metrics as well as delete those which will become outliers the incorrect matches. The basic concept of this procedure which call RANSAC is that all the good correspondence agree with each other, and all the bad correspondences disagree with each other. Such that our procedure start with the following. We randomly sample 8 points in the correspondence space. And then given 8 points, we can compute the fundamental matrix. And illustrated here is the epipole associated with the fundamental matrix. Then what we do is we take this fundamental matrix and check if all the other points agree with this. And this is done by joining epipolar lines in the second image from the points in the first image, x1. And what we check is whether the second corresponding points in the image in fact lies down these epipolar line. And this is by measuring the distance between of corresponding points to that epipolar line. We said the threshold of this distance, which we call inliers. And we count them how many points that's agree with this epipolar constraints. And then what we do is we iterate this random sampling process several times. And then pick the one set of 8 points which produced a large number of inlier counts. And this allowed us to remove the outliers, the correspondents which are computed incorrectly. Instead of inliers, which then allow us to compute a much more robust estimate of fundamental matrix. Once we had a fundamental matrix computed, the first thing we do is peel away the calibration which is K, arrive at an essential matrix, E. And E itself contains only the rotation and translation. And this allows us to then going from the essential matrix backed it up to the two quantity that we care about. The translation between the two cameras and as well the rotation between them. And this is done by taking the translation first through S release and compute the rotation matrix from that. At this stage, we can estimate translation rotation up to a full four ambiguity. There are four possible configurations that all linearly satisfy the constraints that we have. And then what we need to do is pick from one of the four configurations. And this is done by doing a triangulation of points in 3D space, assuming that each camera pose is correct. And check if all the points is projected in front of the cameras rather than behind it. So now we had narrowed down from the four configuration to one correct configurations. The triangulation problem itself assume that we have a known camera positions therefore, the camera projection matrix is P. And we need to estimate a three dimensional points in this three dimensional space. And this is again, done through a least squared estimation problems. Once we have established camera position relative to each other, once we have established a pair of cameras how they're oriented to themselves. We are ready to add a third and fourth camera into this setup. For a new frame that we add into this computation for example, this image at the top right. For these new cameras we have, instead of computing parallax relative motion to the first images, we can simply use the normal three dimensional structure computed already and estimate a 3D to 2D transformation. And this step allowed me to quickly localize the new images as they came in relative to the new three dimensional scene. And this is again computed through the perspective-n-point algorithm. And we have demonstrated linear methods for computing this as is shown in the equation here. Finally, as we have computed the structure of the camera as one image is arriving after the other. We compute the relative camera position relative to the first two images. We have computer three dimension seem structures. In this nonlinear least square estimation called bundle adjustments. We are measuring the errors, not in terms of the homogeneous queries of points. Which consists of optical sensor to the image that raise in space but instead, we're measuring the error on the image plane itself. As such when you take a homogeneous coordinates of each image points divided by the z elements to derive as a two dimensional measurements in the plane. And when you minimize the error relative to measurements that we see the image location with the points measured by the self feature for example. And when I minimize the error, this minimization's non linear because we have a division by the z elements. Also this non linear because the rotation metrics left subspace called [INAUDIBLE]. In totality we have constraints, which is measure in green here. Which are fundamental image measurements with feature points. And we have variables which gives rise to back rejection in image in pink here. And their consistence essentially the homogeneous coordinates of the points in 3Ds base. The whole entire procedure for non-linear lease square is really central on this character called the Jacobian. And the Jacobian is a matrix with many rows, as many rows as constraints we have the image. So if a K image feature points then we have 2K measurements and therefore 2K rows. And the columns of a Jacobian are associated with the variables that we're estimating. We have color coded them, the orange ones are associated with camera rotations. The cyan ones are associated with the camera positions themself in 3D. And lastly, we'll have variables associated with the three dimensional points in space. Which Jacobian computed, the lending of least squared consists of a set of searches. Each time looking for a delta x where delta x consists of the camber updates to the camera rotation, camera positions, as well as update to the three dimensional points in space. And this delta x is computed through a normal equation, consists of Jacobians on the left. And of course, the error functions that we were estimating on the right. Each step requires solving a least square problem and we iterate by updating our camera relative to delta x we have computed. Here, we have straight examples. Again, a simple example, we have three cameras and four points. The Jacobians we have consist of columns associated with camera orientation positions, color coded in blue. As well as the Jacobians associated with the point positions in a 2D plane here, color coded in orange. These Jacobian matrix that we're working with are highly structured. In fact, they consist of many sparse elements. In this diagram, all the blocks color coded in gray are 0. Focus on the right hand side of the Jacobian, those columns associate with the point positions, P1 to P4. We can see it has a block diagonal formed, because the three dimensional positions when they vary itself, only a fact of the two dimensional positions of the self in the images. Furthermore, notice not all the points is visible in all the cameras, those are illustrated in the lighter blue boxes as they are graded out. So the Jacobian matrix tells me how the camera position is moving. And once they move it a little bit, how much my image points associated with that measurement is changing. And this Jacobian matrix is highly sparse. In practice however, we have far more points than cameras in a scene. The number of points could consist of hundreds of thousands of points where camera position may be the order of hundreds of thousands. So this Jacobian matrix in fact is highly sparse as it's showing in this picture. As a result the Jacobian transposed himself, the Hessian itself, is also very sparse. Recall that this is a matrix we need to invert. In fact, this matrix instance of the form is shown here. As you can see, the diagonals associated with the correlation between camera position relative to itself is fairly sparse, consists of mostly block diagonal forms. The correlation between the points to itself is also very sparse, also consists of blocks on the diagonals, 0 off the diagonal. As such, then we can have a much more efficient way of computing the inverse of these Hessians in solving the least square problem that we are looking for. We won't go into the details here, but only highlight the structures here for you. Consists of a block of A, block of B, a block of C. And the equation here demonstrates how one can efficiently compute the inverse of this matrix. And this is important in practice because we have far more points then cameras we have. In fact, if you visualize this sparse matrix, it will look like this. You have an entirely very large correlation between points with two points labeled in the C very highly camera to camera correlation A. A very low vector of camera to point correlation B, which are dense. Now we are ready to put everything together into bundle adjustments. And what it does is take the least square of solutions, which are linear. And give you fairly good estimates of where the camera orientations are, as well as the three dimensional points. From that I can back project the three dimensional points into computed camera orientation, obtain the location in blue. And the bundle adjustment that's further tied up the three dimensional point locations, camera orientations such that the reprojection point matches with the estimated point, computer front SIFT matches. Here's the entire 3D reconstruction taken from the images that we saw earlier. Here's a sequence that was taken in our lab. This case, the global camera we're carrying ourselves as well as a drone that we're flying with GoPro on it. And we are reconstructing the three dimensional structure of the lab space as was the camera. Here we illustrate the point calls as we computed the three dimensional space as well as the camera moving in the space. Now we're seeing entire pipelines of constructed three dimensional environments of the scene as well as estimated camera positions for images itself. We have seen how we can estimate where we are in the scene as well as compute a three dimensional structure in the world. For the final project, we will study this in a concrete setting. We will see a set of images sequenced and we apply our two to three dimensional structures. We're living in a very exciting time, the camera technology is rapidly improving. Just a few years ago running around with a GoPro was unthinkable. Looking ahead, there are many opportunity coming from world robotics, as well as entertainments. This technology will introduce to you is only the beginning of these revolution. I hope you enjoyed the technique we introduced you and apply to many places we haven't thought about.