In the previous lectures we have seen how we can obtain the relative pose between two cameras or between two times while the camera is moving. In this lecture we are going to generalize this. To multiple views, or what we call Bundle Adjustment. Let us assume that they take pictures of this house. These pictures might have been taken even from different cameras, different points of time. The question is, whether we can extract the three-dimensional model of this house out of these pictures, as well as the poses of the cameras when the pictures were taken. To get to an example closer to us, we took a digital camera and we went on a very characteristic place of the University of Pennsylvania campus, the sculpture of Benjamin Franklin on the bench and we took 15 pictures, without noting anything about the position of the cameras or without even clicking on those images. Then we used one of the bundle adjusting methods, that are existing out there, as an open source, the one called, PBVS, and we obtained this three dimensional model of Benjamin Franklin, which is quite accurate. You see, also, it's quite dense It consists from several thousands of points and we can even map the color from the picture of these points so that we have a quite accurate model which is also visually appealing. We can observe that we have the model projections also from points of view where the cameras were not there at all. Like a point from the top, or this point from the bottom, which is a real proof of the three dimensional structure we have obtained out of just two dimensional views. This is really the quintessence of geometric vision using only two dimensional views in order to extract three dimensional models. We have seen the small pieces of it the repolar geometry. How to do it with two frames and here we'll explain how can we use many frames together. This is another example from the first very, very large reconstruction in history from the project urbanscape at the University of North Carolina where a very detailed model was obtained by just using very few cameras on a vehicle while the vehicle was moving and without using any laser scanner or any other three dimensional device. This is a quite old problem. It was known in photogrammetry as bundle adjustment and because photogrammetry was quite advanced in Germany, the first term about this program was called [FOREIGN] which is in English, bundle block adjustment. The way it was solved when we were taking pictures from airplanes was that ground points the green ground points were used which were helping in getting initial estimates of the cameras using something like a PnP algorithm or a resection and then all the rest of the environment was computed what we call today bundle adjustment. So even photogrammetry out of the very few pictures, you could get quite detail three dimensional model of the environment. Like the cliff that you see in this picture, including all elevations and all extrusions from this cliff. Now back in computer vision, let's look again at our equations and see how we can solve for both the 3D model points, the 3D model, as well as the poses of the camera. The only thing we're given is endpoints, which we called Xp, Yp, and the index p also over the points, over f frames, and we use it as a superscript f. We are not given any other information and they announce are the pose of every camera for every frame r sub f and t sub f. And the 3D points big x p, big y p and big zed p which are constant independent of the frame. If we assume that we have calibration information, then we can write the two projection equations as the transformation of the X,Y, Z points by a rotation translation and projected with focal length equals one. These are the two very basic equations for every point. Let's see how many points we need and how many frames we need in order at least to get a unique solution. As usually we don't have a standard reference frame. We don't use any world frame. We have only these pictures like the cameras around the Benjamin Franklin statue. So we need to choose one of these cameras as a reference frame. And for this camera we are going to say that it has a rotation orientation as the identity matrix and that the origin of the world is exactly this camera. So T1 is equal zero. After fixing this first frame, we're going to still have a global scale factor present. So one of the all unknowns will not be recoverable. That's why usually we just set the translation between the first and the second frame equal to one or we just set of the 3D points in the environment. At the specific distance. Now having fixed one of the poses, we have (F- 1) poses, which times 6, makes 6(F-1) unknowns. And we have endpoints, but because we can not recover one global scale, we have 3N-1 unknowns. So, total, we have 6 times F minus 1, plus 3N minus 1, independent unknowns. And for every point, we have two equations. So we have 2NF equations, where we include those equations of the first frame, of course. Now in order to be solvable we need the number of equations to be greater equal the number of unknowns. And we see for example for two frames we have N greater equal five points which is what we learned already in the people of geometry that we need at least five points in order to solve for the three rotational announcing, the two translation announcing to frames. For three frames we have four points as minimum and so on. Now obviously we cannot solve in a closed form solution, so, we are going to do what is called an adjustment. And we call it a bundle adjustment because it comes from the bundles of the race to the points. So if we have it's error or epsilon, which is the projection error of every point x small minus the projected word point, or projected 3D point. And yp minus the projected, again y coordinate. And we have some error covariance which we'll ignore for now. It doesn't change so much our solution then we really want to minimize this epsilon transposed covariance inverse epsilon with respect to all the rotations, all the translations and all points in the scene. Now by after forgetting the covariance matrix we have an objective function Phi of U, U are all the unknowns. And this is the inner product of epsilon of U, transposed epsilon of U. And epsilon is a vector of all the reprojection errors, both for x and y coordinates. This is another form of non linear least squares. And the way we solve it is to take the Taylor expansion with respect to our unknowns. So we take a step delta u and the with the second order expansion we have that phi of u plus delta u it is the value of u phi of u then we have the gradient the linear term. Inner product delta U, and then we have a quadratic term, where the matrix, in this quadratic form H of U, is the Hessian, the matrix of all the second derivatives. Now, if we minimize it with respect to delta U, then we have this equation that Hessian times a delta u is minus the gradient phi of u. And this is the step, we have to do in order to go to a decreasing direction when we are minimizing. This is a trajectory that we can take with these delta u steps when we go in the direction of gradient descent. And we go in this direction of gradient descent. If we go in the direction of green, if we solve this equation the way we have seen it with taking the delta u not just in the direction of the gradient, but inverse Hessian times the gradient then we take the red path, and this is a much shorter path in minimizing this function. How does the gradient look like if our objective function is epsilon transposed epsilon, and we split it among all the epsilon terms. Then we could find easily that the gradient of phi is the Jacobian transposed the error epsilon. And the Jacobian is nothing else then the partial derivative of every error element with respect to every unknown. If we differentiate once more, we obtain the Hessian, which will be the sum of all the gradients, gradients transposed plus the error times the Hessian of every error element. We have obtained this with a product rule. So now we are going to omit the quadratic terms inside the Hessian and we're going to remain just with the Jacobian transposed Jacobian. When the Hessian becomes just the Jacobian transposed Jacobian the iteration delta u becomes what is the called the Gauss-Newton iteration which is the minus inverse of the Jacobian transposed Jacobian times the gradient with is Jacobian transposed the error. Now, remember that the Jacobian is a derivative of all the equations we had all the constraints, all the error terms, with respect to all the unknowns. So this is like a six frames plus three times points minus seven. Rows for every error element, and then, we have all unknowns, and, if we transpose it, we get, at the end there, six frames, plus three, times the number of points, minus seven. So, the whole art here is how to efficiently invert the Jacobian transpose Jacobian. We will follow here the sparse bundle adjustment method which was the first efficient method for inverting the j transpose j and this was invented by Manolis Lourakis. We assume that we have a two kinds of unknowns, the motion unknowns which we capture into a vector a and as we said, there are 6F minus 6 for F frames. And the structure unknowns we capture into a vector b and these are 3P minus 1 where the minus 1 is because of the scale ambiguity. And we will explain much better this case with just using three frames and two motion unknowns, a1 and a2. And three unknown points, b1, b2 and b3, a very simple case. So, we're not going to deal here with the first frame, which we have set already with identity or with a global scale ambiguity. If we write the Jacobian, we know that every row corresponds to a measurement, and every column corresponds to an unknown. So, the first two block columns correspond to their known motions, and the three last block columns correspond to their known structure. In this case we have 15 columns for unknowns and this is because we have nine unknowns for the structure. And six are non critical. Each of the A matrices is 2 x 6 because it corresponds to 2 measurements but x 6 equations and each of the B matrices corresponds to 2 x 3 matrix. Corresponding to the Jacobian structure of every two measurements. Now if take the Jacobian transposed Jacobian, we have the following pattern. We have a block diagonal matrix, you want your two corresponding only to motion. We have a block diagonal matrix, V1, V2,V3 corresponding only to the structure. And we would have already been done if we didn't have this non diagonal elements, the Ws. Because of this non diagonal elements we can not take the inversion by taking the inverse of all the diagonal elements. There is a trick for this though, let us write this equation. This is the update step, updating the motion delta A, updating the structure of delta B. The error terms are epsilon a and epsilon b and they're actually the gradient terms, they are Jacobian transposed epsilon, that's why we put the prime there. And let's see whether there's any trick to decouple those. And the trick we're going to use is pre-multiplying with this matrix which is identity WV inverse, 0 and the identity. We multiply both sides. So, that's nothing changes in the equations. And now we observed the first row and the second column. And, with this good combination of I times W, plus WV inversed times V. We see that we actually get this equal to 0. Which means, that when we get the second element, then it will completely disappear. This means that in the first equation we're going to get U minus WV inverse W transposed. But we're not going to have any information about the structure update. So we will have only the delta A. This way we have isolated actually the motion unknowns in one equation and we can solve this. This is not for every single motion. It is for all the motions together, 6F x 6F matrix. But still it is without the structure. It is a real structure less matrix. Now after we solve for this delta alpha. We can put it in the second equation and obtain the matrix for every structure and this actually is very easy because it is different for every structure element, so we have to invert only a three by three matrix to find the update for every structure, which makes it everything very, very efficient. So instead of inverting the whole matrix at the beginning which was a multiple of the fringe the multiple of the points. We just invert a matrix of all the multiples of the motions, which is on the top. And then assisted with the results, every point's about. This way, we can use thousands of pictures and millions of points. And because it is independent of the number of the points in the solution, we can easily first find the poses of the camera which you can see here in this solution by the system called Bundler and then solve for every point separately. This allows us to really reconstruct structures with millions of points. We will see how we will use these bundle adjusting method in another setup which is called visual odometry.