In the last lecture we talked about one adjustment. How to obtain large three dimensional models from multiple views. This lecture's topic is very similar, it's called visual odometry but the emphasis here is not on how to get the three dimensional structure but how to get the poses of a camera as a path. Look at this video, which we have taken here in west Philadelphia using our car. And we used the panoramic camera which you can see the input on the upper left. And you can see on the other pictures this blue trajectories which is the reconstructive trajectories from our system. And the black points which is the feature points we used. And then we superimposed the trajectory on a map of Philadelphia. The only input we used was this panoramic camera, the panoramic video you see on the top left. What is odometry? Odometry is really counting your steps. It's counting how far you go by either counting your steps or counting the wheel rotations of your wheel. You have probably heard of the odometers in taxi cabs. In biological perception, we talk about path integration. This is what animals and humans do when they don't have reference points, when they don't place cognition. They just integrate their path, and they know approximately how far they went. More general, integration of velocity or acceleration measurements is called inertial odometry. What is visual odometry? Visual odometry is really the process of incrementally estimating your position to orientation with respect to an initial reference by tracking only visual features. It sounds very similar to the bundle adjustment. The difference is that the bundle adjustment can have very large baselines, it can be from different cameras, it can be random images in the web, while the visual odometry is usually from a camera which you either hold. Or it is mounted on a robot and because it is taken as a video, we can really exploit this trajectory. We might even apply a motion model on our video and we really want to reconstruct this path. We also use the term visual slam. And many people use it interchangeably but when we say visual slams, we also put the focus not on your projectory but also on the feature map. Like the map of the visual features when they triangulated in the world. Is like widen your fields with many advances in the last 15 years, it's not getting textbooks but there is a very good reference tutorial by David Raskino Musa, and very recently on December 2015, there is the IC big workshop on the future of realtime slum and they really held you to visit the website of this workshop and look at all the slides. The most successful application of visual odometry is probably on the planet Mars. NASA has sent already three vehicles. Even from the very first one the speed and the protunity will head to solve the following problem. Even there was some remote control from the earth, you know that move the vehicles, the delay of sending a command to the Mars might take up to 20 minutes, so there is no way to really drive this vehicle with a joy stick. Now, how can a vehicle navigate on the Mars? There is no GPS there, so the only thing we can do is really apply visual optometry. So we might send some wave points where the robot has to go. But between these two wave points the robot has to solve the visual optometry problem. Another big success with visual optometry is this vacuum cleaner called Dyson 360 which uses implementation of Andrew Davison visual slam, it uses an on directional system at 360 degrees eye system which captures this panoramic picture. And then using features natural features in the environment. It can find its position and then transverse a regular pattern while knowing at every point where it is with respect to the first frame. Now let's go back again to our equations, to the multiple views setting. We had calibrated projection points xp and yp for frame f and we have it is a non deposes, rt and the 3D points, xyz. In the visual odometry given an estimate rk tk of the current camera pose as well as the 3D points. And having also the correspondence to calibrated point projections we really need to update every time. So we have a point at kdk, we have a time point dk dispose and one to updated to the next time point. And when we say visual odometry by default we refer to monocular visual odometry just using one camera and this means that when we don't use any other censor we're still having unknown global scale. There's is done in two steps. One for the rotation and one for the translation. First when an incoming email so time to k plus one is in and we find features, we try to find the right correspondences. These correspondences have many outlier so we need to apply a RANSAC in order to select the inliers and usually we do i t with what we called a minimal problem. In this case by choosing five points and a sampling over five points and then applying the five point algorithm. After we find the inliers, we solve for the polar geometry, which means we find the center of matrix A. And then we can obtain a rotation estimate. And the rotation estimate is really sufficient in order to update for the rotation. And we find also a translation as to be made but is really not enough though because we don't know it's scale. So we cannot really apply this last equation. For the translation what we really need is an estimate of the 3D points. So we need first a triangulation of the 3D points and then we can update the translation by just using a PNP algorithm, a 2D to 3D algorithm. In this point of a 2D to 3D we have the option why they update together the rotation translation or only translation. So this is the main cycle of visual odometry, we always have essential matrices between the points. We compute relative translation rotation with the two successive images and then we need to integrate because we use this pairs of subsequent frames, depending on the base line, depending on how many features we can track. This might become very vulnerable to what we call the drift. The main problem of visual odometry and to really address this drift, what we do. As we group a window of frames like the last end frames and we apply a bundle adjustment. The advantage of these bundle adjustment is not only there when we're kind of have a longer base line but also that we're going to use directly the back projection error with all the unknowns together. And this will create excellent local map. Using the same back projection error that we have used in battle adjustment. Using pretty much these two equations in a non-linear linear square setup. Now when we apply it as global filter over the whole sequence we have. It's a structure is three unknown to the state vector, and we have our rotation which are updated just with F graphical rotation using the angular velocity. And we use the exponential of the symmetrical of angular velocity update the rotation. And then we have an update for the translation using a velocity, we assume that both velocities, the angular and the translation, are constant. For any filter approach, like the Kalman filter, we need to also update the covariances, which are really estimates of the error. We really need a good propagation of the error. First to make sure that we have some idea about how uncertain we are. You might have seen these big circles around the GPS position when your GPS measurement is unknown. Is the same what we're going to do with the visual geometry but also first we really want to know how uncertain is our structure as we will see in the next slide. So if big sigma is our covariance and sigma k, k minus one is the covariance between frame k minus frame k our updates that then usually by pre and post-multiplying the Jacobian, where Jacobian has exactly the same meaning as the parallel adjustment. We can update the covariance and we can really visualize within an ellipsoid for the 3D points, and ellipsoid for the position and some other presentation for the rotation. Now when we have triangulated points, we said that this is the only way to update our translation without having that scale factor problem. Now these updates might be quite sensitive to error in 3D if we do them after every step. So you see here, you have this sequence step and with triangulation of the error, structure, and why the certainly the error. This uncertainty remains quite large even with the second frame. We have a quite large uncertainty depth. But when we move forward, if we are lucky, to really still track the same point, we can have with this very large baseline, we can have a very small uncertainty ellipsoid. And in this case, this really the point to update our translation estimate. The frames when these are called keyframes. And they are very important in the visual odometry implementations. Another issue in the visual odometry is really the outliers. Outliers appear because of illumination changes, because of occlusions, because we might are moving very very fast. And you see here an example of the trajectory that we constructed if we don't do any inlier selection, which is the blue, and after we really select good inliers, which is the red trajectory. They're inliers because also a drift because they really cause biases and in the rotation translation estimation and for the first time in 2004 David Nested who invented the 5-point algorithm provided us also with a solution to solve for the inlier problem in the two of your case. Now, choosing a quintuples, these groups of five points appearing ransack might be very expensive, so we need some alternative and the alternative different game with an invented here at the at the University of Minnesota. Which is that if you know a direction, like the gravity from the IMU, or just the point at infinity, then you already know two degrees of freedom of the rotation, and the remaining problem has three degrees of freedom, the yaw angle and two for the translation direction. So what you do is every time before you solve for answer you're lying with this direction, for example the gravity and then you solve for constraint problem which has in the rotation part only one angle is unknown. The way you see it here and then you have the asymmetric matrix to translation which is only two unknown section y. And because you know only direction of this xy you just set it as cosine filtered sign theta, and you have to solve a system with the four equations and forum knowns for the three quotes. This can be solved much faster than the five point algorithm and we can obtain a much better run section solutions. Without spending most of our time in the inlier selection. Another issue in the visual odometry is really look close. I think usually does not involve any look closer just involves the accounting steps. But when you are counting steps and you come to the same point. And you see with your eyes that they are the same point. Then you really have to enforce in your system that this image has been seen and actually that any error that you have like your estimated pose in this picture has to be corrected and come to the same position where you started, for example. This is an essential element of every visual odometry algorithm. And it has two steps. The first step is that you look in the vicinity of every pose you are whether there is somewhere around there in. Which means whether you're visiting the same place. And we do it with feature level. For example the vocabulary trace. And then, we just apply geometric consistency. And might be also a bundle adjustment in order to correct all our poses so that we are again at the correct pose and we don't create a phantom by hallucinated that we are in a different position. So these are the basic, actually, ingredients of every digital odometry algorithms to repeat them here. That we would bundle adjustment over a window to minimize the drift. That we do a keyframe selection to really minimize the triangulation error. That we apply RANSAC for five points or three points in order to select the inliers. And last, that if we are revisiting places, we really have to adjust our position with what we call visual loop closing. Now newer systems use additional information. One of the systems here from the University of Minnesota uses a combination of inertial and visual elements. And you can see on the left, the incoming emails and on the right, the trajectory. This is using just the regular cell phone and the inertial measurement unit inside the cell phone. And what we really get form the initial measurement unit is the known scale. And inertial measurement units measures the acceleration and the angular velocity. And the acceleration which is really meters per second square when combined in the integration with the velocity allows us to estimate our post in terms of meters, not in terms of an unknown global scale, as we have seen before. Another way to cope with this unknown scale is, for example, information about the motion itself. In this case, in this system called Libviso invented by Andreas Geiger in 2012, we exploit the height of the camera from the street and the fact that we are really are moving on a planar motion. And in this case, you can see that we have the feature trucks on the right. And they will project their reconstructed trajectory on the left, including, in blue, all the features which were reconstructed up to this point. Another recent development, aimed at visual odometry, is the semi direct approach, where in addition to features we use directly the whole image when we have the projection error for the motion. This is quite impressive video for a quadrotor and we see on the right that a construction of the projector of the vehicle and also the points that are constructed from the ground and the points the way they're seen and tracked in the picture in the bottom left. Probably the most recent and successful application is the realization of visual inertial odometry on the Project Tango,which started as a small cell phone from Google. And now it is a tablet. And captured in within on the directional image and initial information of the trajectory of this tablet. You see on the left the initial measurements and on the right you see that a constructed trajectory the features are not that many, they are here like 70 or 80 or 100 and we can produce a quite accurate trajectory off of the tablet. Now what is the future of Visual Slam? In the future, Visual Slam, in addition to features and this information, we can really include some semantic information. For example, we recognize the doors. And we have some model about door recognition. In this case here, taken inside the computer science building, we see the construction of the trajectory of the camera using a features, but as well as a doors which you see the recognition and these bounding boxes as well as actually any chairs in the environment. Symantec information does not only help on for using a symantec mapping where are the doors and where are the chairs. But also allows us to solve very efficiently the visual look important. Visual adometry is application that we're going to use everywhere inertial navigation wherever there is no GPS and probably also in many virtual reality set ups where we're in a the head of a user.