[MUSIC] Welcome to our online course in perception. I am Kostas Daniilidis. >> And my name is Zhe Bushi. >> In this course, we're going to learn how to extract the information from images and video. This information might be about the position of a camera, mounted on an air vehicle, like the video here, or on a human, or on any robot. In this video, you can see by just watching it, you have some idea about the trajectory of the camera. And you also have some idea about the distances to the points in the scene. And all that, although you're just watching a two-dimensional video without no intrinsic 3D measurement information. In the first week of this course, we are going to study the geometry of perspective projection. These mathematical models are very powerful because they allow us with a lot of intuition to understand how a camera is posed with respect to the environment and how far are the objects. In this scene with this very nice ground floor, you can see two vanishing points, the intersections of parallel lines. And the line connecting the two vanishing points called the horizon tells you already whether you are tilted with respect to the ground floor or not. It gives you also some very good idea about where is the camera, where is that projection center from which all these points are seen, and from which the race to these points are going. In the second week, we're going to see how we can exploit this information in order to do some visual effects. You might have seen already in soccer fields that there are some virtual advertisement. And your homework will be actually to put this Penn Engineering logo on this video, even if the camera is moving and without having any information about the camera or about the dimensions of the soccer field. This is really like magic, but it is not because derivative power, the power of projected geometry. Just with very simple ideas, like the vanishing points. Like the correspondence of points in the scene, you can pretty much change your environment without having really metric information, like any information about distances. This is one of the uses of projective geometry, to produce powerful augmented reality. In another note, also about soccer, in the second week, we're going to study a very old controversy in soccer, that final of 1966 World Cup, where it was really remained indecisive whether a goal was really behind the goal line or not. And again, in this case, we're going to study just two pictures without knowing how far away were the cameras in the Wembley Stadium, where this match was happening. And with projective geometry terms, with vanishing points, and with projector formations, we're going to find whether the ball indeed was behind the goal line or in front. >> In week three and week four, we're going to ask the question, where am I? This question is fairly easy if you have a cellphone with you. You can just take that out, look at the GPS location on a Google map, and it will tell you where you are. But what if you're flying a drone or moving in space, where you don't have a GPS with you or with your cellphone with a map attached to it? In week three and four, we're going to study how we reason about how we are in the three-dimensional scene by just looking at two-dimensional picture itself. So there are two methods to achieve this goal. The first group of method, assume we know something about the 3D environment. For example, in this scene where we're standing in the street corner, we can recognize this road on the scene, and we can recognize this road for our own view as we move in the space. Once we can recognize this road in the scene, the road itself and the lane mark on the road itself, allow me to find vanishing points of the road in the image as shown here. Once I have a vanishing point in the image, draw a line between the vanishing point to the optical center, that ray extended to the three-dimensional space is in fact parallel to the road in the three-dimensional space. As such, we have a way to orient ourself with respect to this road in the vanishing point direction, where we call z. And the road surface might not be seen everywhere, but oftentimes, we live in a manmade structured world, where we have parallel lines on the ground planes, which we can recognize across multiple views. So long as we have those, we can identify our orientation with respect to each other, respect to this point in infinity. One point infinity allow me to estimate two degree of freedoms in rotation, and to estimate an entire three-dimensional rotational matrix, we can use two vanishing points. And this is a required finding in the three-dimensional world, two perpendicular direction in the three-dimensional space. And for those two directions, we found vanishing points in the images associated with them. Now we will have, again, as two vanishing points in an image, an optical center. We draw a ray from the optical center to vanishing point 1, optical center to vanishing point 2 in the image, and that two ray extend into 3D space are also perpendicular to each other. And in fact, align with those three-dimensional vanishing points in 3D. As such, we can orientate ourself fully in 3D. And quite often in manmade environments, we have those perpendicular lines. For example, a building here, we have seen vanishing points in two different directions, one one facade, one on another facade. If you don't have vanishing points, we can still estimate our orientation with respect to a known object in the world. For example, if I have a planar surface on the ground and as soon as we can recognize four points on this planar object, we can navigate ourselves with respect to that four points, meaning that we know the orientation of the camera in 3D, as well as the position relative to each other. If you don't have the planar objects, in more general case, you might have a three-dimensional object that you know the shape of. And if you have known shape of three-dimensional objects, I can again, infer the position of my camera from the two-dimensional image alone by looking at the 3D to 2D correspondence. The second method for computing camera orientation and camera position require no known objects in the three-dimensional scenes. All we need to do is taking multiple picture of the scene from different angles. So, illustrated here, two friends taking picture of the same building across the street from different view points, and if they were to share the picture between them, we can compute the relative three-dimensional rotation and translation between the two views. And the key idea here is that if Bob can see Alice in his picture and vice versa, if they can see each other in each view, then we have some idea where they are relative to each other. In practice, this might not be done, okay [COUGH] sorry. In practice, when we took a picture of the same object in the scene, we're going to be positioned in such way that we can see each other. But still, we can invert this position, pretending the camera is large enough so called peepholes. And that peephole is in fact the second camera person in the first person view. And this is done requiring us finding correspondence of at least eight points in the scene across two different views. And we do not need to know the three-dimensional shape, just simply a two-dimensional correspondense for a few selected points in the scene. Once we have those, we can compute the relative camera rotation and translation, in this case between Bob and Alice. As we have more images come into the play, we can estimate the entire three-dimensional scene structure, as well as the camera position for all the viewers. Not just from two views, Bob and Alice, but for multiple viewers. And this is done through a technique called bundle adjustments. And this technique allow me to tweak the position of the three-dimensional points, as well as the camera position of all the viewers in the scene together in a nonlinear least square function. And this results a complete estimation of where am I across all the views, as well as a back product and structure in the scene, as illustrated here. This course provides you a set of tools that you can use for computing three-dimensional position of the camera as well. But this is not really a fun thing, unless you take your own cameras out of pockets and go out and take some pictures in your environment, in your cities, in your schools. And you can apply the technique we use directly on your own pictures. From that, you can really figure out how you are relative to your friends, relative to yourself. And you can, in fact, compute 3D structures of your own schools and your buildings. So I hope you enjoy, and go out and take many pictures.