Hello, and welcome back. I want to illustrate now how we can use Sparse modeling for problems in image and video classification. In particular, I want to address that very challenging problem of classifying or recognizing activity in video. For example, if we look at this video we want to be able to say there is a person carrying an object. If we look at this video we want to be able to say there's a person jumping. If we look at this video we want to be able to say there is a person running. And if we look at this one we want to be able to say, there's a person jogging here. Of course, we want to be able to say all of the activities and to recognize all the activities in any video. For example, here we have a movie of basically a clip of the movie Forrest Gump. And we want to be able hopefully, to say everything that is happening in the movie. Basically, every single activity in that movie. How can we do that with Sparse modeling? Let's see that. We're going to start from what's called supervised classification. The basic idea is that we are going to have a collection of data for training and that data that comes with labels. So somebody is going to basically tell us this video has this type of activity. This video has this type of activity. For example, we're going to have this video and it comes with its own activity and its own labor. And what we're going to do is we're going to learn a dictionary for this type of activity. One dictionary per activity or per class. Now, we can just go and look at all the patches in the video and try to learn for those patches as we have done in some of the previous lectures. But not every patch in this video is actually interesting. We need to find and we want to use only the patches that have some information about the activity. One way of doing that is just to take frame differences. If we take frame differences when there is no activity that difference will basically be very small. And when the is activity, when there is a person moving, we will have a big difference. So we just take consecutive frames and we subtract one from the other. So one frame minus the previous one and so on to try to see which regions in the video, in the image, in each frame, do have some activity. Let me show you that. So here we have basically a video of frame differences. Nothing else than every frame minus the previous one. And you see immediately that we can basically take apart those regions that have activity from the regions that do not have activity. Then what we do is we only look at patches. Basically regions lets say eight by eight by five. So we take eight by eight in the frame and five consecutive frames, and we look at which of those have high energy. Basically, have more non-zero coefficients or enough high coefficients that we define to go above certain threshold and they have some energy. We pick only those patches and with those patches we'll learn a dictionary for that class. So this is the dictionary there we learn. It's of course a dictionary in time because we have used patches that are let's say eight by eight in space and five in time. Now, this dictionary is for this class. Now, we take a different class. We take basically videos from a different example. And we go and do basically the same. We learn activities for the one. We take the video, it's just a different activity. So we take frame differences, we take all the patches that have sufficient energy and we go and learn a dictionary for a day. And in this way we have now a dictionary for Class 2 and another dictionary for Class 1. These dictionaries are adapted to the class and we continue doing that for every single class that we have in our database for training. We basically are going to have another class here of a different activity and we are going to have another dictionary here for that activity. And we do that for as many classes as we have. Now, what we end up having at the end is basically a large dictionary that is the concatenation of all the previous dictionaries. So we put every dictionary one after the other, and like a block and then we have a large dictionary. Now, we actually know that every block belongs or represents a different activity, and that has been the training supervised because we have labels for every single one of the classes we created one dictionary they came with labels. We're going to talk about unsupervised in a little while. Now, when a new video comes in we want to classify it. That's actually the goal of this learning. How do we do the classification? Almost identical to the way we did the training but now we have the dictionary ready. So basically, a new video comes in-- a new video comes, for example, any of these type of videos we do the feature extraction. In this case, very, very simple just frame difference. Then we go again to every patch, three dimensional patch, in space and time and we encode every single patch that has sufficient energy as we did in training. We encode with this dictionary. But now we remember that every block in the dictionary represents something different. So we're encoding with that dictionary and then we look actually which regions of the dictionary were actually active. Let's say that when I encode this video most of the energy of the code that I produce is in the second block. That means, that the video is from the second class. If most of the energy is in the seventh block, then basically this is from the seventh class. So basically, we look at for example, the A1 in every single block. And we pick the one that has basically the maximal energy. There are many ways to pick the right block but the basic idea is that we-- Although we're encoding with a large dictionary, this dictionary is the concatenation of multiple dictionaries. And then we need to observe which one of the blocks of these large dictionary is the one that is most active and that basically tells me which one is the class for that video. Very, very simple. We are basically doing dictionary learning at training stage, one dictionary for each one of the classes and then we're doing just encoding, sparse encoding. And then we say which one was the one that you like the most that's your class. Very, very simple algorithm that basically achieves some of the best results that have been published for this type of activity recognition challenges. Let's see some examples. The examples I'm going to show you are from a database of videos collected from YouTube. Every area of classification has standard databases that the community uses to test their own algorithms. And this is one of the most popular ones. It has a lot of variability of just videos downloaded from YouTube. So here are some examples for example of basically two different videos of a person playing with a ball. Just soccer playing, and of course, we want to be able to say this belongs to the same class. Some of the videos are used for training, some are used for testing. We also have people jumping, just keeps jumping here and we will call these basically the same class. Although, the videos are completely different, we know that basically is the same activity. We also have basically bike riding. And we have just examples of horse riding, and examples of basically diving in a swimming pool. So different type of activities. We basically train dictionaries, one dictionary per activity and then we have a new video and we say, "Which dictionary do you like the most?" The one you like the most. That's your class. Let me show you some results. And the way I'm going to present you the result is what's normally called a Confusion matrix. So here we have all the different activities and here we have the activities in the same order. And in every row we basically see the percentage or the probability of doing a correct classification. And what you want is to be as close as possible to one in the diagonal. So 91 here means that basically 91 percent of the videos of basketball were correctly classified. Now, there was about 2 percent of the basketball that will classify as class 2, class 2 is biking. And then you can have here basically 1 percent will classy as class one, two, three, four, five, six, seven, eight. As class eight, so you look for Class eight. So that's a Confusion matrix. It tells you both how many times you have done right. But it also tells you what type of mistakes you have done. And as you can see most of the numbers are very, very close to one. These are actually some of the best results with these very, very simple technique. We learn a dictionary per class, and then we do sparse coding for the new dictionaries and we get the result. And the results are really, really good in spite of the variability in activities videos taken with different cameras, different resolutions, different qualities, but have common activities and we can actually detect them. Now, this is supervised. What happens when we don't have supervised data to do the learning? Let's talk a bit about that. Let's just observe this video first and let us concentrate on what's happening on the right which is the grand truth. And we're going to see two people here that are running and then jumping. Let's just look at that. I'm going to probably run this video a number of times while I'm going to explain what's happening. So the people are running and they're jumping. And if you see basically, there was a box that was green and now it became yellow. Let's just run that again. It's green, that means one activity yellow. That means a different activity. So this is the grand truth. We're basically going and mark it by hand. So how can you use sparse modeling to detect this change in activity? Very simple and very similar to what we have just done. We basically start and we take let's say a time period of one second and we learn a dictionary for that in the same way that we have just done for a supervised case. Then we take the next second and we'll try to encode this new one second of video using the dictionary that we just learned. If we can do a good encoding, meaning, I encode it and basically my encoding vector is very sparse. That means that that dictionary is a very good dictionary for this new time period of one second. That means, I probably have not change activity. Now, I keep going and then I encounter one second where I try to encode with the previous dictionaries and then get a very bad encoding. Either I cannot reconstruct the video or I need to use many, many atoms in the dictionary meaning my encoding vector is very none sparse. That means that that dictionary that was trained to encode a video with a given activity in a very sparse fashion, is not working for this segment of the video. That means, you have changed activities. And what you can do is first of all call this a new activity and then you can actually train a new dictionary. And now you're going to have one dictionary per activity that you have seen so far and you can keep going. And that's basically what we are doing here. And on the left is the result. The grand truth was on the right. Again, this is unsupervised. This is just marked so we can verify if we're doing correct or not. And let's see the performance of sparse modeling and as you're going to see the left is basically identical to the right. We automatically we're able to detect the change in activity by observing that the dictionary that we learned in the past is not good anymore for the new activity. And therefore, we probably have changed activity and we can detect this change in a completely unsupervised fashion. You can take multiple people and do the same exercise. You can ask, "Are these two persons, these two groups of people maybe even doing the same activity?" You do a dictionary for one, you do a dictionary for the other and you see if you can interchange the dictionaries. If you can, those dictionaries are good for both activities means both activities are probably the same. If you cannot interchange dictionaries, the dictionaries don't work for both activities. Those activities are different. So this is one way basically of using sparse modeling for classification. Very simple, learn dictionaries per class and then use them to encode the new data and see how good is that dictionary. And according to that you can do that classification. So a very powerful and very simple technique for image and video classification. Thank you, very much. Looking forward to seeing you in one of the future videos. Thank you.