Hello and welcome back. During this video I want to talk about the implementation of sparse modeling, in other words how do we compute the alpha that we have seen in the previous videos? During the video we are also going to discuss some theoretical results, I'm just going to mention their existence and why they are so important. Let's start with that. Assume that we have just produced a signal X as this product, basically as a sparse model. So assume that we know the dictionary D and we randomly pick an alpha with a few non zero coefficients and we generate X. And now our goal is to find back this alpha, so now we know the dictionary. And, we know the signal and we want to compute it's sparse representation. That's our goal right now. As we said later on we are going to discuss about how to compute the dictionary itself, so. As we have seen in the previous video, we want to compute the alpha that has the smallest possible number of non-serial coefficients, that's also called a support, and we assume that there is no noise right now. We have just generated this signal, so it's kind of the ideal scenario. So that's our goal right now. And there are a number of questions. One of the questions is why would we get, with whatever algorithm that we produce here, that we are going to explain in the next few slides. Why are we expected to get exactly the same alpha that produce the sigma? In other words, is that alpha unique? Can you think about a scenario where we know that alpha non necessarily has to be unique? And can you think of a scenario where we know that alpha will be unique? Let's think for a second about that problem. One example where alpha is not unique if for example several of the columns of D are identical. So. Let's assume that this was generated by one of those columns. I can actually pick any other column, because they're identical, and I can replace that alpha by, by the alpha corresponding to that column. I assume that column one or two are identical. I assume that, basically, I generated this signal with exactly this pattern. So, I generated it with the second column. But when I go to reconstruct, I can pick column one, you still have column two, and therefore, I get that different alpha. So that's an example where basically, the solution is not unique. An example where the solution is unique is in DCT, Discrete Cosine Transform or in the Fourier Transform. We know that actually the transform of a Signalling Fourier or in Discrete Cosine Transform is unique. So those are cases, now where was the difference? In one of the scenarios, D was basically had the repetitions. In the other scenario we know that Fourier and the cosine every column is independent of the rest, they are both orthogonal to the rest. So it looks like the structure of the dictionary D has a lot to do with the uniqueness of alpha. Here comes the second question. Will I get, because in the previous case at least the support, the number of non-zero proficients, meaning the size of the support, was the same. Could I get basically maybe even a more sparse representation than the one that produce it? And, once again it's not hard to think about of an example of that. Assume that basically you generated a signal with all these three, and this is X. Now assume that by chance that X was part of the dictionary D. Then when you're going to try to recover, your sparsity goes back to 1. So you went from three to 1. What's, what's the problem there once again? One of the columns is basically a combination of the others, so if you produce a signal that has such combination. And then you might be able to get it back without actually less coefficients. So these two problems, actually the questions depend a lot on a number of things. The dictionary, meaning it's size k, and also its general structure. And the basic idea, as we have seen from this example, is that we want the dictionary D to have atoms, columns that are as anchor related as possible to guarantee that basically we have unique solutions. And that basically we get back the sparsest signal that actually produced the signal. So depending on the sparsity level. Basically, depending on the number of non-zero coefficients. And on some conditions on the dictionary. There is beautiful theoretical research out there that guarantee uniqueness. And that guarantee that we get the, basically, smallest possible support when we do the type of reconstruction algorithms that I'm going to show next. So it's important for us to know the existence of this very important theoretical resides in the background that support the type of techniques that we are developing and that support the use and the exploitation of sparse modelling. Our goal now, assuming the theoretical results, is actually to solve sparse modeling. So we want to find. Now, we are giving a signal, y. Remem-, now is not the ideal case anymore. X is going to be represented by this. So this is our x. We want to find their representation of x. Given the dictionary d such that we get the smallest possible super meaning the smallest possible non zero coefficients and we don't get too far away from the observation. And that's our goal right now. To solve this problem. This is actually what's called a commutatorial problem, and it's MP. It's actually, you cannot solve this problem as I just posed it in basically in realistic time. Let's just explain why. We're going to basically do the following, which is the only way to solve exactly this problem. We start with L equal one, so we look for a signal that has sparcity one. That's what we are going to do first. So, we generate all the possible supports of that signal, because L is equal one, basically the support means only the first one is active, or only the second one is active, or only the third one is active, and so on. Now, how many possibilities we have? We have already seen that. We have l chosen out of k. In the case of l equal one we have basically k possibilities. When we are going to iterate this, because then we are going to do k equal two, we have of the order of k square possibilities. And as l increases, we're going to have more and more possibilities. But for now, we basically have all the possible supports. Now with. Each one of those possible supports will basically go and solve the fitting problem. So we say okay, what happen if I'm only going to use the first atom? All what's left is to pick the scale. How much of that first atom will I use? And then I solve this problem. This is a least squares problem. It's a very simple quadratic problem. Many ways to solve it, but the basic idea is that we are going to project the signal into the atom that we have just selected. And we see basically we try for each one of those. And we actually seen what's the error. If the error is less than what we want then we're done. We basically peak the alpha that produced that small error. Now if there is not then we increase the support. So, we have just tried, basically, with L equal one. Now let's try with L equal two. So we pick all the possible supports that have two none two active, two non-zero coefficients. So we say okay, how can I approximate the signal Y with the first and second atom together, or with the first and third atom together, but with the seventh and the twenty fifth atom together. So we pick every possible tool. Again there are two chosen out of K which is in the order of K^2 and we try all of them we solve for each one of them this problem again just a least squares problem or a projection on to the sub-space generated by the selected atoms we try again we keep going and we keep going so, thus we solve this. Basically absolutely no problem. What's the problem with this technique. The problem with this technique is the following. Assume that K1000. = 1000, that's a standard size. It's a good size as I say it's about the size that we're going to be using when we talk about imagery noising. And assume that I tell you they have to be ten atoms. So basically it's ten chosen out of 1000. That's about of the order, starts being the order of possibilities starts, starts being about the order of 1000 to the tenth something like that. So, it's a very, very large number. Assume that each one of the computations that we have here takes about one nano second. So, you try all possible supports of size ten. You solve this problem in one nano second. These are all reasonable numbers. What you get is that you're going to need a long, long time to solve this problem. Look at this number. It's a huge number. That's how much you're going to need to solve the problem if I already gave you that L is equal ten. Imagine what would happen if you have to run over all possible L's. It's impossible. That's why the problem is very hard. It's impossible to solve as it is. So, are we done with, wick, with this week? We just presented a great model that we cannot solve for it. No of course we are not done. Now we are going to have to find a way to solve this. How do we do that? There is actually a couple of fundamental ways to solve this problem. Again, a good fitting, with as sparse as possible representation. There are again two ways to do that, one way is what's called relaxation methods. Can we change these. For something that is solvable. Something that we can actually do in our computer. Meaning, can we relax our condition in such a way that it's almost the same, but we can solve it. The other way is to do what are called greedy algorithms, and that means, okay, forget about finding all of them at the same time. Give me the most important item, then the second, then the third, and so on. So let us basically briefly present both of these techniques. Relaxation methods are also some times called basis pursuit methods.This is our problem, this is what we want to solve, relaxation methods will change this. Look what I'm doing, I'm replacing the zero norm, or pseudo-norm by the one norm. I'm now, no, not just counting the number of knowns zero efficients, I'm basically counting with their magnitude. Remember in the F0 norm we had a penalty which was something like this, flat. Okay. So there was no penalty when you are zero and the same penalty when you are non zero. Here in the L1 norm. This was a type of penalty. It increases with the magnitude. Now we are a bit surprised. The beauty of this, is that under certain conditions, on D and there's certain conditions on L, the level of sparsity. Basically we can guarantee that these two are equivalent. This is a fantastic result. It says that I can move from an np complete and solve other problem into an l1 problem. This is a convex problem. It's solvable. And sometimes obtain exactly here, the solution I was looking here, but in a reasonable time. Very beautiful theory behind this and. it's not the only case but it's a powerful case where relaxation, actually we don't lose anything, by doing relaxation. And this problem has been studied quite a lot. As I say, it is called the basis pursuit, is convex. So, we can solve. There is a lot of algorithms in the literature to solve it. And in the packages I mentioned that are free for you to download and to play with sparse modeling, you have implementations and very efficient implementations of these. And this actually has been the source of a lot of research in the community because. Now that we know that this is convex, and we're solving, under certain conditions, the original problem. Let's try to solve it the best possible way and this open the door to a lot of excellent research in this area, and I'm just mentioning a few here. The literature is, is huge and you might, actually. Know about it from your own research area. Now the second say, so this is the relaxation way and as I say, extremely powerful. So you can pick to use one. The second one is the greedy approach, the greedy approach is often called the matching pursuit and the idea is very, very simple. Again, remember we have the signal. That we are trying to approximate. That would be Y. We have our dictionary D. And we are trying to find the sparsest possible way to represent this signal. Now, what we do, and much in pursuit is a standard technique in the statistics community and it was brought into signal processing about twenty years ago or some, or so. The basic idea is let's find, first find the most important atom in the dictionary. So you go and travel the dictionary. Let's observe what's happening here. You travel a dictionary and you find the first one. So once again, you go, you try the dictionary, and you find the most important one. What do I mean by the most important one? The one that minimizes the D alpha - 1^2. So. We have the'd' alpha minus'y'. Square, and this no, this is alpha, this was cut. This is no, this no, Y is no. This is d, this is y, and supports one. So you try the first one, you try the second one, you try the third one, you try the fourth one. The one that does this the best. Actually because this is the method we're using the, this one will be the one that maximizes the inner product with the signal Y. Basically which is the one which is most parallel to the signal Y. That's going to be this one. Now I keep it. I cannot give it back. In contrast with what we saw in the previous slide that we basically tried L equal one, then tried L equal two, then tried L equal three. This I'm forced to keep. That's why the technique is called greedy. Once you select it, you keep it. So now I selected this, I keep it. I keep it and there is still an error here, if the error is below what I wanted I stop, if not I have an error. Now the next step is pick the one that helps you to make that error the smallest possible. So, for the first one was Y, what we are trying to approximate. For the second one, we're trying to approximate the error. If we approximate the error, we add that atom to our collection, and we are starting to shrink the error. So the next time you keep this one. And basically you go to the one that has the best fit to the error so far. So again, you try all of them. And you keep the one that minimizes the error. If you see that again you try all of them. And we keep the one that minimizes the error. Now, please note I've kept the first one and I'm now keeping the second one. Okay. So I kept the first one and I kept the second one Now I have a new error because I'm using two items. I have a new error and I'm going to pick the third one that basically approximates that error again. And every time all that you're doing is inner products. To approximate those errors and now the algorithm is doable because for every one they have to try, you basically try N times, you are picking one at a time instead of picking this L at a time and interchanging and putting back and trying new ones. The one you pick, you stay with it. And now this algorithm is very efficient, it's just a bunch of inner products all the time. And you stop when you go to a reasonable error, an error that satisfies your requirements. Now, this is what basically I wrote here, you stop when your error is satisfied. There are some variations of this. The most important one is that every time that you pick a new atom, because it's the most important one, you take again your signal Y. And you project into all the selected atoms. In other words, when I pick the first atom. I have a certain coefficient alpha here. When I picked the second atom, I got a certain coefficient. Now, these two are already selected. So, may be I can take the signal and change the actual value here and the actual value here to get a better error. If I'm already selecting it, why to stay with the actual coefficient that I got when I selected it? I, I'll to change it. So, every time you change, you do a projection onto the sub space of all the selected atoms. Again that's a least course proceeded to very fast to implement. So. Orthogonal matching pursuit, you select the best atom so far, you project on to the best selected atoms, that gives you an error so far. Next step you select, once again, the best atom to minimize that error that you are getting so far, you project onto all the atoms and then you keep going. So that's a very simple algorithm. Again, it's implemented in some of the software that we give you the link for. So, the basic idea is now we have techniques to solve these problems. We have techniques which are basically of several classes. The greedy algorithms that I just mentioned to you, that you go one at a time. There are variations of that. There are many, many variations. Some people say, why don't we pick two at a time? I do have, sent computation in time to do that. And things like that or there are Orthogonal Matching Pursuit that I mentioned. They are the relaxation algorithms. The literature is huge in that topic as well. And the basic idea was to relax this by something that will give us hopefully the same result what we're looking for. And there hybrid algorithms that do both of them at the same time in some very clever combinations. Now why should I work. Because there is theory behind them that tells us that sometimes the relaxation, although it's a relaxation, it actually produces the right results. There are theories that gives us conditions for that. There is theory that gives us conditions for the greedy. Although it's greedy, sometimes it produces the right result. For example, if the matrix D is a Fourier matrix, you got the right results there. So there are ways of doing this kind of. Transform this basically l0 type of approach, incompletely dissolvable and there is a beautiful theory that there is a, when relaxation already, actually gives us the best, the result that we were looking for. If our problem, for example the D matrix doesn't hold those conditions, in practice we still use one of these two classes of algorithms. Why? Because solving the real problem is not doable. It's impossible, so the best we can do is do either a relaxation or a greedy. And if we are in the right conditions we get the exact solution we were looking for. If not hopefully we get a very good approximation of that. So where are we so far? We started by image denoising and looking for the need of modeling. We derived sparse modeling as one way of doing a model of images that will help us in denoising. Now we discussed some of the problems and basically we saw how to solve, how to compute the alpha that will give us the representation. The next step is the dictionary, and that's what we are going to do in the next video. Thank you very much. I'm looking forward to that.