Hello and welcome back. In this video we're going to learn about Roto Brush. Roto Brush is the video segmentation algorithm inside After Effects. After Effects is the video processing product from Adobe. And Roto Brush was released around April 2010 with a new version of After Effects that was released at that time. And it was considered one of the highlights of that new release. So I think it's very cool to learn about the basic concepts behind Roto Brush. So let's dive right in. When we are doing video segmentation, and in particular, if we're doing video segmentation for products that are going to be used, for example in advertising or the movie industry, we need some very basic characteristics that need to hold. We need very high accuracy. We need to be able to segmentation at the pixel, or even sub pixel level. As we saw, these are very challenging problems, so the user is going to have to be in the loop. And we want to make sure, that the way the user is involved in the process, is a very comfortable way for the user. And, if the user is involved, this has to be done basically in real time. You cannot request from the user to do something, then go home, come a few hours later, and see what are the results. You need the user to interact. In real time with a computer and those are basic characteristics that any one of these type of systems needs to have, and certainly Roto Brush has. Now we need high quality. So we will need to be able to segment the videos with very, very high quality to get to the sub-pixel levels to be able to do that and this is a video that we have seen before. Now why is this so challenging? For a number of reasons. Some of them we have seen, and some of them I want to go back to them. So these are some of the challenges that we encounter. And we encounter them in the video, we also encountered them also in the images. And that's why I basically using stills here as well to you to describe. What is overlapping color distribution? What do I mean by that? We saw in the previous. Slide that I want to segment out these two players. But look. This player has a white shirt and also people in the audience have white shirts. This player has a blue shirt, there's also blue here, there's blue here, there's blue here. So basically there is a lot of overlapping colors. We discussed that early on, it looks like color enough and therefore edges are not going to be sufficient to do an accurate segmentation. Here is another examples where edges are not going to be sufficient. There is basically no boundaries here. So how do we know to segment out the car if I don't really see where the car ends? Additional challenges that appear, in particular in video is that they are what's called topological changes. So look how they are and here separated from the body. In one of the next frames it's actually touching the body and this hole has been created. So the actual shape, the global shape of the object of interest, is changing. And once again, it's changing here. There is a new host. So there's a lot of variability because of the articulation. And that makes the problem really, really challenging. So, what are the type of features that are in Roto Brush and in the segmentation behind Roto Brush that basically we need to have? But also features that we need to have in many, many of this type of video segmentation approaches. So, accuracy we already talked about that. Robustness. Robustness also refers to the fact that not only the image can be changing. But we actually might have a very, very rich variety of data. And we want this algorithm to work as good as possible for all of them. It might work better for ss, better for some of them. Meaning the use of magnet to be less involved. For some the use of magnet to be more involved but we want this to be a really good algorithm for all of them. So it has to be practical and this is related to the user intervation we want to make sure that when the user gets involved things get better because of the involvement of the user and don't get worse so we don't want the user to basically say oh, I'm happy with everything but this and then the user go goes and does something that basically destroys the part that he or she was already happy with so it has to be a natural work flow and very easy and as we say has to be very computational efficiency so you know we have about 30 frames a second in the NTSE American system our, our and similar you know 24 to 30 frames between the empty seat and polly systems and we don't want every frame to take hours we just want every frame to be done really kind of in an interactive comfortable time. So, the algorithm I'm going to describe has four major steps, that we see them here. First of all, we're going to do segmentation one frame to the next frame, then we're going to propagate multiple frames ahead. And then, we're going to show local correction and I'm going to basically briefly mention about post-processing which is a concept that we are segmenting videos, not just still images. And we're going to see many, many very cool examples. So, let's start with the localized classifiers. So, let me explain what do I mean by localized classifiers. We're going to assume, that we start, from the segmentation, of a given frame. This segmentation, was obtained either minorly, remember we are talking about segmenting entire videos, so certainly we can ask, once in a while the user, to manually do a segmentation, but we know, that we don't actually need that, because we have learned many ways, to do, this type of, stilement segmentation. So we start, from one segmentation. And we want to propagate this segmentation to the next one. That's going to be our first challenge here. Now we are going to have what we call local classifiers. So, we're going to put windows centered around the segmentation, around the boundaries that we have just computed. From each one of these windows that basically go all around. So, every one has a window. Every pixel has a window. Or maybe we do every few pixels going around the curve, with some overlapping, as we have illustrated here. From every one of these, we have very important information. So once again, this is just a local window. Is here because we are in a frame that we're given the segmentation for that frame so we're on frame T, we have the boundary and we have the inside and the outside of the object. Now because we have that we have two very important characteristics in that window. We have a shape prior, that we say okay, we believe that the next frame will have a similar shape, remember frames, are very, very fast, in time so things can not in a normal video change a lot, between frames. So we have, a shape prior, we also have a color model, so we know what type of colors to expect in the foreground, what that type colors to expect in the background. That's for every one of these windows. So let's see how we use that. We are now in frame T, and we want to be able to segment out frame T+1. This we have, as I say, is one of the frames we mark by hand or is one of the previous frames, as we're going to see in the next slide when we progress. And we have these windows all around, and I want now to segment out the player here. The first thing we do is we basically move these windows to the new frame. Now we can just move them with basically no shifts and that wouldn't be too bad or we can move them following what's called motion estimation. Do you remember, when we talked about video compression? We talk that we go in a block and look for the most similar block in the previous frame. This what we are doing now. So, for example, we could go into every region and see either backwards or foreground or forward, we can see what's the closest box, what's the closest local window. So we can start for example in this local window, look around the regions here and then move it the place that we believe is the closest one. We can do the mean squared error, the absolute difference. The same type of approach that we did for prediction when we were talking about video compression. So for every window we have a new position. Again, if you believe that there is no, too much motion in the image, you can just copy them to the same position. So we either copy them or we use this motion estimation to move them. So now we have the windows in the new frame that we are about to segment. Now remember, I'm going to just pick two windows. I'm going to pick a window in the current frame and one in the future frame. This is the segmentation that we know. This is where we need to put it. Now we're transferring the red curve, to the blue, because we have moved, the basic, the block here, but we don't believe in the blue curve yet. That's the one that we need to adjust. Now, remember that for this, which we call the training. Because it's something that we already have, we have a shape prior and we also have color distributions. So, we have, we say, more or less, this is what we believe should be the curve here. And these are the distributions of colors in the foreground. I'm in the background. Now why do I write here GMM? This stands for Gaussian mixture models. That's one way to approximate the distribution of foreground and background. Just think once again about the histogram distributions. Now instead of having the histograms we want to approximate, approximate them by a parametric function. And that's why we are using for Roto Brush Gaussian mixture models. But it's not critical now. Just remember, you have a distribution of colors for the back, foreground, a distribution of colors for the background. For simplicity, just think about histograms. Now that's what we're going to try to combine. To basically find the exact position of the blue curve. So once again the red move to the blue but we want to be able to shift that around. So, this is going to provide us some information of where should it be. I'm not going to allow it to change completely, because I'm not expecting a complete shape the formation from frame to frame. I'm also not expecting complete changes in the color, in the pixel values in the image. So this provides me a very good prior, about the colors for every pixel, inside here, I can say, are you more probably foreground or background? So we can get a picture like this, and the idea is to merge. A shape prior and a color prior. how to merge them is going to be what we are going to explain next. The important part here is that from what we already know, we can get for every block, kind of a distribution of probability of shape and probability of colors, merge them, and then we get the probability of every pixel being either foreground or background. That happens for every single one of these blocks that have moved from the previous frame. And therefore, we put it back, we get that for the entire image. And from that for example using graph cards we can get a segmentation. And from this frame we will be able to go to the next frame. Before that, I have to explain to you how we do this merging. For the merging, we have to remember that we have both colors and shape as indicated here. So for every block, color, and shape, how we combine them. The basic idea is what's written here. If you have colors that are well separated, trust them. If not, trust the shape. If you're ambivalent, trust both and merge them. So, we have the color distributions here. From the previous frame. We have information from the previous frame, a segmentation that we trust. And we got the shape prior, the color priors. If the color distributions are well separated, so you look at the histograms for example, of every color or a distribution of all of them together, and you see that their very well separated, and are multiple ways to measure that. If there was separated. Trust the colors, trust the probability, and use this shape prior very limited. So, the shape prior was somewhere here, that was the curve, you say. I'm going to let you be, any place, at a relatively large distance, from what your shed prior predicted, so you basically can go on every pixel that has, certain distance, from the, prior, it's an allow reach, and for the prior. So you get, these two type of probabilities, you merge them, you get your, confidence map, all the probability, that we saw in the previous slide. That, if you have this, confidence. If you don't trust him to much basically you have that the distributions are more overlapping, then what you do is disband that had a prior shape, you make it a bit narrow. You say basically I don't want a late the boundary in this particular block to go very far from what I predicted, has to be more or less where I predicted. And the less you trust the color, because of overlapping, the more narrow you make those binds. So these are three examples from the video. And these are real examples. So this is the back of the player, where the colors were very well separated. This is the, another region. There is not such a clear separation. And look here, even the image tells us the separation between foreground and background is not very good. For all the cases we get, as we saw in the previous slide, both color and shape. The difference is, once we're given the shape prior, something like this will be here. Here will be something like this. And here will be this. How much we let that shape prior to impose the actual position of the new curve. If we trust a lot the color, we basically expand that shape prior a lot. And we say, be anywhere you want, because I know colors are going to help me. We trust that it be less, we see that this is a bit narrow. When the color distributions basically don't provide any information, the shape prior comes to the rescue and say, I'm sorry, you cannot change too much. So it's a very, very narrow band. This always comes from the product. It comes always from the colors of the previous frame. The actual shape of the confidence of the prior depends on the confidence we have here. These two, we add them, we normalize, and we wait, we get what we solve in the previous example. Here for example there is no way to use colors. They are too similar but the shape prior, because the previous frame was so nicely segmented, help us to continue with the segmentation for the next frame and so on. This is a very important concept where we merge different types of information to obtain the segmentation that we want. That just illustrate how important this is. We have here one frame. Segmented. We have the horse segmented off and we're going to, want to segment the next frame. So first of all why this local. Remember, having this frame, we are having blogs around it and we do everything local. Because the other lab, they share information, but everything is local. If we were not to do it, so if we were to consider one entire frame as one blog. There is so much variability in the colors, that the distribution basically makes nonsense. And we won't be able to segment the horse. If we use the local, this is what we are going to get all ready. Much, much better using local color models, because we only care about what is around each one of the pixels in the boundary. Now there is still problems, with basically, as we see here. The horse is white, the carriage is also white. So certainly, the color there won't help a lot. And we see that problem coming here, okay? But, that's where the shape prior. This provides both colors and shape for the next frame. And that's where it comes to the rescue. And here, we see what we get when we combine both locality and shape and color priors. We get very nice prior that as I said before, with an algorithm like graft guides, immediately gets the nice segmentation, and we propagate the segmentation to the next frame. So both local and combining shape and color is very important to get this accurate segmentations. We're going to see very soon that the locality's important beyond what I just explained. It's also important to help the user have a much more friendly interaction with the images. So, this is how we actually go from one frame right to the next frame, but we don't want to do just one frame. We want to do multiple frames. How do we do that? So the basic first idea that will come to our mind is, okay segment the first frame, go to the next frame. Then, from the segmentation of this frame, go to the next one. From the segmentation of this one, go to the next one. And keep going as long as you can, as far as you can. But this has one fundamental problem. The fundamental problem is the following. This is a frame that we trusted. We trusted it a lot. Maybe because we need the hand sanitation, maybe because we work very hard on that. Now, this is the result of the algorithm I just described, based on this. This might not be perfect. If I propagate these to the next one, then the errors I created here, propagate to the next one, which means that my shape priors error propagate, my color distributions. Also the estimations for this frame propagate if there are any errors. One way to address that is to say, okay in order to stabilize the system why don't we let the color priors also to influence here. Why don't we let them keep moving? And also here and also here. This is the friend that we trusted a lot. So let's just have it move forward. Remember that we are working in blocks. So locally, we let it move forward. And in that way, everybody will get also some of the information that we have from the original frame. And from the original segmentation that we so much trusted. And in this way, what we get is that our segmentation has propagated not only one frame, but has propagated. Multiple frames. So, once again the information that we trust, and because we're working locally, it doesn't have to be that we trust it globally. It only has to be that we trust it locally. That information we can propagate multiple frames to help stabilize the system. We could actually iterate and use the current segmentation as the initial condition or the priors for the next segmentation and so on. That's an option in this type of frameworks. But the critical part is to use information that we trust. As many frames ahead as possible. Before I give you some cool examples, there's a couple of additional things that we need to do. One of them is, how are we going to let the user to repair mistakes? No matter how great the algorithm that we develop, video segmentation is a very, very tough problem. In particular, if we are going to segment two hour movies. So, we need to let the user do repairs in the algorithm. Remember, when we trust a friend, we're really trusted, so we better make sure that all the friends, all the way, are in very good shape. So here is an example. We have this progression, and then we realize that it was good, it was good, but here there is a mistake. So how are we going to let the user repair that mistake? Here is where once again the work on local blocks helps a lot. So we basically want to do that. We want to go and repair it. And there is many packages that will help us to repair. Basically, I want to illustrate that again. Here we have it. And now we repair it. So we repair it inside many different packages, Photoshop, After Effects, you can just go and move that curve. But of course, remember they were propagation before, there was propagation before. So that mistake that we had before propagated, propagated, propagated. I want to now the correction I did, I want it to propagate as well. I don't want to to propagate backwards, if I'm happy with what happen in the past. I don't want my correction here to influence what happened here because I'm happy with that. So, here we have everything based on local blocks, local patches. And that's great. We go and look at every block, or every patch, that is effected by my recent repair. And because we have this motion estimation that we did, we know where are these blocks going in the next frames. We have their position in the future as well and then we go and do repairs only in those blocks may be we can expand a couple to the side to just make it more continuous repair but we certainly are not got going to let the propagation of that correction go all the way far away here. And that's how we get this local control. So these blocks are helping us to get lock and features, and are also helping us to do the interaction must faster. So look what's going to happen now with this repairs. We just do them and here it is. We have repaired them. Let me show that again to you. We had a mistake, but we just repaired that mistake. We let that repair propagate to the corresponding blocks in future frames. And then we get everything repaired. And then we can continue with this frame, segmenting forward frames. So, this is very cool, this is the third step that basically is the local correction. The final thing is, we have, so far done segmentation frame by frame. This is a movie. We need the segmentation to be coherent, consistent across frames. One way of doing that is basic to smooth this out. We have these segments per frame. We're going to smooth them out across time, regularize them to make them look better and then when I basically show you what happens. So you see here you might be able. To see that the boundaries are a bit rough here, and are much nicer here. We basically take the segmentation of multiple frames and we smooth them out together, to get from this rough. Segmentation to much nicer, much smooth segmentation, that help us either to put this object in other movies, or to do different effects. So now, it's time to show, examples of this very cool technique. This is an example that we have seen a number of times before. So, here we have multiple examples. But I'm going to run them now. We have segmented out here, the person. He looks very nice. He has all the components. He has the repair, the smoothing, everything, and looks really, really nice. Here is an example that we have seen multiple times and we see it with the particular case where we blur the background. This is a special effect that we have seen in the past, in one of the introductory videos before. We segmented out the players and then we put them in a blurry background to enhance the players and call the attention of the viewer to the players. Some addition and really nice examples of, of special effects. Here we have the original one. And there is a special filter. You can see the lady surfing that has been filtered out. So we first have to segment this with all the problems of splashes from the water, topological changes occurring as we see here. Once we segment that out, we can make very nice special effects, like here. Here we changed the light. In the person, as you can see. We don't have to affect the back ground, because we have segmented the foreground out. And here is the compulsive that we have seen before. Everything starts from segmenting this very tough image. I want to keep giving you examples. I think this is quite cool. This is another example where we basically segment out the player from the background. Look how difficult this problem is. The background is all blue and white. Or blue and gray like the player. A lot of confusion, but we have shape, and that helps a lot. And once we segment this player out, we can basically modify the background, for example, significantly blurried. And this is also based on the just, the concept that I just explained to you of segmenting the foreground out. For this particular, there are some additions on not only mottling the foreground colors. We can also model the background colors. All what I explain now, is let's model the program, color and shape. Why not to do all this some learning of the background? You can do that and then you get improvement in your results. But again, the concepts are very similar. And, I want to give you one more example. So, here is an outer special effect, where we basically, take the actor here, segment it out, and we can repeat it in multiple positions. Again, the big challenge here is to segment it out. Here, there is very good basic contrast between the actor, in particular with the shirt and the background, so color would be very important here. But also the shape the actor is moving, but it's not completely deforming, so the shape prior is critical. So this gives us an idea of what's behind one of the popular video segmentation techniques in particularly Roto Brush which is incorporated inside Abobe's After Effect. And we have seen it has many components. It uses shape. It uses color. It uses localized structures, localized learning. Both to be able to do more accurate predictions. And also to make it more friendly to the user when the user needs to make corrections. All those concepts put together is what leads to this very, very high accuracy and very user-friendly video segmentation. Thank you very much. I hope you enjoyed this and I will see you in the next video. Thank you.