So the goal of this video is to prove the correctness of Kosaraju's two paths, depth first search base linear time algorithm, that computes the strongly connected components of a directed graph. So I've given you the full specification of the algorithm. I've also given you a plausibility argument of why it might work. And that at least it does something sensible, on an example. Namely, it first does a pass of depth first search, on the reverse graph. It computes this magical ordering. And what's so special about this ordering is then when we do a depth first search using this ordering on the forward graph. It seems to do exactly what we want. Every indication of depth first search to some new node discovers exactly the nodes at the strong component and no extra stuff. Remember that was our first observation. But that it was unclear whether depth first search would be useful or not for computing strong components. If you call depth first search search from just the right place, you're going to get exactly the nodes of an SCC and nothing more. If you call it from the wrong place you might get all of the nodes at the graph. And get no information at all about the structure of the strong components. And at least in this example this first pass with the finishing time seems to be accomplishing. Seems to be leading us to invoking depth first search from exactly the right places. So remember how this worked in the example. So in the top graph, I've shown you the graph with the arch reversed. This is where we first invoked DFS loop with a loop over the nodes going from the highest node named nine. All the way down to the node named one. And here we compute finishing times. That's the book keeping that we do in the first pass that we just keep a running count of how many nodes we've finished processing. That is how we both explored that node as well as explore all of the outgoing arcs. And so that gives us these numbers in red, these finishing times between one and nine for the various nodes. Those became the new node names in the second graph and then we reversed the arcs again to get the original graph back. And then we saw that every time we invoked DFS in our second pass, we uncovered exactly the nodes of an STC. So when we invoked it from the node nine, we discovered that nine, eight, and seven, those all have the leader vertex nine. Then when we next invoked DFS from six, we discovered that six, five, and one, and nothing else. And then finally we invoked it from four, and we discovered two, three, and four, and nothing else. And those are exactly the three SCCs of this graph. So let's now understand why this works in any directed graph, not just in this one example. So let's begin with a simple observation about directed graphs, which is actually interesting in its own right. The claim is that every directed graph has two levels of granularity. If you squint, if you sort of zoom out then what you see is a directed acyclic graph [INAUDIBLE] comprising its strongly connected components. And if you want, you can zoom in and focus on the fine grain structure with one SCC. A little bit more precisely, the claim is that the strongly connected components of a directed graph induced in a natural way an acyclic meta-graph. So, what is this meta-graph, what are the nodes, and what are the arcs, what are the edges? Well, the meta-nodes are just the SCCs. So, we think of every strong candidate component as being a single node in this meta-graph. So you call them say, C1 up to Ck. So what are the arcs in this meta-graph? Well, they're basically just the ones corresponding to the arcs between SCCs and the original graph. That is we include in the meta-graph an arc between the strong components C and C hat from C2 C hat. If and only if there's an arc from a node C to a node in C hat in the original graph G. So for example, if this is your C And this other triangle is your C hat and you have one or maybe multiple edges going from C to C hat. Then, in the corresponding meta-graph, you just going to have node for C A node for C hat. And a directed arc from C to C hat. So if we go back to some of the directed graphs that we've used as running examples. So if we go back to the one at the beginning of the previous video which looked maybe something like this, the corresponding directed acyclic graph. Has four nodes and four arcs. And for the running example, we use to illustrate Kosaraju's algorithm with the three triangles. The corresponding meta-graph would just be a path with three nodes. So why is this meta graph guaranteed to be acyclic? Well, remember meta-nodes correspond to strong components. And in a strong a component, you can get from anywhere to anywhere else. So if you had a circle that involved two different meta-nodes, that is two different strong factor components. Remember on a directed cycle, you can also get from anywhere to anywhere else. So if you had two supposedly distinct SCCs where you can get from the one to the other and vice versa. They would collapse into a single SCC. You can get from anywhere to anywhere in one. Anywhere from anywhere in the other one. And you can also go between them at will. So you can get from anywhere in this union to anywhere in the union. So not just in this context of computing strong components. But also just more generally, this is a useful fact to know about directed graphs. On the one hand, they can have very complex structure within a strong component. You have paths coming from everywhere to everywhere else. And it may be sort of complicated-looking. But at a higher level if you abstract down the level of SCCs, you're guaranteed to have this simple of a simple directness [INAUDIBLE] a specific graph structure. So to reinforce these concepts and also to segue into thinking about Kosaraju's algorithm in particular. Let me ask you a question about how reversing arcs affects the strong components of a directed graph? So the correct answer to this quiz is the fourth one. The strong components are exactly the same as they were before. In fact, the relation that we describe is exactly the same as it was before. So therefore, the equivalence classes with the stronger components is exactly the same. So if two nodes are related in the original graph, that is there's a path from U to V and a path from V to U. That's still true after you reverse all the arcs. You just use the reversal of the two paths that you had before. Similarly, if the two nodes weren't related before, for example, because you could not get from U to V. Well, then after you reverse everything, then you can't get from V to U. So again, you don't have this relation holding. So the SCCs are exactly the same in the forward or the backward graph. So particular in Kosaraju's algorithm, the strong component structure's exactly the same in the first pass of DFS and in the second pass of DFS. So now that we understand how every directed graph has a meta-graph where the nodes correspond to a components. And [INAUDIBLE] arc from one SCC to another. If there's any arc from any node in that SCC to the other SCC in the original graph. I'm in a position to state what's the key limit that drives the correctness of Kosaraju's two path algorithm for computing the connected components of a directed graph. So here's the Lemma statement. It considers two strongly connected components that are adjacent in the sense that there's an arc from node in one of them to one node in the other one. So let's say we have one SCC C1 with a node i and another SCC C2 with a node j. And that in g in the graph there's an arc directly from i to j. So in this sense, we say that these SCCs are adjacent with the second one being in some sense after the first one. Now let's suppose we've already run the first pass of the DFS loop subroutine. And remember that works on a reverse graph. So we've invoked it on the reverse graph. We've computed these finishing times as usual, we'll let f to note the finishing times computed in that depth first search subroutine on the reverse graph. The Lemma then [INAUDIBLE] the following. It says first, amongst all the nodes in C1, we look at the one with the largest finishing time. Similarly, amongst all nodes in C2, look at the one with the biggest finishing time. Amongst all of these the claim is the biggest finishing time will be in C2, not in C1. So what I want to do next is I want to assume that this Lemma is true temporarily. And I want to explore the consequences of that assumption. And in particular what I want to show you is that if this Lemma holds Then we can complete the proof of correctness of Kosaraju's two-pass SCC computation algorithm. Okay, so if the lemma is true, then I'll give you the argument about why we're done. About why we just peeled off the strong connect components one at a time with the second pass of depth first search. Now of course a proof with a hole in it isn't a proof, so at the end of the lecture I'm going to fill in the hole, that is I'm going to supply a proof of this Key Lemma. But for now as a working hypothesis let's assume that it's true. Let's begin with a corollary, that is a statement which follows essentially immediately from the statement of the Lemma. So for the corollary, let's forget about just trying to find the maximum finishing time of a single SCC. Let's think about the maximum finishing time in the entire graph. Now why do we care about the maximum finishing time in the entire graph? Well notice that's exactly where the second pass of that first search is going to begin, right? So it processes nodes in order from largest finishing time to smallest finishing time. So equivalently let's think about the node at which the second pass of that first search is going to begin. I.e the node with the maximum finishing time. Where could it be? Well, the corollary is that it has to be what I'm going to call a sink strongly connected component. That is a strongly connected component without any outgoing arcs. So for example, let's go back to the meta graph of SCCs for the very first directed graph we looked at. You recall in the very first graph we looked at in when we started talking about this algorithm, there were four SCCs. So there's a C1, a C2, a C3 and a C4. And of course within each of these components there could be multiple nodes but they're all strong connected to each other. Now let's use f1, f2, f3 and f4 to denote the maximum finishing time in each of these SCCs. So we have f1, f2, f3 and f4. So now we have four different opportunities to apply this Lemma, right. There's four different pairs of adjacent SCCs and so what do we find? We find that, well, comparing f1 and f2 because C2 comes after C1 that is there's an arc from C1 to C2. The max finishing time in C2 has to be bigger than that in C1, that is f2 is bigger than f1. To the same reasoning, f3 has to be bigger than f1. Symmetrically we can apply, elements of the pair C2, C4, and C3, C4, and we get that f4 has to dominate both of them. Now notice we actually have no idea whether f2 or f3 is bigger. So that pair we can't resolve. But we do know these relationships, okay. F1 is the smallest and f4 is the smallest and you'll also notice that C4 is a sink SCC, in the sense that it has no outgoing arcs. And if you think about it that's a totally general consequence of this Lemma. So a simple group by contradiction would go as follows: consider the strongly connected component with the maximum F value. Suppose it was not a sink SCC, that it has an outgoing arc. Follow that outgoing arc to get to some other SCC. By the Lemma, the SCC you've gotten to has an even bigger maximum finishing time. So that contradicts the fact that you started in the SCC with a maximum finishing time, okay? So just like in this cartoon where the unique sink SCC has to have the largest finishing time, that's totally general. As another sanity check, we might return to the nine node graph, where we actually ran Kosaraju's algorithm. And looking at the forward version of the graph, which is the one on the bottom. We see that the maximum finishing times in the three SCCs are 4, 6, and 9. Okay, it turns out that the same as the leader nodes, which is not an accident, if you think about it for a little while. And again, you'll observe the maximum finishing time in this graph, namely 9, is indeed in the left most SCC which is the only SCC with no outgoing arcs, okay? But it's totally general. Basically, you can keep following arcs, and you can keep seeing bigger and bigger finishing times. So the biggest one of all, it has to be somewhere where you get stuck, where you can't go forward, where there's no outgoing arcs. And that's what I'm calling a sink SCC, okay. So assuming that Lemma's true, we now know this corollary is true. Now using this corollary, let's finish the proof of correctness of Kosaraju's algorithm module proof of the Key Lemma. So I'm not going to do this super rigorously, although everything I say is correct and can be made rigorous. And if you want a more rigorous version I'll post some notes on the course website which you can consult for more details. So what the previous query accomplished. It allows us to locate the node with maximum finishing time. We can locate it somewhere in sum sink SCC. Well let me remind you about the discussion we had at the very beginning of talking about computing strong components. We are trying to understand whether depth first search would be a useful work horse for finding the strong components. And the key observation was that it depends where you begin that depth first search. To for example in this graph with four SCCs shown in blue on the right, a really bad place to start a DFS called depth for search would be somewhere in c1, somewhere in this source SCC. So this is a bad DFS. Why is it bad? Well remember what depth first search does, it finds everything findable from its starting point. And from C1, you can get to the entire world. You can get to all the nodes in the entire graph, so you'll discover everything. And this is totally useless, because we wanted to discover much more fine grain structure. We want to discover C1, C2, C3 and C4 individually. So that would be a disaster if we invoked depth first search somewhere from C1. Fortunately, that's not what's going to happen, right. We computed this magical ordering in the first pass to ensure that we look at the node with the maximum finishing time first. And by the corollary, the maximum finishing time is going to be somewhere in C4. That's going to be a good DFS, in the sense that when we start exploring for anyone in C4 it there is no out going arrows. So, of course we're going to find everything in C4 everything in C4 try to connect to each other. But we can't get out. We will not have the option of trespassing on other strong components. And we're not going to find them. So we are only going to find C4, nothing more. Now here's where we're going to be a little informal. Although again everything I'm going to say is going to be correct. So what happens now once we've discovered everything in C4? Well, all the nodes in C4 get marked as explored as we're doing vector first search. And then they're basically dead to us, right. The rest of our depth first search loop will never explore them again. They're already marked as explored. If we ever see them, we don't even go there. So the way to think about that, is when we proceed with the rest of our for loop in DFS-Loop, it's as if we're starting afresh. We're doing depth-first search from scratch, on a smaller graph, on the residual graph, the graph G with this newly discovered strong component C star deleted. So in this example on the right, all of the nodes in C4 are dead to us. And it's as if we run DFS new just on the graph containing those strong components, C1, C2, and C3. So in particular, where is the next indication of depth first search going to come from? It's going to come from some sink SCC in the residual graph, where it's going to start at the node that remains and that has the largest finishing time left. So there's some ambiguity in this picture, again recall, we don't know whether f2 is bigger or f3 is bigger, it could be either one. So maybe f2 is the largest remaining finishing time, in which case the next DFS invocation is going to begin somewhere from C2. Again, the only things outgoing from C2 are these already explored nodes. They're effectively deleted. We're not going to go there again. So this is essentially, a sink SCC. We newly discover the nodes in C2 and nothing else, those are now effectively deleted. Now, the next indication of DFS will come from somewhere in f3, somewhere in C3. That's the only remaining sink SCC in the residual graph, so the third call to DFS will discover this stuff. And now of course, we're left with only C1 and so the final indication of DFS will emerge from and discover the nodes in C1. And in this sense, because we've ordered the nodes by finishing times with the DFS for the reverse graph, that ordering has this incredible property. That when we process the nodes in the second pass we will just peel off the strongly connected components one at a time. If you think about it it's in reverse topological order with respect to the directed acyclic graph of the strongly connected components. So we've constructed proof of corrective of Causaratnu's algorithm for computing strolencamp components but again there's a hole in it. So we completed the argument assuming a statement that we haven't proved. So let's fill in that last gap in the proof and we'll be done. What we need to do is prove the key limit. Let me remind you what it says. It says, if you have two adjacent SCCs, C1 and C2 is an arc from a node in C1, call it i. Through a node in C2, say j, then the max finishing time in C2 is bigger than the max finishing time in C1. Where as always, these finishing times are computed in that first pass of Depth-First Search loop in the reversed graph. All right, now, the finishing times are computed in the reverse graph. So let's actually reverse all the arcs and reason about what's happening there. We still have C1, it still continues to node i. We still have C2, it still continues to node j. But now, of course, the orientation of the arc has reversed. So the arc now points from j to i. Recall we had a quiz which asked you to understand the effect of reversing all arcs on the SCCs and in particular there is no effect. So the SCCs in the reverse graph are exactly the same as in the forward graph. So now we're going to have two cases in this proof and the cases correspond to where we first encounter a node of C1 union C2. Now remember when we do this DFS loop, the second pass because we have this outer for loop that iterates over all of the nodes. We're guaranteed to explore every node of the graph at some point. So in particular, we're going to explore at some point every single node in C1UC2. What I want you to do is pause the algorithm when it past for the first time explores, some know that that it's in either C1 or C2. It's going to need two cases, of course, because I know it might be in C. You might see that first or it might be in C2. You might see something from C2 first. So our case1 is going to be when the first node that we from other one happens to lie on C1. And the second case is where the first node V that we see happens to lie in C2. So clearly, exactly one of these will occur. So let's about our Case 1. When we see a node of C1 before we see any nodes of C2. So in this case where we encounter a node in C1 before we encounter any node in C2, the claim is that we're going to explore everything in C1 before we ever see anything in C2. Why is that true? The reason is there cannot be a path that starts somewhere in the C1, for example of the vertex V and reaches C2. And this is why we're using the fact that the meta graph [INAUDIBLE] connected components is a cyclic. Whereas at C1 is strongly connected, C2 is strongly connected. You can get from C2 to C1 and if you can also get from C1 back to C2, this all collapses into a single strongly connected component without any contradiction. We're assuming C1 and C2 are distinct strongly connected components. Therefore, you can't have pass in both directions. We already have a path from right to left via ji so there's no path from right to left. That's why if you originate at first search from somewhere inside C1 like this vertex v, you would finish exploring all of C1 before you ever are going to see C2. You're only going to see C2 at some later point in the outer for loop. So what's the consequence that you completely finish with C1 before you ever see C2? Well, it means every single finishing time in C1 is going to be smaller than every single finishing time in C2. So that's even stronger than what we're claiming. We're just claiming that the biggest thing in C2 is bigger than the biggest C1. But actually finishing time in C2 totally dominate those in C1 because you finish C1 before you ever see C2. So let’s now have a look at case one actually in action, let’s return to the nine node graph on which we actually ran [INAUDIBLE] Roger's algorithm to completion. So, if we go back to this graph which has the three connected components. And remember, it’s the bottom version is the forward version, the top version is the reversed version. So if you think about the middle SCC as being C1, playing the role of C1, and the left most SCC playing the role of C2, then what we have is exactly as Case 1 of the key lema. So, which was the first of these six vertices visited during the DFS loop in the reversed graph. Well that would just be the node with the highest name, so node nine. So this was the first of these six vertices that Depth-first search ever looked at on the first pass. That lies on what we are calling C1, and indeed everything in C1 was discovered in that pass before anything in C2. And that's why all of the finishing times in C2, the seven, eight, and nine are bigger than all of the finishing times in C1, the one, five, and six. So we're good to go in case 2. We've proven, sorry, in case 1. We've proven key lemma when it's the case that amongst the vertices in C1 and C2, that first search in the first pass sees something from C1 first. So now let's look at this other case, this grey case. Which can also happen totally possible. Were the first thing we see when Depth-First Searching in the first pass is something from C2. And here now is where we truly use the fact that's we're using at Depth-First Search rather than some other graph search algorithm. There's a lot of places in this you can swap search. But in this case, too, you'll see why it's important we're using Depth-First Search to compute the finish times. And what's the key point? The key point is that when we invoke Depth-First Search, beginning from this node v, which is now assuming the line C2. Remember, Depth-First Search will not complete. We won't be done with v until we've found everything there is to find from it. Right so we recursively explore all of the outgoing arcs. They recursively explore all the outgoing arcs and so on. It's only when all paths going out of v have been totally explored and exhausted that we finally back track all the way to v and we consider ourselves done with it. That is Depth-First Search in a reverse graph initiated at v. Won't finish until everything findable has been completely explored because there's an arc from C2 to C1. Obviously, everything C2 is findable from V, and that's showing connected. We can get from C2 to C1 just using this arc from J to I, C1 being strongly connected, we can then find all of that. Maybe we can find other significant components as well but for sure, Depth-First Search started from V, we'll find everything in unit C1 and C2. Maybe some other things and we won't finish with v until we finish with everything else, that's the depth first search property. For that reason, the finishing time of this vertex, v, will be the largest of anything reachable from it. So I'm going to say it's going to be larger than everything in C2. But more to the point, it will be larger than everything in C1, which is what we were trying to prove. Again let's just see this quickly in action in the nine node network on which we trace through cause of algorithm. So to show the role that case two is playing in this concrete example. Let's think of the right most strongly commit component as being C1, and let's think of the middle strongly commit component as being C2. Now the last time we called the middle one C1 and the left most one C2. Now we're calling the right most one C1 and the middle one C2. So again, we have to ask the question, of the six nodes, in C1 and in C2, what is the first one encountered in the depth-first search that we do in the first pass. And then again, is the node 9, okay, the node which is originally labeled 9. So that's the same node that was relevant in the previous case. But now, with this re-labeling of the components, 9 appears in the strongly connected component C2, not in the one labeled C1. So that's the reason now we're in case two, not in case one. And what you'll see is what is the finishing time that this originally labeled 9 node gets. It gets the finishing time six. And you'll notice six is bigger than any of the other finishing times of any of the other nodes in C1 or C2. All the other five nodes have the finishing times 1 through 5. And that's exactly because when we ran Depth-First Search in the first pass and we started it at the node originally labeled 9. It discovered these other five nodes and finished exploring them first, before finally backtracking all the way back to 9, and deeming 9 fully explored. It's only at that point that 9 got it's finishing time, after everything reachable from it had already gotten their lower finishing times. So that wraps it up. We had two cases depending on whether in these two adjacent SCCs, the first vertex in categories in the C1, or in C2. Either way it doesn't matter, the largest finishing time has to be in C2. Sometimes it's bigger than everything, sometimes it's just bigger than the biggest in C1, but it's all the same to us. And to recap how the rest of the proof goes, we had a corollary based on this lemma. Which is maximum finishing times half the lie in sync joined connected components. And that's exactly where we want our depth first search to initiate. If you are initiated in a strong component with no out-going arcs, you do the invest. The stuff you find is just the stuff in that component. You do not have any avenues by which to trespass on other strong components. So you find exactly on SCC. In effect you can peel that off and recurs on the rest of the graph. And our sleek way of implementing this recursion is to just view the single second DFS pass where you just treat the nodes in decreasing order of finishing times. That in effect, unveils all the SCCs in reverse topological ordering. So that's it, Kosaraju's algorithm and the complete proof of correctness. A blazingly fast graph primitive that in any directed graph will tell you its strong components.