So let me now introduce you to the union-find data structure. Let's not lose sight of our goal. Our goal is to be able check for cycles in Kruskal's algorithm in constant time. So the first and most basic idea behind this union-find data structure implementation, is we're going to maintain a linked structure for each group, that is for each connected component with respect to the edges Kruskal has chosen thus far. By link structure I just mean each vertex of the graph is going to have an extra pointer field. In addition, with each group, with each connected component, we're going to designate one vertex. And we don't care which one. Just some vertex from this connected component as the leader vertex of that component. All right, so what is the point of these extra pointers at each vertex and what is the point of having these leaders? Well a key invariant that we're going to maintain is that a given vertex, with its extra pointer, points to the leader vertex of its connected component. So for example maybe we have two different connected components with three vertices each, one containing the vertices u, v, and w, another containing the vertices x, y and z. Any of these three vertices could be the leader of each of these two components, so perhaps u happens to be the leader vertex of the first component and x happens to be the leader vertex of the second component, and then the invariant just says every vertex should be pointing to the leader of its component. So v and w should be pointing to u, u has its own pointer, it should be pointing to itself. Similarly in the second component x is pointing to itself, y points to x and z also points to x. So in blue here are the actual edges of the graph and in green there are these sort of extra edge pointers that we've invented where everybody's pointing back to the leader of their component. And so one very simple thing in this setup, which turns out to be a good idea, is each component is in effect inheriting the name of its leader vertex. So we refer to a group, we refer to a component via the object, via the vertex, who happens to be the representative, who happens to be the leader. And what's kind of amazing is even just this very simple scaffolding on the connected components is enough to have constant time cycle checks provided the invariant is satisfied. Well, how do we do that. So remember that checking if adding the edge uv is going to create a cycle boils down to checking whether there's already a path between u and v. Well, there's already a path between u and v if and only if they're in the same connected component. Given two vertices, u and v. How do we know if they're in the same connected component? We just follow the respective leader pointers and we see if we get to the same place. If they're in the same component, we get the same leader. If they're in different components, we get different leaders. So, checking for a cycle just involves for each of u and v, comparing a quality of leader pointers that is clearly constant time. More generally the way you implement the find operation for this flavor of disjoint union data structure is you simply return the leader pointer of the object that you were handed. So if you're given a vertex, you just follow the leader pointer of that vertex, and you return wherever you wind up. So that's pretty excellent as long as in this simple data structure the invariants satisfied we have are desired implementation of constant time cycle checks. But and certainly this is a recurring theme in our data structure discussions. Whenever you have a data structure and it needs to maintain an invariant. Whenever you do an operation that changes the data structure. So in this case when you do a union fusing two groups together. You have to worry, well does the invariant get destroyed when you do that operation and if it does how are you going to restore the invariant without doing undue work. So in the present context of Kruskal's algorithm here's how this plays out. So we're happily doing our constant time cycle checks whenever an edge creates a cycle we don't do anything, we skip the edge, we don't change our data structure, we move on. The issue is when we have a new edge and it doesn't create a cycle, our cycle check fails. Now Kruskal's algorithm dictates that we add this edge into the set capital T that we're building and that fuses two connected components together. But remember we have this invariant, every vertex should point to the leader of it's component. Well if we had component A and we had component B, they are both pointing to the leader vertex of their respective components. Now when these components fuse into one, we've gotta do some updating of leader pointers. In particular, there used to be two leaders, now there has to be just one. And we have to rewire the leader pointers to restore the invariant. So to make sure you're clear on this important problem. Let me ask you the following question. So consider the case when at some point in Kruskal's algorithm, a new edge is added and two connected components fuse into one, now to restore the invariant you have to some leader pointer updates. In the worst case, asymptotically, how many leader pointer updates might be required in order to restore the invariance? So, the answer to this question, somewhat alarmingly, is the third answer. So, it might require a linear number in the number vertices n pointer updates to restore the invariant. Maybe one easy way to see that is just imagine the very last edge that Kruskal's going to add in the set T, the one which fuses the final two connected components down into one. For all you know, those two components have exactly the same size. They have n over 2 vertices each. You're going down from two leader pointers to one. One of those sets of n over 2 vertices are going to have to inherit the leader pointers from the other side. So one of the two sets is going to have to have n over 2 vertices get their leader pointer updated. So that's a bummer. We were sort of hoping for a near linear time bound, but if every, each one of our linear number of edge additions might trigger a linear number of leader pointer updates that seems to be giving rise to a quadratic time bound. But remember when I introduced the union-find data structure I only said that first idea was idea number one. Presumably there's an idea number two and here it is. And it's very natural, if you were coding up an implementation of union-find data structure you would probably very naturally do this optimization yourself. All right, so consider the moment in time in which some component A merges with some component B. Each of these two components currently has their own respective leader vertex, and all vertices from that group are pointing to that leader vertex. Now when they fuse, what are you going to do? The first obvious thing to do is say, well let's not bother computing some totally new leader. Let's either re-use the leader from group A or the leader from group B. That way it, for example, we retain the leader from group A, the only leader pointers that need to be rewired are the ones that come from component B. The vertices in component A can keep their same leader vertex and their same leader pointers as before. So that's the first point. Let's just have a new union of two components inherit the leader of one of the constituent parts. Now, if you're going to retain one of the two leaders, which one are you going to retain? Maybe one the components is 1,000 vertices. The other component only has 100 vertices. Well given the choice you're certainly going to keep the leader from the bigger, from the first component. That way you only have to rewire the 100 leader pointers of the second group. Right if you kept the leader of the second group you'd have to rewire the 1,000 pointers from the first group and that seems silly and wasteful. So the obvious way to implement a merge is you just keep the leader of the bigger group, and rewire everybody from the second group, the smaller group. So you should notice that in order to actually implement this optimization where you always retain the leader of the larger group, you have to be able to quickly determine which of the two groups is larger, but you can augment the data structure we've discussed to facilitate that. So just with each group, you just keep a count of how many vertices are in that group so you maintain a size field for each group. That allows you to check in constant time what's the population of two different groups and figure out, again in constant time, which one's bigger. And notice that when you fuse two groups together it's easier to maintain the size field, it's just the sum of the sizes of the two constituent parts. So let me know revisit the question from the previous slide. In the worst case given this optimization, how many leader pointers, asymptotically, might you have to rewire when you merge two components together? Well unfortunately, this song remains the same. The answer is still the third one, and it's because for exactly the same reason as on the previous slide. It still might be the case that say in the final iteration of Kruskal you're merging two components that both have size n over 2, so it doesn't matter. I mean no matter which leader you choose you're going to be stuck updating the leader pointers of n over 2 or theta of n vertices. So while this is clearly a smart practical optimization, it doesn't seem to be buying us anything in our asymptotic analysis of the running time. However, however, what if we think about the work done in all of these leader pointers updates in the following different way. Rather than asking how many updates might a merging of two components trigger, let's adopt a vertex-centric view. Let's suppose you're one of the vertices of this graph, so initially the beginning of Kruskal's algorithm you're in your own isolated connected component, you just point to yourself. You're your own leader. And then as Kruskal's algorithm runs it's course, your leader pointer will periodically get updated. At some point you're no longer pointing to yourself. You're pointing to some other vertex. Then at some point you're pointer gets updated again, you're pointing to yet some other vertex and so on. How many times over the entire trajectory of Kruskal's algorithm, in light of our new optimization, might you as some vertex of this graph have your leader pointer updated? Well here's the very cool answer. The answer is the second one. So while it remains true that if you always have the union of two groups inherit the leader pointer of the larger one, it's still true that a given fusion might trigger a linear number of leader pointer updates, each vertex will only see its leader pointer updated a logarithmic number of times over the course of Kruskal's algorithm. What is the reason for this? Well, suppose you're a vertex and you're in some group and it has maybe 20 vertices, so you're 1 of 20. Now, suppose at some point your leader pointer gets updated. Why did that happen? Well, it meant that your group of 20 vertices merged with some other group that has to be bigger. Remember, your leader pointer only gets rewired in a fusion if you were in the smaller group. So you're joining a group at least as big as yours. So the size of the union, the size of your new community, your new connected component is at least double the size of your previous one. So the bottom line is every time you as a vertex has your leader pointer updated, the population in the component to which you belong is at least twice as large as before. Now, you started the connecting component of size one. The connecting component is not going to have more than n vertices. So the number of doublings you might have to endure ss at most log base 2 of n. So that bounds how many leader pointers you will see as a vertex in this graph. So in light of that very cool operation we can now give a good running time analysis of Kruskal's algorithm using the union-find data structure, using the scaffolding on the connected component structure to do cycle checks. We of course have not changed the preprocessing step. We're still sorting the edges from cheapest to most expensive at the beginning of the algorithm, and that still takes O(m log n) time. So what work do we have to do, beyond this sorting preprocessing step? Well fundamentally, Kruskal's algorithm is just about these cycle checks. In each iteration of the for loop, we have to check if adding a given edge would create a cycle with those edges we've already added. And, the whole point of the union-find scaffolding, these link structures, is that we can check cycles in constant time. Just given a edge, we look at the leader pointers of its endpoints, and a cycle will be created if and only if their leader pointers are identical. So for the cycle checking, we only do constant time for each of the O of m iterations. But the final source of work is maintaining this union-find data structure, restoring the invariant each time we add a new edge to the set capital T. And here the good idea is we're not going to just bound the worst case running time of these leader pointers for each iteration, because that could be too expensive that could be up to linear in just a single iteration. Rather we're going to do a global analysis, thinking about the total number reader pointers that ever occur, on the previous slide we observed that for a single vertex, it's only going to endure a logarithmic number of leader pointer updates. So something over all of the n vertices, the total work done for leader pointer updates is only n log n. So even though we might do a linear amount of pointer updates in just one iteration of this for loop, we still have this global upper bound of n log n on the total number of leader pointer updates. Very cool. So looking over this tally, we observe the stunning fact that the bottleneck for this implementation of Kruskal's algorithm is actually sorting. So we do more work in the preprocessing step, n log n, than we do in the entire for loop, which is only m plus n log n. So that gives us an overall running time down of O(m log n) matching the theoretical performance that we achieved in Prim's algorithm implemented with heaps. So just like with our heap-based implementation of Prim's algorithm, we get a running time which is almost linear. Only a logarithmic factor's slower than reading the input. And just like with Prim's algorithm, you should find Kruskal's algorithm to be very competitive in practice.