0:00

So, in this video, we'll graduate beyond the domain of just vanilla binary search

Â trees, like we've been talking about before, and we'll start talking about

Â balanced binary search trees. These are the search trees you'd really

Â want to use when you want to have real time guarantees on your operation time.

Â Cuz they're search trees which are guaranteed to stay balanced, which means

Â the height is guaranteed to stay logarithmic, which means all of the

Â operations search trees support that we know and love, will also be a logarithmic

Â in the number of keys that they're storing.

Â So, let's just quickly recap. What is the basic structure tree property?

Â It should be the case that at every single node of your search tree, if you go to the

Â left, you'll only see keys that are smaller than where you started and if you

Â go to the right you only see keys that are bigger than where you started.

Â And a really important observation, which is that, given a set of keys, there's

Â going to be lot and lots of legitimate, valid, binary search trees with those

Â keys. So, we've been having these running

Â examples where the keys one, two, three, four, five.

Â On the one hand, you can have a nice and balanced search tree that has height only

Â two, with the keys one through five. On the other hand, you can also have these

Â crazy chains, basically devolved to link lists where the heights for, and elements

Â could be as high as N - 1. So, in general, you could have an

Â exponential difference in the height. It can be as small, in the best case, as

Â logarithmic and as big, in the worst case, as linear.

Â So, this obviously motivates search trees that have the additional property that you

Â never have to worry about their height. You know they're going to be well

Â balanced. You know they're going to have height

Â logarithmic. You're never worried about them having

Â this really lousy linear height. Remember, why it's so important to have a

Â small height? It's because the running time of all of

Â the operations of search trees depends on the height.

Â You want to do search, you want to insertions, you want to find predecessors

Â or whatever, the height is going to be what governs the running time of all those

Â properties. So, the high level idea behind balanced search

Â trees is really exactly what you think, which is that, you know, because the

Â height can't be any better than logarithmic in the number of things you're

Â storing, that's because the trees are binary so the number of nodes can only

Â double each level so you need a logarithmic number of levels to

Â accommodate everything that you are storing.

Â But it's got to be logarithmic, lets make sure it stays logarithmic all the time,

Â even as we do insertions and deletions. If we can do that, then we get a very rich

Â collection of supported operations all running in logarithmic time.

Â As usual, n denotes, the number of keys being stored in the tree.

Â There are many, many, many different balanced search trees.

Â They're not super, most of them are not super different from each other.

Â I'm going to talk about one of the more popular ones which are called Red Black

Â Trees. So, these were invented back in the '70s.

Â These were not the first balanced binary search tree data structures, that honor

Â belongs to AVL trees, which again are not very different from red black trees,

Â though the invariants are slightly different.

Â Another thing you might want to look up and read about is a very cool data

Â structure called splay trees, due to Sleator and Tarjan,

Â These, unlike red black trees and AVL trees, which only are modified on

Â insertions and deletions, which, if you think about it, is sort of what you'd

Â expect. Splay trees modify themselves, even when

Â you're doing look ups, even when you're doing searches.

Â So, they're sometimes called self-adjusting trees for that reason.

Â And it's super simple, but they still have kind of amazing guarantees.

Â And then finally, going beyond the, just the binary tree paradigm many of you might

Â want to look up examples of B trees or also B+ trees.

Â These are very relevant for implementing databases.

Â Here what the idea is, in a given node you're going to have not just one key but

Â many keys and from a node, you have multiple branches that you can take

Â depending where you're searching for falls with respect to the multiple keys that are

Â at that node. The motivation in a database context for

Â going beyond the binary paradigm, is to have a better match up with the memory

Â hierarchy. So, that's also very important, although a

Â little bit out of the scope here. That said, what we discuss about red-black

Â trees, much of the intuition will translate to all of these other balance

Â tree data structures, if you ever find yourself in a position where you need to

Â learn more about them. So, red black trees are just the same as

Â binary search trees, except they also always maintain a number of additional

Â invariants. And so, what I'm going to focus on in this

Â video is, first of all, what the invariants are, and then how the

Â invariants guarantee that the height will be logarithmic.

Â Time permitting, at some point, there will be optional videos more about the guts,

Â more about the implementations of red black trees namely how do you maintain

Â these invariants under insertions and deletions.

Â That's quite a bit more complicated, so that's appropriate for, for optional

Â material. But understanding what the invariants are

Â and what role they play in controlling the height is very accessible, and it's

Â something I think every programmer should know.

Â So, there, I'm going to write down four invariants and really, the bite comes from

Â the second two, okay, from the third and the fourth invariant.

Â The first two invariants you know, are just really cosmetic.

Â So, the first one we're going to store one bit of information additionally at each

Â node, beyond just the key and we're going call this bit as indicating whether it's a

Â red or a black node. You might be wondering, you know, why red

Â black? Well, I asked my colleague, Leo Guibas

Â about that a few years ago. And he told me that when he and Professor

Â Sedgewick were writing up this article the journals were, just had access to a

Â certain kind of new printing technology that allowed very limited color in the

Â printed copies of the journals. And so, they were eager to use it, and so

Â they named the data structure red black, so they could have these nice red and

Â black pictures in the journal article. Unfortunately, there was then some snafu,

Â and at the end of the day, that technology wasn't actually available, so it wasn't

Â actually printed the way they were envisioning it but the name has stuck.

Â So, that's the rather idiosyncratic reason why these data structures got the name

Â that they did, red black trees. So, secondly we're going to maintain the

Â invariant that the roots of the search tree is always black, it can never be red.

Â Okay. So, with the superficial pair of

Â invariants out of the way, let's go to the two main ones.

Â So, first of all, we're never going to allow two reds in a row.

Â By which, I mean, if you have a red node in the search tree, then its children must

Â be black. If you think about for a second, you

Â realize this also implies that if a notice red, and it has a parent, then that

Â parent has to be a black node. So, in that sense, there are no two red

Â nodes in a row anywhere in the tree. And the final invariant which is also

Â rather severe is that every path you might take from a root to a null pointer, passes

Â through exactly the same number of black nodes.

Â So, to be clear on what I mean by a root null path, what you should think about is an

Â unsuccessful search, right? So, what happens in an unsuccessful

Â search, you start at the root depending on whether you need to go smaller or bigger,

Â you go left or right respectably. You keep going left right as appropriate

Â until eventually you hit a null pointer. So, I want you to think about the process

Â that which you start at the root and then, eventually, fall off the end of the tree.

Â In doing so, you traverse some number of nodes.

Â Some of those nodes will be black some of those nodes will be red.

Â And I want you to keep track of the number of black nodes and the constraints that a

Â red black tree, by definition, must satisfy, is that no matter what path you

Â take through the tree starting from the root terminating at a null pointer, the

Â number of black nodes traversed, has to be exactly the same.

Â It cannot depend on the path, it has to be exactly the same on every single root null path.

Â Let's move on to some examples.

Â So, here's a claim. And this is meant to, kind of, whet your

Â appetite for the idea that red black trees must be pretty balanced.

Â They have to have height, basically logarithmic.

Â So, remember, what's the most unbalanced search tree?

Â Well, that's these chains. So, the claim is, even a chain with three

Â nodes can not be a red black tree. So, what's the proof?

Â Well, consider such a search tree. So, maybe, with the key values one, two and three.

Â So, the question that we're asking is, is

Â there a way to color the node, these three nodes, red and black so that all four of

Â the invariants are satisfied. So, we need to color each red or black.

Â Remember, variant two says, the root, the one has to be black.

Â So, we have four possibilities for how to use the color two and three.

Â But really, because of the third invariant, we only have three possibilities.

Â We can't color two and three both red, cuz

Â then we'd have two reds in a row. So, we can either make two red, three

Â black, two black, three red, or both two and three black.

Â And all of the cases are the same. Just to give one example, suppose that we

Â colored the node two, red, and one and three are black.

Â The claim is invariant four has been broken and invariant four is going to be

Â broken no matter how we try to color two and three red and black.

Â What is invariant four says? It says, really on any unsuccessful

Â search, you pass through the same number of black nodes.

Â And so, one unsuccessful search would be, you search for zero.

Â And if you search for a zero, you go to the root, you immediately go left to hit a

Â null pointer. So, you see exactly one black node.

Â Namely one. On the other hand, suppose you searched

Â for four, then you'd start at the root, and you'd go right, and you go to two,

Â you'd go right, and you go to three, you'd go right again, and only then will you get

Â a null pointer. And on that, unsuccessful search, you'd

Â encounter two black nodes, both the one and the three.

Â So, it's a violation of the fourth invariant, therefore, this would not be a

Â red black tree. I'll leave that for you to check, that no

Â matter how you try to code two and three red or black, you're going to break one of

Â the invariants. If they're both red, you'd break the third

Â invariant. If at most one is red, you'd break the

Â fourth invariant. So, that's a non-example of a red-black tree.

Â So, let's look at an example of a red-black tree.

Â One, a search tree where you can actually

Â color the nodes red or black so that all four invariants are maintained.

Â So, one search tree which is very easy to make red black is a perfectly balanced one.

Â So, for example, let's consider this three

Â nodes search tree has the keys three, five, and seven and let's suppose the five

Â is the root. So, it has one child on each side, the

Â three and the seven. So, can this be made a red black tree?

Â So, remember what that question really means.

Â It's asking can we color theses three nodes some combination of red and black so

Â that all four of the invariants are satisfied?

Â If you think about it a little bit, you realize, yeah, you can definitely color

Â these nodes red or black to make and satisfy for the invariants.

Â In particular, suppose we color all three of the nodes, black.

Â We've satisfied variant number one, we've colored all the nodes.

Â We've satisfied variant number two, and particularly, the root is black.

Â We've satisfied invariant number three. There's no reds at all, so there's

Â certainly no two reds in a row. And, if you think about it, we've

Â satisfied invariant four because this tree is perfectly balanced.

Â No matter what you unsuccessfully search for, you're going to encounter two black

Â nodes. If you search for, say, one, you're going

Â to encounter three and five. If you search for, say, six, you're going

Â to encounter five and seven. So, all root null paths have exactly

Â two black nodes and variant number four is also satisfied.

Â So, that's great. But, of course, the whole point of having

Â a binary search tree data structure is you want to be dynamic.

Â You want to accommodate insertions and deletions.

Â Every time you have an insertion or a deletion into a red black tree, you get a

Â new node. Let's say, an insertion, you get a new

Â node, you have to color it something. And now, all of a sudden, you got to worry

Â about breaking one of these four invariants.

Â So, let me just show you some easy cases where you can accommodate insertions

Â without too much work. Time permitting we will include some

Â optional videos with the notion of rotations which do more fundamental

Â restructuring of search trees so that they can maintain the four invariants, and stay

Â nearly perfectly balanced. So, if we have this red black tree where

Â everything's black, and we insert, say, six, that's going to get inserted down

Â here. Now, if we try to color it black, it's no

Â longer going to be a red black tree. And that's because, if we do an

Â unsuccessful search now for, say, 5.5, we're going to encounter three black

Â nodes, where if we do an unsuccessful search for one, we only encounter two

Â black nodes. So, that's not going to work.

Â But the way we can fix it is instead of coloring the six black, we color it red.

Â And now, this six is basically invisible to invariant number four.

Â It doesn't show up in any root null paths.

Â So, because you have two black nodes in all roots in all paths before, before the

Â six was there, that's still true now that you have this red six.

Â So, all four invariants are satisfied once you insert the six and color it red.

Â If we then insert, say, an eight, we can pull exactly the same trick, we can call

Â it an eight red. Again, it doesn't participate in invariant

Â four at all so we haven't broken it. Moreover, we still don't have two reds in

Â a row, so we haven't broken invariant number three either.

Â So, this is yet another red black tree. In fact, this is not the unique way to

Â color the nodes of this search tree, so that it satisfies all four of the

Â invariants. If we, instead, recolor six and eight

Â black, but at the same time, recolor the node seven, red, we're also golden.

Â Clearly, the first three invariants are all satisfied.

Â But also, in pushing the red upward, consolidating the red at six and eight,

Â and putting it at seven instead, we haven't changed the number of black nodes

Â on any given path. Any black, any path that previously went

Â through six, went through seven, anything that went through eight, went through

Â seven so there's exactly the same number of red and black nodes on each such path

Â as there was before. So, all paths still have equal number of

Â black nodes and invariant four remains satisfied.

Â As I said, I've shown you here only simple examples, where you don't have to do much

Â work on an insertion to retain the red black properties.

Â In general, if you keep inserting more and more stuff and certainly if you do the

Â deletions, you have to work much harder to maintain those four invariants.

Â Time permitting, we'll cover just a taste of it in some optional videos.

Â So, what's the point of these seemingly arbitrary four invariants of a red black

Â tree? Well, the whole point is that if you

Â satisfy these four invariants in your search tree, then your height is going to

Â be small. And because your height's going to be

Â small, all your operations are going to be fast.

Â So, let me give you a proof that if a search tree satisfies the four invariants,

Â then it has super small height. In fact, no more than double the absolute

Â minimum that we conceivably have, almost two times log base two of N.

Â So, the formal claim, is that every red-black tree with N nodes, has height O

Â of log N, were precisely in those two times log base two of N + 1.

Â So, here's the proof. And what's clear about this proof is it's

Â very obvious the role played by this invariants three and four.

Â Essentially, what the invariants guarantee is that, a red black tree has to look like

Â a perfectly balanced tree with at most a sort of factor two inflation.

Â So, let's see exactly what I mean. So, let's begin with an observation.

Â And this, this has nothing to do with red black trees.

Â Forget about the colors for a moment, and just think about the structure of binary

Â trees. And let's suppose we have a lower bound on

Â how long root null paths are in the tree. So, for some parameter k, and go ahead and

Â think of k as, like, ten if you want. Suppose we have a tree where if you start

Â from the root, and no matter how it is you navigate left and right, child pointers

Â until you terminate in a null pointer. No matter how you do it, you have no

Â choice but to see at least k nodes along the way.

Â If that hypothesis is satisfied, then if you think about it, the top of this tree

Â has to be totally filled in. So, the top of this tree has to include a

Â perfectly balanced search tree, binary tree of depth k - 1.

Â So, let me draw a picture here of the case of k = three.

Â So, if no matter how you go from the root to a null pointer, you have to see at

Â least three nodes along the way. That means the top three levels of this

Â tree have to be full. So, you have to have the root.

Â It has to have both of its children. It has to have all four of its

Â grandchildren. The proof of this observation is by

Â contradiction. If, in fact, you were missing some nodes

Â in any of these top k levels. We'll that would give you a way of hitting

Â a null pointer seeing less then k nodes. So, what's the point is, the point is this

Â gives us a lower bound on the population of a search tree as a function of the

Â lengths of its root null paths. So, the size N of the tree must include at

Â least the number of nodes in a perfectly balanced tree of depth k - 1 which is

Â 2^k - 1, So, for example, when k = 3, it's 2^3 (two cubed) - 1,

Â or 7 that's just a basic fact about trees,

Â nothing about red black trees. So, let's now combine that with a red

Â black tree invariant to see why red black trees have to have small height.

Â So again, to recap where we got to on the previous slide.

Â The size N, the number of nodes in a tree, is at least 2^k - 1, where k is

Â the fewest number of nodes you will ever see on a root null path.

Â So, let's rewrite this a little bit and let's actually say, instead of having a

Â lower bound on N in terms of k, let's have an upper bound on k in terms of N.

Â So, the length of every root null path, the minimum length of every root null

Â path is bounded above by log base two of quantity N + 1.

Â This is just adding one to both sides and taking the logarithm base two.

Â So, what does this buy us? Well, now, let's start thinking about red

Â black trees. So now, red black tree with N nodes.

Â What does this say? This says that the number of nodes, forget

Â about red or black, just the number of nodes on some root null path has to be the

Â most log base two of N + 1. In the best case, all of those are black.

Â Maybe some of them are red, but in the, in, the maximum case, all of them are

Â black. So, we can write in a red black tree with

Â N nodes, there is a root null path with at most log base two of N + 1, black nodes.

Â This is an even weaker statement than what we just proved.

Â We proved that it have some, somehow must have at most log based two, n + 1 total

Â nodes. So, certainly, that path has the most log

Â base two of N + 1 black nodes. Now, let's, now let's apply the two

Â knockout punches of our two invariants. Alright, so fundamentally, what is the

Â fourth invariant telling us? It's telling us that if we look at a path

Â in our red black tree, we go from the root, we think about, let's say, that's an

Â unsuccessful search, we go down to a null pointer.

Â It says, if we think of the red nodes as invisible, if we don't count them in our

Â tally, then we're only going to see log, basically a logarithmic number of nodes.

Â But when we care about the height of the red black tree, of course, we care about

Â all of the nodes, the red nodes and the black nodes.

Â So, so far we know, that if we only count black nodes then we're good, We only have

Â log base two of N + 1 nodes that we need to count.

Â So, here's where the third invariant comes in.

Â It says, well actually, black nodes are a majority of nodes in the tree.

Â In a strong sense, there are no two reds in a row, on any path.

Â So, if we know the number of black nodes is small, then because you can't have two

Â reds in a row, the number of total nodes on the path is at most twice as large.

Â In the worst case, you have a black route, then red, then black, then red, then

Â black, then red, then black, et cetera. At the worst case, the number of red nodes

Â is equal to the number of black nodes, which doubles the length of the path once

Â you start counting the red nodes as well. And this is exactly what it means for a

Â tree to have a logarithmic depth. So, this, in fact, proves the claim, if

Â the search trees satisfies the invariants one through four, in particular if there's

Â no two reds in a row and all root null paths have an equal number of black nodes,

Â then, knowing nothing else about this search tree, it's got to be almost

Â balanced. It's perfectly balanced up to a factor of

Â two. And again, the point then is that

Â operations in a search tree and the search trees are going to run in logarithmic

Â time, because the height is what governs the running time of those operations.

Â Now, in some sense, I've only told you the easy part which is if it just so happens

Â that your search tree satisfies these four invariants, then you're good.

Â The height is guaranteed to be small so the operations are guaranteed to be fast.

Â Clearly that's exactly what you want from this data structure.

Â But for the poor soul who has to actually implement this data structure, the hard

Â work is maintaining these invariants even as the data structure changes.

Â Remember, the point here is to be dynamic, to accommodate insertions and deletions.

Â And searches and deletions can disrupt these four invariants and then one has to

Â actually change the code to make sure they're satisfied again, so that the tree

Â stays balanced, has low height, even under arbitrary sequences of insertions and

Â deletions. So, we're not going to cover that in this

Â video. It can be done, without significantly

Â slowing down any of the operations. It's pretty tricky, takes some nice ideas.

Â There's a couple well-known algorithms textbooks that cover those details.

Â Or if you look at open source and limitations of balanced search trees, you

Â can look at code that does that implementations.

Â But, because it can be done in a practical way and because Red Black Tree supports

Â such an original array of operations, that's why you will find them used in a

Â number practical applications. That's why balanced search trees should be

Â part of your programmer tool box.

Â