Hi, in this video we will learn the main ideas that allow to build a suffix array of a string efficiently and the general construction strategy. In the next few videos we'll work out the details of these ideas and how to implement the whole algorithm. The first idea is that instead of sorting the suffixes of a string we are going to sort its cyclic shifts. What are the cyclic shifts of a string? We can draw the string around the circle like this string, ababaa$ is currently drawn around the circle, starting from letter A and ending in character dollar. So all the cyclic shifts of the string a are all the ways that we can start with some of the characters on the circle and go through the whole circle until the last character. The first cyclic shift of the string is the string itself, ababaa$. The next cyclic shift is what we get if we start from the next character b, we get babaa$a, we always go through the whole circle. The next cyclic shift is abaa$ab and so on and so forth. There are as many different cyclic shifts of a string as there are characters in the string. Because we can start from any character and it uniquely defines the cyclic shift. What happens if we sort the cyclic shifts that we've just determined for our example string? We sort them lexicographically and we get the following order. We start with $ababaa, because dollar is the smallest character so the cyclic shift which starts with dollar will always be the first one. Then go to the other cyclic shifts in the lexicographic order. Let's look what happens if we cut everything to the right from the first occurrence of the dollar sign in the string. We cut the red parts out and what we get? Well, it is the least of all the suffixes of the initial string in the lexicographic order and it turns out that this is always the case. More specifically, the Lemma states that after adding to the end of the string S a special character $ which is smaller than all the other characters, sorting cyclic shifts of string S. Sorting suffixes of string S is equivalent in terms that the order of positions in which the sorted suffixes start in the string, which is the suffix array. Is the same as the order in which the sorted cyclic shifts of this string will appear if we sort them lexicographically. How can we use that? How do we sort cyclic shifts and why is it easier? This is somewhat easier than sorting suffixes because we can sort partial cyclic shifts. Partial cyclic shifts are sub-strings of the cyclic string S. So basically if we draw a string S around the circle and start with some character and go part of the circle. Not necessarily the whole circle but maybe only apart, then any such sub-string of a cyclic string S is called a partial cyclic shift. Let's see some examples, we draw our string around the circle and let's consider all the cyclic shifts of length 4. The first of them is the prefix of a string of length 4 abab. Second one is a sub-string of our string starting in position 1, which is baba. The third one is abab also a sub-string of the initial string. Fourth is baa$, it is also a sub-string of the initial string but now happens something interesting. The next cyclic shift, aa$a is no longer a sub-string of the initial string. It is a sub-string of the cyclic string but it is not a sub-string of ababa$, which is our initial string. However, this is still considered the partial cyclic shift, and there are two more partial cyclic shifts of length 4. For any length we can consider all the cyclic shifts of this length, in particular for the length 1 all the cyclic shifts of our string are just all single characters separately. The general strategy is we start with sorting just single characters of S and those are also by definition the partial cyclic shifts of the string S of length 1. To sort all the single characters of S is a relatively simple problem because there are only a few different characters. The number of different characters is the size of the alphabet. It is bonded and so sorting them can be easier than sorting the whole suffixes or whole cyclic shifts of the string S. Then as soon as we have cyclic shifts of some length sorted, it turns out that we can sort now cyclic shifts of twice the length. Partial cyclic shifts of length 2L can be sorted if we know the order of the partial cyclic shifts of length L. We will be doubling our partial cyclic shifts for several iterations from length 1, then two, then four and so on. Until the length of the cyclic shifts that we've started exceeds the length of the initial string. At that point the order of this partial cyclic shifts will be the same as the order of this sorted suffixes of the initial string. Let's look at this process on our example of string ababaa$. First we sort the single characters of the string as, the first one will be $ because it is smaller than all the others. Then go all the letters a in position 0, 2, 4, and 5, and then go all the letters b in positions one and three so the order is 6, 0, 2, 4, 5, 1, 3. Of course, it is not important what is the order among the same characters. But we will sort them in some order anyway so let's assume this is the order of single characters. Next we move to the partial cyclic shifts of length 2. Now we sort them and now it all starts with $a because dollar is the smallest again, and then a$aa and so on, and now the new order is 6,5,4,0,2,1,3. And again, sum of the partial cyclic shifts are equal and the relative order between them doesn't matter too much, but we will anyway sort them in some order. Let us assume that it is this order. Now we move to the cyclic shifts of length 4 and the origin becomes 6,5,4,2,0,3,1. Next step, we move to the cyclic shifts of length 8, and the order is 6,5,4,2,0,3,1. Now the order already is the same order as we had for the suffix array of this string; this is what we wanted to get. In four iterations, we've sorted the cyclic shifts of length 8, which is longer than the initial string and we've got the suffix array. This is the general strategy that we're going to use. After that we will just cut out the red parts of the cyclic shifts and we will get the order of the suffixes. In terms of code, this is just the general template and you will understand the full depth of this code by the end of this lecture. But just to get it all structured in your mind what is the general construction strategy. First thing we do is we sort the characters. Then the main part of the algorithm we started with length 1, four of partial cyclic shifts, and then each iteration we sort the doubled partial cyclic shifts of the double length and we multiply the current length by 2. This is the main loop but we also need some augmented auxiliary data structures. This is the data structure class which will basically help us to understand quickly in constant time which of the partial cyclic shifts are equal to each other. We will build this class data structure initially for characters to know which characters are equal. That'll be easy because those are just characters and we just can compare them. Then on each iteration we will also update this information about which partial cyclic shifts are equal and that'll give us the whole algorithm. In the next few videos we will work out the details of sort characters, compute char classes, and all the other procedures and come up with this algorithm and your full understanding of how it works.