Well, welcome to the next video in the Introduction to Data Exploration unit. In this video, we will focus on strings and sequences. And we will spend time to learn about what we mean by a string or sequence, what we mean by a time series, and how do we operate on strings or time series. So if you remember from our earlier videos, we had said that we can represent data in a database or data in a data correlation in different ways. We had said that a common representation is relational or object oriented data. But we said that relational/object oriented data are often known as the structured data types. But in many cases, the data that we want to explore might have less strict structures. And in those cases, we said that we might use other representations. We had enumerated several of them. We had said that data can represented in the form of vectors. We said that data can be represented as string sequences or time series. We had said that the data can be represented in forms of trees or graphs. And we also said that data can be represented in the fuzzy or probabilistic form. In this video, we will focus on strings and sequences and time series. We will define what strings and sequences are. We will define what time series are. And I will basically define the common operations on the sequences or common operations on time series. So let's start from a string or sequence. The definition actually is pretty straightforward. A string or sequence is nothing but a, Sequence, consecutive sequence of symbols from a given alphabet. So there are a couple of terms here. There's an alphabet of symbols. In this example, the alphabet is a, b, c. And we have a sequence that has a starting point, that has an ending point, and a length, Of symbols that together make the sequence. So very simple definition, a string or sequence. And in our daily life, we are very familiar with strings or sequences. Our names, for example, are sequences. What about a time series? So this is slightly different definition. We might not have heard the concept of time series. Maybe in our daily life, we don't necessarily use the term very frequently. But a time series is nothing but a sequence. It is, again, a sequence. But in this case, the sequence is sort of a specific sequence where the alphabet is numbers. So essentially, a time series is sequence of numbers, sequence of values. These numbers could be, for example, real numbers. In some other applications, they can be integers, or positive values, or negatives values, and so on. But essentially, a time series is a specific type of sequence where we are dealing with numbers. In this example, we are seeing a time series. It's called time series because we have a time component. We have a time series of the term big data on the web. So we see that basically, roughly a decade back, the term big data was not very frequently used. In this case, the y-axis essentially is the sort of frequency of use of the term big data. So it's counting the term, so it's a number. And we see that over time, the interest in the term big data increased. There is of course some fluctuation. However, the overall trend is increasing. So essentially, one important use of time series is to see trends. So the time series can be univariate. As an example, there's only one term that we are tracking, so it's a univariate time series. But it can also, in some applications, the time series that we are tracking may be multiple variate. That is, there are multiple variables that are changing over time. And we are tracking all of these in a synchronized way over time. For example, in this application, we have a housing unit. The housing unit has several temperature sensors distributed in different zones of the housing unit. And we are tracking the temperature for different sensors over time. So this is called a multi-variate time series because we have multiple variables we are tracking. And we are tracking these multiple variables in a synchronized way over time. So we will see that when we are dealing with multi-variate time series, we will need to do certain things in a specific way. So we'll come to that concept later. But for now, remember, sequences have a domain or have an alphabet. Time series is a sequence where the alphabet or the domain is numbers. And a multi-variate time series is a time series where we are not tracking one value. We are tracking multiple values simultaneously. So those are our basic definitions. So next, we will basically focus on some of the operations that we do commonly on sequence or time series to support data exploration. We'll start with matching and search operations on strings and sequences. So we'll basically give some sample operations that we often have to implement. We will not discuss about how we implement them yet. In later units of the course, we will come back to this. And we'll learn data structures and algorithms that we will need to use and we will need to implement to achieve these functionalities. For now, we are only defining what these operations are so that we can prepare ourselves for the later units. So one operation that is commonly implemented on strings or sequences is searching for prefixes. That is, I have a database, and the database has a number of strings. And I am asking the question, could you please find me all strings or sequences that start with the sequence T-A-B, tab. In this case, if this is correctly implemented, the database can return me table, tabular, or tablet. All of these sequences have something in common, and that is the fact that they all have prefix T-A-B. So that's called prefix search. The second operation that is commonly implemented in sequence databases is called subsequence search. For example, again, if I have a database that's storing sequences, I can ask the question, please find me all strings or sequences that contains a subsequence, A-R-K, ark. In this case, the system would return me marketing, spark or quark. Again, all of these have the common property of containing the subsequence ark. A related subsequence search operation is called subsequence match. So in this case, I'm not searching for a given subsequence, but I'm analyzing the data for certain common subsequences. For example, my query could be, can you please find the longest matching subsequence given the string, plasticity and scholastic. These are two different strings and I'm telling the system, please find the longest matching subsequence between them. In that case, the system can return N-A-S-T-I-C as the longest common, longest matching subsequence between these two strings. Again, an operation that is commonly done when we are exploring sequences or strings. A related operation is involves, again, looking for subsequences, but in this case, we are looking for frequent not only longest but frequent subsequences. For example, the question could be something like this, find the most frequently repeating three character or three symbol subsequence, in a given string. In this case, if you analyze the given string, you will see that B-A-A, Is the most frequently repeating three character subsequence in this string. Now, we will give another name to these type of frequently repeating subsequences. We will call them motifs. The subsequent, baa, essentially is a motif that is occuring in this sequence. Searching for motifs, will be an important operation and later on we will discuss how to achieve this goal. Now, there are maybe other motifs in the same sequence. For example, if you look closely, you will see that the sequence bba, the subsequence bba, is also a motif of this sequence. It was also occurs three times and it's length is also three. Again, I'm not going to do that, but you might want to search for the motifs bba in this sequence, in this example. A fourth common operation often implemented in sequence exploration systems is, measuring the similarity or measuring distance between strings. So, the question is the following, I am given a sequence table, a string table, and I want to measure the distance between this sequence and three other sequences, cable, tale, and tackle. And I'm asking the system, could you please return me the most similar sequence? Is tale more similar to table? Is cable more similar to table or tackle more similar to table? So how do I define distance or similarity between sequences? If you remember, in earlier lectures, we have seen how to measure distance or similarity between vectors. We ask a similar question, but in this case, our data is not representing a vector space, but is represented as a sequence. So we need to develop distance measures that operate on sequences. So a common way to measure distances between sequences is called, edit distance. Edit distance is defined essentially as, the minimum number, minimum number, of operations that you need to implement to convert one sequence to the other sequence. Minimum number of operations that you need to implement to convert one sequence to the other. Then what are these operations? So what are the type of operations that we can have to convert one string to the others? A common set of operations are insertions, deletions, and replacement of symbols. Let's see some examples. Let's assume that I have table and cable and I'm trying to measure the distance between these two sequences or these two strings. You can see that see that I can achieve this by replacing the symbol t in the input sequence with the symbol c. One replacement is sufficient to convert table to cable. So, in that case, I would say that the distance between table and cable is one unit, only one operation is sufficient. What about table versus tale? How many operations do I need? Well, you can see that if I remove the symbol b from table, I obtain tale. So in this case, again, one operation is sufficient. One deletion operation is sufficient to convert to table to tale. So I would also say that, the distance between these two strings, is one unit. What about table versus tackle? Well, it turns out that in this case, there may be two different ways to achieve the same goal. One of them is deleting b And inserting c and k. Table, delete b, insert c and k, and you will obtain tackle. This has cost three, because I had to do three operations. One deletions and two insertions. But I could also achieve the same goal by replacing b with c. Instead of deleting anything, replacing b with c and then inserting k. In that case, I had to do only two operations, one replacement and one insertion. Since the definition says, find the minimum number of operations that you need, in that case, the distance between the two strings could be found as two. Now, of course, in different applications, replacement versus insertion versus deletion may have different impacts or different costs. So, in that case the distance may have different goals. So, for example, if we are not allowed to replace, we can see that basically, if you are not allowed to replace we can see that This cannot be done in two, and the cost of this also raise from 1 to 2. Because I have to first delete b, then insert c. So once again, almost as always, the definition of the distance is application-dependent. But in any case it did distance a general way to define this. And essentially this says, well give me a set of edit operations, whatever those operations are. I will measure the distance between those two strings by finding the minimum number of operations that I need to convert one sequence to the others. Again, we will not discuss for now how to compute the edit distance. But that's going to be something that we will discuss in future lectures. Now what about time series? Now if you remember, time series is a specific type of string where instead of dealing with any arbitrary symbols, we are dealing with numeric symbols. We are dealing with numbers. So given this, we can measure the distance between time series using the difference between numbers. So for example, a common way to define the distance between time series is basically using Euclidean distance, which we discussed in the past. Essentially what I would do is I would look at each time instance. I would measure the distance between the corresponding values. The difference between the corresponding values for each of the time instances. And then I would, actually this is wrong, I apologize for this. It should have been like this. I apologize for the error in the slide. But in any case, I would find the distances between the terms. I will take their squares, and I would sum them up and I will take the square root of it. This is a commonly known Euclidean distance. And it is a common way to measure distance between time series. Having said that, while this is a commonly used measure, it is not applicable in all applications, why? Because in some applications, it is possible that there is a shift between two time series. But in spite of the shift, we want to basically define them as being similar. For example, you can sort of think like this. Let's assume that this is one stock as a behavior, over time. And this is another stock, price of the stock which is a similar behavior. They are clearly not synchronized. One behavior is occurring this month, the other one starts a few weeks later, but it is a similar behavior. If we try to rely on synchronous measures, like Euclidean distance, we will not be able to find them as matching. Because there will be a big difference between the corresponding values. So if I use a synchronous measure, I will say these are very different from each other. However, in terms of the pattern, they're a similar pattern, even though they are not synchronized. In other similar sort of distortion that has come on the common law occurring which we still want to allow, is called stretch. So let's assume that basically I have two time series. Both of the time series have a similar pattern. Something is increasing and then decreasing after that. They start around the same time. But the increase and decrease pattern occurs with different speeds. This can occur, for example let's assume that the time series is the position of my hand in space, and say I do gestures slowly. And somebody else maybe does the gesture fast. In that case, both of the time series have the same shape overall. But one of them has a different speed than the other. Now again if I have to rely on a synchronous measure like Euclidean distance, I would see that difference between these time series is very large. However, if I'm going to use this for example condition, whether I move my hand slowly or fast, I want to record it as being a similar movement. I need to use a asynchronous or elastic distance measure for time series. Now, there are several of them. We are not going to discuss them in this lecture now. Some examples are, again, edit distance, which we discuss few minutes back. Dynamic time warping, often used to compare time series when shifts or stretches are allowed. And feature-based alignment algorithm, such as robust time series features. Once again, we will discuss these in the near future, but not in this lecture. Now what about motifs, what are motifs? If you remember a few slides back, we defined motifs as sub-sequences of strings that frequently repeat itself. Now, the same similar concept also applies in time series. We can define also in time series three completed repeating patterns. So for example in this case, I have a sequence of numbers. So it's a time series, and if you look at these time series you will see that two patterns are repeating over time in the time series. So in this example, my time series has two motifs. Motif a is repeating itself four times, motif b is repeating itself twice. So the question, essentially, that if I'm interested in finding is repeating patterns. I need to have algorithms, efficient algorithms, that can take a long time series and find these repeating patterns inside. Once again, this is going to be a concept that we will discuss in depth in future lectures. Now what about multivariated time series? Remember we said that basically not all time series are univariates. Many times we are interested in multiple variances simultaneously. For example, we might basically be sensing the temperature of a large building. Or we can track the positions of, if I'm doing for you gesture recognition, I might be tracking positions of different body parts simultaneously. In those cases I need to basically do similar things for multi-variate term series. For example, I might define motifs for multi-variate time series. And these motifs can occur along the same variant over time, or may occur at different variants along time. So I need basically algorithms that can efficiently search for such motifs over time. Once again, the implementing motif search or implementing edit distance. Or similarity measures on univariate and multivariate time series will be something that we'll be discussing in the upcoming lectures. I will stop here for the introduction to sequence and time series. Thank you for your attention.