All right I have to start with a caveat. Mung asked me to talk about search algorithms and I sort of interpreted that fairly broadly. And I changed the topic somewhat. What I'm really going to talk about is How we do things with very large data sets very efficiently. And this is something that's very closely linked to searching at web scale, the kinds of algorithms that Google and other search engines have to deploy in order to be able to do anything at all with just the large volumes of data that they have to deal with. But at the same time it's not there are lots of special topics we could go into about search itself. As Mung alluded to in his lecture. And there are many things that he attributed to what I would cover in the 30 minutes I, I have, but I'm. Afraid I'm not going to be able to cover that. So I'm happy to answer these questions later perhaps if you, if you'd like. But I just want to warn you it's a little big of a, of a diversion, but a relevant diversion I hope. Okay? So I want to start with talking about. Some problems that, people face when they're trying to deal with very large data sets. And having discussed those problems, I'll talk about new ways of thinking that have evolved to design algorithms to deal with them. And then finally, we'll come up with some, I'll describe some building blocks of the algorithmic tool kit to design algorithm for very large data sets. I recognize a few faces in the audience. Those of you who have taken 340 last year. Some of this might be a repeat of things you have seen before. And for those of you who haven't taken 340, those who are in the. Currently 340, it's a little bit of a preview of something that you might see later on in the course. So I will start with a very simple problem. Suppose you are given a list, how do you count how many distinct elements are in this list? Any ideas? >> This is supposed to be an easy question. A sample question. >> Do you count them? >> Yeah. >> Do you count them? >> Yes. How do you count them? >> 1,2,3. >> Yeah, yeah, yeah. How do you know that the, okay, so the, the list could have repeats. Right? And you don't want to count the repeats. How do you eliminate the repeats? Yeah. >> I mean, you could like, go through each element on the list and then build like a simple table or something and you just keep putting elements and if it's a repeat you just discount it. >> Exactly. Okay. You, you want to say something? >> [inaudible]. You can sort them. And, and then just go to the sorted list. So either of these solutions would work. And, of course, this is a very simple problem. One that you might encounter in a very first algorithms course. You know, a standard solution is to use a hash table. You could also use sorting, right? A hash table is just the same thing as a symbol table. As you encounter elements, put them into the hash table. And obviously, if you encounter a duplicate, [inaudible] realize this. Okay, so this is a very simple problem. What makes this problem very hard is, What happens if the list was very big? What if you had to do this not on, you know, 1000 elements, but a billion elements. Or a trillion elements. Okay? Then all of a sudden, you know, all of these simple approaches, like using a hash table or using, or doing sorting or whatever, go out of the window. It, well number one you need a whole bunch of, a whole lot of memory to actually keep a hash table. And of course modern computers are capable of doing hash tables even of this size perhaps. But you know, this size where sort of it's C, okay? Sorting again, you know? Sorting such a huge list takes a very long time. Okay? So, why would you care about a problem like this? Well, if you think about Google. For example, some statistics that, they, they released earlier this year saying that you know, Google in its entire life time has seen something like. 30 trillion URLs. They crawl something like twenty billion pages a day and they serve about three billion searches a day which means, you know, 100 billion searches a month. Okay. So that's the scale at which they are operating. And If you had to solve the simplest kinds of, if you had to answer the simplest kinds of questions about what they do. Like, you know, how many queries did they get in a day? How many distinct queries they get in a day? How many distinct queries did they get in a month or a year? You're faced with solving what seems to be a very simple problem. But is very difficult just because of the immense scale at which you need to operate. Okay. Of course, you know? Distinct, answering this question of how many distinct queries you have is the very simple statistic that you might want to compute and keep track of as time, as time goes by. But you might want more sophisticated statistics like, you know, how much did the queries change from day to day? How much are they different today compared to yesterday. How much are they different this month compared to previous month. This month compared to the same month last year and so on and so forth. And all of these very, very simple westerns just seem incredibly difficult when your dealing with, you know, the sizes of data that go with the info. Okay. You know, there are other settings where you have, lots of data and you'd like to, come up with algorithms, real simple things. You know, a network router is one example. It sees a lot of packets whizzing by at fast speed, you know. Current, home routers are even capable of speeds. About a one gigabyte per second. Commercial routers are capable of speeds of something like ten gigabytes per second. The router is something that's, that's, you know, really simple computing device, all it does is receive packets and forward them along. It could perhaps do some very simple sort of analysis on the data when it's sync. How could you use this very, very simple computing device To do simple monitoring like you know, can you tell me how many active flows are there, are, are in the router. You know, which are the flows that are contributing most, most of the traffic. And these things are important because monitoring these sorts of things are essential to you know, various security measures like, Early detection of denial of service attacks. Okay? It's important to keep track of, you know, what's going on, what sort of traffic you're seeing, because changes in this might signal that something strange is happening and you might want to take some security measures. Okay? So very, very simple statistics like this are, are a huge challenge just because of the amount of data that this router is seeing and the fact that the router itself is a very, very simple device. Okay? So. You know in general there's the stock of big data everywhere. If you, if you read the news read magazines you know, every other day there's a, there's an article about how we're surrounded with data, we need to find ways to make sense of the data, and. What I am really going to talk about is what are some of the algorithmic techniques that we can do to tame these, that we can use to tame these big data problems. Okay, alright. So, that's, that's the introduction of what the problems are. Let me talk about what, what models we could envision to design algorithms. So this clearly, our traditional notion of, you know having all of the data available to us. Having the algorithm that has random access to the data. You know, fitting all the data in memory. All of this, doesn't really apply when we are dealing with data at such large scale. So you know, the two operatives are, number one, you know, the data size is very large. It far exceeds the available memory. And that's just a reality we have to face. We can't actually store all the data in memory. And the other thing is that there's a new emphasis on efficiency. Usually when you design algorithms, polynomial time algorithms are considered. Efficient and implementable. But, in reality, when you are dealing with data size, size of this, this scale, your algorithms are run in you know, linear, near linear time perhaps. You know. Even something like quadratic, at, at the scales of which we talked about are [inaudible], okay? So that's, that's what we are. Those are, sort of the, the barometers. Okay. So one kind of model that's used, we'll see, later on in the, in the, in the lecture is. The idea of taking our [inaudible] and somehow reducing it to a smaller size, okay? So, it turns out that we can do this is many cases. Replace the original data by some kind of compact summary. Such that, you know, by. Discarding the original data, and just looking at this compact summary. We can actually solve, or we can actually answer the questions that we care about. Now, it would be wonderful [inaudible] if you could do this. Because you would [inaudible] reduce the data size. It turns out that in many cases, this is actually possible, okay? Now, you know, one thing you never give up on is the idea that you really want to solve a problem exactly. In most cases, like, you know? If you wanna find out how many distinct elements were there in the queries. Do you really care about. Whether I give you a one percent approximate answer, or whether I give you the exact answer, it doesn't really matter. So, in most cases of interest it's okay to not give the exact answer, but give something that is fairly accurate. And turns out that this relaxation moving from getting the exact answer to tolerating some error. [inaudible] error is something that actually makes many problems tractable. Okay? So we'll see some examples of this, this idea. Another idea that's, that's been very useful. Is, a certain model called, the streaming model. And a streaming model is the following. It says, you know, we, you have to design your algorithms. It operates in the following way. Imagine the data as being, you know, one stream. And you make one pass over the data. Or perhaps, you know, [inaudible]. A few passes over the data. But you can remember only a small amount of information relative to the size of the area. So, you can afford to remember very little and you are only allowed to. See the data in one bias or in several biases. So in, for example, you know you're going to zig zag back and forth. You don't have random access to the data, okay? And the point is, if you've actually designed an algorithm that works in this very, very restrictive way, could actually run it in on very large areas that's very efficient. Okay. So it seems like we're really, really tying our hands and, you know, putting a blindfold on. How could we possibly do anything in this very, very restricted mode of operation? But turns out that there are very interesting things that you can do even when you restrict yourself to interacting with this data in this very controlled way. Just a little bit of history of, you know, where this notion of streaming algorithms came from. Actually the. This idea of, trying to design algorithms, at work you know, in one pass of the data, remembering very little. Predates. You know, the internet even are big data sets, and so on. It was really Sort of a, a fun problem that people worked on in fact, you know, they're. There's, there's some work by [inaudible] and Martin that exactly deals with this problem of estimating distinct elements. And this is, we're actually gonna see what, what was the idea that they came up with. Because it's actually relevant today. To the scale of, for the data sizes that we're actually faced with today. You know, another, another piece of work, back in 1980 was, by Monroe and Patterson. And they look at the problem of, you know, how much space do you need in order to compute the media. So, you know, you have a data set of size, and. You have only a small amount of space. You can make a few parcels of the data. Without K parcels, how much space do you think to actually compute the median? And if I found a very sort of tied on sense to this question turns out if they are allowed to make K parcels, you can do it using something like N to the one zero K space. Okay, and these, these algorithms, as I said, were newly developed very, very early on, even before this model was really formalized. This model was really formalized by Work of Alon and Matias in 1796. This was a landmark paper because it really sort of, formalized this model. It talked about, you know, what you can and cannot do in this model, get examples of how you, how to pull negative results. And for this work, they won the Gou0308del Prize in 2005, which is a very prestigious prize in Computer Science. Okay, so, Let me also mention, some of you might have seen, heard about MapReduce. That's another model of interacting with radar data sets, something that was proposed by Google, in fact. And it is the basis for a bunch of practical implementations and disparate algorithms. There is something called Hadoop that some of you might have heard of. All of this is sort of relevant to the discussion of how do you deal with large data sets very efficiently. But unfortunately, I'm not going to have time to talk about all of these things, okay? So, next in the rest of the lecture I just wanna talk, give you a glimpse of some of the, the ideas that algorithm, the ingredients of the algorithms. O be with [inaudible]. So this is why I gave everyone the problems. I told how we might interact with data sets in order to solve these problems. Now I want to tell you, okay. Now that we've tied our hands for once over this problem, very large area. Okay so, how do we actually do it. So, I talk about data reduction the idea. Table data set is reducing the silent data set. Actually if we think about it this is, this is a concept that we know and are very, very familiar with in our lives. So you know, if you were to ask the question what fraction of voters today prefer Obama over Romeny, [inaudible] Other than questions, all people get about. But very difficult to answer those questions if you really wanna know what exact [inaudible] is. Right? You have to ask every single word, what do you think? What do you prefer? Takes a lot of time and effort to actually get the exact answer. We really do this every once in every four years for a good reason. On the other hand this is something that people actually make predictions about. Every day. Every day, you look at the newspaper, and, and you get, you know? Some estimate of wh-, you know? What's the, what's the approval rating of Obama today? What's the approval rating yesterday, you know? How many, and various other questions that you might, you might think about. So, This is something that we solve, or, or certainly solve every day, by sampling, okay. We don't actually ask everybody what their opinion is. But the point is, if you actually pick a random set of people, and you ask them, what is your opinion, that, the percentage that you compute on the random sample, is actually a very good estimate of the actual percentage of the entire population. Okay that is George Gallup. And you know, all of you heard about, read about Gallup polls, something, done by the Gallup organization that he set up. Okay, so let's think about sampling. What, what is sampling does, what does sampling do? The idea is if you pick a random voter and you ask them, you know, do you prefer Obama, then the probability that this random voter says yes is just a fraction of voters that prefer Obama, right? It's a basic fact. Do we agree on this? Okay, alright. It's a very simple thing. But this is actually very useful. Why? Because, now, if you pick several random orders, and, you know, independently ask them the same question. Then the estimate that you get from the random sample is actually a very good estimate of the actual fraction. Okay? That's, that's the basis for polling. That's the basis for random sampling, okay? So let's, let's all take a step back, and think about what this, what this actually did. The idea is that you produce a random variable. Which has a correct expectation. Okay? In this case, the expectation is the fraction of voters who prefer Obama. This is the, this is the quantity that we actually want to ask about. So you, you design a random variable that has a correct expectation. You take many copies, independent samples of this random variable, and you average. Alright. Because this kind of variable has the right expectation once you take many copies. The, the expectation of the average is, is exactly what you want, but the point is that by taking many copies and averaging, you get tied concentration. Okay. Now, in reality, if you really wanna analyze this. You need to think about what's the variance of the random variable. You know, what's the confidence balance that you want on your estimate, and so and so forth. This is a topic for probability which some of you may have seen, some of you may not. I'm not gonna go into this. But believe me that, you know, once you design a variable [inaudible] with the right expectations. With you know, a little more calculations, you can sort of determine what sort of sample size you need in order to get. Confidence bonds and so on and so force. Okay so, I'm only going to focus on. For all the bonds we care about, we're gonna try to design random variables that have the right expectations. Okay. And further, the important thing is that these random variables, will be computable very, very efficiently by making just one pass over the data. Okay? That's the, that's the key idea that we're gonna [inaudible]. Alright, so let's, having said that, let's go back to our version of distinct elements. It turns out that the solution that [inaudible] and Martin came up with is, is tied to a simple puzzle. Okay? So let me tell you what the puzzle is. I know some of you probably know the answer to this question. If you haven't seen this before, I encourage you to. Think about it, So here's the experiment. Okay. So, let's say you have the interval 0-1. And you throw K random darts inside this interval. So you pick K random points in the interval 0-1. Okay? Everyone, experiment clear? Okay. So the question is, what's the expected value of the minimum of these k elements? Any guesses? If you had to guess what w-, what should the minimum be? Okay, so zero seems a little extreme. Because for, for it to be zero, you want some element to line up here. What I'd like to know is, what, is it as a function of K? And you're right that, in some sense, as K gets larger and larger, this minimum is gonna get closer and closer to zero. [inaudible] understand what behavior is. >> 1/k is actually a very good guess. >> What about k + one? >> 1/k + one is actually, a better guess. So okay. How do you guys wanna work, 1/k or 1/k + one? >> The probably it is lesser, if it was going to the lowest, is going to be 1/k + one because probability of each of k + one in the list is that. And that's what's same as the probability of it being lower than the lowest third ... >> Okay, okay. But how does that help you compute the expected value of the mineral? >> Because then you know that the probability of it landing lower than the lowest star right now is [inaudible]. So that the expected value of, the expected length of that segment is one over K+1. I think your suggesting a very slick proof which escapes me at this moment. And after, when you, when you stand in front of the glass, your like, you usually drops like 50%. So I think you're probably right. But maybe we should, we should talk about this after class. Okay, so Don't know the right answer is indeed one over k+1. And we might have a very clever proof of that here. If you didn't know that the answer, you know, what the answer should be. You know, the way I, I usually think about this problem is, well. It's kinda reasonable to assume that these k elements are sort of uniformly spread in the interval K 0-1. Okay? And if that's really the case, if we believe this. Then, you know, each of these gaps should be, you know, one over K+1, there are K gaps. So, in fact, the smallest sum, which would be one over K+1. Okay. This is very crude, you know? It's no, by no means, the correct answer. But, you know, if we are to guess, one over K or one over K+1 is a reasonable guess. And if you actually do the calculations, it turns out that it is indeed one over K+1. Alright, good. So if you, if you have the minimum of K random numbers in 01, the expected value is [inaudible] +one. Why is this actually important to the question that we started out with? The question was, how do you estimate how many distinct elements there are? In your data stream. How could you use this fact, to estimate the number of distinct elements? Yeah. [inaudible] You spend that would give you. >> Okay? >> [inaudible] Okay. So I, I think I know what you're saying. How'bout this? Imagine computing a hash value for every element. And this hash value is a number between zero and one. So think of our hash function as a random function. So it assigns every element a number between zero and one, at random. And we keep track of the minimum as we go along. Okay? We see many copies of the same element. The hash function is always gonna map onto the same value. The minimum doesn't change, right? So what is the minimum, really? The minimum is just the same, the distribution of the minimum is just the minimum of K random values and zero one. To the expected value of the [inaudible], is one over K plus one. Great we're done. If you actually wanted to get a good estimate of the number of the [inaudible] elements, you would compute not one hash function, but maybe 100 hash functions or 1,000 hash functions. Keep track of all of them. Okay? And the end average. And you get a very good estimate of one over K +one. Where K is the number of distinct elements. Invert this, and you have a very good estimate of the number of distinct elements, okay? There's a really beautiful trick, originally discovered by [inaudible] and Martin. And it's, it's amazing how it, it completely bypasses the need to store any sort of hash function, do any sort of sorting. And comes up with a very good estimate of the number of distinct elements. Okay? Alright. Next I wanna talk a little bit. Actually I don't have very much time so I'm gonna have to breeze through this, this section. A class of methods that are used to. Figure out, you know, in your, in a very large collection of documents. Which documents are similar? Again, this is actually an important problem, because number one, when a web s, web search engine crawler goes out, it brings in a bunch of documents and many duplicates and near duplicates. And you wanna get rid of these guys, There's no point in Google returning 100 results, which are almost the same. It wants, you know? It, it wants resul, return results that are distinct. And ideally, actually, not even store these [inaudible] documents, so that's number one. And technically, you know, in addition to not storing them sometimes. At query time you might want to figure out that the results of the [inaudible], [inaudible] user are almost the same and you might want to suppress these things that are almost the same. And if you searched on Google, you might often see, you know, there are. So many resolves that are very similar to the ones we have shown to you. And we suppressed them. If you'd like to see them, then click here, right? So there are various reasons why you'd like to [inaudible] duplicate documents. Now, you know? This is sort of a difficult question. How do you find an element [inaudible] duplicate documents. And I'm gonna describe a scheme that was originally developed by Andre [inaudible] and his collaborators, in the context of a search engine called Alta Vista. Which, in the old days, predated Google, and was the leading search engine of the time. Now, you know, I'm, I'd be surprised if very many of you know about Alta Vista at all. So, let, let me so very quickly describe the main idea behind this. It was a very, very cute idea. First I would [inaudible] you, how do you define similarity of documents? Right? What does it mean for two documents to be similar? There are many was of doing this. One way is the following. You take your document. It's the Declaration of Independence. And you slide a window of three words across the document. And look at all the three word sequences that you encounter. In doing this, that's your set. So now I've taken the document and I've produced a set. And so they'll think of the set as representative of the document. Having produced sets I can now define similarity of two sets. And I'll use that as a measure of similarity of the original documents. One reasonable measure of similarity of the documents is the following, look at how much they overlap, look at the size of the intersection, whatever size of the union.'Kay. So henceforth let's agree that this is the measure of similarity that we'll use. Okay. If two documents are identical, the sets they produce would be identical so similarity would be one and we are going to work with a hypothesis similarity of reasonable measure. Similarity defined this way of is a reasonable measure of, of similar document. Okay? So here's a question, right? Can you produce a sketch? A very short summary of a document. A set, in this case. So as that, you can estimate the similarity just by looking at these two sketches. Instead of actually looking at the written document. So, clearly I can look at the full text of the document, compare them. And [inaudible]. But that would be too slow. And, you know, not acceptable. So, can I get rid of the original documents? Replace them all by short sketches. And estimate this invariant just by looking at these short sketches. This is something that you could actually hope to do at run time. Right? If you actually had to go and look at the original documents at run time, not a good idea. Alright. The first thing is that. First thing that might come to mind, is something called random [inaudible]. You might say well how I'm storing a random subsets of elements. This just does not work. Okay. If you have two doc, two sets with ten thousand elements let's say. Identical. What's the? I mean if you look at a random sample from set A and a random sample from set B. Say you sample 100 elements. The expected intersection of these two samples is one. It gives you no information whatsoever that these two sets are identical. So, random sampling, per se, doesn't work. Okay? But here's a little variant of random sampling that actually works. Okay? And it's amaz-, it's, it's actually very, very cute. And it's very, it's related, in some sense, to what we did before. So here's the idea. You apply random [inaudible], hash function to every element. Of the universe, incase you have a random hash function. You apply to every element of the set, and you just remember the minimum. Kay, let's assume that we have no collision, so we have a high function that has no collision. And here's the claim; the claim is that the probability with a minimum of A is equal to the minimum of B is just a similarity of A and B. Okay. So it's actually not very hard to show this. Maybe I'll leave this as an exercise. The, the claim again is the following. You know, you have two sets, A and B. Let's say you apply a hash function to every element in the universe. You look at the minimum over A, the minimum over B. Assuming that there are no collisions. What's the probability that the minimum over A is equal to minimum over B. The claim is that this, probably is exactly this. Okay. So, great. We have been able, we, we're able to design random variable that has exactly this expectation, to get high, you know, get a high confidence estimate. You would do this not with one hash function, but with 100 hash functions. And for every set, you would store a sketch consisting of the minimum with 100 such hash functions. Okay. And now you can use these sketches. Do I have to get a very good estimate of somebody? Alright. I had hoped to talk about a lot more, which I will skip. But, maybe I'll, I'll give the slides to [inaudible], and he can post them. This is another scheme to estimate similarity that I actually worked on when I was at Google. I'm not gonna be able to tell you about it. But, you can look at the slides a little later. I just wanted to say that, you know? Many of these, these techniques are, are really useful beyond the problems that we talked about. The idea of random projections, which we didn't quite discuss is, is really useful in dealing with very large data sets. It, it's a very useful technique to reduce data sizes. Nearest neighbor size data structures are, are very useful in, in searching large complex high dimensional data sets. And designing half functions like the ones that we saw. You know for set similarity, they're actually the building blocks of designing very, very efficient data searches to do these search problems. There's a little issue with randomness. We've sort of assumed that everything is completely random, we have completely random hash function. Turns out in practice is not really true. There's a way to fix this, which. I'm afraid we won't be able to discuss. And so, in conclusion, I just wanted to say that you know? There are a lot of cool ideas, Two article ideas which are very practical in design. That are useful for designing algorithms and working with very large data sets. I want to give you a glimpse of this algorithmic tool kit. It was a shorter glimpse than I had. Thought I would do. But you know, there are many interesting problems here. I hope I will motivate some of you to think about them.