[SOUND] Let's introduce Apache Spark here. What we are actually trying to achieve in Apache Spark, is to extend the MapReduce model. To better support iterative algorithms, like for example, machine learning, graph processing, database access, SQL access or just general custom algorithms, that deal with a lot of data. You want to be able to support interactive data mining. And what do I mean by interactive data mining? I mean. Sort of an interactive console for the user. So that, instead of writing a program and compiling it, and hope for the best and if it doesn't work again modify the program, compile it, run it again. Instead you have an interactive console say, hey load this data set. Done. Okay, now what should I do? Let's filter it. Filter. Oh, that's not quite what I wanted. Let's filter with another set of parameters, and you can do this interactively. What you want to try to do is, in Apache Spark, try to enhance the program ability. So you want to be able to use a nice programming language, to be able to play with the data. And in Spark the system in written on top of Scala. Scala is a nice language written on top of Java, kind of like closure was written on top of Java. Scala is written on top of Java, so it allows interoperability with Java. And it has a nice interpreter, a repl, that you can interact with. So we want to be able to do it, retiv and interactive algorithms better in house program. But the question is how? To be able to answer that question we need to figure out why the current frameworks aren't working as good as you really want to. The issue is that many of the current cluster programming frameworks. [INAUDIBLE] others. They are based on this idea of acyclic data flow. So, the idea is that you have data that flows from one stage of computation to another. From the second stage to the third stage. And it's acyclic, so you do not have loopbacks on your data flow. Which helps the design of some of these frameworks. So data basically flows from one stable storage to another stable storage. Starts, data starts it's life on HDFS. You do a little bit of computation on it, ends up in HDFS. Well this model has some benefits of course. The run time engine, the scheduling algorithms. They can decide where to run tasks based on the data locality, therefore trying to optimize the run time. It can also provide nicer failure recovery semantics, because it's easy. I mean, we know where each task started from. We now where it is supposed to end. If there's a failure we can restart. But. Acyclic data flow paradigm, is not quite efficient for applications that use the same set of data, the same working set of data. Iterative algorithms use, reuse a lot of the data. So you can think of an iterative algorithm, what happens is that you have this huge data set. You look at it, you have, so you have a huge input data set and also something called an answer. Right. And then in each iteration, you consider this big data set. You consider your answer from last time around. You do a little bit of computation. Tweak the answer a little bit. Push it back. So the fall for the next time around, your concept input data set still remains the same, your answer has changed a little bit. So you can see that a huge part of the data doesn't change from attrition to attrition. Another example is again interactive data mining, right? Lot of times, you load up some data, you do some search on it. Okay, I did the search, then I want to do another search, with the different set up parameters. My data hasn't change. The input data is still the same. Why do I need to read it again from hard drive, the very slow hard drives. So, as you can see, acyclic data flow paradigm could be very good for its own use cases, but here for some of these use cases, it becomes inefficient. Now the solution that Spark provides, is a data and computation abstraction called Resilient Distributed Datasets. What do I mean by that? Resilient, this means that once you load it into memory, it remains into memory, until you say, okay I'm going to get rid of this. It's a data set, obviously so it has, it contains data. It's also distributed. So when you have a cluster, you, as the programmer, can think of a data set, an RDD, as one logical thing. In reality pieces of this RDD are spread all around the cluster, okay. So in your algorithm you say RDD1 load from disk. RDD2 is RDD1 filtered by this parameter. RDD3 is RDD2 sorted. RDD4 is RDD3 something else. All of these RDDs are distributed across the cluster and the framework manages them. Okay, so we introduce RDDs. But we want to retain the attractive properties of MapReduce. We want to have, you know, all the nice things that you have in a cluster programming framework. We want to have fault tolerance, scalability, be able to use locality as much as possible. And these are things that are considered in the design of Apache Spark. By using all of these features, Spark allows you to do a wide range of applications. So. It can do regular MapReduce. Although YARN still has its place, when you're dealing with huge, huge amounts of data that don't necessarily fit within the total amount of RAM that you have in the cluster, Hadoop is still the kink. If you have a Pew data flow algorithm, you have log processing, you just have a huge amount of logs, you want to process them, fine. You'd don't to iterate, Hapood is the kink. But you can also do those in Spark. If your data said slightly smaller, and you can load them within the memory of your cluster, Spark becomes actually much faster. You can do machine learning, you can do database access. Remember that we talked about Apache Hive and, previously and Apache HBase. Well, in Spark, you can load data from Hive and HBase, and then you can do interactive search on the data. You can do graph processing. You can do other nice things. So, we'll get to some of those. Getting back to the programming model idea. The RDDs, the Resilient Distributed Datasets that were introduced. They have a couple of interesting properties. One, they are immutable. Right. That means that once you have, once in your program you defined one RDD and say, RDD one equals something. Fine. You cannot change RDD1 again. You cannot go and say, okay, now RDD1 changes to something else. You can use other RDDS, but one RDD, once its done its computation its immutable. And it lets the framework handle these things much better because it doesn't need to worry about overwriting data in RDDs, so once it's done it's done. RDDs are typically created through parallel transformations, right. So for example if you have data that's stored in one RDD and you can say hey my RDD2 is map of this function on RDD1, right? In reality what happens is that RDD1 is stored in a distributed way across all of the cluster. The system can say, okay, I'm going to create a distributed RDD2 across the cluster. Run each computation transforming data items locally individual machines. So I don't need to do, bring all of the data into one machine and then get it back to the machines on the cluster. You can do all of these in Paolo. At the same time. And the last interesting about RDD is that, the data is stored in memory, so it can be cached for efficient reuse. So once you have an RDD you can say, I want this RDD to be cached. The system doesn't get rid of it and then you can say RDD3 is also using RDD1. When you say RDD3 equals some function on RDD1, RDD1 is already in memory and it's cached. So you can do a number of actions on RDDs. You can, for example, do counts. You can do join. You can reduce items like MapReduce, save, collect, filter. You know, really nice sets. Okay so, in the next video I will show you one example of how to use RDDs in Spark. [MUSIC]