Hello, in this module we will be talking about Apache Spark. First, I would like to introduce myself. I'm Andrea Zonca and I I have a background in astrophysics. I've been studying the beginning of the universe, analyzing a large amount of data from satellite measurements. And now, I am staff at the San Diego Supercomputer Center where I help scientists run rare data analysis pipelines on supercomputers all over the US. And I am also one of the instructors of software carpentry, and we teach two-day workshops for early career scientists to help them being more comfortable with computing and with a reproducible science. First of all I would like to start by showing you some of the shortcomings of how the MapReduce brought to the development of a new system as Apache Spark. The first one is that every time you implement a workflow in MapReduce, you have to force your data analysis workflow into a map and a reduce phase. And sometimes, this cannot accommodate every data analysis workflow. For example, you might want to do a join operation between different data sets. You might want to filter or sample your data, or you might have a more complicated workflow with several steps. Maybe a map and a reduce phase, but maybe another map phase after that. And this cannot be accommodated by MapReduce. Another important bottleneck, and this is actually very critical for performance, is that MapReduce relies heavily on reading data from disk. And especially if you have Iterative Algorithms that require cycling several times through the data, where you have to run once, get the results of some calculation, then run again using those results. Pipelines like this are very common in machine learning. You will be reading data from this many times. And this makes your analyst pipeline run very slow. And this is definitely a serious bottleneck for my previous jobs. The next is about the fact that MapReduce is only one native interface when choosing Java. As we saw in the previous module, it's possible to run Python code, but that requires to go through the streaming module that makes or the implementation more complex and not very efficient especially when you are running not with text data but with floating point numbers. And it's be really nice to have simpler languages to use not only Java and to be pass it'd be nice to have an interactive shell which is commonly used nowadays by data scientists. And the solution, what might be the solution for this? The solution is to write a new framework from scratch. And the point is that this framework doesn't need to be a complete replacement of the Hadoop stock, but just a replacement of the Hadoop MapReduce. So that, being in the same ecosystem can build upon all the available tools. So needs to be a tool that runs in the Java virtual machine, so it's compatible with all the rest of the tools around the loop, and this was developed from UC Berkeley by a computer science lab, and now it's been managed by Apache Foundation, which is the same foundation behind Hadoop. So how does Apache Spark provide solution for the problems we saw before? So what about, for example, accommodating other workflow? Spark provides a very rich programming interface that gives you more than 20 highly efficient distributed operations. So it's a lot easier for data scientists to write their dataport, their data analysis pipeline to Spark and once they build their workflow they can use any number of steps to write their data analysis pipeline. The other bottleneck was about the flow performance in iterative algorithms, and the solution for this is caching data in memory. So generally, in your pipeline you will be loading your data from disk to some preprocessing and clean your data, and at that point, you have a good stage that, where your iterative algorithms like machine learning can work from. And it's possible in Spark to mark those data to be saving memory. So that every time your machine learning algorithm accesses those data, it accesses extremely fast. So machine learning pipelines can gain a factor of 10 or a factor of 100 speed up easily, thanks to this performance increase. However, I think that the most revolutionary feature of Spark is that it's been designed to make it a lot easier for new users to write their analysis on distributed machines and by providing access with other languages. For example, Python and Scala which are easy to use. And very soon also the R interface will be complete. And provide interactive shell where users can test, explore their data interactively, and test the new algorithms and have, and so, straight away. This makes data scientists a lot more productive instead of the usual MapReduce batch implementation where you have to submit your job and wait for the results to be back to look at them. And to conclude this video, I'd like to show you a very interesting application where Spark is being compared, performance-wise, to Hadoop MapReduce. In particular, this was a sorting competition. So some random data were generated with specific rules. And thanks to extremely sophisticated communication algorithm, Spark was able to perform the sorting operation in less than half the time with about 200 nodes instead of the 2,000 nodes used to Hadoop MapReduce. And in the reading assignment, I will point you to more information about this.