After a brief overview of some of the processing systems in the Big Data Landscape, it is time for us to dive deeper into Spark. Spark was initiated at UC Berkeley in 2009 and was transferred to Apache Software Foundation in 2013. Since then, Spark has become a top level project with many users and contributors worldwide. After this video, you will be able to list the main motivations for the development of Spark, draw the Spark stack as a layer diagram, And explain the functionality of the components in the Spark stack. As we have discussed in our earlier discussions, while Hadoop is great for batch processing using the MapReduce programming module, it has shortcomings in a number of ways. First of all, since it is limited to Map and Reduce based transformations, one has to restrict their big data pipeline to map and reduce steps. But the number of applications can be implemented using Map and Reduce, it's not always possible and it is often not the most efficient way to express a big data pipeline. For example, you might want to do a join operation between different data sets or you might want to filter or sample your data. Or you might have a more complicated data pipeline with several steps including joins and group byes. It might have a Map and Reduce face, but maybe another map face after that. These types of operations are hard or impossible to express using MapReduce and cannot be accommodated by the MapReduce framework in Hadoop. Another important bottleneck in Hadoop MapReduce that is critical for performance, is that MapReduce relies heavily on reading data from disc. This is especially a problem for iterative algorithms that require taking several passes through the data using a number of transformations. Since each transformation will need to read its inputs from the disk, this will end up in a performance bottleneck due to IO. Most machine learning pipelines are in this category, making Hadoop MapReduce not ideal for machine learning. And as I mentioned in the system overview, the only programming language that MapReduce provides a native interface for is Java. Although, it's possible to run Python code to implementation for it is more complex and not very efficient especially when you are running not with text data, but with floating point numbers. The programming language issue also affects how interactive the environment is. Most data scientist prefer to use scripting languages due to their interactive shell capabilities. Not having such an interface in Hadoop really makes it difficult to use and adapt my many in the field. In addition in the big data era having support for streaming data processing is a key for being able to run similar analysis on both real time and historical data. Spark came out of the need to extend the MapReduce framework to overcome this shortcomings and provide an expressive cluster computing environment that can provide interactive querying, efficient iterative analytics and streaming data processing. So, how does Apache Spark provide solutions for these problems? Spark provides a very rich and expressive programming module that gives you more than 20 highly efficient distributed operations or transformations. Pipe-lining any of these steps in Spark simply takes a few lines of code. Another important feature of Spark is the ability to run these computations in memory. It's ability to cache and process data in memory, makes it significantly faster for iterative applications. This is proven to provide a factor of ten or even 100 speed-up in the performance of some algorithms, especially using large data sets. Additionally, Spark provides support for batch and streaming workloads at once. Last but not least, Spark provides simple APIs for Python, Scala, Java and SQL programming through an interactive shell to accomplish analytical tasks through both external and its built-in libraries. The Spark layer diagram, also called Stack, consists of components that build on top of the Spark computational engine. This engine distributes and monitors tasks across the nodes of a commodity cluster. The components built on top of this engine are designed to interact and communicate through this common engine. Any improvements to the underlying engine becomes an improvement in the other components, thanks to such close interaction. This also enables building applications that's span across these different components like querying data using Spark SQL and applying machine learning algorithms, the query results using Sparks machine learning library and MLlib. The Spark Core is where the core capability is of the Spark Framework are implemented. This includes support for distributed scheduling, memory management and full tolerance. Interaction with different schedulers, like YARN and Mesos and various NoSQL storage systems like HBase also happen through Spark Core. A very important part of Spark Core is the APIs for defining resilient distributed data sets, or RDDs for short. RDDs are the main programming abstraction in Spark, which carry data across many computing nodes in parallel, and transform it. Spark SQL is the component of Spark that provides querying structured and unstructured data through a common query language. It can connect to many data sources and provide APIs to convert query results to RDDs in Python, Scala and Java programs. Spark Streaming is where data manipulations take place in Spark. Although, not a native real-time interface to datastreams, Spark streaming enables creating small aggregates of data coming from streaming data ingestion systems. These aggregate datasets are called micro-batches and they can be converted into RDBs in Spark Streaming for processing. MLlib is Sparks native library for machine learning algorithms as well as model evaluation. All of the functionality is potentially ported to any programming language Sparks supports and is designed to scale out using Spark. GraphX is the graph analytics library of Spark and enables the Vertex edge data model of graphs to be converted into RDDs as well as providing scalable implementations of graph processing algorithms. To summarize, through these layers Spark provides diverse, scalable interactive management and analyses of big data. The interactive shell enables data scientists to conduct exploratory analysis and create big data pipelines, while also enabling the big data system integration engineers to scale these analytical pipelines across commodity computing clusters and cloud environments.