[MUSIC] Sometimes a single machine simply cannot perform a given task fast enough. Sometimes there are too many tasks for a single machine to properly handle. And yet other times there is so much data that the data must be distributed across multiple resources. These scenarios describe several of the situations where the use of Apache Spark has the potential to directly impact a business opportunity. Whether you are working with spark locally on Watson Studio, from within Docker, or as part of a compute cluster, the basics we cover here will still apply. There are many types of high performance computing environments. Spark is a cluster computing framework. When you compare it to Hadoop, it essentially competes with the mapreduce component of the Hadoop ecosystem. Spark does not have its own distributed file system, but can use the Hadoop distributed file system or HDFS. Spark uses memory and can use disk for processing, whereas mapreduce has a strictly disk-based processing approach. Here we have a diagram of a Spark application. When we start a Spark environment, a Spark session is first created and this manages the driver process. The driver program or process can be controlled using APIs in Scala, Python, SQL, Java, and R. The worker nodes on the right are usually distinct machines. Executors are the worker nodes processes, each in charge of running an individual task. And they are shown as the orange squares inside each of the worker nodes. One cluster configuration is to assign several cores to each executor, leaving one for additional overhead. The cluster manager, shown at the bottom, which is often Yarn, Mesos, or Kubernetes, helps coordinate between the driver program and the worker nodes. Spark applications are run as an independent set of processes on a cluster coordinated by the Spark context object in your main program. Each application gets its own executor processes which remain allocated for the duration of the application. The driver program that encapsulates both at the Spark context and the Spark session is used to submit Spark applications in Spark. The driver program, once it is given instructions in the form of user code, willl then ask the cluster manager to launch executors. Prior to Spark 2.0, there were multiple points of entry for a Spark application, including Spark context, SQL context, Hive context, and the streaming context. More recent versions of Spark combined all of these objects into a single point of entry that can be used for all Spark applications. The Spark context is a child process of a Spark session. In this cell, first we create a Spark session using the Spark session builder, then we show how to access the Spark context. Using the Spark context, we show here how to print some of the configuration properties. The four enumerated steps are shown in code just below with the exception of setting up the environment, which we just did. Spark revolves around the concept of a resilient distributed data set or RDD. An RDD is a full tolerant collection of elements that can be operated on in parallel. The RDD API uses two types of operations, transformations and actions. On top of Spark's RDD API, higher level APIs are provided including the data frame API and the machine learning API. Both of which we will cover in this course. The text file is an RDD that was subjected to a chain of transformations, and it was used to create counts, another RDD. The action used here is collect, which brings the data back into Python. And it is something that you should exercise caution with especially when working with very large datasets. There are several important things to remember about RDDs. The first is that they use what is known as lazy evaluation. This means that Spark will wait until the very last moment to execute your transformations. It does this by constructing a direct acyclic graph or deg over the transformations. And then when an action like count or collect is called, the task is sent for execution. Here we show another example of an RDD, but this time we create it using a numpy array with the paralyzed function. The transformation is a simple filter and the action is to obtain counts. Rather than collect all of the data, since we generally want to avoid pulling all data into local memory unless we absolutely have to. Applications can be submitted to a cluster of any type using the command Spark submit and an accompanying script. The right file magic function shown here saved the code in the cell as a Python script to be called by Spark-submit. The file calculate-pi, unsurprisingly, calculates pi, but more importantly, it shows an example of how to map a custom function over an RDD. Spark-submit can be run from the command line as shown. Generally, it is called with a number of options using a batch file, but it can be called using all of the defaults as shown here. Once the script has run, it will produce the outfile that is being printed in the cell below. The procedure could be used to make a prediction using a machine learning model that has been tuned, trained, and saved. [MUSIC]