Hello, I hope you enjoyed your first programming experience with Spark. Although the words count example is simple, it is useful in starting to understand how to work with RDDs. After this video, you'll be able to use two methods to create RDDs in Spark, explain what immutable means, interpret a Spark program as a pipeline of transformations and actions, and list the steps to create a Spark program. So let's remember where we are. We have a Driver Program that defines the Spark context. This is the entry point to your application. The driver converts all the data to RDDs, and everything from this point on gets managed using the RDDs. RDDs can be constructed from files or any other storage. They can also be constructed from data structures for collections and programs, like lists. All the transformations and actions on these RDDs take place either locally, or on the Worker Nodes managed by a Cluster Manager. Each transformation results in a new updated version of the RDD. The RDDs at the end get converted and saved in a persistent storage like HDFS or your local drive. As we mentioned before, RDDs get created in the Driver Program. The developer of the Driver Program, who in this case is you, is responsible for creating them. You can just read in a file through your Spark Context, or as we have in this example, you can provide an existing collection, like a list to be turned into a distributed collection. You can also create an integer RDD using parallelize, and provide a number of partitions for distribution as we do create the numbers RDD in this line. Here, the range function in Python will give us a list of numbers starting from 0 to 9. The parallelize function will create three partitions of the RDD to be distributed, based on the parameter that was provided to it. Spark will decide how to assign partitions to our executors and worker nodes. The distributed RDDs can in the end be gathered into a single partition on the driver using the collect transformation. Now let's think of a scenario were we start processing the created RDDs. There are two types of operations that help with processing in Spark, namely Transformations and Actions. All partitions written in RDD, go through the same transformation in the worker node, executors when a transformation is applied to an RDD. Spark uses lazy evaluation for transformations. That means they will not be immediately executed, but instead wait for an action to be performed. The transformations get computed when an action is executed. For this reason, a lot of times you will see run time errors showing up at the action stage and not at the transformation stages. It is very similar to Haskell or Erlang, if any of you are familiar with these languages. Let's put some names on these transformations. We can have a pipeline by converting a text file into an RDD with two partitions. Filter some values out of it, and maybe apply a map function to it. In the end, the run, the collect action on the mapped RDDs to evaluate the results of the pipeline and convert the outputs into results. Here, filter and map are transformations, and collect is the action. Although the RDDs are in memory, and they are not persistent, we can use the cash function to make them persistent cash. For example, in order to reuse the RDD created from a database query that could otherwise be costly to re-execute, we can instead cache these RDDs. We need to use caution when using the cache option, as it can consume too much memory and generate a bottleneck itself. As a part of the Word Count example, we mapped the words RDD to generate tuples. We then applied reduceByKey to tuples to generate counts. In the end, we convert the number of partitions to one so that output is one file when written to this later. Otherwise, output will be spread over multiple files on disk. Finally, saveAsTextFile is an action that kickstarts the computation and writes to disk. To summarize, in a typical Spark program we create RDDs from external storage or local collections like lists. Then we apply transformations to these RDDs, like filter, map, and reduceByKey. These transformations get lazily evaluated until an action is performed. Actions are performed both for local and parallel computation to generate results. Next, we will talk more about transformation and actions in Spark.