in case we lose a partition, Spark
knows the lineage so knows how the dependency is going so
needs to recover everything that has been lost from the beginning.
And the same for the second node.
It's going to go back down to HDFS and
read the relevant
part of the data that has been lost and then reprocess everything.
Through the final output.
So let's now take a look at a little more complicated example
where we have two different data sets being read from disks.
So you see the bottom two nodes are exactly the same
operations that we've been doing before, but
we can assume that there isn't, for example,
in another RDD which is being joined with this RDD.
Join is another wide transformation that
takes all the keys from the first RDD,
and take their values.
And joins them with the values on the second RDD that have the same key.
And so, here, you see that.
If you follow the colors, you can see
if a partition is lost at the very last RDD.
Then we can track all its dependencies back.
And there is also another very important feature, which is
accomplished by this tag, and that is also the execution order.
So, by building this [INAUDIBLE] of dependencies.
Spark can understand what part of your pipeline can run in parallel.
So here we see that the two sections,
the two processing of the RDDs are independent one to the other.
And so they can run independently in parallel.
And then when they both are completed the join operation can happen.
And also local chain of local operation can be
optimized by Spark and can be run simultaneously.
For example our data set at the bottom we have a flatMap.
And in map operation, these two operations are both local.
So they can be executed at the same time by the same process without even actually