Hello. In this video,

you will learn how to compute the shortest path between vertices in a graph.

To recap, this week,

we are working with the social graph.

Users are encoded with numbers and edges represent a "follows" relation.

We have loaded and prepared the graph data with the code in this slide.

Has it ever happened to you that you clicked on a random profile in

a social network and found out that you have several mutual friends?

Or have you ever saw a social network with circles like close friends,

friends of friends, friends of friends of friends, and so on?

These features use a craft distance in their internals.

Let me introduce some definitions.

A path or a vertex path is a sequence of

vertices where every two consequent vertices are connected by an edge.

The path length is the number of edges in the path.

The shortest path between two vertices is the one with the minimal length,

the distance between two vertices is the length of the shortest path between them.

Apply to our social graph example,

users at a distance of one are those whom I follow.

Users at a distance of two are followed by those whom I follow, and so on.

Let's compute for the given user X,

the distance to every other user in the graph.

To do so, we will run a breadth-first search from the vertex X.

Precisely, we start with annotating the vertex X with distance zero.

Then, we find all neighbors of X and annotate them with distance one.

On the next iteration,

we start with nearly annotated vertices and repeat the process.

When we have different distances at the vertex,

we take the minimum because we are interested in the shortest path.

We terminate the algorithm when the distances are not changing any more.

As the distance increases with every new iteration,

it is equivalent to stop when there are no unvisited vertices.

Now, let me guide you through the code.

To save some space,

I will omit the prompt marker, perfect.

The idea is to have a mapping from a vertex to its neighbors.

Note that in this slide you have a mapping from a vertex to its followers,

so you need to make a reverse mapping,

let's call it forward_edges.

To bootstrap the algorithm,

let's create an initial distance mapping.

For example, let's start with vertex 12.

Now, you need to find neighbors of the vertex and update their distances.

I'm sure you know how to do this, right?

You join the distance mapping with the forward_edges,

and what will be the result?

On the left side,

there is a pair of vertex 12 and distance 0.

On the right side,

there is a pair of vertex 12 and vertex 13, the neighbor vertex.

So the joint tubule will be vertex 12,

distance 0, and vertex 13.

Now, we need to perform a step to devise that vertex 13 is at a distance of 1.

Is it the final distance?

Well, during the first iteration,

yes, but generally speaking, no.

Recall, you may have visited the vertex already.

Now, how to account for that.

You need to check if the distance was computed already or not.

So, once again, you need to join two datasets as shown in this slide.

Congratulations, you have done one iteration of the algorithm.

Now, you need to make a loop.

Before continuing, let's study up the code and fold

these helper functions for brevity, that's it.

Now, let's think about the termination criteria.

How can you check that you have visited all reachable vertices?

For example, you can check if you have reached any new vertices on the current iteration.

Their distance in this case will be equal to the iteration number.

You need to create a loop,

move the iteration step inside the loop and count the number of new vertices.

Then, if there are no new vertices,

you need to break the loop.

Otherwise, prepare for the next iteration by updating the search frontier.

If you run this code,

you will notice that the iteration time becomes longer and longer,

make a pause here.

Do you understand why?

There are two reasons.

First, Spark discards intermediate data after the computation.

We can hint it to persist distances to avoid recomputing them from scratch,

from the source vertex on every iteration.

Second, if you look at the Spark UI,

you will see that Spark reshuffles the data over and

over again because it knows nothing about the key distribution of data.

We can hint how to partition the data thus avoiding reshuffles.

This optimization technique is covered in

more detail in the next course of the specialization.

Congratulations, you can now compute the distance

from any vertex in the graph to all other vertices.

As an exercise, try to reconstruct the shortest path

from a particular vertex.

To summarize, in this video you have learned how to write

iterative algorithms in Spark and use persistence to avoid performance issues.

You have used this knowledge to implement

a simple breadth-first search algorithm to compute distances from a given vertex.

Take care and see you later.