0:01

So what's the problem?

Well, I did say most of the stuff about graph search really doesn't matter,

undirected or directed, it's pretty much cosmetic changes.

But the big exception is when you're computing connectivity,

when you're computing the pieces of the graph.

So right now, I'm only going to talk about undirected graphs.

The directed case, we can get very efficient algorithms for, but it's quite

a bit harder work, so that's going to be covered in detail in a separate video.

0:56

But we do want to sum up more formal definitions, something which is actually

in math that we could say is true or false about a given graph.

And roughly, we define the connected components of an undirected graph

as the maximal regions that are connected.

In the sense you can get from any vertex in the region from

any other vertex in the region using a path.

So, maximal connected regions in that sense.

Now the slick way to do this is using an equivalence relation.

And I'm going to do this here in part because it's really the right way to think

about the directed graph case, which we'll talk about in some detail later.

1:57

So I'll leave it for you to do the simple check that this squiggle is indeed

an equivalence relation.

I'm going to remind you what equivalence relations are.

This is something you generally learn in your first class on proofs or

your first class in discrete math.

So it's just something which may or may not be true about pairs of objects.

In an equivalence relation, you have to satisfy three properties.

So first you have to be reflexive, meaning everything has to be related to itself,

and indeed in a graph there is a path from any node to itself, namely the empty path.

Also a couple of these relations have to be symmetric, meaning that if u and

v are related then v and u are related.

Because this is an undirected graph it's clear that this is symmetric.

If there's a path from u to v in the graph there's also a path from v to u so

no problem there.

Finally equivalence classes have got to be transitive.

So that means if u and v are related and so are v and w and so are u and wo.

That is if u and v have a path, v and w have a path, then so

does u and w and you can prove transtivity just by pasting the two paths together.

And so the upshot is, when you want to say something like

the maximal subset of something where everything is the same.

The right way to make that mathematical is using equivalence relations.

So over in this blue graph we want to say one, three, five, seven, and

nine, which are all the same in the sense that they're mutually connected and so

that's exactly what this relation is making precise.

All five of those nodes are related to each other.

2 and 4 are related to each other.

6, 8, and 10, all pairs of them are related.

So equivalence relations always have equivalence classes,

the maximal mutual related stuff.

And in this graph context, it's exactly these connected components,

exactly what you want.

So what I want to show you is you can use wrapped in an outer of

the vertices to compute, to identify all of the connected components of a graph.

In time linear in the graph in n plus n time.

Now you might be wondering why do you want to do that.

4:02

And that boils down to just understanding whether the graph that represents you

network, is a connected graph, that is if it's in one piece, or not in one piece.

So obviously, you can ask this same question about recreational examples.

So if you return to the movie graph, maybe you're wondering,

can you get from every single actor in the IMDB database to Kevin Bacon?

Or are there actors for which you cannot reach Kevin Bacon?

Via a sequence of edges,

a sequences of movies in which two actors have both played a role.

So that's something that boils down to a connectivity computation.

4:47

So let me make sure that one probably a little less obvious application of

undirected connectivity is that it gives us a nice quick and dirty heuristic for

doing clustering if you have paralyzed information about objects.

Let me be a little more concrete.

Suppose you have a set of objects that you really care about.

This could be documents, maybe web pages that you crawl, something like that.

It could be a set of images, either your own or drawn from some data base.

Or it could be, for example, a set of genomes.

Suppose further, that you have a pairwise function, which for each pair of objects

tells you whether they are very much like each other or very much different.

So let's suppose that two objects are very similar to each other,

like the two web pages that are almost the same.

Or there are two genomes where you can get from one to the other with a small number

of mutations.

Then, they have a low score.

So low numbers,

close to zero, indicates that the objects are very similar to each other.

High numbers, let's say, they can go up to a thousand or

something, indicate that they are very different objects.

Two webpages that have nothing to do with each other, two genomes for

totally unrelated parts, or two images that seem to be of

completely different people or even completely different objects.

5:58

Now here's a graph you can construct using these objects and

the similarity data that you have about them.

So you can have a graph where the nodes are the objects.

Okay, so for each image, for each document, whatever,

you have a single node and then for a given pair of nodes,

you put in an edge if and only if the two objects are very similar.

So for example, you could put in an edge between two objects if and

only if the score is at most ten.

So remember, the more similar two objects are, the closer there score is to zero.

So you're going to get an edge between very similar documents,

very similar genomes, very similar images.

Now in this graph you've constructed, you can find the connected components.

So each of these connected components will be a group of objects, which more or

less are all very similar to each other.

So this would be a cluster of closely related objects in your database.

And you can imagine a lot of reasons why, given a large set of unstructured data,

just a bunch of pictures, a bunch of documents or

whatever, you might want to find clusters of highly related objects.

7:34

I'm also going to assume that the nodes have names.

Let's say that the names are from 1 to n.

So these names could just be the position in the node array that these nodes occupy.

So this is going to be an outer for loop,

which walks through the nodes in an arbitrary order, let's say from 1 to n.

This outer for loop is to ensure that every single node of the graph will be

inspected for sure at some point in the algorithm.

Now, again, one of maxims is that we should never do redundant work, so

before we start exploring from some node, we check if they've already been there.

8:03

And if we haven't seen i before, then we invoke the breath-first search,

a term we were talking about previously in the lecture, in the graph G,

starting from the node I.

So to make sure this is clear,

let's just run this algorithm on this blue graph to the right.

So we start in the outer for loop and we said i equal to 1.

And we say have we explored node number 1 yet.

And of course not, we haven't explored anything yet.

So the first thing we're going to do is we're going to invoke

BFS on node number 1 here.

So now we start running the usual breadth for search subroutine starting

from this node one and so we explore layer one here is going to be nodes 3 and 5.

So we explore them in some order for

example maybe node number 3 is what we explore second.

Then node number five is what we explore third and

8:51

then the second layer in this component is going to be the nodes 7 and 9.

So we'll explore them in some order as well and say 7 first followed by 9.

So after this BFS initiated from node number one completes,

of course it will have found everything it could possibly find,

namely the five nodes in the same connected component as node number 1.

And of course, all of the five of these nodes will be marked as explored.

So now we return once that exits we return to the outer for loop we increment I we go

to I equal 2, and we say we already explored node number 2 no we have not.

And so now we invoke BFS again from node number 2.

So that will be the sixth node we explore.

There's not much to do from two,

all we can do is go to 4, so that's the seventh node we explore.

That BFS terminates at finding the nodes in this connected component,

then we go back to the outer for loop.

We increment i to 3, we say we've already seen node number three.

Yes we have, we saw that in the first breadth first search.

So we certainly don't bother to BFS from node 3, then we increment item four.

Have we seen 4?

Yes we have, in the second called to BFS.

Have we seen node 5?

Yes we have, in the first call to BFS.

Have we seen node 6?

No, we have not.

So the final indication of breadth-first search begins from node number 6.

That's going to be the eighth node overall that we see.

And then we're going to see the notes 8 and 10 in some order, so for

example maybe we first explore note number 8.

That's one of the first layer in this component,

and then note number 10 is the other note of the first layer in this component.

So in general, what's going on, well what we know

will happen when we invoked that first search from a given node i.

We're going to discover exactly the nodes in i's connected component.

Right, anything where there's a path from i to that node, we'll find it.

That's the BFS guarantee, that's also the definition of a connected component.

All the other nodes which have a path to i.

10:39

Another thing that I hope was clear from the example, but

just to reiterate it, is every search call,

when you explore a node, you remember that through the entire for loop.

So when we invoke breadth-first search from node number 1, we explore nodes 1, 3,

5, 7 and 9, and we keep those marked as explored for the rest of this algorithm.

And that's so

we don't do redundant work when we get to later stages of the for loop.

11:04

So as far as what does this algorithm accomplish, well,

it certainly finds every connected component.

There is absolutely no way it can miss a node because this for

loop literally walks through the nodes, all of them, one at a time.

So you're not going to miss a node.

Moreover, we know that as soon as you hit a connected component for

the first time, and you do breadth-first search from that node,

you're going to find the whole thing.

That the breadth-first a search guarantee.

As far as what's the running time, well it's going to be exactly what we want.

It's going to be linear time, which again means proportional to the number of edges

plus the number of vertices.

And again depending on the graph, one of these might be bigger that the other.

So why is it O of m plus n?

Well as far as the nodes, we have to do this initialization there where we mark

them all as unexplored, so that takes constant time per node.

We have just the basic overhead of a for loop, so that's constant time per node.

And then we have this check, constant time per node, so that's O of n.

And then recall we proved that within breath for research,

you do a amount of work proportional.

You do constant time for each node in that connected component.

Now, each of the nodes of the graph is in exactly one of the connected components.

So you'll do constant time for

each node in the BFS in which you discover that node.

So that's again, o of n over all of the connected components.

And as far as the edges, note we don't even bother to look at edges until

we're inside one of these breadth first search calls.

They played no role in the outer for loop or in the pre-processing.

And remember what we proved about an indication of breadth first search.

The running time, you only do constant amount of work

per edge in the connected component that you're exploring.

In the worst case, you look at an edge once from either endpoint and

each of that triggers a constant amount of work.

So when you discover a given connected component,

the edge work is proportional to the number of edges in that kind of component.

Each edge of the graph is only in exactly one of the connect components, so

over this entire for loop, over all of these BFS calls.

For each edge of the graph, you'll only be responsible for

a constant amount of work of the algorithm.

So summarizing because breadth-first search from a given starting node.

That is, works in time proportional to the size of that component, piggybacking on

that sub routine and looping over all of the nodes of the graph.

We find all of the connecting components in time proportional to the amount of

edges and nodes in the entire graph.