[MUSIC] So we're ready to tackle the problem of detecting communities within a social network. And we've thought about this problem a couple of times so far, but by the end of this video, you'll be able to describe both the notion of communities within a social network, and think about how to implement an algorithm for finding these communities. So remember our motivation is to try to make sense of huge number of interconnected nodes that appear in this network. But they are somehow clustered into subcommunities within the big network. And so we'd like to capture this notion of communities really being groups of people who have more connections to one another than to people who are outside of the group. And so, if we want to translate that intuition about what community might look like to graphs and algorithms, we need to have a more precise definition of what a community is. And in order to come up with that precise definition, we could really attack the problem in two different ways. The first way would be to look for those tight links, and try to group together nodes that have a lot of edges between them. And somehow build up communities from kernels, starting from within a community, so maybe like the heart of the community, find vertices that have lots and lots of connections between them. And then add on additional vertices if they have a fair number of connections between them, and then somehow grow communities from the middle out. But on the other hand, we could attack the problem from the outside in. And say, if I have a region in my graph where it looks like there's not such a strong link between man part of the graph and another, then there's probably not a community that encompasses both parts of that graph. And so what we might want to do is look for links that we could cut, and cutting those links wouldn't interfere with the community structures and might help us subdivide our problem into smaller chunks. That then we could iterate and look at within those new parts of the graph, see if there's weak links there and cut those and weak links etc. And that way hone in to those communities from the outside in. So, the algorithm that we will be talking about for this video is one of those divisive algorithms that start from the outside and looks for those quick legs. And the intuition that we'll use for describing our communities is that if we have two separate communities then what's going to happen is to get officially from one community to another, we might always have to use one edge that sort of connects those communities. The analogy we might want to think of is like communities geographically located on either side of a highway. So that if we ever want to travel from one of those communities to another, we have to go through this bottleneck highway, and so anytime we want to travel between them, our shortest path will tend to flow through a particular part of the network. And so that particular part of the network is really like a border between our two communities and acts to separate them. So what we want to do is find those borders and then cut them. And so our algorithm might exactly look for those borders. So, what we want to do now is translate this intuition of traffic and borders to graphs. And so, let's look at a simple graph, a toy example. And so we have this network here and let's think about path in this network for getting from one vertex to another, and then see if information about those paths can tell us something about how we might group these vertices together. Now, as a spoiler a bit, just looking at the graph, you probably have a sense of which vertices it would make sense to group. So, think ahead about how you would, as a human without having to use an algorithm, think about grouping these vertices into communities. And then let's see how our algorithm plays out and whether it matches up with your intuition, okay. So let's think about notes 6 and 7 in this graph. Let's think about what would be a shortest path from 6 to 7. What do you think? Well, you probably said there's an edge between them, so of course, we are just going to jump along that edge and immediately get from 6 to 7. All right, so let's do something a little bit more complicated. We saw that there's just that one edge. But what about if we went from 2 to 5? These two vertices are no longer neighbors of one another. We can't just jump along a single edge. So what's the shortest path from 2 to 5? What you may have seen is that there's actually two shortest paths. So even though we said shortest, there's two ways of traversing the graph from vertex 2 to vertex 5 or vice versa, they each take three hops. We can either go up to 1, and then across to 5, or we can go down to 4 and then across to 5. If we started off at 2 and wanted to end at 5. Now what's going to be really important for our analysis of these edges is that either one of these shortest paths still has to take that edge between 3 and 5. And so no matter how we traverse the graph from 2 to 5, if we want to stick to efficient paths in the graph, mainly shortest paths, we're going to need to use that really important edge. And so that edge between 3 and 5 seems to somehow lie between the community of 2 and the community of 5. So, this is what we're going to try to capture when we think about flow. And the visualization you might want to have in mind here is think about traffic flow or liquid flow where you can think about going from 2 to 5, and we have 1 unit of water that we're pouring into the graph. And that water is going to try and take the path of least resistance, so the water's just going to use shortest paths through the graph and try to get to 5. And at the beginning, when we start at 2, half the water will go into the blue path and half the water will go into the purple path because they're both shortest paths and so they'll each take about half the flow. That starts from 2 and goes to 5. But then once they reach vertex 3, they'll join up together, both of these quantities of flow will join up together and the entire flow between 2 and 5 has to traverse that edge from 3 to 5. On the other hand, we had this situation with 6 and 7 where the 1 unit of flow that was allowed to go between vertex 6 and vertex 7 all had to go along that one edge. It didn't separate. There was exactly one unique shortest path between those vertices. And so what we could now is think about this notion of flow between every pair of vertices in the graph, and think about which edges come up as really important edges between many pairs of vertices. And so what we do is we compute the betweenness of edges. So an edge has really high betweenness if it has this important function, as being used within the shortest paths of many pairs of vertices. So if we look at that edge between 6 and 7 again, it had all the flow that went between 6 and 7, but it doesn't have any other contributions from any other flows, because that edge doesn't lie on any shortest path between any other pair of vertices, just between 6 and 7. On the other hand, the edge between 3 and 5 lies on many many shortest paths. It would lie on the shortest path between 2 and 5, as we saw, but also between 1 and 5. And also between 4 and 6. And so that edge is really crucial, it's lying on many shortest paths on this graph. And so somehow we see that it's acting as a boundary between the communities that many of these vertices occupy. And so if we wanted to subdivide those communities we would probably cut that edge, and thinking back to your intuition, it's probably in line with what you were thinking of before. If we were to group the vertices in this graph, we would probably group the four vertices, 1, 2, 3, 4, as 1 community, and the 3 vertices, 5, 6, and 7, as a separate community. Because there's only really one edge connecting the vertices between these two communities, and most of the edges within these communities just stay within those two groups of vertices. And so, this notion of betweenness is capturing which edge we should cut in order to make progress towards subdividing our vertices into groups that represents communities. And so that leads to our algorithm. Because our algorithm now is compute the betweenness of all edges by thinking of these flows through shortest paths. Remove the edges of highest betweenness, we might have a tie, in which case we might remove multiple edges, and then repeat. We can repeat, and repeat, and repeat, and that way, we get a picture of slowly by slowly how we refine the structure of the graph, so it gets into smaller and smaller subdivisions that represent different levels of community. Or we might have a particular end goal in mind, where we wanted to keep going until we have exactly 3 communities or 17 communities and look at the graph structure with results. And so, this is a very high level of description of the graph, but we would probably be pretty hard-pressed to actually implement it at this point. And so we might ask ourselves, well, how do we compute the betweenness? What do we do? And so for that we have to think back to that definition of betweenness, and betweenness was accumulating that proportion of flow that an edge was receiving as we looked at all pairs of vertices in the graph. And so what we might want to do is think about each node v, and the flow that it contributes to all other vertices. And think about how much of that flow does each edge get? So, how do we do that? Well, the heart of that is to find the shortest path between that vertex of v and all of the vertices in the graph. And so we need to find the number of shortest paths from v to every other node, but you already know how to look for shortest paths. If you think back to course three, we talked about breadth-first search. And breadth-first search was a way of exploring a graph and we explored it from a central vertex out in all directions, and every time we hit a new vertex we've got there, we had a potential of finding the shortest path to that vertex. And so we could explore our graph from our original node v, looking out for shortest paths and just count them. And so what we might do is do is do that breadth-first search, compute the numbers shortest path from v to each of the nodes and then proportional distribute the flow according to these paths. And now we have a slightly lower level description of our task for the algorithm. But if you're interested in looking at this problem more carefully and implementing it as part of your project, you still have work to do. You have to think about how do I do this breadth-first search? What information do I need to capture as I'm traversing this graph and as I'm accumulating flow over all the vertices? How do I do this optimally? What data structures should I use and how far should I iterate? Because, even once I've computed the between the small edges now I need to start removing those edges and thinking about the community structure that results. So at this point, we've got an algorithm but it's your turn to explore. And you can explore optimizations to this algorithm, you can compare it to a rich dataset where you already know something about what the inherent communities in this data structure might look like. And compare whether the algorithm that we've just outlined will agree with those communities that come up from additional information beyond just the edges and the nodes that we know in the Facebook data, for example. And you can think about how we might use some of that ground truth rich data to inform our algorithm as we're running it, perhaps to make it run faster, or to make it run more accurately so we don't have to do as many iterations. Or really anywhere your heart desires. So hopefully, this gives you some ideas of what you might do in looking for communities in the network data. And if you're interested in thinking about this project further, you can take it where you would like.