[MUSIC] So I want to spend some time talking about graphs and graph analytics. So we've encountered these before the context of elastic MapReduce assignment, but we haven't spent too much time in the lectures talking about them. But they're increasingly important in the data science context, for reasons we'll talk about. So what is a graph? A graph is a pair of sets, a set of vertices and a set of edges. Okay, you might also see vertices referred to as nodes, but I'm gonna try to avoid that terminology to avoid confusion with clusters of computers, where each computer is referred to as a node. Okay, so V is a set of vertices and E is a set of edges, and each edge is a pair of vertices, a source and a target. And so edges may be considered directed or undirected, and we'll talk a little bit more about this. But maybe it's an edge from a source to a target, or maybe it's just a pair and the order doesn't matter. So these graphs are increasingly common in the wild, right? The web itself can be modeled quite directly as a graph, where every page is a vertex and every link between pages is an edge. The Internet underlying the web can be modeled as a graph where every computer is a vertex and every route is an edge, or maybe even every packet. The social networks, as we've seen in some of the assignments, these are becoming increasingly important as a political driver and as a driver for social change. The influence they have is pretty difficult to overstate, and this is very obviously a graph where every connection between two people in the social network is an edge. Communication logs can be modeled as a graph. Every phone call between two people is an edge. And of course in the news recently, this is becoming quite important with the prison program. And so many more, and so one reason these are so ubiquitous is that it really does kind of get the fundamentals of communication, right. You have a big set of actors, and any time one actor interacts with another, that can be modeled as an edge. And so this is just a very common paradigm that can capture the behavior of a lot of different kinds of systems, okay? The other reason you see this to be so ubiquitous in kind of a data modeling aspect is that objects and relationships between those objects, you can't really get any lower than that, right? So relations talk about, or the relational data model talks about tables, where you have records and attributes, but even here there's a bit of complexity. And you have to sort of think about a schema, and you have to sort of think about every record is gonna have the same schema and so on. Graphs kinda blow up all that and disintegrate everything down to just objects and relationships. And so in some sense, it's a lowest common denominator data model. What we'll get into a little bit is my opinion that I think you throw out quite a bit when sort of drop everything down to that level. But we'll talk about that in a bit. So what do we wanna do with these graphs? So Bordawekar in 2012 wrote a paper about analyzing analytics where they try to sorta categorize various analytics tasks in a variety of categories, and one of the categories was graph analytics. And he broke it down into these three patterns, structural algorithms, traversal algorithms, and pattern-matching algorithms. And I think this is as good a breakdown as any other. So this is the one we'll use in the next several slides. So you think about just structural tasks, structural algorithms, the first thing to do in encountering a graph you're expected to work with or understand is just to collect some basic metrics, right. How big is it, how many vertices and how many edges? And one of the takeaways I want you to have is that the number of edges is more relevant than the number of vertices when you're trying to understand the scale of a graph. So, many times you'll have a conversation with someone, and they'll say, well, I have this really big graph I'm working with. So the next question you should ask is, how many edges does it have? Not how many bytes and not how many vertices. And the reason you don't care about the bytes is because there might be all kinds of other information packed into the labeling of these things. But that can really be factored out from the graph itself and modeled maybe more traditionally in a relational database. And so the performance may not depend so much on exactly the number of bytes. But the graph structure is a challenge as you will see. Okay, and so the number of vertices is one measure, but the number of edges is another. So why is the number of edges more important than the number of vertices? Well, because it scales quadratically with the number of vertices, right. You can have at most, number of vertices squared number of edges, and that's a really, really big number in some cases, right. If there's 2 billion people using a social network, well, 2 billion squared is the number possible friend relationships you have. And, if you're expected to do some analytics on this graph, you're going to be processing all of those edges, perhaps, okay. And then, the second question you should ask after the number of edges is, what is the highest in or out degree? As a proxy for really what is the degree of distribution here. So what do I mean by degree? Well, the in degree of a vertex is the number of edges that are coming into it, and the out degree of a vertex is the number of edges that go out of it, okay? And so, how the edges are distributed among all the vertices tends to be a driver of how difficult the graph is to work with. If they're all pretty evenly distributed, then this isn't much of a problem. You can parallelize things, as we'll see, and everything works out very nicely. However, almost never do graphs in the wild have this property where things are sort of randomly distributed. It tends to be very skewed. So in a social network, there tends to be a person who is friends with everybody in the graph, right, very, very popular people, almost everybody in the graph, okay. On the web, there are pages that tend to be linked to by everyone, right, very, very popular pages. And so this in balance in the degree distribution is one of the challenges in working with very, very large graphs. And so, if you know the number of edges, you've got one indication of sort of gross scale. And then if you know about how many edges are in the most popular vertex, that gives you some idea of how skewed some things are. Okay, so fine, so, when you're working with graphs, these are the things you're gonna be able to do. Can you just count the vertices, count the edges? Can you say given a vertex, can I get the number of in edges and out edges for that vertex? And can I maybe do that for every vertex in the graph? Okay, and so if this is hard to work with, this is gonna be the basis for many other tasks. [MUSIC]