In this lesson, we'll talk about two dominant systems developed for large scale graph processing. The first one, called Giraph, is from Apache and implements a BSP model on Hadoop. The second system is called Graphx, is developed on the Spark platform, which as you know, emphasizes on interactive in memory computations. While BSP is a popular graph processing model, the actual implementation of BSP in an infrastructure needs additional programmability beyond what we have discussed so far. In Giraph, several additional capabilities are added to make it more practical. A thorough coverage of the Giraph platform is beyond the scoop of these lectures. However, we'll touch upon a few of these capabilities. We'll first consider graph IO, that is how graphs can come into a system represented inside the system, and when completed are written out. Next, we'll describe how Giraph interacts with external data sources. Some of these data sources use a different data model, other sources include databases. Once a graph is imported, it is important to make sure that the system runs efficiently. We will look at a method that uses a special kind of global aggregate operation which saves time by reducing the amount of messaging to compute aggregate functions like sum and products. Finally, we'll recognize that even if Giraph is designed for performing iterative, in memory computation, there are times where it is absolutely necessary to store data on disk. We'll briefly touch upon Giraph's ability to handle out of core graphs and out of core messages. A graph can be written in many ways. For Neo4J, we saw how graphs can be important to the database from a CSV file. In Giraph, two of the most common input formats are Adjacency List and Edge List. For an Adjacency List, each line has the node ID, a node value which is a single number here, and a list of destination, weight pairs. Thus, in line one, A has a value of 10 and 2 neighbors B and F with edge weights 2 and 5, respectively. Since G has no outgoing edge, the adjacency list is empty. The current way of representing graphs is in terms of triplets. Containing the source and destination nodes followed by an [INAUDIBLE]. Notice the way we have shown it here. And the node values is not represented. Let us simplify the Adjacency List representation of it. We remove the colons, commas, braces, and parenthesis, and get a space separated set of lines. One line for each vertex. We further replace the node IDs A, B, C, etc., with 1, 2, 3, etc., so that these IDs are integers. So what do we need to specify to parse this for Giraph? One, the graph is a text subject and not, let's say, a database subject. Two, it is a vertex based representation, each line is a vertex. This splitter here is a space. The idea of the node is a first value for each line. The value is a second token. The next pair of items you find an edge with the target and the weight, respectively. And lastly, there is a list of these pairs until the end of the line. Therefore, each line would typically lead to the creation of both notes and a set of edges. This shows a typical reader formula decency matrix written in Java. Again, you don't have to know Java to get the elements of this program. Our reader is clearly customized for your specific input. Very often the starting point is a basic reader provided by Giraph. Like the reader that knows how to read vertices from each line of a text swipe. To customize it, you extend it and create your own version. Now, you need to define how to get the ID and value of the vertex by writing separate message for them. Notice that the ID comes from the zeroth item of each line after the split by white space, and the value comes from the next open, the second term, marked by 1 for the 0 base of the light. The next code element is this block here. This specifies how to create edges by iterating through every line. To keep the short, we'll remove the part that gets the edges here. As Giraph as mature it has included many specialized to interoperate with compatible resources. This diagram is from Giraph where the show some of these sources. We can group them into three different categories. Group one interoperates with Hive and HBase. You possibly remember these systems from a prior course. These systems are designed to give a higher level of data access on interface on top of MapReduce. Group two accesses relational systems like MySQL and Cassandra. But these systems have accessed indirectly through a software module called Gora. Gora uses a JSON schema to map the relation schema of the SQL database to a structure that Giraph can read. Group three accesses graph databases like Neo4J and DEX, which is now called Sparksee. These systems are all taxes indirectly. Using the [INAUDIBLE] service of Tinkerpop. Which is a graph API layer that can use many different Giraph stores including [INAUDIBLE] graph and Titan. Consider a relational table stored in Hive. The table shown here is extracted from the bio grid data source that we mentioned in module two. Each row of the table represents a molecule interaction. We can create a network from here just by considering the first two columns. The first column represents the source node of an edge colored red. And the second column represents the target node of the edge colored blue. The label on the edge comes from the fifth column of the table which is a black bold font. Let's assume that these predict items, these are items that we want to pick up from the Hive table. The simplest way to get a record from hive to Giraph is to extend the class called SimpleHiveRowToEdge. For this class, we need to specify the source node, the target node, and the edge value using three methods as shown here. My extension is called MyHiveRowToEdge. It shows the implementation of these methods where we just pick up the first, second, and fifth columns, as we described before. Now, as mentioned before, Giraph interacts with Neo4J through the Gremlin API provided by Tinkerpop. One can think of Gremlin as a traversal API, which means, it allows one to start from some node and walk the graph step by step. To show this, consider disease gene graph on the right. Let's call this graph G. So g.V represents all the vertices of G. Therefore g.V has name MC4R selects the node that has a property called name whose value is MC4R. Let's add to this path the condition .out, which chooses the out edges of the MC4R node and then traverses the associatedWith edge to the orange node called obesity. For this call, returns the vertex only. Now, adding the path to values means gives us obesity. We can also expand differently from the obesity node. When we say inV() we refer to all nodes that have incoming edges to the current node. In this case, there is only one. The LEPR node. To this, we add the traversal out beam and thus we get back the out going edge from the LEPR node highlighted in together. We can also look at the Giraph Gremlin near project connection from Tinkerpop's viewpoint. Tinkerpop is trying to create a standard language for graph reversal, just like Neo4J is trying to create Open Cypher as a standard query language for graph databases. In trying to create the standard, Tinkerpop recognizes that the actual storage management for graph databases should be provided by another vendor. The vendor needs to implement the Gremlin API for access. Similarly for graphic processing, including expensive analytic operations should be performed by what they call a graph computer. This is the role played by Giraph as well as Spark. Both of which interface with Tinkerpop.