p193 tian

Transcript

1 From "Think Like a Vertex" to "Think Like a Graph" 1 ? † † † § , John McPherson Yuanyuan Tian , Shirish Tatikonda , Andrey Balmin , Severin Andreas Corsten † ytian, statiko, jmcphers IBM Almaden Research Center, USA } { @us.ibm.com § GraphSQL, USA [email protected] ? IBM Deutschland GmbH, Germany [email protected] ABSTRACT vertex-centric programming model, where users express their al- gorithms by “thinking like a vertex”. Each vertex contains infor- To meet the challenge of processing rapidly growing graph and mation about itself and all its outgoing edges, and the computation network data created by modern applications, a number of dis- is expressed at the level of a single vertex. In Pregel, a common tributed graph processing systems have emerged, such as Pregel and vertex-centric computation involves receiving messages from other GraphLab. All these systems divide input graphs into partitions, vertices, updating the state of itself and its edges, and sending mes- and employ a “think like a vertex” programming model to support sages to other vertices. In GraphLab, the computation for a vertex is iterative graph computation. This vertex-centric model is easy to to read or/and update its own data or data of its neighbors. program and has been proved useful for many graph algorithms. This vertex-centric model is very easy to program and has been However, this model hides the partitioning information from the proved to be useful for many graph algorithms. However, it does not users, thus prevents many algorithm-specific optimizations. This always perform efficiently, because it ignores the vital information often results in longer execution time due to excessive network about graph partitions. Each graph partition essentially represents a messages (e.g. in Pregel) or heavy scheduling overhead to ensure proper subgraph of the original input graph, instead of a collection data consistency (e.g. in GraphLab). To address this limitation, we of unrelated vertices. In the vertex-centric model, a vertex is very propose a new “think like a graph” programming paradigm. Under short sighted: it only has information about its immediate neighbors, this graph-centric model, the partition structure is opened up to the therefore information is propagated through graphs slowly, one hop users, and can be utilized so that communication within a partition at a time. As a result, it takes many computation steps to propagate can bypass the heavy message passing or scheduling machinery. We a piece of information from a source to a destination, even if both implemented this model in a new system, called Giraph++, based on appear in the same graph partition. Apache Giraph, an open source implementation of Pregel. We ex- To overcome this limitation of the vertex-centric model, we pro- plore the applicability of the graph-centric model to three categories programming paradigm that opens up graph-centric pose a new of graph algorithms, and demonstrate its flexibility and superior the partition structure to users and allows information to flow freely performance, especially on well-partitioned data. For example, on inside a partition. We implemented this graph-centric model in a a web graph with 118 million vertices and 855 million edges, the new distributed graph processing system called Giraph++, which is graph-centric version of connected component detection algorithm based on Apache Giraph. runs 63X faster and uses 204X fewer network messages than its To illustrate the flexibility and the associated performance advan- vertex-centric counterpart. tages of the graph-centric model, we demonstrate its use in three categories of graph algorithms: graph traversal, random walk, and 1. INTRODUCTION graph aggregation. Together, they represent a large subset of graph algorithms. These example algorithms show that the graph-centric Rapidly growing social networks and other graph datasets require paradigm facilitates the use of existing well-known sequential graph a scalable processing infrastructure. MapReduce [7], despite its pop- algorithms as starting points in developing their distributed counter- ularity for big data computation, is awkward at supporting iterative parts, flexibly supports the expression of local asynchronous compu- graph algorithms. As a result, a number of distributed/parallel graph tation, and naturally translates existing low-level implementations processing systems have been proposed, including Pregel [15], its of parallel or distributed algorithms that are partition-aware. open source implementation Apache Giraph [1], GraphLab [14], We empirically evaluate the effectiveness of the graph-centric Kineograph [6], Trinity [20], and Grace [23]. model on our graph algorithm examples. We compare the graph- The common processing patterns shared among existing dis- centric model with the vertex-centric model, as well as with a hybrid tributed/parallel graph processing systems are: (1) they divide input model, which keeps the vertex-centric programming API but allows graphs into partitions for parallelization, and (2) they employ a asynchronous computation through system optimization. This hy- brid model resembles the approaches GraphLab and Grace take. For fair comparison, we implemented all three models in the same This work is licensed under the Creative Commons Attribution- Giraph++ system. In experimental evaluation, we consistently ob- NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per- serve substantial performance gains from the graph-centric model mission prior to any use beyond those covered by the license. Contact especially on well-partitioned data. For example, on a graph with copyright holder by emailing [email protected] Articles from this volume were invited to present their results at the 40th International Conference on Very Large Data Bases, September 1st - 5th 2014, Hangzhou, China. 1 This work was done while the author was at IBM Almaden Re- Proceedings of the VLDB Endowment, Vol. 7, No. 3 search Center. Copyright 2013 VLDB Endowment 2150-8097/13/11. 193

2 Edge List Vertex Subgraph Partition //I: vertex ID type, V: vertex Vertex A B //value type, E: edge value type, M: message type B A D G1 P1 void compute(); //user defined compute function D F B long getSuperstep(); //get the current superstep number F B A void sendMsg(I id, M msg); A void sendMsgToAllEdges(M msg); C A E D C void voteToHalt(); G2 P2 D C D boolean isHalted(); E int getNumOutEdges(); //get the number of outgoing edges F E A E getEdgeValue(I targetVertexId); E A F D G3 P3 boolean addEdge(I targetVertexId, E edgeValue); A D F E removeEdge(I targetVertexId); F E Iterator iterator(); //iterator to all neighbors Iterable getMessages(); //get all messages to it (a) (c) (b) I getVertexId(); V getVertexValue(); void setVertexValue(V vertexValue); Figure 1: Example graph and graph partitions void write(DataOutput out); //serialization void readFields(DataInput in); //deserialization 118 million vertices and 855 million edges, the graph-centric con- Figure 2: Major (not all) functions for Vertex in Giraph. nected component algorithm ran 63X faster than the vertex-centric implementation and used 204X fewer network messages. This was also 27X faster than the hybrid model, even though it used only 2.3X In this section, we provide an overview of Apache Giraph, which fewer network messages. These performance gains are due to an is an open source implementation of Pregel. algorithm-specific data structure that keeps track of the connected Giraph distributes a graph processing job to a set of workers. One components within a partition and efficiently merges components of the workers acts as the master to coordinate the remaining slave that turn out to be connected by a path through other partitions. As a workers. The set of vertices of a graph is divided into partitions. As result, the graph-centric version needs much fewer messages per iter- shown in Figure 1, each partition contains a set of vertices and all ation and completes in fewer iterations than both the vertex-centric their outgoing edges. Each vertex is uniquely identified by an ID, and the hybrid versions. decides which partition a vertex belongs to based partitioner and a Note that the proposed graph-centric programming model is not on its ID. The partitioner is also used to route messages for a vertex intended to replace the existing vertex-centric model. Both models correctly to its partition. The default partitioner is a hash function can be implemented in the same system as we demonstrated in on the vertex ID. Range partitioner or other customized partitioners Giraph++. The vertex-centric model has its simplicity. However, can be used as well. The number of partitions is usually greater than the graph-centric model allows lower level access, often needed to the number of workers, to improve load balance. implement important algorithm-specific optimizations. At the same Giraph employs a vertex-centric model. Each graph vertex is time, the graph-centric model still provides sufficiently high level of considered an independent computing unit that inherits from the abstraction and is much easier to use than, for example, MPI[18]. Vertex class. Each vertex has a unique ID, a vertex value, predefined The graph-centric programming model can also be implemented a set of outgoing edges (a set of edges in the case of an undirected in other graph processing systems. We chose Giraph, due to its graph) with an edge value associated with each edge, and a set of popularity in the open source community, and more importantly its messages sent to it. Figure 2 shows the major functions for the ability to handle graph mutation. Graph mutation is a crucial require- class with I as the vertex ID type, V as the vertex value type, Vertex ment for many graph algorithms, especially for graph aggregation E as the edge value type, and M as the message type. algorithms, such as graph coarsening [12, 11], graph sparsifica- Giraph follows the Bulk Synchronous Parallel (BSP) computa- tion [19], and graph summarization [22]. Incidentally, to the best tion model. A typical Giraph program consists of an input step, of our knowledge, Giraph++ is the first system able to support both where the graph is initialized (e.g., distributing vertices to worker asynchronous computation and mutation of graph structures. , supersteps machines), followed by a sequence of iterations, called The performance of many graph algorithms, especially the ones which are separated by global synchronization barriers, and finally implemented in the graph-centric model, can significantly benefit an output step to write down the results. A vertex carries two states: from a good graph partitioning strategy that reduces the number . In the beginning, all vertices are inactive and active . A active of cross-partition edges. Although there has been a lot of work on () or voteToHalt vertex can voluntarily deactivate itself by calling single-node sequential/parallel graph partitioning algorithms [12, be passively activated by some incoming messages from other ver- 11, 21], the rapid growth of graph data demands scalable distributed tices. The overall program terminates if every vertex is inactive. In graph partitioning solutions. In this paper, we adapted and extended i superstep , each active vertex can receive messages sent by other the algorithm of [11] into a distributed graph partitioning algorithm, i , query and update the information of 1 − vertices in superstep which we implemented in the same Giraph++ system, using the the current vertex and its edges, initiate graph topology mutation, graph-centric model. communicate with global aggregation variables, and send messages The remainder of the paper is organized as follows: Section 2 . All this computation to other vertices for the next superstep i + 1 provides a necessary background on Giraph. In Section 3, we () function of the Vertex compute logic is executed in a user-defined introduce the graph-centric programming model, and in Section 4, () calls in a class. After all active vertices finish their local compute we exploit the graph-centric model in various graph algorithms. In superstep, a global synchronization phase allows global data to be Section 5, we discuss the hybrid model which is an alternative design aggregated, and messages created by each vertex to be delivered to to support asynchronous graph computation. Then, the detailed their destinations. empirical study is provided in Section 6. Section 7 describes the To reduce the number of messages transmitted and buffered across related work. Finally, we conclude in Section 8. supersteps, a user can define a function if only an aggre- combiner gate (such as min, max, sum) of messages is required instead of the 2. GIRAPH/PREGEL OVERVIEW individual messages. 194

3 3. Giraph++ GRAPH-CENTRIC MODEL Algorithm 1: Connected Component Algorithm in Giraph In this section, we introduce the new graph-centric programming COMPUTE 1 () then getSuperstep()==0 2 if model. In a nutshell, instead of exposing the view of a single vertex setVertexValue(getVertexID()); 3 to the programmers, this model opens up the entire subgraph of each (getMessages(), getVertexValue()); minValue= 4 min partition to be programmed against. 5 getVertexValue() if < then getSuperstep()==0 or minValue setVertexValue(minValue); 6 3.1 Internal Vertices and Boundary Vertices 7 sendMsgToAllEdges(minValue); voteToHalt(); 8 Just like the vertex-centric model, the graph-centric model also divides the set of vertices in the original graph into partitions as // combiner function COMBINE (msgs) 9 G = < V, E > denote the original depicted in Figure 1(b). Let min return 10 (msgs); graph with its vertices and edges, and let P ∪ P . . . ∪ P ∪ = V be 1 2 k i the , i.e. P partitions of ∩ P V = , P ∅ , ∀ k 6 = j . For each partition i j i , along with vertices they link to, define a subgraph the vertices in P i Partition P1 Partition P2 G of the original graph. Figure 1(c) shows examples of subgraphs. i (a) B C F D E A V denote all the vertices that appear in the To be more precise, let i V . We define P G subgraph v P = } ∈ ∪{ | ( u, v ) ∈ E ∧ u . We Superstep i i i i E A F B C 0: D B C C D D E E F A B vertex of G ∈ and any vertex P say that any vertex is an internal u i i vertex. In Figure 1(c), A and B are V ) is a boundary \ ( ∈ v P i i 1: B A A D E C A B B C D E A C D the internal vertices of G1, while D and F are its boundary vertices A D C A A 2: C D B A A B C B (shown as shaded vertices). Note that a vertex is an internal vertex (b) A A 3: A B C A A B C A B owner of the vertex, but in exactly one subgraph, which we call the it can be a boundary vertex in zero or more subgraphs. For example, 4: A A A A B A A B A Vertex-centric in Figure 1(c), A is an internal vertex in G1, and is a boundary vertex A A 5: A A A A A in both G2 and G3. G1 is the owner of A. A A A A 6: A A In the rest of this paper, when discussing graph-centric model, we will refer to subgraphs as partitions when it is not ambiguous to do Subgraph G2 Subgraph G1 so. (c) C B A D C E F D In the Giraph++ graph-centric model, for each internal vertex in a partition, we have all the information of its vertex value, edge values Superstep A C 0: A A C A C C A C and incoming messages. But for a boundary vertex in a partition, we only associate a vertex value with it. This vertex value is just a (d) A A A A A A A A A 1: local temporary copy. The primary copy of the vertex value resides A A A 2: A A A A A Graph-centric in its owner’s corresponding internal vertex. The local copies of vertex values are essentially caches of local computation in different Figure 3: Example execution of connected component algo- partitions, they have to be propagated to the primary copy through rithms in vertex-centric and graph-centric models messages. The distinction between internal vertices and boundary vertices are crucial, as in Giraph++ messages are only sent from boundary Fault tolerance in Giraph is achieved by periodic checkpointing. vertices to their primary copies. This is because the whole sub- Users can specify the frequency of checkpointing (in terms of num- graph structure is available in the graph-centric model, information ber of supersteps). During checkpointing (it only happens at the exchange between internal vertices is cheap and immediate. The beginning of a superstep), the workers save the state of all vertices algorithm can arbitrarily change the state of any internal vertex at including vertex value, edge values, and all incoming messages. any point in time, without a need for a network message or a wait for Once a failure is detected, the master notifies all workers to enter the the next superstep. Boundary vertex values can also be arbitrarily recovery mode. Then, in the subsequent superstep, workers reload changed, but these changes will have to be propagated to the owners the full state from the last checkpoint, and proceed with the normal through messages, at the end of the superstep. computation. Al- Example: Connected Component Algorithm in Giraph. 3.2 Giraph++ Programming API gorithm 1 shows an example of the connected component algorithm As we intend to make the graph-centric model a valuable comple- for undirected graphs implemented in Giraph. In this algorithm, the vertex value associated with each vertex is its component label. ment to the existing vertex-centric model, our design principal for Initially in superstep 0, each vertex uses its own ID as its component Giraph++ is to fully make use of the existing Giraph infrastructure. label (each vertex is itself a connected component), then propagates A program in Giraph++ is still executed in sequence of supersteps, separated by global synchronization barriers. However, in each the component label to all its neighbors. In subsequent supersteps, each vertex first finds the smallest label from the received messages. superstep, the computation is performed on the whole subgraph in a If this label is smaller than the vertex’s current component label, partition. class in Giraph for internal vertices and We utilize the Vertex the vertex modifies its label and propagates the new label to all its boundary vertices in Giraph++. However, we turn off functions that neighbors. When the algorithm finishes, the component label for are not needed. Among the major functions shown in Figure 2, the each vertex is the smallest vertex ID in the corresponding connected retained functions for internal vertices are highlighted in blue and component. To reduce the number of network messages, a combiner red, whereas only the ones highlighted in red are retained for the (line 9-10) that computes the min of messages is used for this al- boundary vertices. Like in Giraph, each internal vertex of a partition gorithm. For the example graph in Figure 3(a), Figure 3(b) depicts has two states: active or inactive. However, a boundary vertex does the vertex labels and the message passing in every superstep for this connected component algorithm. not have any state. 195

4 GraphPartition //I: vertex ID type, V: vertex model exposes the whole subgraph in a partition, an existing sequen- //value type, E: edge value type, M: message type tial algorithm can be utilized to detect the connected components void compute(); //user defined compute function void allVoteToHalt(); //all vertices vote to halt in each graph partition. If a set of vertices belong to the same long getSuperstep(); //get the current superstep number connected component in a partition, then they also belong to the void sendMsg(I id, M msg); same connected component in the original graph. After informa- boolean containsVertex(I id); tion is exchanged across different partitions, some small connected boolean isInternalVertex(I id); boolean isBoundaryVertex(I id); components will start to merge into a larger connected component. Vertex getVertex(I id); Exploiting the above property, superstep 0 first runs a sequential Collection> internalVertices(); connected component algorithm (we use a breath-first-search based Collection> activeInternalVertices(); Collection> boundaryVertices(); algorithm) on the subgraph of each graph partition and then sends Collection> allVertices(); the locally computed component label for each boundary vertex void write(DataOutput out); //serialization to its corresponding owner’s internal vertex. For the example in void readFields(DataInput in); //deserialization Figure 3(a), superstep 0 finds one connected component in the sub- graph G1 and assigns the smallest label A to all its vertices including Figure 4: Major functions for GraphPartition. the boundary vertex D. Similarly, one connected component with Algorithm 2: Connected Component Algorithm in Giraph++ label C is detected in G2. Messages with the component labels are 1 COMPUTE () then sent to the owners of the boundary vertices. In each of the getSuperstep()==0 then if 2 subsequent supersteps, the algorithm processes all the incoming sequentialCC(); 3 // run a sequential CC algorithm messages and uses them to find out which component labels actually 4 foreach bv IN boundaryVertices() do represent equivalent components (i.e. they will be merged into a sendMsg(bv.getVertexId(), bv.getVertexValue()); 5 larger component) and stores them in a data structure called equiCC. 6 else 7 equiCC= // store equivalent CCs ; ∅ In the above example, vertex D in superstep 1 receives the message foreach 8 do activeInternalVertices() IN iv A from G1, while its previous component label is C. Thus, pair (A, minValue= min (iv.getMessages()); 9 C) is put into equiCC to indicate that the connected components minValue < iv.getVertexValue() then 10 if labeled A and C need to be merged. In equiCC. consolidate () func- equiCC.add(iv.getVertexValue(), minValue); 11 tion, we use the smallest label as the unique label for the set of // get min for equivalent CCs equiCC.consolidate(); 12 all equivalent components. In our example, the new label for the 13 foreach iv IN internalVertices() do 14 changedTo=equiCC.uniqueLabel(iv.getVertexValue()); merged components should be A. Then the unique labels are used 15 iv.setVertexValue(changedTo); to update the component labels of all the vertices in the partition. bv foreach do 16 boundaryVertices() IN If a boundary vertex’s component label is changed, then a message 17 changedTo=equiCC.uniqueLabel(bv.getVertexValue()); is sent to its owner’s corresponding internal vertex. Comparing the then changedTo!=bv.getVertexValue() if 18 two algorithms illustrated in Figure 3(b) and 3(d), the graph-centric bv.setVertexValue(changedTo); 19 20 sendMsg(bv.getVertexId(), bv.getVertexValue()); algorithm needs substantially fewer messages and supersteps. In superstep 0, all the vertices in P1 already converge to their final allVoteToHalt(); 21 labels. It only takes another 2 supersteps for the whole graph to converge. Note that the same combiner function used in Algorithm 1 can also be used in Algorithm 2. We argue that the graph-centric programming model in Giraph++ To support the graph-centric programming model, we introduce a is more general and flexible than the vertex-centric model. The . Figure 4 lists the major functions in new class called GraphPartition graph-centric model can mimic the vertex-centric model by simply this class. This class allows users to 1) access all vertices in a graph iterating through all the active internal vertices and perform vertex- partition, either internal or boundary, 2) check whether a particular oriented computation. In other words, any algorithm that can be vertex is internal, boundary or neither, 3) send messages to internal implemented in the vertex-centric model can also be implemented vertices of other partitions, and 4) collectively deactivate all internal in the graph-centric model. However, the performance of some () function in vertices in this partition. The user defined compute algorithms can substantially benefit from the graph-centric model. the GraphPartition class is on the whole subgraph instead of on The connected component algorithm in Algorithm 2 is such an individual vertex. example. More examples of the graph-centric model’s superior Like Giraph, Giraph++ achieves Fault Tolerance in Giraph++. flexibility will be shown in Section 4. fault tolerance by periodic checkpointing. During each checkpoint, Giraph++ saves all the vertices (with their states), edges (with their states) , and messages in each partition. Furthermore, since users can freely define auxiliary data structures inside the GraphPartition class. 4. EXPLOITING Giraph++ The checkpointing scheme also calls the GraphPartition.write() In this section, we explore the application of our proposed Gi- function to serialize the auxiliary data structures. During failure raph++ graph-centric programming model to three categories of recovery, besides reading back the vertices, edges, and messages, graph algorithms: graph traversal, random walk and graph aggre- Giraph++ also deserializes the auxiliary data structures by calling gation. For each category, we pick one representative algorithm to the GraphPartition.readFields() function. illustrate how it is implemented in the graph-centric paradigm. Example: Connected Component Algorithm in Giraph++. Algorithm 2 demonstrates how the connected component algorithm 4.1 Graph Traversal is implemented in Giraph++. For the example graph in Figure 3(a), Graph traversal represents a category of graph algorithms that Figure 3(c) and 3(d) depict the subgraphs of its two partitions and need to visit all the vertices of a graph in a certain manner, while the execution of the graph-centric algorithm, respectively. checking and updating the values of vertices along the way. These Sequential connected component algorithms have been well stud- ied in the graph literature. Since the graph-centric programming algorithms often involve a search over the graph. Examples include 196

5 Algorithm 3: PageRank Algorithm in Giraph PageRank Algorithm in Giraph++ Algorithm 4: 1 COMPUTE () () COMPUTE 1 ITERATION then then ITERATION MAX > getSuperstep() if 2 2 if getSuperstep() ≤ MAX delta=0; 3 allVoteToHalt(); 3 4 then getSuperstep()==0 if 4 else getSuperstep()==0 5 5 if setVertexValue(0); then 6 foreach v IN allVertices() do 6 delta+=0.15; v.getVertexValue().pr=0; 7 (getMessages()); sum 7 delta+= 8 v.getVertexValue().delta=0; then 0 > delta if 8 setVertexValue(getVertexValue()+delta); 9 do activeInternalVertices() IN foreach iv 9 10 sendMsgToAllEdges(0.85*delta/getNumOutEdges()); 10 if getSuperstep()==0 then 11 iv.getVertexValue().delta+=0.15; voteToHalt(); 11 12 iv.getVertexValue().delta+= sum (iv.getMessages()); // combiner function 13 > iv.getVertexValue().delta if 0 then 12 COMBINE (msgs) iv.getVertexValue().pr+=iv.getVertexValue().delta; 14 return (msgs); sum 13 15 update=0.85*iv.getVertexValue().delta/iv.getNumOutEdges(); 16 while iv.iterator().hashNext() do neighbor=getVertex(iv.iterator().next()); 17 neighbor.getVertexValue().delta+=update; 18 computing shortest distances, connected components, transitive clo- 19 iv.getVertexValue().delta=0; sures, etc. Many such algorithms are well-studied, with sequential foreach do 20 bv IN boundaryVertices() implementations readily available in textbooks and online resources. 0 21 if then bv.getVertexValue().delta > We just examined the implementation of one such algorithm in the sendMsg(bv.getVertexId(), bv.getVertexValue().delta); 22 vertex-centric and the graph-centric models, in Algorithms 1 and 2, bv.getVertexValue().delta=0; 23 respectively. In the graph-centric model, we applied an existing sequential algorithm to the local subgraph of each partition in super- step 0, then only propagate messages through the boundary vertices. w = 2 In the subsequent supersteps, messages to each vertex could result in the value update of multiple vertices, thus requiring fewer supersteps A B AB w = 3 w = 2 than a corresponding vertex-centric implementation. C D w = 1 CE DF F E 4.2 Random Walk w = 2 w = 2 (a) (b) The algorithms of the second category are all based on the ran- dom walk model. This category includes algorithms like HITS [13], Figure 5: Graph coarsening example PageRank [5], and its variations, such as ObjectRank [2]. In this section, we use PageRank as the representative random walk algo- rithm. 4.3 Graph Aggregation Algorithm 3 shows the pseudo code of the PageRank algorithm The third category of graph algorithms, which we call graph (using damping factor 0.85) implemented in the vertex-centric model. , are used to condense a large graph into a structurally aggregation This is not the classic PageRank implementation, which iteratively similar but smaller graph by collapsing vertices and/or edges. Promi- updates the PageRank based on the values from the previous itera- nent examples of graph aggregation are graph summarization [22], ∑ 1 − i i P R E | , where P R tion as in = d × | ) − +(1 d graph sparsification [19], and graph coarsening [12, 11]. These algo- u v ∈ E | } ( u,v ) { u E | | u rithms are more sophisticated than graph traversal or random walk is the number of outgoing edges of u . Instead, Algorithm 3 follows algorithms. They typically involve mutation of the graph structure the accumulative iterative update approach proposed in [25], and by adding and removing vertices or/and edges. As a result, plat- incrementally accumulates the intermediate updates to an existing forms that do not support graph mutation, such as Grace [23] and PageRank. It has been proved in [25] that this accumulative update GraphLab [14], cannot efficiently support these algorithms. We pick approach converges to the same values as the classic PageRank graph coarsening as a representative graph aggregation algorithm to algorithm. One advantage of this incremental implementation of implement in the graph-centric paradigm. PageRank is the ability to update increment values asynchronously, which we will leverage in the graph-centric model below. 4.3.1 Background on Graph Coarsening Algorithm 4 shows the PageRank algorithm implemented in the Graph coarsening is often used as a first step in a graph par- graph-centric model. This algorithm also follows the accumulative titioning algorithm, to reduce the complexity of the graph. We update approach. However, there are two crucial differences: (1) Besides the PageRank score, the value of each vertex contains an implemented a modification of the graph coarsening algorithm used delta , which caches the intermediate updates re- in the ParMetis parallel graph partitioning tool [11]. Their coars- extra attribute, ening algorithm works on an undirected graph and executes in a ceived from other vertices in the same partition (line 16-18). (2) , as it utilizes the asynchronous number of coarsening rounds to generate smaller and smaller graphs. Local PageRank computation is partial results of other vertices from the same superstep (line 14). In each round, the algorithm first finds a maximal matching of the ⊆ Asynchrony has been demonstrated to accelerate the convergence of E is current graph. As shown in Figure 5(a), a matching M iterative computation in many cases [8], including PageRank [14]. a subset of edges, such that no two edges in M share a common Our graph-centric programming paradigm allows the local asyn- incident vertex. We call a matching M maximal, if it is not a proper subset of another matching of the graph. After finding a maximal chrony to be naturally expressed in the algorithm. Note that both the vertex-centric and the graph-centric algorithms can benefit from matching, the algorithm collapses the incident vertices of each edge vertex. During this coarsening process, in the matching into a super a combiner that computes the sum of all messages (line 12-13 of Algorithm 3). the algorithm keeps a weight for each vertex and each edge. Initially, 197

6 match request the weights are all 1. The weight of a super vertex is the sum of the weights of all vertices collapsed into it. An edge between two super u v vertices is an aggregation of edges between the original vertices, so u w v its weight is the sum of the individual edge weights. Figure 5(b) demonstrates an example coarsened graph of Figure 5(a). (a) (b) In ParMetis [11], finding a maximal matching is done in a number Figure 6: Examples in the matching phase of graph coarsening i , a processor randomly iterates through its local of phases. In phase unmatched vertices. For each such vertex u , it uses a heavy-edge with another unmatched vertex u heuristic to match v if there is any. to C. If B is matched and merged to A, B needs to notify C that is local, the match is established immediately. Otherwise, a v If C’s connection to B will be replaced by a connection to A after the v , conditioned upon match request is sent to the processor that owns merge. In this case, B → A is stored in C’s replacements before the i is even, a match request is sent only when the order of u and v : if actual merge happens. Besides vertex values, the value associated u > v u < v . This ordering ; otherwise, a request is sent only when with an edge is the weight of the (super) edge. constraint is used to avoid conflicts when both incident vertices of After dealing with 1-degree vertices in the first 2 supersteps, an edge try to match to each other in the same communication step. matching phases and a the algorithm executes in iterations of m In phase , multiple match requests for a vertex + 1 i v is resolved by collapsing phase. is granted, a u breaking conflicts arbitrarily. If a match request from Matching Phase. Each matching phase consists of 2 super- . This matching process finishes when u notification is sent back to steps: a match request step and a match notification step. In a a large fraction of vertices are matched. match request step, the algorithm randomly scans over the vertices , it looks through u . For each such vertex ORMAL =N state with 4.3.2 Graph Coarsening in Giraph++ the edges that are not in the replacements list (as vertices in re- placements have already been matched), and chooses the one with The graph coarsening algorithm in ParMetis can be naturally u, v v the highest weight, say , by breaking ties arbitrarily. If ) ( implemented in the graph-centric programming model. The (super) , we establish the match by setting ORMAL =N v.state is local and Vertex vertex during the coarsening process is represented by the = , u.state =M ERGED , u.mergedT o OST v and v.state =M ERGE H class in Giraph++. When two (super) vertices are collapsed together, → v and notifies each u ’s neighbor to add u in its replacements we always reuse one of the (super) vertices. In other words, we through either a local operation or a message depending on whether merge one of the vertex into the other. After a merge, however, we the neighbor is local or not. Otherwise, if is remote, a match v do not delete the vertex that has been merged. We delete all its based on the ordering constraints. v request message will be sent to edges and declare it inactive, but utilize its vertex value to remember R If a match request is sent, we set . v.state =M ATCH EQUESTED which vertex it has been merged to. v receives match In a match notification step, when a vertex Algorithm Overview. The graph coarsening implementation in state ERGED v of v.state requests, we first check the or . If is M our graph-centric model follows the similar process as in the par- v , we ignore all the match requests, since OST H ERGE M is already allel ParMetis algorithm: The algorithm executes iteratively in a R matched. If v.state is , we also ignore the EQUESTED ATCH M sequence of coarsening rounds. Each coarsening round consists of match requests. This is to avoid the conflict when v ’s match request m matching phases followed by a collapsing phase. Each of the w v to has already granted the is granted but at the same time matching phases is completed by 2 supersteps and the collapsing . Under the definition of graph matching, u match request from phase corresponds to a single superstep. We empirically observed no two matches should share a common vertex. This scenario is that m = 4 is a good number to ensure that a large fraction of the demonstrated in Figure 6(a). The ordering constraint doesn’t help vertices are matched in each coarsening round. Instead of follow- u < in this chain case, as the vertices could happen to be ordered ing exactly the same procedure as ParMetis, we add an important and the phase number i could happen to be even. Therefore, v < w extension to the coarsening algorithm to specially handle 1-degree =M state we use ATCH R EQUESTED to break chains. In summary, vertices. It has been observed that most real graphs follow power . ORMAL =N v.state are only considered when v match requests to law distribution, which means a large number of vertices have very Then we choose the request with the heaviest edge weight, say low degree. 1-degree vertices can be specially handled to improve u from , and send u a match notification. At the same time, we the coarsening rate, by simply merging them into their only neighbor. ’s , then notifies v.mergedT o = v change v.state =M ERGED and u Once again, this merge is done in two supersteps to resolve conflicts u → v neighbors to add . replacements in their that arise if two vertices only connect to each other and nothing else. When a match notification is received by a vertex u at the begin- The value associated with each vertex Vertex Data Structure. ning of the next match request step, we simply change u.state = keeps track of consists of the following four attributes: (1) state . On the other hand, if no match notification is re- ERGE M H OST which state the vertex is currently in. It can take one of the 4 val- = M ATCH state back EQUESTED , we change ceived and u u.state ’s R M and . ERGED , M EQUESTED R ATCH , M ORMAL N ues: OST H ERGE . to N ORMAL N obviously indicates that the vertex is normal – ready to ORMAL We have discussed above that a chain of match requests needs to do any action; M means that the vertex just sent ATCH R EQUESTED be broken in order to avoid conflicts. As shown in Figure 6(b), for out an match request; M ERGED denotes that the vertex is or will be a hub vertex, potentially many other vertices could send match re- H merged into another vertex; and means that another OST M ERGE quests to it. However, as long as the hub vertex has issued a match re- vertex will be merged into this vertex. (2) mergedTo records the id of quest, all the requests that it receives will be ignored. In order to give the vertex that this vertex is merged into, so that we can reconstruct a fair chance to all the match requests, we add a probability to de- the member vertices of a super vertex. This attribute is legitimate with u cide whether a vertex should send a match ORMAL =N state only when state= M ERGED . (3) weight keeps track of the weight of 1 , request or not. This probability is defined as p = u the (super) vertex during the coarsening process. (4) replacements (1+ log ) | E | u stores all the pair of vertex replacements in order to guarantee the where | E . Based on this | is the number of edges incident on u u correctness of graph structure change during a merge. For example, probability, a hub vertex is less likely to send a match request, and consider a graph where A is connected to B, which in turn links thus more likely to be the receiving end of match requests. 198

7 Collapsing Phase. After all the matches are established, in the needs to be taken when designing algorithms in other systems that collapsing phase, each vertex first processes all the support asynchronous computation. replacements , =M Note that GraphLab and Grace also allow asynchronous com- has u on the edges. After that, if an active vertex ERGED state it needs to be merged to the target vertex with id= . u.mergedT o putation while keeping the vertex-centric model. Both systems If the target vertex is local, the merge is processed immediately. achieve this goal mainly through the customization of different ver- ’s u Otherwise, a merge request is sent to the target vertex with tex scheduling polices in the systems. However, one price paid for u weight and all its edges with their weights. Next, we remove all ’s supporting asynchrony in their systems is that the scheduler can not so that later we can trace handle mutation of graphs. In fact, both GraphLab and Grace re- edges but keep u.state and u.mergedT o back who is merged into whom. After all the merges are done, if quire the graph structure to be immutable. This, in turn, limits their =M ERGE H OST a vertex has N ORMAL to state applicability to any algorithm that mutates the structure of graphs, , we set it back to such as all the graph aggregation algorithms discussed in Section 4.3. participate in the next round of coarsening. In comparison, Giraph++ does not have such conflicts of interests: programmers can freely express the asynchronous computation in 5. A HYBRID MODEL the graph-centric model, while the system maintains the ability to In the previous section, we have shown how the graph-centric handle graph mutation. model can benefit a number of graph algorithms. One of the ma- jor reasons for the improved performance under the graph-centric 6. EXPERIMENTAL EVALUATION model is the allowance of the local asynchrony in the computation (e.g. the PageRank algorithm in Algorithm 4): a message sent to In this section, we empirically compare the performance of the a vertex in the same partition can be processed by the receiver in vertex-centric, the hybrid, and the graph-centric models. For fair the same superstep. In addition to using the flexible graph-centric comparison among the different models, we implemented all of them programming model, asynchrony can also be achieved by a system in the same Giraph++ system. We refer to these implementations as optimization while keeping the same vertex-centric programming Giraph++ Vertex Mode (VM), Hybrid Mode (HM), and Graph Mode model. We call this approach as the . hybrid model (GM), respectively. We expect that relative performance trends that To implement the hybrid model, we differentiate the messages we observed in this evaluation will hold for other graph processing sent from one partition to the vertices of the same partition, called systems, though such study is beyond the scope of this paper. internal messages , from the ones sent to the vertices of a different 6.1 Experimental Setup partition, called the . We keep two separate in- external messages coming message buffers for each internal vertex, one for internal We used four real web graph datasets, shown in Table 1, for all of our experiments. The first three datasets, uk-2002, uk-2005 messages called inbox , and one for the external messages called in law.di.unimi.it/ and webbase-2001 were downloaded from inbox . The external messages are handled using exactly the same ex message passing mechanism as in the vertex-centric model. An datasets.php . These datasets were provided by the WebGraph can only be seen by the receiver external message sent in superstep i [4] and the LLP [3] projects. The last dataset clueweb50m was i +1 . In contrast, an internal message is directly placed boston.lti.cs.cmu.edu/clueweb09/ downloaded from in superstep . It is the TREC and can be utilized immediately in the vertex’s com- into inbox wiki/tiki-index.php?page=Web+Graph in putation during the same superstep, since both the sender and the 2009 Category B dataset. All four datasets are directed graphs. receiver are in the same partition. Suppose vertices A and B are in Since some algorithms we studied are for undirected graphs, we , A sends B an internal the same partition, and during a superstep i also converted the four directed graphs into undirected graphs. The numbers of edges in the undirected version of the graphs are shown in inbox message M. This message is immediately put into B’s in in the 4th column of Table 1. These four dataset are good rep- the same superstep (with proper locking mechanism to ensure con- resentative for real life graphs with heavy-tail degree distribution. sistency). If later B is processed in the same superstep, all messages in B’s inbox and inbox , including M, will be utilized to perform For example, although the uk-2005 graph has an average degree of in ex 23.7, the largest in-degree of a vertex is 1,776,852 and the largest B’s compute () function. On the other hand, if B is already processed , then M will be kept in the message i out-degree of a vertex is 5,213. before M is sent in superstep + 1 buffer until B is processed in the next superstep All experiments were conducted on a cluster of 10 IBM Sys- i . To reduce com- the overhead of maintaining the message buffer, we apply the tem x iDataPlex dx340 servers. Each consisted of two quad-core () function on the internal messages, whenever a user-defined bine Intel Xeon E5540 64-bit 2.8GHz processors, 32GB RAM, and in- terconnected using 1Gbit Ethernet. Each server ran Ubuntu Linux combiner is provided. (kernel version 2.6.32-24) and Java 1.6. Giraph++ was implemented Under this hybrid model, we can keep exactly the same connected based on a version of Apache Giraph downloaded in June 2012, component algorithm in Algorithm 1 and the PageRank algorithm which supports two protocols for message passing: Hadoop RPC in Algorithm 3 designed for the vertex-centric model, while still ). We chose Netty (by setting benefiting from the asynchronous computation. However, one needs and Netty ( https://netty.io -Dgiraph.useNetty=true), since it proved to be more stable. Each to be cautious when using the hybrid model. First of all, not all server was configured to run up to 6 workers concurrently. Since, graph problems can benefit from asynchrony. Furthermore, blindly running a vertex-centric algorithm in the hybrid mode is not always a Giraph++ job requires one worker to be the master, there were at most 59 slave workers running concurrently in the cluster. safe. For example, the vertex-centric graph coarsening algorithm Note that in any of the modes, the same algorithm running on the won’t work under the hybrid model without change. This is because same input will have some common overhead cost, such as setting up the graph coarsening algorithm requires different types of messages a job, reading the input, shutting down the job, and writing the final to be processed at different stages of the computation. The hybrid model will mix messages from different stages and confuse the output. This overhead stays largely constant in VM, HM, and GM, computation. Even for PageRank, although our specially designed regardless of the data partitioning strategy. For our largest dataset accumulative iterative update algorithm works without change in (webbase-2001), the common overhead is around 113 seconds, 108 the hybrid model, the classic PageRank algorithm won’t fly in the seconds, and 272 seconds, for connected component, PageRank hybrid model without change. We point out that similar care also and graph coarsening, respectively. As many graph algorithms (e.g. 199

8 directed undirected hash partitioned graph partitioned hash partitioned graph partitioned directed undirected partitioning imblc #edges ncut imblc ncut imblc ncut time (sec.) ncut imblc #edges dataset #nodes #partns 261,787,258 177 1,082 0.97 1.01 0.02 2.15 0.99 1.06 0.02 2.24 uk-2002 18,520,486 298,113,762 936,364,282 783,027,125 6,891 0.98 1.01 0.06 7.39 0.997 1.61 0.06 7.11 uk-2005 39,459,925 295 0.999 1,019,903,190 4,238 0.97 1.03 0.03 3.78 826 1.41 0.03 5.05 118,142,155 webbase-2001 854,809,761 454,075,604 446,769,872 2891 4,614 0.9997 1.07 0.07 5.56 0.9997 1.96 0.06 6.95 clueweb50m 428,136,613 Table 1: Datasets characteristics An even more important question is how to partition a graph. A PageRank and graph coarsening) requires 100s of supersteps, this cost is amortized quickly. For example, if we run 100 supersteps for good partitioning strategy should minimize the number of edges that connect different partitions, to potentially reduce the number PageRank and graph coarsening, then even for the fastest execution of messages during a distributed computation. Most distributed times on this dataset (GM with graph partitioning strategy), the graph processing systems, such as Giraph [1] and GraphLab [14], common overhead accounts for around 7.7% and 8.7% of the total execution times, respectively. For the connected component algo- by default use random hash to partition graphs. Obviously, this random partitioning results in a large number of edges crossing rithm, this cost is more noticeable, however the high-level trends are the same. Even with overheads included in the execution time, GM partition boundaries. To quantify this property, we use the well- known Average Normalized Cut measure. Normalized Cut of a is 3X faster than VM on hash partitioned data (instead of 3.1X, if , denoted P ncut(P) , is defined as the faction of edges partition discounting the common overhead) for the connected component al- linking vertices in P to vertices in other partitions among all the gorithm. Same speedup on graph partitioned data is 22X (instead of u,v ) }| ∈ E |{ P ∈ ( ∧ P ∈ u | v / 63X, if discounting the common overhead). In order to focus on the outgoing edges of vertices in P, i.e. . The ∈ E | |{ u ∈ P u,v ( ) }| difference in the processing part of VM, HM, and GM, we excluded average ncut of all the partitions can be used to measure the quality the common overhead from the reported execution time for all the of a graph partitioning. As shown in column 7 and column 11 of experiments. In addition, except for the experiments in Section 6.6, Table 1, the average ncuts for hash partitioning across different we turned off the checkpointing to eliminate its overhead. datasets are all very close 1, which means that almost all the edges cross partition boundaries. 60 GraphLab [14] and the work in [10], also proposed to employ VM−HP 50 the Metis [12] (sequential) or the ParMetis [11] (parallel) graph HM−HP GM−HP 40 partitioning tools to generate better partitions. However, Metis and VM−GP 30 ParMetis cannot help when the input graph becomes too big to fit in HM−GP GM−GP 20 the memory of a single machine. 10 We implemented a scalable partitioning approach based on the Execution Time (sec) 0 distributed graph coarsening algorithm described in Section 4.3.2. 30 25 20 15 10 5 0 Superstep This algorithm mimics the parallel multi-level k-way graph parti- tioning algorithm in [11], but is simpler and more scalable. There (a) Execution Time are 3 phases in this algorithm: a coarsening phase, a partitioning phase, and a uncoarsening phase. In the coarsening phase, we ap- 5e+008 VM−HP ply the distributed graph coarsening algorithm in Section 4.3.2 to 4e+008 HM−HP GM−HP reduce the input graph into a manageable size that can fit in a single 3e+008 VM−GP machine. Then, in the partitioning phase, a single node sequential HM−GP 2e+008 GM−GP or parallel graph partitioning algorithm is applied to the coarsened 1e+008 graph. We simply use the ParMetis algorithm in this step. At last, Network Messages in the uncoarsening phase, we project the partitions back to the 0 15 20 25 30 5 10 0 original graph. This phase does not apply any refinements on the Superstep partitions as in [11]. However, uncoarsening has to be executed in a (b) Network Messages distributed fashion as well. Recall that in the coarsening phase each mergedTo vertex that is merged into another has an attribute called Figure 7: The execution time and network messages per super- to keep track of the host of the merger. This attribute can help us step for connected component detection on uk-2002 dataset. derive the membership information for each partition. Suppose that a vertex A is merged into B which in turn is merged into C, and 6.1.1 Scalable Graph Partitioning attribute finally C belongs to a partition P. We can use the mergedTo Graph partitioning plays a crucial role in distributed graph pro- to form an edge in a membership forest . In this example, we have cessing. For each experiment, we need to first decide on the number an edge between A and B, and an edge between B and C. Finding of partitions. Intuitively, a larger partition size would benefit GM which vertices ultimately belong to the partition P is essentially and HM, as it increases the chance that neighboring vertices belong finding the connected components in the membership forest . Thus, to the same partition. However, smaller partitions increase potential we use Algorithm 2 in the last stage of graph partitioning. degree of parallelism and help balance workload across the cluster. Column 6 of Table 1 shows the total graph partitioning time, with We observed empirically that when each partition contained 100,000 graph coarsening running 100 supersteps in the graph-centric model. to 150,000 vertices, all three modes performed well. Therefore, we The average ncuts of graph partitions produced by this algorithm heuristically set the number of partitions for each graph dataset so are listed in columns 9 and 13. On average, only 2% to 7% edges that it is a multiple of 59 (the number of slave workers) and each go across partitions. partition contained around 100,000 to 150,000 vertices. The number There is another important property of graph partitioning that af- of partitions for each dataset used in our experiment is shown in fects the performance of a distributed graph algorithm: load balance. column 5 of Table 1. 200

9 execution time (sec) network messages (millions) number of supersteps hash partitioned (HP) graph partitioned (GP) graph partitioned (GP) graph partitioned (GP) hash partitioned (HP) hash partitioned (HP) GM GM HM GM VM HM HM VM HM VM VM HM GM VM HM GM VM dataset GM 438 272 532 89 19 3,315 uk-2002 1,937 3,414 17 8.7 33 32 19 33 19 5 441 3,129 1,366 723 1,700 230 90 11,361 10,185 5,188 10,725 1,354 225 22 22 16 22 12 5 uk-2005 370 4,491 4,405 1,427 3,599 1,565 57 13,348 11,198 6,581 11,819 136 58 605 604 39 605 319 5 webbase-2001 69 1,875 1,072 250 103 6,391 5,308 2,703 6,331 129 1,163 38 37 14 38 18 5 clueweb50m 2,103 Table 2: Total execution time, network messages, and number of supersteps for connected component detection HP or GP, GM performs significantly better as it generates much For many algorithms, running time for each partition is significantly fewer network messages per superstep. Furthermore, GM reduces affected by its number of edges. Therefore, we also need to measure load the number of supersteps needed for the algorithm. (4) GM benefits how the edges are distributed across partitions. We define the of a graph partitioning as the maximum number of edges significantly from better partitioning of graphs, which allows it to imbalance in a partition divided by the average number of edges per partition, do majority of the work in the first 3 supersteps. max( × p | E ) | P ∑ HM only speeds up processing of messages local to a partition. u = { ( u, v ) ∈ E | , where ∈ P } and p is E i.e. P | E | P Thus, it provides only marginal benefit under HP where 99% percent the number of partitions. Table 1 also shows the load imbalance of edges go across partitions. However, under GP, where only 2% of factors for both hash partitioning and our proposed graph partition- edges cross partition boundaries, it provides 201X improvement in ing across different datasets. Clearly, hash partitioning results in the number of messages, which translates into 6X improvement in better balanced partitions than our graph partitioning method. This overall running time. Still, GM does even better, as it uses algorithm- is because most real graph datasets present a preferential attachment specific, per-partition data structures to store equivalent connected phenomenon: new edges tend to be established between already components. This allows GM to instantly change states of entire well-connected vertices [17]. As a result, a partition that contains groups of vertices, instead of processing them one at a time. Thus, a well-connected vertex will naturally bring in much more edges GM sends 2X fewer messages and runs 4.7X faster than HM. than expected. Producing balanced graph partitions with minimal The above observations also apply to the other 3 datasets as shown communication cost is an NP-hard problem and is a difficult trade- in Table 2. Across all the four datasets, GM dramatically reduces off in practice, especially for very skewed graphs. Newly proposed the number of iterations, the execution time and network messages. partitioning technique with dynamic load balancing [24], and the Under HP, the reduction in network messages ranges from 1.7X to vertex-cut approach introduced in the latest version of Graphlab [9] 2.4X resulting in 1.6X to 3.1X faster execution time. The advantage can potentially help alleviate the problem, but cannot completely of the graph-centric model is more prominent under GP: 48X to solve the problem. 393X reduction of network messages and 10X to 63X speedup in We decided to “eat our own dog food” and evaluate the connected execution time. component and PageRank algorithms, using our graph partitioning Note that a smaller number of network messages doesn’t always (GP) strategy in addition to the default hash partitioning (HP). Note translate to faster execution time. For example, for the uk-2005 that the focus of this evaluation is on comparing Giraph++ VM, dataset VM sends fewer messages under GP than under HP, yet it HM, and GM modes. Although we implemented a scalable graph still runs 24% slower, due to load imbalance. The largest partition partitioning algorithm and use its output to evaluate two distributed becomes the straggler in every superstep, thus increasing the total graph algorithms, we leave the in-depth study of graph partitioning wall clock time. algorithms for the future work. The GP experiments should be viewed as a proxy for a scenario where some scalable graph parti- 6.3 PageRank tioning algorithm is used as a part of graph processing pipeline. This Next, we compare the performance of PageRank algorithm in should be the case if graphs are analyzed by (multiple) expensive different modes. In VM and HM we implemented Algorithm 3, and algorithms, so that the performance benefits of the low ncut justify in GM we implemented Algorithm 4. the partitioning cost. Figure 8 compares the convergence rates of PageRank on the uk-2002 dataset under different modes and partitioning strategies. 6.2 Connected Component The error is measured in error against the true PageRank values, L 1 The first algorithm we use for the performance evaluation is the obtained by running Algorithm 4 for 200 iterations. Under HP connected component analysis. In VM and HM we implemented strategy, VM, HM and GM all exhibit very similar convergence behavior with supersteps and time (the corresponding three lines Algorithm 1, and in GM we implemented Algorithm 2. Figure 7 shows the execution time and network messages per overlap in Figure 8(a) and are very close to each other in Figure 8(b)). This is because vertices in each partition mostly connect to vertices superstep as the connected component algorithm progresses on outside, thus there is very few asynchronous updates that actually the uk-2002 dataset (the undirected graph version) under the three happen in each superstep. As VM cannot take much advantage of modes and with both hash partitioning (HP) and graph partitioning better partitioning strategies, its convergence behavior under GP (GP) strategies. From this figure, we observe that: (1) The shapes is still similar to that of HP. GM and HM, on the other hand, take of curves of the execution time look very similar to those of the significantly fewer supersteps and less time to converge under GP. network messages. Since the computation is usually very simple Figure 8(b) shows dramatic reduction (more than 5X) of GM in time for graph traversal algorithms, execution time is mostly dominated needed to converge, as compared to VM. HM converges much faster by message passing overhead. (2) Better partitioning of the input than VM, however its total time to converge is still up to 1.5X slower graph doesn’t help much in the VM case, because the vertex-centric than GM, due to HM’s generic, and hence less efficient, in-memory model doesn’t take advantage of the local graph structure. VM sends structures. virtually the same number of network messages with GP and HP. As the algorithms progress, the execution time and network mes- The small differences are due to the use of the combiner. The total sages per superstep on the uk-2002 dataset are depicted in Figure 9. number of messages without combiner is always exactly the same in VM, no matter which partitioning strategy is used. (3) Under either The progression of PageRank looks very different than that of con- 201

10 per-superstep execution time (sec) 1e+007 VM−HP hash partition (HP) graph partition (GP) HM−HP 1e+006 GM VM HM dataset GM HM VM GM−HP 1e+005 22 6 21 23 23 4 uk-2002 VM−GP 1e+004 28 89 167 91 13 88 uk-2005 HM−GP 1000 Error GM−GP 102 webbase-2001 121 125 125 24 13 100 clueweb50m 21 21 34 62 64 65 10 dataset per-superstep network messages (millions) 1 0 175 uk-2002 168 168 186 1.6 1.6 100 80 60 40 20 0 6.7 6.7 684 684 701 uk-2005 739 Superstep 9.2 webbase-2001 717 689 689 9.2 751 181 4.1 clueweb50m 181 4.1 181 168 (a) Convergence with # supersteps Table 3: Execution time and network messages for PageRank 1e+007 VM−HP HM−HP 1e+006 GM−HP 1e+005 VM−GP or merge operations could be performed and everything is done 1e+004 HM−GP through message passing. Table 4 compares the two models in terms 1000 Error GM−GP of execution time and network messages for running 100 supersteps 100 10 of the graph coarsening algorithm. Note that we do not assume better 1 0 partitioning of input graphs is available, since graph coarsening is 2200 0 400 600 800 200 1000 1200 1400 1600 1800 2000 Execution Time (sec) usually the first step for a graph partitioning algorithm. As a result, all the input graphs are hash partitioned. Across all four datasets, (b) Convergence with time the graph coarsening algorithm benefits at different levels from our graph-centric model, because some match and merge operations can Figure 8: Convergence of PageRank on uk-2002 dataset be resolved locally without message passing. For the clueweb50m 30 dataset, the execution time is more than halved. Figure 10 charts the 25 progressive execution time and network messages per superstep for 20 the clueweb50m dataset. Both time and messages fluctuate as the VM−HP VM−GP 15 HM−HP HM−GP algorithm progresses. But since the coarsened graph continues to GM−HP GM−GP 10 shrink, both measures decrease as a general trend. The coarsening 5 rates on the clueweb50m dataset with both supersteps and time Execution Time (sec) 0 are shown in Figure 11. In terms of supersteps, the coarsening 30 0 10 20 Superstep rates look neck and neck for VM and GM. Since clueweb50m is a sparser graph, the first 2 supersteps of handling 1-degree vertices (a) Execution Time are particularly effective in reducing the number of active vertices. 2e+008 Due to the better efficiency of GM, its coarsening rate with time excels (more than doubled). 1.5e+008 VM−HP VM−GP HM−HP HM−GP 1e+008 network messages (millions) execution time (sec) GM−HP GM−GP dataset VM VM GM GM 5e+007 1,094 uk-2002 1,215 626 600 Network Messages 5,414 6,687 uk-2005 2,796 3,206 0 30 20 10 0 webbase-2001 3,983 2,839 4,015 3,904 Superstep 2,020 2,039 7,875 clueweb50m 3,545 (b) Network Messages Table 4: Execution time and network messages for graph coars- ening running 100 supersteps Figure 9: The execution time and network messages per super- step for PageRank on uk-2002 dataset (the first 30 supersteps). 6.5 Effect of Partitioning Quality We now investigate the effect of partitioning quality on the per- nected component detection. No matter under which mode and formance of different models. We chose the uk-2002 dataset for which partitioning strategy, the execution time and network mes- this experiment. Besides HP and GP partitioning strategies, we con- sages do not change much across different supersteps. structed a different partitioning (abbreviated as SP) of the uk-2002 The performance comparisons are consistent across all four datasets, graph by randomly swapping 20% vertices in different partitions of as shown in Table 3. When input graphs are better partitioned, GM the GP strategy. The resulting partitions have average ncut 0.57 and brings in 41X to 116X reduction in network messages and 1.6X to 0.56 for the undirected version and the directed version of the graph, 12.8X speedup in execution time per iteration, keeping in mind that respectively. The corresponding imbalance factors are 1.83 and 1.75. it also results in much faster convergence rate. Again, due to the Both the average ncuts and the imbalance factors of SP fall between higher load imbalance resulted from the GP strategy, VM drags its those of HP and GP. Table 5 lists the execution times of connected feet for the uk-2005 dataset. component and PageRank with the three partitioning strategies. The 6.4 Graph Coarsening results are consistent with previous experiments. Better partitioning of input doesn’t help much for VM. In fact, VM under GP performs We now empirically study the graph coarsening algorithm in worse than under SP due to higher imbalance. In contrast, both HM VM and GM. Recall, that HM cannot be used for the algorithm and GM benefit a lot from better-quality partitioning strategies, with implementation described in Section 4.3.2, as the algorithm does GM benefiting the most. not allow asynchronous execution. The implementation of the graph coarsening algorithm in VM 6.6 Effect on Fault Tolerance follows a similar procedure as in GM, except that no local match 202

11 pagerank connected component 350 per-superstep time (sec) total time (sec) VM 300 GP SP GP HP SP HP GM 250 22 VM 441 356 532 18 21 200 6 HM 199 89 23 13 438 150 GM 106 19 23 12 4 272 100 50 Table 5: Effect of partitioning quality on uk-2002 dataset Execution Time (sec) 0 80 100 90 70 60 50 40 30 20 10 0 GraphLab (sec) Giraph++ (sec) Superstep HM Async Sync GM VM 4,491 error 4,405 HP 912 1,427 (a) Execution Time GP 3,599 1,565 57 150 161 2e+008 Table 6: Connected component in Giraph++ and GraphLab VM GM 1.5e+008 graph partitions, VM takes 7,662 seconds, whereas GM spends 1e+008 6,390 seconds collectively for the same other setting. For the graph 5e+007 coarsening algorithm, VM takes totally 28,287 seconds for check- Network Messages pointing with 59 workers running 100 supersteps (checkpointing 0 80 40 50 60 70 30 0 10 20 100 90 for every 10th superstep), whereas GM takes about 27,309 seconds. Superstep Furthermore, as GM often results in fewer iterations, the number of (b) Network Messages checkpoints needed for an algorithm is also reduced. For connected component detection, which only takes 10s of supersteps under GM, Figure 10: Execution time and network messages per superstep it is not necessary to even turn on the checkpointing option. Similar for graph coarsening on clueweb50m dataset. behavior was observed for reading the checkpointed state during recovery. 4e+008 VM GM 6.7 Comparison with Prior Work 3e+008 Finally, we compare Giraph++ with GraphLab for the connected 2e+008 component algorithm on the largest webbase-2001 dataset. Once 1e+008 again, we want to emphasize that the focus of this paper is on # Active Vertices the flexible graph-centric programming model and that it can be 0 20 80 60 40 0 100 implemented in different graph processing systems. Head-to-head Superstep comparison between different systems is not the purpose of this (a) Coarsening with # Supersteps paper, as the different implementation details (e.g. Giraph++ uses Netty or Hadoop RPC for communication, whereas GraphLab uses 4e+008 VM MPI) and the use of different programming languages (Giraph++ in GM 3e+008 Java and GraphLab in C++) contribute a great deal to the difference in performance. Nevertheless, the performance numbers in Table 6 2e+008 shed some light on the potential of the graph-centric programming 1e+008 # Active Vertices model in other graph processing systems. 0 In this experiment, we ran GraphLab (version 2.1) using the same 5000 4000 0 1000 2000 3000 6000 7000 Execution Time (sec) number of workers on the same 10-node cluster as described in Section 6.1. We also use the same hash partitioning (HP) and graph (b) Coarsening with time partitioning (GP) strategies described in Table 1 for Giraph++ and GraphLab. The reported execution time for both systems excludes Figure 11: Coarsening rates on clueweb50m dataset. the common overhead of setting up jobs, reading inputs, writing outputs and shutting down jobs. As Table 6 shows, the synchronous In Section 3, we have discussed that Giraph++ utilizes the same mode (denoted as Sync) in GraphLab has much more efficient run- checkpointing and failure recovery mechanism as in Giraph to en- time compared to the equivalent VM mode in the Giraph++ BSP sure fault tolerance. The only difference is that in GM mode Gi- model, partly due to the system implementation differences outlined raph++ also stores and reads back the auxiliary data structures for above. GraphLab also benefits from the better graph partitioning strategy. However, the asynchronous mode (denoted as Async in each graph partition (e.g. equiCC for the connected component Table 6) in GraphLab did not deliver the better performance as ex- algorithm), respectively. Sometimes, GM mode also introduces pected. Under HP, Async ran into memory allocation error, and more attributes for each vertex, such as the delta value in the PageR- under GP it performed worse than Sync. Even though Async often ank algorithm. It may seem that a lot of overhead is introduced in GM. However, this little bit of extra cost is sometimes outweighed needs fewer operations to converge, it also reduces the degree of by the dramatic reduction of messages and number of iterations. parallelism due to the locking mechanism needed to ensure data con- sistency. For the connected component algorithm on this particular During our empirical study, we consistently observed similar or even reduced overhead for checkpointing and failure recovery un- dataset, this trade-off results in the inferior performance of Async. der GM. Let’s take the relatively large webbase-2001 dataset as an Nevertheless, the most important takeaway from this experiment is example. If we turn on checkpointing for every 10th superstep for that GM in Giraph++ even outperforms the best of GraphLab by 2.6X, despite the system implementation differences. This result the PageRank computation, with hash-partitioned input, 59 workers spend collectively 7,413 seconds for checkpointing under VM and further highlights the advantage of the graph-centric programming 7,728 seconds under GM (the elapsed time is roughly 7728/59=131 model and also shows its potential in other graph processing sys- seconds) for running 30 supersteps. In comparison, with better tems. 203

12 7. RELATED WORK compressing social networks. In WWW’11 , pages 587–596, 2011. An overview of Pregel[15] and Giraph[1] is provided in Section 2. [4] P. Boldi and S. Vigna. The WebGraph framework I: GraphLab [14] also employs a vertex-centric model. However, , pages 595–601, 2004. WWW’04 Compression techniques. In different from the BSP model in Pregel, GraphLab allows asyn- S. Brin and L. Page. The anatomy of a large-scale hypertextual [5] chronous iterative computation. As another point of distinction, web search engine. In WWW’98 , pages 107–117, 1998. Pregel supports mutation of the graph structure during the compu- tation, whereas GraphLab requires the graph structure to be static. [6] R. Cheng, J. Hong, A. Kyrola, Y. Miao, X. Weng, M. Wu, This limits GraphLab’s application to problems like graph coarsen- F. Yang, L. Zhou, F. Zhao, and E. Chen. Kineograph: taking ing [11], graph sparsification [19] and graph summarization [22]. the pulse of a fast-changing and connected world. In Kineograph [6] is a distributed system for storing continuously EuroSys’12 , pages 85–98, 2012. changing graphs. However, graph mining algorithms are still per- [7] J. Dean and S. Ghemawat. MapReduce: simplified data formed on static snapshots of changing graphs. Kineograph’s com- processing on large clusters. In OSDI’04 , pages 137–150, putation model is also vertex centric. 2004. Trinity Graph Engine [20] handles both online and offline graph [8] A. Frommer and D. B. Szyld. On asynchronous iterations. J. processing. For online processing, it keeps the graph topology in Comput. Appl. Math. , 123:201–216, 2000. a distributed in-memory key/value store. For offline processing, it [9] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. employs the similar vertex-based BSP model as in Pregel. Powergraph: distributed graph-parallel computation on natural Grace [23] is a single-machine parallel graph processing platform. , pages 17–30, 2012. graphs. In OSDI’12 It employs the similar vertex-centric programming model as in [10] J. Huang, D. J. Abadi, and K. Ren. Scalable sparql querying Pregel, but allows customization of vertex scheduling and message of large rdf graphs. PVLDB , 4(11):1123–1134, 2011. selection to support asynchronous computation. However, Grace, G. Karypis and V. Kumar. A coarse-grain parallel formulation [11] too, requires immutable graph structure. SIAM of multilevel k-way graph partitioning algorithm. In Naiad [16] is designed for processing continuously changing PP’97 , 1997. input data. It employs a differential computation model and exposes [12] G. Karypis and V. Kumar. A fast and high quality multilevel a declarative query language based on .NET Language Integration SIAM J. Sci. scheme for partitioning irregular graphs. Query (LINQ). Comput. , 20(1):359–392, 1998. [13] J. M. Kleinberg. Authoritative sources in a hyperlinked 8. CONCLUSION J. ACM environment. , 46(5):604–632, 1999. In summary, we proposed a new graph-centric programming [14] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and model for distributed graph processing and implemented it in a J. M. Hellerstein. Distributed GraphLab: a framework for system called Giraph++. Compared to the vertex-centric model used , PVLDB machine learning and data mining in the cloud. in most existing systems, the graph-centric model allows users to 5(8):716–727, 2012. take full advantage of the local graph structure in a partition, which G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, [15] enables more complex and flexible graph algorithms to be expressed N. Leiser, and G. Czajkowski. Pregel: a system for large-scale in Giraph++. By exploiting the use of the graph-centric model in graph processing. In , pages 135–146, 2010. SIGMOD’10 three categories of graph algorithms – graph traversal, random walk [16] F. McSherry, D. G. Murray, R. Isaacs, and M. Isard. and graph aggregation – we demonstrated that the graph-centric , 2013. CIDR’13 Differential dataflow. In model can make use of the off-the-shell sequential graph algorithms [17] M. E. J. Newman. Clustering and preferential attachment in in distributed computation, allows asynchronous computation to , 64:025102, 2001. Phys. Rev. E growing networks. accelerate convergence rates, and naturally support existing partition- [18] P. S. Pacheco. Parallel programming with MPI . Morgan aware parallel/distributed algorithms. Furthermore, Giraph++ is able Kaufmann, 1997. to support both asynchronous computation and mutation of graph [19] V. Satuluri, S. Parthasarathy, and Y. Ruan. Local graph structures in the same system. , pages SIGMOD’11 sparsification for scalable clustering. In Throughout our empirical study, we observed significant perfor- 721–732, 2011. mance improvement from the graph-centric model, especially when [20] B. Shao, H. Wang, and Y. Li. Trinity: A Distributed Graph better graph partitioning strategies are used. The better performance Engine on a Memory Cloud. In SIGMOD’13 , pages 505–516, of the graph-centric model is due to the significant reduction of 2013. network messages and execution time per iteration, as well as fewer iterations needed for convergence. This makes the graph-centric I. Stanton and G. Kliot. Streaming graph partitioning for large [21] model a valuable complement to the existing vertex-centric model. distributed graphs. In KDD’12 , pages 1222–1230, 2012. In the future, we would like to explore more graph algorithms using Y. Tian, R. Hankins, and J. M. Patel. Efficient aggregation for [22] this model and conduct more in-depth study of distributed graph graph summarization. In , pages 419–432, 2008. SIGMOD’08 partitioning algorithms. [23] G. Wang, W. Xie, A. Demers, and J. Gehrke. Asynchronous large-scale graph processing made easy. In CIDR’13 , 2013. 9. REFERENCES [24] S. Yang, X. Yan, B. Zong, and A. Khan. Towards effective http://giraph.apache.org . [1] Apache Giraph. partition management for large graphs. In SIGMOD’12 , pages A. Balmin, V. Hristidis, and Y. Papakonstantinou. Objectrank: [2] 517–528, 2012. authority-based keyword search in databases. In , VLDB’04 [25] Y. Zhang, Q. Gao, L. Gao, and C. Wang. Accelerate pages 564–575, 2004. large-scale iterative computation though asynchronous [3] P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered label accumulative updates. In ScienceCloud’12 , pages 13–22, propagation: A multiresolution coordinate-free ordering for 2012. 204

Related documents

Systems Confrontation and System Destruction Warfare: How the Chinese People's Liberation Army Seeks to Wage Modern Warfare

Systems Confrontation and System Destruction Warfare: How the Chinese People's Liberation Army Seeks to Wage Modern Warfare

系 作 体 战 Systems Confrontation and System Destruction Warfare How the Chinese People’s Liberation Army Seeks to Wage Modern Warfare Jeffrey Engstrom R P O R A T I O N C O 10/30/17 2:35 PM barcode_templ...

More info »
544 41003 ch00 4P.indd

544 41003 ch00 4P.indd

Life as Politics Bayat.indd 1 24-12-2009 13:59:53

More info »
1993Directory Application of Principles Norms of Ecumenism

1993Directory Application of Principles Norms of Ecumenism

PONTIFICIUM CONSILIUM AD CHRISTIANORUM UNITATEM FOVENDAM Directory for the Application of Principles and Norms on Ecumenism 1 PREFACE 1 Reasons for this Revision 2 To Whom is the Directory Addressed 2...

More info »
AndersBehringBreivikManifesto

AndersBehringBreivikManifesto

2011 , London – By Andrew Berwick

More info »
he cv

he cv

Liang (Larry) He Contact Address: 1380 Lawrence Street, Room 816, Denver, CO, 80204 Information [email protected] E-mail: http://cse.ucdenver.edu/ ∼ helia/ Website: Academic , CO, USA University o...

More info »
7340.2H  Bsc dtd 3 29 18

7340.2H Bsc dtd 3 29 18

ORDER JO 7340.2H Air Traffic Organization Policy Effective Date: March 29, 2018 Contractions SUBJ: contractions used by ed word and phrase This handbook contains the approv personnel of the Federal Av...

More info »
Thriving on Our Changing Planet: A Decadal Strategy for Earth Observation from Space

Thriving on Our Changing Planet: A Decadal Strategy for Earth Observation from Space

TIONAL ACADEMIES PRESS THE NA This PDF is available at http://nap.edu/24938 SHARE     Thriving on Our Changing Planet: A Decadal Strategy for Earth Observation from Space DET AILS 700 pages | 8.5 ...

More info »
47935ThielenStu 00000014302

47935ThielenStu 00000014302

What’s the Least I Can Believe and Still Be a Christian? A Guide to What Matters Most Leader’S GuIde

More info »
National Tracking Poll 190458

National Tracking Poll 190458

National Tracking Poll #190458 April 23-24, 2019 Crosstabulation Results Methodology: This poll was conducted from April 23-24, 2019, among a national sample of 2201 Adults. The interviews were conduc...

More info »
a i5199e

a i5199e

Status of the Main Report World’s Soil Resources © FAO | Giuseppe Bizzarri INTERGOVERNMENTAL INTERGOVERNMENTAL TECHNICAL PANEL ON SOILS TECHNICAL PANEL ON SOILS

More info »
COMM 750 Network S Directory   508

COMM 750 Network S Directory 508

SM Provider Directory – Blue Network S UPDATED SEPTEMBER 2016

More info »
plug and pray

plug and pray

SoK: “Plug & Pray” Today – Understanding USB Insecurity in Versions 1 through C ∗ † † † ∗ ∗ , Adam Bates , Deepak Kumar , Kevin R. B. Butler , Nolen Scaife , Michael Bailey Dave (Jing) Tian ∗ Universi...

More info »
Do Domestic Firms Benefit From Direct Foreign Investment? Evidence From Venezuela

Do Domestic Firms Benefit From Direct Foreign Investment? Evidence From Venezuela

Univ ersit ennsy lv ani a y of P hol Comm ons Sc arly Management Papers Wharton Faculty Research 6-1999 o D omes t ic F ir D e nefit F r om D ir ec t F or ei g n ms B I n ves tme n t? E v ide nce F r ...

More info »
CDIR 2018 07 27

CDIR 2018 07 27

S. Pub. 115-7 2017-2018 Official Congressional Directory 115th Congress Convened January 3, 2017 JOINT COMMITTEE ON PRINTING UNITED STATES CONGRESS UNITED STATES GOVERNMENT PUBLISHING OFFICE WASHINGTO...

More info »
Managing the Risks of Extreme Events and Disasters to Advance Climate Change Adaptation

Managing the Risks of Extreme Events and Disasters to Advance Climate Change Adaptation

MANAGING THE RISKS OF EXTREME EVENTS AND DISASTERS TO ADVANCE CLIMATE CHANGE ADAPTATION SPECIAL REPORT OF THE INTERGOVERNMENTAL PANEL ON CLIMATE CHANGE

More info »
DB2019 report web version

DB2019 report web version

DOING BUSINESS 2019 Training for Reform TRADING ACROSS BORDERS

More info »