1 Bandwidth Optimal Algorithms All-reduce Clusters of Workstations for Xin Yuan Pitc h Patarasuk Florida Univ ersit y Science, State Departmen t of Computer Tallahassee, FL 32306 g @cs.fsu.edu xyuan patarasu, f Abstract ecien t realization of the all-reduce operation with large We consider sizes an data environmen under the assumption that the reduce operator is asso ciativ e in cluster ts, amoun utativ We deriv e a tigh t lower bound of the e. t of data that must and comm unicated be comm to complete this operation and prop ose a ring-based algo- in order rithm only requires tree connectivit y to achiev e bandwidth optimalit y. Unlik e that work widely e all-reduce algorithm that incurs net butter y-lik con ten tion in the used ulti-core clusters, the prop osed algorithm SMP/m achiev e con ten tion-free comm u- can nication all con temp orary in almost including SMP/m ulti-core clusters and clusters Ethernet switc hed clusters with multiple switc hes. We demonstrate that the prop osed algorithm is more t than other algorithms on clusters with di eren t nodal ar- ecien size and technologies when the data working is sucien tly large. chitectures net ords : All-reduce, collectiv e comm unication, tree top Keyw , cluster of workstations ology 1 Introduction pro all-reduce com bines values from The operation cesses and distributes the results to all all pro cesses. It is commonly used in parallel computing. In the Message Passing Interface Allreduc e . (MPI) for this operation is MPI standard the routine [18], with an t realization of the all-reduce operation ecien large data sizes in We consider cluster environmen ts, under the assumption that the reduce operator is asso ciativ e and com- t of data mutativ t lower bound of the amoun e a tigh that must be comm unicated e. We deriv Con tact author: Xin Yuan, [email protected], phone: (850)644-9133, fax: (850)644-0058.

2 in order to complete all-reduce operation, and use the lower bound to establish the mini- the time required this operation. We prop ose a ring-based algorithm for this operation mum for eac es bandwidth tree top ologies in that (1) y on h node sends the min- achiev optimalit that t of data required to complete this operation, and (2) imum comm unications are amoun all ten free. con tion butter y-lik most used tly, the scheme is the widely e algorithm [22, 23, Curren all-reduce where the all-reduce operation is realized with a recursiv e halving reduce-scatter follo wed 27], e doubling all-gather. When the net work can supp ort the butter y comm unica- by a recursiv both pattern ten tion, this algorithm is optimal con in the latency term (using the tion without um num ber of comm unication rounds needed) and in the bandwidth term (eac h node minim comm unicating minim um amoun t of data required). The problem with the butter y- the cause is that butter y comm unication pattern can like algorithm net work con ten tion in the man y con temp orary clusters suc h as the widely deplo yed SMP/m ulti-core clusters. In the con trast, ring-based algorithm only requires a tree top ology to be bandwidth optimal our can achiev ten tion-free comm unication in almost all con temp orary clusters includ- and e con with ulti-core and Ethernet switc hed clusters clusters multiple switc hes. The ing SMP/m algorithm also requires less working ring-based and can be applied to clusters with memory non-p wo num bers of nodes. One limitation of the prop osed algorithm is that it is ower-of-t latency only bandwidth term, but not the in the term: the num ber of comm uni- optimal cation rounds is prop ortional to the num ber of pro cesses. Another issue in the ring-based algorithm is that reduction results are computed with di eren t \brac keting", whic h may the problems in the presence of rounding errors. cause the prop We evaluate algorithm on various clusters of workstations, including high-end osed SMP/m ulti-core clusters with Myrinet and In niBand interconnects and low-end Ethernet prop switc The results sho w that the clusters. osed algorithm signi can tly outp erforms hed other algorithms when the data size is sucien tly large, whic h demonstrates the e ectiv eness 2

3 of the prop algorithm. osed rest of the er is organized as follo ws. Section 2 introduces the all-reduce oper- The pap the ation mo del that we use. In Section 3, we deriv e the theoretical comm and unication operation. comm time required for this unication Section 4 presen ts on lower bound the osed bandwidth optimal all-reduce algorithm. Section 5 rep the the results of our prop orts erimen The related work is discussed in Section 6. Section 7 concludes the pap er. exp ts. 2 Background All-reduce operation 2.1 operator a generic to denote use reduce operator in the all-reduce operation. We will the Allreduc MPI requires the reduce operator to be asso ciativ e, that is, ( a b ) c = a ( b c ). e Moreo all built-in operations for MPI Allreduc e are also comm utativ e, that is, a b = b a . ver, that ciativ reduce operator is both asso We assume e and comm utativ e in this pap er. the of operating is equiv an all-reduce operation In terms alen t to a reduction operation results, follo reduces that to one pro cess, the wed by a broadcast operation that distributes results the results to all pro cesses. Speci cally , let the N pro cesses be denoted as p . , p p , ..., 1 0 N 1 1 0 , , a p data , 0 i N 1, has X Before items a the all-reduce operation, eac h pro cess i i i X 1 0 1 X 1 operation, all pro cesses have all X -item results r ..., , r a , ..., r . At the end of the , i j j j j j , 0 X 1. ::: a a a = r where 1 0 1 N presence y requiremen errors, MPI poses some qualit In the ts for this operation of rounding e the It is required all pro cesses must receiv [23]. same resulting data. In addition, the that same reduction order and brac keting for all elemen ts is not strictly required, but should be striv for. ed 2.2 Comm unication performance model Since we consider operations with large data sizes, we will use a line ar model to mo del the a message time point-to-p oint comm unications: the time to transfer for of size m , T ( m ) = 3

4 m , where is a constan t. Notice that the linear mo del ignores comm unication start-up and the unication time is prop ortional to the message size. Let i be a overheads comm ). T i m ) = i T ( m ) and T ( m constan + m ( ) = T ( t, m ) + T ( m 1 2 1 2 the algorithms, all pro cesses start that operation at the all-reduce we assume In analyzing The same for an all-reduce operation is de ned as the time when all pro cesses time. time the operation until the last pro cess receiv es the last message (and computes the nal start Our time focuses on the comm unication result). and ignore the computation time. In analysis both the unication start-up overheads and the computation in the operation, practice, comm omitted t. analysis, can be signi can h are whic in our lower bounds 3 The indep In a general-purp data items are ose enden t of eac h other. The all-reduce operation, on di eren t items are indep enden t of one another. Hence, reduction amoun t of data that the must unicated in order to complete an all-reduce operation on X items is equal to be comm times all-reduce amoun t for the single-item X operation. To obtain the lower bound the need a general operation on X items, we only for to nd the lower bound for the all-reduce operation on a single item. In a one-item all-reduce on N pro cesses, eac h pro cess p operation , 0 i N 1, has i all a one . At the end of the all-reduce operation, item pro cesses con tain the nal initial i r = a We will a items. ::::: a ; :::; b of initial be a set . Let B = f b g result i N 1 1 0 0 g the ( B ) = ( f b reduction ; b use ; :::; b t the notation ) = b to represen b b ::: 1 i 0 i 0 1 operation on the initial items in B . Let ALL = f a tains ; a con ; :::; a that set be the g 1 1 0 N initial = r = ( ALL ). Let B all f b items, ; b items, ; :::; b of initial g ALL be a subset i 1 0 ( B ) = b 1, as well as the b ::: b N is a partial result . All initial items, a , 0 i i 1 0 i nal result r are partial results. Let be the empt y set. Since ALL , for completeness, we ? introduce (empt y) partial result, denoted as a null = ( ). Intuitiv ely, any computation 4

5 (reduction) result is not a partial result will not con tribute to the calculation of the that result in any way and ted as ? . Eac h partial result except ? has the nal can be represen 2 initial as an ( B ) be a partial result and a item. B , we say that the e e ct of size same Let is in the partial result ( B ). Let pr item = ( A ) and pr a = ( B ) be two partial results. 2 1 results operator to tak e partial the as parameters. Speci cally , We extend reduce 8 S T > > < A A 6 ) ( A B B ) ; if ( ( B = ) and ( and 6 = ) = pr pr = 2 1 T > > : = ( if B 6 = ) or ? A A ) or ( B = ) ; ( , if two partial results cover disjoin t sets of initial items, applying the reduce Basically on that two partial results yields another partial result operation covers the union of these two partial initial the items covered by these When results intersect, applying the items. reduce operation on the two partial results yields the that is not a partial result something (some items app ear multiple times), initial h is denoted by ? as discussed earlier. whic In our pro ofs, we also use a operator on partial results, whic h is de ned as follo ws. Let = pr ( A ) and pr ( B ) be two partial results, = 2 1 = ( A B ). pr pr 2 1 from , the ves the e ects of items in B operator the partial result pr Basically . For remo 1 ( a b c ) b = ( example, a; b; c g f b g ) = a c ; a a = ( f a g f a g ) = ? . f During all-reduce operation, eac h pro cess the a partial result that it can compute may send to another pro cess or receiv e a partial result from another pro cess. The partial results that can be computed cess in a given time dep end on the initial item at the pro cess as by a pro set We de ne results receiv ed by the pro cess. well as the the state of a pro cess at a of partial initial time set of partial results that include the given item in the pro cess and all to be the partial results that the pro cess receiv ed from other pro cesses. We call the set of all partial the results be computed at a given time in a pro cess can cover age set of the pro cess. that Clearly , the coverage set is a function of the state. Let the state of a pro cess be S , we denote the coverage set as COV ER ( S ): 5

6 COV ER ( ) = f pr j9 pr S ; pr g ; :::; pr pr 2 S; pr = pr pr ::: 2 1 i 2 1 i results The atomic items that cannot be separated even though in a state partial are if a pro bined other partial results. For example, be com cess has a state can they with f a; b c S , the coverage set is COV ER ( S ) = f? ; a; b c; a b c g , whic h does not = g a b or a c . include deriv ation of the lower bound, we focus on all-reduce schemes whose comm unications In the no operation conditions suc h that all messages in an do can be sequen tialized. have race algorithms with conditions are usually incorrect and not considered in this All-reduce race eac Without of generalit y, we will also assume that er. h message transfers one partial pap loss between two pro cesses: a message with multiple partial results are treated as multiple result A message src as ( messages. ! dst; pr ), where src is the source pro cess, dst is is denoted destination result. cess, and pr is a partial the Let a generic all-reduce scheme use n pro not messages. assumption that the messages do the have race conditions and can be Under sequen tialized, the n messages can be ordered based on the timing when eac h message occurs. can For concurren same time, the tie en at the be brok en arbitrarily . Let that t messages happ , ..., , M messages M M their be the n messages in an all-reduce scheme ordered based on n 2 1 M j j , 1 M before p cess of pro state be the . n s S timing. ; pr Let M d ! = ( ) and j i j j j j i M n +1 ( to denote , ). Clearly message M the state of p last after the S notation the use We will i n i M j ). COV ER ( S 2 pr we have j j Since we are interested in the num ber of partial results comm unicated, we will use the only algorithm. of messages algorithm to represen t the in an Let a message sequence be sequence M valid , M is , ..., M of messages , where M sequence = ( s the ! d ). We say that ; pr j j j k 2 1 j M j ted 2 COV ER ( S when pr be represen ), 1 j k . Clearly , any all-reduce algorithm can j j of messages sequence Similarly , any valid by a valid of messages. can be realized sequence by an algorithm. Lemma 1: The minim um num ber of partial results to be comm unicated to complete a 6

7 one-item all-reduce on N pro cesses over all algorithms is at most 2 ( N 1). operation : The pro t-forw ard by constructing an algorithm that comm unicates 2 Proof of is straigh ( to complete the operation. The all-reduce operation can be completed 1) partial N results 1 partial tree that comm unicates N algorithm results follo wed by a at a at with reduce algorithm that comm unicates N tree 1 partial results. 2 broadcast 2: num ber of partial results to be comm unicated to complete a one-item Lemma The 2 on pro cesses using any algorithm is at least operation ( N 1). all-reduce N : The lemma applies both when the messages are within the N pro cesses and when Proof pro to rela are used additional y messages. We pro ve the lemma by induction on N . cesses case: N one message, at most one pro cess can send its data to the other Base = 2. With and one pro cess can compute a pro cess, a all-reduce , whic h cannot complete the at most 1 0 Hence, at least 1 + 1 = 2 = 2 ( N 1) comm unications are operation. needed. Induction The induction hypothesis is that the minim um num ber of partial results to case: unicated pro to complete a one-item all-reduce operation on N be comm cesses is at in order we will 2 ( least 1). Using this hypothesis, pro ve that the minim um num ber of messages N required to complete the operation on N + 1 pro cesses is at least 2 (( N + 1) 1) = 2 N . Let the cesses be p N , p + 1 pro , ..., p operation. with p the having initial item a before i N 1 0 i minim um num ber of messages required to complete the operation on the N + 1 Let the E be X+1, ordered sequence of messages be E pro , the , ..., E , E , where E = cesses +1 X 1 j 2 X cess s ! d ( ; pr be ), 1 j X + 1, and the state of pro E p message , 0 i N , before i j j j j E E j X +2 S . S 2 denotes the state after E a (the last message). We have r = a ::::: a 1 +1 X N 0 i i E X +2 COV ER ( . S i N ), 0 i We will pro ve X + 1 2 N by constructing a sequence of at most X 1 messages that cesses allo all-reduce operation on N pro the to be completed, whic h requires at least ws 2 N 2 messages (induction hypothesis). Let us assume that p es is the pro cess that receiv N is asso last message (this assumption is valid since the ciativ e and comm utativ e). The new 7

8 sequence is constructed the X -message sequence E from , E , in two steps. , ..., E X 1 2 rst the a sequence F of X messages as follo ws. For eac h E step, = In we construct j j ; pr ), 1 s X , there is a corresp onding message d ! = ( s ! d ; pr a ). ( F j j j N j j j j Message is exactly the same as the message sequence E except that the e ect sequence F a is remo ved. Note that some messages in sequence F may con tain the empt y partial of N step. ( h empt y messages will be remo ved in the next ? Let the state of pro cess result ). Suc F j i N , before message F condition be S p of , 0 (sequence F is applied from the initial j i i to sho operation). j , it is straigh t-forw ard on w that By the induction F j that is, pr COV ER a 1. Message 2 sequence ( S F is valid, . X ) ; 1 j s N j j E F j j S N 2. If ). ), 1 j X + 1 and 0 i pr , 2 a COV ER 2 COV ER ( S ( pr N i i message sequence E , before the last message ( E the Using ), the nal result r = a 0 +1 X a ::::: a cesses must be in the coverage set of eac h of p pro , ..., p these since 1 N 0 N 1 do not receiv e the last message and their states do not change after the message. Hence, E X +1 r a 2 ::: a a 2 COV ER ( S = a ), 0 i N 1, and r a ::: = a a 1 1 0 N N N 1 0 i F X +1 COV ER ( ) ), 0 S N 1: the reduction result of the N pro cesses ( p p , p i , ..., 1 N 1 0 i are of the N pro cesses after the X messages in sequence F . Next, we will construct in all that sequence a sequence of at most X 1 messages F allo ws the all-reduce operation from on N pro cesses p two cases. , p are , ..., p There to complete. 1 1 0 N con (a): sequence F , if there is any message whose message ten t is ? (empt y Case In this message can be remo ved and we obtain a sequence of at most X 1 messages message), that allo the all-reduce operation on N pro cesses to complete. ws , eac (b): Case no suc h messages in sequence F If there h message in F con tains a are real partial result. Since r = a , ..., ::: a p is in the coverage set of all pro cesses p , 0 1 0 N p of these one of at least after messages E sender , E be the , ..., E must , pro cess p N 2 1 1 N X messages so that the e ect of a not can be distributed to other pro cesses. Since we do N 8

9 change senders receiv ers from sequence E to sequence F , pro cess p and also must be the N sender in sequence F . at least once rst message F with p the as the sender be F Let = ( s in sequence ! d ), ; pr a k N k k N k = p every . We create a new message sequence from sequence F s follo ws: as for (1) N k message sender er are not p and , keep the receiv intact; (2) if the receiv er message whose N is the receiv er to d ; (3) if the sender is p p and the message is not F , change , change N k k N sender the d to (if the receiv er is also d F , then remo ve this message); (4) remo ve the k k k the sequence. , in the new sequence, d the from es all Basically messages sen t to p in receiv N k F , and sends out anything that p e ect used to send out in sequence F . Since sequence the N a cesses has been remo ved in sequence F , this new sequence is valid. Eac h of the pro of N other d as in sequence and p sequence receiv than the same messages in the new es exactly N k F . Hence, the state of pro cess p after , 0 i N 1 and p same 6 = d the , is exactly k i i two message these pro cess d sequences. . For the messages that it does not Now consider k be able receiv ) in the new sequence, it should p to compute the partial results in the e (from N since d after receiv es everything that p messages receiv es in sequence F : the state of d k k N new sequence is a sup erset of its state after the sequence F . Since a the ::: a is in N 1 0 p coverage cesses p , p the , ..., of pro set after the sequence F , it should also be in the 1 1 0 N coverage set of all these pro cesses after the new sequence whic h has at most X 1 messages ( F is remo ved). Thus, X 1 messages are sucien t to complete the all-reduce operation k pro cesses. From the induction on X 1 2 N N 2 and X + 1 2 N . This hypothesis, concludes the induction case. 2 Com bining Lemma 1 and Lemma 2, we obtain the lower bound result in the follo wing the lemma. while we pro ve the lower bound for that case when the reduce operator Note is comm utativ e, the bound also applies when the reduce operator is non-comm utativ e since of the of all all-reduce algorithms with a non-comm utativ e reduce operator is a subset set with the of all all-reduce algorithms set a comm utativ e reduce operator. 9

10 Lemma 3: num ber of partial results to be comm unicated to complete a The minim um ( on pro cesses over all algorithms is 2 operation N 1). 2 all-reduce one-item N Assume that (1) data are not compressed during the all-reduce operation; and Lemma 4: all-reduce items enden t of one another. To perform an indep operation on X are initial (2) itsiz e bytes item size on N pro cesses, there exists at least one pro cess that must items with N 1 assuming data bytes um minim the e X e itsiz comm d of at least a total unicate 2 N comm is an item. for unit unication the minim um num ber of items to be comm unicated in order to complete a Proof : Since the operation ( N 1) (Lemma 3), is 2 minim um num ber of items to one-item all-reduce unicated to complete an X item all-reduce operation is X 2 ( N 1). Since be comm operation are cesses that carry out the pro collectiv ely, the comm unications can N there among the N pro cesses. Hence, at least one pro cess needs to comm unicate be distributed X 2 ( N 1) N 1 N 1 d 2 e = data. X e items and d 2 d 2 byte X e itsiz e N N N time 5: assumptions stated in Lemma 4, the minim um the for the all-reduce Lemma Under 1 N d 2 e operation is at least 2 X T ( itsiz e ). N in the msiz = X itsiz e be the total data Let e operation. Using the linear comm u- size N 1 T ( d 2 of any nication mo del, X e itsiz e ) 2 T ( msiz e ). This indicates that the best N all-reduce can do is to achiev e roughly 2 times the time to send the msiz e -byte algorithm data. 4 A ring-based optimal algorithm bandwidth algorithm describ bandwidth We will all-reduce e a ring-based for tree top ologies, optimal and sho w how suc h an algorithm can be applied to high-end clusters. The tree top ology by optimal itself interesting. We dev elop bandwidth particularly algorithms for the tree is not top ology only because tree pro vides the minim um connectivit y and most net works have tree for embeddings: bandwidth optimal all-reduce algorithm our trees can be applied to any 10

11 top ology a tree embedding. Note that since our algorithm achiev es bandwidth opti- with y, no algorithms have a better theoretical performance on any top ology , assuming malit can h node that net work port. has eac one algorithm for tree topologies Bandwidth 4.1 optimal osed bandwidth optimal all-reduce algorithm com bines three existing ideas: Our prop (1) the operation by a reduce-scatter operation follo wed by an all-gather realizing all-reduce and [22], both the reduce-scatter operation realizing the all-gather operation operation (2) logical ring based algorithms [7, 25], and (3) constructing a con ten tion-free logical ring using the [7]. top ology on Our con tribution lies in com bining these three comp onen ts in an tree algorithm and wing that the resulting all-reduce algorithm is bandwidth all-reduce in sho optimal. e bandwidth optimalit y using the reduce-scatter follo wed by all-gather algorithm To achiev both the reduce-scatter and the all-gather operations must be bandwidth optimal, whic h [22], that: means eac h pro cess should comm unicate the minim um amoun t of data required to (1) ten realize and (2) the net work con operations; tion must be avoided. The optimalit y is the achiev ed by using logical ring based reduce-scatter and all-gather algorithms, and by nding a con ten logical ring on the tree top ology . The logical ring all-gather algorithm tion-free tion-free ed technique to compute a con ten the logical ring on a tree is describ in [25] and ology can be found in [7]. Next, we will describ e the logical ring reduce-scatter scheme. top F the pro Let be p be a , p g , ..., p 1 ; :::; N . Let N : f 0 ; :::; N 1 g ! f 0 cesses 1 N 1 0 one-to-one mapping. Thus, p . The p , p , ..., p , ..., p , p of utation is a perm 0 1 N 1 F ( F (0) 1) F (1) N logical pattern con tains ring follo wing comm unications: the p p ! ! p p ! ! p ::: ! (2) F F ( N (1) 1) (0) F F (0) F Using the logical ring pattern, the reduce-scatter operation is performed as follo ws. First, is partitioned the e source data in eac h pro cess msiz into N segmen ts, all segmen ts having 11

12 msiz e d e bytes except the last segmen t, whic h has a segmen t size of msiz e ( N of data N msiz e e . Let us num ber the segmen ts by SEG 1) , SEG reduce-scatter , ..., SEG . The d 1 1 0 N N the logical is carried pattern N by performing 1 iterations. In the operation out ring (iteration 1), pro cess p rst . p to sends segmen t iteration SEG ( 1) mod N ) i F (( i +1) mod N ) ( F i h pro it performs receiv es the data, eac a reduction operation on the receiv ed After cess segmen segmen corresp onding data segmen t (the its t with the same segmen t data t with and replaces its own data with the (partial) index), results. For eac h remaining reduction iteration : 2 j N 1, eac h pro cess p j SEG computed newly sends the ) i ( i j ) mod N F ( unicated data . After receiving the comm p in eac h iteration, eac h pro cess to i +1) F (( N ) mod the reduction operation on the data receiv ed with the corresp onding segmen t in performs local partial y and replaces the the reduction results in the arra y. At the end of the N 1 arra Figure p the ws 1 sho holds the reduction results in SEG iterations, , 0 i N 1. i F i ( ) cesses. ring tation of reduce-scatter on logical pro implemen As sho wn in the gure, in four the rst iteration, n SEG sends sends SEG n to n ; n ; n to SEG sends (0) (1) F 3 F (2) (2) F 1 (1) 0 F F to ; and n n operation sends SEG reduction to n the unication, . After the comm (0) 2 (3) F (3) F F is performed: on SEG n , n . After on SEG , n SEG on on SEG , and n 3 0 2 1 F (2) F (1) F (0) (0) F n has results, three SEG iterations, results, n SEG has SEG results, n has 2 0 1 (2) F (1) F F (0) has SEG results. n and (3) F 3 !!!!!!! """""" ####### $$$$$$ !!!!!!! """""" ####### $$$$$$ final results: communicated segment: !!!!!!! """""" $$$$$$ ####### SEG SEG SEG SEG SEG SEG SEG SEG SEG SEG SEG SEG 3 2 3 2 3 2 1 0 0 0 1 1 P F(0) P F(1) P F(2) P F(3) Iteration 2 Iteration 1 Iteration 3 Figure 1: Logical ring reduce-scatter algorithm nds Put together, the prop osed algorithm rst it all a con ten tion-free logical ring on the 12

13 tree top . It then performs the all-reduce operation over the con ten tion-free logical ring ology logical ring to carry out a reduce-scatter operation follo wed by an all- by using algorithms perform Both based reduce-scatter and all-gather algorithms ring the operation. gather the e msiz 1 rounds with eac h node sending d in eac comm unications e bytes data in h round. N N 1 an Theorem operation on X items with the size of eac h item being all-reduce : Consider is divisible ( e e X itsiz e ). When X msiz by N , the prop osed algorithm is bandwidth itsiz = optimal. : Both the reduce-scatter operation and the all-gather operation are performed Proof in X itsiz e h pro cess sending and receiving N 1 steps with eac data in eac h step. Since N X itsiz e N logical ring, both operations tak e ( con 1) T ( there ten tion on the is no ) time. N X itsiz e the del, ). Under linear mo 1) T ( the comm unication time is 2 total Hence, N ( N X itsiz e N 1 2 ( N 1) T ( (2 e (Lemma ) = optimal X itsiz T ), whic h is the theoretical N N Hence, 5). algorithm is bandwidth optimal. 2 our all-reduce X ( by N , the total data size X t by eac h node is 2 divisible N 1) d is not When e sen N N 1 h may be much more the theoretical optimal of d 2 , whic e itsiz than X e itsiz e when N msiz e N 1 ) to appro N N 1) T ( In addition, in order ), is large. ximate T (2 for 2 e msiz ( N N msiz e msiz e to be sucien tly large: if is close osed prop needs to in nit y, the performance of the N N is better will The exact threshold when the prop osed algorithm to optimal. be close algorithm algorithms is system dep enden t. other than X osed algorithm requires additional d The h is better e itsiz e working memory , whic prop N X than the d results e itsiz e working memory required by the butter y-lik ed algorithm. The 2 in all cessors will be the same after pro ring-based algorithm. Hence, the algorithm meets the the minim um MPI qualit y requiremen t [23]. However, the brac keting for computing the results is not the same for di eren t data items. 13

14 4.2 Algorithm for high-end clusters cluster A high-end by SMP and/or multi-core nodes is typically SMP/Multi-core formed speed The interconnect is usually a single cross-bar by a high connected interconnect. small clusters and a fat-tree switc large clusters, whose performance is close to h for for switc The comm unication between pro h. in di eren t SMP nodes (inter- a cross-bar cessors comm unication) goes through the interconnect while the intra-no de comm unication is node operations. within typically through memory node, For the common case an SMP performed one net work interface card is equipp ed in eac h node, the system can be appro ximated when el tree ology as sho wn in Figure 2. top by a two-lev interconnect s 0 a node s s s N 1 2 p p N-1 p p p p p p (M-1)*N/M N/M N/M+1 2*N/M-1 1 N/M-1 0 2: The two-lev el tree mo del for SMP clusters Figure be applied The ed in the previous subsection can describ when the tree top ology algorithm is determined. In a high-end SMP cluster, even though the system can be appro ximated with a two-lev the exact tree top ology is unkno wn until the SMP nodes are allo cated. el tree, cluster this be directly used in a high-end SMP cannot since we need to Hence, algorithm elop the all-reduce routine before it can be used in programs running on a particular dev However, by default, man y SMP clusters assign MPI pro cesses with consecutiv e platform. node. ranks (or cores) in eac h SMP cessors For example, in a system with 4-core SMP to pro nodes, n , and , n n , n , , and n n are assigned one SMP node (one pro cess per core); n , 4 3 2 1 6 0 5 so on. SMP assigned to another n node; and are For suc h clusters, using the 2-lev el tree 7 represen tation, a con ten tion-free logical ring for the default pro cess assignmen t scheme can lemma, be built kno wing the exact top ology as sho wn in the follo wing without whic h can be pro ved by examining the top ology in Figure 2. 14

15 Lemma 6 the assumption that MPI pro cesses with consecutiv e ranks are assigned : Under cessors (or in eac h SMP node, logical ring pattern, p to pro ! p cores) ! p ! ! ::: 2 0 1 p 2 ! p free. , is con ten tion 0 1 N 5 Experimen ts the algorithms describ ed in the previous section, Based t all-reduce routines on we implemen For high-end SMP/m ulti-core clusters, we implemen t a stand-alone all-reduce in two forms. based on Lemma 6. For clusters with a physical tree top ology , we dev elop a routine routine that tree es the generator top ology information as input and automatically pro duces an tak for routine the top ology speci c algorithm uses the top ology . The routine all-reduce that reads a top ology description le that describ es the connectivit y among mac hines generator switc hes, determines the order of the mac hines that realizes a con ten tion-free ring, and and the duces code that realizes the operations using the con ten tion-free ring comm unication pro pattern. The generated routines are written in C and are based on MPICH point-to-p oint primitiv es. clusters are in the exp erimen ts: the NCSA Teragrid IA-64 Lin ux cluster [19], used Three State cluster Departmen t of Computer Science, Florida at the Univ ersit y, and an Draco the switc hed cluster. The NCSA Teragrid Ethernet Lin ux cluster is a Myrinet cluster with IA-64 dual Intel Itan tium 2 SMP nodes and 4GB memory per node. The system runs 1.5GHz the Lin 2.4.21-SMP operating system and uses ux mpic h-gm-1.2.6..14b library . The Draco the cluster is an In niBand cluster with Dell PowerEdge 1950 nodes, eac h node having two dual- core Xeon pro cessors (2.33GHz, 4 cores per pro cessor, 8 cores per node) and 8GB E5345 double memory are connected with a 20Gbps nodes data rate In niBand switc h. The . The cluster runs the Lin ux 2.6.9-42.ELsmp kernel and uses the mvapic h2.1.0.2p1 library . The compute Ethernet cluster consists of 32 hed nodes connected by Dell Powerconnect switc 2724 Gigabit Ethernet switc hes. The nodes are Dell Dimension 2400, eac h with a 2.8GHz 15

16 P4 pro and 640MB memory . All nodes run Lin ux (Fedora) with the 2.6.5-358 kernel. cessor latest OPENMPI [20] and MPICH 2.1.0.6p1 [10] are installed on this cluster. The 1.2.5 approac of the using the Mpptest is measured h [11]. We performance algorithms The prop osed algorithms with nativ e MPI implemen compare on these clusters. In the tations the butter y-lik e algorithm [22, 23, 27], denote traditional butter y , that is addition, as both latency and bandwidth optimal when the net work con ten tion is not a theoretically is also we also For high-end SMP/m ulti-core clusters, problem compare the pro- compared. scheme with speci c implemen tations. One SMP speci c implemen tation is posed two SMP clusters on tly dev elop ed for SMP recen [26]. The implemen tation, denoted based algorithms SMP-binomial , has four logical phases: as an intra-no de reduce operation for eac h SMP (1) node using tree, (2) an inter-no de reduce operation with a binomial tree to obtain a binomial inter-no the in one node, (3) an results de broadcast operation with a binomial tree reduction to distribute the results to eac h SMP node, and (4) an intra-no de broadcast to distribute the results h pro cessor. While this algorithm does not cause net work con ten tion, the to eac exists de comm bandwidth optimal: there is not a node that comm unicates inter-no unication ( log ( N ) msiz e ) data. We enhance this algorithm by using the butter y-lik e algorithm to O inter-no algorithm operation. This perform is denoted as SMP-butter y . Notice de all-reduce is that the comp etitor of the prop osed algorithm that butter y whose main problem main is the unication pattern cannot be embedded comm SMP cluster without causing link in an con ten tion. SMP-butter y can be considered as an impro vemen t over butter y : eliminating the net con ten tion by grouping intra-no de comm unications together. work Teragrid Linux cluster NCSA results IA-64 algorithms 3 sho the Figure of di eren t all-reduce ws on the NCSA Teragrid performance cluster with 128 pro cessors (64 nodes). All programs are compiled with the 'mpicc -lm' compiler command other ag (mpicc invokes the Intel no in the system). The nativ e with MPI library uses the butter y algorithm and as a result, its performance is very close to that 16

17 of our implemen of butter y . As sho wn in the gure, native ( butter y ) performs the tation when the size is small. However, as the data size increases, the net work con ten tion best data ) is not performance: of native ( butter y performance as good as the pro- the degrades the when the data size is larger than 256KB. SMP-butter y eliminates net work posed algorithm tion by performing the operation in three phases. However, using this algorithm, the con ten unicated t of inter-no and intra-no de data comm amoun is still larger than what is total de As a result, using this metho d to eliminate net work con ten tion is not needed. e in e ectiv suc cluster as sho wn in the gure. It is not surprising that SMP-binomial per- h a high-end one poorly data size is large since there exists the pro cess that comm unicates forms when ( log ( N ) msiz e ) data in this algorithm. The prop osed algorithm outp erforms butter y O 256 KB = 2 KB : the threshold value the when size data is larger than 256KB. Notice that 128 e msiz of ecien for bandwidth optimal this to be more our t is around 2 KB on algorithm N cluster. As the data size increases, the performance di erence is more signi can t: the net- work con tion problem is more severe as the data size increases. This exp erimen t sho ws ten algorithm the point for the prop osed ring-based en is 256KB when N = 128. that break-ev N is larger, however, the break-ev en point can be much larger since the num ber of When unication rounds in the ring-based algorithm is O ( N ). comm cluster Draco results 4 sho ws the results on the Figure cluster. The nativ e MPI library also uses butter y Draco when the data size is larger than 32KB. In this cluster, the trend for the relativ e performance of SMP-binomial SMP-butter y , and the prop osed algorithm is similar to that in the NCSA , native cluster. osed algorithm performs better than prop ( butter y ) signi can tly for The data sizes 64KB, 128KB, 256KB, 2MB and 4MB. However, for data sizes 512KB and 1MB, tly the prop osed algorithm is sligh of the worse. We believ e this is mainly due performance to the interaction between memory references and net work con ten tion, whic h results in the 1MB abnormal results for the 512KB and performance data points. Nonetheless, the gure 17

18 100 16 SMP-binomial SMP-binomial 14 SMP-butterfly SMP-butterfly 80 Butterfly Butterfly 12 Native Native Ours Ours 10 60 8 40 6 Time (ms) Time (ms) 4 20 2 0 0 16K 2M 1M 512K 128K 128K 64K 32K 8K 256K Message size (Byte) Message size (Byte) sized (a) (b) Large sized data Medium data 3: Results on the NCSA Teragrid IA-64 cluster (128 pro cessors, Myrinet) Figure ) and ws relativ e performance of native ( butter y of the the prop osed algorithm: trend sho the osed algorithm is more e ectiv e as the data size becomes larger. the prop 3 70 SMP-binomial SMP-binomial SMP-butterfly SMP-butterfly 60 2.5 Butterfly Butterfly Native Native 50 2 Ours Ours 40 1.5 30 Time (ms) Time (ms) 1 20 0.5 10 0 0 128K 64K 32K 16K 4M 2M 1M 512K 256K 128K 8K Message size (Byte) Message size (Byte) Medium (a) data (b) Large sized data sized In niBand 4: Results Figure Draco cluster (128 cores, on (20Gbps)) the Ethernet switc hed cluster results For the Ethernet switc hed cluster, we sho w the results on 32-no de clusters with the top ol- top ogy in Figure 5. We performed exp erimen ts on other wn ologies with multiple switc hes sho and the trend is similar. Figure 6 sho w the results for the Ethernet cluster. On this cluster, 18

19 the prop algorithm is more ecien t than both Op en MPI and MPICH2 when the data osed msiz e is larger the threshold value of size Hence, than 256KB. the prop osed algorithm to for N 256 KB 2 NCSA in the = 8 KB , whic h is much larger than the high-end KB ecien be more t is 32 This re ects the fact that on an Ethernet switc hed cluster with Gigabit switc hes, the cluster. unication is much larger overhead comm than that in high-end clusters. Thus, even start-up more reasonably data sizes (e.g. 128KB), it is still for imp ortan t to reduce the com- large munication start-up overheads. However, when the data size is larger, net work con ten tion and bandwidth a problem and our bandwidth optimal algorithm is more eciency become ecien t. n0 n1 n15 n16 n17 n31 S0 S1 5: Ethernet Topologies used in the exp erimen ts Figure 160 16 MPICH2 MPICH2 140 14 Open MPI Open MPI Ours Ours 12 120 10 100 80 8 6 60 Time (ms) Time (ms) 4 40 20 2 0 0 256K 16K 64K 8K 1M 512K 32K 128K 128K 2M Message size (Byte) Message size (Byte) (a) sized message Medium (b) Large sized message hed Figure for the Ethernet switc 6: Results cluster (32 single-core nodes) 19

20 6 Related Work all-reduce operation been extensiv e studied. The one-item all-reduce operation has The has glob under suc h as census function [2], t names al combine [3, 5, 6, 27], studied been di eren [16]. The lower bound for the comm unication time under various comm unication and gossip dels has established [2, 3, 5, 16]. In [16], it is sho wn that to complete a one-item all- mo been N operation telephone mo del, at least 2 the 4 connections must be established reduce under N > 4. In [2, 3, 5], the lower bounds for the num ber of rounds of comm unications and when the for ber of data item to be comm unicated in sequence in various postal mo dels are num However, to the of our kno wledge, the lower bound on the total num ber of best established. has to be comm to complete the operation unicated not been established. items data all-reduce operation is one of the collectiv e operations supp orted The MPI standard in the [18], thus, all MPI libraries supp ort this operation. Man y ecien t platform indep en- and [22, t algorithms this operation have been prop osed for 23, 25, 27]. In [22], Rab enseifner den prop osed to realize the all-reduce operation by a reduce-scatter operation follo wed by an all-gather operation gave various algorithms for the reduce-scatter and all-gather opera- and The butter y-lik has been dev elop ed some times ago [22, 27] and has been e algorithm tions. [23]. non-p wo num bers of pro cesses ower-of-t Various architecture speci c to handle extended schemes have also been dev elop all-reduce [1, 4, 12, 17, 26]. An all-reduce algorithm was ed designed BlueGene/L systems in [1]. In [12], an all-reduce scheme that tak es adv antage for VIA-based DMA capabilit y was dev elop ed for (RDMA) clusters. The work in [17] of remote investigated an adaptiv e all-reduce algorithm in an In niBand cluster that deals with the situation when all nodes arriv e at the call site at the same time. A study on the all-reduce not were over WAN be found in [4]. All-reduce algorithms can dev elop ed speci cally operation for SMP clusters in [26]. Among all these algorithms, the butter y-lik e algorithm [22, 23, 27] is that is widely limitation of this algorithm The it is dicult to realize the butter y used. comm unication pattern without incurring net work con ten tion in con temp orary SMP/m ulti- 20

21 core clusters. y all-gather [7, 24, 25] and reduce-scatter [13, 24, 25] algorithms, whic h are Man of the prop all-reduce algorithm, have been dev elop ed. The technique for nding parts osed tion-free ed ring on tree top ologies was dev elop ten in [7]. In this pap er, we use logical con 25]) as the results pap ers (e.g. [7, 22, in various comp onen ts of our prop osed existing the Our con tribution lies in sho wing that putting these existing algorithm. onen ts in the comp particular a con ten tion-free bandwidth optimal all-reduce algorithm for the tree way yields [8, 7, 15, ology architecture dep enden t collectiv e algorithms e other 21] that work well top . Lik situations, the prop osed scheme can be used in some anced comm unication systems in adv [9, 28]. 14, 7 Conclusions operation t implemen tations We investigate all-reduce ecien with large data sizes of the under the assumption that the reduce operator is both asso ciativ e and comm utativ e. We deriv e a theoretical on the comm unication time of this operation and dev elop lower bound This optimal on tree top ologies. algorithm algorithm only requires a bandwidth all-reduce connectivit y to achiev e bandwidth optimalit y and can be applied to con temp orary clus- tree We demonstrate the e ectiv eness of the prop osed algorithm on various con temp o- ters. and/or clusters, high-end clusters with SMP including multi-core nodes connected by rary high-sp eed interconnects, and low-end Ethernet switc hed clusters. While our algorithm is ber con and bandwidth optimal, it is not optimal in the latency term: the num tion-free ten of comm rounds is prop ortional to the unication ber of pro cesses. num Acknowledgmen t This work is supp orted in part by National Science Foundation (NSF) gran ts: CCF-0342540, resources CCF-0541096, CCF-0551555. Exp erimen ts are also performed on and sponsored through an NSF Teragrid gran t CCF-050010T. 21

22 References [1] et.al., \Optimization of MPI Collectiv e Comm unication on BlueGene/L Systems," G. Almasi, Confer ence on Supercomputing pages 253-262, 2005. International (ICS), of Census Bar-No B. Schieb er, \Optimal Computation and Functions in the A. y, S. Kipnis, [2] del," Discr ete Applie d Mathematics , 58:213-222, April Postal Mo 1995. A. y, J. Bruc k, C-T. Ho, S. Kipnis, and B. Bar-No er, \Computing Global Com bine [3] Schieb erstions in the Multip ort Postal Mo del," Op Trans. on Parallel and Distribute d Systems , IEEE 6(8):896-900, 1995. August \Extending L. Bongo, us, J. Bjorndalen, and T. Larsen, Ansh Collectiv e Op erations With [4] O. Seman tics for Impro ving Multi-cluster Performance," Application of the Third Proceedings International osium on Parallel and Distribute d Computing/Thir d International Work- Symp on Algorithms, Models, and Tools for Parallel Computing on Heter ogeneous Networks shop oPar , pages 320-327, 2004. (ISPDC/Heter [5] J. Bruc k, C.-T. Ho, \Ecien t Global Com bine Op erations in Multi-p ort Message-P assing Systems," Parallel Processing , 3(4):335-346, 1993. Letters R. J. Bruc D. Coster, N. Dewulf, C.-T. Ho, and L. Lau wereins, \On the Design and [6] k, tation of Broadcast and Global Com bine Op erations Implemen the Postal Mo del," Using IEEE on Parallel and distribute d Systems , 7(2):256-265, Marc h 1996. Transactions Ecien A. j, P. Patarasuk [7] X. Yuan, \Bandwidth Fara t All-to-all Broadcast on Switc hed and Clusters," International Journal of Parallel Programming , 36(4)426-453, August 2008. Per- [8] j, X. Yuan, and Pitc h Patarasuk, \A Message Scheduling Scheme for All-to-all Fara A. sonalized unication on Ethernet Switc Comm Clusters," IEEE Transactions on Parallel hed and Distribute d Systems , 18(2):264-276, Feb. 2007. MPI [9] j, P. Patarasuk, and X. Yuan, \A Study of Pro cess Arriv al Patterns for A. Fara Collectiv e Op erations," International Journal of Parallel Programming , accepted for publication. 22

23 [10] W. E. L. Lusk, N. E. Doss, A. Skjellum, \A High-P erformance, Portable Implemen- Gropp, of the MPI Passing Interface Standard," Parallel Computing , 22(6):789-828, tation Message 1996. W. E. L. Lusk, \Repro ducible Measuremen and Performance Characteris- [11] Gropp ts of MPI of PVMMPI , pages 11-18, 1999. tics," Proceedings cha, Gupta, ji, D. K. Panda, and J. Nieplo P. Bala \Ecien t collectiv e operations using R. [12] memory operations on VIA-based clusters," In remote of the 17th International Proceedings Symp on Parallel and Distribute d Processing (IPDPS), pages 46, April 2003. osium eration G. Iannello, for the Reduce-Scatter Op t Algorithms in LogGP ," IEEE Trans. [13] \Ecien d Systems , 8(9):970-982, Sept. 1997. on Parallel and Distribute Lowenthal, A. ande, X. Yuan, and D. K. Karw \An MPI Protot ype for Compiled Comm u- [14] nication on Ethernet Switc hed Clusters," Journal of Parallel and Distribute d Computing , 65(10):1123-113 3, Octob er 2005. Empirical R. S. Daniels and X. Yuan, \An Lane, Study of Reliable Multicast Proto cols [15] G. Net works," Performanc e Evaluation Journal , 64(3):210-228, Marc h over Ethernet-Connected 2007. [16] Kno del, \New Gossips and Telephones," Discr ete Math. , 3(1):95, 1975. W. and A. Mamidala, D. Panda, \Ecien t Barrier J. Liu, Allreduce on In niBand Clusters using [17] Hardw are Multicast and Adaptiv e Algorithms," In Proceedings of the 2006 IEEE Interna- tional Confer Computing , pages 135-144, 2004. ence on Cluster e Standar The [18] MPI: A Message-Passing Interfac MPI d, Version 1.3 , Ma y 2008. Forum. Available at http://www.mpi-forum.org/do cs/mpi-1.3/mpi-rep ort-1.3-2008-05 -30.p df. Cluster. [19] IA-64 Lin ux Teragrid http://www.ncsa.uiuc.edu/UserInfo/Resources /Hard- NCSA ware/TGIA64Lin uxCluster. [20] Op en MPI: Op en Source High Performance Computing. http://www.op en-mpi.org/. 23

24 [21] P. Patarasuk, A. Fara j, \Techniques for Pip elined Broadcast on Ethernet Switc hed X. Yuan, Journal of Parallel and Distribute , 68(6):809-824, June 2008. Clusters," d Computing Rab erations," \Optimization of Collectiv e Reduction Op R. International Confer- enseifner, [22] e , LNCS 3036, pages 1-9, 2004. Scienc ence on Computational R. Rab enseifner and J. L. Tra , \More Ecien t Reduction Algorithms for Non-p ower-of-t wo [23] ber of Pro cessors in Message-P assing Parallel Systems," EuroPVM/MPI , LNCS 3241, Num 36-46, pages 2004. W. B. Tan and P. Strazdins, \The Analysis and Optimization of Collectiv e Comm unica- [24] tions on wulf Cluster," Proc. of the Ninth International Confer ence on Parallel and a Beo d Systems (ICP ADS'02), page 659, 2002. Distribute [25] R. Thakur, R. Rab enseifner, and W. Gropp, \Optimizing of Collectiv e Comm unication Op- erations in MPICH," Journal of High Performanc e Computing Applic ations, International Spring 19(1):49-66, 2005. V. Tippara ju, J. Nieplo cha, and D. Panda, \Fast Collectiv e Op eration [26] Shared and Using Remote Access Proto cols on Memory In Proceedings of the 17th International Clusters," Symp osium on Parallel and Distribute d Processing (IPDPS), page 84, 2003. erations," [27] de Geijn, \On Global Com bine Op van Journal of Parallel and Distribute d R. Computing , 22(2):324-328, 1994. [28] X. Yuan, R. Melhem and R. Gupta, \Algorithms for Supp orting Compiled Comm unication," IEEE Transactions on Parallel and Distribute d Systems , 14(2):107-118, Feb. 2003. 24