1 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 11, NOVEMBER 1997 1143 Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems Senior Member IEEE Member IEEE , Ching-Tien Ho, , , Jehoshua Bruck, , Senior Member IEEE IEEE Member Shlomo Kipnis, , and Derrick Weathersby , Eli Upfal, , , index —We present efficient algorithms for two all-to-all communication operations in message-passing systems: (or all-to- Abstract concatenation all personalized communication) and (or all-to-all broadcast). We assume a model of a fully connected message- passing system, in which the performance of any point-to-point communication is independent of the sender-receiver pair. We als o k k assume that each processor has messages in every communication round. The 1 ports, through which it can send and receive ≥ complexity measures we use are independent of the particular system topology and are based on the communication start-up time, and on the communication bandwidth. n i n In the index operation among blocks of data, and the goal is to exchange the th processors, initially, each processor has j j i n block of processor . We present a class of index algorithms that is designed for all values of th block of processor and with the that features a trade-off between the communication start-up time and the data transfer time. This class of algorithms includes two special cases: an algorithm that is optimal with respect to the measure of the start-up time, and an algorithm that is optimal with respect to the measure of the data transfer time. We also present experimental results featuring the performance tuneability of our index algorithms on the IBM SP-1 parallel system. n In the concatenation operation, among processors, initially, each processor has one block of data, and the goal is to n n concatenate the processors, and to make the concatenation result known to all the processors. We blocks of data from the n present a concatenation algorithm that is optimal, for most values of , in the number of communication rounds and in the amount of data transferred. —All-to-all broadcast, all-to-all personalized communication, complete exchange, concatenation operation, distributed- Index Terms memory system, index operation, message-passing system, multiscatter/gather, parallel system. —————————— —————————— F 1I NTRODUCTION ollective communication operations [2] are communica- applications across different architectures, and reflect concep- tion operations that generally involve more than two tual grouping of processes. In particular, collective communi- C processors, as opposed to the point-to-point communication cation is used extensively in many scientific applications for between two processors. Examples of collective communica- which the interleaving of stages of local computation with tion operations include: (one-to-all) broadcast, scatter, gather, stages of global communication is possible (see [12]). index (all-to-all personalized communication), and concate- This paper studies the design of all-to-all communication nation (all-to-all broadcast). See rvey of col- [13], [16] for a su algorithms, namely, collective operations in which every lective communication algorithms on various networks with processor both sends data to and receives data from every various communication models. other processor. In particular, we focus on two widely used The need for collective communication arises frequently in operations: index (or all-to-all personalized communication) parallel computation. Collective communication operations (or all-to-all broadcast). concatenation and simplify the programming of applications for parallel comput- The algorithms described here are incorporated into ers, facilitate the implementation of efficient communication the Collective Communication Library (CCL) [2], which schemes on various machines, promote the portability of was designed and developed for the new IBM line of scal- able parallel computers. The first computer in this line, the IBM 9076 Scalable POWERparallel System 1 (SP1), ———————————————— was announced in February 1994. • J. Bruck is with the California Institute of Technology, Mail Code 136-93, Pasadena, CA 91125. E-mail: [email protected] 1.1 Definitions and Applications • C.-T. Ho and E. Upfal are with IBM Almaden Research Center, 650 Harry Rd., San Jose, CA 95120. E-mail: {ho, upfal}@almaden.ibm.com. , n , p p . , I p processors : The system consists of NDEX º S. Kipnis is with News Datacom Research Ltd., 14 Wedgewood St., Haifa • - n 1 0 1 34635, Israel. E-mail: [email protected] B n [ i , 0], has blocks of data Initially, each processor p i • D. Weathersby is with the Department of Computer Science and Engi- B [ i , j ] is of size i , 1], i , B [ B , n [ 1], where every block - º neering, University of Washington, Seattle, WA 98195. ] (the j , i [ B b . The goal is to exchange block j th data E-mail: [email protected] th data i ] (the ) with block , j [ B i p block of processor i anuscript received 6 Apr. 1994; revised 27 Apr. 1997. M n j , i 1. The final ), for all 0 p block of processor £ £ - For information on obtaining reprints of this article, please send e-mail to: j [email protected], and reference IEEECS Log Number 100822. 1045-9219/97/$10.00 © 1997 IEEE

2 1144 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 11, NOVEMBER 1997 , for 0 £ i £ n - 1, holds result is that each processor p general and flexible. For instance, this model allows the i [0, [1, i ], º , B [ n - 1, i ]. i B blocks B ], development of algorithms that are portable between dif- ferent machines, that can operate within arbitrary and dy- p p , , processors : The system consists of n ONCATENATION C 0 1 namic subsets of processors, and that can operate in the . Initially, each processor p has a block of data , p º - i 1 n presence of faults (assuming connectivity is maintained). In i ] of size b . The goal is to make the concatenation of B [ addition, algorithms developed for this model can also be [ n - 1], the n data blocks, namely, [0] B [1] B B helpful in designing algorithms for specific topologies. known to all the n processors. We use the linear model [13] to estimate the commu- Both the index and concatenation operations are used nication complexity of our algorithms. In the linear extensively in distributed-memory parallel computers and -byte message from one model, the time to send an m are included in the Message-Passing Interface (MPI) stan- processor to another, without congestion, can be mod- dard proposal [24]. (The index operation is referred to as T = b + b eled as is the overhead (start-up m t , where MPI_Alltoall in MPI, while the concatenation is referred time) associated with each send or receive operation, and to as MPI_Allgather in MPI.) For example, the index op- t is the communication time for sending each additional eration can be used for computing the transpose of a ma- byte (or any appropriate data unit). trix, when the matrix is partitioned into blocks of rows (or For convenience, we define the following two terms in columns) with different blocks residing on different proc- order to estimate the time complexities of our communica- essors. Thus, the index operation can be used to support tion algorithms in the linear model: the remapping of arrays in HPF compilers, such as re- : the number of communication steps (or rounds) • C 1 mapping the data layout of a two-dimensional array from is an important meas- required by an algorithm. C 1 (block, *) to (cyclic, *), or from (block, *) to (*, block). The ure when the communication start-up time is high, index operation is also used in FFT algorithms [22], in relative to the transfer time, of one unit of data, and Ascend and Descend algorithms [26], in the Alternating the message size per send/receive operation is Direction Implicit (ADI) method [21], and in the solution relatively small. of Poisson’s problem by the Fourier Analysis Cyclic Re- : the amount of data (in the appropriate unit of • C 2 duction (FACR) method [28], [23], or the two-dimensional communication: bytes, flits, or packets) transferred in FFT method [8]. The concatenation operation can be used be the largest size of a a sequence. Specifically, let m i in matrix multiplication [19] and in basic linear algebra message (over all ports of all processors) sent in operations [12]. is the sum of all the s over all m i round . Then, C i 2 is an important measure when the start- C . rounds i 1.2 Communication Model 2 up time is small compared to the message size. We assume a model of a multiport fully connected mes- sage-passing system. The assumption of full connectivity Thus, in our fully connected, linear model, an algorithm means that each processor can communicate directly with T has an estimated communication time complexity of = any other processor and that every pair of processors are + b . It should be noted that there are more detailed t C C 2 1 equally distant. The assumption of multiple ports means communication models, such as the BSP model [30], the that, in every communication step (or round), each proces- Postal model [3], and the LogP model [9], which further k processors and simul- distinct messages to sor can send k take into account that a receiving processor generally com- taneously receive k messages from k other processors, for receive operation later than the corresponding pletes its some k ≥ 1, - £ k £ 1. Throughout the paper, we assume 1 n send operation. However, sending processor finishes its is the number of processors in the system. The where n designing practical and efficient algorithms in these models multiport model generalizes the one-port model that has is substantially more complicated. Another important issue been widely investigated. There are examples of parallel is the uniformity of the implementation. For example, in the k -port capabilities for systems with k > 1, such as the LogP model, the design of collective communication algo- nCUBE/2, the CM-2 (where k is the dimension of the hyper- , the number of processors. Optimal algo- P rithms is based on cube in both machines), and transputer-based machines. rithms for two distinct values of may be very different. This P Such a fully connected model addresses emerging trends presents a challenge when the goal is to support collective in many modern distributed-memory parallel computers communication algorithms for processor groups with various and message-passing communication environments. These sizes while using one collective communication library. trends are evident in systems such as IBM’s Vulcan [6], 1.3 Main Contributions and Organization MIT’s J-Machine [10], NCUBE’s nCUBE/2 [25], Thinking We study the complexity of the index and concatenation llel Machines’ CM-5 [29], and IBM’s 9076 Scalable POWERpara k operations in the -port fully connected message-passing System 1, and in environments such as IBM EUI [1], PICL model. We derive lower bounds and develop algorithms [14], PARMACS [17], Zipcode [27], and Express [31]. These for these operations. The following is a description of our systems and environments generally ignore the specific main results: structure and topology of the communication network and assume a fully connected collection of processors, in which • Lower bounds: Section 2 provides lower bounds on each processor can communicate directly with any other for both the con- C and C the complexity measures 1 2 processor by sending and receiving messages. The fact that catenation and the index operations. this model does not assume any single topology makes it

3 BRUCK ET AL.: EFFICIENT ALGORIITHMS FOR ALL-TO-ALL COMMUNICATIONS IN MULTIPORT MESSAGE-PASSING SYSTEMS 1145 For the concatenation operation, we show that any n C ), or optimal log and suboptimal lower bound 1 + k 1 algorithm requires Cn log communication ≥ - bn 1 af 1 + 1 k C - ). (at most b 1 more than the lower bound 2 k - bn 1 af C ≥ units of data. rounds and sends 2 k Appendices A and B provide pseudo- Pseudocode: • For the index operation, we show that any algo- code for the index and concatenation algorithms, re- ≥ rithm requires Cn communication rounds log spectively, in the one-port model. Both the index 1 + 1 k and concatenation operations were included in the - bn 1 af C ≥ units of data. We also show and sends 2 k [2] of the Exter- Collective Communication Library is a power of n that, when k + 1, any index algorithm 9076 Scalable nal User Interface (EUI) [1] for the that uses the minimal number of communication POWERparallel System (SP1) by IBM. In addition, these one-port versions of the algorithms have been rounds (i.e., C n ) must transfer = log k +1 1 implemented on various additional software plat- bn Cn log ≥ units of data. Finally, we show that, 2 + k 1 + 1 k forms including PVM [15], and Express [31]. in the one-port model, if the number of communica- tion rounds C must be C n ). n log bn ( W ), then is O (log B 2L OWER OUNDS 2 1 • Index algorithms: Section 3 describes a class of effi- This section provides lower bounds on the complexity n cient algorithms for the index operation among C and for algorithms that perform the con- C measures 2 1 processors. This class of algorithms is designed for catenation and index operations. Proposition 2.1 was and features a trade-off be- n arbitrary values of shown in [13]. We include it here for completeness. tween the start-up time (measure C ) and the data 1 2.1 Lower Bounds for the Concatenation Operation C transfer time (measure ). Using a parameter , r 2 ≥ 1 , any concatenation l, port mode - for k k In the 2.1. P ROPOSITION r £ , the communication complexity where 2 £ n 1 - r ≥ algorithm requires n C communication rounds. log Cn log measures of the algorithms are = 1 + k 1 1 r k 1 - n r n Cb and log . Note that, following our £ P . p . Focus on one particular processor, say, processor ROOF r 2 r 0 k The concatenation operation requires, among other cannot be ob- C and C lower bound results, optimal 2 1 p ssor [0] of proce B things, that the data block be 0 tained simultaneously. To increase the performance of n processors. With communica- broadcast among the k can be carefully the index operation, the parameter r tion ports per processor, data block B [0] can reach at chosen as a function of the start-up time , the data b d processors in d communication rounds. most ( k + 1) transfer rate t , the message size , the number of b d + 1) For ( k dn , we must have log n to be at least ≥ n processors , and the number of ports k . Two special + k 1 cases of this class are of particular interest: One case communication rounds. Ü exhibits the minimal number of communication P k port model, for - k In the 2.2. , any concatena- 1 ≥ ROPOSITION C rounds (i.e., n is minimized to by choosing log 1 + k 1 - bn 1 af C . units of data tion algorithm transfers ≥ + 1), and another case features the minimal k = r 2 k amount of data transferred (i.e., C is minimized to 2 1 data blocks of n . Each processor must receive the - ROOF P 1 - n b n by choosing ). The one-port version of the r = k - n the other 1 processors, the combined size of which index algorithm was implemented on the IBM’s SP-1 - n ( b is 1) units of data. Since each processor can use to confirm the existence of the trade-off between C 1 its input ports simultaneously, the amount of data k n is a power of C . It should be noted that, when and transferred through one of the input ports must be at 2 - bn 1 af two, there are known algorithms for the index opera- least . Ü k tion which are based on the structure of a hypercube (see [5], [20], [18]). However, none of these algorithms 2.2 Lower Bounds for the Index Operation n can be easily generalized to values of that are not powers of two without losing efficiency. The idea of a In the 2.3. P , any index algo- 1 ≥ k port model, for - k ROPOSITION trade-off between C and C is not new and has been Cn log ≥ rithm requires communication rounds. 1 2 1 + k 1 applied to hypercubes in [5], [18]. P £ ], 0 i [ B . Any concatenation operation on an array n i , < ROOF • Concatenation algorithms: Section 4 presents algo- , i [ B can be reduced to an index operation on i £ , ], 0 j rithms for the concatenation operation in the k -port . Thus, the j and i ] for all i [ B ] = j , i [ B , by letting n < j model. These algorithms are optimal for any values of Ü proposition follows from Proposition 2.1. , and k , except for the following range: b , 3, k ≥ n 3, b ≥ d d P k In the k-port model, for , any index algo- ≥ 1 2.4. ROPOSITION + 1) and ( k k - b = 1 + 1) k < ( , for some d . (Thus, if n < - bn 1 af C units of data. rithm transfers ≥ or = 1, which covers most practical cases, our algo- k 2 k rithm is optimal.) In this special range, we achieve ei- . Similar to the proof of Proposition 2.3, the proposi- P ROOF ther optimal C and suboptimal (one more than the C 1 2 tion follows from Proposition 2.2. Ü

4 1146 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 11, NOVEMBER 1997 2.3 Compound Lower Bounds for the Index th data block of processor p case that the j is sent directly i Operation from processor p to processor . (That is, each data p j i Here, we provide additional lower bounds for the index block is sent exactly once from its source to its destina- C operation. These lower bounds characterize the measure 1 tion, and no processor can forward data blocks of other C and vice versa. Theorems 2.5 and 2.7 as a function of 2 processors.) In this case, each processor must send 1 - n C show that when is optimized first, the lower bound on - distinct messages to the other n 1 processors. Any such 1 - n 1 C ≥ algorithm must require rounds. Ü C (log O becomes an order of ) higher than the “stand- n 1 k +1 2 k alone” lower bound given in Proposition 2.4. Then, Theo- T C ̆ n = È 2.7. Any index algorithm that uses log HEOREM +1 k 1 rem 2.6 shows that when C is optimized first, the lower 2 communication rounds must transfer at least n C bound on - . becomes ( log as opposed to k 1)/ n bn 1 + k 1 units of data. log Cn = W di 2 + k 1 + k 1 Finally, Theorem 2.9 gives a more general lower bound for P = 1. Con- b . It is sufficient to prove the theorem for ROOF the one-port case. sider any algorithm that finishes the index operation d , for some integer d , then any ≥ 0 T n If 2.5. = ( + 1) k HEOREM C in d = (minimum) rounds. We show that the algo- 1 2 C index algorithm that uses exactly = log n communi- +1 k 1 ( rithm executed a total of W n n ) data transmis- log k +1 bn Cn = log cation rounds must transfer at least 2 + 1 k + k 1 sions (over all nodes), thus, there is a port that trans- n units of data. mitted log units of data. n W di + k 1 + k 1 d . Let k n + 1) = ( . In order to finish the algorithm in ROOF P We first concentrate on the data distribution from a 1 nodes. Any such - n to all other v given source node exactly log rounds, the number of processors = d n k +1 algorithm can be characterized by a sequence of d + 1 having received data from a given processor, say p , i sets, S is the set of nodes that have , S S , , where , S d i 0 1 + 1 in every round. This must grow by a factor of k received their respective data by the end of communi- T defines a unique structure of the spanning tree , i i cation round . Thus, S S = { | = n , and S v contains }, | i d 0 which is rooted at p , that is a generalized version of i S S , plus nodes that received data from nodes in - - 1 i i 1 the binomial tree used to distribute the 1 data - n = | S x |. in the i th communication rounds. Let i i p blocks of processor n among the other 1 proces- - i Clearly, x ( , because each node in x + 1) k £ can S x £ i i +1 i i sors. Denote by the number of processors at level j , j k k -port send data to at most other nodes under the in tree T . One may use induc- rooted at processor p i i model. dj Next, we assign weights to the nodes in the sets, l = k . Now, the total amount of tion to show that ej jj u s, where the weight of a node represents the S in S i i data D that is injected into the network over the i path length (or the number of communication rounds T edges of the binomial tree rooted at p is given by i i u to v incurred) from in achieving the data distribu- d d tion. The weights can be assigned based on the fol- k d j F I dn Dbj bj kb , l = == ÂÂ ij j due to a u lowing rule. If a node S appears first in H K 1 k + i == j j 00 S data transmission from node w in , then the - i 1 where the last equality step can be derived by differ- weight of u is the weight of w plus one. Note that, entiating both sides of once a node is assigned a weight, it holds the same d weight in all subsequent sets. d d j F I kk 1 =+ bg Â By Lemma 2.8, we know that there are at most j H K = j 0 jf nodes of weight in . Our goal is to give a S f k ej j f . Now, clearly, bk and then multiplying both sides by n lower bound for the sum of the weights of the . Without loss of generality, we can as- nodes in S - n 1 d D bn bn i Ü . ≥= = d C n log sume that the sum of the weights is the minimum Â 2 + k 1 + + k k nk 11 = i 0 possible. d d df T 2.6. Any algorithm for the index operation that trans- HEOREM ==+ 1 , d . By the choice of Xkk Let bg ej Â f = f 0 - bn 1 af = fers exactly units of data from each processor C £< + nXn k 1 . 2 bg k - 1 n d C ≥ requires communication rounds. - 1 df 1 2 k Let Yk . Since, for = ej Â f = f 0 P n - 1 data . In the index operation, each processor has ROOF - df ddf d 1 , 1, fkk £ £- e j ej f - df blocks that it needs to send to the other 1 processors. n - 2 k - bn 1 af If each processor is allowed to transfer at most k 1 - d 1 Xkn Y 1 . < =+ £ bg units of data per port over all rounds, then it must be the k + 1

5 BRUCK ET AL.: EFFICIENT ALGORIITHMS FOR ALL-TO-ALL COMMUNICATIONS IN MULTIPORT MESSAGE-PASSING SYSTEMS 1147 Thus, the algorithm must use all the possible nodes rounds must have the following n log c that finishes in with weights less than 2. d j such that 1 j property. For every £ , there exist h £ cn log To bound the sum of the weights ,we need a lower j messages from each node that travel at least ej j bound on hops in the network. Notice that, in this property, d - 1 2 each message can only be counted once for a given . j d f I F Zf k . = Therefore, the average number of hops a message has Â f K H = f 0 log £ , or log /2, if h n h n to travel for each node is /2, if df £- fd 21, . Thus, at least is monoton in f For k W ≥ ) from Lemma C.1 in n (log log must be h . Since n h ej f ( Ü ). W = bn log n Appendix C, we have C 2 /2 of the nodes have weight at least n 212 . - d di ( W = nd That is, ). Z 3I A NDEX LGORITHMS Summing over all origins, the total number of 2 This section presents a class of efficient algorithms for the nZ transmissions is at least = W ( n ). Thus, at least d index operation. First, we provide an overview of the algo- one port has a sequence of rithms. Then, we focus on the communication phase of the n n F F I I algorithms for the one-port model. Next, we describe two C d n = WW = log G G J J 2 + k 1 k k 1 + K H H K special cases of this class of algorithms. Then, we generalize the algorithms to the k -port model. And finally, we com- data transmissions. Ü ment on the implementation and performance of this class jf L in S nodes of weight f There are no more than 2.8. EMMA k of algorithms. ej j f defined in the proof of Theorem ( 2.7). 3.1 Overview . We prove by induction on . There is clearly no more j ROOF P The class of algorithms for the index operation among n nodes of weight k than one node of weight zero and processors can be represented as a sequence of processor- memory configurations. Each processor-memory configu- one in S . Assume that the hypothesis holds for j - 1. 1 ration has n n blocks each. Columns are la- columns of - jf 1 nodes of weight f contains up to S Note that k ej j f 1 (from left to right in the fig- beled from 0 through - n , plus up to that appeared with the same weight in S ures) and blocks are labeled from 0 through n - 1 (from - j 1 -- 11 jf i top to bottom in the figures). Column represents proces- kk nodes that receive data at communication ej - 1 f represents the j th data block in the , and block j sor p i memory offset. The objective of the index operation, then, round j from nodes with weight f - 1 in . The claim S j 1 - is to transpose these columns of blocks. Fig. 1 shows an holds for j since example of the processor-memory configurations before - j j j - 11 I F I F F I - fff 1 = 5 processors. The n and after the index operation for Ü . = + k k kk f f f - 1 K H K H H K th data block ini- ij ” in each box represents the notation “ j is referred to as j . The label p tially allocated to processor i T k When = 1 2.9. , any algorithm for the index operation HEOREM the block-id. communication rounds must ) n O = (log C that uses 1 All the algorithms in the class consist of three phases. units of data. ) n log bn ( W = C transfer 2 Phases 1 and 3 require only local data rearrangement on each processor, while Phase 2 involves interprocessor c . Assume that there is an algorithm with C £ log n ROOF P 1 communication. ≥ for some constant 1. Consider the binomial distri- c log cn 1. Each processor p data n independently rotates its HASE P i bution , such that h . Let be the minimal , ej j blocks i steps upwards in a cyclical manner. + 1 l log cn n ≥ . One can show that any algorithm ej Â j steps p j th data block j 2. Each processor rotates its HASE P i = j 0 to the right in a cyclical manner. This rotation is im- Fig. 1. Memory-processor configurations before and after an index operation on five processors.

6 1148 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 11, NOVEMBER 1997 Fig. 2. An example of memory-processor configurations for the three phases of the index operation on five processors. x r , are rotated z ◊ their block-id is z plemented by interprocessor communication. steps to the right. This is accomplished in a communication round by a n independently rotates its 3. Each processor p data HASE P i direct point-to-point communications between processor x steps downwards in a cyclical manner. i blocks £ , for each 0 n ) mod - n 1. £ i + i and processor ( i z r ◊ Fig. 2 presents an example of these three phases of the For example, when r is chosen to be 3, the fifth block = 5 n algorithm for performing an index operation among will be rotated two steps to the right during Step 2 of processors. Subphase 0, and later rotated again three steps to the right The implementation of Phases 1 and 3 on each proces- during Step 1 of Subphase 1. This follows from the fact sor involves only local data movements and is straight- that 5 is encoded into “12” using radix-3 representation. forward. In the sequel, we focus only on the implementa- subphases, all data blocks have been Note that, after w tion of Phase 2. Different algorithms are derived depend- rotated to the correct destination processor as specified by ing on how the communication pattern of Phase 2 is de- the processor id. However, data blocks are not necessarily composed into a sequence of point-to-point communica- in their correct memory locations. Phase 3 of the algorithm tion rounds. fixes this problem. The following points are made regarding the performance 3.2 The Interprocessor Communication Phase of this algorithm. We present the decomposition of Phase 2 into a sequence of point-to-point communication rounds, assuming the one- Each step can be realized by a single communica- • (for radix ) in the range port model and using a parameter r tion round by packing all the outgoing blocks to the 2 n £ . £ r same destination into a temporary array and send- For convenience, we say that the of the block-id j th data ing them together in one message. Hence, each . Consider the j block in each processor after Phase 1 is subphase can be realized in at most r 1 communi- - j in rotation required in Phase 2. Each block with a block-id cation rounds. needs to be rotated to processor ( i + processor i j . n ) mod • The size of each message involved in a communica- n data. tion round is at most b £ , where 0 j The block-id 1, can be encoded using - n £ j r = radix- r representation using wn log digits. For con- r • Hence, the class of the index algorithms has complex- ity measures Cr n 1 £- log and af 1 venience, we refer to these w - digits from zero through w 1 r starting with the least significant digit. Our algorithm for n , log 1 £- n Cbr w subphases corresponding to the w Phase 2 consists of af r 2 r digits. Each subphase consists of at most r - 1 steps, corre- n £ r £ is chosen in the range 2 r where . r - 1 different non-zero values of a sponding to the (up to) 1, we iterate Step 1 - w £ given digit. In subphase x £ , for 0 x 3.3 Two Special Cases r - 1, as follows: through Step The class of algorithms for the index operation in the one- port model contains two interesting special cases: z • r - 1 and During Step z of subphase x , where 1 £ £ th digit of 1, all data blocks, for which the - w £ x £ 0 x 1) When r = 2, the derived algorithm requires

7 BRUCK ET AL.: EFFICIENT ALGORIITHMS FOR ALL-TO-ALL COMMUNICATIONS IN MULTIPORT MESSAGE-PASSING SYSTEMS 1149 Fig. 3. An example of memory-processor configurations for the index algorithm on five processors, which has an optimal measure. C 1 rithm remain the same. In Phase 2, we still have log Cn = 12 wn subphases as before, corresponding to the log = w r communication rounds, which is optimal with respect , where 0 digits in radix- representation of any block-id r j £ . Also, in this case, to the measure C 1 1 1. In each subphase, there are, at most, r j n - - £ n “independent” point-to-point communication steps that , log £ Cb n 2 2 2 need to be performed. Since these point-to-point communi- which is optimal (to within a multiplicative factor) for cation steps are independent, they can be performed in par- Cn shows such an ex- . Fig. 3 log = the case when allel, subject to the constraint on the number of parallel in- 12 put/output ports . Thus, every of these communication k k = 5. The shaded data blocks are = 2 and ample with n r steps can be grouped together and performed concurrently. the ones subject to rotation during the next subphase. 1 - r 1) ( = , the derived algorithm transfers 2) When = b n - r n C Therefore, each subphase consists of at most commu- 2 k units of data from each node, which is optimal with nication steps. The complexity measures for the index algo- . The value of in this case respect to the measure C C 1 2 rithm under the -port model, therefore, are k = 1, which is optimal for the case when = is C n - C 1 2 - - 1 1 n r r n Cn Cb log , where and can log r £ £ r r 2 1 r k k 1). ( b n - and be chosen in the range 2 . To minimize both n r £ C £ 1 = 2 should be chosen when the start-up time of the Hence, r underlying machine is relatively significant, and the product 1) mod , one clearly needs to choose = 0. , such that ( C r - k r 2 and the per-element transfer time is rela- of the block size b 3.5 Implementation = should be chosen when tively small. On the other hand, r n the start-up time is negligible. In general, can be fine-tuned r = 1) of the We have implemented the one-port version ( k according to the parameters of the underlying machines to index algorithm on an IBM SP-1 parallel system. (The IBM balance between the start-up time and the data transfer time. SP-1 is closer to the one-port model in the domain of the multiport model.) The implementation is done on top of the 3.4 Generalization to the k -Port Model point-to-point message-passing external user interface We now present a modification to the index algorithm (EUI), running on the EUIH environment. At this level, the above for the -port model. Phase 1 and Phase 3 of the algo- k sec, and the , measures about 29 communication start-up, m b

8 1150 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 11, NOVEMBER 1997 sustained point-to-point communication bandwidth is 0.12 m sec/byte. about 8.5 Mbytes/sec, i.e., t < Fig. 4 shows the measured times of the index algorithm as a function of message size with various power-of-two r on a 64 node SP-1. As can be seen, the smaller ra- radices dix tends to perform better for smaller message sizes, and vice versa. Fig. 5 compares the measured times of the index algo- r = 2, r = rithm with = 64, and optimal r among all power- n of-two radices, respectively, on a 64 node SP-1. The break- even point of the message size between the two special cases of the index algorithms (i.e., = 2 and r = n ) occurs at r about 100 to 200 bytes. The index algorithm with optimal power-of-two radix, as expected, is the best overall choice. Fig. 6 shows the measured times of the index algorithm as a function of radix for three different message sizes: 32 bytes, 64 bytes, and 128 bytes. As the message size in- Fig. 6. The measured times of the index algorithm as a function of radix for various message sizes on a 64 node SP-1. creases, the minimal time of the curve tends to occur at a higher radix. When comparing these measured times with our predicted times based on the linear model, we find big discrepancies quantitatively, but relatively consistent qualitatively. Note that we are mainly interested in the qualitatively behavior of the index algorithm on a general message-passing system. We believe the quantitative differences between the measured times and the predicted times are due to the following factors: 1) There are various system routines running in the background that have a higher priority than the user processes. 2) We do not model the copy time incurred by the func- pack, and unpack tion copy , (see the pseudocode in Appendix A). 3) We do not model the congestion behavior of the SP-1. Fig. 4. The measured time of the index algorithm as a function of mes- sage sizes on a 64 node SP-1. 4) There is a slowdown factor, somewhere between one and two, from the linear model to the send_and_receive model. If we model the congestion behavior as a fixed multi- and assume the system routines have t plicative factor of c a fixed slowdown factor of the overall time, then the total time for the index operation can be modeled as C t C t g + . + g T = g c s 3 2 1 1 2 A 4C LGORITHMS ONCATENATION There are two known algorithms for the concatenation op- eration in the one-port model. The first is a simple folklore algorithm which consists of two phases. In the first phase, the n blocks of data from the n processors are accumulated to a designated processor, say processor p . This can be 0 is not a done using a binomial tree (or a subtree of it when n power of two). In the second phase, the concatenation re- p sult from processor processors using n is broadcast to the 0 = 64, = = 2, Fig. 5. The measured times of the index algorithm with r n r the same binomial tree. This algorithm is not optimal since among all power-of-two radices, respectively, on a 64 and optimal r Cn = it consists of log 2 communication rounds and node SP-1. 1

9 BRUCK ET AL.: EFFICIENT ALGORIITHMS FOR ALL-TO-ALL COMMUNICATIONS IN MULTIPORT MESSAGE-PASSING SYSTEMS 1151 edges {(0, 1), (0, 2), , (0, k )}.) In general, in round r , transfers = 2 b ( n - 1) units of data. The second known C 2 to the £ £ where 0 r S 2, we add edges with offsets in - d r concatenation algorithm is for the case when n is a power of current partial spanning tree to form a new larger partial = 1 (see [20]). This algorithm is based on the two and k spanning tree. It is easy to verify that, after d 1 rounds, - structure of a binary hypercube and is optimal in both C and 1 nodes starting from n the resulting tree spans the first 1 C . For a given k 1, this algorithm can be generalized to the ≥ node 0, namely, nodes 0 through n 1. Fig. 7 illustrates - 2 1 case where k is a power of n + 1 by using the structure of a = 9. n = 2 and k for the case of T the process of constructing 0 , n generalized hypercube [4]. However, for general values of to construct the spanning trees T , for T Next, we use tree i 0 we do not know of any existing concatenation algorithm that to £ i 1. We do this by translating each node £ 1 in j T - n 0 . Also, the round id associated with node ( + i ) mod n T j in is optimal in both C C , even when b = k = 1. and in i 1 2 (which represents the round during T each tree edge in i In this section, we present efficient concatenation algo- which the corresponding communication is performed) is the k rithms for the -port communication model that, in most . Fig. 8 il- T same as that of the corresponding tree edge in 0 cases of C , are optimal in both k and n and C . Throughout 2 1 = 2 and n = 9. It is easy to see for the case of k lustrates tree T 1 this section, we assume that k - n £ is in the range 1 2. k £ was obtained from by adding one (modulo nine) to T that T 1 0 Notice that, for ≥ n - 1, the trivial algorithm that takes a k . T the labels of the nodes in 0 single round is optimal. The main structure that we use for deriving the algorithms is that of circulant graphs. We note here that circulant graphs are also useful in constructing fault-tolerant networks [7]. G . A circulant graph ) is characterized by two S , n ( EFINITION D parameters: the number of nodes . S and a set of offsets n, , n 1, G In - n nodes are labeled from 0 through n ), the S ( i and each node s - i is connected to node (( ) mod n ) and i + s ) mod n ) for all s Œ (see [11]). to node (( S The concatenation algorithm consists of d rounds. Let - 1 d d dn log = , that is, ( + 1) n . Also let < n = ( £ k + 1) k + k 1 - d 1 n = ( k and 1 £ n + 1) , where £ kn n . The rounds n + 2 1 1 2 1 Fig. 7. The two rounds in constructing the spanning tree rooted at node 0 of the algorithm can be divided into two phases. The first = 2. = 9 and for k n 1 rounds, at the end of which every - d phase consists of n node has the concatenation result of the - 1 nodes that 1 precede it in the numbering (in a circulant sense). The second phase consists of a single round and completes the nodes. n concatenation operation among the - 1 Rounds 4.1 The First d 1 rounds, we use a circulant graph G ( n - S ), For the first d , where S S , S S = < < < -2 d 1 0 i i i k , 2( k + 1) = {( , }. k + 1) k ( + 1) , º S i ( n nodes of G n n , We identify the processors with the S ), Fig. 8. The two rounds in constructing the spanning tree rooted at node - 1. which are labeled from 0 through n = 2. They can be derived by translating node ad- = 9 and 1 for k n The communication pattern related to broadcasting the dresses of the spanning tree rooted at node 0 in Fig. 7. data item of each node can be described by a spanning The concatenation algorithm in each node is specified by denote the spanning tree associated with the T tree. Let i £ i £ , for 0 1, as follows: - n the trees T ). We is rooted at node i i ] of node i (namely, T data item B [ i i describe the spanning tree associated with each node by In round i , for 0 - £ i £ d 2 , do: specifying the edges that are used in every communica- - For all 0 £ j £ n j 1, if data item B [ • ] is present at the i tion round. The edges associated with round are called . node, then send it on all round- -edges of tree T i j , and then we round- i -edges. First, we describe the tree T 0 Receive the corresponding data items on the round- - • i i £ n - 1, can be derived from , for 1 £ show how tree T i trees. n edges of all the . T tree 0 which consists only of We start with an initial tree T 0 1 - d After rounds of the above algorithm, every 4.1. HEOREM T T to S node 0. In round 0, we add edges with offsets in 0 0 , where ] j [ B data items n - 1 , has the n node i i £ 0 for , £ 1 to form a partial spanning tree; the added edges are the - d 1 Also, during these ). n + 1 (mod j + ≥ i - i n j ≥ 1 round-0-edges. (That is, in round 0, we add the set of is optimal : rounds, the measure C 2

10 1152 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 11, NOVEMBER 1997 b table columns corre- rived as follows. Each of the n 2 . 1 =- C n ch 21 nodes yet to be spanned, and sponds to one of the n k 2 each of the table rows represents one byte. Table ele- b P 1, are derived , for 1 . The spanning trees ROOF T T - £ n £ i i , will use the same offset, ments in the same area, say A i by shifting the indices in a cyclic manner. from T 0 which is determined by the index of the leftmost column . No- Hence, it suffices to focus on the spanning tree T 0 . touched by A i tice that the algorithm can be implemented in a -port k It can be shown that a straightforward algorithm for , we use only the set of model, since, in every round i partitioning the table satisfies the above two conditions for , which consists of is offsets. Also, the tree offsets k T S 0 i , any combination of , and , except for the following k n b , where 0 a spanning tree for the nodes 1, be- £ i £ n p - d d 1 i , for some . + 1) range: 3 and ( 3, < ( < + 1) d k ≥ ≥ k b - k n k in this range can be represented using a cause every i For instance, Table 1 presents a partitioning example for . Hence, after set of distinct offsets from 1 rounds d - S = 3, which fall in the optimal range = 3, = 7, = 3, and k b n n 1 2 of the algorithm, the data items are distributed ac- . From is marked by the number of . The area covered by i n A i cording to the claim of the theorem. this table, one can derive the following scheduling for the associated with the Next, we need to prove that C 2 last round: 1 rounds is as claimed. By induction on , it fol- d - i i The sum of the weighted edges with offset 3 (in area • lows that, before round + 1) , any node has at most ( i k , receives three bytes from ) is 7. Thus, node p p A distinct data items. Hence, in round , any node sends i 3 1 0 i re- , and node receives three bytes from node p p p data items on any given edge. Thus, at most ( + 1) k 5 1 4 . ceives one byte from p 2 b 22 - d The sum of the weighted edges with offset 5 (in area • 11 1 1 1 . n k k Cb k £+++++++ = - L bgbg bg ch 1 2 k , receives two bytes from ) is 7. Thus, node p p A 2 5 0 However, by the lower bound argument, we have , and node receives three bytes from re- node p p p 1 7 6 b . ceives two bytes from p , and the claim follows. 1 Cn ≥- Ü ch 3 21 k The sum of the weighted edges with offset 7 (in area • 4.2 The Last Round ) is 7. Thus, node receives one byte from , node p p A 7 0 3 1, the last round of the algorithm, we have Before round - d receives three bytes from , and node receives p p p 9 8 1 had broadcast its mes- the following situation: Every node i . three bytes from p 2 sage to the 1 nodes succeeding it in the circular graph n - 1 After rotation, to generate spanning trees, each of which n 1 and had received the broadcast message from the - n 1 needs to send is rooted at a different node, each node i nodes preceding it in the circular graph. Consider tree T 0 + 5) mod seven bytes to nodes ( , and ( , ( + 7) + 3) mod n i i n i just before the last round. The first nodes (nodes 0 n 1 , mod 3) mod , and receive seven bytes from nodes ( i - n n through 1) are included in the current tree, and the re- n - 1 ( 7) mod , and ( 5) mod . i n - n - i nodes still need to be spanned. We bring the maining n 2 following proposition. 4.3. T HEOREM The above concatenation algorithm attains optimal - bn 1 af 4.2. P ROPOSITION The last round can be performed with Cn = = log C for any combination of and 1 2 + k 1 k bn 2 , , , C except for the = , for any combination of n b and k 2 k , , 3, 3, n b and ≥ and b k k, except for the following range: ≥ d d d d < , < ( + 1) 3, ( + 1) 3, n k k - following range: b k ≥ and k ≥ . + 1) < ( < + 1) ( d , for some integer k n - k k . d for some P . By combining Theorem 4.1 and Proposition 4.2, we ROOF 1 - bn bn af The proof of this proposition is somewhat complicated, b 2 Cn =-+ = 1 , which matches have ch 21 k k k and we only give the main ideas here. The basic idea is to transform the scheduling problem for the last round of the in Proposition 2.2. the lower bound of C Ü 2 algorithm into a table partitioning problem. (In the sense Fig. 9 presents an example of the concatenation algo- that, if the table partitioning problem can be solved, then = 1 and = 5. Note that, to simplify the pseudo- rithm for n k we have an optimal algorithm by deriving an optimal code included in Appendix A, we actually grow the span- schedule for the last round.) The table partitioning prob- using negative offsets. That is, in both the figure ning tree T i bn 2 = a . Given a table of lem is defined as follows. Let b k TABLE 1 columns, we would like to partition the table rows and n 2 RANSFORMED XAMPLE OF THE N ROBLEM FOR P p = 3 ( E n T A 0 1 into disjoint areas, denoted by , such that , , ..., k A A A 2 1 k BYTES HROUGH AND HROUGH p k T ), p = 3 ( = 7 ( 3 = T b p ), ), n 9 2 2 3 , is at most , for all 1 , the column-span of i k £ £ n A • ORTS ) (P 1 i + 1 if is defined as of where the L R - column-span A i i i and are the rightmost and leftmost columns, re- L R i i ; and spectively, touched by A i , for all 1 , is at the number of table entries in k £ i £ • A i . most a If a solution can be found to the table-partitioning problem, then a schedule for the last round can be de-

11 BRUCK ET AL.: EFFICIENT ALGORIITHMS FOR ALL-TO-ALL COMMUNICATIONS IN MULTIPORT MESSAGE-PASSING SYSTEMS 1153 Fig. 9. An example of the one-port concatenation algorithm with five processors. and in the pseudocode, left-rotations are performed instead send_and_recv takes six arguments: the outgoing message; of right-rotations. the size of the outgoing message; the destination of the out- going message; the incoming message; the size of the in- n, it is easy to achieve opti- . For the nonoptimal range of EMARK R coming message; and the source of the incoming message. at the expense of increasing C by one round over the mal C 2 1 is supported by IBM’s Mes- send_and_recv The function and subop- lower bound. It is also easy to achieve optimal C 1 sage Passing Library (MPL) [1] on SP-1 and SP-2, and the C , where b more than the lower 1 - is at most timal C 2 2 recent MPI standard [24]. It can also be implemented as a bound. combination of blocking send and nonblocking receive. In the following pseudocode, lines 3 and 4 correspond to A A PPENDIX Phase 1, lines 5 through 20 correspond to Phase 2, and lines P SEUDOCODE FOR THE INDEX ALGORITHM 21 through 23 correspond to Phase 3. In Phase 2, there are w . During each subphase, i subphases, which are indexed by This appendix presents pseudocode for the index algorithm send_and_recv each processor needs to perform the opera- k of Section 3 when = 1. This pseudocode sketches the im- 1 times, except for the last subphase, where each tion r - plementation of the index operation in the Collective send_and_recv operation only processor performs the Communication Library of the EUI [1] by IBM. In the pseu- - w 1 is takes six arguments: docode, the function index outmsg nr 1 times. Lines 7 through 11 take into account the - blklen is the length in an array for the outgoing message; special case for the last subphase. The routine pack is used bytes of each data block; inmsg is an array for the incoming to pack those blocks that need to be rotated to the same A is the is the number of processors involved; n message; intermediate destination into a consecutive array. Specifi- ; i p A different processor ids, such that, n array of the ] = [ i , blklen , B , ) packs some selected nblocks A ( pack , cally, j , i , r , n is the outmsg used to tune the algorithm. Arrays radix and r A blocks of array B ; each block is of size blklen in into array are each of length n bytes. Other routines and inmsg * blklen th digit of the radix- bytes; those blocks, for which the i r , B , A copy that appear in the code are as follows: Routine ( len ) representation of their block ids are equal to , are selected j bytes into array B A copies array of size len . Routine for packing; and value of the number of selected blocks is id . ( id , n , A ) returns the index i that satisfies A [ getrank ] = i . The routine , B , A ( unpack nblocks written to the argument x ) returns the value ( mod The routine x mod y in the y , ) is defined as the inverse function of , n , r , i , j , nblocks blklen y . The function 1, even for negative x range of 0 through -

12 1154 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 11, NOVEMBER 1997 where B becomes the input array to be unpacked and pack rithm (lines 13 and 16). Finally, each processor performs a becomes the output array. A local circular shift of the data such that all data blocks in its inmsg array begin with the block B [0] (lines 17 and 18). Function index A , r ) outmsg ( blklen , inmsg , n , , , , Function concat ( outmsg , ) A , n len inmsg ̆ n (1) È = w log r ̆ n (1) d = log È 2 _ rank A , n , pid ) my ( getrank = my _ (2) , my _ rank = getrank ( my _ pid ) A , n (2) - ], n [( tmp , outmsg ( copy (3) my _ rank * ) * blklen rank _ my len (3) ( outmsg , temp , copy ) ) blklen (4) nblk = 1 _ (4) ) * copy ( outmsg [ my _ rank * blklen ], tmp , ( n - my rank (5) _ len = len current blklen ) 1 = 0 (6) d - do for r to (5) dist = 1 (7) ) n , nblk - rank _ my ( mod = rank _ dest i 1 - w do to = 0 for (6) _ n , nblk + rank my rank ( mod = (8) src _ ) then if ( (7) == w - 1) i ë (9) send_and_recv ( temp , current _ len , hndist = (8) temp ], rank _ dest [ A ], len [ _ current (9) else A _ , ]) current _ rank [ src len (10) r = h (10) * 2 nblk = nblk (11) endif 2 (11) current _ len = current _ len * = 1 do (12) for j 1 to h - (12) endfor rank _ j + dest ) * dist , n rank _ my ( mod = (13) len = len _ current (13) ( ) nblk - n * src ) n , dist * j - rank _ my ( mod = (14) rank _ _ (14) , nblk - rank n my ( mod = rank _ dest ) , r , i , j , msg , blklen , n (15) pack ( tmp , _ packed mod rank n , (15) src _ rank = nblk ( my _ + ) ) nblocks _ (16) send_and_recv ( temp , current len , _ packed ë ( send_and_recv (16) msg , blklen * rank [ ], A dest _ ], temp [ current _ len dest [ A , ], rank _ nblocks ]) len , A [ _ src _ current rank * _ msg packed , blklen len ], _ my * rank inmsg , temp ( copy (17) [ src [ A , nblocks ]) rank _ )) - len * ( n my _ rank , j i , r , n , blklen , msg _ packed , tmp ( unpack (17) , ( (18) copy , inmsg )], rank _ my - n ( * len [ temp nblocks ) * my _ rank ) len (18) endfor (19) return r * dist = dist (19) endfor (20) C A PPENDIX do 1 - n for to = 0 i (21) L P ROOF OF A EMMA rank blklen inmsg * ) n i - ], _ my ( mod [ tmp ( copy (22) , [ i * blklen ], blklen ) be integers such that m and c Let C.1. if L 2 £ c £ m. Then, EMMA (23) endfor h cm m return (24) min( ≥ then h , 2 ). c /8 log m /64, m ≥ ej Â j = 0 j P . Assume, for the sake of contradiction, that the ROOF A B PPENDIX lemma does not hold. First, note that the lemma holds P SEUDOCODE FOR THE CONCATENATION /64. Also, if h ≥ m /64, so it must be the case that h < m ALGORITHM cm m note that 12 =< 1 and ≥ h > 64. Therefore, , so m ej 0 This appendix presents pseudocode for the concatenation algorithm of Section 4 when = 1. This pseudocode k £ cm . Because h < m /64 h cm /128, the terms + 1 £ 2 m £ h sketches the implementation of the concatenation operation cm m ≥ are monotonically 2 in the summation ej Â j = j 0 in the Collective Communication Library of the EUI [1] by increasing, so concat takes five ar- IBM. In this pseudocode, the function h guments: is outmsg len is an array for the outgoing message; cm cm mh I F I F £ hcmh h 211 ! £+ £+ af bg bg the length in bytes of array outmsg inmsg ; is an array for the Â j j K K H H = j 0 is the number of processors involved; n incoming message; h +1/2 h is the array of the different processor ids, such that, A n and , so / e h Note that h ! ≥ inmsg . Array bytes. The function n * len is of length [ i ] = p A i m - ( log h £ h ) + + 1/2) log e cm h log ( h + 1) + h log ( concat sends and receives messages using the send_and_recv , copy , routine. The routines getrank + 1) log ( h log h - £ ( h e cm ) + h log . were defined in Appendix A. send_and_recv, and mod m £ /64 ), e m /(2 log Because h < In the following pseudocode, each processor first initial- izes some variables and copies its outmsg array into a tempo- cm ) + log ( m /2 £ ). h (log ( cm ) - log h temp rary array (lines 1 through 5). Then, each processor per- £ /4 m /4, it follows that Because log ( cm ) £ 2 log m £ m 1 rounds of the algorithm (lines 6 through forms the first d - > 64, so h (log ( cm ) - log h ). Let h = m / x and note that x 12). Then, each processor performs the last round of the algo-

13 BRUCK ET AL.: EFFICIENT ALGORIITHMS FOR ALL-TO-ALL COMMUNICATIONS IN MULTIPORT MESSAGE-PASSING SYSTEMS 1155 [18] C.-T. Ho and M.T. Raghunath, “Efficient Communication Primi- ( m x )(log c + log x ), which implies that x £ 4 log £ m /4 / Concurrency: Practice and Experience tives on Hypercubes,” , vol. 4, 8log £ 4 log c . Note that x ≥ - x , and x + 4 log x 4 log x c no. 6, pp. 427–458, Sept. 1992. so x £ 4 log c . Therefore, x £ 8log c and h x /2 £ x - 4 log [19] S.L Johnsson and C.-T. Ho, “Matrix Multiplication on Boolean Parallel Proc- Cubes Using Generic Communication Primitives,” = Ü m / x ≥ m /8 log c , which is a contradiction. essing and Medium-Scale Multiprocessors , A. Wouk, ed., pp. 108– 156. SIAM, 1989. [20] S.L. Johnsson and C.-T. Ho, “Spanning Graphs for Optimum A CKNOWLEDGMENTS Broadcasting and Personalized Communication in Hypercubes,” C.1. We thank Robert Cypher for his help in deriving Lemma IEEE Trans. Computers , vol. 38, no. 9, pp. 1,249–1,268, Sept. 1989. [21] S.L. Johnsson and C.-T. Ho, “Optimizing Tridiagonal Solvers for Jehoshua Bruck was supported in part by U.S. National Alternating Direction Methods on Boolean Cube Multiprocessors,” Science Foundation Young Investigator Award CCR- SIAM J. Scientific and Statistical Computing , vol. 11, no. 3, pp. 563– 9457811, by the Sloan Research Fellowship, and by DARPA 592, 1990. [22] S.L. Johnsson, C.-T. Ho, M. Jacquemin, and A. Ruttenberg, and BMDO through an agreement with NASA/OSAT. “Computing Fast Fourier Transforms on Boolean Cubes and Re- Advanced Algorithms and Architectures for Signal lated Networks,” Processing II , vol. 826, pp. 223–231. Soc. Photo-Optical Instrumen- R EFERENCES tation Engineers, 1987. [1] V. Bala, J. Bruck, R. Bryant, R. Cypher, P. deJong, P. Elustondo, D. [23] O.A. McBryan and E.F. Van de Velde, “Hypercube Algorithms Frye, A. Ho, C.-T. Ho, G. Irwin, S. Kipnis, R. Lawrence, and M. SIAM J. Scientific and Statistical Computing and Implementations,” , Snir, “The IBM External User Interface for Scalable Parallel Sys- vol. 8, no. 2, pp. 227–287, Mar. 1987. Parallel Computing , vol. 20, no. 4, pp. 445–462, Apr. 1994. tems,” MPI: A Message-Passing Inter- [24] Message Passing Interface Forum, [2] V. Bala, J. Bruck, R. Cypher, P. Elustondo, A. Ho, C.-T. Ho, S. face Standard , May 1994. Kipnis, and M. Snir, “CCL: A Portable and Tunable Collective [25] J.F. Palmer “The NCUBE Family of Parallel Supercomputers,” IEEE Communication Library for Scalable Parallel Computers,” Proc. Int’l Conf. Computer Design , 1986. Trans. Parallel and Distributed Systems , vol. 6, no. 2, pp. 154–164, [26] F.P. Preparata and J.E. Vuillemin, “The Cube Connected Cycles: A Feb. 1995. Comm. ACM , vol. 24, Versatile Network for Parallel Computation,” [3] A. Bar-Noy and S. Kipnis, “Designing Broadcasting Algorithms in no. 5, pp. 300–309, May 1981. Mathematical Sys- the Postal Model for Message-Passing Systems,” [27] A. Skjellum and A.P. Leung, “Zipcode: A Portable Multicomputer tems Theory, vol. 27, no. 5, pp. 431-452, Sept./Oct. 1994. Proc. Fifth Communication Library Atop the Reactive Kernel,” [4] L. Bhuyan and D. Agrawal, “Generalized Hypercube and Hyper- Distributed Memory Computing Conf. , pp. 328–337, Apr. 1990. IEEE Trans. Computers bus Structures for a Computer Network,” , [28] P.N. Swarztrauber, “The Methods of Cyclic Reduction, Fourier vol. 33, no. 4, pp. 323–333, Apr. 1984. Analysis, and the FACR Algorithm for the Discrete Solution of Pois- [5] S. Bokhari, “Multiphase Complete Exchange on a Circuit- SIAM Rev. son’s Equation on a Rectangle,” , vol. 19, pp. 490–501, Proc. 1991 Int’l Conf. Parallel Processing Switched Hypercube,” , vol. I, 1977. pp. 525–528, Aug. 1991. Connection Machine CM-5 Technical Summary . Thinking Machines [29] [6] J. Bruck, R. Cypher, L. Gravano, A. Ho, C.-T. Ho, S. Kipnis, S. Corporation, 1991. Konstantinidou, M. Snir, and E. Upfal, “Survey of Routing Issues [30] L.G. Valiant, “A Bridging Model for Parallel Computation,” for the Vulcan Parallel Computer,” IBM Research Report, RJ-8839, Comm. ACM , vol. 33, no. 8, pp. 103–111, Aug. 1990. June 1992. Express 3.0 Introductory Guide [31] . Parasoft Corporation, 1990. [7] J. Bruck, R. Cypher, and C.-T. Ho, “Fault-Tolerant Meshes and IEEE Trans. Com- Hypercubes with Minimal Numbers of Spares,” puters , vol. 42, no. 9, pp. 1,089–1,104, Sept. 1993. received the BSc and MSc Jehoshua Bruck [8] C.Y. Chu, “Comparison of Two-dimensional FFT Methods on the degrees in electrical engineering from the Tech- Proc. Third Conf. Hypercube Concurrent Computers Hypercubes,” nion, Israel Institute of Technology, in 1982 and and Applications , pp. 1,430–1,437, 1988. 1985, respectively, and the PhD degree in elec- [9] D. Culler, R. Karp, D. Patterson, A. Sahay, K.E. Schauser, E. Santos, trical engineering from Stanford University in R. Subramonian, and T. von Eicken, “LogP: Towards a Realistic 1989. Proc. Fourth SIGPLAN Symp. Prin- Model of Parallel Computation,” He is an associate professor of computation ciples and Practices Parallel Programming , ACM, May 1993. and neural systems and electrical engineering at [10] W.J. Dally, A. Chien, S. Fiske, W. Horwat, J. Keen, M. Larivee, R. the California Institute of Technology. His re- Lethin, P. Nuth, S. Wills, P. Carrick, and G. Fyler, “The J-Machine: search interests include parallel and distributed Proc. Information Processing a Fine-Grain Concurrent Computer,” computing, fault-tolerant computing, error- ‘89 , pp. 1,147–1,153, 1989. correcting codes, computation theory, and neural and biological sys- [11] B. Elspas and J. Turner, “Graphs with Circulant Adjacency Matri- tems. Dr. Bruck has extensive industrial experience including, serving J. Combinatorial Theory, no. 9, pp. 297–307, 1970. ces,” as manager of the Foundations of Massively Parallel Computing Group [12] G. Fox, M. Johnsson, G. Lyzenga, S. Otto, J. Salmon, and D. at the IBM Almaden Research Center from 1990-1994, a research staff Solving Problems on Concurrent Processors, Vol. I Walker, . Prentice member at the IBM Almaden Research Center from 1989-1990, and a Hall, 1988. researcher at the IBM Haifa Science center from 1982-1985. [13] P. Fraigniaud and E. Lazard, “Methods and Problems of Com- Dr. Bruck is the recipient of a 1995 Sloan Research Fellowship, a Discrete Applied Math. munication in Usual Networks,” , vol. 53, 1994 National Science Foundation Young Investigator Award, six IBM pp. 79–133, 1994. Plateau Invention Achievement Awards, a 1992 IBM Outstanding Inno- [14] G.A. Geist, M.T. Heath, B.W. Peyton, and P.H. Worley, “A User’s vation Award for his work on “Harmonic Analysis of Neural Networks,” Guide to PICL: A Portable Instrumented Communication Li- and a 1994 IBM Outstanding Technical Achievement Award for his con- brary,” ORNL Technical Report no. ORNL/TM-11616, Oct. 1990. tributions to the design and implementation of the SP-1, the first IBM [15] G.A. Geist and V.S. Sunderam, “Network Based Concurrent scalable parallel computer. He has published more than 120 journal and Computing on the PVM System,” ORNL Technical Report no. conference papers in his areas of interests and he holds 20 patents. Dr. ORNL/TM-11760, June 1991. Bruck is a senior member of the IEEE and a member of the editorial [16] S.M. Hedetniemi, S.T. Hedetniemi, and A.L. Liestman, “A Survey . IEEE Transactions on Parallel and Distributed Systems board of the of Gossiping and Broadcasting in Communication Networks,” Networks , vol. 18, pp. 319-349, 1988. [17] R. Hempel, “The ANL/GMD Macros (PARMACS) in FORTRAN for Portable Parallel Programming Using the Message Passing Programming Model, User’s Guide and Reference Manual,” tech- nical memorandum, Gesellschaft für Mathematik und Datenvera- beitung mbH, West Germany.

14 1156 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 11, NOVEMBER 1997 received a BS degree in electri- Ching-Tien Ho received a BSc in mathematics from the Eli Upfal cal engineering from National Taiwan University Hebrew University in 1978, an MSc in computer in 1979 and the MS, MPhil, and PhD degrees in science from the Weizmann Institute in 1980, and computer science from Yale University in 1985, a PhD in computer science from the Hebrew Uni- 1986, and 1990, respectively. versity in 1983. During 1983-1984, he was a re- He joined IBM Almaden Research Center as a search fellow at the University of California at research staff member in 1989. He was manager of Berkeley, and, in 1984-1985, a postdoctoral fellow the Foundations of Massively Parallel Computing at Stanford University. In 1985, Dr. Upfal joined group from 1994-1996, where he led the develop- the IBM Almaden Research Center, where he is ment of collective communication, as part of IBM currently a research staff member in the Founda- MPL and MPI, for IBM SP-1 and SP-2 parallel sys- tions of Computer Science Group. In 1988, he also tems. His primary research interests include communication issues for joined the Faculty of Applied Mathematics and Computer Science at the interconnection networks, algorithms for collective communications, graph Weizmann Institute, where he is currently the Norman D. Cohen Profes- embeddings, fault tolerance, and parallel algorithms and architectures. His sor of Computer Science. Dr. Upfal’s research interest include theory of current interests are data mining and on-line analytical processing. He has algorithms, randomized computing, probabilistic analysis of algorithms, published more than 80 journal and conference papers in these areas. communication networks, and parallel and distributed computing. He is a Dr. Ho is a corecipient of the 1986 “Outstanding Paper Award” of senior member of the IEEE. the International Conference on Parallel Processing. He has received an IBM Outstanding Innovation Award, two IBM Outstanding Technical is a PhD candidate in W. Derrick Weathersby Achievement Awards, and four IBM Plateau Invention Achievement the Department of Computer Science at the Awards. He has 10 patents granted or pending. He is on the editorial University of Washington, Seattle, Washington. . IEEE Transactions on Parallel and Distributed Systems board of the His current research involves compiler optimiza- He will be one of the program vice chairs for the 1998 International tions for collective communication primitives, Conference on Parallel Processing. He has served on program com- portable software support for efficient collective mittees of many parallel processing conferences and workshops. He is communication libraries, and parallel program- a member of the ACM, the IEEE, and the IEEE Computer Society. ming language design. (M’87) received a BSc in mathe- Shlomo Kipnis matics and physics in 1983 and an MSc in com- puter science in 1985, both from the Hebrew Uni- versity of Jerusalem, Israel. He received a PhD in electrical engineering and computer science in 1990 from the Massachusetts Institute of Technol- ogy. From 1990-1993, he worked as a research staff member at the IBM T. J. Watson Research Center in Yorktown Heights, New York. From 1993- 1995, he worked as a research staff member at the IBM Haifa Research Laboratory in Israel. Currently, he is working as manager of new technologies at NDS Technologies Israel. In addition, since 1994, Dr. Kipnis has been an adjunct professor of com- puter science at Bar Ilan University and at Tel Aviv University. His research interests include parallel and distributed processing, efficient communica- tion structures and algorithms, and system security. Dr. Kipnis is a member of the IEEE, the IEEE Computer Society, ACM, and ILA. He has published in numerous journals and presented his work in many conferences and workshops. He is also an inventor and coinventor of two U.S. patents.