MPI: A Message Passing Interface Standard

Transcript

1 MPI : A Message-Passing Interface Standard Version 3.0 Message Passing Interface Forum September 21, 2012

2 1 This document describes the Message-Passing Interface ( MPI ) standard, version 3.0. 2 The MPI standard includes point-to-point message-passing, collective communications, group 3 and communicator concepts, process topologies, environmental management, process cre- 4 ation and management, one-sided communications, extended collective operations, external 5 interfaces, I/O, some miscellaneous topics, and a profiling interface. Language bindings for 6 C and Fortran are defined. 7 MPI-1.1 Historically, the evolution of the standards is from (June 1994) to MPI-1.0 8 (July 18, 1997), with several clarifications and additions and MPI-1.2 (June 12, 1995) to 9 MPI-2.0 document, to published as part of the (July 18, 1997), with new functionality, MPI-2 10 MPI-1.3 (May 30, 2008), combining for historical reasons the documents 1.1 and 1.2 to 11 (June 23, 2008), MPI-2.1 and some errata documents to one combined document, and to 12 MPI-2.2 (September 2009) added additional combining the previous documents. Version 13 . , is an extension of MPI-3.0 clarifications and seven new routines. This version, MPI-2.2 14 15 MPI to the MPI Forum as follows: Comments. Please send comments on 16 17 1. Subscribe to http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-comments 18 [email protected] , together with the URL of 2. Send your comment to: 19 the version of the MPI standard and the page and line numbers on which you are 20 commenting. Only use the official versions. 21 22 Your comment will be forwarded to MPI Forum committee members for consideration. 23 Messages sent from an unsubscribed e-mail address will not be considered. 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 c 1993, 1994, 1995, 1996, 1997, 2008, 2009, 2012 University of Tennessee, Knoxville, © 46 Tennessee. Permission to copy without fee all or part of this material is granted, provided 47 the University of Tennessee copyright notice and the title of this document appear, and 48 notice is given that copying is by permission of the University of Tennessee. ii

3 1 Version 3.0: September 21, 2012. , the MPI Coincident with the development of MPI-2.2 2 Forum began discussions of a major extension to MPI. This document contains the MPI-3 3 Standard. This draft version of the standard contains significant extensions to MPI MPI-3 4 functionality, including nonblocking collectives, new one-sided communication operations, 5 , this standard is considered a major update to and Fortran 2008 bindings. Unlike MPI-2.2 6 standard. As with previous versions, new features have been adopted only when MPI the 7 there were compelling needs for the users. Some features, however, may have more than a 8 implementations. MPI minor impact on existing 9 10 This document contains mostly corrections and clarifi- Version 2.2: September 4, 2009. 11 cations to the MPI-2.1 document. A few extensions have been added; however all correct 12 programs are correct programs. New features were adopted only when MPI-2.2 MPI-2.1 13 there were compelling needs for users, open source implementations, and minor impact on 14 existing MPI implementations. 15 16 MPI-1.3 Version 2.1: June 23, 2008. (May This document combines the previous documents 17 , such as some sections of MPI-2.0 (July 18, 1997). Certain parts of MPI-2.0 30, 2008) and 18 Chapter 4, Miscellany, and Chapter 7, Extended Collective Operations, have been merged 19 MPI-1.3 . Additional errata and clarifications collected by the MPI into the Chapters of 20 Forum are also included in this document. 21 22 This document combines the previous documents (June MPI-1.1 Version 1.3: May 30, 2008. 23 12, 1995) and the (July 18, 1997). Additional errata collected Chapter in MPI-2 MPI-1.2 24 MPI Forum referring to MPI-1.1 and by the are also included in this document. MPI-1.2 25 26 Version 2.0: July 18, 1997. Beginning after the release of MPI-1.1 , the MPI Forum began 27 meeting to consider corrections and extensions. MPI-2 has been focused on process creation 28 and management, one-sided communications, extended collective communications, external 29 interfaces and parallel I/O. A miscellany chapter discusses items that do not fit elsewhere, 30 in particular language interoperability. 31 32 as Chapter 3 in the MPI-2 Forum introduced MPI-1.2 The Version 1.2: July 18, 1997. 33 standard “ MPI-2 : Extensions to the Message-Passing Interface”, July 18, 1997. This section 34 Standard. The only contains clarifications and minor corrections to Version 1.1 of the MPI 35 Standard the MPI new function in MPI-1.2 is one for identifying to which version of the 36 MPI-1 implementation conforms. There are small differences between and MPI-1.1 . There 37 are very few differences between MPI-1.2 MPI-1.1 and , but large differences between MPI-1.2 38 MPI-2 . and 39 40 41 Beginning in March, 1995, the Message-Passing Interface Forum Version 1.1: June, 1995. 42 document of May 5, 1994, MPI reconvened to correct errors and make clarifications in the 43 referred to below as Version 1.0. These discussions resulted in Version 1.1. The changes from 44 Version 1.0 are minor. A version of this document with all changes marked is available. 45 46 Version 1.0: May, 1994. The Message-Passing Interface Forum (MPIF), with participation 47 from over 40 organizations, has been meeting since January 1993 to discuss and define a set 48 iii

4 1 of library interface standards for message passing. MPIF is not sanctioned or supported by 2 any official standards organization. 3 The goal of the Message-Passing Interface, simply stated, is to develop a widely used 4 standard for writing message-passing programs. As such the interface should establish a 5 practical, portable, efficient, and flexible standard for message-passing. 6 This is the final report, Version 1.0, of the Message-Passing Interface Forum. This 7 document contains all the technical features proposed for the interface. This copy of the 8 A draft was processed by L T X on May 5, 1994. E 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 iv

5 Contents Acknowledgments ix MPI 1 1 Introduction to 1.1 Overview and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Background of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 MPI-1.0 , MPI-1.2 , and MPI-2.0 1.3 Background of 2 MPI-1.1 . . . . . . . . . . . . . . . . . MPI-1.3 and MPI-2.1 . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Background of 1.5 Background of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 MPI-2.2 1.6 Background of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 MPI-3.0 4 1.7 Who Should Use This Standard? . . . . . . . . . . . . . . . . . . . . . . . . 5 1.8 What Platforms Are Targets For Implementation? . . . . . . . . . . . . . . 1.9 What Is Included In The Standard? . . . . . . . . . . . . . . . . . . . . . . 5 6 1.10 What Is Not Included In The Standard? . . . . . . . . . . . . . . . . . . . . 1.11 Organization of this Document . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 MPI Terms and Conventions 9 2.1 Document Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Naming Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Procedure Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 11 2.4 Semantic Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Opaque Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5.2 Array Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 14 2.5.4 Named Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.5 Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.6 Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.7 File Offsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.8 Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 2.6 Language Binding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 17 Deprecated and Removed Names and Functions . . . . . . . . . . . . 2.6.2 Fortran Binding Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6.3 C Binding Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.6.4 Functions and Macros . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.7 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.8 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.9 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 v

6 2.9.1 Independence of Basic Runtime Routines . . . . . . . . . . . . . . . 21 Interaction with Signals . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.9.2 22 2.10 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3 Point-to-Point Communication 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 24 3.2 Blocking Send and Receive Operations . . . . . . . . . . . . . . . . . . . . . Blocking Send . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 3.2.2 Message Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Message Envelope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.3 3.2.4 Blocking Receive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.5 Return Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Passing MPI _ STATUS 3.2.6 IGNORE for Status . . . . . . . . . . . . . . . . 32 _ 3.3 Data Type Matching and Data Conversion . . . . . . . . . . . . . . . . . . 33 Type Matching Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 33 Type MPI _ CHARACTER . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Data Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.2 37 3.4 Communication Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5 Semantics of Point-to-Point Communication . . . . . . . . . . . . . . . . . . 3.6 Buffer Allocation and Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Model Implementation of Buffered Mode . . . . . . . . . . . . . . . . 46 3.6.1 3.7 Nonblocking Communication . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.7.1 Communication Request Objects . . . . . . . . . . . . . . . . . . . . 48 3.7.2 Communication Initiation . . . . . . . . . . . . . . . . . . . . . . . . 48 Communication Completion . . . . . . . . . . . . . . . . . . . . . . . 3.7.3 52 3.7.4 Semantics of Nonblocking Communications . . . . . . . . . . . . . . 56 3.7.5 Multiple Completions . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Non-destructive Test of status . . . . . . . . . . . . . . . . . . . . . . 63 3.7.6 3.8 Probe and Cancel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Probe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.1 64 3.8.2 Matching Probe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.8.3 69 Matched Receives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.4 Cancel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.9 Persistent Communication Requests . . . . . . . . . . . . . . . . . . . . . . 73 3.10 Send-Receive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 81 3.11 Null Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4 Datatypes 4.1 Derived Datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.1.1 85 Type Constructors with Explicit Addresses . . . . . . . . . . . . . . 4.1.2 Datatype Constructors . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.1.3 Subarray Datatype Constructor . . . . . . . . . . . . . . . . . . . . . 95 4.1.4 97 Distributed Array Datatype Constructor . . . . . . . . . . . . . . . . 4.1.5 Address and Size Functions . . . . . . . . . . . . . . . . . . . . . . . 102 4.1.6 Lower-Bound and Upper-Bound Markers . . . . . . . . . . . . . . . 104 4.1.7 Extent and Bounds of Datatypes . . . . . . . . . . . . . . . . . . . . 107 4.1.8 True Extent of Datatypes . . . . . . . . . . . . . . . . . . . . . . . . 108 4.1.9 Commit and Free . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 vi

7 4.1.10 Duplicating a Datatype . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.1.11 Use of General Datatypes in Communication . . . . . . . . . . . . . 112 4.1.12 Correct Use of Addresses . . . . . . . . . . . . . . . . . . . . . . . . 115 4.1.13 Decoding a Datatype . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.1.14 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.2 Pack and Unpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 MPI PACK and MPI _ UNPACK . . . . . . . . . . . . . . . . . . . 138 4.3 Canonical _ 5 Collective Communication 141 5.1 Introduction and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.2 Communicator Argument . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.2.1 Specifics for Intracommunicator Collective Operations . . . . . . . . 144 5.2.2 Applying Collective Operations to Intercommunicators . . . . . . . . 145 5.2.3 Specifics for Intercommunicator Collective Operations . . . . . . . . 146 5.3 Barrier Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.4 Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Example using MPI 5.4.1 BCAST . . . . . . . . . . . . . . . . . . . . . . . 149 _ 5.5 Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 _ Examples using GATHER , MPI _ GATHERV . . . . . . . . . . . . 152 5.5.1 MPI 5.6 Scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.6.1 MPI _ SCATTER , MPI _ SCATTERV . . . . . . . . . . 162 Examples using 5.7 Gather-to-all . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.7.1 Example using MPI _ ALLGATHER . . . . . . . . . . . . . . . . . . . . 167 5.8 All-to-All Scatter/Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 5.9 Global Reduction Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 5.9.1 5.9.2 Predefined Reduction Operations . . . . . . . . . . . . . . . . . . . . 176 Signed Characters and Reductions . . . . . . . . . . . . . . . . . . . 178 5.9.3 5.9.4 MINLOC and MAXLOC . . . . . . . . . . . . . . . . . . . . . . . . 179 5.9.5 User-Defined Reduction Operations . . . . . . . . . . . . . . . . . . 183 Example of User-defined Reduce . . . . . . . . . . . . . . . . . . . . 186 All-Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 5.9.6 Process-Local Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 189 5.9.7 5.10 Reduce-Scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 5.10.1 MPI _ REDUCE SCATTER _ BLOCK . . . . . . . . . . . . . . . . . . . 190 _ 5.10.2 MPI _ REDUCE _ SCATTER . . . . . . . . . . . . . . . . . . . . . . . . 191 5.11 Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 5.11.1 Inclusive Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 5.11.2 Exclusive Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 5.11.3 Example using MPI _ SCAN . . . . . . . . . . . . . . . . . . . . . . . . 195 5.12 Nonblocking Collective Operations . . . . . . . . . . . . . . . . . . . . . . . 196 5.12.1 Nonblocking Barrier Synchronization . . . . . . . . . . . . . . . . . . 198 5.12.2 Nonblocking Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . 199 Example using MPI _ IBCAST . . . . . . . . . . . . . . . . . . . . . . 200 5.12.3 Nonblocking Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 5.12.4 Nonblocking Scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 5.12.5 Nonblocking Gather-to-all . . . . . . . . . . . . . . . . . . . . . . . . 204 5.12.6 Nonblocking All-to-All Scatter/Gather . . . . . . . . . . . . . . . . . 206 vii

8 5.12.7 Nonblocking Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 5.12.8 Nonblocking All-Reduce . . . . . . . . . . . . . . . . . . . . . . . . . 210 5.12.9 Nonblocking Reduce-Scatter with Equal Blocks . . . . . . . . . . . . 211 5.12.10 Nonblocking Reduce-Scatter . . . . . . . . . . . . . . . . . . . . . . . 212 5.12.11 Nonblocking Inclusive Scan . . . . . . . . . . . . . . . . . . . . . . . 213 5.12.12 Nonblocking Exclusive Scan . . . . . . . . . . . . . . . . . . . . . . . 214 5.13 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 6 Groups, Contexts, Communicators, and Caching 223 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 6.1.1 Features Needed to Support Libraries . . . . . . . . . . . . . . . . . 223 6.1.2 MPI ’s Support for Libraries . . . . . . . . . . . . . . . . . . . . . . . 224 6.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 6.2.1 Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 6.2.2 Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 6.2.3 Intra-Communicators . . . . . . . . . . . . . . . . . . . . . . . . . . 227 6.2.4 Predefined Intra-Communicators . . . . . . . . . . . . . . . . . . . . 227 6.3 Group Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 6.3.1 Group Accessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 6.3.2 Group Constructors . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 Group Destructors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 6.3.3 6.4 Communicator Management . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Communicator Accessors . . . . . . . . . . . . . . . . . . . . . . . . 235 6.4.1 6.4.2 Communicator Constructors . . . . . . . . . . . . . . . . . . . . . . . 237 6.4.3 Communicator Destructors . . . . . . . . . . . . . . . . . . . . . . . 248 6.4.4 Communicator Info . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 6.5 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 6.5.1 Current Practice #1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 Current Practice #2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 6.5.2 (Approximate) Current Practice #3 . . . . . . . . . . . . . . . . . . 251 6.5.3 6.5.4 Example #4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 Library Example #1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 6.5.5 Library Example #2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 6.5.6 6.6 Inter-Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 6.6.1 Inter-communicator Accessors . . . . . . . . . . . . . . . . . . . . . . 259 6.6.2 Inter-communicator Operations . . . . . . . . . . . . . . . . . . . . . 260 6.6.3 Inter-Communication Examples . . . . . . . . . . . . . . . . . . . . . 263 Example 1: Three-Group “Pipeline” . . . . . . . . . . . . . . . . . . 263 Example 2: Three-Group “Ring” . . . . . . . . . . . . . . . . . . . . 264 6.7 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 6.7.1 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 6.7.2 Communicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 6.7.3 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 6.7.4 Datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 6.7.5 Error Class for Invalid Keyval . . . . . . . . . . . . . . . . . . . . . . 279 6.7.6 Attributes Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 6.8 Naming Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 6.9 Formalizing the Loosely Synchronous Model . . . . . . . . . . . . . . . . . . 285 viii

9 6.9.1 Basic Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Models of Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 6.9.2 Static Communicator Allocation . . . . . . . . . . . . . . . . . . . . 286 Dynamic Communicator Allocation . . . . . . . . . . . . . . . . . . . 286 The General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 7 Process Topologies 289 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 7.2 Virtual Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 7.3 Embedding in MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 7.4 Overview of the Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 7.5 Topology Constructors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 7.5.1 Cartesian Constructor . . . . . . . . . . . . . . . . . . . . . . . . . . 292 Cartesian Convenience Function: MPI 7.5.2 DIMS _ CREATE . . . . . . . . 293 _ 7.5.3 Graph Constructor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 Distributed Graph Constructor . . . . . . . . . . . . . . . . . . . . . 296 7.5.4 Topology Inquiry Functions . . . . . . . . . . . . . . . . . . . . . . . 302 7.5.5 Cartesian Shift Coordinates . . . . . . . . . . . . . . . . . . . . . . . 310 7.5.6 Partitioning of Cartesian Structures . . . . . . . . . . . . . . . . . . 312 7.5.7 7.5.8 Low-Level Topology Functions . . . . . . . . . . . . . . . . . . . . . 312 7.6 Neighborhood Collective Communication . . . . . . . . . . . . . . . . . . . 314 7.6.1 Neighborhood Gather . . . . . . . . . . . . . . . . . . . . . . . . . . 315 7.6.2 Neighbor Alltoall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 7.7 Nonblocking Neighborhood Communication . . . . . . . . . . . . . . . . . . 324 7.7.1 Nonblocking Neighborhood Gather . . . . . . . . . . . . . . . . . . . 325 7.7.2 Nonblocking Neighborhood Alltoall . . . . . . . . . . . . . . . . . . . 327 7.8 An Application Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 MPI Environmental Management 8 335 8.1 Implementation Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 8.1.1 Version Inquiries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 8.1.2 Environmental Inquiries . . . . . . . . . . . . . . . . . . . . . . . . . 336 Tag Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 Host Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 IO Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 Clock Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . 338 Inquire Processor Name . . . . . . . . . . . . . . . . . . . . . . . . . 338 8.2 Memory Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 8.3 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 8.3.1 Error Handlers for Communicators . . . . . . . . . . . . . . . . . . . 343 8.3.2 Error Handlers for Windows . . . . . . . . . . . . . . . . . . . . . . . 345 8.3.3 Error Handlers for Files . . . . . . . . . . . . . . . . . . . . . . . . . 347 8.3.4 Freeing Errorhandlers and Retrieving Error Strings . . . . . . . . . . 348 8.4 Error Codes and Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 8.5 Error Classes, Error Codes, and Error Handlers . . . . . . . . . . . . . . . . 352 8.6 Timers and Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 8.7 Startup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 8.7.1 Allowing User Functions at Process Termination . . . . . . . . . . . 363 ix

10 8.7.2 Determining Whether Has Finished . . . . . . . . . . . . . . . . 363 MPI MPI Process Startup . . . . . . . . . . . . . . . . . . . . . . . . . . 364 8.8 Portable Object 9 The 367 Info 10 Process Creation and Management 373 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 10.2 The Dynamic Process Model . . . . . . . . . . . . . . . . . . . . . . . . . . 374 10.2.1 Starting Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 10.2.2 The Runtime Environment . . . . . . . . . . . . . . . . . . . . . . . 374 10.3 Process Manager Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 10.3.1 Processes in 10.3.2 Starting Processes and Establishing Communication . . . . . . . . . 376 10.3.3 Starting Multiple Executables and Establishing Communication . . 381 10.3.4 Reserved Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 10.3.5 Spawn Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 MPI _ COMM _ SPAWN Manager-worker Example Using . . . . . . . . 385 10.4 Establishing Communication . . . . . . . . . . . . . . . . . . . . . . . . . . 387 10.4.1 Names, Addresses, Ports, and All That . . . . . . . . . . . . . . . . 387 10.4.2 Server Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 10.4.3 Client Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 10.4.4 Name Publishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 10.4.5 Reserved Key Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 10.4.6 Client/Server Examples . . . . . . . . . . . . . . . . . . . . . . . . . 394 Simplest Example — Completely Portable. . . . . . . . . . . . . . . 394 Ocean/Atmosphere — Relies on Name Publishing . . . . . . . . . . 395 Simple Client-Server Example . . . . . . . . . . . . . . . . . . . . . . 395 10.5 Other Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 10.5.1 Universe Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 MPI _ INIT 10.5.2 Singleton . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 10.5.3 MPI _ APPNUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 10.5.4 Releasing Connections . . . . . . . . . . . . . . . . . . . . . . . . . . 399 10.5.5 Another Way to Establish MPI Communication . . . . . . . . . . . . 401 11 One-Sided Communications 403 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 11.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 11.2.1 Window Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 11.2.2 Window That Allocates Memory . . . . . . . . . . . . . . . . . . . . 407 11.2.3 Window That Allocates Shared Memory . . . . . . . . . . . . . . . . 408 11.2.4 Window of Dynamically Attached Memory . . . . . . . . . . . . . . 411 11.2.5 Window Destruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 11.2.6 Window Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 11.2.7 Window Info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 11.3 Communication Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418 11.3.1 Put . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 11.3.2 Get . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 11.3.3 Examples for Communication Calls . . . . . . . . . . . . . . . . . . . 422 x

11 11.3.4 Accumulate Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 424 Accumulate Function . . . . . . . . . . . . . . . . . . . . . . . . . . 424 Get Accumulate Function . . . . . . . . . . . . . . . . . . . . . . . . 427 Fetch and Op Function . . . . . . . . . . . . . . . . . . . . . . . . . 428 Compare and Swap Function . . . . . . . . . . . . . . . . . . . . . . 429 11.3.5 Request-based RMA Communication Operations . . . . . . . . . . . 430 11.4 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 11.5 Synchronization Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 11.5.1 Fence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440 11.5.2 General Active Target Synchronization . . . . . . . . . . . . . . . . . 442 11.5.3 Lock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 11.5.4 Flush and Sync . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 11.5.5 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 11.5.6 Miscellaneous Clarifications . . . . . . . . . . . . . . . . . . . . . . . 452 11.6 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452 11.6.1 Error Handlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452 11.6.2 Error Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 11.7 Semantics and Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 11.7.1 Atomicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 11.7.2 Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 11.7.3 Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462 11.7.4 Registers and Compiler Optimizations . . . . . . . . . . . . . . . . . 464 11.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 12 External Interfaces 473 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 12.2 Generalized Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 12.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478 12.3 Associating Information with Status . . . . . . . . . . . . . . . . . . . . . . 480 12.4 MPI and Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482 12.4.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482 12.4.2 Clarifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483 12.4.3 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 13 I/O 489 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 13.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 13.2 File Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 13.2.1 Opening a File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 13.2.2 Closing a File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493 13.2.3 Deleting a File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494 13.2.4 Resizing a File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 13.2.5 Preallocating Space for a File . . . . . . . . . . . . . . . . . . . . . . 496 13.2.6 Querying the Size of a File . . . . . . . . . . . . . . . . . . . . . . . 496 13.2.7 Querying File Parameters . . . . . . . . . . . . . . . . . . . . . . . . 497 13.2.8 File Info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498 Reserved File Hints . . . . . . . . . . . . . . . . . . . . . . . . . . . 500 13.3 File Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501 xi

12 13.4 Data Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 13.4.1 Data Access Routines . . . . . . . . . . . . . . . . . . . . . . . . . . 504 Positioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505 Synchronism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505 Coordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506 Data Access Conventions . . . . . . . . . . . . . . . . . . . . . . . . 506 13.4.2 Data Access with Explicit Offsets . . . . . . . . . . . . . . . . . . . . 507 13.4.3 Data Access with Individual File Pointers . . . . . . . . . . . . . . . 511 13.4.4 Data Access with Shared File Pointers . . . . . . . . . . . . . . . . . 518 Noncollective Operations . . . . . . . . . . . . . . . . . . . . . . . . 518 Collective Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 520 Seek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522 13.4.5 Split Collective Data Access Routines . . . . . . . . . . . . . . . . . 523 13.5 File Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530 13.5.1 Datatypes for File Interoperability . . . . . . . . . . . . . . . . . . . 532 13.5.2 External Data Representation: “external32” . . . . . . . . . . . . . . 534 13.5.3 User-Defined Data Representations . . . . . . . . . . . . . . . . . . . 535 Extent Callback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537 Datarep Conversion Functions . . . . . . . . . . . . . . . . . . . . . 537 13.5.4 Matching Data Representations . . . . . . . . . . . . . . . . . . . . . 540 13.6 Consistency and Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . 540 13.6.1 File Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540 13.6.2 Random Access vs. Sequential Files . . . . . . . . . . . . . . . . . . 543 13.6.3 Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544 13.6.4 Collective File Operations . . . . . . . . . . . . . . . . . . . . . . . . 544 13.6.5 Type Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544 13.6.6 Miscellaneous Clarifications . . . . . . . . . . . . . . . . . . . . . . . 544 13.6.7 MPI Offset Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 _ 13.6.8 Logical vs. Physical File Layout . . . . . . . . . . . . . . . . . . . . . 545 13.6.9 File Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 13.6.10 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546 Asynchronous I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548 13.7 I/O Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550 13.8 I/O Error Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550 13.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551 13.9.1 Double Buffering with Split Collective I/O . . . . . . . . . . . . . . 551 13.9.2 Subarray Filetype Constructor . . . . . . . . . . . . . . . . . . . . . 553 14 Tool Support 555 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 14.2 Profiling Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 14.2.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 14.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556 14.2.3 Logic of the Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556 14.2.4 Miscellaneous Control of Profiling . . . . . . . . . . . . . . . . . . . 557 14.2.5 Profiler Implementation Example . . . . . . . . . . . . . . . . . . . . 558 14.2.6 MPI Library Implementation Example . . . . . . . . . . . . . . . . . 558 Systems with Weak Symbols . . . . . . . . . . . . . . . . . . . . . . 558 xii

13 Systems Without Weak Symbols . . . . . . . . . . . . . . . . . . . . 559 14.2.7 Complications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559 Multiple Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559 Linker Oddities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560 Fortran Support Methods . . . . . . . . . . . . . . . . . . . . . . . . 560 14.2.8 Multiple Levels of Interception . . . . . . . . . . . . . . . . . . . . . 560 MPI Tool Information Interface . . . . . . . . . . . . . . . . . . . . . . 561 14.3 The 14.3.1 Verbosity Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562 Tool Information Interface Variables to 14.3.2 Binding Objects . 562 MPI MPI 14.3.3 Convention for Returning Strings . . . . . . . . . . . . . . . . . . . . 563 14.3.4 Initialization and Finalization . . . . . . . . . . . . . . . . . . . . . . 563 14.3.5 Datatype System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565 14.3.6 Control Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567 Control Variable Query Functions . . . . . . . . . . . . . . . . . . . 567 Example: Printing All Control Variables . . . . . . . . . . . . . . . . 569 Handle Allocation and Deallocation . . . . . . . . . . . . . . . . . . 570 Control Variable Access Functions . . . . . . . . . . . . . . . . . . . 571 Example: Reading the Value of a Control Variable . . . . . . . . . . 572 14.3.7 Performance Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 573 Performance Variable Classes . . . . . . . . . . . . . . . . . . . . . . 573 Performance Variable Query Functions . . . . . . . . . . . . . . . . . 575 Performance Experiment Sessions . . . . . . . . . . . . . . . . . . . . 577 Handle Allocation and Deallocation . . . . . . . . . . . . . . . . . . 578 Starting and Stopping of Performance Variables . . . . . . . . . . . 579 Performance Variable Access Functions . . . . . . . . . . . . . . . . 580 Example: Tool to Detect Receives with Long Unexpected Message Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582 14.3.8 Variable Categorization . . . . . . . . . . . . . . . . . . . . . . . . . 584 14.3.9 Return Codes for the MPI Tool Information Interface . . . . . . . . 588 14.3.10 Profiling Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588 15 Deprecated Functions 591 MPI-2.0 15.1 Deprecated since . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591 15.2 Deprecated since MPI-2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594 595 16 Removed Interfaces MPI-1 Bindings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 16.1 Removed 16.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 16.1.2 Removed MPI-1 Functions . . . . . . . . . . . . . . . . . . . . . . . . 595 16.1.3 Removed MPI-1 Datatypes . . . . . . . . . . . . . . . . . . . . . . . 595 16.1.4 Removed Constants . . . . . . . . . . . . . . . . . . . . . . . . 596 MPI-1 16.1.5 Removed MPI-1 Callback Prototypes . . . . . . . . . . . . . . . . . . 596 16.2 C++ Bindings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596 17 Language Bindings 597 17.1 Fortran Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597 17.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597 17.1.2 Fortran Support Through the mpi_f08 Module . . . . . . . . . . . . 598 xiii

14 17.1.3 Fortran Support Through the mpi Module . . . . . . . . . . . . . . . 601 mpif.h Include File . . . . . . . . . . 603 17.1.4 Fortran Support Through the 17.1.5 Interface Specifications, Linker Names and the Profiling Interface . . 605 17.1.6 MPI for Different Fortran Standard Versions . . . . . . . . . . . . . 609 17.1.7 Requirements on Fortran Compilers . . . . . . . . . . . . . . . . . . 613 17.1.8 Additional Support for Fortran Register-Memory-Synchronization . 615 17.1.9 Additional Support for Fortran Numeric Intrinsic Types . . . . . . . 615 Parameterized Datatypes with Specified Precision and Exponent Range616 Support for Size-specific MPI Datatypes . . . . . . . . . . . . . . . . 620 Communication With Size-specific Types . . . . . . . . . . . . . . . 622 MPI . . . . . . . . . . . . . . . 624 17.1.10 Problems With Fortran Bindings for 17.1.11 Problems Due to Strong Typing . . . . . . . . . . . . . . . . . . . . 625 17.1.12 Problems Due to Data Copying and Sequence Association with Sub- script Triplets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626 17.1.13 Problems Due to Data Copying and Sequence Association with Vector Subscripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629 17.1.14 Special Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629 17.1.15 Fortran Derived Types . . . . . . . . . . . . . . . . . . . . . . . . . . 629 17.1.16 Optimization Problems, an Overview . . . . . . . . . . . . . . . . . . 631 17.1.17 Problems with Code Movement and Register Optimization . . . . . 632 Nonblocking Operations . . . . . . . . . . . . . . . . . . . . . . . . . 632 One-sided Communication . . . . . . . . . . . . . . . . . . . . . . . . 633 MPI _ BOTTOM and Combining Independent Variables in Datatypes 633 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634 The Fortran ASYNCHRONOUS Attribute . . . . . . . . . . . . . . 635 Calling MPI _ F _ SYNC _ REG . . . . . . . . . . . . . . . . . . . . . . 637 A User Defined Routine Instead of MPI _ _ SYNC _ REG . . . . . . . 638 F Module Variables and COMMON Blocks . . . . . . . . . . . . . . . 639 The (Poorly Performing) Fortran VOLATILE Attribute . . . . . . . 639 The Fortran TARGET Attribute . . . . . . . . . . . . . . . . . . . . 639 17.1.18 Temporary Data Movement and Temporary Memory Modification . 639 17.1.19 Permanent Data Movement . . . . . . . . . . . . . . . . . . . . . . . 642 17.1.20 Comparison with C . . . . . . . . . . . . . . . . . . . . . . . . . . . 642 17.2 Language Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645 17.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645 17.2.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645 17.2.3 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645 17.2.4 Transfer of Handles . . . . . . . . . . . . . . . . . . . . . . . . . . . 646 17.2.5 Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648 17.2.6 MPI Opaque Objects . . . . . . . . . . . . . . . . . . . . . . . . . . 650 Datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651 Callback Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652 Error Handlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653 Reduce Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653 17.2.7 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653 17.2.8 Extra-State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657 17.2.9 Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657 17.2.10 Interlanguage Communication . . . . . . . . . . . . . . . . . . . . . . 658 xiv

15 A Language Bindings Summary 661 A.1 Defined Values and Handles . . . . . . . . . . . . . . . . . . . . . . . . . . . 661 A.1.1 Defined Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661 A.1.2 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676 A.1.3 Prototype Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 677 C Bindings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677 Fortran 2008 Bindings with the mpi _ f08 Module . . . . . . . . . . . 678 Fortran Bindings with mpif.h or the mpi Module . . . . . . . . . . . 680 A.1.4 Deprecated Prototype Definitions . . . . . . . . . . . . . . . . . . . . 682 A.1.5 Info Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683 A.1.6 Info Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683 A.2 C Bindings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685 A.2.1 Point-to-Point Communication C Bindings . . . . . . . . . . . . . . 685 A.2.2 Datatypes C Bindings . . . . . . . . . . . . . . . . . . . . . . . . . . 687 A.2.3 Collective Communication C Bindings . . . . . . . . . . . . . . . . . 689 A.2.4 Groups, Contexts, Communicators, and Caching C Bindings . . . . 691 A.2.5 Process Topologies C Bindings . . . . . . . . . . . . . . . . . . . . . 694 A.2.6 MPI Environmental Management C Bindings . . . . . . . . . . . . . 696 A.2.7 The Info Object C Bindings . . . . . . . . . . . . . . . . . . . . . . . 697 A.2.8 Process Creation and Management C Bindings . . . . . . . . . . . . 697 A.2.9 One-Sided Communications C Bindings . . . . . . . . . . . . . . . . 698 A.2.10 External Interfaces C Bindings . . . . . . . . . . . . . . . . . . . . . 700 A.2.11 I/O C Bindings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700 A.2.12 Language Bindings C Bindings . . . . . . . . . . . . . . . . . . . . . 703 A.2.13 Tools / Profiling Interface C Bindings . . . . . . . . . . . . . . . . . 704 A.2.14 Tools / MPI Tool Information Interface C Bindings . . . . . . . . . 704 A.2.15 Deprecated C Bindings . . . . . . . . . . . . . . . . . . . . . . . . . 705 _ A.3 Fortran 2008 Bindings with the mpi f08 Module . . . . . . . . . . . . . . . 707 A.3.1 Point-to-Point Communication Fortran 2008 Bindings . . . . . . . . 707 A.3.2 Datatypes Fortran 2008 Bindings . . . . . . . . . . . . . . . . . . . . 712 A.3.3 Collective Communication Fortran 2008 Bindings . . . . . . . . . . . 717 A.3.4 Groups, Contexts, Communicators, and Caching Fortran 2008 Bindings724 A.3.5 Process Topologies Fortran 2008 Bindings . . . . . . . . . . . . . . . 731 A.3.6 MPI Environmental Management Fortran 2008 Bindings . . . . . . . 736 A.3.7 The Info Object Fortran 2008 Bindings . . . . . . . . . . . . . . . . 739 A.3.8 Process Creation and Management Fortran 2008 Bindings . . . . . . 740 A.3.9 One-Sided Communications Fortran 2008 Bindings . . . . . . . . . . 741 A.3.10 External Interfaces Fortran 2008 Bindings . . . . . . . . . . . . . . . 746 A.3.11 I/O Fortran 2008 Bindings . . . . . . . . . . . . . . . . . . . . . . . 747 A.3.12 Language Bindings Fortran 2008 Bindings . . . . . . . . . . . . . . . 754 A.3.13 Tools / Profiling Interface Fortran 2008 Bindings . . . . . . . . . . . 755 A.4 Fortran Bindings with mpif.h or the mpi Module . . . . . . . . . . . . . . . 756 A.4.1 Point-to-Point Communication Fortran Bindings . . . . . . . . . . . 756 A.4.2 Datatypes Fortran Bindings . . . . . . . . . . . . . . . . . . . . . . . 759 A.4.3 Collective Communication Fortran Bindings . . . . . . . . . . . . . . 761 A.4.4 Groups, Contexts, Communicators, and Caching Fortran Bindings . 765 A.4.5 Process Topologies Fortran Bindings . . . . . . . . . . . . . . . . . . 769 A.4.6 MPI Environmental Management Fortran Bindings . . . . . . . . . . 772 xv

16 A.4.7 The Info Object Fortran Bindings . . . . . . . . . . . . . . . . . . . 774 A.4.8 Process Creation and Management Fortran Bindings . . . . . . . . . 775 A.4.9 One-Sided Communications Fortran Bindings . . . . . . . . . . . . . 776 A.4.10 External Interfaces Fortran Bindings . . . . . . . . . . . . . . . . . . 779 A.4.11 I/O Fortran Bindings . . . . . . . . . . . . . . . . . . . . . . . . . . 780 A.4.12 Language Bindings Fortran Bindings . . . . . . . . . . . . . . . . . . 784 A.4.13 Tools / Profiling Interface Fortran Bindings . . . . . . . . . . . . . . 785 A.4.14 Deprecated Fortran Bindings . . . . . . . . . . . . . . . . . . . . . . 785 B Change-Log 787 B.1 Changes from Version 2.2 to Version 3.0 . . . . . . . . . . . . . . . . . . . . 787 B.1.1 Fixes to Errata in Previous Versions of MPI . . . . . . . . . . . . . . 787 B.1.2 Changes in MPI-3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 788 B.2 Changes from Version 2.1 to Version 2.2 . . . . . . . . . . . . . . . . . . . . 793 B.3 Changes from Version 2.0 to Version 2.1 . . . . . . . . . . . . . . . . . . . . 796 Bibliography 801 Examples Index 806 MPI Constant and Predefined Handle Index 809 MPI Declarations Index 814 MPI Callback Function Prototype Index 815 MPI Function Index 816 xvi

17 List of Figures 5.1 Collective comminucations, an overview . . . . . . . . . . . . . . . . . . . . 143 5.2 Intercommunicator allgather . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.3 Intercommunicator reduce-scatter . . . . . . . . . . . . . . . . . . . . . . . . 147 5.4 Gather example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.5 Gatherv example with strides . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.6 Gatherv example, 2-dimensional . . . . . . . . . . . . . . . . . . . . . . . . 155 5.7 Gatherv example, 2-dimensional, subarrays with different sizes . . . . . . . 156 5.8 Gatherv example, 2-dimensional, subarrays with different sizes and strides . 158 5.9 Scatter example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.10 Scatterv example with strides . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.11 Scatterv example with different strides and counts . . . . . . . . . . . . . . 164 5.12 Race conditions with point-to-point and collective communications . . . . . 217 5.13 Overlapping Communicators Example . . . . . . . . . . . . . . . . . . . . . 221 6.1 Intercommunicator creation using MPI _ COMM _ CREATE . . . . . . . . . . . 242 6.2 Intercommunicator construction with MPI _ COMM _ SPLIT . . . . . . . . . . 246 6.3 Three-group pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 6.4 Three-group ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 7.1 Set-up of process structure for two-dimensional parallel Poisson solver. . . . 331 7.2 Communication routine with local data copying and sparse neighborhood all-to-all. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 7.3 Communication routine with sparse neighborhood all-to-all-w and without local data copying. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 11.1 Schematic description of the public/private window operations in the MPI _ WIN _ SEPARATE memory model for two overlapping windows. . . . . . . 436 11.2 Active target communication . . . . . . . . . . . . . . . . . . . . . . . . . . 439 11.3 Active target communication, with weak synchronization . . . . . . . . . . . 440 11.4 Passive target communication . . . . . . . . . . . . . . . . . . . . . . . . . . 441 11.5 Active target communication with several processes . . . . . . . . . . . . . . 444 11.6 Symmetric communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 11.7 Deadlock situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 11.8 No deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 13.1 Etypes and filetypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490 13.2 Partitioning a file among parallel processes . . . . . . . . . . . . . . . . . . 490 13.3 Displacements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503 13.4 Example array file layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553 xvii

18 13.5 Example local array filetype for process 1 . . . . . . . . . . . . . . . . . . . 554 17.1 Status conversion routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649 xviii

19 1 2 3 4 5 6 7 List of Tables 8 9 10 11 2.1 Deprecated and Removed constructs . . . . . . . . . . . . . . . . . . . . . . 18 12 3.1 Predefined MPI datatypes corresponding to Fortran datatypes . . . . . . . 25 13 3.2 Predefined MPI datatypes corresponding to C datatypes . . . . . . . . . . 26 14 3.3 Predefined MPI datatypes corresponding to both C and Fortran datatypes 27 15 27 3.4 Predefined MPI datatypes corresponding to C++ datatypes . . . . . . . . . 16 17 GET _ TYPE ENVELOPE . . . . . . . . 117 _ _ MPI values returned from combiner 4.1 18 19 Function Behavior (in Inter-Communication Mode) . . . . . 259 * _ _ MPI 6.1 COMM 20 21 8.1 Error classes (Part 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 22 8.2 Error classes (Part 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 23 WIN _ GET _ ATTR and 11.1 C types of attribute value argument to MPI _ 24 WIN _ SET _ ATTR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416 MPI _ 25 11.2 Error classes in one-sided communication routines . . . . . . . . . . . . . . 453 26 27 13.1 Data access routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505 28 13.2 “external32” sizes of predefined datatypes . . . . . . . . . . . . . . . . . . . 536 29 13.3 I/O Error Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551 30 31 14.1 tool information interface verbosity levels . . . . . . . . . . . . . . . . . 562 MPI 32 14.2 Constants to identify associations of variables . . . . . . . . . . . . . . . . . 563 33 MPI 14.3 tool information interface . . . 565 MPI datatypes that can be used by the 34 14.4 Scopes for control variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 569 35 tool information interface . . . . 589 14.5 Return codes used in functions of the MPI 36 . . . . . . . . . . . . . . 595 16.1 Removed MPI-1 functions and their replacements 37 MPI-1 16.2 Removed datatypes and their replacements . . . . . . . . . . . . . . 596 38 16.3 Removed MPI-1 constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596 39 MPI-1 callback prototypes and their replacements . . . . . . . . . 596 16.4 Removed 40 41 17.1 Occurrence of Fortran optimization problems . . . . . . . . . . . . . . . . . 632 42 43 44 45 46 47 48 xix

20 1 Acknowledgments 2 3 4 5 This document is the product of a number of distinct efforts in three distinct phases: 6 . This section describes these in historical order, MPI-2 one for each of MPI-1 , and MPI-3 , 7 . Some efforts, particularly parts of MPI-2 , had distinct groups of MPI-1 starting with 8 individuals associated with them, and these efforts are detailed separately. 9 10 11 12 MPI This document represents the work of many people who have served on the Forum. 13 The meetings have been attended by dozens of people from many parts of the world. It is 14 the hard and dedicated work of this group that has led to the standard. MPI 15 The technical development was carried out by subgroups, whose work was reviewed 16 by the full committee. During the period of development of the Message-Passing Interface 17 MPI ( ), many people helped with this effort. 18 MPI-1.0 and MPI-1.1 Those who served as primary coordinators in are: 19 Jack Dongarra, David Walker, Conveners and Meeting Chairs • 20 21 Ewing Lusk, Bob Knighten, Minutes • 22 23 • Marc Snir, William Gropp, Ewing Lusk, Point-to-Point Communication 24 • Al Geist, Marc Snir, Steve Otto, Collective Communication 25 26 • Steve Otto, Editor 27 28 • Rolf Hempel, Process Topologies 29 Ewing Lusk, Language Binding • 30 31 • William Gropp, Environmental Management 32 33 James Cownie, Profiling • 34 Tony Skjellum, Lyndon Clarke, Marc Snir, Richard Littlefield, Mark Sears, Groups, • 35 Contexts, and Communicators 36 37 Steven Huss-Lederman, Initial Implementation Subset • 38 39 and MPI-1.0 The following list includes some of the active participants in the MPI-1.1 40 process not mentioned above. 41 42 43 44 45 46 47 48 xx

21 1 Ed Anderson Robert Babb Joe Baron Eric Barszcz 2 Anne Elster Rob Bjornson Scott Berryman Nathan Doss 3 Sam Fineberg Jon Flower Jim Feeney Vince Fernando 4 Ian Glendinning Daniel Frye Adam Greenberg Robert Harrison 5 Leslie Hart Tom Haupt Tom Henderson Don Heller 6 John Kapenga Gary Howell C.T. Howard Ho Alex Ho 7 Bob Leary James Kohl Susan Krauss Arthur Maccabe 8 Phil McKinley Peter Madams Alan Mainwaring Oliver McBryan 9 Howard Palmer Peter Pacheco Dan Nessett Charles Mosher 10 Arch Robison Sanjay Ranka Paul Pierce Peter Rigsbee 11 Alan Sussman Erich Schikuta Robert Tomlinson Ambuj Singh 12 Steve Zenith Stephen Wheat Robert G. Voigt Dennis Weeks 13 14 The University of Tennessee and Oak Ridge National Laboratory made the draft avail- 15 able by anonymous FTP mail servers and were instrumental in distributing the document. 16 The work on the MPI-1 standard was supported in part by ARPA and NSF under grant 17 ASC-9310330, the National Science Foundation Science and Technology Center Cooperative 18 Agreement No. CCR-8809615, and by the Commission of the European Community through 19 Esprit project P6643 (PPPE). 20 21 22 MPI-1.2 and MPI-2.0: 23 24 are: MPI-2.0 and MPI-1.2 Those who served as primary coordinators in 25 26 Ewing Lusk, Convener and Meeting Chair • 27 • Steve Huss-Lederman, Editor 28 29 Ewing Lusk, Miscellany • 30 31 Bill Saphir, Process Creation and Management • 32 • Marc Snir, One-Sided Communications 33 34 • Bill Gropp and Anthony Skjellum, Extended Collective Operations 35 36 Steve Huss-Lederman, External Interfaces • 37 • Bill Nitzberg, I/O 38 39 Andrew Lumsdaine, Bill Saphir, and Jeff Squyres, Language Bindings • 40 41 • Anthony Skjellum and Arkady Kanevsky, Real-Time 42 43 MPI-2 Forum The following list includes some of the active participants who attended 44 meetings and are not mentioned above. 45 46 47 48 xxi

22 1 Greg Astfalk Rajesh Bordawekar Robert Babb Ed Benson 2 Ron Brightwell Maciej Brodowicz Pete Bradley Peter Brennan 3 Pang Chen Margaret Cahir Greg Burns Eric Brunner 4 Joel Clark Albert Cheng Ying Chen Yong Cho 5 Dennis Cottel Jim Cownie Laurie Costello Lyndon Clarke 6 Zhenqian Cui Suresh Damodaran-Kamal Raja Daoud 7 Doug Doefler David DiNucci Judith Devaney Jack Dongarra 8 Terry Dontje Nathan Doss Anne Elster Mark Fallon 9 Karl Feind Sam Fineberg Craig Fischberg Stephen Fleischman 10 Richard Frost Hubertus Franke Ian Foster Al Geist 11 David Greenberg Robert George Kei Harada John Hagedorn 12 Rolf Hempel Leslie Hart Shane Hebert Tom Henderson 13 Alex Ho Hans-Christian Hoppe Joefon Jann Terry Jones 14 Susan Kraus Karl Kesselman Koichi Konishi Steve Kubica 15 Steve Landherr Mario Lauria Mark Law Juan Leon 16 Bob Madahar Peter Madams Lloyd Lewins Ziyang Lu 17 Oliver McBryan Tyce McLarty John May Brian McCandless 18 Jarek Nieplocha Thom McMahon Harish Nag Nick Nevin 19 Steve Otto Peter Pacheco Peter Ossadnik Ron Oldfield 20 Perry Partow Yoonho Park Pratap Pattnaik Elsie Pierce 21 Paul Pierce Heidi Poxon Boris Protopopov Jean-Pierre Prost 22 James Pruyve Peter Rigsbee Joe Rieken Rolf Rabenseifner 23 Nobutoshi Sagawa Arindam Saha Tom Robey Anna Rounbehler 24 Eric Salo Darren Sanders Eric Sharakan Andrew Sherman 25 Fred Shirley Lance Shuler A. Gordon Smith Ian Stockdale 26 Stephen Taylor Greg Tensa Rajeev Thakur David Taylor 27 Marydell Tholburn Dick Treumann Simon Tsang Manuel Ujaldon 28 David Walker Jerrell Watts Klaus Wolf Parkson Wong 29 Dave Wright 30 The MPI Forum also acknowledges and appreciates the valuable input from people via 31 e-mail and in person. 32 33 The following institutions supported the effort through time and travel support MPI-2 34 for the people listed above. 35 36 Argonne National Laboratory 37 Bolt, Beranek, and Newman 38 California Institute of Technology 39 Center for Computing Sciences 40 Convex Computer Corporation 41 Cray Research 42 Digital Equipment Corporation 43 Dolphin Interconnect Solutions, Inc. 44 Edinburgh Parallel Computing Centre 45 General Electric Company 46 German National Research Center for Information Technology 47 Hewlett-Packard 48 Hitachi xxii

23 1 Hughes Aircraft Company 2 Intel Corporation 3 International Business Machines 4 Khoral Research 5 Lawrence Livermore National Laboratory 6 Los Alamos National Laboratory 7 MPI Software Techology, Inc. 8 Mississippi State University 9 NEC Corporation 10 National Aeronautics and Space Administration 11 National Energy Research Scientific Computing Center 12 National Institute of Standards and Technology 13 National Oceanic and Atmospheric Adminstration 14 Oak Ridge National Laboratory 15 Ohio State University 16 PALLAS GmbH 17 Pacific Northwest National Laboratory 18 Pratt & Whitney 19 San Diego Supercomputer Center 20 Sanders, A Lockheed-Martin Company 21 Sandia National Laboratories 22 Schlumberger 23 Scientific Computing Associates, Inc. 24 Silicon Graphics Incorporated 25 Sky Computers 26 Sun Microsystems Computer Corporation 27 Syracuse University 28 The MITRE Corporation 29 Thinking Machines Corporation 30 United States Navy 31 University of Colorado 32 University of Denver 33 University of Houston 34 University of Illinois 35 University of Maryland 36 University of Notre Dame 37 University of San Fransisco 38 University of Stuttgart Computing Center 39 University of Wisconsin 40 41 MPI-2 operated on a very tight budget (in reality, it had no budget when the first 42 meeting was announced). Many institutions helped the effort by supporting the MPI-2 43 efforts and travel of the members of the MPI Forum. Direct support was given by NSF and 44 DARPA under NSF contract CDA-9115428 for travel by U.S. academic participants and 45 Esprit under project HPC Standards (21111) for European participants. 46 47 48 xxiii

24 1 MPI-1.3 and MPI-2.1: 2 The editors and organizers of the combined documents have been: 3 4 • Richard Graham, Convener and Meeting Chair 5 6 Jack Dongarra, Steering Committee • 7 Al Geist, Steering Committee • 8 9 • Bill Gropp, Steering Committee 10 11 Rainer Keller, Merge of MPI-1.3 • 12 Andrew Lumsdaine, Steering Committee • 13 14 MPI-2.1 MPI-1.1 Ewing Lusk, Steering Committee, • -Errata -Errata (Oct. 12, 1998) 15 Ballots 1, 2 (May 15, 2002) 16 17 MPI-2.1 and MPI-2.1 -Errata Ballots • Rolf Rabenseifner, Steering Committee, Merge of 18 3, 4 (2008) 19 text. Those who served All chapters have been revisited to achieve a consistent MPI-2.1 20 as authors for the necessary modifications are: 21 22 Bill Gropp, Front matter, Introduction, and Bibliography • 23 24 • Richard Graham, Point-to-Point Communication 25 Adam Moody, Collective Communication • 26 27 • Richard Treumann, Groups, Contexts, and Communicators 28 29 Jesper Larsson Tr ̈aff, Process Topologies, Info-Object, and One-Sided Communica- • 30 tions 31 • George Bosilca, Environmental Management 32 33 David Solt, Process Creation and Management • 34 35 • Bronis R. de Supinski, External Interfaces, and Profiling 36 Rajeev Thakur, I/O • 37 38 Jeffrey M. Squyres, Language Bindings and MPI 2.1 Secretary • 39 Rolf Rabenseifner, Deprecated Functions and Annex Change-Log • 40 41 Alexander Supalov and Denis Nagorny, Annex Language Bindings • 42 43 MPI-2 The following list includes some of the active participants who attended Forum 44 meetings and in the e-mail discussions of the errata items and are not mentioned above. 45 46 47 48 xxiv

25 1 Pavan Balaji Purushotham V. Bangalore Brian Barrett 2 Richard Barrett Robert Blackmore Christian Bell 3 Gil Bloch Ron Brightwell Jeffrey Brown 4 Jonathan Carter Nathan DeBardeleben Darius Buntinas 5 Edric Ellis Gabor Dozsa Terry Dontje 6 Edgar Gabriel Patrick Geoffray Karl Feind 7 Erez Haba Dave Goodell David Gingold 8 Robert Harrison Steve Hodson Thomas Herault 9 Yann Kalemkarian Joshua Hursey Torsten Hoefler 10 Quincey Koziol Sameer Kumar Matthew Koop 11 Mark Pagel Kannan Narasimhan Miron Livny 12 Howard Pritchard Steve Poole Avneesh Pant 13 Craig Rasmussen Hubert Ritzdorf Rob Ross 14 Tony Skjellum Brian Smith Vinod Tipparaju 15 Jesper Larsson Tr ̈aff Keith Underwood 16 The MPI Forum also acknowledges and appreciates the valuable input from people via 17 e-mail and in person. 18 19 The following institutions supported the effort through time and travel support MPI-2 20 for the people listed above. 21 22 Argonne National Laboratory 23 Bull 24 Cisco Systems, Inc. 25 Cray Inc. 26 The HDF Group 27 Hewlett-Packard 28 IBM T.J. Watson Research 29 Indiana University 30 Institut National de Recherche en Informatique et Automatique (INRIA) 31 Intel Corporation 32 Lawrence Berkeley National Laboratory 33 Lawrence Livermore National Laboratory 34 Los Alamos National Laboratory 35 Mathworks 36 Mellanox Technologies 37 Microsoft 38 Myricom 39 NEC Laboratories Europe, NEC Europe Ltd. 40 Oak Ridge National Laboratory 41 Ohio State University 42 Pacific Northwest National Laboratory 43 QLogic Corporation 44 Sandia National Laboratories 45 SiCortex 46 Silicon Graphics Incorporated 47 Sun Microsystems, Inc. 48 University of Alabama at Birmingham xxv

26 1 University of Houston 2 University of Illinois at Urbana-Champaign 3 University of Stuttgart, High Performance Computing Center Stuttgart (HLRS) 4 University of Tennessee, Knoxville 5 University of Wisconsin 6 7 Funding for the MPI Forum meetings was partially supported by award #CCF-0816909 8 from the National Science Foundation. In addition, the HDF Group provided travel support 9 for one U.S. academic. 10 11 12 13 MPI-2.2: 14 text. Those who served as All chapters have been revisited to achieve a consistent MPI-2.2 15 authors for the necessary modifications are: 16 17 William Gropp, Front matter, Introduction, and Bibliography; MPI 2.2 chair. • 18 • Richard Graham, Point-to-Point Communication and Datatypes 19 20 • Adam Moody, Collective Communication 21 22 • Torsten Hoefler, Collective Communication and Process Topologies 23 Richard Treumann, Groups, Contexts, and Communicators • 24 25 • Jesper Larsson Tr ̈aff, Process Topologies, Info-Object and One-Sided Communications 26 27 • George Bosilca, Datatypes and Environmental Management 28 David Solt, Process Creation and Management • 29 30 • Bronis R. de Supinski, External Interfaces, and Profiling 31 32 Rajeev Thakur, I/O • 33 • Jeffrey M. Squyres, Language Bindings and MPI 2.2 Secretary 34 35 • Rolf Rabenseifner, Deprecated Functions, Annex Change-Log, and Annex Language 36 Bindings 37 38 Alexander Supalov, Annex Language Bindings • 39 The following list includes some of the active participants who attended MPI-2 Forum 40 meetings and in the e-mail discussions of the errata items and are not mentioned above. 41 42 43 44 45 46 47 48 xxvi

27 1 Pavan Balaji Purushotham V. Bangalore Brian Barrett 2 Robert Blackmore Richard Barrett Christian Bell 3 Gil Bloch Ron Brightwell Greg Bronevetsky 4 Darius Buntinas Jonathan Carter Jeff Brown 5 Gabor Dozsa Terry Dontje Nathan DeBardeleben 6 Edgar Gabriel Karl Feind Edric Ellis 7 Johann George Patrick Geoffray David Gingold 8 Erez Haba Robert Harrison David Goodell 9 Marc-Andr ́e Hermanns Steve Hodson Thomas Herault 10 Bin Jia Joshua Hursey Yutaka Ishikawa 11 Yann Kalemkarian Terry Jones Hideyuki Jitsumoto 12 Matthew Koop Ranier Keller Quincey Koziol 13 Miron Livny Manojkumar Krishnan Sameer Kumar 14 Ewing Lusk Miao Luo Andrew Lumsdaine 15 Kannan Narasimhan Timothy I. Mattox Mark Pagel 16 Avneesh Pant Steve Poole Howard Pritchard 17 Rob Ross Hubert Ritzdorf Craig Rasmussen 18 Pavel Shamis Galen Shipman Martin Schulz 19 Brian Smith Anthony Skjellum Christian Siebert 20 Keith Underwood Naoki Sueyasu Vinod Tipparaju 21 Abhinav Vishnu Rolf Vandevaart Weikuan Yu 22 Forum also acknowledges and appreciates the valuable input from people via MPI The 23 e-mail and in person. 24 25 The following institutions supported the MPI-2.2 effort through time and travel support 26 for the people listed above. 27 Argonne National Laboratory 28 Auburn University 29 Bull 30 Cisco Systems, Inc. 31 Cray Inc. 32 Forschungszentrum J ̈ulich 33 Fujitsu 34 The HDF Group 35 Hewlett-Packard 36 International Business Machines 37 Indiana University 38 Institut National de Recherche en Informatique et Automatique (INRIA) 39 Institute for Advanced Science & Engineering Corporation 40 Intel Corporation 41 Lawrence Berkeley National Laboratory 42 Lawrence Livermore National Laboratory 43 Los Alamos National Laboratory 44 Mathworks 45 Mellanox Technologies 46 Microsoft 47 Myricom 48 xxvii

28 1 NEC Corporation 2 Oak Ridge National Laboratory 3 Ohio State University 4 Pacific Northwest National Laboratory 5 QLogic Corporation 6 RunTime Computing Solutions, LLC 7 Sandia National Laboratories 8 SiCortex, Inc. 9 Silicon Graphics Inc. 10 Sun Microsystems, Inc. 11 Tokyo Institute of Technology 12 University of Alabama at Birmingham 13 University of Houston 14 University of Illinois at Urbana-Champaign 15 University of Stuttgart, High Performance Computing Center Stuttgart (HLRS) 16 University of Tennessee, Knoxville 17 University of Tokyo 18 University of Wisconsin 19 Funding for the MPI Forum meetings was partially supported by awards #CCF- 20 0816909 and #CCF-1144042 from the National Science Foundation. In addition, the HDF 21 Group provided travel support for one U.S. academic. 22 23 24 MPI-3: 25 26 MPI-3 is a signficant effort to extend and modernize the Standard. MPI 27 MPI-3 The editors and organizers of the have been: 28 29 William Gropp, Steering committee, Front matter, Introduction, Groups, Contexts, • 30 and Communicators, One-Sided Communications, and Bibliography 31 • Richard Graham, Steering committee, Point-to-Point Communication, Meeting Con- 32 vener, and MPI-3 chair 33 34 • Torsten Hoefler, Collective Communication, One-Sided Communications, and Process 35 Topologies 36 37 George Bosilca, Datatypes and Environmental Management • 38 • David Solt, Process Creation and Management 39 40 Bronis R. de Supinski, External Interfaces and Tool Support • 41 42 • Rajeev Thakur, I/O and One-Sided Communications 43 Darius Buntinas, Info Object • 44 45 • Jeffrey M. Squyres, Language Bindings and MPI 3.0 Secretary 46 47 Rolf Rabenseifner, Steering committee, Terms and Definitions, and Fortran Bindings, • 48 Deprecated Functions, Annex Change-Log, and Annex Language Bindings xxviii

29 1 • Craig Rasmussen, Fortran Bindings 2 Forum MPI-3 The following list includes some of the active participants who attended 3 meetings or participated in the e-mail discussions and who are not mentioned above. 4 5 Tomoya Adachi Sadaf Alam Tatsuya Abe 6 Pavan Balaji Reinhold Bader Purushotham V. Bangalore 7 Robert Blackmore Richard Barrett Brian Barrett 8 Ron Brightwell Greg Bronevetsky Aurelien Bouteiller 9 Jed Brown Devendar Bureddy Darius Buntinas 10 Arno Candel Mohamad Chaarawi George Carr 11 Raghunath Raja Chandrasekar James Dinan Terry Dontje 12 Balazs Gerofi Edgar Gabriel Brice Goglin 13 Manjunath Gorentla David Goodell Erez Haba 14 Marc-Andr ́e Hermanns Jeff Hammond Thomas Herault 15 Atsushi Hori Nathan Hjelm Jennifer Herrett-Skjellum 16 Joshua Hursey Marty Itzkowitz Yutaka Ishikawa 17 Bin Jia Nysal Jan Hideyuki Jitsumoto 18 Yann Kalemkarian Takahiro Kawashima Krishna Kandalla 19 Chulho Kim Dries Kimpe Christof Klausecker 20 Quincey Koziol Alice Koniges Dieter Kranzlmueller 21 Sameer Kumar Eric Lantz Manojkumar Krishnan 22 Bill Long Andrew Lumsdaine Jay Lofstead 23 Ewing Lusk Miao Luo Adam Moody 24 Guillaume Mercier Amith Mamidala Nick M. Maclaren 25 Kathryn Mohror Douglas Miller Scott McMillan 26 Tomotake Nakamura Takeshi Nanri Tim Murray 27 Steve Oyanagi Swann Perarnau Mark Pagel 28 Sreeram Potluri Howard Pritchard Rolf Riesen 29 Kuninobu Sasaki Hubert Ritzdorf Timo Schneider 30 Martin Schulz Gilad Shainer Christian Siebert 31 Anthony Skjellum Brian Smith Marc Snir 32 Raffaele Giuseppe Solca Shinji Sumimoto Alexander Supalov 33 Fabian Tillier Masamichi Takagi Sayantan Sur 34 Richard Treumann Vinod Tipparaju Jesper Larsson Tr ̈aff 35 Rolf Vandevaart Keith Underwood Anh Vo 36 Min Xie Abhinav Vishnu Enqiang Zhou 37 MPI Forum also acknowledges and appreciates the valuable input from people via The 38 e-mail and in person. 39 The MPI Forum also thanks those that provided feedback during the public comment 40 period. In particular, the Forum would like to thank Jeremiah Wilcock for providing detailed 41 comments on the entire draft standard. 42 43 effort through time and travel support MPI-3 The following institutions supported the 44 for the people listed above. 45 Argonne National Laboratory 46 Bull 47 Cisco Systems, Inc. 48 xxix

30 1 Cray Inc. 2 CSCS 3 ETH Zurich 4 Fujitsu Ltd. 5 German Research School for Simulation Sciences 6 The HDF Group 7 Hewlett-Packard 8 International Business Machines 9 IBM India Private Ltd 10 Indiana University 11 Institut National de Recherche en Informatique et Automatique (INRIA) 12 Institute for Advanced Science & Engineering Corporation 13 Intel Corporation 14 Lawrence Berkeley National Laboratory 15 Lawrence Livermore National Laboratory 16 Los Alamos National Laboratory 17 Mellanox Technologies, Inc. 18 Microsoft Corporation 19 NEC Corporation 20 National Oceanic and Atmospheric Administration, Global Systems Division 21 NVIDIA Corporation 22 Oak Ridge National Laboratory 23 The Ohio State University 24 Oracle America 25 Platform Computing 26 RIKEN AICS 27 RunTime Computing Solutions, LLC 28 Sandia National Laboratories 29 Technical University of Chemnitz 30 Tokyo Institute of Technology 31 University of Alabama at Birmingham 32 University of Chicago 33 University of Houston 34 University of Illinois at Urbana-Champaign 35 University of Stuttgart, High Performance Computing Center Stuttgart (HLRS) 36 University of Tennessee, Knoxville 37 University of Tokyo 38 Funding for the MPI Forum meetings was partially supported by awards #CCF- 39 0816909 and #CCF-1144042 from the National Science Foundation. In addition, the HDF 40 Group and Sandia National Laboratories provided travel support for one U.S. academic 41 each. 42 43 44 45 46 47 48 xxx

31 1 2 3 4 5 6 Chapter 1 7 8 9 10 Introduction to MPI 11 12 13 14 Overview and Goals 1.1 15 16 (Message-Passing Interface) is a message-passing library interface specification . All MPI 17 MPI parts of this definition are significant. addresses primarily the message-passing parallel 18 programming model, in which data is moved from the address space of one process to 19 that of another process through cooperative operations on each process. Extensions to the 20 “classical” message-passing model are provided in collective operations, remote-memory 21 MPI is a access operations, dynamic process creation, and parallel I/O. specification , not 22 an implementation; there are multiple implementations of MPI . This specification is for a 23 operations are expressed as functions, library interface ; MPI is not a language, and all MPI 24 subroutines, or methods, according to the appropriate language bindings which, for C and 25 standard. The standard has been defined through an open MPI Fortran, are part of the 26 process by a community of parallel computing vendors, computer scientists, and application 27 ’s development. developers. The next few sections provide an overview of the history of MPI 28 The main advantages of establishing a message-passing standard are portability and 29 ease of use. In a distributed memory communication environment in which the higher level 30 routines and/or abstractions are built upon lower level message-passing routines the benefits 31 of standardization are particularly apparent. Furthermore, the definition of a message- 32 passing standard, such as that proposed here, provides vendors with a clearly defined base 33 set of routines that they can implement efficiently, or in some cases for which they can 34 provide hardware support, thereby enhancing scalability. 35 The goal of the Message-Passing Interface simply stated is to develop a widely used 36 standard for writing message-passing programs. As such the interface should establish a 37 practical, portable, efficient, and flexible standard for message passing. 38 A complete list of goals follows. 39 40 • Design an application programming interface (not necessarily for compilers or a system 41 implementation library). 42 Allow efficient communication: Avoid memory-to-memory copying, allow overlap of • 43 computation and communication, and offload to communication co-processors, where 44 available. 45 46 Allow for implementations that can be used in a heterogeneous environment. • 47 48 Allow convenient C and Fortran bindings for the interface. • 1

32 2 CHAPTER 1. INTRODUCTION TO MPI 1 • Assume a reliable communication interface: the user need not cope with communica- 2 tion failures. Such failures are dealt with by the underlying communication subsystem. 3 • Define an interface that can be implemented on many vendor’s platforms, with no 4 significant changes in the underlying communication and system software. 5 6 Semantics of the interface should be language independent. • 7 8 • The interface should be designed to allow for thread safety. 9 10 Background of MPI-1.0 1.2 11 12 MPI sought to make use of the most attractive features of a number of existing message- 13 passing systems, rather than selecting one of them and adopting it as the standard. Thus, 14 was strongly influenced by work at the IBM T. J. Watson Research Center [1, 2], MPI 15 Intel’s NX/2 [50], Express [13], nCUBE’s Vertex [46], p4 [8, 9], and PARMACS [5, 10]. 16 Other important contributions have come from Zipcode [53, 54], Chimp [19, 20], PVM 17 [4, 17], Chameleon [27], and PICL [25]. 18 standardization effort involved about 60 people from 40 organizations mainly The MPI 19 from the United States and Europe. Most of the major vendors of concurrent computers 20 , along with researchers from universities, government laboratories, and MPI were involved in 21 industry. The standardization process began with the Workshop on Standards for Message- 22 Passing in a Distributed Memory Environment, sponsored by the Center for Research on 23 Parallel Computing, held April 29-30, 1992, in Williamsburg, Virginia [60]. At this workshop 24 the basic features essential to a standard message-passing interface were discussed, and a 25 working group established to continue the standardization process. 26 MPI-1 A preliminary draft proposal, known as , was put forward by Dongarra, Hempel, 27 Hey, and Walker in November 1992, and a revised version was completed in February 28 1993 [18]. embodied the main features that were identified at the Williamsburg MPI-1 29 workshop as being necessary in a message passing standard. Since was primarily MPI-1 30 intended to promote discussion and “get the ball rolling,” it focused mainly on point-to-point 31 communications. MPI-1 brought to the forefront a number of important standardization 32 issues, but did not include any collective communication routines and was not thread-safe. 33 MPI In November 1992, a meeting of the working group was held in Minneapolis, at 34 which it was decided to place the standardization process on a more formal footing, and to 35 generally adopt the procedures and organization of the High Performance Fortran Forum. 36 Subcommittees were formed for the major component areas of the standard, and an email 37 discussion service established for each. In addition, the goal of producing a draft MPI 38 standard by the Fall of 1993 was set. To achieve this goal the MPI working group met every 39 MPI 6 weeks for two days throughout the first 9 months of 1993, and presented the draft 40 standard at the Supercomputing 93 conference in November 1993. These meetings and the 41 email discussion together constituted the MPI Forum, membership of which has been open 42 to all members of the high performance computing community. 43 44 45 Background of MPI-1.1, MPI-1.2, and MPI-2.0 1.3 46 Forum began meeting to consider corrections and exten- Beginning in March 1995, the MPI 47 Standard document [22]. The first product of these deliberations sions to the original MPI 48

33 1.4. BACKGROUND OF MPI-2.1 3 MPI-1.3 AND 1 specification, released in June of 1995 [23] (see was Version 1.1 of the MPI 2 for official http://www.mpi-forum.org MPI document releases). At that time, effort fo- 3 cused in five areas. 4 document. MPI-1.1 1. Further corrections and clarifications for the 5 6 2. Additions to MPI-1.1 that do not significantly change its types of functionality (new 7 datatype constructors, language interoperability, etc.). 8 9 3. Completely new types of functionality (dynamic processes, one-sided communication, 10 parallel I/O, etc.) that are what everyone thinks of as “ MPI-2 functionality.” 11 MPI-1 4. Bindings for Fortran 90 and C++. MPI-2 specifies C++ bindings for both and 12 MPI-1 to functions, and extensions to the Fortran 77 binding of MPI-2 and MPI-2 13 handle Fortran 90 issues. 14 15 process and framework seem likely to be useful, 5. Discussions of areas in which the MPI 16 but where more discussion and experience are needed before standardization (e.g., 17 zero-copy semantics on shared-memory machines, real-time specifications). 18 19 Corrections and clarifications (items of type 1 in the above list) were collected in Chap- 20 document: “Version 1.2 of MPI .” That chapter also contains the function ter 3 of the MPI-2 21 (items of types 2, 3, and 4 in the MPI-1.1 for identifying the version number. Additions to 22 document, and constitute the specifi- MPI-2 above list) are in the remaining chapters of the 23 MPI-2 cation for . Items of type 5 in the above list have been moved to a separate document, 24 the “ MPI Journal of Development” (JOD), and are not part of the MPI-2 Standard. 25 This structure makes it easy for users and implementors to understand what level of 26 MPI compliance a given implementation has: 27 . This is a useful level of com- MPI-1 • MPI-1.3 compliance will mean compliance with 28 pliance. It means that the implementation conforms to the clarifications of MPI-1.1 29 document. Some implementations MPI-2 function behavior given in Chapter 3 of the 30 MPI-1 may require changes to be compliant. 31 32 • MPI-2 compliance will mean compliance with all of MPI-2.1 . 33 34 MPI Standard. • The MPI Journal of Development is not part of the 35 It is to be emphasized that forward compatibility is preserved. That is, a valid MPI-1.1 36 program is both a valid MPI-1.3 program, and a valid MPI-1.3 program and a valid MPI-2.1 37 program. MPI-2.1 program is a valid 38 39 40 Background of MPI-1.3 and MPI-2.1 1.4 41 42 MPI , the After the release of Forum kept working on errata and clarifications for MPI-2.0 43 and ). The short document “Errata for ” both standard documents ( MPI-1.1 MPI-1.1 MPI-2.0 44 was released October 12, 1998. On July 5, 2001, a first ballot of errata and clarifications for 45 was released, and a second ballot was voted on May 22, 2002. Both votes were done MPI-2.0 46 MPI-2 electronically. Both ballots were combined into one document: “Errata for ,” May 47 15, 2002. This errata process was then interrupted, but the Forum and its e-mail reflectors 48 kept working on new requests for clarification.

34 4 MPI CHAPTER 1. INTRODUCTION TO 1 Forum was initiated in three meetings, at Eu- MPI Restarting regular work of the 2 roPVM/MPI’06 in Bonn, at EuroPVM/MPI’07 in Paris, and at SC’07 in Reno. In De- 3 Forum meetings at cember 2007, a steering committee started the organization of new MPI 4 regular 8-weeks intervals. At the January 14–16, 2008 meeting in Chicago, the MPI Forum 5 documents to one document for each ver- MPI decided to combine the existing and future 6 MPI sion of the standard. For technical and historical reasons, this series was started with 7 . Additional Ballots 3 and 4 solved old questions from the errata list started in 1995 MPI-1.3 8 MPI-2 MPI-1.1 , Errata for up to new questions from the last years. After all documents ( , 9 MPI-1.1 (Oct. 12, 1998), and Ballots 1-4) were combined into one draft document, MPI-2.1 10 for each chapter, a chapter author and review team were defined. They cleaned up the 11 MPI-2.1 MPI-2.1 document to achieve a consistent document. The final standard document 12 was finished in June 2008, and finally released with a second vote in September 2008 in 13 the meeting at Dublin, just before EuroPVM/MPI’08. The major work of the current MPI 14 Forum is the preparation of MPI-3 . 15 16 1.5 Background of MPI-2.2 17 18 standard. This version addresses additional errors MPI-2.1 is a minor update to the MPI-2.2 19 standard as well as a small number and ambiguities that were not corrected in the MPI-2.1 20 of extensions to MPI-2.1 that met the following criteria: 21 22 program. • Any correct MPI-2.1 program is a correct MPI-2.2 23 • Any extension must have significant benefit for users. 24 25 • Any extension must not require significant implementation effort. To that end, all 26 such changes are accompanied by an open source implementation. 27 28 MPI-2.2 The discussions of discussions; in some MPI-3 proceeded concurrently with the 29 . cases, extensions were proposed for MPI-2.2 but were later moved to MPI-3 30 31 Background of MPI-3.0 1.6 32 33 MPI-3.0 is a major update to the MPI standard. The updates include the extension of 34 collective operations to include nonblocking versions, extensions to the one-sided operations, 35 and a new Fortran 2008 binding. In addition, the deprecated C++ bindings have been 36 objects (such as the MPI _ removed, as well as many of the deprecated routines and UB MPI 37 datatype). 38 39 40 1.7 Who Should Use This Standard? 41 This standard is intended for use by all those who want to write portable message-passing 42 programs in Fortran and C (and access the C bindings from C++). This includes individual 43 application programmers, developers of software designed to run on parallel machines, and 44 creators of environments and tools. In order to be attractive to this wide audience, the 45 standard must provide a simple, easy-to-use interface for the basic user while not seman- 46 tically precluding the high-performance message-passing operations available on advanced 47 machines. 48

35 1.8. WHAT PLATFORMS ARE TARGETS FOR IMPLEMENTATION? 5 1 What Platforms Are Targets For Implementation? 1.8 2 3 The attractiveness of the message-passing paradigm at least partially stems from its wide 4 portability. Programs expressed this way may run on distributed-memory multiprocessors, 5 networks of workstations, and combinations of all of these. In addition, shared-memory 6 implementations, including those for multi-core processors and hybrid architectures, are 7 possible. The paradigm will not be made obsolete by architectures combining the shared- 8 and distributed-memory views, or by increases in network speeds. It thus should be both 9 possible and useful to implement this standard on a great variety of machines, including 10 those “machines” consisting of collections of other machines, parallel or not, connected by 11 a communication network. 12 The interface is suitable for use by fully general MIMD programs, as well as those writ- 13 ten in the more restricted style of SPMD. provides many features intended to improve MPI 14 performance on scalable parallel computers with specialized interprocessor communication 15 hardware. Thus, we expect that native, high-performance implementations of MPI will be 16 MPI provided on such machines. At the same time, implementations of on top of stan- 17 dard Unix interprocessor communication protocols will provide portability to workstation 18 clusters and heterogenous networks of workstations. 19 20 1.9 What Is Included In The Standard? 21 22 The standard includes: 23 24 • Point-to-point communication, 25 • Datatypes, 26 27 Collective operations, • 28 29 • Process groups, 30 Communication contexts, • 31 32 • Process topologies, 33 34 Environmental management and inquiry, • 35 • The Info object, 36 37 Process creation and management, • 38 39 One-sided communication, • 40 External interfaces, • 41 42 • Parallel file I/O, 43 44 Language bindings for Fortran and C, • 45 Tool support. • 46 47 48

36 6 MPI CHAPTER 1. INTRODUCTION TO 1 1.10 What Is Not Included In The Standard? 2 3 The standard does not specify: 4 • Operations that require more operating system support than is currently standard; 5 for example, interrupt-driven receives, remote execution, or active messages, 6 7 Program construction tools, • 8 9 Debugging facilities. • 10 11 There are many features that have been considered and not included in this standard. 12 This happened for a number of reasons, one of which is the time constraint that was self- 13 imposed in finishing the standard. Features that are not included can always be offered as 14 extensions by specific implementations. Perhaps future versions of MPI will address some 15 of these issues. 16 17 Organization of this Document 1.11 18 19 The following is a list of the remaining chapters in this document, along with a brief 20 description of each. 21 22 MPI Terms and Conventions Chapter 2, • , explains notational terms and conventions 23 MPI document. used throughout the 24 25 • Chapter 3, Point to Point Communication , defines the basic, pairwise communication 26 . Send and receive are found here, along with many associated functions subset of MPI 27 designed to make basic communication powerful and efficient. 28 • , defines a method to describe any data layout, e.g., an array of Datatypes Chapter 4, 29 structures in the memory, which can be used as message send or receive buffer. 30 31 , defines process-group collective communication Chapter 5, • Collective Communications 32 operations. Well known examples of this are barrier and broadcast over a group of 33 MPI-2 , the semantics of collective processes (not necessarily all the processes). With 34 communication was extended to include intercommunicators. It also adds two new 35 adds nonblocking collective operations. MPI-3 collective operations. 36 , shows how groups of pro- Groups, Contexts, Communicators, and Caching Chapter 6, • 37 cesses are formed and manipulated, how unique communication contexts are obtained, 38 . and how the two are bound together into a communicator 39 40 Chapter 7, Process Topologies • , explains a set of utility functions meant to assist in 41 the mapping of process groups (a linearly ordered set) to richer topological structures 42 such as multi-dimensional grids. 43 44 , explains how the programmer can manage • Chapter 8, MPI Environmental Management 45 and make inquiries of the current environment. These functions are needed for the MPI 46 writing of correct, robust programs, and are especially important for the construction 47 of highly-portable message-passing programs. 48

37 1.11. ORGANIZATION OF THIS DOCUMENT 7 1 The Info Object Chapter 9, • , defines an opaque object, that is used as input in several 2 routines. MPI 3 • , defines routines that allow for creation Process Creation and Management Chapter 10, 4 of processes. 5 6 , defines communication routines that can be One-Sided Communications Chapter 11, • 7 completed by a single process. These include shared-memory operations (put/get) 8 and remote accumulate operations. 9 10 Chapter 12, • External Interfaces , defines routines designed to allow developers to layer 11 on top of opaque MPI . This includes generalized requests, routines that decode MPI 12 objects, and threads. 13 MPI support for parallel I/O. • Chapter 13, I/O , defines 14 15 • , covers interfaces that allow debuggers, performance ana- Tool Support Chapter 14, 16 processes. This MPI lyzers, and other tools to obtain data about the operation of 17 Profiling Interface ), which was a chapter in previous chapter includes Section 14.2 ( 18 MPI versions of . 19 20 • Deprecated Functions , describes routines that are kept for reference. How- Chapter 15, 21 ever usage of these functions is discouraged, as they may be deleted in future versions 22 of the standard. 23 , describes routines and constructs that have been Chapter 16, • Removed Interfaces 24 MPI . These were deprecated in MPI-2 removed from MPI Forum decided to , and the 25 MPI-3 remove these from the standard. 26 27 , discusses Fortran issues, and describes language in- Language Bindings • Chapter 17, 28 teroperability aspects between C and Fortran. 29 30 The Appendices are: 31 32 , gives specific syntax in C and Fortran, for all Annex A, • Language Bindings Summary 33 MPI functions, constants, and types. 34 , summarizes some changes since the previous version of the Annex B, • Change-Log 35 standard. 36 37 • Index Several pages show the locations of examples, constants and predefined handles, 38 callback routine prototypes, and all MPI functions. 39 40 provides various interfaces to facilitate interoperability of distinct MPI MPI imple- 41 I/O and for MPI mentations. Among these are the canonical data representation for 42 UNPACK PACK _ EXTERNAL and MPI _ _ MPI EXTERNAL . The definition of an actual bind- _ 43 ing of these interfaces that will enable interoperability is outside the scope of this document. 44 Forum during the A separate document consists of ideas that were discussed in the MPI 45 development and deemed to have value, but are not included in the MPI-2 MPI Standard. 46 They are part of the “Journal of Development” (JOD), lest good ideas be lost and in order 47 to provide a starting point for further work. The chapters in the JOD are 48

38 8 MPI CHAPTER 1. INTRODUCTION TO 1 Chapter 2, , includes some elements of dynamic pro- Spawning Independent Processes • 2 cess management, in particular management of processes with which the spawning 3 processes do not intend to communicate, that the Forum discussed at length but 4 Standard. MPI ultimately decided not to include in the 5 Chapter 3, , describes some of the expected interaction between an Threads and MPI • 6 implementation and a thread library in a multi-threaded environment. MPI 7 8 , describes an approach to providing identifiers for com- Communicator ID • Chapter 4, 9 municators. 10 11 MPI , discusses Miscellaneous topics in the JOD, in particu- • Miscellany Chapter 5, 12 lar single-copy routines for use in shared-memory environments and new datatype 13 constructors. 14 Chapter 6, Toward a Full Fortran 90 Interface • , describes an approach to providing a 15 more elaborate Fortran 90 interface. 16 17 Chapter 7, , describes a specification for certain non- • Split Collective Communication 18 blocking collective operations. 19 20 Chapter 8, , discusses • MPI Real-Time MPI support for real time processing. 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

39 1 2 3 4 5 6 Chapter 2 7 8 9 10 Terms and Conventions MPI 11 12 13 14 MPI document, This chapter explains notational terms and conventions used throughout the 15 some of the choices that have been made, and the rationale behind those choices. 16 17 2.1 Document Notation 18 19 Rationale. Throughout this document, the rationale for the design choices made in 20 the interface specification is set off in this format. Some readers may wish to skip 21 these sections, while readers interested in interface design may want to read them 22 carefully. ( End of rationale. ) 23 24 Throughout this document, material aimed at users and that Advice to users. 25 illustrates usage is set off in this format. Some readers may wish to skip these sections, 26 MPI while readers interested in programming in may want to read them carefully. ( End 27 ) of advice to users. 28 29 Throughout this document, material that is primarily Advice to implementors. 30 commentary to implementors is set off in this format. Some readers may wish to skip 31 MPI implementations may want to read these sections, while readers interested in 32 End of advice to implementors. ) them carefully. ( 33 34 35 Naming Conventions 2.2 36 MPI In many cases MPI subset . This Class action _ _ _ names for C functions are of the form 37 convention originated with . Since MPI-2 an attempt has been made to standardize MPI-1 38 the names of MPI functions according to the following rules. 39 40 1. In C, all routines associated with a particular type of MPI object should be of the 41 form MPI . action _ Class _ Class or, if no subset exists, of the form subset _ action _ _ MPI 42 In Fortran, all routines associated with a particular type of MPI object should be of 43 SUBSET or, if no subset exists, of the form the form MPI _ CLASS _ ACTION _ 44 CLASS . ACTION _ _ MPI 45 46 2. If the routine is not associated with a class, the name should be of the form 47 SUBSET MPI _ ACTION _ MPI in C and subset _ Action _ in Fortran. 48 9

40 10 TERMS AND CONVENTIONS CHAPTER 2. MPI 1 creates 3. The names of certain actions have been standardized. In particular, Create 2 sets this information, a new object, retrieves information about an object, Get Set 3 Delete deletes information, Is asks whether or not an object has a certain property. 4 5 MPI-1 MPI functions (that were defined during the C and Fortran names for some 6 process) violate these rules in several cases. The most common exceptions are the omission 7 Class name from the routine and the omission of the Action where one can be of the 8 inferred. 9 identifiers are limited to 30 characters (31 with the profiling interface). This is MPI 10 done to avoid exceeding the limit on some compilation systems. 11 12 2.3 Procedure Specification 13 14 MPI procedures are specified using a language-independent notation. The arguments of 15 OUT procedure calls are marked as IN . The meanings of these are: INOUT , or , 16 17 : the call may use the input value but does not update the argument from the • IN 18 perspective of the caller at any time during the call’s execution, 19 • OUT : the call may update the argument but does not use its input value, 20 21 • : the call may both use and update the argument. INOUT 22 23 There is one special case — if an argument is a handle to an opaque object (these 24 terms are defined in Section 2.5.1), and the object is updated by the procedure call, then 25 or INOUT OUT the argument is marked . It is marked this way even though the handle itself 26 OUT is not modified — we use the INOUT or attribute to denote that what the handle 27 references is updated. 28 Rationale. The definition of MPI tries to avoid, to the largest possible extent, the use 29 of INOUT arguments, because such use is error-prone, especially for scalar arguments. 30 ) End of rationale. ( 31 32 INOUT OUT , and , is intended to indicate to the user how an argument MPI IN ’s use of 33 is to be used, but does not provide a rigorous classification that can be translated directly 34 in C bindings). into all language bindings (e.g., INTENT in Fortran 90 bindings or const 35 For instance, the “constant” MPI _ BOTTOM can usually be passed to OUT buffer arguments. 36 status argument. IGNORE Similarly, MPI _ STATUS _ can be passed as the OUT 37 MPI functions is an argument that is used as IN by some pro- A common occurrence for 38 by other processes. Such an argument is, syntactically, an INOUT argument cesses and OUT 39 and is marked as such, although, semantically, it is not used in one call both for input and 40 for output on a single process. 41 Another frequent situation arises when an argument value is needed only by a subset 42 of the processes. When an argument is not significant at a process then an arbitrary value 43 can be passed as an argument. 44 or type cannot be aliased OUT Unless specified otherwise, an argument of type INOUT 45 procedure. An example of argument aliasing in MPI with any other argument passed to an 46 C appears below. If we define a C procedure like this, 47 48

41 2.4. SEMANTIC TERMS 11 1 void copyIntBuffer( int *pin, int *pout, int len ) 2 { int i; 3 for (i=0; i

42 12 TERMS AND CONVENTIONS CHAPTER 2. MPI 1 predefined A predefined datatype is a datatype with a predefined (constant) name (such 2 , MPI _ FLOAT _ INT , or MPI _ PACKED ) or a datatype constructed with as MPI INT _ 3 F90 _ INTEGER , MPI _ TYPE _ CREATE _ F90 _ REAL , or _ _ _ CREATE MPI TYPE 4 F90 _ COMPLEX . The former are MPI whereas the latter are _ TYPE _ CREATE _ named 5 unnamed . 6 derived A derived datatype is any datatype that is not predefined. 7 8 A datatype is portable if it is a predefined datatype, or it is derived from portable 9 TYPE , MPI CONTIGUOUS a portable datatype using only the type constructors _ _ 10 , _ TYPE _ VECTOR , MPI _ MPI _ INDEXED TYPE 11 TYPE _ CREATE _ SUBARRAY , MPI _ TYPE _ CREATE _ INDEXED _ BLOCK MPI , _ 12 , and DARRAY . Such a datatype is portable TYPE MPI _ _ TYPE _ DUP CREATE MPI _ _ 13 because all displacements in the datatype are in terms of extents of one predefined 14 datatype. Therefore, if such a datatype fits a data layout in one memory, it will 15 fit the corresponding data layout in another memory, if the same declarations were 16 used, even if the two systems have different architectures. On the other hand, if a 17 , HINDEXED _ CREATE _ TYPE _ datatype was constructed using MPI 18 MPI _ TYPE _ CREATE _ HINDEXED _ BLOCK , MPI _ TYPE _ CREATE _ HVECTOR or 19 _ , then the datatype contains explicit byte displace- STRUCT _ CREATE _ MPI TYPE 20 ments (e.g., providing padding to meet alignment restrictions). These displacements 21 are unlikely to be chosen correctly if they fit data layout on one memory, but are 22 used for data layouts on another process, running on a processor with a different 23 architecture. 24 25 equivalent Two datatypes are equivalent if they appear to have been created with the same 26 sequence of calls (and arguments) and thus have the same typemap. Two equivalent 27 datatypes do not necessarily have the same cached attributes or the same names. 28 29 Data Types 2.5 30 31 Opaque Objects 2.5.1 32 33 that is used for buffering messages and for storing internal system memory MPI manages 34 representations of various MPI objects such as groups, communicators, datatypes, etc. This 35 memory is not directly accessible to the user, and objects stored there are : their opaque 36 size and shape is not visible to the user. Opaque objects are accessed via handles , which 37 exist in user space. MPI procedures that operate on opaque objects are passed handle 38 MPI calls for object access, arguments to access these objects. In addition to their use by 39 handles can participate in assignments and comparisons. 40 or INCLUDE ’mpif.h’ , all handles have type INTEGER In Fortran with USE mpi . In 41 USE mpi_f08 , and in C, a different handle type is defined for each category of Fortran with 42 derived USE mpi_f08 objects. With Fortran BIND(C) , the handles are defined as Fortran 43 . The internal handle value is INTEGER :: MPI_VAL types that consist of only one element 44 value used in the . The operators identical to the Fortran INTEGER mpif.h module and mpi 45 and /= are overloaded to allow the comparison of these handles. The type .EQ. , .NE. , == 46 names are identical to the names in C, except that they are not case sensitive. For example: 47 48

43 2.5. DATA TYPES 13 1 TYPE, BIND(C) :: MPI_Comm 2 INTEGER :: MPI_VAL 3 END TYPE MPI_Comm 4 The C types must support the use of the assignment and equality operators. 5 6 Advice to implementors. In Fortran, the handle can be an index into a table of 7 opaque objects in a system table; in C it can be such an index or a pointer to the 8 End of advice to implementors. ( object. ) 9 10 Rationale. Since the Fortran integer values are equivalent, applications can easily 11 handles between all three supported Fortran methods. For example, an MPI convert 12 can be converted directly into an exactly equivalent COMM integer communicator handle 13 communicator handle named , and mpi_f08 comm_f08 by comm_f08%MPI_VAL=COMM 14 derived type vice versa. The use of the INTEGER defined handles and the BIND(C) 15 handles is different: Fortran 2003 (and later) define that derived types can BIND(C) 16 be used within user defined common blocks, but it is up to the rules of the companion 17 C compiler how many numerical storage units are used for these BIND(C) derived type 18 handles. Most compilers use one unit for both, the INTEGER handles and the handles 19 End of rationale. ) defined as BIND(C) derived types. ( 20 21 mpi module by the Advice to users. If a user wants to substitute mpif.h or the 22 module and the application program stores a handle in a Fortran common mpi_f08 23 block then it is necessary to change the Fortran support method in all application 24 routines that use this common block, because the number of numerical storage units 25 of such a handle can be different in the two modules. ( End of advice to users. ) 26 27 Opaque objects are allocated and deallocated by calls that are specific to each object 28 type. These are listed in the sections where the objects are described. The calls accept a 29 handle argument of matching type. In an allocate call this is an OUT argument that returns 30 INOUT argument which a valid reference to the object. In a call to deallocate this is an 31 MPI provides an “invalid handle” constant for each returns with an “invalid handle” value. 32 object type. Comparisons to this constant are used to test for validity of the handle. 33 A call to a deallocate routine invalidates the handle and marks the object for deal- 34 location. The object is not accessible to the user after the call. However, MPI need not 35 deallocate the object immediately. Any operation pending (at the time of the deallocate) 36 that involves this object will complete normally; the object will be deallocated afterwards. 37 An opaque object and its handle are significant only at the process where the object 38 was created and cannot be transferred to another process. 39 provides certain predefined opaque objects and predefined, static handles to these MPI 40 objects. The user must not free such objects. 41 data structures, This design hides the internal representation used for Rationale. MPI 42 thus allowing similar calls in C and Fortran. It also avoids conflicts with the typing 43 rules in these languages, and easily allows future extensions of functionality. The 44 mechanism for opaque objects used here loosely follows the POSIX Fortran binding 45 standard. 46 47 The explicit separation of handles in user space and objects in system space allows 48 space-reclaiming and deallocation calls to be made at appropriate points in the user

44 14 MPI CHAPTER 2. TERMS AND CONVENTIONS 1 program. If the opaque objects were in user space, one would have to be very careful 2 not to go out of scope before any pending operation requiring that object completed. 3 The specified design allows an object to be marked for deallocation, the user program 4 can then go out of scope, and the object itself still persists until any pending operations 5 are complete. 6 The requirement that handles support assignment/comparison is made since such 7 operations are common. This restricts the domain of possible implementations. The 8 alternative would have been to allow handles to have been an arbitrary, opaque type. 9 This would force the introduction of routines to do assignment and comparison, adding 10 ) complexity, and was therefore ruled out. ( End of rationale. 11 12 A user may accidentally create a dangling reference by assigning to a Advice to users. 13 handle the value of another handle, and then deallocating the object associated with 14 these handles. Conversely, if a handle variable is deallocated before the associated 15 object is freed, then the object becomes inaccessible (this may occur, for example, if 16 the handle is a local variable within a subroutine, and the subroutine is exited before 17 the associated object is deallocated). It is the user’s responsibility to avoid adding or 18 calls that allocate or deleting references to opaque objects, except as a result of MPI 19 End of advice to users. deallocate such objects. ( ) 20 21 Advice to implementors. The intended semantics of opaque objects is that opaque 22 objects are separate from one another; each call to allocate such an object copies 23 all the information required for the object. Implementations may avoid excessive 24 copying by substituting referencing for copying. For example, a derived datatype 25 may contain references to its components, rather then copies of its components; a 26 call to may return a reference to the group associated with the GROUP MPI _ COMM _ 27 communicator, rather than a copy of this group. In such cases, the implementation 28 must maintain reference counts, and allocate and deallocate objects in such a way that 29 ) the visible effect is as if the objects were copied. ( End of advice to implementors. 30 31 Array Arguments 2.5.2 32 33 An MPI call may need an argument that is an array of opaque objects, or an array of 34 handles. The array-of-handles is a regular array with entries that are handles to objects 35 of the same type in consecutive locations in the array. Whenever such an array is used, 36 argument is required to indicate the number of valid entries (unless this len an additional 37 number can be derived otherwise). The valid entries are at the beginning of the array; 38 len indicates how many of them there are, and need not be the size of the entire array. 39 NULL handles are The same approach is followed for other array arguments. In some cases 40 argument is desired for an array of statuses, one NULL considered valid entries. When a 41 . uses IGNORE _ STATUSES MPI _ 42 43 State 2.5.3 44 MPI procedures use at various places arguments with state types. The values of such a data 45 type are all identified by names, and no operation is defined on them. For example, the 46 TYPE _ CREATE _ with values _ MPI order routine has a state argument SUBARRAY 47 _ ORDER _ MPI and C _ ORDER _ MPI . FORTRAN 48

45 2.5. DATA TYPES 15 1 Named Constants 2.5.4 2 procedures sometimes assign a special meaning to a special value of a basic type argu- MPI 3 is an integer-valued argument of point-to-point communication operations, tag ment; e.g., 4 MPI _ TAG . Such arguments will have a range of regular _ with a special wild-card value, ANY 5 values, which is a proper subrange of the range of values of the corresponding basic type; 6 _ MPI ANY _ special values (such as ) will be outside the regular range. The range of regu- TAG 7 tag lar values, such as , can be queried using environmental inquiry functions (Chapter 7 of 8 document). The range of other values, such as , depends on values given source the MPI-1 9 routines (in the case of by other it is the communicator size). source MPI 10 MPI . also provides predefined named constant handles, such as MPI _ COMM _ WORLD 11 All named constants, with the exceptions noted below for Fortran, can be used in 12 initialization expressions or assignments, but not necessarily in array declarations or as 13 select / case labels in C switch or Fortran statements. This implies named constants 14 to be link-time but not necessarily compile-time constants. The named constants listed 15 below are required to be compile-time constants in both C and Fortran. These constants 16 do not change values during execution. Opaque objects accessed by constant handles are 17 MPI ) and INIT _ MPI initialization ( MPI defined and do not change value between completion 18 ). The handles themselves are constants and can be also used in initialization MPI FINALIZE ( _ 19 expressions or assignments. 20 The constants that are required to be compile-time constants (and can thus be used 21 / select statements) switch for array length declarations and labels in C case and Fortran 22 are: 23 _ MAX _ PROCESSOR _ NAME MPI 24 _ STRING MAX _ LIBRARY _ _ MPI VERSION 25 MAX _ _ ERROR _ MPI STRING 26 MAX _ _ DATAREP _ STRING MPI 27 INFO MAX _ MPI _ KEY _ 28 MPI _ MAX _ INFO _ VAL 29 NAME MPI _ MAX _ OBJECT _ 30 _ MAX NAME PORT MPI _ _ 31 MPI _ VERSION 32 MPI _ SUBVERSION 33 (Fortran only) _ MPI _ STATUS SIZE 34 KIND (Fortran only) ADDRESS _ MPI _ 35 KIND MPI _ COUNT _ (Fortran only) 36 (Fortran only) MPI _ INTEGER _ KIND 37 MPI _ OFFSET _ KIND (Fortran only) 38 _ SUBARRAYS _ SUPPORTED MPI (Fortran only) 39 (Fortran only) ASYNC MPI PROTECTS _ NONBLOCKING _ _ 40 The constants that cannot be used in initialization expressions or assignments in For- 41 tran are: 42 MPI_BOTTOM 43 MPI_STATUS_IGNORE 44 MPI_STATUSES_IGNORE 45 MPI_ERRCODES_IGNORE 46 MPI_IN_PLACE 47 MPI_ARGV_NULL 48

46 16 MPI CHAPTER 2. TERMS AND CONVENTIONS 1 MPI_ARGVS_NULL 2 MPI_UNWEIGHTED 3 MPI_WEIGHTS_EMPTY 4 5 In Fortran the implementation of these special constants Advice to implementors. 6 may require the use of language constructs that are outside the Fortran standard. 7 PARAMETER Using special values for the constants (e.g., by defining them through 8 statements) is not possible because an implementation cannot distinguish these val- 9 ues from valid data. Typically, these constants are implemented as predefined static 10 -declared COMMON block), relying on the fact that variables (e.g., a variable in an MPI 11 the target compiler passes data by address. Inside the subroutine, this address can 12 be extracted by some mechanism outside the Fortran standard (e.g., by Fortran ex- 13 End of advice to implementors. ) tensions or by implementing the function in C). ( 14 15 Choice 2.5.5 16 (or union) data type. Distinct calls to choice functions sometimes use arguments with a MPI 17 the same routine may pass by reference actual arguments of different types. The mechanism 18 for providing such arguments will differ from language to language. For Fortran with the 19 > include file to represent a choice mpif.h mpi module, the document uses < type or the 20 module, such arguments are declared with the Fortran mpi_f08 variable; with the Fortran 21 ; for C, we use TYPE(*), DIMENSION(..) 2008 + TR 29113 syntax . void * 22 23 Implementors can freely choose how to implement choice Advice to implementors. 24 mpi module, e.g., with a non-standard compiler-dependent method arguments in the 25 that has the quality of the call mechanism in the implicit Fortran interfaces, or with 26 mpi_f08 module. See details in Section 17.1.1 on page 597. the method defined for the 27 ( End of advice to implementors. ) 28 29 30 Addresses 2.5.6 31 32 procedures use MPI Some arguments that represent an absolute address in the address 33 calling program. The datatype of such an argument is MPI Aint in C and _ 34 _ ADDRESS _ KIND ) in Fortran. These types must have the same width INTEGER (KIND= MPI 35 and encode address values in the same manner such that address values in one language 36 constant may be passed directly to another language without conversion. There is the MPI 37 MPI _ BOTTOM to indicate the start of the address range. 38 39 File Offsets 2.5.7 40 For I/O there is a need to give the size, displacement, and offset into a file. These quantities 41 can easily be larger than 32 bits which can be the default size of a Fortran integer. To 42 INTEGER (KIND=MPI_OFFSET_KIND) overcome this, these quantities are declared to be in 43 MPI Fortran. In C one uses . These types must have the same width and encode _ Offset 44 address values in the same manner such that offset values in one language may be passed 45 directly to another language without conversion. 46 47 48

47 2.6. LANGUAGE BINDING 17 1 Counts 2.5.8 2 _ Aint ) to address locations within memory As described above, MPI MPI defines types (e.g., 3 ) to address locations within files. In addition, some MPI and other types (e.g., _ MPI Offset 4 arguments that represent a number of MPI procedures use count datatypes on which to 5 operate. At times, one needs a single type that can be used to address locations within 6 Count in C values, and that type is either memory or files as well as express MPI _ count 7 in Fortran. These types must have the same width INTEGER (KIND=MPI_COUNT_KIND) and 8 and encode values in the same manner such that count values in one language may be 9 type MPI _ passed directly to another language without conversion. The size of the Count 10 is determined by the MPI implementation with the restriction that it must be minimally 11 Aint MPI capable of encoding any value that may be stored in a variable of type int , _ , or 12 _ Offset INTEGER (KIND=MPI_ADDRESS_KIND) in C and of type INTEGER , MPI , or 13 in Fortran. INTEGER (KIND=MPI_OFFSET_KIND) 14 15 Rationale. Count values logically need to be large enough to encode any value used 16 for expressing element counts, type maps in memory, type maps in file views, etc. For 17 backward compatibility reasons, many MPI routines still use int in C and INTEGER 18 in Fortran as the type of count arguments. ( End of rationale. ) 19 20 21 Language Binding 2.6 22 23 MPI language binding in general and for Fortran, and ISO This section defines the rules for 24 C, in particular. (Note that ANSI C has been replaced by ISO C.) Defined here are various 25 object representations, as well as the naming conventions used for expressing this standard. 26 The actual calling sequences are defined elsewhere. 27 MPI bindings are for Fortran 90 or later, though they were originally designed to be 28 module, two new Fortran features, mpi_f08 usable in Fortran 77 environments. With the 29 , are also required, see Section 2.5.5 on page 16. assumed rank and assumed type 30 Since the word PARAMETER is a keyword in the Fortran language, we use the word 31 “argument” to denote the arguments to a subroutine. These are normally referred to 32 as parameters in C, however, we expect that C programmers will understand the word 33 “argument” (which has no specific meaning in C), thus allowing us to avoid unnecessary 34 confusion for Fortran programmers. 35 Since Fortran is case insensitive, linkers may use either lower case or upper case when 36 resolving Fortran names. Users of case sensitive languages should avoid the “mpi ” and _ 37 “pmpi _ ” prefixes. 38 39 Deprecated and Removed Names and Functions 2.6.1 40 41 constructs. These are constructs MPI A number of chapters refer to deprecated or replaced 42 that continue to be part of the MPI standard, as documented in Chapter 15 on page 591, 43 but that users are recommended not to continue using, since better solutions were provided 44 functions that have . For example, the Fortran binding for MPI with newer versions of MPI-1 45 INTEGER . This is not consistent with the C binding, and causes address arguments uses 46 , these functions INTEGER MPI-2 problems on machines with 32 bit s and 64 bit addresses. In 47 were given new names with new bindings for the address arguments. The use of the old 48 functions is deprecated. For consistency, here and in a few other cases, new C functions are

48 18 TERMS AND CONVENTIONS CHAPTER 2. MPI 1 also provided, even though the new functions are equivalent to the old functions. The old 2 names are deprecated. 3 Some of the deprecated constructs are now removed, as documented in Chapter 16 on 4 page 595. They may still be provided by an implementation for backwards compatibility, 5 but are not required. 6 Table 2.1 shows a list of all of the deprecated and removed constructs. Note that some 7 C typedefs and Fortran subroutine names are included in this list; they are the types of 8 callback functions. 9 Deprecated or removed deprecated Replacement removed 10 since construct since 11 _ ADDRESS ADDRESS MPI-3.0 MPI _ GET MPI _ MPI-2.0 12 _ HINDEXED MPI-2.0 MPI-3.0 MPI _ TYPE _ CREATE _ HINDEXED MPI _ TYPE 13 _ HVECTOR MPI-2.0 MPI-3.0 MPI _ TYPE _ CREATE _ HVECTOR MPI _ TYPE MPI-3.0 MPI TYPE _ CREATE _ TYPE _ STRUCT MPI-2.0 STRUCT _ _ MPI 14 EXTENT MPI-2.0 MPI-3.0 MPI _ MPI _ GET _ EXTENT _ TYPE _ TYPE 15 GET MPI _ TYPE _ UB _ MPI-3.0 MPI _ TYPE _ EXTENT MPI-2.0 16 TYPE _ GET _ EXTENT _ TYPE LB _ MPI MPI-2.0 MPI-3.0 MPI _ 1 17 MPI-2.0 _ TYPE _ MPI-3.0 MPI RESIZED _ LB _ MPI CREATE 1 CREATE _ _ UB _ MPI-2.0 MPI-3.0 MPI _ TYPE MPI RESIZED 18 _ ERRHANDLER _ CREATE MPI-2.0 MPI-3.0 MPI MPI COMM _ CREATE _ ERRHANDLER _ 19 ERRHANDLER GET MPI-3.0 MPI _ COMM _ MPI _ ERRHANDLER GET _ _ MPI-2.0 20 _ ERRHANDLER _ SET _ COMM _ MPI-3.0 MPI MPI-2.0 SET _ ERRHANDLER MPI 2 2 21 _ MPI-3.0 MPI function _ errhandler MPI-2.0 MPI _ Comm Handler _ function _ 22 KEYVAL _ MPI _ MPI-2.0 CREATE CREATE _ KEYVAL _ COMM _ MPI FREE _ KEYVAL MPI _ KEYVAL _ FREE MPI-2.0 MPI _ COMM _ 23 3 3 _ FN FN _ DUP _ DUP _ MPI MPI-2.0 MPI _ COMM 24 3 3 MPI _ COMM _ NULL _ COPY MPI FN _ NULL _ COPY _ FN _ MPI-2.0 3 3 25 _ DELETE _ NULL _ COMM _ MPI FN FN _ DELETE _ NULL _ MPI MPI-2.0 2 2 26 function _ Copy _ MPI _ Comm _ copy _ attr _ function MPI-2.0 MPI 3 3 FUNCTION COPY _ FUNCTION COPY MPI-2.0 _ _ ATTR _ COMM 27 2 2 _ delete _ Comm _ MPI MPI-2.0 _ function _ Delete function attr MPI _ 28 3 3 ATTR _ DELETE _ FUNCTION MPI-2.0 COMM FUNCTION _ DELETE _ 29 DELETE MPI-2.0 MPI _ COMM _ DELETE _ ATTR MPI _ ATTR _ 30 _ _ ATTR ATTR _ GET MPI COMM _ MPI MPI-2.0 GET _ _ ATTR ATTR _ SET _ COMM _ MPI MPI-2.0 PUT _ MPI 31 4 4 _ INTEGER _ - MPI-3.0 MPI _ COMBINER MPI HVECTOR _ COMBINER _ HVECTOR 32 4 4 - _ _ HINDEXED _ INTEGER MPI COMBINER MPI-3.0 MPI _ COMBINER _ HINDEXED 33 4 4 MPI _ COMBINER STRUCT _ INTEGER STRUCT _ COMBINER _ MPI MPI-3.0 - _ 34 MPI-2.2 MPI-3.0 C language binding MPI::... 1 35 Predefined datatype. 2 Callback prototype definition. 36 3 Predefined callback routine. 37 4 Constant. 38 MPI routines. Other entries are regular 39 40 Table 2.1: Deprecated and Removed constructs 41 42 43 44 Fortran Binding Issues 2.6.2 45 MPI-1.1 provided bindings for Fortran 77. These bindings are retained, but they Originally, 46 MPI are now interpreted in the context of the Fortran 90 standard. can still be used with 47 most Fortran 77 compilers, as noted below. When the term “Fortran” is used it means 48

49 2.6. LANGUAGE BINDING 19 1 module is mpi_f08 Fortran 90 or later; it means Fortran 2008 + TR 29113 and later if the 2 used. 3 MPI names have an MPI_ All prefix, and all characters are capitals. Programs must 4 not declare names, e.g., for variables, subroutines, functions, parameters, derived types, 5 abstract interfaces, or modules, beginning with the prefix . To avoid conflicting with MPI_ 6 the profiling interface, programs must also avoid subroutines and functions with the prefix 7 . This is mandated to avoid possible name collisions. PMPI_ 8 All MPI USE Fortran subroutines have a return code in the last argument. With 9 mpi_f08 OPTIONAL , this last argument is declared as , except for user-defined callback func- 10 ) and their predefined callbacks (e.g., tions (e.g., COMM_COPY_ATTR_FUNCTION 11 COPY _ FN ). A few MPI operations which are functions do not have the return MPI _ NULL _ 12 _ code argument. The return code value for successful completion is MPI SUCCESS . Other 13 error codes are implementation dependent; see the error codes in Chapter 8 and Annex A. 14 Constants representing the maximum length of a string are one smaller in Fortran than 15 in C as discussed in Section 17.2.9. 16 Handles are represented in Fortran as BIND(C) derived type with the INTEGER s, or as a 17 LOGICAL mpi_f08 module; see Section 2.5.1 on page 12. Binary-valued variables are of type . 18 Array arguments are indexed from one. 19 ) are inconsistent with the For- mpif.h The older and use mpi MPI Fortran bindings ( 20 tran standard in several respects. These inconsistencies, such as register optimization prob- 21 lems, have implications for user codes that are discussed in detail in Section 17.1.16. 22 23 2.6.3 C Binding Issues 24 We use the ISO C declaration format. All names have an MPI_ prefix, defined constants MPI 25 are in all capital letters, and defined types and functions have one capital letter after 26 the prefix. Programs must not declare names (identifiers), e.g., for variables, functions, 27 . To support the profiling MPI_ constants, types, or macros, beginning with the prefix 28 PMPI_ interface, programs must not declare functions with names beginning with the prefix . 29 The definition of named constants, function prototypes, and type definitions must be 30 supplied in an include file mpi.h . 31 Almost all C functions return an error code. The successful return code will be 32 SUCCESS _ MPI , but failure return codes are implementation dependent. 33 Type declarations are provided for handles to each category of opaque objects. 34 Array arguments are indexed from zero. 35 Logical flags are integers with value 0 meaning “false” and a non-zero value meaning 36 “true.” 37 void * . Choice arguments are pointers of type 38 Aint Address arguments are of MPI defined type MPI _ . File displacements are of type 39 _ is defined to be an integer of the size needed to hold any valid address MPI _ Offset . MPI Aint 40 is defined to be an integer of the size needed to hold Offset MPI on the target architecture. _ 41 any valid file size on the target architecture. 42 43 44 2.6.4 Functions and Macros 45 , MPI _ WTICK , PMPI _ WTIME , An implementation is allowed to implement MPI _ WTIME 46 f2c , etc.) in Section 17.2.4, _ PMPI WTICK , and the handle-conversion functions ( MPI _ Group _ 47 and no others, as macros in C. 48

50 20 MPI CHAPTER 2. TERMS AND CONVENTIONS 1 Implementors should document which routines are imple- Advice to implementors. 2 mented as macros. ( End of advice to implementors. ) 3 4 Advice to users. If these routines are implemented as macros, they will not work 5 ) profiling interface. ( MPI with the End of advice to users. 6 7 2.7 Processes 8 9 program consists of autonomous processes, executing their own code, in an MIMD MPI An 10 style. The codes executed by each process need not be identical. The processes communicate 11 communication primitives. Typically, each process executes in its own via calls to MPI 12 address space, although shared-memory implementations of MPI are possible. 13 MPI This document specifies the behavior of a parallel program assuming that only 14 program with other possible means of commu- MPI calls are used. The interaction of an 15 nication, I/O, and process management is not specified. Unless otherwise stated in the 16 MPI specification of the standard, places no requirements on the result of its interaction 17 with external mechanisms that provide similar or equivalent functionality. This includes, 18 but is not limited to, interactions with external mechanisms for process control, shared and 19 remote memory access, file system access and control, interprocess communication, process 20 signaling, and terminal I/O. High quality implementations should strive to make the results 21 of such interactions intuitive to users, and attempt to document restrictions where deemed 22 necessary. 23 24 Implementations that support such additional mechanisms Advice to implementors. 25 for functionality supported within MPI are expected to document how these interact 26 MPI . ( with ) End of advice to implementors. 27 The interaction of MPI and threads is defined in Section 12.4. 28 29 30 Error Handling 2.8 31 32 MPI provides the user with reliable message transmission. A message sent is always received 33 correctly, and the user does not need to check for transmission errors, time-outs, or other 34 does not provide mechanisms for dealing with failures MPI error conditions. In other words, 35 in the communication system. If the MPI implementation is built on an unreliable underly- 36 ing mechanism, then it is the job of the implementor of the MPI subsystem to insulate the 37 user from this unreliability, or to reflect unrecoverable errors as failures. Whenever possible, 38 such failures will be reflected as errors in the relevant communication call. Similarly, MPI 39 itself provides no mechanisms for handling processor failures. 40 programs may still be erroneous. A Of course, can occur when program error MPI 41 an MPI call is made with an incorrect argument (non-existing destination in a send oper- 42 ation, buffer too small in a receive operation, etc.). This type of error would occur in any 43 resource error implementation. In addition, a may occur when a program exceeds the 44 amount of available system resources (number of pending messages, system buffers, etc.). 45 The occurrence of this type of error depends on the amount of available resources in the 46 system and the resource allocation mechanism used; this may differ from system to system. 47 A high-quality implementation will provide generous limits on the important resources so 48 as to alleviate the portability problem this represents.

51 2.9. IMPLEMENTATION ISSUES 21 1 calls return a code that indicates successful completion In C and Fortran, almost all MPI 2 of the operation. Whenever possible, MPI calls return an error code if an error occurred 3 MPI library during the call. By default, an error detected during the execution of the 4 MPI causes the parallel computation to abort, except for file operations. However, provides 5 mechanisms for users to change this default and to handle recoverable errors. The user may 6 calls by himself or MPI specify that no error is fatal, and handle error codes returned by 7 herself. Also, the user may provide his or her own error-handling routines, which will be 8 error handling facilities are call returns abnormally. The MPI invoked whenever an MPI 9 described in Section 8.3. 10 MPI calls to return with meaningful error codes Several factors limit the ability of 11 may not be able to detect some errors; other errors may be too MPI when an error occurs. 12 expensive to detect in normal execution mode; finally some errors may be “catastrophic” 13 and may prevent MPI from returning control to the caller in a consistent state. 14 Another subtle issue arises because of the nature of asynchronous communications: MPI 15 calls may initiate operations that continue asynchronously after the call returned. Thus, the 16 operation may return with a code indicating successful completion, yet later cause an error 17 exception to be raised. If there is a subsequent call that relates to the same operation (e.g., 18 a call that verifies that an asynchronous operation has completed) then the error argument 19 associated with this call will be used to indicate the nature of the error. In a few cases, the 20 error may occur after all calls that relate to the operation have completed, so that no error 21 value can be used to indicate the nature of the error (e.g., an error on the receiver in a send 22 with the ready mode). Such an error must be treated as fatal, since information cannot be 23 returned for the user to recover from it. 24 call MPI This document does not specify the state of a computation after an erroneous 25 has occurred. The desired behavior is that a relevant error code be returned, and the effect 26 of the error be localized to the greatest possible extent. E.g., it is highly desirable that an 27 erroneous receive call will not cause any part of the receiver’s memory to be overwritten, 28 beyond the area specified for receiving the message. 29 Implementations may go beyond this document in supporting in a meaningful manner 30 MPI specifies strict type calls that are defined here to be erroneous. For example, MPI 31 matching rules between matching send and receive operations: it is erroneous to send a 32 floating point variable and receive an integer. Implementations may go beyond these type 33 matching rules, and provide automatic type conversion in such situations. It will be helpful 34 to generate warnings for such non-conforming behavior. 35 defines a way for users to create new error codes as defined in Section 8.5. MPI 36 37 2.9 Implementation Issues 38 39 There are a number of areas where an MPI implementation may interact with the operating 40 environment and system. While MPI does not mandate that any services (such as signal 41 handling) be provided, it does strongly suggest the behavior to be provided if those services 42 are available. This is an important point in achieving portability across platforms that 43 provide the same set of services. 44 45 Independence of Basic Runtime Routines 2.9.1 46 47 MPI programs require that library routines that are part of the basic language environment 48 malloc (such as write in Fortran and printf and in ISO C) and are executed after

52 22 CHAPTER 2. MPI TERMS AND CONVENTIONS 1 INIT _ FINALIZE operate independently and that their completion is and before MPI _ MPI 2 program. independent of the action of other processes in an MPI 3 Note that this in no way prevents the creation of library routines that provide parallel 4 services whose operation is collective. However, the following program is expected to com- 5 _ WORLD MPI plete in an ISO C environment regardless of the size of (assuming that _ COMM 6 is available at the executing nodes). printf 7 int rank; 8 MPI_Init((void *)0, (void *)0); 9 MPI_Comm_rank(MPI_COMM_WORLD, &rank); 10 if (rank == 0) printf("Starting program\n"); 11 MPI_Finalize(); 12 13 The corresponding Fortran programs are also expected to complete. 14 An example of what is not required is any particular ordering of the action of these 15 routines when called by several tasks. For example, MPI makes neither requirements nor 16 recommendations for the output from the following program (again assuming that I/O is 17 available at the executing nodes). 18 19 MPI_Comm_rank(MPI_COMM_WORLD, &rank); 20 printf("Output from task rank %d\n", rank); 21 In addition, calls that fail because of resource exhaustion or other error are not con- 22 sidered a violation of the requirements here (however, they are required to complete, just 23 not to complete successfully). 24 25 26 Interaction with Signals 2.9.2 27 MPI does not specify the interaction of processes with signals and does not require that MPI 28 be signal safe. The implementation may reserve some signals for its own use. It is required 29 that the implementation document which signals it uses, and it is strongly recommended 30 that it not use SIGIO , or . Implementations may also prohibit the use of , SIGALRM SIGFPE 31 calls from within signal handlers. MPI 32 MPI In multithreaded environments, users can avoid conflicts between signals and the 33 calls. High quality MPI library by catching signals only on threads that do not execute 34 single-threaded implementations will be signal safe: an MPI call suspended by a signal will 35 resume and complete normally after the signal is handled. 36 37 38 2.10 Examples 39 The examples in this document are for illustration purposes only. They are not intended 40 to specify the standard. Furthermore, the examples have not been carefully checked or 41 verified. 42 43 44 45 46 47 48

53 1 2 3 4 5 6 Chapter 3 7 8 9 10 Point-to-Point Communication 11 12 13 14 3.1 Introduction 15 16 Sending and receiving of messages by processes is the basic communication mechanism. MPI 17 receive send The basic point-to-point communication operations are and . Their use is 18 illustrated in the example below. 19 20 #include "mpi.h" 21 int main( int argc, char *argv[]) 22 { 23 char message[20]; 24 int myrank; 25 MPI_Status status; 26 MPI_Init( &argc, &argv ); 27 MPI_Comm_rank( MPI_COMM_WORLD, &myrank ); 28 if (myrank == 0) /* code for process zero */ 29 { 30 strcpy(message,"Hello, there"); 31 MPI_Send(message, strlen(message)+1, MPI_CHAR, 1, 99, MPI_COMM_WORLD); 32 } 33 else if (myrank == 1) /* code for process one */ 34 { 35 MPI_Recv(message, 20, MPI_CHAR, 0, 99, MPI_COMM_WORLD, &status); 36 printf("received :%s:\n", message); 37 } 38 MPI_Finalize(); 39 return 0; 40 } 41 42 ) sends a message to process one using the myrank = 0 In this example, process zero ( 43 SEND _ send . The operation specifies a send buffer in the sender memory operation MPI 44 from which the message data is taken. In the example above, the send buffer consists of 45 in the memory of process zero. The location, message the storage containing the variable 46 size and type of the send buffer are specified by the first three parameters of the send 47 operation. The message sent will contain the 13 characters of this variable. In addition, 48 with the message. This envelope specifies the envelope the send operation associates an 23

54 24 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 receive message destination and contains distinguishing information that can be used by the 2 operation to select a particular message. The last three parameters of the send operation, 3 along with the rank of the sender, specify the envelope for the message sent. Process one 4 . The message to ( MPI RECV ) receives this message with the receive _ myrank = 1 operation 5 be received is selected according to the value of its envelope, and the message data is stored 6 . In the example above, the receive buffer consists of the storage receive buffer into the 7 in the memory of process one. The first three parameters containing the string message 8 of the receive operation specify the location, size and type of the receive buffer. The next 9 three parameters are used for selecting the incoming message. The last parameter is used 10 to return information on the message just received. 11 The next sections describe the blocking send and receive operations. We discuss send, 12 receive, blocking communication semantics, type matching requirements, type conversion in 13 heterogeneous environments, and more general communication modes. Nonblocking com- 14 munication is addressed next, followed by probing and canceling a message, channel-like 15 constructs and send-receive operations, ending with a description of the “dummy” process, 16 MPI NULL . _ PROC _ 17 18 Blocking Send and Receive Operations 3.2 19 20 Blocking Send 3.2.1 21 22 The syntax of the blocking send operation is given below. 23 24 MPI SEND(buf, count, datatype, dest, tag, comm) _ 25 26 IN buf initial address of send buffer (choice) 27 IN number of elements in send buffer (non-negative inte- count 28 ger) 29 datatype IN datatype of each send buffer element (handle) 30 IN rank of destination (integer) dest 31 32 IN tag message tag (integer) 33 IN comm communicator (handle) 34 35 int MPI_Send(const void* buf, int count, MPI_Datatype datatype, int dest, 36 int tag, MPI_Comm comm) 37 38 MPI_Send(buf, count, datatype, dest, tag, comm, ierror) BIND(C) 39 TYPE(*), DIMENSION(..), INTENT(IN) :: buf 40 INTEGER, INTENT(IN) :: count, dest, tag 41 TYPE(MPI_Datatype), INTENT(IN) :: datatype 42 TYPE(MPI_Comm), INTENT(IN) :: comm 43 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 44 MPI_SEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERROR) 45 BUF(*) 46 INTEGER COUNT, DATATYPE, DEST, TAG, COMM, IERROR 47 48 The blocking semantics of this call are described in Section 3.4.

55 3.2. BLOCKING SEND AND RECEIVE OPERATIONS 25 1 3.2.2 Message Data 2 count successive entries of MPI The send buffer specified by the operation consists of _ SEND 3 . Note that we specify , starting with the entry at address buf datatype the type indicated by 4 . The former is the message length in terms of number of elements bytes , not number of 5 machine independent and closer to the application level. 6 values, each of the type count The data part of the message consists of a sequence of 7 may be zero, in which case the data part of the message is count . datatype indicated by 8 empty. The basic datatypes that can be specified for message data values correspond to the 9 basic datatypes of the host language. Possible values of this argument for Fortran and the 10 corresponding Fortran types are listed in Table 3.1. 11 12 MPI datatype Fortran datatype 13 MPI _ INTEGER INTEGER 14 REAL MPI _ REAL 15 _ _ MPI PRECISION DOUBLE PRECISION DOUBLE 16 COMPLEX COMPLEX MPI _ 17 LOGICAL MPI _ LOGICAL 18 CHARACTER(1) CHARACTER _ MPI 19 BYTE MPI _ 20 PACKED MPI _ 21 22 23 Table 3.1: Predefined MPI datatypes corresponding to Fortran datatypes 24 Possible values for this argument for C and the corresponding C types are listed in 25 Table 3.2. 26 BYTE and MPI _ PACKED do not correspond to a Fortran or C The datatypes _ MPI 27 BYTE _ MPI datatype. A value of type consists of a byte (8 binary digits). A byte is 28 uninterpreted and is different from a character. Different machines may have different 29 representations for characters, or may use more than one byte to represent characters. On 30 the other hand, a byte has the same binary value on all machines. The use of the type 31 is explained in Section 4.2. PACKED _ MPI 32 requires support of these datatypes, which match the basic datatypes of Fortran MPI 33 datatypes should be provided if the host language has additional MPI and ISO C. Additional 34 data types: MPI _ DOUBLE _ COMPLEX for double precision complex in Fortran declared 35 REAL2 , MPI _ REAL4 , and MPI _ REAL8 for Fortran to be of type DOUBLE COMPLEX ; MPI _ 36 REAL*4 _ INTEGER1 reals, declared to be of type REAL*2 , , and REAL*8 , respectively; MPI 37 MPI _ for Fortran integers, declared to be of type INTEGER4 _ MPI , and INTEGER2 38 INTEGER*1 , INTEGER*2 , and INTEGER*4 , respectively; etc. 39 40 Rationale. One goal of the design is to allow for MPI to be implemented as a 41 library, with no need for additional preprocessing or compilation. Thus, one cannot 42 assume that a communication call has information on the datatype of variables in the 43 communication buffer; this information must be supplied by an explicit argument. 44 The need for such datatype information will become clear in Section 3.3.2. ( End of 45 ) rationale. 46 47 MPI The datatypes MPI _ AINT , MPI _ OFFSET , and MPI _ - COUNT correspond to the 48 and their Fortran equivalents MPI _ Aint , MPI defined C types Offset , and MPI _ Count _

56 26 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 C datatype MPI datatype 2 _ MPI char CHAR 3 (treated as printable character) 4 signed short int MPI _ SHORT 5 MPI INT signed int _ 6 MPI signed long int LONG _ 7 LONG _ INT signed long long int _ MPI _ LONG 8 (as a synonym) signed long long int MPI _ LONG _ LONG 9 MPI _ SIGNED _ CHAR signed char 10 (treated as integral value) 11 _ UNSIGNED unsigned char MPI _ CHAR 12 (treated as integral value) 13 MPI _ UNSIGNED _ SHORT unsigned short int 14 MPI _ UNSIGNED unsigned int 15 _ MPI _ UNSIGNED LONG unsigned long int 16 MPI _ UNSIGNED _ LONG LONG _ unsigned long long int 17 MPI FLOAT float _ 18 double MPI _ DOUBLE 19 _ LONG _ DOUBLE long double MPI 20 WCHAR wchar_t MPI _ 21 ) (defined in 22 (treated as printable character) 23 _Bool MPI _ C _ BOOL 24 INT8 _ T int8_t MPI _ 25 MPI _ INT16 _ T int16_t 26 _ INT32 _ MPI int32_t T 27 MPI _ INT64 _ T int64_t 28 UINT8 _ T MPI _ uint8_t 29 uint16_t MPI _ UINT16 _ T 30 MPI _ UINT32 _ T uint32_t 31 MPI T uint64_t _ _ UINT64 32 MPI _ C _ COMPLEX float _Complex 33 COMPLEX (as a synonym) float _Complex MPI _ C _ FLOAT _ 34 DOUBLE COMPLEX double _Complex _ MPI _ C _ 35 _ LONG MPI _ C COMPLEX _ DOUBLE _ long double _Complex 36 _ BYTE MPI 37 _ PACKED MPI 38 39 Table 3.2: Predefined MPI datatypes corresponding to C datatypes 40 41 42 _ KIND ) , INTEGER (KIND= INTEGER (KIND= _ OFFSET _ KIND ), and MPI _ ADDRESS MPI 43 This is described in Table 3.3. All predefined datatype _ COUNT _ KIND MPI INTEGER (KIND= ). 44 handles are available in all language bindings. See Sections 17.2.6 and 17.2.10 on page 650 45 and 658 for information on interlanguage communication with these types. 46 If there is an accompanying C++ compiler then the datatypes in Table 3.4 are also 47 supported in C and Fortran. 48

57 3.2. BLOCKING SEND AND RECEIVE OPERATIONS 27 1 Fortran datatype C datatype MPI datatype 2 MPI_Aint AINT MPI INTEGER (KIND=MPI_ADDRESS_KIND) _ 3 MPI OFFSET _ MPI_Offset INTEGER (KIND=MPI_OFFSET_KIND) 4 MPI MPI_Count INTEGER (KIND=MPI_COUNT_KIND) COUNT _ 5 6 Table 3.3: Predefined MPI datatypes corresponding to both C and Fortran datatypes 7 8 MPI datatype C++ datatype 9 bool MPI BOOL _ CXX _ 10 _ FLOAT MPI COMPLEX std::complex < float > _ CXX _ 11 _ DOUBLE _ COMPLEX std::complex MPI double > _ CXX < 12 > MPI long double < std::complex COMPLEX _ DOUBLE _ LONG _ CXX _ 13 14 15 Table 3.4: Predefined MPI datatypes corresponding to C++ datatypes 16 17 3.2.3 Message Envelope 18 19 In addition to the data part, messages carry information that can be used to distinguish 20 messages and selectively receive them. This information consists of a fixed number of fields, 21 which we collectively call the message envelope . These fields are 22 source 23 destination 24 tag 25 communicator 26 27 The message source is implicitly determined by the identity of the message sender. The 28 other fields are specified by arguments in the send operation. 29 dest argument. The message destination is specified by the 30 argument. This integer can be tag The integer-valued message tag is specified by the 31 used by the program to distinguish different types of messages. The range of valid tag 32 0,...,UB , where the value of values is UB is implementation dependent. It can be found by 33 requires UB , as described in Chapter 8. MPI _ TAG _ MPI querying the value of the attribute 34 UB be no less than 32767. that 35 argument specifies the The that is used for the send operation. communicator comm 36 Communicators are explained in Chapter 6; below is a brief summary of their usage. 37 A communicator specifies the communication context for a communication operation. 38 Each communication context provides a separate “communication universe”: messages are 39 always received within the context they were sent, and messages sent in different contexts 40 do not interfere. 41 The communicator also specifies the set of processes that share this communication 42 is ordered and processes are identified by their rank within context. This process group 43 dest this group. Thus, the range of valid values for , where } NULL _ PROC _ is ∪{ 0, ..., n-1 MPI 44 n is the number of processes in the group. (If the communicator is an inter-communicator, 45 then destinations are identified by their rank in the remote group. See Chapter 6.) 46 COMM _ MPI A predefined communicator MPI . It allows com- WORLD _ is provided by 47 initialization and processes are MPI munication with all processes that are accessible after 48 MPI WORLD . _ COMM _ identified by their rank in the group of

58 28 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 Advice to users. Users that are comfortable with the notion of a flat name space 2 for processes, and a single communication context, as offered by most existing com- 3 as the _ munication libraries, need only use the predefined variable WORLD COMM _ MPI 4 comm argument. This will allow communication with all the processes available at 5 initialization time. 6 Users may define new communicators, as explained in Chapter 6. Communicators 7 provide an important encapsulation mechanism for libraries and modules. They allow 8 modules to have their own disjoint communication universe and their own process 9 numbering scheme. ( ) End of advice to users. 10 11 The message envelope would normally be encoded by a Advice to implementors. 12 fixed-length message header. However, the actual encoding is implementation depen- 13 dent. Some of the information (e.g., source or destination) may be implicit, and need 14 not be explicitly carried by messages. Also, processes may be identified by relative 15 ranks, or absolute ids, etc. ( End of advice to implementors. ) 16 17 Blocking Receive 3.2.4 18 19 The syntax of the blocking receive operation is given below. 20 21 MPI _ RECV (buf, count, datatype, source, tag, comm, status) 22 23 buf initial address of receive buffer (choice) OUT 24 IN count number of elements in receive buffer (non-negative in- 25 teger) 26 IN datatype of each receive buffer element (handle) datatype 27 28 rank of source or IN source MPI _ ANY _ SOURCE (integer) 29 tag message tag or MPI _ IN _ TAG (integer) ANY 30 IN comm communicator (handle) 31 32 status object (Status) status OUT 33 34 int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, 35 int tag, MPI_Comm comm, MPI_Status *status) 36 MPI_Recv(buf, count, datatype, source, tag, comm, status, ierror) BIND(C) 37 TYPE(*), DIMENSION(..) :: buf 38 INTEGER, INTENT(IN) :: count, source, tag 39 TYPE(MPI_Datatype), INTENT(IN) :: datatype 40 TYPE(MPI_Comm), INTENT(IN) :: comm 41 TYPE(MPI_Status) :: status 42 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 43 44 MPI_RECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, STATUS, IERROR) 45 BUF(*) 46 INTEGER COUNT, DATATYPE, SOURCE, TAG, COMM, STATUS(MPI_STATUS_SIZE), 47 IERROR 48

59 3.2. BLOCKING SEND AND RECEIVE OPERATIONS 29 1 The blocking semantics of this call are described in Section 3.4. 2 consecutive elements of the count The receive buffer consists of the storage containing 3 . The length of the received message must type specified by datatype buf , starting at address 4 be less than or equal to the length of the receive buffer. An overflow error occurs if all 5 incoming data does not fit, without truncation, into the receive buffer. 6 If a message that is shorter than the receive buffer arrives, then only those locations 7 corresponding to the (shorter) message are modified. 8 9 Advice to users. PROBE MPI The function described in Section 3.8 can be used to _ 10 ) receive messages of unknown length. ( End of advice to users. 11 12 Advice to implementors. for MPI Even though no specific behavior is mandated by 13 erroneous programs, the recommended handling of overflow situations is to return in 14 status information about the source and tag of the incoming message. The receive 15 operation will return an error code. A quality implementation will also ensure that 16 no memory that is outside the receive buffer will ever be overwritten. 17 In the case of a message shorter than the receive buffer, MPI is quite strict in that it 18 allows no modification of the other locations. A more lenient statement would allow 19 for some optimizations but this is not allowed. The implementation must be ready to 20 end a copy into the receiver memory exactly at the end of the receive buffer, even if 21 it is an odd address. ( ) End of advice to implementors. 22 23 The selection of a message by a receive operation is governed by the value of the 24 message envelope. A message can be received by a receive operation if its envelope matches 25 source , comm the values specified by the receive operation. The receiver may and tag 26 MPI _ ANY _ SOURCE value for source , and/or a wildcard MPI _ ANY specify a wildcard TAG _ 27 value for tag , indicating that any source and/or tag are acceptable. It cannot specify a 28 . Thus, a message can be received by a receive operation only comm wildcard value for 29 if it is addressed to the receiving process, has a matching communicator, has matching 30 _ ANY in the pattern, and has a matching tag unless MPI source unless source= SOURCE _ 31 ANY _ MPI tag= in the pattern. TAG _ 32 The message tag is specified by the tag argument of the receive operation. The 33 _ SOURCE , is specified as a rank within the argument source , if different from MPI _ ANY 34 process group associated with that same communicator (remote process group, for in- 35 tercommunicators). Thus, the range of valid values for the source argument is { 0,...,n- 36 } ANY is the number of processes in this n , where } NULL _ PROC _ MPI ∪{ , 1 SOURCE _ }∪{ _ MPI 37 group. 38 Note the asymmetry between send and receive operations: A receive operation may 39 accept messages from an arbitrary sender, on the other hand, a send operation must specify 40 a unique receiver. This matches a “push” communication mechanism, where data transfer 41 is effected by the sender (rather than a “pull” mechanism, where data transfer is effected 42 by the receiver). 43 Source = destination is allowed, that is, a process can send a message to itself. (How- 44 ever, it is unsafe to do so with the blocking send and receive operations described above, 45 since this may lead to deadlock. See Section 3.5.) 46 47 Advice to implementors. Message context and other communicator information can 48 be implemented as an additional tag field. It differs from the regular message tag

60 30 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 in that wild card matching is not allowed on this field, and that value setting for 2 End of advice to this field is controlled by communicator manipulation functions. ( 3 implementors. ) 4 _ The use of dest or source= NULL to define a “dummy” destination or source PROC MPI _ 5 in any send or receive call is described in Section 3.11 on page 81. 6 7 8 3.2.5 Return Status 9 The source or tag of a received message may not be known if wildcard values were used 10 in the receive operation. Also, if multiple requests are completed by a single function MPI 11 (see Section 3.7.5), a distinct error code may need to be returned for each request. The 12 is status argument of MPI MPI _ - information is returned by the status . The type of RECV 13 defined. Status variables need to be explicitly allocated by the user, that is, they are not 14 system objects. 15 , MPI _ TAG , In C, status is a structure that contains three fields named MPI _ SOURCE 16 MPI ; the structure may contain additional fields. Thus, ERROR _ and 17 _ status.MPI , SOURCE contain the source, tag, and status.MPI _ ERROR _ status.MPI and TAG 18 error code, respectively, of the received message. 19 or INCLUDE ’mpif.h’ , status is an array of INTEGER s of size In Fortran with USE mpi 20 MPI STATUS _ SIZE . The constants MPI _ SOURCE , _ _ TAG and MPI _ ERROR are the indices MPI 21 of the entries that store the source, tag and error fields. Thus, , _ status(MPI SOURCE) 22 _ ERROR) status(MPI _ TAG) and status(MPI contain, respectively, the source, tag and error 23 code of the received message. 24 derived type BIND(C) USE mpi_f08 With Fortran , status is defined as the Fortran 25 TYPE(MPI_Status) containing three public fields named MPI _ SOURCE , 26 ERROR . TYPE(MPI_Status) may contain additional, implementation- MPI _ TAG , and MPI _ 27 status%MPI_TAG con- status%MPI_ERROR and specific fields. Thus, , status%MPI_SOURCE 28 tain the source, tag, and error code of a received message respectively. Additionally, within 29 modules, the constants mpi_f08 and the mpi both the SOURCE , _ MPI , SIZE _ STATUS _ MPI 30 , and TYPE(MPI_Status) MPI _ TAG are defined to allow conversion between both MPI _ ERROR , 31 status representations. Conversion routines are provided in Section 17.2.5 on page 648. 32 33 derived type so The Fortran Rationale. TYPE(MPI_Status) BIND(C) is defined as a 34 that it can be used at any location where the status integer array representation can 35 ) be used, e.g., in user defined common blocks. ( End of rationale. 36 SOURCE ) defined as a Rationale. It is allowed to have the same name (e.g., MPI _ 37 constant (e.g., Fortran parameter) and as a field of a derived type. ( End of rationale. ) 38 39 In general, message-passing calls do not modify the value of the error code field of 40 status variables. This field may be updated only by the functions in Section 3.7.5 which 41 return multiple statuses. The field is updated if and only if such function returns with an 42 STATUS ERR _ IN _ error code of . MPI _ 43 44 The error field in status is not needed for calls that return only one status, Rationale. 45 , since that would only duplicate the information returned by the WAIT _ MPI such as 46 function itself. The current design avoids the additional overhead of setting it, in such 47 cases. The field is needed for calls that return multiple statuses, since each request 48 ) may have had a different failure. ( End of rationale.

61 3.2. BLOCKING SEND AND RECEIVE OPERATIONS 31 1 The status argument also returns information on the length of the message received. 2 However, this information is not directly available as a field of the status variable and a call 3 MPI to COUNT is required to “decode” this information. _ GET _ 4 5 COUNT(status, datatype, count) _ GET _ MPI 6 7 return status of receive operation (Status) status IN 8 datatype IN datatype of each receive buffer entry (handle) 9 OUT count number of received entries (integer) 10 11 12 int MPI_Get_count(const MPI_Status *status, MPI_Datatype datatype, 13 int *count) 14 MPI_Get_count(status, datatype, count, ierror) BIND(C) 15 TYPE(MPI_Status), INTENT(IN) :: status 16 TYPE(MPI_Datatype), INTENT(IN) :: datatype 17 INTEGER, INTENT(OUT) :: count 18 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 19 20 MPI_GET_COUNT(STATUS, DATATYPE, COUNT, IERROR) 21 INTEGER STATUS(MPI_STATUS_SIZE), DATATYPE, COUNT, IERROR 22 , entries , each of type datatype Returns the number of entries received. (Again, we count 23 .) The bytes not argument should match the argument provided by the receive call datatype 24 that set the status variable. If the number of entries received exceeds the limits of the count 25 MPI _ GET _ COUNT sets the value of count to MPI _ UNDEFINED . There are parameter, then 26 ; see Section 4.1.11. UNDEFINED _ MPI can be set to count other situations where the value of 27 28 tag and Some message-passing libraries use INOUT Rationale. , count 29 source arguments, thus using them both to specify the selection criteria for incoming 30 messages and return the actual envelope values of the received message. The use of a 31 INOUT argument separate status argument prevents errors that are often attached with 32 ANY constant as the tag in a receive). Some libraries use TAG _ _ MPI (e.g., using the 33 calls that refer implicitly to the “last message received.” This is not thread safe. 34 COUNT _ GET _ MPI argument is passed to datatype The so as to improve performance. 35 A message might be received without counting the number of elements it contains, 36 and the count value is often not needed. Also, this allows the same function to be 37 MPI _ PROBE or MPI _ IPROBE . With a status from MPI _ PROBE used after a call to 38 IPROBE to receive _ RECV or MPI _ MPI , the same datatypes are allowed as in a call to 39 ) End of rationale. this message. ( 40 41 The value returned as the argument of count for a datatype of length COUNT _ GET _ MPI 42 zero where zero bytes have been transferred is zero. If the number of bytes transferred is 43 greater than zero, MPI _ UNDEFINED is returned. 44 45 Rationale. Zero-length datatypes may be created in a number of cases. An important 46 case is , where the definition of the particular darray DARRAY _ CREATE _ TYPE _ MPI 47 results in an empty block on some process. Programs written in an SPMD style MPI 48

62 32 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 _ MPI COUNT to check GET will not check for this special case and may want to use _ 2 End of rationale. the status. ( ) 3 4 The buffer size required for the receive can be affected by data con- Advice to users. 5 versions and by the stride of the receive datatype. In most cases, the safest approach 6 GET End of advice is to use the same datatype with MPI _ COUNT and the receive. ( _ 7 ) to users. 8 9 comm , All send and receive operations use the buf , count , datatype , source , dest , tag , 10 MPI _ SEND and MPI _ and operations status arguments in the same way as the blocking RECV 11 described in this section. 12 13 IGNORE _ for Status _ MPI Passing 3.2.6 STATUS 14 includes a Every call to argument, wherein the system can return details MPI _ RECV status 15 about the message received. There are also a number of other MPI calls where status 16 MPI is returned. An object of type MPI _ Status is not an opaque object; its structure 17 and is declared in mpif.h , and it exists in the user’s program. In many cases, mpi.h 18 application programs are constructed so that it is unnecessary for them to examine the 19 fields. In these cases, it is a waste for the user to allocate a status object, and it is status 20 MPI particularly wasteful for the implementation to fill in fields in this object. 21 STATUS IGNORE To cope with this problem, there are two predefined constants, MPI _ _ 22 IGNORE and MPI _ STATUSES _ , which when passed to a receive, probe, wait, or test function, 23 inform the implementation that the status fields are not to be filled in. Note that 24 _ IGNORE is not a special type of MPI STATUS Status object; rather, it is a special value _ _ MPI 25 for the argument. In C one would expect it to be NULL , not the address of a special 26 . Status _ MPI 27 _ STATUSES MPI _ STATUS _ IGNORE , can be used every- IGNORE MPI _ , and the array version 28 MPI _ STATUS _ IGNORE where a status argument is passed to a receive, wait, or test function. 29 cannot be used when status is an IN argument. Note that in Fortran MPI _ IGNORE _ STATUS 30 MPI IGNORE _ STATUSES _ MPI and (not usable for initialization or BOTTOM _ are objects like 31 assignment). See Section 2.5.4. 32 or an array of In general, this optimization can apply to all functions for which status 33 into an INOUT argument. The status es is an OUT argument. Note that this converts status 34 MPI _ STATUS _ IGNORE are all the various forms of MPI _ RECV , functions that can be passed 35 MPI _ REQUEST _ GET _ STATUS . When , as well as _ MPI , and TEST _ MPI , PROBE _ MPI WAIT 36 functions, a separate constant, an array is passed, as in the MPI _ { TEST | WAIT }{ ALL | SOME } 37 _ function _ , is passed for the array argument. It is possible for an MPI MPI IGNORE STATUSES 38 _ to return MPI _ ERR _ IN _ STATUS even when MPI _ STATUS _ IGNORE or MPI _ IGNORE STATUSES 39 has been passed to that function. 40 _ _ _ IGNORE and MPI MPI STATUSES STATUS IGNORE are not required to have the same _ 41 values in C and Fortran. 42 It is not allowed to have some of the statuses in an array of statuses for 43 | functions set to IGNORE _ STATUS _ MPI } SOME ; one either specifies ALL }{ WAIT | TEST { _ MPI 44 ignoring all of the statuses in such a call with MPI _ STATUSES _ IGNORE , or none of them by 45 passing normal statuses in all positions in the array of statuses. 46 47 48

63 3.3. DATA TYPE MATCHING AND DATA CONVERSION 33 1 3.3 Data Type Matching and Data Conversion 2 3 3.3.1 Type Matching Rules 4 One can think of message transfer as consisting of the following three phases. 5 6 1. Data is pulled out of the send buffer and a message is assembled. 7 2. A message is transferred from sender to receiver. 8 9 3. Data is pulled from the incoming message and disassembled into the receive buffer. 10 11 Type matching has to be observed at each of these three phases: The type of each 12 variable in the sender buffer has to match the type specified for that entry by the send 13 operation; the type specified by the send operation has to match the type specified by the 14 receive operation; and the type of each variable in the receive buffer has to match the type 15 specified for that entry by the receive operation. A program that fails to observe these three 16 rules is erroneous. 17 To define type matching more precisely, we need to deal with two issues: matching of 18 types of the host language with types specified in communication operations; and matching 19 of types at sender and receiver. 20 The types of a send and receive match (phase two) if both operations use identical 21 MPI MPI _ INTEGER , MPI _ names. That is, REAL MPI matches matches _ _ REAL INTEGER , 22 and so on. There is one exception to this rule, discussed in Section 4.2: the type 23 _ PACKED can match any other type. MPI 24 The type of a variable in a host program matches the type specified in the commu- 25 nication operation if the datatype name used by that operation corresponds to the basic 26 MPI INTEGER _ type of the host program variable. For example, an entry with type name 27 matches a Fortran variable of type . A table giving this correspondence for Fortran INTEGER 28 and C appears in Section 3.2.2. There are two exceptions to this last rule: an entry with 29 BYTE or MPI _ PACKED can be used to match any byte of storage (on a type name MPI _ 30 byte-addressable machine), irrespective of the datatype of the variable that contains this 31 byte. The type MPI _ PACKED is used to send data that has been explicitly packed, or 32 allows _ MPI receive data that will be explicitly unpacked, see Section 4.2. The type BYTE 33 one to transfer the binary value of a byte in memory unchanged. 34 To summarize, the type matching rules fall into the three categories below. 35 ), where • Communication of typed values (e.g., with datatype different from MPI _ BYTE 36 the datatypes of the corresponding entries in the sender program, in the send call, in 37 the receive call and in the receiver program must all match. 38 39 MPI _ • Communication of untyped values (e.g., of datatype BYTE ), where both sender 40 _ BYTE . In this case, there are no requirements on and receiver use the datatype MPI 41 the types of the corresponding entries in the sender and the receiver programs, nor is 42 it required that they be the same. 43 44 PACKED is used. • Communication involving packed data, where MPI _ 45 The following examples illustrate the first two cases. 46 47 Example 3.1 Sender and receiver specify matching types. 48

64 34 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 CALL MPI_COMM_RANK(comm, rank, ierr) 2 IF (rank.EQ.0) THEN 3 CALL MPI_SEND(a(1), 10, MPI_REAL, 1, tag, comm, ierr) 4 ELSE IF (rank.EQ.1) THEN 5 CALL MPI_RECV(b(1), 15, MPI_REAL, 0, tag, comm, status, ierr) 6 END IF 7 a 10. (In Fortran, it might ≥ are real arrays of size b and This code is correct if both 8 have size or can be equivalenced a(1) 10: e.g., when < be correct to use this code even if b a 9 to an array with ten reals.) 10 11 Example 3.2 Sender and receiver do not specify matching types. 12 13 CALL MPI_COMM_RANK(comm, rank, ierr) 14 IF (rank.EQ.0) THEN 15 CALL MPI_SEND(a(1), 10, MPI_REAL, 1, tag, comm, ierr) 16 ELSE IF (rank.EQ.1) THEN 17 CALL MPI_RECV(b(1), 40, MPI_BYTE, 0, tag, comm, status, ierr) 18 END IF 19 20 This code is erroneous, since sender and receiver do not provide matching datatype 21 arguments. 22 Example 3.3 Sender and receiver specify communication of untyped values. 23 24 CALL MPI_COMM_RANK(comm, rank, ierr) 25 IF (rank.EQ.0) THEN 26 CALL MPI_SEND(a(1), 40, MPI_BYTE, 1, tag, comm, ierr) 27 ELSE IF (rank.EQ.1) THEN 28 CALL MPI_RECV(b(1), 60, MPI_BYTE, 0, tag, comm, status, ierr) 29 END IF 30 (unless this results in b This code is correct, irrespective of the type and size of and a 31 an out of bounds memory access). 32 33 Advice to users. , is passed as an argument to _ SEND _ BYTE MPI If a buffer of type MPI 34 then MPI will send the data stored at contiguous locations, starting from the address 35 indicated by the buf argument. This may have unexpected results when the data 36 layout is not as a casual user would expect it to be. For example, some Fortran 37 CHARACTER as a structure that contains the compilers implement variables of type 38 character length and a pointer to the actual string. In such an environment, sending 39 _ BYTE type will not have and receiving a Fortran CHARACTER variable using the MPI 40 the anticipated result of transferring the character string. For this reason, the user is 41 advised to use typed communications whenever possible. ( End of advice to users. ) 42 43 MPI CHARACTER _ Type 44 45 The type MPI _ CHARACTER matches one character of a Fortran variable of type CHARACTER , 46 rather than the entire character string stored in the variable. Fortran variables of type 47 CHARACTER or substrings are transferred as if they were arrays of characters. This is 48 illustrated in the example below.

65 3.3. DATA TYPE MATCHING AND DATA CONVERSION 35 1 Example 3.4 2 CHARACTER s. Transfer of Fortran 3 4 CHARACTER*10 a 5 CHARACTER*10 b 6 7 CALL MPI_COMM_RANK(comm, rank, ierr) 8 IF (rank.EQ.0) THEN 9 CALL MPI_SEND(a, 5, MPI_CHARACTER, 1, tag, comm, ierr) 10 ELSE IF (rank.EQ.1) THEN 11 CALL MPI_RECV(b(6:10), 5, MPI_CHARACTER, 0, tag, comm, status, ierr) 12 END IF 13 14 at process 1 are replaced by the first five characters The last five characters of string b 15 at process 0. of string a 16 Rationale. The alternative choice would be for CHARACTER MPI to match a char- _ 17 acter of arbitrary length. This runs into problems. 18 19 A Fortran character variable is a constant length string, with no special termina- 20 tion symbol. There is no fixed convention on how to represent characters, and how 21 to store their length. Some compilers pass a character argument to a routine as a 22 pair of arguments, one holding the address of the string and the other holding the 23 communication call that is passed a MPI length of string. Consider the case of an 24 communication buffer with type defined by a derived datatype (Section 4.1). If this 25 CHARACTER then the information on communicator buffer contains variables of type 26 MPI routine. their length will not be passed to the 27 This problem forces us to provide explicit information on character length with the 28 CHARACTER MPI call. One could add a length parameter to the type , but this _ MPI 29 does not add much convenience and the same functionality can be achieved by defining 30 End of rationale. a suitable derived datatype. ( ) 31 32 CHARACTER arguments as a Advice to implementors. Some compilers pass Fortran 33 structure with a length and a pointer to the actual string. In such an environment, 34 the MPI call needs to dereference the pointer in order to reach the string. ( End of 35 ) advice to implementors. 36 37 Data Conversion 3.3.2 38 39 One of the goals of MPI is to support parallel computations across heterogeneous environ- 40 ments. Communication in a heterogeneous environment may require data conversions. We 41 use the following terminology. 42 REAL changes the datatype of a value, e.g., by rounding a type conversion . INTEGER to an 43 44 representation conversion changes the binary representation of a value, e.g., from Hex 45 floating point to IEEE floating point. 46 47 communication never entails type conversion. MPI The type matching rules imply that 48 MPI On the other hand, requires that a representation conversion be performed when a

66 36 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 typed value is transferred across environments that use different representations for the 2 does not specify rules for representation conversion. Such datatype of this value. MPI 3 conversion is expected to preserve integer, logical and character values, and to convert a 4 floating point value to the nearest value that can be represented on the target system. 5 Overflow and underflow exceptions may occur during floating point conversions. Con- 6 version of integers or characters may also lead to exceptions when a value that can be 7 represented in one system cannot be represented in the other system. An exception occur- 8 ring during representation conversion results in a failure of the communication. An error 9 occurs either in the send operation, or the receive operation, or both. 10 If a value sent in a message is untyped (i.e., of type BYTE ), then the binary MPI _ 11 representation of the byte stored at the receiver is identical to the binary representation 12 of the byte loaded at the sender. This holds true, whether sender and receiver run in the 13 same or in distinct environments. No representation conversion is required. (Note that 14 representation conversion may occur when values of type _ CHAR or CHARACTER MPI _ MPI 15 are transferred, for example, from an EBCDIC encoding to an ASCII encoding.) 16 No conversion need occur when an MPI program executes in a homogeneous system, 17 where all processes run in the same environment. 18 a Consider the three examples, 3.1–3.3. The first program is correct, assuming that and 19 b REAL arrays of size ≥ 10. If the sender and receiver execute in different environments, are 20 then the ten real values that are fetched from the send buffer will be converted to the 21 representation for reals on the receiver site before they are stored in the receive buffer. 22 While the number of real elements fetched from the send buffer equal the number of real 23 elements stored in the receive buffer, the number of bytes stored need not equal the number 24 of bytes loaded. For example, the sender may use a four byte representation and the receiver 25 an eight byte representation for reals. 26 The second program is erroneous, and its behavior is undefined. 27 The third program is correct. The exact same sequence of forty bytes that were loaded 28 from the send buffer will be stored in the receive buffer, even if sender and receiver run in 29 a different environment. The message sent has exactly the same length (in bytes) and the 30 are of different types, or if same binary representation as the message received. If a and b 31 they are of the same type but different data representations are used, then the bits stored 32 in the receive buffer may encode values that are different from the values they encoded in 33 the send buffer. 34 Data representation conversion also applies to the envelope of a message: source, des- 35 tination and tag are all integers that may need to be converted. 36 37 The current definition does not require messages to carry Advice to implementors. 38 data type information. Both sender and receiver provide complete data type infor- 39 mation. In a heterogeneous environment, one can either use a machine independent 40 encoding such as XDR, or have the receiver convert from the sender representation 41 to its own, or even have the sender do the conversion. 42 Additional type information might be added to messages in order to allow the sys- 43 tem to detect mismatches between datatype at sender and receiver. This might be 44 End of advice to implementors. ) particularly useful in a slower but safer debug mode. ( 45 46 MPI requires support for inter-language communication, i.e., if messages are sent by a 47 C or C++ process and received by a Fortran process, or vice-versa. The behavior is defined 48 in Section 17.2 on page 645.

67 3.4. COMMUNICATION MODES 37 1 Communication Modes 3.4 2 3 The send call described in Section 3.2.1 is : it does not return until the message blocking 4 data and envelope have been safely stored away so that the sender is free to modify the 5 send buffer. The message might be copied directly into the matching receive buffer, or it 6 might be copied into a temporary system buffer. 7 Message buffering decouples the send and receive operations. A blocking send can com- 8 plete as soon as the message was buffered, even if no matching receive has been executed by 9 the receiver. On the other hand, message buffering can be expensive, as it entails additional 10 MPI memory-to-memory copying, and it requires the allocation of memory for buffering. 11 offers the choice of several communication modes that allow one to control the choice of the 12 communication protocol. 13 communication mode. In standard The send call described in Section 3.2.1 uses the 14 may this mode, it is up to to decide whether outgoing messages will be buffered. MPI MPI 15 buffer outgoing messages. In such a case, the send call may complete before a matching 16 MPI receive is invoked. On the other hand, buffer space may be unavailable, or may choose 17 not to buffer outgoing messages, for performance reasons. In this case, the send call will 18 not complete until a matching receive has been posted, and the data has been moved to the 19 receiver. 20 Thus, a send in standard mode can be started whether or not a matching receive has 21 been posted. It may complete before a matching receive is posted. The standard mode send 22 is non-local : successful completion of the send operation may depend on the occurrence 23 of a matching receive. 24 to mandate whether standard sends are buffering MPI The reluctance of Rationale. 25 or not stems from the desire to achieve portable programs. Since any system will run 26 out of buffer resources as message sizes are increased, and some implementations may 27 takes the position that correct (and therefore, want to provide little buffering, MPI 28 portable) programs do not rely on system buffering in standard mode. Buffering may 29 improve the performance of a correct program, but it doesn’t affect the result of the 30 program. If the user wishes to guarantee a certain amount of buffering, the user- 31 provided buffer system of Section 3.6 should be used, along with the buffered-mode 32 send. ( End of rationale. ) 33 34 There are three additional communication modes. 35 buffered mode send operation can be started whether or not a matching receive A 36 has been posted. It may complete before a matching receive is posted. However, unlike 37 local the standard send, this operation is , and its completion does not depend on the 38 occurrence of a matching receive. Thus, if a send is executed and no matching receive is 39 posted, then MPI must buffer the outgoing message, so as to allow the send call to complete. 40 An error will occur if there is insufficient buffer space. The amount of available buffer space 41 is controlled by the user — see Section 3.6. Buffer allocation by the user may be required 42 for the buffered mode to be effective. 43 mode can be started whether or not a matching A send that uses the synchronous 44 receive was posted. However, the send will complete successfully only if a matching receive is 45 posted, and the receive operation has started to receive the message sent by the synchronous 46 send. Thus, the completion of a synchronous send not only indicates that the send buffer 47 can be reused, but it also indicates that the receiver has reached a certain point in its 48

68 38 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 execution, namely that it has started executing the matching receive. If both sends and 2 receives are blocking operations then the use of the synchronous mode provides synchronous 3 communication semantics: a communication does not complete at either end before both 4 processes rendezvous at the communication. A send executed in this mode is . non-local 5 communication mode may be started ready A send that uses the only if the matching 6 receive is already posted. Otherwise, the operation is erroneous and its outcome is unde- 7 fined. On some systems, this allows the removal of a hand-shake operation that is otherwise 8 required and results in improved performance. The completion of the send operation does 9 not depend on the status of a matching receive, and merely indicates that the send buffer 10 can be reused. A send operation that uses the ready mode has the same semantics as a 11 standard send operation, or a synchronous send operation; it is merely that the sender 12 provides additional information to the system (namely that a matching receive is already 13 posted), that can save some overhead. In a correct program, therefore, a ready send could 14 be replaced by a standard send with no effect on the behavior of the program other than 15 performance. 16 Three additional send functions are provided for the three additional communication 17 S for buffered, modes. The communication mode is indicated by a one letter prefix: B for 18 synchronous, and R for ready. 19 20 _ BSEND (buf, count, datatype, dest, tag, comm) MPI 21 22 IN buf initial address of send buffer (choice) 23 IN count number of elements in send buffer (non-negative inte- 24 ger) 25 datatype datatype of each send buffer element (handle) IN 26 27 rank of destination (integer) IN dest 28 IN tag message tag (integer) 29 comm communicator (handle) IN 30 31 32 int MPI_Bsend(const void* buf, int count, MPI_Datatype datatype, int dest, 33 int tag, MPI_Comm comm) 34 MPI_Bsend(buf, count, datatype, dest, tag, comm, ierror) BIND(C) 35 TYPE(*), DIMENSION(..), INTENT(IN) :: buf 36 INTEGER, INTENT(IN) :: count, dest, tag 37 TYPE(MPI_Datatype), INTENT(IN) :: datatype 38 TYPE(MPI_Comm), INTENT(IN) :: comm 39 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 40 41 MPI_BSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERROR) 42 BUF(*) 43 INTEGER COUNT, DATATYPE, DEST, TAG, COMM, IERROR 44 Send in buffered mode. 45 46 47 48

69 3.4. COMMUNICATION MODES 39 1 MPI _ SSEND (buf, count, datatype, dest, tag, comm) 2 IN buf initial address of send buffer (choice) 3 count IN number of elements in send buffer (non-negative inte- 4 ger) 5 6 datatype of each send buffer element (handle) IN datatype 7 IN dest rank of destination (integer) 8 message tag (integer) tag IN 9 10 communicator (handle) comm IN 11 12 int MPI_Ssend(const void* buf, int count, MPI_Datatype datatype, int dest, 13 int tag, MPI_Comm comm) 14 15 MPI_Ssend(buf, count, datatype, dest, tag, comm, ierror) BIND(C) 16 TYPE(*), DIMENSION(..), INTENT(IN) :: buf 17 INTEGER, INTENT(IN) :: count, dest, tag 18 TYPE(MPI_Datatype), INTENT(IN) :: datatype 19 TYPE(MPI_Comm), INTENT(IN) :: comm 20 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 21 MPI_SSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERROR) 22 BUF(*) 23 INTEGER COUNT, DATATYPE, DEST, TAG, COMM, IERROR 24 25 Send in synchronous mode. 26 27 MPI _ RSEND (buf, count, datatype, dest, tag, comm) 28 29 initial address of send buffer (choice) IN buf 30 IN count number of elements in send buffer (non-negative inte- 31 ger) 32 datatype of each send buffer element (handle) datatype IN 33 34 dest rank of destination (integer) IN 35 IN tag message tag (integer) 36 IN comm communicator (handle) 37 38 int MPI_Rsend(const void* buf, int count, MPI_Datatype datatype, int dest, 39 int tag, MPI_Comm comm) 40 41 MPI_Rsend(buf, count, datatype, dest, tag, comm, ierror) BIND(C) 42 TYPE(*), DIMENSION(..), INTENT(IN) :: buf 43 INTEGER, INTENT(IN) :: count, dest, tag 44 TYPE(MPI_Datatype), INTENT(IN) :: datatype 45 TYPE(MPI_Comm), INTENT(IN) :: comm 46 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 47 48 MPI_RSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERROR)

70 40 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 BUF(*) 2 INTEGER COUNT, DATATYPE, DEST, TAG, COMM, IERROR 3 Send in ready mode. 4 There is only one receive operation, but it matches any of the send modes. The receive 5 blocking operation described in the last section is : it returns only after the receive buffer 6 contains the newly received message. A receive can complete before the matching send has 7 completed (of course, it can complete only after the matching send has started). 8 In a multithreaded implementation of , the system may de-schedule a thread that MPI 9 is blocked on a send or receive operation, and schedule another thread for execution in 10 the same address space. In such a case it is the user’s responsibility not to modify a 11 communication buffer until the communication completes. Otherwise, the outcome of the 12 computation is undefined. 13 14 Advice to implementors. Since a synchronous send cannot complete before a matching 15 receive is posted, one will not normally buffer messages sent by such an operation. 16 It is recommended to choose buffering over blocking the sender, whenever possible, 17 for standard sends. The programmer can signal his or her preference for blocking the 18 sender until a matching receive occurs by using the synchronous send mode. 19 20 A possible communication protocol for the various communication modes is outlined 21 below. 22 ready send : The message is sent as soon as possible. 23 synchronous send: The sender sends a request-to-send message. The receiver stores 24 this request. When a matching receive is posted, the receiver sends back a permission- 25 to-send message, and the sender now sends the message. 26 27 First protocol may be used for short messages, and second protocol for standard send: 28 long messages. 29 The sender copies the message into a buffer and then sends it with a buffered send: 30 nonblocking send (using the same protocol as for standard send). 31 Additional control messages might be needed for flow control and error recovery. Of 32 course, there are many other possible protocols. 33 34 Ready send can be implemented as a standard send. In this case there will be no 35 performance advantage (or disadvantage) for the use of ready send. 36 A standard send can be implemented as a synchronous send. In such a case, no data 37 buffering is needed. However, users may expect some buffering. 38 In a multithreaded environment, the execution of a blocking communication should 39 block only the executing thread, allowing the thread scheduler to de-schedule this 40 thread and schedule another thread for execution. ( End of advice to implementors. ) 41 42 43 3.5 Semantics of Point-to-Point Communication 44 45 A valid MPI implementation guarantees certain general properties of point-to-point com- 46 munication, which are described in this section. 47 48

71 3.5. SEMANTICS OF POINT-TO-POINT COMMUNICATION 41 1 Messages are : If a sender sends two messages in succession to the non-overtaking Order 2 same destination, and both match the same receive, then this operation cannot receive the 3 second message if the first one is still pending. If a receiver posts two receives in succession, 4 and both match the same message, then the second receive operation cannot be satisfied 5 by this message, if the first one is still pending. This requirement facilitates matching of 6 sends to receives. It guarantees that message-passing code is deterministic, if processes are 7 ANY single-threaded and the wildcard SOURCE is not used in receives. (Some of the _ _ MPI 8 CANCEL or MPI _ WAITANY , are additional sources of calls described later, such as MPI _ 9 nondeterminism.) 10 If a process has a single thread of execution, then any two communications executed 11 by this process are ordered. On the other hand, if the process is multithreaded, then the 12 semantics of thread execution may not define a relative order between two send operations 13 executed by two distinct threads. The operations are logically concurrent, even if one 14 physically precedes the other. In such a case, the two messages sent can be received in 15 any order. Similarly, if two receive operations that are logically concurrent receive two 16 successively sent messages, then the two messages can match the two receives in either 17 order. 18 19 An example of non-overtaking messages. Example 3.5 20 CALL MPI_COMM_RANK(comm, rank, ierr) 21 IF (rank.EQ.0) THEN 22 CALL MPI_BSEND(buf1, count, MPI_REAL, 1, tag, comm, ierr) 23 CALL MPI_BSEND(buf2, count, MPI_REAL, 1, tag, comm, ierr) 24 ELSE IF (rank.EQ.1) THEN 25 CALL MPI_RECV(buf1, count, MPI_REAL, 0, MPI_ANY_TAG, comm, status, ierr) 26 CALL MPI_RECV(buf2, count, MPI_REAL, 0, tag, comm, status, ierr) 27 END IF 28 29 The message sent by the first send must be received by the first receive, and the message 30 sent by the second send must be received by the second receive. 31 32 If a pair of matching send and receives have been initiated on two processes, then Progress 33 at least one of these two operations will complete, independently of other actions in the 34 system: the send operation will complete, unless the receive is satisfied by another message, 35 and completes; the receive operation will complete, unless the message sent is consumed by 36 another matching receive that was posted at the same destination process. 37 38 An example of two, intertwined matching pairs. Example 3.6 39 40 CALL MPI_COMM_RANK(comm, rank, ierr) 41 IF (rank.EQ.0) THEN 42 CALL MPI_BSEND(buf1, count, MPI_REAL, 1, tag1, comm, ierr) 43 CALL MPI_SSEND(buf2, count, MPI_REAL, 1, tag2, comm, ierr) 44 ELSE IF (rank.EQ.1) THEN 45 CALL MPI_RECV(buf1, count, MPI_REAL, 0, tag2, comm, status, ierr) 46 CALL MPI_RECV(buf2, count, MPI_REAL, 0, tag1, comm, status, ierr) 47 END IF 48

72 42 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 Both processes invoke their first communication call. Since the first send of process zero 2 uses the buffered mode, it must complete, irrespective of the state of process one. Since 3 no matching receive is posted, the message will be copied into buffer space. (If insufficient 4 buffer space is available, then the program will fail.) The second send is then invoked. At 5 that point, a matching pair of send and receive operation is enabled, and both operations 6 must complete. Process one next invokes its second receive call, which will be satisfied by 7 the buffered message. Note that process one received the messages in the reverse order they 8 were sent. 9 10 in the handling of communication. Suppose Fairness fairness makes no guarantee of MPI 11 that a send is posted. Then it is possible that the destination process repeatedly posts a 12 receive that matches this send, yet the message is never received, because it is each time 13 overtaken by another message, sent from another source. Similarly, suppose that a receive 14 was posted by a multithreaded process. Then it is possible that messages that match this 15 receive are repeatedly received, yet the receive is never satisfied, because it is overtaken 16 by other receives posted at this node (by other executing threads). It is the programmer’s 17 responsibility to prevent starvation in such situations. 18 19 Any pending communication operation consumes system resources Resource limitations 20 MPI that are limited. Errors may occur when lack of resources prevent the execution of an 21 call. A quality implementation will use a (small) fixed amount of resources for each pending 22 send in the ready or synchronous mode and for each pending receive. However, buffer space 23 may be consumed to store messages sent in standard mode, and must be consumed to store 24 messages sent in buffered mode, when no matching receive is available. The amount of space 25 available for buffering will be much smaller than program data memory on many systems. 26 Then, it will be easy to write programs that overrun available buffer space. 27 allows the user to provide buffer memory for messages sent in the buffered mode. MPI 28 MPI specifies a detailed operational model for the use of this buffer. An MPI Furthermore, 29 implementation is required to do no worse than implied by this model. This allows users to 30 avoid buffer overflows when they use buffered sends. Buffer allocation and use is described 31 in Section 3.6. 32 A buffered send operation that cannot complete because of a lack of buffer space is 33 erroneous. When such a situation is detected, an error is signaled that may cause the 34 program to terminate abnormally. On the other hand, a standard send operation that 35 cannot complete because of lack of buffer space will merely block, waiting for buffer space 36 to become available or for a matching receive to be posted. This behavior is preferable in 37 many situations. Consider a situation where a producer repeatedly produces new values 38 and sends them to a consumer. Assume that the producer produces new values faster 39 than the consumer can consume them. If buffered sends are used, then a buffer overflow 40 will result. Additional synchronization has to be added to the program so as to prevent 41 this from occurring. If standard sends are used, then the producer will be automatically 42 throttled, as its send operations will block when buffer space is unavailable. 43 In some situations, a lack of buffer space leads to deadlock situations. This is illustrated 44 by the examples below. 45 46 An exchange of messages. Example 3.7 47 48

73 43 3.5. SEMANTICS OF POINT-TO-POINT COMMUNICATION 1 CALL MPI_COMM_RANK(comm, rank, ierr) 2 IF (rank.EQ.0) THEN 3 CALL MPI_SEND(sendbuf, count, MPI_REAL, 1, tag, comm, ierr) 4 CALL MPI_RECV(recvbuf, count, MPI_REAL, 1, tag, comm, status, ierr) 5 ELSE IF (rank.EQ.1) THEN 6 CALL MPI_RECV(recvbuf, count, MPI_REAL, 0, tag, comm, status, ierr) 7 CALL MPI_SEND(sendbuf, count, MPI_REAL, 0, tag, comm, ierr) 8 END IF 9 This program will succeed even if no buffer space for data is available. The standard send 10 operation can be replaced, in this example, with a synchronous send. 11 12 An errant attempt to exchange messages. Example 3.8 13 14 CALL MPI_COMM_RANK(comm, rank, ierr) 15 IF (rank.EQ.0) THEN 16 CALL MPI_RECV(recvbuf, count, MPI_REAL, 1, tag, comm, status, ierr) 17 CALL MPI_SEND(sendbuf, count, MPI_REAL, 1, tag, comm, ierr) 18 ELSE IF (rank.EQ.1) THEN 19 CALL MPI_RECV(recvbuf, count, MPI_REAL, 0, tag, comm, status, ierr) 20 CALL MPI_SEND(sendbuf, count, MPI_REAL, 0, tag, comm, ierr) 21 END IF 22 The receive operation of the first process must complete before its send, and can complete 23 only if the matching send of the second processor is executed. The receive operation of the 24 second process must complete before its send and can complete only if the matching send 25 of the first process is executed. This program will always deadlock. The same holds for any 26 other send mode. 27 28 An exchange that relies on buffering. Example 3.9 29 30 CALL MPI_COMM_RANK(comm, rank, ierr) 31 IF (rank.EQ.0) THEN 32 CALL MPI_SEND(sendbuf, count, MPI_REAL, 1, tag, comm, ierr) 33 CALL MPI_RECV(recvbuf, count, MPI_REAL, 1, tag, comm, status, ierr) 34 ELSE IF (rank.EQ.1) THEN 35 CALL MPI_SEND(sendbuf, count, MPI_REAL, 0, tag, comm, ierr) 36 CALL MPI_RECV(recvbuf, count, MPI_REAL, 0, tag, comm, status, ierr) 37 END IF 38 The message sent by each process has to be copied out before the send operation returns 39 and the receive operation starts. For the program to complete, it is necessary that at least 40 one of the two messages sent be buffered. Thus, this program can succeed only if the 41 words of data. communication system can buffer at least count 42 43 Advice to users. When standard send operations are used, then a deadlock situation 44 may occur where both processes are blocked because buffer space is not available. The 45 same will certainly happen, if the synchronous mode is used. If the buffered mode is 46 used, and not enough buffer space is available, then the program will not complete 47 either. However, rather than a deadlock situation, we shall have a buffer overflow 48 error.

74 44 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 A program is “safe” if no message buffering is required for the program to complete. 2 One can replace all sends in such program with synchronous sends, and the pro- 3 gram will still run correctly. This conservative programming style provides the best 4 portability, since program completion does not depend on the amount of buffer space 5 available or on the communication protocol used. 6 Many programmers prefer to have more leeway and opt to use the “unsafe” program- 7 ming style shown in Example 3.9. In such cases, the use of standard sends is likely 8 to provide the best compromise between performance and robustness: quality imple- 9 mentations will provide sufficient buffering so that “common practice” programs will 10 not deadlock. The buffered send mode can be used for programs that require more 11 buffering, or in situations where the programmer wants more control. This mode 12 might also be used for debugging purposes, as buffer overflow conditions are easier to 13 diagnose than deadlock conditions. 14 Nonblocking message-passing operations, as described in Section 3.7, can be used to 15 avoid the need for buffering outgoing messages. This prevents deadlocks due to lack 16 of buffer space, and improves performance, by allowing overlap of computation and 17 communication, and avoiding the overheads of allocating buffers and copying messages 18 End of advice to users. into buffers. ( ) 19 20 21 Buffer Allocation and Usage 3.6 22 23 A user may specify a buffer to be used for buffering messages sent in buffered mode. Buffer- 24 ing is done by the sender. 25 26 _ ATTACH(buffer, size) MPI _ BUFFER 27 28 IN buffer initial buffer address (choice) 29 size IN buffer size, in bytes (non-negative integer) 30 31 int MPI_Buffer_attach(void* buffer, int size) 32 33 MPI_Buffer_attach(buffer, size, ierror) BIND(C) 34 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: buffer 35 INTEGER, INTENT(IN) :: size 36 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 37 MPI_BUFFER_ATTACH(BUFFER, SIZE, IERROR) 38 BUFFER(*) 39 INTEGER SIZE, IERROR 40 41 MPI Provides to a buffer in the user’s memory to be used for buffering outgoing mes- 42 sages. The buffer is used only by messages sent in buffered mode. Only one buffer can be 43 buffer is the starting address of a memory region. In attached to a process at a time. In C, 44 Fortran, one can pass the first element of a memory region or a whole array, which must be 45 ‘simply contiguous’ (for ‘simply contiguous,’ see also Section 17.1.12 on page 626). 46 47 48

75 3.6. BUFFER ALLOCATION AND USAGE 45 1 _ DETACH(buffer _ addr, size) MPI _ BUFFER 2 addr initial buffer address (choice) _ OUT buffer 3 OUT size buffer size, in bytes (non-negative integer) 4 5 6 int MPI_Buffer_detach(void* buffer_addr, int* size) 7 MPI_Buffer_detach(buffer_addr, size, ierror) BIND(C) 8 USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR 9 TYPE(C_PTR), INTENT(OUT) :: buffer_addr 10 INTEGER, INTENT(OUT) :: size 11 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 12 13 MPI_BUFFER_DETACH(BUFFER_ADDR, SIZE, IERROR) 14 BUFFER_ADDR(*) 15 INTEGER SIZE, IERROR 16 Detach the buffer currently associated with . The call returns the address and the MPI 17 size of the detached buffer. This operation will block until all messages currently in the 18 buffer have been transmitted. Upon return of this function, the user may reuse or deallocate 19 the space taken by the buffer. 20 21 Calls to attach and detach buffers. Example 3.10 22 #define BUFFSIZE 10000 23 int size; 24 char *buff; 25 MPI_Buffer_attach( malloc(BUFFSIZE), BUFFSIZE); 26 /* a buffer of 10000 bytes can now be used by MPI_Bsend */ 27 MPI_Buffer_detach( &buff, &size); 28 /* Buffer size reduced to zero */ 29 MPI_Buffer_attach( buff, size); 30 /* Buffer of 10000 bytes available again */ 31 32 Even though the C functions Advice to users. _ and Buffer MPI attach _ 33 MPI , these arguments are used detach _ Buffer both have a first argument of type _ void* 34 differently: A pointer to the buffer is passed to MPI _ Buffer _ attach ; the address of the 35 _ pointer is passed to MPI _ Buffer detach , so that this call can return the pointer value. 36 addr , the type of the _ module or mpi In Fortran with the buffer argument is mpif.h 37 wrongly defined and the argument is therefore unused. In Fortran with the mpi_f08 38 , see also Example 8.1 module, the address of the buffer is returned as TYPE(C_PTR) 39 pointers. ( ) End of advice to users. on page 341 about the use of C_PTR 40 Rationale. Both arguments are defined to be of type void* (rather than 41 and , respectively), so as to avoid complex type casts. E.g., in the last void* void** 42 , which is of type &buff example, , can be passed as argument to char** 43 void** MPI _ Buffer _ detach without type casting. If the formal parameter had type 44 then we would need a type cast before and after the call. ( End of rationale. ) 45 46 The statements made in this section describe the behavior of MPI for buffered-mode 47 MPI sends. When no buffer is currently associated, behaves as if a zero-sized buffer is 48 associated with the process.

76 46 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 outgoing message MPI as if must provide as much buffering for outgoing messages 2 data were buffered by the sending process, in the specified buffer space, using a circular, 3 contiguous-space allocation policy. We outline below a model implementation that defines 4 MPI may provide more buffering, and may use a better buffer allocation algo- this policy. 5 rithm than described below. On the other hand, may signal an error whenever the MPI 6 simple buffering allocator described below would run out of space. In particular, if no buffer 7 is explicitly associated with the process, then any buffered send may cause an error. 8 MPI does not provide mechanisms for querying or controlling buffering done by standard 9 mode sends. It is expected that vendors will provide such information for their implemen- 10 tations. 11 12 Rationale. There is a wide spectrum of possible implementations of buffered com- 13 munication: buffering can be done at sender, at receiver, or both; buffers can be 14 dedicated to one sender-receiver pair, or be shared by all communications; buffering 15 can be done in real or in virtual memory; it can use dedicated memory, or memory 16 shared by other processes; buffer space may be allocated statically or be changed dy- 17 namically; etc. It does not seem feasible to provide a portable mechanism for querying 18 or controlling buffering that would be compatible with all these choices, yet provide 19 ) meaningful information. ( End of rationale. 20 21 Model Implementation of Buffered Mode 3.6.1 22 The model implementation uses the packing and unpacking functions described in Sec- 23 tion 4.2 and the nonblocking communication functions described in Section 3.7. 24 We assume that a circular queue of pending message entries (PME) is maintained. 25 Each entry contains a communication request handle that identifies a pending nonblocking 26 send, a pointer to the next entry and the packed message data. The entries are stored in 27 successive locations in the buffer. Free space is available between the queue tail and the 28 queue head. 29 A buffered send call results in the execution of the following code. 30 31 Traverse sequentially the PME queue from head towards the tail, deleting all entries • 32 for communications that have completed, up to the first entry with an uncompleted 33 request; update queue head to point to that entry. 34 35 Compute the number, , of bytes needed to store an entry for the new message. An up- n • 36 per bound on n can be computed as follows: A call to the function 37 MPI _ _ SIZE(count, datatype, comm, size) comm and PACK , count , with the datatype 38 arguments used in the MPI _ BSEND call, returns an upper bound on the amount 39 of space needed to buffer the message data (see Section 4.2). The MPI constant 40 _ provides an upper bound on the additional space consumed MPI _ BSEND OVERHEAD 41 by the entry (e.g., for pointers or envelope information). 42 • Find the next contiguous empty space of n bytes in buffer (space following queue tail, 43 or space at start of buffer if queue tail is too close to end of buffer). If space is not 44 found then raise buffer overflow error. 45 46 • Append to end of PME queue in contiguous space the new entry that contains request 47 is used to pack data. PACK _ MPI handle, next pointer and packed message data; 48

77 3.7. NONBLOCKING COMMUNICATION 47 1 • Post nonblocking send (standard mode) for packed data. 2 • Return 3 4 5 3.7 Nonblocking Communication 6 7 One can improve performance on many systems by overlapping communication and com- 8 putation. This is especially true on systems where communication can be executed au- 9 tonomously by an intelligent communication controller. Light-weight threads are one mech- 10 anism for achieving such overlap. An alternative mechanism that often leads to better 11 . A nonblocking nonblocking communication performance is to use send start call ini- 12 tiates the send operation, but does not complete it. The send start call can return before 13 call is needed send complete the message was copied out of the send buffer. A separate 14 to complete the communication, i.e., to verify that the data has been copied out of the send 15 buffer. With suitable hardware, the transfer of data out of the sender memory may proceed 16 concurrently with computations done at the sender after the send was initiated and before it 17 initiates the receive operation, but completed. Similarly, a nonblocking receive start call 18 does not complete it. The call can return before a message is stored into the receive buffer. 19 receive complete A separate call is needed to complete the receive operation and verify 20 that the data has been received into the receive buffer. With suitable hardware, the transfer 21 of data into the receiver memory may proceed concurrently with computations done after 22 the receive was initiated and before it completed. The use of nonblocking receives may also 23 avoid system buffering and memory-to-memory copying, as information is provided early 24 on the location of the receive buffer. 25 Nonblocking send start calls can use the same four modes as blocking sends: standard , 26 ready . These carry the same meaning. Sends of all modes, buffered , synchronous and ready 27 excepted, can be started whether a matching receive has been posted or not; a nonblocking 28 send can be started only if a matching receive is posted. In all cases, the send start call ready 29 is local: it returns immediately, irrespective of the status of other processes. If the call causes 30 some system resource to be exhausted, then it will fail and return an error code. Quality 31 should ensure that this happens only in “pathological” cases. That MPI implementations of 32 is, an implementation should be able to support a large number of pending nonblocking MPI 33 operations. 34 The send-complete call returns when data has been copied out of the send buffer. It 35 may carry additional meaning, depending on the send mode. 36 If the send mode is , then the send can complete only if a matching receive synchronous 37 has started. That is, a receive has been posted, and has been matched with the send. In 38 this case, the send-complete call is non-local. Note that a synchronous, nonblocking send 39 may complete, if matched by a nonblocking receive, before the receive complete call occurs. 40 (It can complete as soon as the sender “knows” the transfer will complete, but before the 41 receiver “knows” the transfer will complete.) 42 If the send mode is buffered then the message must be buffered if there is no pending 43 receive. In this case, the send-complete call is local, and must succeed irrespective of the 44 status of a matching receive. 45 standard If the send mode is then the send-complete call may return before a matching 46 receive is posted, if the message is buffered. On the other hand, the receive-complete may 47 not complete until a matching receive is posted, and the message was copied into the receive 48 buffer.

78 48 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 Nonblocking sends can be matched with blocking receives, and vice-versa. 2 3 Advice to users. The completion of a send operation may be delayed, for standard 4 mode, and must be delayed, for synchronous mode, until a matching receive is posted. 5 The use of nonblocking sends in these two cases allows the sender to proceed ahead 6 of the receiver, so that the computation is more tolerant of fluctuations in the speeds 7 of the two processes. 8 Nonblocking sends in the buffered and ready modes have a more limited impact, e.g., 9 the blocking version of buffered send is capable of completing regardless of when a 10 matching receive call is made. However, separating the start from the completion 11 of these sends still gives some opportunity for optimization within the library. MPI 12 For example, starting a buffered send gives an implementation more flexibility in 13 determining if and how the message is buffered. There are also advantages for both 14 nonblocking buffered and ready modes when data copying can be done concurrently 15 with computation. 16 The message-passing model implies that communication is initiated by the sender. 17 The communication will generally have lower overhead if a receive is already posted 18 when the sender initiates the communication (data can be moved directly to the 19 receive buffer, and there is no need to queue a pending send request). However, a 20 receive operation can complete only after the matching send has occurred. The use 21 of nonblocking receives allows one to achieve lower communication overheads without 22 blocking the receiver while it waits for the send. ( End of advice to users. ) 23 24 Communication Request Objects 3.7.1 25 26 Nonblocking communications use opaque objects to identify communication oper- request 27 ations and match the operation that initiates the communication with the operation that 28 terminates it. These are system objects that are accessed via a handle. A request object 29 identifies various properties of a communication operation, such as the send mode, the com- 30 munication buffer that is associated with it, its context, the tag and destination arguments 31 to be used for a send, or the tag and source arguments to be used for a receive. In addition, 32 this object stores information about the status of the pending communication operation. 33 34 3.7.2 Communication Initiation 35 36 , or S , We use the same naming conventions as for blocking communication: a prefix of B 37 is used for buffered , synchronous or ready mode. In addition a prefix of I (for immediate ) R 38 indicates that the call is nonblocking. 39 40 41 42 43 44 45 46 47 48

79 3.7. NONBLOCKING COMMUNICATION 49 1 ISEND(buf, count, datatype, dest, tag, comm, request) MPI _ 2 buf initial address of send buffer (choice) IN 3 IN count number of elements in send buffer (non-negative inte- 4 ger) 5 6 IN datatype datatype of each send buffer element (handle) 7 IN dest rank of destination (integer) 8 tag message tag (integer) IN 9 10 communicator (handle) comm IN 11 communication request (handle) request OUT 12 13 int MPI_Isend(const void* buf, int count, MPI_Datatype datatype, int dest, 14 int tag, MPI_Comm comm, MPI_Request *request) 15 16 MPI_Isend(buf, count, datatype, dest, tag, comm, request, ierror) BIND(C) 17 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: buf 18 INTEGER, INTENT(IN) :: count, dest, tag 19 TYPE(MPI_Datatype), INTENT(IN) :: datatype 20 TYPE(MPI_Comm), INTENT(IN) :: comm 21 TYPE(MPI_Request), INTENT(OUT) :: request 22 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 23 MPI_ISEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR) 24 BUF(*) 25 INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR 26 27 Start a standard mode, nonblocking send. 28 29 30 MPI _ IBSEND(buf, count, datatype, dest, tag, comm, request) 31 IN initial address of send buffer (choice) buf 32 IN count number of elements in send buffer (non-negative inte- 33 ger) 34 35 datatype of each send buffer element (handle) datatype IN 36 dest rank of destination (integer) IN 37 IN tag message tag (integer) 38 39 IN comm communicator (handle) 40 communication request (handle) request OUT 41 42 int MPI_Ibsend(const void* buf, int count, MPI_Datatype datatype, int dest, 43 int tag, MPI_Comm comm, MPI_Request *request) 44 45 MPI_Ibsend(buf, count, datatype, dest, tag, comm, request, ierror) BIND(C) 46 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: buf 47 INTEGER, INTENT(IN) :: count, dest, tag 48 TYPE(MPI_Datatype), INTENT(IN) :: datatype

80 50 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 TYPE(MPI_Comm), INTENT(IN) :: comm 2 TYPE(MPI_Request), INTENT(OUT) :: request 3 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 4 MPI_IBSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR) 5 BUF(*) 6 INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR 7 8 Start a buffered mode, nonblocking send. 9 10 MPI _ ISSEND(buf, count, datatype, dest, tag, comm, request) 11 12 initial address of send buffer (choice) IN buf 13 IN number of elements in send buffer (non-negative inte- count 14 ger) 15 16 IN datatype datatype of each send buffer element (handle) 17 IN dest rank of destination (integer) 18 message tag (integer) IN tag 19 20 IN comm communicator (handle) 21 OUT request communication request (handle) 22 23 int MPI_Issend(const void* buf, int count, MPI_Datatype datatype, int dest, 24 int tag, MPI_Comm comm, MPI_Request *request) 25 26 MPI_Issend(buf, count, datatype, dest, tag, comm, request, ierror) BIND(C) 27 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: buf 28 INTEGER, INTENT(IN) :: count, dest, tag 29 TYPE(MPI_Datatype), INTENT(IN) :: datatype 30 TYPE(MPI_Comm), INTENT(IN) :: comm 31 TYPE(MPI_Request), INTENT(OUT) :: request 32 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 33 MPI_ISSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR) 34 BUF(*) 35 INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR 36 37 Start a synchronous mode, nonblocking send. 38 39 40 41 42 43 44 45 46 47 48

81 3.7. NONBLOCKING COMMUNICATION 51 1 MPI _ IRSEND(buf, count, datatype, dest, tag, comm, request) 2 IN buf initial address of send buffer (choice) 3 count number of elements in send buffer (non-negative inte- IN 4 ger) 5 6 datatype of each send buffer element (handle) IN datatype 7 rank of destination (integer) dest IN 8 tag IN message tag (integer) 9 10 IN communicator (handle) comm 11 OUT communication request (handle) request 12 13 int MPI_Irsend(const void* buf, int count, MPI_Datatype datatype, int dest, 14 int tag, MPI_Comm comm, MPI_Request *request) 15 16 MPI_Irsend(buf, count, datatype, dest, tag, comm, request, ierror) BIND(C) 17 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: buf 18 INTEGER, INTENT(IN) :: count, dest, tag 19 TYPE(MPI_Datatype), INTENT(IN) :: datatype 20 TYPE(MPI_Comm), INTENT(IN) :: comm 21 TYPE(MPI_Request), INTENT(OUT) :: request 22 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 23 MPI_IRSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR) 24 BUF(*) 25 INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR 26 27 Start a ready mode nonblocking send. 28 29 30 _ MPI IRECV (buf, count, datatype, source, tag, comm, request) 31 OUT buf initial address of receive buffer (choice) 32 IN number of elements in receive buffer (non-negative in- count 33 teger) 34 35 datatype of each receive buffer element (handle) datatype IN 36 IN source MPI _ ANY _ SOURCE (integer) rank of source or 37 tag message tag or MPI _ ANY _ TAG (integer) IN 38 39 IN communicator (handle) comm 40 communication request (handle) OUT request 41 42 int MPI_Irecv(void* buf, int count, MPI_Datatype datatype, int source, 43 int tag, MPI_Comm comm, MPI_Request *request) 44 45 MPI_Irecv(buf, count, datatype, source, tag, comm, request, ierror) BIND(C) 46 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: buf 47 INTEGER, INTENT(IN) :: count, source, tag 48 TYPE(MPI_Datatype), INTENT(IN) :: datatype

82 52 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 TYPE(MPI_Comm), INTENT(IN) :: comm 2 TYPE(MPI_Request), INTENT(OUT) :: request 3 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 4 MPI_IRECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST, IERROR) 5 BUF(*) 6 INTEGER COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST, IERROR 7 8 Start a nonblocking receive. 9 These calls allocate a communication request object and associate it with the request 10 ). The request can be used later to query the status of the request handle (the argument 11 communication or wait for its completion. 12 A nonblocking send call indicates that the system may start copying data out of the 13 send buffer. The sender should not modify any part of the send buffer after a nonblocking 14 send operation is called, until the send completes. 15 A nonblocking receive call indicates that the system may start writing data into the re- 16 ceive buffer. The receiver should not access any part of the receive buffer after a nonblocking 17 receive operation is called, until the receive completes. 18 19 To prevent problems with the argument copying and register opti- Advice to users. 20 mization done by Fortran compilers, please note the hints in Sections 17.1.10-17.1.20, 21 Problems Due to especially in Sections 17.1.12 and 17.1.13 on pages 626-629 about “ 22 ”, Vector Subscripts ” and “ Data Copying and Sequence Association with Subscript Triplets 23 ”, and in Sections 17.1.16 to 17.1.19 on pages 631 to 642 about “ Optimization Problems 24 ” and “ “ Code Movements and Register Optimization ”, “ Temporary Data Movements Per- 25 manent Data Movements End of advice to users. ) ”. ( 26 27 3.7.3 Communication Completion 28 are used to complete a nonblocking communica- _ WAIT and MPI _ TEST MPI The functions 29 tion. The completion of a send operation indicates that the sender is now free to update the 30 locations in the send buffer (the send operation itself leaves the content of the send buffer 31 unchanged). It does not indicate that the message has been received, rather, it may have 32 been buffered by the communication subsystem. However, if a mode send was synchronous 33 used, the completion of the send operation indicates that a matching receive was initiated, 34 and that the message will eventually be received by this matching receive. 35 The completion of a receive operation indicates that the receive buffer contains the 36 received message, the receiver is now free to access it, and that the status object is set. It 37 does not indicate that the matching send operation has completed (but indicates, of course, 38 that the send was initiated). 39 We shall use the following terminology: A null handle is a handle with value 40 MPI _ REQUEST _ . A persistent request and the handle to it are inactive NULL if the re- 41 quest is not associated with any ongoing communication (see Section 3.9). A handle is 42 active if it is neither null nor inactive. An empty status is a status which is set to 43 _ ANY _ TAG , source = MPI _ ANY _ SOURCE , error = MPI _ SUCCESS , and is return tag = MPI 44 COUNT _ GET _ MPI , ELEMENTS _ GET _ MPI also internally configured so that calls to , and 45 . We _ false MPI returns CANCELLED _ TEST _ MPI and count = 0 return X _ ELEMENTS _ GET 46 set a status variable to empty when the value returned by it is not significant. Status is set 47 in this way so as to prevent errors due to accesses of stale information. 48

83 3.7. NONBLOCKING COMMUNICATION 53 1 object returned by a call to status WAIT , MPI _ TEST , or any MPI The fields in a _ 2 { WAIT }{ ALL | SOME | ANY } ), where the request of the other derived functions ( MPI TEST | _ 3 corresponds to a send call, are undefined, with two exceptions: The error status field will 4 MPI ; and ERR IN _ _ STATUS contain valid information if the wait or test call returned with _ 5 the returned status can be queried by the call MPI _ TEST _ CANCELLED . 6 STATUS Error codes belonging to the error class MPI _ ERR _ IN _ should be returned only by 7 MPI , . For the functions Status _ MPI completion functions that take arrays of MPI the _ TEST 8 MPI WAITANY , which return a single MPI _ Status value, _ TESTANY , MPI _ WAIT , and MPI _ 9 the normal MPI error return process should be used (not the MPI _ ERROR field in the 10 argument). _ Status MPI 11 12 WAIT(request, status) _ MPI 13 14 request (handle) request INOUT 15 status object (Status) status OUT 16 17 int MPI_Wait(MPI_Request *request, MPI_Status *status) 18 19 MPI_Wait(request, status, ierror) BIND(C) 20 TYPE(MPI_Request), INTENT(INOUT) :: request 21 TYPE(MPI_Status) :: status 22 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 23 MPI_WAIT(REQUEST, STATUS, IERROR) 24 INTEGER REQUEST, STATUS(MPI_STATUS_SIZE), IERROR 25 26 request returns when the operation identified by WAIT _ MPI A call to is complete. If 27 the request is an active persistent request, it is marked inactive. Any other type of request 28 . MPI _ WAIT is a non-local operation. is and the request handle is set to MPI _ REQUEST _ NULL 29 , information on the completed operation. The content of The call returns, in status 30 the status object for a receive operation can be accessed as described in Section 3.2.5. The 31 MPI _ CANCELLED status object for a send operation may be queried by a call to _ TEST 32 (see Section 3.8). 33 with a null or inactive request One is allowed to call MPI _ WAIT argument. In this case 34 . the operation returns immediately with empty status 35 36 IBSEND MPI after a WAIT _ MPI Successful return of Advice to users. _ implies that 37 the user send buffer can be reused — i.e., data has been sent out or copied into 38 ATTACH . Note that, at this point, we can no a buffer attached with MPI _ BUFFER _ 39 longer cancel the send (see Section 3.8). If a matching receive is never posted, then the 40 MPI buffer cannot be freed. This runs somewhat counter to the stated goal of _ CANCEL 41 (always being able to free program space that was committed to the communication 42 End of advice to users. subsystem). ( ) 43 44 should Advice to implementors. In a multithreaded environment, a call to MPI _ WAIT 45 block only the calling thread, allowing the thread scheduler to schedule another thread 46 End of advice to implementors. ) for execution. ( 47 48

84 54 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 TEST(request, flag, status) _ MPI 2 INOUT request communication request (handle) 3 flag true if operation completed (logical) OUT 4 5 status object (Status) OUT status 6 7 int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status) 8 9 MPI_Test(request, flag, status, ierror) BIND(C) 10 TYPE(MPI_Request), INTENT(INOUT) :: request 11 LOGICAL, INTENT(OUT) :: flag 12 TYPE(MPI_Status) :: status 13 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 14 MPI_TEST(REQUEST, FLAG, STATUS, IERROR) 15 LOGICAL FLAG 16 INTEGER REQUEST, STATUS(MPI_STATUS_SIZE), IERROR 17 18 is complete. TEST A call to MPI returns _ request if the operation identified by flag = true 19 In such a case, the status object is set to contain information on the completed operation. 20 If the request is an active persistent request, it is marked as inactive. Any other type 21 of request is deallocated and the request handle is set to NULL . The call _ REQUEST _ MPI 22 returns flag = false if the operation identified by request is not complete. In this case, the 23 MPI is a local operation. TEST _ value of the status object is undefined. 24 The return status object for a receive operation carries information that can be accessed 25 as described in Section 3.2.5. The status object for a send operation carries information 26 TEST _ that can be accessed by a call to (see Section 3.8). MPI _ CANCELLED 27 One is allowed to call MPI _ TEST with a null or inactive request argument. In such a 28 status case the operation returns with flag = true . and empty 29 and MPI _ WAIT The functions MPI _ TEST can be used to complete both sends and 30 receives. 31 _ Advice to users. The use of the nonblocking MPI TEST call allows the user to 32 schedule alternative activities within a single thread of execution. An event-driven 33 End of advice to thread scheduler can be emulated with periodic calls to _ TEST . ( MPI 34 users. ) 35 36 37 Example 3.11 . _ MPI WAIT Simple usage of nonblocking operations and 38 39 CALL MPI_COMM_RANK(comm, rank, ierr) 40 IF (rank.EQ.0) THEN 41 CALL MPI_ISEND(a(1), 10, MPI_REAL, 1, tag, comm, request, ierr) 42 **** do some computation to mask latency **** 43 CALL MPI_WAIT(request, status, ierr) 44 ELSE IF (rank.EQ.1) THEN 45 CALL MPI_IRECV(a(1), 15, MPI_REAL, 0, tag, comm, request, ierr) 46 **** do some computation to mask latency **** 47 CALL MPI_WAIT(request, status, ierr) 48 END IF

85 3.7. NONBLOCKING COMMUNICATION 55 1 A request object can be deallocated without waiting for the associated communication 2 to complete, by using the following operation. 3 4 _ REQUEST FREE(request) _ MPI 5 6 communication request (handle) INOUT request 7 8 int MPI_Request_free(MPI_Request *request) 9 10 MPI_Request_free(request, ierror) BIND(C) 11 TYPE(MPI_Request), INTENT(INOUT) :: request 12 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 13 MPI_REQUEST_FREE(REQUEST, IERROR) 14 INTEGER REQUEST, IERROR 15 16 Mark the request object for deallocation and set NULL _ REQUEST _ MPI to request . An 17 ongoing communication that is associated with the request will be allowed to complete. The 18 request will be deallocated only after its completion. 19 MPI The Rationale. FREE mechanism is provided for reasons of perfor- _ REQUEST _ 20 ) End of rationale. mance and convenience on the sending side. ( 21 22 Advice to users. FREE , it is not _ REQUEST _ MPI Once a request is freed by a call to 23 possible to check for the successful completion of the associated communication with 24 . Also, if an error occurs subsequently during the calls to MPI _ WAIT TEST MPI _ or 25 communication, an error code cannot be returned to the user — such an error must 26 be treated as fatal. An active receive request should never be freed as the receiver 27 will have no way to verify that the receive has completed and the receive buffer can 28 be reused. ( End of advice to users. ) 29 30 31 . Example 3.12 An example using MPI _ REQUEST _ FREE 32 33 CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) 34 IF (rank.EQ.0) THEN 35 DO i=1, n 36 CALL MPI_ISEND(outval, 1, MPI_REAL, 1, 0, MPI_COMM_WORLD, req, ierr) 37 CALL MPI_REQUEST_FREE(req, ierr) 38 CALL MPI_IRECV(inval, 1, MPI_REAL, 1, 0, MPI_COMM_WORLD, req, ierr) 39 CALL MPI_WAIT(req, status, ierr) 40 END DO 41 ELSE IF (rank.EQ.1) THEN 42 CALL MPI_IRECV(inval, 1, MPI_REAL, 0, 0, MPI_COMM_WORLD, req, ierr) 43 CALL MPI_WAIT(req, status, ierr) 44 DO I=1, n-1 45 CALL MPI_ISEND(outval, 1, MPI_REAL, 0, 0, MPI_COMM_WORLD, req, ierr) 46 CALL MPI_REQUEST_FREE(req, ierr) 47 CALL MPI_IRECV(inval, 1, MPI_REAL, 0, 0, MPI_COMM_WORLD, req, ierr) 48 CALL MPI_WAIT(req, status, ierr)

86 56 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 END DO 2 CALL MPI_ISEND(outval, 1, MPI_REAL, 0, 0, MPI_COMM_WORLD, req, ierr) 3 CALL MPI_WAIT(req, status, ierr) 4 END IF 5 6 3.7.4 Semantics of Nonblocking Communications 7 The semantics of nonblocking communication is defined by suitably extending the definitions 8 in Section 3.5. 9 10 11 Order Nonblocking communication operations are ordered according to the execution order 12 of the calls that initiate the communication. The non-overtaking requirement of Section 3.5 13 is extended to nonblocking communication, with this definition of order being used. 14 Message ordering for nonblocking operations. Example 3.13 15 16 CALL MPI_COMM_RANK(comm, rank, ierr) 17 IF (RANK.EQ.0) THEN 18 CALL MPI_ISEND(a, 1, MPI_REAL, 1, 0, comm, r1, ierr) 19 CALL MPI_ISEND(b, 1, MPI_REAL, 1, 0, comm, r2, ierr) 20 ELSE IF (rank.EQ.1) THEN 21 CALL MPI_IRECV(a, 1, MPI_REAL, 0, MPI_ANY_TAG, comm, r1, ierr) 22 CALL MPI_IRECV(b, 1, MPI_REAL, 0, 0, comm, r2, ierr) 23 END IF 24 CALL MPI_WAIT(r1, status, ierr) 25 CALL MPI_WAIT(r2, status, ierr) 26 27 The first send of process zero will match the first receive of process one, even if both messages 28 are sent before process one executes either receive. 29 30 that completes a receive will eventually terminate and return _ Progress A call to WAIT MPI 31 if a matching send has been started, unless the send is satisfied by another receive. In 32 particular, if the matching send is nonblocking, then the receive should complete even if no 33 that WAIT _ MPI call is executed by the sender to complete the send. Similarly, a call to 34 completes a send will eventually return if a matching receive has been started, unless the 35 receive is satisfied by another send, and even if no call is executed to complete the receive. 36 37 An illustration of progress semantics. Example 3.14 38 CALL MPI_COMM_RANK(comm, rank, ierr) 39 IF (RANK.EQ.0) THEN 40 CALL MPI_SSEND(a, 1, MPI_REAL, 1, 0, comm, ierr) 41 CALL MPI_SEND(b, 1, MPI_REAL, 1, 1, comm, ierr) 42 ELSE IF (rank.EQ.1) THEN 43 CALL MPI_IRECV(a, 1, MPI_REAL, 0, 0, comm, r, ierr) 44 CALL MPI_RECV(b, 1, MPI_REAL, 0, 1, comm, status, ierr) 45 CALL MPI_WAIT(r, status, ierr) 46 END IF 47 48

87 3.7. NONBLOCKING COMMUNICATION 57 1 implementation. The first synchronous MPI This code should not deadlock in a correct 2 send of process zero must complete after process one posts the matching (nonblocking) 3 receive even if process one has not yet reached the completing wait call. Thus, process zero 4 will continue and execute the second send, allowing process one to complete execution. 5 If an that completes a receive is repeatedly called with the same arguments, TEST _ MPI 6 flag = true , unless and a matching send has been started, then the call will eventually return 7 TEST MPI that completes a send is repeatedly _ the send is satisfied by another receive. If an 8 called with the same arguments, and a matching receive has been started, then the call will 9 , unless the receive is satisfied by another send. flag = true eventually return 10 11 3.7.5 Multiple Completions 12 It is convenient to be able to wait for the completion of any, some, or all the operations 13 or WAITANY MPI in a list, rather than having to wait for a specific message. A call to _ 14 can be used to wait for the completion of one out of several operations. A _ TESTANY MPI 15 can be used to wait for all pending operations in call to MPI _ WAITALL or MPI _ TESTALL 16 WAITSOME _ MPI or TESTSOME can be used to complete all enabled MPI _ a list. A call to 17 operations in a list. 18 19 20 _ of _ requests, index, status) _ MPI WAITANY (count, array 21 count list length (non-negative integer) IN 22 23 requests of _ array _ array of requests (array of handles) INOUT 24 index OUT index of handle for operation that completed (integer) 25 status OUT status object (Status) 26 27 28 int MPI_Waitany(int count, MPI_Request array_of_requests[], int *index, 29 MPI_Status *status) 30 MPI_Waitany(count, array_of_requests, index, status, ierror) BIND(C) 31 INTEGER, INTENT(IN) :: count 32 TYPE(MPI_Request), INTENT(INOUT) :: array_of_requests(count) 33 INTEGER, INTENT(OUT) :: index 34 TYPE(MPI_Status) :: status 35 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 36 37 MPI_WAITANY(COUNT, ARRAY_OF_REQUESTS, INDEX, STATUS, IERROR) 38 INTEGER COUNT, ARRAY_OF_REQUESTS(*), INDEX, STATUS(MPI_STATUS_SIZE), 39 IERROR 40 Blocks until one of the operations associated with the active requests in the array has 41 completed. If more than one operation is enabled and can terminate, one is arbitrarily 42 the index of that request in the array and returns in status index chosen. Returns in the 43 status of the completing operation. (The array is indexed from zero in C, and from one in 44 Fortran.) If the request is an active persistent request, it is marked inactive. Any other 45 NULL _ REQUEST _ MPI . type of request is deallocated and the request handle is set to 46 _ array The list may contain null or inactive handles. If the list contains no requests _ of 47 active handles (list has length zero or all entries are null or inactive), then the call returns 48

88 58 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 MPI index = , and an empty status . _ immediately with UNDEFINED 2 WAITANY(count, array _ requests, index, status) has the same The execution of MPI _ _ of 3 _ , where i is the value _ requests[i], status) WAIT(&array effect as the execution of of _ MPI 4 returned by ). MPI _ WAITANY with an index (unless the value of index is MPI _ UNDEFINED 5 _ WAIT . array containing one active entry is equivalent to MPI 6 7 MPI TESTANY(count, array _ of _ _ requests, index, flag, status) 8 9 IN list length (non-negative integer) count 10 _ of INOUT array of requests (array of handles) requests array _ 11 index of operation that completed, or OUT index 12 _ MPI UNDEFINED if none completed (integer) 13 14 OUT flag true if one of the operations is complete (logical) 15 status status object (Status) OUT 16 17 int MPI_Testany(int count, MPI_Request array_of_requests[], int *index, 18 int *flag, MPI_Status *status) 19 20 MPI_Testany(count, array_of_requests, index, flag, status, ierror) BIND(C) 21 INTEGER, INTENT(IN) :: count 22 TYPE(MPI_Request), INTENT(INOUT) :: array_of_requests(count) 23 INTEGER, INTENT(OUT) :: index 24 LOGICAL, INTENT(OUT) :: flag 25 TYPE(MPI_Status) :: status 26 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 27 MPI_TESTANY(COUNT, ARRAY_OF_REQUESTS, INDEX, FLAG, STATUS, IERROR) 28 LOGICAL FLAG 29 INTEGER COUNT, ARRAY_OF_REQUESTS(*), INDEX, STATUS(MPI_STATUS_SIZE), 30 IERROR 31 32 Tests for completion of either one or none of the operations associated with active 33 handles. In the former case, it returns flag = true , returns in index the index of this request 34 status in the array, and returns in the status of that operation. If the request is an active 35 persistent request, it is marked as inactive. Any other type of request is deallocated and 36 _ MPI . (The array is indexed from zero in C, and from NULL _ REQUEST the handle is set to 37 flag = false , returns one in Fortran.) In the latter case (no operation completed), it returns 38 _ UNDEFINED a value of MPI and is undefined. in index status 39 The array may contain null or inactive handles. If the array contains no active handles 40 then the call returns immediately with MPI _ UNDEFINED = , and an empty flag = true , index 41 . status 42 If the array of requests contains active handles then the execution of 43 MPI requests, index, status) has the same effect as the execution _ of TESTANY(count, array _ _ 44 of _ TEST( &array _ MPI of requests[i], flag, status) i=0, 1 ,..., count-1 , in some arbitrary _ , for 45 is set to the index , or all fail. In the former case, flag = true order, until one call returns 46 i , and in the latter case, it is set to with an last value of MPI TESTANY _ MPI . UNDEFINED _ 47 . _ MPI TEST array containing one active entry is equivalent to 48

89 3.7. NONBLOCKING COMMUNICATION 59 1 WAITALL( count, array _ _ requests, array _ of _ statuses) _ MPI of 2 count IN lists length (non-negative integer) 3 INOUT _ requests array of requests (array of handles) _ array of 4 5 OUT array _ array of status objects (array of Status) statuses of _ 6 7 int MPI_Waitall(int count, MPI_Request array_of_requests[], 8 MPI_Status array_of_statuses[]) 9 10 MPI_Waitall(count, array_of_requests, array_of_statuses, ierror) BIND(C) 11 INTEGER, INTENT(IN) :: count 12 TYPE(MPI_Request), INTENT(INOUT) :: array_of_requests(count) 13 TYPE(MPI_Status) :: array_of_statuses(*) 14 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 15 MPI_WAITALL(COUNT, ARRAY_OF_REQUESTS, ARRAY_OF_STATUSES, IERROR) 16 INTEGER COUNT, ARRAY_OF_REQUESTS(*) 17 INTEGER ARRAY_OF_STATUSES(MPI_STATUS_SIZE,*), IERROR 18 19 Blocks until all communication operations associated with active handles in the list 20 complete, and return the status of all these operations (this includes the case where no 21 handle in the list is active). Both arrays have the same number of valid entries. The 22 array i is set to the return status of the statuses of _ -th operation. Active -th entry in i _ 23 persistent requests are marked inactive. Requests of any other type are deallocated and the 24 _ NULL . The list may contain corresponding handles in the array are set to MPI _ REQUEST 25 null or inactive handles. The call sets to empty the status of each such entry. 26 WAITALL(count, array MPI _ _ statuses) _ The error-free execution of _ requests, array _ of of 27 has the same effect as the execution of 28 MPI statuses[i]) , for , in some arbi- _ of _ request[i], &array _ of _ i=0 ,..., count-1 _ WAIT(&array 29 _ WAITALL with an array of length one is equivalent to MPI _ WAIT . trary order. MPI 30 When one or more of the communications completed by a call to MPI _ WAITALL fail, 31 it is desirable to return specific information on each communication. The function 32 _ _ WAITALL will return in such case the error code MPI MPI _ ERR IN _ STATUS and will set the 33 error field of each status to a specific error code. This code will be MPI _ SUCCESS , if the 34 specific communication completed; it will be another specific error code, if it failed; or it can 35 MPI be MPI if it has neither failed nor completed. The function _ WAITALL PENDING _ ERR _ 36 _ will return MPI SUCCESS if no request had an error, or will return another error code if it 37 failed for other reasons (such as invalid arguments). In such cases, it will not update the 38 error fields of the statuses. 39 Rationale. This design streamlines error handling in the application. The application 40 code need only test the (single) function result to determine if an error has occurred. It 41 End of rationale. ) needs to check each individual status only when an error occurred. ( 42 43 44 45 46 47 48

90 60 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 TESTALL(count, array MPI _ of _ statuses) _ _ of _ requests, flag, array 2 lists length (non-negative integer) IN count 3 of array of requests (array of handles) array _ _ INOUT requests 4 5 (logical) flag OUT 6 of _ statuses array of status objects (array of Status) OUT array _ 7 8 int MPI_Testall(int count, MPI_Request array_of_requests[], int *flag, 9 MPI_Status array_of_statuses[]) 10 11 MPI_Testall(count, array_of_requests, flag, array_of_statuses, ierror) 12 BIND(C) 13 INTEGER, INTENT(IN) :: count 14 TYPE(MPI_Request), INTENT(INOUT) :: array_of_requests(count) 15 LOGICAL, INTENT(OUT) :: flag 16 TYPE(MPI_Status) :: array_of_statuses(*) 17 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 18 MPI_TESTALL(COUNT, ARRAY_OF_REQUESTS, FLAG, ARRAY_OF_STATUSES, IERROR) 19 LOGICAL FLAG 20 INTEGER COUNT, ARRAY_OF_REQUESTS(*), 21 ARRAY_OF_STATUSES(MPI_STATUS_SIZE,*), IERROR 22 23 Returns flag = true if all communications associated with active handles in the array 24 have completed (this includes the case where no handle in the list is active). In this case, each 25 status entry that corresponds to an active request is set to the status of the corresponding 26 operation. Active persistent requests are marked inactive. Requests of any other type are 27 _ deallocated and the corresponding handles in the array are set to . MPI _ REQUEST NULL 28 Each status entry that corresponds to a null or inactive handle is set to empty. 29 Otherwise, flag = false is returned, no request is modified and the values of the status 30 entries are undefined. This is a local operation. 31 MPI _ are handled in the same Errors that occurred during the execution of TESTALL 32 MPI _ WAITALL . manner as errors in 33 34 35 _ requests, outcount, array _ of _ indices, array _ of _ statuses) MPI _ WAITSOME(incount, array _ of 36 37 incount requests (non-negative integer) IN _ length of array _ of 38 _ of INOUT array _ requests array of requests (array of handles) 39 40 OUT outcount number of completed requests (integer) 41 array _ OUT _ indices array of indices of operations that completed (array of of 42 integers) 43 _ of _ array OUT array of status objects for operations that completed statuses 44 (array of Status) 45 46 int MPI_Waitsome(int incount, MPI_Request array_of_requests[], 47 int *outcount, int array_of_indices[], 48

91 3.7. NONBLOCKING COMMUNICATION 61 1 MPI_Status array_of_statuses[]) 2 MPI_Waitsome(incount, array_of_requests, outcount, array_of_indices, 3 array_of_statuses, ierror) BIND(C) 4 INTEGER, INTENT(IN) :: incount 5 TYPE(MPI_Request), INTENT(INOUT) :: array_of_requests(incount) 6 INTEGER, INTENT(OUT) :: outcount, array_of_indices(*) 7 TYPE(MPI_Status) :: array_of_statuses(*) 8 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 9 10 MPI_WAITSOME(INCOUNT, ARRAY_OF_REQUESTS, OUTCOUNT, ARRAY_OF_INDICES, 11 ARRAY_OF_STATUSES, IERROR) 12 INTEGER INCOUNT, ARRAY_OF_REQUESTS(*), OUTCOUNT, ARRAY_OF_INDICES(*), 13 ARRAY_OF_STATUSES(MPI_STATUS_SIZE,*), IERROR 14 Waits until at least one of the operations associated with active handles in the list have 15 requests _ that outcount completed. Returns in array _ the number of requests from the list of 16 array outcount _ of _ indices the have completed. Returns in the first locations of the array 17 requests indices of these operations (index within the array array _ of _ ; the array is indexed 18 from zero in C and from one in Fortran). Returns in the first outcount locations of the array 19 array _ of _ status the status for these completed operations. Completed active persistent 20 requests are marked as inactive. Any other type or request that completed is deallocated, 21 REQUEST and the associated handle is set to MPI _ . NULL _ 22 If the list contains no active handles, then the call returns immediately with outcount 23 UNDEFINED = MPI _ . 24 MPI When one or more of the communications completed by WAITSOME fails, then _ 25 it is desirable to return specific information on each communication. The arguments 26 _ _ of and array outcount array , _ will be adjusted to indicate completion of statuses _ of indices 27 all communications that have succeeded or failed. The call will return the error code 28 STATUS _ IN _ ERR and the error field of each status returned will be set to indicate MPI _ 29 MPI SUCCESS _ success or to indicate the specific error that occurred. The call will return 30 if no request resulted in an error, and will return another error code if it failed for other 31 reasons (such as invalid arguments). In such cases, it will not update the error fields of the 32 statuses. 33 34 35 _ indices, array _ of of MPI _ TESTSOME(incount, array _ of _ requests, outcount, array _ _ statuses) 36 37 IN incount length of array _ requests (non-negative integer) _ of 38 39 INOUT array _ of _ requests array of requests (array of handles) 40 outcount number of completed requests (integer) OUT 41 array array of indices of operations that completed (array of indices of _ _ OUT 42 integers) 43 44 array OUT array of status objects for operations that completed _ of statuses _ 45 (array of Status) 46 47 int MPI_Testsome(int incount, MPI_Request array_of_requests[], 48 int *outcount, int array_of_indices[],

92 62 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 MPI_Status array_of_statuses[]) 2 MPI_Testsome(incount, array_of_requests, outcount, array_of_indices, 3 array_of_statuses, ierror) BIND(C) 4 INTEGER, INTENT(IN) :: incount 5 TYPE(MPI_Request), INTENT(INOUT) :: array_of_requests(incount) 6 INTEGER, INTENT(OUT) :: outcount, array_of_indices(*) 7 TYPE(MPI_Status) :: array_of_statuses(*) 8 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 9 10 MPI_TESTSOME(INCOUNT, ARRAY_OF_REQUESTS, OUTCOUNT, ARRAY_OF_INDICES, 11 ARRAY_OF_STATUSES, IERROR) 12 INTEGER INCOUNT, ARRAY_OF_REQUESTS(*), OUTCOUNT, ARRAY_OF_INDICES(*), 13 ARRAY_OF_STATUSES(MPI_STATUS_SIZE,*), IERROR 14 , except that it returns immediately. If no operation has _ MPI WAITSOME Behaves like 15 . If there is no active handle in the list it returns outcount completed it returns outcount = 0 16 _ . = UNDEFINED MPI 17 MPI is a local operation, which returns immediately, whereas TESTSOME _ 18 MPI _ WAITSOME will block until a communication completes, if it was passed a list that 19 contains at least one active handle. Both calls fulfill a requirement: If a request for fairness 20 _ MPI a receive repeatedly appears in a list of requests passed to or WAITSOME 21 MPI TESTSOME , and a matching send has been posted, then the receive will eventually _ 22 succeed, unless the send is satisfied by another receive; and similarly for send requests. 23 are handled as for TESTSOME _ MPI Errors that occur during the execution of 24 _ WAITSOME . MPI 25 26 Advice to users. The use of MPI _ is likely to be more efficient than the use TESTSOME 27 _ . The former returns information on all completed communications, TESTANY MPI of 28 with the latter, a new call is required for each communication that completes. 29 30 A server with multiple clients can use WAITSOME _ MPI so as not to starve any client. 31 Clients send messages to the server with service requests. The server calls 32 MPI _ WAITSOME with one receive request for each client, and then handles all receives 33 is used instead, then one client could starve that completed. If a call to MPI _ WAITANY 34 while requests from another client always sneak in first. ( End of advice to users. ) 35 Advice to implementors. MPI _ TESTSOME should complete as many pending com- 36 End of advice to implementors. munications as possible. ( ) 37 38 39 Example 3.15 Client-server code (starvation can occur). 40 41 42 CALL MPI_COMM_SIZE(comm, size, ierr) 43 CALL MPI_COMM_RANK(comm, rank, ierr) 44 IF(rank .GT. 0) THEN ! client code 45 DO WHILE(.TRUE.) 46 CALL MPI_ISEND(a, n, MPI_REAL, 0, tag, comm, request, ierr) 47 CALL MPI_WAIT(request, status, ierr) 48 END DO

93 3.7. NONBLOCKING COMMUNICATION 63 1 ELSE ! rank=0 -- server code 2 DO i=1, size-1 3 CALL MPI_IRECV(a(1,i), n, MPI_REAL, i, tag, 4 comm, request_list(i), ierr) 5 END DO 6 DO WHILE(.TRUE.) 7 CALL MPI_WAITANY(size-1, request_list, index, status, ierr) 8 CALL DO_SERVICE(a(1,index)) ! handle one message 9 CALL MPI_IRECV(a(1, index), n, MPI_REAL, index, tag, 10 comm, request_list(index), ierr) 11 END DO 12 END IF 13 14 Example 3.16 _ MPI Same code, using . WAITSOME 15 16 17 CALL MPI_COMM_SIZE(comm, size, ierr) 18 CALL MPI_COMM_RANK(comm, rank, ierr) 19 IF(rank .GT. 0) THEN ! client code 20 DO WHILE(.TRUE.) 21 CALL MPI_ISEND(a, n, MPI_REAL, 0, tag, comm, request, ierr) 22 CALL MPI_WAIT(request, status, ierr) 23 END DO 24 ELSE ! rank=0 -- server code 25 DO i=1, size-1 26 CALL MPI_IRECV(a(1,i), n, MPI_REAL, i, tag, 27 comm, request_list(i), ierr) 28 END DO 29 DO WHILE(.TRUE.) 30 CALL MPI_WAITSOME(size, request_list, numdone, 31 indices, statuses, ierr) 32 DO i=1, numdone 33 CALL DO_SERVICE(a(1, indices(i))) 34 CALL MPI_IRECV(a(1, indices(i)), n, MPI_REAL, 0, tag, 35 comm, request_list(indices(i)), ierr) 36 END DO 37 END DO 38 END IF 39 40 Non-destructive Test of status 3.7.6 41 42 This call is useful for accessing the information associated with a request, without freeing 43 the request (in case the user is expected to access it later). It allows one to layer libraries 44 more conveniently, since multiple layers of software may access the same completed request 45 and extract from it the status information. 46 47 48

94 64 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 REQUEST MPI STATUS( request, flag, status ) _ GET _ _ 2 request request (handle) IN 3 flag _ boolean flag, same as from TEST MPI (logical) OUT 4 5 status object if flag is true (Status) OUT status 6 7 int MPI_Request_get_status(MPI_Request request, int *flag, 8 MPI_Status *status) 9 MPI_Request_get_status(request, flag, status, ierror) BIND(C) 10 TYPE(MPI_Request), INTENT(IN) :: request 11 LOGICAL, INTENT(OUT) :: flag 12 TYPE(MPI_Status) :: status 13 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 14 15 MPI_REQUEST_GET_STATUS( REQUEST, FLAG, STATUS, IERROR) 16 INTEGER REQUEST, STATUS(MPI_STATUS_SIZE), IERROR 17 LOGICAL FLAG 18 19 Sets flag=true if the operation is complete, and, if so, returns in status the request 20 status. However, unlike test or wait, it does not deallocate or inactivate the request; a 21 subsequent call to test, wait or free should be executed with that request. It sets flag=false 22 if the operation is not complete. 23 MPI One is allowed to call GET with a null or inactive request STATUS _ _ REQUEST _ 24 argument. In such a case the operation returns with flag=true and empty status. 25 26 3.8 Probe and Cancel 27 28 MPI _ IMPROBE operations allow in- MPI _ PROBE , MPI _ The , MPI _ MPROBE , and IPROBE 29 coming messages to be checked for, without actually receiving them. The user can then 30 decide how to receive them, based on the information returned by the probe (basically, the 31 ). In particular, the user may allocate memory for the receive status information returned by 32 buffer, according to the length of the probed message. 33 _ MPI operation allows pending communications to be cancelled. This is CANCEL The 34 required for cleanup. Posting a send or a receive ties up user resources (send or receive 35 buffers), and a cancel may be needed to free these resources gracefully. 36 37 3.8.1 Probe 38 39 40 _ IPROBE(source, tag, comm, flag, status) MPI 41 rank of source or MPI IN (integer) SOURCE _ ANY _ source 42 TAG MPI _ ANY _ message tag or tag IN (integer) 43 44 IN comm communicator (handle) 45 OUT flag (logical) 46 status object (Status) OUT status 47 48

95 3.8. PROBE AND CANCEL 65 1 int MPI_Iprobe(int source, int tag, MPI_Comm comm, int *flag, 2 MPI_Status *status) 3 MPI_Iprobe(source, tag, comm, flag, status, ierror) BIND(C) 4 INTEGER, INTENT(IN) :: source, tag 5 TYPE(MPI_Comm), INTENT(IN) :: comm 6 LOGICAL, INTENT(OUT) :: flag 7 TYPE(MPI_Status) :: status 8 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 9 10 MPI_IPROBE(SOURCE, TAG, COMM, FLAG, STATUS, IERROR) 11 LOGICAL FLAG 12 INTEGER SOURCE, TAG, COMM, STATUS(MPI_STATUS_SIZE), IERROR 13 flag = true if there is a message IPROBE(source, tag, comm, flag, status) _ MPI returns 14 source , tag that can be received and that matches the pattern specified by the arguments , 15 comm . The call matches the same message that would have been received by a call to and 16 executed at the same point in the program, and RECV(..., source, tag, comm, status) MPI _ 17 MPI RECV() _ the same value that would have been returned by status returns in . Otherwise, 18 undefined. the call returns flag = false , and leaves status 19 If MPI _ IPROBE returns flag = true , then the content of the status object can be sub- 20 sequently accessed as described in Section 3.2.5 to find the source, tag and length of the 21 probed message. 22 A subsequent receive executed with the same communicator, and the source and tag re- 23 IPROBE MPI turned in status by will receive the message that was matched by the probe, if _ 24 no other intervening receive occurs after the probe, and the send is not successfully cancelled 25 before the receive. If the receiving process is multithreaded, it is the user’s responsibility 26 to ensure that the last condition holds. 27 argument source The MPI _ PROBE can be MPI _ ANY _ SOURCE , and the tag argument of 28 can be MPI _ ANY _ TAG , so that one can probe for messages from an arbitrary source and/or 29 with an arbitrary tag. However, a specific communication context must be provided with 30 comm argument. the 31 It is not necessary to receive a message immediately after it has been probed for, and 32 the same message may be probed for several times before it is received. 33 NULL as source returns flag = true, and the status object A probe with MPI _ PROC _ 34 returns source = _ PROC _ NULL TAG MPI , tag = MPI _ ANY _ , and count = 0; see Section 3.11 35 on page 81. 36 37 38 PROBE(source, tag, comm, status) _ MPI 39 _ ANY source SOURCE (integer) rank of source or MPI _ IN 40 41 TAG _ ANY _ (integer) MPI message tag or tag IN 42 IN communicator (handle) comm 43 status status object (Status) OUT 44 45 46 int MPI_Probe(int source, int tag, MPI_Comm comm, MPI_Status *status) 47 MPI_Probe(source, tag, comm, status, ierror) BIND(C) 48

96 66 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 INTEGER, INTENT(IN) :: source, tag 2 TYPE(MPI_Comm), INTENT(IN) :: comm 3 TYPE(MPI_Status) :: status 4 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 5 MPI_PROBE(SOURCE, TAG, COMM, STATUS, IERROR) 6 INTEGER SOURCE, TAG, COMM, STATUS(MPI_STATUS_SIZE), IERROR 7 8 behaves like except that it is a blocking call that returns IPROBE _ MPI MPI PROBE _ 9 only after a matching message has been found. 10 MPI _ PROBE MPI MPI _ IPROBE needs to guarantee progress: The implementation of and 11 if a call to _ PROBE has been issued by a process, and a send that matches the probe MPI 12 will return, unless the PROBE _ MPI has been initiated by some process, then the call to 13 message is received by another concurrent receive operation (that is executed by another 14 MPI _ IPROBE and a thread at the probing process). Similarly, if a process busy waits with 15 IPROBE MPI _ will eventually return matching message has been issued, then the call to flag 16 = true unless the message is received by another concurrent receive operation or matched 17 by a concurrent matched probe. 18 19 Example 3.17 20 Use blocking probe to wait for an incoming message. 21 CALL MPI_COMM_RANK(comm, rank, ierr) 22 IF (rank.EQ.0) THEN 23 CALL MPI_SEND(i, 1, MPI_INTEGER, 2, 0, comm, ierr) 24 ELSE IF (rank.EQ.1) THEN 25 CALL MPI_SEND(x, 1, MPI_REAL, 2, 0, comm, ierr) 26 ELSE IF (rank.EQ.2) THEN 27 DO i=1, 2 28 CALL MPI_PROBE(MPI_ANY_SOURCE, 0, 29 comm, status, ierr) 30 IF (status(MPI_SOURCE) .EQ. 0) THEN 31 100 CALL MPI_RECV(i, 1, MPI_INTEGER, 0, 0, comm, status, ierr) 32 ELSE 33 200 CALL MPI_RECV(x, 1, MPI_REAL, 1, 0, comm, status, ierr) 34 END IF 35 END DO 36 END IF 37 38 Each message is received with the right type. 39 40 A similar program to the previous example, but now it has a problem. Example 3.18 41 42 CALL MPI_COMM_RANK(comm, rank, ierr) 43 IF (rank.EQ.0) THEN 44 CALL MPI_SEND(i, 1, MPI_INTEGER, 2, 0, comm, ierr) 45 ELSE IF (rank.EQ.1) THEN 46 CALL MPI_SEND(x, 1, MPI_REAL, 2, 0, comm, ierr) 47 ELSE IF (rank.EQ.2) THEN 48 DO i=1, 2

97 3.8. PROBE AND CANCEL 67 1 CALL MPI_PROBE(MPI_ANY_SOURCE, 0, 2 comm, status, ierr) 3 IF (status(MPI_SOURCE) .EQ. 0) THEN 4 100 CALL MPI_RECV(i, 1, MPI_INTEGER, MPI_ANY_SOURCE, 5 0, comm, status, ierr) 6 ELSE 7 200 CALL MPI_RECV(x, 1, MPI_REAL, MPI_ANY_SOURCE, 8 0, comm, status, ierr) 9 END IF 10 END DO 11 END IF 12 13 In Example 3.18, the two receive calls in statements labeled 100 and 200 in Example 3.17 14 MPI source as the ANY _ slightly modified, using argument. The program is now SOURCE _ 15 incorrect: the receive operation may receive a message that is distinct from the message 16 MPI probed by the preceding call to . _ PROBE 17 and PROBE _ MPI In a multithreaded MPI program, Advice to users. 18 _ IPROBE might need special care. If a thread probes for a message and then MPI 19 immediately posts a matching receive, the receive may match a message other than 20 that found by the probe since another thread could concurrently receive that original 21 MPI _ IMPROBE solve this problem by matching the message [29]. MPI _ MPROBE and 22 _ IMRECV _ MPI or MRECV MPI incoming message so that it may only be received with 23 ) End of advice to users. on the corresponding message handle. ( 24 25 26 MPI Advice to implementors. PROBE(source, tag, comm, status) will match A call to _ 27 the message that would have been received by a call to MPI _ RECV(..., source, tag, 28 s comm, status) , executed at the same point. Suppose that this message has source 29 . If the tag argument in the probe call has value c and communicator t tag 30 TAG _ ANY _ MPI then the message probed will be the earliest pending message from 31 source s with communicator c and any tag; in any case, the message probed will be 32 and communicator c (this is the the earliest pending message from source s with tag t 33 message that would have been received, so as to preserve message order). This message 34 continues as the earliest pending message from source s with tag t and communicator 35 , until it is received. A receive operation subsequent to the probe that uses the c 36 same communicator as the probe and uses the tag and source values returned by 37 the probe, must receive this message, unless it has already been received by another 38 ) receive operation. ( End of advice to implementors. 39 40 41 3.8.2 Matching Probe 42 43 MPI _ PROBE checks for incoming messages without receiving them. Since the The function 44 list of incoming messages is global among the threads of each MPI process, it can be hard 45 to use this functionality in threaded environments [29, 26]. 46 _ MPI and PROBE _ MPI Like _ IMPROBE opera- _ and MPI MPROBE MPI , the IPROBE 47 tions allow incoming messages to be queried without actually receiving them, except that 48 MPI and IMPROBE _ MPI provide a mechanism to receive the specific message _ MPROBE

98 68 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 that was matched regardless of other intervening probe or receive operations. This gives 2 the application an opportunity to decide how to receive the message, based on the infor- 3 mation returned by the probe. In particular, the user may allocate memory for the receive 4 buffer, according to the length of the probed message. 5 6 MPI _ IMPROBE(source, tag, comm, flag, message, status) 7 8 _ rank of source or (integer) source IN _ MPI SOURCE ANY 9 ANY _ IN tag (integer) TAG _ message tag or MPI 10 communicator (handle) IN comm 11 12 OUT flag flag (logical) 13 returned message (handle) OUT message 14 status object (Status) OUT status 15 16 17 int MPI_Improbe(int source, int tag, MPI_Comm comm, int *flag, 18 MPI_Message *message, MPI_Status *status) 19 MPI_Improbe(source, tag, comm, flag, message, status, ierror) BIND(C) 20 INTEGER, INTENT(IN) :: source, tag 21 TYPE(MPI_Comm), INTENT(IN) :: comm 22 INTEGER, INTENT(OUT) :: flag 23 TYPE(MPI_Message), INTENT(OUT) :: message 24 TYPE(MPI_Status) :: status 25 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 26 27 MPI_IMPROBE(SOURCE, TAG, COMM, FLAG, MESSAGE, STATUS, IERROR) 28 INTEGER SOURCE, TAG, COMM, FLAG, MESSAGE, STATUS(MPI_STATUS_SIZE), 29 IERROR 30 IMPROBE(source, tag, comm, flag, message, status) returns flag = true if there is MPI _ 31 a message that can be received and that matches the pattern specified by the arguments 32 tag source . The call matches the same message that would have been received comm , and , 33 RECV(..., source, tag, comm, status) by a call to MPI _ executed at the same point in the 34 RECV . status the same value that would have been returned by program and returns in _ MPI 35 message a handle to the matched message. Otherwise, the call In addition, it returns in 36 , and leaves flag = false and returns status undefined. message 37 or MRECV ) executed with the message han- MPI A matched receive ( IMRECV _ MPI _ 38 dle will receive the message that was matched by the probe. Unlike MPI _ IPROBE , no 39 . other probe or receive operation may match the message returned by MPI _ IMPROBE 40 MPI Each message returned by MPI or _ IMPROBE must be received with either _ MRECV 41 MPI _ IMRECV . 42 _ , and the tag argu- SOURCE _ ANY The source argument of can be IMPROBE _ MPI MPI 43 MPI _ ANY _ TAG , so that one can probe for messages from an arbitrary source ment can be 44 and/or with an arbitrary tag. However, a specific communication context must be provided 45 argument. comm with the 46 A synchronous send operation that is matched with MPROBE MPI or _ IMPROBE _ MPI 47 or MRECV _ MPI will complete successfully only if both a matching receive is posted with 48

99 3.8. PROBE AND CANCEL 69 1 IMRECV _ , and the receive operation has started to receive the message sent by the MPI 2 synchronous send. 3 _ PROC , which is a message There is a special predefined message: MPI NO MESSAGE _ _ 4 _ _ as its source process. The predefined constant which has PROC NULL MPI 5 MPI _ is the value used for invalid message handles. NULL _ MESSAGE 6 message = MPI _ PROC _ NULL A matching probe with flag = true , as source returns 7 _ PROC _ NULL , tag _ MESSAGE MPI NO _ PROC , and the status object returns source = MPI _ 8 MRECV ; see Section 3.11. It is not necessary to call count = 0 , and TAG _ ANY _ = MPI MPI _ 9 _ PROC , but it is not erroneous to do so. MPI or IMRECV with MPI _ MESSAGE _ NO _ 10 11 _ MESSAGE _ Rationale. _ PROC was chosen instead of MPI NO 12 NULL to avoid possible confusion as another null handle con- MPI MESSAGE _ PROC _ _ 13 End of rationale. ) stant. ( 14 15 16 MPROBE(source, tag, comm, message, status) _ MPI 17 18 MPI source IN ANY _ _ (integer) rank of source or SOURCE 19 tag message tag or MPI _ ANY _ TAG (integer) IN 20 communicator (handle) comm IN 21 22 OUT returned message (handle) message 23 OUT status status object (Status) 24 25 int MPI_Mprobe(int source, int tag, MPI_Comm comm, MPI_Message *message, 26 MPI_Status *status) 27 28 MPI_Mprobe(source, tag, comm, message, status, ierror) BIND(C) 29 INTEGER, INTENT(IN) :: source, tag 30 TYPE(MPI_Comm), INTENT(IN) :: comm 31 TYPE(MPI_Message), INTENT(OUT) :: message 32 TYPE(MPI_Status) :: status 33 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 34 MPI_MPROBE(SOURCE, TAG, COMM, MESSAGE, STATUS, IERROR) 35 INTEGER SOURCE, TAG, COMM, MESSAGE, STATUS(MPI_STATUS_SIZE), IERROR 36 37 MPROBE behaves like MPI _ IMPROBE except that it is a blocking call that returns _ MPI 38 only after a matching message has been found. 39 MPI needs to guarantee progress The implementation of _ _ MPROBE and MPI IMPROBE 40 _ MPI and PROBE _ MPI in the same way as in the case of . IPROBE 41 42 3.8.3 Matched Receives 43 MRECV _ IMRECV and MPI _ receive messages that have been previously The functions MPI 44 matched by a matching probe (Section 3.8.2). 45 46 47 48

100 70 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 MPI MRECV(buf, count, datatype, message, status) _ 2 OUT buf initial address of receive buffer (choice) 3 count number of elements in receive buffer (non-negative in- IN 4 teger) 5 6 datatype of each receive buffer element (handle) datatype IN 7 message INOUT message (handle) 8 status status object (Status) OUT 9 10 11 int MPI_Mrecv(void* buf, int count, MPI_Datatype datatype, 12 MPI_Message *message, MPI_Status *status) 13 MPI_Mrecv(buf, count, datatype, message, status, ierror) BIND(C) 14 TYPE(*), DIMENSION(..) :: buf 15 INTEGER, INTENT(IN) :: count 16 TYPE(MPI_Datatype), INTENT(IN) :: datatype 17 TYPE(MPI_Message), INTENT(INOUT) :: message 18 TYPE(MPI_Status) :: status 19 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 20 21 MPI_MRECV(BUF, COUNT, DATATYPE, MESSAGE, STATUS, IERROR) 22 BUF(*) 23 INTEGER COUNT, DATATYPE, MESSAGE, STATUS(MPI_STATUS_SIZE), IERROR 24 This call receives a message matched by a matching probe operation (Section 3.8.2). 25 consecutive elements of the The receive buffer consists of the storage containing count 26 buf , starting at address . The length of the received message must type specified by datatype 27 be less than or equal to the length of the receive buffer. An overflow error occurs if all 28 incoming data does not fit, without truncation, into the receive buffer. 29 If the message is shorter than the receive buffer, then only those locations corresponding 30 to the (shorter) message are modified. 31 . All MPI _ MESSAGE _ NULL On return from this function, the message handle is set to 32 errors that occur during the execution of this operation are handled according to the error 33 handler set for the communicator used in the matching probe call that produced the message 34 handle. 35 MRECV _ _ MESSAGE PROC MPI is called with NO _ MPI If as the message argument, the _ 36 MPI _ PROC _ NULL , tag = call returns immediately with the status object set to source = 37 TAG NULL _ PROC _ MPI was issued (see Sec- _ ANY _ MPI , and count = 0, as if a receive from 38 _ MPI with NULL MRECV _ MPI tion 3.11). A call to is erroneous. _ MESSAGE 39 40 41 42 43 44 45 46 47 48

101 3.8. PROBE AND CANCEL 71 1 IMRECV(buf, count, datatype, message, request) _ MPI 2 buf initial address of receive buffer (choice) OUT 3 IN number of elements in receive buffer (non-negative in- count 4 teger) 5 6 datatype datatype of each receive buffer element (handle) IN 7 message INOUT message (handle) 8 OUT communication request (handle) request 9 10 11 int MPI_Imrecv(void* buf, int count, MPI_Datatype datatype, 12 MPI_Message *message, MPI_Request *request) 13 MPI_Imrecv(buf, count, datatype, message, request, ierror) BIND(C) 14 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: buf 15 INTEGER, INTENT(IN) :: count 16 TYPE(MPI_Datatype), INTENT(IN) :: datatype 17 TYPE(MPI_Message), INTENT(INOUT) :: message 18 TYPE(MPI_Request), INTENT(OUT) :: request 19 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 20 21 MPI_IMRECV(BUF, COUNT, DATATYPE, MESSAGE, REQUEST, IERROR) 22 BUF(*) 23 INTEGER COUNT, DATATYPE, MESSAGE, REQUEST, IERROR 24 IMRECV _ MPI and starts a nonblocking _ MRECV MPI is the nonblocking variant of 25 as described receive of a matched message. Completion semantics are similar to MPI IRECV _ 26 in Section 3.7.2. On return from this function, the message handle is set to 27 NULL MPI _ MESSAGE _ . 28 _ NO _ MESSAGE _ MPI PROC IMRECV _ MPI If as the message argument, the is called with 29 call returns immediately with a request object which, when completed, will yield a status 30 NULL _ _ MPI PROC _ object set to source = , tag = MPI _ , and count = 0, as if a TAG ANY 31 was issued (see Section 3.11). A call to NULL _ PROC _ MPI receive from with IMRECV _ MPI 32 MPI _ MESSAGE _ NULL is erroneous. 33 34 Advice to implementors. If reception of a matched message is started with 35 IMRECV _ MPI CANCEL MPI _ . If , then it is possible to cancel the returned request with 36 MPI _ succeeds, the matched message must be found by a subsequent message CANCEL 37 _ MPI MPI MPROBE _ MPI , IPROBE _ , PROBE _ MPI probe ( ), received by IMPROBE , or 38 a subsequent receive operation or cancelled by the sender. See Section 3.8.4 for details 39 MPI about may . The cancellation of operations initiated with _ IMRECV MPI CANCEL _ 40 End of advice to implementors. fail. ( ) 41 42 Cancel 3.8.4 43 44 45 _ CANCEL(request) MPI 46 communication request (handle) request IN 47 48

102 72 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 int MPI_Cancel(MPI_Request *request) 2 MPI_Cancel(request, ierror) BIND(C) 3 TYPE(MPI_Request), INTENT(IN) :: request 4 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 5 6 MPI_CANCEL(REQUEST, IERROR) 7 INTEGER REQUEST, IERROR 8 MPI A call to CANCEL _ marks for cancellation a pending, nonblocking communication 9 operation (send or receive). The cancel call is local. It returns immediately, possibly before 10 REQUEST _ _ , the communication is actually cancelled. It is still necessary to call MPI FREE 11 MPI MPI TEST (or any of the derived operations) with the cancelled request as or _ WAIT _ 12 . If a communication is marked for cancellation, CANCEL _ argument after the call to MPI 13 then a MPI _ WAIT call for that communication is guaranteed to return, irrespective of 14 behaves as a local function); similarly if MPI the activities of other processes (i.e., _ WAIT 15 _ TEST is repeatedly called in a busy wait loop for a cancelled communication, then MPI 16 _ TEST will eventually be successful. MPI 17 CANCEL can be used to cancel a communication that uses a persistent request (see MPI _ 18 Section 3.9), in the same way it is used for nonpersistent requests. A successful cancellation 19 MPI _ CANCEL cancels the active communication, but not the request itself. After the call to 20 WAIT or TEST _ , the request becomes inactive and can and the subsequent call to MPI _ MPI 21 be activated for a new communication. 22 The successful cancellation of a buffered send frees the buffer space occupied by the 23 pending message. 24 Either the cancellation succeeds, or the communication succeeds, but not both. If a 25 send is marked for cancellation, then it must be the case that either the send completes 26 normally, in which case the message sent was received at the destination process, or that 27 the send is successfully cancelled, in which case no part of the message was received at the 28 destination. Then, any matching receive has to be satisfied by another send. If a receive is 29 marked for cancellation, then it must be the case that either the receive completes normally, 30 or that the receive is successfully cancelled, in which case no part of the receive buffer is 31 altered. Then, any matching send has to be satisfied by another receive. 32 If the operation has been cancelled, then information to that effect will be returned in 33 the status argument of the operation that completes the communication. 34 35 Although the IN request handle parameter should not need to be passed Rationale. 36 by reference, the C binding has listed the argument type as MPI MPI- since Request* _ 37 1.0 . This function signature therefore cannot be changed without breaking existing 38 MPI applications. ( End of rationale. ) 39 40 41 42 MPI _ TEST _ CANCELLED(status, flag) 43 status IN status object (Status) 44 flag (logical) OUT 45 46 47 int MPI_Test_cancelled(const MPI_Status *status, int *flag) 48

103 3.9. PERSISTENT COMMUNICATION REQUESTS 73 1 MPI_Test_cancelled(status, flag, ierror) BIND(C) 2 TYPE(MPI_Status), INTENT(IN) :: status 3 LOGICAL, INTENT(OUT) :: flag 4 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 5 MPI_TEST_CANCELLED(STATUS, FLAG, IERROR) 6 LOGICAL FLAG 7 INTEGER STATUS(MPI_STATUS_SIZE), IERROR 8 9 flag = true if the communication associated with the status object was cancelled Returns 10 status (such as successfully. In such a case, all other fields of ) are undefined. tag or count 11 flag = false , otherwise. If a receive operation might be cancelled then one should Returns 12 _ MPI call _ CANCELLED TEST first, to check whether the operation was cancelled, before 13 checking on the other fields of the return status. 14 15 Cancel can be an expensive operation that should be used only Advice to users. 16 ) exceptionally. ( End of advice to users. 17 18 Advice to implementors. If a send operation uses an “eager” protocol (data is 19 transferred to the receiver before a matching receive is posted), then the cancellation 20 of this send may require communication with the intended receiver in order to free 21 allocated buffers. On some systems this may require an interrupt to the intended 22 receiver. Note that, while communication may be needed to implement 23 , this is still a local operation, since its completion does not depend on _ MPI CANCEL 24 the code executed by other processes. If processing is required on another process, 25 this should be transparent to the application (hence the need for an interrupt and an 26 ) interrupt handler). ( End of advice to implementors. 27 28 Persistent Communication Requests 3.9 29 30 Often a communication with the same argument list is repeatedly executed within the in- 31 ner loop of a parallel computation. In such a situation, it may be possible to optimize 32 the communication by binding the list of communication arguments to a com- persistent 33 munication request once and, then, repeatedly using the request to initiate and complete 34 messages. The persistent request thus created can be thought of as a communication port or 35 a “half-channel.” It does not provide the full functionality of a conventional channel, since 36 there is no binding of the send port to the receive port. This construct allows reduction 37 of the overhead for communication between the process and communication controller, but 38 not of the overhead for communication between one communication controller and another. 39 It is not necessary that messages sent with a persistent request be received by a receive 40 operation using a persistent request, or vice versa. 41 A persistent communication request is created using one of the five following calls. 42 These calls involve no communication. 43 44 45 46 47 48

104 74 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 _ INIT(buf, count, datatype, dest, tag, comm, request) _ SEND MPI 2 initial address of send buffer (choice) IN buf 3 number of elements sent (non-negative integer) count IN 4 5 type of each element (handle) IN datatype 6 IN dest rank of destination (integer) 7 message tag (integer) IN tag 8 9 communicator (handle) comm IN 10 communication request (handle) request OUT 11 12 int MPI_Send_init(const void* buf, int count, MPI_Datatype datatype, 13 int dest, int tag, MPI_Comm comm, MPI_Request *request) 14 15 MPI_Send_init(buf, count, datatype, dest, tag, comm, request, ierror) 16 BIND(C) 17 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: buf 18 INTEGER, INTENT(IN) :: count, dest, tag 19 TYPE(MPI_Datatype), INTENT(IN) :: datatype 20 TYPE(MPI_Comm), INTENT(IN) :: comm 21 TYPE(MPI_Request), INTENT(OUT) :: request 22 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 23 MPI_SEND_INIT(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR) 24 BUF(*) 25 INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR 26 27 Creates a persistent communication request for a standard mode send operation, and 28 binds to it all the arguments of a send operation. 29 30 31 MPI _ BSEND _ INIT(buf, count, datatype, dest, tag, comm, request) 32 IN buf initial address of send buffer (choice) 33 number of elements sent (non-negative integer) IN count 34 35 IN type of each element (handle) datatype 36 dest IN rank of destination (integer) 37 tag message tag (integer) IN 38 39 IN communicator (handle) comm 40 communication request (handle) OUT request 41 42 int MPI_Bsend_init(const void* buf, int count, MPI_Datatype datatype, 43 int dest, int tag, MPI_Comm comm, MPI_Request *request) 44 45 MPI_Bsend_init(buf, count, datatype, dest, tag, comm, request, ierror) 46 BIND(C) 47 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: buf 48 INTEGER, INTENT(IN) :: count, dest, tag

105 3.9. PERSISTENT COMMUNICATION REQUESTS 75 1 TYPE(MPI_Datatype), INTENT(IN) :: datatype 2 TYPE(MPI_Comm), INTENT(IN) :: comm 3 TYPE(MPI_Request), INTENT(OUT) :: request 4 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 5 MPI_BSEND_INIT(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR) 6 BUF(*) 7 INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR 8 9 Creates a persistent communication request for a buffered mode send. 10 11 _ _ MPI INIT(buf, count, datatype, dest, tag, comm, request) SSEND 12 13 initial address of send buffer (choice) buf IN 14 IN count number of elements sent (non-negative integer) 15 16 datatype IN type of each element (handle) 17 rank of destination (integer) dest IN 18 message tag (integer) IN tag 19 20 IN comm communicator (handle) 21 OUT request communication request (handle) 22 23 int MPI_Ssend_init(const void* buf, int count, MPI_Datatype datatype, 24 int dest, int tag, MPI_Comm comm, MPI_Request *request) 25 26 MPI_Ssend_init(buf, count, datatype, dest, tag, comm, request, ierror) 27 BIND(C) 28 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: buf 29 INTEGER, INTENT(IN) :: count, dest, tag 30 TYPE(MPI_Datatype), INTENT(IN) :: datatype 31 TYPE(MPI_Comm), INTENT(IN) :: comm 32 TYPE(MPI_Request), INTENT(OUT) :: request 33 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 34 MPI_SSEND_INIT(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR) 35 BUF(*) 36 INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR 37 38 Creates a persistent communication object for a synchronous mode send operation. 39 40 41 42 43 44 45 46 47 48

106 76 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 RSEND MPI _ _ INIT(buf, count, datatype, dest, tag, comm, request) 2 IN buf initial address of send buffer (choice) 3 IN count number of elements sent (non-negative integer) 4 5 datatype type of each element (handle) IN 6 rank of destination (integer) IN dest 7 tag IN message tag (integer) 8 9 communicator (handle) comm IN 10 communication request (handle) request OUT 11 12 int MPI_Rsend_init(const void* buf, int count, MPI_Datatype datatype, 13 int dest, int tag, MPI_Comm comm, MPI_Request *request) 14 15 MPI_Rsend_init(buf, count, datatype, dest, tag, comm, request, ierror) 16 BIND(C) 17 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: buf 18 INTEGER, INTENT(IN) :: count, dest, tag 19 TYPE(MPI_Datatype), INTENT(IN) :: datatype 20 TYPE(MPI_Comm), INTENT(IN) :: comm 21 TYPE(MPI_Request), INTENT(OUT) :: request 22 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 23 MPI_RSEND_INIT(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR) 24 BUF(*) 25 INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR 26 27 Creates a persistent communication object for a ready mode send operation. 28 29 30 RECV MPI _ INIT(buf, count, datatype, source, tag, comm, request) _ 31 initial address of receive buffer (choice) buf OUT 32 IN count number of elements received (non-negative integer) 33 34 IN type of each element (handle) datatype 35 (integer) ANY _ _ MPI rank of source or SOURCE IN source 36 IN tag message tag or MPI _ ANY _ TAG (integer) 37 38 IN communicator (handle) comm 39 communication request (handle) OUT request 40 41 int MPI_Recv_init(void* buf, int count, MPI_Datatype datatype, int source, 42 int tag, MPI_Comm comm, MPI_Request *request) 43 44 MPI_Recv_init(buf, count, datatype, source, tag, comm, request, ierror) 45 BIND(C) 46 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: buf 47 INTEGER, INTENT(IN) :: count, source, tag 48 TYPE(MPI_Datatype), INTENT(IN) :: datatype

107 3.9. PERSISTENT COMMUNICATION REQUESTS 77 1 TYPE(MPI_Comm), INTENT(IN) :: comm 2 TYPE(MPI_Request), INTENT(OUT) :: request 3 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 4 MPI_RECV_INIT(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST, IERROR) 5 BUF(*) 6 INTEGER COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST, IERROR 7 8 buf Creates a persistent communication request for a receive operation. The argument 9 OUT is marked as because the user gives permission to write on the receive buffer by passing 10 MPI the argument to _ RECV _ INIT . 11 A persistent communication request is inactive after it was created — no active com- 12 munication is attached to the request. 13 A communication (send or receive) that uses a persistent request is initiated by the 14 . MPI function START _ 15 16 MPI _ START(request) 17 18 communication request (handle) request INOUT 19 20 int MPI_Start(MPI_Request *request) 21 22 MPI_Start(request, ierror) BIND(C) 23 TYPE(MPI_Request), INTENT(INOUT) :: request 24 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 25 MPI_START(REQUEST, IERROR) 26 INTEGER REQUEST, IERROR 27 28 , is a handle returned by one of the previous five calls. The request The argument, 29 associated request should be inactive. The request becomes active once the call is made. 30 If the request is for a send with ready mode, then a matching receive should be posted 31 before the call is made. The communication buffer should not be modified after the call, 32 and until the operation completes. 33 The call is local, with similar semantics to the nonblocking communication operations 34 described in Section 3.7. That is, a call to MPI _ START with a request created by 35 _ MPI _ SEND INIT starts a communication in the same manner as a call to MPI _ ISEND ; a 36 BSEND MPI START with a request created by MPI call to _ _ INIT starts a communication _ 37 _ IBSEND ; and so on. in the same manner as a call to MPI 38 39 STARTALL(count, array of requests) _ _ MPI _ 40 41 list length (non-negative integer) IN count 42 _ of _ array requests INOUT array of requests (array of handle) 43 44 int MPI_Startall(int count, MPI_Request array_of_requests[]) 45 46 MPI_Startall(count, array_of_requests, ierror) BIND(C) 47 INTEGER, INTENT(IN) :: count 48 TYPE(MPI_Request), INTENT(INOUT) :: array_of_requests(count)

108 78 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 2 MPI_STARTALL(COUNT, ARRAY_OF_REQUESTS, IERROR) 3 INTEGER COUNT, ARRAY_OF_REQUESTS(*), IERROR 4 5 requests of _ _ Start all communications associated with requests in . A call to array 6 STARTALL(count, array MPI has the same effect as calls to requests) _ _ _ of 7 START (&array _ of _ requests[i]) , executed for i=0 ,..., count-1 , in some arbitrary order. MPI _ 8 START or A communication started with a call to _ STARTALL is completed MPI _ MPI 9 WAIT , MPI _ TEST , or one of the derived functions described in Sec- MPI by a call to _ 10 tion 3.7.5. The request becomes inactive after successful completion of such call. The re- 11 or START _ STARTALL quest is not deallocated and it can be activated anew by an MPI _ MPI 12 call. 13 _ REQUEST _ FREE (Section 3.7.3). A persistent request is deallocated by a call to MPI 14 MPI _ REQUEST _ FREE can occur at any point in the program after the per- The call to 15 sistent request was created. However, the request will be deallocated only after it becomes 16 inactive. Active receive requests should not be freed. Otherwise, it will not be possible 17 to check that the receive has completed. It is preferable, in general, to free requests when 18 they are inactive. If this rule is followed, then the functions described in this section will 19 be invoked in a sequence of the form, 20 21 ∗ Create Free ) Start Complete ( 22 23 ∗ indicates zero or more repetitions. If the same communication object is used in where 24 several concurrent threads, it is the user’s responsibility to coordinate calls so that the 25 correct sequence is obeyed. 26 MPI _ START can be matched with any receive operation A send operation initiated with 27 _ can receive messages generated MPI and, likewise, a receive operation initiated with START 28 by any send operation. 29 30 Advice to users. To prevent problems with the argument copying and register opti- 31 mization done by Fortran compilers, please note the hints in Sections 17.1.10-17.1.20, 32 especially in Sections 17.1.12 and 17.1.13 on pages 626-629 about “ Problems Due to 33 Vector Subscripts ”, ” and “ Data Copying and Sequence Association with Subscript Triplets 34 and in Sections 17.1.16 to 17.1.19 on pages 631 to 642 about “ Optimization Problems ”, 35 ” and “ Per- Temporary Data Movements “ ”, “ Code Movements and Register Optimization 36 manent Data Movements ) End of advice to users. ”. ( 37 38 Send-Receive 3.10 39 40 send-receive The operations combine in one call the sending of a message to one desti- 41 nation and the receiving of another message, from another process. The two (source and 42 destination) are possibly the same. A send-receive operation is very useful for executing 43 a shift operation across a chain of processes. If blocking sends and receives are used for 44 such a shift, then one needs to order the sends and receives correctly (for example, even 45 processes send, then receive, odd processes receive first, then send) so as to prevent cyclic 46 dependencies that may lead to deadlock. When a send-receive operation is used, the com- 47 munication subsystem takes care of these issues. The send-receive operation can be used 48

109 3.10. SEND-RECEIVE 79 1 in conjunction with the functions described in Chapter 7 in order to perform shifts on var- 2 ious logical topologies. Also, a send-receive operation is useful for implementing remote 3 procedure calls. 4 A message sent by a send-receive operation can be received by a regular receive oper- 5 ation or probed by a probe operation; a send-receive operation can receive a message sent 6 by a regular send operation. 7 8 SENDRECV(sendbuf, sendcount, sendtype, dest, sendtag, recvbuf, recvcount, recvtype, _ MPI 9 source, recvtag, comm, status) 10 11 initial address of send buffer (choice) sendbuf IN 12 number of elements in send buffer (non-negative inte- sendcount IN 13 ger) 14 type of elements in send buffer (handle) sendtype IN 15 16 IN dest rank of destination (integer) 17 sendtag send tag (integer) IN 18 OUT recvbuf initial address of receive buffer (choice) 19 20 number of elements in receive buffer (non-negative in- IN recvcount 21 teger) 22 type of elements in receive buffer (handle) recvtype IN 23 IN rank of source or _ ANY SOURCE (integer) MPI _ source 24 25 (integer) IN recvtag receive tag or MPI _ ANY _ TAG 26 IN comm communicator (handle) 27 status object (Status) status OUT 28 29 30 int MPI_Sendrecv(const void *sendbuf, int sendcount, MPI_Datatype sendtype, 31 int dest, int sendtag, void *recvbuf, int recvcount, 32 MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, 33 MPI_Status *status) 34 MPI_Sendrecv(sendbuf, sendcount, sendtype, dest, sendtag, recvbuf, 35 recvcount, recvtype, source, recvtag, comm, status, ierror) 36 BIND(C) 37 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 38 TYPE(*), DIMENSION(..) :: recvbuf 39 INTEGER, INTENT(IN) :: sendcount, dest, sendtag, recvcount, source, 40 recvtag 41 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 42 TYPE(MPI_Comm), INTENT(IN) :: comm 43 TYPE(MPI_Status) :: status 44 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 45 46 MPI_SENDRECV(SENDBUF, SENDCOUNT, SENDTYPE, DEST, SENDTAG, RECVBUF, 47 RECVCOUNT, RECVTYPE, SOURCE, RECVTAG, COMM, STATUS, IERROR) 48 SENDBUF(*), RECVBUF(*)

110 80 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 INTEGER SENDCOUNT, SENDTYPE, DEST, SENDTAG, RECVCOUNT, RECVTYPE, 2 SOURCE, RECVTAG, COMM, STATUS(MPI_STATUS_SIZE), IERROR 3 Execute a blocking send and receive operation. Both send and receive use the same 4 communicator, but possibly different tags. The send buffer and receive buffers must be 5 disjoint, and may have different lengths and datatypes. 6 The semantics of a send-receive operation is what would be obtained if the caller forked 7 two concurrent threads, one to execute the send, and one to execute the receive, followed 8 by a join of these two threads. 9 10 11 MPI SENDRECV _ REPLACE(buf, count, datatype, dest, sendtag, source, recvtag, comm, sta- _ 12 tus) 13 INOUT buf initial address of send and receive buffer (choice) 14 15 number of elements in send and receive buffer (non- count IN 16 negative integer) 17 datatype type of elements in send and receive buffer (handle) IN 18 rank of destination (integer) IN dest 19 IN send message tag (integer) sendtag 20 21 rank of source or MPI source ANY _ SOURCE (integer) IN _ 22 TAG recvtag receive message tag or MPI _ ANY (integer) _ IN 23 IN comm communicator (handle) 24 OUT status status object (Status) 25 26 27 int MPI_Sendrecv_replace(void* buf, int count, MPI_Datatype datatype, 28 int dest, int sendtag, int source, int recvtag, MPI_Comm comm, 29 MPI_Status *status) 30 MPI_Sendrecv_replace(buf, count, datatype, dest, sendtag, source, recvtag, 31 comm, status, ierror) BIND(C) 32 TYPE(*), DIMENSION(..) :: buf 33 INTEGER, INTENT(IN) :: count, dest, sendtag, source, recvtag 34 TYPE(MPI_Datatype), INTENT(IN) :: datatype 35 TYPE(MPI_Comm), INTENT(IN) :: comm 36 TYPE(MPI_Status) :: status 37 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 38 39 MPI_SENDRECV_REPLACE(BUF, COUNT, DATATYPE, DEST, SENDTAG, SOURCE, RECVTAG, 40 COMM, STATUS, IERROR) 41 BUF(*) 42 INTEGER COUNT, DATATYPE, DEST, SENDTAG, SOURCE, RECVTAG, COMM, 43 STATUS(MPI_STATUS_SIZE), IERROR 44 Execute a blocking send and receive. The same buffer is used both for the send and 45 for the receive, so that the message sent is replaced by the message received. 46 47 Additional intermediate buffering is needed for the “replace” Advice to implementors. 48 End of advice to implementors. variant. ( )

111 3.11. NULL PROCESSES 81 1 Null Processes 3.11 2 3 In many instances, it is convenient to specify a “dummy” source or destination for commu- 4 nication. This simplifies the code that is needed for dealing with boundaries, for example, 5 in the case of a non-circular shift done with calls to send-receive. 6 The special value PROC can be used instead of a rank wherever a source or a MPI _ NULL _ 7 NULL _ PROC _ MPI destination argument is required in a call. A communication with process 8 PROC _ NULL succeeds and returns as soon as possible. A receive has no effect. A send to MPI _ 9 succeeds and returns as soon as possible with no modifications to NULL _ PROC _ from MPI 10 = MPI _ PROC _ NULL is executed then the the receive buffer. When a receive with source 11 MPI ANY _ TAG and count = 0 . A status object returns = MPI _ PROC _ _ = NULL , source tag 12 NULL _ PROC _ MPI probe or matching probe with source = succeeds and returns as soon as 13 possible, and the status object returns source = MPI _ PROC _ NULL , tag = MPI _ ANY _ TAG and 14 count = 0. A matching probe (cf. Section 3.8.2) with MPI _ PROC _ NULL as source returns 15 source = _ NO PROC , and the status object returns flag = true , message = MPI _ MESSAGE _ 16 _ TAG _ ANY _ tag = MPI , NULL _ PROC , and MPI . count = 0 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

112 82 CHAPTER 3. POINT-TO-POINT COMMUNICATION 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

113 1 2 3 4 5 6 Chapter 4 7 8 9 10 Datatypes 11 12 13 14 Basic datatypes were introduced in Section 3.2.2 on page 25 and in Section 3.3 Message Data 15 on page 33. In this chapter, this model is extended Data Type Matching and Data Conversion 16 to describe any data layout. We consider general datatypes that allow one to transfer 17 efficiently heterogeneous and noncontiguous data. We conclude with the description of calls 18 for explicit packing and unpacking of messages. 19 20 4.1 Derived Datatypes 21 22 Up to here, all point to point communications have involved only buffers containing a 23 sequence of identical basic datatypes. This is too constraining on two accounts. One 24 often wants to pass messages that contain values with different datatypes (e.g., an integer 25 count, followed by a sequence of real numbers); and one often wants to send noncontiguous 26 data (e.g., a sub-block of a matrix). One solution is to pack noncontiguous data into 27 a contiguous buffer at the sender site and unpack it at the receiver site. This has the 28 disadvantage of requiring additional memory-to-memory copy operations at both sites, even 29 when the communication subsystem has scatter-gather capabilities. Instead, provides MPI 30 mechanisms to specify more general, mixed, and noncontiguous communication buffers. It 31 is up to the implementation to decide whether data should be first packed in a contiguous 32 buffer before being transmitted, or whether it can be collected directly from where it resides. 33 The general mechanisms provided here allow one to transfer directly, without copying, 34 library is cognizant of MPI objects of various shapes and sizes. It is not assumed that the 35 the objects declared in the host language. Thus, if one wants to transfer a structure, or an 36 MPI array section, it will be necessary to provide in a definition of a communication buffer 37 that mimics the definition of the structure or array section in question. These facilities can 38 be used by library designers to define communication functions that can transfer objects 39 defined in the host language — by decoding their definitions as available in a symbol table 40 . MPI or a dope vector. Such higher-level communication functions are not part of 41 More general communication buffers are specified by replacing the basic datatypes that 42 have been used so far with derived datatypes that are constructed from basic datatypes using 43 the constructors described in this section. These methods of constructing derived datatypes 44 can be applied recursively. 45 general datatype is an opaque object that specifies two things: A 46 47 A sequence of basic datatypes • 48 83

114 84 CHAPTER 4. DATATYPES 1 A sequence of integer (byte) displacements • 2 The displacements are not required to be positive, distinct, or in increasing order. 3 Therefore, the order of items need not coincide with their order in store, and an item may 4 type appear more than once. We call such a pair of sequences (or sequence of pairs) a 5 . The sequence of basic datatypes (displacements ignored) is the type signature map of 6 the datatype. 7 Let 8 9 ,disp ,..., ,disp ( type { ) } , ) = type Typemap ( 0 1 − n 0 − n 1 10 11 be such a type map, where type are displacements. Let disp are basic types, and i i 12 ,...,type { } Typesig = type 0 n − 1 13 14 be the associated type signature. This type map, together with a base address buf , specifies 15 a communication buffer: the communication buffer that consists of n entries, where the 16 + i -th entry is at address . A message assembled from such a buf type and has type disp i i 17 communication buffer will consist of n values, of the types defined by . Typesig 18 Most datatype constructors have replication count or block length arguments. Allowed 19 values are non-negative integers. If the value is zero, no elements are generated in the type 20 map and there is no effect on datatype bounds or extent. 21 We can use a handle to a general datatype as an argument in a send or receive operation, 22 SEND(buf, 1, datatype,...) will use instead of a basic datatype argument. The operation MPI _ 23 buf the send buffer defined by the base address and the general datatype associated with 24 datatype ; it will generate a message with the type signature determined by the datatype 25 argument. MPI _ RECV(buf, 1, datatype,...) will use the receive buffer defined by the base 26 . address buf and the general datatype associated with datatype 27 General datatypes can be used in all send and receive operations. We discuss, in 28 > 1. has value count Section 4.1.11, the case where the second argument 29 The basic datatypes presented in Section 3.2.2 are particular cases of a general datatype, 30 MPI INT is a predefined handle to a datatype with type map and are predefined. Thus, _ 31 and displacement zero. The other basic datatypes are , 0) { } , with one entry of type int ( int 32 similar. 33 The of a datatype is defined to be the span from the first byte to the last byte extent 34 occupied by entries in this datatype, rounded up to satisfy alignment requirements. That 35 is, if 36 37 ,..., ( type { ( type ,disp Typemap ,disp = ) } , ) n n 1 − 0 0 1 − 38 then 39 40 ) = min disp Typemap lb ( , j j 41 42 ub ( and , )) + type ( sizeof + Typemap disp ( ) = max j j j 43 lb ( ( Typemap ) . (4.1) ub Typemap ) ) = Typemap ( extent − 44 45 If type requires alignment to a byte address that is a multiple of k , then  is the least j j 46 non-negative increment needed to round ) to the next multiple of max Typemap k ( . extent j j 47 In Fortran, it is implementation dependent whether the MPI implementation computes 48 the alignments k according to the alignments used by the compiler in common blocks, j

115 4.1. DERIVED DATATYPES 85 1 BIND(C) derived types, derived types, or derived types that are neither SEQUENCE SEQUENCE 2 extent nor BIND(C) is given in Section 4.1.6 on page 104. . The complete definition of 3 4 at displacement zero, ( char , 8) } (a double 0) , = Example 4.1 Assume that Type double ( , { 5 followed by a char at displacement eight). Assume, furthermore, that doubles have to be 6 strictly aligned at addresses that are multiples of eight. Then, the extent of this datatype is 7 16 (9 rounded to the next multiple of 8). A datatype that consists of a character immediately 8 followed by a double will also have an extent of 16. 9 The definition of extent is motivated by the assumption that the amount Rationale. 10 of padding added at the end of each structure in an array of structures is the least 11 needed to fulfill alignment constraints. More explicit control of the extent is provided 12 in Section 4.1.6. Such explicit control is needed in cases where the assumption does not 13 hold, for example, where union types are used. In Fortran, structures can be expressed 14 SEQUENCE with several language features, e.g., common blocks, derived types, or 15 derived types. The compiler may use different alignments, and therefore, BIND(C) 16 CREATE it is recommended to use for arrays of structures if TYPE _ MPI _ RESIZED _ 17 an alignment may cause an alignment-gap at the end of a structure as described in 18 End of rationale. Section 4.1.6 on page 104 and in Section 17.1.15 on page 629. ( ) 19 20 21 4.1.1 Type Constructors with Explicit Addresses 22 HVECTOR In Fortran, the functions _ CREATE , TYPE _ MPI _ 23 _ MPI CREATE _ TYPE _ MPI HINDEXED _ BLOCK , HINDEXED , _ TYPE _ CREATE _ 24 GET _ ADDRESS accept arguments of type MPI _ TYPE _ CREATE _ STRUCT , and MPI _ 25 _ MPI are used in C. Aint , wherever arguments of type INTEGER(KIND=MPI_ADDRESS_KIND) 26 notation, and where KIND On Fortran 77 systems that do not support the Fortran 90 27 addresses are 64 bits whereas default INTEGER s are 32 bits, these arguments will be of type 28 . INTEGER*8 29 30 Datatype Constructors 4.1.2 31 32 _ MPI The simplest datatype constructor is Contiguous CONTIGUOUS _ TYPE which allows 33 replication of a datatype into contiguous locations. 34 35 _ MPI _ TYPE CONTIGUOUS(count, oldtype, newtype) 36 37 replication count (non-negative integer) count IN 38 IN oldtype old datatype (handle) 39 OUT newtype new datatype (handle) 40 41 42 int MPI_Type_contiguous(int count, MPI_Datatype oldtype, 43 MPI_Datatype *newtype) 44 MPI_Type_contiguous(count, oldtype, newtype, ierror) BIND(C) 45 INTEGER, INTENT(IN) :: count 46 TYPE(MPI_Datatype), INTENT(IN) :: oldtype 47 TYPE(MPI_Datatype), INTENT(OUT) :: newtype 48

116 86 CHAPTER 4. DATATYPES 1 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 2 MPI_TYPE_CONTIGUOUS(COUNT, OLDTYPE, NEWTYPE, IERROR) 3 INTEGER COUNT, OLDTYPE, NEWTYPE, IERROR 4 5 count copies of newtype is the datatype obtained by concatenating 6 as the size of the concatenated copies. extent oldtype . Concatenation is defined using 7 8 char , 8) } , with extent 16, and let Example 4.2 Let oldtype have type map { ( double , 0) , ( 9 is newtype = 3. The type map of the datatype returned by count 10 double , 16) , ( char , 24) , ( double , 32) , ( char , 40) } ; ( double , 0) , ( char , 8) , { ( 11 12 , 40. i.e., alternating double and char elements, with displacements 0 , 8 , 16 , 24 , 32 13 14 In general, assume that the type map of oldtype is 15 16 { ( ,disp ,disp ) ,..., ( } , type ) type 1 n n 0 − 0 1 − 17 18 count entries defined by: n · with extent has a type map with newtype . Then ex 19 type ,disp + ex ) , ) ,disp , ,..., ) ,disp type ( { ( type type ( + ex ) ,..., ( ,disp 0 n n − 1 − n − 1 0 1 0 n − 1 0 20 21 ( · ( count − 1)) } . ,disp ex ..., + type ,disp + ex · ( count − 1)) ,..., ( type − 0 0 n − 1 n 1 22 23 24 25 26 Vector The function MPI _ TYPE _ is a more general constructor that allows repli- VECTOR 27 cation of a datatype into locations that consist of equally spaced blocks. Each block is 28 obtained by concatenating the same number of copies of the old datatype. The spacing 29 between blocks is a multiple of the extent of the old datatype. 30 31 TYPE _ MPI VECTOR(count, blocklength, stride, oldtype, newtype) _ 32 33 IN count number of blocks (non-negative integer) 34 IN blocklength number of elements in each block (non-negative inte- 35 ger) 36 IN stride number of elements between start of each block (inte- 37 ger) 38 39 IN old datatype (handle) oldtype 40 OUT newtype new datatype (handle) 41 42 int MPI_Type_vector(int count, int blocklength, int stride, 43 MPI_Datatype oldtype, MPI_Datatype *newtype) 44 45 MPI_Type_vector(count, blocklength, stride, oldtype, newtype, ierror) 46 BIND(C) 47 INTEGER, INTENT(IN) :: count, blocklength, stride 48 TYPE(MPI_Datatype), INTENT(IN) :: oldtype

117 4.1. DERIVED DATATYPES 87 1 TYPE(MPI_Datatype), INTENT(OUT) :: newtype 2 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 3 MPI_TYPE_VECTOR(COUNT, BLOCKLENGTH, STRIDE, OLDTYPE, NEWTYPE, IERROR) 4 INTEGER COUNT, BLOCKLENGTH, STRIDE, OLDTYPE, NEWTYPE, IERROR 5 6 7 { char , 8) } , with extent has type map oldtype Assume, again, that , 0) , ( ( Example 4.3 double 8 _ 16. A call to MPI TYPE _ VECTOR(2, 3, 4, oldtype, newtype) will create the datatype with 9 type map, 10 32) 16) , ( char , 24) , ( double , , , ( char , 40) , { double 0) , ( char , 8) , ( double ( , 11 12 , ( char , 88) , ( double , 96) , ( char , 104) } . double , 64) , ( ( , 72) , ( double , 80) char 13 14 16 That is, two blocks with three copies each of the old type, with a stride of 4 elements (4 · 15 bytes) between the the start of each block. 16 17 _ _ TYPE A call to VECTOR(3, 1, -2, oldtype, newtype) Example 4.4 MPI will create the 18 datatype, 19 − 24) { ( double , − 64) , ( char , − 56) } . ( double , 0) , ( char , 8) , ( double , − 32) , ( char , , 20 21 22 In general, assume that oldtype has type map, 23 24 ,disp } ,..., ( type ,disp , ) { ( type ) n 0 0 n − 1 − 1 25 blocklength with extent ex . Let bl be the . The newly created datatype has a type map with 26 count · bl · n entries: 27 28 ( type ,disp ) ,..., ( type ,disp ) , { n 1 1 0 n 0 − − 29 30 ( type ,..., ) ex + ,disp + ex ,disp ) ( ,..., type n 1 n − 0 1 0 − 31 32 type ) ,..., ( type ex ,disp ( ,disp + ( bl − + ( bl − 1) · ex ) , 1) · 1 − n 1 − n 0 0 33 ) ex · stride + ,disp type ( ,..., ( type ,disp + stride · ex ) ,..., 34 1 0 0 n − 1 n − 35 stride type ( ,..., · ) ex ex · 1) bl − bl type + ) ,..., + ( ,disp ( + 1) − stride + ( ,disp 0 n − 1 1 − n 0 36 37 ( + ,..., count − 1) · ) · ,disp stride ex type ( 0 0 38 39 ( count − ,disp · ex ) ,..., 1) ( type + stride · n − 1 1 − n 40 41 − bl 1) + − ( · count + ( ,..., ,disp ) type ( ex · 1) stride 0 0 42 type ( ,disp + ( stride · ( count − 1) + bl − 1) · ex ) } . 43 n − 1 n − 1 44 45 46 A call to MPI _ TYPE _ CONTIGUOUS(count, oldtype, newtype) is equivalent to a call to 47 MPI _ _ VECTOR(count, 1, 1, oldtype, newtype) MPI TYPE _ TYPE _ VECTOR(1, , or to a call to 48 n count, n, oldtype, newtype) , arbitrary.

118 88 CHAPTER 4. DATATYPES 1 MPI _ HVECTOR is identical to The function TYPE Hvector _ _ CREATE 2 , except that stride TYPE VECTOR _ is given in bytes, rather than in elements. The MPI _ 3 use for both types of vector constructors is illustrated in Section 4.1.14. ( stands for H 4 “heterogeneous”). 5 6 _ HVECTOR(count, blocklength, stride, oldtype, newtype) TYPE CREATE _ _ MPI 7 8 IN count number of blocks (non-negative integer) 9 IN blocklength number of elements in each block (non-negative inte- 10 ger) 11 IN number of bytes between start of each block (integer) stride 12 13 old datatype (handle) IN oldtype 14 OUT newtype new datatype (handle) 15 16 int MPI_Type_create_hvector(int count, int blocklength, MPI_Aint stride, 17 MPI_Datatype oldtype, MPI_Datatype *newtype) 18 19 MPI_Type_create_hvector(count, blocklength, stride, oldtype, newtype, 20 ierror) BIND(C) 21 INTEGER, INTENT(IN) :: count, blocklength 22 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: stride 23 TYPE(MPI_Datatype), INTENT(IN) :: oldtype 24 TYPE(MPI_Datatype), INTENT(OUT) :: newtype 25 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 26 MPI_TYPE_CREATE_HVECTOR(COUNT, BLOCKLENGTH, STRIDE, OLDTYPE, NEWTYPE, 27 IERROR) 28 INTEGER COUNT, BLOCKLENGTH, OLDTYPE, NEWTYPE, IERROR 29 INTEGER(KIND=MPI_ADDRESS_KIND) STRIDE 30 31 32 oldtype has type map, Assume that 33 34 type ) ,disp ,disp { ) ,..., } , ( type ( − 0 n − 1 n 1 0 35 36 bl be the . Let ex blocklength . The newly created datatype has a type map with with extent 37 · bl · count n entries: 38 , type ( ,..., ) ,disp ,disp ( type ) { − 0 0 n − 1 1 n 39 40 ) ,disp ,disp ( + ex + ex ) ,..., ,..., ( type type 0 1 − n 1 − n 0 41 42 ( type ,disp + ( bl − 1) · ex ) , type ( ,disp + ( bl − 1) · ex ) ,..., − 1 − n 0 1 0 n 43 44 type ,..., ) ,disp ( ,disp type ( stride + ) stride + ,..., − 1 0 0 n 1 − n 45 46 bl ,..., ,disp type + stride + ( ( − 1) · ex ) 0 0 47 48 ,..., ex · 1) − bl + ( stride + ) ( type ,disp 1 − n n − 1

119 4.1. DERIVED DATATYPES 89 1 ,..., ( + stride · ( count − 1)) ,..., ( type ,disp type 1)) ,disp − count ( + stride · 1 0 n 1 0 n − − 2 1) + stride · ( count ( 1) + ( bl − type · ex ) ,..., ,disp − 0 0 3 4 ,disp } ( . + stride · ( count − 1) + ( bl − 1) · ex ) type 1 n − 1 n − 5 6 7 8 The function MPI _ TYPE _ INDEXED allows replication of an old datatype into a Indexed 9 sequence of blocks (each block is a concatenation of the old datatype), where each block 10 can contain a different number of copies and have a different displacement. All block 11 displacements are multiples of the old type extent. 12 13 MPI _ _ displacements, oldtype, of _ blocklengths, array _ _ TYPE _ of INDEXED(count, array 14 newtype) 15 16 number of blocks — also number of entries in count IN 17 array _ of _ blocklengths (non- array _ of _ displacements and 18 negative integer) 19 IN array number of elements per block (array of non-negative of _ _ blocklengths 20 integers) 21 displacements displacement for each block, in multiples of oldtype IN array _ of _ 22 extent (array of integer) 23 24 IN oldtype old datatype (handle) 25 OUT new datatype (handle) newtype 26 27 int MPI_Type_indexed(int count, const int array_of_blocklengths[], const 28 int array_of_displacements[], MPI_Datatype oldtype, 29 MPI_Datatype *newtype) 30 31 MPI_Type_indexed(count, array_of_blocklengths, array_of_displacements, 32 oldtype, newtype, ierror) BIND(C) 33 INTEGER, INTENT(IN) :: count, array_of_blocklengths(count), 34 array_of_displacements(count) 35 TYPE(MPI_Datatype), INTENT(IN) :: oldtype 36 TYPE(MPI_Datatype), INTENT(OUT) :: newtype 37 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 38 MPI_TYPE_INDEXED(COUNT, ARRAY_OF_BLOCKLENGTHS, ARRAY_OF_DISPLACEMENTS, 39 OLDTYPE, NEWTYPE, IERROR) 40 INTEGER COUNT, ARRAY_OF_BLOCKLENGTHS(*), ARRAY_OF_DISPLACEMENTS(*), 41 OLDTYPE, NEWTYPE, IERROR 42 43 Example 4.5 44 oldtype have type map B = (3, 1) with extent 16. Let , } { double and Let 8) , char ( , 0) , ( 45 TYPE _ MPI . A call to D = (4, 0) INDEXED(2, B, D, oldtype, newtype) let returns a datatype _ 46 with type map, 47 48 80) , , char ( 96) , double ( , 88) , char ( , 104) , double ( , 72) , char ( , 64) , double ( { ,

120 90 CHAPTER 4. DATATYPES 1 , double ( char , 8) } . 0) ( , 2 3 That is, three copies of the old type starting at displacement 64, and one copy starting at 4 displacement 0. 5 6 In general, assume that has type map, oldtype 7 8 ,disp } , { ( ,disp type ) ( ,..., ) type 0 n − − 1 0 1 n 9 D be the ex . Let B be the array _ with extent _ blocklengths argument and of 10 ∑ − 1 count displacements n B[i] array _ of _ entries: argument. The newly created datatype has · 11 =0 i 12 { ,disp type ,disp + D[0] · ( D[0] · ex ) ,..., ex ) ,..., ( type + 1 n 1 − − 0 0 n 13 14 ( type ,disp ( + ( ,disp D[0] + type + ( D[0] + B[0] − 1) · ex ) ,..., B[0] 1) · ex ) ,..., − 1 0 n − 0 n − 1 15 16 ,..., ( type + D[count-1] · ex ) ,..., D[count-1] ex ( type ,disp ) ,disp + · n − 1 0 n 0 − 1 17 18 ) ,..., D[count-1] · B[count-1] + 1) ,disp − type ( ex + ( 0 0 19 20 − 1) · ex ) } type ,disp ( + ( D[count-1] + B[count-1] . 1 n n − 1 − 21 22 23 MPI _ VECTOR(count, blocklength, stride, oldtype, newtype) A call to TYPE is equivalent _ 24 to a call to MPI _ TYPE _ INDEXED(count, B, D, oldtype, newtype) where 25 26 , j 1 , = 0 − D[j] = j · stride ,..., count 27 and 28 29 = blocklength , j . ,..., count − 1 B[j] = 0 30 31 HINDEXED TYPE _ CREATE _ _ is identical to Hindexed MPI The function 32 INDEXED are spec- of _ MPI _ _ array , except that block displacements in _ displacements TYPE 33 extent. oldtype ified in bytes, rather than in multiples of the 34 35 36 MPI _ TYPE _ CREATE _ HINDEXED(count, array displacements, _ of _ blocklengths, array of _ _ 37 oldtype, newtype) 38 IN count number of blocks — also number of entries in 39 _ displacements and array _ of _ blocklengths (non- array _ of 40 negative integer) 41 _ blocklengths IN array _ of number of elements in each block (array of non-negative 42 integers) 43 44 byte displacement of each block (array of integer) of _ displacements array IN _ 45 oldtype IN old datatype (handle) 46 OUT newtype new datatype (handle) 47 48

121 4.1. DERIVED DATATYPES 91 1 int MPI_Type_create_hindexed(int count, const int array_of_blocklengths[], 2 const MPI_Aint array_of_displacements[], MPI_Datatype oldtype, 3 MPI_Datatype *newtype) 4 MPI_Type_create_hindexed(count, array_of_blocklengths, 5 array_of_displacements, oldtype, newtype, ierror) BIND(C) 6 INTEGER, INTENT(IN) :: count, array_of_blocklengths(count) 7 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: 8 array_of_displacements(count) 9 TYPE(MPI_Datatype), INTENT(IN) :: oldtype 10 TYPE(MPI_Datatype), INTENT(OUT) :: newtype 11 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 12 13 MPI_TYPE_CREATE_HINDEXED(COUNT, ARRAY_OF_BLOCKLENGTHS, 14 ARRAY_OF_DISPLACEMENTS, OLDTYPE, NEWTYPE, IERROR) 15 INTEGER COUNT, ARRAY_OF_BLOCKLENGTHS(*), OLDTYPE, NEWTYPE, IERROR 16 INTEGER(KIND=MPI_ADDRESS_KIND) ARRAY_OF_DISPLACEMENTS(*) 17 18 19 Assume that oldtype has type map, 20 type , } ) ,disp ( { ( ,..., ) ,disp type 21 n − 0 1 0 n − 1 22 D be the with extent ex . Let B be the array _ of _ blocklengths argument and 23 argument. The newly created datatype has a type map with _ of _ displacements n array · 24 ∑ count − 1 B[i] entries: i =0 25 26 ,disp { ,disp + D[0] ) + D[0] ) ,..., ,..., ( type ( type 0 0 n − 1 n − 1 27 28 − D[0] + B[0] ,disp ,..., type ( + ( 1) · ex ) 0 0 29 30 ) ,..., · ( type + D[0] + ( B[0] − 1) ,disp ex n − 1 n − 1 31 D[count-1] ) ,..., ( type ,disp D[count-1] + + ) ,..., ( type ,disp 32 − 1 n n − 1 0 0 33 ,..., ( + D[count-1] + ( B[count-1] − 1) · ex ) ,disp type 0 0 34 35 ) ( type ,disp . } D[count-1] ex · 1) − B[count-1] + ( + n − n − 1 1 36 37 38 39 40 MPI _ TYPE Indexed INDEXED except that the block- _ block This function is the same as _ 41 length is the same for all blocks. There are many codes using indirect addressing arising 42 from unstructured grids where the blocksize is always 1 (gather/scatter). The following 43 convenience function allows for constant blocksize and arbitrary displacements. 44 45 46 47 48

122 92 CHAPTER 4. DATATYPES 1 _ CREATE _ INDEXED _ BLOCK(count, blocklength, array _ of _ displacements, oldtype, _ MPI TYPE 2 newtype) 3 count length of array of displacements (non-negative integer) IN 4 IN size of block (non-negative integer) blocklength 5 6 _ displacements of _ array IN array of displacements (array of integer) 7 old datatype (handle) oldtype IN 8 newtype new datatype (handle) OUT 9 10 11 int MPI_Type_create_indexed_block(int count, int blocklength, const 12 int array_of_displacements[], MPI_Datatype oldtype, 13 MPI_Datatype *newtype) 14 MPI_Type_create_indexed_block(count, blocklength, array_of_displacements, 15 oldtype, newtype, ierror) BIND(C) 16 INTEGER, INTENT(IN) :: count, blocklength, 17 array_of_displacements(count) 18 TYPE(MPI_Datatype), INTENT(IN) :: oldtype 19 TYPE(MPI_Datatype), INTENT(OUT) :: newtype 20 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 21 22 MPI_TYPE_CREATE_INDEXED_BLOCK(COUNT, BLOCKLENGTH, ARRAY_OF_DISPLACEMENTS, 23 OLDTYPE, NEWTYPE, IERROR) 24 INTEGER COUNT, BLOCKLENGTH, ARRAY_OF_DISPLACEMENTS(*), OLDTYPE, 25 NEWTYPE, IERROR 26 27 Hindexed _ block The function MPI _ TYPE _ CREATE _ HINDEXED _ BLOCK is identical to 28 BLOCK MPI _ TYPE _ CREATE _ INDEXED _ , except that block displacements in 29 _ of _ displacements are specified in bytes, rather than in multiples of the array extent. oldtype 30 31 32 HINDEXED _ BLOCK(count, blocklength, array _ of _ displacements, MPI _ TYPE _ CREATE _ 33 oldtype, newtype) 34 IN count length of array of displacements (non-negative integer) 35 36 blocklength IN size of block (non-negative integer) 37 byte displacement of each block (array of integer) array _ of _ displacements IN 38 IN oldtype old datatype (handle) 39 40 new datatype (handle) OUT newtype 41 42 int MPI_Type_create_hindexed_block(int count, int blocklength, const 43 MPI_Aint array_of_displacements[], MPI_Datatype oldtype, 44 MPI_Datatype *newtype) 45 MPI_Type_create_hindexed_block(count, blocklength, array_of_displacements, 46 oldtype, newtype, ierror) BIND(C) 47 INTEGER, INTENT(IN) :: count, blocklength 48

123 4.1. DERIVED DATATYPES 93 1 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: 2 array_of_displacements(count) 3 TYPE(MPI_Datatype), INTENT(IN) :: oldtype 4 TYPE(MPI_Datatype), INTENT(OUT) :: newtype 5 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 6 MPI_TYPE_CREATE_HINDEXED_BLOCK(COUNT, BLOCKLENGTH, ARRAY_OF_DISPLACEMENTS, 7 OLDTYPE, NEWTYPE, IERROR) 8 INTEGER COUNT, BLOCKLENGTH, OLDTYPE, NEWTYPE, IERROR 9 INTEGER(KIND=MPI_ADDRESS_KIND) ARRAY_OF_DISPLACEMENTS(*) 10 11 12 CREATE _ _ MPI Struct is the most general type constructor. It further _ TYPE STRUCT 13 TYPE _ _ CREATE MPI generalizes _ HINDEXED in that it allows each block to consist of repli- 14 cations of different datatypes. 15 16 17 _ STRUCT(count, array _ of _ blocklengths, array _ MPI _ displacements, _ TYPE _ CREATE of 18 _ types, newtype) _ of array 19 count IN number of blocks (non-negative integer) — also num- 20 of _ types , ber of entries in arrays _ array 21 displacements _ _ of blocklengths array of _ array and _ 22 blocklength _ of _ array IN number of elements in each block (array of non-negative 23 integer) 24 25 _ displacements byte displacement of each block (array of integer) IN array _ of 26 of _ IN array _ types type of elements in each block (array of handles to 27 datatype objects) 28 new datatype (handle) newtype OUT 29 30 31 int MPI_Type_create_struct(int count, const int array_of_blocklengths[], 32 const MPI_Aint array_of_displacements[], const 33 MPI_Datatype array_of_types[], MPI_Datatype *newtype) 34 MPI_Type_create_struct(count, array_of_blocklengths, 35 array_of_displacements, array_of_types, newtype, ierror) 36 BIND(C) 37 INTEGER, INTENT(IN) :: count, array_of_blocklengths(count) 38 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: 39 array_of_displacements(count) 40 TYPE(MPI_Datatype), INTENT(IN) :: array_of_types(count) 41 TYPE(MPI_Datatype), INTENT(OUT) :: newtype 42 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 43 44 MPI_TYPE_CREATE_STRUCT(COUNT, ARRAY_OF_BLOCKLENGTHS, 45 ARRAY_OF_DISPLACEMENTS, ARRAY_OF_TYPES, NEWTYPE, IERROR) 46 INTEGER COUNT, ARRAY_OF_BLOCKLENGTHS(*), ARRAY_OF_TYPES(*), NEWTYPE, 47 IERROR 48 INTEGER(KIND=MPI_ADDRESS_KIND) ARRAY_OF_DISPLACEMENTS(*)

124 94 CHAPTER 4. DATATYPES 1 type1 Let have type map, Example 4.6 2 , ( char , 8) } , double ( 0) { , 3 4 with extent 16. Let CHAR) . , and _ FLOAT, type1, MPI D = (0, 16, 26) , T = (MPI B = (2, 1, 3) _ 5 returns a datatype with MPI _ TYPE _ CREATE _ Then a call to STRUCT(3, B, D, T, newtype) 6 type map, 7 8 , ( double , 16) , ( char , 24) , ( char , 26) , ( char , 27) , ( char , { } . ( float , 0) , ( float , 4) 28) 9 That is, two copies of MPI _ FLOAT starting at 0, followed by one copy of type1 starting at 10 , starting at 26. (We assume that a float occupies _ 16, followed by three copies of MPI CHAR 11 four bytes.) 12 13 14 _ is a handle to, argument, where types _ of T[i] array be the T In general, let 15 i i i i 16 , ,..., { type ( type ,disp typemap ,disp = ) ) } ( i n 0 − 1 1 n 0 − i i 17 ex be the . Let B be the with extent _ of _ blocklength argument and D array i 18 array of _ displacements argument. Let c be the count argument. Then the newly created _ 19 ∑ 1 − c entries: B[i] n · datatype has a type map with i i =0 20 21 0 0 0 0 ,disp ) + D[0] ,..., + D[0] { type type ) ( ( ,..., ,disp 0 n 0 n 0 0 22 0 0 0 0 23 ( ) ,..., ( type ,disp + D[0] + ( B[0] − 1) · ex type ,disp ,..., ex · B[0]-1) ) D[0] + + ( 0 0 0 n 0 n 0 0 24 c − 1 1 c − c − 1 c − 1 25 ,..., ) D[c-1] + ,disp ,..., + ) D[c-1] type ( ,disp ( type − n 1 1 n − 0 0 c − 1 1 − c 26 − 1 1 − c c 27 ,disp ex − · + ( D[c-1] + 1) ) ( type ,..., B[c-1] c − 1 0 0 28 1 − c c − 1 29 + ( ) ex · B[c-1]-1) } D[c-1] . + ,disp type ( 1 c − 1 − n 1 n − − 1 c 1 − c 30 31 32 _ TYPE _ CREATE MPI HINDEXED(count, B, D, oldtype, newtype) is equivalent A call to _ 33 CREATE T _ TYPE _ MPI to a call to , where each entry of STRUCT(count, B, D, T, newtype) _ 34 . is equal to oldtype 35 36 37 38 39 40 41 42 43 44 45 46 47 48

125 4.1. DERIVED DATATYPES 95 1 Subarray Datatype Constructor 4.1.3 2 3 4 SUBARRAY(ndims, array _ of _ sizes, array _ of _ subsizes, array _ of _ starts, MPI _ _ CREATE _ TYPE 5 order, oldtype, newtype) 6 number of array dimensions (positive integer) ndims IN 7 8 _ number of elements of type oldtype in each dimension of IN sizes _ array 9 of the full array (array of positive integers) 10 _ number of elements of type subsizes in each dimension of _ array IN oldtype 11 of the subarray (array of positive integers) 12 of IN array _ _ starts starting coordinates of the subarray in each dimension 13 (array of non-negative integers) 14 15 IN order array storage order flag (state) 16 array element datatype (handle) oldtype IN 17 OUT new datatype (handle) newtype 18 19 int MPI_Type_create_subarray(int ndims, const int array_of_sizes[], const 20 int array_of_subsizes[], const int array_of_starts[], int 21 order, MPI_Datatype oldtype, MPI_Datatype *newtype) 22 23 MPI_Type_create_subarray(ndims, array_of_sizes, array_of_subsizes, 24 array_of_starts, order, oldtype, newtype, ierror) BIND(C) 25 INTEGER, INTENT(IN) :: ndims, array_of_sizes(ndims), 26 array_of_subsizes(ndims), array_of_starts(ndims), order 27 TYPE(MPI_Datatype), INTENT(IN) :: oldtype 28 TYPE(MPI_Datatype), INTENT(OUT) :: newtype 29 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 30 31 MPI_TYPE_CREATE_SUBARRAY(NDIMS, ARRAY_OF_SIZES, ARRAY_OF_SUBSIZES, 32 ARRAY_OF_STARTS, ORDER, OLDTYPE, NEWTYPE, IERROR) 33 INTEGER NDIMS, ARRAY_OF_SIZES(*), ARRAY_OF_SUBSIZES(*), 34 ARRAY_OF_STARTS(*), ORDER, OLDTYPE, NEWTYPE, IERROR 35 MPI The subarray type constructor creates an n -dimensional datatype describing an 36 n -dimensional array. The subarray may be situated anywhere within the subarray of an 37 full array, and may be of any nonzero size up to the size of the larger array as long as it 38 is confined within this array. This type constructor facilitates creating filetypes to access 39 arrays distributed in blocks among processes to a single file that contains the global array, 40 MPI I/O, especially Section 13.1.1 on page 489. see 41 This type constructor can handle arrays with an arbitrary number of dimensions and 42 works for both C and Fortran ordered matrices (i.e., row-major or column-major). Note 43 that a C program may use Fortran order and a Fortran program may use C order. 44 ndims The parameter specifies the number of dimensions in the full data array and 45 sizes , and _ of _ array , subsizes _ of _ array gives the number of elements in . starts _ of _ array 46 -dimensional ar- The number of elements of type oldtype in each dimension of the n 47 ray and the requested subarray are specified by subsizes _ , re- of _ array and sizes _ of _ array 48

126 96 CHAPTER 4. DATATYPES 1 , it is erroneous to specify i of _ subsizes[i] < 1 or array spectively. For any dimension _ 2 _ array _ of _ sizes[i] . _ array subsizes[i] of > 3 starts of array _ contains the starting coordinates of each dimension of the subarray. The _ 4 Arrays are assumed to be indexed starting from zero. For any dimension , it is erroneous to i 5 array _ of _ starts[i] > ( array _ of _ specify − array _ of _ subsizes[i] ). array _ of _ starts[i] < 0 or sizes[i] 6 7 In a Fortran program with arrays indexed starting from 1, if the Advice to users. 8 , then the entry in n starting coordinate of a particular dimension of the subarray is 9 of _ starts for that dimension is n-1 . ( array ) _ End of advice to users. 10 argument specifies the storage order for the subarray as well as the full array. The order 11 It must be set to one of the following: 12 13 _ The ordering used by C arrays, (i.e., row-major order) C _ ORDER MPI 14 15 ORDER _ FORTRAN The ordering used by Fortran arrays, (i.e., column-major order) MPI _ 16 A ndims -dimensional subarray ( ) with no extra padding can be defined by the newtype 17 function Subarray() as follows: 18 19 newtype = Subarray( ,...,size , ,size } size { ndims, 0 1 − ndims 1 20 { , ,...,subsize subsize } ,subsize 0 1 1 ndims − 21 22 } start ,...,start { ,start , oldtype ) 0 1 ndims − 1 23 oldtype Let the typemap of have the form: 24 25 ,disp ,disp ) ,..., ( type type ) , ,disp ( type { ) } ( − 0 1 1 n 0 1 n − 1 26 where type is a predefined MPI datatype, and let ex be the extent of oldtype . Then we define 27 i the Subarray() function recursively using the following three equations. Equation 4.2 defines 28 FORTRAN MPI the base step. Equation 4.3 defines the recursion step when order _ _ ORDER , = 29 _ and Equation 4.4 defines the recursion step when order = MPI ORDER _ C . These equations 30 marker and , see Section 4.1.6 on page 104 for lb use the conceptual datatypes ub _ marker _ 31 details. 32 33 34 } } , { subsize Subarray(1 } , { start , { , (4.2) size 0 0 0 35 36 ( ,..., ) ,disp ) type } , ) ,disp type ( { ( ) ,disp type 1 0 − n − n 1 1 0 1 37 0) , = { ( lb _ marker , 38 ) + start × ex ) , ,disp type ( ,..., ex × start + ,disp type ( − 0 0 n 0 0 1 n − 1 39 ( ,..., ex ) × , + 1) ( start + ( type ,disp type 0 0 0 1 − n 40 start ,... disp + ( + 1) ex × ) 41 n − 1 0 42 ( type start + ( + subsize ,disp − 1) × ex ) ,..., 0 0 0 0 43 − subsize + start , ,disp type ( 1) ) ex × + ( 1 n − 1 0 n 0 − 44 × ) } ( ub _ marker ,size ex 0 45 46 47 ,size ,...,size , (4.3) size { ndims, Subarray( } 1 0 ndims 1 − 48 } , subsize ,...,subsize ,subsize { 0 1 ndims − 1

127 4.1. DERIVED DATATYPES 97 1 start ,...,start ,start { } , oldtype ) 0 1 − ndims 1 2 { size = Subarray( ,size ndims ,...,size 1 , , − } 1 2 − 1 ndims 3 ,...,subsize { ,subsize subsize } , 1 2 − ndims 1 4 , ,...,start { } ,start start 5 2 1 1 − ndims 6 { subsize Subarray(1 } , { start , } , { )) size oldtype } , 0 0 0 7 8 (4.4) , ,...,size ndims, { Subarray( } size ,size 1 0 ndims − 1 9 { subsize } ,...,subsize , ,subsize 0 1 1 ndims − 10 11 ) ,...,start start oldtype ,start } , { 0 1 ndims − 1 12 { } , ,...,size size ,size , 1 − ndims = Subarray( 1 0 2 − ndims 13 ,subsize { } , subsize ,...,subsize 1 0 − 2 ndims 14 { start ,start } ,...,start , 1 0 − 2 ndims 15 , } { } , oldtype )) Subarray(1 , { size subsize { , start } 16 1 − ndims − ndims 1 − 1 ndims 17 MPI _ TYPE _ CREATE _ SUBARRAY in the context of I/O see Sec- For an example use of 18 tion 13.9.2. 19 20 Distributed Array Datatype Constructor 4.1.4 21 22 The distributed array type constructor supports HPF-like [42] data distributions. However, 23 unlike in HPF, the storage order may be specified for C arrays as well as for Fortran arrays. 24 25 Advice to users. One can create an HPF-like file view using this type constructor as 26 follows. Complementary filetypes are created by having every process of a group call 27 which should be rank this constructor with identical arguments (with the exception of 28 etype and disp ) are then used set appropriately). These filetypes (along with identical 29 VIEW ), see MPI I/O, especially Section 13.1.1 to define the view (via MPI _ FILE _ SET _ 30 on page 489 and Section 13.3 on page 501. Using this view, a collective data access 31 operation (with identical offsets) will yield an HPF-like distribution pattern. ( End of 32 ) advice to users. 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

128 98 CHAPTER 4. DATATYPES 1 TYPE _ _ DARRAY(size, rank, ndims, array _ of _ gsizes, array _ of _ distribs, _ MPI CREATE 2 _ of _ psizes, order, oldtype, newtype) array _ dargs, array of _ 3 IN size size of process group (positive integer) 4 rank IN rank in process group (non-negative integer) 5 6 ndims number of array dimensions as well as process grid IN 7 dimensions (positive integer) 8 _ of _ gsizes number of elements of type oldtype in each dimension IN array 9 of global array (array of positive integers) 10 distribs distribution of array in each dimension (array of state) IN array _ of _ 11 12 _ IN array _ of dargs distribution argument in each dimension (array of pos- 13 itive integers) 14 array _ of IN size of process grid in each dimension (array of positive _ psizes 15 integers) 16 order IN array storage order flag (state) 17 18 IN old datatype (handle) oldtype 19 new datatype (handle) newtype OUT 20 21 int MPI_Type_create_darray(int size, int rank, int ndims, const 22 int array_of_gsizes[], const int array_of_distribs[], const 23 int array_of_dargs[], const int array_of_psizes[], int order, 24 MPI_Datatype oldtype, MPI_Datatype *newtype) 25 26 MPI_Type_create_darray(size, rank, ndims, array_of_gsizes, 27 array_of_distribs, array_of_dargs, array_of_psizes, order, 28 oldtype, newtype, ierror) BIND(C) 29 INTEGER, INTENT(IN) :: size, rank, ndims, array_of_gsizes(ndims), 30 array_of_distribs(ndims), array_of_dargs(ndims), 31 array_of_psizes(ndims), order 32 TYPE(MPI_Datatype), INTENT(IN) :: oldtype 33 TYPE(MPI_Datatype), INTENT(OUT) :: newtype 34 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 35 MPI_TYPE_CREATE_DARRAY(SIZE, RANK, NDIMS, ARRAY_OF_GSIZES, 36 ARRAY_OF_DISTRIBS, ARRAY_OF_DARGS, ARRAY_OF_PSIZES, ORDER, 37 OLDTYPE, NEWTYPE, IERROR) 38 INTEGER SIZE, RANK, NDIMS, ARRAY_OF_GSIZES(*), ARRAY_OF_DISTRIBS(*), 39 ARRAY_OF_DARGS(*), ARRAY_OF_PSIZES(*), ORDER, OLDTYPE, NEWTYPE, IERROR 40 41 _ TYPE _ CREATE _ DARRAY can be used to generate the datatypes corresponding to MPI 42 ndims the distribution of an ndims -dimensional array of oldtype -dimensional elements onto an 43 psizes _ of _ array grid of logical processes. Unused dimensions of should be set to 1. (See 44 DARRAY _ CREATE _ _ MPI TYPE to be correct, the Example 4.7, page 101.) For a call to ∏ 45 ndims − 1 i size [ psizes _ of _ array must be satisfied. The ordering of processes ] = equation i =0 46 in the process grid is assumed to be row-major, as in the case of virtual Cartesian process 47 topologies . 48

129 4.1. DERIVED DATATYPES 99 1 For both Fortran and C arrays, the ordering of processes in the Advice to users. 2 process grid is assumed to be row-major. This is consistent with the ordering used in 3 virtual Cartesian process topologies in . To create such virtual process topologies, MPI 4 or to find the coordinates of a process in the process grid, etc., users may use the 5 End of advice corresponding process topology functions, see Chapter 7 on page 289. ( 6 ) to users. 7 Each dimension of the array can be distributed in one of three ways: 8 9 BLOCK _ MPI • - Block distribution DISTRIBUTE _ 10 11 CYCLIC • - Cyclic distribution _ DISTRIBUTE _ MPI 12 _ DISTRIBUTE _ MPI • - Dimension not distributed. NONE 13 14 specifies a default distribution argument. The constant MPI _ DISTRIBUTE _ DFLT _ DARG 15 The distribution argument for a dimension that is not distributed is ignored. For any 16 , it is erroneous to specify BLOCK _ i _ MPI in which the distribution is dimension DISTRIBUTE 17 of _ dargs[i] ∗ array _ of _ array < array _ of _ gsizes[i] . _ psizes[i] 18 For example, the HPF layout ARRAY(CYCLIC(15)) corresponds to 19 with a distribution argument of 15, and the HPF layout AR- MPI _ DISTRIBUTE _ CYCLIC 20 _ RAY(BLOCK) corresponds to DISTRIBUTE _ BLOCK with a distribution argument of MPI 21 DISTRIBUTE . DARG _ DFLT _ MPI _ 22 The _ TYPE _ CREATE _ SUBARRAY to specify the stor- order argument is used as in MPI 23 age order. Therefore, arrays described by this type constructor may be stored in Fortran 24 are MPI _ (column-major) or C (row-major) order. Valid values for _ FORTRAN order ORDER 25 ORDER _ C . and MPI _ 26 datatype with a typemap defined in terms of a function MPI This routine creates a new 27 called “cyclic()” (see below). 28 Without loss of generality, it suffices to define the typemap for the 29 MPI _ DISTRIBUTE _ DISTRIBUTE DFLT _ DARG is not used. _ _ CYCLIC case where MPI 30 _ NONE MPI _ DISTRIBUTE _ BLOCK and MPI can be reduced to the _ DISTRIBUTE 31 as follows. MPI _ DISTRIBUTE _ CYCLIC case for dimension i 32 _ DISTRIBUTE _ DFLT _ DARG MPI _ DISTRIBUTE _ BLOCK with array _ of _ dargs[i] equal to MPI 33 _ _ of _ array with CYCLIC set to DISTRIBUTE _ MPI is equivalent to dargs[i] 34 35 1) / array _ of _ psizes[i] ( array _ of _ gsizes[i] + array _ of _ psizes[i] − . 36 , then _ of _ dargs[i] is not MPI _ DISTRIBUTE _ DFLT _ DARG If array MPI _ DISTRIBUTE _ BLOCK and 37 MPI _ DISTRIBUTE _ CYCLIC are equivalent. 38 _ MPI DISTRIBUTE with _ CYCLIC MPI is equivalent to NONE _ DISTRIBUTE _ 39 _ of . gsizes[i] _ of array array set to dargs[i] _ _ 40 MPI equal to dargs[i] _ of _ array with CYCLIC _ DISTRIBUTE _ Finally, 41 with MPI _ DISTRIBUTE _ DFLT _ DARG is equivalent to MPI _ DISTRIBUTE _ CYCLIC 42 set to 1. dargs[i] _ of _ array 43 ) is defined MPI newtype For -dimensional distributed array ( ndims , an FORTRAN _ ORDER _ 44 by the following code fragment: 45 46 oldtypes[0] = oldtype; 47 for (i = 0; i < ndims; i++) { 48

130 100 CHAPTER 4. DATATYPES 1 oldtypes[i+1] = cyclic(array_of_dargs[i], 2 array_of_gsizes[i], 3 r[i], 4 array_of_psizes[i], 5 oldtypes[i]); 6 } 7 newtype = oldtypes[ndims]; 8 9 _ MPI , the code is: ORDER For _ C 10 oldtypes[0] = oldtype; 11 for (i = 0; i < ndims; i++) { 12 oldtypes[i + 1] = cyclic(array_of_dargs[ndims - i - 1], 13 array_of_gsizes[ndims - i - 1], 14 r[ndims - i - 1], 15 array_of_psizes[ndims - i - 1], 16 oldtypes[i]); 17 } 18 newtype = oldtypes[ndims]; 19 20 21 ] is the position of the process (with rank i i r ) in the process grid at dimension where . rank [ 22 The values of r [ i ] are given by the following code fragment: 23 24 t_rank = rank; 25 t_size = 1; 26 for (i = 0; i < ndims; i++) 27 t_size *= array_of_psizes[i]; 28 for (i = 0; i < ndims; i++) { 29 t_size = t_size / array_of_psizes[i]; 30 r[i] = t_rank / t_size; 31 t_rank = t_rank % t_size; 32 } 33 34 have the form: Let the typemap of oldtype 35 ,disp ) , ( ) type ( ,disp type ) ,..., ( type } ,disp { 36 1 1 1 n − 0 n − 0 1 37 type be the extent of where ex is a predefined MPI datatype, and let i 38 marker, see oldtype . The following function uses the conceptual datatypes lb _ marker and ub _ 39 Section 4.1.6 on page 104 for details. 40 Given the above, the function cyclic() is defined as follows: 41 42 cyclic( darg,gsize,r,psize, oldtype ) 43 lb { ( , = 0) _ , marker 44 ) ,..., ( ,disp type + r × darg × ex 0 0 45 darg ex ( type ) , × ,disp × r + 1 − 1 − n n 46 47 darg × r + ( ,..., ,disp ) type ( ex × + 1) 0 0 48 darg × type + 1) × ex ,disp ) , ( + ( r − n 1 − n 1

131 4.1. DERIVED DATATYPES 101 1 ... 2 type + (( r + 1) × darg − 1) × ex ) ,..., ( ,disp 0 0 3 1) + 1) × darg − ,disp × ex ) , ( + (( r type 1 − n n − 1 4 5 6 ) ex + psize × darg × ex × ,..., ( + ,disp type darg × r 0 0 7 + r × darg × ex + psize × darg × ex ) , type ( ,disp 1 n − n 1 − 8 × ex + psize × darg × ,disp ) ,..., + ( type ( r × darg + 1) ex 0 0 9 ,disp × darg + 1) × ex + psize × darg × ex ) , r ( type + ( 1 n − 1 n − 10 11 ... 12 ex + psize × darg × ex ) ,..., ,disp + (( r + 1) × darg ( type − 1) × 0 0 13 + 1) 1) × ex + psize × darg × ex ) , ( type ,disp + (( darg r − × 1 n − 1 n − 14 . . 15 . 16 + ( count − 1)) ,..., r × darg × ,disp ex type ( + psize × darg × ex × 0 0 17 × darg × ex × ( count − 1)) , × ( type + r ,disp darg × ex + psize 1 − n 1 − n 18 ex count − 1)) ,..., + 1) darg × r + ( ,disp type ( × × darg × ex × ( + psize 0 0 19 × + 1) darg × type + ( ex ,disp ( r 20 1 − n n − 1 21 , + darg × ex × ( × 1)) − count psize 22 ... 23 − 1) × ex ,disp ( + ( r × darg + darg type 0 0 last 24 psize × darg × ex × ( count − 1)) ,..., + 25 26 ( ex × 1) − type darg + darg × r + ( ,disp − n 1 − n 1 last 27 darg × ex × ( count − 1)) , + psize × 28 _ ex ( ub ) marker ,gsize ∗ } 29 is defined by this code fragment: where count 30 31 nblocks = (gsize + (darg - 1)) / darg; 32 count = nblocks / psize; 33 left_over = nblocks - count * psize; 34 if (r < left_over) 35 count = count + 1; 36 is the number of blocks that must be distributed among the processors. nblocks Here, 37 is defined by this code fragment: Finally, darg last 38 39 if ((num_in_last_cyclic = gsize % (psize * darg)) == 0) 40 darg_last = darg; 41 else 42 darg_last = num_in_last_cyclic - darg * r; 43 if (darg_last > darg) 44 darg_last = darg; 45 if (darg_last <= 0) 46 darg_last = darg; 47 48 Example 4.7 Consider generating the filetypes corresponding to the HPF distribution:

132 102 CHAPTER 4. DATATYPES 1 FILEARRAY(100, 200, 300) 2 !HPF$ PROCESSORS PROCESSES(2, 3) 3 !HPF$ DISTRIBUTE FILEARRAY(CYCLIC(10), *, BLOCK) ONTO PROCESSES 4 This can be achieved by the following Fortran code, assuming there will be six processes 5 attached to the run: 6 7 ndims = 3 8 array_of_gsizes(1) = 100 9 array_of_distribs(1) = MPI_DISTRIBUTE_CYCLIC 10 array_of_dargs(1) = 10 11 array_of_gsizes(2) = 200 12 array_of_distribs(2) = MPI_DISTRIBUTE_NONE 13 array_of_dargs(2) = 0 14 array_of_gsizes(3) = 300 15 array_of_distribs(3) = MPI_DISTRIBUTE_BLOCK 16 array_of_dargs(3) = MPI_DISTRIBUTE_DFLT_DARG 17 array_of_psizes(1) = 2 18 array_of_psizes(2) = 1 19 array_of_psizes(3) = 3 20 call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr) 21 call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) 22 call MPI_TYPE_CREATE_DARRAY(size, rank, ndims, array_of_gsizes, & 23 array_of_distribs, array_of_dargs, array_of_psizes, & 24 MPI_ORDER_FORTRAN, oldtype, newtype, ierr) 25 26 Address and Size Functions 4.1.5 27 28 Abso- The displacements in a general datatype are relative to some initial buffer address. 29 can be substituted for these displacements: we treat them as displacements lute addresses 30 relative to “address zero,” the start of the address space. This initial address zero is indi- 31 cated by the constant MPI _ BOTTOM . Thus, a datatype can specify the absolute address of 32 argument is passed the value the entries in the communication buffer, in which case the buf 33 MPI . BOTTOM _ 34 The address of a location in memory can be found by invoking the function 35 MPI . ADDRESS GET _ _ 36 37 MPI _ GET _ ADDRESS(location, address) 38 39 IN location location in caller memory (choice) 40 address of location (integer) address OUT 41 42 int MPI_Get_address(const void *location, MPI_Aint *address) 43 44 MPI_Get_address(location, address, ierror) BIND(C) 45 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: location 46 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(OUT) :: address 47 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 48

133 4.1. DERIVED DATATYPES 103 1 MPI_GET_ADDRESS(LOCATION, ADDRESS, IERROR) 2 LOCATION(*) 3 INTEGER IERROR 4 INTEGER(KIND=MPI_ADDRESS_KIND) ADDRESS 5 location . Returns the (byte) address of 6 7 codes will run unmodified, and will port Advice to users. MPI Current Fortran 8 32 − 1 are used to any system. However, they may fail if addresses larger than 2 9 in the program. New codes should be written so that they use the new functions. 10 This provides compatibility with C/C++ and avoids errors on 64 bit architectures. 11 However, such newly written codes may need to be (slightly) rewritten to port to old 12 declarations. ( KIND Fortran 77 environments that do not support End of advice to 13 users. ) 14 15 16 argument is not defined with location module, the mpi_f08 In the Rationale. 17 GET _ ADDRESS as a substi- INTENT(IN) because existing applications may use MPI _ 18 that was not defined before MPI-3.0 . ( End of rationale. ) MPI _ tute for _ SYNC _ REG F 19 20 21 ADDRESS for an array. Example 4.8 Using MPI _ GET _ 22 REAL A(100,100) 23 INTEGER(KIND=MPI_ADDRESS_KIND) I1, I2, DIFF 24 CALL MPI_GET_ADDRESS(A(1,1), I1, IERROR) 25 CALL MPI_GET_ADDRESS(A(10,10), I2, IERROR) 26 DIFF = I2 - I1 27 ! The value of DIFF is 909*sizeofreal; the values of I1 and I2 are 28 ! implementation dependent. 29 30 C users may be tempted to avoid the usage of Advice to users. 31 GET _ MPI ADDRESS _ and rely on the availability of the address operator &. Note, 32 & however, that is a pointer, not an address. ISO C does not require cast-expression 33 that the value of a pointer (or the pointer cast to int ) be the absolute address of the 34 object pointed at — although this is commonly the case. Furthermore, referencing 35 may not have a unique definition on machines with a segmented address space. The 36 MPI to “reference” C variables guarantees portability to such use of ADDRESS _ GET _ 37 machines as well. ( End of advice to users. ) 38 39 To prevent problems with the argument copying and register opti- Advice to users. 40 mization done by Fortran compilers, please note the hints in Sections 17.1.10-17.1.20. 41 In particular, refer to Sections 17.1.12 and 17.1.13 on pages 626-629 about “ Problems 42 Vector ” and “ Due to Data Copying and Sequence Association with Subscript Triplets 43 ”, and Sections 17.1.16-17.1.19 on pages 631-642 about “ Optimization Prob- Subscripts 44 lems ”, “ Code Movements and Register Optimization ”, “ Temporary Data Movements ” 45 Permanent Data Movements and “ ) ”. ( End of advice to users. 46 47 The following auxiliary functions provide useful information on derived datatypes. 48

134 104 CHAPTER 4. DATATYPES 1 _ SIZE(datatype, size) MPI TYPE _ 2 datatype IN datatype (handle) 3 datatype size (integer) size OUT 4 5 6 int MPI_Type_size(MPI_Datatype datatype, int *size) 7 MPI_Type_size(datatype, size, ierror) BIND(C) 8 TYPE(MPI_Datatype), INTENT(IN) :: datatype 9 INTEGER, INTENT(OUT) :: size 10 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 11 12 MPI_TYPE_SIZE(DATATYPE, SIZE, IERROR) 13 INTEGER DATATYPE, SIZE, IERROR 14 15 16 SIZE _ X(datatype, size) MPI _ TYPE _ 17 datatype (handle) datatype IN 18 19 OUT datatype size (integer) size 20 21 int MPI_Type_size_x(MPI_Datatype datatype, MPI_Count *size) 22 MPI_Type_size_x(datatype, size, ierror) BIND(C) 23 TYPE(MPI_Datatype), INTENT(IN) :: datatype 24 INTEGER(KIND=MPI_COUNT_KIND), INTENT(OUT) :: size 25 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 26 27 MPI_TYPE_SIZE_X(DATATYPE, SIZE, IERROR) 28 INTEGER DATATYPE, IERROR 29 INTEGER(KIND = MPI_COUNT_KIND) SIZE 30 _ X SIZE set the value of size to the total size, in _ MPI and SIZE _ TYPE _ MPI _ TYPE 31 ; i.e., the total size of the datatype bytes, of the entries in the type signature associated with 32 data in a message that would be created with this datatype. Entries that occur multiple 33 OUT times in the datatype are counted with their multiplicity. For both functions, if the 34 parameter cannot express the value to be returned (e.g., if the parameter is too small to 35 hold the output value), it is set to . UNDEFINED _ MPI 36 37 38 Lower-Bound and Upper-Bound Markers 4.1.6 39 It is often convenient to define explicitly the lower bound and upper bound of a type map, 40 and override the definition given on page 105. This allows one to define a datatype that has 41 “holes” at its beginning or its end, or a datatype with entries that extend above the upper 42 bound or below the lower bound. Examples of such usage are provided in Section 4.1.14. 43 Also, the user may want to overide the alignment rules that are used to compute upper 44 bounds and extents. E.g., a C compiler may allow the user to overide default alignment 45 rules for some of the structures within a program. The user has to specify explicitly the 46 bounds of the datatypes that match these structures. 47 48

135 4.1. DERIVED DATATYPES 105 1 _ lb ub _ marker , marker To achieve this, we add two additional conceptual datatypes, and 2 that represent the lower bound and upper bound of a datatype. These conceptual datatypes 3 ) = extent ( ub _ marker ) = 0) . They do not affect the size occupy no space ( ( marker lb extent _ 4 or count of a datatype, and do not affect the content of a message created with this datatype. 5 However, they do affect the definition of the extent of a datatype and, therefore, affect the 6 outcome of a replication of this datatype by a datatype constructor. 7 8 _ _ _ CREATE _ RESIZED(MPI TYPE A call to MPI INT, -3, 9, type1) creates a Example 4.9 9 new datatype that has an extent of 9 (from -3 to 5, 5 included), and contains an integer 10 _ (lb { at displacement 0. This is the datatype defined by the typemap marker, -3), (int, 0), 11 } . If this type is replicated twice by a call to MPI _ TYPE _ CONTIGUOUS(2, (ub _ marker, 6) 12 (lb type1, type2) then the newly created type can be described by the typemap { _ marker, 13 marker -3), (int, 0), (int,9), (ub _ marker, 15) } . (An entry of type can be deleted if there _ ub 14 with a higher displacement; an entry of type is another entry of type ub _ marker _ lb marker 15 with a lower displacement.) can be deleted if there is another entry of type lb _ marker 16 In general, if 17 Typemap ) } , ) = ( type { ( type ,disp ,..., ,disp 0 n − 1 n − 1 0 18 19 lower bound of Typemap is defined to be then the  20 if no entry has type  disp min j j 21 ) = Typemap ( lb _ marker lb  22 = { marker disp } min otherwise lb such that type _ j j j 23 Similarly, the upper bound of Typemap is defined to be 24  if no entry has type  25 sizeof + disp ( )) +  type ( max j j j ub ( Typemap ) = _ marker ub 26  such that type ub = { _ marker } otherwise disp max j j j 27 Then 28 29 ) extent ( Typemap ( Typemap ) − lb ( Typemap ) = ub 30 , then requires alignment to a byte address that is a multiple of k If type  is the least 31 i i ( Typemap ) to the next multiple of max k non-negative increment needed to round . extent 32 i i In Fortran, it is implementation dependent whether the MPI implementation computes 33 according to the alignments used by the compiler in common blocks, k the alignments 34 i BIND(C) SEQUENCE derived types, SEQUENCE derived types, or derived types that are neither 35 . BIND(C) nor 36 The formal definitions given for the various datatype constructors apply now, with the 37 amended definition of extent . 38 39 Rationale. Before Fortran 2003, MPI _ TYPE _ STRUCT CREATE could be applied to _ 40 Fortran common blocks and SEQUENCE derived types. With Fortran 2003, this list 41 derived types and BIND(C) implementors have implemented the was extended by MPI 42 derived differently, i.e., some based on the alignments used in SEQUENCE alignments k i 43 BIND(C) types, and others according to derived types. ( End of rationale. ) 44 45 In Fortran, it is generally recommended to use BIND(C) Advice to implementors. 46 SEQUENCE derived types. Therefore it is derived types instead of common blocks or 47 recommended to calculate the alignments End derived types. ( based on BIND(C) k i 48 of advice to implementors. )

136 106 CHAPTER 4. DATATYPES 1 Advice to users. Structures combining different basic datatypes should be defined 2 so that there will be no gaps based on alignment rules. If such a datatype is used 3 to create an array of structures, users should also avoid an alignment-gap at the 4 communication, the content of such gaps would not MPI end of the structure. In 5 be communicated into the receiver’s buffer. For example, such an alignment-gap 6 s or may occur between an odd number of REAL double or DOUBLE s before a float 7 MPI PRECISION data. Such gaps may be added explicitly to both the structure and the 8 derived datatype handle because the communication of a contiguous derived datatype 9 may be significantly faster than the communication of one that is non-contiguous 10 because of such alignment-gaps. 11 Example: Instead of 12 13 TYPE, BIND(C) :: my_data 14 REAL, DIMENSION(3) :: x 15 ! there may be a gap of the size of one REAL 16 ! if the alignment of a DOUBLE PRECISION is 17 ! two times the size of a REAL 18 DOUBLE PRECISION :: p 19 END TYPE 20 21 22 one should define 23 24 TYPE, BIND(C) :: my_data 25 REAL, DIMENSION(3) :: x 26 REAL :: gap1 27 DOUBLE PRECISION :: p 28 END TYPE 29 30 and also include gap1 in the matching MPI derived datatype. It is required that all 31 processes in a communication add the same gaps, i.e., defined with the same basic 32 datatype. Both the original and the modified structures are portable, but may have 33 different performance implications for the communication and memory accesses during 34 computation on systems with different alignment values. 35 In principle, a compiler may define an additional alignment rule for structures, e.g., to 36 use at least 4 or 8 byte alignment, although the content may have a max alignment k i i 37 less than this structure alignment. To maintain portability, users should always resize 38 structure derived datatype handles if used in an array of structures, see the Example 39 ) in Section 17.1.15 on page 629. ( End of advice to users. 40 41 42 43 44 45 46 47 48

137 4.1. DERIVED DATATYPES 107 1 Extent and Bounds of Datatypes 4.1.7 2 3 4 GET EXTENT(datatype, lb, extent) _ TYPE MPI _ _ 5 datatype IN datatype to get information on (handle) 6 7 lower bound of datatype (integer) lb OUT 8 extent of datatype (integer) OUT extent 9 10 int MPI_Type_get_extent(MPI_Datatype datatype, MPI_Aint *lb, 11 MPI_Aint *extent) 12 13 MPI_Type_get_extent(datatype, lb, extent, ierror) BIND(C) 14 TYPE(MPI_Datatype), INTENT(IN) :: datatype 15 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(OUT) :: lb, extent 16 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 17 MPI_TYPE_GET_EXTENT(DATATYPE, LB, EXTENT, IERROR) 18 INTEGER DATATYPE, IERROR 19 INTEGER(KIND = MPI_ADDRESS_KIND) LB, EXTENT 20 21 22 _ TYPE _ GET X(datatype, lb, extent) _ _ MPI EXTENT 23 24 IN datatype datatype to get information on (handle) 25 lb lower bound of datatype (integer) OUT 26 extent of datatype (integer) extent OUT 27 28 29 int MPI_Type_get_extent_x(MPI_Datatype datatype, MPI_Count *lb, 30 MPI_Count *extent) 31 MPI_Type_get_extent_x(datatype, lb, extent, ierror) BIND(C) 32 TYPE(MPI_Datatype), INTENT(IN) :: datatype 33 INTEGER(KIND = MPI_COUNT_KIND), INTENT(OUT) :: lb, extent 34 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 35 36 MPI_TYPE_GET_EXTENT_X(DATATYPE, LB, EXTENT, IERROR) 37 INTEGER DATATYPE, IERROR 38 INTEGER(KIND = MPI_COUNT_KIND) LB, EXTENT 39 Returns the lower bound and the extent of datatype (as defined in Section 4.1.6 on 40 page 104). 41 parameter cannot express the value to be returned For both functions, if either OUT 42 _ MPI (e.g., if the parameter is too small to hold the output value), it is set to . UNDEFINED 43 MPI allows one to change the extent of a datatype, using lower bound and upper bound 44 markers. This provides control over the stride of successive datatypes that are replicated 45 count argument in a send or receive call. by datatype constructors, or are replicated by the 46 47 48

138 108 CHAPTER 4. DATATYPES 1 _ CREATE _ RESIZED(oldtype, lb, extent, newtype) MPI TYPE _ 2 oldtype IN input datatype (handle) 3 IN lb new lower bound of datatype (integer) 4 5 new extent of datatype (integer) IN extent 6 output datatype (handle) newtype OUT 7 8 int MPI_Type_create_resized(MPI_Datatype oldtype, MPI_Aint lb, MPI_Aint 9 extent, MPI_Datatype *newtype) 10 11 MPI_Type_create_resized(oldtype, lb, extent, newtype, ierror) BIND(C) 12 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: lb, extent 13 TYPE(MPI_Datatype), INTENT(IN) :: oldtype 14 TYPE(MPI_Datatype), INTENT(OUT) :: newtype 15 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 16 MPI_TYPE_CREATE_RESIZED(OLDTYPE, LB, EXTENT, NEWTYPE, IERROR) 17 INTEGER OLDTYPE, NEWTYPE, IERROR 18 INTEGER(KIND=MPI_ADDRESS_KIND) LB, EXTENT 19 20 oldtype newtype a handle to a new datatype that is identical to Returns in , except that 21 the lower bound of this new datatype is set to be lb , and its upper bound is set to be lb 22 . Any previous extent + markers are erased, and a new pair of lower bound and ub and lb 23 upper bound markers are put in the positions indicated by the lb and extent arguments. 24 This affects the behavior of the datatype when used in communication operations, with 25 1, and when used in the construction of new derived datatypes. count > 26 27 True Extent of Datatypes 4.1.8 28 29 Suppose we implement gather (see also Section 5.5 on page 149) as a spanning tree im- 30 plemented on top of point-to-point routines. Since the receive buffer is only valid on the 31 root process, one will need to allocate some temporary space for receiving data on in- 32 termediate nodes. However, the datatype extent cannot be used as an estimate of the 33 amount of space that needs to be allocated, if the user has modified the extent, for example 34 TRUE _ GET _ TYPE _ _ . The functions RESIZED CREATE _ TYPE _ MPI by using EXTENT _ MPI 35 EXTENT _ X are provided which return the true extent of the and MPI _ TYPE _ GET _ TRUE _ 36 datatype. 37 38 TRUE _ MPI _ TYPE extent) _ lb, true _ EXTENT(datatype, true _ _ GET 39 40 datatype datatype to get information on (handle) IN 41 _ lb OUT true true lower bound of datatype (integer) 42 _ true OUT true size of datatype (integer) extent 43 44 int MPI_Type_get_true_extent(MPI_Datatype datatype, MPI_Aint *true_lb, 45 MPI_Aint *true_extent) 46 47 MPI_Type_get_true_extent(datatype, true_lb, true_extent, ierror) BIND(C) 48

139 4.1. DERIVED DATATYPES 109 1 TYPE(MPI_Datatype), INTENT(IN) :: datatype 2 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(OUT) :: true_lb, true_extent 3 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 4 MPI_TYPE_GET_TRUE_EXTENT(DATATYPE, TRUE_LB, TRUE_EXTENT, IERROR) 5 INTEGER DATATYPE, IERROR 6 INTEGER(KIND = MPI_ADDRESS_KIND) TRUE_LB, TRUE_EXTENT 7 8 9 TYPE extent) _ MPI TRUE _ EXTENT _ X(datatype, true _ _ lb, true _ GET _ 10 11 datatype datatype to get information on (handle) IN 12 lb true lower bound of datatype (integer) OUT true _ 13 14 _ true OUT extent true size of datatype (integer) 15 16 int MPI_Type_get_true_extent_x(MPI_Datatype datatype, MPI_Count *true_lb, 17 MPI_Count *true_extent) 18 MPI_Type_get_true_extent_x(datatype, true_lb, true_extent, ierror) BIND(C) 19 TYPE(MPI_Datatype), INTENT(IN) :: datatype 20 INTEGER(KIND = MPI_COUNT_KIND), INTENT(OUT) :: true_lb, true_extent 21 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 22 23 MPI_TYPE_GET_TRUE_EXTENT_X(DATATYPE, TRUE_LB, TRUE_EXTENT, IERROR) 24 INTEGER DATATYPE, IERROR 25 INTEGER(KIND = MPI_COUNT_KIND) TRUE_LB, TRUE_EXTENT 26 _ returns the offset of the lowest unit of store which is addressed by the datatype, true lb 27 i.e., the lower bound of the corresponding typemap, ignoring explicit lower bound mark- 28 ers. true _ returns the true size of the datatype, i.e., the extent of the correspond- extent 29 ing typemap, ignoring explicit lower bound and upper bound markers, and performing no 30 datatype rounding for alignment. If the typemap associated with is 31 32 } ) ,..., ( type = { ( type ,disp ,disp Typemap ) n − − 1 1 0 0 n 33 34 Then 35 _ lb = 6 , type : } disp { marker min ) = Typemap ( lb _ true _ ub , marker j j j 36 37 + sizeof ( type ub ) : type ( 6 = lb _ marker , ub _ marker } , Typemap ) = max { true disp _ j j j j 38 39 and 40 41 true true _ extent ( Typemap ) = true _ ub ( . Typemap typemap ( lb _ − ) ) 42 (Readers should compare this with the definitions in Section 4.1.6 on page 104 and Sec- 43 _ TYPE _ GET _ EXTENT .) tion 4.1.7 on page 107, which describe the function MPI 44 _ true The extent is the minimum number of bytes of memory necessary to hold a 45 datatype, uncompressed. 46 OUT parameter cannot express the value to be returned For both functions, if either 47 _ MPI (e.g., if the parameter is too small to hold the output value), it is set to . UNDEFINED 48

140 110 CHAPTER 4. DATATYPES 1 Commit and Free 4.1.9 2 before it can be used in a communication. As committed A datatype object has to be 3 an argument in datatype constructors, uncommitted and also committed datatypes can be 4 used. There is no need to commit basic datatypes. They are “pre-committed.” 5 6 7 MPI _ COMMIT(datatype) _ TYPE 8 datatype that is committed (handle) INOUT datatype 9 10 11 int MPI_Type_commit(MPI_Datatype *datatype) 12 MPI_Type_commit(datatype, ierror) BIND(C) 13 TYPE(MPI_Datatype), INTENT(INOUT) :: datatype 14 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 15 16 MPI_TYPE_COMMIT(DATATYPE, IERROR) 17 INTEGER DATATYPE, IERROR 18 The commit operation commits the datatype, that is, the formal description of a com- 19 munication buffer, not the content of that buffer. Thus, after a datatype has been commit- 20 ted, it can be repeatedly reused to communicate the changing content of a buffer or, indeed, 21 the content of different buffers, with different starting addresses. 22 23 Advice to implementors. The system may “compile” at commit time an internal 24 representation for the datatype that facilitates communication, e.g., change from a 25 compacted representation to a flat representation of the datatype, and select the most 26 End of advice to implementors. convenient transfer mechanism. ( ) 27 28 MPI _ TYPE _ COMMIT will accept a committed datatype; in this case, it is equivalent 29 to a no-op. 30 31 _ Example 4.10 The following code fragment gives examples of using MPI _ TYPE COMMIT . 32 INTEGER type1, type2 33 CALL MPI_TYPE_CONTIGUOUS(5, MPI_REAL, type1, ierr) 34 ! new type object created 35 CALL MPI_TYPE_COMMIT(type1, ierr) 36 ! now type1 can be used for communication 37 type2 = type1 38 ! type2 can be used for communication 39 ! (it is a handle to same object as type1) 40 CALL MPI_TYPE_VECTOR(3, 5, 4, MPI_REAL, type1, ierr) 41 ! new uncommitted type object created 42 CALL MPI_TYPE_COMMIT(type1, ierr) 43 ! now type1 can be used anew for communication 44 45 46 47 48

141 4.1. DERIVED DATATYPES 111 1 FREE(datatype) MPI _ _ TYPE 2 datatype that is freed (handle) datatype INOUT 3 4 int MPI_Type_free(MPI_Datatype *datatype) 5 6 MPI_Type_free(datatype, ierror) BIND(C) 7 TYPE(MPI_Datatype), INTENT(INOUT) :: datatype 8 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 9 MPI_TYPE_FREE(DATATYPE, IERROR) 10 INTEGER DATATYPE, IERROR 11 12 for deallocation and sets Marks the datatype object associated with datatype datatype 13 . Any communication that is currently using this datatype will NULL _ _ MPI to DATATYPE 14 complete normally. Freeing a datatype does not affect any other datatype that was built 15 from the freed datatype. The system behaves as if input datatype arguments to derived 16 datatype constructors are passed by value. 17 18 Advice to implementors. The implementation may keep a reference count of active 19 communications that use the datatype, in order to decide when to free it. Also, one 20 may implement constructors of derived datatypes so that they keep pointers to their 21 datatype arguments, rather then copying them. In this case, one needs to keep track 22 of active datatype definition references in order to know when a datatype object can 23 ) End of advice to implementors. be freed. ( 24 25 4.1.10 Duplicating a Datatype 26 27 28 DUP(oldtype, newtype) TYPE MPI _ _ 29 30 datatype (handle) oldtype IN 31 copy of oldtype (handle) OUT newtype 32 33 int MPI_Type_dup(MPI_Datatype oldtype, MPI_Datatype *newtype) 34 35 MPI_Type_dup(oldtype, newtype, ierror) BIND(C) 36 TYPE(MPI_Datatype), INTENT(IN) :: oldtype 37 TYPE(MPI_Datatype), INTENT(OUT) :: newtype 38 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 39 MPI_TYPE_DUP(OLDTYPE, NEWTYPE, IERROR) 40 INTEGER OLDTYPE, NEWTYPE, IERROR 41 42 MPI _ TYPE _ DUP is a type constructor which duplicates the existing 43 with associated key values. For each key value, the respective copy callback function oldtype 44 determines the attribute value associated with this key in the new communicator; one 45 particular action that a copy callback may take is to delete the attribute from the new 46 a new datatype with exactly the same properties as datatype. Returns in newtype oldtype 47 and any copied cached information, see Section 6.7.4 on page 276. The new datatype has 48 identical upper bound and lower bound and yields the same net result when fully decoded

142 112 CHAPTER 4. DATATYPES 1 has the same committed state as the old newtype with the functions in Section 4.1.13. The 2 oldtype . 3 4 Use of General Datatypes in Communication 4.1.11 5 Handles to derived datatypes can be passed to a communication call wherever a datatype 6 > count , where SEND(buf, count, datatype, ...) _ MPI argument is required. A call of the form 7 1 count , is interpreted as if the call was passed a new datatype which is the concatenation of 8 copies of . Thus, MPI _ is equivalent to, datatype SEND(buf, count, datatype, dest, tag, comm) 9 10 MPI_TYPE_CONTIGUOUS(count, datatype, newtype) 11 MPI_TYPE_COMMIT(newtype) 12 MPI_SEND(buf, 1, newtype, dest, tag, comm) 13 MPI_TYPE_FREE(newtype). 14 15 count Similar statements apply to all other communication functions that have a and 16 datatype argument. 17 Suppose that a send operation MPI _ SEND(buf, count, datatype, dest, tag, comm) is 18 has type map, datatype executed, where 19 ( ,disp { ,disp ) ,..., ( ) } , type type 20 − 0 0 n − 1 n 1 21 and extent extent . (Explicit lower bound and upper bound markers are not listed in the 22 extent count n · entries, type map, but they affect the value of .) The send operation sends 23 is at location where entry i = buf + extent · i + disp · and has type type n , for + j addr j j i,j 24 = 0 ,..., count − 1 and j = 0 ,...,n − 1. These entries need not be contiguous, nor distinct; i 25 their order can be arbitrary. 26 in the calling program should be of a type that addr The variable stored at address i,j 27 , where type matching is defined as in Section 3.3.1. The message sent contains matches type j 28 j count . entries, where entry type has type · + n · i n j 29 MPI _ RECV(buf, count, datatype, source, tag, Similarly, suppose that a receive operation 30 comm, status) is executed, where datatype has type map, 31 32 ) , { ( type } ,disp ) ,..., ( type ,disp − 1 n n − 1 0 0 33 extent . (Again, explicit lower bound and upper bound markers are not listed in with extent 34 · .) This receive operation receives the type map, but they affect the value of extent n count 35 type i · entries, where entry n + j is at location buf + extent · i + disp and has type . If the 36 j j ; the ≤ k -th n incoming message consists of · count elements, then we must have i · n + j k 37 . element of the message should have a type that matches type 38 j Type matching is defined according to the type signature of the corresponding datatypes, 39 that is, the sequence of basic type components. Type matching does not depend on some 40 aspects of the datatype definition, such as the displacements (layout in memory) or the 41 intermediate types used. 42 43 Example 4.11 This example shows that type matching is defined in terms of the basic 44 types that a derived type consists of. 45 46 47 48

143 4.1. DERIVED DATATYPES 113 1 ... 2 CALL MPI_TYPE_CONTIGUOUS(2, MPI_REAL, type2, ...) 3 CALL MPI_TYPE_CONTIGUOUS(4, MPI_REAL, type4, ...) 4 CALL MPI_TYPE_CONTIGUOUS(2, type2, type22, ...) 5 ... 6 CALL MPI_SEND(a, 4, MPI_REAL, ...) 7 CALL MPI_SEND(a, 2, type2, ...) 8 CALL MPI_SEND(a, 1, type22, ...) 9 CALL MPI_SEND(a, 1, type4, ...) 10 ... 11 CALL MPI_RECV(a, 4, MPI_REAL, ...) 12 CALL MPI_RECV(a, 2, type2, ...) 13 CALL MPI_RECV(a, 1, type22, ...) 14 CALL MPI_RECV(a, 1, type4, ...) 15 Each of the sends matches any of the receives. 16 A datatype may specify overlapping entries. The use of such a datatype in a receive 17 operation is erroneous. (This is erroneous even if the actual message received is short enough 18 not to write any entry more than once.) 19 is executed, RECV(buf, count, datatype, dest, tag, comm, status) MPI Suppose that _ 20 where datatype has type map, 21 22 type } ,disp ,disp { ) ) ,..., . ( type ( 0 − 1 n − 0 n 1 23 24 The received message need not fill all the receive buffer, nor does it need to fill a number of 25 n . Any number, , of basic elements can be received, where k locations which is a multiple of 26 · count ≤ k ≤ status using . The number of basic elements received can be retrieved from n 0 27 _ X . the query functions MPI _ GET _ ELEMENTS or MPI _ GET _ ELEMENTS 28 29 _ MPI GET _ ELEMENTS(status, datatype, count) 30 31 return status of receive operation (Status) status IN 32 datatype IN datatype used by receive operation (handle) 33 OUT count number of received basic elements (integer) 34 35 36 int MPI_Get_elements(const MPI_Status *status, MPI_Datatype datatype, 37 int *count) 38 MPI_Get_elements(status, datatype, count, ierror) BIND(C) 39 TYPE(MPI_Status), INTENT(IN) :: status 40 TYPE(MPI_Datatype), INTENT(IN) :: datatype 41 INTEGER, INTENT(OUT) :: count 42 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 43 44 MPI_GET_ELEMENTS(STATUS, DATATYPE, COUNT, IERROR) 45 INTEGER STATUS(MPI_STATUS_SIZE), DATATYPE, COUNT, IERROR 46 47 48

144 114 CHAPTER 4. DATATYPES 1 GET MPI X(status, datatype, count) _ ELEMENTS _ _ 2 status return status of receive operation (Status) IN 3 IN datatype datatype used by receive operation (handle) 4 5 number of received basic elements (integer) OUT count 6 7 int MPI_Get_elements_x(const MPI_Status *status, MPI_Datatype datatype, 8 MPI_Count *count) 9 MPI_Get_elements_x(status, datatype, count, ierror) BIND(C) 10 TYPE(MPI_Status), INTENT(IN) :: status 11 TYPE(MPI_Datatype), INTENT(IN) :: datatype 12 INTEGER(KIND = MPI_COUNT_KIND), INTENT(OUT) :: count 13 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 14 15 MPI_GET_ELEMENTS_X(STATUS, DATATYPE, COUNT, IERROR) 16 INTEGER STATUS(MPI_STATUS_SIZE), DATATYPE, IERROR 17 INTEGER(KIND=MPI_COUNT_KIND) COUNT 18 argument should match the argument provided by the receive call that The datatype 19 variable. For both functions, if the set the status OUT parameter cannot express the value 20 to be returned (e.g., if the parameter is too small to hold the output value), it is set to 21 . _ UNDEFINED MPI 22 _ MPI The previously defined function _ (Section 3.2.5), has a different be- COUNT GET 23 havior. It returns the number of “top-level entries” received, i.e. the number of “copies” of 24 _ . In the previous example, may return any integer value MPI type datatype COUNT _ GET 25 returns k , then the number of basic elements , where 0 k k ≤ count . If MPI _ GET _ COUNT ≤ 26 or ) is _ ELEMENTS _ GET _ received (and the value returned by X ELEMENTS _ GET _ MPI MPI 27 . If the number of basic elements received is not a multiple of n , that is, if the receive n · k 28 _ operation has not received an integral number of datatype “copies,” then MPI GET _ COUNT 29 MPI UNDEFINED . sets the value of count to _ 30 31 GET _ MPI and COUNT _ GET _ MPI Usage of Example 4.12 ELEMENTS . _ 32 33 ... 34 CALL MPI_TYPE_CONTIGUOUS(2, MPI_REAL, Type2, ierr) 35 CALL MPI_TYPE_COMMIT(Type2, ierr) 36 ... 37 CALL MPI_COMM_RANK(comm, rank, ierr) 38 IF (rank.EQ.0) THEN 39 CALL MPI_SEND(a, 2, MPI_REAL, 1, 0, comm, ierr) 40 CALL MPI_SEND(a, 3, MPI_REAL, 1, 0, comm, ierr) 41 ELSE IF (rank.EQ.1) THEN 42 CALL MPI_RECV(a, 2, Type2, 0, 0, comm, stat, ierr) 43 CALL MPI_GET_COUNT(stat, Type2, i, ierr) ! returns i=1 44 CALL MPI_GET_ELEMENTS(stat, Type2, i, ierr) ! returns i=2 45 CALL MPI_RECV(a, 2, Type2, 0, 0, comm, stat, ierr) 46 CALL MPI_GET_COUNT(stat, Type2, i, ierr) ! returns i=MPI_UNDEFINED 47 CALL MPI_GET_ELEMENTS(stat, Type2, i, ierr) ! returns i=3 48 END IF

145 4.1. DERIVED DATATYPES 115 1 _ MPI ELEMENTS and MPI _ GET _ ELEMENTS _ X can also be used GET The functions _ 2 after a probe to find the number of elements in the probed message. Note that the functions 3 X _ GET _ ELEMENTS , and MPI _ GET _ ELEMENTS _ COUNT return the same GET , MPI _ _ MPI 4 values when they are used with basic datatypes as long as the limits of their respective 5 arguments are not exceeded. count 6 7 The extension given to the definition of MPI _ GET _ COUNT seems natural: Rationale. 8 count one would expect this function to return the value of the argument, when the 9 represents a basic unit of data one wants datatype receive buffer is filled. Sometimes 10 to transfer, for example, a record in an array of records (structures). One should be 11 able to find out how many components were received without bothering to divide by 12 the number of elements in each component. However, on other occasions, datatype 13 is used to define a complex layout of data in the receiver memory, and does not 14 represent a basic unit of data for transfers. In such cases, one needs to use the 15 ) MPI _ GET _ function or MPI _ GET _ ELEMENTS _ X . ( End of rationale. ELEMENTS 16 17 Advice to implementors. The definition implies that a receive cannot change the 18 value of storage outside the entries defined to compose the communication buffer. In 19 particular, the definition implies that padding space in a structure should not be mod- 20 ified when such a structure is copied from one process to another. This would prevent 21 the obvious optimization of copying the structure, together with the padding, as one 22 contiguous block. The implementation is free to do this optimization when it does not 23 impact the outcome of the computation. The user can “force” this optimization by 24 ) End of advice to implementors. explicitly including padding as part of the message. ( 25 26 Correct Use of Addresses 4.1.12 27 Successively declared variables in C or Fortran are not necessarily stored at contiguous 28 locations. Thus, care must be exercised that displacements do not cross from one variable 29 to another. Also, in machines with a segmented address space, addresses are not unique 30 and address arithmetic has some peculiar properties. Thus, the use of , that is, addresses 31 MPI _ BOTTOM , has to be restricted. displacements relative to the start address 32 sequential storage Variables belong to the same if they belong to the same array, to 33 the same COMMON block in Fortran, or to the same structure in C. Valid addresses are 34 defined recursively as follows: 35 36 returns a valid address, when passed as argument 1. The function MPI _ GET _ ADDRESS 37 a variable of the calling program. 38 39 argument of a communication function evaluates to a valid address, when buf 2. The 40 passed as argument a variable of the calling program. 41 is an integer, then v+i is a valid address, provided 3. If and v is a valid address, and i v 42 are in the same sequential storage. v+i 43 44 A correct program uses only valid addresses to identify the locations of entries in 45 u and v are two valid addresses, then the (integer) communication buffers. Furthermore, if 46 can be computed only if both v and u are in the same sequential storage. u - v difference 47 No other arithmetic operations can be meaningfully executed on addresses. 48

146 116 CHAPTER 4. DATATYPES 1 The rules above impose no constraints on the use of derived datatypes, as long as 2 they are used to define a communication buffer that is wholly contained within the same 3 sequential storage. However, the construction of a communication buffer that contains 4 variables that are not within the same sequential storage must obey certain restrictions. 5 Basically, a communication buffer with variables that are not within the same sequential 6 storage can be used only by specifying in the communication call _ buf = MPI BOTTOM , 7 , and using a count = 1 datatype argument where all displacements are valid (absolute) 8 addresses. 9 Advice to users. implementations will be able to detect It is not expected that MPI 10 erroneous, “out of bound” displacements — unless those overflow the user address 11 space — since the MPI call may not know the extent of the arrays and records in the 12 host program. ( End of advice to users. ) 13 14 There is no need to distinguish (absolute) addresses and Advice to implementors. 15 (relative) displacements on a machine with contiguous address space: BOTTOM _ MPI 16 is zero, and both addresses and displacements are integers. On machines where the 17 distinction is required, addresses are recognized as expressions that involve 18 MPI BOTTOM _ . ( End of advice to implementors. ) 19 20 4.1.13 Decoding a Datatype 21 22 MPI datatype objects allow users to specify an arbitrary layout of data in memory. There 23 are several cases where accessing the layout information in opaque datatype objects would 24 MPI . Further- be useful. The opaque datatype object has found a number of uses outside 25 more, a number of tools wish to display internal information about a datatype. To achieve 26 this, datatype decoding functions are provided. The two functions in this section are used 27 together to decode datatypes to recreate the calling sequence used in their initial defini- 28 tion. These can be used to allow a user to determine the type map and type signature of a 29 datatype. 30 31 MPI _ TYPE _ GET _ ENVELOPE(datatype, num datatypes, _ addresses, num _ integers, num _ 32 combiner) 33 34 IN datatype to access (handle) datatype 35 OUT num _ integers number of input integers used in the call constructing 36 combiner (non-negative integer) 37 number of input addresses used in the call construct- addresses OUT num _ 38 (non-negative integer) ing combiner 39 40 _ datatypes number of input datatypes used in the call construct- OUT num 41 combiner ing (non-negative integer) 42 combiner (state) OUT combiner 43 44 int MPI_Type_get_envelope(MPI_Datatype datatype, int *num_integers, 45 int *num_addresses, int *num_datatypes, int *combiner) 46 47 MPI_Type_get_envelope(datatype, num_integers, num_addresses, num_datatypes, 48 combiner, ierror) BIND(C)

147 4.1. DERIVED DATATYPES 117 1 TYPE(MPI_Datatype), INTENT(IN) :: datatype 2 INTEGER, INTENT(OUT) :: num_integers, num_addresses, num_datatypes, 3 combiner 4 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 5 MPI_TYPE_GET_ENVELOPE(DATATYPE, NUM_INTEGERS, NUM_ADDRESSES, NUM_DATATYPES, 6 COMBINER, IERROR) 7 INTEGER DATATYPE, NUM_INTEGERS, NUM_ADDRESSES, NUM_DATATYPES, COMBINER, 8 IERROR 9 10 _ TYPE _ GET _ ENVELOPE returns information on the num- For the given datatype MPI , 11 . The number-of- datatype ber and type of input arguments used in the call that created the 12 arguments values returned can be used to provide sufficiently large arrays in the decoding 13 MPI _ TYPE _ _ routine . This call and the meaning of the returned values is GET CONTENTS 14 reflects the datatype constructor call that was used in combiner described below. The MPI 15 . creating datatype 16 17 reflect the constructor used in the creation combiner By requiring that the Rationale. 18 , the decoded information can be used to effectively recreate the calling datatype of the 19 sequence used in the original creation. This is the most useful information and was felt 20 to be reasonable even though it constrains implementations to remember the original 21 constructor sequence even if the internal representation is different. 22 The decoded information keeps track of datatype duplications. This is important as 23 one needs to distinguish between a predefined datatype and a dup of a predefined 24 datatype. The former is a constant object that cannot be freed, while the latter is a 25 derived datatype that can be freed. ( End of rationale. ) 26 27 combiner The list in Table 4.1 has the values that can be returned in on the left and 28 the call associated with them on the right. 29 30 COMBINER _ MPI a named predefined datatype _ NAMED 31 TYPE MPI _ COMBINER _ DUP MPI _ DUP _ 32 MPI _ COMBINER _ CONTIGUOUS MPI _ TYPE _ CONTIGUOUS 33 MPI _ VECTOR _ COMBINER _ VECTOR MPI _ TYPE 34 _ COMBINER HVECTOR MPI _ _ _ CREATE _ HVECTOR MPI TYPE 35 COMBINER _ INDEXED TYPE _ _ MPI INDEXED _ MPI 36 _ MPI _ _ COMBINER _ MPI HINDEXED CREATE _ TYPE HINDEXED 37 _ BLOCK _ INDEXED _ CREATE _ MPI _ MPI BLOCK _ INDEXED _ COMBINER TYPE 38 BLOCK MPI _ TYPE _ CREATE _ HINDEXED _ BLOCK MPI _ COMBINER _ HINDEXED _ 39 STRUCT CREATE MPI _ COMBINER _ _ MPI _ TYPE _ STRUCT 40 _ _ MPI _ COMBINER SUBARRAY MPI _ TYPE _ CREATE SUBARRAY 41 DARRAY MPI MPI _ COMBINER _ DARRAY _ CREATE _ TYPE _ 42 _ _ F90 _ F90 _ _ MPI REAL COMBINER REAL MPI _ TYPE _ CREATE 43 COMPLEX F90 _ CREATE _ TYPE _ MPI COMPLEX _ F90 _ COMBINER _ MPI _ 44 F90 _ CREATE _ TYPE _ MPI INTEGER _ F90 _ COMBINER _ MPI INTEGER _ 45 _ RESIZED MPI _ COMBINER _ RESIZED MPI _ TYPE _ CREATE 46 47 GET _ ENVELOPE Table 4.1: combiner values returned from MPI _ TYPE _ 48

148 118 CHAPTER 4. DATATYPES 1 is combiner COMBINER _ NAMED then datatype is a named predefined datatype. MPI If _ 2 can be obtained using The actual arguments used in the creation call for a datatype 3 _ . TYPE _ CONTENTS MPI _ GET 4 5 _ integers, max _ addresses, max _ datatypes, MPI _ TYPE _ GET _ CONTENTS(datatype, max 6 _ _ of _ datatypes) _ array _ of addresses, array integers, array of _ 7 8 datatype to access (handle) datatype IN 9 (non-negative max IN integers number of elements in array _ of _ integers _ 10 integer) 11 IN max _ addresses number of elements in array _ (non-negative _ addresses of 12 integer) 13 14 datatypes (non-negative IN _ max _ array number of elements in datatypes _ of 15 integer) 16 _ of _ integers contains integer arguments used in constructing OUT array 17 datatype (array of integers) 18 OUT contains address arguments used in constructing addresses of _ array _ 19 datatype (array of integers) 20 21 contains datatype arguments used in constructing datatypes OUT array _ of _ 22 datatype (array of handles) 23 24 int MPI_Type_get_contents(MPI_Datatype datatype, int max_integers, 25 int max_addresses, int max_datatypes, int array_of_integers[], 26 MPI_Aint array_of_addresses[], 27 MPI_Datatype array_of_datatypes[]) 28 29 MPI_Type_get_contents(datatype, max_integers, max_addresses, max_datatypes, 30 array_of_integers, array_of_addresses, array_of_datatypes, 31 ierror) BIND(C) 32 TYPE(MPI_Datatype), INTENT(IN) :: datatype 33 INTEGER, INTENT(IN) :: max_integers, max_addresses, max_datatypes 34 INTEGER, INTENT(OUT) :: array_of_integers(max_integers) 35 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(OUT) :: 36 array_of_addresses(max_addresses) 37 TYPE(MPI_Datatype), INTENT(OUT) :: array_of_datatypes(max_datatypes) 38 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 39 MPI_TYPE_GET_CONTENTS(DATATYPE, MAX_INTEGERS, MAX_ADDRESSES, MAX_DATATYPES, 40 ARRAY_OF_INTEGERS, ARRAY_OF_ADDRESSES, ARRAY_OF_DATATYPES, 41 IERROR) 42 INTEGER DATATYPE, MAX_INTEGERS, MAX_ADDRESSES, MAX_DATATYPES, 43 ARRAY_OF_INTEGERS(*), ARRAY_OF_DATATYPES(*), IERROR 44 INTEGER(KIND=MPI_ADDRESS_KIND) ARRAY_OF_ADDRESSES(*) 45 46 datatype must be a predefined unnamed or a derived datatype; the call is erroneous if 47 is a predefined named datatype. datatype 48

149 4.1. DERIVED DATATYPES 119 1 _ max max _ addresses , and max _ datatypes must be at least as integers The values given for , 2 integers _ addresses , and num _ datatypes , respectively, large as the value returned in num , _ num 3 _ ENVELOPE for the same datatype argument. TYPE _ in the call MPI _ GET 4 5 _ Rationale. , and max _ datatypes allow for The arguments max _ integers , max addresses 6 ) error checking in the call. ( End of rationale. 7 are handles to datatype objects that array The datatypes returned in _ of _ datatypes 8 are equivalent to the datatypes used in the original construction call. If these were derived 9 datatypes, then the returned datatypes are new datatype objects, and the user is responsible 10 for freeing these datatypes with . If these were predefined datatypes, then MPI _ TYPE _ FREE 11 the returned datatype is equal to that (constant) predefined datatype and cannot be freed. 12 The committed state of returned derived datatypes is undefined, i.e., the datatypes may 13 or may not be committed. Furthermore, the content of attributes of returned datatypes is 14 undefined. 15 TYPE _ _ can be invoked with a Note that MPI _ GET CONTENTS 16 _ datatype , REAL _ F90 argument that was constructed using CREATE _ TYPE _ MPI 17 CREATE (an unnamed TYPE _ TYPE F90 _ MPI _ INTEGER , or MPI _ _ _ CREATE _ F90 _ COMPLEX 18 _ datatypes is returned. predefined datatype). In such a case, an empty array _ of 19 20 Rationale. The definition of datatype equivalence implies that equivalent predefined 21 datatypes are equal. By requiring the same handle for named predefined datatypes, 22 it is possible to use the == or .EQ. comparison operator to determine the datatype 23 ) End of rationale. involved. ( 24 25 _ must appear Advice to implementors. The datatypes returned in array _ of datatypes 26 to the user as if each is an equivalent copy of the datatype used in the type constructor 27 call. Whether this is done by creating a new datatype or via another mechanism such 28 as a reference count mechanism is up to the implementation as long as the semantics 29 End of advice to implementors. are preserved. ( ) 30 31 Rationale. The committed state and attributes of the returned datatype is delib- 32 erately left vague. The datatype used in the original construction may have been 33 modified since its use in the constructor call. Attributes can be added, removed, or 34 modified as well as having the datatype committed. The semantics given allow for 35 a reference count implementation without having to track these changes. ( End of 36 rationale. ) 37 38 In the deprecated datatype constructor calls, the address arguments in Fortran are 39 . In the preferred calls, the address arguments are of type of type INTEGER 40 _ . The call INTEGER(KIND=MPI_ADDRESS_KIND) returns all ad- GET _ CONTENTS TYPE _ MPI 41 dresses in an argument of type INTEGER(KIND=MPI_ADDRESS_KIND) . This is true even if the 42 deprecated calls were used. Thus, the location of values returned can be thought of as being 43 returned by the C bindings. It can also be determined by examining the preferred calls for 44 datatype constructors for the deprecated calls that involve addresses. 45 46 By having all address arguments returned in the Rationale. 47 argument, the result from a C and Fortran decoding of a array _ of _ addresses datatype 48 gives the result in the same argument. It is assumed that an integer of type

150 120 CHAPTER 4. DATATYPES 1 argument INTEGER(KIND=MPI_ADDRESS_KIND) INTEGER will be at least as large as the 2 calls so no loss of information will used in datatype construction with the old MPI-1 3 End of rationale. ) occur. ( 4 5 The following defines what values are placed in each entry of the returned arrays 6 depending on the datatype constructor used for . It also specifies the size of the datatype 7 ENVELOPE _ GET _ TYPE _ MPI arrays needed which is the values returned by . In Fortran, 8 the following calls were made: 9 PARAMETER (LARGE = 1000) 10 INTEGER TYPE, NI, NA, ND, COMBINER, I(LARGE), D(LARGE), IERROR 11 INTEGER (KIND=MPI_ADDRESS_KIND) A(LARGE) 12 ! CONSTRUCT DATATYPE TYPE (NOT SHOWN) 13 CALL MPI_TYPE_GET_ENVELOPE(TYPE, NI, NA, ND, COMBINER, IERROR) 14 IF ((NI .GT. LARGE) .OR. (NA .GT. LARGE) .OR. (ND .GT. LARGE)) THEN 15 WRITE (*, *) "NI, NA, OR ND = ", NI, NA, ND, & 16 " RETURNED BY MPI_TYPE_GET_ENVELOPE IS LARGER THAN LARGE = ", LARGE 17 CALL MPI_ABORT(MPI_COMM_WORLD, 99, IERROR) 18 ENDIF 19 CALL MPI_TYPE_GET_CONTENTS(TYPE, NI, NA, ND, I, A, D, IERROR) 20 21 or in C the analogous calls of: 22 23 #define LARGE 1000 24 int ni, na, nd, combiner, i[LARGE]; 25 MPI_Aint a[LARGE]; 26 MPI_Datatype type, d[LARGE]; 27 /* construct datatype type (not shown) */ 28 MPI_Type_get_envelope(type, &ni, &na, &nd, &combiner); 29 if ((ni > LARGE) || (na > LARGE) || (nd > LARGE)) { 30 fprintf(stderr, "ni, na, or nd = %d %d %d returned by ", ni, na, nd); 31 fprintf(stderr, "MPI_Type_get_envelope is larger than LARGE = %d\n", 32 LARGE); 33 MPI_Abort(MPI_COMM_WORLD, 99); 34 }; 35 MPI_Type_get_contents(type, ni, na, nd, i, a, d); 36 37 In the descriptions that follow, the lower case name of arguments is used. 38 If combiner is MPI COMBINER _ NAMED then it is erroneous to call _ 39 . CONTENTS _ _ TYPE _ MPI GET 40 DUP then If combiner is MPI _ COMBINER _ 41 C Constructor argument Fortran location 42 d[0] oldtype D(1) 43 and ni = 0, na = 0, nd = 1. 44 MPI _ If combiner is then CONTIGUOUS _ COMBINER 45 46 Constructor argument C Fortran location 47 count i[0] I(1) 48 d[0] oldtype D(1)

151 4.1. DERIVED DATATYPES 121 1 and ni = 1, na = 0, nd = 1. 2 MPI VECTOR then If combiner is _ _ COMBINER 3 Fortran location C Constructor argument 4 i[0] I(1) count 5 i[1] I(2) blocklength 6 I(3) i[2] stride 7 d[0] oldtype D(1) 8 and ni = 3, na = 0, nd = 1. 9 COMBINER If combiner is MPI _ then HVECTOR _ 10 11 Constructor argument C Fortran location 12 i[0] I(1) count 13 i[1] I(2) blocklength 14 A(1) a[0] stride 15 D(1) d[0] oldtype 16 and ni = 2, na = 1, nd = 1. 17 _ If combiner is MPI _ COMBINER INDEXED then 18 Constructor argument Fortran location C 19 20 i[0] I(1) count of I(2) to I(I(1)+1) _ blocklengths array _ i[1] to i[i[0]] 21 of array displacements i[i[0]+1] to i[2*i[0]] I(I(1)+2) to I(2*I(1)+1) _ _ 22 oldtype d[0] D(1) 23 24 and ni = 2*count+1, na = 0, nd = 1. 25 COMBINER _ MPI If combiner is then _ HINDEXED 26 Fortran location C Constructor argument 27 I(1) i[0] count 28 _ I(2) to I(I(1)+1) array _ i[1] to i[i[0]] blocklengths of 29 displacements a[0] to a[i[0]-1] _ A(1) to A(I(1)) array _ of 30 d[0] oldtype D(1) 31 and ni = count+1, na = count, nd = 1. 32 INDEXED _ BLOCK then If combiner is MPI _ COMBINER _ 33 34 Constructor argument Fortran location C 35 count i[0] I(1) 36 I(2) blocklength i[1] 37 _ of array displacements i[2] to i[i[0]+1] I(3) to I(I(1)+2) _ 38 oldtype D(1) d[0] 39 and ni = count+2, na = 0, nd = 1. 40 HINDEXED _ COMBINER _ MPI If combiner is then BLOCK _ 41 Constructor argument C Fortran location 42 43 I(1) i[0] count 44 blocklength i[1] I(2) 45 A(1) to A(I(1)) array _ of _ displacements a[0] to a[i[0]-1] 46 oldtype d[0] D(1) 47 and ni = 2, na = count, nd = 1. 48 then COMBINER _ STRUCT MPI If combiner is _

152 122 CHAPTER 4. DATATYPES 1 Fortran location C Constructor argument 2 I(1) count i[0] 3 I(2) to I(I(1)+1) array _ blocklengths _ i[1] to i[i[0]] of 4 array _ displacements A(1) to A(I(1)) of _ a[0] to a[i[0]-1] 5 types array D(1) to D(I(1)) d[0] to d[i[0]-1] _ _ of 6 and ni = count+1, na = count, nd = count. 7 COMBINER _ SUBARRAY then If combiner is MPI _ 8 Fortran location Constructor argument C 9 I(1) i[0] ndims 10 I(2) to I(I(1)+1) _ of _ sizes array i[1] to i[i[0]] 11 i[i[0]+1] to i[2*i[0]] array _ of _ subsizes I(I(1)+2) to I(2*I(1)+1) 12 _ _ array starts i[2*i[0]+1] to i[3*i[0]] I(2*I(1)+2) to I(3*I(1)+1) of 13 I(3*I(1)+2] order i[3*i[0]+1] 14 D(1) oldtype d[0] 15 16 and ni = 3*ndims+2, na = 0, nd = 1. 17 then MPI _ If combiner is _ DARRAY COMBINER 18 C Fortran location Constructor argument 19 size i[0] I(1) 20 rank i[1] I(2) 21 I(3) i[2] ndims 22 i[3] to i[i[2]+2] array _ gsizes of _ I(4) to I(I(3)+3) 23 array _ of _ distribs i[i[2]+3] to i[2*i[2]+2] I(I(3)+4) to I(2*I(3)+3) 24 i[2*i[2]+3] to i[3*i[2]+2] I(2*I(3)+4) to I(3*I(3)+3) _ dargs of _ array 25 array of _ psizes i[3*i[2]+3] to i[4*i[2]+2] I(3*I(3)+4) to I(4*I(3)+3) _ 26 I(4*I(3)+4) order i[4*i[2]+3] 27 oldtype d[0] D(1) 28 and ni = 4*ndims+4, na = 0, nd = 1. 29 MPI _ _ then REAL If combiner is F90 COMBINER _ 30 31 Constructor argument C Fortran location 32 p i[0] I(1) 33 i[1] I(2) r 34 and ni = 2, na = 0, nd = 0. 35 F90 _ COMPLEX then _ COMBINER If combiner is _ MPI 36 C Constructor argument Fortran location 37 I(1) p i[0] 38 i[1] r I(2) 39 40 and ni = 2, na = 0, nd = 0. 41 MPI If combiner is COMBINER _ _ _ INTEGER then F90 42 C Fortran location Constructor argument 43 i[0] I(1) r 44 and ni = 1, na = 0, nd = 0. 45 _ RESIZED MPI If combiner is then _ COMBINER 46 47 48

153 4.1. DERIVED DATATYPES 123 1 Constructor argument C Fortran location 2 A(1) a[0] lb 3 a[1] extent A(2) 4 oldtype d[0] D(1) 5 and ni = 0, na = 2, nd = 1. 6 7 4.1.14 Examples 8 9 The following examples illustrate the use of derived datatypes. 10 Send and receive a section of a 3D array. Example 4.13 11 12 REAL a(100,100,100), e(9,9,9) 13 INTEGER oneslice, twoslice, threeslice, myrank, ierr 14 INTEGER (KIND=MPI_ADDRESS_KIND) lb, sizeofreal 15 INTEGER status(MPI_STATUS_SIZE) 16 17 C extract the section a(1:17:2, 3:11, 2:10) 18 C and store it in e(:,:,:). 19 20 CALL MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr) 21 22 CALL MPI_TYPE_GET_EXTENT(MPI_REAL, lb, sizeofreal, ierr) 23 24 C create datatype for a 1D section 25 CALL MPI_TYPE_VECTOR(9, 1, 2, MPI_REAL, oneslice, ierr) 26 27 C create datatype for a 2D section 28 CALL MPI_TYPE_CREATE_HVECTOR(9, 1, 100*sizeofreal, oneslice, 29 twoslice, ierr) 30 31 C create datatype for the entire section 32 CALL MPI_TYPE_CREATE_HVECTOR(9, 1, 100*100*sizeofreal, twoslice, 33 threeslice, ierr) 34 35 CALL MPI_TYPE_COMMIT(threeslice, ierr) 36 CALL MPI_SENDRECV(a(1,3,2), 1, threeslice, myrank, 0, e, 9*9*9, 37 MPI_REAL, myrank, 0, MPI_COMM_WORLD, status, ierr) 38 39 40 Example 4.14 Copy the (strictly) lower triangular part of a matrix. 41 42 REAL a(100,100), b(100,100) 43 INTEGER disp(100), blocklen(100), ltype, myrank, ierr 44 INTEGER status(MPI_STATUS_SIZE) 45 46 C copy lower triangular part of array a 47 C onto lower triangular part of array b 48

154 CHAPTER 4. DATATYPES 124 1 CALL MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr) 2 3 C compute start and size of each column 4 DO i=1, 100 5 disp(i) = 100*(i-1) + i 6 blocklen(i) = 100-i 7 END DO 8 9 C create datatype for lower triangular part 10 CALL MPI_TYPE_INDEXED(100, blocklen, disp, MPI_REAL, ltype, ierr) 11 12 CALL MPI_TYPE_COMMIT(ltype, ierr) 13 CALL MPI_SENDRECV(a, 1, ltype, myrank, 0, b, 1, 14 ltype, myrank, 0, MPI_COMM_WORLD, status, ierr) 15 16 Transpose a matrix. Example 4.15 17 18 REAL a(100,100), b(100,100) 19 INTEGER row, xpose, myrank, ierr 20 INTEGER (KIND=MPI_ADDRESS_KIND) lb, sizeofreal 21 INTEGER status(MPI_STATUS_SIZE) 22 23 C transpose matrix a onto b 24 25 CALL MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr) 26 27 CALL MPI_TYPE_GET_EXTENT(MPI_REAL, lb, sizeofreal, ierr) 28 29 C create datatype for one row 30 CALL MPI_TYPE_VECTOR(100, 1, 100, MPI_REAL, row, ierr) 31 32 C create datatype for matrix in row-major order 33 CALL MPI_TYPE_CREATE_HVECTOR(100, 1, sizeofreal, row, xpose, ierr) 34 35 CALL MPI_TYPE_COMMIT(xpose, ierr) 36 37 C send matrix in row-major order and receive in column major order 38 CALL MPI_SENDRECV(a, 1, xpose, myrank, 0, b, 100*100, 39 MPI_REAL, myrank, 0, MPI_COMM_WORLD, status, ierr) 40 41 42 Another approach to the transpose problem: Example 4.16 43 44 REAL a(100,100), b(100,100) 45 INTEGER row, row1 46 INTEGER (KIND=MPI_ADDRESS_KIND) disp(2), lb, sizeofreal 47 INTEGER myrank, ierr 48 INTEGER status(MPI_STATUS_SIZE)

155 4.1. DERIVED DATATYPES 125 1 2 CALL MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr) 3 4 C transpose matrix a onto b 5 6 CALL MPI_TYPE_GET_EXTENT(MPI_REAL, lb, sizeofreal, ierr) 7 8 C create datatype for one row 9 CALL MPI_TYPE_VECTOR(100, 1, 100, MPI_REAL, row, ierr) 10 11 C create datatype for one row, with the extent of one real number 12 lb = 0 13 CALL MPI_TYPE_CREATE_RESIZED(row, lb, sizeofreal, row1, ierr) 14 15 CALL MPI_TYPE_COMMIT(row1, ierr) 16 17 C send 100 rows and receive in column major order 18 CALL MPI_SENDRECV(a, 100, row1, myrank, 0, b, 100*100, 19 MPI_REAL, myrank, 0, MPI_COMM_WORLD, status, ierr) 20 21 We manipulate an array of structures. Example 4.17 22 23 struct Partstruct 24 { 25 int type; /* particle type */ 26 double d[6]; /* particle coordinates */ 27 char b[7]; /* some additional information */ 28 }; 29 30 struct Partstruct particle[1000]; 31 32 int i, dest, tag; 33 MPI_Comm comm; 34 35 36 /* build datatype describing structure */ 37 38 MPI_Datatype Particlestruct, Particletype; 39 MPI_Datatype type[3] = {MPI_INT, MPI_DOUBLE, MPI_CHAR}; 40 int blocklen[3] = {1, 6, 7}; 41 MPI_Aint disp[3]; 42 MPI_Aint base, lb, sizeofentry; 43 44 45 /* compute displacements of structure components */ 46 47 MPI_Get_address(particle, disp); 48

156 CHAPTER 4. DATATYPES 126 1 MPI_Get_address(particle[0].d, disp+1); 2 MPI_Get_address(particle[0].b, disp+2); 3 base = disp[0]; 4 for (i=0; i < 3; i++) disp[i] -= base; 5 6 MPI_Type_create_struct(3, blocklen, disp, type, &Particlestruct); 7 8 /* If compiler does padding in mysterious ways, 9 the following may be safer */ 10 11 /* compute extent of the structure */ 12 13 MPI_Get_address(particle+1, &sizeofentry); 14 sizeofentry -= base; 15 16 /* build datatype describing structure */ 17 18 MPI_Type_create_resized(Particlestruct, 0, sizeofentry, &Particletype); 19 20 21 /* 4.1: 22 send the entire array */ 23 24 MPI_Type_commit(&Particletype); 25 MPI_Send(particle, 1000, Particletype, dest, tag, comm); 26 27 28 /* 4.2: 29 send only the entries of type zero particles, 30 preceded by the number of such entries */ 31 32 MPI_Datatype Zparticles; /* datatype describing all particles 33 with type zero (needs to be recomputed 34 if types change) */ 35 MPI_Datatype Ztype; 36 37 int zdisp[1000]; 38 int zblock[1000], j, k; 39 int zzblock[2] = {1,1}; 40 MPI_Aint zzdisp[2]; 41 MPI_Datatype zztype[2]; 42 43 /* compute displacements of type zero particles */ 44 j = 0; 45 for (i=0; i < 1000; i++) 46 if (particle[i].type == 0) 47 { 48 zdisp[j] = i;

157 127 4.1. DERIVED DATATYPES 1 zblock[j] = 1; 2 j++; 3 } 4 5 /* create datatype for type zero particles */ 6 MPI_Type_indexed(j, zblock, zdisp, Particletype, &Zparticles); 7 8 /* prepend particle count */ 9 MPI_Get_address(&j, zzdisp); 10 MPI_Get_address(particle, zzdisp+1); 11 zztype[0] = MPI_INT; 12 zztype[1] = Zparticles; 13 MPI_Type_create_struct(2, zzblock, zzdisp, zztype, &Ztype); 14 15 MPI_Type_commit(&Ztype); 16 MPI_Send(MPI_BOTTOM, 1, Ztype, dest, tag, comm); 17 18 19 /* A probably more efficient way of defining Zparticles */ 20 21 /* consecutive particles with index zero are handled as one block */ 22 j=0; 23 for (i=0; i < 1000; i++) 24 if (particle[i].type == 0) 25 { 26 for (k=i+1; (k < 1000)&&(particle[k].type == 0) ; k++); 27 zdisp[j] = i; 28 zblock[j] = k-i; 29 j++; 30 i = k; 31 } 32 MPI_Type_indexed(j, zblock, zdisp, Particletype, &Zparticles); 33 34 35 /* 4.3: 36 send the first two coordinates of all entries */ 37 38 MPI_Datatype Allpairs; /* datatype for all pairs of coordinates */ 39 40 MPI_Type_get_extent(Particletype, &lb, &sizeofentry); 41 42 /* sizeofentry can also be computed by subtracting the address 43 of particle[0] from the address of particle[1] */ 44 45 MPI_Type_create_hvector(1000, 2, sizeofentry, MPI_DOUBLE, &Allpairs); 46 MPI_Type_commit(&Allpairs); 47 MPI_Send(particle[0].d, 1, Allpairs, dest, tag, comm); 48

158 128 CHAPTER 4. DATATYPES 1 /* an alternative solution to 4.3 */ 2 3 MPI_Datatype Twodouble; 4 5 MPI_Type_contiguous(2, MPI_DOUBLE, &Twodouble); 6 7 MPI_Datatype Onepair; /* datatype for one pair of coordinates, with 8 the extent of one particle entry */ 9 10 MPI_Type_create_resized(Twodouble, 0, sizeofentry, &Onepair ); 11 MPI_Type_commit(&Onepair); 12 MPI_Send(particle[0].d, 1000, Onepair, dest, tag, comm); 13 14 15 Example 4.18 The same manipulations as in the previous example, but use absolute 16 addresses in datatypes. 17 18 struct Partstruct 19 { 20 int type; 21 double d[6]; 22 char b[7]; 23 }; 24 25 struct Partstruct particle[1000]; 26 27 /* build datatype describing first array entry */ 28 29 MPI_Datatype Particletype; 30 MPI_Datatype type[3] = {MPI_INT, MPI_DOUBLE, MPI_CHAR}; 31 int block[3] = {1, 6, 7}; 32 MPI_Aint disp[3]; 33 34 MPI_Get_address(particle, disp); 35 MPI_Get_address(particle[0].d, disp+1); 36 MPI_Get_address(particle[0].b, disp+2); 37 MPI_Type_create_struct(3, block, disp, type, &Particletype); 38 39 /* Particletype describes first array entry -- using absolute 40 addresses */ 41 42 /* 5.1: 43 send the entire array */ 44 45 MPI_Type_commit(&Particletype); 46 MPI_Send(MPI_BOTTOM, 1000, Particletype, dest, tag, comm); 47 48

159 4.1. DERIVED DATATYPES 129 1 2 /* 5.2: 3 send the entries of type zero, 4 preceded by the number of such entries */ 5 6 MPI_Datatype Zparticles, Ztype; 7 8 int zdisp[1000]; 9 int zblock[1000], i, j, k; 10 int zzblock[2] = {1,1}; 11 MPI_Datatype zztype[2]; 12 MPI_Aint zzdisp[2]; 13 14 j=0; 15 for (i=0; i < 1000; i++) 16 if (particle[i].type == 0) 17 { 18 for (k=i+1; (k < 1000)&&(particle[k].type == 0) ; k++); 19 zdisp[j] = i; 20 zblock[j] = k-i; 21 j++; 22 i = k; 23 } 24 MPI_Type_indexed(j, zblock, zdisp, Particletype, &Zparticles); 25 /* Zparticles describe particles with type zero, using 26 their absolute addresses*/ 27 28 /* prepend particle count */ 29 MPI_Get_address(&j, zzdisp); 30 zzdisp[1] = (MPI_Aint)0; 31 zztype[0] = MPI_INT; 32 zztype[1] = Zparticles; 33 MPI_Type_create_struct(2, zzblock, zzdisp, zztype, &Ztype); 34 35 MPI_Type_commit(&Ztype); 36 MPI_Send(MPI_BOTTOM, 1, Ztype, dest, tag, comm); 37 38 39 Handling of unions. Example 4.19 40 41 union { 42 int ival; 43 float fval; 44 } u[1000]; 45 46 int utype; 47 48

160 130 CHAPTER 4. DATATYPES 1 /* All entries of u have identical type; variable 2 utype keeps track of their current type */ 3 4 MPI_Datatype mpi_utype[2]; 5 MPI_Aint i, extent; 6 7 /* compute an MPI datatype for each possible union type; 8 assume values are left-aligned in union storage. */ 9 10 MPI_Get_address(u, &i); 11 MPI_Get_address(u+1, &extent); 12 extent -= i; 13 14 MPI_Type_create_resized(MPI_INT, 0, extent, &mpi_utype[0]); 15 16 MPI_Type_create_resized(MPI_FLOAT, 0, extent, &mpi_utype[1]); 17 18 for(i=0; i<2; i++) MPI_Type_commit(&mpi_utype[i]); 19 20 /* actual communication */ 21 22 MPI_Send(u, 1000, mpi_utype[utype], dest, tag, comm); 23 24 Example 4.20 This example shows how a datatype can be decoded. The routine 25 printdatatype prints out the elements of the datatype. Note the use of free _ Type _ MPI for 26 datatypes that are not predefined. 27 28 /* 29 Example of decoding a datatype. 30 31 Returns 0 if the datatype is predefined, 1 otherwise 32 */ 33 #include 34 #include 35 #include "mpi.h" 36 int printdatatype(MPI_Datatype datatype) 37 { 38 int *array_of_ints; 39 MPI_Aint *array_of_adds; 40 MPI_Datatype *array_of_dtypes; 41 int num_ints, num_adds, num_dtypes, combiner; 42 int i; 43 44 MPI_Type_get_envelope(datatype, 45 &num_ints, &num_adds, &num_dtypes, &combiner); 46 switch (combiner) { 47 case MPI_COMBINER_NAMED: 48 printf("Datatype is named:");

161 4.2. PACK AND UNPACK 131 1 /* To print the specific type, we can match against the 2 predefined forms. We can NOT use a switch statement here 3 We could also use MPI_TYPE_GET_NAME if we prefered to use 4 names that the user may have changed. 5 */ 6 if (datatype == MPI_INT) printf( "MPI_INT\n" ); 7 else if (datatype == MPI_DOUBLE) printf( "MPI_DOUBLE\n" ); 8 ... else test for other types ... 9 return 0; 10 break; 11 case MPI_COMBINER_STRUCT: 12 case MPI_COMBINER_STRUCT_INTEGER: 13 printf("Datatype is struct containing"); 14 array_of_ints = (int *)malloc(num_ints * sizeof(int)); 15 array_of_adds = 16 (MPI_Aint *) malloc(num_adds * sizeof(MPI_Aint)); 17 array_of_dtypes = (MPI_Datatype *) 18 malloc(num_dtypes * sizeof(MPI_Datatype)); 19 MPI_Type_get_contents(datatype, num_ints, num_adds, num_dtypes, 20 array_of_ints, array_of_adds, array_of_dtypes); 21 printf(" %d datatypes:\n", array_of_ints[0]); 22 for (i=0; i

162 132 CHAPTER 4. DATATYPES 1 The user specifies the layout of the data to be sent or received, and the communication 2 library directly accesses a noncontiguous buffer. The pack/unpack routines are provided 3 for compatibility with previous libraries. Also, they provide some functionality that is not 4 MPI otherwise available in . For instance, a message can be received in several parts, where 5 the receive operation done on a later part may depend on the content of a former part. 6 Another use is that outgoing messages may be explicitly buffered in user supplied space, 7 thus overriding the system buffering policy. Finally, the availability of pack and unpack 8 operations facilitates the development of additional communication libraries layered on top 9 MPI . of 10 11 _ PACK(inbuf, incount, datatype, outbuf, outsize, position, comm) MPI 12 13 input buffer start (choice) inbuf IN 14 incount number of input data items (non-negative integer) IN 15 IN datatype datatype of each input data item (handle) 16 17 output buffer start (choice) outbuf OUT 18 IN output buffer size, in bytes (non-negative integer) outsize 19 INOUT current position in buffer, in bytes (integer) position 20 21 communicator for packed message (handle) comm IN 22 23 int MPI_Pack(const void* inbuf, int incount, MPI_Datatype datatype, 24 void *outbuf, int outsize, int *position, MPI_Comm comm) 25 26 MPI_Pack(inbuf, incount, datatype, outbuf, outsize, position, comm, ierror) 27 BIND(C) 28 TYPE(*), DIMENSION(..), INTENT(IN) :: inbuf 29 TYPE(*), DIMENSION(..) :: outbuf 30 INTEGER, INTENT(IN) :: incount, outsize 31 TYPE(MPI_Datatype), INTENT(IN) :: datatype 32 INTEGER, INTENT(INOUT) :: position 33 TYPE(MPI_Comm), INTENT(IN) :: comm 34 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 35 MPI_PACK(INBUF, INCOUNT, DATATYPE, OUTBUF, OUTSIZE, POSITION, COMM, IERROR) 36 INBUF(*), OUTBUF(*) 37 INTEGER INCOUNT, DATATYPE, OUTSIZE, POSITION, COMM, IERROR 38 39 Packs the message in the send buffer specified by into the buffer inbuf, incount, datatype 40 space specified by outbuf and outsize . The input buffer can be any communication buffer 41 allowed in . The output buffer is a contiguous storage area containing SEND _ outsize MPI 42 (length is counted in bytes , not elements, as if it were bytes, starting at the address outbuf 43 a communication buffer for a message of type MPI _ PACKED ). 44 The input value of is the first location in the output buffer to be used for position 45 packing. position is incremented by the size of the packed message, and the output value 46 of position is the first location in the output buffer following the locations occupied by the 47 comm packed message. The argument is the communicator that will be subsequently used 48 for sending the packed message.

163 4.2. PACK AND UNPACK 133 1 MPI UNPACK(inbuf, insize, position, outbuf, outcount, datatype, comm) _ 2 input buffer start (choice) inbuf IN 3 IN insize size of input buffer, in bytes (non-negative integer) 4 5 position INOUT current position in bytes (integer) 6 OUT outbuf output buffer start (choice) 7 outcount number of items to be unpacked (integer) IN 8 9 datatype of each output data item (handle) IN datatype 10 communicator for packed message (handle) comm IN 11 12 int MPI_Unpack(const void* inbuf, int insize, int *position, void *outbuf, 13 int outcount, MPI_Datatype datatype, MPI_Comm comm) 14 15 MPI_Unpack(inbuf, insize, position, outbuf, outcount, datatype, comm, 16 ierror) BIND(C) 17 TYPE(*), DIMENSION(..), INTENT(IN) :: inbuf 18 TYPE(*), DIMENSION(..) :: outbuf 19 INTEGER, INTENT(IN) :: insize, outcount 20 INTEGER, INTENT(INOUT) :: position 21 TYPE(MPI_Datatype), INTENT(IN) :: datatype 22 TYPE(MPI_Comm), INTENT(IN) :: comm 23 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 24 MPI_UNPACK(INBUF, INSIZE, POSITION, OUTBUF, OUTCOUNT, DATATYPE, COMM, 25 IERROR) 26 INBUF(*), OUTBUF(*) 27 INTEGER INSIZE, POSITION, OUTCOUNT, DATATYPE, COMM, IERROR 28 29 outbuf, outcount, datatype from Unpacks a message into the receive buffer specified by 30 and insize inbuf the buffer space specified by . The output buffer can be any communication 31 RECV . The input buffer is a contiguous storage area containing buffer allowed in insize MPI _ 32 position is the first location in the input bytes, starting at address inbuf . The input value of 33 buffer occupied by the packed message. position is incremented by the size of the packed 34 message, so that the output value of position is the first location in the input buffer after 35 comm is the communicator used the locations occupied by the message that was unpacked. 36 to receive the packed message. 37 38 UNPACK : in Advice to users. Note the difference between MPI _ RECV and MPI _ 39 _ MPI argument specifies the maximum number of items that can RECV , the count 40 be received. The actual number of items received is determined by the length of 41 MPI argument specifies the actual count , the UNPACK _ the incoming message. In 42 number of items that are unpacked; the “size” of the corresponding message is the 43 position . The reason for this change is that the “incoming message size” increment in 44 is not predetermined since the user decides how much to unpack; nor is it easy to 45 determine the “message size” from the number of items to be unpacked. In fact, in a 46 a priori . ( heterogeneous system, this number may not be determined End of advice 47 ) to users. 48

164 134 CHAPTER 4. DATATYPES 1 To understand the behavior of pack and unpack, it is convenient to think of the data 2 part of a message as being the sequence obtained by concatenating the successive values sent 3 in that message. The pack operation stores this sequence in the buffer space, as if sending 4 the message to that buffer. The unpack operation retrieves this sequence from buffer space, 5 as if receiving a message from that buffer. (It is helpful to think of internal Fortran files or 6 sscanf in C, for a similar function.) 7 . This is effected packing unit Several messages can be successively packed into one 8 , where the first call provides position = 0 calls to , by several successive MPI _ related PACK 9 and each successive call inputs the value of that was output by the previous call, position 10 and outbuf, outcount and the same values for . This packing unit now contains the comm 11 equivalent information that would have been stored in a message by one send call with a 12 send buffer that is the “concatenation” of the individual send buffers. 13 . Any point to point or collective A packing unit can be sent using type MPI _ PACKED 14 communication function can be used to move the sequence of bytes that forms the packing 15 unit from one process to another. This packing unit can now be received using any receive 16 operation, with any datatype: the type matching rules are relaxed for messages sent with 17 . PACKED _ MPI type 18 ) can be received using the type A message sent with any type (including MPI _ PACKED 19 MPI MPI PACKED . Such a message can then be unpacked by calls to _ _ UNPACK . 20 A packing unit (or a message created by a regular, “typed” send) can be unpacked into 21 several successive messages. This is effected by several successive related calls to 22 _ UNPACK , where the first call provides position = 0 , and each successive call inputs the MPI 23 inbuf, insize value of position that was output by the previous call, and the same values for 24 and comm . 25 The concatenation of two packing units is not necessarily a packing unit; nor is a 26 substring of a packing unit necessarily a packing unit. Thus, one cannot concatenate two 27 packing units and then unpack the result as one packing unit; nor can one unpack a substring 28 of a packing unit as a separate packing unit. Each packing unit, that was created by a related 29 sequence of pack calls, or by a regular send, must be unpacked as a unit, by a sequence of 30 related unpack calls. 31 32 The restriction on “atomic” packing and unpacking of packing units Rationale. 33 allows the implementation to add at the head of packing units additional information, 34 such as a description of the sender architecture (to be used for type conversion, in a 35 ) End of rationale. heterogeneous environment) ( 36 37 The following call allows the user to find out how much space is needed to pack a 38 message and, thus, manage space allocation for buffers. 39 40 41 42 43 44 45 46 47 48

165 4.2. PACK AND UNPACK 135 1 _ SIZE(incount, datatype, comm, size) MPI PACK _ 2 IN incount count argument to packing call (non-negative integer) 3 datatype datatype argument to packing call (handle) IN 4 5 IN comm communicator argument to packing call (handle) 6 OUT size upper bound on size of packed message, in bytes (non- 7 negative integer) 8 9 int MPI_Pack_size(int incount, MPI_Datatype datatype, MPI_Comm comm, 10 int *size) 11 12 MPI_Pack_size(incount, datatype, comm, size, ierror) BIND(C) 13 INTEGER, INTENT(IN) :: incount 14 TYPE(MPI_Datatype), INTENT(IN) :: datatype 15 TYPE(MPI_Comm), INTENT(IN) :: comm 16 INTEGER, INTENT(OUT) :: size 17 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 18 MPI_PACK_SIZE(INCOUNT, DATATYPE, COMM, SIZE, IERROR) 19 INTEGER INCOUNT, DATATYPE, COMM, SIZE, IERROR 20 21 _ size MPI an upper bound returns in PACK _ SIZE(incount, datatype, comm, size) A call to 22 on the increment in PACK(inbuf, incount, datatype, _ MPI that is effected by a call to position 23 . If the packed size of the datatype cannot be expressed outbuf, outcount, position, comm) 24 to MPI _ UNDEFINED . by the size parameter, then MPI _ PACK _ SIZE sets the value of size 25 Rationale. The call returns an upper bound, rather than an exact bound, since the 26 exact amount of space needed to pack the message may depend on the context (e.g., 27 End of rationale. first message packed in a packing unit may take more space). ( ) 28 29 30 Example 4.21 An example using MPI _ PACK . 31 int position, i, j, a[2]; 32 char buff[1000]; 33 34 MPI_Comm_rank(MPI_COMM_WORLD, &myrank); 35 if (myrank == 0) 36 { 37 /* SENDER CODE */ 38 39 position = 0; 40 MPI_Pack(&i, 1, MPI_INT, buff, 1000, &position, MPI_COMM_WORLD); 41 MPI_Pack(&j, 1, MPI_INT, buff, 1000, &position, MPI_COMM_WORLD); 42 MPI_Send(buff, position, MPI_PACKED, 1, 0, MPI_COMM_WORLD); 43 } 44 else /* RECEIVER CODE */ 45 MPI_Recv(a, 2, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); 46 47 48 Example 4.22 An elaborate example.

166 CHAPTER 4. DATATYPES 136 1 int position, i; 2 float a[1000]; 3 char buff[1000]; 4 5 MPI_Comm_rank(MPI_COMM_WORLD, &myrank); 6 if (myrank == 0) 7 { 8 /* SENDER CODE */ 9 10 int len[2]; 11 MPI_Aint disp[2]; 12 MPI_Datatype type[2], newtype; 13 14 /* build datatype for i followed by a[0]...a[i-1] */ 15 16 len[0] = 1; 17 len[1] = i; 18 MPI_Get_address(&i, disp); 19 MPI_Get_address(a, disp+1); 20 type[0] = MPI_INT; 21 type[1] = MPI_FLOAT; 22 MPI_Type_create_struct(2, len, disp, type, &newtype); 23 MPI_Type_commit(&newtype); 24 25 /* Pack i followed by a[0]...a[i-1]*/ 26 27 position = 0; 28 MPI_Pack(MPI_BOTTOM, 1, newtype, buff, 1000, &position, MPI_COMM_WORLD); 29 30 /* Send */ 31 32 MPI_Send(buff, position, MPI_PACKED, 1, 0, 33 MPI_COMM_WORLD); 34 35 /* ***** 36 One can replace the last three lines with 37 MPI_Send(MPI_BOTTOM, 1, newtype, 1, 0, MPI_COMM_WORLD); 38 ***** */ 39 } 40 else if (myrank == 1) 41 { 42 /* RECEIVER CODE */ 43 44 MPI_Status status; 45 46 /* Receive */ 47 48 MPI_Recv(buff, 1000, MPI_PACKED, 0, 0, MPI_COMM_WORLD, &status);

167 4.2. PACK AND UNPACK 137 1 2 /* Unpack i */ 3 4 position = 0; 5 MPI_Unpack(buff, 1000, &position, &i, 1, MPI_INT, MPI_COMM_WORLD); 6 7 /* Unpack a[0]...a[i-1] */ 8 MPI_Unpack(buff, 1000, &position, a, i, MPI_FLOAT, MPI_COMM_WORLD); 9 } 10 11 Each process sends a count, followed by count characters to the root; the Example 4.23 12 root concatenates all characters into one string. 13 14 int count, gsize, counts[64], totalcount, k1, k2, k, 15 displs[64], position, concat_pos; 16 char chr[100], *lbuf, *rbuf, *cbuf; 17 18 MPI_Comm_size(comm, &gsize); 19 MPI_Comm_rank(comm, &myrank); 20 21 /* allocate local pack buffer */ 22 MPI_Pack_size(1, MPI_INT, comm, &k1); 23 MPI_Pack_size(count, MPI_CHAR, comm, &k2); 24 k = k1+k2; 25 lbuf = (char *)malloc(k); 26 27 /* pack count, followed by count characters */ 28 position = 0; 29 MPI_Pack(&count, 1, MPI_INT, lbuf, k, &position, comm); 30 MPI_Pack(chr, count, MPI_CHAR, lbuf, k, &position, comm); 31 32 if (myrank != root) { 33 /* gather at root sizes of all packed messages */ 34 MPI_Gather(&position, 1, MPI_INT, NULL, 0, 35 MPI_DATATYPE_NULL, root, comm); 36 37 /* gather at root packed messages */ 38 MPI_Gatherv(lbuf, position, MPI_PACKED, NULL, 39 NULL, NULL, MPI_DATATYPE_NULL, root, comm); 40 41 } else { /* root code */ 42 /* gather sizes of all packed messages */ 43 MPI_Gather(&position, 1, MPI_INT, counts, 1, 44 MPI_INT, root, comm); 45 46 /* gather all packed messages */ 47 displs[0] = 0; 48 for (i=1; i < gsize; i++)

168 138 CHAPTER 4. DATATYPES 1 displs[i] = displs[i-1] + counts[i-1]; 2 totalcount = displs[gsize-1] + counts[gsize-1]; 3 rbuf = (char *)malloc(totalcount); 4 cbuf = (char *)malloc(totalcount); 5 MPI_Gatherv(lbuf, position, MPI_PACKED, rbuf, 6 counts, displs, MPI_PACKED, root, comm); 7 8 /* unpack all messages and concatenate strings */ 9 concat_pos = 0; 10 for (i=0; i < gsize; i++) { 11 position = 0; 12 MPI_Unpack(rbuf+displs[i], totalcount-displs[i], 13 &position, &count, 1, MPI_INT, comm); 14 MPI_Unpack(rbuf+displs[i], totalcount-displs[i], 15 &position, cbuf+concat_pos, count, MPI_CHAR, comm); 16 concat_pos += count; 17 } 18 cbuf[concat_pos] = ’\0’; 19 } 20 21 PACK and MPI _ Canonical MPI 4.3 UNPACK _ 22 23 These functions read/write data to/from the buffer in the “external32” data format specified 24 in Section 13.5.2, and calculate the size needed for packing. Their first arguments specify 25 the data format, for future extensibility, but currently the only valid value of the datarep 26 argument is “external32.” 27 28 Advice to users. These functions could be used, for example, to send typed data in a 29 implementation to another. ( End of advice to users. portable format from one MPI ) 30 The buffer will contain exactly the packed data, without headers. should BYTE _ MPI 31 MPI . EXTERNAL _ PACK _ be used to send and receive data that is packed using 32 33 specifies that there is no header on the message Rationale. EXTERNAL _ PACK _ MPI 34 MPI may (and is PACK _ and further specifies the exact format of the data. Since 35 allowed to) use a header, the datatype MPI _ PACKED cannot be used for data packed 36 _ EXTERNAL _ PACK . ( MPI with ) End of rationale. 37 38 39 40 41 42 43 44 45 46 47 48

169 4.3. CANONICAL AND MPI _ UNPACK 139 MPI _ PACK 1 _ _ EXTERNAL(datarep, inbuf, incount, datatype, outbuf, outsize, position) MPI PACK 2 datarep data representation (string) IN 3 inbuf IN input buffer start (choice) 4 5 IN incount number of input data items (integer) 6 datatype of each input data item (handle) IN datatype 7 output buffer start (choice) outbuf OUT 8 9 IN outsize output buffer size, in bytes (integer) 10 position INOUT current position in buffer, in bytes (integer) 11 12 int MPI_Pack_external(const char datarep[], const void *inbuf, int incount, 13 MPI_Datatype datatype, void *outbuf, MPI_Aint outsize, 14 MPI_Aint *position) 15 16 MPI_Pack_external(datarep, inbuf, incount, datatype, outbuf, outsize, 17 position, ierror) BIND(C) 18 CHARACTER(LEN=*), INTENT(IN) :: datarep 19 TYPE(*), DIMENSION(..), INTENT(IN) :: inbuf 20 TYPE(*), DIMENSION(..) :: outbuf 21 INTEGER, INTENT(IN) :: incount 22 TYPE(MPI_Datatype), INTENT(IN) :: datatype 23 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: outsize 24 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(INOUT) :: position 25 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 26 MPI_PACK_EXTERNAL(DATAREP, INBUF, INCOUNT, DATATYPE, OUTBUF, OUTSIZE, 27 POSITION, IERROR) 28 INTEGER INCOUNT, DATATYPE, IERROR 29 INTEGER(KIND=MPI_ADDRESS_KIND) OUTSIZE, POSITION 30 CHARACTER*(*) DATAREP 31 INBUF(*), OUTBUF(*) 32 33 34 EXTERNAL(datarep, inbuf, insize, position, outbuf, outsize, position) _ UNPACK _ MPI 35 36 IN datarep data representation (string) 37 inbuf input buffer start (choice) IN 38 input buffer size, in bytes (integer) IN insize 39 40 position current position in buffer, in bytes (integer) INOUT 41 outbuf OUT output buffer start (choice) 42 number of output data items (integer) IN outcount 43 44 datatype datatype of output data item (handle) IN 45 46 int MPI_Unpack_external(const char datarep[], const void *inbuf, 47 MPI_Aint insize, MPI_Aint *position, void *outbuf, 48 int outcount, MPI_Datatype datatype)

170 140 CHAPTER 4. DATATYPES 1 MPI_Unpack_external(datarep, inbuf, insize, position, outbuf, outcount, 2 datatype, ierror) BIND(C) 3 CHARACTER(LEN=*), INTENT(IN) :: datarep 4 TYPE(*), DIMENSION(..), INTENT(IN) :: inbuf 5 TYPE(*), DIMENSION(..) :: outbuf 6 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: insize 7 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(INOUT) :: position 8 INTEGER, INTENT(IN) :: outcount 9 TYPE(MPI_Datatype), INTENT(IN) :: datatype 10 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 11 MPI_UNPACK_EXTERNAL(DATAREP, INBUF, INSIZE, POSITION, OUTBUF, OUTCOUNT, 12 DATATYPE, IERROR) 13 INTEGER OUTCOUNT, DATATYPE, IERROR 14 INTEGER(KIND=MPI_ADDRESS_KIND) INSIZE, POSITION 15 CHARACTER*(*) DATAREP 16 INBUF(*), OUTBUF(*) 17 18 19 _ EXTERNAL _ MPI SIZE(datarep, incount, datatype, size) _ PACK 20 21 data representation (string) datarep IN 22 incount IN number of input data items (integer) 23 24 IN datatype datatype of each input data item (handle) 25 OUT output buffer size, in bytes (integer) size 26 27 int MPI_Pack_external_size(const char datarep[], int incount, 28 MPI_Datatype datatype, MPI_Aint *size) 29 30 MPI_Pack_external_size(datarep, incount, datatype, size, ierror) BIND(C) 31 TYPE(MPI_Datatype), INTENT(IN) :: datatype 32 INTEGER, INTENT(IN) :: incount 33 CHARACTER(LEN=*), INTENT(IN) :: datarep 34 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(OUT) :: size 35 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 36 MPI_PACK_EXTERNAL_SIZE(DATAREP, INCOUNT, DATATYPE, SIZE, IERROR) 37 INTEGER INCOUNT, DATATYPE, IERROR 38 INTEGER(KIND=MPI_ADDRESS_KIND) SIZE 39 CHARACTER*(*) DATAREP 40 41 42 43 44 45 46 47 48

171 1 2 3 4 5 6 Chapter 5 7 8 9 10 Collective Communication 11 12 13 14 Introduction and Overview 5.1 15 16 Collective communication is defined as communication that involves a group or groups of 17 are the following: MPI processes. The functions of this type provided by 18 19 : Barrier synchronization across all members of a group • , MPI _ _ IBARRIER MPI BARRIER 20 (Section 5.3 and Section 5.12.1). 21 BCAST MPI MPI • _ : Broadcast from one member to all members of a group IBCAST _ , 22 (Section 5.4 and Section 5.12.2). This is shown as “broadcast” in Figure 5.1. 23 24 MPI , , GATHER _ MPI • _ IGATHERV : Gather data from GATHERV MPI MPI , IGATHER _ _ 25 all members of a group to one member (Section 5.5 and Section 5.12.3). This is shown 26 as “gather” in Figure 5.1. 27 28 MPI _ MPI , SCATTERV _ MPI , _ ISCATTER , SCATTER _ MPI • : Scatter data ISCATTERV 29 from one member to all members of a group (Section 5.6 and Section 5.12.4). This is 30 shown as “scatter” in Figure 5.1. 31 : A MPI _ ALLGATHER , MPI _ • , MPI _ ALLGATHERV , MPI _ IALLGATHERV IALLGATHER 32 variation on Gather where all members of a group receive the result (Section 5.7 and 33 Section 5.12.5). This is shown as “allgather” in Figure 5.1. 34 35 _ MPI , ALLTOALLV _ MPI , IALLTOALL _ MPI , ALLTOALL _ MPI • IALLTOALLV , 36 MPI _ IALLTOALLW : Scatter/Gather data from all members to all MPI _ ALLTOALLW , 37 members of a group (also called complete exchange) (Section 5.8 and Section 5.12.6). 38 This is shown as “complete exchange” in Figure 5.1. 39 40 ALLREDUCE , , MPI • MPI _ MPI _ _ : Global reduc- IREDUCE _ MPI , REDUCE IALLREDUCE 41 tion operations such as sum, max, min, or user-defined functions, where the result is 42 returned to all members of a group (Section 5.9.6 and Section 5.12.8) and a variation 43 where the result is returned to only one member (Section 5.9 and Section 5.12.7). 44 • _ BLOCK , SCATTER _ IREDUCE _ MPI , BLOCK _ SCATTER _ REDUCE _ MPI 45 : A combined reduction and scat- _ IREDUCE _ MPI , SCATTER _ REDUCE _ MPI SCATTER 46 ter operation (Section 5.10, Section 5.12.9, and Section 5.12.10). 47 48 141

172 142 CHAPTER 5. COLLECTIVE COMMUNICATION 1 IEXSCAN SCAN MPI _ EXSCAN , MPI _ , : Scan across all members of , MPI • MPI ISCAN _ _ 2 a group (also called prefix) (Section 5.11, Section 5.11.2, Section 5.12.11, and Sec- 3 tion 5.12.12). 4 One of the key arguments in a call to a collective routine is a communicator that 5 defines the group or groups of participating processes and provides a context for the oper- 6 ation. This is discussed further in Section 5.2. The syntax and semantics of the collective 7 operations are defined to be consistent with the syntax and semantics of the point-to-point 8 operations. Thus, general datatypes are allowed and must match between sending and re- 9 ceiving processes as specified in Chapter 4. Several collective routines such as broadcast 10 . root and gather have a single originating or receiving process. Such a process is called the 11 Some arguments in the collective functions are specified as “significant only at root,” and 12 are ignored for all participants except the root. The reader is referred to Chapter 4 for 13 information concerning communication buffers, general datatypes and type matching rules, 14 and to Chapter 6 for information on how to define groups and create communicators. 15 The type-matching conditions for the collective operations are more strict than the cor- 16 responding conditions between sender and receiver in point-to-point. Namely, for collective 17 operations, the amount of data sent must exactly match the amount of data specified by 18 the receiver. Different type maps (the layout in memory, see Section 4.1) between sender 19 and receiver are still allowed. 20 Collective operations can (but are not required to) complete as soon as the caller’s 21 participation in the collective communication is finished. A blocking operation is complete 22 as soon as the call returns. A nonblocking (immediate) call requires a separate completion 23 call (cf. Section 3.7). The completion of a collective operation indicates that the caller is free 24 to modify locations in the communication buffer. It does not indicate that other processes 25 in the group have completed or even started the operation (unless otherwise implied by the 26 description of the operation). Thus, a collective communication operation may, or may not, 27 have the effect of synchronizing all calling processes. This statement excludes, of course, 28 the barrier operation. 29 Collective communication calls may use the same communicators as point-to-point 30 communication; guarantees that messages generated on behalf of collective communi- MPI 31 cation calls will not be confused with messages generated by point-to-point communication. 32 The collective operations do not have a message tag argument. A more detailed discussion 33 of correct use of collective routines is found in Section 5.13. 34 35 Rationale. The equal-data restriction (on type matching) was made so as to avoid 36 _ the complexity of providing a facility analogous to the status argument of MPI RECV 37 for discovering the amount of data sent. Some of the collective routines would require 38 an array of status values. 39 40 The statements about synchronization are made so as to allow a variety of implemen- 41 tations of the collective functions. 42 ) ( End of rationale. 43 44 Advice to users. It is dangerous to rely on synchronization side-effects of the col- 45 lective operations for program correctness. For example, even though a particular 46 implementation may provide a broadcast routine with a side-effect of synchroniza- 47 tion, the standard does not require this, and a program that relies on this will not be 48 portable.

173 5.1. INTRODUCTION AND OVERVIEW 143 1 data 2 A A 3 0 0 4 A 0 5 broadcast A processes 6 0 7 A 0 8 A 9 0 10 A 0 11 12 13 A A A A A A A 14 5 1 2 4 3 0 0 scatter 15 A 1 16 A 17 2 18 gather A 3 19 A 20 4 21 A 5 22 23 24 F E A C B A D 25 0 0 0 0 0 0 0 26 B A E D C F B 0 0 0 0 0 0 0 27 allgather D C A B C E F 28 0 0 0 0 0 0 0 29 C F D E A D B 0 0 0 0 0 0 0 30 C F A E B D E 31 0 0 0 0 0 0 0 32 F D C B A F E 0 0 0 0 0 0 0 33 34 35 36 C E F B A D A A A A A A 0 4 0 3 0 2 0 1 0 0 5 0 37 complete B B F B E B D B C B B A 1 0 1 5 1 4 1 3 1 2 1 1 38 exchange 39 E C B A F D C C C C C C 2 1 2 2 2 3 2 4 2 5 2 0 40 D B C D D E D F D D D A 4 0 2 3 3 3 1 3 5 3 3 3 41 E E F E E E D E C B E A 42 4 4 4 5 4 2 3 0 4 1 4 4 43 F F E D C B F F F A F F 5 5 2 5 5 3 5 1 5 4 5 0 44 45 Figure 5.1: Collective move functions illustrated for a group of six processes. In each case, 46 each row of boxes represents data locations in one process. Thus, in the broadcast, initially 47 just the first process contains the data , but after the broadcast all processes contain it. A 0 48

174 144 CHAPTER 5. COLLECTIVE COMMUNICATION 1 On the other hand, a correct, portable program must allow for the fact that a collective 2 may call be synchronizing. Though one cannot rely on any synchronization side-effect, 3 one must program so as to allow it. These issues are discussed further in Section 5.13. 4 ( ) End of advice to users. 5 6 While vendors may write optimized collective routines Advice to implementors. 7 matched to their architectures, a complete library of the collective communication 8 point-to-point communication func- MPI routines can be written entirely using the 9 tions and a few auxiliary functions. If implementing on top of point-to-point, a hidden, 10 special communicator might be created for the collective operation so as to avoid inter- 11 ference with any on-going point-to-point communication at the time of the collective 12 call. This is discussed further in Section 5.13. ( End of advice to implementors. ) 13 14 Many of the descriptions of the collective routines provide illustrations in terms of 15 MPI blocking point-to-point routines. These are intended solely to indicate what data is 16 MPI programs; sent or received by what process. Many of these examples are not correct 17 for purposes of simplicity, they often assume infinite buffering. 18 19 Communicator Argument 5.2 20 21 The key concept of the collective functions is to have a group or groups of participating 22 processes. The routines do not have group identifiers as explicit arguments. Instead, there 23 is a communicator argument. Groups and communicators are discussed in full detail in 24 Chapter 6. For the purposes of this chapter, it is sufficient to know that there are two types 25 and inter-communicators intra-communicators of communicators: . An intracommunicator 26 can be thought of as an identifier for a single group of processes linked with a context. An 27 intercommunicator identifies two distinct groups of processes linked with a context. 28 29 Specifics for Intracommunicator Collective Operations 5.2.1 30 31 All processes in the group identified by the intracommunicator must call the collective 32 routine. 33 In many cases, collective communication can occur “in place” for intracommunicators, 34 with the output buffer being identical to the input buffer. This is specified by providing 35 MPI a special argument value, _ , instead of the send buffer or the receive buffer PLACE _ IN 36 argument, depending on the operation performed. 37 The “in place” operations are provided to reduce unnecessary memory Rationale. 38 implementation and by the user. Note that while the simple MPI motion by both the 39 check of testing whether the send and receive buffers have the same address will 40 work for some cases (e.g., MPI _ ALLREDUCE ), they are inadequate in others (e.g., 41 _ MPI GATHER , with root not equal to zero). Further, Fortran explicitly prohibits 42 aliasing of arguments; the approach of using a special value to denote “in place” 43 operation eliminates that difficulty. ( ) End of rationale. 44 45 Advice to users. By allowing the “in place” option, the receive buffer in many of the 46 collective calls becomes a send-and-receive buffer. For this reason, a Fortran binding 47 INTENT that includes OUT . , not INOUT must mark these as 48

175 5.2. COMMUNICATOR ARGUMENT 145 1 _ MPI PLACE is a special kind of value; it has the same restrictions on its IN Note that _ 2 BOTTOM ) use that MPI has. ( _ End of advice to users. 3 4 5.2.2 Applying Collective Operations to Intercommunicators 5 To understand how collective operations apply to intercommunicators, we can view most 6 intracommunicator collective operations as fitting one of the following categories (see, MPI 7 for instance, [56]): 8 9 All-To-All All processes contribute to the result. All processes receive the result. 10 11 MPI _ ALLGATHER , MPI _ IALLGATHER , MPI _ ALLGATHERV , • 12 IALLGATHERV MPI _ 13 • , MPI _ ALLTOALL , MPI _ IALLTOALL , MPI _ ALLTOALLV , MPI _ IALLTOALLV 14 IALLTOALLW _ MPI , ALLTOALLW _ MPI 15 _ MPI , ALLREDUCE , MPI • IALLREDUCE MPI _ REDUCE _ SCATTER _ BLOCK , _ 16 SCATTER _ BLOCK , MPI _ REDUCE _ SCATTER , _ MPI _ IREDUCE 17 _ IREDUCE _ SCATTER MPI 18 _ BARRIER , MPI _ IBARRIER • MPI 19 20 All processes contribute to the result. One process receives the result. All-To-One 21 22 MPI _ GATHERV • _ , MPI _ IGATHERV MPI GATHER , MPI _ IGATHER , 23 MPI • MPI _ IREDUCE _ REDUCE , 24 25 One-To-All One process contributes to the result. All processes receive the result. 26 _ BCAST , MPI _ IBCAST • MPI 27 28 _ MPI ISCATTERV , SCATTERV _ MPI , ISCATTER _ MPI , SCATTER _ MPI • 29 Collective operations that do not fit into one of the above categories. Other 30 31 • MPI _ MPI _ SCAN , MPI _ ISCAN IEXSCAN EXSCAN , , MPI _ 32 33 _ , and MPI _ SCAN , MPI The data movement patterns of ISCAN _ , EXSCAN MPI 34 do not fit this taxonomy. IEXSCAN _ MPI 35 The application of collective communication to intercommunicators is best described 36 in terms of two groups. For example, an all-to-all MPI _ ALLGATHER operation can be 37 described as collecting data from all members of one group with the result appearing in all 38 members of the other group (see Figure 5.2). As another example, a one-to-all 39 MPI _ BCAST operation sends data from one member of one group to all members of the 40 other group. Collective computation operations such as MPI _ REDUCE _ have a SCATTER 41 similar interpretation (see Figure 5.3). For intracommunicators, these two groups are the 42 same. For intercommunicators, these two groups are distinct. For the all-to-all operations, 43 each such operation is described in two phases, so that it has a symmetric, full-duplex 44 behavior. 45 The following collective operations also apply to intercommunicators: 46 _ IBARRIER • MPI _ BARRIER , MPI 47 48 IBCAST _ MPI , BCAST _ MPI •

176 146 CHAPTER 5. COLLECTIVE COMMUNICATION 1 _ MPI MPI _ IGATHER , MPI _ GATHERV , MPI _ IGATHERV , GATHER • , 2 SCATTER MPI _ ISCATTER , MPI _ SCATTERV , MPI _ ISCATTERV , _ • MPI , 3 4 _ IALLGATHER , MPI _ ALLGATHERV , MPI ALLGATHER IALLGATHERV , _ MPI • , _ MPI 5 6 IALLTOALL , MPI _ ALLTOALLV , MPI _ IALLTOALLV , MPI • ALLTOALL , MPI _ _ 7 MPI _ ALLTOALLW , MPI IALLTOALLW _ , 8 , _ REDUCE , IALLREDUCE _ IREDUCE , _ MPI • MPI MPI , ALLREDUCE _ MPI 9 10 _ IREDUCE _ SCATTER _ BLOCK , • MPI _ REDUCE _ SCATTER _ BLOCK , MPI 11 IREDUCE SCATTER MPI _ REDUCE _ SCATTER , MPI _ . _ 12 13 14 15 0 0 16 1 1 17 2 2 18 3 19 20 Lcomm Rcomm 21 22 23 0 0 24 1 1 25 2 2 26 3 27 28 Rcomm Lcomm 29 30 31 Figure 5.2: Intercommunicator allgather. The focus of data to one process is represented, 32 not mandated by the semantics. The two phases do allgathers in both directions. 33 34 35 Specifics for Intercommunicator Collective Operations 5.2.3 36 All processes in both groups identified by the intercommunicator must call the collective 37 routine. 38 Note that the “in place” option for intracommunicators does not apply to intercom- 39 municators since in the intercommunicator case there is no communication from a process 40 to itself. 41 For intercommunicator collective communication, if the operation is in the All-To-One 42 or One-To-All categories, then the transfer is unidirectional. The direction of the transfer is 43 indicated by a special value of the root argument. In this case, for the group containing the 44 root process, all processes in the group must call the routine using a special argument for 45 ; all other processes ROOT MPI the root. For this, the root process uses the special root value _ 46 MPI in the same group as the root use PROC _ _ NULL . All processes in the other group (the 47 group that is the remote group relative to the root process) must call the collective routine 48

177 147 5.3. BARRIER SYNCHRONIZATION 1 2 0 0 3 1 1 4 2 2 5 3 6 7 Rcomm Lcomm 8 9 10 0 0 11 1 1 12 2 13 2 14 3 15 Rcomm Lcomm 16 17 18 Figure 5.3: Intercommunicator reduce-scatter. The focus of data to one process is rep- 19 resented, not mandated by the semantics. The two phases do reduce-scatters in both 20 directions. 21 22 and provide the rank of the root. If the operation is in the All-To-All category, then the 23 transfer is bidirectional. 24 25 Operations in the All-To-One and One-To-All categories are unidirectional Rationale. 26 by nature, and there is a clear way of specifying direction. Operations in the All-To-All 27 category will often occur as part of an exchange, where it makes sense to communicate 28 End of rationale. in both directions at once. ( ) 29 30 31 5.3 Barrier Synchronization 32 33 34 MPI BARRIER(comm) _ 35 IN comm communicator (handle) 36 37 38 int MPI_Barrier(MPI_Comm comm) 39 MPI_Barrier(comm, ierror) BIND(C) 40 TYPE(MPI_Comm), INTENT(IN) :: comm 41 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 42 43 MPI_BARRIER(COMM, IERROR) 44 INTEGER COMM, IERROR 45 MPI _ BARRIER blocks the caller until all group mem- If comm is an intracommunicator, 46 bers have called it. The call returns at any process only after all group members have entered 47 the call. 48

178 148 CHAPTER 5. COLLECTIVE COMMUNICATION 1 MPI _ BARRIER involves two groups. The call returns If is an intercommunicator, comm 2 at processes in one group (group A) of the intercommunicator only after all members of the 3 other group (group B) have entered the call (and vice versa). A process may return from 4 the call before all processes in its own group have entered the call. 5 6 5.4 Broadcast 7 8 9 10 BCAST(buffer, count, datatype, root, comm) _ MPI 11 starting address of buffer (choice) buffer INOUT 12 IN count number of entries in buffer (non-negative integer) 13 14 data type of buffer (handle) IN datatype 15 IN rank of broadcast root (integer) root 16 communicator (handle) IN comm 17 18 19 int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, 20 MPI_Comm comm) 21 MPI_Bcast(buffer, count, datatype, root, comm, ierror) BIND(C) 22 TYPE(*), DIMENSION(..) :: buffer 23 INTEGER, INTENT(IN) :: count, root 24 TYPE(MPI_Datatype), INTENT(IN) :: datatype 25 TYPE(MPI_Comm), INTENT(IN) :: comm 26 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 27 28 MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM, IERROR) 29 BUFFER(*) 30 INTEGER COUNT, DATATYPE, ROOT, COMM, IERROR 31 _ BCAST broadcasts a message from the process If comm MPI is an intracommunicator, 32 to all processes of the group, itself included. It is called by all members of root with rank 33 ’s the group using the same arguments for comm and root . On return, the content of root 34 buffer is copied to all other processes. 35 General, derived datatypes are allowed for count, . The type signature of datatype 36 datatype at the root. count, datatype on any process must be equal to the type signature of 37 This implies that the amount of data sent must be equal to the amount received, pairwise 38 BCAST and all other data-movement collective between each process and the root. MPI _ 39 routines make this restriction. Distinct type maps between sender and receiver are still 40 allowed. 41 The “in place” option is not meaningful here. 42 is an intercommunicator, then the call involves all processes in the intercom- If comm 43 municator, but with one group (group A) defining the root process. All processes in the 44 other group (group B) pass the same value in argument , which is the rank of the root root 45 . All other processes in group A in group A. The root passes the value MPI _ ROOT in root 46 pass the value . Data is broadcast from the root to all processes root in NULL _ PROC _ MPI 47 48

179 5.5. GATHER 149 1 in group B. The buffer arguments of the processes in group B must be consistent with the 2 buffer argument of the root. 3 4 5.4.1 BCAST _ Example using MPI 5 The examples in this section use intracommunicators. 6 7 Example 5.1 8 s from process to every process in the group. Broadcast 100 int 0 9 10 MPI_Comm comm; 11 int array[100]; 12 int root=0; 13 ... 14 MPI_Bcast(array, 100, MPI_INT, root, comm); 15 16 As in many of our example code fragments, we assume that some of the variables (such as 17 in the above) have been assigned appropriate values. comm 18 19 20 5.5 Gather 21 22 23 GATHER(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) _ MPI 24 starting address of send buffer (choice) sendbuf IN 25 26 number of elements in send buffer (non-negative inte- sendcount IN 27 ger) 28 data type of send buffer elements (handle) sendtype IN 29 OUT recvbuf address of receive buffer (choice, significant only at 30 root) 31 32 recvcount IN number of elements for any single receive (non-negative 33 integer, significant only at root) 34 IN recvtype data type of recv buffer elements (significant only at 35 root) (handle) 36 IN rank of receiving process (integer) root 37 38 communicator (handle) comm IN 39 40 int MPI_Gather(const void* sendbuf, int sendcount, MPI_Datatype sendtype, 41 void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, 42 MPI_Comm comm) 43 MPI_Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, 44 root, comm, ierror) BIND(C) 45 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 46 TYPE(*), DIMENSION(..) :: recvbuf 47 INTEGER, INTENT(IN) :: sendcount, recvcount, root 48

180 150 CHAPTER 5. COLLECTIVE COMMUNICATION 1 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 2 TYPE(MPI_Comm), INTENT(IN) :: comm 3 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 4 MPI_GATHER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, 5 ROOT, COMM, IERROR) 6 SENDBUF(*), RECVBUF(*) 7 INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR 8 9 comm is an intracommunicator, each process (root process included) sends the con- If 10 tents of its send buffer to the root process. The root process receives the messages and stores 11 each of the processes in the group (including them in rank order. The outcome is as if n 12 the root process) had executed a call to 13 ) sendcount , sendtype , , ,... root sendbuf ( MPI_Send , 14 15 calls to and the root had executed n 16 17 ) , ( recvbuf + i · recvcount · extent ( recvtype ) , recvcount MPI_Recv recvtype , i ,... , 18 extent(recvtype) where . MPI_Type_get_extent is the type extent obtained from a call to 19 messages sent by the processes in the group An alternative description is that the n 20 are concatenated in rank order, and the resulting message is received by the root as if by a 21 · n, recvtype, ...) . MPI call to RECV(recvbuf, recvcount _ 22 The receive buffer is ignored for all non-root processes. 23 . The type signa- recvtype sendtype General, derived datatypes are allowed for both and 24 ture of sendcount, sendtype on each process must be equal to the type signature of recvcount, 25 recvtype at the root. This implies that the amount of data sent must be equal to the amount 26 of data received, pairwise between each process and the root. Distinct type maps between 27 sender and receiver are still allowed. 28 root , while on other processes, All arguments to the function are significant on process 29 are significant. The arguments comm , and sendbuf, sendcount, sendtype, root only arguments 30 root must have identical values on all processes. comm and 31 The specification of counts and types should not cause any location on the root to be 32 written more than once. Such a call is erroneous. 33 argument at the root indicates the number of items it receives recvcount Note that the 34 process, not the total number of items it receives. each from 35 The “in place” option for intracommunicators is specified by passing MPI _ IN _ PLACE as 36 sendbuf the value of are ignored, and sendtype and sendcount at the root. In such a case, 37 the contribution of the root to the gathered vector is assumed to be already in the correct 38 place in the receive buffer. 39 If is an intercommunicator, then the call involves all processes in the intercom- comm 40 municator, but with one group (group A) defining the root process. All processes in the 41 , which is the rank of the root other group (group B) pass the same value in argument root 42 . All other processes in group A MPI in group A. The root passes the value ROOT in _ root 43 PROC root in NULL _ . Data is gathered from all processes in group B to _ MPI pass the value 44 the root. The send buffer arguments of the processes in group B must be consistent with 45 the receive buffer argument of the root. 46 47 48

181 5.5. GATHER 151 1 GATHERV(sendbuf, sendcount, sendtype, recvbuf, recvcounts, displs, recvtype, root, MPI _ 2 comm) 3 starting address of send buffer (choice) IN sendbuf 4 number of elements in send buffer (non-negative inte- sendcount IN 5 ger) 6 7 sendtype data type of send buffer elements (handle) IN 8 address of receive buffer (choice, significant only at recvbuf OUT 9 root) 10 IN non-negative integer array (of length group size) con- recvcounts 11 taining the number of elements that are received from 12 each process (significant only at root) 13 14 displs i specifies integer array (of length group size). Entry IN 15 recvbuf the displacement relative to at which to place 16 the incoming data from process i (significant only at 17 root) 18 IN recvtype data type of recv buffer elements (significant only at 19 root) (handle) 20 root rank of receiving process (integer) IN 21 22 communicator (handle) IN comm 23 24 int MPI_Gatherv(const void* sendbuf, int sendcount, MPI_Datatype sendtype, 25 void* recvbuf, const int recvcounts[], const int displs[], 26 MPI_Datatype recvtype, int root, MPI_Comm comm) 27 MPI_Gatherv(sendbuf, sendcount, sendtype, recvbuf, recvcounts, displs, 28 recvtype, root, comm, ierror) BIND(C) 29 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 30 TYPE(*), DIMENSION(..) :: recvbuf 31 INTEGER, INTENT(IN) :: sendcount, recvcounts(*), displs(*), root 32 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 33 TYPE(MPI_Comm), INTENT(IN) :: comm 34 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 35 36 MPI_GATHERV(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNTS, DISPLS, 37 RECVTYPE, ROOT, COMM, IERROR) 38 SENDBUF(*), RECVBUF(*) 39 INTEGER SENDCOUNT, SENDTYPE, RECVCOUNTS(*), DISPLS(*), RECVTYPE, ROOT, 40 COMM, IERROR 41 42 _ MPI _ GATHERV GATHER extends the functionality of by allowing a varying count MPI 43 recvcounts is now an array. It also allows more flexibility of data from each process, since 44 . displs as to where the data is placed on the root, by providing the new argument, 45 If each process, including the root as if is an intracommunicator, the outcome is comm 46 process, sends a message to the root, 47 ) ,... root , sendtype , sendcount , sendbuf ( MPI_Send , 48

182 152 CHAPTER 5. COLLECTIVE COMMUNICATION 1 receives, n and the root executes 2 , j ] · extent ( recvtype ) , recvcounts [ j ] [ recvtype , i ,... ) . ( + displs MPI_Recv recvbuf 3 4 process beginning at The data received from process j is placed into recvbuf of the root 5 ). displs[j] offset elements (in terms of the recvtype 6 The receive buffer is ignored for all non-root processes. 7 i on process The type signature implied by sendcount, sendtype must be equal to the 8 type signature implied by at the root. This implies that the amount recvcounts[i], recvtype 9 of data sent must be equal to the amount of data received, pairwise between each process 10 and the root. Distinct type maps between sender and receiver are still allowed, as illustrated 11 in Example 5.6. 12 root , while on other processes, All arguments to the function are significant on process 13 sendbuf, sendcount, sendtype, root , and comm are significant. The arguments only arguments 14 comm must have identical values on all processes. root and 15 The specification of counts, types, and displacements should not cause any location on 16 the root to be written more than once. Such a call is erroneous. 17 PLACE as MPI _ _ IN The “in place” option for intracommunicators is specified by passing 18 sendcount the value of sendbuf at the root. In such a case, are ignored, and and sendtype 19 the contribution of the root to the gathered vector is assumed to be already in the correct 20 place in the receive buffer. 21 If comm is an intercommunicator, then the call involves all processes in the intercom- 22 municator, but with one group (group A) defining the root process. All processes in the 23 , which is the rank of the root other group (group B) pass the same value in argument root 24 root in ROOT _ MPI . All other processes in group A in group A. The root passes the value 25 MPI _ PROC _ pass the value NULL in root . Data is gathered from all processes in group B to 26 the root. The send buffer arguments of the processes in group B must be consistent with 27 the receive buffer argument of the root. 28 29 Examples using MPI 5.5.1 GATHERV _ GATHER, MPI _ 30 31 The examples in this section use intracommunicators. 32 33 Example 5.2 34 s from every process in group to root. See Figure 5.4. int Gather 100 35 MPI_Comm comm; 36 int gsize,sendarray[100]; 37 int root, *rbuf; 38 ... 39 MPI_Comm_size(comm, &gsize); 40 rbuf = (int *)malloc(gsize*100*sizeof(int)); 41 MPI_Gather(sendarray, 100, MPI_INT, rbuf, 100, MPI_INT, root, comm); 42 43 44 Example 5.3 45 Previous example modified — only the root allocates memory for the receive buffer. 46 47 48

183 5.5. GATHER 153 100 100 100 1 all processes 2 3 4 100 100 100 5 at root 6 7 rbuf 8 9 s from each process in the group. Figure 5.4: The root process gathers 100 int 10 11 MPI_Comm comm; 12 int gsize,sendarray[100]; 13 int root, myrank, *rbuf; 14 ... 15 MPI_Comm_rank(comm, &myrank); 16 if (myrank == root) { 17 MPI_Comm_size(comm, &gsize); 18 rbuf = (int *)malloc(gsize*100*sizeof(int)); 19 } 20 MPI_Gather(sendarray, 100, MPI_INT, rbuf, 100, MPI_INT, root, comm); 21 22 23 Example 5.4 24 Do the same as the previous example, but use a derived datatype. Note that the type 25 s since type matching is defined pairwise between gsize*100 int cannot be the entire set of 26 the root and each process in the gather. 27 28 MPI_Comm comm; 29 int gsize,sendarray[100]; 30 int root, *rbuf; 31 MPI_Datatype rtype; 32 ... 33 MPI_Comm_size(comm, &gsize); 34 MPI_Type_contiguous(100, MPI_INT, &rtype); 35 MPI_Type_commit(&rtype); 36 rbuf = (int *)malloc(gsize*100*sizeof(int)); 37 MPI_Gather(sendarray, 100, MPI_INT, rbuf, 1, rtype, root, comm); 38 39 40 Example 5.5 41 stride int s to root, but place each set (of 100) int Now have each process send 100 s 42 apart at receiving end. Use MPI _ GATHERV and the displs argument to achieve this effect. 43 100. See Figure 5.5. ≥ stride Assume 44 45 46 47 48

184 154 CHAPTER 5. COLLECTIVE COMMUNICATION 100 100 100 1 all processes 2 3 4 100 100 100 5 at root 6 7 stride rbuf 8 9 s from each process in the group, each set is int Figure 5.5: The root process gathers 100 10 placed stride int s apart. 11 12 MPI_Comm comm; 13 int gsize,sendarray[100]; 14 int root, *rbuf, stride; 15 int *displs,i,*rcounts; 16 17 ... 18 19 MPI_Comm_size(comm, &gsize); 20 rbuf = (int *)malloc(gsize*stride*sizeof(int)); 21 displs = (int *)malloc(gsize*sizeof(int)); 22 rcounts = (int *)malloc(gsize*sizeof(int)); 23 for (i=0; i

185 5.5. GATHER 155 150 150 150 1 2 100 all processes 100 100 3 4 5 6 100 100 100 at root 7 8 stride rbuf 9 10 of a 100 × 150 C array, and each set is placed Figure 5.6: The root process gathers column 0 11 stride int s apart. 12 13 14 displs[i] = i*stride; 15 rcounts[i] = 100; 16 } 17 /* Create datatype for 1 column of array 18 */ 19 MPI_Type_vector(100, 1, 150, MPI_INT, &stype); 20 MPI_Type_commit(&stype); 21 MPI_Gatherv(sendarray, 1, stype, rbuf, rcounts, displs, MPI_INT, 22 root, comm); 23 24 Example 5.7 25 Process × 150 i -th column of a 100 i sends (100-i) int s from the array, in C. int 26 It is received into a buffer with stride, as in the previous two examples. See Figure 5.7. 27 28 MPI_Comm comm; 29 int gsize,sendarray[100][150],*sptr; 30 int root, *rbuf, stride, myrank; 31 MPI_Datatype stype; 32 int *displs,i,*rcounts; 33 34 ... 35 36 MPI_Comm_size(comm, &gsize); 37 MPI_Comm_rank(comm, &myrank); 38 rbuf = (int *)malloc(gsize*stride*sizeof(int)); 39 displs = (int *)malloc(gsize*sizeof(int)); 40 rcounts = (int *)malloc(gsize*sizeof(int)); 41 for (i=0; i

186 156 CHAPTER 5. COLLECTIVE COMMUNICATION 150 150 150 1 2 100 100 all processes 100 3 4 5 6 100 99 98 at root 7 8 stride rbuf 9 10 i 100-i int of a 100 s from column 150 C array, and × Figure 5.7: The root process gathers 11 s apart. each set is placed stride int 12 13 14 /* sptr is the address of start of "myrank" column 15 */ 16 sptr = &sendarray[0][myrank]; 17 MPI_Gatherv(sptr, 1, stype, rbuf, rcounts, displs, MPI_INT, 18 root, comm); 19 Note that a different amount of data is received from each process. 20 21 Example 5.8 22 Same as Example 5.7, but done in a different way at the sending end. We create a 23 datatype that causes the correct striding at the sending end so that we read a column of a 24 C array. A similar thing was done in Example 4.16, Section 4.1.14. 25 26 MPI_Comm comm; 27 int gsize, sendarray[100][150], *sptr; 28 int root, *rbuf, stride, myrank; 29 MPI_Datatype stype; 30 int *displs, i, *rcounts; 31 32 ... 33 34 MPI_Comm_size(comm, &gsize); 35 MPI_Comm_rank(comm, &myrank); 36 rbuf = (int *)malloc(gsize*stride*sizeof(int)); 37 displs = (int *)malloc(gsize*sizeof(int)); 38 rcounts = (int *)malloc(gsize*sizeof(int)); 39 for (i=0; i

187 5.5. GATHER 157 1 MPI_Gatherv(sptr, 100-myrank, stype, rbuf, rcounts, displs, MPI_INT, 2 root, comm); 3 4 Example 5.9 5 Same as Example 5.7 at sending side, but at receiving side we make the stride between 6 received blocks vary from block to block. See Figure 5.8. 7 8 MPI_Comm comm; 9 int gsize,sendarray[100][150],*sptr; 10 int root, *rbuf, *stride, myrank, bufsize; 11 MPI_Datatype stype; 12 int *displs,i,*rcounts,offset; 13 14 ... 15 16 MPI_Comm_size(comm, &gsize); 17 MPI_Comm_rank(comm, &myrank); 18 19 stride = (int *)malloc(gsize*sizeof(int)); 20 ... 21 /* stride[i] for i = 0 to gsize-1 is set somehow 22 */ 23 24 /* set up displs and rcounts vectors first 25 */ 26 displs = (int *)malloc(gsize*sizeof(int)); 27 rcounts = (int *)malloc(gsize*sizeof(int)); 28 offset = 0; 29 for (i=0; i

188 158 CHAPTER 5. COLLECTIVE COMMUNICATION 150 150 150 1 2 100 100 all processes 100 3 4 5 6 98 100 99 at root 7 8 stride[1] rbuf 9 10 100-i int of a 100 i s from column Figure 5.8: The root process gathers 150 C array, and × 11 stride[i] int each set is placed s apart (a varying stride). 12 13 14 s from the num int int array, in C. The 150 i × Process -th column of a 100 i sends 15 are not known to root , so a separate complicating factor is that the various values of num 16 gather must first be run to find these out. The data is placed contiguously at the receiving 17 end. 18 19 MPI_Comm comm; 20 int gsize,sendarray[100][150],*sptr; 21 int root, *rbuf, myrank; 22 MPI_Datatype stype; 23 int *displs,i,*rcounts,num; 24 25 ... 26 27 MPI_Comm_size(comm, &gsize); 28 MPI_Comm_rank(comm, &myrank); 29 30 /* First, gather nums to root 31 */ 32 rcounts = (int *)malloc(gsize*sizeof(int)); 33 MPI_Gather(&num, 1, MPI_INT, rcounts, 1, MPI_INT, root, comm); 34 /* root now has correct rcounts, using these we set displs[] so 35 * that data is placed contiguously (or concatenated) at receive end 36 */ 37 displs = (int *)malloc(gsize*sizeof(int)); 38 displs[0] = 0; 39 for (i=1; i

189 5.6. SCATTER 159 1 MPI_Type_commit(&stype); 2 sptr = &sendarray[0][myrank]; 3 MPI_Gatherv(sptr, num, stype, rbuf, rcounts, displs, MPI_INT, 4 root, comm); 5 6 7 5.6 Scatter 8 9 10 11 MPI _ SCATTER(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) 12 sendbuf address of send buffer (choice, significant only at root) IN 13 IN number of elements sent to each process (non-negative sendcount 14 integer, significant only at root) 15 16 IN data type of send buffer elements (significant only at sendtype 17 root) (handle) 18 address of receive buffer (choice) recvbuf OUT 19 recvcount IN number of elements in receive buffer (non-negative in- 20 teger) 21 22 recvtype data type of receive buffer elements (handle) IN 23 root rank of sending process (integer) IN 24 comm communicator (handle) IN 25 26 27 int MPI_Scatter(const void* sendbuf, int sendcount, MPI_Datatype sendtype, 28 void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, 29 MPI_Comm comm) 30 MPI_Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, 31 root, comm, ierror) BIND(C) 32 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 33 TYPE(*), DIMENSION(..) :: recvbuf 34 INTEGER, INTENT(IN) :: sendcount, recvcount, root 35 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 36 TYPE(MPI_Comm), INTENT(IN) :: comm 37 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 38 39 MPI_SCATTER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, 40 ROOT, COMM, IERROR) 41 SENDBUF(*), RECVBUF(*) 42 INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR 43 MPI _ GATHER . MPI _ SCATTER is the inverse operation to 44 comm n the root executed as if is an intracommunicator, the outcome is send oper- If 45 ations, 46 47 · sendcount · extent ( sendtype ) , sendcount , sendtype , i ,... ) , MPI_Send ( sendbuf + i 48

190 160 CHAPTER 5. COLLECTIVE COMMUNICATION 1 and each process executed a receive, 2 ( , recvtype , i ,... ) . recvbuf MPI_Recv recvcount , 3 4 _ MPI An alternative description is that the root sends a message with Send(sendbuf, 5 sendcount . This message is split into n equal segments, the · i -th segment is n, sendtype, ...) 6 i sent to the -th process in the group, and each process receives this message as above. 7 The send buffer is ignored for all non-root processes. 8 at the root must be equal to sendcount, sendtype The type signature associated with 9 at all processes (however, the type the type signature associated with recvcount, recvtype 10 maps may be different). This implies that the amount of data sent must be equal to the 11 amount of data received, pairwise between each process and the root. Distinct type maps 12 between sender and receiver are still allowed. 13 All arguments to the function are significant on process , while on other processes, root 14 are significant. The arguments comm , and only arguments recvbuf, recvcount, recvtype, root 15 must have identical values on all processes. root and comm 16 The specification of counts and types should not cause any location on the root to be 17 read more than once. 18 19 Rationale. Though not needed, the last restriction is imposed so as to achieve 20 MPI symmetry with _ GATHER , where the corresponding restriction (a multiple-write 21 restriction) is necessary. ( End of rationale. ) 22 23 _ IN PLACE MPI The “in place” option for intracommunicators is specified by passing as _ 24 are ignored, and the value of recvcount and at the root. In such a case, recvbuf recvtype 25 n segments, root “sends” no data to itself. The scattered vector is still assumed to contain 26 n -th segment, which root should “send to itself,” is not is the group size; the root where 27 moved. 28 If comm is an intercommunicator, then the call involves all processes in the intercom- 29 municator, but with one group (group A) defining the root process. All processes in the 30 root , which is the rank of the root other group (group B) pass the same value in argument 31 MPI _ ROOT in root . All other processes in group A in group A. The root passes the value 32 in . Data is scattered from the root to all processes in NULL root _ PROC _ MPI pass the value 33 group B. The receive buffer arguments of the processes in group B must be consistent with 34 the send buffer argument of the root. 35 36 37 38 39 40 41 42 43 44 45 46 47 48

191 5.6. SCATTER 161 1 MPI SCATTERV(sendbuf, sendcounts, displs, sendtype, recvbuf, recvcount, recvtype, root, _ 2 comm) 3 address of send buffer (choice, significant only at root) sendbuf IN 4 IN non-negative integer array (of length group size) spec- sendcounts 5 ifying the number of elements to send to each rank 6 7 specifies i integer array (of length group size). Entry displs IN 8 sendbuf the displacement (relative to ) from which to 9 i take the outgoing data to process 10 sendtype data type of send buffer elements (handle) IN 11 OUT recvbuf address of receive buffer (choice) 12 13 number of elements in receive buffer (non-negative in- IN recvcount 14 teger) 15 recvtype data type of receive buffer elements (handle) IN 16 IN root rank of sending process (integer) 17 comm IN communicator (handle) 18 19 20 int MPI_Scatterv(const void* sendbuf, const int sendcounts[], const 21 int displs[], MPI_Datatype sendtype, void* recvbuf, 22 int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm) 23 MPI_Scatterv(sendbuf, sendcounts, displs, sendtype, recvbuf, recvcount, 24 recvtype, root, comm, ierror) BIND(C) 25 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 26 TYPE(*), DIMENSION(..) :: recvbuf 27 INTEGER, INTENT(IN) :: sendcounts(*), displs(*), recvcount, root 28 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 29 TYPE(MPI_Comm), INTENT(IN) :: comm 30 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 31 32 MPI_SCATTERV(SENDBUF, SENDCOUNTS, DISPLS, SENDTYPE, RECVBUF, RECVCOUNT, 33 RECVTYPE, ROOT, COMM, IERROR) 34 SENDBUF(*), RECVBUF(*) 35 INTEGER SENDCOUNTS(*), DISPLS(*), SENDTYPE, RECVCOUNT, RECVTYPE, ROOT, 36 COMM, IERROR 37 GATHERV . MPI _ SCATTERV is the inverse operation to MPI _ 38 MPI _ SCATTERV extends the functionality of MPI _ SCATTER by allowing a varying 39 sendcounts is now an array. It also allows count of data to be sent to each process, since 40 more flexibility as to where the data is taken from on the root, by providing an additional 41 displs . argument, 42 is an intracommunicator, the outcome is as if the root executed send oper- comm If n 43 ations, 44 45 i i · MPI_Send ( sendbuf + displs [ extent , ) ,... ( , sendtype , ] i [ sendcounts , ) sendtype ] 46 and each process executed a receive, 47 48 , ) ( recvbuf , recvcount MPI_Recv recvtype , i ,... .

192 162 CHAPTER 5. COLLECTIVE COMMUNICATION 1 The send buffer is ignored for all non-root processes. 2 sendcount[i], sendtype at the root must be equal to the The type signature implied by 3 (however, the type maps may be recvcount, recvtype at process i type signature implied by 4 different). This implies that the amount of data sent must be equal to the amount of data 5 received, pairwise between each process and the root. Distinct type maps between sender 6 and receiver are still allowed. 7 , while on other processes, All arguments to the function are significant on process root 8 recvbuf, recvcount, recvtype, root only arguments comm are significant. The arguments , and 9 must have identical values on all processes. comm and root 10 The specification of counts, types, and displacements should not cause any location on 11 the root to be read more than once. 12 IN _ PLACE The “in place” option for intracommunicators is specified by passing MPI _ as 13 recvcount and the value of are ignored, and recvbuf at the root. In such a case, recvtype 14 n root “sends” no data to itself. The scattered vector is still assumed to contain segments, 15 where n -th segment, which root should “send to itself,” is not is the group size; the root 16 moved. 17 If comm is an intercommunicator, then the call involves all processes in the intercom- 18 municator, but with one group (group A) defining the root process. All processes in the 19 other group (group B) pass the same value in argument root , which is the rank of the root 20 MPI in group A. The root passes the value . All other processes in group A root in ROOT _ 21 _ MPI pass the value NULL in . Data is scattered from the root to all processes in _ root PROC 22 group B. The receive buffer arguments of the processes in group B must be consistent with 23 the send buffer argument of the root. 24 25 Examples using MPI _ SCATTER, MPI _ SCATTERV 5.6.1 26 The examples in this section use intracommunicators. 27 28 Example 5.11 29 The reverse of Example 5.2. Scatter sets of 100 int s from the root to each process in 30 the group. See Figure 5.9. 31 32 MPI_Comm comm; 33 int gsize,*sendbuf; 34 int root, rbuf[100]; 35 ... 36 MPI_Comm_size(comm, &gsize); 37 sendbuf = (int *)malloc(gsize*100*sizeof(int)); 38 ... 39 MPI_Scatter(sendbuf, 100, MPI_INT, rbuf, 100, MPI_INT, root, comm); 40 41 42 Example 5.12 43 int The reverse of Example 5.5. The root process scatters sets of 100 s to the other 44 stride int s apart in the sending buffer. Requires use of processes, but the sets of 100 are 45 100. See Figure 5.10. . Assume SCATTERV _ MPI ≥ stride 46 47 48

193 5.6. SCATTER 163 100 100 100 1 all processes 2 3 4 100 100 100 5 at root 6 7 sendbuf 8 9 Figure 5.9: The root process scatters sets of 100 int s to each process in the group. 10 11 100 100 100 12 all processes 13 14 15 100 100 100 at root 16 17 stride 18 sendbuf 19 20 stride int Figure 5.10: The root process scatters sets of 100 s from send int s, moving by 21 to send in the scatter. 22 23 MPI_Comm comm; 24 int gsize,*sendbuf; 25 int root, rbuf[100], i, *displs, *scounts; 26 27 ... 28 29 MPI_Comm_size(comm, &gsize); 30 sendbuf = (int *)malloc(gsize*stride*sizeof(int)); 31 ... 32 displs = (int *)malloc(gsize*sizeof(int)); 33 scounts = (int *)malloc(gsize*sizeof(int)); 34 for (i=0; i

194 164 CHAPTER 5. COLLECTIVE COMMUNICATION 150 150 150 1 2 all processes 100 100 100 3 4 5 6 98 99 100 at root 7 8 stride[1] 9 sendbuf 10 s into column 150 C array. i of a 100 × 100-i int Figure 5.11: The root scatters blocks of 11 s apart. stride[i] int At the sending side, the blocks are 12 13 14 int root, *sendbuf, myrank, *stride; 15 MPI_Datatype rtype; 16 int i, *displs, *scounts, offset; 17 ... 18 MPI_Comm_size(comm, &gsize); 19 MPI_Comm_rank(comm, &myrank); 20 21 stride = (int *)malloc(gsize*sizeof(int)); 22 ... 23 /* stride[i] for i = 0 to gsize-1 is set somehow 24 * sendbuf comes from elsewhere 25 */ 26 ... 27 displs = (int *)malloc(gsize*sizeof(int)); 28 scounts = (int *)malloc(gsize*sizeof(int)); 29 offset = 0; 30 for (i=0; i

195 5.7. GATHER-TO-ALL 165 1 5.7 Gather-to-all 2 3 4 _ MPI ALLGATHER(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm) 5 6 IN starting address of send buffer (choice) sendbuf 7 sendcount number of elements in send buffer (non-negative inte- IN 8 ger) 9 data type of send buffer elements (handle) IN sendtype 10 11 recvbuf address of receive buffer (choice) OUT 12 recvcount IN number of elements received from any process (non- 13 negative integer) 14 IN recvtype data type of receive buffer elements (handle) 15 16 communicator (handle) comm IN 17 18 int MPI_Allgather(const void* sendbuf, int sendcount, 19 MPI_Datatype sendtype, void* recvbuf, int recvcount, 20 MPI_Datatype recvtype, MPI_Comm comm) 21 MPI_Allgather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, 22 comm, ierror) BIND(C) 23 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 24 TYPE(*), DIMENSION(..) :: recvbuf 25 INTEGER, INTENT(IN) :: sendcount, recvcount 26 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 27 TYPE(MPI_Comm), INTENT(IN) :: comm 28 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 29 30 MPI_ALLGATHER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, 31 COMM, IERROR) 32 SENDBUF(*), RECVBUF(*) 33 INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, COMM, IERROR 34 MPI ALLGATHER _ , but where all processes receive can be thought of as MPI _ GATHER 35 the result, instead of just the root. The block of data sent from the j -th process is received 36 -th block of the buffer recvbuf . by every process and placed in the j 37 The type signature associated with sendcount, sendtype , at a process must be equal to 38 the type signature associated with recvcount, recvtype at any other process. 39 If is an intracommunicator, the outcome of a call to MPI _ ALLGATHER(...) is as comm 40 n calls to if all processes executed 41 42 MPI_Gather(sendbuf,sendcount,sendtype,recvbuf,recvcount, 43 recvtype,root,comm) 44 45 ALLGATHER . The rules for correct usage of for MPI root = 0 , ..., n-1 are easily found _ 46 MPI from the corresponding rules for . GATHER _ 47 The “in place” option for intracommunicators is specified by passing the value 48 to the argument sendtype _ IN _ PLACE MPI sendbuf at all processes. sendcount and are ignored.

196 166 CHAPTER 5. COLLECTIVE COMMUNICATION 1 Then the input data of each process is assumed to be in the area where that process would 2 receive its own contribution to the receive buffer. 3 If comm is an intercommunicator, then each process of one group (group A) contributes 4 sendcount data items; these data are concatenated and the result is stored at each process 5 in the other group (group B). Conversely the concatenation of the contributions of the 6 processes in group B is stored at each process in group A. The send buffer arguments in 7 group A must be consistent with the receive buffer arguments in group B, and vice versa. 8 9 MPI ALLGATHER _ The communication pattern of executed on an Advice to users. 10 intercommunication domain need not be symmetric. The number of items sent by 11 sendcount, sendtype processes in group A (as specified by the arguments in group A 12 and the arguments recvcount, recvtype in group B), need not equal the number of 13 sendcount, sendtype items sent by processes in group B (as specified by the arguments 14 in group A). In particular, one can in group B and the arguments recvcount, recvtype 15 for the communication sendcount = 0 move data in only one direction by specifying 16 in the reverse direction. 17 ) ( End of advice to users. 18 19 20 _ MPI ALLGATHERV(sendbuf, sendcount, sendtype, recvbuf, recvcounts, displs, recvtype, comm) 21 22 23 starting address of send buffer (choice) IN sendbuf 24 number of elements in send buffer (non-negative inte- IN sendcount 25 ger) 26 sendtype data type of send buffer elements (handle) IN 27 28 recvbuf address of receive buffer (choice) OUT 29 non-negative integer array (of length group size) con- recvcounts IN 30 taining the number of elements that are received from 31 each process 32 integer array (of length group size). Entry displs specifies i IN 33 the displacement (relative to recvbuf ) at which to place 34 the incoming data from process i 35 36 data type of receive buffer elements (handle) recvtype IN 37 IN comm communicator (handle) 38 39 int MPI_Allgatherv(const void* sendbuf, int sendcount, 40 MPI_Datatype sendtype, void* recvbuf, const int recvcounts[], 41 const int displs[], MPI_Datatype recvtype, MPI_Comm comm) 42 43 MPI_Allgatherv(sendbuf, sendcount, sendtype, recvbuf, recvcounts, displs, 44 recvtype, comm, ierror) BIND(C) 45 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 46 TYPE(*), DIMENSION(..) :: recvbuf 47 INTEGER, INTENT(IN) :: sendcount, recvcounts(*), displs(*) 48 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype

197 5.7. GATHER-TO-ALL 167 1 TYPE(MPI_Comm), INTENT(IN) :: comm 2 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 3 MPI_ALLGATHERV(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNTS, DISPLS, 4 RECVTYPE, COMM, IERROR) 5 SENDBUF(*), RECVBUF(*) 6 INTEGER SENDCOUNT, SENDTYPE, RECVCOUNTS(*), DISPLS(*), RECVTYPE, COMM, 7 IERROR 8 9 _ ALLGATHERV _ MPI GATHERV can be thought of as MPI , but where all processes re- 10 j ceive the result, instead of just the root. The block of data sent from the -th process is 11 . These blocks -th block of the buffer recvbuf received by every process and placed in the j 12 need not all be the same size. 13 sendcount, sendtype , at process j must be equal to The type signature associated with 14 the type signature associated with at any other process. recvcounts[j], recvtype 15 comm If is an intracommunicator, the outcome is as if all processes executed calls to 16 MPI_Gatherv(sendbuf,sendcount,sendtype,recvbuf,recvcounts,displs, 17 recvtype,root,comm), 18 19 MPI root = 0 , ..., n-1 for ALLGATHERV are easily _ . The rules for correct usage of 20 MPI _ GATHERV . found from the corresponding rules for 21 The “in place” option for intracommunicators is specified by passing the value 22 IN at all processes. In such a case, to the argument MPI _ sendcount _ PLACE sendbuf and 23 are ignored, and the input data of each process is assumed to be in the area where sendtype 24 that process would receive its own contribution to the receive buffer. 25 If comm is an intercommunicator, then each process of one group (group A) contributes 26 sendcount data items; these data are concatenated and the result is stored at each process 27 in the other group (group B). Conversely the concatenation of the contributions of the 28 processes in group B is stored at each process in group A. The send buffer arguments in 29 group A must be consistent with the receive buffer arguments in group B, and vice versa. 30 31 5.7.1 Example using MPI _ ALLGATHER 32 33 The example in this section uses intracommunicators. 34 35 Example 5.14 36 MPI _ The all-gather version of Example 5.2. Using ALLGATHER , we will gather 100 37 int s from every process in the group to every process. 38 MPI_Comm comm; 39 int gsize,sendarray[100]; 40 int *rbuf; 41 ... 42 MPI_Comm_size(comm, &gsize); 43 rbuf = (int *)malloc(gsize*100*sizeof(int)); 44 MPI_Allgather(sendarray, 100, MPI_INT, rbuf, 100, MPI_INT, comm); 45 46 After the call, every process has the group-wide concatenation of the sets of data. 47 48

198 168 CHAPTER 5. COLLECTIVE COMMUNICATION 1 5.8 All-to-All Scatter/Gather 2 3 4 ALLTOALL(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm) _ MPI 5 6 IN sendbuf starting address of send buffer (choice) 7 IN sendcount number of elements sent to each process (non-negative 8 integer) 9 data type of send buffer elements (handle) sendtype IN 10 11 address of receive buffer (choice) recvbuf OUT 12 number of elements received from any process (non- recvcount IN 13 negative integer) 14 data type of receive buffer elements (handle) IN recvtype 15 16 IN comm communicator (handle) 17 18 int MPI_Alltoall(const void* sendbuf, int sendcount, MPI_Datatype sendtype, 19 void* recvbuf, int recvcount, MPI_Datatype recvtype, 20 MPI_Comm comm) 21 MPI_Alltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, 22 comm, ierror) BIND(C) 23 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 24 TYPE(*), DIMENSION(..) :: recvbuf 25 INTEGER, INTENT(IN) :: sendcount, recvcount 26 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 27 TYPE(MPI_Comm), INTENT(IN) :: comm 28 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 29 30 MPI_ALLTOALL(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, 31 COMM, IERROR) 32 SENDBUF(*), RECVBUF(*) 33 INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, COMM, IERROR 34 _ is an extension of to the case where each process ALLGATHER ALLTOALL _ MPI MPI 35 is received sends distinct data to each of the receivers. The j -th block sent from process i 36 by process j and is placed in the -th block of recvbuf . i 37 The type signature associated with sendcount, sendtype , at a process must be equal to 38 at any other process. This implies recvcount, recvtype the type signature associated with 39 that the amount of data sent must be equal to the amount of data received, pairwise between 40 every pair of processes. As usual, however, the type maps may be different. 41 If comm is an intracommunicator, the outcome is as if each process executed a send to 42 each process (itself included) with a call to, 43 44 sendbuf + i · sendcount · extent ( sendtype ) , sendcount , sendtype , i ,... ) , MPI_Send ( 45 46 and a receive from every other process with a call to, 47 · recvtype , recvcount , ) recvtype ( extent ) recvcount · i + recvbuf ( MPI_Recv . ,... i , 48

199 5.8. ALL-TO-ALL SCATTER/GATHER 169 1 All arguments on all processes are significant. The argument must have identical comm 2 values on all processes. 3 to MPI _ IN The “in place” option for intracommunicators is specified by passing _ PLACE 4 and sendtype are ignored. the argument sendbuf at all processes. In such a case, sendcount 5 and replaced by the received data. Data sent The data to be sent is taken from the recvbuf 6 recvtype . and recvcount and received must have the same type map as specified by 7 8 instances, allocating both send and receive For large MPI _ ALLTOALL Rationale. 9 buffers may consume too much memory. The “in place” option effectively halves the 10 application memory consumption and is useful in situations where the data to be sent 11 exchange (e.g., in will not be used by the sending process after the _ ALLTOALL MPI 12 End of rationale. ) parallel Fast Fourier Transforms). ( 13 14 Advice to implementors. Users may opt to use the “in place” option in order to con- 15 serve memory. Quality MPI implementations should thus strive to minimize system 16 ) End of advice to implementors. buffering. ( 17 If is an intercommunicator, then the outcome is as if each process in group A comm 18 -th send buffer of process sends a message to each process in group B, and vice versa. The j 19 in group A should be consistent with the i -th receive buffer of process j in group B, and i 20 vice versa. 21 22 When a complete exchange is executed on an intercommunication Advice to users. 23 domain, then the number of data items sent from processes in group A to processes 24 in group B need not equal the number of items sent in the reverse direction. In 25 sendcount = 0 particular, one can have unidirectional communication by specifying in 26 the reverse direction. 27 28 ( ) End of advice to users. 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

200 170 CHAPTER 5. COLLECTIVE COMMUNICATION 1 MPI ALLTOALLV(sendbuf, sendcounts, sdispls, sendtype, recvbuf, recvcounts, rdispls, _ 2 recvtype, comm) 3 sendbuf starting address of send buffer (choice) IN 4 IN non-negative integer array (of length group size) spec- sendcounts 5 ifying the number of elements to send to each rank 6 7 sdispls specifies integer array (of length group size). Entry j IN 8 the displacement (relative to ) from which to sendbuf 9 take the outgoing data destined for process j 10 sendtype data type of send buffer elements (handle) IN 11 OUT recvbuf address of receive buffer (choice) 12 13 recvcounts non-negative integer array (of length group size) spec- IN 14 ifying the number of elements that can be received 15 from each rank 16 i specifies IN rdispls integer array (of length group size). Entry 17 the displacement (relative to recvbuf ) at which to place 18 the incoming data from process i 19 IN recvtype data type of receive buffer elements (handle) 20 21 communicator (handle) IN comm 22 23 int MPI_Alltoallv(const void* sendbuf, const int sendcounts[], const 24 int sdispls[], MPI_Datatype sendtype, void* recvbuf, const 25 int recvcounts[], const int rdispls[], MPI_Datatype recvtype, 26 MPI_Comm comm) 27 MPI_Alltoallv(sendbuf, sendcounts, sdispls, sendtype, recvbuf, recvcounts, 28 rdispls, recvtype, comm, ierror) BIND(C) 29 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 30 TYPE(*), DIMENSION(..) :: recvbuf 31 INTEGER, INTENT(IN) :: sendcounts(*), sdispls(*), recvcounts(*), 32 rdispls(*) 33 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 34 TYPE(MPI_Comm), INTENT(IN) :: comm 35 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 36 37 MPI_ALLTOALLV(SENDBUF, SENDCOUNTS, SDISPLS, SENDTYPE, RECVBUF, RECVCOUNTS, 38 RDISPLS, RECVTYPE, COMM, IERROR) 39 SENDBUF(*), RECVBUF(*) 40 INTEGER SENDCOUNTS(*), SDISPLS(*), SENDTYPE, RECVCOUNTS(*), RDISPLS(*), 41 RECVTYPE, COMM, IERROR 42 43 adds flexibility to MPI _ ALLTOALL in that the location of data for MPI _ ALLTOALLV 44 the send is specified by sdispls and the location of the placement of the data on the receive 45 side is specified by . rdispls 46 j is an intracommunicator, then the is received If i comm -th block sent from process 47 recvbuf by process j and is placed in the i -th block of . These blocks need not all have the 48 same size.

201 5.8. ALL-TO-ALL SCATTER/GATHER 171 1 at process sendcounts[j], sendtype i The type signature associated with must be equal 2 j at process . This implies that recvcounts[i], recvtype to the type signature associated with 3 the amount of data sent must be equal to the amount of data received, pairwise between 4 every pair of processes. Distinct type maps between sender and receiver are still allowed. 5 The outcome is as if each process sent a message to every other process with, 6 · extent ( sendtype ) , sendcounts [ MPI_Send ] , sendtype , i ,... ) , ( sendbuf + sdispls [ i ] i 7 8 and received a message from every other process with a call to 9 10 recvtype ) , recvcounts [ i ] , recvtype , i ,... ) MPI_Recv ( recvbuf + rdispls [ i ] · extent ( . 11 All arguments on all processes are significant. The argument comm must have identical 12 values on all processes. 13 _ IN _ to The “in place” option for intracommunicators is specified by passing MPI PLACE 14 sendcounts sendtype are and sdispls , sendbuf at the argument all processes. In such a case, 15 and replaced by the received data. ignored. The data to be sent is taken from the recvbuf 16 recvcounts array Data sent and received must have the same type map as specified by the 17 , and is taken from the locations of the receive buffer specified by rdispls . recvtype and the 18 19 Specifying the “in place” option (which must be given on all Advice to users. 20 processes) implies that the same amount and type of data is sent and received between 21 any two processes in the group of the communicator. Different pairs of processes can 22 and recvtype exchange different amounts of data. Users must ensure that recvcounts[j] 23 match . This symmetric exchange on process on process i j recvcounts[i] and recvtype 24 can be useful in applications where the data to be sent will not be used by the sending 25 ALLTOALLV exchange. ( ) End of advice to users. process after the MPI _ 26 27 If comm is an intercommunicator, then the outcome is as if each process in group A 28 -th send buffer of process sends a message to each process in group B, and vice versa. The j 29 in group A should be consistent with the i j in group B, and -th receive buffer of process i 30 vice versa. 31 32 ALLTOALL Rationale. The definitions of MPI give as much ALLTOALLV _ MPI and _ 33 independent, point-to-point communi- n flexibility as one would achieve by specifying 34 cations, with two exceptions: all messages use the same datatype, and messages are 35 ) scattered from (or gathered to) sequential storage. ( End of rationale. 36 37 Although the discussion of collective communication in Advice to implementors. 38 terms of point-to-point operation implies that each message is transferred directly 39 from sender to receiver, implementations may use a tree communication pattern. 40 Messages can be forwarded by intermediate nodes where they are split (for scatter) or 41 concatenated (for gather), if this is more efficient. ( End of advice to implementors. ) 42 43 44 45 46 47 48

202 172 CHAPTER 5. COLLECTIVE COMMUNICATION 1 ALLTOALLW(sendbuf, sendcounts, sdispls, sendtypes, recvbuf, recvcounts, rdispls, MPI _ 2 recvtypes, comm) 3 IN starting address of send buffer (choice) sendbuf 4 sendcounts IN non-negative integer array (of length group size) spec- 5 ifying the number of elements to send to each rank 6 7 specifies j integer array (of length group size). Entry sdispls IN 8 sendbuf the displacement in bytes (relative to ) from 9 which to take the outgoing data destined for process 10 (array of integers) j 11 sendtypes array of datatypes (of length group size). Entry IN j 12 (array j specifies the type of data to send to process 13 of handles) 14 address of receive buffer (choice) recvbuf OUT 15 16 non-negative integer array (of length group size) spec- recvcounts IN 17 ifying the number of elements that can be received 18 from each rank 19 i integer array (of length group size). Entry specifies IN rdispls 20 ) at which the displacement in bytes (relative to recvbuf 21 (array of i to place the incoming data from process 22 integers) 23 IN recvtypes array of datatypes (of length group size). Entry i 24 specifies the type of data received from process i (ar- 25 ray of handles) 26 IN communicator (handle) comm 27 28 29 int MPI_Alltoallw(const void* sendbuf, const int sendcounts[], const 30 int sdispls[], const MPI_Datatype sendtypes[], void* recvbuf, 31 const int recvcounts[], const int rdispls[], const 32 MPI_Datatype recvtypes[], MPI_Comm comm) 33 MPI_Alltoallw(sendbuf, sendcounts, sdispls, sendtypes, recvbuf, recvcounts, 34 rdispls, recvtypes, comm, ierror) BIND(C) 35 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 36 TYPE(*), DIMENSION(..) :: recvbuf 37 INTEGER, INTENT(IN) :: sendcounts(*), sdispls(*), recvcounts(*), 38 rdispls(*) 39 TYPE(MPI_Datatype), INTENT(IN) :: sendtypes(*) 40 TYPE(MPI_Datatype), INTENT(IN) :: recvtypes(*) 41 TYPE(MPI_Comm), INTENT(IN) :: comm 42 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 43 44 MPI_ALLTOALLW(SENDBUF, SENDCOUNTS, SDISPLS, SENDTYPES, RECVBUF, RECVCOUNTS, 45 RDISPLS, RECVTYPES, COMM, IERROR) 46 SENDBUF(*), RECVBUF(*) 47 INTEGER SENDCOUNTS(*), SDISPLS(*), SENDTYPES(*), RECVCOUNTS(*), 48 RDISPLS(*), RECVTYPES(*), COMM, IERROR

203 5.9. GLOBAL REDUCTION OPERATIONS 173 1 ALLTOALLW _ is the most general form of complete exchange. Like MPI 2 _ STRUCT , the most general type constructor, MPI _ ALLTOALLW al- _ TYPE CREATE MPI _ 3 lows separate specification of count, displacement and datatype. In addition, to allow max- 4 imum flexibility, the displacement of blocks within the send and receive buffers is specified 5 in bytes. 6 comm is an intracommunicator, then the j If -th block sent from process is received i 7 -th block of recvbuf by process j and is placed in the i . These blocks need not all have the 8 same size. 9 i sendcounts[j], sendtypes[j] The type signature associated with at process must be equal 10 recvcounts[i], recvtypes[i] at process j . This implies to the type signature associated with 11 that the amount of data sent must be equal to the amount of data received, pairwise between 12 every pair of processes. Distinct type maps between sender and receiver are still allowed. 13 The outcome is as if each process sent a message to every other process with 14 [ i MPI_Send , i ,... ) , ( sendbuf + sdispls [ i ] , sendcounts [ i ] , sendtypes ] 15 16 and received a message from every other process with a call to 17 18 i ,... ) . , MPI_Recv ( recvbuf + rdispls [ i ] ] , recvtypes recvcounts [ [ i ] , i 19 All arguments on all processes are significant. The argument comm must describe the 20 same communicator on all processes. 21 Like for _ ALLTOALLV , the “in place” option for intracommunicators is specified by MPI 22 all processes. In such a case, sendcounts , passing MPI _ IN _ PLACE to the argument sendbuf at 23 are ignored. The data to be sent is taken from the and replaced sdispls and sendtypes recvbuf 24 by the received data. Data sent and received must have the same type map as specified 25 arrays, and is taken from the locations of the receive buffer by the recvcounts and recvtypes 26 specified by . rdispls 27 If comm is an intercommunicator, then the outcome is as if each process in group A 28 -th send buffer of process j sends a message to each process in group B, and vice versa. The 29 j i i in group B, and in group A should be consistent with the -th receive buffer of process 30 vice versa. 31 32 MPI The Rationale. _ ALLTOALLW function generalizes several MPI functions by care- 33 fully selecting the input arguments. For example, by making all but one process have 34 , this achieves an function. ( End of rationale. ) MPI_SCATTERW sendcounts[i] = 0 35 36 37 5.9 Global Reduction Operations 38 39 The functions in this section perform a global reduce operation (for example sum, maximum, 40 and logical and) across all members of a group. The reduction operation can be either one of 41 a predefined list of operations, or a user-defined operation. The global reduction functions 42 come in several flavors: a reduce that returns the result of the reduction to one member of a 43 group, an all-reduce that returns this result to all members of a group, and two scan (parallel 44 prefix) operations. In addition, a reduce-scatter operation combines the functionality of a 45 reduce and of a scatter operation. 46 47 48

204 174 CHAPTER 5. COLLECTIVE COMMUNICATION 1 5.9.1 Reduce 2 3 4 _ REDUCE(sendbuf, recvbuf, count, datatype, op, root, comm) MPI 5 IN sendbuf address of send buffer (choice) 6 7 address of receive buffer (choice, significant only at recvbuf OUT 8 root) 9 number of elements in send buffer (non-negative inte- IN count 10 ger) 11 IN datatype data type of elements of send buffer (handle) 12 reduce operation (handle) IN op 13 14 rank of root process (integer) root IN 15 IN comm communicator (handle) 16 17 int MPI_Reduce(const void* sendbuf, void* recvbuf, int count, 18 MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm) 19 20 MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm, ierror) 21 BIND(C) 22 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 23 TYPE(*), DIMENSION(..) :: recvbuf 24 INTEGER, INTENT(IN) :: count, root 25 TYPE(MPI_Datatype), INTENT(IN) :: datatype 26 TYPE(MPI_Op), INTENT(IN) :: op 27 TYPE(MPI_Comm), INTENT(IN) :: comm 28 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 29 MPI_REDUCE(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, ROOT, COMM, IERROR) 30 SENDBUF(*), RECVBUF(*) 31 INTEGER COUNT, DATATYPE, OP, ROOT, COMM, IERROR 32 33 If MPI _ REDUCE comm combines the elements provided in the is an intracommunicator, 34 input buffer of each process in the group, using the operation , and returns the combined op 35 root . The input buffer is defined by value in the output buffer of the process with rank 36 , count and datatype ; the output buffer is defined by the arguments the arguments sendbuf 37 ; both have the same number of elements, with the same type. recvbuf , count datatype and 38 count, datatype, op, The routine is called by all group members using the same arguments for 39 comm and . Thus, all processes provide input buffers of the same length, with elements root 40 of the same type as the output buffer at the root. Each process can provide one element, or a 41 sequence of elements, in which case the combine operation is executed element-wise on each 42 _ MAX and the send buffer contains entry of the sequence. For example, if the operation is MPI 43 _ two elements that are floating point numbers ( ), then FLOAT = 2 and MPI = datatype count 44 1 ( ) = global max( )). 1 ( recvbuf 2 ( sendbuf ) = global max( 2 ( recvbuf )) and sendbuf 45 MPI Section 5.9.2, lists the set of predefined operations provided by . That section also 46 enumerates the datatypes to which each operation can be applied. 47 In addition, users may define their own operations that can be overloaded to operate 48 on several datatypes, either basic or derived. This is further explained in Section 5.9.5.

205 5.9. GLOBAL REDUCTION OPERATIONS 175 1 is always assumed to be associative. All predefined operations are also op The operation 2 assumed to be commutative. Users may define operations that are assumed to be associative, 3 but not commutative. The “canonical” evaluation order of a reduction is determined by the 4 ranks of the processes in the group. However, the implementation can take advantage of 5 associativity, or associativity and commutativity in order to change the order of evaluation. 6 This may change the result of the reduction for operations that are not strictly associative 7 and commutative, such as floating point addition. 8 9 be im- Advice to implementors. REDUCE _ MPI It is strongly recommended that 10 plemented so that the same result be obtained whenever the function is applied on 11 the same arguments, appearing in the same order. Note that this may prevent op- 12 End of advice to timizations that take advantage of the physical location of ranks. ( 13 implementors. ) 14 15 Some applications may not be able to ignore the non-associative na- Advice to users. 16 ture of floating-point operations or may use user-defined operations (see Section 5.9.5) 17 that require a special reduction order and cannot be treated as associative. Such 18 applications should enforce the order of evaluation explicitly. For example, in the 19 case of operations that require a strict left-to-right (or right-to-left) evaluation or- 20 der, this could be done by gathering all operands at a single process (e.g., with 21 MPI ), applying the reduction operation in the desired order (e.g., with _ GATHER 22 _ _ LOCAL ), and if needed, broadcast or scatter the result to the other MPI REDUCE 23 ). ( End of advice to users. ) processes (e.g., with MPI _ BCAST 24 MPI op must be compatible with REDUCE . Predefined op- argument of datatype The _ 25 types listed in Section 5.9.2 and Section 5.9.4. Furthermore, erators work only with the MPI 26 op the datatype and given for predefined operators must be the same on all processes. 27 Note that it is possible for users to supply different user-defined operations to 28 does not define which operations are used on which REDUCE MPI _ MPI in each process. 29 operands in this case. User-defined operators may operate on general, derived datatypes. 30 In this case, each argument that the reduce operation is applied to is one element described 31 by such a datatype, which may contain several basic values. This is further explained in 32 Section 5.9.5. 33 34 MPI REDUCE is Advice to users. Users should make no assumptions about how _ 35 REDUCE _ implemented. It is safest to ensure that the same function is passed to MPI 36 by each process. ( End of advice to users. ) 37 38 Overlapping datatypes are permitted in “send” buffers. Overlapping datatypes in “re- 39 ceive” buffers are erroneous and may give unpredictable results. 40 The “in place” option for intracommunicators is specified by passing the value 41 IN _ MPI _ PLACE to the argument sendbuf at the root. In such a case, the input data is taken 42 at the root from the receive buffer, where it will be replaced by the output data. 43 comm If is an intercommunicator, then the call involves all processes in the intercom- 44 municator, but with one group (group A) defining the root process. All processes in the 45 other group (group B) pass the same value in argument root , which is the rank of the root 46 in root . All other processes in group A ROOT MPI in group A. The root passes the value _ 47 root pass the value MPI _ PROC _ NULL in . Only send buffer arguments are significant in group 48 B and only receive buffer arguments are significant at the root.

206 176 CHAPTER 5. COLLECTIVE COMMUNICATION 1 Predefined Reduction Operations 5.9.2 2 and related functions The following predefined operations are supplied for MPI REDUCE _ 3 MPI _ SCATTER _ BLOCK , MPI _ REDUCE _ SCATTER , _ MPI ALLREDUCE _ , REDUCE 4 MPI , , all nonblocking variants of those (see Section 5.12), and EXSCAN SCAN _ _ MPI 5 . MPI . These operations are invoked by placing the following in LOCAL _ _ op REDUCE 6 7 8 Name Meaning 9 10 _ MAX maximum MPI 11 MPI _ MIN minimum 12 sum SUM MPI _ 13 MPI _ PROD product 14 MPI _ LAND logical and 15 BAND bit-wise and MPI _ 16 MPI _ LOR logical or 17 _ MPI BOR bit-wise or 18 MPI LXOR logical exclusive or (xor) _ 19 MPI BXOR bit-wise exclusive or (xor) _ 20 max value and location MPI MAXLOC _ _ min value and location MPI MINLOC 21 22 The two operations are discussed separately in Sec- MPI _ MINLOC and MPI _ MAXLOC 23 tion 5.9.4. For the other predefined operations, we enumerate below the allowed combi- 24 arguments. First, define groups of MPI op nations of basic datatypes in the and datatype 25 following way. 26 27 28 _ LONG , MPI _ SHORT C integer: MPI _ INT , MPI , 29 _ SHORT , MPI _ UNSIGNED _ , MPI UNSIGNED 30 MPI _ UNSIGNED _ LONG , 31 MPI _ LONG , INT _ LONG _ 32 (as synonym), LONG _ LONG _ MPI 33 MPI _ UNSIGNED _ LONG _ , LONG _ CHAR , MPI _ SIGNED 34 _ UNSIGNED _ CHAR , MPI 35 _ INT8 _ T , MPI _ INT16 _ T , MPI 36 INT64 _ T , MPI _ INT32 _ T , MPI _ 37 T , _ MPI MPI _ UINT8 _ UINT16 _ T , 38 UINT64 _ T , MPI UINT32 _ T _ MPI _ 39 , Fortran integer: MPI _ INTEGER 40 and handles returned from 41 _ F90 _ CREATE _ TYPE _ MPI , INTEGER 42 MPI _ INTEGER1 , and if available: 43 MPI INTEGER4 _ , INTEGER2 , _ MPI 44 INTEGER16 _ MPI , _ MPI INTEGER8 45 _ DOUBLE , MPI _ REAL , Floating point: MPI _ FLOAT , MPI 46 _ MPI _ PRECISION DOUBLE 47 _ MPI _ LONG DOUBLE 48 and handles returned from

207 5.9. GLOBAL REDUCTION OPERATIONS 177 1 TYPE _ _ F90 _ REAL , _ MPI CREATE 2 REAL2 and if available: , MPI _ 3 REAL8 MPI _ REAL16 MPI , MPI _ , _ REAL4 4 C _ BOOL , MPI _ LOGICAL , MPI Logical: _ 5 CXX _ BOOL MPI _ 6 C COMPLEX , , Complex: MPI _ COMPLEX MPI _ _ 7 _ (as synonym), COMPLEX FLOAT _ C _ MPI _ MPI DOUBLE _ COMPLEX , C _ 8 , MPI C _ LONG _ DOUBLE _ COMPLEX _ 9 _ CXX _ MPI _ COMPLEX , FLOAT 10 DOUBLE _ COMPLEX , MPI CXX _ _ 11 _ MPI _ CXX _ LONG COMPLEX , DOUBLE _ 12 and handles returned from 13 MPI _ _ COMPLEX , TYPE _ CREATE _ F90 14 COMPLEX _ DOUBLE _ MPI and if available: , 15 , MPI _ COMPLEX4 , MPI _ COMPLEX8 16 MPI MPI COMPLEX32 _ _ , COMPLEX16 17 _ Byte: MPI BYTE 18 , COUNT AINT _ _ MPI Multi-language types: MPI OFFSET , MPI _ 19 20 Now, the valid datatypes for each operation are specified below. 21 22 Op Allowed Types 23 24 _ MPI C integer, Fortran integer, Floating point, MAX _ MPI MIN , 25 Multi-language types 26 MPI SUM , _ PROD _ C integer, Fortran integer, Floating point, Complex, MPI 27 Multi-language types 28 _ MPI , LAND _ MPI C integer, Logical LXOR _ , LOR MPI 29 , _ MPI _ BOR MPI , _ BXOR C integer, Fortran integer, Byte, Multi-language types MPI BAND 30 31 These operations together with all listed datatypes are valid in all supported program- 32 ming languages, see also Reduce Operations on page 650 in Section 17.2.6. 33 The following examples use intracommunicators. 34 Example 5.15 35 A routine that computes the dot product of two vectors that are distributed across a 36 group of processes and returns the answer at node zero. 37 38 39 40 41 42 43 44 45 46 47 48

208 178 CHAPTER 5. COLLECTIVE COMMUNICATION 1 SUBROUTINE PAR_BLAS1(m, a, b, c, comm) 2 REAL a(m), b(m) ! local slice of array 3 REAL c ! result (at node zero) 4 REAL sum 5 INTEGER m, comm, i, ierr 6 7 ! local sum 8 sum = 0.0 9 DO i = 1, m 10 sum = sum + a(i)*b(i) 11 END DO 12 13 ! global sum 14 CALL MPI_REDUCE(sum, c, 1, MPI_REAL, MPI_SUM, 0, comm, ierr) 15 RETURN 16 END 17 18 Example 5.16 19 A routine that computes the product of a vector and an array that are distributed 20 across a group of processes and returns the answer at node zero. 21 22 SUBROUTINE PAR_BLAS2(m, n, a, b, c, comm) 23 REAL a(m), b(m,n) ! local slice of array 24 REAL c(n) ! result 25 REAL sum(n) 26 INTEGER n, comm, i, j, ierr 27 28 ! local sum 29 DO j= 1, n 30 sum(j) = 0.0 31 DO i = 1, m 32 sum(j) = sum(j) + a(i)*b(i,j) 33 END DO 34 END DO 35 36 ! global sum 37 CALL MPI_REDUCE(sum, c, n, MPI_REAL, MPI_SUM, 0, comm, ierr) 38 39 ! return result at node zero (and garbage at the other nodes) 40 RETURN 41 END 42 43 Signed Characters and Reductions 5.9.3 44 45 SIGNED _ MPI The types can be used in reduction opera- _ UNSIGNED _ CHAR and CHAR _ MPI 46 WCHAR , and MPI _ CHARACTER (which represent printable charac- tions. MPI _ CHAR , MPI _ 47 , CHAR _ MPI ters) cannot be used in reduction operations. In a heterogeneous environment, 48 CHARACTER will be translated so as to preserve the printable _ MPI , and WCHAR _ MPI

209 5.9. GLOBAL REDUCTION OPERATIONS 179 1 _ MPI CHAR and MPI _ UNSIGNED _ CHAR will be translated so SIGNED character, whereas _ 2 as to preserve the integer value. 3 4 MPI WCHAR , and MPI _ CHARACTER are Advice to users. , MPI _ _ The types CHAR 5 intended for characters, and so will be translated to preserve the printable representa- 6 tion, rather than the integer value, if sent between machines with different character 7 SIGNED CHAR should be used in CHAR MPI codes. The types MPI _ and _ _ _ UNSIGNED 8 ) End of advice to users. C if the integer value should be preserved. ( 9 10 MINLOC and MAXLOC 5.9.4 11 MINLOC is used to compute a global minimum and also an index attached The operator MPI _ 12 to the minimum value. similarly computes a global maximum and index. One MAXLOC MPI _ 13 application of these is to compute a global minimum (maximum) and the rank of the process 14 containing this value. 15 MPI The operation that defines _ is: MAXLOC 16 ) ( ) ) ( ( 17 w v u 18 ◦ = j k i 19 20 where 21 22 w = max( ) u,v 23 and 24  25  i if u > v  26 u = v ) if min( i,j = k  27  u < v if j 28 29 _ MINLOC is defined similarly: MPI 30 ) ( ( ( ) ) u v w 31 ◦ = i k j 32 33 where 34 35 ) w = min( u,v 36 37 and  38  i u < v if  39 min( v = u ) if i,j = k 40   j if u > v 41 42 MAXLOC is applied Both operations are associative and commutative. Note that if MPI _ 43 ( 1) ( u ,n − 1), then the value returned is to reduce a sequence of pairs ( , u u ,..., , 0) , 0 n − 1 1 44 and ( u,r ), where r = max is the index of the first global maximum in the sequence. u u i i 45 Thus, if each process supplies a value and its rank within the group, then a reduce operation 46 with will return the maximum value and the rank of the first process with MAXLOC _ MPI = op 47 that value. Similarly, _ MINLOC can be used to return a minimum and its index. More MPI 48 MPI MINLOC computes a lexicographic minimum , where elements are ordered generally, _

210 180 CHAPTER 5. COLLECTIVE COMMUNICATION 1 according to the first component of each pair, and ties are resolved according to the second 2 component. 3 The reduce operation is defined to operate on arguments that consist of a pair: value 4 and index. For both Fortran and C, types are provided to describe the pair. The potentially 5 mixed-type nature of such arguments is a problem in Fortran. The problem is circumvented, 6 -provided type consist of a pair of the same type as value, MPI for Fortran, by having the 7 MPI and coercing the index to this type also. In C, the -provided pair type has distinct 8 types and the index is an int . 9 MPI _ MAXLOC MINLOC _ and MPI In order to use in a reduce operation, one must provide 10 a provides nine such MPI argument that represents a pair (value and index). datatype 11 MPI _ MINLOC can be used with predefined datatypes. The operations MPI _ MAXLOC and 12 each of the following datatypes. 13 14 Fortran: 15 Name Description 16 MPI _ 2REAL pair of REAL s PRECISION _ MPI 2DOUBLE _ pair of DOUBLE PRECISION variables 17 MPI 2INTEGER pair of _ INTEGER s 18 19 20 21 C: 22 Name Description 23 _ INT float and int MPI FLOAT _ 24 _ DOUBLE _ MPI double and int INT 25 int MPI _ LONG _ INT long and 26 _ 2INT pair of int MPI MPI _ INT _ int and short SHORT 27 long double and int MPI _ LONG _ DOUBLE _ INT 28 29 The datatype is 2REAL defined by the following (see Section 4.1). as if _ MPI 30 31 MPI_TYPE_CONTIGUOUS(2, MPI_REAL, MPI_2REAL) 32 33 , and _ Similar statements apply for MPI _ 2INTEGER , MPI _ 2DOUBLE _ PRECISION . 2INT MPI 34 INT FLOAT _ MPI The datatype is as if defined by the following sequence of instructions. _ 35 type[0] = MPI_FLOAT 36 type[1] = MPI_INT 37 disp[0] = 0 38 disp[1] = sizeof(float) 39 block[0] = 1 40 block[1] = 1 41 MPI_TYPE_CREATE_STRUCT(2, block, disp, type, MPI_FLOAT_INT) 42 43 _ _ Similar statements apply for LONG _ INT and MPI MPI DOUBLE _ INT . 44 The following examples use intracommunicators. 45 46 Example 5.17 47 double Each process has an array of 30 s, in C. For each of the 30 locations, compute 48 the value and rank of the process containing the largest value.

211 5.9. GLOBAL REDUCTION OPERATIONS 181 1 ... 2 /* each process has an array of 30 double: ain[30] 3 */ 4 double ain[30], aout[30]; 5 int ind[30]; 6 struct { 7 double val; 8 int rank; 9 } in[30], out[30]; 10 int i, myrank, root; 11 12 MPI_Comm_rank(comm, &myrank); 13 for (i=0; i<30; ++i) { 14 in[i].val = ain[i]; 15 in[i].rank = myrank; 16 } 17 MPI_Reduce(in, out, 30, MPI_DOUBLE_INT, MPI_MAXLOC, root, comm); 18 /* At this point, the answer resides on process root 19 */ 20 if (myrank == root) { 21 /* read ranks out 22 */ 23 for (i=0; i<30; ++i) { 24 aout[i] = out[i].val; 25 ind[i] = out[i].rank; 26 } 27 } 28 29 Example 5.18 30 Same example, in Fortran. 31 32 ... 33 ! each process has an array of 30 double: ain(30) 34 35 DOUBLE PRECISION ain(30), aout(30) 36 INTEGER ind(30) 37 DOUBLE PRECISION in(2,30), out(2,30) 38 INTEGER i, myrank, root, ierr 39 40 CALL MPI_COMM_RANK(comm, myrank, ierr) 41 DO I=1, 30 42 in(1,i) = ain(i) 43 in(2,i) = myrank ! myrank is coerced to a double 44 END DO 45 46 CALL MPI_REDUCE(in, out, 30, MPI_2DOUBLE_PRECISION, MPI_MAXLOC, root, 47 comm, ierr) 48

212 182 CHAPTER 5. COLLECTIVE COMMUNICATION 1 ! At this point, the answer resides on process root 2 3 IF (myrank .EQ. root) THEN 4 ! read ranks out 5 DO I= 1, 30 6 aout(i) = out(1,i) 7 ind(i) = out(2,i) ! rank is coerced back to an integer 8 END DO 9 END IF 10 11 Example 5.19 12 Each process has a non-empty array of values. Find the minimum global value, the 13 rank of the process that holds it and its index on this process. 14 15 #define LEN 1000 16 17 float val[LEN]; /* local array of values */ 18 int count; /* local number of values */ 19 int myrank, minrank, minindex; 20 float minval; 21 22 struct { 23 float value; 24 int index; 25 } in, out; 26 27 /* local minloc */ 28 in.value = val[0]; 29 in.index = 0; 30 for (i=1; i < count; i++) 31 if (in.value > val[i]) { 32 in.value = val[i]; 33 in.index = i; 34 } 35 36 /* global minloc */ 37 MPI_Comm_rank(comm, &myrank); 38 in.index = myrank*LEN + in.index; 39 MPI_Reduce( &in, &out, 1, MPI_FLOAT_INT, MPI_MINLOC, root, comm ); 40 /* At this point, the answer resides on process root 41 */ 42 if (myrank == root) { 43 /* read answer out 44 */ 45 minval = out.value; 46 minrank = out.index / LEN; 47 minindex = out.index % LEN; 48 }

213 5.9. GLOBAL REDUCTION OPERATIONS 183 1 MPI MINLOC _ MAXLOC given here has the The definition of Rationale. _ and MPI 2 advantage that it does not require any special-case handling of these two operations: 3 they are handled like any other reduce operation. A programmer can provide his or 4 MAXLOC , if so desired. The disadvantage MPI _ her own definition of and MPI MINLOC _ 5 is that values and indices have to be first interleaved, and that indices and values have 6 ) End of rationale. to be coerced to the same type, in Fortran. ( 7 8 5.9.5 User-Defined Reduction Operations 9 10 11 _ OP CREATE(user _ MPI _ fn, commute, op) 12 fn IN user user defined function (function) _ 13 14 false otherwise. IN commute true if commutative; 15 OUT operation (handle) op 16 17 int MPI_Op_create(MPI_User_function* user_fn, int commute, MPI_Op* op) 18 19 MPI_Op_create(user_fn, commute, op, ierror) BIND(C) 20 PROCEDURE(MPI_User_function) :: user_fn 21 LOGICAL, INTENT(IN) :: commute 22 TYPE(MPI_Op), INTENT(OUT) :: op 23 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 24 MPI_OP_CREATE( USER_FN, COMMUTE, OP, IERROR) 25 EXTERNAL USER_FN 26 LOGICAL COMMUTE 27 INTEGER OP, IERROR 28 29 MPI _ OP _ CREATE binds a user-defined reduction operation to an 30 MPI op MPI REDUCE , handle that can subsequently be used in _ ALLREDUCE , _ 31 _ SCATTER , MPI _ SCAN , MPI _ EXSCAN, MPI _ REDUCE _ SCATTER _ BLOCK , MPI _ REDUCE 32 _ _ LOCAL . The user- MPI all nonblocking variants of those (see Section 5.12), and REDUCE 33 defined operation is assumed to be associative. If commute = true , then the operation 34 = should be both commutative and associative. If commute false , then the order of 35 operands is fixed and is defined to be in ascending, process rank order, beginning with 36 process zero. The order of evaluation can be changed, talking advantage of the associativity 37 = then the order of evaluation can be changed, taking of the operation. If true commute 38 advantage of commutativity and associativity. 39 _ fn is the user-defined function, which must have the following four The argument user 40 invec, inoutvec, len datatype and arguments: . 41 The ISO C prototype for the function is the following. 42 typedef void MPI_User_function(void* invec, void* inoutvec, int *len, 43 MPI_Datatype *datatype); 44 _ user The Fortran declarations of the user-defined function appear below. fn 45 ABSTRACT INTERFACE 46 SUBROUTINE MPI_User_function(invec, inoutvec, len, datatype) BIND(C) 47 USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR 48

214 184 CHAPTER 5. COLLECTIVE COMMUNICATION 1 TYPE(C_PTR), VALUE :: invec, inoutvec 2 INTEGER :: len 3 TYPE(MPI_Datatype) :: datatype 4 SUBROUTINE USER_FUNCTION(INVEC, INOUTVEC, LEN, DATATYPE) 5 INVEC(LEN), INOUTVEC(LEN) 6 INTEGER LEN, DATATYPE 7 8 datatype The argument is a handle to the data type that was passed into the call to 9 MPI . The user reduce function should be written such that the following holds: _ REDUCE 10 Let len elements in the communication buffer described by the u[0], ... , u[len-1] be the 11 when the function is invoked; let v[0], ... , v[len-1] arguments invec, len datatype be len and 12 elements in the communication buffer described by the arguments datatype and inoutvec, len 13 when the function is invoked; let elements in the communication len be w[0], ... , w[len-1] 14 when the function returns; inoutvec, len and buffer described by the arguments datatype 15 ◦ , where i=0 , ... , len-1 , for v[i] is the reduce operation that the function w[i] = u[i] then ◦ 16 computes. 17 fn invec and inoutvec as arrays of len elements that Informally, we can think of _ user 18 is combining. The result of the reduction over-writes values in inoutvec , hence the name. 19 Each invocation of the function results in the pointwise evaluation of the reduce operator 20 [ len elements: i.e., the function returns in i inoutvec[i] the value invec ], for on [ i inoutvec ◦ ] 21 ◦ 0 ,..., count is the combining operation computed by the function. i , where 1 − = 22 23 to avoid calling the function for Rationale. The len argument allows MPI _ REDUCE 24 each element in the input buffer. Rather, the system can choose to apply the function 25 to chunks of input. In C, it is passed in as a reference for reasons of compatibility 26 with Fortran. 27 datatype By internally comparing the value of the argument to known, global handles, 28 it is possible to overload the use of a single user-defined function for several, different 29 data types. ( End of rationale. ) 30 31 General datatypes may be passed to the user function. However, use of datatypes that 32 are not contiguous is likely to lead to inefficiencies. 33 MPI communication function may be called inside the user function. MPI No ABORT _ 34 may be called inside the function in case of an error. 35 Advice to users. Suppose one defines a library of user-defined reduce functions that 36 are overloaded: the argument is used to select the right execution path at each datatype 37 invocation, according to the types of the operands. The user-defined reduce function 38 argument that it is passed, and cannot identify, by itself, cannot “decode” the datatype 39 the correspondence between the datatype handles and the datatype they represent. 40 This correspondence was established when the datatypes were created. Before the 41 library is used, a library initialization preamble must be executed. This preamble 42 code will define the datatypes that are used by the library, and store handles to these 43 datatypes in global, static variables that are shared by the user code and the library 44 code. 45 46 will invoke a user-defined reduce function using The Fortran version of MPI _ REDUCE 47 the Fortran calling conventions and will pass a Fortran-type datatype argument; the 48 C version will use C calling convention and the C representation of a datatype handle.

215 5.9. GLOBAL REDUCTION OPERATIONS 185 1 Users who plan to mix languages should define their reduction functions accordingly. 2 End of advice to users. ( ) 3 4 We outline below a naive and inefficient implementation of Advice to implementors. 5 REDUCE not supporting the “in place” option. MPI _ 6 7 MPI_Comm_size(comm, &groupsize); 8 MPI_Comm_rank(comm, &rank); 9 if (rank > 0) { 10 MPI_Recv(tempbuf, count, datatype, rank-1,...); 11 User_reduce(tempbuf, sendbuf, count, datatype); 12 } 13 if (rank < groupsize-1) { 14 MPI_Send(sendbuf, count, datatype, rank+1, ...); 15 } 16 /* answer now resides in process groupsize-1 ... now send to root 17 */ 18 if (rank == root) { 19 MPI_Irecv(recvbuf, count, datatype, groupsize-1,..., &req); 20 } 21 if (rank == groupsize-1) { 22 MPI_Send(sendbuf, count, datatype, root, ...); 23 } 24 if (rank == root) { 25 MPI_Wait(&req, &status); 26 } 27 28 0 to process The reduction computation proceeds, sequentially, from process 29 . This order is chosen so as to respect the order of a possibly non- groupsize-1 30 commutative operator defined by the function User_reduce() . A more efficient im- 31 plementation is achieved by taking advantage of associativity and using a logarithmic 32 tree reduction. Commutativity can be used to advantage, for those cases in which 33 CREATE MPI argument to commute the is true. Also, the amount of temporary _ OP _ 34 buffer required can be reduced, and communication can be pipelined with computa- 35 < count len . tion, by transferring and reducing the elements in chunks of size 36 The predefined reduce operations can be implemented as a library of user-defined 37 _ REDUCE handles operations. However, better performance might be achieved if MPI 38 these functions as a special case. ( End of advice to implementors. ) 39 40 41 42 _ MPI OP FREE(op) _ 43 INOUT op operation (handle) 44 45 int MPI_Op_free(MPI_Op *op) 46 47 MPI_Op_free(op, ierror) BIND(C) 48 TYPE(MPI_Op), INTENT(INOUT) :: op

216 CHAPTER 5. COLLECTIVE COMMUNICATION 186 1 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 2 MPI_OP_FREE(OP, IERROR) 3 INTEGER OP, IERROR 4 5 _ NULL . to _ MPI Marks a user-defined reduction operation for deallocation and sets OP op 6 7 Example of User-defined Reduce 8 It is time for an example of user-defined reduction. The example in this section uses an 9 intracommunicator. 10 11 Example 5.20 Compute the product of an array of complex numbers, in C. 12 13 typedef struct { 14 double real,imag; 15 } Complex; 16 17 /* the user-defined function 18 */ 19 void myProd(void *inP, void *inoutP, int *len, MPI_Datatype *dptr) 20 { 21 int i; 22 Complex c; 23 Complex *in = (Complex *)inP, *inout = (Complex *)inoutP; 24 25 for (i=0; i< *len; ++i) { 26 c.real = inout->real*in->real - 27 inout->imag*in->imag; 28 c.imag = inout->real*in->imag + 29 inout->imag*in->real; 30 *inout = c; 31 in++; inout++; 32 } 33 } 34 35 /* and, to call it... 36 */ 37 ... 38 39 /* each process has an array of 100 Complexes 40 */ 41 Complex a[100], answer[100]; 42 MPI_Op myOp; 43 MPI_Datatype ctype; 44 45 /* explain to MPI how type Complex is defined 46 */ 47 MPI_Type_contiguous(2, MPI_DOUBLE, &ctype); 48 MPI_Type_commit(&ctype);

217 5.9. GLOBAL REDUCTION OPERATIONS 187 1 /* create the complex-product user-op 2 */ 3 MPI_Op_create( myProd, 1, &myOp ); 4 5 MPI_Reduce(a, answer, 100, ctype, myOp, root, comm); 6 7 /* At this point, the answer, which consists of 100 Complexes, 8 * resides on process root 9 */ 10 11 Example 5.21 MPI_User_function interface of the Fortran . How to use the mpi_f08 12 13 subroutine my_user_function( invec, inoutvec, len, type ) bind(c) 14 use, intrinsic :: iso_c_binding, only : c_ptr, c_f_pointer 15 use mpi_f08 16 type(c_ptr), value :: invec, inoutvec 17 integer :: len 18 type(MPI_Datatype) :: type 19 real, pointer :: invec_r(:), inoutvec_r(:) 20 if (type%MPI_VAL == MPI_REAL%MPI_VAL) then 21 call c_f_pointer(invec, invec_r, (/ len /) ) 22 call c_f_pointer(inoutvec, inoutvec_r, (/ len /) ) 23 inoutvec_r = invec_r + inoutvec_r 24 end if 25 end subroutine 26 27 28 5.9.6 All-Reduce 29 includes a variant of the reduce operations where the result is returned to all processes MPI 30 requires that all processes from the same group participating in these MPI in a group. 31 operations receive identical results. 32 33 34 MPI ALLREDUCE(sendbuf, recvbuf, count, datatype, op, comm) _ 35 IN starting address of send buffer (choice) sendbuf 36 37 OUT recvbuf starting address of receive buffer (choice) 38 IN count number of elements in send buffer (non-negative inte- 39 ger) 40 datatype data type of elements of send buffer (handle) IN 41 42 IN op operation (handle) 43 communicator (handle) comm IN 44 45 int MPI_Allreduce(const void* sendbuf, void* recvbuf, int count, 46 MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) 47 48 MPI_Allreduce(sendbuf, recvbuf, count, datatype, op, comm, ierror) BIND(C)

218 188 CHAPTER 5. COLLECTIVE COMMUNICATION 1 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 2 TYPE(*), DIMENSION(..) :: recvbuf 3 INTEGER, INTENT(IN) :: count 4 TYPE(MPI_Datatype), INTENT(IN) :: datatype 5 TYPE(MPI_Op), INTENT(IN) :: op 6 TYPE(MPI_Comm), INTENT(IN) :: comm 7 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 8 MPI_ALLREDUCE(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, COMM, IERROR) 9 SENDBUF(*), RECVBUF(*) 10 INTEGER COUNT, DATATYPE, OP, COMM, IERROR 11 12 comm If MPI behaves the same as ALLREDUCE _ is an intracommunicator, 13 except that the result appears in the receive buffer of all the group members. REDUCE _ MPI 14 15 The all-reduce operations can be implemented as a re- Advice to implementors. 16 duce, followed by a broadcast. However, a direct implementation can lead to better 17 ) End of advice to implementors. performance. ( 18 19 The “in place” option for intracommunicators is specified by passing the value 20 sendbuf at all processes. In this case, the input data is MPI _ IN _ PLACE to the argument 21 taken at each process from the receive buffer, where it will be replaced by the output data. 22 is an intercommunicator, then the result of the reduction of the data provided comm If 23 by processes in group A is stored at each process in group B, and vice versa. Both groups 24 should provide count and datatype arguments that specify the same type signature. 25 The following example uses an intracommunicator. 26 Example 5.22 27 A routine that computes the product of a vector and an array that are distributed 28 across a group of processes and returns the answer at all nodes (see also Example 5.16). 29 30 SUBROUTINE PAR_BLAS2(m, n, a, b, c, comm) 31 REAL a(m), b(m,n) ! local slice of array 32 REAL c(n) ! result 33 REAL sum(n) 34 INTEGER n, comm, i, j, ierr 35 36 ! local sum 37 DO j= 1, n 38 sum(j) = 0.0 39 DO i = 1, m 40 sum(j) = sum(j) + a(i)*b(i,j) 41 END DO 42 END DO 43 44 ! global sum 45 CALL MPI_ALLREDUCE(sum, c, n, MPI_REAL, MPI_SUM, comm, ierr) 46 47 ! return result at all nodes 48

219 5.9. GLOBAL REDUCTION OPERATIONS 189 1 RETURN 2 END 3 4 5.9.7 Process-Local Reduction 5 The functions in this section are of importance to library implementors who may want to 6 implement special reduction patterns that are otherwise not easily covered by the standard 7 operations. MPI 8 The following function applies a reduction operator to local arguments. 9 10 11 _ REDUCE _ LOCAL( inbuf, inoutbuf, count, datatype, op) MPI 12 IN input buffer (choice) inbuf 13 14 combined input and output buffer (choice) inoutbuf INOUT 15 inoutbuf buffers (non- count number of elements in inbuf IN and 16 negative integer) 17 data type of elements of inbuf inoutbuf and datatype buffers IN 18 (handle) 19 20 op operation (handle) IN 21 22 int MPI_Reduce_local(const void* inbuf, void* inoutbuf, int count, 23 MPI_Datatype datatype, MPI_Op op) 24 25 MPI_Reduce_local(inbuf, inoutbuf, count, datatype, op, ierror) BIND(C) 26 TYPE(*), DIMENSION(..), INTENT(IN) :: inbuf 27 TYPE(*), DIMENSION(..) :: inoutbuf 28 INTEGER, INTENT(IN) :: count 29 TYPE(MPI_Datatype), INTENT(IN) :: datatype 30 TYPE(MPI_Op), INTENT(IN) :: op 31 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 32 MPI_REDUCE_LOCAL(INBUF, INOUTBUF, COUNT, DATATYPE, OP, IERROR) 33 INBUF(*), INOUTBUF(*) 34 INTEGER COUNT, DATATYPE, OP, IERROR 35 36 The function applies the operation given by element-wise to the elements of inbuf op 37 , as explained for user-defined with the result stored element-wise in inoutbuf and inoutbuf 38 inoutbuf (input as well as result) have the inbuf operations in Section 5.9.5. Both and 39 . The same number of elements given by count and the same datatype given by datatype 40 _ MPI _ IN PLACE option is not allowed. 41 Reduction operations can be queried for their commutativity. 42 43 MPI OP _ COMMUTATIVE( op, commute) _ 44 45 operation (handle) op IN 46 OUT otherwise (logical) false is commutative, op if true commute 47 48

220 190 CHAPTER 5. COLLECTIVE COMMUNICATION 1 int MPI_Op_commutative(MPI_Op op, int *commute) 2 MPI_Op_commutative(op, commute, ierror) BIND(C) 3 TYPE(MPI_Op), INTENT(IN) :: op 4 LOGICAL, INTENT(OUT) :: commute 5 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 6 7 MPI_OP_COMMUTATIVE(OP, COMMUTE, IERROR) 8 LOGICAL COMMUTE 9 INTEGER OP, IERROR 10 11 12 Reduce-Scatter 5.10 13 14 includes variants of the reduce operations where the result is scattered to all processes MPI 15 in a group on return. One variant scatters equal-sized blocks to all processes, while another 16 variant scatters blocks that may vary in size for each process. 17 18 _ REDUCE SCATTER _ 5.10.1 _ BLOCK MPI 19 20 21 REDUCE _ SCATTER _ _ MPI BLOCK( sendbuf, recvbuf, recvcount, datatype, op, comm) 22 IN sendbuf starting address of send buffer (choice) 23 24 OUT recvbuf starting address of receive buffer (choice) 25 IN recvcount element count per block (non-negative integer) 26 datatype IN data type of elements of send and receive buffers (han- 27 dle) 28 29 op operation (handle) IN 30 comm communicator (handle) IN 31 32 int MPI_Reduce_scatter_block(const void* sendbuf, void* recvbuf, 33 int recvcount, MPI_Datatype datatype, MPI_Op op, 34 MPI_Comm comm) 35 36 MPI_Reduce_scatter_block(sendbuf, recvbuf, recvcount, datatype, op, comm, 37 ierror) BIND(C) 38 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 39 TYPE(*), DIMENSION(..) :: recvbuf 40 INTEGER, INTENT(IN) :: recvcount 41 TYPE(MPI_Datatype), INTENT(IN) :: datatype 42 TYPE(MPI_Op), INTENT(IN) :: op 43 TYPE(MPI_Comm), INTENT(IN) :: comm 44 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 45 MPI_REDUCE_SCATTER_BLOCK(SENDBUF, RECVBUF, RECVCOUNT, DATATYPE, OP, COMM, 46 IERROR) 47 SENDBUF(*), RECVBUF(*) 48

221 5.10. REDUCE-SCATTER 191 1 INTEGER RECVCOUNT, DATATYPE, OP, COMM, IERROR 2 comm REDUCE _ SCATTER _ BLOCK first performs a If is an intracommunicator, _ MPI 3 elements in the send buffers count = global, element-wise reduction on vectors of n *recvcount 4 datatype , where n is the number of sendbuf , count op and , using the operation defined by 5 . The routine is called by all group members using the comm processes in the group of 6 . The resulting vector is treated as same arguments for recvcount , datatype , op and comm 7 recvcount n consecutive blocks of elements that are scattered to the processes of the group. 8 , i -th block is sent to process i recvbuf and stored in the receive buffer defined by The 9 . datatype , and recvcount 10 11 _ REDUCE Advice to implementors. SCATTER _ BLOCK routine is func- The MPI _ 12 tionally equivalent to: an MPI _ REDUCE collective operation with count equal to 13 equal to recvcount . How- _ recvcount* n , followed by an with sendcount MPI SCATTER 14 ) End of advice to implementors. ever, a direct implementation may run faster. ( 15 16 in MPI PLACE _ IN _ The “in place” option for intracommunicators is specified by passing 17 the all argument on sendbuf processes. In this case, the input data is taken from the receive 18 buffer. 19 comm is an intercommunicator, then the result of the reduction of the data provided If 20 by processes in one group (group A) is scattered among processes in the other group (group 21 B) and vice versa. Within each group, all processes provide the same value for the recvcount 22 n argument, and provide input vectors of elements stored in the send count = *recvcount 23 must be the same count n buffers, where is the size of the group. The number of elements 24 for the two groups. The resulting vector from the other group is scattered in blocks of 25 recvcount elements among the processes in the group. 26 27 Rationale. The last restriction is needed so that the length of the send buffer of 28 argument of the other group. recvcount one group can be determined by the local 29 Otherwise, a communication is needed to figure out how many elements are reduced. 30 ( End of rationale. ) 31 32 SCATTER 5.10.2 MPI _ REDUCE _ 33 34 REDUCE REDUCE _ SCATTER extends the functionality of MPI _ MPI _ SCATTER _ BLOCK _ 35 such that the scattered blocks can vary in size. Block sizes are determined by the recvcounts 36 i array, such that the elements. recvcounts[i] -th block contains 37 38 39 40 41 42 43 44 45 46 47 48

222 192 CHAPTER 5. COLLECTIVE COMMUNICATION 1 REDUCE MPI _ SCATTER( sendbuf, recvbuf, recvcounts, datatype, op, comm) _ 2 sendbuf starting address of send buffer (choice) IN 3 OUT starting address of receive buffer (choice) recvbuf 4 5 recvcounts non-negative integer array (of length group size) spec- IN 6 ifying the number of elements of the result distributed 7 to each process. 8 IN data type of elements of send and receive buffers (han- datatype 9 dle) 10 IN operation (handle) op 11 12 communicator (handle) IN comm 13 14 int MPI_Reduce_scatter(const void* sendbuf, void* recvbuf, const 15 int recvcounts[], MPI_Datatype datatype, MPI_Op op, 16 MPI_Comm comm) 17 MPI_Reduce_scatter(sendbuf, recvbuf, recvcounts, datatype, op, comm, 18 ierror) BIND(C) 19 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 20 TYPE(*), DIMENSION(..) :: recvbuf 21 INTEGER, INTENT(IN) :: recvcounts(*) 22 TYPE(MPI_Datatype), INTENT(IN) :: datatype 23 TYPE(MPI_Op), INTENT(IN) :: op 24 TYPE(MPI_Comm), INTENT(IN) :: comm 25 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 26 27 MPI_REDUCE_SCATTER(SENDBUF, RECVBUF, RECVCOUNTS, DATATYPE, OP, COMM, 28 IERROR) 29 SENDBUF(*), RECVBUF(*) 30 INTEGER RECVCOUNTS(*), DATATYPE, OP, COMM, IERROR 31 first performs a global, SCATTER _ REDUCE _ MPI is an intracommunicator, comm If 32 ∑ n − 1 element-wise reduction on vectors of [ = count i recvcounts ] elements in the send buffers 33 i = 0 and datatype , using the operation op , where n is the number of defined by , count sendbuf 34 processes in the group of comm . The routine is called by all group members using the 35 op same arguments for recvcounts . The resulting vector is treated as comm and , , datatype 36 n consecutive blocks where the number of elements of the i -th block is recvcounts[i] . The 37 -th block is sent to process i and blocks are scattered to the processes of the group. The i 38 . stored in the receive buffer defined by recvbuf, recvcounts[i] datatype and 39 40 Advice to implementors. The MPI _ REDUCE _ SCATTER routine is functionally equiv- 41 collective operation with alent to: an MPI _ count REDUCE equal to the sum of 42 followed by MPI _ SCATTERV with sendcounts equal to recvcounts[i] . How- recvcounts 43 End of advice to implementors. ) ever, a direct implementation may run faster. ( 44 45 IN PLACE _ MPI The “in place” option for intracommunicators is specified by passing _ in 46 argument. In this case, the input data is taken from the receive buffer. It is sendbuf the 47 not required to specify the “in place” option on all processes, since the processes for which 48 may not have allocated a receive buffer. recvcounts[i]==0

223 5.11. SCAN 193 1 is an intercommunicator, then the result of the reduction of the data provided comm If 2 by processes in one group (group A) is scattered among processes in the other group (group 3 recvcounts B), and vice versa. Within each group, all processes provide the same argument, ∑ 1 − n 4 [ i = ] elements stored in the send buffers, and provide input vectors of count recvcounts 0 = i 5 is the size of the group. The resulting vector from the other group is scattered in n where 6 recvcounts[i] elements among the processes in the group. The number of elements blocks of 7 count must be the same for the two groups. 8 9 Rationale. The last restriction is needed so that the length of the send buffer can be 10 determined by the sum of the local entries. Otherwise, a communication recvcounts 11 is needed to figure out how many elements are reduced. ( ) End of rationale. 12 13 Scan 5.11 14 15 5.11.1 Inclusive Scan 16 17 18 _ SCAN(sendbuf, recvbuf, count, datatype, op, comm) MPI 19 starting address of send buffer (choice) sendbuf IN 20 21 recvbuf OUT starting address of receive buffer (choice) 22 number of elements in input buffer (non-negative in- IN count 23 teger) 24 datatype IN data type of elements of input buffer (handle) 25 26 IN op operation (handle) 27 IN comm communicator (handle) 28 29 int MPI_Scan(const void* sendbuf, void* recvbuf, int count, 30 MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) 31 32 MPI_Scan(sendbuf, recvbuf, count, datatype, op, comm, ierror) BIND(C) 33 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 34 TYPE(*), DIMENSION(..) :: recvbuf 35 INTEGER, INTENT(IN) :: count 36 TYPE(MPI_Datatype), INTENT(IN) :: datatype 37 TYPE(MPI_Op), INTENT(IN) :: op 38 TYPE(MPI_Comm), INTENT(IN) :: comm 39 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 40 MPI_SCAN(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, COMM, IERROR) 41 SENDBUF(*), RECVBUF(*) 42 INTEGER COUNT, DATATYPE, OP, COMM, IERROR 43 44 comm is an intracommunicator, MPI _ SCAN is used to perform a prefix reduction on If 45 data distributed across the group. The operation returns, in the receive buffer of the process 46 , the reduction of the values in the send buffers of processes with ranks with rank i 0,...,i 47 (inclusive). The routine is called by all group members using the same arguments for count, 48 datatype, op and comm, except that for user-defined operations, the same rules apply as

224 194 CHAPTER 5. COLLECTIVE COMMUNICATION 1 REDUCE . The type of operations supported, their semantics, and the constraints for _ MPI 2 REDUCE MPI . on send and receive buffers are as for _ 3 IN _ PLACE _ in The “in place” option for intracommunicators is specified by passing MPI 4 argument. In this case, the input data is taken from the receive buffer, and sendbuf the 5 replaced by the output data. 6 This operation is invalid for intercommunicators. 7 8 5.11.2 Exclusive Scan 9 10 11 _ EXSCAN(sendbuf, recvbuf, count, datatype, op, comm) MPI 12 starting address of send buffer (choice) IN sendbuf 13 14 starting address of receive buffer (choice) recvbuf OUT 15 count number of elements in input buffer (non-negative in- IN 16 teger) 17 datatype data type of elements of input buffer (handle) IN 18 19 operation (handle) IN op 20 intracommunicator (handle) comm IN 21 22 int MPI_Exscan(const void* sendbuf, void* recvbuf, int count, 23 MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) 24 25 MPI_Exscan(sendbuf, recvbuf, count, datatype, op, comm, ierror) BIND(C) 26 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 27 TYPE(*), DIMENSION(..) :: recvbuf 28 INTEGER, INTENT(IN) :: count 29 TYPE(MPI_Datatype), INTENT(IN) :: datatype 30 TYPE(MPI_Op), INTENT(IN) :: op 31 TYPE(MPI_Comm), INTENT(IN) :: comm 32 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 33 MPI_EXSCAN(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, COMM, IERROR) 34 SENDBUF(*), RECVBUF(*) 35 INTEGER COUNT, DATATYPE, OP, COMM, IERROR 36 37 EXSCAN is used to perform a prefix reduction MPI is an intracommunicator, comm If _ 38 on data distributed across the group. The value in recvbuf on the process with rank 0 is 39 on the process undefined, and recvbuf is not signficant on process 0. The value in recvbuf 40 with rank 1 is defined as the value in sendbuf on the process with rank 0. For processes 41 1, the operation returns, in the receive buffer of the process with rank , the with rank i > i 42 ,...,i − 1 (inclusive). The reduction of the values in the send buffers of processes with ranks 0 43 routine is called by all group members using the same arguments for count, datatype, op and 44 _ REDUCE . comm, except that for user-defined operations, the same rules apply as for MPI 45 The type of operations supported, their semantics, and the constraints on send and receive 46 REDUCE buffers, are as for . _ MPI 47 48

225 5.11. SCAN 195 1 _ IN in The “in place” option for intracommunicators is specified by passing _ MPI PLACE 2 sendbuf argument. In this case, the input data is taken from the receive buffer, and the 3 replaced by the output data. The receive buffer on rank 0 is not changed by this operation. 4 This operation is invalid for intercommunicators. 5 6 The exclusive scan is more general than the inclusive scan. Any inclusive Rationale. 7 scan operation can be achieved by using the exclusive scan and then locally combining 8 _ MAX , the MPI the local contribution. Note that for non-invertable operations such as 9 exclusive scan cannot be computed with the inclusive scan. ( End of rationale. ) 10 11 Example using MPI _ SCAN 5.11.3 12 The example in this section uses an intracommunicator. 13 14 Example 5.23 15 This example uses a user-defined operation to produce a segmented scan . A segmented 16 scan takes, as input, a set of values and a set of logicals, and the logicals delineate the 17 various segments of the scan. For example: 18 19 v v v values v v v v v 8 7 6 5 4 3 2 1 20 0 1 0 1 1 1 0 0 logicals 21 v v v + + v v v v v v + v result v + v v + 7 1 1 2 3 3 4 3 4 5 6 8 6 22 23 The operator that produces this effect is, 24 ) ( ) ( ) ( w v u 25 , = ◦ j j i 26 27 where, 28 { 29 u + v if i = j . w = 30 j if i 6 v = 31 Note that this is a non-commutative operator. C code that implements it is given 32 below. 33 34 typedef struct { 35 double val; 36 int log; 37 } SegScanPair; 38 39 /* the user-defined function 40 */ 41 void segScan(SegScanPair *in, SegScanPair *inout, int *len, 42 MPI_Datatype *dptr) 43 { 44 int i; 45 SegScanPair c; 46 47 for (i=0; i< *len; ++i) { 48

226 196 CHAPTER 5. COLLECTIVE COMMUNICATION 1 if (in->log == inout->log) 2 c.val = in->val + inout->val; 3 else 4 c.val = inout->val; 5 c.log = inout->log; 6 *inout = c; 7 in++; inout++; 8 } 9 } 10 argument to the user-defined function corresponds to the right- inout Note that the 11 hand operand of the operator. When using this operator, we must be careful to specify that 12 it is non-commutative, as in the following. 13 14 int i,base; 15 SegScanPair a, answer; 16 MPI_Op myOp; 17 MPI_Datatype type[2] = {MPI_DOUBLE, MPI_INT}; 18 MPI_Aint disp[2]; 19 int blocklen[2] = { 1, 1}; 20 MPI_Datatype sspair; 21 22 /* explain to MPI how type SegScanPair is defined 23 */ 24 MPI_Get_address( &a, disp); 25 MPI_Get_address( &a.log, disp+1); 26 base = disp[0]; 27 for (i=0; i<2; ++i) disp[i] -= base; 28 MPI_Type_create_struct( 2, blocklen, disp, type, &sspair ); 29 MPI_Type_commit( &sspair ); 30 /* create the segmented-scan user-op 31 */ 32 MPI_Op_create(segScan, 0, &myOp); 33 ... 34 MPI_Scan( &a, &answer, 1, sspair, myOp, comm ); 35 36 37 Nonblocking Collective Operations 5.12 38 39 As described in Section 3.7, performance of many applications can be improved by over- 40 lapping communication and computation, and many systems enable this. Nonblocking 41 collective operations combine the potential benefits of nonblocking point-to-point opera- 42 tions, to exploit overlap and to avoid synchronization, with the optimized implementation 43 and message scheduling provided by collective operations [30, 34]. One way of doing this 44 would be to perform a blocking collective operation in a separate thread. An alternative 45 mechanism that often leads to better performance (e.g., avoids context switching, scheduler 46 overheads, and thread management) is to use nonblocking collective communication [32]. 47 The nonblocking collective communication model is similar to the model used for non- 48 blocking point-to-point communication. A nonblocking call initiates a collective operation,

227 5.12. NONBLOCKING COLLECTIVE OPERATIONS 197 1 which must be completed in a separate completion call. Once initiated, the operation 2 may progress independently of any computation or other communication at participating 3 processes. In this manner, nonblocking collective operations can mitigate possible synchro- 4 nizing effects of collective operations by running them in the “background.” In addition to 5 enabling communication-computation overlap, nonblocking collective operations can per- 6 form collective operations on overlapping communicators, which would lead to deadlocks 7 with blocking operations. Their semantic advantages can also be useful in combination with 8 point-to-point communication. 9 As in the nonblocking point-to-point case, all calls are local and return immediately, 10 irrespective of the status of other processes. The call initiates the operation, which indicates 11 that the system may start to copy data out of the send buffer and into the receive buffer. 12 Once initiated, all associated send buffers and buffers associated with input arguments (such 13 as arrays of counts, displacements, or datatypes in the vector versions of the collectives) 14 should not be modified, and all associated receive buffers should not be accessed, until the 15 collective operation completes. The call returns a request handle, which must be passed to 16 a completion call. 17 WAIT _ MPI All completion calls (e.g., ) described in Section 3.7.3 are supported for 18 nonblocking collective operations. Similarly to the blocking case, nonblocking collective 19 operations are considered to be complete when the local part of the operation is finished, 20 i.e., for the caller, the semantics of the operation are guaranteed and all buffers can be 21 safely accessed and modified. Completion does not indicate that other processes have 22 completed or even started the operation (unless otherwise implied by the description of 23 the operation). Completion of a particular nonblocking collective operation also does not 24 indicate completion of any other posted nonblocking collective (or send-receive) operations, 25 whether they are posted before or after the completed operation. 26 Users should be aware that implementations are allowed, but Advice to users. 27 _ ), to synchronize processes during the not required (with exception of MPI IBARRIER 28 ) End of advice to users. completion of a nonblocking collective operation. ( 29 30 Upon returning from a completion call in which a nonblocking collective operation 31 _ ERROR field in the associated status object is set appropriately, see completes, the MPI 32 Section 3.2.5 on page 30. The values of the MPI _ SOURCE and fields are undefined. MPI _ TAG 33 It is valid to mix different request types (i.e., any combination of collective requests, I/O 34 requests, generalized requests, or point-to-point requests) in functions that enable multiple 35 _ WAITALL ). It is erroneous to call MPI _ REQUEST _ FREE or completions (e.g., MPI 36 MPI _ CANCEL for a request associated with a nonblocking collective operation. Nonblocking 37 collective requests are not persistent. 38 39 Freeing an active nonblocking collective request could cause similar Rationale. 40 problems as discussed for point-to-point requests (see Section 3.7.3). Cancelling a 41 request is not supported because the semantics of this operation are not well-defined. 42 End of rationale. ( ) 43 44 Multiple nonblocking collective operations can be outstanding on a single communi- 45 cator. If the nonblocking call causes some system resource to be exhausted, then it will 46 fail and generate an MPI exception. Quality implementations of MPI should ensure that 47 this happens only in pathological cases. That is, an MPI implementation should be able to 48 support a large number of pending nonblocking operations.

228 198 CHAPTER 5. COLLECTIVE COMMUNICATION 1 Unlike point-to-point operations, nonblocking collective operations do not match with 2 blocking collective operations, and collective operations do not have a tag argument. All 3 processes must call collective operations (blocking and nonblocking) in the same order 4 per communicator. In particular, once a process calls a collective operation, all other 5 processes in the communicator must eventually call the same collective operation, and no 6 other collective operation with the same communicator in between. This is consistent with 7 the ordering rules for blocking collective operations in threaded environments. 8 9 Rationale. Matching blocking and nonblocking collective operations is not allowed 10 because the implementation might use different communication algorithms for the two 11 cases. Blocking collective operations may be optimized for minimal time to comple- 12 tion, while nonblocking collective operations may balance time to completion with 13 CPU overhead and asynchronous progression. 14 The use of tags for collective operations can prevent certain hardware optimizations. 15 End of rationale. ) ( 16 17 If program semantics require matching blocking and nonblocking Advice to users. 18 collective operations, then a nonblocking collective operation can be initiated and 19 immediately completed with a blocking wait to emulate blocking behavior. ( End of 20 ) advice to users. 21 22 In terms of data movements, each nonblocking collective operation has the same effect 23 as its blocking counterpart for intracommunicators and intercommunicators after comple- 24 tion. Likewise, upon completion, nonblocking collective reduction operations have the same 25 effect as their blocking counterparts, and the same restrictions and recommendations on 26 reduction orders apply. 27 The use of the “in place” option is allowed exactly as described for the corresponding 28 blocking collective operations. When using the “in place” option, message buffers function 29 as both send and receive buffers. Such buffers should not be modified or accessed until the 30 operation completes. 31 Progression rules for nonblocking collective operations are similar to progression of 32 nonblocking point-to-point operations, refer to Section 3.7.4. 33 34 Nonblocking collective operations can be implemented with Advice to implementors. 35 local execution schedules [33] using nonblocking point-to-point communication and a 36 End of advice to implementors. ) reserved tag-space. ( 37 38 Nonblocking Barrier Synchronization 5.12.1 39 40 41 IBARRIER(comm , request) MPI _ 42 communicator (handle) comm IN 43 44 communication request (handle) OUT request 45 46 int MPI_Ibarrier(MPI_Comm comm, MPI_Request *request) 47 MPI_Ibarrier(comm, request, ierror) BIND(C) 48

229 5.12. NONBLOCKING COLLECTIVE OPERATIONS 199 1 TYPE(MPI_Comm), INTENT(IN) :: comm 2 TYPE(MPI_Request), INTENT(OUT) :: request 3 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 4 MPI_IBARRIER(COMM, REQUEST, IERROR) 5 INTEGER COMM, REQUEST, IERROR 6 7 _ , _ . By calling MPI BARRIER _ IBARRIER is a nonblocking version of MPI IBARRIER MPI 8 a process notifies that it has reached the barrier. The call returns immediately, indepen- 9 IBARRIER . The usual barrier semantics MPI dent of whether other processes have called _ 10 are enforced at the corresponding completion operation (test or wait), which in the intra- 11 communicator case will complete only after all other processes in the communicator have 12 . In the intercommunicator case, it will complete when all processes IBARRIER _ called MPI 13 MPI in the remote group have called IBARRIER . _ 14 15 A nonblocking barrier can be used to hide latency. Moving indepen- Advice to users. 16 MPI dent computations between the _ IBARRIER and the subsequent completion call 17 can overlap the barrier latency and therefore shorten possible waiting times. The se- 18 mantic properties are also useful when mixing collective operations and point-to-point 19 ) messages. ( End of advice to users. 20 21 5.12.2 Nonblocking Broadcast 22 23 24 IBCAST(buffer, count, datatype, root, comm, request) _ MPI 25 INOUT buffer starting address of buffer (choice) 26 27 number of entries in buffer (non-negative integer) count IN 28 datatype IN data type of buffer (handle) 29 IN root rank of broadcast root (integer) 30 31 comm communicator (handle) IN 32 request OUT communication request (handle) 33 34 int MPI_Ibcast(void* buffer, int count, MPI_Datatype datatype, int root, 35 MPI_Comm comm, MPI_Request *request) 36 37 MPI_Ibcast(buffer, count, datatype, root, comm, request, ierror) BIND(C) 38 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: buffer 39 INTEGER, INTENT(IN) :: count, root 40 TYPE(MPI_Datatype), INTENT(IN) :: datatype 41 TYPE(MPI_Comm), INTENT(IN) :: comm 42 TYPE(MPI_Request), INTENT(OUT) :: request 43 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 44 MPI_IBCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM, REQUEST, IERROR) 45 BUFFER(*) 46 INTEGER COUNT, DATATYPE, ROOT, COMM, REQUEST, IERROR 47 48 BCAST This call starts a nonblocking variant of MPI _ (see Section 5.4).

230 200 CHAPTER 5. COLLECTIVE COMMUNICATION 1 IBCAST Example using MPI _ 2 The example in this section uses an intracommunicator. 3 4 Example 5.24 5 s from process 0 int Start a broadcast of 100 to every process in the group, perform some 6 computation on independent data, and then complete the outstanding broadcast operation. 7 8 MPI_Comm comm; 9 int array1[100], array2[100]; 10 int root=0; 11 MPI_Request req; 12 ... 13 MPI_Ibcast(array1, 100, MPI_INT, root, comm, &req); 14 compute(array2, 100); 15 MPI_Wait(&req, MPI_STATUS_IGNORE); 16 17 Nonblocking Gather 5.12.3 18 19 20 MPI _ IGATHER(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm, 21 request) 22 23 starting address of send buffer (choice) sendbuf IN 24 sendcount IN number of elements in send buffer (non-negative inte- 25 ger) 26 sendtype data type of send buffer elements (handle) IN 27 address of receive buffer (choice, significant only at recvbuf OUT 28 root) 29 30 IN recvcount number of elements for any single receive (non-negative 31 integer, significant only at root) 32 IN recvtype data type of recv buffer elements (significant only at 33 root) (handle) 34 rank of receiving process (integer) IN root 35 36 communicator (handle) IN comm 37 request communication request (handle) OUT 38 39 int MPI_Igather(const void* sendbuf, int sendcount, MPI_Datatype sendtype, 40 void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, 41 MPI_Comm comm, MPI_Request *request) 42 43 MPI_Igather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, 44 root, comm, request, ierror) BIND(C) 45 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 46 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 47 INTEGER, INTENT(IN) :: sendcount, recvcount, root 48 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype

231 5.12. NONBLOCKING COLLECTIVE OPERATIONS 201 1 TYPE(MPI_Comm), INTENT(IN) :: comm 2 TYPE(MPI_Request), INTENT(OUT) :: request 3 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 4 MPI_IGATHER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, 5 ROOT, COMM, REQUEST, IERROR) 6 SENDBUF(*), RECVBUF(*) 7 INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, ROOT, COMM, REQUEST, 8 IERROR 9 10 This call starts a nonblocking variant of _ GATHER (see Section 5.5). MPI 11 12 _ IGATHERV(sendbuf, sendcount, sendtype, recvbuf, recvcounts, displs, recvtype, root, MPI 13 comm, request) 14 15 IN sendbuf starting address of send buffer (choice) 16 number of elements in send buffer (non-negative inte- IN sendcount 17 ger) 18 19 data type of send buffer elements (handle) IN sendtype 20 OUT recvbuf address of receive buffer (choice, significant only at 21 root) 22 non-negative integer array (of length group size) con- recvcounts IN 23 taining the number of elements that are received from 24 each process (significant only at root) 25 26 i integer array (of length group size). Entry specifies displs IN 27 at which to place recvbuf the displacement relative to 28 (significant only at i the incoming data from process 29 root) 30 data type of recv buffer elements (significant only at recvtype IN 31 root) (handle) 32 IN root rank of receiving process (integer) 33 IN comm communicator (handle) 34 35 communication request (handle) OUT request 36 37 int MPI_Igatherv(const void* sendbuf, int sendcount, MPI_Datatype sendtype, 38 void* recvbuf, const int recvcounts[], const int displs[], 39 MPI_Datatype recvtype, int root, MPI_Comm comm, 40 MPI_Request *request) 41 42 MPI_Igatherv(sendbuf, sendcount, sendtype, recvbuf, recvcounts, displs, 43 recvtype, root, comm, request, ierror) BIND(C) 44 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 45 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 46 INTEGER, INTENT(IN) :: sendcount, root 47 INTEGER, INTENT(IN), ASYNCHRONOUS :: recvcounts(*), displs(*) 48 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype

232 202 CHAPTER 5. COLLECTIVE COMMUNICATION 1 TYPE(MPI_Comm), INTENT(IN) :: comm 2 TYPE(MPI_Request), INTENT(OUT) :: request 3 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 4 MPI_IGATHERV(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNTS, DISPLS, 5 RECVTYPE, ROOT, COMM, REQUEST, IERROR) 6 SENDBUF(*), RECVBUF(*) 7 INTEGER SENDCOUNT, SENDTYPE, RECVCOUNTS(*), DISPLS(*), RECVTYPE, ROOT, 8 COMM, REQUEST, IERROR 9 10 This call starts a nonblocking variant of MPI (see Section 5.5). GATHERV _ 11 12 5.12.4 Nonblocking Scatter 13 14 15 ISCATTER(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm, _ MPI 16 request) 17 18 sendbuf address of send buffer (choice, significant only at root) IN 19 sendcount number of elements sent to each process (non-negative IN 20 integer, significant only at root) 21 data type of send buffer elements (significant only at IN sendtype 22 root) (handle) 23 24 address of receive buffer (choice) recvbuf OUT 25 IN recvcount number of elements in receive buffer (non-negative in- 26 teger) 27 IN recvtype data type of receive buffer elements (handle) 28 29 rank of sending process (integer) root IN 30 comm communicator (handle) IN 31 OUT request communication request (handle) 32 33 int MPI_Iscatter(const void* sendbuf, int sendcount, MPI_Datatype sendtype, 34 void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, 35 MPI_Comm comm, MPI_Request *request) 36 37 MPI_Iscatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, 38 root, comm, request, ierror) BIND(C) 39 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 40 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 41 INTEGER, INTENT(IN) :: sendcount, recvcount, root 42 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 43 TYPE(MPI_Comm), INTENT(IN) :: comm 44 TYPE(MPI_Request), INTENT(OUT) :: request 45 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 46 MPI_ISCATTER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, 47 ROOT, COMM, REQUEST, IERROR) 48

233 5.12. NONBLOCKING COLLECTIVE OPERATIONS 203 1 SENDBUF(*), RECVBUF(*) 2 INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, ROOT, COMM, REQUEST, 3 IERROR 4 MPI This call starts a nonblocking variant of _ (see Section 5.6). SCATTER 5 6 7 MPI _ ISCATTERV(sendbuf, sendcounts, displs, sendtype, recvbuf, recvcount, recvtype, root, 8 comm, request) 9 IN address of send buffer (choice, significant only at root) sendbuf 10 11 IN sendcounts non-negative integer array (of length group size) spec- 12 ifying the number of elements to send to each rank 13 i integer array (of length group size). Entry displs IN specifies 14 the displacement (relative to ) from which to sendbuf 15 take the outgoing data to process i 16 data type of send buffer elements (handle) sendtype IN 17 recvbuf address of receive buffer (choice) OUT 18 19 IN number of elements in receive buffer (non-negative in- recvcount 20 teger) 21 data type of receive buffer elements (handle) recvtype IN 22 rank of sending process (integer) root IN 23 24 comm IN communicator (handle) 25 OUT request communication request (handle) 26 27 int MPI_Iscatterv(const void* sendbuf, const int sendcounts[], const 28 int displs[], MPI_Datatype sendtype, void* recvbuf, 29 int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm, 30 MPI_Request *request) 31 32 MPI_Iscatterv(sendbuf, sendcounts, displs, sendtype, recvbuf, recvcount, 33 recvtype, root, comm, request, ierror) BIND(C) 34 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 35 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 36 INTEGER, INTENT(IN), ASYNCHRONOUS :: sendcounts(*), displs(*) 37 INTEGER, INTENT(IN) :: recvcount, root 38 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 39 TYPE(MPI_Comm), INTENT(IN) :: comm 40 TYPE(MPI_Request), INTENT(OUT) :: request 41 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 42 MPI_ISCATTERV(SENDBUF, SENDCOUNTS, DISPLS, SENDTYPE, RECVBUF, RECVCOUNT, 43 RECVTYPE, ROOT, COMM, REQUEST, IERROR) 44 SENDBUF(*), RECVBUF(*) 45 INTEGER SENDCOUNTS(*), DISPLS(*), SENDTYPE, RECVCOUNT, RECVTYPE, ROOT, 46 COMM, REQUEST, IERROR 47 48 SCATTERV This call starts a nonblocking variant of MPI _ (see Section 5.6).

234 204 CHAPTER 5. COLLECTIVE COMMUNICATION 1 5.12.5 Nonblocking Gather-to-all 2 3 4 MPI _ IALLGATHER(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm, 5 request) 6 sendbuf starting address of send buffer (choice) IN 7 8 sendcount IN number of elements in send buffer (non-negative inte- 9 ger) 10 IN sendtype data type of send buffer elements (handle) 11 recvbuf OUT address of receive buffer (choice) 12 13 recvcount IN number of elements received from any process (non- 14 negative integer) 15 recvtype IN data type of receive buffer elements (handle) 16 IN comm communicator (handle) 17 18 request communication request (handle) OUT 19 20 int MPI_Iallgather(const void* sendbuf, int sendcount, 21 MPI_Datatype sendtype, void* recvbuf, int recvcount, 22 MPI_Datatype recvtype, MPI_Comm comm, MPI_Request *request) 23 MPI_Iallgather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, 24 comm, request, ierror) BIND(C) 25 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 26 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 27 INTEGER, INTENT(IN) :: sendcount, recvcount 28 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 29 TYPE(MPI_Comm), INTENT(IN) :: comm 30 TYPE(MPI_Request), INTENT(OUT) :: request 31 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 32 33 MPI_IALLGATHER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, 34 COMM, REQUEST, IERROR) 35 SENDBUF(*), RECVBUF(*) 36 INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, COMM, REQUEST, IERROR 37 _ ALLGATHER (see Section 5.7). This call starts a nonblocking variant of MPI 38 39 40 41 42 43 44 45 46 47 48

235 5.12. NONBLOCKING COLLECTIVE OPERATIONS 205 1 _ MPI IALLGATHERV(sendbuf, sendcount, sendtype, recvbuf, recvcounts, displs, recvtype, comm, 2 request) 3 starting address of send buffer (choice) IN sendbuf 4 number of elements in send buffer (non-negative inte- IN sendcount 5 ger) 6 7 data type of send buffer elements (handle) IN sendtype 8 OUT address of receive buffer (choice) recvbuf 9 IN recvcounts non-negative integer array (of length group size) con- 10 taining the number of elements that are received from 11 each process 12 13 specifies IN integer array (of length group size). Entry i displs 14 the displacement (relative to recvbuf ) at which to place 15 the incoming data from process i 16 recvtype IN data type of receive buffer elements (handle) 17 comm communicator (handle) IN 18 19 OUT request communication request (handle) 20 21 int MPI_Iallgatherv(const void* sendbuf, int sendcount, 22 MPI_Datatype sendtype, void* recvbuf, const int recvcounts[], 23 const int displs[], MPI_Datatype recvtype, MPI_Comm comm, 24 MPI_Request* request) 25 MPI_Iallgatherv(sendbuf, sendcount, sendtype, recvbuf, recvcounts, displs, 26 recvtype, comm, request, ierror) BIND(C) 27 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 28 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 29 INTEGER, INTENT(IN) :: sendcount 30 INTEGER, INTENT(IN), ASYNCHRONOUS :: recvcounts(*), displs(*) 31 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 32 TYPE(MPI_Comm), INTENT(IN) :: comm 33 TYPE(MPI_Request), INTENT(OUT) :: request 34 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 35 36 MPI_IALLGATHERV(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNTS, DISPLS, 37 RECVTYPE, COMM, REQUEST, IERROR) 38 SENDBUF(*), RECVBUF(*) 39 INTEGER SENDCOUNT, SENDTYPE, RECVCOUNTS(*), DISPLS(*), RECVTYPE, COMM, 40 REQUEST, IERROR 41 42 ALLGATHERV _ MPI This call starts a nonblocking variant of (see Section 5.7). 43 44 45 46 47 48

236 206 CHAPTER 5. COLLECTIVE COMMUNICATION 1 5.12.6 Nonblocking All-to-All Scatter/Gather 2 3 4 MPI IALLTOALL(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm, request) _ 5 6 starting address of send buffer (choice) IN sendbuf 7 8 IN sendcount number of elements sent to each process (non-negative 9 integer) 10 data type of send buffer elements (handle) sendtype IN 11 recvbuf address of receive buffer (choice) OUT 12 13 IN recvcount number of elements received from any process (non- 14 negative integer) 15 data type of receive buffer elements (handle) recvtype IN 16 IN comm communicator (handle) 17 18 OUT request communication request (handle) 19 20 int MPI_Ialltoall(const void* sendbuf, int sendcount, 21 MPI_Datatype sendtype, void* recvbuf, int recvcount, 22 MPI_Datatype recvtype, MPI_Comm comm, MPI_Request *request) 23 MPI_Ialltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, 24 comm, request, ierror) BIND(C) 25 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 26 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 27 INTEGER, INTENT(IN) :: sendcount, recvcount 28 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 29 TYPE(MPI_Comm), INTENT(IN) :: comm 30 TYPE(MPI_Request), INTENT(OUT) :: request 31 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 32 33 MPI_IALLTOALL(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, 34 COMM, REQUEST, IERROR) 35 SENDBUF(*), RECVBUF(*) 36 INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, COMM, REQUEST, IERROR 37 _ ALLTOALL (see Section 5.8). This call starts a nonblocking variant of MPI 38 39 40 41 42 43 44 45 46 47 48

237 5.12. NONBLOCKING COLLECTIVE OPERATIONS 207 1 MPI _ IALLTOALLV(sendbuf, sendcounts, sdispls, sendtype, recvbuf, recvcounts, rdispls, 2 recvtype, comm, request) 3 sendbuf IN starting address of send buffer (choice) 4 IN non-negative integer array (of length group size) spec- sendcounts 5 ifying the number of elements to send to each rank 6 7 IN j specifies sdispls integer array (of length group size). Entry 8 the displacement (relative to sendbuf ) from which to 9 j take the outgoing data destined for process 10 IN sendtype data type of send buffer elements (handle) 11 address of receive buffer (choice) OUT recvbuf 12 13 non-negative integer array (of length group size) spec- recvcounts IN 14 ifying the number of elements that can be received 15 from each rank 16 IN i specifies rdispls integer array (of length group size). Entry 17 ) at which to place recvbuf the displacement (relative to 18 the incoming data from process i 19 data type of receive buffer elements (handle) recvtype IN 20 21 comm IN communicator (handle) 22 request communication request (handle) OUT 23 24 int MPI_Ialltoallv(const void* sendbuf, const int sendcounts[], const 25 int sdispls[], MPI_Datatype sendtype, void* recvbuf, const 26 int recvcounts[], const int rdispls[], MPI_Datatype recvtype, 27 MPI_Comm comm, MPI_Request *request) 28 29 MPI_Ialltoallv(sendbuf, sendcounts, sdispls, sendtype, recvbuf, recvcounts, 30 rdispls, recvtype, comm, request, ierror) BIND(C) 31 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 32 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 33 INTEGER, INTENT(IN), ASYNCHRONOUS :: sendcounts(*), sdispls(*), 34 recvcounts(*), rdispls(*) 35 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 36 TYPE(MPI_Comm), INTENT(IN) :: comm 37 TYPE(MPI_Request), INTENT(OUT) :: request 38 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 39 MPI_IALLTOALLV(SENDBUF, SENDCOUNTS, SDISPLS, SENDTYPE, RECVBUF, RECVCOUNTS, 40 RDISPLS, RECVTYPE, COMM, REQUEST, IERROR) 41 SENDBUF(*), RECVBUF(*) 42 INTEGER SENDCOUNTS(*), SDISPLS(*), SENDTYPE, RECVCOUNTS(*), RDISPLS(*), 43 RECVTYPE, COMM, REQUEST, IERROR 44 45 _ MPI This call starts a nonblocking variant of (see Section 5.8). ALLTOALLV 46 47 48

238 208 CHAPTER 5. COLLECTIVE COMMUNICATION 1 MPI IALLTOALLW(sendbuf, sendcounts, sdispls, sendtypes, recvbuf, recvcounts, rdispls, _ 2 recvtypes, comm, request) 3 sendbuf IN starting address of send buffer (choice) 4 IN integer array (of length group size) specifying the num- sendcounts 5 ber of elements to send to each rank (array of non- 6 negative integers) 7 8 specifies IN j integer array (of length group size). Entry sdispls 9 sendbuf the displacement in bytes (relative to ) from 10 which to take the outgoing data destined for process 11 (array of integers) j 12 j sendtypes IN array of datatypes (of length group size). Entry 13 (array j specifies the type of data to send to process 14 of handles) 15 address of receive buffer (choice) OUT recvbuf 16 17 integer array (of length group size) specifying the num- recvcounts IN 18 ber of elements that can be received from each rank 19 (array of non-negative integers) 20 specifies i integer array (of length group size). Entry IN rdispls 21 the displacement in bytes (relative to recvbuf ) at which 22 to place the incoming data from process i (array of 23 integers) 24 array of datatypes (of length group size). Entry IN recvtypes i 25 i specifies the type of data received from process (ar- 26 ray of handles) 27 28 comm IN communicator (handle) 29 communication request (handle) OUT request 30 31 int MPI_Ialltoallw(const void* sendbuf, const int sendcounts[], const 32 int sdispls[], const MPI_Datatype sendtypes[], void* recvbuf, 33 const int recvcounts[], const int rdispls[], const 34 MPI_Datatype recvtypes[], MPI_Comm comm, MPI_Request *request) 35 36 MPI_Ialltoallw(sendbuf, sendcounts, sdispls, sendtypes, recvbuf, 37 recvcounts, rdispls, recvtypes, comm, request, ierror) BIND(C) 38 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 39 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 40 INTEGER, INTENT(IN), ASYNCHRONOUS :: sendcounts(*), sdispls(*), 41 recvcounts(*), rdispls(*) 42 TYPE(MPI_Datatype), INTENT(IN), ASYNCHRONOUS :: sendtypes(*), 43 recvtypes(*) 44 TYPE(MPI_Comm), INTENT(IN) :: comm 45 TYPE(MPI_Request), INTENT(OUT) :: request 46 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 47 48

239 5.12. NONBLOCKING COLLECTIVE OPERATIONS 209 1 MPI_IALLTOALLW(SENDBUF, SENDCOUNTS, SDISPLS, SENDTYPES, RECVBUF, 2 RECVCOUNTS, RDISPLS, RECVTYPES, COMM, REQUEST, IERROR) 3 SENDBUF(*), RECVBUF(*) 4 INTEGER SENDCOUNTS(*), SDISPLS(*), SENDTYPES(*), RECVCOUNTS(*), 5 RDISPLS(*), RECVTYPES(*), COMM, REQUEST, IERROR 6 ALLTOALLW _ This call starts a nonblocking variant of MPI (see Section 5.8). 7 8 Nonblocking Reduce 5.12.7 9 10 11 12 IREDUCE(sendbuf, recvbuf, count, datatype, op, root, comm, request) MPI _ 13 address of send buffer (choice) sendbuf IN 14 address of receive buffer (choice, significant only at OUT recvbuf 15 root) 16 17 IN count number of elements in send buffer (non-negative inte- 18 ger) 19 IN datatype data type of elements of send buffer (handle) 20 reduce operation (handle) IN op 21 22 rank of root process (integer) root IN 23 IN communicator (handle) comm 24 OUT request communication request (handle) 25 26 int MPI_Ireduce(const void* sendbuf, void* recvbuf, int count, 27 MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm, 28 MPI_Request *request) 29 30 MPI_Ireduce(sendbuf, recvbuf, count, datatype, op, root, comm, request, 31 ierror) BIND(C) 32 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 33 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 34 INTEGER, INTENT(IN) :: count, root 35 TYPE(MPI_Datatype), INTENT(IN) :: datatype 36 TYPE(MPI_Op), INTENT(IN) :: op 37 TYPE(MPI_Comm), INTENT(IN) :: comm 38 TYPE(MPI_Request), INTENT(OUT) :: request 39 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 40 MPI_IREDUCE(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, ROOT, COMM, REQUEST, 41 IERROR) 42 SENDBUF(*), RECVBUF(*) 43 INTEGER COUNT, DATATYPE, OP, ROOT, COMM, REQUEST, IERROR 44 45 This call starts a nonblocking variant of (see Section 5.9.1). REDUCE _ MPI 46 47 Advice to implementors. The implementation is explicitly allowed to use different 48 algorithms for blocking and nonblocking reduction operations that might change the

240 210 CHAPTER 5. COLLECTIVE COMMUNICATION 1 REDUCE _ order of evaluation of the operations. However, as for , it is strongly MPI 2 be implemented so that the same result be obtained MPI recommended that IREDUCE _ 3 whenever the function is applied on the same arguments, appearing in the same order. 4 Note that this may prevent optimizations that take advantage of the physical location 5 End of advice to implementors. ) of processes. ( 6 7 Advice to users. For operations which are not truly associative, the result delivered 8 upon completion of the nonblocking reduction may not exactly equal the result deliv- 9 ered by the blocking reduction, even when specifying the same arguments in the same 10 ) End of advice to users. order. ( 11 12 5.12.8 Nonblocking All-Reduce 13 14 15 IALLREDUCE(sendbuf, recvbuf, count, datatype, op, comm, request) _ MPI 16 starting address of send buffer (choice) sendbuf IN 17 18 OUT recvbuf starting address of receive buffer (choice) 19 number of elements in send buffer (non-negative inte- IN count 20 ger) 21 IN datatype data type of elements of send buffer (handle) 22 23 operation (handle) op IN 24 communicator (handle) comm IN 25 request OUT communication request (handle) 26 27 28 int MPI_Iallreduce(const void* sendbuf, void* recvbuf, int count, 29 MPI_Datatype datatype, MPI_Op op, MPI_Comm comm, 30 MPI_Request *request) 31 MPI_Iallreduce(sendbuf, recvbuf, count, datatype, op, comm, request, 32 ierror) BIND(C) 33 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 34 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 35 INTEGER, INTENT(IN) :: count 36 TYPE(MPI_Datatype), INTENT(IN) :: datatype 37 TYPE(MPI_Op), INTENT(IN) :: op 38 TYPE(MPI_Comm), INTENT(IN) :: comm 39 TYPE(MPI_Request), INTENT(OUT) :: request 40 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 41 42 MPI_IALLREDUCE(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, COMM, REQUEST, 43 IERROR) 44 SENDBUF(*), RECVBUF(*) 45 INTEGER COUNT, DATATYPE, OP, COMM, REQUEST, IERROR 46 ALLREDUCE MPI This call starts a nonblocking variant of (see Section 5.9.6). _ 47 48

241 5.12. NONBLOCKING COLLECTIVE OPERATIONS 211 1 Nonblocking Reduce-Scatter with Equal Blocks 5.12.9 2 3 4 _ SCATTER _ BLOCK(sendbuf, recvbuf, recvcount, datatype, op, comm, request) IREDUCE _ MPI 5 6 sendbuf IN starting address of send buffer (choice) 7 8 recvbuf OUT starting address of receive buffer (choice) 9 recvcount IN element count per block (non-negative integer) 10 data type of elements of send and receive buffers (han- datatype IN 11 dle) 12 13 IN op operation (handle) 14 communicator (handle) IN comm 15 communication request (handle) request OUT 16 17 int MPI_Ireduce_scatter_block(const void* sendbuf, void* recvbuf, 18 int recvcount, MPI_Datatype datatype, MPI_Op op, 19 MPI_Comm comm, MPI_Request *request) 20 21 MPI_Ireduce_scatter_block(sendbuf, recvbuf, recvcount, datatype, op, comm, 22 request, ierror) BIND(C) 23 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 24 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 25 INTEGER, INTENT(IN) :: recvcount 26 TYPE(MPI_Datatype), INTENT(IN) :: datatype 27 TYPE(MPI_Op), INTENT(IN) :: op 28 TYPE(MPI_Comm), INTENT(IN) :: comm 29 TYPE(MPI_Request), INTENT(OUT) :: request 30 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 31 32 MPI_IREDUCE_SCATTER_BLOCK(SENDBUF, RECVBUF, RECVCOUNT, DATATYPE, OP, COMM, 33 REQUEST, IERROR) 34 SENDBUF(*), RECVBUF(*) 35 INTEGER RECVCOUNT, DATATYPE, OP, COMM, REQUEST, IERROR 36 This call starts a nonblocking variant of MPI _ REDUCE _ SCATTER _ BLOCK (see Sec- 37 tion 5.10.1). 38 39 40 41 42 43 44 45 46 47 48

242 212 CHAPTER 5. COLLECTIVE COMMUNICATION 1 5.12.10 Nonblocking Reduce-Scatter 2 3 4 _ SCATTER(sendbuf, recvbuf, recvcounts, datatype, op, comm, request) MPI IREDUCE _ 5 sendbuf IN starting address of send buffer (choice) 6 7 starting address of receive buffer (choice) OUT recvbuf 8 non-negative integer array specifying the number of recvcounts IN 9 elements in result distributed to each process. Array 10 must be identical on all calling processes. 11 data type of elements of input buffer (handle) datatype IN 12 13 operation (handle) op IN 14 communicator (handle) IN comm 15 communication request (handle) request OUT 16 17 int MPI_Ireduce_scatter(const void* sendbuf, void* recvbuf, const 18 int recvcounts[], MPI_Datatype datatype, MPI_Op op, 19 MPI_Comm comm, MPI_Request *request) 20 21 MPI_Ireduce_scatter(sendbuf, recvbuf, recvcounts, datatype, op, comm, 22 request, ierror) BIND(C) 23 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 24 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 25 INTEGER, INTENT(IN), ASYNCHRONOUS :: recvcounts(*) 26 TYPE(MPI_Datatype), INTENT(IN) :: datatype 27 TYPE(MPI_Op), INTENT(IN) :: op 28 TYPE(MPI_Comm), INTENT(IN) :: comm 29 TYPE(MPI_Request), INTENT(OUT) :: request 30 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 31 32 MPI_IREDUCE_SCATTER(SENDBUF, RECVBUF, RECVCOUNTS, DATATYPE, OP, COMM, 33 REQUEST, IERROR) 34 SENDBUF(*), RECVBUF(*) 35 INTEGER RECVCOUNTS(*), DATATYPE, OP, COMM, REQUEST, IERROR 36 MPI _ REDUCE _ SCATTER (see Section 5.10.2). This call starts a nonblocking variant of 37 38 39 40 41 42 43 44 45 46 47 48

243 5.12. NONBLOCKING COLLECTIVE OPERATIONS 213 1 5.12.11 Nonblocking Inclusive Scan 2 3 4 ISCAN(sendbuf, recvbuf, count, datatype, op, comm, request) _ MPI 5 IN sendbuf starting address of send buffer (choice) 6 7 starting address of receive buffer (choice) recvbuf OUT 8 IN count number of elements in input buffer (non-negative in- 9 teger) 10 datatype IN data type of elements of input buffer (handle) 11 12 operation (handle) IN op 13 comm IN communicator (handle) 14 OUT request communication request (handle) 15 16 int MPI_Iscan(const void* sendbuf, void* recvbuf, int count, 17 MPI_Datatype datatype, MPI_Op op, MPI_Comm comm, 18 MPI_Request *request) 19 20 MPI_Iscan(sendbuf, recvbuf, count, datatype, op, comm, request, ierror) 21 BIND(C) 22 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 23 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 24 INTEGER, INTENT(IN) :: count 25 TYPE(MPI_Datatype), INTENT(IN) :: datatype 26 TYPE(MPI_Op), INTENT(IN) :: op 27 TYPE(MPI_Comm), INTENT(IN) :: comm 28 TYPE(MPI_Request), INTENT(OUT) :: request 29 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 30 31 MPI_ISCAN(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, COMM, REQUEST, IERROR) 32 SENDBUF(*), RECVBUF(*) 33 INTEGER COUNT, DATATYPE, OP, COMM, REQUEST, IERROR 34 (see Section 5.11). This call starts a nonblocking variant of MPI _ SCAN 35 36 37 38 39 40 41 42 43 44 45 46 47 48

244 214 CHAPTER 5. COLLECTIVE COMMUNICATION 1 5.12.12 Nonblocking Exclusive Scan 2 3 4 _ IEXSCAN(sendbuf, recvbuf, count, datatype, op, comm, request) MPI 5 sendbuf starting address of send buffer (choice) IN 6 7 OUT recvbuf starting address of receive buffer (choice) 8 IN count number of elements in input buffer (non-negative in- 9 teger) 10 IN data type of elements of input buffer (handle) datatype 11 12 op IN operation (handle) 13 intracommunicator (handle) comm IN 14 OUT communication request (handle) request 15 16 int MPI_Iexscan(const void* sendbuf, void* recvbuf, int count, 17 MPI_Datatype datatype, MPI_Op op, MPI_Comm comm, 18 MPI_Request *request) 19 20 MPI_Iexscan(sendbuf, recvbuf, count, datatype, op, comm, request, ierror) 21 BIND(C) 22 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 23 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 24 INTEGER, INTENT(IN) :: count 25 TYPE(MPI_Datatype), INTENT(IN) :: datatype 26 TYPE(MPI_Op), INTENT(IN) :: op 27 TYPE(MPI_Comm), INTENT(IN) :: comm 28 TYPE(MPI_Request), INTENT(OUT) :: request 29 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 30 31 MPI_IEXSCAN(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, COMM, REQUEST, IERROR) 32 SENDBUF(*), RECVBUF(*) 33 INTEGER COUNT, DATATYPE, OP, COMM, REQUEST, IERROR 34 MPI This call starts a nonblocking variant of _ EXSCAN (see Section 5.11.2). 35 36 37 Correctness 5.13 38 A correct, portable program must invoke collective communications so that deadlock will not 39 occur, whether collective communications are synchronizing or not. The following examples 40 illustrate dangerous use of collective routines on intracommunicators. 41 42 Example 5.25 43 The following is erroneous. 44 45 46 47 48

245 5.13. CORRECTNESS 215 1 switch(rank) { 2 case 0: 3 MPI_Bcast(buf1, count, type, 0, comm); 4 MPI_Bcast(buf2, count, type, 1, comm); 5 break; 6 case 1: 7 MPI_Bcast(buf2, count, type, 1, comm); 8 MPI_Bcast(buf1, count, type, 0, comm); 9 break; 10 } 11 12 } . Two processes execute two broadcast is { We assume that the group of 0,1 comm 13 operations in reverse order. If the operation is synchronizing then a deadlock will occur. 14 Collective operations must be executed in the same order at all members of the com- 15 munication group. 16 Example 5.26 17 The following is erroneous. 18 19 switch(rank) { 20 case 0: 21 MPI_Bcast(buf1, count, type, 0, comm0); 22 MPI_Bcast(buf2, count, type, 2, comm2); 23 break; 24 case 1: 25 MPI_Bcast(buf1, count, type, 1, comm1); 26 MPI_Bcast(buf2, count, type, 0, comm0); 27 break; 28 case 2: 29 MPI_Bcast(buf1, count, type, 2, comm2); 30 MPI_Bcast(buf2, count, type, 1, comm1); 31 break; 32 } 33 34 } } { is comm1 , of } 0,1 { is comm0 Assume that the group of . If 2,0 { is comm2 and of 1, 2 35 the broadcast is a synchronizing operation, then there is a cyclic dependency: the broadcast 36 comm2 completes only after the broadcast in comm0 ; the broadcast in comm0 completes in 37 comm1 only after the broadcast in ; and the broadcast in comm1 completes only after the 38 broadcast in . Thus, the code will deadlock. comm2 39 Collective operations must be executed in an order so that no cyclic dependencies occur. 40 Nonblocking collective operations can alleviate this issue. 41 42 Example 5.27 43 The following is erroneous. 44 45 46 47 48

246 CHAPTER 5. COLLECTIVE COMMUNICATION 216 1 switch(rank) { 2 case 0: 3 MPI_Bcast(buf1, count, type, 0, comm); 4 MPI_Send(buf2, count, type, 1, tag, comm); 5 break; 6 case 1: 7 MPI_Recv(buf2, count, type, 0, tag, comm, status); 8 MPI_Bcast(buf1, count, type, 0, comm); 9 break; 10 } 11 Process zero executes a broadcast, followed by a blocking send operation. Process one 12 first executes a blocking receive that matches the send, followed by broadcast call that 13 matches the broadcast of process zero. This program may deadlock. The broadcast call on 14 process zero may block until process one executes the matching broadcast call, so that the 15 send is not executed. Process one will definitely block on the receive and so, in this case, 16 never executes the broadcast. 17 The relative order of execution of collective operations and point-to-point operations 18 should be such, so that even if the collective operations and the point-to-point operations 19 are synchronizing, no deadlock will occur. 20 21 Example 5.28 22 An unsafe, non-deterministic program. 23 24 switch(rank) { 25 case 0: 26 MPI_Bcast(buf1, count, type, 0, comm); 27 MPI_Send(buf2, count, type, 1, tag, comm); 28 break; 29 case 1: 30 MPI_Recv(buf2, count, type, MPI_ANY_SOURCE, tag, comm, status); 31 MPI_Bcast(buf1, count, type, 0, comm); 32 MPI_Recv(buf2, count, type, MPI_ANY_SOURCE, tag, comm, status); 33 break; 34 case 2: 35 MPI_Send(buf2, count, type, 1, tag, comm); 36 MPI_Bcast(buf1, count, type, 0, comm); 37 break; 38 } 39 40 All three processes participate in a broadcast. Process 0 sends a message to process 41 1 after the broadcast, and process 2 sends a message to process 1 before the broadcast. 42 Process 1 receives before and after the broadcast, with a wildcard source argument. 43 Two possible executions of this program, with different matchings of sends and receives, 44 are illustrated in Figure 5.12. Note that the second execution has the peculiar effect that 45 a send executed after the broadcast is received at another node before the broadcast. This 46 example illustrates the fact that one should not rely on collective communication functions 47 to have particular synchronization effects. A program that works correctly only when the 48 first execution occurs (only when broadcast is synchronizing) is erroneous.

247 5.13. CORRECTNESS 217 1 First Execution 2 process: 1 2 0 3 match 4 recv send 5 broadcast broadcast broadcast 6 match recv send 7 8 9 10 11 Second Execution 12 13 14 broadcast 15 match recv send 16 broadcast 17 match send recv 18 broadcast 19 20 Figure 5.12: A race condition causes non-deterministic matching of sends and receives. One 21 cannot rely on synchronization from a broadcast to make the program deterministic. 22 23 24 Finally, in multithreaded implementations, one can have more than one, concurrently 25 executing, collective communication call at a process. In these situations, it is the user’s re- 26 sponsibility to ensure that the same communicator is not used concurrently by two different 27 collective communication calls at the same process. 28 29 Advice to implementors. Assume that broadcast is implemented using point-to-point 30 MPI communication. Suppose the following two rules are followed. 31 1. All receives specify their source explicitly (no wildcards). 32 33 2. Each process sends all messages that pertain to one collective call before sending 34 any message that pertain to a subsequent collective call. 35 Then, messages belonging to successive broadcasts cannot be confused, as the order 36 of point-to-point messages is preserved. 37 38 It is the implementor’s responsibility to ensure that point-to-point messages are not 39 confused with collective messages. One way to accomplish this is, whenever a commu- 40 nicator is created, to also create a “hidden communicator” for collective communica- 41 tion. One could achieve a similar effect more cheaply, for example, by using a hidden 42 tag or context bit to indicate whether the communicator is used for point-to-point or 43 End of advice to implementors. collective communication. ( ) 44 45 Example 5.29 46 Blocking and nonblocking collective operations can be interleaved, i.e., a blocking collec- 47 tive operation can be posted even if there is a nonblocking collective operation outstanding. 48

248 218 CHAPTER 5. COLLECTIVE COMMUNICATION 1 MPI_Request req; 2 3 MPI_Ibarrier(comm, &req); 4 MPI_Bcast(buf1, count, type, 0, comm); 5 MPI_Wait(&req, MPI_STATUS_IGNORE); 6 7 Each process starts a nonblocking barrier operation, participates in a blocking broad- 8 cast and then waits until every other process started the barrier operation. This ef- 9 fectively turns the broadcast into a synchronizing broadcast with possible communica- 10 _ Bcast is allowed, but not required to synchronize). tion/communication overlap ( MPI 11 Example 5.30 12 The starting order of collective operations on a particular communicator defines their 13 matching. The following example shows an erroneous matching of different collective oper- 14 ations on the same communicator. 15 16 MPI_Request req; 17 switch(rank) { 18 case 0: 19 /* erroneous matching */ 20 MPI_Ibarrier(comm, &req); 21 MPI_Bcast(buf1, count, type, 0, comm); 22 MPI_Wait(&req, MPI_STATUS_IGNORE); 23 break; 24 case 1: 25 /* erroneous matching */ 26 MPI_Bcast(buf1, count, type, 0, comm); 27 MPI_Ibarrier(comm, &req); 28 MPI_Wait(&req, MPI_STATUS_IGNORE); 29 break; 30 } 31 32 This ordering would match MPI _ Ibarrier on rank 0 with MPI _ Bcast on rank 1 which is 33 erroneous and the program behavior is undefined. However, if such an order is required, the 34 user must create different duplicate communicators and perform the operations on them. 35 If started with two processes, the following program would be correct: 36 37 MPI_Request req; 38 MPI_Comm dupcomm; 39 MPI_Comm_dup(comm, &dupcomm); 40 switch(rank) { 41 case 0: 42 MPI_Ibarrier(comm, &req); 43 MPI_Bcast(buf1, count, type, 0, dupcomm); 44 MPI_Wait(&req, MPI_STATUS_IGNORE); 45 break; 46 case 1: 47 MPI_Bcast(buf1, count, type, 0, dupcomm); 48 MPI_Ibarrier(comm, &req);

249 5.13. CORRECTNESS 219 1 MPI_Wait(&req, MPI_STATUS_IGNORE); 2 break; 3 } 4 The use of different communicators offers some flexibility regarding Advice to users. 5 the matching of nonblocking collective operations. In this sense, communicators could 6 be used as an equivalent to tags. However, communicator construction might induce 7 overheads so that this should be used carefully. ( End of advice to users. ) 8 9 10 Example 5.31 11 Nonblocking collective operations can rely on the same progression rules as nonblocking 12 point-to-point messages. Thus, if started with two processes, the following program is a 13 valid MPI program and is guaranteed to terminate: 14 MPI_Request req; 15 16 switch(rank) { 17 case 0: 18 MPI_Ibarrier(comm, &req); 19 MPI_Wait(&req, MPI_STATUS_IGNORE); 20 MPI_Send(buf, count, dtype, 1, tag, comm); 21 break; 22 case 1: 23 MPI_Ibarrier(comm, &req); 24 MPI_Recv(buf, count, dtype, 0, tag, comm, MPI_STATUS_IGNORE); 25 MPI_Wait(&req, MPI_STATUS_IGNORE); 26 break; 27 } 28 MPI The MPI library must progress the barrier in the Wait MPI call. Thus, the Recv _ _ 29 Send so all calls _ MPI call in rank 0 will eventually complete, which enables the matching 30 eventually return. 31 32 Example 5.32 33 Blocking and nonblocking collective operations do not match. The following example 34 is erroneous. 35 36 MPI_Request req; 37 38 switch(rank) { 39 case 0: 40 /* erroneous false matching of Alltoall and Ialltoall */ 41 MPI_Ialltoall(sbuf, scnt, stype, rbuf, rcnt, rtype, comm, &req); 42 MPI_Wait(&req, MPI_STATUS_IGNORE); 43 break; 44 case 1: 45 /* erroneous false matching of Alltoall and Ialltoall */ 46 MPI_Alltoall(sbuf, scnt, stype, rbuf, rcnt, rtype, comm); 47 break; 48 }

250 CHAPTER 5. COLLECTIVE COMMUNICATION 220 1 Example 5.33 2 Collective and point-to-point requests can be mixed in functions that enable multiple 3 completions. If started with two processes, the following program is valid. 4 5 MPI_Request reqs[2]; 6 7 switch(rank) { 8 case 0: 9 MPI_Ibarrier(comm, &reqs[0]); 10 MPI_Send(buf, count, dtype, 1, tag, comm); 11 MPI_Wait(&reqs[0], MPI_STATUS_IGNORE); 12 break; 13 case 1: 14 MPI_Irecv(buf, count, dtype, 0, tag, comm, &reqs[0]); 15 MPI_Ibarrier(comm, &reqs[1]); 16 MPI_Waitall(2, reqs, MPI_STATUSES_IGNORE); 17 break; 18 } 19 20 The call returns only after the barrier and the receive completed. Waitall _ MPI 21 Example 5.34 22 Multiple nonblocking collective operations can be outstanding on a single communicator 23 and match in order. 24 25 MPI_Request reqs[3]; 26 27 compute(buf1); 28 MPI_Ibcast(buf1, count, type, 0, comm, &reqs[0]); 29 compute(buf2); 30 MPI_Ibcast(buf2, count, type, 0, comm, &reqs[1]); 31 compute(buf3); 32 MPI_Ibcast(buf3, count, type, 0, comm, &reqs[2]); 33 MPI_Waitall(3, reqs, MPI_STATUSES_IGNORE); 34 35 Pipelining and double-buffering techniques can efficiently be used Advice to users. 36 to overlap computation and communication. However, having too many outstanding 37 End of advice to users. requests might have a negative impact on performance. ( ) 38 39 Advice to implementors. The use of pipelining may generate many outstanding 40 requests. A high-quality hardware-supported implementation with limited resources 41 should be able to fall back to a software implementation if its resources are exhausted. 42 In this way, the implementation could limit the number of outstanding requests only 43 End of advice to implementors. by the available memory. ( ) 44 45 46 Example 5.35 47 48

251 5.13. CORRECTNESS 221 1 comm1 2 3 0 1 4 5 comm2 comm3 6 7 2 8 9 10 11 Figure 5.13: Example with overlapping communicators. 12 13 14 Nonblocking collective operations can also be used to enable simultaneous collective 15 operations on multiple overlapping communicators (see Figure 5.13). The following example 16 comm1 is started with three processes and three communicators. The first communicator 17 includes ranks 1 and 2, and spans ranks 0 and 2. It is includes ranks 0 and 1, comm2 comm3 18 not possible to perform a blocking collective operation on all communicators because there 19 exists no deadlock-free order to invoke them. However, nonblocking collective operations 20 can easily be used to achieve this task. 21 22 MPI_Request reqs[2]; 23 24 switch(rank) { 25 case 0: 26 MPI_Iallreduce(sbuf1, rbuf1, count, dtype, MPI_SUM, comm1, &reqs[0]); 27 MPI_Iallreduce(sbuf3, rbuf3, count, dtype, MPI_SUM, comm3, &reqs[1]); 28 break; 29 case 1: 30 MPI_Iallreduce(sbuf1, rbuf1, count, dtype, MPI_SUM, comm1, &reqs[0]); 31 MPI_Iallreduce(sbuf2, rbuf2, count, dtype, MPI_SUM, comm2, &reqs[1]); 32 break; 33 case 2: 34 MPI_Iallreduce(sbuf2, rbuf2, count, dtype, MPI_SUM, comm2, &reqs[0]); 35 MPI_Iallreduce(sbuf3, rbuf3, count, dtype, MPI_SUM, comm3, &reqs[1]); 36 break; 37 } 38 MPI_Waitall(2, reqs, MPI_STATUSES_IGNORE); 39 40 This method can be useful if overlapping neighboring regions (halo Advice to users. 41 or ghost zones) are used in collective operations. The sequence of the two calls in 42 each process is irrelevant because the two nonblocking operations are performed on 43 different communicators. ( ) End of advice to users. 44 45 Example 5.36 46 The progress of multiple outstanding nonblocking collective operations is completely 47 independent. 48

252 222 CHAPTER 5. COLLECTIVE COMMUNICATION 1 MPI_Request reqs[2]; 2 3 compute(buf1); 4 MPI_Ibcast(buf1, count, type, 0, comm, &reqs[0]); 5 compute(buf2); 6 MPI_Ibcast(buf2, count, type, 0, comm, &reqs[1]); 7 MPI_Wait(&reqs[1], MPI_STATUS_IGNORE); 8 /* nothing is known about the status of the first bcast here */ 9 MPI_Wait(&reqs[0], MPI_STATUS_IGNORE); 10 11 is completely independent of the first one. This MPI Finishing the second _ IBCAST 12 means that it is not guaranteed that the first broadcast operation is finished or even started 13 after the second one is completed via reqs[1] . 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

253 1 2 3 4 5 6 Chapter 6 7 8 9 10 Groups, Contexts, Communicators, 11 12 and Caching 13 14 15 16 6.1 Introduction 17 18 features that support the development of parallel libraries. This chapter introduces MPI 19 Parallel libraries are needed to encapsulate the distracting complications inherent in paral- 20 lel implementations of key algorithms. They help to ensure consistent correctness of such 21 itself can provide. As procedures, and provide a “higher level” of portability than MPI 22 such, libraries prevent each programmer from repeating the work of defining consistent 23 data structures, data layouts, and methods that implement key algorithms (such as matrix 24 operations). Since the best libraries come with several variations on parallel systems (dif- 25 ferent data layouts, different strategies depending on the size of the system or problem, or 26 type of floating point), this too needs to be hidden from the user. 27 , MPI We refer the reader to [55] and [3] for further information on writing libraries in 28 using the features described in this chapter. 29 30 6.1.1 Features Needed to Support Libraries 31 32 The key features needed to support the creation of robust parallel libraries are as follows: 33 34 Safe communication space, that guarantees that libraries can communicate as they • 35 need to, without conflicting with communication extraneous to the library, 36 • Group scope for collective operations, that allow libraries to avoid unnecessarily syn- 37 chronizing uninvolved processes (potentially running unrelated code), 38 39 Abstract process naming to allow libraries to describe their communication in terms • 40 suitable to their own data structures and algorithms, 41 42 The ability to “adorn” a set of communicating processes with additional user-defined • 43 attributes, such as extra collective operations. This mechanism should provide a 44 means for the user or library writer effectively to extend a message-passing notation. 45 In addition, a unified mechanism or object is needed for conveniently denoting communica- 46 tion context, the group of communicating processes, to house abstract process naming, and 47 to store adornments. 48 223

254 224 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 6.1.2 MPI’s Support for Libraries 2 MPI provides, specifically to support robust libraries, are The corresponding concepts that 3 as follows: 4 5 Contexts • of communication, 6 7 of processes, • Groups 8 Virtual topologies • , 9 10 , • Attribute caching 11 12 . • Communicators 13 Communicators (see [21, 53, 57]) encapsulate all of these ideas in order to provide the 14 appropriate scope for all communication operations in MPI . Communicators are divided 15 into two kinds: intra-communicators for operations within a single group of processes and 16 inter-communicators for operations between two groups of processes. 17 18 Communicators (see below) provide a “caching” mechanism that allows one to Caching. 19 associate new attributes with communicators, on par with built-in features. This can MPI 20 MPI to implement be used by advanced users to adorn communicators further, and by 21 some communicator functions. For example, the virtual-topology functions described in 22 Chapter 7 are likely to be supported this way. 23 24 Groups. Groups define an ordered collection of processes, each with a rank, and it is this 25 group that defines the low-level names for inter-process communication (ranks are used for 26 sending and receiving). Thus, groups define a scope for process names in point-to-point 27 communication. In addition, groups define the scope of collective operations. Groups may 28 be manipulated separately from communicators in MPI , but only communicators can be 29 used in communication operations. 30 31 32 The most commonly used means for message passing in MPI Intra-communicators. is via 33 intra-communicators. Intra-communicators contain an instance of a group, contexts of 34 communication for both point-to-point and collective communication, and the ability to 35 include virtual topology and other attributes. These features work as follows: 36 provide the ability to have separate safe “universes” of message-passing in • Contexts 37 . A context is akin to an additional tag that differentiates messages. The system MPI 38 manages this differentiation process. The use of separate communication contexts 39 by distinct libraries (or distinct library invocations) insulates communication internal 40 to the library execution from external communication. This allows the invocation of 41 the library even if there are pending communications on “other” communicators, and 42 avoids the need to synchronize entry or exit into library code. Pending point-to-point 43 communications are also guaranteed not to interfere with collective communications 44 within a single communicator. 45 46 Groups • define the participants in the communication (see above) of a communicator. 47 48

255 6.1. INTRODUCTION 225 1 defines a special mapping of the ranks in a group to and from a • virtual topology A 2 topology. Special constructors for communicators are defined in Chapter 7 to provide 3 this feature. Intra-communicators as described in this chapter do not have topologies. 4 • define the local information that the user or library has added to a com- Attributes 5 municator for later reference. 6 7 The practice in many communication libraries is that there is a Advice to users. 8 unique, predefined communication universe that includes all processes available when 9 the parallel program is initiated; the processes are assigned consecutive ranks. Par- 10 ticipants in a point-to-point communication are identified by their rank; a collective 11 communication (such as broadcast) always involves all processes. This practice can be 12 _ COMM _ WORLD . Users followed in MPI by using the predefined communicator MPI 13 wherever a com- WORLD _ COMM _ MPI who are satisfied with this practice can plug in 14 municator argument is required, and can consequently disregard the rest of this chapter. 15 ) End of advice to users. ( 16 17 Inter-communicators. The discussion has dealt so far with intra-communication : com- 18 munication within a group. MPI : communication inter-communication also supports 19 between two non-overlapping groups. When an application is built by composing several 20 parallel modules, it is convenient to allow one module to communicate with another using 21 local ranks for addressing within the second module. This is especially convenient in a 22 client-server computing paradigm, where either client or server are parallel. The support 23 MPI of inter-communication also provides a mechanism for the extension of to a dynamic 24 model where not all processes are preallocated at initialization time. In such a situation, 25 it becomes necessary to support communication across “universes.” Inter-communication 26 . These objects bind two groups to- inter-communicators is supported by objects called 27 gether with communication contexts shared by both groups. For inter-communicators, these 28 features work as follows: 29 30 Contexts provide the ability to have a separate safe “universe” of message-passing • 31 between the two groups. A send in the local group is always a receive in the re- 32 mote group, and vice versa. The system manages this differentiation process. The 33 use of separate communication contexts by distinct libraries (or distinct library in- 34 vocations) insulates communication internal to the library execution from external 35 communication. This allows the invocation of the library even if there are pending 36 communications on “other” communicators, and avoids the need to synchronize entry 37 or exit into library code. 38 39 • A local and remote group specify the recipients and destinations for an inter-com- 40 municator. 41 Virtual topology is undefined for an inter-communicator. • 42 43 • As before, attributes cache defines the local information that the user or library has 44 added to a communicator for later reference. 45 46 MPI provides mechanisms for creating and manipulating inter-communicators. They 47 are used for point-to-point and collective communication in an related manner to intra- 48 communicators. Users who do not need inter-communication in their applications can safely

256 226 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 ignore this extension. Users who require inter-communication between overlapping groups 2 MPI . must layer this capability on top of 3 4 Basic Concepts 6.2 5 6 In this section, we turn to a more formal definition of the concepts introduced above. 7 8 Groups 6.2.1 9 10 is an ordered set of process identifiers (henceforth processes); processes are group A 11 implementation-dependent objects. Each process in a group is associated with an inte- 12 . Ranks are contiguous and start from zero. Groups are represented by opaque rank ger 13 , and hence cannot be directly transferred from one process to another. A group objects 14 group is used within a communicator to describe the participants in a communication “uni- 15 verse” and to rank such participants (thus giving them unique names within that “universe” 16 of communication). 17 , which is a group with no EMPTY _ _ MPI There is a special pre-defined group: GROUP 18 NULL members. The predefined constant MPI _ GROUP _ is the value used for invalid group 19 handles. 20 21 _ GROUP Advice to users. EMPTY _ , which is a valid handle to an empty group, MPI 22 _ GROUP _ NULL , which in turn is an invalid handle. should not be confused with MPI 23 The former may be used as an argument to group operations; the latter, which is 24 End of advice to users. ) returned when a group is freed, is not a valid argument. ( 25 26 Advice to implementors. A group may be represented by a virtual-to-real process- 27 address-translation table. Each communicator object (see below) would have a pointer 28 to such a table. 29 will enumerate groups, such as in a table. However, MPI Simple implementations of 30 more advanced data structures make sense in order to improve scalability and memory 31 MPI . usage with large numbers of processes. Such implementations are possible with 32 ) ( End of advice to implementors. 33 34 Contexts 6.2.2 35 36 A context is a property of communicators (defined next) that allows partitioning of the 37 communication space. A message sent in one context cannot be received in another context. 38 Furthermore, where permitted, collective operations are independent of pending point-to- 39 objects; they appear only as part of the point operations. Contexts are not explicit MPI 40 realization of communicators (below). 41 Advice to implementors. Distinct communicators in the same process have distinct 42 contexts. A context is essentially a system-managed tag (or tags) needed to make 43 -defined collective communication. a communicator safe for point-to-point and MPI 44 Safety means that collective and point-to-point communication within one commu- 45 nicator do not interfere, and that communication over distinct communicators don’t 46 interfere. 47 48

257 6.2. BASIC CONCEPTS 227 1 A possible implementation for a context is as a supplemental tag attached to messages 2 on send and matched on receive. Each intra-communicator stores the value of its two 3 tags (one for point-to-point and one for collective communication). Communicator- 4 generating functions use a collective communication to agree on a new group-wide 5 unique context. 6 Analogously, in inter-communication, two context tags are stored per communicator, 7 one used by group A to send and group B to receive, and a second used by group B 8 to send and for group A to receive. 9 Since contexts are not explicit objects, other implementations are also possible. ( End 10 of advice to implementors. ) 11 12 13 Intra-Communicators 6.2.3 14 To support Intra-communicators bring together the concepts of group and context. 15 implementation-specific optimizations, and application topologies (defined in the next chap- 16 ter, Chapter 7), communicators may also “cache” additional information (see Section 6.7). 17 communication operations reference communicators to determine the scope and the MPI 18 “communication universe” in which a point-to-point or collective operation is to operate. 19 Each communicator contains a group of valid participants; this group always includes 20 the local process. The source and destination of a message is identified by process rank 21 within that group. 22 For collective communication, the intra-communicator specifies the set of processes that 23 participate in the collective operation (and their order, when significant). Thus, the commu- 24 nicator restricts the “spatial” scope of communication, and provides machine-independent 25 process addressing through ranks. 26 Intra-communicators are represented by opaque intra-communicator objects , and 27 hence cannot be directly transferred from one process to another. 28 29 Predefined Intra-Communicators 6.2.4 30 31 of all processes the local process can MPI _ An initial intra-communicator COMM WORLD _ 32 or _ MPI communicate with after initialization (itself included) is defined once INIT 33 MPI SELF MPI is _ COMM _ _ has been called. In addition, the communicator THREAD _ INIT 34 provided, which includes only the process itself. 35 MPI _ COMM _ NULL is the value used for invalid communicator The predefined constant 36 handles. 37 In a static-process-model implementation of , all processes that participate in the MPI 38 is initialized. For this case, _ is a _ MPI WORLD MPI computation are available after COMM 39 communicator of all processes available for the computation; this communicator has the 40 MPI where processes can dynami- same value in all processes. In an implementation of 41 MPI computation cally join an MPI execution, it may be the case that a process starts an 42 without having access to all other processes. In such situations, MPI _ COMM _ WORLD is a 43 communicator incorporating all processes with which the joining process can immediately 44 WORLD COMM _ communicate. Therefore, may simultaneously represent disjoint groups MPI _ 45 in different processes. 46 WORLD All MPI implementations are required to provide the MPI _ COMM _ communi- 47 cator. It cannot be deallocated during the life of a process. The group corresponding to 48 this communicator does not appear as a pre-defined constant, but it may be accessed using

258 228 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 _ GROUP _ MPI (see below). COMM MPI does not specify the correspondence between the 2 process rank in and its (machine-dependent) absolute address. Neither MPI WORLD _ COMM _ 3 specify the function of the host process, if any. Other implementation-dependent, MPI does 4 predefined communicators may also be provided. 5 6 Group Management 6.3 7 8 . These operations are MPI This section describes the manipulation of process groups in 9 local and their execution does not require interprocess communication. 10 11 Group Accessors 6.3.1 12 13 14 MPI _ GROUP _ SIZE(group, size) 15 16 IN group group (handle) 17 number of processes in the group (integer) OUT size 18 19 int MPI_Group_size(MPI_Group group, int *size) 20 21 MPI_Group_size(group, size, ierror) BIND(C) 22 TYPE(MPI_Group), INTENT(IN) :: group 23 INTEGER, INTENT(OUT) :: size 24 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 25 MPI_GROUP_SIZE(GROUP, SIZE, IERROR) 26 INTEGER GROUP, SIZE, IERROR 27 28 29 30 _ RANK(group, rank) _ GROUP MPI 31 IN group group (handle) 32 rank of the calling process in group, or OUT rank 33 MPI _ UNDEFINED if the process is not a member (in- 34 teger) 35 36 37 int MPI_Group_rank(MPI_Group group, int *rank) 38 MPI_Group_rank(group, rank, ierror) BIND(C) 39 TYPE(MPI_Group), INTENT(IN) :: group 40 INTEGER, INTENT(OUT) :: rank 41 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 42 43 MPI_GROUP_RANK(GROUP, RANK, IERROR) 44 INTEGER GROUP, RANK, IERROR 45 46 47 48

259 6.3. GROUP MANAGEMENT 229 1 GROUP _ MPI TRANSLATE _ _ RANKS(group1, n, ranks1, group2, ranks2) 2 IN group1 (handle) group1 3 number of ranks in IN n and ranks1 arrays (integer) ranks2 4 5 ranks1 array of zero or more valid ranks in group1 IN 6 group2 (handle) group2 IN 7 array of corresponding ranks in group2, OUT ranks2 8 when no correspondence exists. UNDEFINED MPI _ 9 10 11 int MPI_Group_translate_ranks(MPI_Group group1, int n, const int ranks1[], 12 MPI_Group group2, int ranks2[]) 13 MPI_Group_translate_ranks(group1, n, ranks1, group2, ranks2, ierror) 14 BIND(C) 15 TYPE(MPI_Group), INTENT(IN) :: group1, group2 16 INTEGER, INTENT(IN) :: n, ranks1(n) 17 INTEGER, INTENT(OUT) :: ranks2(n) 18 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 19 20 MPI_GROUP_TRANSLATE_RANKS(GROUP1, N, RANKS1, GROUP2, RANKS2, IERROR) 21 INTEGER GROUP1, N, RANKS1(*), GROUP2, RANKS2(*), IERROR 22 This function is important for determining the relative numbering of the same processes 23 in two different groups. For instance, if one knows the ranks of certain processes in the group 24 MPI _ of WORLD _ COMM , one might want to know their ranks in a subset of that group. 25 , which _ PROC _ NULL MPI MPI _ GROUP _ TRANSLATE _ RANKS is a valid rank for input to 26 as the translated rank. returns MPI _ PROC _ NULL 27 28 29 _ GROUP _ COMPARE(group1, group2, result) MPI 30 IN group1 first group (handle) 31 32 second group (handle) IN group2 33 OUT result result (integer) 34 35 int MPI_Group_compare(MPI_Group group1,MPI_Group group2, int *result) 36 37 MPI_Group_compare(group1, group2, result, ierror) BIND(C) 38 TYPE(MPI_Group), INTENT(IN) :: group1, group2 39 INTEGER, INTENT(OUT) :: result 40 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 41 MPI_GROUP_COMPARE(GROUP1, GROUP2, RESULT, IERROR) 42 INTEGER GROUP1, GROUP2, RESULT, IERROR 43 44 IDENT results if the group members and group order is exactly the same in both groups. MPI _ 45 This happens for instance if group1 and group2 results if MPI _ SIMILAR are the same handle. 46 _ MPI UNEQUAL the group members are the same but the order is different. results otherwise. 47 48

260 230 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 Group Constructors 6.3.2 2 Group constructors are used to subset and superset existing groups. These constructors 3 construct new groups from existing groups. These are local operations, and distinct groups 4 may be defined on different processes; a process may also define a group that does not 5 include itself. Consistent definitions are required when groups are used as arguments in 6 MPI does not provide a mechanism to build a group communicator-building functions. 7 from scratch, but only from other, previously defined groups. The base group, upon which 8 all other groups are defined, is the group associated with the initial communicator 9 COMM _ WORLD (accessible through the function MPI _ COMM _ GROUP ). MPI _ 10 11 In what follows, there is no group duplication function analogous to Rationale. 12 _ DUP , defined later in this chapter. There is no need for a group dupli- MPI _ COMM 13 cator. A group, once created, can have several references to it by making copies of 14 the handle. The following constructors address the need for subsets and supersets of 15 End of rationale. existing groups. ( ) 16 17 Advice to implementors. Each group constructor behaves as if it returned a new 18 group object. When this new group is a copy of an existing group, then one can 19 avoid creating such new objects, using a reference-count mechanism. ( End of advice 20 to implementors. ) 21 22 23 24 _ GROUP(comm, group) _ COMM MPI 25 communicator (handle) comm IN 26 OUT group group corresponding to comm (handle) 27 28 int MPI_Comm_group(MPI_Comm comm, MPI_Group *group) 29 30 MPI_Comm_group(comm, group, ierror) BIND(C) 31 TYPE(MPI_Comm), INTENT(IN) :: comm 32 TYPE(MPI_Group), INTENT(OUT) :: group 33 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 34 35 MPI_COMM_GROUP(COMM, GROUP, IERROR) 36 INTEGER COMM, GROUP, IERROR 37 a handle to the group of comm . MPI _ COMM _ GROUP returns in group 38 39 40 _ GROUP MPI _ UNION(group1, group2, newgroup) 41 IN group1 first group (handle) 42 second group (handle) IN group2 43 44 union group (handle) OUT newgroup 45 46 int MPI_Group_union(MPI_Group group1, MPI_Group group2, 47 MPI_Group *newgroup) 48

261 6.3. GROUP MANAGEMENT 231 1 MPI_Group_union(group1, group2, newgroup, ierror) BIND(C) 2 TYPE(MPI_Group), INTENT(IN) :: group1, group2 3 TYPE(MPI_Group), INTENT(OUT) :: newgroup 4 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 5 MPI_GROUP_UNION(GROUP1, GROUP2, NEWGROUP, IERROR) 6 INTEGER GROUP1, GROUP2, NEWGROUP, IERROR 7 8 9 GROUP _ MPI INTERSECTION(group1, group2, newgroup) _ 10 11 first group (handle) group1 IN 12 group2 second group (handle) IN 13 newgroup intersection group (handle) OUT 14 15 16 int MPI_Group_intersection(MPI_Group group1, MPI_Group group2, 17 MPI_Group *newgroup) 18 MPI_Group_intersection(group1, group2, newgroup, ierror) BIND(C) 19 TYPE(MPI_Group), INTENT(IN) :: group1, group2 20 TYPE(MPI_Group), INTENT(OUT) :: newgroup 21 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 22 23 MPI_GROUP_INTERSECTION(GROUP1, GROUP2, NEWGROUP, IERROR) 24 INTEGER GROUP1, GROUP2, NEWGROUP, IERROR 25 26 27 DIFFERENCE(group1, group2, newgroup) _ MPI _ GROUP 28 IN group1 first group (handle) 29 30 IN second group (handle) group2 31 OUT difference group (handle) newgroup 32 33 int MPI_Group_difference(MPI_Group group1, MPI_Group group2, 34 MPI_Group *newgroup) 35 36 MPI_Group_difference(group1, group2, newgroup, ierror) BIND(C) 37 TYPE(MPI_Group), INTENT(IN) :: group1, group2 38 TYPE(MPI_Group), INTENT(OUT) :: newgroup 39 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 40 MPI_GROUP_DIFFERENCE(GROUP1, GROUP2, NEWGROUP, IERROR) 41 INTEGER GROUP1, GROUP2, NEWGROUP, IERROR 42 43 The set-like operations are defined as follows: 44 union All elements of the first group ( group1 ), followed by all elements of second group 45 ) not in the first group. group2 ( 46 47 all elements of the first group that are also in the second group, ordered as in intersect 48 the first group.

262 232 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 all elements of the first group that are not in the second group, ordered as in difference 2 the first group. 3 Note that for these operations the order of processes in the output group is determined 4 primarily by order in the first group (if possible) and then, if necessary, by order in the 5 second group. Neither union nor intersection are commutative, but both are associative. 6 GROUP The new group can be empty, that is, equal to _ _ EMPTY MPI . 7 8 9 _ INCL(group, n, ranks, newgroup) GROUP _ MPI 10 group (handle) group IN 11 12 IN number of elements in array ranks (and size of n 13 ) (integer) newgroup 14 ranks of processes in ranks to appear in group IN 15 (array of integers) newgroup 16 newgroup OUT new group derived from above, in the order defined by 17 (handle) ranks 18 19 20 int MPI_Group_incl(MPI_Group group, int n, const int ranks[], 21 MPI_Group *newgroup) 22 MPI_Group_incl(group, n, ranks, newgroup, ierror) BIND(C) 23 TYPE(MPI_Group), INTENT(IN) :: group 24 INTEGER, INTENT(IN) :: n, ranks(n) 25 TYPE(MPI_Group), INTENT(OUT) :: newgroup 26 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 27 28 MPI_GROUP_INCL(GROUP, N, RANKS, NEWGROUP, IERROR) 29 INTEGER GROUP, N, RANKS(*), NEWGROUP, IERROR 30 GROUP MPI The function newgroup that consists of the creates a group INCL _ _ 31 group with ranks ranks[0], ... n ; the process with rank i in newgroup processes in , ranks[n-1] 32 elements of ranks is the process with rank ranks[i] in group must be a valid . Each of the n 33 = 0, rank in group and all elements must be distinct, or else the program is erroneous. If n 34 EMPTY . This function can, for instance, be used to reorder then newgroup is MPI _ GROUP _ 35 MPI COMPARE _ . the elements of a group. See also GROUP _ 36 37 38 _ MPI GROUP _ EXCL(group, n, ranks, newgroup) 39 IN group (handle) group 40 41 IN n number of elements in array ranks (integer) 42 group not to appear in IN ranks array of integer ranks in 43 newgroup 44 newgroup OUT new group derived from above, preserving the order 45 defined by (handle) group 46 47 48

263 6.3. GROUP MANAGEMENT 233 1 int MPI_Group_excl(MPI_Group group, int n, const int ranks[], 2 MPI_Group *newgroup) 3 MPI_Group_excl(group, n, ranks, newgroup, ierror) BIND(C) 4 TYPE(MPI_Group), INTENT(IN) :: group 5 INTEGER, INTENT(IN) :: n, ranks(n) 6 TYPE(MPI_Group), INTENT(OUT) :: newgroup 7 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 8 9 MPI_GROUP_EXCL(GROUP, N, RANKS, NEWGROUP, IERROR) 10 INTEGER GROUP, N, RANKS(*), NEWGROUP, IERROR 11 that is obtained MPI _ EXCL creates a group of processes _ The function GROUP newgroup 12 ... . The ordering of ranks[n-1] ranks[0] , those processes with ranks group by deleting from 13 n group is identical to the ordering in newgroup processes in ranks . Each of the elements of 14 and all elements must be distinct; otherwise, the program is must be a valid rank in group 15 n erroneous. If = 0, then . newgroup is identical to group 16 17 18 INCL(group, n, ranges, newgroup) MPI _ GROUP _ RANGE _ 19 IN group group (handle) 20 21 (integer) ranges number of triplets in array IN n 22 IN ranges a one-dimensional array of integer triplets, of the form 23 group (first rank, last rank, stride) indicating ranks in 24 of processes to be included in newgroup 25 new group derived from above, in the order defined by OUT newgroup 26 ranges (handle) 27 28 29 int MPI_Group_range_incl(MPI_Group group, int n, int ranges[][3], 30 MPI_Group *newgroup) 31 MPI_Group_range_incl(group, n, ranges, newgroup, ierror) BIND(C) 32 TYPE(MPI_Group), INTENT(IN) :: group 33 INTEGER, INTENT(IN) :: n, ranges(3,n) 34 TYPE(MPI_Group), INTENT(OUT) :: newgroup 35 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 36 37 MPI_GROUP_RANGE_INCL(GROUP, N, RANGES, NEWGROUP, IERROR) 38 INTEGER GROUP, N, RANGES(3,*), NEWGROUP, IERROR 39 consists of the triplets ranges If 40 41 ,..., ( ,last ,stride ) first first ,last ,stride ) ( 1 n n n 1 1 42 newgroup with ranks then group consists of the sequence of processes in 43 ⌊ ⌋ 44 − last first 1 1 + ,first ,...,first + stride first ,..., stride 45 1 1 1 1 1 stride 1 46 ⌋ ⌊ 47 − first last n n . stride ,...,first ,first + stride + first n n n n n 48 stride n

264 234 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 Each computed rank must be a valid rank in and all computed ranks must be group 2 > last , and stride distinct, or else the program is erroneous. Note that we may have first i i i 3 may be negative, but cannot be zero. 4 The functionality of this routine is specified to be equivalent to expanding the array 5 of ranges to an array of the included ranks and passing the resulting array of ranks and 6 MPI INCL _ INCL is equivalent to a call _ _ GROUP _ MPI other arguments to . A call to GROUP 7 with each rank ranks replaced by the triplet (i,i,1) in _ _ GROUP INCL to in _ MPI RANGE i 8 . ranges the argument 9 10 _ EXCL(group, n, ranges, newgroup) MPI _ GROUP _ RANGE 11 12 IN group (handle) group 13 number of elements in array ranges (integer) n IN 14 a one-dimensional array of integer triplets of the form IN ranges 15 (first rank, last rank, stride), indicating the ranks in 16 of processes to be excluded from the output group 17 group . newgroup 18 19 OUT newgroup new group derived from above, preserving the order 20 group (handle) in 21 22 int MPI_Group_range_excl(MPI_Group group, int n, int ranges[][3], 23 MPI_Group *newgroup) 24 25 MPI_Group_range_excl(group, n, ranges, newgroup, ierror) BIND(C) 26 TYPE(MPI_Group), INTENT(IN) :: group 27 INTEGER, INTENT(IN) :: n, ranges(3,n) 28 TYPE(MPI_Group), INTENT(OUT) :: newgroup 29 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 30 MPI_GROUP_RANGE_EXCL(GROUP, N, RANGES, NEWGROUP, IERROR) 31 INTEGER GROUP, N, RANGES(3,*), NEWGROUP, IERROR 32 33 Each computed rank must be a valid rank in group and all computed ranks must be distinct, 34 or else the program is erroneous. 35 The functionality of this routine is specified to be equivalent to expanding the array of 36 ranges to an array of the excluded ranks and passing the resulting array of ranks and other 37 EXCL is equivalent to a call to arguments to MPI _ GROUP _ EXCL . A call to MPI _ GROUP _ 38 i in MPI _ GROUP _ RANGE _ EXCL with each rank ranks replaced by the triplet (i,i,1) in 39 the argument ranges . 40 The range operations do not explicitly enumerate ranks, and Advice to users. 41 therefore are more scalable if implemented efficiently. Hence, we recommend MPI 42 programmers to use them whenenever possible, as high-quality implementations will 43 End of advice to users. ) take advantage of this fact. ( 44 45 The range operations should be implemented, if possible, Advice to implementors. 46 without enumerating the group members, in order to obtain better scalability (time 47 and space). ( End of advice to implementors. ) 48

265 6.4. COMMUNICATOR MANAGEMENT 235 1 Group Destructors 6.3.3 2 3 4 GROUP _ MPI FREE(group) _ 5 group group (handle) INOUT 6 7 8 int MPI_Group_free(MPI_Group *group) 9 MPI_Group_free(group, ierror) BIND(C) 10 TYPE(MPI_Group), INTENT(INOUT) :: group 11 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 12 13 MPI_GROUP_FREE(GROUP, IERROR) 14 INTEGER GROUP, IERROR 15 This operation marks a group object for deallocation. The handle is set to group 16 GROUP _ MPI by the call. Any on-going operation using this group will complete NULL _ 17 normally. 18 19 Advice to implementors. One can keep a reference count that is incremented for each 20 COMM _ GROUP , MPI _ COMM _ CREATE , MPI _ COMM _ DUP , and call to MPI _ 21 or _ COMM _ IDUP MPI MPI _ GROUP _ FREE , and decremented for each call to 22 MPI _ COMM _ FREE ; the group object is ultimately deallocated when the reference 23 ) count drops to zero. ( End of advice to implementors. 24 25 Communicator Management 6.4 26 27 This section describes the manipulation of communicators in MPI . Operations that access 28 communicators are local and their execution does not require interprocess communication. 29 Operations that create communicators are collective and may require interprocess commu- 30 nication. 31 32 High-quality implementations should amortize the over- Advice to implementors. 33 heads associated with the creation of communicators (for the same group, or subsets 34 thereof) over several calls, by allocating multiple contexts with one collective commu- 35 End of advice to implementors. nication. ( ) 36 37 Communicator Accessors 6.4.1 38 39 The following are all local operations. 40 41 MPI _ COMM _ SIZE(comm, size) 42 43 comm IN communicator (handle) 44 comm (integer) number of processes in the group of size OUT 45 46 int MPI_Comm_size(MPI_Comm comm, int *size) 47 48 MPI_Comm_size(comm, size, ierror) BIND(C)

266 236 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 TYPE(MPI_Comm), INTENT(IN) :: comm 2 INTEGER, INTENT(OUT) :: size 3 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 4 MPI_COMM_SIZE(COMM, SIZE, IERROR) 5 INTEGER COMM, SIZE, IERROR 6 7 8 Rationale. This function is equivalent to accessing the communicator’s group with 9 MPI _ GROUP _ MPI (see above), computing the size using GROUP SIZE , and _ COMM _ 10 . However, this function is FREE _ GROUP _ MPI then freeing the temporary group via 11 ) so commonly used that this shortcut was introduced. ( End of rationale. 12 This function indicates the number of processes involved in a Advice to users. 13 _ communicator. For MPI , it indicates the total number of processes COMM _ WORLD 14 available unless the number of processes has been changed by using the functions 15 WORLD described in Chapter 10; note that the number of processes in MPI _ COMM _ 16 does not change during the life of an MPI program. 17 18 This call is often used with the next call to determine the amount of concurrency 19 RANK MPI _ COMM _ available for a specific library or program. The following call, 20 − size ... 1, where indicates the rank of the process that calls it in the range from 0 size 21 .( End of advice to users. ) MPI is the return value of COMM _ SIZE _ 22 23 24 _ MPI _ RANK(comm, rank) COMM 25 26 comm communicator (handle) IN 27 comm rank of the calling process in group of rank OUT (integer) 28 29 int MPI_Comm_rank(MPI_Comm comm, int *rank) 30 31 MPI_Comm_rank(comm, rank, ierror) BIND(C) 32 TYPE(MPI_Comm), INTENT(IN) :: comm 33 INTEGER, INTENT(OUT) :: rank 34 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 35 MPI_COMM_RANK(COMM, RANK, IERROR) 36 INTEGER COMM, RANK, IERROR 37 38 39 Rationale. This function is equivalent to accessing the communicator’s group with 40 MPI _ _ COMM , RANK _ GROUP _ MPI (see above), computing the rank using GROUP 41 . However, this function MPI _ GROUP _ FREE and then freeing the temporary group via 42 is so commonly used that this shortcut was introduced. ( End of rationale. ) 43 Advice to users. This function gives the rank of the process in the particular commu- 44 . SIZE _ COMM _ MPI nicator’s group. It is useful, as noted above, in conjunction with 45 46 Many programs will be written with the master-slave model, where one process (such 47 as the rank-zero process) will play a supervisory role, and the other processes will 48 serve as compute nodes. In this framework, the two preceding calls are useful for

267 6.4. COMMUNICATOR MANAGEMENT 237 1 End of advice to determining the roles of the various processes of a communicator. ( 2 users. ) 3 4 5 MPI _ _ COMPARE(comm1, comm2, result) COMM 6 7 comm1 first communicator (handle) IN 8 second communicator (handle) comm2 IN 9 result (integer) result OUT 10 11 12 int MPI_Comm_compare(MPI_Comm comm1, MPI_Comm comm2, int *result) 13 MPI_Comm_compare(comm1, comm2, result, ierror) BIND(C) 14 TYPE(MPI_Comm), INTENT(IN) :: comm1, comm2 15 INTEGER, INTENT(OUT) :: result 16 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 17 18 MPI_COMM_COMPARE(COMM1, COMM2, RESULT, IERROR) 19 INTEGER COMM1, COMM2, RESULT, IERROR 20 results if and only if comm1 and comm2 are handles for the same object (identical MPI _ IDENT 21 groups and same contexts). results if the underlying groups are identical CONGRUENT _ MPI 22 in constituents and rank order; these communicators differ only by context. MPI _ SIMILAR 23 results if the group members of both communicators are the same but the rank order differs. 24 results otherwise. UNEQUAL _ MPI 25 26 6.4.2 Communicator Constructors 27 28 The following are collective functions that are invoked by all processes in the group or 29 GROUP _ CREATE _ COMM _ , which , with the exception of comm groups associated with MPI 30 is invoked only by the processes in the group of the new communicator being constructed. 31 32 in that a communicator Rationale. MPI Note that there is a chicken-and-egg aspect to 33 MPI com- is needed to create a new communicator. The base communicator for all 34 , and is MPI . This model was MPI _ COMM municators is predefined outside of _ WORLD 35 arrived at after considerable debate, and was chosen to increase “safety” of programs 36 MPI . ( End of rationale. written in ) 37 38 This chapter presents the following communicator construction routines: 39 _ CREATE , MPI _ COMM _ DUP , MPI _ COMM _ IDUP , MPI _ COMM 40 INFO _ DUP can be used to create both intra- _ _ WITH _ COMM , and MPI _ COMM _ SPLIT MPI 41 communicators and intercommunicators; MPI _ and COMM _ CREATE _ GROUP 42 _ INTERCOMM _ MERGE (see Section 6.6.2) can be used to create intracommunicators; MPI 43 _ INTERCOMM _ CREATE (see Section 6.6.2) can be used to create intercommuni- and MPI 44 cators. 45 An intracommunicator involves a single group while an intercommunicator involves 46 two groups. Where the following discussions address intercommunicator semantics, the 47 groups. A process in an right and left two groups in an intercommunicator are called the 48 intercommunicator is a member of either the left or the right group. From the point of view

268 238 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 of that process, the group that the process is a member of is called the group; the local 2 other group (relative to that process) is the remote group. The left and right group labels 3 give us a way to describe the two groups in an intercommunicator that is not relative to 4 any particular process (as the local and remote groups are). 5 6 COMM DUP(comm, newcomm) MPI _ _ 7 8 comm communicator (handle) IN 9 newcomm copy of comm (handle) OUT 10 11 int MPI_Comm_dup(MPI_Comm comm, MPI_Comm *newcomm) 12 13 MPI_Comm_dup(comm, newcomm, ierror) BIND(C) 14 TYPE(MPI_Comm), INTENT(IN) :: comm 15 TYPE(MPI_Comm), INTENT(OUT) :: newcomm 16 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 17 MPI_COMM_DUP(COMM, NEWCOMM, IERROR) 18 INTEGER COMM, NEWCOMM, IERROR 19 20 DUP COMM _ with associated key duplicates the existing communicator comm _ MPI 21 values, topology information, and info hints. For each key value, the respective copy callback 22 function determines the attribute value associated with this key in the new communicator; 23 one particular action that a copy callback may take is to delete the attribute from the new 24 a new communicator with the same group or groups, newcomm communicator. Returns in 25 same topology, same info hints, any copied cached information, but a new context (see 26 Section 6.7.1). 27 This operation is used to provide a parallel library with a duplicate Advice to users. 28 communication space that has the same properties as the original communicator. This 29 includes any attributes (see below), topologies (see Chapter 7), and associated info 30 hints (see Section 6.4.4). This call is valid even if there are pending point-to-point 31 comm . A typical call might involve a communications involving the communicator 32 _ FREE of MPI _ COMM _ DUP at the beginning of the parallel call, and an MPI _ COMM 33 that duplicated communicator at the end of the call. Other models of communicator 34 management are also possible. 35 36 ) End of advice to users. This call applies to both intra- and inter-communicators. ( 37 One need not actually copy the group information, but only Advice to implementors. 38 add a new reference and increment the reference count. Copy on write can be used 39 ) End of advice to implementors. for the cached information.( 40 41 42 MPI _ DUP _ COMM _ INFO(comm, info, newcomm) _ WITH 43 44 IN comm communicator (handle) 45 info object (handle) IN info 46 newcomm OUT copy of comm (handle) 47 48

269 6.4. COMMUNICATOR MANAGEMENT 239 1 int MPI_Comm_dup_with_info(MPI_Comm comm, MPI_Info info, MPI_Comm *newcomm) 2 MPI_Comm_dup_with_info(comm, info, newcomm, ierror) BIND(C) 3 TYPE(MPI_Comm), INTENT(IN) :: comm 4 TYPE(MPI_Info), INTENT(IN) :: info 5 TYPE(MPI_Comm), INTENT(OUT) :: newcomm 6 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 7 8 MPI_COMM_DUP_WITH_INFO(COMM, INFO, NEWCOMM, IERROR) 9 INTEGER COMM, INFO, NEWCOMM, IERROR 10 MPI DUP except that the COMM _ DUP _ WITH _ MPI INFO COMM _ _ behaves exactly as _ 11 . The comm are not duplicated in newcomm info hints associated with the communicator 12 newcomm are associated with the output communicator info hints provided by the argument 13 instead. 14 15 Rationale. It is expected that some hints will only be valid at communicator creation 16 time. However, for legacy reasons, most communicator creation calls do not provide 17 an info argument. One may associate info hints with a duplicate of any communicator 18 WITH _ INFO . ( End of rationale. ) at creation time through a call to MPI _ COMM _ DUP _ 19 20 21 22 IDUP(comm, newcomm, request) _ COMM MPI _ 23 communicator (handle) IN comm 24 OUT newcomm copy of comm (handle) 25 26 OUT request communication request (handle) 27 28 int MPI_Comm_idup(MPI_Comm comm, MPI_Comm *newcomm, MPI_Request *request) 29 30 MPI_Comm_idup(comm, newcomm, request, ierror) BIND(C) 31 TYPE(MPI_Comm), INTENT(IN) :: comm 32 TYPE(MPI_Comm), INTENT(OUT) :: newcomm 33 TYPE(MPI_Request), INTENT(OUT) :: request 34 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 35 MPI_COMM_IDUP(COMM, NEWCOMM, REQUEST, IERROR) 36 INTEGER COMM, NEWCOMM, REQUEST, IERROR 37 38 IDUP _ . The semantics of _ MPI DUP _ COMM _ MPI is a nonblocking variant of COMM 39 _ IDUP are as if MPI _ COMM _ DUP was executed at the time that MPI _ COMM 40 IDUP will IDUP MPI _ COMM _ _ is called. For example, attributes changed after MPI _ COMM 41 not be copied to the new communicator. All restrictions and assumptions for nonblock- 42 _ COMM _ MPI ing collective operations (see Section 5.12) apply to and the returned IDUP 43 request. 44 newcomm as an input argument to other MPI It is erroneous to use the communicator 45 COMM functions before the IDUP operation completes. MPI _ _ 46 This functionality is crucial for the development of purely nonblocking Rationale. 47 libraries (see [36]). ( ) End of rationale. 48

270 240 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 COMM _ _ MPI CREATE(comm, group, newcomm) 2 communicator (handle) IN comm 3 comm (handle) IN group group, which is a subset of the group of 4 5 new communicator (handle) newcomm OUT 6 7 int MPI_Comm_create(MPI_Comm comm, MPI_Group group, MPI_Comm *newcomm) 8 9 MPI_Comm_create(comm, group, newcomm, ierror) BIND(C) 10 TYPE(MPI_Comm), INTENT(IN) :: comm 11 TYPE(MPI_Group), INTENT(IN) :: group 12 TYPE(MPI_Comm), INTENT(OUT) :: newcomm 13 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 14 MPI_COMM_CREATE(COMM, GROUP, NEWCOMM, IERROR) 15 INTEGER COMM, GROUP, NEWCOMM, IERROR 16 17 is an intracommunicator, this function returns a new communicator comm If 18 group newcomm argument. No cached information with communication group defined by the 19 comm to newcomm . Each process must call MPI _ COMM _ propagates from with CREATE 20 ; this could be comm associated with group argument that is a subgroup of the group a 21 _ GROUP _ EMPTY . The processes may specify different values for the group MPI argument. 22 then all processes in that group must call the group If a process calls with a non-empty 23 group function with the same as argument, that is the same processes in the same order. 24 Otherwise, the call is erroneous. This implies that the set of groups specified across the 25 processes must be disjoint. If the calling process is a member of the group given as group 26 group is a communicator with newcomm argument, then as its associated group. In the case 27 _ group to which it does not belong, e.g., MPI that a process calls with a EMPTY _ , GROUP 28 MPI _ COMM _ NULL is returned as newcomm . The function is collective and must be then 29 comm . called by all processes in the group of 30 Rationale. The interface supports the original mechanism from MPI-1.1, which re- 31 quired the same group in all processes of comm . It was extended in MPI-2.2 to allow 32 the use of disjoint subgroups in order to allow implementations to eliminate unnec- 33 essary communication that MPI _ would incur when the user already SPLIT _ COMM 34 knows the membership of the disjoint subgroups. ( ) End of rationale. 35 36 Rationale. The requirement that the entire group of comm participate in the call 37 stems from the following considerations: 38 39 on top of regular It allows the implementation to layer • _ MPI COMM _ CREATE 40 collective communications. 41 It provides additional safety, in particular in the case where partially overlapping • 42 groups are used to create new communicators. 43 44 It permits implementations to sometimes avoid communication related to context • 45 creation. 46 ( End of rationale. ) 47 48

271 6.4. COMMUNICATOR MANAGEMENT 241 1 _ MPI CREATE provides a means to subset a group of pro- COMM Advice to users. _ 2 cesses for the purpose of separate MIMD computation, with separate communication 3 , which emerges from CREATE , can be used in subse- space. COMM _ MPI newcomm _ 4 _ quent calls to MPI _ CREATE (or other communicator constructors) to further COMM 5 subdivide a computation into parallel sub-computations. A more general service is 6 , below. ( SPLIT provided by MPI End of advice to users. ) _ _ COMM 7 8 _ COMM _ Advice to implementors. , all processes call with the When calling MPI DUP 9 associated with the communicator). When calling same group (the group 10 or disjoint subgroups. , the processes provide the same COMM _ _ MPI group CREATE 11 For both calls, it is theoretically possible to agree on a group-wide unique context 12 with no communication. However, local execution of these functions requires use 13 of a larger context name space and reduces error checking. Implementations may 14 strike various compromises between these conflicting goals, such as bulk allocation of 15 multiple contexts in one collective operation. 16 Important: If new communicators are created without synchronizing the processes 17 involved then the communication system must be able to cope with messages arriving 18 End of advice in a context that has not yet been allocated at the receiving process. ( 19 ) to implementors. 20 21 is an intercommunicator, then the output communicator is also an intercommun- If comm 22 (see Fig- group icator where the local group consists only of those processes contained in 23 ure 6.1). The group argument should only contain those processes in the local group of 24 the input intercommunicator that are to be a part of newcomm . All processes in the same 25 , i.e., the same members in the comm local group of group must specify the same value for 26 same order. If either group does not specify at least one process in the local group of the 27 MPI _ COMM _ NULL intercommunicator, or if the calling process is not included in the group , 28 is returned. 29 30 Rationale. In the case where either the left or right group is empty, a null communi- 31 because _ cator is returned instead of an intercommunicator with MPI GROUP _ EMPTY 32 . ( COMM _ MPI the side with the empty group must return NULL End of rationale. _ ) 33 34 Example 6.1 The following example illustrates how the first node in the left side of an 35 intercommunicator could be joined with all members on the right side of an intercommun- 36 icator to form a new intercommunicator. 37 38 MPI_Comm inter_comm, new_inter_comm; 39 MPI_Group local_group, group; 40 int rank = 0; /* rank on left side to include in 41 new inter-comm */ 42 43 /* Construct the original intercommunicator: "inter_comm" */ 44 ... 45 46 /* Construct the group of processes to be in new 47 intercommunicator */ 48 if (/* I’m on the left side of the intercommunicator */) {

272 242 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 NTER REATE C COMMUNICATOR - I 2 Before 3 4 0 5 5 1 0 1 6 4 7 2 4 8 3 2 3 9 10 11 After 12 13 14 15 0 1 16 17 0 18 1 2 19 20 21 22 _ Figure 6.1: Intercommunicator creation using MPI CREATE extended to intercom- COMM _ 23 municators. The input groups are those in the grey circle. 24 25 MPI_Comm_group ( inter_comm, &local_group ); 26 MPI_Group_incl ( local_group, 1, &rank, &group ); 27 MPI_Group_free ( &local_group ); 28 } 29 else 30 MPI_Comm_group ( inter_comm, &group ); 31 32 MPI_Comm_create ( inter_comm, group, &new_inter_comm ); 33 MPI_Group_free( &group ); 34 35 36 37 MPI _ _ CREATE _ GROUP(comm, group, tag, newcomm) COMM 38 comm intracommunicator (handle) IN 39 group group, which is a subset of the group of comm (handle) IN 40 41 tag (integer) IN tag 42 new communicator (handle) newcomm OUT 43 44 int MPI_Comm_create_group(MPI_Comm comm, MPI_Group group, int tag, 45 MPI_Comm *newcomm) 46 47 MPI_Comm_create_group(comm, group, tag, newcomm, ierror) 48 TYPE(MPI_Comm), INTENT(IN) :: comm

273 6.4. COMMUNICATOR MANAGEMENT 243 1 TYPE(MPI_Group), INTENT(IN) :: group 2 INTEGER, INTENT(IN) :: tag 3 TYPE(MPI_Comm), INTENT(OUT) :: newcomm 4 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 5 MPI_COMM_CREATE_GROUP(COMM, GROUP, TAG, NEWCOMM, IERROR) 6 INTEGER COMM, GROUP, TAG, NEWCOMM, IERROR 7 8 COMM CREATE ; however, _ MPI _ GROUP is similar to MPI _ COMM _ _ CREATE 9 MPI must be called by all processes in the group of CREATE _ COMM _ 10 _ MPI CREATE _ , whereas comm GROUP must be called by all processes in group , _ COMM 11 MPI _ COMM _ CREATE _ GROUP which is a subgroup of the group of comm . In addition, 12 GROUP comm requires that is an intracommunicator. MPI _ COMM _ CREATE _ returns a new 13 intracommunicator, group , for which the argument defines the communication newcomm 14 . Each process must to comm group. No cached information propagates from newcomm 15 ; this provide a group argument that is a subgroup of the group associated with comm 16 could be . If a non-empty group is specified, then all processes in that MPI _ GROUP _ EMPTY 17 group must call the function, and each of these processes must provide the same arguments, 18 including a group that contains the same members with the same ordering. Otherwise 19 the call is erroneous. If the calling process is a member of the group given as the group 20 as its associated group. If the argument, then newcomm is a communicator with group 21 , e.g., group is MPI GROUP EMPTY , then the call calling process is not a member of group _ _ 22 . _ NULL MPI is a local operation and _ newcomm is returned as COMM 23 24 MPI Functionality similar to Rationale. can be imple- GROUP _ CREATE COMM _ _ 25 _ and MPI INTERCOMM _ mented through repeated CREATE 26 INTERCOMM _ MERGE calls that start with the MPI _ COMM _ SELF communica- MPI _ 27 tors at each process in group and build up an intracommunicator with group 28 [16]. Such an algorithm requires the creation of many intermediate communi- group 29 can provide a more efficient implementation GROUP cators; _ CREATE MPI _ COMM _ 30 ) that avoids this overhead. ( End of rationale. 31 32 Advice to users. An intercommunicator can be created collectively over processes in 33 the union of the local and remote groups by creating the local communicator using 34 _ GROUP MPI _ COMM _ CREATE and using that communicator as the local communi- 35 INTERCOMM CREATE MPI _ ) _ . ( End of advice to users. cator argument to 36 tag argument does not conflict with tags used in point-to-point communication and The 37 is not permitted to be a wildcard. If multiple threads at a given process perform concurrent 38 MPI COMM _ _ operations, the user must distinguish these operations by GROUP _ CREATE 39 providing different or comm arguments. tag 40 41 COMM _ CREATE may provide lower overhead than Advice to users. MPI _ 42 _ GROUP because it can take advantage of collective communi- _ COMM MPI CREATE _ 43 comm when constructing newcomm . ( End of advice to users. ) cation on 44 45 46 47 48

274 244 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 COMM _ _ MPI SPLIT(comm, color, key, newcomm) 2 communicator (handle) comm IN 3 IN control of subset assignment (integer) color 4 5 key IN control of rank assigment (integer) 6 OUT new communicator (handle) newcomm 7 8 int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm) 9 10 MPI_Comm_split(comm, color, key, newcomm, ierror) BIND(C) 11 TYPE(MPI_Comm), INTENT(IN) :: comm 12 INTEGER, INTENT(IN) :: color, key 13 TYPE(MPI_Comm), INTENT(OUT) :: newcomm 14 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 15 MPI_COMM_SPLIT(COMM, COLOR, KEY, NEWCOMM, IERROR) 16 INTEGER COMM, COLOR, KEY, NEWCOMM, IERROR 17 18 into disjoint subgroups, one for comm This function partitions the group associated with 19 . Each subgroup contains all processes of the same color. Within each each value of color 20 subgroup, the processes are ranked in the order defined by the value of the argument 21 key , with ties broken according to their rank in the old group. A new communicator is 22 created for each subgroup and returned in newcomm . A process may supply the color value 23 _ UNDEFINED , in which case newcomm returns MPI _ COMM _ NULL . This is a collective MPI 24 . call, but each process is permitted to provide different values for color and key 25 comm With an intracommunicator _ , a call to MPI _ COMM CREATE(comm, group, new- 26 SPLIT(comm, color, key, newcomm) , where is equivalent to a call to MPI _ COMM comm) _ 27 group argument provide color = number of the group processes that are members of their 28 (based on a unique numbering of all disjoint groups) and key = rank in group , and all 29 . UNDEFINED _ MPI = color argument provide group processes that are not members of their 30 . color must be non-negative or MPI _ The value of UNDEFINED 31 This is an extremely powerful mechanism for dividing a single Advice to users. 32 subgroups, with k chosen implicitly by the communicating group of processes into k 33 user (by the number of colors asserted over all the processes). Each resulting com- 34 municator will be non-overlapping. Such a division could be useful for defining a 35 hierarchy of computations, such as for multigrid, or linear algebra. For intracommu- 36 _ COMM _ MPI to provides similar capability as nicators, MPI _ COMM _ CREATE SPLIT 37 split a communicating group into disjoint subgroups. MPI _ COMM _ SPLIT is useful 38 when some processes do not have complete information of the other members in their 39 group, but all processes know (the color of) the group to which they belong. In this 40 case, the MPI implementation discovers the other group members via communication. 41 MPI is useful when all processes have complete information of the _ COMM _ CREATE 42 members of their group. In this case, MPI can avoid the extra communication re- 43 GROUP quired to discover group membership. MPI _ COMM _ CREATE _ is useful when 44 all processes in a given group have complete information of the members of their group 45 and synchronization with processes outside the group can be avoided. 46 47 COMM _ SPLIT can be used to overcome the requirement that Multiple calls to MPI _ 48 any call have no overlap of the resulting communicators (each process is of only one

275 6.4. COMMUNICATOR MANAGEMENT 245 1 color per call). In this way, multiple overlapping communication structures can be 2 color in such splitting operations is encouraged. key and created. Creative use of the 3 ’s MPI _ COMM Note that, for a fixed color, the keys need not be unique. It is _ SPLIT 4 responsibility to sort processes in ascending order according to this key, and to break 5 ties in a consistent way. If all the keys are specified in the same way, then all the 6 processes in a given color will have the relative rank order as they did in their parent 7 group. 8 Essentially, making the key value zero for all processes of a given color means that one 9 does not really care about the rank-order of the processes in the new communicator. 10 ( ) End of advice to users. 11 12 is restricted to be non-negative, so as not to confict with the value color Rationale. 13 MPI UNDEFINED . ( End of rationale. ) assigned to _ 14 15 COMM _ SPLIT on an intercommunicator is that those processes on the The result of MPI _ 16 color as those processes on the right combine to create a new intercom- left with the same 17 key municator. The argument describes the relative rank of processes on each side of the 18 intercommunicator (see Figure 6.2). For those colors that are specified only on one side of 19 _ COMM _ NULL is also returned MPI the intercommunicator, COMM _ NULL is returned. MPI _ 20 as the color. UNDEFINED _ MPI to those processes that specify 21 For intercommunicators, is more general than Advice to users. SPLIT _ COMM _ MPI 22 SPLIT can create a set of disjoint MPI _ COMM _ CREATE . A single call to MPI _ COMM _ 23 End of MPI intercommunicators, while a call to _ COMM _ CREATE creates only one. ( 24 advice to users. ) 25 26 (Parallel client-server model). The following client code illustrates how clients Example 6.2 27 on the left side of an intercommunicator could be assigned to a single server from a pool of 28 servers on the right side of an intercommunicator. 29 30 /* Client code */ 31 MPI_Comm multiple_server_comm; 32 MPI_Comm single_server_comm; 33 int color, rank, num_servers; 34 35 /* Create intercommunicator with clients and servers: 36 multiple_server_comm */ 37 ... 38 39 /* Find out the number of servers available */ 40 MPI_Comm_remote_size ( multiple_server_comm, &num_servers ); 41 42 /* Determine my color */ 43 MPI_Comm_rank ( multiple_server_comm, &rank ); 44 color = rank % num_servers; 45 46 /* Split the intercommunicator */ 47 MPI_Comm_split ( multiple_server_comm, color, rank, 48 &single_server_comm );

276 246 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 2 3 4 5 6 7 8 Color 9 10 0(0,1) 11 1(1,0) 1(0,0) 0(0,0) 12 4(1,0) 2(2,0) 13 2(2,0) 3(0,1) 14 3(2,1) 15 Key 16 17 18 Input Intercommunicator (comm) 19 20 21 Rank in the original group 22 1(3) 0(1) 0(0) 23 1(0) Color = 0 24 25 26 27 28 0(4) 0(1) 29 Color = 1 30 31 32 33 0(2) 34 0(2) 1(3) 35 Color = 2 36 37 Disjoint output communicators (newcomm) 38 (one per color) 39 40 Figure 6.2: Intercommunicator construction achieved by splitting an existing intercommun- 41 SPLIT extended to intercommunicators. COMM _ MPI icator with _ 42 43 44 45 46 47 48

277 6.4. COMMUNICATOR MANAGEMENT 247 1 The following is the corresponding server code: 2 /* Server code */ 3 MPI_Comm multiple_client_comm; 4 MPI_Comm single_server_comm; 5 int rank; 6 7 /* Create intercommunicator with clients and servers: 8 multiple_client_comm */ 9 ... 10 11 /* Split the intercommunicator for a single server per group 12 of clients */ 13 MPI_Comm_rank ( multiple_client_comm, &rank ); 14 MPI_Comm_split ( multiple_client_comm, rank, 0, 15 &single_server_comm ); 16 17 18 19 _ type, key, info, newcomm) COMM _ _ SPLIT _ MPI TYPE(comm, split 20 IN comm communicator (handle) 21 IN split type of processes to be grouped together (integer) type _ 22 23 IN key control of rank assignment (integer) 24 IN info info argument (handle) 25 new communicator (handle) newcomm OUT 26 27 28 int MPI_Comm_split_type(MPI_Comm comm, int split_type, int key, 29 MPI_Info info, MPI_Comm *newcomm) 30 MPI_Comm_split_type(comm, split_type, key, info, newcomm, ierror) BIND(C) 31 TYPE(MPI_Comm), INTENT(IN) :: comm 32 INTEGER, INTENT(IN) :: split_type, key 33 TYPE(MPI_Info), INTENT(IN) :: info 34 TYPE(MPI_Comm), INTENT(OUT) :: newcomm 35 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 36 37 MPI_COMM_SPLIT_TYPE(COMM, SPLIT_TYPE, KEY, INFO, NEWCOMM, IERROR) 38 INTEGER COMM, SPLIT_TYPE, KEY, INFO, NEWCOMM, IERROR 39 into disjoint subgroups, based on comm This function partitions the group associated with 40 the type specified by . Each subgroup contains all processes of the same type. _ split type 41 Within each subgroup, the processes are ranked in the order defined by the value of the 42 argument , with ties broken according to their rank in the old group. A new commu- key 43 nicator is created for each subgroup and returned in newcomm . This is a collective call; 44 , but each process is permitted to provide all processes must provide the same split _ type 45 . An exception to this rule is that a process may supply the type key different values for 46 . returns newcomm , in which case NULL _ MPI value UNDEFINED _ COMM _ MPI 47 The following type is predefined by MPI: 48

278 248 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 _ TYPE _ SHARED — this type splits the communicator into subcommunicators, MPI COMM _ 2 each of which can create a shared memory region. 3 4 Advice to implementors. Implementations can define their own types, or use the 5 info argument, to assist in creating communicators that help expose platform-specific 6 ) End of advice to implementors. information to the application. ( 7 8 9 6.4.3 Communicator Destructors 10 11 12 FREE(comm) _ COMM _ MPI 13 comm communicator to be destroyed (handle) INOUT 14 15 int MPI_Comm_free(MPI_Comm *comm) 16 17 MPI_Comm_free(comm, ierror) BIND(C) 18 TYPE(MPI_Comm), INTENT(INOUT) :: comm 19 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 20 MPI_COMM_FREE(COMM, IERROR) 21 INTEGER COMM, IERROR 22 23 This collective operation marks the communication object for deallocation. The handle 24 COMM NULL _ is set to _ MPI . Any pending operations that use this communicator will com- 25 plete normally; the object is actually deallocated only if there are no other active references 26 to it. This call applies to intra- and inter-communicators. The delete callback functions for 27 all cached attributes (see Section 6.7) are called in arbitrary order. 28 29 A reference-count mechanism may be used: the reference Advice to implementors. 30 MPI DUP _ , and _ COMM count is incremented by each call to _ _ IDUP MPI or COMM 31 _ FREE . The object is ultimately deallocated decremented by each call to MPI _ COMM 32 when the count reaches zero. 33 Though collective, it is anticipated that this operation will normally be implemented 34 MPI to be local, though a debugging version of an library might choose to synchronize. 35 ) End of advice to implementors. ( 36 37 38 Communicator Info 6.4.4 39 40 Hints specified via info (see Chapter 9) allow a user to provide information to direct opti- 41 mization. Providing hints may enable an implementation to deliver increased performance 42 or minimize use of system resources. However, hints do not change the semantics of any MPI 43 interfaces. In other words, an implementation is free to ignore all hints. Hints are specified 44 _ WITH _ INFO , MPI _ COMM _ SET _ INFO , on a per communicator basis, in MPI _ COMM _ DUP 45 _ TYPE MPI _ COMM _ SPLIT , MPI _ DIST _ GRAPH _ CREATE , and ADJACENT _ 46 MPI DIST _ GRAPH _ CREATE , via the opaque info object. When an info object that speci- _ 47 INFO fies a subset of valid hints is passed to MPI _ COMM _ SET _ , there will be no effect on 48 previously set or defaulted hints that the info does not specify.

279 6.4. COMMUNICATOR MANAGEMENT 249 1 Advice to implementors. It may happen that a program is coded with hints for one 2 system, and later executes on another system that does not support these hints. In 3 general, unsupported hints should simply be ignored. Needless to say, no hint can be 4 mandatory. However, for each hint used by a specific implementation, a default value 5 End of advice must be provided when the user does not specify a value for this hint. ( 6 to implementors. ) 7 8 Info hints are not propagated by MPI from one communicator to another except when 9 IDUP the communicator is duplicated using MPI _ COMM _ DUP or MPI _ COMM _ . In this 10 case, all hints associated with the original communicator are also applied to the duplicated 11 communicator. 12 13 _ _ INFO(comm, info) MPI _ COMM SET 14 15 INOUT comm communicator (handle) 16 info object (handle) info IN 17 18 int MPI_Comm_set_info(MPI_Comm comm, MPI_Info info) 19 20 MPI_Comm_set_info(MPI_Comm comm, MPI_Info info) BIND(C) 21 TYPE(MPI_Comm), INTENT(INOUT) :: comm 22 TYPE(MPI_Info), INTENT(IN) :: info 23 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 24 MPI_COMM_SET_INFO(COMM, INFO, IERROR) 25 INTEGER COMM, INFO, IERROR 26 27 SET _ INFO sets new values for the hints of the communicator associated MPI _ COMM _ 28 is a collective routine. The info object may be different comm . MPI _ with _ SET _ INFO COMM 29 on each process, but any info entries that an implementation requires to be the same on all 30 processes must appear with the same value in each process’s info object. 31 32 Some info items that an implementation can use when it creates Advice to users. 33 a communicator cannot easily be changed once the communicator has been created. 34 Thus, an implementation may ignore hints issued in this call that it would have 35 ) accepted in a creation call. ( End of advice to users. 36 37 38 COMM used) _ INFO(comm, info _ _ GET _ MPI 39 40 communicator object (handle) IN comm 41 OUT new info object (handle) _ info used 42 43 int MPI_Comm_get_info(MPI_Comm comm, MPI_Info *info_used) 44 45 MPI_Comm_get_info(comm, info_used, ierror) BIND(C) 46 TYPE(MPI_Comm), INTENT(IN) :: comm 47 TYPE(MPI_Info), INTENT(OUT) :: info_used 48 INTEGER, OPTIONAL, INTENT(OUT) :: ierror

280 250 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 MPI_COMM_GET_INFO(COMM, INFO_USED, IERROR) 2 INTEGER COMM, INFO_USED, IERROR 3 _ COMM _ _ MPI INFO GET returns a new info object containing the hints of the commu- 4 nicator associated with comm. The current setting of all hints actually used by the system 5 _ used . If no such hints exist, a handle info related to this communicator is returned in 6 to a newly created info object is returned that contains no key/value pair. The user is 7 responsible for freeing info _ used via MPI _ INFO _ FREE . 8 9 Advice to users. The info object returned in info _ used will contain all hints currently 10 active for this communicator. This set of hints may be greater or smaller than the 11 set of hints specified when the communicator was created, as the system may not 12 recognize some hints set by the user, and may recognize other hints that the user has 13 not set. ( End of advice to users. ) 14 15 16 17 6.5 Motivating Examples 18 6.5.1 Current Practice #1 19 20 Example #1a: 21 int main(int argc, char *argv[]) 22 { 23 int me, size; 24 ... 25 MPI_Init ( &argc, &argv ); 26 MPI_Comm_rank (MPI_COMM_WORLD, &me); 27 MPI_Comm_size (MPI_COMM_WORLD, &size); 28 29 (void)printf ("Process %d size %d\n", me, size); 30 ... 31 MPI_Finalize(); 32 return 0; 33 } 34 35 Example #1a is a do-nothing program that initializes itself, and refers to the “all” commu- 36 nicator, and prints a message. It terminates itself too. This example does not imply that 37 supports printf -like communication itself. MPI 38 is even): size Example #1b (supposing that 39 40 int main(int argc, char *argv[]) 41 { 42 int me, size; 43 int SOME_TAG = 0; 44 ... 45 MPI_Init(&argc, &argv); 46 47 MPI_Comm_rank(MPI_COMM_WORLD, &me); /* local */ 48 MPI_Comm_size(MPI_COMM_WORLD, &size); /* local */

281 6.5. MOTIVATING EXAMPLES 251 1 2 if((me % 2) == 0) 3 { 4 /* send unless highest-numbered process */ 5 if((me + 1) < size) 6 MPI_Send(..., me + 1, SOME_TAG, MPI_COMM_WORLD); 7 } 8 else 9 MPI_Recv(..., me - 1, SOME_TAG, MPI_COMM_WORLD, &status); 10 11 ... 12 MPI_Finalize(); 13 return 0; 14 } 15 Example #1b schematically illustrates message exchanges between “even” and “odd” pro- 16 cesses in the “all” communicator. 17 18 19 Current Practice #2 6.5.2 20 int main(int argc, char *argv[]) 21 { 22 int me, count; 23 void *data; 24 ... 25 26 MPI_Init(&argc, &argv); 27 MPI_Comm_rank(MPI_COMM_WORLD, &me); 28 29 if(me == 0) 30 { 31 /* get input, create buffer ‘‘data’’ */ 32 ... 33 } 34 35 MPI_Bcast(data, count, MPI_BYTE, 0, MPI_COMM_WORLD); 36 37 ... 38 MPI_Finalize(); 39 return 0; 40 } 41 42 This example illustrates the use of a collective communication. 43 44 (Approximate) Current Practice #3 6.5.3 45 46 int main(int argc, char *argv[]) 47 { 48 int me, count, count2;

282 252 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 void *send_buf, *recv_buf, *send_buf2, *recv_buf2; 2 MPI_Group MPI_GROUP_WORLD, grprem; 3 MPI_Comm commslave; 4 static int ranks[] = {0}; 5 ... 6 MPI_Init(&argc, &argv); 7 MPI_Comm_group(MPI_COMM_WORLD, &MPI_GROUP_WORLD); 8 MPI_Comm_rank(MPI_COMM_WORLD, &me); /* local */ 9 10 MPI_Group_excl(MPI_GROUP_WORLD, 1, ranks, &grprem); /* local */ 11 MPI_Comm_create(MPI_COMM_WORLD, grprem, &commslave); 12 13 if(me != 0) 14 { 15 /* compute on slave */ 16 ... 17 MPI_Reduce(send_buf,recv_buf,count, MPI_INT, MPI_SUM, 1, commslave); 18 ... 19 MPI_Comm_free(&commslave); 20 } 21 /* zero falls through immediately to this reduce, others do later... */ 22 MPI_Reduce(send_buf2, recv_buf2, count2, 23 MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); 24 25 MPI_Group_free(&MPI_GROUP_WORLD); 26 MPI_Group_free(&grprem); 27 MPI_Finalize(); 28 return 0; 29 } 30 31 This example illustrates how a group consisting of all but the zeroth process of the “all” 32 ) for that new group. commslave group is created, and then how a communicator is formed ( 33 The new communicator is used in a collective call, and all processes execute a collective call 34 in the context. This example illustrates how the two communicators WORLD _ COMM _ MPI 35 (that inherently possess distinct contexts) protect communication. That is, communication 36 MPI _ COMM _ WORLD is insulated from communication in commslave , and vice versa. in 37 In summary, “group safety” is achieved via communicators because distinct contexts 38 within communicators are enforced to be unique on any process. 39 40 6.5.4 Example #4 41 The following example is meant to illustrate “safety” between point-to-point and collective 42 communication. MPI guarantees that a single communicator can do safe point-to-point and 43 collective communication. 44 45 #define TAG_ARBITRARY 12345 46 #define SOME_COUNT 50 47 48 int main(int argc, char *argv[])

283 6.5. MOTIVATING EXAMPLES 253 1 { 2 int me; 3 MPI_Request request[2]; 4 MPI_Status status[2]; 5 MPI_Group MPI_GROUP_WORLD, subgroup; 6 int ranks[] = {2, 4, 6, 8}; 7 MPI_Comm the_comm; 8 ... 9 MPI_Init(&argc, &argv); 10 MPI_Comm_group(MPI_COMM_WORLD, &MPI_GROUP_WORLD); 11 12 MPI_Group_incl(MPI_GROUP_WORLD, 4, ranks, &subgroup); /* local */ 13 MPI_Group_rank(subgroup, &me); /* local */ 14 15 MPI_Comm_create(MPI_COMM_WORLD, subgroup, &the_comm); 16 17 if(me != MPI_UNDEFINED) 18 { 19 MPI_Irecv(buff1, count, MPI_DOUBLE, MPI_ANY_SOURCE, TAG_ARBITRARY, 20 the_comm, request); 21 MPI_Isend(buff2, count, MPI_DOUBLE, (me+1)%4, TAG_ARBITRARY, 22 the_comm, request+1); 23 for(i = 0; i < SOME_COUNT; i++) 24 MPI_Reduce(..., the_comm); 25 MPI_Waitall(2, request, status); 26 27 MPI_Comm_free(&the_comm); 28 } 29 30 MPI_Group_free(&MPI_GROUP_WORLD); 31 MPI_Group_free(&subgroup); 32 MPI_Finalize(); 33 return 0; 34 } 35 36 Library Example #1 6.5.5 37 The main program: 38 39 int main(int argc, char *argv[]) 40 { 41 int done = 0; 42 user_lib_t *libh_a, *libh_b; 43 void *dataset1, *dataset2; 44 ... 45 MPI_Init(&argc, &argv); 46 ... 47 init_user_lib(MPI_COMM_WORLD, &libh_a); 48

284 254 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 init_user_lib(MPI_COMM_WORLD, &libh_b); 2 ... 3 user_start_op(libh_a, dataset1); 4 user_start_op(libh_b, dataset2); 5 ... 6 while(!done) 7 { 8 /* work */ 9 ... 10 MPI_Reduce(..., MPI_COMM_WORLD); 11 ... 12 /* see if done */ 13 ... 14 } 15 user_end_op(libh_a); 16 user_end_op(libh_b); 17 18 uninit_user_lib(libh_a); 19 uninit_user_lib(libh_b); 20 MPI_Finalize(); 21 return 0; 22 } 23 The user library initialization code: 24 25 void init_user_lib(MPI_Comm comm, user_lib_t **handle) 26 { 27 user_lib_t *save; 28 29 user_lib_initsave(&save); /* local */ 30 MPI_Comm_dup(comm, &(save -> comm)); 31 32 /* other inits */ 33 ... 34 35 *handle = save; 36 } 37 38 User start-up code: 39 void user_start_op(user_lib_t *handle, void *data) 40 { 41 MPI_Irecv( ..., handle->comm, &(handle -> irecv_handle) ); 42 MPI_Isend( ..., handle->comm, &(handle -> isend_handle) ); 43 } 44 45 User communication clean-up code: 46 47 void user_end_op(user_lib_t *handle) 48 {

285 6.5. MOTIVATING EXAMPLES 255 1 MPI_Status status; 2 MPI_Wait(& handle -> isend_handle, &status); 3 MPI_Wait(& handle -> irecv_handle, &status); 4 } 5 User object clean-up code: 6 7 void uninit_user_lib(user_lib_t *handle) 8 { 9 MPI_Comm_free(&(handle -> comm)); 10 free(handle); 11 } 12 13 Library Example #2 6.5.6 14 15 The main program: 16 int main(int argc, char *argv[]) 17 { 18 int ma, mb; 19 MPI_Group MPI_GROUP_WORLD, group_a, group_b; 20 MPI_Comm comm_a, comm_b; 21 22 static int list_a[] = {0, 1}; 23 #if defined(EXAMPLE_2B) || defined(EXAMPLE_2C) 24 static int list_b[] = {0, 2 ,3}; 25 #else/* EXAMPLE_2A */ 26 static int list_b[] = {0, 2}; 27 #endif 28 int size_list_a = sizeof(list_a)/sizeof(int); 29 int size_list_b = sizeof(list_b)/sizeof(int); 30 31 ... 32 MPI_Init(&argc, &argv); 33 MPI_Comm_group(MPI_COMM_WORLD, &MPI_GROUP_WORLD); 34 35 MPI_Group_incl(MPI_GROUP_WORLD, size_list_a, list_a, &group_a); 36 MPI_Group_incl(MPI_GROUP_WORLD, size_list_b, list_b, &group_b); 37 38 MPI_Comm_create(MPI_COMM_WORLD, group_a, &comm_a); 39 MPI_Comm_create(MPI_COMM_WORLD, group_b, &comm_b); 40 41 if(comm_a != MPI_COMM_NULL) 42 MPI_Comm_rank(comm_a, &ma); 43 if(comm_b != MPI_COMM_NULL) 44 MPI_Comm_rank(comm_b, &mb); 45 46 if(comm_a != MPI_COMM_NULL) 47 lib_call(comm_a); 48

286 256 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 2 if(comm_b != MPI_COMM_NULL) 3 { 4 lib_call(comm_b); 5 lib_call(comm_b); 6 } 7 8 if(comm_a != MPI_COMM_NULL) 9 MPI_Comm_free(&comm_a); 10 if(comm_b != MPI_COMM_NULL) 11 MPI_Comm_free(&comm_b); 12 MPI_Group_free(&group_a); 13 MPI_Group_free(&group_b); 14 MPI_Group_free(&MPI_GROUP_WORLD); 15 MPI_Finalize(); 16 return 0; 17 } 18 The library: 19 void lib_call(MPI_Comm comm) 20 { 21 int me, done = 0; 22 MPI_Status status; 23 MPI_Comm_rank(comm, &me); 24 if(me == 0) 25 while(!done) 26 { 27 MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, comm, &status); 28 ... 29 } 30 else 31 { 32 /* work */ 33 MPI_Send(..., 0, ARBITRARY_TAG, comm); 34 ... 35 } 36 #ifdef EXAMPLE_2C 37 /* include (resp, exclude) for safety (resp, no safety): */ 38 MPI_Barrier(comm); 39 #endif 40 } 41 42 The above example is really three examples, depending on whether or not one includes rank 43 _ 3 in list _ b . This example illustrates call , and whether or not a synchronize is included in lib 44 that, despite contexts, subsequent calls to _ call with the same context need not be safe lib 45 from one another (colloquially, “back-masking”). Safety is realized if the MPI _ Barrier is 46 added. What this demonstrates is that libraries have to be written carefully, even with 47 contexts. When rank 3 is excluded, then the synchronize is not needed to get safety from 48 back-masking.

287 6.6. INTER-COMMUNICATION 257 1 Algorithms like “reduce” and “allreduce” have strong enough source selectivity prop- 2 erties so that they are inherently okay (no back-masking), provided that provides basic MPI 3 guarantees. So are multiple calls to a typical tree-broadcast algorithm with the same root 4 or different roots (see [57]). Here we rely on two guarantees of MPI : pairwise ordering of 5 messages between processes in the same context, and source selectivity — deleting either 6 feature removes the guarantee that back-masking cannot be required. 7 Algorithms that try to do non-deterministic broadcasts or other calls that include wild- 8 card operations will not generally have the good properties of the deterministic implemen- 9 tations of “reduce,” “allreduce,” and “broadcast.” Such algorithms would have to utilize 10 the monotonically increasing tags (within a communicator scope) to keep things straight. 11 All of the foregoing is a supposition of “collective calls” implemented with point-to- 12 implementations may or may not implement collective calls using point operations. MPI 13 point-to-point operations. These algorithms are used to illustrate the issues of correctness 14 MPI implements its collective calls. See also Section 6.9. and safety, independent of how 15 16 6.6 Inter-Communication 17 18 This section introduces the concept of inter-communication and describes the portions of 19 that support it. It describes support for writing programs that contain user-level MPI 20 servers. 21 All communication described thus far has involved communication between processes 22 that are members of the same group. This type of communication is called “intra-commun- 23 ication” and the communicator used is called an “intra-communicator,” as we have noted 24 earlier in the chapter. 25 In modular and multi-disciplinary applications, different process groups execute distinct 26 modules and processes within different modules communicate with one another in a pipeline 27 or a more general module graph. In these applications, the most natural way for a process 28 to specify a target process is by the rank of the target process within the target group. In 29 applications that contain internal user-level servers, each server may be a process group that 30 provides services to one or more clients, and each client may be a process group that uses 31 the services of one or more servers. It is again most natural to specify the target process 32 by rank within the target group in these applications. This type of communication is called 33 “inter-communication” and the communicator used is called an “inter-communicator,” as 34 introduced earlier. 35 An inter-communication is a point-to-point communication between processes in differ- 36 ent groups. The group containing a process that initiates an inter-communication operation 37 is called the “local group,” that is, the sender in a send and the receiver in a receive. The 38 group containing the target process is called the “remote group,” that is, the receiver in a 39 send and the sender in a receive. As in intra-communication, the target process is specified 40 (communicator, rank) using a pair. Unlike intra-communication, the rank is relative to a 41 second, remote group. 42 All inter-communicator constructors are blocking except for MPI _ COMM _ IDUP and 43 require that the local and remote groups be disjoint. 44 45 The groups must be disjoint for several reasons. Primarily, this Advice to users. 46 is the intent of the intercommunicators — to provide a communicator for commu- 47 nication between disjoint groups. This is reflected in the definition of 48

288 258 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 INTERCOMM MPI _ _ MERGE , which allows the user to control the ranking of the pro- 2 cesses in the created intracommunicator; this ranking makes little sense if the groups 3 are not disjoint. In addition, the natural extension of collective operations to inter- 4 communicators makes the most sense when the groups are disjoint. ( End of advice to 5 ) users. 6 Here is a summary of the properties of inter-communication and inter-communicators: 7 8 • The syntax of point-to-point and collective communication is the same for both inter- 9 and intra-communication. The same communicator can be used both for send and for 10 receive operations. 11 12 A target process is addressed by its rank in the remote group, both for sends and for • 13 receives. 14 • Communications using an inter-communicator are guaranteed not to conflict with any 15 communications that use a different communicator. 16 17 • A communicator will provide either intra- or inter-communication, never both. 18 19 TEST _ INTER may be used to determine if a communicator is an The routine MPI _ COMM _ 20 inter- or intra-communicator. Inter-communicators can be used as arguments to some of the 21 other communicator access routines. Inter-communicators cannot be used as input to some 22 _ of the constructor routines for intra-communicators (for instance, MPI ). CART _ CREATE 23 24 Advice to implementors. For the purpose of point-to-point communication, commu- 25 nicators can be represented in each process by a tuple consisting of: 26 group 27 28 send _ context 29 _ context receive 30 source 31 32 source describes the remote group, and group For inter-communicators, is the rank of 33 group the process in the local group. For intra-communicators, is the communicator 34 send group (remote=local), source is the rank of the process in this group, and 35 context and receive context are identical. A group can be represented by a rank- 36 to-absolute-address translation table. 37 The inter-communicator cannot be discussed sensibly without considering processes in 38 in group P both the local and remote groups. Imagine a process , which has an inter- P 39 C Q communicator in group Q , and a process , which has an inter-communicator P 40 C . Then Q 41 42 .group describes the group P . and .group describes the group Q • C C P Q 43 Q • C .send _ context = C .receive _ context and the context is unique in ; Q P 44 _ context = C . C .send P context .receive and this context is unique in _ Q P 45 C • Q is rank of .source Q C and P in . P is rank of .source in 46 P Q 47 48

289 6.6. INTER-COMMUNICATION 259 1 Q using the inter-communicator. Then P uses Assume that sends a message to P 2 source send _ context are Q and table to find the absolute address of the ; group 3 appended to the message. 4 Q posts a receive with an explicit source argument using the inter- Assume that 5 communicator. Then Q matches receive _ context to the message context and source 6 argument to the message source. 7 The same algorithm is appropriate for intra-communicators as well. 8 9 In order to support inter-communicator accessors and constructors, it is necessary to 10 supplement this model with additional structures, that store information about the 11 local communication group, and additional safe contexts. ( End of advice to imple- 12 mentors. ) 13 14 6.6.1 Inter-communicator Accessors 15 16 17 _ COMM _ TEST _ INTER(comm, flag) MPI 18 comm IN communicator (handle) 19 20 OUT flag (logical) 21 22 int MPI_Comm_test_inter(MPI_Comm comm, int *flag) 23 24 MPI_Comm_test_inter(comm, flag, ierror) BIND(C) 25 TYPE(MPI_Comm), INTENT(IN) :: comm 26 LOGICAL, INTENT(OUT) :: flag 27 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 28 MPI_COMM_TEST_INTER(COMM, FLAG, IERROR) 29 INTEGER COMM, IERROR 30 LOGICAL FLAG 31 32 This local routine allows the calling process to determine if a communicator is an inter- 33 true communicator or an intra-communicator. It returns if it is an inter-communicator, 34 . false otherwise 35 When an inter-communicator is used as an input argument to the communicator ac- 36 cessors described above under intra-communication, the following table describes behavior. 37 38 MPI COMM _ SIZE returns the size of the local group. _ 39 _ GROUP returns the local group. MPI _ COMM 40 MPI RANK returns the rank in the local group _ COMM _ 41 42 43 Table 6.1: COMM MPI _ Function Behavior (in Inter-Communication Mode) * _ 44 is valid for inter-communicators. Both Furthermore, the operation MPI _ COMM _ COMPARE 45 UNEQUAL communicators must be either intra- or inter-communicators, or else results. _ MPI 46 Both corresponding local and remote groups must compare correctly to get the results 47 48

290 260 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 or MPI _ SIMILAR . In particular, it is possible for MPI _ SIMILAR to result MPI CONGRUENT _ 2 because either the local or remote groups were similar but not identical. 3 The following accessors provide consistent access to the remote group of an inter- 4 communicator. The following are all local operations. 5 6 REMOTE _ COMM SIZE(comm, size) MPI _ _ 7 8 inter-communicator (handle) comm IN 9 comm OUT size number of processes in the remote group of 10 (integer) 11 12 int MPI_Comm_remote_size(MPI_Comm comm, int *size) 13 14 MPI_Comm_remote_size(comm, size, ierror) BIND(C) 15 TYPE(MPI_Comm), INTENT(IN) :: comm 16 INTEGER, INTENT(OUT) :: size 17 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 18 MPI_COMM_REMOTE_SIZE(COMM, SIZE, IERROR) 19 INTEGER COMM, SIZE, IERROR 20 21 22 23 _ REMOTE _ GROUP(comm, group) MPI _ COMM 24 comm IN inter-communicator (handle) 25 group remote group corresponding to comm OUT (handle) 26 27 int MPI_Comm_remote_group(MPI_Comm comm, MPI_Group *group) 28 29 MPI_Comm_remote_group(comm, group, ierror) BIND(C) 30 TYPE(MPI_Comm), INTENT(IN) :: comm 31 TYPE(MPI_Group), INTENT(OUT) :: group 32 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 33 34 MPI_COMM_REMOTE_GROUP(COMM, GROUP, IERROR) 35 INTEGER COMM, GROUP, IERROR 36 37 Rationale. Symmetric access to both the local and remote groups of an inter- 38 communicator is important, so this function, as well as MPI _ COMM _ SIZE REMOTE _ 39 End of rationale. ) have been provided. ( 40 41 6.6.2 Inter-communicator Operations 42 43 This section introduces four blocking inter-communicator operations. 44 _ MPI _ CREATE is used to bind two intra-communicators into an inter-com- INTERCOMM 45 municator; the function MPI _ INTERCOMM creates an intra-communicator by merg- MERGE _ 46 MPI _ COMM _ DUP ing the local and remote groups of an inter-communicator. The functions 47 FREE and MPI _ COMM _ , introduced previously, duplicate and free an inter-communicator, 48 respectively.

291 6.6. INTER-COMMUNICATION 261 1 Overlap of local and remote groups that are bound into an inter-communicator is 2 prohibited. If there is overlap, then the program is erroneous and is likely to deadlock. (If 3 a process is multithreaded, and MPI calls block only a thread, rather than a process, then 4 “dual membership” can be supported. It is then the user’s responsibility to make sure that 5 calls on behalf of the two “roles” of a process are executed by two independent threads.) 6 INTERCOMM The function MPI _ _ CREATE can be used to create an inter-communicator 7 from two existing intra-communicators, in the following situation: At least one selected 8 member from each group (the “group leader”) has the ability to communicate with the 9 selected member from the other group; that is, a “peer” communicator exists to which both 10 leaders belong, and each leader knows the rank of the other leader in this peer communicator. 11 Furthermore, members of each group know the rank of their leader. 12 Construction of an inter-communicator from two intra-communicators requires separate 13 collective operations in the local group and in the remote group, as well as a point-to-point 14 communication between a process in the local group and a process in the remote group. 15 MPI In standard implementations (with static process allocation at initialization), the 16 communicator (or preferably a dedicated duplicate thereof) can be this COMM _ WORLD _ MPI 17 peer communicator. For applications that have used spawn or join, it may be necessary to 18 first create an intracommunicator to be used as peer. 19 The application topology functions described in Chapter 7 do not apply to inter- 20 communicators. Users that require this capability should utilize 21 MPI _ INTERCOMM _ to build an intra-communicator, then apply the graph or carte- MERGE 22 sian topology capabilities to that intra-communicator, creating an appropriate topology- 23 oriented intra-communicator. Alternatively, it may be reasonable to devise one’s own ap- 24 plication topology mechanisms for this case, without loss of generality. 25 26 _ INTERCOMM _ CREATE(local _ comm, local _ leader, peer _ comm, remote _ leader, tag, MPI 27 newintercomm) 28 29 IN local comm local intra-communicator (handle) _ 30 _ (integer) comm _ local rank of local group leader in leader IN local 31 IN comm “peer” communicator; significant only at the _ peer 32 local _ leader (handle) 33 34 _ peer IN remote _ leader rank of remote group leader in ; significant comm 35 (integer) local _ leader only at the 36 tag tag (integer) IN 37 new inter-communicator (handle) newintercomm OUT 38 39 40 int MPI_Intercomm_create(MPI_Comm local_comm, int local_leader, 41 MPI_Comm peer_comm, int remote_leader, int tag, 42 MPI_Comm *newintercomm) 43 MPI_Intercomm_create(local_comm, local_leader, peer_comm, remote_leader, 44 tag, newintercomm, ierror) BIND(C) 45 TYPE(MPI_Comm), INTENT(IN) :: local_comm, peer_comm 46 INTEGER, INTENT(IN) :: local_leader, remote_leader, tag 47 TYPE(MPI_Comm), INTENT(OUT) :: newintercomm 48

292 262 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 2 MPI_INTERCOMM_CREATE(LOCAL_COMM, LOCAL_LEADER, PEER_COMM, REMOTE_LEADER, 3 TAG, NEWINTERCOMM, IERROR) 4 INTEGER LOCAL_COMM, LOCAL_LEADER, PEER_COMM, REMOTE_LEADER, TAG, 5 NEWINTERCOMM, IERROR 6 7 This call creates an inter-communicator. It is collective over the union of the local and 8 _ local local remote groups. Processes should provide identical arguments and _ leader comm 9 leader, local _ leader , and tag . within each group. Wildcards are not permitted for remote _ 10 11 _ MERGE(intercomm, high, newintracomm) MPI _ INTERCOMM 12 13 intercomm Inter-Communicator (handle) IN 14 (logical) high IN 15 16 new intra-communicator (handle) OUT newintracomm 17 18 int MPI_Intercomm_merge(MPI_Comm intercomm, int high, 19 MPI_Comm *newintracomm) 20 MPI_Intercomm_merge(intercomm, high, newintracomm, ierror) BIND(C) 21 TYPE(MPI_Comm), INTENT(IN) :: intercomm 22 LOGICAL, INTENT(IN) :: high 23 TYPE(MPI_Comm), INTENT(OUT) :: newintracomm 24 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 25 26 MPI_INTERCOMM_MERGE(INTERCOMM, HIGH, NEWINTRACOMM, IERROR) 27 INTEGER INTERCOMM, NEWINTRACOMM, IERROR 28 LOGICAL HIGH 29 This function creates an intra-communicator from the union of the two groups that are 30 associated with high intercomm value within each . All processes should provide the same 31 and processes of the two groups. If processes in one group provided the value high = false 32 high = true then the union orders the “low” group in the other group provided the value 33 before the “high” group. If all processes provided the same high argument then the order 34 of the union is arbitrary. This call is blocking and collective within the union of the two 35 groups. 36 The error handler on the new intercommunicator in each process is inherited from 37 the communicator that contributes the local group. Note that this can result in different 38 processes in the same communicator having different error handlers. 39 40 Advice to implementors. The implementation of MPI _ INTERCOMM _ MERGE , 41 COMM MPI MPI _ COMM _ FREE are similar to the implementation of DUP _ , and _ 42 MPI _ INTERCOMM _ CREATE , except that contexts private to the input inter-com- 43 municator are used for communication between group leaders rather than contexts 44 inside a bridge communicator. ( End of advice to implementors. ) 45 46 47 48

293 6.6. INTER-COMMUNICATION 263 1 2 3 Group 1 Group 2 Group 0 4 5 6 7 Figure 6.3: Three-group pipeline 8 9 Inter-Communication Examples 6.6.3 10 11 Example 1: Three-Group “Pipeline” 12 Groups 0 and 1 communicate. Groups 1 and 2 communicate. Therefore, group 0 requires 13 one inter-communicator, group 1 requires two inter-communicators, and group 2 requires 1 14 inter-communicator. 15 16 int main(int argc, char *argv[]) 17 { 18 MPI_Comm myComm; /* intra-communicator of local sub-group */ 19 MPI_Comm myFirstComm; /* inter-communicator */ 20 MPI_Comm mySecondComm; /* second inter-communicator (group 1 only) */ 21 int membershipKey; 22 int rank; 23 24 MPI_Init(&argc, &argv); 25 MPI_Comm_rank(MPI_COMM_WORLD, &rank); 26 27 /* User code must generate membershipKey in the range [0, 1, 2] */ 28 membershipKey = rank % 3; 29 30 /* Build intra-communicator for local sub-group */ 31 MPI_Comm_split(MPI_COMM_WORLD, membershipKey, rank, &myComm); 32 33 /* Build inter-communicators. Tags are hard-coded. */ 34 if (membershipKey == 0) 35 { /* Group 0 communicates with group 1. */ 36 MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 1, 37 1, &myFirstComm); 38 } 39 else if (membershipKey == 1) 40 { /* Group 1 communicates with groups 0 and 2. */ 41 MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 0, 42 1, &myFirstComm); 43 MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 2, 44 12, &mySecondComm); 45 } 46 else if (membershipKey == 2) 47 { /* Group 2 communicates with group 1. */ 48

294 264 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 2 3 4 Group 1 Group 2 Group 0 5 6 7 Figure 6.4: Three-group ring 8 9 10 MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 1, 11 12, &myFirstComm); 12 } 13 14 /* Do work ... */ 15 16 switch(membershipKey) /* free communicators appropriately */ 17 { 18 case 1: 19 MPI_Comm_free(&mySecondComm); 20 case 0: 21 case 2: 22 MPI_Comm_free(&myFirstComm); 23 break; 24 } 25 26 MPI_Finalize(); 27 return 0; 28 } 29 30 Example 2: Three-Group “Ring” 31 32 Groups 0 and 1 communicate. Groups 1 and 2 communicate. Groups 0 and 2 communicate. 33 Therefore, each requires two inter-communicators. 34 int main(int argc, char *argv[]) 35 { 36 MPI_Comm myComm; /* intra-communicator of local sub-group */ 37 MPI_Comm myFirstComm; /* inter-communicators */ 38 MPI_Comm mySecondComm; 39 int membershipKey; 40 int rank; 41 42 MPI_Init(&argc, &argv); 43 MPI_Comm_rank(MPI_COMM_WORLD, &rank); 44 ... 45 46 /* User code must generate membershipKey in the range [0, 1, 2] */ 47 membershipKey = rank % 3; 48

295 265 6.7. CACHING 1 2 /* Build intra-communicator for local sub-group */ 3 MPI_Comm_split(MPI_COMM_WORLD, membershipKey, rank, &myComm); 4 5 /* Build inter-communicators. Tags are hard-coded. */ 6 if (membershipKey == 0) 7 { /* Group 0 communicates with groups 1 and 2. */ 8 MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 1, 9 1, &myFirstComm); 10 MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 2, 11 2, &mySecondComm); 12 } 13 else if (membershipKey == 1) 14 { /* Group 1 communicates with groups 0 and 2. */ 15 MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 0, 16 1, &myFirstComm); 17 MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 2, 18 12, &mySecondComm); 19 } 20 else if (membershipKey == 2) 21 { /* Group 2 communicates with groups 0 and 1. */ 22 MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 0, 23 2, &myFirstComm); 24 MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 1, 25 12, &mySecondComm); 26 } 27 28 /* Do some work ... */ 29 30 /* Then free communicators before terminating... */ 31 MPI_Comm_free(&myFirstComm); 32 MPI_Comm_free(&mySecondComm); 33 MPI_Comm_free(&myComm); 34 MPI_Finalize(); 35 return 0; 36 } 37 38 Caching 6.7 39 40 provides a “caching” facility that allows an application to attach arbitrary pieces of MPI 41 information, called , to three kinds of MPI objects, communicators, windows, attributes 42 and datatypes. More precisely, the caching facility allows a portable library to do the 43 following: 44 45 pass information between calls by associating it with an • intra- or inter-commun- MPI 46 icator, window, or datatype, 47 quickly retrieve that information, and • 48

296 266 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 be guaranteed that out-of-date information is never retrieved, even if the object is • 2 freed and its handle subsequently reused by . MPI 3 The caching capabilities, in some form, are required by built-in MPI routines such as 4 collective communication and application topology. Defining an interface to these capa- 5 standard is valuable because it permits routines like collective MPI bilities as part of the 6 communication and application topologies to be implemented as portable code, and also 7 more extensible by allowing user-written routines to use standard MPI because it makes 8 calling sequences. MPI 9 10 The communicator _ MPI SELF Advice to users. COMM _ is a suitable choice for post- 11 ing process-local attributes, via this attribute-caching mechanism. ( End of advice to 12 users. ) 13 14 In one extreme one can allow caching on all opaque handles. The other Rationale. 15 extreme is to only allow it on communicators. Caching has a cost associated with it 16 and should only be allowed when it is clearly needed and the increased cost is modest. 17 This is the reason that windows and datatypes were added but not other handles. 18 ) End of rationale. ( 19 20 One difficulty is the potential for size differences between Fortran integers and C 21 pointers. For this reason, the Fortran versions of these routines use integers of kind 22 KIND MPI _ ADDRESS _ . 23 24 Advice to implementors. High-quality implementations should raise an error when 25 _ MPI _ XXX a keyval that was created by a call to _ KEYVAL CREATE is used with an 26 YYY _ SET _ ATTR , MPI _ YYY object of the wrong type with a call to GET _ ATTR , MPI _ _ 27 MPI _ . To do so, it is necessary to KEYVAL _ FREE _ YYY MPI _ , or ATTR _ DELETE _ YYY 28 maintain, with each keyval, information on the type of the associated user function. 29 ) ( End of advice to implementors. 30 31 6.7.1 Functionality 32 Attributes can be attached to communicators, windows, and datatypes. Attributes are local 33 to the process and specific to the communicator to which they are attached. Attributes are 34 from one communicator to another except when the communicator not propagated by MPI 35 MPI _ COMM or _ DUP is duplicated using MPI _ COMM _ (and even then the application IDUP 36 must give specific permission through callback functions for the attribute to be copied). 37 38 . Typically, such an attribute will Attributes in C are of type Advice to users. void * 39 be a pointer to a structure that contains further information, or a handle to an MPI 40 . Such attribute can be a handle to INTEGER object. In Fortran, attributes are of type 41 an MPI object, or just an integer-valued attribute. ( End of advice to users. ) 42 43 Advice to implementors. Attributes are scalar values, equal in size to, or larger than 44 a C-language pointer. Attributes can always hold an handle. ( End of advice to MPI 45 implementors. ) 46 47 The caching interface defined here requires that attributes be stored by MPI opaquely 48 within a communicator, window, and datatype. Accessor functions include the following:

297 6.7. CACHING 267 1 • obtain a key value (used to identify an attribute); the user specifies “callback” func- 2 tions by which MPI informs the application when the communicator is destroyed or 3 copied. 4 • store and retrieve the value of an attribute; 5 6 Advice to implementors. Caching and callback functions are only called synchronously, 7 in response to explicit application requests. This avoids problems that result from re- 8 peated crossings between user and system space. (This synchronous calling rule is a 9 .) MPI general property of 10 11 MPI to optimize its The choice of key values is under control of MPI . This allows 12 implementation of attribute sets. It also avoids conflict between independent modules 13 caching information on the same communicators. 14 A much smaller interface, consisting of just a callback facility, would allow the entire 15 caching facility to be implemented by portable code. However, with the minimal call- 16 back interface, some form of table searching is implied by the need to handle arbitrary 17 communicators. In contrast, the more complete interface defined here permits rapid 18 access to attributes through the use of pointers in communicators (to find the attribute 19 table) and cleverly chosen key values (to retrieve individual attributes). In light of the 20 efficiency “hit” inherent in the minimal interface, the more complete interface defined 21 ) End of advice to implementors. here is seen to be superior. ( 22 23 provides the following services related to caching. They are all process local. MPI 24 25 6.7.2 Communicators 26 Functions for caching on communicators are: 27 28 29 _ attr _ fn, comm _ delete _ attr _ fn, comm _ keyval, MPI _ COMM _ CREATE _ KEYVAL(comm _ copy 30 extra _ state) 31 keyval comm copy callback function for fn _ _ _ copy _ comm IN (function) attr 32 33 fn delete callback function for comm _ keyval (function) IN comm _ delete _ attr _ 34 _ key value for future access (integer) keyval comm OUT 35 state extra IN extra state for callback functions _ 36 37 38 int MPI_Comm_create_keyval(MPI_Comm_copy_attr_function *comm_copy_attr_fn, 39 MPI_Comm_delete_attr_function *comm_delete_attr_fn, 40 int *comm_keyval, void *extra_state) 41 MPI_Comm_create_keyval(comm_copy_attr_fn, comm_delete_attr_fn, comm_keyval, 42 extra_state, ierror) BIND(C) 43 PROCEDURE(MPI_Comm_copy_attr_function) :: comm_copy_attr_fn 44 PROCEDURE(MPI_Comm_delete_attr_function) :: comm_delete_attr_fn 45 INTEGER, INTENT(OUT) :: comm_keyval 46 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: extra_state 47 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 48

298 268 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 MPI_COMM_CREATE_KEYVAL(COMM_COPY_ATTR_FN, COMM_DELETE_ATTR_FN, COMM_KEYVAL, 2 EXTRA_STATE, IERROR) 3 EXTERNAL COMM_COPY_ATTR_FN, COMM_DELETE_ATTR_FN 4 INTEGER COMM_KEYVAL, IERROR 5 INTEGER(KIND=MPI_ADDRESS_KIND) EXTRA_STATE 6 Generates a new attribute key. Keys are locally unique in a process, and opaque to 7 user, though they are explicitly stored in integers. Once allocated, the key value can be 8 used to associate attributes and access them on any locally defined communicator. 9 10 The C callback functions are: 11 typedef int MPI_Comm_copy_attr_function(MPI_Comm oldcomm, int comm_keyval, 12 void *extra_state, void *attribute_val_in, 13 void *attribute_val_out, int *flag); 14 15 and 16 typedef int MPI_Comm_delete_attr_function(MPI_Comm comm, int comm_keyval, 17 void *attribute_val, void *extra_state); 18 which are the same as the MPI-1.1 calls but with a new name. The old names are deprecated. 19 20 With the mpi_f08 module, the Fortran callback functions are: 21 ABSTRACT INTERFACE 22 SUBROUTINE MPI_Comm_copy_attr_function(oldcomm, comm_keyval, extra_state, 23 attribute_val_in, attribute_val_out, flag, ierror) BIND(C) 24 TYPE(MPI_Comm) :: oldcomm 25 INTEGER :: comm_keyval, ierror 26 INTEGER(KIND=MPI_ADDRESS_KIND) :: extra_state, attribute_val_in, 27 attribute_val_out 28 LOGICAL :: flag 29 30 and 31 ABSTRACT INTERFACE 32 SUBROUTINE MPI_Comm_delete_attr_function(comm, comm_keyval, 33 attribute_val, extra_state, ierror) BIND(C) 34 TYPE(MPI_Comm) :: comm 35 INTEGER :: comm_keyval, ierror 36 INTEGER(KIND=MPI_ADDRESS_KIND) :: attribute_val, extra_state 37 38 mpi , the Fortran callback functions are: mpif.h module and With the 39 SUBROUTINE COMM_COPY_ATTR_FUNCTION(OLDCOMM, COMM_KEYVAL, EXTRA_STATE, 40 ATTRIBUTE_VAL_IN, ATTRIBUTE_VAL_OUT, FLAG, IERROR) 41 INTEGER OLDCOMM, COMM_KEYVAL, IERROR 42 INTEGER(KIND=MPI_ADDRESS_KIND) EXTRA_STATE, ATTRIBUTE_VAL_IN, 43 ATTRIBUTE_VAL_OUT 44 LOGICAL FLAG 45 46 and 47 SUBROUTINE COMM_DELETE_ATTR_FUNCTION(COMM, COMM_KEYVAL, ATTRIBUTE_VAL, 48 EXTRA_STATE, IERROR)

299 6.7. CACHING 269 1 INTEGER COMM, COMM_KEYVAL, IERROR 2 INTEGER(KIND=MPI_ADDRESS_KIND) ATTRIBUTE_VAL, EXTRA_STATE 3 4 _ fn function is invoked when a communicator is duplicated by The copy _ comm attr _ 5 MPI _ IDUP . comm _ copy _ attr _ fn should be of type _ COMM _ MPI DUP _ or COMM 6 copy MPI _ Comm _ _ function attr _ . The copy callback function is invoked for each key value in 7 oldcomm in arbitrary order. Each call to the copy callback is made with a key value and its 8 corresponding attribute. If it returns , then the attribute is deleted in the .FALSE. or flag = 0 9 duplicated communicator. Otherwise ( flag = 1 or .TRUE. ), the new attribute value is set to 10 _ val _ out . The function returns MPI _ SUCCESS the value returned in attribute on success and 11 IDUP an error code on failure (in which case MPI _ COMM _ DUP or MPI _ COMM _ will fail). 12 _ _ _ FN COMM _ MPI may be specified as fn _ attr _ copy _ comm The argument NULL COPY 13 _ or _ FN is a MPI _ COMM _ DUP _ FN from either C or Fortran. MPI _ COMM _ NULL COPY 14 function that does nothing other than returning flag = 0 or .FALSE. (depending on whether 15 MPI _ COMM _ CREATE _ ) and the keyval was created with a C or Fortran binding to KEYVAL 16 DUP _ . MPI _ COMM _ MPI _ FN is a simple-minded copy function that sets flag = 1 or SUCCESS 17 _ val in in _ attribute _ val _ out , and returns MPI _ SUCCESS .TRUE. . attribute , returns the value of 18 predefined callbacks MPI _ NULL _ COPY _ FN and MPI _ DUP _ FN , These replace the MPI-1 19 whose use is deprecated. 20 21 attribute Even though both formal arguments _ val _ in Advice to users. and 22 attribute val _ out are of type void * , their usage differs. The C copy function is passed _ 23 _ in the value of the attribute, and in attribute _ val _ out the by MPI in attribute _ val 24 address of the attribute, so as to allow the function to return the (new) attribute 25 value. The use of type for both is to avoid messy type casts. void * 26 A valid copy function is one that completely duplicates the information by making 27 a full duplicate copy of the data structures implied by an attribute; another might 28 just make another reference to that data structure, while using a reference-count 29 mechanism. Other types of attributes might not copy at all (they might be specific 30 End of advice to users. ) oldcomm to only). ( 31 32 A C interface should be assumed for copy and delete Advice to implementors. 33 functions associated with key values created in C; a Fortran calling interface should 34 be assumed for key values created in Fortran. ( ) End of advice to implementors. 35 36 is a callback deletion function, defined as follows. Analogous to comm _ copy _ fn _ attr 37 function is invoked when a communicator is deleted by The comm _ fn _ attr _ delete 38 COMM _ . MPI _ _ FREE or when a call is made explicitly to MPI _ COMM _ DELETE ATTR 39 Comm _ attr . function _ delete _ _ attr _ fn _ delete should be of type MPI _ comm 40 , and ATTR _ DELETE _ COMM _ This function is called by , FREE _ COMM _ MPI MPI 41 COMM _ SET _ ATTR to do whatever is needed to remove an attribute. The function MPI _ 42 returns MPI _ SUCCESS on success and an error code on failure (in which case 43 MPI _ COMM _ FREE will fail). 44 attr _ delete _ comm _ fn The argument may be specified as 45 _ DELETE _ FN from either C or Fortran. MPI _ COMM _ NULL 46 NULL FN _ DELETE _ is a function that does nothing, other than returning _ COMM _ MPI 47 _ NULL _ COMM _ FN . SUCCESS _ MPI MPI _ DELETE _ NULL _ MPI replaces FN _ DELETE , whose 48 use is deprecated.

300 270 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 If an attribute copy function or attribute delete function returns other than 2 _ _ COMM _ FREE ), MPI SUCCESS MPI , then the call that caused it to be invoked (for example, 3 is erroneous. 4 INVALID KEYVAL The special key value MPI _ _ is never returned by 5 COMM KEYVAL _ CREATE _ . Therefore, it can be used for static initialization of key _ MPI 6 values. 7 8 The predefined Fortran functions Advice to implementors. 9 _ COMM _ DUP _ FN , and _ COMM _ NULL _ COPY MPI FN , MPI _ 10 module (and mpif.h ) and MPI _ COMM _ NULL _ DELETE _ FN are defined in the mpi 11 the module with the same name, but with different interfaces. Each function mpi_f08 12 can coexist twice with the same name in the same MPI library, one routine as an 13 EXTERNAL implicit interface outside of the , and the other module, i.e., declared as mpi 14 CONTAINS . These routines have different link routine within declared with mpi_f08 15 names, which are also different to the link names used for the routines used in C. 16 ) End of advice to implementors. ( 17 18 Callbacks, including the predefined Fortran functions Advice to users. 19 _ COMM _ FN MPI COPY _ NULL _ COMM _ MPI _ FN , and _ , DUP 20 MPI _ COMM _ NULL _ DELETE _ FN should not be passed from one application routine 21 mpi_f08 that uses the module module to another application routine that uses the mpi 22 , and vice versa; see also the advice to users on page 652. ( mpif.h or End of advice to 23 users. ) 24 25 26 _ FREE MPI _ COMM _ _ KEYVAL(comm keyval) 27 comm keyval key value (integer) INOUT _ 28 29 30 int MPI_Comm_free_keyval(int *comm_keyval) 31 MPI_Comm_free_keyval(comm_keyval, ierror) BIND(C) 32 INTEGER, INTENT(INOUT) :: comm_keyval 33 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 34 35 MPI_COMM_FREE_KEYVAL(COMM_KEYVAL, IERROR) 36 INTEGER COMM_KEYVAL, IERROR 37 keyval Frees an extant attribute key. This function sets the value of to 38 MPI KEYVAL _ INVALID _ . Note that it is not erroneous to free an attribute key that is in use, 39 because the actual free does not transpire until after all references (in other communicators 40 on the process) to the key have been freed. These references need to be explictly freed by the 41 COMM _ DELETE _ ATTR that free one attribute instance, MPI program, either via calls to _ 42 _ _ FREE that free all attribute instances associated with the freed MPI or by calls to COMM 43 communicator. 44 45 46 47 48

301 6.7. CACHING 271 1 COMM _ _ ATTR(comm, comm _ keyval, attribute _ val) _ MPI SET 2 comm INOUT communicator from which attribute will be attached 3 (handle) 4 comm keyval key value (integer) IN _ 5 6 attribute value attribute IN val _ 7 8 int MPI_Comm_set_attr(MPI_Comm comm, int comm_keyval, void *attribute_val) 9 MPI_Comm_set_attr(comm, comm_keyval, attribute_val, ierror) BIND(C) 10 TYPE(MPI_Comm), INTENT(IN) :: comm 11 INTEGER, INTENT(IN) :: comm_keyval 12 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: attribute_val 13 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 14 15 MPI_COMM_SET_ATTR(COMM, COMM_KEYVAL, ATTRIBUTE_VAL, IERROR) 16 INTEGER COMM, COMM_KEYVAL, IERROR 17 INTEGER(KIND=MPI_ADDRESS_KIND) ATTRIBUTE_VAL 18 19 val for subsequent retrieval attribute This function stores the stipulated attribute value _ 20 _ ATTR by MPI _ COMM _ GET . If the value is already present, then the outcome is as if 21 _ DELETE _ ATTR _ MPI was first called to delete the previous value (and the callback COMM 22 fn was executed), and a new value was next stored. The call comm _ function _ attr _ delete 23 keyval ; in particular is an INVALID is erroneous if there is no key with value _ KEYVAL _ MPI 24 _ attr _ delete _ comm erroneous key value. The call will fail if the function returned an error fn 25 SUCCESS code other than MPI _ . 26 27 _ ATTR(comm, comm _ keyval, attribute _ val, flag) MPI _ COMM _ GET 28 29 IN comm communicator to which the attribute is attached (han- 30 dle) 31 keyval IN comm _ key value (integer) 32 val _ attribute OUT attribute value, unless flag = false 33 false flag if no attribute is associated with the key (logical) OUT 34 35 36 int MPI_Comm_get_attr(MPI_Comm comm, int comm_keyval, void *attribute_val, 37 int *flag) 38 MPI_Comm_get_attr(comm, comm_keyval, attribute_val, flag, ierror) BIND(C) 39 TYPE(MPI_Comm), INTENT(IN) :: comm 40 INTEGER, INTENT(IN) :: comm_keyval 41 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(OUT) :: attribute_val 42 LOGICAL, INTENT(OUT) :: flag 43 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 44 45 MPI_COMM_GET_ATTR(COMM, COMM_KEYVAL, ATTRIBUTE_VAL, FLAG, IERROR) 46 INTEGER COMM, COMM_KEYVAL, IERROR 47 INTEGER(KIND=MPI_ADDRESS_KIND) ATTRIBUTE_VAL 48 LOGICAL FLAG

302 272 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 Retrieves attribute value by key. The call is erroneous if there is no key with value 2 . On the other hand, the call is correct if the key value exists, but no attribute is keyval 3 . In particular attached on for that key; in such case, the call returns flag = false comm 4 _ is an erroneous key value. MPI KEYVAL INVALID _ 5 6 the Advice to users. _ val The call to value of MPI _ Comm _ set _ attr passes in attribute 7 the attribute; the call to _ val the address of the MPI _ Comm _ get _ attr passes in attribute 8 location where the attribute value is to be returned. Thus, if the attribute value itself is 9 set _ attr parameter to val _ attribute , then the actual void* a pointer of type MPI Comm _ _ 10 get _ attr void* and the actual attribute will be of type val parameter to MPI _ Comm _ _ 11 will be of type void** . ( End of advice to users. ) 12 13 The use of a formal parameter attribute _ (rather than of type Rationale. val void* 14 ) avoids the messy type casting that would be needed if the attribute value is void** 15 . ( declared with a type other than void* End of rationale. ) 16 17 18 DELETE keyval) _ COMM _ MPI ATTR(comm, comm _ _ 19 comm INOUT communicator from which the attribute is deleted (han- 20 dle) 21 22 keyval _ comm IN key value (integer) 23 24 int MPI_Comm_delete_attr(MPI_Comm comm, int comm_keyval) 25 MPI_Comm_delete_attr(comm, comm_keyval, ierror) BIND(C) 26 TYPE(MPI_Comm), INTENT(IN) :: comm 27 INTEGER, INTENT(IN) :: comm_keyval 28 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 29 30 MPI_COMM_DELETE_ATTR(COMM, COMM_KEYVAL, IERROR) 31 INTEGER COMM, COMM_KEYVAL, IERROR 32 33 Delete attribute from cache by key. This function invokes the attribute delete function 34 delete was created. The call will fail if the specified when the comm _ keyval _ attr _ fn 35 attr SUCCESS comm _ delete _ _ _ fn function returns an error code other than MPI . 36 MPI or DUP _ COMM _ Whenever a communicator is replicated using the function 37 _ COMM _ IDUP , all call-back copy functions for attributes that are currently set are MPI 38 invoked (in arbitrary order). Whenever a communicator is deleted using the function 39 COMM _ FREE all callback delete functions for attributes that are currently set are _ MPI 40 invoked. 41 42 Windows 6.7.3 43 The functions for caching on windows are: 44 45 46 47 48

303 6.7. CACHING 273 1 WIN _ _ KEYVAL(win _ copy _ attr _ fn, win _ delete _ attr _ fn, win _ keyval, extra _ state) _ MPI CREATE 2 3 (function) fn copy callback function for win _ keyval attr _ win copy IN _ _ 4 _ fn delete callback function for win _ keyval (function) IN win _ delete _ attr 5 6 OUT win _ keyval key value for future access (integer) 7 extra state IN _ extra state for callback functions 8 9 int MPI_Win_create_keyval(MPI_Win_copy_attr_function *win_copy_attr_fn, 10 MPI_Win_delete_attr_function *win_delete_attr_fn, 11 int *win_keyval, void *extra_state) 12 13 MPI_Win_create_keyval(win_copy_attr_fn, win_delete_attr_fn, win_keyval, 14 extra_state, ierror) BIND(C) 15 PROCEDURE(MPI_Win_copy_attr_function) :: win_copy_attr_fn 16 PROCEDURE(MPI_Win_delete_attr_function) :: win_delete_attr_fn 17 INTEGER, INTENT(OUT) :: win_keyval 18 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: extra_state 19 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 20 MPI_WIN_CREATE_KEYVAL(WIN_COPY_ATTR_FN, WIN_DELETE_ATTR_FN, WIN_KEYVAL, 21 EXTRA_STATE, IERROR) 22 EXTERNAL WIN_COPY_ATTR_FN, WIN_DELETE_ATTR_FN 23 INTEGER WIN_KEYVAL, IERROR 24 INTEGER(KIND=MPI_ADDRESS_KIND) EXTRA_STATE 25 26 COPY The argument win _ copy _ attr _ fn may be specified as _ MPI _ FN or WIN _ NULL _ 27 COPY DUP _ FN MPI MPI is a function FN _ _ _ NULL _ WIN _ WIN _ from either C or Fortran. 28 FN is that does nothing other than returning flag = 0 and MPI _ SUCCESS . MPI _ WIN _ DUP _ 29 , returns the value of in a simple-minded copy function that sets flag = 1 _ attribute _ val in 30 SUCCESS . attribute _ val _ out , and returns MPI _ 31 _ _ FN MPI WIN _ NULL _ DELETE The argument win _ delete _ attr _ fn may be specified as 32 NULL _ from either C or Fortran. MPI _ WIN _ DELETE _ FN is a function that does nothing, 33 other than returning MPI _ . SUCCESS 34 The C callback functions are: 35 36 typedef int MPI_Win_copy_attr_function(MPI_Win oldwin, int win_keyval, 37 void *extra_state, void *attribute_val_in, 38 void *attribute_val_out, int *flag); 39 and 40 typedef int MPI_Win_delete_attr_function(MPI_Win win, int win_keyval, 41 void *attribute_val, void *extra_state); 42 43 module, the Fortran callback functions are: With the mpi_f08 44 45 ABSTRACT INTERFACE 46 SUBROUTINE MPI_Win_copy_attr_function(oldwin, win_keyval, extra_state, 47 attribute_val_in, attribute_val_out, flag, ierror) BIND(C) 48 TYPE(MPI_Win) :: oldwin

304 274 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 INTEGER :: win_keyval, ierror 2 INTEGER(KIND=MPI_ADDRESS_KIND) :: extra_state, attribute_val_in, 3 attribute_val_out 4 LOGICAL :: flag 5 and 6 ABSTRACT INTERFACE 7 SUBROUTINE MPI_Win_delete_attr_function(win, win_keyval, attribute_val, 8 extra_state, ierror) BIND(C) 9 TYPE(MPI_Win) :: win 10 INTEGER :: win_keyval, ierror 11 INTEGER(KIND=MPI_ADDRESS_KIND) :: attribute_val, extra_state 12 13 , the Fortran callback functions are: With the module and mpif.h mpi 14 15 SUBROUTINE WIN_COPY_ATTR_FUNCTION(OLDWIN, WIN_KEYVAL, EXTRA_STATE, 16 ATTRIBUTE_VAL_IN, ATTRIBUTE_VAL_OUT, FLAG, IERROR) 17 INTEGER OLDWIN, WIN_KEYVAL, IERROR 18 INTEGER(KIND=MPI_ADDRESS_KIND) EXTRA_STATE, ATTRIBUTE_VAL_IN, 19 ATTRIBUTE_VAL_OUT 20 LOGICAL FLAG 21 and 22 SUBROUTINE WIN_DELETE_ATTR_FUNCTION(WIN, WIN_KEYVAL, ATTRIBUTE_VAL, 23 EXTRA_STATE, IERROR) 24 INTEGER WIN, WIN_KEYVAL, IERROR 25 INTEGER(KIND=MPI_ADDRESS_KIND) ATTRIBUTE_VAL, EXTRA_STATE 26 27 If an attribute copy function or attribute delete function returns other than 28 _ SUCCESS MPI , then the call that caused it to be invoked (for example, MPI ), is _ _ WIN FREE 29 erroneous. 30 31 32 _ FREE _ WIN _ keyval) _ KEYVAL(win MPI 33 keyval key value (integer) INOUT win _ 34 35 int MPI_Win_free_keyval(int *win_keyval) 36 37 MPI_Win_free_keyval(win_keyval, ierror) BIND(C) 38 INTEGER, INTENT(INOUT) :: win_keyval 39 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 40 41 MPI_WIN_FREE_KEYVAL(WIN_KEYVAL, IERROR) 42 INTEGER WIN_KEYVAL, IERROR 43 44 45 46 47 48

305 6.7. CACHING 275 1 _ SET _ ATTR(win, win _ keyval, attribute _ val) MPI WIN _ 2 win window to which attribute will be attached (handle) INOUT 3 keyval _ key value (integer) win IN 4 5 _ val attribute value IN attribute 6 7 int MPI_Win_set_attr(MPI_Win win, int win_keyval, void *attribute_val) 8 9 MPI_Win_set_attr(win, win_keyval, attribute_val, ierror) BIND(C) 10 TYPE(MPI_Win), INTENT(IN) :: win 11 INTEGER, INTENT(IN) :: win_keyval 12 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: attribute_val 13 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 14 MPI_WIN_SET_ATTR(WIN, WIN_KEYVAL, ATTRIBUTE_VAL, IERROR) 15 INTEGER WIN, WIN_KEYVAL, IERROR 16 INTEGER(KIND=MPI_ADDRESS_KIND) ATTRIBUTE_VAL 17 18 19 _ WIN val, flag) GET _ keyval, attribute ATTR(win, win _ _ MPI _ 20 21 window to which the attribute is attached (handle) win IN 22 keyval key value (integer) IN win _ 23 val OUT attribute _ attribute value, unless flag = false 24 25 false if no attribute is associated with the key (logical) flag OUT 26 27 int MPI_Win_get_attr(MPI_Win win, int win_keyval, void *attribute_val, 28 int *flag) 29 MPI_Win_get_attr(win, win_keyval, attribute_val, flag, ierror) BIND(C) 30 TYPE(MPI_Win), INTENT(IN) :: win 31 INTEGER, INTENT(IN) :: win_keyval 32 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(OUT) :: attribute_val 33 LOGICAL, INTENT(OUT) :: flag 34 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 35 36 MPI_WIN_GET_ATTR(WIN, WIN_KEYVAL, ATTRIBUTE_VAL, FLAG, IERROR) 37 INTEGER WIN, WIN_KEYVAL, IERROR 38 INTEGER(KIND=MPI_ADDRESS_KIND) ATTRIBUTE_VAL 39 LOGICAL FLAG 40 41 42 keyval) _ DELETE _ ATTR(win, win _ WIN _ MPI 43 win INOUT window from which the attribute is deleted (handle) 44 45 keyval key value (integer) _ win IN 46 47 int MPI_Win_delete_attr(MPI_Win win, int win_keyval) 48

306 276 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 MPI_Win_delete_attr(win, win_keyval, ierror) BIND(C) 2 TYPE(MPI_Win), INTENT(IN) :: win 3 INTEGER, INTENT(IN) :: win_keyval 4 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 5 MPI_WIN_DELETE_ATTR(WIN, WIN_KEYVAL, IERROR) 6 INTEGER WIN, WIN_KEYVAL, IERROR 7 8 9 6.7.4 Datatypes 10 The new functions for caching on datatypes are: 11 12 13 _ fn, type _ keyval, CREATE _ KEYVAL(type _ delete fn, type MPI TYPE _ _ copy _ attr _ _ attr _ 14 state) _ extra 15 16 fn copy callback function for type _ IN (function) type _ copy _ attr _ keyval 17 fn delete callback function for type _ keyval (function) type _ delete _ attr IN _ 18 type OUT _ keyval key value for future access (integer) 19 20 _ IN state extra state for callback functions extra 21 22 int MPI_Type_create_keyval(MPI_Type_copy_attr_function *type_copy_attr_fn, 23 MPI_Type_delete_attr_function *type_delete_attr_fn, 24 int *type_keyval, void *extra_state) 25 MPI_Type_create_keyval(type_copy_attr_fn, type_delete_attr_fn, type_keyval, 26 extra_state, ierror) BIND(C) 27 PROCEDURE(MPI_Type_copy_attr_function) :: type_copy_attr_fn 28 PROCEDURE(MPI_Type_delete_attr_function) :: type_delete_attr_fn 29 INTEGER, INTENT(OUT) :: type_keyval 30 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: extra_state 31 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 32 33 MPI_TYPE_CREATE_KEYVAL(TYPE_COPY_ATTR_FN, TYPE_DELETE_ATTR_FN, TYPE_KEYVAL, 34 EXTRA_STATE, IERROR) 35 EXTERNAL TYPE_COPY_ATTR_FN, TYPE_DELETE_ATTR_FN 36 INTEGER TYPE_KEYVAL, IERROR 37 INTEGER(KIND=MPI_ADDRESS_KIND) EXTRA_STATE 38 fn _ attr _ copy _ type The argument FN or _ NULL COPY _ TYPE _ MPI may be specified as _ 39 _ DUP _ FN from either C or Fortran. MPI _ TYPE _ NULL _ COPY _ FN is a function MPI _ TYPE 40 TYPE _ FN that does nothing other than returning flag = 0 and MPI _ SUCCESS . MPI _ DUP _ 41 _ is a simple-minded copy function that sets in in flag = 1 val _ attribute , returns the value of 42 attribute MPI SUCCESS _ , and returns out _ val _ . 43 DELETE fn _ FN _ may be specified as MPI _ TYPE _ attr _ delete _ type The argument NULL _ 44 _ FN _ DELETE _ is a function that does nothing, TYPE _ MPI from either C or Fortran. NULL 45 other than returning . SUCCESS _ MPI 46 47 The C callback functions are: 48

307 6.7. CACHING 277 1 typedef int MPI_Type_copy_attr_function(MPI_Datatype oldtype, 2 int type_keyval, void *extra_state, void *attribute_val_in, 3 void *attribute_val_out, int *flag); 4 and 5 typedef int MPI_Type_delete_attr_function(MPI_Datatype datatype, 6 int type_keyval, void *attribute_val, void *extra_state); 7 8 module, the Fortran callback functions are: mpi_f08 With the 9 ABSTRACT INTERFACE 10 SUBROUTINE MPI_Type_copy_attr_function(oldtype, type_keyval, extra_state, 11 attribute_val_in, attribute_val_out, flag, ierror) BIND(C) 12 TYPE(MPI_Datatype) :: oldtype 13 INTEGER :: type_keyval, ierror 14 INTEGER(KIND=MPI_ADDRESS_KIND) :: extra_state, attribute_val_in, 15 attribute_val_out 16 LOGICAL :: flag 17 18 and 19 ABSTRACT INTERFACE 20 SUBROUTINE MPI_Type_delete_attr_function(datatype, type_keyval, 21 attribute_val, extra_state, ierror) BIND(C) 22 TYPE(MPI_Datatype) :: datatype 23 INTEGER :: type_keyval, ierror 24 INTEGER(KIND=MPI_ADDRESS_KIND) :: attribute_val, extra_state 25 26 mpi , the Fortran callback functions are: module and mpif.h With the 27 SUBROUTINE TYPE_COPY_ATTR_FUNCTION(OLDTYPE, TYPE_KEYVAL, EXTRA_STATE, 28 ATTRIBUTE_VAL_IN, ATTRIBUTE_VAL_OUT, FLAG, IERROR) 29 INTEGER OLDTYPE, TYPE_KEYVAL, IERROR 30 INTEGER(KIND=MPI_ADDRESS_KIND) EXTRA_STATE, 31 ATTRIBUTE_VAL_IN, ATTRIBUTE_VAL_OUT 32 LOGICAL FLAG 33 34 and 35 SUBROUTINE TYPE_DELETE_ATTR_FUNCTION(DATATYPE, TYPE_KEYVAL, ATTRIBUTE_VAL, 36 EXTRA_STATE, IERROR) 37 INTEGER DATATYPE, TYPE_KEYVAL, IERROR 38 INTEGER(KIND=MPI_ADDRESS_KIND) ATTRIBUTE_VAL, EXTRA_STATE 39 40 If an attribute copy function or attribute delete function returns other than 41 MPI _ ), FREE _ _ MPI , then the call that caused it to be invoked (for example, SUCCESS TYPE 42 is erroneous. 43 44 keyval) MPI _ TYPE _ FREE _ KEYVAL(type _ 45 46 key value (integer) keyval _ type INOUT 47 48 int MPI_Type_free_keyval(int *type_keyval)

308 278 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 MPI_Type_free_keyval(type_keyval, ierror) BIND(C) 2 INTEGER, INTENT(INOUT) :: type_keyval 3 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 4 MPI_TYPE_FREE_KEYVAL(TYPE_KEYVAL, IERROR) 5 INTEGER TYPE_KEYVAL, IERROR 6 7 8 val) _ keyval, attribute _ MPI _ SET _ TYPE _ ATTR(datatype, type 9 10 datatype INOUT datatype to which attribute will be attached (handle) 11 keyval key value (integer) IN type _ 12 13 attribute value val _ attribute IN 14 15 int MPI_Type_set_attr(MPI_Datatype datatype, int type_keyval, 16 void *attribute_val) 17 MPI_Type_set_attr(datatype, type_keyval, attribute_val, ierror) BIND(C) 18 TYPE(MPI_Datatype), INTENT(IN) :: datatype 19 INTEGER, INTENT(IN) :: type_keyval 20 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: attribute_val 21 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 22 23 MPI_TYPE_SET_ATTR(DATATYPE, TYPE_KEYVAL, ATTRIBUTE_VAL, IERROR) 24 INTEGER DATATYPE, TYPE_KEYVAL, IERROR 25 INTEGER(KIND=MPI_ADDRESS_KIND) ATTRIBUTE_VAL 26 27 28 TYPE _ val, flag) _ keyval, attribute MPI _ _ GET _ ATTR(datatype, type 29 IN datatype to which the attribute is attached (handle) datatype 30 31 key value (integer) IN keyval _ type 32 _ val attribute value, unless flag = false OUT attribute 33 if no attribute is associated with the key (logical) false flag OUT 34 35 36 int MPI_Type_get_attr(MPI_Datatype datatype, int type_keyval, void 37 *attribute_val, int *flag) 38 MPI_Type_get_attr(datatype, type_keyval, attribute_val, flag, ierror) 39 BIND(C) 40 TYPE(MPI_Datatype), INTENT(IN) :: datatype 41 INTEGER, INTENT(IN) :: type_keyval 42 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(OUT) :: attribute_val 43 LOGICAL, INTENT(OUT) :: flag 44 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 45 46 MPI_TYPE_GET_ATTR(DATATYPE, TYPE_KEYVAL, ATTRIBUTE_VAL, FLAG, IERROR) 47 INTEGER DATATYPE, TYPE_KEYVAL, IERROR 48 INTEGER(KIND=MPI_ADDRESS_KIND) ATTRIBUTE_VAL

309 6.7. CACHING 279 1 LOGICAL FLAG 2 3 4 _ keyval) TYPE _ _ ATTR(datatype, type _ MPI DELETE 5 INOUT datatype datatype from which the attribute is deleted (handle) 6 7 _ key value (integer) keyval IN type 8 9 int MPI_Type_delete_attr(MPI_Datatype datatype, int type_keyval) 10 MPI_Type_delete_attr(datatype, type_keyval, ierror) BIND(C) 11 TYPE(MPI_Datatype), INTENT(IN) :: datatype 12 INTEGER, INTENT(IN) :: type_keyval 13 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 14 15 MPI_TYPE_DELETE_ATTR(DATATYPE, TYPE_KEYVAL, IERROR) 16 INTEGER DATATYPE, TYPE_KEYVAL, IERROR 17 18 19 6.7.5 Error Class for Invalid Keyval 20 Key values for attributes are system-allocated, by 21 { CREATE MPI _ } TYPE,COMM,WIN . Only such values can be passed to the func- KEYVAL _ _ 22 tions that use key values as input arguments. In order to signal that an erroneous key value 23 _ MPI error class: MPI has been passed to one of these functions, there is a new _ KEYVAL ERR . 24 MPI It can be returned by ATTR _ GET , MPI _ ATTR _ DELETE , MPI _ ATTR _ PUT , _ 25 _ , _ DELETE _ } TYPE,COMM,WIN MPI ATTR MPI , FREE _ KEYVAL _ { 26 SET _ ATTR , MPI _ { TYPE,COMM,WIN } _ GET _ ATTR , MPI _ { TYPE,COMM,WIN } _ 27 , , MPI _ { TYPE,COMM,WIN } _ FREE _ KEYVAL COMM MPI _ COMM _ DUP MPI , IDUP _ _ 28 _ , and COMM _ COMM _ MPI _ FREE . The last four are included because DISCONNECT MPI 29 is an argument to the copy and delete functions for attributes. keyval 30 31 Attributes Example 6.7.6 32 33 Advice to users. This example shows how to write a collective communication 34 operation that uses caching to be more efficient after the first call. ( End of advice to 35 ) users. 36 37 38 /* key for this module’s stuff: */ 39 static int gop_key = MPI_KEYVAL_INVALID; 40 41 typedef struct 42 { 43 int ref_count; /* reference count */ 44 /* other stuff, whatever else we want */ 45 } gop_stuff_type; 46 47 void Efficient_Collective_Op (MPI_Comm comm, ...) 48 {

310 280 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 gop_stuff_type *gop_stuff; 2 MPI_Group group; 3 int foundflag; 4 5 MPI_Comm_group(comm, &group); 6 7 if (gop_key == MPI_KEYVAL_INVALID) /* get a key on first call ever */ 8 { 9 if ( ! MPI_Comm_create_keyval( gop_stuff_copier, 10 gop_stuff_destructor, 11 &gop_key, (void *)0)); 12 /* get the key while assigning its copy and delete callback 13 behavior. */ 14 15 MPI_Abort (comm, 99); 16 } 17 18 MPI_Comm_get_attr (comm, gop_key, &gop_stuff, &foundflag); 19 if (foundflag) 20 { /* This module has executed in this group before. 21 We will use the cached information */ 22 } 23 else 24 { /* This is a group that we have not yet cached anything in. 25 We will now do so. 26 */ 27 28 /* First, allocate storage for the stuff we want, 29 and initialize the reference count */ 30 31 gop_stuff = (gop_stuff_type *) malloc (sizeof(gop_stuff_type)); 32 if (gop_stuff == NULL) { /* abort on out-of-memory error */ } 33 34 gop_stuff -> ref_count = 1; 35 36 /* Second, fill in *gop_stuff with whatever we want. 37 This part isn’t shown here */ 38 39 /* Third, store gop_stuff as the attribute value */ 40 MPI_Comm_set_attr (comm, gop_key, gop_stuff); 41 } 42 /* Then, in any case, use contents of *gop_stuff 43 to do the global op ... */ 44 } 45 46 /* The following routine is called by MPI when a group is freed */ 47 48 int gop_stuff_destructor (MPI_Comm comm, int keyval, void *gop_stuffP,

311 6.8. NAMING OBJECTS 281 1 void *extra) 2 { 3 gop_stuff_type *gop_stuff = (gop_stuff_type *)gop_stuffP; 4 if (keyval != gop_key) { /* abort -- programming error */ } 5 6 /* The group’s being freed removes one reference to gop_stuff */ 7 gop_stuff -> ref_count -= 1; 8 9 /* If no references remain, then free the storage */ 10 if (gop_stuff -> ref_count == 0) { 11 free((void *)gop_stuff); 12 } 13 return MPI_SUCCESS; 14 } 15 16 /* The following routine is called by MPI when a group is copied */ 17 int gop_stuff_copier (MPI_Comm comm, int keyval, void *extra, 18 void *gop_stuff_inP, void *gop_stuff_outP, int *flag) 19 { 20 gop_stuff_type *gop_stuff_in = (gop_stuff_type *)gop_stuff_inP; 21 gop_stuff_type **gop_stuff_out = (gop_stuff_type **)gop_stuff_outP; 22 if (keyval != gop_key) { /* abort -- programming error */ } 23 24 /* The new group adds one reference to this gop_stuff */ 25 gop_stuff_in -> ref_count += 1; 26 *gop_stuff_out = gop_stuff_in; 27 return MPI_SUCCESS; 28 } 29 30 Naming Objects 6.8 31 32 There are many occasions on which it would be useful to allow a user to associate a printable 33 identifier with an MPI communicator, window, or datatype, for instance error reporting, 34 debugging, and profiling. The names attached to opaque objects do not propagate when 35 the object is duplicated or copied by routines. For communicators this can be achieved MPI 36 using the following two functions. 37 38 39 MPI _ COMM _ name) _ NAME (comm, comm _ SET 40 comm communicator whose identifier is to be set (handle) INOUT 41 comm _ name the character string which is remembered as the name IN 42 (string) 43 44 45 int MPI_Comm_set_name(MPI_Comm comm, const char *comm_name) 46 MPI_Comm_set_name(comm, comm_name, ierror) BIND(C) 47 TYPE(MPI_Comm), INTENT(IN) :: comm 48

312 282 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 CHARACTER(LEN=*), INTENT(IN) :: comm_name 2 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 3 MPI_COMM_SET_NAME(COMM, COMM_NAME, IERROR) 4 INTEGER COMM, IERROR 5 CHARACTER*(*) COMM_NAME 6 7 _ NAME allows a user to associate a name string with a communicator. COMM _ SET MPI _ 8 SET The character string which is passed to MPI _ COMM will be saved inside the NAME _ _ 9 MPI library (so it can be freed by the caller immediately after the call, or allocated on the 10 name are significant but trailing ones are not. stack). Leading spaces in 11 NAME is a local (non-collective) operation, which only affects the MPI _ COMM _ SET _ 12 NAME SET _ COMM _ MPI name of the communicator as seen in the process which made the _ 13 call. There is no requirement that the same (or any) name be assigned to a communicator 14 in every process where it exists. 15 16 MPI is provided to help debug code, it NAME _ SET _ COMM _ Since Advice to users. 17 is sensible to give the same name to a communicator in all of the processes where it 18 ) exists, to avoid confusion. ( End of advice to users. 19 20 The length of the name which can be stored is limited to the value of 21 MPI MPI _ MAX _ OBJECT _ NAME -1 in C to allow for the _ MAX _ OBJECT _ NAME in Fortran and 22 null terminator. Attempts to put names longer than this will result in truncation of the 23 _ MAX _ NAME _ name. MPI OBJECT must have a value of at least 64. 24 Advice to users. Under circumstances of store exhaustion an attempt to put a name 25 MAX _ MPI of any length could fail, therefore the value of should be NAME _ OBJECT _ 26 viewed only as a strict upper bound on the name length, not a guarantee that setting 27 names of less than this length will always succeed. ( ) End of advice to users. 28 29 Implementations which pre-allocate a fixed size space for a Advice to implementors. 30 _ MAX name should use the length of that allocation as the value of MPI _ . _ OBJECT NAME 31 Implementations which allocate space for the name from the heap should still define 32 MPI NAME to be a relatively small value, since the user has to allocate _ _ MAX _ OBJECT 33 _ space for a string of up to this size when calling End of . ( NAME MPI GET _ COMM _ 34 advice to implementors. ) 35 36 37 38 MPI _ _ COMM _ GET NAME (comm, comm _ name, resultlen) 39 comm IN communicator whose name is to be returned (handle) 40 41 name _ comm OUT the name previously stored on the communicator, or 42 an empty string if no such name exists (string) 43 resultlen length of returned name (integer) OUT 44 45 int MPI_Comm_get_name(MPI_Comm comm, char *comm_name, int *resultlen) 46 47 MPI_Comm_get_name(comm, comm_name, resultlen, ierror) BIND(C) 48 TYPE(MPI_Comm), INTENT(IN) :: comm

313 6.8. NAMING OBJECTS 283 1 CHARACTER(LEN=MPI_MAX_OBJECT_NAME), INTENT(OUT) :: comm_name 2 INTEGER, INTENT(OUT) :: resultlen 3 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 4 MPI_COMM_GET_NAME(COMM, COMM_NAME, RESULTLEN, IERROR) 5 INTEGER COMM, RESULTLEN, IERROR 6 CHARACTER*(*) COMM_NAME 7 8 COMM returns the last name which has previously been associated _ MPI _ _ NAME GET 9 with the given communicator. The name may be set and retrieved from any language. The 10 same name will be returned independent of the language used. should be allocated name 11 MAX _ _ _ MPI so that it can hold a resulting string of length characters. OBJECT NAME 12 _ NAME returns a copy of the set name in name . MPI _ COMM _ GET 13 In C, a null character is additionally stored at name[resultlen] . The value of resultlen 14 _ is padded on the name -1. In Fortran, NAME _ OBJECT cannot be larger than MAX _ MPI 15 resultlen right with blank characters. The value of cannot be larger than 16 _ MAX _ OBJECT _ NAME . MPI 17 If the user has not associated a name with a communicator, or an error occurs, 18 "" COMM GET _ NAME MPI _ in C). The _ will return an empty string (all spaces in Fortran, 19 three predefined communicators will have predefined names associated with them. Thus, 20 MPI _ COMM the names of MPI _ , COMM _ SELF , and the communicator returned by WORLD _ 21 PARENT ) will have the default of _ COMM _ MPI MPI NULL _ GET _ COMM _ (if not 22 WORLD , MPI _ COMM _ SELF , and MPI _ COMM _ PARENT . The fact that the system MPI _ COMM _ 23 may have chosen to give a default name to a communicator does not prevent the user from 24 setting a name on the same communicator; doing this removes the old name and assigns 25 the new one. 26 27 We provide separate functions for setting and getting the name of a com- Rationale. 28 municator, rather than simply providing a predefined attribute key for the following 29 reasons: 30 • It is not, in general, possible to store a string as an attribute from Fortran. 31 It is not easy to set up the delete function for a string attribute unless it is known • 32 to have been allocated from the heap. 33 34 is necessary. If strdup To make the attribute key useful additional code to call • 35 this is not standardized then users have to write it. This is extra unneeded work 36 which we can easily eliminate. 37 • The Fortran binding is not trivial to write (it will depend on details of the 38 Fortran compilation system), and will not be portable. Therefore it should be in 39 the library rather than in user code. 40 41 ) ( End of rationale. 42 Advice to users. The above definition means that it is safe simply to print the string 43 COMM NAME _ GET _ _ , as it is always a valid string even if there was MPI returned by 44 no name. 45 46 Note that associating a name with a communicator has no effect on the semantics of 47 MPI an program, and will (necessarily) increase the store requirement of the program, 48 since the names must be saved. Therefore there is no requirement that users use these

314 284 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 functions to associate names with communicators. However debugging and profiling 2 MPI applications may be made easier if names are associated with communicators, 3 since the debugger or profiler should then be able to present information in a less 4 cryptic manner. ( End of advice to users. ) 5 The following functions are used for setting and getting names of datatypes. The 6 _ OBJECT _ MAX also applies to these names. MPI constant _ NAME 7 8 9 _ SET _ NAME (datatype, type _ name) MPI _ TYPE 10 INOUT datatype datatype whose identifier is to be set (handle) 11 12 type _ name the character string which is remembered as the name IN 13 (string) 14 15 int MPI_Type_set_name(MPI_Datatype datatype, const char *type_name) 16 MPI_Type_set_name(datatype, type_name, ierror) BIND(C) 17 TYPE(MPI_Datatype), INTENT(IN) :: datatype 18 CHARACTER(LEN=*), INTENT(IN) :: type_name 19 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 20 21 MPI_TYPE_SET_NAME(DATATYPE, TYPE_NAME, IERROR) 22 INTEGER DATATYPE, IERROR 23 CHARACTER*(*) TYPE_NAME 24 25 26 name, resultlen) _ TYPE _ GET _ NAME (datatype, type MPI _ 27 IN datatype whose name is to be returned (handle) datatype 28 29 type _ name the name previously stored on the datatype, or a empty OUT 30 string if no such name exists (string) 31 length of returned name (integer) resultlen OUT 32 33 int MPI_Type_get_name(MPI_Datatype datatype, char *type_name, int 34 *resultlen) 35 36 MPI_Type_get_name(datatype, type_name, resultlen, ierror) BIND(C) 37 TYPE(MPI_Datatype), INTENT(IN) :: datatype 38 CHARACTER(LEN=MPI_MAX_OBJECT_NAME), INTENT(OUT) :: type_name 39 INTEGER, INTENT(OUT) :: resultlen 40 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 41 MPI_TYPE_GET_NAME(DATATYPE, TYPE_NAME, RESULTLEN, IERROR) 42 INTEGER DATATYPE, RESULTLEN, IERROR 43 CHARACTER*(*) TYPE_NAME 44 45 Named predefined datatypes have the default names of the datatype name. For exam- 46 WCHAR _ MPI has the default name of . _ MPI ple, WCHAR 47 The following functions are used for setting and getting names of windows. The con- 48 NAME stant MPI _ MAX _ OBJECT _ also applies to these names.

315 6.9. FORMALIZING THE LOOSELY SYNCHRONOUS MODEL 285 1 SET _ NAME (win, win _ name) MPI _ _ WIN 2 INOUT win window whose identifier is to be set (handle) 3 name _ the character string which is remembered as the name IN win 4 (string) 5 6 7 int MPI_Win_set_name(MPI_Win win, const char *win_name) 8 MPI_Win_set_name(win, win_name, ierror) BIND(C) 9 TYPE(MPI_Win), INTENT(IN) :: win 10 CHARACTER(LEN=*), INTENT(IN) :: win_name 11 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 12 13 MPI_WIN_SET_NAME(WIN, WIN_NAME, IERROR) 14 INTEGER WIN, IERROR 15 CHARACTER*(*) WIN_NAME 16 17 18 GET MPI _ NAME (win, win _ name, resultlen) _ WIN _ 19 IN win window whose name is to be returned (handle) 20 21 name the name previously stored on the window, or a empty win OUT _ 22 string if no such name exists (string) 23 OUT resultlen length of returned name (integer) 24 25 int MPI_Win_get_name(MPI_Win win, char *win_name, int *resultlen) 26 27 MPI_Win_get_name(win, win_name, resultlen, ierror) BIND(C) 28 TYPE(MPI_Win), INTENT(IN) :: win 29 CHARACTER(LEN=MPI_MAX_OBJECT_NAME), INTENT(OUT) :: win_name 30 INTEGER, INTENT(OUT) :: resultlen 31 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 32 MPI_WIN_GET_NAME(WIN, WIN_NAME, RESULTLEN, IERROR) 33 INTEGER WIN, RESULTLEN, IERROR 34 CHARACTER*(*) WIN_NAME 35 36 37 Formalizing the Loosely Synchronous Model 6.9 38 39 In this section, we make further statements about the loosely synchronous model, with 40 particular attention to intra-communication. 41 42 6.9.1 Basic Statements 43 44 When a caller passes a communicator (that contains a context and group) to a callee, that 45 communicator must be free of side effects throughout execution of the subprogram: there 46 should be no active operations on that communicator that might involve the process. This 47 provides one model in which libraries can be written, and work “safely.” For libraries 48 so designated, the callee has permission to do whatever communication it likes with the

316 286 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 communicator, and under the above guarantee knows that no other communications will 2 interfere. Since we permit good implementations to create new communicators without 3 synchronization (such as by preallocated contexts on communicators), this does not impose 4 a significant overhead. 5 This form of safety is analogous to other common computer-science usages, such as 6 passing a descriptor of an array to a library routine. The library routine has every right to 7 expect such a descriptor to be valid and modifiable. 8 9 Models of Execution 6.9.2 10 is effected by parallel procedure In the loosely synchronous model, transfer of control to a 11 having each executing process invoke the procedure. The invocation is a collective operation: 12 it is executed by all processes in the execution group, and invocations are similarly ordered 13 at all processes. However, the invocation need not be synchronized. 14 We say that a parallel procedure is in a process if the process belongs to a group active 15 that may collectively execute the procedure, and some member of that group is currently 16 executing the procedure code. If a parallel procedure is active in a process, then this process 17 may be receiving messages pertaining to this procedure, even if it does not currently execute 18 the code of this procedure. 19 20 21 Static Communicator Allocation 22 This covers the case where, at any point in time, at most one invocation of a parallel 23 procedure can be active at any process, and the group of executing processes is fixed. For 24 example, all invocations of parallel procedures involve all processes, processes are single- 25 threaded, and there are no recursive invocations. 26 In such a case, a communicator can be statically allocated to each procedure. The 27 static allocation can be done in a preamble, as part of initialization code. If the parallel 28 procedures can be organized into libraries, so that only one procedure of each library can 29 be concurrently active in each processor, then it is sufficient to allocate one communicator 30 per library. 31 32 Dynamic Communicator Allocation 33 34 Calls of parallel procedures are well-nested if a new parallel procedure is always invoked in 35 a subset of a group executing the same parallel procedure. Thus, processes that execute 36 the same parallel procedure have the same execution stack. 37 In such a case, a new communicator needs to be dynamically allocated for each new 38 invocation of a parallel procedure. The allocation is done by the caller. A new communicator 39 can be generated by a call to MPI _ COMM _ , if the callee execution group is identical to DUP 40 the caller execution group, or by a call to _ COMM _ SPLIT if the caller execution group MPI 41 is split into several subgroups executing distinct parallel routines. The new communicator 42 is passed as an argument to the invoked routine. 43 The need for generating a new communicator at each invocation can be alleviated or 44 avoided altogether in some cases: If the execution group is not split, then one can allocate 45 a stack of communicators in a preamble, and next manage the stack in a way that mimics 46 the stack of recursive calls. 47 48

317 6.9. FORMALIZING THE LOOSELY SYNCHRONOUS MODEL 287 1 One can also take advantage of the well-ordering property of communication to avoid 2 confusing caller and callee communication, even if both use the same communicator. To do 3 so, one needs to abide by the following two rules: 4 messages sent before a procedure call (or before a return from the procedure) are also • 5 received before the matching call (or return) at the receiving end; 6 7 ANY _ • _ ). MPI messages are always selected by source (no use is made of SOURCE 8 9 The General Case 10 11 In the general case, there may be multiple concurrently active invocations of the same 12 parallel procedure within the same group; invocations may not be well-nested. A new 13 communicator needs to be created for each invocation. It is the user’s responsibility to make 14 sure that, should two distinct parallel procedures be invoked concurrently on overlapping 15 sets of processes, communicator creation is properly coordinated. 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

318 288 CHAPTER 6. GROUPS, CONTEXTS, COMMUNICATORS, AND CACHING 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

319 1 2 3 4 5 6 Chapter 7 7 8 9 10 Process Topologies 11 12 13 14 Introduction 7.1 15 16 topology mechanism. A topology is an extra, optional MPI This chapter discusses the 17 attribute that one can give to an intra-communicator; topologies cannot be added to inter- 18 communicators. A topology can provide a convenient naming mechanism for the processes 19 of a group (within a communicator), and additionally, may assist the runtime system in 20 mapping the processes onto hardware. 21 is a collection of n As stated in Chapter 6, a process group in MPI processes. Each 22 process in the group is assigned a rank between . In many parallel applications n-1 0 and 23 a linear ranking of processes does not adequately reflect the logical communication pattern 24 of the processes (which is usually determined by the underlying problem geometry and 25 the numerical algorithm used). Often the processes are arranged in topological patterns 26 such as two- or three-dimensional grids. More generally, the logical process arrangement is 27 described by a graph. In this chapter we will refer to this logical process arrangement as 28 the “virtual topology.” 29 A clear distinction must be made between the virtual process topology and the topology 30 of the underlying, physical hardware. The virtual topology can be exploited by the system 31 in the assignment of processes to physical processors, if this helps to improve the commu- 32 nication performance on a given machine. How this mapping is done, however, is outside 33 the scope of . The description of the virtual topology, on the other hand, depends only MPI 34 on the application, and is machine-independent. The functions that are described in this 35 chapter deal with machine-independent mapping and communication on virtual process 36 topologies. 37 38 Rationale. Though physical mapping is not discussed, the existence of the virtual 39 topology information may be used as advice by the runtime system. There are well- 40 known techniques for mapping grid/torus structures to hardware topologies such as 41 hypercubes or grids. For more complicated graph structures good heuristics often 42 yield nearly optimal results [44]. On the other hand, if there is no way for the user 43 to specify the logical process arrangement as a “virtual topology,” a random mapping 44 is most likely to result. On some machines, this will lead to unnecessary contention 45 in the interconnection network. Some details about predicted and measured perfor- 46 mance improvements that result from good process-to-processor mapping on modern 47 wormhole-routing architectures can be found in [11, 12]. 48 289

320 290 CHAPTER 7. PROCESS TOPOLOGIES 1 Besides possible performance benefits, the virtual topology can function as a conve- 2 nient, process-naming structure, with significant benefits for program readability and 3 End of rationale. notational power in message-passing programming. ( ) 4 5 Virtual Topologies 7.2 6 7 The communication pattern of a set of processes can be represented by a graph. The 8 nodes represent processes, and the edges connect processes that communicate with each 9 MPI provides message-passing between any pair of processes in a group. There other. 10 is no requirement for opening a channel explicitly. Therefore, a “missing link” in the 11 user-defined process graph does not prevent the corresponding processes from exchanging 12 messages. It means rather that this connection is neglected in the virtual topology. This 13 strategy implies that the topology gives no convenient way of naming this pathway of 14 communication. Another possible consequence is that an automatic mapping tool (if one 15 exists for the runtime environment) will not take account of this edge when mapping. 16 Specifying the virtual topology in terms of a graph is sufficient for all applications. 17 However, in many applications the graph structure is regular, and the detailed set-up of the 18 graph would be inconvenient for the user and might be less efficient at run time. A large frac- 19 tion of all parallel applications use process topologies like rings, two- or higher-dimensional 20 grids, or tori. These structures are completely defined by the number of dimensions and 21 the numbers of processes in each coordinate direction. Also, the mapping of grids and tori 22 is generally an easier problem than that of general graphs. Thus, it is desirable to address 23 these cases explicitly. 24 Process coordinates in a Cartesian structure begin their numbering at 0. Row-major 25 numbering is always used for the processes in a Cartesian structure. This means that, for 26 × example, the relation between group rank and coordinates for four processes in a (2 2) 27 grid is as follows. 28 29 coord (0,0): rank 0 30 coord (0,1): rank 1 31 coord (1,0): rank 2 32 coord (1,1): rank 3 33 34 7.3 Embedding in MPI 35 36 The support for virtual topologies as defined in this chapter is consistent with other parts of 37 MPI , and, whenever possible, makes use of functions that are defined elsewhere. Topology 38 information is associated with communicators. It is added to communicators using the 39 caching mechanism described in Chapter 6. 40 41 Overview of the Functions 7.4 42 43 MPI supports three topology types: Cartesian, graph, and distributed graph. The function 44 is used to create Cartesian topologies, the function CREATE _ _ MPI CART 45 is used to create graph topologies, and the functions CREATE _ GRAPH _ MPI 46 MPI _ DIST _ GRAPH _ CREATE are used to cre- MPI _ DIST _ GRAPH _ CREATE _ ADJACENT and 47 ate distributed graph topologies. These topology creation functions are collective. As with 48

321 7.4. OVERVIEW OF THE FUNCTIONS 291 1 other collective calls, the program must be written to work correctly, whether the call 2 synchronizes or not. 3 The topology creation functions take as input an existing communicator 4 comm , which defines the set of processes on which the topology is to be mapped. For old _ 5 _ _ CREATE , all input arguments must have identical MPI CREATE and MPI _ GRAPH CART _ 6 _ MPI GRAPH values on all processes of the group of CREATE comm , old . When calling _ _ 7 each process specifies all nodes and edges in the graph. In contrast, the functions 8 _ DIST _ GRAPH _ CREATE are used to spec- MPI _ DIST _ GRAPH _ CREATE _ ADJACENT or MPI 9 ify the graph in a distributed fashion, whereby each process only specifies a subset of the 10 edges in the graph such that the entire graph structure is defined collectively across the set of 11 processes. Therefore the processes provide different values for the arguments specifying the 12 and the reorder graph. However, all processes must give the same value for argument. info 13 topol is created that carries the topological struc- comm In all cases, a new communicator _ 14 ture as cached information (see Chapter 6). In analogy to function MPI _ COMM _ CREATE , 15 topol no cached information propagates from comm _ _ comm to . old 16 _ MPI _ CART can be used to describe Cartesian structures of arbitrary dimen- CREATE 17 sion. For each coordinate direction one specifies whether the process structure is periodic or 18 n not. Note that an n -dimensional hypercube is an -dimensional torus with 2 processes per 19 coordinate direction. Thus, special support for hypercube structures is not necessary. The 20 local auxiliary function CREATE can be used to compute a balanced distribution MPI _ _ DIMS 21 of processes among a given number of dimensions. 22 23 Similar functions are contained in EXPRESS [13] and PARMACS. ( Rationale. End 24 of rationale. ) 25 26 MPI defines functions to query a communicator for topology information. The function 27 _ TEST is used to query for the type of topology associated with a communicator. _ TOPO MPI 28 Depending on the topology type, different information can be extracted. For a graph 29 return the values _ _ and GRAPHDIMS topology, the functions MPI GET _ GRAPH _ MPI GET 30 GRAPH _ . Additionally, the functions that were specified in the call to CREATE _ MPI 31 can be used to obtain _ MPI and COUNT _ NEIGHBORS _ GRAPH _ MPI GRAPH _ NEIGHBORS 32 the neighbors of an arbitrary node in the graph. For a distributed graph topology, the 33 GRAPH _ NEIGHBORS functions MPI _ DIST _ GRAPH _ NEIGHBORS _ COUNT and MPI _ DIST _ 34 can be used to obtain the neighbors of the calling process. For a Cartesian topology, the 35 functions MPI _ CARTDIM _ GET and MPI _ CART _ GET return the values that were specified 36 in the call to MPI _ CART _ CREATE . Additionally, the functions MPI _ CART _ RANK and 37 MPI _ CART _ COORDS translate Cartesian coordinates into a group rank, and vice-versa. 38 _ SHIFT provides the information needed to communicate with The function CART MPI _ 39 neighbors along a Cartesian dimension. All of these query functions are local. 40 _ _ can be used to extract a Carte- For Cartesian topologies, the function MPI CART SUB 41 ). This function is collective over the input sian subspace (analogous to MPI _ COMM _ SPLIT 42 communicator’s group. 43 _ MPI , are, in gen- MAP _ GRAPH _ MPI The two additional functions, MAP and CART _ 44 eral, not called by the user directly. However, together with the communicator manipulation 45 functions presented in Chapter 6, they are sufficient to implement all other topology func- 46 tions. Section 7.5.8 outlines such an implementation. 47 , The neighborhood collective communication routines MPI _ NEIGHBOR _ ALLGATHER 48 ALLGATHERV , _ NEIGHBOR MPI , ALLTOALL _ NEIGHBOR _ MPI _

322 292 CHAPTER 7. PROCESS TOPOLOGIES 1 NEIGHBOR _ , and MPI _ NEIGHBOR _ ALLTOALLW communicate with the _ MPI ALLTOALLV 2 nearest neighbors on the topology associated with the communicator. The nonblocking 3 _ MPI _ INEIGHBOR _ ALLGATHERV , variants are ALLGATHER , INEIGHBOR MPI _ 4 _ INEIGHBOR _ ALLTOALLV , and ALLTOALL _ MPI MPI INEIGHBOR _ , 5 _ INEIGHBOR _ ALLTOALLW . MPI 6 7 Topology Constructors 7.5 8 9 Cartesian Constructor 7.5.1 10 11 12 _ old, ndims, dims, periods, reorder, comm cart) _ MPI _ CART CREATE(comm _ 13 14 IN input communicator (handle) comm _ old 15 number of dimensions of Cartesian grid (integer) ndims IN 16 ndims IN dims integer array of size specifying the number of 17 processes in each dimension 18 19 ndims specifying whether the grid logical array of size periods IN 20 true is periodic ( ) or not ( false ) in each dimension 21 IN reorder ranking may be reordered ( true ) or not ( false ) (logical) 22 comm _ cart communicator with new Cartesian topology (handle) OUT 23 24 25 int MPI_Cart_create(MPI_Comm comm_old, int ndims, const int dims[], const 26 int periods[], int reorder, MPI_Comm *comm_cart) 27 MPI_Cart_create(comm_old, ndims, dims, periods, reorder, comm_cart, ierror) 28 BIND(C) 29 TYPE(MPI_Comm), INTENT(IN) :: comm_old 30 INTEGER, INTENT(IN) :: ndims, dims(ndims) 31 LOGICAL, INTENT(IN) :: periods(ndims), reorder 32 TYPE(MPI_Comm), INTENT(OUT) :: comm_cart 33 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 34 35 MPI_CART_CREATE(COMM_OLD, NDIMS, DIMS, PERIODS, REORDER, COMM_CART, IERROR) 36 INTEGER COMM_OLD, NDIMS, DIMS(*), COMM_CART, IERROR 37 LOGICAL PERIODS(*), REORDER 38 returns a handle to a new communicator to which the Cartesian MPI _ CART _ CREATE 39 reorder = false then the rank of each process in the topology information is attached. If 40 new group is identical to its rank in the old group. Otherwise, the function may reorder 41 the processes (possibly so as to choose a good embedding of the virtual topology onto 42 the physical machine). If the total size of the Cartesian grid is smaller than the size of 43 _ comm the group of COMM old _ NULL , in analogy to _ , then some processes are returned MPI 44 _ is zero then a zero-dimensional Cartesian topology is created. ndims . If SPLIT COMM _ MPI 45 ndims is The call is erroneous if it specifies a grid that is larger than the group size or if 46 negative. 47 48

323 7.5. TOPOLOGY CONSTRUCTORS 293 1 _ 7.5.2 DIMS Cartesian Convenience Function: MPI _ CREATE 2 MPI DIMS _ _ helps the user select a balanced CREATE For Cartesian topologies, the function 3 distribution of processes per coordinate direction, depending on the number of processes 4 in the group to be balanced and optional constraints that can be specified by the user. 5 One use is to partition all the processes (the size of _ COMM ’s group) into an MPI WORLD _ 6 n -dimensional topology. 7 8 9 CREATE(nnodes, ndims, dims) MPI _ DIMS _ 10 number of nodes in a grid (integer) IN nnodes 11 12 IN ndims number of Cartesian dimensions (integer) 13 ndims specifying the number of dims INOUT integer array of size 14 nodes in each dimension 15 16 int MPI_Dims_create(int nnodes, int ndims, int dims[]) 17 18 MPI_Dims_create(nnodes, ndims, dims, ierror) BIND(C) 19 INTEGER, INTENT(IN) :: nnodes, ndims 20 INTEGER, INTENT(INOUT) :: dims(ndims) 21 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 22 MPI_DIMS_CREATE(NNODES, NDIMS, DIMS, IERROR) 23 INTEGER NNODES, NDIMS, DIMS(*), IERROR 24 25 are set to describe a Cartesian grid with ndims dimensions The entries in the array dims 26 nodes. The dimensions are set to be as close to each other as possible, nnodes and a total of 27 using an appropriate divisibility algorithm. The caller may further constrain the operation 28 . If dims of this routine by specifying elements of array is set to a positive number, dims[i] 29 ; only those entries where the routine will not modify the number of nodes in dimension i 30 dims[i] = 0 are modified by the call. 31 is not a Negative input values of dims[i] are erroneous. An error will occur if nnodes ∏ 32 [ dims ]. i multiple of 33 i [ i,dims 6 ] =0 34 will be ordered in non-increasing order. Array For dims[i] set by the call, dims[i] 35 . MPI _ DIMS _ CREATE is dims is suitable for use as input to routine MPI _ CART _ CREATE 36 local. 37 38 Example 7.1 39 dims function call dims 40 on return before call 41 _ DIMS (3,2) CREATE(6, 2, dims) _ (0,0) MPI 42 MPI _ DIMS _ CREATE(7, 2, dims) (7,1) (0,0) 43 (0,3,0) _ DIMS _ CREATE(6, 3, dims) (2,3,1) MPI 44 MPI CREATE(7, 3, dims) (0,3,0) erroneous call _ DIMS _ 45 46 47 48

324 294 CHAPTER 7. PROCESS TOPOLOGIES 1 7.5.3 Graph Constructor 2 3 4 graph) _ MPI _ old, nnodes, index, edges, reorder, comm _ GRAPH _ CREATE(comm 5 comm input communicator (handle) old _ IN 6 7 IN number of nodes in graph (integer) nnodes 8 array of integers describing node degrees (see below) IN index 9 array of integers describing graph edges (see below) edges IN 10 11 reorder ranking may be reordered ( true ) or not ( false ) (logical) IN 12 graph _ comm OUT communicator with graph topology added (handle) 13 14 int MPI_Graph_create(MPI_Comm comm_old, int nnodes, const int index[], 15 const int edges[], int reorder, MPI_Comm *comm_graph) 16 17 MPI_Graph_create(comm_old, nnodes, index, edges, reorder, comm_graph, 18 ierror) BIND(C) 19 TYPE(MPI_Comm), INTENT(IN) :: comm_old 20 INTEGER, INTENT(IN) :: nnodes, index(nnodes), edges(*) 21 LOGICAL, INTENT(IN) :: reorder 22 TYPE(MPI_Comm), INTENT(OUT) :: comm_graph 23 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 24 MPI_GRAPH_CREATE(COMM_OLD, NNODES, INDEX, EDGES, REORDER, COMM_GRAPH, 25 IERROR) 26 INTEGER COMM_OLD, NNODES, INDEX(*), EDGES(*), COMM_GRAPH, IERROR 27 LOGICAL REORDER 28 29 _ MPI CREATE _ GRAPH returns a handle to a new communicator to which the graph 30 reorder = false topology information is attached. If then the rank of each process in the 31 new group is identical to its rank in the old group. Otherwise, the function may reorder the 32 , old _ comm , of the graph is smaller than the size of the group of nnodes processes. If the size, 33 COMM CREATE MPI then some processes are returned _ CART _ MPI , in analogy to NULL _ _ 34 . If the graph is empty, i.e., nnodes == 0 , then MPI _ COMM _ NULL and MPI _ COMM _ SPLIT 35 is returned in all processes. The call is erroneous if it specifies a graph that is larger than 36 the group size of the input communicator. 37 nnodes, index edges is nnodes and define the graph structure. The three parameters 38 the number of nodes of the graph. The nodes are numbered from 0 to nnodes-1 . The 39 graph nodes. i stores the total number of neighbors of the first index -th entry of array i 40 The lists of neighbors of nodes are stored in consecutive locations 0, 1, ..., nnodes-1 41 . The array edges is a flattened representation of the edge lists. The total in array edges 42 is index number of entries in is equal to the edges and the total number of entries in nnodes 43 number of graph edges. 44 , are illustrated with the edges , and index The definitions of the arguments nnodes 45 following simple example. 46 47 Example 7.2 48 Assume there are four processes 0, 1, 2, 3 with the following adjacency matrix:

325 7.5. TOPOLOGY CONSTRUCTORS 295 1 neighbors process 2 1, 3 0 3 0 1 4 2 3 5 0, 2 3 6 Then, the input arguments are: 7 8 nnodes = 4 9 2, 3, 4, 6 index = 10 1, 3, 0, 3, 0, 2 edges = 11 12 Thus, in C, is the degree of node zero, and index[0] index[i] - index[i-1] is the 13 ; the list of neighbors of node zero is stored in i, i=1, ..., nnodes-1 degree of node 14 > 0 , is stored in , for 0 ≤ j ≤ index [ 0 ] edges[j] 1 and the list of neighbors of node i , i − 15 1 . , index [ i − 1 ] ≤ j edges[j] index [ i ] − ≤ 16 is the degree of node zero, and index(i+1) - index(i) index(1) is the In Fortran, 17 degree of node ; the list of neighbors of node zero is stored in i, i=1, ..., nnodes-1 18 i , 1 j ≤ 1 , for edges(j) i > 0 , is stored in index ( ) and the list of neighbors of node ≤ 19 ( i ) + 1 ≤ j ≤ index ( i + 1 ). edges(j) , index 20 A single process is allowed to be defined multiple times in the list of neighbors of a 21 process (i.e., there may be multiple edges between two processes). A process is also allowed 22 to be a neighbor to itself (i.e., a self loop in the graph). The adjacency matrix is allowed 23 to be non-symmetric. 24 Advice to users. Performance implications of using multiple edges or a non-symmetric 25 adjacency matrix are not defined. The definition of a node-neighbor edge does not 26 ) imply a direction of the communication. ( End of advice to users. 27 28 Advice to implementors. The following topology information is likely to be stored 29 with a communicator: 30 31 Type of topology (Cartesian/graph), • 32 For a Cartesian topology: • 33 ndims 1. (number of dimensions), 34 35 2. dims (numbers of processes per coordinate direction), 36 (periodicity information), periods 3. 37 own_position 4. (own position in grid, could also be computed from rank and 38 dims) 39 • For a graph topology: 40 , 1. index 41 42 , 2. edges 43 which are the vectors defining the graph structure. 44 45 For a graph structure the number of nodes is equal to the number of processes in 46 the group. Therefore, the number of nodes does not have to be stored explicitly. 47 simplifies access to the topology index An additional zero entry at the start of array 48 End of advice to implementors. information. ( )

326 296 CHAPTER 7. PROCESS TOPOLOGIES 1 Distributed Graph Constructor 7.5.4 2 CREATE requires that each process passes the full (global) communication MPI _ _ GRAPH 3 graph to the call. This limits the scalability of this constructor. With the distributed graph 4 interface, the communication graph is specified in a fully distributed fashion. Each process 5 specifies only the part of the communication graph of which it is aware. Typically, this 6 could be the set of processes from which the process will eventually receive or get data, 7 or the set of processes to which the process will send or put data, or some combination of 8 such edges. Two different interfaces can be used to create a distributed graph topology. 9 creates a distributed graph communicator with ADJACENT DIST GRAPH _ _ CREATE _ MPI _ 10 each process specifying each of its incoming and outgoing (adjacent) edges in the logical 11 communication graph and thus requires minimal communication during creation. 12 provides full flexibility such that any process can indicate that MPI _ DIST _ GRAPH _ CREATE 13 communication will occur between any pair of processes in the graph. 14 MPI library, the distributed To provide better possibilities for optimization by the 15 graph constructors permit weighted communication edges and take an argument that info 16 MPI can further influence process reordering or other optimizations performed by the library. 17 For example, hints can be provided on how edge weights are to be interpreted, the quality 18 MPI of the reordering, and/or the time permitted for the library to process the graph. 19 20 21 DIST _ GRAPH _ CREATE _ ADJACENT(comm _ MPI _ old, indegree, sources, sourceweights, out- 22 _ dist _ graph) degree, destinations, destweights, info, reorder, comm 23 _ old IN input communicator (handle) comm 24 25 arrays (non-negative IN indegree sourceweights and sources size of 26 integer) 27 IN sources ranks of processes for which the calling process is a 28 destination (array of non-negative integers) 29 weights of the edges into the calling process (array of sourceweights IN 30 non-negative integers) 31 32 outdegree size of arrays (non-negative IN destweights destinations and 33 integer) 34 IN destinations ranks of processes for which the calling process is a 35 source (array of non-negative integers) 36 weights of the edges out of the calling process (array destweights IN 37 of non-negative integers) 38 39 IN hints on optimization and interpretation of weights info 40 (handle) 41 the ranks may be reordered ( ) (logi- true false ) or not ( reorder IN 42 cal) 43 OUT dist communicator with distributed graph topology (han- _ _ graph comm 44 dle) 45 46 int MPI_Dist_graph_create_adjacent(MPI_Comm comm_old, int indegree, const 47 int sources[], const int sourceweights[], int outdegree, const 48

327 7.5. TOPOLOGY CONSTRUCTORS 297 1 int destinations[], const int destweights[], MPI_Info info, 2 int reorder, MPI_Comm *comm_dist_graph) 3 MPI_Dist_graph_create_adjacent(comm_old, indegree, sources, sourceweights, 4 outdegree, destinations, destweights, info, reorder, 5 comm_dist_graph, ierror) BIND(C) 6 TYPE(MPI_Comm), INTENT(IN) :: comm_old 7 INTEGER, INTENT(IN) :: indegree, sources(indegree), outdegree, 8 destinations(outdegree) 9 INTEGER, INTENT(IN) :: sourceweights(*), destweights(*) 10 TYPE(MPI_Info), INTENT(IN) :: info 11 LOGICAL, INTENT(IN) :: reorder 12 TYPE(MPI_Comm), INTENT(OUT) :: comm_dist_graph 13 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 14 15 MPI_DIST_GRAPH_CREATE_ADJACENT(COMM_OLD, INDEGREE, SOURCES, SOURCEWEIGHTS, 16 OUTDEGREE, DESTINATIONS, DESTWEIGHTS, INFO, REORDER, 17 COMM_DIST_GRAPH, IERROR) 18 INTEGER COMM_OLD, INDEGREE, SOURCES(*), SOURCEWEIGHTS(*), OUTDEGREE, 19 DESTINATIONS(*), DESTWEIGHTS(*), INFO, COMM_DIST_GRAPH, IERROR 20 LOGICAL REORDER 21 returns a handle to a new communicator ADJACENT _ CREATE _ GRAPH _ DIST _ MPI 22 to which the distributed graph topology information is attached. Each process passes all 23 information about its incoming and outgoing edges in the virtual distributed graph topology. 24 The calling processes must ensure that each edge of the graph is described in the source 25 and in the destination process with the same weights. If there are multiple edges for a given 26 pair, then the sequence of the weights of these edges does not matter. The (source,dest) 27 sources complete communication topology is the combination of all edges shown in the arrays 28 comm of all processes in , which must be identical to the combination of all edges shown old _ 29 _ . old comm arrays. Source and destination ranks must be process ranks of destinations in the 30 This allows a fully distributed specification of the communication graph. Isolated processes 31 (i.e., processes with no outgoing or incoming edges, that is, processes that have specified 32 outdegree indegree and as zero and thus do not occur as source or destination rank in the 33 graph specification) are allowed. 34 The call creates a new communicator comm _ dist _ graph of distributed graph topology 35 type to which topology information has been attached. The number of processes in 36 comm _ comm _ graph _ old . The call to is identical to the number of processes in dist 37 _ ADJACENT is collective. MPI _ DIST _ GRAPH _ CREATE 38 Weights are specified as non-negative integers and can be used to influence the process 39 remapping strategy and other internal MPI optimizations. For instance, approximate count 40 arguments of later communication calls along specific edges could be used as their edge 41 weights. Multiplicity of edges can likewise indicate more intense communication between 42 MPI pairs of processes. However, the exact meaning of edge weights is not specified by the 43 standard and is left to the implementation. In C or Fortran, an application can supply 44 _ MPI the special value UNWEIGHTED for the weight array to indicate that all edges have 45 MPI _ UNWEIGHTED for some the same (effectively no) weight. It is erroneous to supply 46 _ is old . If the graph is weighted but indegree or outdegree but not all processes of comm 47 sourceweights or any arbitrary array may be passed to EMPTY _ WEIGHTS _ MPI zero, then 48

328 298 CHAPTER 7. PROCESS TOPOLOGIES 1 respectively. Note that destweights UNWEIGHTED and MPI _ WEIGHTS _ EMPTY are MPI or _ 2 not special weight values; rather they are special values for the total array argument. In 3 _ WEIGHTS _ EMPTY are objects like MPI _ BOTTOM (not Fortran, MPI and _ UNWEIGHTED MPI 4 usable for initialization or assignment). See Section 2.5.4. 5 6 In the case of an empty weights array argument passed while Advice to users. 7 NULL constructing a weighted graph, one should not pass because the value of 8 NULL . The value of this argument would then MPI _ UNWEIGHTED may be equal to 9 to the implementation. In this case be indistinguishable from MPI _ UNWEIGHTED 10 _ MPI End of advice to users. WEIGHTS _ EMPTY ) should be used instead. ( 11 Advice to implementors. It is recommended that MPI _ UNWEIGHTED not be imple- 12 mented as . ( End of advice to implementors. ) NULL 13 14 _ UNWEIGHTED may still be imple- Rationale. To ensure backward compatibility, MPI 15 End of rationale. ) mented as NULL. See Annex B.1 on page 787. ( 16 17 reorder arguments is defined in the description of the The meaning of the info and 18 following routine. 19 20 21 _ DIST _ MPI CREATE(comm _ _ old, n, sources, degrees, destinations, weights, info, re- GRAPH 22 order, comm _ dist _ graph) 23 old input communicator (handle) IN comm _ 24 number of source nodes for which this process specifies n IN 25 edges (non-negative integer) 26 27 array containing the sources IN n source nodes for which this pro- 28 cess specifies edges (array of non-negative integers) 29 degrees array specifying the number of destinations for each IN 30 source node in the source node array (array of non- 31 negative integers) 32 destinations IN destination nodes for the source nodes in the source 33 node array (array of non-negative integers) 34 35 weights weights for source to destination edges (array of non- IN 36 negative integers) 37 hints on optimization and interpretation of weights IN info 38 (handle) 39 IN ) or not ( the process may be reordered ( reorder true ) (log- false 40 ical) 41 42 communicator with distributed graph topology added _ dist _ graph comm OUT 43 (handle) 44 45 int MPI_Dist_graph_create(MPI_Comm comm_old, int n, const int sources[], 46 const int degrees[], const int destinations[], const 47 int weights[], MPI_Info info, int reorder, 48 MPI_Comm *comm_dist_graph)

329 7.5. TOPOLOGY CONSTRUCTORS 299 1 MPI_Dist_graph_create(comm_old, n, sources, degrees, destinations, weights, 2 info, reorder, comm_dist_graph, ierror) BIND(C) 3 TYPE(MPI_Comm), INTENT(IN) :: comm_old 4 INTEGER, INTENT(IN) :: n, sources(n), degrees(n), destinations(*) 5 INTEGER, INTENT(IN) :: weights(*) 6 TYPE(MPI_Info), INTENT(IN) :: info 7 LOGICAL, INTENT(IN) :: reorder 8 TYPE(MPI_Comm), INTENT(OUT) :: comm_dist_graph 9 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 10 MPI_DIST_GRAPH_CREATE(COMM_OLD, N, SOURCES, DEGREES, DESTINATIONS, WEIGHTS, 11 INFO, REORDER, COMM_DIST_GRAPH, IERROR) 12 INTEGER COMM_OLD, N, SOURCES(*), DEGREES(*), DESTINATIONS(*), 13 WEIGHTS(*), INFO, COMM_DIST_GRAPH, IERROR 14 LOGICAL REORDER 15 16 CREATE returns a handle to a new communicator to which the MPI _ DIST _ _ GRAPH 17 distributed graph topology information is attached. Concretely, each process calls the con- 18 communication edges as described below. structor with a set of directed (source,destination) 19 n Every process passes an array of array. For each source node, a source nodes in the sources 20 degrees non-negative number of destination nodes is specified in the array. The destination 21 array. More destinations nodes are stored in the corresponding consecutive segment of the 22 (s,d) with d of the j -th precisely, if the i -th node in sources is s , this specifies degrees[i] edges 23 . The weight of this edge is destinations[degrees[0]+...+degrees[i-1]+j] such edge stored in 24 weights[degrees[0]+...+degrees[i-1]+j] . Both the sources and the stored in arrays destinations 25 may contain the same node more than once, and the order in which nodes are listed as 26 destinations or sources is not significant. Similarly, different processes may specify edges 27 with the same source and destination nodes. Source and destination nodes must be pro- 28 old . Different processes may specify different numbers of source and cess ranks of comm _ 29 destination nodes, as well as different source to destination edges. This allows a fully dis- 30 tributed specification of the communication graph. Isolated processes (i.e., processes with 31 no outgoing or incoming edges, that is, processes that do not occur as source or destination 32 node in the graph specification) are allowed. 33 comm dist _ graph The call creates a new communicator _ of distributed graph topology 34 type to which topology information has been attached. The number of processes in 35 is identical to the number of processes in comm _ old . The call to comm _ dist _ graph 36 CREATE MPI _ DIST _ GRAPH _ is collective. 37 _ If reorder = false , all processes will have the same rank in comm _ dist graph as in 38 old _ comm . If reorder = true then the MPI library is free to remap to other processes (of 39 comm old ) in order to improve communication on the edges of the communication graph. _ 40 MPI library about the amount or The weight associated with each edge is a hint to the 41 intensity of communication on that edge, and may be used to compute a “best” reordering. 42 Weights are specified as non-negative integers and can be used to influence the process 43 optimizations. For instance, approximate count remapping strategy and other internal MPI 44 arguments of later communication calls along specific edges could be used as their edge 45 weights. Multiplicity of edges can likewise indicate more intense communication between 46 MPI pairs of processes. However, the exact meaning of edge weights is not specified by the 47 standard and is left to the implementation. In C or Fortran, an application can supply 48

330 300 CHAPTER 7. PROCESS TOPOLOGIES 1 _ MPI UNWEIGHTED the special value for the weight array to indicate that all edges have the 2 UNWEIGHTED same (effectively no) weight. It is erroneous to supply MPI for some but not _ 3 comm _ WEIGHTS _ EMPTY all processes of MPI = 0, then _ n old . If the graph is weighted but 4 _ UNWEIGHTED and or any arbitrary array may be passed to weights. Note that MPI 5 are not special weight values; rather they are special values for the EMPTY _ WEIGHTS MPI _ 6 _ EMPTY are objects MPI _ UNWEIGHTED total array argument. In Fortran, MPI _ WEIGHTS and 7 MPI like _ (not usable for initialization or assignment). See Section 2.5.4. BOTTOM 8 9 Advice to users. In the case of an empty weights array argument passed while 10 NULL because the value of constructing a weighted graph, one should not pass 11 UNWEIGHTED may be equal to MPI . The value of this argument would then _ NULL 12 be indistinguishable from MPI _ UNWEIGHTED to the implementation. In this case 13 _ WEIGHTS _ EMPTY End of advice to users. MPI should be used instead. ( ) 14 15 It is recommended that MPI _ UNWEIGHTED not be imple- Advice to implementors. 16 ) End of advice to implementors. . ( NULL mented as 17 MPI To ensure backward compatibility, _ may still be imple- UNWEIGHTED Rationale. 18 End of rationale. mented as NULL. See Annex B.1 on page 787. ( ) 19 20 The meaning of the weights argument can be influenced by the info argument. Info 21 arguments can be used to guide the mapping; possible options include minimizing the 22 maximum number of edges between processes on different SMP nodes, or minimizing the 23 MPI sum of all such edges. An implementation is not obliged to follow specific hints, and it 24 MPI is valid for an implementation may MPI implementation not to do any reordering. An 25 info info key-value pairs. All processes must specify the same set of key-value specify more 26 pairs. 27 28 implementations must document any additionally MPI Advice to implementors. 29 _ info supported key-value INFO _ NULL is always valid, and may indicate the MPI pairs. 30 default creation of the distributed graph topology to the MPI library. 31 An implementation does not explicitly need to construct the topology from its dis- 32 tributed parts. However, all processes can construct the full topology from the dis- 33 GRAPH _ MPI CREATE tributed specification and use this in a call to _ to create the 34 topology. This may serve as a reference implementation of the functionality, and 35 may be acceptable for small communicators. However, a scalable high-quality im- 36 End of advice to plementation would save the topology graph in a distributed way. ( 37 ) implementors. 38 39 40 As for Example 7.2, assume there are four processes 0, 1, 2, 3 with the Example 7.3 41 following adjacency matrix and unit edge weights: 42 43 44 45 46 47 48

331 7.5. TOPOLOGY CONSTRUCTORS 301 1 neighbors process 2 1, 3 0 3 0 1 4 3 2 5 0, 2 3 6 DIST , this graph could be constructed in many different _ CREATE GRAPH _ _ With MPI 7 ways. One way would be that each process specifies its outgoing edges. The arguments per 8 process would be: 9 10 weights process destinations degrees sources n 11 0 2 1,3 1,1 0 1 12 1 0 1 1 1 1 13 2 1 2 1 3 1 14 2 3 1 3 0,2 1,1 15 16 Another way would be to pass the whole graph on process 0, which could be done with 17 the following arguments per process: 18 weights destinations process n sources degrees 19 0,1,2,3 2,1,1,2 1,3,0,3,0,2 1,1,1,1,1,1 0 4 20 1 - - - - 0 21 - - - - 2 0 22 - 3 0 - - 23 24 instead of explic- _ UNWEIGHTED MPI In both cases above, the application could supply 25 itly providing identical weights. 26 _ could be used to specify this graph using the ADJACENT _ CREATE _ GRAPH _ DIST MPI 27 following arguments: 28 29 indegree sources sourceweights outdegree destinations destweights process 30 1,1 0 2 1,3 2 1,3 1,1 31 1 1 1 1 1 0 0 32 1 3 1 1 3 1 2 33 2 3 2 0,2 1,1 1,1 0,2 34 35 A two-dimensional PxQ torus where all processes communicate along the Example 7.4 36 dimensions and along the diagonal edges. This cannot be modeled with Cartesian topologies, 37 DIST but can easily be captured with MPI _ _ GRAPH _ CREATE as shown in the following 38 code. In this example, the communication along the dimensions is twice as heavy as the 39 communication along the diagonals: 40 41 /* 42 Input: dimensions P, Q 43 Condition: number of processes equal to P*Q; otherwise only 44 ranks smaller than P*Q participate 45 */ 46 int rank, x, y; 47 int sources[1], degrees[1]; 48 int destinations[8], weights[8];

332 302 CHAPTER 7. PROCESS TOPOLOGIES 1 MPI_Comm comm_dist_graph; 2 3 MPI_Comm_rank(MPI_COMM_WORLD, &rank); 4 5 /* get x and y dimension */ 6 y=rank/P; x=rank%P; 7 8 /* get my communication partners along x dimension */ 9 destinations[0] = P*y+(x+1)%P; weights[0] = 2; 10 destinations[1] = P*y+(P+x-1)%P; weights[1] = 2; 11 12 /* get my communication partners along y dimension */ 13 destinations[2] = P*((y+1)%Q)+x; weights[2] = 2; 14 destinations[3] = P*((Q+y-1)%Q)+x; weights[3] = 2; 15 16 /* get my communication partners along diagonals */ 17 destinations[4] = P*((y+1)%Q)+(x+1)%P; weights[4] = 1; 18 destinations[5] = P*((Q+y-1)%Q)+(x+1)%P; weights[5] = 1; 19 destinations[6] = P*((y+1)%Q)+(P+x-1)%P; weights[6] = 1; 20 destinations[7] = P*((Q+y-1)%Q)+(P+x-1)%P; weights[7] = 1; 21 22 sources[0] = rank; 23 degrees[0] = 8; 24 MPI_Dist_graph_create(MPI_COMM_WORLD, 1, sources, degrees, destinations, 25 weights, MPI_INFO_NULL, 1, &comm_dist_graph); 26 27 7.5.5 Topology Inquiry Functions 28 If a topology has been defined with one of the above functions, then the topology information 29 can be looked up using inquiry functions. They all are local calls. 30 31 32 _ TOPO _ TEST(comm, status) MPI 33 communicator (handle) comm IN 34 35 status comm (state) topology type of communicator OUT 36 37 int MPI_Topo_test(MPI_Comm comm, int *status) 38 39 MPI_Topo_test(comm, status, ierror) BIND(C) 40 TYPE(MPI_Comm), INTENT(IN) :: comm 41 INTEGER, INTENT(OUT) :: status 42 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 43 MPI_TOPO_TEST(COMM, STATUS, IERROR) 44 INTEGER COMM, STATUS, IERROR 45 46 returns the type of topology that is assigned to a _ TEST _ MPI The function TOPO 47 communicator. 48 is one of the following: status The output value

333 7.5. TOPOLOGY CONSTRUCTORS 303 1 GRAPH _ MPI graph topology 2 MPI _ CART Cartesian topology 3 MPI _ distributed graph topology GRAPH _ DIST 4 MPI _ no topology UNDEFINED 5 6 7 MPI _ GRAPHDIMS GET(comm, nnodes, nedges) _ 8 communicator for group with graph structure (handle) comm IN 9 nnodes OUT number of nodes in graph (integer) (same as number 10 of processes in the group) 11 12 nedges number of edges in graph (integer) OUT 13 14 int MPI_Graphdims_get(MPI_Comm comm, int *nnodes, int *nedges) 15 16 MPI_Graphdims_get(comm, nnodes, nedges, ierror) BIND(C) 17 TYPE(MPI_Comm), INTENT(IN) :: comm 18 INTEGER, INTENT(OUT) :: nnodes, nedges 19 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 20 MPI_GRAPHDIMS_GET(COMM, NNODES, NEDGES, IERROR) 21 INTEGER COMM, NNODES, NEDGES, IERROR 22 23 retrieve the graph-topology Functions MPI _ GRAPHDIMS _ GET and MPI _ GRAPH _ GET 24 information that was associated with a communicator by _ GRAPH _ CREATE . MPI 25 The information provided by can be used to dimension the GET _ GRAPHDIMS _ MPI 26 and edges . index vectors GET _ GRAPH _ MPI correctly for the following call to 27 28 _ GET(comm, maxindex, maxedges, index, edges) MPI _ GRAPH 29 30 comm IN communicator with graph structure (handle) 31 index length of vector in the calling program IN maxindex 32 (integer) 33 in the calling program IN maxedges length of vector edges 34 (integer) 35 36 OUT array of integers containing the graph structure (for index 37 _ GRAPH details see the definition of MPI ) _ CREATE 38 edges array of integers containing the graph structure OUT 39 40 int MPI_Graph_get(MPI_Comm comm, int maxindex, int maxedges, int index[], 41 int edges[]) 42 43 MPI_Graph_get(comm, maxindex, maxedges, index, edges, ierror) BIND(C) 44 TYPE(MPI_Comm), INTENT(IN) :: comm 45 INTEGER, INTENT(IN) :: maxindex, maxedges 46 INTEGER, INTENT(OUT) :: index(maxindex), edges(maxedges) 47 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 48

334 304 CHAPTER 7. PROCESS TOPOLOGIES 1 MPI_GRAPH_GET(COMM, MAXINDEX, MAXEDGES, INDEX, EDGES, IERROR) 2 INTEGER COMM, MAXINDEX, MAXEDGES, INDEX(*), EDGES(*), IERROR 3 4 5 MPI _ GET(comm, ndims) _ CARTDIM 6 comm communicator with Cartesian structure (handle) IN 7 8 ndims OUT number of dimensions of the Cartesian structure (in- 9 teger) 10 11 int MPI_Cartdim_get(MPI_Comm comm, int *ndims) 12 MPI_Cartdim_get(comm, ndims, ierror) BIND(C) 13 TYPE(MPI_Comm), INTENT(IN) :: comm 14 INTEGER, INTENT(OUT) :: ndims 15 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 16 17 MPI_CARTDIM_GET(COMM, NDIMS, IERROR) 18 INTEGER COMM, NDIMS, IERROR 19 and MPI GET GET _ CARTDIM _ MPI The functions _ CART _ return the Cartesian topol- 20 . If comm ogy information that was associated with a communicator by MPI _ CART _ CREATE 21 returns GET _ CARTDIM _ MPI is associated with a zero-dimensional Cartesian topology, 22 MPI _ CART ndims=0 GET will keep all output arguments unchanged. and _ 23 24 25 CART MPI _ _ GET(comm, maxdims, dims, periods, coords) 26 comm IN communicator with Cartesian structure (handle) 27 28 , and dims, periods length of vectors in the IN coords maxdims 29 calling program (integer) 30 number of processes for each Cartesian dimension (ar- OUT dims 31 ray of integer) 32 ) for each Cartesian dimension true periodicity ( periods OUT / false 33 (array of logical) 34 35 coordinates of calling process in Cartesian structure OUT coords 36 (array of integer) 37 38 int MPI_Cart_get(MPI_Comm comm, int maxdims, int dims[], int periods[], 39 int coords[]) 40 41 MPI_Cart_get(comm, maxdims, dims, periods, coords, ierror) BIND(C) 42 TYPE(MPI_Comm), INTENT(IN) :: comm 43 INTEGER, INTENT(IN) :: maxdims 44 INTEGER, INTENT(OUT) :: dims(maxdims), coords(maxdims) 45 LOGICAL, INTENT(OUT) :: periods(maxdims) 46 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 47 MPI_CART_GET(COMM, MAXDIMS, DIMS, PERIODS, COORDS, IERROR) 48

335 7.5. TOPOLOGY CONSTRUCTORS 305 1 INTEGER COMM, MAXDIMS, DIMS(*), COORDS(*), IERROR 2 LOGICAL PERIODS(*) 3 4 5 _ _ RANK(comm, coords, rank) CART MPI 6 communicator with Cartesian structure (handle) comm IN 7 8 integer array (of size coords IN ) specifying the Cartesian ndims 9 coordinates of a process 10 rank of specified process (integer) OUT rank 11 12 int MPI_Cart_rank(MPI_Comm comm, const int coords[], int *rank) 13 14 MPI_Cart_rank(comm, coords, rank, ierror) BIND(C) 15 TYPE(MPI_Comm), INTENT(IN) :: comm 16 INTEGER, INTENT(IN) :: coords(*) 17 INTEGER, INTENT(OUT) :: rank 18 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 19 MPI_CART_RANK(COMM, COORDS, RANK, IERROR) 20 INTEGER COMM, COORDS(*), RANK, IERROR 21 22 _ _ CART RANK trans- MPI For a process group with Cartesian structure, the function 23 lates the logical process coordinates to process ranks as they are used by the point-to-point 24 routines. 25 with i , is out of coords(i) For dimension , if the coordinate, periods(i) = true 26 range, that is, < 0 or coords(i) , it is shifted back to the interval coords(i) ≥ dims(i) 27 dims(i) automatically. Out-of-range coordinates are erroneous for 0 ≤ coords(i) < 28 non-periodic dimensions. 29 coords comm If is associated with a zero-dimensional Cartesian topology, is not signif- 30 rank icant and 0 is returned in . 31 32 CART COORDS(comm, rank, maxdims, coords) _ _ MPI 33 34 comm communicator with Cartesian structure (handle) IN 35 rank of a process within group of (integer) IN rank comm 36 IN maxdims length of vector coords in the calling program (inte- 37 ger) 38 39 ndims coords OUT ) containing the Cartesian integer array (of size 40 coordinates of specified process (array of integers) 41 42 int MPI_Cart_coords(MPI_Comm comm, int rank, int maxdims, int coords[]) 43 44 MPI_Cart_coords(comm, rank, maxdims, coords, ierror) BIND(C) 45 TYPE(MPI_Comm), INTENT(IN) :: comm 46 INTEGER, INTENT(IN) :: rank, maxdims 47 INTEGER, INTENT(OUT) :: coords(maxdims) 48 INTEGER, OPTIONAL, INTENT(OUT) :: ierror

336 306 CHAPTER 7. PROCESS TOPOLOGIES 1 MPI_CART_COORDS(COMM, RANK, MAXDIMS, COORDS, IERROR) 2 INTEGER COMM, RANK, MAXDIMS, COORDS(*), IERROR 3 The inverse mapping, rank-to-coordinates translation is provided by 4 CART _ _ MPI . COORDS 5 If comm is associated with a zero-dimensional Cartesian topology, 6 coords will be unchanged. 7 8 9 COUNT(comm, rank, nneighbors) MPI _ GRAPH _ NEIGHBORS _ 10 comm communicator with graph topology (handle) IN 11 12 (integer) IN rank rank of process in group of comm 13 OUT nneighbors number of neighbors of specified process (integer) 14 15 int MPI_Graph_neighbors_count(MPI_Comm comm, int rank, int *nneighbors) 16 17 MPI_Graph_neighbors_count(comm, rank, nneighbors, ierror) BIND(C) 18 TYPE(MPI_Comm), INTENT(IN) :: comm 19 INTEGER, INTENT(IN) :: rank 20 INTEGER, INTENT(OUT) :: nneighbors 21 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 22 MPI_GRAPH_NEIGHBORS_COUNT(COMM, RANK, NNEIGHBORS, IERROR) 23 INTEGER COMM, RANK, NNEIGHBORS, IERROR 24 25 26 MPI NEIGHBORS(comm, rank, maxneighbors, neighbors) _ GRAPH _ 27 28 communicator with graph topology (handle) comm IN 29 comm (integer) IN rank rank of process in group of 30 (integer) IN size of array neighbors maxneighbors 31 32 neighbors ranks of processes that are neighbors to specified pro- OUT 33 cess (array of integer) 34 35 int MPI_Graph_neighbors(MPI_Comm comm, int rank, int maxneighbors, 36 int neighbors[]) 37 38 MPI_Graph_neighbors(comm, rank, maxneighbors, neighbors, ierror) BIND(C) 39 TYPE(MPI_Comm), INTENT(IN) :: comm 40 INTEGER, INTENT(IN) :: rank, maxneighbors 41 INTEGER, INTENT(OUT) :: neighbors(maxneighbors) 42 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 43 MPI_GRAPH_NEIGHBORS(COMM, RANK, MAXNEIGHBORS, NEIGHBORS, IERROR) 44 INTEGER COMM, RANK, MAXNEIGHBORS, NEIGHBORS(*), IERROR 45 46 MPI and COUNT _ _ provide adjacency GRAPH _ MPI NEIGHBORS _ GRAPH _ NEIGHBORS 47 information for a graph topology. The returned count and array of neighbors for the queried 48 all rank will both include neighbors and reflect the same edge ordering as was specified by

337 7.5. TOPOLOGY CONSTRUCTORS 307 1 _ MPI CREATE . Specifically, MPI _ GRAPH _ NEIGHBORS _ COUNT GRAPH the original call to _ 2 GRAPH will return values based on the original index and edges MPI and _ _ NEIGHBORS 3 (assuming that effectively equals zero): CREATE array passed to _ GRAPH _ index[-1] MPI 4 5 ) returned from nneighbors The number of neighbors ( • 6 - index[rank-1] ). _ GRAPH _ NEIGHBORS _ COUNT MPI index[rank] will be ( 7 will be The neighbors array returned from MPI _ GRAPH • NEIGHBORS _ 8 edges[index[rank-1]] through edges[index[rank]-1] . 9 10 11 Example 7.5 12 Assume there are four processes 0, 1, 2, 3 with the following adjacency matrix (note 13 that some neighbors are listed multiple times): 14 15 neighbors process 16 1, 1, 3 0 17 1 0, 0 18 2 3 19 3 0, 2, 2 20 MPI GRAPH _ CREATE are: _ Thus, the input arguments to 21 22 nnodes = 4 23 index = 3, 5, 6, 9 24 1, 1, 3, 0, 0, 3, 0, 2, 2 edges = 25 26 _ COUNT GRAPH _ NEIGHBORS _ _ GRAPH _ MPI NEIGHBORS and Therefore, calling MPI 27 for each of the 4 processes will return: 28 Input rank Count Neighbors 29 0 3 1, 1, 3 30 1 2 0, 0 31 3 1 2 32 3 3 0, 2, 2 33 34 35 Example 7.6 36 Suppose that comm is a communicator with a shuffle-exchange topology. The group has n 37 a with ,...,a , 0 1 } , and has three neighbors: ∈{ 2 a members. Each process is labeled by i n 1 38 ( ̄ , and ,a ,...,a a ) = ,...,a a ̄ , = 1 ,...,a a ) = ,...,a a exchange( a ), shuffle( a − a 1 n n n 1 2 n 1 − n 1 1 39 ,...,a ,...,a ,a . The graph adjacency list is illustrated below for a ) = a unshuffle( n n n 1 1 1 − 40 n = 3. 41 42 43 44 45 46 47 48

338 308 CHAPTER 7. PROCESS TOPOLOGIES 1 unshuffle node shuffle exchange 2 neighbors(1) neighbors(2) neighbors(3) 3 0 1 0 (000) 0 4 2 4 1 (001) 0 5 1 3 2 (010) 4 6 5 6 2 3 (011) 7 1 2 4 (100) 5 8 4 5 (101) 6 3 9 6 (110) 5 3 7 10 7 6 7 (111) 7 11 12 has this topology associated with it. The follow- comm Suppose that the communicator 13 ing code fragment cycles through the three types of neighbors and performs an appropriate 14 permutation for each. 15 C assume: each process has stored a real number A. 16 C extract neighborhood information 17 CALL MPI_COMM_RANK(comm, myrank, ierr) 18 CALL MPI_GRAPH_NEIGHBORS(comm, myrank, 3, neighbors, ierr) 19 C perform exchange permutation 20 CALL MPI_SENDRECV_REPLACE(A, 1, MPI_REAL, neighbors(1), 0, 21 + neighbors(1), 0, comm, status, ierr) 22 C perform shuffle permutation 23 CALL MPI_SENDRECV_REPLACE(A, 1, MPI_REAL, neighbors(2), 0, 24 + neighbors(3), 0, comm, status, ierr) 25 C perform unshuffle permutation 26 CALL MPI_SENDRECV_REPLACE(A, 1, MPI_REAL, neighbors(3), 0, 27 + neighbors(2), 0, comm, status, ierr) 28 29 MPI _ DIST _ pro- NEIGHBORS _ GRAPH _ _ MPI and COUNT _ NEIGHBORS _ GRAPH DIST 30 vide adjacency information for a distributed graph topology. 31 32 33 NEIGHBORS _ COUNT(comm, indegree, outdegree, weighted) MPI _ DIST _ GRAPH _ 34 IN comm communicator with distributed graph topology (han- 35 dle) 36 number of edges into this process (non-negative inte- indegree OUT 37 ger) 38 39 number of edges out of this process (non-negative in- OUT outdegree 40 teger) 41 MPI _ UNWEIGHTED if OUT weighted false was supplied during cre- 42 true otherwise (logical) ation, 43 44 int MPI_Dist_graph_neighbors_count(MPI_Comm comm, int *indegree, 45 int *outdegree, int *weighted) 46 47 MPI_Dist_graph_neighbors_count(comm, indegree, outdegree, weighted, ierror) 48 BIND(C)

339 7.5. TOPOLOGY CONSTRUCTORS 309 1 TYPE(MPI_Comm), INTENT(IN) :: comm 2 INTEGER, INTENT(OUT) :: indegree, outdegree 3 LOGICAL, INTENT(OUT) :: weighted 4 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 5 MPI_DIST_GRAPH_NEIGHBORS_COUNT(COMM, INDEGREE, OUTDEGREE, WEIGHTED, IERROR) 6 INTEGER COMM, INDEGREE, OUTDEGREE, IERROR 7 LOGICAL WEIGHTED 8 9 10 NEIGHBORS(comm, maxindegree, sources, sourceweights, maxoutdegree, MPI DIST _ GRAPH _ _ 11 destinations, destweights) 12 13 IN comm communicator with distributed graph topology (han- 14 dle) 15 IN maxindegree size of sources and sourceweights arrays (non-negative 16 integer) 17 18 OUT sources processes for which the calling process is a destination 19 (array of non-negative integers) 20 weights of the edges into the calling process (array of sourceweights OUT 21 non-negative integers) 22 size of destinations and destweights arrays maxoutdegree IN 23 (non-negative integer) 24 25 OUT destinations processes for which the calling process is a source (ar- 26 ray of non-negative integers) 27 OUT weights of the edges out of the calling process (array destweights 28 of non-negative integers) 29 30 int MPI_Dist_graph_neighbors(MPI_Comm comm, int maxindegree, int sources[], 31 int sourceweights[], int maxoutdegree, int destinations[], 32 int destweights[]) 33 34 MPI_Dist_graph_neighbors(comm, maxindegree, sources, sourceweights, 35 maxoutdegree, destinations, destweights, ierror) BIND(C) 36 TYPE(MPI_Comm), INTENT(IN) :: comm 37 INTEGER, INTENT(IN) :: maxindegree, maxoutdegree 38 INTEGER, INTENT(OUT) :: sources(maxindegree), 39 destinations(maxoutdegree) 40 INTEGER :: sourceweights(*), destweights(*) 41 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 42 MPI_DIST_GRAPH_NEIGHBORS(COMM, MAXINDEGREE, SOURCES, SOURCEWEIGHTS, 43 MAXOUTDEGREE, DESTINATIONS, DESTWEIGHTS, IERROR) 44 INTEGER COMM, MAXINDEGREE, SOURCES(*), SOURCEWEIGHTS(*), MAXOUTDEGREE, 45 DESTINATIONS(*), DESTWEIGHTS(*), IERROR 46 47 These calls are local. The number of edges into and out of the process returned by 48 _ GRAPH _ COUNT _ MPI are the total number of such edges given in the _ NEIGHBORS DIST

340 310 CHAPTER 7. PROCESS TOPOLOGIES 1 _ MPI GRAPH _ CREATE _ ADJACENT or MPI _ DIST _ GRAPH _ CREATE (poten- DIST call to _ 2 tially by processes other than the calling process in the case of 3 DIST ). Multiply defined edges are all counted and returned by _ GRAPH _ _ CREATE MPI 4 NEIGHBORS in some order. If MPI _ UNWEIGHTED is supplied for MPI _ DIST _ GRAPH _ 5 _ UNWEIGHTED sourceweights or destweights or both, or if MPI was supplied during the con- 6 struction of the graph then no weight information is returned in that array or those arrays. 7 _ then for _ DIST CREATE MPI If the communicator was created with _ _ GRAPH ADJACENT 8 sources each rank in destinations is identical to the in- comm , the order of the values in and 9 put that was used by the process with the same rank in comm _ old in the creation call. If the 10 _ CREATE then the only requirement on communicator was created with MPI DIST GRAPH _ _ 11 is that two calls to the routine with same input the order of values in sources and destinations 12 maxoutdegree argument comm will return the same sequence of edges. If is or maxindegree 13 _ GRAPH smaller than the numbers returned by _ NEIGHBOR _ COUNT , then only MPI _ DIST 14 the first part of the full list is returned. 15 16 Since the query calls are defined to be local, each process Advice to implementors. 17 needs to store the list of its neighbors with incoming and outgoing edges. Communica- 18 tion is required at the collective MPI _ DIST _ GRAPH _ CREATE call in order to compute 19 End of the neighbor lists for each process from the distributed graph specification. ( 20 advice to implementors. ) 21 22 Cartesian Shift Coordinates 7.5.6 23 If the process topology is a Cartesian structure, an MPI _ SENDRECV operation is likely to 24 MPI _ be used along a coordinate direction to perform a shift of data. As input, SENDRECV 25 takes the rank of a source process for the receive, and the rank of a destination process for the 26 _ _ is called for a Cartesian process group, it provides send. If the function MPI CART SHIFT 27 . SENDRECV _ MPI the calling process with the above identifiers, which then can be passed to 28 The user specifies the coordinate direction and the size of the step (positive or negative). 29 The function is local. 30 31 32 _ source, rank CART _ SHIFT(comm, direction, disp, rank _ MPI _ dest) 33 IN comm communicator with Cartesian structure (handle) 34 35 IN coordinate dimension of shift (integer) direction 36 > displacement ( disp IN < 0: upwards shift, 0: downwards 37 shift) (integer) 38 source OUT rank _ rank of source process (integer) 39 40 rank _ dest rank of destination process (integer) OUT 41 42 int MPI_Cart_shift(MPI_Comm comm, int direction, int disp, 43 int *rank_source, int *rank_dest) 44 MPI_Cart_shift(comm, direction, disp, rank_source, rank_dest, ierror) 45 BIND(C) 46 TYPE(MPI_Comm), INTENT(IN) :: comm 47 INTEGER, INTENT(IN) :: direction, disp 48

341 7.5. TOPOLOGY CONSTRUCTORS 311 1 INTEGER, INTENT(OUT) :: rank_source, rank_dest 2 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 3 MPI_CART_SHIFT(COMM, DIRECTION, DISP, RANK_SOURCE, RANK_DEST, IERROR) 4 INTEGER COMM, DIRECTION, DISP, RANK_SOURCE, RANK_DEST, IERROR 5 6 argument indicates the coordinate dimension to be traversed by the shift. direction The 7 is the number of dimensions. The dimensions are numbered from 0 to ndims ndims-1 , where 8 Depending on the periodicity of the Cartesian group in the specified coordinate direc- 9 SHIFT _ provides the identifiers for a circular or an end-off shift. In the case _ MPI tion, CART 10 _ dest of an end-off shift, the value MPI _ PROC _ NULL may be returned in rank _ source or rank , 11 indicating that the source or the destination for the shift is out of range. 12 CART _ MPI with a direction that is either negative or It is erroneous to call SHIFT _ 13 greater than or equal to the number of dimensions in the Cartesian communicator. This 14 comm that is associated with MPI _ implies that it is erroneous to call _ SHIFT with a CART 15 a zero-dimensional Cartesian topology. 16 17 Example 7.7 18 comm The communicator, , has a two-dimensional, periodic, Cartesian topology associ- 19 ated with it. A two-dimensional array of s is stored one element per process, in variable REAL 20 . One wishes to skew this array, by shifting column A (vertically, i.e., along the column) i 21 by i steps. 22 ... 23 C find process rank 24 CALL MPI_COMM_RANK(comm, rank, ierr) 25 C find Cartesian coordinates 26 CALL MPI_CART_COORDS(comm, rank, maxdims, coords, ierr) 27 C compute shift source and destination 28 CALL MPI_CART_SHIFT(comm, 0, coords(2), source, dest, ierr) 29 C skew array 30 CALL MPI_SENDRECV_REPLACE(A, 1, MPI_REAL, dest, 0, source, 0, comm, 31 + status, ierr) 32 33 In Fortran, the dimension indicated by Advice to users. DIRECTION = i has DIMS(i+1) 34 DIMS is the array that was used to create the grid. In C, the dimension nodes, where 35 dims[i] is the dimension specified by direction = i End of advice to users. indicated by ) . ( 36 37 38 39 40 41 42 43 44 45 46 47 48

342 312 CHAPTER 7. PROCESS TOPOLOGIES 1 Partitioning of Cartesian Structures 7.5.7 2 3 4 _ CART _ dims, newcomm) MPI SUB(comm, remain _ 5 IN comm communicator with Cartesian structure (handle) 6 7 the _ dims specifies whether the dims -th entry of i remain remain _ IN 8 -th dimension is kept in the subgrid ( ) or is drop- true i 9 false ped ( ) (logical vector) 10 communicator containing the subgrid that includes OUT newcomm 11 the calling process (handle) 12 13 int MPI_Cart_sub(MPI_Comm comm, const int remain_dims[], MPI_Comm *newcomm) 14 15 MPI_Cart_sub(comm, remain_dims, newcomm, ierror) BIND(C) 16 TYPE(MPI_Comm), INTENT(IN) :: comm 17 LOGICAL, INTENT(IN) :: remain_dims(*) 18 TYPE(MPI_Comm), INTENT(OUT) :: newcomm 19 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 20 MPI_CART_SUB(COMM, REMAIN_DIMS, NEWCOMM, IERROR) 21 INTEGER COMM, NEWCOMM, IERROR 22 LOGICAL REMAIN_DIMS(*) 23 24 , the function CREATE _ CART _ MPI If a Cartesian topology has been created with 25 CART _ SUB can be used to partition the communicator group into subgroups that MPI _ 26 form lower-dimensional Cartesian subgrids, and to build for each subgroup a communicator 27 remain _ dims are false or with the associated subgrid Cartesian topology. If all entries in 28 newcomm is is already associated with a zero-dimensional Cartesian topology then comm 29 associated with a zero-dimensional Cartesian topology. (This function is closely related to 30 _ .) COMM _ MPI SPLIT 31 32 Example 7.8 33 3 × 4) grid. Let Assume that MPI _ CART _ CREATE (..., comm) has defined a (2 × 34 remain_dims = (true, false, true) . Then a call to, 35 MPI_CART_SUB(comm, remain_dims, comm_new), 36 37 will create three communicators each with eight processes in a 2 4 Cartesian topol- × 38 ogy. If SUB(comm, _ CART _ MPI then the call to remain_dims = (false, false, true) 39 remain _ new) will create six non-overlapping communicators, each with four dims, comm _ 40 processes, in a one-dimensional Cartesian topology. 41 42 7.5.8 Low-Level Topology Functions 43 44 The two additional functions introduced in this section can be used to implement all other 45 topology functions. In general they will not be called by the user directly, unless he or she 46 is creating additional virtual topology capability other than that provided by MPI . The two 47 calls are both local. 48

343 7.5. TOPOLOGY CONSTRUCTORS 313 1 CART MPI _ _ MAP(comm, ndims, dims, periods, newrank) 2 comm IN input communicator (handle) 3 number of dimensions of Cartesian structure (integer) IN ndims 4 5 integer array of size ndims specifying the number of dims IN 6 processes in each coordinate direction 7 IN ndims specifying the periodicity periods logical array of size 8 specification in each coordinate direction 9 newrank OUT reordered rank of the calling process; 10 MPI if calling process does not belong _ UNDEFINED 11 to grid (integer) 12 13 14 int MPI_Cart_map(MPI_Comm comm, int ndims, const int dims[], const 15 int periods[], int *newrank) 16 MPI_Cart_map(comm, ndims, dims, periods, newrank, ierror) BIND(C) 17 TYPE(MPI_Comm), INTENT(IN) :: comm 18 INTEGER, INTENT(IN) :: ndims, dims(ndims) 19 LOGICAL, INTENT(IN) :: periods(ndims) 20 INTEGER, INTENT(OUT) :: newrank 21 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 22 23 MPI_CART_MAP(COMM, NDIMS, DIMS, PERIODS, NEWRANK, IERROR) 24 INTEGER COMM, NDIMS, DIMS(*), NEWRANK, IERROR 25 LOGICAL PERIODS(*) 26 _ MPI computes an “optimal” placement for the calling process on the phys- MAP _ CART 27 ical machine. A possible implementation of this function is to always return the rank of the 28 calling process, that is, not to perform any reordering. 29 30 The function MPI _ CART Advice to implementors. CREATE(comm, ndims, dims, pe- _ 31 can be implemented by calling reorder = true , with cart) _ riods, reorder, comm 32 CART MAP(comm, _ calling _ MPI then , newrank) periods, dims, ndims, 33 , with color = 0 if newrank 6 = MPI _ COMM _ SPLIT(comm, color, key, comm _ cart) 34 otherwise, and , ndims color = MPI_UNDEFINED MPI_UNDEFINED key = newrank . If 35 is zero then a zero-dimensional Cartesian topology is created. 36 The function MPI _ CART _ SUB(comm, remain can be implemented _ _ dims, comm new) 37 MPI _ COMM _ SPLIT(comm, color, key, comm _ by a call to , using a single number new) 38 color and a single number encoding of the preserved encoding of the lost dimensions as 39 dimensions as . key 40 41 All other Cartesian topology functions can be implemented locally, using the topology 42 information that is cached with the communicator. ( End of advice to implementors. ) 43 44 The corresponding function for graph structures is as follows. 45 46 47 48

344 314 CHAPTER 7. PROCESS TOPOLOGIES 1 GRAPH _ MPI MAP(comm, nnodes, index, edges, newrank) _ 2 comm IN input communicator (handle) 3 nnodes IN number of graph nodes (integer) 4 5 integer array specifying the graph structure, see index IN 6 GRAPH MPI CREATE _ _ 7 edges integer array specifying the graph structure IN 8 reordered rank of the calling process; newrank OUT 9 _ UNDEFINED if the calling process does not be- MPI 10 long to graph (integer) 11 12 int MPI_Graph_map(MPI_Comm comm, int nnodes, const int index[], const 13 int edges[], int *newrank) 14 15 MPI_Graph_map(comm, nnodes, index, edges, newrank, ierror) BIND(C) 16 TYPE(MPI_Comm), INTENT(IN) :: comm 17 INTEGER, INTENT(IN) :: nnodes, index(nnodes), edges(*) 18 INTEGER, INTENT(OUT) :: newrank 19 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 20 21 MPI_GRAPH_MAP(COMM, NNODES, INDEX, EDGES, NEWRANK, IERROR) 22 INTEGER COMM, NNODES, INDEX(*), EDGES(*), NEWRANK, IERROR 23 24 CREATE(comm, nnodes, index, Advice to implementors. GRAPH _ MPI The function _ 25 , with reorder = true can be implemented by calling edges, reorder, comm _ graph) 26 MAP(comm, , newrank) edges, index, nnodes, then _ GRAPH _ MPI calling 27 SPLIT(comm, color, key, comm _ graph) , with color = 0 if newrank 6 = _ COMM MPI _ 28 MPI_UNDEFINED , color = MPI_UNDEFINED otherwise, and key = newrank . 29 All other graph topology functions can be implemented locally, using the topology 30 End of advice to implementors. information that is cached with the communicator. ( ) 31 32 33 Neighborhood Collective Communication on Process Topologies 7.6 34 35 MPI process topologies specify a communication graph, but they implement no commu- 36 nication function themselves. Many applications require sparse nearest neighbor commu- 37 nications that can be expressed as graph topologies. We now describe several collective 38 operations that perform communication along the edges of a process topology. All of these 39 functions are collective; i.e., they must be called by all processes in the specified com- 40 municator. See Section 5 on page 141 for an overview of other dense (global) collective 41 communication operations and the semantics of collective operations. 42 _ _ ADJACENT sources CREATE If the graph was created with MPI _ DIST _ with GRAPH 43 and destinations containing 0, ..., n-1 , where n is the number of processes in the group 44 of _ old (i.e., the graph is fully connected and also includes an edge from each node comm 45 to itself), then the sparse neighborhood communication routine performs the same data 46 exchange as the corresponding dense (fully-connected) collective operation. In the case of a 47 Cartesian communicator, only nearest neighbor communication is provided, corresponding 48 dest CART rank _ source and rank _ to in MPI =1. disp with input SHIFT _ _

345 7.6. NEIGHBORHOOD COLLECTIVE COMMUNICATION 315 1 Neighborhood collective communications enable communication on a Rationale. 2 process topology. This high-level specification of data exchange among neighboring 3 library because the communication pattern processes enables optimizations in the MPI 4 is known statically (the topology). Thus, the implementation can compute optimized 5 message schedules during creation of the topology [35]. This functionality can signif- 6 ) End of rationale. icantly simplify the implementation of neighbor exchanges [31]. ( 7 8 _ _ DIST CREATE , the se- GRAPH For a distributed graph topology, created with MPI _ 9 quence of neighbors in the send and receive buffers at each process is defined as the se- 10 _ NEIGHBORS _ for destinations and sources, respec- DIST _ MPI quence returned by GRAPH 11 MPI _ GRAPH _ tively. For a general graph topology, created with , the order of CREATE 12 neighbors in the send and receive buffers is defined as the sequence of neighbors as re- 13 GRAPH _ MPI . Note that general graph topologies should generally turned by NEIGHBORS _ 14 be replaced by the distributed graph topologies. 15 For a Cartesian topology, created with CREATE , the sequence of neigh- MPI _ CART _ 16 bors in the send and receive buffers at each process is defined by order of the dimensions, 17 first the neighbor in the negative direction and then in the positive direction with dis- 18 placement 1. The numbers of sources and destinations in the communication routines are 19 MPI CART 2*ndims with ndims defined in _ . If a neighbor does not exist, i.e., at CREATE _ 20 the border of a Cartesian topology in the case of a non-periodic virtual grid dimension (i.e., 21 MPI periods[...]==false . NULL _ PROC _ ), then this neighbor is defined to be 22 MPI _ PROC _ NULL , then the neighborhood collec- If a neighbor in any of the functions is 23 _ tive communication behaves like a point-to-point communication with PROC _ NULL in MPI 24 this direction. That is, the buffer is still part of the sequence of neighbors but it is neither 25 communicated nor updated. 26 27 7.6.1 Neighborhood Gather 28 j ) exists i In this function, each process if an edge ( j,i gathers data items from each process 29 sends the same data items to all processes i in the topology graph, and each process where j 30 i,j an edge ( -th block l ) exists. The send buffer is sent to each neighboring process and the 31 l in the receive buffer is received from the -th neighbor. 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

346 316 CHAPTER 7. PROCESS TOPOLOGIES 1 ALLGATHER(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, MPI _ _ NEIGHBOR 2 comm) 3 IN sendbuf starting address of send buffer (choice) 4 sendcount number of elements sent to each neighbor (non-negative IN 5 integer) 6 7 sendtype IN data type of send buffer elements (handle) 8 recvbuf starting address of receive buffer (choice) OUT 9 number of elements received from each neighbor (non- IN recvcount 10 negative integer) 11 12 IN recvtype data type of receive buffer elements (handle) 13 communicator with topology structure (handle) IN comm 14 15 int MPI_Neighbor_allgather(const void* sendbuf, int sendcount, MPI_Datatype 16 sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, 17 MPI_Comm comm) 18 19 MPI_Neighbor_allgather(sendbuf, sendcount, sendtype, recvbuf, recvcount, 20 recvtype, comm, ierror) BIND(C) 21 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 22 TYPE(*), DIMENSION(..) :: recvbuf 23 INTEGER, INTENT(IN) :: sendcount, recvcount 24 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 25 TYPE(MPI_Comm), INTENT(IN) :: comm 26 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 27 MPI_NEIGHBOR_ALLGATHER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, 28 RECVTYPE, COMM, IERROR) 29 > SENDBUF(*), RECVBUF(*) < type 30 INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, COMM, IERROR 31 32 This function supports Cartesian communicators, graph communicators, and distributed 33 graph communicators as described in Section 7.6 on page 314. If comm is a distributed graph 34 communicator, the outcome is as if each process executed sends to each of its outgoing 35 neighbors and receives from each of its incoming neighbors: 36 37 MPI_Dist_graph_neighbors_count(comm,&indegree,&outdegree,&weighted); 38 int *srcs=(int*)malloc(indegree*sizeof(int)); 39 int *dsts=(int*)malloc(outdegree*sizeof(int)); 40 MPI_Dist_graph_neighbors(comm,indegree,srcs,MPI_UNWEIGHTED, 41 outdegree,dsts,MPI_UNWEIGHTED); 42 int k,l; 43 44 /* assume sendbuf and recvbuf are of type (char*) */ 45 for(k=0; k

347 7.6. NEIGHBORHOOD COLLECTIVE COMMUNICATION 317 1 MPI_Irecv(recvbuf+l*recvcount*extent(recvtype),recvcount,recvtype, 2 srcs[l],...); 3 4 MPI_Waitall(...); 5 6 Figure 7.6.1 shows the neighborhood gather communication of one process with out- 7 ...d s ...s . The process will send its going neighbors d sendbuf and incoming neighbors 0 3 5 0 8 to all four destinations (outgoing neighbors) and it will receive the contribution from all six 9 (incoming neighbors) into separate locations of its receive buffer. sources 10 d 11 0 BM B 12 , s d 2 4 B 3  13 s 0  B  14 B @  @R  d B  + 15 s d HY H 1 1 MB  H 16 H  B H  H 17 B H  H B s 18 3 s 2 B BN 19 d , s 5 3 20 sendbuf 21 22 23 s s s s s s 2 5 0 3 1 4 24 recvbuf 25 26 27 All arguments are significant on all processes and the argument 28 must have identical values on all processes. comm 29 , at a process must be equal to sendcount, sendtype The type signature associated with 30 the type signature associated with recvcount, recvtype at all other processes. This implies 31 that the amount of data sent must be equal to the amount of data received, pairwise between 32 every pair of communicating processes. Distinct type maps between sender and receiver are 33 still allowed. 34 35 For optimization reasons, the same type signature is required indepen- Rationale. 36 End of rationale. ) dently of whether the topology graph is connected or not. ( 37 38 The “in place” option is not meaningful for this operation. 39 _ NEIGHBOR _ MPI The vector variant of ALLGATHER allows one to gather different 40 numbers of elements from each neighbor. 41 42 43 44 45 46 47 48

348 318 CHAPTER 7. PROCESS TOPOLOGIES 1 _ MPI NEIGHBOR _ ALLGATHERV(sendbuf, sendcount, sendtype, recvbuf, recvcounts, displs, 2 recvtype, comm) 3 starting address of send buffer (choice) IN sendbuf 4 number of elements sent to each neighbor (non-negative sendcount IN 5 integer) 6 7 IN data type of send buffer elements (handle) sendtype 8 recvbuf starting address of receive buffer (choice) OUT 9 recvcounts IN non-negative integer array (of length indegree) con- 10 taining the number of elements that are received from 11 each neighbor 12 13 displs integer array (of length indegree). Entry i specifies IN 14 ) at which to place recvbuf the displacement (relative to 15 i the incoming data from neighbor 16 IN recvtype data type of receive buffer elements (handle) 17 IN communicator with topology structure (handle) comm 18 19 20 int MPI_Neighbor_allgatherv(const void* sendbuf, int sendcount, 21 MPI_Datatype sendtype, void* recvbuf, const int recvcounts[], 22 const int displs[], MPI_Datatype recvtype, MPI_Comm comm) 23 MPI_Neighbor_allgatherv(sendbuf, sendcount, sendtype, recvbuf, recvcounts, 24 displs, recvtype, comm, ierror) BIND(C) 25 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 26 TYPE(*), DIMENSION(..) :: recvbuf 27 INTEGER, INTENT(IN) :: sendcount, recvcounts(*), displs(*) 28 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 29 TYPE(MPI_Comm), INTENT(IN) :: comm 30 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 31 32 MPI_NEIGHBOR_ALLGATHERV(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNTS, 33 DISPLS, RECVTYPE, COMM, IERROR) 34 SENDBUF(*), RECVBUF(*) > type < 35 INTEGER SENDCOUNT, SENDTYPE, RECVCOUNTS(*), DISPLS(*), RECVTYPE, COMM, 36 IERROR 37 This function supports Cartesian communicators, graph communicators, and distributed 38 is a distributed graph comm graph communicators as described in Section 7.6 on page 314. If 39 communicator, the outcome is as if each process executed sends to each of its outgoing 40 neighbors and receives from each of its incoming neighbors: 41 42 MPI_Dist_graph_neighbors_count(comm,&indegree,&outdegree,&weighted); 43 int *srcs=(int*)malloc(indegree*sizeof(int)); 44 int *dsts=(int*)malloc(outdegree*sizeof(int)); 45 MPI_Dist_graph_neighbors(comm,indegree,srcs,MPI_UNWEIGHTED, 46 outdegree,dsts,MPI_UNWEIGHTED); 47 int k,l; 48

349 7.6. NEIGHBORHOOD COLLECTIVE COMMUNICATION 319 1 2 /* assume sendbuf and recvbuf are of type (char*) */ 3 for(k=0; k

350 320 CHAPTER 7. PROCESS TOPOLOGIES 1 MPI_Neighbor_alltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, 2 recvtype, comm, ierror) BIND(C) 3 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 4 TYPE(*), DIMENSION(..) :: recvbuf 5 INTEGER, INTENT(IN) :: sendcount, recvcount 6 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 7 TYPE(MPI_Comm), INTENT(IN) :: comm 8 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 9 MPI_NEIGHBOR_ALLTOALL(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, 10 RECVTYPE, COMM, IERROR) 11 > SENDBUF(*), RECVBUF(*) type < 12 INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, COMM, IERROR 13 14 This function supports Cartesian communicators, graph communicators, and distributed 15 is a distributed graph comm graph communicators as described in Section 7.6 on page 314. If 16 communicator, the outcome is as if each process executed sends to each of its outgoing 17 neighbors and receives from each of its incoming neighbors: 18 19 MPI_Dist_graph_neighbors_count(comm,&indegree,&outdegree,&weighted); 20 int *srcs=(int*)malloc(indegree*sizeof(int)); 21 int *dsts=(int*)malloc(outdegree*sizeof(int)); 22 MPI_Dist_graph_neighbors(comm,indegree,srcs,MPI_UNWEIGHTED, 23 outdegree,dsts,MPI_UNWEIGHTED); 24 int k,l; 25 26 /* assume sendbuf and recvbuf are of type (char*) */ 27 for(k=0; k

351 7.6. NEIGHBORHOOD COLLECTIVE COMMUNICATION 321 1 _ MPI _ ALLTOALLV(sendbuf, sendcounts, sdispls, sendtype, recvbuf, recvcounts, NEIGHBOR 2 rdispls, recvtype, comm) 3 starting address of send buffer (choice) sendbuf IN 4 non-negative integer array (of length outdegree) speci- sendcounts IN 5 fying the number of elements to send to each neighbor 6 7 sdispls IN integer array (of length outdegree). Entry j specifies 8 sendbuf the displacement (relative to ) from which to 9 j send the outgoing data to neighbor 10 sendtype data type of send buffer elements (handle) IN 11 starting address of receive buffer (choice) OUT recvbuf 12 13 non-negative integer array (of length indegree) speci- recvcounts IN 14 fying the number of elements that are received from 15 each neighbor 16 integer array (of length indegree). Entry i specifies IN rdispls 17 ) at which to place recvbuf the displacement (relative to 18 the incoming data from neighbor i 19 data type of receive buffer elements (handle) recvtype IN 20 21 IN communicator with topology structure (handle) comm 22 23 int MPI_Neighbor_alltoallv(const void* sendbuf, const int sendcounts[], 24 const int sdispls[], MPI_Datatype sendtype, void* recvbuf, 25 const int recvcounts[], const int rdispls[], MPI_Datatype 26 recvtype, MPI_Comm comm) 27 MPI_Neighbor_alltoallv(sendbuf, sendcounts, sdispls, sendtype, recvbuf, 28 recvcounts, rdispls, recvtype, comm, ierror) BIND(C) 29 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 30 TYPE(*), DIMENSION(..) :: recvbuf 31 INTEGER, INTENT(IN) :: sendcounts(*), sdispls(*), recvcounts(*), 32 rdispls(*) 33 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 34 TYPE(MPI_Comm), INTENT(IN) :: comm 35 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 36 37 MPI_NEIGHBOR_ALLTOALLV(SENDBUF, SENDCOUNTS, SDISPLS, SENDTYPE, RECVBUF, 38 RECVCOUNTS, RDISPLS, RECVTYPE, COMM, IERROR) 39 > type < SENDBUF(*), RECVBUF(*) 40 INTEGER SENDCOUNTS(*), SDISPLS(*), SENDTYPE, RECVCOUNTS(*), RDISPLS(*), 41 RECVTYPE, COMM, IERROR 42 43 This function supports Cartesian communicators, graph communicators, and distributed 44 is a distributed graph graph communicators as described in Section 7.6 on page 314. If comm 45 communicator, the outcome is as if each process executed sends to each of its outgoing 46 neighbors and receives from each of its incoming neighbors: 47 MPI_Dist_graph_neighbors_count(comm,&indegree,&outdegree,&weighted); 48

352 322 CHAPTER 7. PROCESS TOPOLOGIES 1 int *srcs=(int*)malloc(indegree*sizeof(int)); 2 int *dsts=(int*)malloc(outdegree*sizeof(int)); 3 MPI_Dist_graph_neighbors(comm,indegree,srcs,MPI_UNWEIGHTED, 4 outdegree,dsts,MPI_UNWEIGHTED); 5 int k,l; 6 7 /* assume sendbuf and recvbuf are of type (char*) */ 8 for(k=0; k

353 7.6. NEIGHBORHOOD COLLECTIVE COMMUNICATION 323 1 _ ALLTOALLW(sendbuf, sendcounts, sdispls, sendtypes, recvbuf, recvcounts, MPI NEIGHBOR _ 2 rdispls, recvtypes, comm) 3 starting address of send buffer (choice) sendbuf IN 4 sendcounts IN non-negative integer array (of length outdegree) speci- 5 fying the number of elements to send to each neighbor 6 7 specifies IN sdispls integer array (of length outdegree). Entry j 8 sendbuf the displacement in bytes (relative to ) from 9 which to take the outgoing data destined for neighbor 10 (array of integers) j 11 array of datatypes (of length outdegree). Entry IN sendtypes j spec- 12 (array of j ifies the type of data to send to neighbor 13 handles) 14 recvbuf OUT starting address of receive buffer (choice) 15 16 non-negative integer array (of length indegree) speci- recvcounts IN 17 fying the number of elements that are received from 18 each neighbor 19 specifies integer array (of length indegree). Entry rdispls i IN 20 the displacement in bytes (relative to recvbuf ) at which 21 to place the incoming data from neighbor i (array of 22 integers) 23 i spec- recvtypes array of datatypes (of length indegree). Entry IN 24 (array i ifies the type of data received from neighbor 25 of handles) 26 comm communicator with topology structure (handle) IN 27 28 29 int MPI_Neighbor_alltoallw(const void* sendbuf, const int sendcounts[], 30 const MPI_Aint sdispls[], const MPI_Datatype sendtypes[], 31 void* recvbuf, const int recvcounts[], const MPI_Aint 32 rdispls[], const MPI_Datatype recvtypes[], MPI_Comm comm) 33 MPI_Neighbor_alltoallw(sendbuf, sendcounts, sdispls, sendtypes, recvbuf, 34 recvcounts, rdispls, recvtypes, comm, ierror) BIND(C) 35 TYPE(*), DIMENSION(..), INTENT(IN) :: sendbuf 36 TYPE(*), DIMENSION(..) :: recvbuf 37 INTEGER, INTENT(IN) :: sendcounts(*), recvcounts(*) 38 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: sdispls(*), rdispls(*) 39 TYPE(MPI_Datatype), INTENT(IN) :: sendtypes(*), recvtypes(*) 40 TYPE(MPI_Comm), INTENT(IN) :: comm 41 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 42 43 MPI_NEIGHBOR_ALLTOALLW(SENDBUF, SENDCOUNTS, SDISPLS, SENDTYPES, RECVBUF, 44 RECVCOUNTS, RDISPLS, RECVTYPES, COMM, IERROR) 45 > < type SENDBUF(*), RECVBUF(*) 46 INTEGER(KIND=MPI_ADDRESS_KIND) SDISPLS(*), RDISPLS(*) 47 INTEGER SENDCOUNTS(*), SENDTYPES(*), RECVCOUNTS(*), RECVTYPES(*), COMM, 48 IERROR

354 324 CHAPTER 7. PROCESS TOPOLOGIES 1 This function supports Cartesian communicators, graph communicators, and distributed 2 comm graph communicators as described in Section 7.6 on page 314. If is a distributed graph 3 communicator, the outcome is as if each process executed sends to each of its outgoing 4 neighbors and receives from each of its incoming neighbors: 5 6 MPI_Dist_graph_neighbors_count(comm,&indegree,&outdegree,&weighted); 7 int *srcs=(int*)malloc(indegree*sizeof(int)); 8 int *dsts=(int*)malloc(outdegree*sizeof(int)); 9 MPI_Dist_graph_neighbors(comm,indegree,srcs,MPI_UNWEIGHTED, 10 outdegree,dsts,MPI_UNWEIGHTED); 11 int k,l; 12 13 /* assume sendbuf and recvbuf are of type (char*) */ 14 for(k=0; k

355 7.7. NONBLOCKING NEIGHBORHOOD COMMUNICATION 325 1 Nonblocking Neighborhood Gather 7.7.1 2 3 4 ALLGATHER(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, MPI INEIGHBOR _ _ 5 comm, request) 6 IN sendbuf starting address of send buffer (choice) 7 8 IN number of elements sent to each neighbor (non-negative sendcount 9 integer) 10 IN sendtype data type of send buffer elements (handle) 11 OUT recvbuf starting address of receive buffer (choice) 12 13 IN recvcount number of elements received from each neighbor (non- 14 negative integer) 15 recvtype data type of receive buffer elements (handle) IN 16 communicator with topology structure (handle) comm IN 17 18 request OUT communication request (handle) 19 20 int MPI_Ineighbor_allgather(const void* sendbuf, int sendcount, 21 MPI_Datatype sendtype, void* recvbuf, int recvcount, 22 MPI_Datatype recvtype, MPI_Comm comm, MPI_Request *request) 23 MPI_Ineighbor_allgather(sendbuf, sendcount, sendtype, recvbuf, recvcount, 24 recvtype, comm, request, ierror) BIND(C) 25 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 26 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 27 INTEGER, INTENT(IN) :: sendcount, recvcount 28 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 29 TYPE(MPI_Comm), INTENT(IN) :: comm 30 TYPE(MPI_Request), INTENT(OUT) :: request 31 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 32 33 MPI_INEIGHBOR_ALLGATHER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, 34 RECVTYPE, COMM, REQUEST, IERROR) 35 type < SENDBUF(*), RECVBUF(*) > 36 INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, COMM, REQUEST, IERROR 37 _ NEIGHBOR _ ALLGATHER . This call starts a nonblocking variant of MPI 38 39 40 41 42 43 44 45 46 47 48

356 326 CHAPTER 7. PROCESS TOPOLOGIES 1 ALLGATHERV(sendbuf, sendcount, sendtype, recvbuf, recvcounts, displs, MPI _ _ INEIGHBOR 2 recvtype, comm, request) 3 starting address of send buffer (choice) IN sendbuf 4 IN sendcount number of elements sent to each neighbor (non-negative 5 integer) 6 7 data type of send buffer elements (handle) IN sendtype 8 OUT starting address of receive buffer (choice) recvbuf 9 IN recvcounts non-negative integer array (of length indegree) con- 10 taining the number of elements that are received from 11 each neighbor 12 13 IN i integer array (of length indegree). Entry specifies displs 14 the displacement (relative to recvbuf ) at which to place 15 i the incoming data from neighbor 16 IN data type of receive buffer elements (handle) recvtype 17 comm communicator with topology structure (handle) IN 18 19 OUT request communication request (handle) 20 21 int MPI_Ineighbor_allgatherv(const void* sendbuf, int sendcount, 22 MPI_Datatype sendtype, void* recvbuf, const int recvcounts[], 23 const int displs[], MPI_Datatype recvtype, MPI_Comm comm, 24 MPI_Request *request) 25 MPI_Ineighbor_allgatherv(sendbuf, sendcount, sendtype, recvbuf, recvcounts, 26 displs, recvtype, comm, request, ierror) BIND(C) 27 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 28 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 29 INTEGER, INTENT(IN) :: sendcount 30 INTEGER, INTENT(IN), ASYNCHRONOUS :: recvcounts(*), displs(*) 31 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 32 TYPE(MPI_Comm), INTENT(IN) :: comm 33 TYPE(MPI_Request), INTENT(OUT) :: request 34 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 35 36 MPI_INEIGHBOR_ALLGATHERV(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNTS, 37 DISPLS, RECVTYPE, COMM, REQUEST, IERROR) 38 > SENDBUF(*), RECVBUF(*) < type 39 INTEGER SENDCOUNT, SENDTYPE, RECVCOUNTS(*), DISPLS(*), RECVTYPE, COMM, 40 REQUEST, IERROR 41 42 This call starts a nonblocking variant of . _ ALLGATHERV NEIGHBOR _ MPI 43 44 45 46 47 48

357 7.7. NONBLOCKING NEIGHBORHOOD COMMUNICATION 327 1 Nonblocking Neighborhood Alltoall 7.7.2 2 3 4 ALLTOALL(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, MPI INEIGHBOR _ _ 5 comm, request) 6 IN sendbuf starting address of send buffer (choice) 7 8 IN number of elements sent to each neighbor (non-negative sendcount 9 integer) 10 IN sendtype data type of send buffer elements (handle) 11 OUT recvbuf starting address of receive buffer (choice) 12 13 IN recvcount number of elements received from each neighbor (non- 14 negative integer) 15 recvtype data type of receive buffer elements (handle) IN 16 communicator with topology structure (handle) comm IN 17 18 request OUT communication request (handle) 19 20 int MPI_Ineighbor_alltoall(const void* sendbuf, int sendcount, MPI_Datatype 21 sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, 22 MPI_Comm comm, MPI_Request *request) 23 MPI_Ineighbor_alltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, 24 recvtype, comm, request, ierror) BIND(C) 25 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 26 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 27 INTEGER, INTENT(IN) :: sendcount, recvcount 28 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 29 TYPE(MPI_Comm), INTENT(IN) :: comm 30 TYPE(MPI_Request), INTENT(OUT) :: request 31 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 32 33 MPI_INEIGHBOR_ALLTOALL(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, 34 RECVTYPE, COMM, REQUEST, IERROR) 35 type < SENDBUF(*), RECVBUF(*) > 36 INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, COMM, REQUEST, IERROR 37 _ NEIGHBOR _ ALLTOALL . This call starts a nonblocking variant of MPI 38 39 40 41 42 43 44 45 46 47 48

358 328 CHAPTER 7. PROCESS TOPOLOGIES 1 INEIGHBOR _ _ MPI ALLTOALLV(sendbuf, sendcounts, sdispls, sendtype, recvbuf, recvcounts, 2 rdispls, recvtype, comm, request) 3 IN starting address of send buffer (choice) sendbuf 4 IN sendcounts non-negative integer array (of length outdegree) speci- 5 fying the number of elements to send to each neighbor 6 7 IN specifies j integer array (of length outdegree). Entry sdispls 8 ) from which send the displacement (relative to sendbuf 9 the outgoing data to neighbor j 10 data type of send buffer elements (handle) sendtype IN 11 starting address of receive buffer (choice) OUT recvbuf 12 13 non-negative integer array (of length indegree) speci- recvcounts IN 14 fying the number of elements that are received from 15 each neighbor 16 IN rdispls integer array (of length indegree). Entry specifies i 17 recvbuf ) at which to place the displacement (relative to 18 the incoming data from neighbor i 19 data type of receive buffer elements (handle) recvtype IN 20 21 comm communicator with topology structure (handle) IN 22 OUT communication request (handle) request 23 24 int MPI_Ineighbor_alltoallv(const void* sendbuf, const int sendcounts[], 25 const int sdispls[], MPI_Datatype sendtype, void* recvbuf, 26 const int recvcounts[], const int rdispls[], MPI_Datatype 27 recvtype, MPI_Comm comm, MPI_Request *request) 28 29 MPI_Ineighbor_alltoallv(sendbuf, sendcounts, sdispls, sendtype, recvbuf, 30 recvcounts, rdispls, recvtype, comm, request, ierror) BIND(C) 31 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 32 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 33 INTEGER, INTENT(IN), ASYNCHRONOUS :: sendcounts(*), sdispls(*), 34 recvcounts(*), rdispls(*) 35 TYPE(MPI_Datatype), INTENT(IN) :: sendtype, recvtype 36 TYPE(MPI_Comm), INTENT(IN) :: comm 37 TYPE(MPI_Request), INTENT(OUT) :: request 38 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 39 MPI_INEIGHBOR_ALLTOALLV(SENDBUF, SENDCOUNTS, SDISPLS, SENDTYPE, RECVBUF, 40 RECVCOUNTS, RDISPLS, RECVTYPE, COMM, REQUEST, IERROR) 41 < type > SENDBUF(*), RECVBUF(*) 42 INTEGER SENDCOUNTS(*), SDISPLS(*), SENDTYPE, RECVCOUNTS(*), RDISPLS(*), 43 RECVTYPE, COMM, REQUEST, IERROR 44 45 This call starts a nonblocking variant of . ALLTOALLV _ NEIGHBOR _ MPI 46 47 48

359 7.7. NONBLOCKING NEIGHBORHOOD COMMUNICATION 329 1 _ ALLTOALLW(sendbuf, sendcounts, sdispls, sendtypes, recvbuf, recvcounts, MPI INEIGHBOR _ 2 rdispls, recvtypes, comm, request) 3 starting address of send buffer (choice) IN sendbuf 4 non-negative integer array (of length outdegree) speci- sendcounts IN 5 fying the number of elements to send to each neighbor 6 7 IN j sdispls specifies integer array (of length outdegree). Entry 8 sendbuf ) from the displacement in bytes (relative to 9 which to take the outgoing data destined for neighbor 10 (array of integers) j 11 sendtypes j spec- array of datatypes (of length outdegree). Entry IN 12 ifies the type of data to send to neighbor (array of j 13 handles) 14 starting address of receive buffer (choice) OUT recvbuf 15 16 IN non-negative integer array (of length indegree) speci- recvcounts 17 fying the number of elements that are received from 18 each neighbor 19 integer array (of length indegree). Entry i specifies IN rdispls 20 the displacement in bytes (relative to recvbuf ) at which 21 i to place the incoming data from neighbor (array of 22 integers) 23 array of datatypes (of length indegree). Entry recvtypes IN spec- i 24 ifies the type of data received from neighbor i (array 25 of handles) 26 27 comm communicator with topology structure (handle) IN 28 request communication request (handle) OUT 29 30 int MPI_Ineighbor_alltoallw(const void* sendbuf, const int sendcounts[], 31 const MPI_Aint sdispls[], const MPI_Datatype sendtypes[], 32 void* recvbuf, const int recvcounts[], const MPI_Aint 33 rdispls[], const MPI_Datatype recvtypes[], MPI_Comm comm, 34 MPI_Request *request) 35 36 MPI_Ineighbor_alltoallw(sendbuf, sendcounts, sdispls, sendtypes, recvbuf, 37 recvcounts, rdispls, recvtypes, comm, request, ierror) BIND(C) 38 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: sendbuf 39 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: recvbuf 40 INTEGER, INTENT(IN), ASYNCHRONOUS :: sendcounts(*), recvcounts(*) 41 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN), ASYNCHRONOUS :: 42 sdispls(*), rdispls(*) 43 TYPE(MPI_Datatype), INTENT(IN), ASYNCHRONOUS :: sendtypes(*), 44 recvtypes(*) 45 TYPE(MPI_Comm), INTENT(IN) :: comm 46 TYPE(MPI_Request), INTENT(OUT) :: request 47 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 48

360 330 CHAPTER 7. PROCESS TOPOLOGIES 1 MPI_INEIGHBOR_ALLTOALLW(SENDBUF, SENDCOUNTS, SDISPLS, SENDTYPES, RECVBUF, 2 RECVCOUNTS, RDISPLS, RECVTYPES, COMM, REQUEST, IERROR) 3 < type SENDBUF(*), RECVBUF(*) > 4 INTEGER(KIND=MPI_ADDRESS_KIND) SDISPLS(*), RDISPLS(*) 5 INTEGER SENDCOUNTS(*), SENDTYPES(*), RECVCOUNTS(*), RECVTYPES(*), COMM, 6 REQUEST, IERROR 7 MPI NEIGHBOR _ ALLTOALLW . _ This call starts a nonblocking variant of 8 9 10 An Application Example 7.8 11 12 The example in Figures 7.1-7.3 shows how the grid definition and inquiry Example 7.9 13 functions can be used in an application program. A partial differential equation, for instance 14 the Poisson equation, is to be solved on a rectangular domain. First, the processes organize 15 themselves in a two-dimensional structure. Each process then inquires about the ranks of 16 its neighbors in the four directions (up, down, right, left). The numerical problem is solved 17 relax by an iterative method, the details of which are hidden in the subroutine . 18 In each relaxation step each process computes new values for the solution grid function 19 owned by the process. Then the values at inter-process at the points u(1:100,1:100) 20 boundaries have to be exchanged with neighboring processes. For example, the newly 21 calculated values in of the u(1,1:100) must be sent into the halo cells u(101,1:100) 22 left-hand neighbor with coordinates . (own_coord(1)-1,own_coord(2)) 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

361 7.8. AN APPLICATION EXAMPLE 331 1 2 3 4 5 6 7 8 INTEGER ndims, num_neigh 9 LOGICAL reorder 10 PARAMETER (ndims=2, num_neigh=4, reorder=.true.) 11 INTEGER comm, comm_cart, dims(ndims), ierr 12 INTEGER neigh_rank(num_neigh), own_coords(ndims), i, j, it 13 LOGICAL periods(ndims) 14 REAL u(0:101,0:101), f(0:101,0:101) 15 DATA dims / ndims * 0 / 16 comm = MPI_COMM_WORLD 17 ! Set process grid size and periodicity 18 CALL MPI_DIMS_CREATE(comm, ndims, dims,ierr) 19 periods(1) = .TRUE. 20 periods(2) = .TRUE. 21 ! Create a grid structure in WORLD group and inquire about own position 22 CALL MPI_CART_CREATE (comm, ndims, dims, periods, reorder, & 23 comm_cart,ierr) 24 CALL MPI_CART_GET (comm_cart, ndims, dims, periods, own_coords,ierr) 25 i = own_coords(1) 26 j = own_coords(2) 27 ! Look up the ranks for the neighbors. Own process coordinates are (i,j). 28 ! Neighbors are (i-1,j), (i+1,j), (i,j-1), (i,j+1) modulo (dims(1),dims(2)) 29 CALL MPI_CART_SHIFT (comm_cart, 0,1, neigh_rank(1),neigh_rank(2), ierr) 30 CALL MPI_CART_SHIFT (comm_cart, 1,1, neigh_rank(3),neigh_rank(4), ierr) 31 ! Initialize the grid functions and start the iteration 32 CALL init (u, f) 33 DO it=1,100 34 CALL relax (u, f) 35 ! Exchange data with neighbor processes 36 CALL exchange (u, comm_cart, neigh_rank, num_neigh) 37 END DO 38 CALL output (u) 39 40 Figure 7.1: Set-up of process structure for two-dimensional parallel Poisson solver. 41 42 43 44 45 46 47 48

362 332 CHAPTER 7. PROCESS TOPOLOGIES 1 2 3 4 5 6 7 8 9 10 SUBROUTINE exchange (u, comm_cart, neigh_rank, num_neigh) 11 REAL u(0:101,0:101) 12 INTEGER comm_cart, num_neigh, neigh_rank(num_neigh) 13 REAL sndbuf(100,num_neigh), rcvbuf(100,num_neigh) 14 INTEGER ierr 15 sndbuf(1:100,1) = u( 1,1:100) 16 sndbuf(1:100,2) = u(100,1:100) 17 sndbuf(1:100,3) = u(1:100, 1) 18 sndbuf(1:100,4) = u(1:100,100) 19 CALL MPI_NEIGHBOR_ALLTOALL (sndbuf, 100, MPI_REAL, rcvbuf, 100, MPI_REAL, & 20 comm_cart, ierr) 21 ! instead of 22 ! DO i=1,num_neigh 23 ! CALL MPI_IRECV(rcvbuf(1,i),100,MPI_REAL,neigh_rank(i),...,rq(2*i-1),& 24 ! ierr) 25 ! CALL MPI_ISEND(sndbuf(1,i),100,MPI_REAL,neigh_rank(i),...,rq(2*i ),& 26 ! ierr) 27 ! END DO 28 ! CALL MPI_WAITALL (2*num_neigh, rq, statuses, ierr) 29 30 u( 0,1:100) = rcvbuf(1:100,1) 31 u(101,1:100) = rcvbuf(1:100,2) 32 u(1:100, 0) = rcvbuf(1:100,3) 33 u(1:100,101) = rcvbuf(1:100,4) 34 END 35 36 37 Figure 7.2: Communication routine with local data copying and sparse neighborhood all- 38 to-all. 39 40 41 42 43 44 45 46 47 48

363 7.8. AN APPLICATION EXAMPLE 333 1 2 3 SUBROUTINE exchange (u, comm_cart, neigh_rank, num_neigh) 4 USE MPI 5 REAL u(0:101,0:101) 6 INTEGER comm_cart, num_neigh, neigh_rank(num_neigh) 7 INTEGER sndcounts(num_neigh), sndtypes(num_neigh) 8 INTEGER rcvcounts(num_neigh), rcvtypes(num_neigh) 9 INTEGER (KIND=MPI_ADDRESS_KIND) lb, sizeofreal, sdispls(num_neigh), & 10 rdispls(num_neigh) 11 INTEGER type_vec, i, ierr 12 ! The following initialization need to be done only once 13 ! before the first call of exchange. 14 CALL MPI_TYPE_GET_EXTENT(MPI_REAL, lb, sizeofreal, ierr) 15 CALL MPI_TYPE_VECTOR (100, 1, 102, MPI_REAL, type_vec, ierr) 16 CALL MPI_TYPE_COMMIT (type_vec, ierr) 17 sndtypes(1) = type_vec 18 sndtypes(2) = type_vec 19 sndtypes(3) = MPI_REAL 20 sndtypes(4) = MPI_REAL 21 DO i=1,num_neigh 22 sndcounts(i) = 100 23 rcvcounts(i) = 100 24 rcvtypes(i) = sndtypes(i) 25 END DO 26 sdispls(1) = ( 1 + 1*102) * sizeofreal ! first element of u( 1,1:100) 27 sdispls(2) = (100 + 1*102) * sizeofreal ! first element of u(100,1:100) 28 sdispls(3) = ( 1 + 1*102) * sizeofreal ! first element of u(1:100, 1) 29 sdispls(4) = ( 1 + 100*102) * sizeofreal ! first element of u(1:100,100) 30 rdispls(1) = ( 0 + 1*102) * sizeofreal ! first element of u( 0,1:100) 31 rdispls(2) = (101 + 1*102) * sizeofreal ! first element of u(101,1:100) 32 rdispls(3) = ( 1 + 0*102) * sizeofreal ! first element of u(1:100, 0) 33 rdispls(4) = ( 1 + 101*102) * sizeofreal ! first element of u(1:100,101) 34 35 ! the following communication has to be done in each call of exchange 36 CALL MPI_NEIGHBOR_ALLTOALLW (u, sndcounts, sdispls, sndtypes, & 37 u, rcvcounts, rdispls, rcvtypes, comm_cart, ierr) 38 39 ! The following finalizing need to be done only once 40 ! after the last call of exchange. 41 CALL MPI_TYPE_FREE (type_vec, ierr) 42 END 43 44 Figure 7.3: Communication routine with sparse neighborhood all-to-all-w and without local 45 data copying. 46 47 48

364 334 CHAPTER 7. PROCESS TOPOLOGIES 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

365 1 2 3 4 5 6 Chapter 8 7 8 9 10 MPI Environmental Management 11 12 13 14 This chapter discusses routines for getting and, where appropriate, setting various param- 15 eters that relate to the MPI implementation and the execution environment (such as error 16 MPI handling). The procedures for entering and leaving the execution environment are also 17 described here. 18 19 Implementation Information 8.1 20 21 8.1.1 Version Inquiries 22 23 Standard, there are both compile-time and run- In order to cope with changes to the MPI 24 time ways to determine which version of the standard is in use in the environment one is 25 using. 26 The “version” will be represented by two separate integers, for the version and subver- 27 sion: In C, 28 29 #define MPI_VERSION 3 30 #define MPI_SUBVERSION 0 31 in Fortran, 32 33 INTEGER :: MPI_VERSION, MPI_SUBVERSION 34 PARAMETER (MPI_VERSION = 3) 35 PARAMETER (MPI_SUBVERSION = 0) 36 37 38 For runtime determination, 39 40 MPI _ GET VERSION( version, subversion ) _ 41 42 OUT version number (integer) version 43 OUT subversion subversion number (integer) 44 45 int MPI_Get_version(int *version, int *subversion) 46 47 MPI_Get_version(version, subversion, ierror) BIND(C) 48 INTEGER, INTENT(OUT) :: version, subversion 335

366 336 ENVIRONMENTAL MANAGEMENT CHAPTER 8. MPI 1 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 2 MPI_GET_VERSION(VERSION, SUBVERSION, IERROR) 3 INTEGER VERSION, SUBVERSION, IERROR 4 5 MPI MPI _ FINALIZE . Valid _ can be called before and after VERSION _ _ MPI GET INIT 6 _ ( SUBVERSION _ MPI , VERSION standard MPI ) pairs in this and previous versions of the MPI 7 are (3,0), (2,2), (2,1), (2,0), and (1,2). 8 9 _ VERSION( version, resultlen ) MPI _ GET _ LIBRARY 10 11 version OUT version string (string) 12 OUT Length (in printable characters) of the result returned resultlen 13 in (integer) version 14 15 int MPI_Get_library_version(char *version, int *resultlen) 16 17 MPI_Get_library_version(version, resulten, ierror) BIND(C) 18 CHARACTER(LEN=MPI_MAX_LIBRARY_VERSION_STRING), INTENT(OUT) :: version 19 INTEGER, INTENT(OUT) :: resultlen 20 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 21 MPI_GET_LIBRARY_VERSION(VERSION, RESULTEN, IERROR) 22 CHARACTER*(*) VERSION 23 INTEGER RESULTLEN,IERROR 24 25 This routine returns a string representing the version of the MPI library. The version 26 argument is a character string for maximum flexibility. 27 28 Advice to implementors. An implementation of MPI should return a different string 29 for every change to its source code or build that could be visible to the user. ( End of 30 ) advice to implementors. 31 32 must represent storage that is The argument version 33 may _ MPI _ LIBRARY _ VERSION _ STRING characters long. MPI _ GET _ LIBRARY _ VERSION MAX 34 write up to this many characters into version . 35 . The number of characters actually written is returned in the output argument, resultlen 36 version[resultlen] . The value of resultlen cannot In C, a null character is additionally stored at 37 version is padded on be larger than MPI _ MAX _ LIBRARY _ VERSION _ STRING - 1. In Fortran, 38 the right with blank characters. The value of cannot be larger than resultlen 39 LIBRARY MAX _ MPI _ STRING _ VERSION _ . 40 VERSION and after INIT _ MPI can be called before _ LIBRARY _ GET _ MPI 41 _ FINALIZE . MPI 42 43 Environmental Inquiries 8.1.2 44 A set of attributes that describe the execution environment are attached to the commu- 45 COMM _ MPI nicator WORLD MPI is initialized. The values of these attributes can _ when 46 _ _ GET _ COMM ATTR MPI be inquired by using the function described in Section 6.7 on 47 48

367 8.1. IMPLEMENTATION INFORMATION 337 1 page 265 and in Section 17.2.7 on page 653. It is erroneous to delete these attributes, free 2 their keys, or change their values. 3 The list of predefined attribute keys include 4 Upper bound for tag value. _ MPI UB TAG _ 5 6 _ PROC _ MPI _ MPI NULL Host process rank, if such exists, HOST , otherwise. 7 rank of a node that has regular I/O facilities (possibly myrank). Nodes in the same IO _ MPI 8 communicator may return different values for this parameter. 9 10 MPI _ GLOBAL Boolean variable that indicates whether clocks are synchronized. IS _ WTIME _ 11 12 Vendors may add implementation-specific parameters (such as node number, real mem- 13 ory size, virtual memory size, etc.) 14 MPI _ INIT ) These predefined attributes do not change value between MPI initialization ( 15 completion ( and MPI MPI ), and cannot be updated or deleted by users. _ FINALIZE 16 Advice to users. Note that in the C binding, the value returned by these attributes 17 pointer to an is a containing the requested value. ( End of advice to users. ) int 18 19 The required parameter values are discussed in more detail below: 20 21 Tag Values 22 23 to the value returned for 0 Tag values range from , inclusive. These values are UB _ TAG _ MPI 24 MPI guaranteed to be unchanging during the execution of an program. In addition, the tag 25 32767. An upper bound value must be implementation is free to make the at least MPI 30 26 _ TAG _ UB larger than this; for example, the value 2 1 is also a valid value − value of MPI 27 MPI for . UB _ TAG _ 28 has the same value on all processes of UB COMM TAG _ MPI The attribute WORLD . MPI _ _ _ 29 30 Host Rank 31 HOST gets the rank of the process in the group associated _ MPI The value returned for HOST 32 , if there is such. MPI _ PROC _ NULL is returned if with communicator MPI _ COMM _ WORLD 33 MPI there is no host. does not specify what it means for a process to be a HOST , nor does 34 it requires that a exists. HOST 35 . MPI _ COMM _ WORLD has the same value on all processes of The attribute MPI _ HOST 36 37 IO Rank 38 39 MPI The value returned for is the rank of a processor that can provide language-standard IO _ 40 I/O facilities. For Fortran, this means that all of the Fortran I/O operations are supported 41 OPEN (e.g., ). For C, this means that all of the ISO C I/O operations are WRITE , REWIND , 42 supported (e.g., fopen , fprintf , lseek ). 43 MPI If every process can provide language-standard I/O, then the value SOURCE _ ANY _ 44 will be returned. Otherwise, if the calling process can provide language-standard I/O, 45 then its rank will be returned. Otherwise, if some process can provide language-standard 46 I/O then the rank of one such process will be returned. The same value need not be 47 returned by all processes. If no process can provide language-standard I/O, then the value 48 NULL MPI _ will be returned. _ PROC

368 338 ENVIRONMENTAL MANAGEMENT CHAPTER 8. MPI 1 not Advice to users. indicate Note that input is not collective, and this attribute does 2 End of advice to users. ) which process can or does provide input. ( 3 4 Clock Synchronization 5 WTIME GLOBAL The value returned for _ IS is 1 if clocks at all processes in _ MPI _ 6 MPI are synchronized, 0 otherwise. A collection of clocks is considered WORLD _ COMM _ 7 synchronized if explicit effort has been taken to synchronize them. The expectation is that 8 MPI _ WTIME , will be less then one half the the variation in time, as measured by calls to 9 message of length zero. If time is measured at a process just MPI round-trip time for an 10 before a send and at another process just after a matching receive, the second time should 11 be always higher than the first one. 12 _ GLOBAL need not be present when the clocks are not MPI _ The attribute _ IS WTIME 13 _ IS _ WTIME GLOBAL MPI synchronized (however, the attribute key is always valid). This _ 14 COMM _ attribute may be associated with communicators other then . MPI _ WORLD 15 MPI _ WTIME _ The attribute _ GLOBAL has the same value on all processes of IS 16 _ WORLD COMM _ . MPI 17 18 19 Inquire Processor Name 20 21 22 NAME( name, resultlen ) MPI _ GET _ PROCESSOR _ 23 OUT name A unique specifier for the actual (as opposed to vir- 24 tual) node. 25 resultlen Length (in printable characters) of the result returned OUT 26 name in 27 28 29 int MPI_Get_processor_name(char *name, int *resultlen) 30 MPI_Get_processor_name(name, resultlen, ierror) BIND(C) 31 CHARACTER(LEN=MPI_MAX_PROCESSOR_NAME), INTENT(OUT) :: name 32 INTEGER, INTENT(OUT) :: resultlen 33 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 34 35 MPI_GET_PROCESSOR_NAME( NAME, RESULTLEN, IERROR) 36 CHARACTER*(*) NAME 37 INTEGER RESULTLEN,IERROR 38 This routine returns the name of the processor on which it was called at the moment 39 of the call. The name is a character string for maximum flexibility. From this value it 40 must be possible to identify a specific piece of hardware; possible values include “processor 41 9 in rack 4 of mpp.cs.org” and “231” (where 231 is the actual processor number in the 42 name must represent storage that is at least running homogeneous system). The argument 43 characters long. NAME _ PROCESSOR _ MAX _ MPI _ NAME may write MPI GET _ _ PROCESSOR 44 . name up to this many characters into 45 . resultlen The number of characters actually written is returned in the output argument, 46 In C, a null character is additionally stored at cannot . The value of name[resultlen] resultlen 47 48

369 8.2. MEMORY ALLOCATION 339 1 _ MPI PROCESSOR _ NAME -1. In Fortran, name is padded on the right with MAX be larger than _ 2 MPI _ PROCESSOR _ NAME . blank characters. The value of resultlen _ cannot be larger than MAX 3 4 implementations that do process migration to MPI Rationale. This function allows 5 or defines process return the current processor. Note that nothing in requires MPI 6 simply allows such an migration; this definition of MPI _ GET _ PROCESSOR _ NAME 7 implementation. ( ) End of rationale. 8 9 _ MPI PROCESSOR Advice to users. _ _ NAME space MAX The user must provide at least 10 to write the processor name — processor names can be this long. The user should 11 resultlen examine the output argument, , to determine the actual length of the name. 12 ) End of advice to users. ( 13 14 8.2 Memory Allocation 15 16 RMA In some systems, message-passing and remote-memory-access ( ) operations run faster 17 when accessing specially allocated memory (e.g., memory that is shared by the other pro- 18 provides a mechanism for allocating MPI cesses in the communicating group on an SMP). 19 and freeing such special memory. The use of such memory for message-passing or RMA 20 is not mandatory, and this memory can be used without restrictions as any other dynam- 21 ically allocated memory. However, implementations may restrict the use of some RMA 22 functionality as defined in Section 11.5.3. 23 24 25 _ MPI _ MEM(size, info, baseptr) ALLOC 26 size of memory segment in bytes (non-negative inte- IN size 27 ger) 28 info argument (handle) IN info 29 30 OUT baseptr pointer to beginning of memory segment allocated 31 32 int MPI_Alloc_mem(MPI_Aint size, MPI_Info info, void *baseptr) 33 MPI_Alloc_mem(size, info, baseptr, ierror) BIND(C) 34 USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR 35 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: size 36 TYPE(MPI_Info), INTENT(IN) :: info 37 TYPE(C_PTR), INTENT(OUT) :: baseptr 38 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 39 40 MPI_ALLOC_MEM(SIZE, INFO, BASEPTR, IERROR) 41 INTEGER INFO, IERROR 42 INTEGER(KIND=MPI_ADDRESS_KIND) SIZE, BASEPTR 43 , then the following interface must be TYPE(C_PTR) If the Fortran compiler provides 44 through overloading, i.e., with mpif.h mpi provided in the module and should be provided in 45 the same routine name as the routine with INTEGER(KIND=MPI_ADDRESS_KIND) BASEPTR , 46 but with a different linker name: 47 48

370 340 ENVIRONMENTAL MANAGEMENT CHAPTER 8. MPI 1 INTERFACE MPI_ALLOC_MEM 2 SUBROUTINE MPI_ALLOC_MEM_CPTR(SIZE, INFO, BASEPTR, IERROR) 3 USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR 4 INTEGER :: INFO, IERROR 5 INTEGER(KIND=MPI_ADDRESS_KIND) :: SIZE 6 TYPE(C_PTR) :: BASEPTR 7 END SUBROUTINE 8 END INTERFACE 9 10 MPI The linker name base of this overloaded function is _ _ MEM . The CPTR ALLOC _ 11 implied linker names are described in Section 17.1.5 on page 605. 12 The info argument can be used to provide directives that control the desired location 13 of the allocated memory. Such a directive does not affect the semantics of the call. Valid 14 NULL info values are implementation-dependent; a null directive value of info = MPI _ INFO _ 15 is always valid. 16 may return an error code of class MPI _ ERR The function NO _ MEM MPI _ ALLOC _ MEM _ 17 to indicate it failed because memory is exhausted. 18 19 _ FREE _ MEM(base) MPI 20 21 base IN initial address of memory segment allocated by 22 MEM ALLOC _ MPI (choice) _ 23 24 int MPI_Free_mem(void *base) 25 MPI_Free_mem(base, ierror) BIND(C) 26 TYPE(*), DIMENSION(..), INTENT(IN), ASYNCHRONOUS :: base 27 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 28 29 MPI_FREE_MEM(BASE, IERROR) 30 BASE(*) 31 INTEGER IERROR 32 MEM The function MPI _ FREE _ to may return an error code of class MPI _ ERR _ BASE 33 indicate an invalid base argument. 34 35 MPI Rationale. _ ALLOC _ MEM and The C bindings of _ FREE _ MEM are similar MPI 36 malloc and free C library calls: a call to to the bindings for the 37 Alloc _ (one mem(base) MPI _ MPI _ Free _ mem(..., &base) should be paired with a call to 38 less level of indirection). Both arguments are declared to be of same type 39 void* so as to facilitate type casting. The Fortran binding is consistent with the C 40 _ bindings: the Fortran MPI _ ALLOC MEM TYPE(C_PTR) the baseptr call returns in 41 pointer or the (integer valued) address of the allocated memory. The argument base 42 _ of _ MEM is a choice argument, which passes (a reference to) the variable MPI FREE 43 stored at that location. ( End of rationale. ) 44 45 MPI _ _ allocates special memory, then a If Advice to implementors. ALLOC MEM 46 and malloc free design similar to the design of C functions has to be used, in order 47 to find out the size of a memory segment, when the segment is freed. If no special 48

371 8.2. MEMORY ALLOCATION 341 1 ALLOC _ MEM simply invokes malloc , and MPI _ FREE _ MEM memory is used, _ MPI 2 . invokes free 3 can be used in shared memory systems to allocate mem- _ A call to ALLOC _ MEM MPI 4 ory in a shared memory segment. ( End of advice to implementors. ) 5 6 7 Example of use of MPI _ ALLOC _ , in Fortran with Example 8.1 MEM 8 s. REAL pointers. We assume 4-byte TYPE(C_PTR) 9 USE mpi_f08 ! or USE mpi (not guaranteed with INCLUDE ’mpif.h’) 10 USE, INTRINSIC :: ISO_C_BINDING 11 TYPE(C_PTR) :: p 12 REAL, DIMENSION(:,:), POINTER :: a ! no memory is allocated 13 INTEGER, DIMENSION(2) :: shape 14 INTEGER(KIND=MPI_ADDRESS_KIND) :: size 15 shape = (/100,100/) 16 size = 4 * shape(1) * shape(2) ! assuming 4 bytes per REAL 17 CALL MPI_Alloc_mem(size,MPI_INFO_NULL,p,ierr) ! memory is allocated and 18 CALL C_F_POINTER(p, a, shape) ! intrinsic ! now accessible via a(i,j) 19 ... ! in ISO_C_BINDING 20 a(3,5) = 2.71; 21 ... 22 CALL MPI_Free_mem(a, ierr) ! memory is freed 23 24 25 MPI Cray- Example 8.2 Example of use of , in Fortran with non-standard _ ALLOC _ MEM 26 REAL . We assume 4-byte pointers s, and assume that these pointers are address-sized. 27 28 REAL A 29 POINTER (P, A(100,100)) ! no memory is allocated 30 INTEGER(KIND=MPI_ADDRESS_KIND) SIZE 31 SIZE = 4*100*100 32 CALL MPI_ALLOC_MEM(SIZE, MPI_INFO_NULL, P, IERR) 33 ! memory is allocated 34 ... 35 A(3,5) = 2.71; 36 ... 37 CALL MPI_FREE_MEM(A, IERR) ! memory is freed 38 This code is not Fortran 77 or Fortran 90 code. Some compilers may not support this 39 code or need a special option, e.g., the GNU gFortran compiler needs -fcray-pointer . 40 41 Some compilers map Cray-pointers to address-sized integers, Advice to implementors. 42 some to TYPE(C_PTR) pointers (e.g., Cray Fortran, version 7.3.3). From the user’s 43 viewpoint, this mapping is irrelevant because Examples 8.2 should work correctly 44 (or later) library if Cray-pointers are available. ( MPI-3.0 with an End of advice to 45 implementors. ) 46 47 48 Example 8.3 Same example, in C.

372 342 ENVIRONMENTAL MANAGEMENT CHAPTER 8. MPI 1 float (* f)[100][100]; 2 /* no memory is allocated */ 3 MPI_Alloc_mem(sizeof(float)*100*100, MPI_INFO_NULL, &f); 4 /* memory allocated */ 5 ... 6 (*f)[5][3] = 2.71; 7 ... 8 MPI_Free_mem(f); 9 10 Error Handling 8.3 11 12 MPI An implementation cannot or may choose not to handle some errors that occur during 13 calls. These can include errors that generate exceptions or traps, such as floating point MPI 14 is implementation- errors or access violations. The set of errors that are handled by MPI 15 MPI exception . dependent. Each such error generates an 16 The above text takes precedence over any text on error handling within this document. 17 Specifically, text that states that errors be handled. may be handled should be read as will 18 A user can associate error handlers to three types of objects: communicators, windows, 19 exception that occurs MPI and files. The specified error handling routine will be used for any 20 MPI calls that are not related to any objects during a call to MPI for the respective object. 21 . The attachment WORLD _ COMM _ MPI are considered to be attached to the communicator 22 of error handlers to objects is purely local: different processes may attach different error 23 handlers to corresponding objects. 24 : MPI Several predefined error handlers are available in 25 26 FATAL The handler, when called, causes the program to abort on all _ ARE _ ERRORS _ MPI 27 ABORT was called by the executing processes. This has the same effect as if MPI _ 28 process that invoked the handler. 29 ERRORS MPI The handler has no effect other than returning the error code to RETURN _ _ 30 the user. 31 32 Implementations may provide additional predefined error handlers and programmers 33 can code their own error handlers. 34 MPI FATAL _ ERRORS _ MPI The error handler is associated by default with ARE _ COMM- _ 35 _ WORLD after initialization. Thus, if the user chooses not to control error handling, every 36 MPI handles is treated as fatal. Since (almost) all MPI calls return an error code, error that 37 a user may choose to handle errors in its main code, by testing the return code of MPI 38 calls and executing a suitable recovery code when the call was not successful. In this case, 39 the error handler RETURN MPI _ ERRORS _ will be used. Usually it is more convenient and 40 call, and have such error handled by a more efficient not to test for errors after each MPI 41 non-trivial MPI error handler. 42 After an error is detected, the state of MPI is undefined. That is, using a user-defined 43 MPI _ ERRORS _ RETURN , does not necessarily allow the user to continue to error handler, or 44 MPI after an error is detected. The purpose of these error handlers is to allow a user to use 45 issue user-defined error messages and to take actions unrelated to MPI (such as flushing I/O 46 MPI buffers) before a program exits. An MPI to continue implementation is free to allow 47 after an error but is not required to do so. 48

373 8.3. ERROR HANDLING 343 1 A high-quality implementation will, to the greatest possible Advice to implementors. 2 extent, circumscribe the impact of an error, so that normal processing can continue 3 after an error handler was invoked. The implementation documentation will provide 4 End of advice to implemen- information on the possible effect of each class of errors. ( 5 ) tors. 6 7 MPI An MPI calls are error handler is an opaque object, which is accessed by a handle. 8 provided to create new error handlers, to associate error handlers with objects, and to test 9 which error handler is associated with an object. C has distinct typedefs for user defined 10 error handling callback functions that accept communicator, file, and window arguments. 11 In Fortran there are three user routines. 12 _ XXX MPI _ ERRHANDLER CREATE An error handler object is created by a call to , _ 13 FILE XXX is, respectively, . where WIN , , or COMM 14 An error handler is attached to a communicator, window, or file by a call to 15 . The error handler must be either a predefined error han- ERRHANDLER _ SET XXX _ MPI _ 16 _ XXX _ dler, or an error handler that was created by a call to _ ERRHANDLER , MPI CREATE 17 _ with matching XXX . The predefined error handlers MPI _ and RETURN ERRORS 18 FATAL MPI _ ERRORS _ ARE _ can be attached to communicators, windows, and files. 19 The error handler currently associated with a communicator, window, or file can be 20 _ XXX _ GET _ ERRHANDLER . retrieved by a call to MPI 21 function MPI _ ERRHANDLER _ FREE can be used to free an error handler that The MPI 22 CREATE _ ERRHANDLER MPI XXX was created by a call to . _ _ 23 behave as if a new error handler ob- MPI _ { COMM,WIN,FILE } _ GET _ ERRHANDLER 24 ject is created. That is, once the error handler is no longer needed, 25 MPI _ ERRHANDLER _ FREE should be called with the error handler returned from 26 GET _ ERRHANDLER to mark the error handler for deallocation. MPI _ { COMM,WIN,FILE } _ 27 _ . FREE _ This provides behavior similar to that of MPI _ COMM _ GROUP and MPI GROUP 28 29 Advice to implementors. High-quality implementations should raise an error when 30 _ ERRHANDLER an error handler that was created by a call to MPI _ XXX _ CREATE is 31 ERRHANDLER . _ _ YYY SET _ attached to an object of the wrong type with a call to MPI 32 To do so, it is necessary to maintain, with each error handler, information on the 33 ) typedef of the associated user function. ( End of advice to implementors. 34 The syntax for these calls is given below. 35 36 37 Error Handlers for Communicators 8.3.1 38 39 40 COMM _ MPI fn, errhandler) _ errhandler _ ERRHANDLER(comm _ CREATE _ 41 user defined error handling procedure (function) IN comm _ errhandler _ fn 42 errhandler MPI OUT error handler (handle) 43 44 45 int MPI_Comm_create_errhandler(MPI_Comm_errhandler_function 46 *comm_errhandler_fn, MPI_Errhandler *errhandler) 47 MPI_Comm_create_errhandler(comm_errhandler_fn, errhandler, ierror) BIND(C) 48

374 344 MPI CHAPTER 8. ENVIRONMENTAL MANAGEMENT 1 PROCEDURE(MPI_Comm_errhandler_function) :: comm_errhandler_fn 2 TYPE(MPI_Errhandler), INTENT(OUT) :: errhandler 3 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 4 MPI_COMM_CREATE_ERRHANDLER(COMM_ERRHANDLER_FN, ERRHANDLER, IERROR) 5 EXTERNAL COMM_ERRHANDLER_FN 6 INTEGER ERRHANDLER, IERROR 7 8 Creates an error handler that can be attached to communicators. 9 Comm _ errhandler _ function The user routine should be, in C, a function of type MPI _ , which 10 is defined as 11 typedef void MPI_Comm_errhandler_function(MPI_Comm *, int *, ...); 12 The first argument is the communicator in use. The second is the error code to be 13 MPI routine that raised the error. If the routine would have returned returned by the 14 ERR _ MPI , it is the error code returned in the status for the request that caused IN _ _ STATUS 15 varargs the error handler to be invoked. The remaining arguments are “ ” arguments whose 16 number and meaning is implementation-dependent. An implementation should clearly doc- 17 ument these arguments. Addresses are used so that the handler may be written in Fortran. 18 19 comm fn module, the user routine mpi_f08 With the Fortran _ errhandler should be of the _ 20 form: 21 ABSTRACT INTERFACE 22 SUBROUTINE MPI_Comm_errhandler_function(comm, error_code) BIND(C) 23 TYPE(MPI_Comm) :: comm 24 INTEGER :: error_code 25 26 ERRHANDLER _ FN With the Fortran mpi module and mpif.h , the user routine COMM _ 27 should be of the form: 28 29 SUBROUTINE COMM_ERRHANDLER_FUNCTION(COMM, ERROR_CODE) 30 INTEGER COMM, ERROR_CODE 31 32 The variable argument list is provided because it provides an ISO- Rationale. 33 standard hook for providing additional information to the error handler; without this 34 hook, ISO C prohibits additional arguments. ( ) End of rationale. 35 36 Advice to users. A newly created communicator inherits the error handler that 37 is associated with the “parent” communicator. In particular, the user can specify 38 a “global” error handler for all communicators by associating this handler with the 39 COMM _ communicator MPI _ WORLD immediately after initialization. ( End of advice to 40 users. ) 41 42 43 44 SET _ MPI ERRHANDLER(comm, errhandler) _ COMM _ 45 comm INOUT communicator (handle) 46 new error handler for communicator (handle) errhandler IN 47 48

375 8.3. ERROR HANDLING 345 1 int MPI_Comm_set_errhandler(MPI_Comm comm, MPI_Errhandler errhandler) 2 MPI_Comm_set_errhandler(comm, errhandler, ierror) BIND(C) 3 TYPE(MPI_Comm), INTENT(IN) :: comm 4 TYPE(MPI_Errhandler), INTENT(IN) :: errhandler 5 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 6 7 MPI_COMM_SET_ERRHANDLER(COMM, ERRHANDLER, IERROR) 8 INTEGER COMM, ERRHANDLER, IERROR 9 Attaches a new error handler to a communicator. The error handler must be either 10 a predefined error handler, or an error handler created by a call to 11 MPI _ ERRHANDLER . CREATE _ COMM _ 12 13 14 _ MPI GET _ ERRHANDLER(comm, errhandler) _ COMM 15 communicator (handle) comm IN 16 17 error handler currently associated with communicator errhandler OUT 18 (handle) 19 20 int MPI_Comm_get_errhandler(MPI_Comm comm, MPI_Errhandler *errhandler) 21 MPI_Comm_get_errhandler(comm, errhandler, ierror) BIND(C) 22 TYPE(MPI_Comm), INTENT(IN) :: comm 23 TYPE(MPI_Errhandler), INTENT(OUT) :: errhandler 24 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 25 26 MPI_COMM_GET_ERRHANDLER(COMM, ERRHANDLER, IERROR) 27 INTEGER COMM, ERRHANDLER, IERROR 28 29 Retrieves the error handler currently associated with a communicator. 30 For example, a library function may register at its entry point the current error handler 31 for a communicator, set its own private error handler for this communicator, and restore 32 before exiting the previous error handler. 33 34 Error Handlers for Windows 8.3.2 35 36 37 _ CREATE _ _ MPI fn, errhandler) _ errhandler WIN ERRHANDLER(win _ 38 errhandler _ fn user defined error handling procedure (function) IN win _ 39 40 error handler (handle) MPI errhandler OUT 41 42 int MPI_Win_create_errhandler(MPI_Win_errhandler_function 43 *win_errhandler_fn, MPI_Errhandler *errhandler) 44 MPI_Win_create_errhandler(win_errhandler_fn, errhandler, ierror) BIND(C) 45 PROCEDURE(MPI_Win_errhandler_function) :: win_errhandler_fn 46 TYPE(MPI_Errhandler), INTENT(OUT) :: errhandler 47 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 48

376 346 CHAPTER 8. MPI ENVIRONMENTAL MANAGEMENT 1 MPI_WIN_CREATE_ERRHANDLER(WIN_ERRHANDLER_FN, ERRHANDLER, IERROR) 2 EXTERNAL WIN_ERRHANDLER_FN 3 INTEGER ERRHANDLER, IERROR 4 Creates an error handler that can be attached to a window object. The user routine 5 errhandler Win which is defined as function _ MPI should be, in C, a function of type _ _ 6 typedef void MPI_Win_errhandler_function(MPI_Win *, int *, ...); 7 8 The first argument is the window in use, the second is the error code to be returned. 9 win errhandler _ should be of the form: module, the user routine mpi_f08 With the Fortran fn _ 10 11 ABSTRACT INTERFACE 12 SUBROUTINE MPI_Win_errhandler_function(win, error_code) BIND(C) 13 TYPE(MPI_Win) :: win 14 INTEGER :: error_code 15 16 _ FN should mpi module and mpif.h , the user routine WIN With the Fortran ERRHANDLER _ 17 be of the form: 18 SUBROUTINE WIN_ERRHANDLER_FUNCTION(WIN, ERROR_CODE) 19 INTEGER WIN, ERROR_CODE 20 21 22 23 WIN ERRHANDLER(win, errhandler) _ SET _ MPI _ 24 window (handle) INOUT win 25 IN errhandler new error handler for window (handle) 26 27 28 int MPI_Win_set_errhandler(MPI_Win win, MPI_Errhandler errhandler) 29 MPI_Win_set_errhandler(win, errhandler, ierror) BIND(C) 30 TYPE(MPI_Win), INTENT(IN) :: win 31 TYPE(MPI_Errhandler), INTENT(IN) :: errhandler 32 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 33 34 MPI_WIN_SET_ERRHANDLER(WIN, ERRHANDLER, IERROR) 35 INTEGER WIN, ERRHANDLER, IERROR 36 Attaches a new error handler to a window. The error handler must be either a pre- 37 defined error handler, or an error handler created by a call to 38 ERRHANDLER . MPI _ WIN _ CREATE _ 39 40 41 _ ERRHANDLER(win, errhandler) _ GET _ WIN MPI 42 window (handle) IN win 43 44 errhandler OUT error handler currently associated with window (han- 45 dle) 46 47 int MPI_Win_get_errhandler(MPI_Win win, MPI_Errhandler *errhandler) 48

377 8.3. ERROR HANDLING 347 1 MPI_Win_get_errhandler(win, errhandler, ierror) BIND(C) 2 TYPE(MPI_Win), INTENT(IN) :: win 3 TYPE(MPI_Errhandler), INTENT(OUT) :: errhandler 4 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 5 MPI_WIN_GET_ERRHANDLER(WIN, ERRHANDLER, IERROR) 6 INTEGER WIN, ERRHANDLER, IERROR 7 8 Retrieves the error handler currently associated with a window. 9 10 Error Handlers for Files 8.3.3 11 12 13 MPI fn, errhandler) FILE _ CREATE _ _ ERRHANDLER(file _ errhandler _ 14 15 user defined error handling procedure (function) fn errhandler _ file IN _ 16 error handler (handle) OUT MPI errhandler 17 18 int MPI_File_create_errhandler(MPI_File_errhandler_function 19 *file_errhandler_fn, MPI_Errhandler *errhandler) 20 21 MPI_File_create_errhandler(file_errhandler_fn, errhandler, ierror) BIND(C) 22 PROCEDURE(MPI_File_errhandler_function) :: file_errhandler_fn 23 TYPE(MPI_Errhandler), INTENT(OUT) :: errhandler 24 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 25 MPI_FILE_CREATE_ERRHANDLER(FILE_ERRHANDLER_FN, ERRHANDLER, IERROR) 26 EXTERNAL FILE_ERRHANDLER_FN 27 INTEGER ERRHANDLER, IERROR 28 29 Creates an error handler that can be attached to a file object. The user routine should 30 _ File _ , which is defined as errhandler be, in C, a function of type MPI _ function 31 typedef void MPI_File_errhandler_function(MPI_File *, int *, ...); 32 The first argument is the file in use, the second is the error code to be returned. 33 34 file module, the user routine _ With the Fortran mpi_f08 errhandler should be of the form: fn _ 35 ABSTRACT INTERFACE 36 SUBROUTINE MPI_File_errhandler_function(file, error_code) BIND(C) 37 TYPE(MPI_File) :: file 38 INTEGER :: error_code 39 40 _ FILE , the user routine mpif.h module and mpi With the Fortran should FN _ ERRHANDLER 41 be of the form: 42 SUBROUTINE FILE_ERRHANDLER_FUNCTION(FILE, ERROR_CODE) 43 INTEGER FILE, ERROR_CODE 44 45 46 47 48

378 348 ENVIRONMENTAL MANAGEMENT MPI CHAPTER 8. 1 MPI ERRHANDLER(file, errhandler) _ FILE _ _ SET 2 file (handle) file INOUT 3 IN errhandler new error handler for file (handle) 4 5 6 int MPI_File_set_errhandler(MPI_File file, MPI_Errhandler errhandler) 7 MPI_File_set_errhandler(file, errhandler, ierror) BIND(C) 8 TYPE(MPI_File), INTENT(IN) :: file 9 TYPE(MPI_Errhandler), INTENT(IN) :: errhandler 10 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 11 12 MPI_FILE_SET_ERRHANDLER(FILE, ERRHANDLER, IERROR) 13 INTEGER FILE, ERRHANDLER, IERROR 14 Attaches a new error handler to a file. The error handler must be either a predefined 15 CREATE FILE . ERRHANDLER _ error handler, or an error handler created by a call to _ MPI _ 16 17 18 ERRHANDLER(file, errhandler) MPI _ FILE _ GET _ 19 IN file file (handle) 20 21 error handler currently associated with file (handle) errhandler OUT 22 23 int MPI_File_get_errhandler(MPI_File file, MPI_Errhandler *errhandler) 24 MPI_File_get_errhandler(file, errhandler, ierror) BIND(C) 25 TYPE(MPI_File), INTENT(IN) :: file 26 TYPE(MPI_Errhandler), INTENT(OUT) :: errhandler 27 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 28 29 MPI_FILE_GET_ERRHANDLER(FILE, ERRHANDLER, IERROR) 30 INTEGER FILE, ERRHANDLER, IERROR 31 Retrieves the error handler currently associated with a file. 32 33 34 8.3.4 Freeing Errorhandlers and Retrieving Error Strings 35 36 37 ERRHANDLER _ FREE( errhandler ) MPI _ 38 INOUT errhandler MPI error handler (handle) 39 40 int MPI_Errhandler_free(MPI_Errhandler *errhandler) 41 42 MPI_Errhandler_free(errhandler, ierror) BIND(C) 43 TYPE(MPI_Errhandler), INTENT(INOUT) :: errhandler 44 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 45 MPI_ERRHANDLER_FREE(ERRHANDLER, IERROR) 46 INTEGER ERRHANDLER, IERROR 47 48

379 8.4. ERROR CODES AND CLASSES 349 1 for deallocation and sets errhandler Marks the error handler associated with errhandler 2 MPI to _ _ NULL ERRHANDLER . The error handler will be deallocated after all the objects 3 associated with it (communicator, window, or file) have been deallocated. 4 5 _ ERROR _ MPI STRING( errorcode, string, resultlen ) 6 7 errorcode Error code returned by an MPI routine IN 8 OUT Text that corresponds to the errorcode string 9 OUT resultlen Length (in printable characters) of the result returned 10 in string 11 12 13 int MPI_Error_string(int errorcode, char *string, int *resultlen) 14 MPI_Error_string(errorcode, string, resultlen, ierror) BIND(C) 15 INTEGER, INTENT(IN) :: errorcode 16 CHARACTER(LEN=MPI_MAX_ERROR_STRING), INTENT(OUT) :: string 17 INTEGER, INTENT(OUT) :: resultlen 18 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 19 20 MPI_ERROR_STRING(ERRORCODE, STRING, RESULTLEN, IERROR) 21 INTEGER ERRORCODE, RESULTLEN, IERROR 22 CHARACTER*(*) STRING 23 Returns the error string associated with an error code or class. The argument string 24 characters long. must represent storage that is at least MPI _ MAX _ ERROR _ STRING 25 The number of characters actually written is returned in the output argument, . resultlen 26 27 Rationale. The form of this function was chosen to make the Fortran and C bindings 28 similar. A version that returns a pointer to a string has two difficulties. First, the 29 return string must be statically allocated and different for each error message (allowing 30 ERROR _ STRING to point to the the pointers returned by successive calls to MPI _ 31 correct message). Second, in Fortran, a function declared as returning CHARACTER*(*) 32 can not be referenced in, for example, a PRINT statement. ( End of rationale. ) 33 34 8.4 Error Codes and Classes 35 36 are left entirely to the implementation (with the exception MPI The error codes returned by 37 MPI of ). This is done to allow an implementation to provide as much information SUCCESS _ 38 _ ). as possible in the error code (for use with MPI _ ERROR STRING 39 To make it possible for an application to interpret an error code, the routine 40 MPI converts any error code into one of a small set of standard error codes, CLASS _ ERROR _ 41 called . Valid error classes are shown in Table 8.1 and Table 8.2. error classes 42 The error classes are a subset of the error codes: an function may return an error MPI 43 class number; and the function MPI _ ERROR _ STRING can be used to compute the error 44 string associated with an error class. The values defined for MPI error classes are valid MPI 45 error codes. 46 The error codes satisfy, 47 48 ERR LASTCODE MPI _ SUCCESS < MPI _ 0 = _ ... ≤ MPI _ ERR . _

380 350 ENVIRONMENTAL MANAGEMENT CHAPTER 8. MPI 1 2 SUCCESS MPI No error _ 3 BUFFER ERR _ _ Invalid buffer pointer MPI 4 Invalid count argument COUNT _ ERR _ MPI 5 Invalid datatype argument MPI _ _ TYPE ERR 6 Invalid tag argument MPI _ ERR TAG _ 7 Invalid communicator _ MPI COMM _ ERR 8 Invalid rank RANK _ ERR MPI _ 9 ERR _ MPI Invalid request (handle) _ REQUEST 10 MPI Invalid root ROOT _ ERR _ 11 _ Invalid group ERR _ GROUP MPI 12 OP Invalid operation _ ERR MPI _ 13 MPI _ ERR _ TOPOLOGY Invalid topology 14 Invalid dimension argument MPI _ ERR _ DIMS 15 _ MPI ERR _ ARG Invalid argument of some other kind 16 Unknown error UNKNOWN _ ERR _ MPI 17 Message truncated on receive MPI _ ERR _ TRUNCATE 18 OTHER MPI _ ERR _ Known error not in this list 19 ERR _ INTERN _ MPI (implementation) error MPI Internal 20 ERR _ IN _ STATUS Error code is in status MPI _ 21 MPI PENDING Pending request ERR _ _ 22 MPI _ ERR _ KEYVAL Invalid keyval has been passed 23 ERR failed because memory ALLOC _ MPI _ MEM MPI _ MEM _ NO _ 24 is exhausted 25 _ _ BASE MPI _ FREE MEM _ MPI Invalid base passed to ERR 26 Key longer than INFO _ INFO _ MAX _ MPI KEY KEY _ _ ERR _ MPI 27 INFO _ VAL MPI _ ERR _ INFO _ VALUE Value longer than MPI _ MAX _ 28 _ NOKEY Invalid key passed to MPI _ INFO _ DELETE INFO _ _ MPI ERR 29 Error in spawning processes MPI _ ERR _ SPAWN 30 Invalid port name passed to PORT _ ERR _ MPI 31 MPI _ COMM _ CONNECT 32 SERVICE Invalid service name passed to MPI _ ERR _ 33 NAME UNPUBLISH _ MPI _ 34 NAME Invalid service name passed to _ ERR _ MPI 35 NAME _ LOOKUP _ MPI 36 WIN _ ERR _ MPI argument win Invalid 37 _ SIZE MPI Invalid _ size argument ERR 38 MPI _ ERR _ DISP Invalid disp argument 39 MPI _ argument info Invalid _ ERR INFO 40 ERR _ LOCKTYPE _ locktype argument MPI Invalid 41 _ ERR _ ASSERT Invalid assert argument MPI 42 _ RMA MPI _ ERR _ CONFLICT Conflicting accesses to window 43 RMA Wrong synchronization of RMA calls _ ERR _ SYNC _ MPI 44 45 46 Table 8.1: Error classes (Part 1) 47 48

381 8.4. ERROR CODES AND CLASSES 351 1 2 3 4 ERR Target memory is not part of the win- _ _ RANGE _ MPI RMA 5 dow (in the case of a window created 6 CREATE MPI with WIN , tar- _ _ _ DYNAMIC 7 get memory is not attached) 8 _ RMA _ ERR MPI _ ATTACH Memory cannot be attached (e.g., because 9 of resource exhaustion) 10 _ MPI ERR _ RMA _ Memory cannot be shared (e.g., some pro- SHARED 11 cess in the group of the specified commu- 12 nicator cannot expose shared memory) 13 MPI _ ERR _ RMA _ FLAVOR Passed window has the wrong flavor for the 14 called function 15 Invalid file handle MPI _ ERR _ FILE 16 SAME MPI _ ERR _ NOT _ Collective argument not identical on all 17 processes, or collective routines called in 18 a different order by different processes 19 ERR _ passed to Error related to the amode MPI _ AMODE 20 _ FILE _ MPI OPEN 21 _ ERR _ UNSUPPORTED _ DATAREP Unsupported datarep MPI passed to 22 _ FILE _ SET _ VIEW MPI 23 _ MPI _ ERR _ UNSUPPORTED OPERATION Unsupported operation, such as seeking on 24 a file which supports sequential access only 25 MPI _ ERR _ NO _ SUCH _ File does not exist FILE 26 MPI _ ERR _ FILE _ EXISTS File exists 27 _ BAD _ FILE Invalid file name (e.g., path name too long) MPI _ ERR 28 ACCESS MPI _ ERR _ Permission denied 29 SPACE _ MPI ERR _ NO _ Not enough space 30 Quota exceeded QUOTA MPI _ ERR _ 31 ONLY Read-only file or file system MPI _ ERR _ READ _ 32 ERR MPI IN _ USE _ File operation could not be completed, as _ FILE _ 33 the file is currently open by some process 34 MPI _ ERR _ DUP _ DATAREP Conversion functions could not be regis- 35 tered because a data representation identi- 36 fier that was already defined was passed to 37 _ DATAREP MPI _ REGISTER 38 An error occurred in a user supplied data CONVERSION _ ERR _ MPI 39 conversion function. 40 _ MPI ERR _ IO Other I/O error 41 MPI ERR _ _ LASTCODE Last error code 42 43 44 Table 8.2: Error classes (Part 2) 45 46 47 48

382 352 MPI CHAPTER 8. ENVIRONMENTAL MANAGEMENT 1 _ _ UNKNOWN and MPI _ ERR ERR OTHER is that The difference between _ MPI Rationale. 2 STRING can return useful information about MPI _ ERR _ OTHER . _ MPI _ ERROR 3 = 0 is necessary to be consistent with C practice; the sepa- _ MPI Note that SUCCESS 4 ration of error classes and error codes allows us to define the error classes this way. 5 End of rationale. ) LASTCODE is often a nice sanity check as well. ( Having a known 6 7 8 9 ERROR _ MPI CLASS( errorcode, errorclass ) _ 10 IN errorcode Error code returned by an MPI routine 11 Error class associated with errorclass errorcode OUT 12 13 14 int MPI_Error_class(int errorcode, int *errorclass) 15 MPI_Error_class(errorcode, errorclass, ierror) BIND(C) 16 INTEGER, INTENT(IN) :: errorcode 17 INTEGER, INTENT(OUT) :: errorclass 18 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 19 20 MPI_ERROR_CLASS(ERRORCODE, ERRORCLASS, IERROR) 21 INTEGER ERRORCODE, ERRORCLASS, IERROR 22 CLASS The function MPI _ _ ERROR maps each standard error code (error class) onto 23 itself. 24 25 26 Error Classes, Error Codes, and Error Handlers 8.5 27 Users may want to write a layered library on top of an existing MPI implementation, and 28 this library may have its own set of error codes and classes. An example of such a library 29 is an I/O library based on , see Chapter 13 on page 489. For this purpose, functions MPI 30 are needed to: 31 32 implementation already knows. 1. add a new error class to the ones an MPI 33 34 CLASS works. 2. associate error codes with this error class, so that MPI _ ERROR _ 35 works. _ ERROR _ MPI STRING 3. associate strings with these error codes, so that 36 37 4. invoke the error handler associated with a communicator, window, or object. 38 39 Several functions are provided to do this. They are all local. No functions are provided 40 to free error classes or codes: it is not expected that an application will generate them in 41 significant numbers. 42 43 44 CLASS(errorclass) _ ERROR _ ADD MPI _ 45 OUT errorclass value for the new error class (integer) 46 47 48 int MPI_Add_error_class(int *errorclass)

383 8.5. ERROR CLASSES, ERROR CODES, AND ERROR HANDLERS 353 1 MPI_Add_error_class(errorclass, ierror) BIND(C) 2 INTEGER, INTENT(OUT) :: errorclass 3 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 4 MPI_ADD_ERROR_CLASS(ERRORCLASS, IERROR) 5 INTEGER ERRORCLASS, IERROR 6 7 Creates a new error class and returns the value for it. 8 9 To avoid conflicts with existing error codes and classes, the value is set Rationale. 10 by the implementation and not by the user. ( ) End of rationale. 11 12 A high-quality implementation will return the value for Advice to implementors. 13 errorclass in the same deterministic way on all processes. ( a new End of advice to 14 ) implementors. 15 Since a call to MPI _ ADD _ CLASS ERROR is local, the same errorclass _ Advice to users. 16 may not be returned on all processes that make this call. Thus, it is not safe to assume 17 that registering a new error on a set of processes at the same time will yield the same 18 errorclass on all of the processes. However, if an implementation returns the new 19 errorclass in a deterministic way, and they are always generated in the same order on 20 the same set of processes (for example, all processes), then the value will be the same. 21 However, even if a deterministic algorithm is used, the value can vary across processes. 22 This can happen, for example, if different but overlapping groups of processes make 23 a series of calls. As a result of these issues, getting the “same” error on multiple 24 End of advice processes may not cause the same value of error code to be generated. ( 25 ) to users. 26 27 _ MPI The value of is a constant value and is not affected by new user- _ LASTCODE ERR 28 is defined error codes and classes. Instead, a predefined attribute key MPI _ LASTUSEDCODE 29 _ associated with MPI _ COMM WORLD . The attribute value corresponding to this key is the 30 current maximum error class including the user-defined ones. This is a local value and may 31 be different on different processes. The value returned by this key is always greater than or 32 LASTCODE _ ERR . equal to _ MPI 33 34 LASTUSEDCODE will not change Advice to users. The value returned by the key MPI _ 35 unless the user calls a function to explicitly add an error class/code. In a multi- 36 threaded environment, the user must take extra care in assuming this value has not 37 changed. Note that error codes and error classes are not necessarily dense. A user 38 MPI LASTUSEDCODE may not assume that each error class below _ is valid. ( End of 39 advice to users. ) 40 41 42 43 ERROR _ MPI CODE(errorclass, errorcode) _ ADD _ 44 errorclass IN error class (integer) 45 (integer) new error code to associated with errorclass errorcode OUT 46 47 48 int MPI_Add_error_code(int errorclass, int *errorcode)

384 354 CHAPTER 8. MPI ENVIRONMENTAL MANAGEMENT 1 MPI_Add_error_code(errorclass, errorcode, ierror) BIND(C) 2 INTEGER, INTENT(IN) :: errorclass 3 INTEGER, INTENT(OUT) :: errorcode 4 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 5 MPI_ADD_ERROR_CODE(ERRORCLASS, ERRORCODE, IERROR) 6 INTEGER ERRORCLASS, ERRORCODE, IERROR 7 8 . Creates new error code associated with errorclass errorcode and returns its value in 9 10 Rationale. To avoid conflicts with existing error codes and classes, the value of the 11 ) End of rationale. new error code is set by the implementation and not by the user. ( 12 13 A high-quality implementation will return the value for Advice to implementors. 14 End of advice to in the same deterministic way on all processes. ( a new errorcode 15 ) implementors. 16 17 18 _ ADD _ _ MPI STRING(errorcode, string) ERROR 19 IN errorcode error code or class (integer) 20 21 IN text corresponding to errorcode (string) string 22 23 int MPI_Add_error_string(int errorcode, const char *string) 24 25 MPI_Add_error_string(errorcode, string, ierror) BIND(C) 26 INTEGER, INTENT(IN) :: errorcode 27 CHARACTER(LEN=*), INTENT(IN) :: string 28 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 29 MPI_ADD_ERROR_STRING(ERRORCODE, STRING, IERROR) 30 INTEGER ERRORCODE, IERROR 31 CHARACTER*(*) STRING 32 33 Associates an error string with an error code or class. The string must be no more 34 characters long. The length of the string is as defined in the STRING ERROR _ MAX _ MPI than _ 35 calling language. The length of the string does not include the null terminator in C. Trailing 36 _ ERROR _ that _ MPI blanks will be stripped in Fortran. Calling errorcode for an STRING ADD 37 already has a string will replace the old string with the new string. It is erroneous to call 38 ≤ MPI _ ERR _ LASTCODE . MPI _ ADD _ ERROR _ STRING for an error code or class with a value 39 ERROR If MPI is called when no string has been set, it will return a empty STRING _ _ 40 in C). string (all spaces in Fortran, "" 41 Section 8.3 on page 342 describes the methods for creating and associating error han- 42 dlers with communicators, files, and windows. 43 44 45 46 47 48

385 8.5. ERROR CLASSES, ERROR CODES, AND ERROR HANDLERS 355 1 _ CALL _ ERRHANDLER (comm, errorcode) MPI COMM _ 2 communicator with error handler (handle) IN comm 3 IN error code (integer) errorcode 4 5 6 int MPI_Comm_call_errhandler(MPI_Comm comm, int errorcode) 7 MPI_Comm_call_errhandler(comm, errorcode, ierror) BIND(C) 8 TYPE(MPI_Comm), INTENT(IN) :: comm 9 INTEGER, INTENT(IN) :: errorcode 10 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 11 12 MPI_COMM_CALL_ERRHANDLER(COMM, ERRORCODE, IERROR) 13 INTEGER COMM, ERRORCODE, IERROR 14 This function invokes the error handler assigned to the communicator with the error 15 SUCCESS _ MPI code supplied. This function returns if IERROR in C and the same value in 16 the error handler was successfully called (assuming the process is not aborted and the error 17 handler returns). 18 19 Advice to users. Users should note that the default error handler is 20 FATAL _ CALL _ COMM _ MPI . Thus, calling will abort _ ARE _ ERRHANDLER ERRORS _ MPI 21 the processes if the default error handler has not been changed for this com- comm 22 End of advice to municator or on the parent before the communicator was created. ( 23 users. ) 24 25 26 27 _ _ WIN MPI CALL _ ERRHANDLER (win, errorcode) 28 IN window with error handler (handle) win 29 IN error code (integer) errorcode 30 31 32 int MPI_Win_call_errhandler(MPI_Win win, int errorcode) 33 MPI_Win_call_errhandler(win, errorcode, ierror) BIND(C) 34 TYPE(MPI_Win), INTENT(IN) :: win 35 INTEGER, INTENT(IN) :: errorcode 36 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 37 38 MPI_WIN_CALL_ERRHANDLER(WIN, ERRORCODE, IERROR) 39 INTEGER WIN, ERRORCODE, IERROR 40 This function invokes the error handler assigned to the window with the error code 41 supplied. This function returns MPI if the IERROR in C and the same value in SUCCESS _ 42 error handler was successfully called (assuming the process is not aborted and the error 43 handler returns). 44 45 As with communicators, the default error handler for windows is Advice to users. 46 ERRORS _ MPI End of advice to users. ) . ( FATAL _ ARE _ 47 48

386 356 MPI CHAPTER 8. ENVIRONMENTAL MANAGEMENT 1 FILE CALL _ _ MPI _ ERRHANDLER (fh, errorcode) 2 file with error handler (handle) IN fh 3 IN errorcode error code (integer) 4 5 6 int MPI_File_call_errhandler(MPI_File fh, int errorcode) 7 MPI_File_call_errhandler(fh, errorcode, ierror) BIND(C) 8 TYPE(MPI_File), INTENT(IN) :: fh 9 INTEGER, INTENT(IN) :: errorcode 10 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 11 12 MPI_FILE_CALL_ERRHANDLER(FH, ERRORCODE, IERROR) 13 INTEGER FH, ERRORCODE, IERROR 14 This function invokes the error handler assigned to the file with the error code supplied. 15 if the error handler IERROR in C and the same value in SUCCESS MPI This function returns _ 16 was successfully called (assuming the process is not aborted and the error handler returns). 17 18 Unlike errors on communicators and windows, the default behavior Advice to users. 19 _ . ( RETURN _ ERRORS End of advice to users. MPI for files is to have ) 20 21 Users are warned that handlers should not be called recursively Advice to users. 22 ERRHANDLER , MPI _ FILE _ CALL _ ERRHANDLER with MPI _ COMM _ CALL _ , or 23 . Doing this can create a situation where an infinite ERRHANDLER _ CALL _ WIN _ MPI 24 recursion is created. This can occur if ERRHANDLER _ CALL _ COMM _ MPI , 25 _ WIN _ CALL _ ERRHANDLER is called inside MPI _ FILE _ CALL _ ERRHANDLER , or MPI 26 an error handler. 27 Error codes and classes are associated with a process. As a result, they may be used 28 in any error handler. Error handlers should be prepared to deal with any error code 29 they are given. Furthermore, it is good practice to only call an error handler with the 30 appropriate error codes. For example, file errors would normally be sent to the file 31 End of advice to users. error handler. ( ) 32 33 34 Timers and Synchronization 8.6 35 36 defines a timer. A timer is specified even though it is not “message-passing,” because MPI 37 timing parallel programs is important in “performance debugging” and because existing 38 timers (both in POSIX 1003.1-1988 and 1003.4D 14.1 and in Fortran 90) are either incon- 39 venient or do not provide adequate access to high resolution timers. See also Section 2.6.4 40 on page 19. 41 42 MPI _ WTIME() 43 44 double MPI_Wtime(void) 45 46 DOUBLE PRECISION MPI_Wtime() BIND(C) 47 DOUBLE PRECISION MPI_WTIME() 48

387 8.7. STARTUP 357 1 returns a floating-point number of seconds, representing elapsed wall- MPI WTIME _ 2 clock time since some time in the past. 3 The “time in the past” is guaranteed not to change during the life of the process. 4 The user is responsible for converting large numbers of seconds to other units if they are 5 preferred. 6 This function is portable (it returns seconds, not “ticks”), it allows high-resolution, 7 and carries no unnecessary baggage. One would use it like this: 8 9 { 10 double starttime, endtime; 11 starttime = MPI_Wtime(); 12 ... stuff to be timed ... 13 endtime = MPI_Wtime(); 14 printf("That took %f seconds\n",endtime-starttime); 15 } 16 The times returned are local to the node that called them. There is no requirement 17 that different nodes return “the same time.” (But see also the discussion of 18 IS _ GLOBAL in Section 8.1.2). MPI _ WTIME _ 19 20 21 MPI WTICK() _ 22 23 double MPI_Wtick(void) 24 DOUBLE PRECISION MPI_Wtick() BIND(C) 25 26 DOUBLE PRECISION MPI_WTICK() 27 WTICK _ MPI _ MPI returns the resolution of in seconds. That is, it returns, WTIME 28 as a double precision value, the number of seconds between successive clock ticks. For 29 example, if the clock is implemented by the hardware as a counter that is incremented 30 3 − . should be 10 WTICK every millisecond, the value returned by _ MPI 31 32 33 8.7 Startup 34 35 . By this we mean that a program writ- One goal of MPI is to achieve source code portability 36 ten using MPI and complying with the relevant language standards is portable as written, 37 and must not require any source code changes when moved from one system to another. 38 MPI program is started or launched from This explicitly does not say anything about how an 39 the command line, nor what the user must do to set up the environment in which an MPI 40 program will run. However, an implementation may require some setup to be performed 41 MPI MPI routines may be called. To provide for this, before other includes an initialization 42 . routine MPI _ INIT 43 44 MPI INIT() _ 45 46 int MPI_Init(int *argc, char ***argv) 47 48 MPI_Init(ierror) BIND(C)

388 358 ENVIRONMENTAL MANAGEMENT CHAPTER 8. MPI 1 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 2 MPI_INIT(IERROR) 3 INTEGER IERROR 4 5 initialization routine: All MPI programs must contain exactly one call to an MPI 6 . Subsequent calls to any initialization routines are erro- MPI or MPI INIT _ INIT _ _ THREAD 7 initialization routines functions that may be invoked before the MPI neous. The only MPI 8 _ LIBRARY _ are called are , MPI _ INITIALIZED , MPI _ GET _ VERSION , MPI _ GET VERSION 9 FINALIZED (within the constraints for func- MPI_T_ , and any function with the prefix MPI _ 10 argc tions with this prefix listed in Section 14.3.4). The version for ISO C accepts the and 11 main or NULL argv that are provided by the arguments to : 12 int main(int argc, char *argv[]) 13 { 14 MPI_Init(&argc, &argv); 15 16 /* parse arguments */ 17 /* main program */ 18 19 MPI_Finalize(); /* see below */ 20 return 0; 21 } 22 23 . The Fortran version takes only IERROR 24 MPI are required to allow applications to pass NULL Conforming implementations of 25 argv in C. for both the argc arguments of and main 26 After is initialized, the application can access information about the execution MPI 27 . The following keys are environment by querying the predefined info object MPI _ INFO _ ENV 28 _ SPAWN _ COMM or of predefined for this object, corresponding to the arguments of MPI 29 mpiexec : 30 31 command Name of program executed. 32 argv Space separated arguments to command. 33 34 Maximum number of MPI processes to start. maxprocs 35 36 soft Allowed values for number of processors. 37 Hostname. host 38 39 Architecture name. arch 40 41 MPI process. wdir Working directory of the 42 file Value is the name of a file in which additional information is specified. 43 44 level _ thread Requested level of thread support, if requested before the program started exe- 45 cution. 46 47 48

389 8.7. STARTUP 359 1 Note that all values are strings. Thus, the maximum number of processes is represented 2 “1024” by a string such as and the requested level is represented by a string such as 3 SINGLE” THREAD _ _ . “MPI 4 The info object _ ENV need not contain a (key,value) pair for each of these INFO _ MPI 5 predefined keys; the set of (key,value) pairs provided is implementation-dependent. Imple- 6 mentations may provide additional, implementation specific, (key,value) pairs. 7 COMM _ MPI SPAWN processes were started with _ MULTIPLE MPI In case where the _ 8 or, equivalently, with a startup mechanism that supports multiple process specifications, 9 at a process are those values that ENV _ _ MPI then the values stored in the info object INFO 10 affect the local MPI process. 11 12 MPI Example 8.4 If is started with a call to 13 mpiexec -n 5 -arch sun ocean : -n 10 -arch rs6000 atmos 14 15 _ INFO Then the first 5 processes will have have in their ENV MPI (command, _ object the pairs 16 MPI _ INFO _ ENV ocean) , (maxprocs, 5) , and (arch, sun) . The next 10 processes will have in 17 , (maxprocs, 10) , and (arch, rs600) (command, atmos) 18 MPI _ INFO _ ENV are the values of the arguments Advice to users. The values passed in 19 execution — not the actual value MPI passed to the mechanism that started the 20 maxprocs provided. Thus, the value associated with MPI is the number of processes 21 requested; it can be larger than the actual number of processes obtained, if the soft 22 option was used. ( End of advice to users. ) 23 24 Advice to implementors. High-quality implementations will provide a (key,value) pair 25 MPI for each parameter that can be passed to the command that starts an program. 26 ( ) End of advice to implementors. 27 28 29 FINALIZE() MPI _ 30 31 int MPI_Finalize(void) 32 33 MPI_Finalize(ierror) BIND(C) 34 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 35 MPI_FINALIZE(IERROR) 36 INTEGER IERROR 37 38 MPI state. If an This routine cleans up all program terminates normally (i.e., MPI 39 not due to a call to MPI _ ABORT or an unrecoverable error) then each process must call 40 _ MPI FINALIZE before it exits. 41 FINALIZE process invokes MPI _ Before an , the process must perform all MPI calls MPI 42 MPI communications: It must locally complete all needed to complete its involvement in 43 MPI operations that it initiated and must execute matching calls needed to complete MPI 44 communications initiated by other processes. For example, if the process executed a non- 45 TEST _ MPI , FREE _ MPI blocking send, it must eventually call WAIT _ REQUEST _ MPI , , or 46 any derived function; if the process is the target of a send, then it must post the matching 47 receive; if it is part of a group executing a collective operation, then it must have completed 48 its participation in the operation.

390 360 CHAPTER 8. MPI ENVIRONMENTAL MANAGEMENT 1 _ calls; these objects are FINALIZE The call to MPI does not free objects created by MPI 2 _ calls. xxx _ MPI freed using FREE 3 FINALIZE is collective over all connected processes. If no processes were spawned, MPI _ 4 ; otherwise it is collective accepted or connected then this means over MPI _ COMM _ WORLD 5 over the union of all processes that have been and continue to be connected, as explained 6 in Section 10.5.4 on page 399. 7 The following examples illustrates these rules 8 9 Example 8.5 The following code is correct 10 Process 0 Process 1 11 --------- --------- 12 MPI_Init(); MPI_Init(); 13 MPI_Send(dest=1); MPI_Recv(src=0); 14 MPI_Finalize(); MPI_Finalize(); 15 16 17 Example 8.6 Without a matching receive, the program is erroneous 18 19 Process 0 Process 1 20 ----------- ----------- 21 MPI_Init(); MPI_Init(); 22 MPI_Send (dest=1); 23 MPI_Finalize(); MPI_Finalize(); 24 25 MPI _ after it has executed Finalize Example 8.7 This program is correct: Process 0 calls 26 calls that complete the send operation. Likewise, process 1 executes the call the MPI MPI 27 that completes the matching receive operation before it calls MPI _ Finalize . 28 29 Process 0 Proces 1 30 -------- -------- 31 MPI_Init(); MPI_Init(); 32 MPI_Isend(dest=1); MPI_Recv(src=0); 33 MPI_Request_free(); MPI_Finalize(); 34 MPI_Finalize(); exit(); 35 exit(); 36 37 This program is correct. The attached buffer is a resource allocated by the Example 8.8 38 user, not by is finalized. ; it is available to the user after MPI MPI 39 40 Process 0 Process 1 41 --------- --------- 42 MPI_Init(); MPI_Init(); 43 buffer = malloc(1000000); MPI_Recv(src=0); 44 MPI_Buffer_attach(); MPI_Finalize(); 45 MPI_Send(dest=1)); exit(); 46 MPI_Finalize(); 47 free(buffer); 48 exit();

391 8.7. STARTUP 361 1 This program is correct. The cancel operation must succeed, since the Example 8.9 2 send cannot complete normally. The wait operation, after the call to _ MPI Cancel , is local 3 MPI call is required on process 1. — no matching 4 5 6 Process 0 Process 1 7 --------- --------- 8 MPI_Issend(dest=1); MPI_Finalize(); 9 MPI_Cancel(); 10 MPI_Wait(); 11 MPI_Finalize(); 12 13 Even though a process has executed all MPI Advice to implementors. calls needed to 14 complete the communications it is involved with, such communication may not yet be 15 completed from the viewpoint of the underlying system. For example, a blocking MPI 16 MPI send may have returned, even though the data is still buffered at the sender in an 17 MPI buffer; an process may receive a cancel request for a message it has completed 18 MPI implementation must ensure that a process has completed any receiving. The 19 _ FINALIZE returns. Thus, if a process involvement in communication before MPI MPI 20 _ FINALIZE , this will not cause an ongoing communication exits after the call to MPI 21 MPI implementation should also complete freeing all objects marked for to fail. The 22 ) End of advice to implementors. calls that freed them. ( MPI deletion by 23 24 MPI routine (not even MPI _ INIT ) may be called, ex- MPI _ Once returns, no FINALIZE 25 _ , _ MPI , VERSION _ LIBRARY _ cept for INITIALIZED MPI , VERSION _ GET _ MPI GET 26 FINALIZED , and any function with the prefix MPI_T_ (within the constraints for func- MPI _ 27 tions with this prefix listed in Section 14.3.4). 28 , it is required Although it is not required that all processes return from MPI _ FINALIZE 29 MPI that at least process 0 in MPI _ COMM _ WORLD return, so that users can know that the 30 portion of the computation is over. In addition, in a POSIX environment, users may desire 31 to supply an exit code for each process that returns from MPI _ FINALIZE . 32 33 The following illustrates the use of requiring that at least one process Example 8.10 34 return and that it be known that process 0 is one of the processes that return. One wants 35 code like the following to work no matter how many processes return. 36 37 ... 38 MPI_Comm_rank(MPI_COMM_WORLD, &myrank); 39 ... 40 MPI_Finalize(); 41 if (myrank == 0) { 42 resultfile = fopen("outfile","w"); 43 dump_results(resultfile); 44 fclose(resultfile); 45 } 46 exit(0); 47 48

392 362 MPI CHAPTER 8. ENVIRONMENTAL MANAGEMENT 1 _ INITIALIZED(flag) MPI 2 OUT INIT has been called and false MPI F _ lag is true if flag 3 otherwise. 4 5 int MPI_Initialized(int *flag) 6 7 MPI_Initialized(flag, ierror) BIND(C) 8 LOGICAL, INTENT(OUT) :: flag 9 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 10 MPI_INITIALIZED(FLAG, IERROR) 11 LOGICAL FLAG 12 INTEGER IERROR 13 14 MPI _ INIT has been called. This routine may be used to determine whether 15 INITIALIZED _ INIT returns true if the calling process has called MPI . Whether _ MPI 16 has been called does not affect the behavior of MPI _ INITIALIZED . It is one MPI FINALIZE _ 17 INIT is called. _ MPI of the few routines that may be called before 18 19 20 _ MPI ABORT(comm, errorcode) 21 communicator of tasks to abort IN comm 22 error code to return to invoking environment IN errorcode 23 24 int MPI_Abort(MPI_Comm comm, int errorcode) 25 26 MPI_Abort(comm, errorcode, ierror) BIND(C) 27 TYPE(MPI_Comm), INTENT(IN) :: comm 28 INTEGER, INTENT(IN) :: errorcode 29 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 30 31 MPI_ABORT(COMM, ERRORCODE, IERROR) 32 INTEGER COMM, ERRORCODE, IERROR 33 This routine makes a “best attempt” to abort all tasks in the group of comm . This 34 function does not require that the invoking environment take any action with the error 35 return errorcode code. However, a Unix or POSIX environment should handle this as a 36 from the main program. 37 MPI implementation to abort only the processes repre- It may not be possible for an 38 MPI implementation sented by comm if this is a subset of the processes. In this case, the 39 should attempt to abort all the connected processes but should not abort any unconnected 40 processes. If no processes were spawned, accepted, or connected then this has the effect of 41 MPI _ aborting all the processes associated with . WORLD COMM _ 42 43 The communicator argument is provided to allow for future extensions of Rationale. 44 MPI to environments with, for example, dynamic process management. In particular, 45 MPI it allows but does not require an implementation to abort a subset of 46 . ( End of rationale. WORLD _ COMM _ MPI ) 47 48

393 8.7. STARTUP 363 1 errorcode Whether the is returned from the executable or from the Advice to users. 2 ), is an aspect of quality of the mpiexec process startup mechanism (e.g., MPI MPI 3 ) End of advice to users. library but not mandatory. ( 4 5 6 Advice to implementors. Where possible, a high-quality implementation will try 7 or to return the MPI process startup mechanism (e.g. mpiexec from the errorcode 8 singleton init). ( End of advice to implementors. ) 9 10 8.7.1 Allowing User Functions at Process Termination 11 MPI There are times in which it would be convenient to have actions happen when an process 12 finishes. For example, a routine may do initializations that are useful until the MPI job (or 13 that part of the job that being terminated in the case of dynamically created processes) is 14 finished. This can be accomplished in MPI by attaching an attribute to MPI _ COMM _ SELF 15 MPI is called, it will first execute the equivalent FINALIZE _ with a callback function. When 16 FREE _ COMM . This will cause the delete callback function MPI of an SELF _ COMM _ MPI on _ 17 _ SELF , in the reverse order that to be executed on all keys associated with MPI _ COMM 18 MPI _ COMM _ SELF . If no key has been attached to they were set on _ COMM _ SELF , then MPI 19 MPI _ COMM SELF no callback is invoked. The “freeing” of occurs before any other parts _ 20 in any _ MPI of are affected. Thus, for example, calling false will return FINALIZED MPI 21 , the order and rest of the SELF _ COMM _ of these callback functions. Once done with MPI 22 is not specified. _ MPI actions taken by FINALIZE 23 24 Advice to implementors. Since attributes can be added from any supported language, 25 implementation needs to remember the creating language so the correct the MPI 26 callback is made. Implementations that use the attribute delete callback on 27 SELF _ MPI _ COMM internally should register their internal callbacks before returning 28 THREAD , so that libraries or applications will not have INIT from MPI _ INIT / MPI _ _ 29 portions of the MPI implementation shut down before the application-level callbacks 30 ) are made. ( End of advice to implementors. 31 32 8.7.2 Determining Whether MPI Has Finished 33 34 was to allow for layered libraries. In order for a library to do MPI One of the goals of 35 _ the function this cleanly, it needs to know if MPI is active. In MPI MPI INITIALIZED was 36 provided to tell if had been initialized. The problem arises in knowing if MPI has been MPI 37 finalized. Once has been finalized it is no longer active and cannot be restarted. A MPI 38 library needs to be able to determine this to act accordingly. To achieve this the following 39 function is needed: 40 41 FINALIZED(flag) _ MPI 42 43 true if was finalized (logical) flag OUT MPI 44 45 int MPI_Finalized(int *flag) 46 47 MPI_Finalized(flag, ierror) BIND(C) 48 LOGICAL, INTENT(OUT) :: flag

394 364 ENVIRONMENTAL MANAGEMENT CHAPTER 8. MPI 1 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 2 MPI_FINALIZED(FLAG, IERROR) 3 LOGICAL FLAG 4 INTEGER IERROR 5 6 _ MPI if true FINALIZE This routine returns has completed. It is valid to call 7 MPI MPI MPI _ FINALIZED before . FINALIZE _ and after INIT _ 8 9 functions if MPI _ INIT Advice to users. MPI is “active” and it is thus safe to call MPI 10 completed. If a library has no other FINALIZE _ MPI completed and has has not 11 MPI is active or not, then it can use MPI _ way of knowing whether and INITIALIZED 12 to determine this. For example, MPI _ FINALIZED MPI is “active” in callback functions 13 that are invoked during _ ) FINALIZE . ( End of advice to users. MPI 14 15 8.8 Portable MPI Process Startup 16 17 A number of implementations of MPI programs that MPI provide a startup command for 18 is of the form 19 20 mpirun 21 Separating the command to start the program from the program itself provides flexibility, 22 particularly for network and heterogeneous implementations. For example, the startup 23 program itself. MPI script need not run on one of the machines that will be executing the 24 Having a standard startup mechanism also extends the portability of MPI programs one 25 step further, to the command lines and scripts that manage them. For example, a validation 26 suite script that runs hundreds of programs can be a portable script if it is written using such 27 a standard starup mechanism. In order that the “standard” command not be confused with 28 existing practice, which is not standard and not portable among implementations, instead 29 mpirun mpiexec MPI of specifies . 30 MPI , the range of While a standardized startup mechanism improves the usability of 31 environments is so diverse (e.g., there may not even be a command line interface) that MPI 32 specifies an mpiexec startup command cannot mandate such a mechanism. Instead, MPI 33 and recommends but does not require it, as advice to implementors. However, if an im- 34 , it must be of the form described mpiexec plementation does provide a command called 35 below. 36 It is suggested that 37 38 mpiexec -n 39 40 _ be at least one way to start with an initial MPI _ COMM WORLD whose group 41 processes. Other arguments to may be implementation- mpiexec contains 42 dependent. 43 Advice to implementors. Implementors, if they do provide a special startup command 44 MPI programs, are advised to give it the following form. The syntax is chosen in for 45 be able to be viewed as a command-line version of mpiexec order that 46 _ MPI SPAWN _ (See Section 10.3.4). COMM 47 48 , we have SPAWN _ COMM _ MPI Analogous to

395 8.8. PORTABLE MPI PROCESS STARTUP 365 1 mpiexec -n 2 -soft < > 3 -host < > 4 -arch < > 5 -wdir < > 6 -path < > 7 -file < > 8 ... 9 10 11 for the case where a single command line for the application program and its arguments 12 will suffice. See Section 10.3.4 for the meanings of these arguments. For the case 13 there are two possible formats: _ COMM corresponding to _ MPI MULTIPLE _ SPAWN 14 Form A: 15 16 mpiexec { } : { ... } : { ... } : ... : { ... } 17 18 As with SPAWN , all the arguments are optional. (Even the -n x argu- MPI _ COMM _ 19 ment is optional; the default is implementation dependent. It might be 1 , it might be 20 taken from an environment variable, or it might be specified at compile time.) The 21 names and meanings of the arguments are taken from the keys in the info argument 22 MPI _ COMM _ SPAWN . There may be other, implementation-dependent arguments to 23 as well. 24 25 Note that Form A, though convenient to type, prevents colons from being program 26 arguments. Therefore an alternate, file-based form is allowed: 27 Form B: 28 29 mpiexec -configfile 30 31 < > are of the form separated by the colons in Form A. filename where the lines of 32 ’ are comments, and lines may be continued by terminating # Lines beginning with ‘ 33 the partial line with ‘ \ ’. 34 35 36 on the current or default machine: myprog Start 16 instances of Example 8.11 37 mpiexec -n 16 myprog 38 39 40 ferrari : Start 10 processes on the machine called Example 8.12 41 42 mpiexec -n 10 -host ferrari myprog 43 44 45 Example 8.13 Start three copies of the same program with different command-line 46 arguments: 47 48 mpiexec myprog infile1 : myprog infile2 : myprog infile3

396 366 ENVIRONMENTAL MANAGEMENT CHAPTER 8. MPI 1 Example 8.14 ocean program on 10 atmos program on five Suns and the Start the 2 RS/6000’s: 3 4 mpiexec -n 5 -arch sun ocean : -n 10 -arch rs6000 atmos 5 6 It is assumed that the implementation in this case has a method for choosing hosts of 7 the appropriate type. Their ranks are in the order specified. 8 9 program on 10 Start the ocean program on five Suns and the Example 8.15 atmos 10 RS/6000’s (Form B): 11 12 mpiexec -configfile myfile 13 14 where myfile contains 15 16 -n 5 -arch sun ocean 17 -n 10 -arch rs6000 atmos 18 19 ( ) End of advice to implementors. 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

397 1 2 3 4 5 6 Chapter 9 7 8 9 10 Object The Info 11 12 13 14 take an argument MPI Many of the routines in . info info is an opaque object with a handle 15 Info in Fortran with of type MPI _ INTEGER in C and Fortran with the mpi_f08 module, and 16 . It stores an unordered set of ( value the mpi module or the include file mpif.h , ) pairs key 17 MPI key and value are strings). A key can have only one value. (both reserves several keys 18 and requires that if an implementation uses a reserved key, it must provide the specified 19 functionality. An implementation is not required to support these keys and may support 20 any others not reserved by MPI . 21 , key An implementation must support info objects as caches for arbitrary ( value ) pairs, 22 regardless of whether it recognizes the key. Each function that takes hints in the form of an 23 _ must be prepared to ignore any key it does not recognize. This description of info Info MPI 24 objects does not attempt to define how a particular function should react if it recognizes 25 _ INFO _ GET _ NTHKEY a key but not the associated value. MPI _ INFO _ GET _ NKEYS , MPI , 26 MPI _ INFO _ GET _ VALUELEN value , and MPI _ INFO _ GET ) pairs so that must retain all ( , key 27 Info object. layered functionality can also use the 28 MPI _ INFO Keys have an implementation-defined maximum length of _ MAX , which KEY _ 29 is at least 32 and at most 255. Values have an implementation-defined maximum length 30 of _ _ . In Fortran, leading and trailing spaces are stripped from both. VAL MPI _ MAX INFO 31 value are and key Returned values will never be larger than these maximum lengths. Both 32 case sensitive. 33 34 Keys have a maximum length because the set of known keys will always Rationale. 35 be finite and known to the implementation and because there is no reason for keys 36 to be complex. The small maximum size allows applications to declare keys of size 37 _ MPI _ KEY INFO _ . The limitation on value sizes is so that an implementation is MAX 38 ) End of rationale. not forced to deal with arbitrarily long strings. ( 39 40 _ Advice to users. MPI _ MAX _ INFO VAL might be very large, so it might not be wise to 41 End of advice to users. declare a string of that size. ( ) 42 43 is used as an argument to a nonblocking routine, it is parsed before that When info 44 routine returns, so that it may be modified or freed immediately after return. 45 When the descriptions refer to a key or value as being a boolean, an integer, or a list, 46 they mean the string representation of these types. An implementation may define its own 47 rules for how info value strings are converted to other types, but to ensure portability, every 48 implementation must support the following representations. Valid values for a boolean must 367

398 368 CHAPTER 9. THE INFO OBJECT 1 include the strings “true” and “false” (all lowercase). For integers, valid values must include 2 string representations of decimal values of integers that are within the range of a standard 3 integer type in the program. (However it is possible that not every integer is a valid value 4 for a given key.) On positive numbers, + signs are optional. No space may appear between 5 sign and the leading digit of a number. For comma separated lists, the string a + or − 6 must contain valid elements separated by commas. Leading and trailing spaces are stripped 7 automatically from the types of info values described above and for each element of a comma 8 separated list. These rules apply to all info values of these types. Implementations are free 9 to specify a different interpretation for values of other info keys. 10 11 CREATE(info) _ INFO _ MPI 12 13 OUT info object created (handle) info 14 15 int MPI_Info_create(MPI_Info *info) 16 17 MPI_Info_create(info, ierror) BIND(C) 18 TYPE(MPI_Info), INTENT(OUT) :: info 19 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 20 MPI_INFO_CREATE(INFO, IERROR) 21 INTEGER INFO, IERROR 22 23 MPI _ INFO _ CREATE creates a new info object. The newly created object contains no 24 key/value pairs. 25 26 _ INFO _ SET(info, key, value) MPI 27 28 info object (handle) INOUT info 29 key (string) key IN 30 value (string) value IN 31 32 int MPI_Info_set(MPI_Info info, const char *key, const char *value) 33 34 MPI_Info_set(info, key, value, ierror) BIND(C) 35 TYPE(MPI_Info), INTENT(IN) :: info 36 CHARACTER(LEN=*), INTENT(IN) :: key, value 37 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 38 39 MPI_INFO_SET(INFO, KEY, VALUE, IERROR) 40 INTEGER INFO, IERROR 41 CHARACTER*(*) KEY, VALUE 42 , and overrides the value if a value for key adds the ( , _ _ MPI value ) pair to info SET INFO 43 are null-terminated strings in C. In Fortran, value and key the same key was previously set. 44 are larger leading and trailing spaces in key and value are stripped. If either key or value 45 KEY _ INFO _ ERR _ MPI or VALUE are INFO _ ERR _ MPI than the allowed maximums, the errors _ 46 raised, respectively. 47 48

399 369 1 _ DELETE(info, key) MPI _ INFO 2 INOUT info object (handle) info 3 IN key (string) key 4 5 6 int MPI_Info_delete(MPI_Info info, const char *key) 7 MPI_Info_delete(info, key, ierror) BIND(C) 8 TYPE(MPI_Info), INTENT(IN) :: info 9 CHARACTER(LEN=*), INTENT(IN) :: key 10 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 11 12 MPI_INFO_DELETE(INFO, KEY, IERROR) 13 INTEGER INFO, IERROR 14 CHARACTER*(*) KEY 15 ) pair from info value , DELETE _ INFO _ MPI deletes a ( , key is not defined in key info . If 16 INFO the call raises an error of class _ MPI _ . NOKEY _ ERR 17 18 19 GET(info, key, valuelen, value, flag) MPI _ INFO _ 20 IN info object (handle) info 21 22 key key (string) IN 23 valuelen length of value arg (integer) IN 24 value value (string) OUT 25 26 flag true if not (boolean) false OUT if key defined, 27 28 int MPI_Info_get(MPI_Info info, const char *key, int valuelen, char *value, 29 int *flag) 30 MPI_Info_get(info, key, valuelen, value, flag, ierror) BIND(C) 31 TYPE(MPI_Info), INTENT(IN) :: info 32 CHARACTER(LEN=*), INTENT(IN) :: key 33 INTEGER, INTENT(IN) :: valuelen 34 CHARACTER(LEN=valuelen), INTENT(OUT) :: value 35 LOGICAL, INTENT(OUT) :: flag 36 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 37 38 MPI_INFO_GET(INFO, KEY, VALUELEN, VALUE, FLAG, IERROR) 39 INTEGER INFO, VALUELEN, IERROR 40 CHARACTER*(*) KEY, VALUE 41 LOGICAL FLAG 42 This function retrieves the value associated with key in a previous call to 43 . If such a key exists, it sets flag MPI _ , value and returns the value in true to INFO SET _ 44 otherwise it sets to false and leaves flag unchanged. valuelen is the number of characters value 45 available in value. If it is less than the actual size of the value, the value is truncated. In 46 should be one less than the amount of allocated space to allow for the null valuelen C, 47 terminator. 48

400 370 OBJECT CHAPTER 9. THE INFO 1 is larger than key INFO _ KEY , the call is erroneous. MPI If _ MAX _ 2 3 INFO VALUELEN(info, key, valuelen, flag) GET MPI _ _ _ 4 5 IN info object (handle) info 6 key key (string) IN 7 length of value arg (integer) valuelen OUT 8 9 OUT flag true if key defined, false if not (boolean) 10 11 int MPI_Info_get_valuelen(MPI_Info info, const char *key, int *valuelen, 12 int *flag) 13 14 MPI_Info_get_valuelen(info, key, valuelen, flag, ierror) BIND(C) 15 TYPE(MPI_Info), INTENT(IN) :: info 16 CHARACTER(LEN=*), INTENT(IN) :: key 17 INTEGER, INTENT(OUT) :: valuelen 18 LOGICAL, INTENT(OUT) :: flag 19 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 20 MPI_INFO_GET_VALUELEN(INFO, KEY, VALUELEN, FLAG, IERROR) 21 INTEGER INFO, VALUELEN, IERROR 22 LOGICAL FLAG 23 CHARACTER*(*) KEY 24 25 associated with key . If Retrieves the length of the is defined, valuelen is set to value key 26 valuelen is not the length of its associated value and flag is set to true . If key is not defined, 27 touched and flag is set to false . The length returned in C does not include the end-of-string 28 character. 29 INFO _ KEY , the call is erroneous. If key is larger than MPI _ MAX _ 30 31 _ _ MPI INFO _ GET NKEYS(info, nkeys) 32 33 info object (handle) info IN 34 nkeys number of defined keys (integer) OUT 35 36 int MPI_Info_get_nkeys(MPI_Info info, int *nkeys) 37 38 MPI_Info_get_nkeys(info, nkeys, ierror) BIND(C) 39 TYPE(MPI_Info), INTENT(IN) :: info 40 INTEGER, INTENT(OUT) :: nkeys 41 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 42 MPI_INFO_GET_NKEYS(INFO, NKEYS, IERROR) 43 INTEGER INFO, NKEYS, IERROR 44 45 INFO _ MPI . returns the number of currently defined keys in NKEYS _ GET info _ 46 47 48

401 371 1 _ GET _ NTHKEY(info, n, key) MPI _ INFO 2 info IN info object (handle) 3 IN n key number (integer) 4 5 OUT key key (string) 6 7 int MPI_Info_get_nthkey(MPI_Info info, int n, char *key) 8 9 MPI_Info_get_nthkey(info, n, key, ierror) BIND(C) 10 TYPE(MPI_Info), INTENT(IN) :: info 11 INTEGER, INTENT(IN) :: n 12 CHARACTER(LEN=*), INTENT(OUT) :: key 13 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 14 MPI_INFO_GET_NTHKEY(INFO, N, KEY, IERROR) 15 INTEGER INFO, N, IERROR 16 CHARACTER*(*) KEY 17 18 − ...N . Keys are numbered 0 info 1 where n This function returns the th defined key in 19 NKEYS . All keys between 0 and N − 1 are N is the value returned by MPI _ INFO _ GET _ 20 is not info guaranteed to be defined. The number of a given key does not change as long as 21 INFO _ DELETE . modified with MPI _ INFO _ SET or MPI _ 22 23 _ INFO _ DUP(info, newinfo) MPI 24 25 info object (handle) IN info 26 OUT info object (handle) newinfo 27 28 int MPI_Info_dup(MPI_Info info, MPI_Info *newinfo) 29 30 MPI_Info_dup(info, newinfo, ierror) BIND(C) 31 TYPE(MPI_Info), INTENT(IN) :: info 32 TYPE(MPI_Info), INTENT(OUT) :: newinfo 33 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 34 MPI_INFO_DUP(INFO, NEWINFO, IERROR) 35 INTEGER INFO, NEWINFO, IERROR 36 37 _ duplicates an existing info object, creating a new object, with the DUP _ INFO MPI 38 key , value ) pairs and the same ordering of keys. same ( 39 40 _ _ MPI FREE(info) INFO 41 42 INOUT info object (handle) info 43 44 int MPI_Info_free(MPI_Info *info) 45 46 MPI_Info_free(info, ierror) BIND(C) 47 TYPE(MPI_Info), INTENT(INOUT) :: info 48 INTEGER, OPTIONAL, INTENT(OUT) :: ierror

402 372 CHAPTER 9. THE OBJECT INFO 1 MPI_INFO_FREE(INFO, IERROR) 2 INTEGER INFO, IERROR 3 MPI INFO _ NULL . The value of an info argument is _ This function frees and sets it to info 4 interpreted each time the info is passed to a routine. Changes to an info after return from 5 a routine do not affect that interpretation. 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

403 1 2 3 4 5 6 Chapter 10 7 8 9 10 Process Creation and Management 11 12 13 14 Introduction 10.1 15 16 is primarily concerned with communication rather than process or resource manage- MPI 17 ment. However, it is necessary to address these issues to some degree in order to define 18 a useful framework for communication. This chapter presents a set of MPI interfaces that 19 allows for a variety of approaches to process management while placing minimal restrictions 20 on the execution environment. 21 The MPI model for process creation allows both the creation of an intial set of pro- 22 and the creation and cesses related by their membership in a common WORLD _ COMM _ MPI 23 management of processes after an MPI application has been started. A major impetus for 24 the latter form of process creation comes from the PVM [24] research effort. This work 25 has provided a wealth of experience with process management and resource control that 26 illustrates their benefits and potential pitfalls. 27 Forum decided not to address resource control because it was not able to The MPI 28 design a portable interface that would be appropriate for the broad spectrum of existing and 29 potential resource and process controllers. Resource control can encompass a wide range of 30 abilities, including adding and deleting nodes from a virtual parallel machine, reserving and 31 scheduling resources, managing compute partitions of an MPP, and returning information 32 assumes that resource control is provided externally — MPI about available resources. 33 probably by computer vendors, in the case of tightly coupled systems, or by a third party 34 software package when the environment is a cluster of workstations. 35 The reasons for including process management in MPI are both technical and practical. 36 Important classes of message-passing applications require process control. These include 37 task farms, serial applications with parallel modules, and problems that require a run-time 38 assessment of the number and type of processes that should be started. On the practical 39 side, users of workstation clusters who are migrating from PVM to MPI may be accustomed 40 to using PVM’s capabilities for process and resource management. The lack of these features 41 would be a practical stumbling block to migration. 42 The following goals are central to the design of process management: MPI 43 44 • The MPI process model must apply to the vast majority of current parallel envi- 45 ronments. These include everything from tightly integrated MPPs to heterogeneous 46 networks of workstations. 47 48 must not take over operating system responsibilities. It should instead provide a MPI • 373

404 374 CHAPTER 10. PROCESS CREATION AND MANAGEMENT 1 clean interface between an application and system software. 2 must guarantee communication determinism in the presense of dynamic processes, MPI • 3 i.e., dynamic process management must not introduce unavoidable race conditions. 4 5 • must not contain features that compromise performance. MPI 6 7 remains MPI The process management model addresses these issues in two ways. First, 8 primarily a communication library. It does not manage the parallel environment in which 9 a parallel program executes, though it provides a minimal interface between an application 10 and external resource and process managers. 11 maintains a consistent concept of a communicator, regardless of how its MPI Second, 12 members came into existence. A communicator is never changed once created, and it is 13 always created using deterministic collective operations. 14 15 The Dynamic Process Model 10.2 16 17 The dynamic process model allows for the creation and cooperative termination of processes 18 application has started. It provides a mechanism to establish communication MPI after an 19 between the newly created processes and the existing MPI application. It also provides a 20 applications, even when mechanism to establish communication between two existing MPI 21 one did not “start” the other. 22 23 10.2.1 Starting Processes 24 25 applications may start new processes through an interface to an external process man- MPI 26 ager. 27 SPAWN processes and establishes communication with them, _ MPI _ starts MPI COMM 28 _ COMM _ SPAWN _ MULTIPLE starts several different returning an intercommunicator. MPI 29 binaries (or the same binary with different arguments), placing them in the same 30 MPI _ COMM _ WORLD and returning an intercommunicator. 31 MPI uses the group abstraction to represent processes. A process is identified by a 32 (group, rank) pair. 33 34 10.2.2 The Runtime Environment 35 36 and MPI _ COMM _ SPAWN _ MULTIPLE routines provide an inter- The MPI _ COMM _ SPAWN 37 runtime environment and the face between MPI of an MPI application. The difficulty is that 38 MPI there is an enormous range of runtime environments and application requirements, and 39 must not be tailored to any particular one. Examples of such environments are: 40 41 MPP managed by a batch queueing system • . Batch queueing systems generally 42 allocate resources before an application begins, enforce limits on resource use (CPU 43 time, memory use, etc.), and do not allow a change in resource allocation after a 44 job begins. Moreover, many MPPs have special limitations or extensions, such as a 45 limit on the number of processes that may run on one processor, or the ability to 46 gang-schedule processes of a parallel application. 47 48

405 10.2. THE DYNAMIC PROCESS MODEL 375 1 • . PVM (Parallel Virtual Machine) allows a Network of workstations with PVM 2 user to create a “virtual machine” out of a network of workstations. An application 3 may extend the virtual machine or manage processes (create, kill, redirect output, 4 etc.) through the PVM library. Requests to manage the machine or processes may 5 be intercepted and handled by an external resource manager. 6 . A load balanc- • Network of workstations managed by a load balancing system 7 ing system may choose the location of spawned processes based on dynamic quantities, 8 such as load average. It may transparently migrate processes from one machine to 9 another when a resource becomes unavailable. 10 11 . Applications are run directly by the user. They are Large SMP with Unix • 12 scheduled at a low level by the operating system. Processes may have special schedul- 13 ing characteristics (gang-scheduling, processor affinity, deadline scheduling, processor 14 locking, etc.) and be subject to OS resource limits (number of processes, amount of 15 memory, etc.). 16 17 MPI assumes, implicitly, the existence of an environment in which an application runs. 18 It does not provide “operating system” services, such as a general ability to query what 19 processes are running, to kill arbitrary processes, to find out properties of the runtime 20 environment (how many processors, how much memory, etc.). 21 Complex interaction of an MPI application with its runtime environment should be 22 done through an environment-specific API. An example of such an API would be the PVM 23 , etc., pvm_tasks , pvm_config pvm_addhosts , task and machine management routines — 24 (group, rank) when possible. A Condor or PBS API possibly modified to return an MPI 25 would be another possibility. 26 At some low level, obviously, MPI must be able to interact with the runtime system, 27 but the interaction is not visible at the application level and the details of the interaction 28 standard. MPI are not specified by the 29 MPI In many cases, it is impossible to keep environment-specific information out of the 30 interface without seriously compromising MPI functionality. To permit applications to take 31 routines take an info argument advantage of environment-specific functionality, many MPI 32 that allows an application to specify environment-specific information. There is a tradeoff 33 are not portable. between functionality and portability: applications that make use of info 34 MPI does not require the existence of an underlying “virtual machine” model, in which 35 there is a consistent global view of an application and an implicit “operating system” MPI 36 managing resources and processes. For instance, processes spawned by one task may not 37 be visible to another; additional hosts added to the runtime environment by one process 38 may not be visible in another process; tasks spawned by different processes may not be 39 automatically distributed over available resources. 40 MPI and the runtime environment is limited to the following areas: Interaction between 41 42 A process may start new processes with and • _ SPAWN _ COMM MPI 43 COMM _ MPI . MULTIPLE _ SPAWN _ 44 • When a process spawns a child process, it may optionally use an info argument to tell 45 the runtime environment where or how to start the process. This extra information 46 . MPI may be opaque to 47 48

406 376 CHAPTER 10. PROCESS CREATION AND MANAGEMENT 1 _ UNIVERSE _ SIZE (See Section 10.5.1 on page 397) on • MPI An attribute 2 _ tells a program how “large” the initial runtime environment is, COMM MPI WORLD _ 3 namely how many processes can usefully be started in all. One can subtract the size 4 from this value to find out how many processes might usefully of MPI _ COMM _ WORLD 5 be started in addition to those already running. 6 7 Process Manager Interface 10.3 8 9 10.3.1 Processes in MPI 10 11 by a (group, rank) pair. A (group, rank) pair specifies a A process is represented in MPI 12 unique process but a process does not determine a unique (group, rank) pair, since a process 13 may belong to several groups. 14 15 Starting Processes and Establishing Communication 10.3.2 16 MPI processes and establishes communication with The following routine starts a number of 17 them, returning an intercommunicator. 18 19 MPI Advice to users. It is possible in to start a static SPMD or MPMD appli- 20 cation by first starting one process and having that process start its siblings with 21 _ COMM _ . This practice is discouraged primarily for reasons of perfor- SPAWN MPI 22 mance. If possible, it is preferable to start all processes at once, as a single MPI 23 End of advice to users. application. ( ) 24 25 26 27 _ SPAWN(command, argv, maxprocs, info, root, comm, intercomm, MPI _ COMM 28 errcodes) of array _ _ 29 name of program to be spawned (string, significant command IN 30 only at root) 31 arguments to argv (array of strings, significant command IN 32 only at root) 33 34 maxprocs maximum number of processes to start (integer, sig- IN 35 nificant only at root) 36 info IN a set of key-value pairs telling the runtime system 37 where and how to start the processes (handle, signifi- 38 cant only at root) 39 rank of process in which previous arguments are ex- root IN 40 amined (integer) 41 42 comm IN intracommunicator containing group of spawning pro- 43 cesses (handle) 44 intercommunicator between original group and the OUT intercomm 45 newly spawned group (handle) 46 _ array OUT errcodes one code per process (array of integer) _ of 47 48

407 10.3. PROCESS MANAGER INTERFACE 377 1 int MPI_Comm_spawn(const char *command, char *argv[], int maxprocs, 2 MPI_Info info, int root, MPI_Comm comm, MPI_Comm *intercomm, 3 int array_of_errcodes[]) 4 MPI_Comm_spawn(command, argv, maxprocs, info, root, comm, intercomm, 5 array_of_errcodes, ierror) BIND(C) 6 CHARACTER(LEN=*), INTENT(IN) :: command, argv(*) 7 INTEGER, INTENT(IN) :: maxprocs, root 8 TYPE(MPI_Info), INTENT(IN) :: info 9 TYPE(MPI_Comm), INTENT(IN) :: comm 10 TYPE(MPI_Comm), INTENT(OUT) :: intercomm 11 INTEGER :: array_of_errcodes(*) 12 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 13 14 MPI_COMM_SPAWN(COMMAND, ARGV, MAXPROCS, INFO, ROOT, COMM, INTERCOMM, 15 ARRAY_OF_ERRCODES, IERROR) 16 CHARACTER*(*) COMMAND, ARGV(*) 17 INTEGER INFO, MAXPROCS, ROOT, COMM, INTERCOMM, ARRAY_OF_ERRCODES(*), 18 IERROR 19 SPAWN MPI maxprocs tries to start program spec- _ COMM _ MPI identical copies of the 20 , establishing communication with them and returning an intercommun- command ified by 21 icator. The spawned processes are referred to as children. The children have their own 22 MPI _ COMM _ WORLD , which is separate from that of the parents. MPI _ COMM _ SPAWN is 23 _ has been called in the chil- INIT MPI , and also may not return until comm collective over 24 dren. Similarly, in the children may not return until all parents have called _ INIT MPI 25 in the parents and MPI _ INIT in _ COMM _ SPAWN . In this sense, MPI MPI COMM _ SPAWN _ 26 the children form a collective operation over the union of parent and child processes. The 27 _ intercommunicator returned by MPI _ COMM SPAWN contains the parent processes in the 28 local group and the child processes in the remote group. The ordering of processes in the 29 in the parents local and remote groups is the same as the ordering of the group of the comm 30 and of MPI _ COMM _ WORLD of the children, respectively. This intercommunicator can be 31 _ GET _ PARENT . obtained in the children through the function MPI _ COMM 32 33 Advice to users. An implementation may automatically establish communication 34 SPAWN _ before MPI _ INIT is called by the children. Thus, completion of MPI COMM _ 35 INIT _ MPI in the parent does not necessarily mean that has been called in the children 36 End of advice (although the returned intercommunicator can be used immediately). ( 37 to users. ) 38 39 The command argument command The argument is a string containing the name of a pro- 40 gram to be spawned. The string is null-terminated in C. In Fortran, leading and trailing 41 spaces are stripped. MPI does not specify how to find the executable or how the working 42 directory is determined. These rules are implementation-dependent and should be appro- 43 priate for the runtime environment. 44 45 Advice to implementors. The implementation should use a natural rule for finding 46 executables and determining working directories. For instance, a homogeneous sys- 47 tem with a global file system might look first in the working directory of the spawning 48

408 378 CHAPTER 10. PROCESS CREATION AND MANAGEMENT 1 process, or might search the directories in a PATH environment variable as do Unix 2 shells. An implementation on top of PVM would use PVM’s rules for finding exe- 3 $HOME/pvm3/bin/$PVM_ARCH cutables (usually in MPI ). An implementation running 4 under POE on an IBM SP would use POE’s method of finding executables. An imple- 5 mentation should document its rules for finding executables and determining working 6 directories, and a high-quality implementation should give the user some control over 7 these rules. ( End of advice to implementors. ) 8 9 If the program named in does not call MPI _ command INIT , but instead forks a process 10 , the results are undefined. Implementations may allow this case to MPI that calls INIT _ 11 work but are not required to. 12 13 MPI does not say what happens if the program you start is a Advice to users. 14 shell script and that shell script starts a program that calls MPI . Though some INIT _ 15 implementations may allow you to do this, they may also have restrictions, such as 16 requiring that arguments supplied to the shell script be supplied to the program, or 17 End of advice to requiring that certain parts of the environment not be changed. ( 18 users. ) 19 20 argv is an array of strings containing arguments that are passed to The argv argument 21 is the first argument passed to command , not, as the program. The first element of argv 22 is conventional in some contexts, the command itself. The argument list is terminated by 23 NULL in C and an empty string in Fortran. In Fortran, leading and trailing spaces are 24 always stripped, so that a string consisting of all spaces is considered an empty string. The 25 constant may be used in C and Fortran to indicate an empty argument _ MPI NULL _ ARGV 26 list. In C this constant is the same as NULL. 27 in C and Fortran Example 10.1 Examples of argv 28 To run the program “ocean” with arguments “-gridfile” and “ocean1.grd” in C: 29 30 char command[] = "ocean"; 31 char *argv[] = {"-gridfile", "ocean1.grd", NULL}; 32 MPI_Comm_spawn(command, argv, ...); 33 34 or, if not everything is known at compile time: 35 char *command; 36 char **argv; 37 command = "ocean"; 38 argv=(char **)malloc(3 * sizeof(char *)); 39 argv[0] = "-gridfile"; 40 argv[1] = "ocean1.grd"; 41 argv[2] = NULL; 42 MPI_Comm_spawn(command, argv, ...); 43 44 In Fortran: 45 46 47 48

409 10.3. PROCESS MANAGER INTERFACE 379 1 CHARACTER*25 command, argv(3) 2 command = ’ ocean ’ 3 argv(1) = ’ -gridfile ’ 4 argv(2) = ’ ocean1.grd’ 5 argv(3) = ’ ’ 6 call MPI_COMM_SPAWN(command, argv, ...) 7 Arguments are supplied to the program if this is allowed by the operating system. In 8 main in two MPI _ SPAWN argument C, the argv differs from the argv _ argument of COMM 9 argv[0] of respects. First, it is shifted by one element. Specifically, is provided by the main 10 implementation and conventionally contains the name of the program (given by ). command 11 of to argv[1] argv[0] argv[1] corresponds to in main _ COMM _ SPAWN , argv[2] of main MPI 12 to MPI _ COMM _ SPAWN MPI _ of _ SPAWN , etc. Passing an argv of MPI _ ARGV _ NULL COMM 13 whose element 0 is (conventionally) the argv of 1 and an argc receiving main results in 14 _ SPAWN must be null-terminated, so name of the program. Second, argv of MPI _ COMM 15 that its length can be determined. 16 If a Fortran implementation supplies routines that allow a program to obtain its ar- 17 guments, the arguments may be available through that mechanism. In C, if the operating 18 , the MPI implementation system does not support arguments appearing in argv of main() 19 INIT may add the arguments to the argv that is passed to MPI _ . 20 21 22 maxprocs MPI processes. If it is unable to spawn The maxprocs argument tries to spawn 23 ERR maxprocs SPAWN processes, it raises an error of class . MPI _ _ 24 info argument to change the default behavior, such An implementation may allow the 25 maxprocs that if the implementation is unable to spawn all processes, it may spawn a 26 smaller number of processes instead of raising an error. In principle, the argument info 27 ≤ of allowed values for the number } maxprocs may specify an arbitrary set m { : 0 m ≤ i i 28 of processes spawned. The set { m . If } does not necessarily include the value maxprocs i 29 an implementation is able to spawn one of these allowed numbers of processes, 30 MPI , is given _ _ m returns successfully and the number of spawned processes, SPAWN COMM 31 by the size of the remote group of m , reasons why the . If is less than maxproc intercomm 32 as described below. If it is errcodes _ of _ array other processes were not spawned are given in 33 _ SPAWN not possible to spawn one of the allowed numbers of processes, raises MPI _ COMM 34 an error of class MPI _ ERR _ SPAWN . 35 A spawn call with the default behavior is called hard . A spawn call for which fewer 36 soft than maxprocs . See Section 10.3.4 on page 384 for processes may be returned is called 37 key for soft more information on the . info 38 Advice to users. By default, requests are hard and MPI errors are fatal. This means 39 cannot spawn all the requested that by default there will be a fatal error if MPI 40 N processes. If you want the behavior “spawn as many processes as possible, up to ,” 41 is . However, { m } } you should do a soft spawn, where the set of allowed values { 0 ...N 42 i this is not completely portable, as implementations are not required to support soft 43 End of advice to users. ) spawning. ( 44 45 46 The info argument The info argument to all of the routines in this chapter is an opaque 47 mpi_f08 module and handle of type MPI _ Info in C and Fortran with the 48 . It is a container for a mpif.h module or the include file mpi in Fortran with the INTEGER

410 380 CHAPTER 10. PROCESS CREATION AND MANAGEMENT 1 , key key and value are strings (null-terminated char* value number of user-specified ( ) pairs. 2 info in C, character*(*) argument are in Fortran). Routines to create and manipulate the 3 described in Chapter 9 on page 367. 4 provides additional (and possibly implementation-dependent) info SPAWN For the calls, 5 MPI instructions to and the runtime system on how to start processes. An application may 6 _ NULL pass _ in C or Fortran. Portable programs not requiring detailed control over INFO MPI 7 _ NULL process locations should use MPI _ INFO . 8 argument, except to reserve a number of info does not specify the content of the MPI 9 key values (see Section 10.3.4 on page 384). The info argument is quite flexible and special 10 could even be used, for example, to specify the executable and its command-line arguments. 11 COMM argument to In this case the command MPI _ _ SPAWN could be empty. The ability to 12 do this follows from the fact that MPI does not specify how an executable is found, and the 13 argument can tell the runtime system where to “find” the executable “” (empty string). info 14 implementations. MPI Of course a program that does this will not be portable across 15 16 All arguments before the The root argument argument are examined only on the root 17 is equal to process whose rank in . The value of these arguments on other comm root 18 processes is ignored. 19 20 _ errcodes argument The array _ of _ errcodes is an array of length maxprocs in The array _ of 21 reports the status of each process that was requested to start. If all maxprocs which MPI MPI 22 MPI _ m . If only SUCCESS _ processes were spawned, is filled in with the value errcodes _ of array 23 SUCCESS MPI of the entries will contain m ) processes are spawned, maxprocs m < ≤ (0 _ and 24 the rest will contain an implementation-specific error code indicating the reason MPI could 25 not start the process. MPI does not specify which entries correspond to failed processes. 26 An implementation may, for instance, fill in error codes in one-to-one correspondence with 27 a detailed specification in the argument. These error codes all belong to the error class info 28 if there was no error in the argument list. In C or Fortran, an application MPI _ ERR _ SPAWN 29 _ IGNORE may pass MPI if it is not interested in the error codes. ERRCODES _ 30 in Fortran is a special type of Advice to implementors. _ MPI _ ERRCODES IGNORE 31 MPI End of . See the discussion in Section 2.5.4 on page 15. ( BOTTOM _ constant, like 32 ) advice to implementors. 33 34 35 36 PARENT(parent) _ GET _ COMM _ MPI 37 OUT parent the parent communicator (handle) 38 39 int MPI_Comm_get_parent(MPI_Comm *parent) 40 41 MPI_Comm_get_parent(parent, ierror) BIND(C) 42 TYPE(MPI_Comm), INTENT(OUT) :: parent 43 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 44 MPI_COMM_GET_PARENT(PARENT, IERROR) 45 INTEGER PARENT, IERROR 46 47 COMM , or SPAWN _ _ _ MPI If a process was started with MULTIPLE SPAWN _ COMM _ MPI 48 PARENT _ GET _ COMM _ MPI returns the “parent” intercommunicator of the current process.

411 10.3. PROCESS MANAGER INTERFACE 381 1 _ MPI INIT This parent intercommunicator is created implicitly inside of and is the same in- 2 SPAWN in the parents. tercommunicator returned by 3 _ PARENT returns MPI _ _ _ NULL . If the process was not spawned, COMM MPI _ GET COMM 4 _ PARENT After the parent communicator is freed or disconnected, MPI _ COMM _ GET 5 NULL . returns _ COMM _ MPI 6 7 Advice to users. GET _ PARENT _ MPI _ returns a handle to a single intercom- COMM 8 MPI municator. Calling COMM _ GET _ PARENT a second time returns a handle to _ 9 the same intercommunicator. Freeing the handle with MPI _ COMM _ DISCONNECT or 10 _ FREE will cause other references to the intercommunicator to become _ MPI COMM 11 MPI _ COMM invalid (dangling). Note that calling on the parent communicator FREE _ 12 is not useful. ( End of advice to users. ) 13 14 Rationale. The desire of the Forum was to create a constant 15 _ PARENT similar to MPI _ COMM _ WORLD . Unfortunately such a constant MPI _ COMM 16 COMM DISCONNECT , which _ cannot be used (syntactically) as an argument to MPI _ 17 ) End of rationale. is explicitly allowed. ( 18 19 Starting Multiple Executables and Establishing Communication 10.3.3 20 _ is sufficient for most cases, it does not allow the spawning COMM _ MPI While SPAWN 21 of multiple binaries, or of the same binary with multiple sets of arguments. The follow- 22 ing routine spawns multiple binaries or the same binary with multiple sets of arguments, 23 establishing communication with them and placing them in the same . COMM _ _ MPI WORLD 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

412 382 CHAPTER 10. PROCESS CREATION AND MANAGEMENT 1 COMM _ _ MULTIPLE(count, array _ of _ commands, array _ of _ argv, _ MPI SPAWN 2 _ _ of _ info, root, comm, intercomm, array _ of _ of errcodes) array _ maxprocs, array 3 4 count IN number of commands (positive integer, significant to 5 only at root — see advice to users) MPI 6 array IN commands _ of programs to be executed (array of strings, significant _ 7 only at root) 8 9 array _ of _ argv arguments for commands (array of array of strings, IN 10 significant only at root) 11 _ of _ IN maximum number of processes to start for each com- maxprocs array 12 mand (array of integer, significant only at root) 13 _ info info objects telling the runtime system where and how IN array of _ 14 to start processes (array of handles, significant only at 15 root) 16 17 IN root rank of process in which previous arguments are ex- 18 amined (integer) 19 IN intracommunicator containing group of spawning pro- comm 20 cesses (handle) 21 OUT intercomm intercommunicator between original group and newly 22 spawned group (handle) 23 24 errcodes one error code per process (array of integer) _ array _ of OUT 25 26 int MPI_Comm_spawn_multiple(int count, char *array_of_commands[], 27 char **array_of_argv[], const int array_of_maxprocs[], const 28 MPI_Info array_of_info[], int root, MPI_Comm comm, 29 MPI_Comm *intercomm, int array_of_errcodes[]) 30 MPI_Comm_spawn_multiple(count, array_of_commands, array_of_argv, 31 array_of_maxprocs, array_of_info, root, comm, intercomm, 32 array_of_errcodes, ierror) BIND(C) 33 INTEGER, INTENT(IN) :: count, array_of_maxprocs(*), root 34 CHARACTER(LEN=*), INTENT(IN) :: array_of_commands(*) 35 CHARACTER(LEN=*), INTENT(IN) :: array_of_argv(count, *) 36 TYPE(MPI_Info), INTENT(IN) :: array_of_info(*) 37 TYPE(MPI_Comm), INTENT(IN) :: comm 38 TYPE(MPI_Comm), INTENT(OUT) :: intercomm 39 INTEGER :: array_of_errcodes(*) 40 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 41 42 MPI_COMM_SPAWN_MULTIPLE(COUNT, ARRAY_OF_COMMANDS, ARRAY_OF_ARGV, 43 ARRAY_OF_MAXPROCS, ARRAY_OF_INFO, ROOT, COMM, INTERCOMM, 44 ARRAY_OF_ERRCODES, IERROR) 45 INTEGER COUNT, ARRAY_OF_INFO(*), ARRAY_OF_MAXPROCS(*), ROOT, COMM, 46 INTERCOMM, ARRAY_OF_ERRCODES(*), IERROR 47 CHARACTER*(*) ARRAY_OF_COMMANDS(*), ARRAY_OF_ARGV(COUNT, *) 48

413 10.3. PROCESS MANAGER INTERFACE 383 1 COMM _ _ MULTIPLE is identical to MPI _ COMM _ SPAWN except that there _ MPI SPAWN 2 count are multiple executable specifications. The first argument, , gives the number of 3 specifications. Each of the next four arguments are simply arrays of the corresponding 4 _ of _ argv , the element MPI _ COMM array _ . For the Fortran version of arguments in SPAWN 5 i . array _ of _ argv(i,j) is the j -th argument to command number 6 Rationale. This may seem backwards to Fortran programmers who are familiar 7 with Fortran’s column-major ordering. However, it is necessary to do it this way to 8 COMM to sort out arguments. Note that the leading dimension _ MPI allow _ SPAWN 9 array _ of _ argv must be the same as count . Also note that Fortran rules for sequence of 10 association allow a different value in the first dimension; in this case, the sequence of 11 _ MPI array elements is interpreted by MULTIPLE COMM _ SPAWN _ as if the sequence is 12 count . This Fortran feature stored in an array defined with the first dimension set to 13 _ allows an implementor to define MPI _ ARGVS NULL (see below) with fixed dimensions, 14 e.g., (1,1), or only with one dimension, e.g., (1). ( ) End of rationale. 15 16 MPI only at the root, as is is interpreted by Advice to users. The argument count 17 . Since the leading dimension of array array of _ argv is count , a non-positive _ of _ argv _ 18 at a non-root node could theoretically cause a runtime bounds check count value of 19 of _ error, even though should be ignored by the subroutine. If this happens, array _ argv 20 End you should explicitly supply a reasonable value of count on the non-root nodes. ( 21 ) of advice to users. 22 23 _ NULL (which is likely MPI ARGVS In any language, an application may use the constant _ 24 (char ***)0 in C) to specify that no arguments should be passed to any commands. to be 25 array _ of _ argv _ ARGV _ MPI The effect of setting individual elements of NULL is not defined. to 26 To specify arguments for some commands but not others, the commands without arguments 27 argv should have a corresponding (char *)0 in C and empty whose first element is null ( 28 string in Fortran). In Fortran at non-root processes, the count argument must be set to 29 although the content of these array _ of a value that is consistent with the provided argv _ 30 arguments has no meaning for this operation. 31 COMM _ WORLD . Their ranks in All of the spawned processes have the same MPI _ 32 MPI _ COMM _ WORLD correspond directly to the order in which the commands are specified 33 m . Assume that _ SPAWN _ processes are generated by the first _ MPI in MULTIPLE COMM 1 34 by the second, etc. The processes corresponding to the first command have command, m 2 35 +1 1 , ranks 0 ,m m + 1. The processes in the second command have ranks ,...,m − ,...,m 1 1 1 1 36 ,m + + m m + 1, ,...,m + 1 − m 1. The processes in the third have ranks − m m m + 1 2 1 2 1 2 2 3 37 etc. 38 39 _ Advice to users. Calling MPI _ COMM SPAWN multiple times would create many 40 _ COMM _ MPI sets of children with different s whereas WORLD 41 creates children with a single _ MPI , WORLD _ COMM _ MPI MULTIPLE _ SPAWN _ COMM 42 so the two methods are not completely equivalent. There are also two performance- 43 related reasons why, if you need to spawn multiple executables, you may want to 44 several use MPI _ COMM _ SPAWN _ MULTIPLE instead of calling MPI _ COMM _ SPAWN 45 times. First, spawning several things at once may be faster than spawning them 46 sequentially. Second, in some implementations, communication between processes 47 spawned at the same time may be faster than communication between processes 48 ) End of advice to users. spawned separately. (

414 384 CHAPTER 10. PROCESS CREATION AND MANAGEMENT ∑ count 1 _ The of _ errcodes array argument is a 1-dimensional array of size n is n , where i i i =1 2 _ of the maxprocs . Command number i corresponds to the n i contiguous -th element of array _ i ] [ ∑ ∑ 1 − i 3 i n − 1. Error codes are treated as for to slots in this array from element n j j j =1 j =1 4 SPAWN _ MPI _ COMM . 5 6 in C and Fortran Example 10.2 Examples of array _ of _ argv 7 “ocean1.grd” To run the program “ocean” with arguments “-gridfile” and and the program 8 “atmos.grd” “atmos” with argument in C: 9 char *array_of_commands[2] = {"ocean", "atmos"}; 10 char **array_of_argv[2]; 11 char *argv0[] = {"-gridfile", "ocean1.grd", (char *)0}; 12 char *argv1[] = {"atmos.grd", (char *)0}; 13 array_of_argv[0] = argv0; 14 array_of_argv[1] = argv1; 15 MPI_Comm_spawn_multiple(2, array_of_commands, array_of_argv, ...); 16 17 Here is how you do it in Fortran: 18 19 CHARACTER*25 commands(2), array_of_argv(2, 3) 20 commands(1) = ’ ocean ’ 21 array_of_argv(1, 1) = ’ -gridfile ’ 22 array_of_argv(1, 2) = ’ ocean1.grd’ 23 array_of_argv(1, 3) = ’ ’ 24 25 commands(2) = ’ atmos ’ 26 array_of_argv(2, 1) = ’ atmos.grd ’ 27 array_of_argv(2, 2) = ’ ’ 28 29 call MPI_COMM_SPAWN_MULTIPLE(2, commands, array_of_argv, ...) 30 31 10.3.4 Reserved Keys 32 The following keys are reserved. An implementation is not required to interpret these keys, 33 but if it does interpret the key, it must provide the functionality described. 34 35 Value is a hostname. The format of the hostname is determined by the implementation. host 36 37 arch Value is an architecture name. Valid architecture names and what they mean are 38 determined by the implementation. 39 40 wdir Value is the name of a directory on a machine on which the spawned process(es) 41 execute(s). This directory is made the working directory of the executing process(es). 42 The format of the directory name is determined by the implementation. 43 path Value is a directory or set of directories where the implementation should look for the 44 is determined by the implementation. path executable. The format of 45 46 file Value is the name of a file in which additional information is specified. The format of 47 the filename and internal format of the file are determined by the implementation. 48

415 10.3. PROCESS MANAGER INTERFACE 385 1 Value specifies a set of numbers which are allowed values for the number of processes soft 2 _ SPAWN (et al.) may create. The format of the value is a comma- that MPI COMM _ 3 separated list of Fortran-90 triplets each of which specifies a set of integers and which 4 together specify the set formed by the union of these sets. Negative values in this set 5 and values greater than are ignored. MPI will spawn the largest number of maxprocs 6 processes it can, consistent with some number in the set. The order in which triplets 7 are given is not significant. 8 By Fortran-90 triplets, we mean: 9 10 1. a means a 11 + 2 ,...,b 2. a:b means a,a + 1 ,a 12 is the largest integer a:b:c means a,a + c,a 3. c,...,a + ck , where for c > 0, k + 2 13 ≥ b . a + ck for which b and for c < 0, k is the largest integer for which a + ck ≤ 14 then c must be positive. If b < a b > a then If c must be negative. 15 16 Examples: 17 18 and a b gives a range between a:b 1. 19 0:N gives full “soft” functionality 2. 20 allows a power-of-two num- 3. 1,2,4,8,16,32,64,128,256,512,1024,2048,4096 21 ber of processes. 22 allows an even number of processes. 4. 2:10000:2 23 24 2:10:2,7 5. allows 2, 4, 6, 7, 8, or 10 processes. 25 26 Spawn Example 10.3.5 27 SPAWN COMM _ Manager-worker Example Using MPI _ 28 29 /* manager */ 30 #include "mpi.h" 31 int main(int argc, char *argv[]) 32 { 33 int world_size, universe_size, *universe_sizep, flag; 34 MPI_Comm everyone; /* intercommunicator */ 35 char worker_program[100]; 36 37 MPI_Init(&argc, &argv); 38 MPI_Comm_size(MPI_COMM_WORLD, &world_size); 39 40 if (world_size != 1) error("Top heavy with management"); 41 42 MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, 43 &universe_sizep, &flag); 44 if (!flag) { 45 printf("This MPI does not support UNIVERSE_SIZE. How many\n\ 46 processes total?"); 47 scanf("%d", &universe_size); 48 } else universe_size = *universe_sizep;

416 386 CHAPTER 10. PROCESS CREATION AND MANAGEMENT 1 if (universe_size == 1) error("No room to start workers"); 2 3 /* 4 * Now spawn the workers. Note that there is a run-time determination 5 * of what type of worker to spawn, and presumably this calculation must 6 * be done at run time and cannot be calculated before starting 7 * the program. If everything is known when the application is 8 * first started, it is generally better to start them all at once 9 * in a single MPI_COMM_WORLD. 10 */ 11 12 choose_worker_program(worker_program); 13 MPI_Comm_spawn(worker_program, MPI_ARGV_NULL, universe_size-1, 14 MPI_INFO_NULL, 0, MPI_COMM_SELF, &everyone, 15 MPI_ERRCODES_IGNORE); 16 /* 17 * Parallel code here. The communicator "everyone" can be used 18 * to communicate with the spawned processes, which have ranks 0,.. 19 * MPI_UNIVERSE_SIZE-1 in the remote group of the intercommunicator 20 * "everyone". 21 */ 22 23 MPI_Finalize(); 24 return 0; 25 } 26 27 /* worker */ 28 29 #include "mpi.h" 30 int main(int argc, char *argv[]) 31 { 32 int size; 33 MPI_Comm parent; 34 MPI_Init(&argc, &argv); 35 MPI_Comm_get_parent(&parent); 36 if (parent == MPI_COMM_NULL) error("No parent!"); 37 MPI_Comm_remote_size(parent, &size); 38 if (size != 1) error("Something’s wrong with the parent"); 39 40 /* 41 * Parallel code here. 42 * The manager is represented as the process with rank 0 in (the remote 43 * group of) the parent communicator. If the workers need to communicate 44 * among themselves, they can use MPI_COMM_WORLD. 45 */ 46 47 MPI_Finalize(); 48 return 0;

417 10.4. ESTABLISHING COMMUNICATION 387 1 } 2 3 4 5 10.4 Establishing Communication 6 7 MPI This section provides functions that establish communication between two sets of 8 processes that do not share a communicator. 9 Some situations in which these functions are useful are: 10 11 1. Two parts of an application that are started independently need to communicate. 12 2. A visualization tool wants to attach to a running process. 13 14 3. A server wants to accept connections from multiple clients. Both clients and server 15 may be parallel programs. 16 MPI must establish communication channels where none existed In each of these situations, 17 before, and there is no parent/child relationship. The routines described in this section 18 MPI intercom- establish communication between the two sets of processes by creating an 19 municator, where the two groups of the intercommunicator are the original sets of processes. 20 Establishing contact between two groups of processes that do not share an existing 21 communicator is a collective but asymmetric process. One group of processes indicates its 22 willingness to accept connections from other groups of processes. We will call this group 23 , even if this is not a client/server type of application. The other group the (parallel) server 24 . client connects to the server; we will call it the 25 26 are used throughout this section, Advice to users. While the names and client server 27 MPI does not guarantee the traditional robustness of client/server systems. The func- 28 tionality described in this section is intended to allow two cooperating parts of the 29 same application to communicate with one another. For instance, a client that gets a 30 segmentation fault and dies, or one that does not participate in a collective operation 31 End of advice to users. ) may cause a server to crash or hang. ( 32 33 10.4.1 Names, Addresses, Ports, and All That 34 35 Almost all of the complexity in MPI client/server routines addresses the question “how 36 does the client find out how to contact the server?” The difficulty, of course, is that there 37 is no existing communication channel between them, yet they must somehow agree on a 38 rendezvous point where they will establish communication. 39 Agreeing on a rendezvous point always involves a third party. The third party may 40 itself provide the rendezvous point or may communicate rendezvous information from server 41 to client. Complicating matters might be the fact that a client does not really care what 42 server it contacts, only that it be able to get in touch with one that can handle its request. 43 MPI can accommodate a wide variety of run-time systems while retaining the Ideally, 44 MPI : ability to write simple, portable code. The following should be compatible with 45 The server resides at a well-known internet address host:port. • 46 47 • The server prints out an address to the terminal; the user gives this address to the 48 client program.

418 388 CHAPTER 10. PROCESS CREATION AND MANAGEMENT 1 The server places the address information on a nameserver, where it can be retrieved • 2 with an agreed-upon name. 3 • The server to which the client connects is actually a broker, acting as a middleman 4 between the client and the real server. 5 6 does not require a nameserver, so not all implementations will be able to support MPI 7 provides an optional nameserver interface, and is MPI all of the above scenarios. However, 8 compatible with external name servers. 9 is a system-supplied A port _ name string that encodes a low-level network address at 10 which a server can be contacted. Typically this is an IP address and a port number, but 11 with name _ port an implementation is free to use any protocol. The server establishes a 12 MPI routine. It accepts a connection to a given port with _ _ PORT the OPEN 13 port _ name MPI _ COMM _ ACCEPT . A client uses to connect to the server. 14 mechanism is completely portable, but it may be clumsy By itself, the port _ name 15 _ to use because of the necessity to communicate name to the client. It would be more port 16 application-supplied service _ name convenient if a server could specify that it be known by an 17 _ so that the client could connect to that without knowing the port _ name . service name 18 An name implementation may allow the server to publish a ( port _ MPI , service _ name ) 19 _ and the client to retrieve the port name from the service NAME pair with MPI PUBLISH _ 20 _ MPI LOOKUP _ NAME . This allows three levels of portability, with increasing name with 21 levels of functionality. 22 23 1. Applications that do not rely on the ability to publish names are the most portable. 24 must be transferred “by hand” from server to client. Typically the port _ name 25 _ NAME PUBLISH _ MPI 2. Applications that use the mechanism are completely portable 26 among implementations that provide this service. To be portable among all imple- 27 mentations, these applications should have a fall-back mechanism that can be used 28 when names are not published. 29 30 3. Applications may ignore MPI ’s name publishing functionality and use their own mech- 31 anism (possibly system-supplied) to publish names. This allows arbitrary flexibility 32 but is not portable. 33 34 Server Routines 10.4.2 35 36 to _ MPI A server makes itself available with two routines. First it must call PORT OPEN _ 37 port at which it may be contacted. Secondly it must call MPI _ COMM _ ACCEPT establish a 38 to accept connections from clients. 39 40 _ name) _ PORT(info, port _ OPEN MPI 41 42 IN info implementation-specific information on how to estab- 43 lish an address (handle) 44 newly established port (string) _ port OUT name 45 46 int MPI_Open_port(MPI_Info info, char *port_name) 47 48 MPI_Open_port(info, port_name, ierror) BIND(C)

419 10.4. ESTABLISHING COMMUNICATION 389 1 TYPE(MPI_Info), INTENT(IN) :: info 2 CHARACTER(LEN=MPI_MAX_PORT_NAME), INTENT(OUT) :: port_name 3 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 4 MPI_OPEN_PORT(INFO, PORT_NAME, IERROR) 5 CHARACTER*(*) PORT_NAME 6 INTEGER INFO, IERROR 7 8 name _ port This function establishes a network address, encoded in the string, at which 9 name the server will be able to accept connections from clients. port _ is supplied by the 10 system, possibly using information in the info argument. 11 name . _ port copies a system-supplied port name into port _ name identifies the newly MPI 12 opened port and can be used by a client to contact the server. The maximum size string 13 NAME . that may be supplied by the system is MPI _ MAX _ PORT _ 14 . The application name port The system copies the port name into Advice to users. _ 15 ) must pass a buffer of sufficient size to hold this value. ( End of advice to users. 16 17 port name _ is essentially a network address. It is unique within the communication 18 universe to which it belongs (determined by the implementation), and may be used by any 19 client within that communication universe. For instance, if it is an internet (host:port) 20 address, it will be unique on the internet. If it is a low level switch address on an IBM SP, 21 it will be unique to that SP. 22 23 Advice to implementors. These examples are not meant to constrain implementa- 24 _ port tions. A name could, for instance, contain a user name or the name of a batch 25 job, as long as it is unique within some well-defined communication domain. The 26 larger the communication domain, the more useful MPI ’s client/server functionality 27 will be. ( End of advice to implementors. ) 28 The precise form of the address is implementation-defined. For instance, an internet address 29 may be a host name or IP address, or anything that the implementation can decode into 30 CLOSE _ MPI an IP address. A port name may be reused after it is freed with and PORT _ 31 released by the system. 32 33 _ name by hand, it is useful Advice to implementors. Since the user may type in port 34 End of to choose a form that is easily readable and does not have embedded spaces. ( 35 ) advice to implementors. 36 37 info may be used to tell the implementation how to establish the address. It may, and 38 INFO _ MPI usually will, be in order to get the implementation defaults. NULL _ 39 40 41 PORT(port _ MPI _ name) _ CLOSE 42 _ port IN a port (string) name 43 44 int MPI_Close_port(const char *port_name) 45 46 MPI_Close_port(port_name, ierror) BIND(C) 47 CHARACTER(LEN=*), INTENT(IN) :: port_name 48 INTEGER, OPTIONAL, INTENT(OUT) :: ierror

420 390 CHAPTER 10. PROCESS CREATION AND MANAGEMENT 1 MPI_CLOSE_PORT(PORT_NAME, IERROR) 2 CHARACTER*(*) PORT_NAME 3 INTEGER IERROR 4 . name _ port This function releases the network address represented by 5 6 7 name, info, root, comm, newcomm) _ ACCEPT(port _ COMM _ MPI 8 port _ IN root name ) port name (string, used only on 9 10 info implementation-dependent information (handle, used IN 11 root ) only on 12 IN root rank in comm of root node (integer) 13 intracommunicator over which call is collective (han- comm IN 14 dle) 15 16 OUT newcomm intercommunicator with client as remote group (han- 17 dle) 18 19 int MPI_Comm_accept(const char *port_name, MPI_Info info, int root, 20 MPI_Comm comm, MPI_Comm *newcomm) 21 MPI_Comm_accept(port_name, info, root, comm, newcomm, ierror) BIND(C) 22 CHARACTER(LEN=*), INTENT(IN) :: port_name 23 TYPE(MPI_Info), INTENT(IN) :: info 24 INTEGER, INTENT(IN) :: root 25 TYPE(MPI_Comm), INTENT(IN) :: comm 26 TYPE(MPI_Comm), INTENT(OUT) :: newcomm 27 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 28 29 MPI_COMM_ACCEPT(PORT_NAME, INFO, ROOT, COMM, NEWCOMM, IERROR) 30 CHARACTER*(*) PORT_NAME 31 INTEGER INFO, ROOT, COMM, NEWCOMM, IERROR 32 COMM _ establishes communication with a client. It is collective over the _ MPI ACCEPT 33 calling communicator. It returns an intercommunicator that allows communication with the 34 client. 35 _ name must have been established through a call to MPI _ OPEN _ PORT . The port 36 ACCEPT can be used to provide directives that may influence the behavior of the info 37 call. 38 39 40 10.4.3 Client Routines 41 There is only one routine on the client side. 42 43 44 45 46 47 48

421 10.4. ESTABLISHING COMMUNICATION 391 1 COMM name, info, root, comm, newcomm) _ CONNECT(port _ MPI _ 2 name root port IN _ ) network address (string, used only on 3 IN info implementation-dependent information (handle, used 4 ) only on root 5 6 root rank in comm of root node (integer) IN 7 IN intracommunicator over which call is collective (han- comm 8 dle) 9 newcomm OUT intercommunicator with server as remote group (han- 10 dle) 11 12 13 int MPI_Comm_connect(const char *port_name, MPI_Info info, int root, 14 MPI_Comm comm, MPI_Comm *newcomm) 15 MPI_Comm_connect(port_name, info, root, comm, newcomm, ierror) BIND(C) 16 CHARACTER(LEN=*), INTENT(IN) :: port_name 17 TYPE(MPI_Info), INTENT(IN) :: info 18 INTEGER, INTENT(IN) :: root 19 TYPE(MPI_Comm), INTENT(IN) :: comm 20 TYPE(MPI_Comm), INTENT(OUT) :: newcomm 21 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 22 23 MPI_COMM_CONNECT(PORT_NAME, INFO, ROOT, COMM, NEWCOMM, IERROR) 24 CHARACTER*(*) PORT_NAME 25 INTEGER INFO, ROOT, COMM, NEWCOMM, IERROR 26 name . It is This routine establishes communication with a server specified by port _ 27 collective over the calling communicator and returns an intercommunicator in which the 28 MPI _ ACCEPT remote group participated in an . _ COMM 29 If the named port does not exist (or has been closed), MPI _ COMM _ CONNECT raises 30 . an error of class MPI _ ERR _ PORT 31 If the port exists, but does not have a pending MPI _ COMM _ , the connection ACCEPT 32 attempt will eventually time out after an implementation-defined time, or succeed when 33 MPI MPI COMM _ ACCEPT . In the case of a time out, _ _ COMM _ CONNECT the server calls 34 . raises an error of class MPI _ ERR _ PORT 35 36 Advice to implementors. The time out period may be arbitrarily short or long. 37 However, a high-quality implementation will try to queue connection attempts so 38 that a server can handle simultaneous requests from several clients. A high-quality 39 arguments to implementation may also provide a mechanism, through the info 40 _ MPI _ , for the CONNECT _ COMM _ MPI , and/or OPEN _ COMM MPI , PORT _ ACCEPT 41 ) user to control timeout and queuing behavior. ( End of advice to implementors. 42 43 provides no guarantee of fairness in servicing connection attempts. That is, connec- MPI 44 tion attempts are not necessarily satisfied in the order they were initiated and competition 45 from other connection attempts may prevent a particular connection attempt from being 46 satisfied. 47 is the address of the server. It must be the same as the name returned _ port name 48 on the server. Some freedom is allowed here. If there are equivalent PORT _ OPEN _ MPI by

422 392 CHAPTER 10. PROCESS CREATION AND MANAGEMENT 1 _ port port _ name name forms of , an implementation may accept them as well. For instance, if 2 ip ) as well. is ( hostname:port _ ), an implementation may accept ( address:port 3 4 10.4.4 Name Publishing 5 _ name The routines in this section provide a mechanism for publishing names. A ( , service 6 port ) pair is published by the server, and may be retrieved by a client using the name _ 7 name , that is, service _ name only. An MPI implementation defines the scope of the service _ 8 the domain over which the _ can be retrieved. If the domain is the empty name service 9 set, that is, if no client can retrieve the information, then we say that name publishing 10 is not supported. Implementations should document how the scope is determined. High- 11 arguments to name info quality implementations will give some control to users through the 12 publishing functions. Examples are given in the descriptions of individual functions. 13 14 15 name, info, port _ MPI _ PUBLISH NAME(service _ _ name) 16 IN _ name a service name to associate with the port (string) service 17 18 IN info implementation-specific information (handle) 19 IN _ name a port name (string) port 20 21 int MPI_Publish_name(const char *service_name, MPI_Info info, const 22 char *port_name) 23 24 MPI_Publish_name(service_name, info, port_name, ierror) BIND(C) 25 TYPE(MPI_Info), INTENT(IN) :: info 26 CHARACTER(LEN=*), INTENT(IN) :: service_name, port_name 27 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 28 MPI_PUBLISH_NAME(SERVICE_NAME, INFO, PORT_NAME, IERROR) 29 INTEGER INFO, IERROR 30 CHARACTER*(*) SERVICE_NAME, PORT_NAME 31 32 , name This routine publishes the pair ( port _ ) so that an application may name _ service 33 . port retrieve a system-supplied _ service using a well-known name name _ 34 of a published service name, that is, the scope The implementation must define the 35 domain over which the service name is unique, and conversely, the domain over which the 36 (port name, service name) pair may be retrieved. For instance, a service name may be 37 unique to a job (where job is defined by a distributed operating system or batch scheduler), 38 unique to a machine, or unique to a Kerberos realm. The scope may depend on the info 39 argument to MPI _ PUBLISH _ NAME . 40 service name for a single MPI permits publishing more than one . On the _ name port _ 41 , info has already been published within the scope determined by name _ service other hand, if 42 is undefined. An NAME _ PUBLISH _ MPI implementation may, through the behavior of MPI 43 _ MPI argument to info a mechanism in the _ , provide a way to allow multiple PUBLISH NAME 44 servers with the same service in the same scope. In this case, an implementation-defined 45 LOOKUP _ policy will determine which of several port names is returned by . NAME MPI _ 46 has a limited scope, determined by the implementation, Note that while service _ name 47 port always has global scope within the communication universe used by the imple- name _ 48

423 10.4. ESTABLISHING COMMUNICATION 393 1 mentation (i.e., it is globally unique). 2 port PORT and not yet OPEN _ _ MPI _ name should be the name of a port established by 3 released by MPI _ CLOSE _ PORT . If it is not, the result is undefined. 4 5 Advice to implementors. In some cases, an MPI implementation may use a name 6 service that a user can also access directly. In this case, a name published by MPI 7 could easily conflict with a name published by a user. In order to avoid such conflicts, 8 MPI implementations should mangle service names so that they are unlikely to conflict 9 with user code that makes use of the same service. Such name mangling will of course 10 be completely transparent to the user. 11 The following situation is problematic but unavoidable, if we want to allow implemen- 12 tations to use nameservers. Suppose there are multiple instances of “ocean” running 13 on a machine. If the scope of a service name is confined to a job, then multiple 14 oceans can coexist. If an implementation provides site-wide scope, however, multiple 15 PUBLISH NAME _ after the first may fail. instances are not possible as all calls to MPI _ 16 There is no universal solution to this. 17 To handle these situations, a high-quality implementation should make it possible to 18 limit the domain over which names are published. ( ) End of advice to implementors. 19 20 21 22 _ name) MPI _ _ NAME(service UNPUBLISH _ name, info, port 23 a service name (string) name _ service IN 24 25 implementation-specific information (handle) IN info 26 _ name port IN a port name (string) 27 28 int MPI_Unpublish_name(const char *service_name, MPI_Info info, const 29 char *port_name) 30 31 MPI_Unpublish_name(service_name, info, port_name, ierror) BIND(C) 32 CHARACTER(LEN=*), INTENT(IN) :: service_name, port_name 33 TYPE(MPI_Info), INTENT(IN) :: info 34 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 35 MPI_UNPUBLISH_NAME(SERVICE_NAME, INFO, PORT_NAME, IERROR) 36 INTEGER INFO, IERROR 37 CHARACTER*(*) SERVICE_NAME, PORT_NAME 38 39 This routine unpublishes a service name that has been previously published. Attempt- 40 ing to unpublish a name that has not been published or has already been unpublished is 41 . erroneous and is indicated by the error class MPI _ ERR _ SERVICE 42 All published names must be unpublished before the corresponding port is closed and 43 _ before the publishing process exits. The behavior of MPI is implemen- UNPUBLISH NAME _ 44 tation dependent when a process tries to unpublish a name that it did not publish. 45 info argument was used with to tell the implementation If the NAME _ PUBLISH _ MPI 46 how to publish names, the implementation may require that info passed to 47 NAME MPI _ UNPUBLISH _ contain information to tell the implementation how to unpublish 48 a name.

424 394 CHAPTER 10. PROCESS CREATION AND MANAGEMENT 1 LOOKUP _ _ name, info, port _ name) _ MPI NAME(service 2 a service name (string) name IN service _ 3 implementation-specific information (handle) info IN 4 5 _ name OUT port a port name (string) 6 7 int MPI_Lookup_name(const char *service_name, MPI_Info info, 8 char *port_name) 9 10 MPI_Lookup_name(service_name, info, port_name, ierror) BIND(C) 11 CHARACTER(LEN=*), INTENT(IN) :: service_name 12 TYPE(MPI_Info), INTENT(IN) :: info 13 CHARACTER(LEN=MPI_MAX_PORT_NAME), INTENT(OUT) :: port_name 14 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 15 MPI_LOOKUP_NAME(SERVICE_NAME, INFO, PORT_NAME, IERROR) 16 CHARACTER*(*) SERVICE_NAME, PORT_NAME 17 INTEGER INFO, IERROR 18 19 with This function retrieves a port _ name published by MPI _ PUBLISH _ NAME 20 has not been published, it raises an error in the error class name _ service . If name _ service 21 NAME . The application must supply a port MPI name buffer large enough to hold the _ ERR _ _ 22 MPI _ OPEN _ PORT ). largest possible port name (see discussion above under 23 If an implementation allows multiple entries with the same service _ within the name 24 _ port same scope, a particular is chosen in a way determined by the implementation. name 25 PUBLISH info argument was used with _ If the MPI to tell the implementation NAME _ 26 . how to publish names, a similar info argument may be required for MPI _ LOOKUP _ NAME 27 28 10.4.5 Reserved Key Values 29 The following key values are reserved. An implementation is not required to interpret these 30 key values, but if it does interpret the key value, it must provide the functionality described. 31 32 port _ port Value contains IP port number at which to establish a ip . (Reserved for 33 MPI _ OPEN _ PORT only). 34 35 ip _ address Value contains IP address at which to establish a port . If the address is not a 36 MPI PORT valid IP address of the host on which the call is made, the results _ OPEN _ 37 _ only). PORT _ OPEN MPI are undefined. (Reserved for 38 39 Client/Server Examples 10.4.6 40 Simplest Example — Completely Portable. 41 42 The following example shows the simplest way to use the client/server interface. It does 43 not use service names at all. 44 On the server side: 45 46 47 char myport[MPI_MAX_PORT_NAME]; 48 MPI_Comm intercomm;

425 10.4. ESTABLISHING COMMUNICATION 395 1 /* ... */ 2 MPI_Open_port(MPI_INFO_NULL, myport); 3 printf("port name is: %s\n", myport); 4 5 MPI_Comm_accept(myport, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm); 6 /* do something with intercomm */ 7 The server prints out the port name to the terminal and the user must type it in when 8 starting up the client (assuming the MPI implementation supports stdin such that this 9 works). On the client side: 10 11 MPI_Comm intercomm; 12 char name[MPI_MAX_PORT_NAME]; 13 printf("enter port name: "); 14 gets(name); 15 MPI_Comm_connect(name, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm); 16 17 Ocean/Atmosphere — Relies on Name Publishing 18 19 In this example, the “ocean” application is the “server” side of a coupled ocean-atmosphere 20 implementation publishes names. climate model. It assumes that the MPI 21 22 23 MPI_Open_port(MPI_INFO_NULL, port_name); 24 MPI_Publish_name("ocean", MPI_INFO_NULL, port_name); 25 26 MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm); 27 /* do something with intercomm */ 28 MPI_Unpublish_name("ocean", MPI_INFO_NULL, port_name); 29 30 On the client side: 31 32 MPI_Lookup_name("ocean", MPI_INFO_NULL, port_name); 33 MPI_Comm_connect(port_name, MPI_INFO_NULL, 0, MPI_COMM_SELF, 34 &intercomm); 35 36 Simple Client-Server Example 37 38 This is a simple example; the server accepts only a single connection at a time and serves 39 that connection until the client requests to be disconnected. The server is a single process. 40 Here is the server. It accepts a single connection and then processes data until it 41 receives a message with tag . A message with tag 0 tells the server to exit. 1 42 #include "mpi.h" 43 int main(int argc, char *argv[]) 44 { 45 MPI_Comm client; 46 MPI_Status status; 47 char port_name[MPI_MAX_PORT_NAME]; 48

426 396 CHAPTER 10. PROCESS CREATION AND MANAGEMENT 1 double buf[MAX_DATA]; 2 int size, again; 3 4 MPI_Init(&argc, &argv); 5 MPI_Comm_size(MPI_COMM_WORLD, &size); 6 if (size != 1) error(FATAL, "Server too big"); 7 MPI_Open_port(MPI_INFO_NULL, port_name); 8 printf("server available at %s\n", port_name); 9 while (1) { 10 MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD, 11 &client); 12 again = 1; 13 while (again) { 14 MPI_Recv(buf, MAX_DATA, MPI_DOUBLE, 15 MPI_ANY_SOURCE, MPI_ANY_TAG, client, &status); 16 switch (status.MPI_TAG) { 17 case 0: MPI_Comm_free(&client); 18 MPI_Close_port(port_name); 19 MPI_Finalize(); 20 return 0; 21 case 1: MPI_Comm_disconnect(&client); 22 again = 0; 23 break; 24 case 2: /* do something */ 25 ... 26 default: 27 /* Unexpected message type */ 28 MPI_Abort(MPI_COMM_WORLD, 1); 29 } 30 } 31 } 32 } 33 Here is the client. 34 35 #include "mpi.h" 36 int main( int argc, char **argv ) 37 { 38 MPI_Comm server; 39 double buf[MAX_DATA]; 40 char port_name[MPI_MAX_PORT_NAME]; 41 42 MPI_Init( &argc, &argv ); 43 strcpy( port_name, argv[1] );/* assume server’s name is cmd-line arg */ 44 45 MPI_Comm_connect( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD, 46 &server ); 47 48

427 10.5. OTHER FUNCTIONALITY 397 1 while (!done) { 2 tag = 2; /* Action to perform */ 3 MPI_Send( buf, n, MPI_DOUBLE, 0, tag, server ); 4 /* etc */ 5 } 6 MPI_Send( buf, 0, MPI_DOUBLE, 0, 1, server ); 7 MPI_Comm_disconnect( &server ); 8 MPI_Finalize(); 9 return 0; 10 } 11 12 10.5 Other Functionality 13 14 Universe Size 10.5.1 15 16 Many “dynamic” applications are expected to exist in a static runtime environment, MPI 17 in which resources have been allocated before the application is run. When a user (or 18 possibly a batch system) runs one of these quasi-static applications, she will usually specify 19 a number of processes to start and a total number of processes that are expected. An 20 application simply needs to know how many slots there are, i.e., how many processes it 21 should spawn. 22 , MPI provides an attribute on MPI , that allows SIZE _ UNIVERSE _ MPI _ WORLD _ COMM 23 the application to obtain this information in a portable manner. This attribute indicates 24 the total number of processes that are expected. In Fortran, the attribute is the integer 25 value. In C, the attribute is a pointer to the integer value. An application typically subtracts 26 _ WORLD the size of MPI _ UNIVERSE _ SIZE to find out how many processes it MPI _ COMM from 27 _ UNIVERSE _ SIZE is initialized in MPI _ INIT and is not changed by MPI . If should spawn. MPI 28 . _ _ SIZE defined, it has the same value on all processes of COMM _ WORLD MPI MPI _ UNIVERSE 29 . (The MPI is determined by the application startup mechanism in a way not specified by 30 size of is another example of such a parameter.) WORLD _ COMM _ MPI 31 SIZE Possibilities for how MPI _ UNIVERSE _ might be set include 32 -universe_size argument to a program that starts MPI processes. • A 33 34 • Automatic interaction with a batch scheduler to figure out how many processors have 35 been allocated to an application. 36 37 • An environment variable set by the user. 38 MPI info through the Extra information passed to • _ COMM _ SPAWN argument. 39 40 _ is set. An implementation SIZE _ UNIVERSE MPI An implementation must document how 41 may not support the ability to set MPI _ UNIVERSE SIZE , in which case the attribute _ 42 SIZE is not set. UNIVERSE _ MPI _ 43 SIZE UNIVERSE _ _ MPI is a recommendation, not necessarily a hard limit. For instance, 44 some implementations may allow an application to spawn 50 processes per processor, if 45 they are requested. However, it is likely that the user only wants to spawn one process per 46 processor. 47 SIZE MPI _ UNIVERSE _ is assumed to have been specified when an application was started, 48 and is in essence a portable mechanism to allow the user to pass to the application (through

428 398 CHAPTER 10. PROCESS CREATION AND MANAGEMENT 1 process startup mechanism, such as MPI mpiexec the ) a piece of critical runtime informa- 2 tion. Note that no interaction with the runtime environment is required. If the runtime 3 _ environment changes size while an application is running, SIZE is not up- UNIVERSE MPI _ 4 dated, and the application must find out about the change through direct communication 5 with the runtime system. 6 7 Singleton MPI 10.5.2 INIT _ 8 A high-quality implementation will allow any process (including those not started with a 9 INIT MPI “parallel application” mechanism) to become an MPI process by calling . Such _ 10 COMM _ a process can then connect to other and MPI processes using the MPI _ ACCEPT 11 _ CONNECT routines, or spawn other MPI processes. MPI does not mandate MPI _ COMM 12 this behavior, but strongly encourages it where technically feasible. 13 14 processes belonging to the same To start Advice to implementors. MPI 15 requires some special coordination. The processes must be started COMM _ MPI WORLD _ 16 at the “same” time, they must have a mechanism to establish communication, etc. 17 Either the user or the operating system must take special steps beyond simply starting 18 processes. 19 20 MPI _ , clearly it must be able to determine if these When an application enters INIT 21 _ special steps were taken. If a process enters and determines that no MPI INIT 22 special steps were taken (i.e., it has not been given the information to form an 23 _ COMM _ WORLD MPI MPI pro- with other processes) it succeeds and forms a singleton 24 MPI _ _ WORLD has size 1. COMM gram, that is, one in which 25 may not be able to function without an “ MPI In some implementations, environ- MPI 26 MPI may not be may require that daemons be running or MPI ment.” For example, 27 able to work at all on the front-end of an MPP. In this case, an MPI implementation 28 may either 29 30 1. Create the environment (e.g., start a daemon) or 31 2. Raise an error if it cannot create the environment and the environment has not 32 been started independently. 33 34 A high-quality implementation will try to create a singleton MPI process and not raise 35 an error. 36 ( End of advice to implementors. ) 37 38 _ APPNUM MPI 10.5.3 39 40 MPI There is a predefined attribute . In Fortran, the at- WORLD COMM _ _ of APPNUM _ MPI 41 tribute is an integer value. In C, the attribute is a pointer to an integer value. If a process 42 is the command number MPI _ , MULTIPLE _ SPAWN _ COMM _ MPI was spawned with APPNUM 43 that generated the current process. Numbering starts from zero. If a process was spawned 44 with equal to zero. APPNUM _ MPI MPI SPAWN _ COMM _ , it will have 45 Additionally, if the process was not started by a spawn call, but by an implementation- 46 _ specific startup mechanism that can handle multiple process specifications, MPI APPNUM 47 should be set to the number of the corresponding process specification. In particular, if it 48 is started with

429 10.5. OTHER FUNCTIONALITY 399 1 mpiexec spec0 [: spec1 : spec2 : ...] 2 _ APPNUM MPI should be set to the number of the corresponding specification. 3 _ MPI If an application was not spawned with _ SPAWN COMM or 4 _ APPNUM does not make sense in the context of MULTIPLE COMM _ _ SPAWN MPI MPI _ , and 5 is not set. APPNUM _ MPI the implementation-specific startup mechanism, 6 MPI implementations may optionally provide a mechanism to override the value of 7 APPNUM through the info MPI MPI reserves the following key for all SPAWN _ argument. 8 calls. 9 10 MPI _ APPNUM appnum Value contains an integer that overrides the default value for in the 11 child. 12 13 Rationale. When a single application is started, it is able to figure out how many pro- 14 WORLD _ COMM _ MPI cesses there are by looking at the size of . An application consisting 15 of multiple SPMD sub-applications has no way to find out how many sub-applications 16 there are and to which sub-application the process belongs. While there are ways to 17 figure it out in special cases, there is no general mechanism. provides APPNUM _ MPI 18 End of rationale. such a general mechanism. ( ) 19 20 10.5.4 Releasing Connections 21 22 Before a client and server connect, they are independent MPI applications. An error in one 23 and MPI _ COMM _ CONNECT does not affect the other. After establishing a connection with 24 COMM _ MPI , an error in one may affect the other. It is desirable for a client and _ ACCEPT 25 server to be able to disconnect, so that an error in one will not affect the other. Similarly, 26 it might be desirable for a parent and child to disconnect, so that errors in the child do not 27 affect the parent, or vice-versa. 28 29 • if there is a communication path (direct or indirect) Two processes are connected 30 between them. More precisely: 31 1. Two processes are connected if 32 33 (a) they both belong to the same communicator (inter- or intra-, including 34 or MPI _ COMM _ WORLD ) 35 (b) they have previously belonged to a communicator that was freed with 36 FREE _ COMM _ instead of MPI MPI _ COMM _ DISCONNECT or 37 (c) they both belong to the group of the same window or filehandle. 38 2. If A is connected to B and B to C, then A is connected to C. 39 40 • (also independent ) if they are not connected. disconnected Two processes are 41 42 • By the above definitions, connectivity is a transitive property, and divides the uni- 43 processes into disconnected (independent) sets (equivalence classes) of verse of MPI 44 processes. 45 Processes which are connected, but do not share the same • , may WORLD _ COMM _ MPI 46 become disconnected (independent) if the communication path between them is bro- 47 _ COMM . MPI ken by using _ DISCONNECT 48

430 400 CHAPTER 10. PROCESS CREATION AND MANAGEMENT 1 routines in other chapters: MPI The following additional rules apply to 2 _ is collective over a set of connected processes. • MPI FINALIZE 3 4 • MPI _ ABORT does not abort independent processes. It may abort all processes in 5 MPI COMM _ WORLD _ (ignoring its comm the caller’s argument). Additionally, it may 6 abort connected processes as well, though it makes a “best attempt” to abort only 7 . comm the processes in 8 9 , independent processes are not If a process terminates without calling MPI _ FINALIZE • 10 affected but the effect on connected processes is not defined. 11 12 13 MPI _ _ DISCONNECT(comm) COMM 14 comm INOUT communicator (handle) 15 16 17 int MPI_Comm_disconnect(MPI_Comm *comm) 18 MPI_Comm_disconnect(comm, ierror) BIND(C) 19 TYPE(MPI_Comm), INTENT(INOUT) :: comm 20 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 21 22 MPI_COMM_DISCONNECT(COMM, IERROR) 23 INTEGER COMM, IERROR 24 This function waits for all pending communication on to complete internally, comm 25 COMM _ MPI deallocates the communicator object, and sets the handle to . It is a _ NULL 26 collective operation. 27 WORLD It may not be called with the communicator MPI _ COMM _ _ or MPI _ . SELF COMM 28 _ MPI may be called only if all communication is complete and DISCONNECT _ COMM 29 matched, so that buffered data can be delivered to its destination. This requirement is the 30 same as for MPI _ FINALIZE . 31 _ COMM _ FREE , except that it MPI _ COMM _ DISCONNECT has the same action as MPI 32 waits for pending communication to finish internally and enables the guarantee about the 33 behavior of disconnected processes. 34 35 Advice to users. To disconnect two processes you may need to call 36 _ to remove all CLOSE _ FILE _ MPI , and FREE MPI WIN _ , DISCONNECT _ COMM _ MPI 37 communication paths between the two processes. Note that it may be necessary 38 to disconnect several communicators (or to free several windows or files) before two 39 processes are completely independent. ( End of advice to users. ) 40 41 COMM _ MPI It would be nice to be able to use Rationale. FREE instead, but that _ 42 End of function explicitly does not wait for pending communication to complete. ( 43 ) rationale. 44 45 46 47 48

431 10.5. OTHER FUNCTIONALITY 401 1 Another Way to Establish MPI Communication 10.5.5 2 3 4 MPI _ COMM _ JOIN(fd, intercomm) 5 fd IN socket file descriptor 6 7 OUT intercomm new intercommunicator (handle) 8 9 int MPI_Comm_join(int fd, MPI_Comm *intercomm) 10 MPI_Comm_join(fd, intercomm, ierror) BIND(C) 11 INTEGER, INTENT(IN) :: fd 12 TYPE(MPI_Comm), INTENT(OUT) :: intercomm 13 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 14 15 MPI_COMM_JOIN(FD, INTERCOMM, IERROR) 16 INTEGER FD, INTERCOMM, IERROR 17 _ JOIN MPI MPI _ COMM is intended for implementations that exist in an environment 18 supporting the Berkeley Socket interface [45, 49]. Implementations that exist in an environ- 19 _ _ ment not supporting Berkeley Sockets should provide the entry point for COMM JOIN MPI 20 . NULL _ COMM _ MPI and should return 21 processes which are This call creates an intercommunicator from the union of two MPI 22 _ should normally succeed if the local and remote JOIN _ COMM MPI connected by a socket. 23 processes have access to the same implementation-defined MPI communication universe. 24 25 MPI implementation may require a specific communication Advice to users. An 26 medium for MPI communication, such as a shared memory segment or a special switch. 27 In this case, it may not be possible for two processes to successfully join even if there 28 is a socket connecting them and they are using the same End implementation. ( MPI 29 of advice to users. ) 30 31 Advice to implementors. A high-quality implementation will attempt to establish 32 communication over a slow medium if its preferred one is not available. If implemen- 33 MPI communication tations do not do this, they must document why they cannot do 34 over the medium used by the socket (especially if the socket is a TCP connection). 35 End of advice to implementors. ( ) 36 37 is a file descriptor representing a socket of type SOCK _ STREAM (a two-way reliable fd 38 must byte-stream connection). Nonblocking I/O and asynchronous notification via SIGIO 39 not be enabled for the socket. The socket must be in a connected state. The socket must 40 JOIN COMM _ _ is called (see below). It is the responsibility of the be quiescent when MPI 41 application to create the socket using standard socket API calls. 42 must be called by the process at each end of the socket. It does not JOIN _ COMM _ MPI 43 JOIN _ COMM _ MPI return until both processes have called . The two processes are referred 44 to as the local and remote processes. 45 MPI uses the socket to bootstrap creation of the intercommunicator, and for nothing 46 _ MPI else. Upon return from , the file descriptor will be open and quiescent _ COMM JOIN 47 (see below). 48

432 402 CHAPTER 10. PROCESS CREATION AND MANAGEMENT 1 is unable to create an intercommunicator, but is able to leave the socket in its MPI If 2 intercomm to original state, with no pending communication, it succeeds and sets 3 _ . NULL _ COMM MPI 4 is called and after _ COMM _ MPI The socket must be quiescent before JOIN 5 _ COMM _ JOIN , a read on the _ COMM _ JOIN MPI MPI returns. More specifically, on entry to 6 socket will not read any data that was written to the socket before the remote process called 7 COMM _ JOIN , a MPI will not read any data that was _ COMM _ JOIN . On exit from MPI _ read 8 written to the socket before the remote process returned from MPI _ COMM _ JOIN . It is the 9 responsibility of the application to ensure the first condition, and the responsibility of the 10 implementation to ensure the second. In a multithreaded application, the application MPI 11 must ensure that one thread does not access the socket while another is calling 12 _ JOIN MPI MPI _ COMM COMM concurrently. _ , or call JOIN _ 13 14 is free to use any available communication path(s) Advice to implementors. MPI 15 for MPI messages in the new communicator; the socket is only used for the initial 16 ) End of advice to implementors. handshaking. ( 17 18 uses non- MPI _ COMM communication to do its work. The interaction of non- MPI _ JOIN 19 communication is not defined. Therefore, the result communication with pending MPI MPI 20 on two connected processes (see Section 10.5.4 on page 399 for of calling _ MPI COMM _ JOIN 21 the definition of connected) is undefined. 22 communication with addi- MPI The returned communicator may be used to establish 23 communicator creation mechanisms. tional processes, through the usual MPI 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

433 1 2 3 4 5 6 Chapter 11 7 8 9 10 One-Sided Communications 11 12 13 14 11.1 Introduction 15 16 Remote Memory Access ( RMA ) extends the communication mechanisms of MPI by allowing 17 one process to specify all communication parameters, both for the sending side and for the 18 receiving side. This mode of communication facilitates the coding of some applications with 19 dynamically changing data access patterns where the data distribution is fixed or slowly 20 changing. In such a case, each process can compute what data it needs to access or to update 21 at other processes. However, the programmer may not be able to easily determine which 22 data in a process may need to be accessed or to be updated by operations executed by a 23 different process, and may not even know which processes may perform such updates. Thus, 24 the transfer parameters are all available only on one side. Regular send/receive communi- 25 cation requires matching operations by sender and receiver. In order to issue the matching 26 operations, an application needs to distribute the transfer parameters. This distribution 27 may require all processes to participate in a time-consuming global computation, or to poll 28 for potential communication requests to receive and upon which to act periodically. The 29 RMA communication mechanisms avoids the need for global computations or explicit use of 30 polling. A generic example of this nature is the execution of an assignment of the form A = 31 map is a permutation vector, and A , B , and are distributed in the same map B(map) , where 32 manner. 33 communication Message-passing communication achieves two effects: of data from 34 sender to receiver and design separates RMA of sender with receiver. The synchronization 35 these two functions. The following communication calls are provided: 36 37 RPUT Remote write: • _ MPI , PUT _ MPI 38 _ GET , MPI _ RGET • Remote read: MPI 39 40 ACCUMULATE , _ MPI • RACCUMULATE _ MPI Remote update: 41 42 • Remote read and update: MPI _ GET _ ACCUMULATE , MPI _ RGET _ ACCUMULATE , 43 _ OP AND _ FETCH _ MPI and 44 _ AND _ COMPARE _ MPI Remote atomic swap operations: • SWAP 45 46 This chapter refers to an operations set that includes all remote update, remote read and 47 update, and remote atomic swap operations as “accumulate” operations. 48 403

434 404 CHAPTER 11. ONE-SIDED COMMUNICATIONS 1 MPI supports two fundamentally different memory models: separate and unified. The 2 separate model makes no assumption about memory consistency and is highly portable. 3 This model is similar to that of weakly coherent memory systems: the user must impose 4 correct ordering of memory accesses through synchronization calls. The unified model can 5 exploit cache-coherent hardware and hardware-accelerated, one-sided operations that are 6 commonly available in high-performance systems. The two different models are discussed 7 in detail in Section 11.4. Both models support several synchronization calls to support 8 different synchronization styles. 9 functions allows implementors to take advantage of fast or RMA The design of the 10 asynchronous communication mechanisms provided by various platforms, such as coherent 11 or noncoherent shared memory, DMA engines, hardware-supported put/get operations, and 12 communication coprocessors. The most frequently used RMA communication mechanisms 13 functions might need RMA can be layered on top of message-passing. However, certain 14 support for asynchronous communication agents in software (handlers, threads, etc.) in a 15 distributed memory environment. 16 the process that performs the call, and by origin the We shall denote by target 17 process in which the memory is accessed. Thus, in a put operation, source=origin and 18 destination=target; in a get operation, source=target and destination=origin. 19 20 Initialization 11.2 21 22 _ _ WIN MPI provides the following window initialization functions: , CREATE MPI 23 _ SHARED , and MPI _ WIN _ ALLOCATE , MPI _ WIN _ ALLOCATE 24 , which are collective on an intracommunicator. DYNAMIC _ CREATE _ WIN _ MPI 25 MPI WIN _ CREATE allows each process to specify a “window” in its memory that is made _ 26 accessible to accesses by remote processes. The call returns an opaque object that represents 27 the group of processes that own and access the set of windows, and the attributes of each 28 window, as specified by the initialization call. MPI _ WIN _ ALLOCATE differs from 29 in that the user does not pass allocated memory; CREATE _ WIN _ MPI 30 MPI _ WIN _ ALLOCATE returns a pointer to memory allocated by the MPI implementation. 31 _ ALLOCATE _ SHARED differs from MPI _ WIN _ ALLOCATE in that the allocated MPI _ WIN 32 memory can be accessed from all processes in the window’s group with direct load/store 33 instructions. Some restrictions may apply to the specified communicator. 34 creates a window that allows the user to dynamically control _ DYNAMIC CREATE MPI _ WIN _ 35 which memory is exposed by the window. 36 37 11.2.1 Window Creation 38 39 40 _ CREATE(base, size, disp WIN _ unit, info, comm, win) _ MPI 41 42 IN base initial address of window (choice) 43 IN size size of window in bytes (non-negative integer) 44 unit _ disp IN local unit size for displacements, in bytes (positive in- 45 teger) 46 47 IN info argument (handle) info 48

435 11.2. INITIALIZATION 405 1 IN intra-communicator (handle) comm 2 window object returned by the call (handle) OUT win 3 4 int MPI_Win_create(void *base, MPI_Aint size, int disp_unit, MPI_Info info, 5 MPI_Comm comm, MPI_Win *win) 6 7 MPI_Win_create(base, size, disp_unit, info, comm, win, ierror) BIND(C) 8 TYPE(*), DIMENSION(..), ASYNCHRONOUS :: base 9 INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: size 10 INTEGER, INTENT(IN) :: disp_unit 11 TYPE(MPI_Info), INTENT(IN) :: info 12 TYPE(MPI_Comm), INTENT(IN) :: comm 13 TYPE(MPI_Win), INTENT(OUT) :: win 14 INTEGER, OPTIONAL, INTENT(OUT) :: ierror 15 MPI_WIN_CREATE(BASE, SIZE, DISP_UNIT, INFO, COMM, WIN, IERROR) 16 BASE(*) 17 INTEGER(KIND=MPI_ADDRESS_KIND) SIZE 18 INTEGER DISP_UNIT, INFO, COMM, WIN, IERROR 19 20 This is a collective call executed by all processes in the group of comm . It returns 21 operations. Each RMA a window object that can be used by these processes to perform 22 RMA accesses by the process specifies a window of existing memory that it exposes to 23 comm . The window consists of processes in the group of bytes, starting at address size 24 base is the starting address of a memory region. In Fortran, one can pass the . In C, base 25 first element of a memory region or a whole array, which must be ‘simply contiguous’ (for 26 ‘simply contiguous’, see also Section 17.1.12 on page 626). A process may elect to expose 27 size = 0 . no memory by specifying 28 The displacement unit argument is provided to facilitate address arithmetic in RMA 29 RMA operations: the target displacement argument of an operation is scaled by the factor 30 specified by the target process, at window creation. disp _ unit 31 32 Rationale. The window size is specified using an address-sized integer, to allow 33 windows that span more than 4 GB of address space. (Even if the physical memory 34 size is less than 4 GB, the address range may be larger than 4 GB, if addresses are 35 not contiguous.) ( End of rationale. ) 36 Advice to users. Common choices for disp _ unit are 1 (no scaling), and (in C syntax) 37 sizeof(type) . The type , for a window that consists of an array of elements of type 38 later choice will allow one to use array indices in RMA calls, and have those scaled 39 End of advice correctly to byte displacements, even in a heterogeneous environment. ( 40 to users. ) 41 42 The info argument provides optimization hints to the runtime about the expected usage 43 pattern of the window. The following info keys are predefined: 44 45 , then the implementation may assume that passive target syn- true no _ locks — if set to 46 , chronization (i.e., MPI _ WIN _ LOCK ) will not be used on the given MPI _ LOCK _ ALL 47 window. This implies that this window is not used for 3-party communication, and 48 RMA can be implemented with no (less) asynchronous agent activity at this process.

436 406 CHAPTER 11. ONE-SIDED COMMUNICATIONS 1 — controls the ordering of accumulate operations at the target. See accumulate ordering _ 2 Section 11.7.2 for details. 3 , the implementation will assume that all concurrent op — if set to ops _ accumulate _ same 4 accumulate calls to the same target address will use the same operation. If set to 5 _ same no _ _ op op , then the implementation will assume that all concurrent accumulate 6 _ NO calls to the same target address will use the same operation or _ MPI . This can OP 7 eliminate the need to protect access for certain operation types where the hardware 8 can guarantee atomicity. The default is _ op . same _ op _ no 9 10 11 Advice to users. The info query mechanism described in Section 11.2.7 can be used 12 to query the specified info arguments windows that have been passed to a library. It 13 End is recommended that libraries check attached info keys for each passed window. ( 14 of advice to users. ) 15 16 comm The various processes in the group of may specify completely different target 17 windows, in location, size, displacement units, and info arguments. As long as all the get, 18 put and accumulate accesses to a particular process fit their specific target window this 19 should pose no problem. The same area in memory may appear in multiple windows, each 20 associated with a different window object. However, concurrent communications to distinct, 21 overlapping windows may lead to undefined results. 22 The reason for specifying the memory that may be accessed from another Rationale. 23 operation is to permit the programmer to specify what memory RMA process in an 24 RMA operations and for the implementation to enforce that spec- can be a target of 25 ification. For example, with this definition, a server process can safely allow a client 26 process to use RMA operations, knowing that (under the assumption that the MPI 27 implementation does enforce the specified limits on the exposed memory) an error in 28 End of the client cannot affect any memory other than what was explicitly exposed. ( 29 ) rationale. 30 31 32 Advice to users. A window can be created in any part of the process memory. 33 However, on some systems, the performance of windows in memory allocated by 34 _ (Section 8.2, page 339) will be better. Also, on some systems, ALLOC _ MEM MPI 35 performance is improved when window boundaries are aligned at “natural” boundaries 36 End of advice to users. ) (word, double-word, cache line, page frame, etc.). ( 37 In cases where RMA operations use different mechanisms Advice to implementors. 38 in different memory areas (e.g., load/store in a shared memory segment, and an asyn- 39 MPI call needs to figure out CREATE chronous handler in private memory), the _ _ WIN 40 MPI which type of memory is used for the window. To do so, maintains, internally, the 41 _ MPI list of memory segments allocated by , or by other, implementa- MEM _ ALLOC 42 tion-specific, m