As described in Subsection 2.1, we use (1) as the general model for CM-5 communications. As stated above, a common communications pattern on parallel machines is a block permutation, where given blocks of contiguous data , and a permutation , and , block has to be moved from to . If each block has length , we claim that the CM-5 time complexity of this operation is the following:

Each node in the CM-5 connects to the fat-tree via a 20 Mb/sec network interface link [34]. The fat-tree network provides sufficient bandwidth to allow every node to perform sustained data transfers at a rate of 20 Mb/sec if all communications is contained in groups of four nodes (i.e. the nodes differ only in the last two bits of their logical address), 10 Mb/sec for transfers within groups of 16 nodes (i.e. the nodes differ in only the last four bits), and 5 Mb/sec for all other communication paths in the system [34].

A regular grid shift is an example of a block permutation pattern of data movement as shown in Subsection 2.2.2. For the corresponding regular block communications, the CM-5 can achieve bandwidths of 15 megabytes/sec per processor to put messages into and take messages out of the data network [29]. The runtime system (RTS) of the CM-5 will choose a data section for a message of between 1 and 5 words in length. If we assume that the data section of a message, , is four 4-byte words, and the header and trailer are an additional 4 bytes, then .

To support these claims, we give the following empirical results. Using the message passing library of the CM-5, CMMD version 3.0 [41], each processing node swaps a packet with another node of fixed distance away. Figure 3 confirms that there is a linear relationship between message length and total transmission time on the CM-5. We find that when the regular block permutation consists of only local communications, i.e. messages are contained in each cluster of four leaf nodes, , the lower bound on time is obtained. A least squares analysis of this data finds and . This is equivalent to a sustained transfer rate of 5.03 Mb/sec.

Other regular block permutations are shown in Figure 3 such as

Both and use two levels of the fat-tree, while needs three levels. Our results show that all non-local regular permutations routed through the fat-tree have similar time complexities, with and . This corresponds to a sustained transfer rate of 4.55 Mb/sec.

**Figure:** Sending Time of a Regular Block Permutation

David A. Bader

dbader@umiacs.umd.edu