next up previous
Next: Analysis for the Up: Block Distributed Memory Previous: Analysis For Matrix

Experimental Results for Matrix Transpose

Performance graphs for matrix transposition execution times using SPLIT-C on a 32 processor CM-5, SP-2, CS-2, and 8 processor Paragon are given in Figures 6, 7, 8, and 9, respectively, in Appendix A.1. These figures also show the attained data bandwidthgif per processor for the transpose algorithm. For large enough data sets on the CM-5, we achieve an average bandwidth of 7.62 MB/s per processor, which is more than three-fourths of the maximum user-payload bandwidth per processor of 12 MB/s per processor [28]. This is consistent with the results achieved by other research teams that have achieved 6.4 MB/s per processor (Culler at UC Berkley, [11]), and 7.72 MB/s per processor (Ranka at Syracuse University, [46]) for similar data movements on the CM-5. Note that some of these cited results are for low-level implementations using message passing algorithms. For large enough data sets, the SP-2 achieves greater than 24.8 MB/s per processor for the matrix transpose algorithm, using a high performance switch hardware rated by the vendor as having a peak node to node bandwidth of 40 MB/s [27]. The Meiko CS-2 achieves greater than 10.7 MB/s per processor. Note that the CS-2 result is much less than the maximum attainable bandwidth of 50 MB/s per processor [33] because our SPLIT-C version has not been fully optimized to make use of the architecture's communications coprocessor. The 8 processor Paragon achieves greater than 88.6 MB/s per processor, with the maximum hardware bandwidth given by Intel as 175 MB/s per processor and application peak bandwidth as 135 MB/s per processor [30].



next up previous
Next: Analysis for the Up: Block Distributed Memory Previous: Analysis For Matrix



David A. Bader
dbader@umiacs.umd.edu