Next: A New Sample Sort Algorithm Up: A Randomized Parallel Sorting Algorithm With an Experimental Study Previous: Introduction

The Parallel Computation Model

We use a simple model to analyze the performance of our parallel algorithms. Our model is based on the fact that current hardware platforms can be viewed as a collection of powerful processors connected by a communication network that can be modeled as a complete graph on which communication is subject to the restrictions imposed by the latency and the bandwidth properties of the network. We view a parallel algorithm as a sequence of local computations interleaved with communication steps, where we allow computation and communication to overlap. We account for communication costs as follows.

Assuming no congestion, the transfer of a block consisting of m contiguous words between two processors takes time, where is the latency of the network and is the time per word at which a processor can inject or receive data from the network. Note that the bandwidth per processor is inversely proportional to . We assume that the bisection bandwidth is sufficiently high to support block permutation routing amongst the p processors at the rate of . In particular, for any subset of q processors, a block permutation amongst the q processors takes time, where m is the size of the largest block.

Using this cost model, we can evaluate the communication time of an algorithm as a function of the input size n, the number of processors p , and the parameters and . The coefficient of gives the total number of times collective communication primitives are used, and the coefficient of gives the maximum total amount of data exchanged between a processor and the remaining processors.

This communication model is close to a number of similar models (e.g. [16, 33, 1]) that have recently appeared in the literature and seems to be well-suited for designing parallel algorithms on current high performance platforms.

We define the computation time as the maximum time it takes a processor to perform all the local computation steps. In general, the overall performance involves a tradeoff between and . In many cases, it is possible to minimize both and simultaneously, and sorting is such a case.

Next: A New Sample Sort Algorithm Up: A Randomized Parallel Sorting Algorithm With an Experimental Study Previous: Introduction

David R. Helman
helman@umiacs.umd.edu