next up previous
Next: A New Sample Sort Algorithm Up: A Randomized Parallel Sorting Algorithm With an Experimental Study Previous: Introduction

The Parallel Computation Model

 

We use a simple model to analyze the performance of our parallel algorithms. Our model is based on the fact that current hardware platforms can be viewed as a collection of powerful processors connected by a communication network that can be modeled as a complete graph on which communication is subject to the restrictions imposed by the latency and the bandwidth properties of the network. We view a parallel algorithm as a sequence of local computations interleaved with communication steps, where we allow computation and communication to overlap. We account for communication costs as follows.

Assuming no congestion, the transfer of a block consisting of m contiguous words between two processors takes tex2html_wrap_inline1796 time, where tex2html_wrap_inline1798 is the latency of the network and tex2html_wrap_inline1800 is the time per word at which a processor can inject or receive data from the network. Note that the bandwidth per processor is inversely proportional to tex2html_wrap_inline1800 . We assume that the bisection bandwidth is sufficiently high to support block permutation routing amongst the p processors at the rate of tex2html_wrap_inline1806 . In particular, for any subset of q processors, a block permutation amongst the q processors takes tex2html_wrap_inline1796 time, where m is the size of the largest block.

Using this cost model, we can evaluate the communication time tex2html_wrap_inline1816 of an algorithm as a function of the input size n, the number of processors p , and the parameters tex2html_wrap_inline1798 and tex2html_wrap_inline1800 . The coefficient of tex2html_wrap_inline1798 gives the total number of times collective communication primitives are used, and the coefficient of tex2html_wrap_inline1800 gives the maximum total amount of data exchanged between a processor and the remaining processors.

This communication model is close to a number of similar models (e.g. [16, 33, 1]) that have recently appeared in the literature and seems to be well-suited for designing parallel algorithms on current high performance platforms.

We define the computation time tex2html_wrap_inline1830 as the maximum time it takes a processor to perform all the local computation steps. In general, the overall performance tex2html_wrap_inline1832 involves a tradeoff between tex2html_wrap_inline1830 and tex2html_wrap_inline1836 . In many cases, it is possible to minimize both tex2html_wrap_inline1830 and tex2html_wrap_inline1836 simultaneously, and sorting is such a case.


next up previous
Next: A New Sample Sort Algorithm Up: A Randomized Parallel Sorting Algorithm With an Experimental Study Previous: Introduction

David R. Helman
helman@umiacs.umd.edu