next up previous
Next: The Parallel Computation Model Up: A Randomized Parallel Sorting Algorithm With an Experimental Study Previous: A Randomized Parallel Sorting Algorithm With an Experimental Study

Introduction

Sorting is arguably the most studied problem in computer science, both because of its intrinsic theoretical importance and its use in so many applications. Its significant requirements for interprocessor communication bandwidth and the irregular communication patterns that are typically generated have earned its inclusion in several parallel benchmarks such as NAS [7] and SPLASH [34]. Moreover, its practical importance has motivated the publication of a number of empirical studies seeking to identify the most efficient sorting routines. Yet, parallel sorting strategies have still generally fallen into one of two groups, each with its respective disadvantages. The first group, using the classification of Li and Sevcik [23], is the single-step algorithms, so named because data is moved exactly once between processors. Examples of this include sample sort [21, 10], parallel sorting by regular sampling [31, 24], and parallel sorting by overpartitioning [23]. The price paid by these single-step algorithms is an irregular communication scheme and difficulty with load balancing. The other group of sorting algorithms is the multi-step algorithms, which include bitonic sort [9], column sort [22], rotate sort [25], hyperquicksort [28], flashsort [29], B-flashsort [20], smoothsort [27], and Tridgell and Brent's sort [32]. Generally speaking, these algorithms accept multiple rounds of communication in return for better load balancing and, in some cases, regular communication.

In this paper, we present a novel variation on the sample sort algorithm [19] which addresses the limitations of previous implementations. We exchange the single step of irregular communication for two steps of regular communication. In return, we reduce the problem of poor load balancing because we are able to sustain a very high oversampling ratio at virtually no cost. Second, we efficiently accommodate the presence of duplicates without the overhead of tagging each element. And we obtain predictable, regular communication requirements which are essentially invariant with respect to the input distribution. Utilizing regular communication has become more important with the advent of message passing standards, such as MPI [26], which seek to guarantee the availability of very efficient (often machine specific) implementations of certain basic collective communication routines.

Our algorithm was implemented in a high-level language and run on a variety of platforms, including the Thinking Machines CM-5, the IBM SP-2, and the Cray Research T3D. We ran our code using a variety of benchmarks that we identified to examine the dependence of our algorithm on the input distribution. Our experimental results are consistent with the theoretical analysis and illustrate the scalability and efficiency of our algorithm across different platforms. In fact, it seems to outperform all similar algorithms known to the authors on these platforms, and its performance is indifferent to the set of input distributions unlike previous efficient algorithms.

The high-level language used in our studies is SPLIT-C [14], an extension of C for distributed memory machines. The algorithm makes use of MPI-like communication primitives but does not make any assumptions as to how these primitives are actually implemented. The basic data transport is a read or write operation. The remote read and write typically have both blocking and non-blocking versions. Also, when reading or writing more than a single element, bulk data transports are provided with corresponding bulk_read and bulk_write primitives. Our collective communication primitives, described in detail in [6], are similar to those of the MPI [26], the IBM POWERparallel [8], and the Cray MPP systems [13] and, for example, include the following: transpose, bcast, gather, and scatter. Brief descriptions of these are as follows. The transpose primitive is an all-to-all personalized communication in which each processor has to send a unique block of data to every processor, and all the blocks are of the same size. The bcast primitive is used to copy a block of data from a single source to all the other processors. The primitives gather and scatter are companion primitives. Scatter divides a single array residing on a processor into equal-sized blocks, each of which is distributed to a unique processor, and gather coalesces these blocks back into a single array at a particular processor. See [3, 6, 4, 5] for algorithmic details, performance analyses, and empirical results for these communication primitives.

The organization of this paper is as follows. Section 2 presents our computation model for analyzing parallel algorithms. Section 3 describes in detail our improved sample sort algorithm. Finally, Section 4 describes our data sets and the experimental performance of our sorting algorithm.


next up previous
Next: The Parallel Computation Model Up: A Randomized Parallel Sorting Algorithm With an Experimental Study Previous: A Randomized Parallel Sorting Algorithm With an Experimental Study

David R. Helman
helman@umiacs.umd.edu