next up previous
Next: An Efficient Radix Sort Up: Sorting Previous: A New Sample Sort Algorithm

A New Sorting Algorithm by Regular Sampling

A disadvantage of our random sample sort algorithm is that the performance bounds and the memory requirements can only be guaranteed with high probability. The alternative to this is to choose the samples by regular sampling. A previous version of regular sample sort [23, 19], known as Parallel Sorting by Regular Sampling (PSRS), first sorts the tex2html_wrap_inline1384 elements at each processor and then selects every tex2html_wrap_inline1754 element as a sample. These samples are then routed to a single processor, where they are sorted and every tex2html_wrap_inline1758 sample is selected as a splitter. Each processor then uses these splitters to partition the sorted input values and then routes the resulting subsequences to the appropriate destinations, after which local merging is done to complete the sorting process. The first difficulty with this approach is the load balance. There exist inputs for which at least one
processor will be left at the completion of sorting with as many as tex2html_wrap_inline1760 elements. This could be reduced by choosing more splitters, but this would also increase the overhead. And no matter what is done, previous workers have observed that the load balance would still deteriorate linearly with the number of duplicates [19]. The other difficulty is that no matter how the routing is scheduled, there exist inputs that give rise to large variations in the number of elements destined for different processors, and this in turn results in an inefficient use of the communication bandwidth. Moreover, such an irregular communication scheme cannot take advantage of the regular communication primitives proposed under the MPI standard [20].

In our algorithm, which is parameterized by a sampling ratio s tex2html_wrap_inline1764 , we guarantee that at the completion of sorting, each processor will have at most tex2html_wrap_inline1766 elements, while incurring no overhead in gathering the samples to identify the splitters. This bound holds regardless of the number of duplicates present in the input. Moreover, we are able to replace the irregular routing with exactly two calls to our transpose primitive.

The pseudo code for our algorithm is as follows:

Before establishing the complexity of this algorithm, we need the results of the following theorem, whose proof has been omitted for brevity [15].

Theorem 2: The number of elements exchanged by any two processors in Step (7) is at most tex2html_wrap_inline1876 . Consequently, at the completion of Step (7), no processor receives more than tex2html_wrap_inline1766 elements, for tex2html_wrap_inline1890 .

Hence, the analysis of our regular sample sort algorithm is similar to that of our sample sort algorithm and is given (for floating point numbers) by

equation591

for tex2html_wrap_inline1890 and tex2html_wrap_inline1764 .

   figure599
Figure 4: Scalability with respect to problem size of regular sorting integers from the [U] benchmark, for differing numbers of processors, on the Cray T3D and the IBM SP-2-WN.

Like our random sample sort algorithm, our regular sample sort algorithm is asymptotically optimal with very small coefficients. Once again, our algorithm was implemented and tested on the nine benchmarks. Table iv displays the performance of our regular sort as a function of input distribution for a variety of input sizes. It shows that the performance is essentially independent of the input distribution.

   table606
Table iv: Total execution time (in seconds) for regular sorting a variety of benchmarks on a 64 node Cray T3D.

Table v examines the scalability of our regular sort as a function of machine size. It shows that for a given input size n the execution time scales almost inversely with the number of processors p.

   table629
Table v: Total execution time (in seconds) for regular sorting 8M integers from the [WR] benchmark on a variety of machines and processors. A hyphen indicates that that particular platform was unavailable to us.

Finally, Figure 4 examines the scalability of our sample sort as a function of problem size, for differing numbers of processors. They show that for a fixed number of processors there is an almost linear dependence between the execution time and the total number of elements n. See [15] for additional performance data and comparisons with other published results.


next up previous
Next: An Efficient Radix Sort Up: Sorting Previous: A New Sample Sort Algorithm

David R. Helman
helman@umiacs.umd.edu