Consider the problem of sorting n elements equally distributed amongst p processors, where we assume without loss of generality that p divides n evenly. The idea behind sample sort is to find a set of p - 1 splitters to partition the n input elements into p groups indexed from 0 up to p - 1 such that every element in the group is less than or equal to each of the elements in the group, for . Then the task of sorting each of the p groups can be turned over to the correspondingly indexed processor, after which the n elements will be arranged in sorted order. The efficiency of this algorithm obviously depends on how well we divide the input, and this in turn depends on how well we choose the splitters. One way to choose the splitters is by randomly sampling the input elements at each processor - hence the name sample sort.
Previous versions of sample sort [16, 8, 12] have randomly chosen s samples from the elements at each processor, routed them to a single processor, sorted them at that processor, and then selected every element as a splitter. Each processor then performs a binary search on these splitters for each of its input values and then uses the results to route the values to the appropriate destination, after which local sorting is done to complete the sorting process. The first difficulty with this approach is the work involved in gathering and sorting the splitters. A larger value of s results in better load balancing, but it also increases the overhead. The second difficulty is that no matter how the routing is scheduled, there exist inputs that give rise to large variations in the number of elements destined for different processors, and this in turn results in an inefficient use of the communication bandwidth. Moreover, such an irregular communication scheme cannot take advantage of the regular communication primitives proposed under the MPI standard . The final difficulty with the original approach is that duplicate values are accommodated by tagging each item with some unique value . This, of course, doubles the cost of both memory access and interprocessor communication.
In our version of sample sort, we incur no overhead in obtaining samples from each processor and in sorting these samples to identify the splitters. Because of this very high oversampling, we are able to replace the irregular routing with exactly two calls to our transpose primitive, and we are able to efficiently accommodate the presence of duplicates without resorting to tagging. The pseudo code for our algorithm is as follows:
We can establish the complexity of this algorithm with high probability - that is with probability for some positive constant . But before doing this, we need the results of the following theorem, whose proof has been omitted for brevity .
Theorem 1: The number of elements in each bucket at the completion of Step (1) is at most , the number of elements received by each processor at the completion of Step (7) is at most , and the number of elements exchanged by any two processors in Step (7) is at most , all with high probability for any , ( for duplicates), ( for duplicates), and .
With these bounds on the values of , , and , the analysis of our sample sort algorithm is as follows. Steps (1), (3), (4), (6), and (8) involve no communication and are dominated by the cost of the sequential sorting in Step (3) and the merging in Step (8). Sorting integers using radix sort requires O time, whereas sorting floating point numbers using merge sort requires O WY time. Step (8) requires O time if we merge the sorted subsequences in a binary tree fashion. Steps (2), (5), and (7) call the communication primitives transpose, bcast, and transpose, respectively. The analysis of these primitives in  shows that with high probability these three steps require , , and , respectively. Hence, with high probability, the overall complexity of our sample sort algorithm is given (for floating point numbers) by
Clearly, our algorithm is asymptotically optimal with very small coefficients. But it is also important to perform an empirical evaluation of our algorithm using a wide variety of benchmarks. Our algorithm was implemented and tested on nine different benchmarks, each of which had both a 32-bit integer version (64-bit on the Cray T3D) and a 64-bit double precision floating point number (double) version. The details and the rationale for these benchmarks are described in Appendix A. Table i displays the performance of our sample sort as a function of input distribution for a variety of input sizes. The results show that the performance is essentially independent of the input distribution. There is a slight preference for those benchmarks which contain high numbers of duplicates, but this is attributed to the reduced time required for sequential sorting in Step (3).
Table i: Total execution time (in seconds) for sample sorting a variety of benchmarks on a 64 node Cray T3D.
Figure 3: Scalability with respect to problem size of sample sorting integers from the [U] benchmark, for differing numbers of processors, on the Cray T3D and the IBM SP-2-WN.
Table ii examines the scalability of our sample sort as a function of machine size. It shows that for a given input size n the execution time scales almost inversely with the number of processors p.
Table ii: Total execution time (in seconds) for sorting 8M integers from the [WR] benchmark on a variety of machines and processors. A hyphen indicates that that particular platform was unavailable to us.
Figure 3 examines the scalability of our sample sort as a function of problem size, for differing numbers of processors. It shows that for a fixed number of processors there is an almost linear dependence between the execution time and the total number of elements n. Finally, Table iii compares our results on the Class A NAS Benchmark for integer sorting (IS) with the best times reported for the TMC CM-5 and the Cray T3D. Note that the name of this benchmark is somewhat misleading. Instead of requiring that the integers be placed in sorted order as we do, the benchmark only requires that they be ranked without any reordering, which is a significantly simpler task. We believe that our results, which were obtained using high-level, portable code, compare favorably with the other reported times, which were obtained by the vendors using machine-specific implementations and perhaps system modifications.
Table iii: Comparison of our execution time (in seconds) with the best reported times for the Class A NAS Parallel Benchmark for integer sorting. Note that while we actually place the integers in sorted order, the benchmark only requires that they be ranked without actually reordering.
See  for additional performance data and comparisons with other published results.