next up previous
Next: Performance Evaluation Up: A Parallel Sorting Algorithm Previous: The Parallel Computation Model

A New Sample Sort Algorithm

Consider the problem of sorting n elements equally distributed amongst p processors, where we assume without loss of generality that p divides n evenly. The idea behind sample sort is to find a set of p - 1 splitters to partition the n input elements into p groups indexed from 0 up to p - 1 such that every element in the group is less than or equal to each of the elements in the group, for . Then the task of sorting each of the p groups can be turned over to the correspondingly indexed processor, after which the n elements will be arranged in sorted order. The efficiency of this algorithm obviously depends on how well we divide the input, and this in turn depends on how well we choose the splitters. One way to choose the splitters is by randomly sampling the input elements at each processor - hence the name sample sort.

Previous versions of sample sort [20,10,17,15] have randomly chosen s samples from the elements at each processor, routed them to a single processor, sorted them at that processor, and then selected every element as a splitter. Each processor then performs a binary search on these splitters for each of its input values and then uses the results to route the values to the appropriate destination, after which local sorting is done to complete the sorting process. The first difficulty with this approach is the work involved in gathering and sorting the splitters. A larger value of s results in better load balancing, but it also increases the overhead. The other difficulty is that no matter how the routing is scheduled, there exist inputs that give rise to large variations in the number of elements destined for different processors, and this in turn results in an inefficient use of the communication bandwidth. Moreover, such an irregular communication scheme cannot take advantage of the regular communication primitives proposed under the MPI standard [27].

In our solution, we incur no overhead in obtaining samples from each processor and in sorting these samples to identify the splitters. Because of this very high oversampling, we are able to replace the irregular routing with exactly two calls to our transpose primitive.

The pseudo code for our algorithm is as follows:

We can establish the complexity of this algorithm with high probability - that is with probability for some positive constant . But before doing this, we need to establish the results of the following four lemmas.

Lemma 1: At the completion of Step (1), the number of elements in each bucket is at most with high probability, for any and .

Proof: The probability that exactly elements are placed in a particular bucket in Step (1) is given by the binomial distribution

where , , and . Using the following Chernoff bound [12] for estimating the tail of a binomial distribution

the probability that a particular bucket will contain at least elements can be bounded by . Hence, the probability that any of the buckets contains at least elements can be bounded by , and Lemma 1 follows.

Lemma 2: At the completion of Step (2), the total number of elements received by processor , which comprise the set of samples from which the splitters are chosen, is at most with high probability, for any and .

Proof: The probability that processor receives exactly elements is given by the binomial distribution . Using the Chernoff bound for estimating the tail of a binomial distribution, the probability that processor receives at least elements can be bounded by and Lemma 2 follows.

Lemma 3: At the completion of Step (7), the number of elements received by each processor is at most with high probability, for any and .

Proof: Establishing a bound on the number of elements received by any processor in Step (7) is equivalent to establishing a bound on the number of elements which fall between any two consecutive splitters in the sorted order. But as Blelloch et al. [10] observed, the number of elements which fall between any two consecutive splitters in the sorted order can only be greater than if in the sorted order there are less than samples drawn from the elements which follow the first splitter. Since every element has an equal and independent probability of being a sample, the probability that exactly samples will be found amongst the next elements is given by the binomial distribution . Using the following ``Chernoff'' type bound [18] for estimating the head of a binomial distribution

where , , and , the probability that or less samples will be found amongst the next elements following any of the p splitters can be bounded by and Lemma 3 follows.

Lemma 4: The number of elements exchanged by any two processors in Step (7) is at most with high probability, for any and .

Proof: Since with high probability no processor can receive more than elements in Step (7), and since the randomization in Step (1) means that each of these elements can originate with equal probability from any of the p processors, the probability that exactly elements are exchanged by any two particular processors is given by the binomial distribution . Using the Chernoff bound for estimating the tail of the binomial distribution, the probability that any of the p processors exchange at least elements can be bounded by and Lemma 4 follows.

With these bounds on the values of , , and , the analysis of our sample sort algorithm is as follows. Steps (1), (3), (4), (6), and (8) involve no communication and are dominated by the cost of the sequential sorting in Step (3) and the merging in Step (8). Sorting integers using radix sort requires time, whereas sorting floating point numbers using merge sort requires time. Step (8) requires time if we merge the sorted subsequences in a binary tree fashion. Steps (2), (5), and (7) call the communication primitives transpose, bcast, and transpose, respectively. The analysis of these primitives in [6] shows that with high probability these three steps require , , and , respectively. Hence, with high probability, the overall complexity of our sample sort algorithm is given (for floating point numbers) by

for .

Clearly, our algorithm is asymptotically optimal with very small coefficients. But a theoretical comparison of our running time with previous sorting algorithms is difficult, since there is no consensus on how to model the cost of the irregular communication used by the most efficient algorithms. Hence, it is very important to perform an empirical evaluation of an algorithm using a wide variety of benchmarks, as we will do next.



next up previous
Next: Performance Evaluation Up: A Parallel Sorting Algorithm Previous: The Parallel Computation Model

helman@umiacs.umd.edu