In this section we state the main results achieved for solving the problems of personalized communication and sorting.

**Problem 1 ( h-Relation Personalized Communication)**: A set of

**Result**: A new deterministic algorithm using two rounds of the
**transpose** primitive, with optimal complexity
and very small constant coefficients, is shown to
be very efficient and scalable across different platforms and over
different input distributions.

**Significance**: The importance of this problem has been stated in
many papers (e.g. [24, 22]). Our algorithm seems to
achieve the best known experimental results for this problem. The
general approach was independently described by Kaufmann et al.
[18] and Ranka et al. [21] around the same time as our
earlier draft [3] appeared, but our algorithm is simpler, has
less overhead, and has a tighter bound size for the intermediate
collective communication.

**Problem 2 (Sorting)**: Rearrange *n* equally distributed elements
amongst *p* processors such that they appear in non-decreasing order
starting from the smallest indexed processor.

**Results**: (1) A novel variation of sample sort that uses only two
rounds of regular all-to-all personalized communication and that
maintains very good load balancing with virtually no overhead. (2) A
new deterministic algorithm that achieves almost the same performance
as (1) by using regular sampling. (3) An efficient implementation of
radix sort that makes use of our personalized communication scheme.

**Significance**: We have developed a suite of benchmarks and
conducted extensive experimentations related to Problem 2. Our
algorithms have consistently performed very well across the different
benchmarks and the different platforms. Algorithm (1) has outperformed
all similar results known to the authors on our platforms. It even
compares favorably to the machine-specific implementations reported by
some of the machine vendors for the Class A NAS IS Benchmark,
which only requires the easier task of ranking.
Algorithm (2) has almost the same performance as (1) while
guaranteeing memory and communication bounds, and Algorithm (3)
achieves the best known execution times for a stable integer sorting
algorithm. Additionally, all of our algorithms efficiently handle the presence of
duplicates without the overhead of tagging each element.

helman@umiacs.umd.edu