A fundamental problem in parallel computing is to design high-level,
architecture independent, algorithms that execute efficiently on
general-purpose parallel machines. The aim is to be able to achieve
portability *and* high performance simultaneously. Note that it is
considerably easier to achieve portability alone (say, by using PVM)
or high performance (say, by using sophisticated programmers to fine
tune the algorithm to the specific machine). There are currently two
factors that make this fundamental problem more tractable. The first
is the emergence of a dominant parallel architecture consisting of a
number of powerful microprocessors interconnected by either a
proprietary interconnect or a standard off-the-shelf interconnect.
The second factor is the emergence of standards, such as the message
passing standard MPI [20], for which machine builders and
software developers will try to provide efficient support.
Our work builds on these two developments by presenting a theoretical
and an experimental framework for designing parallel algorithms. In
this abstract, we sketch our contributions in two important problems:
personalized communication and sorting. We start with a brief
outline of the computation model.

We view a parallel algorithm as a sequence of local computations
interleaved with communication steps, and we allow computation and
communication to overlap. We account for communication costs as
follows. Assuming no congestion, the transfer of a block consisting of
*m* contiguous words between two processors takes O time, where is a bound on the latency of the network and
is the time per word at which a processor can inject or
receive data from the network. The cost of each of the collective
communication primitives (see below) will be modeled by O , where *m* is the maximum amount of data
transmitted or received by a processor. Such a cost can be justified
by using our earlier work [17, 6, 5]. Using
this cost model, we can evaluate the communication time of an
algorithm as a function of the input size *n*, the number of
processors *p* , and the parameters and . The
coefficient of gives the total number of times collective
communication primitives are used, and the coefficient of
gives the maximum total amount of data exchanged between a processor
and the remaining processors. This communication model is close to a
number of similar models (e.g. the LogP [13], BSP
[24], and LogGP [1] models) that have recently
appeared in the literature and seems to be well-suited for designing
parallel algorithms on current high performance platforms. We define
the computation time as the maximum time taken by any processor
to perform all of its local computation steps.

Our algorithms are implemented in SPLIT-C [11], an
extension of *C* for distributed memory machines. The algorithms make
use of MPI-like communication primitives but do not make any
assumptions as to how these primitives are actually implemented. Our
collective communication primitives, described in detail in
[6], are similar to those of MPI [20], the IBM
POWERparallel [7], and the Cray MPP systems [10]
and, for example, include the following: **transpose**, **bcast**,
**gather**, and **scatter**. Brief descriptions of these are as
follows. The **transpose** primitive is an all-to-all personalized
communication in which each processor has to send a unique block of
data to every processor, and all the blocks are of the same size. The
**bcast** primitive is called to broadcast a block of data from a
single source to all the remaining processors. The primitives **
gather** and **scatter** are companion primitives whereby **
scatter** divides a single array residing on a processor into
equal-sized blocks which are then distributed to the remaining
processors, and **gather** coalesces these blocks residing on the
different processors into a single array on one processor.

helman@umiacs.umd.edu