Next: Load Balancing Up: The Block Distributed Memory Previous: The Block Distributed

Basic Algorithms for Data Movements

The design of communication efficient parallel algorithms depends on the existence of efficient schemes for handling frequently occurring transformations on data layouts. In this section, we consider data layouts that can be specified by a two-dimensional array A, say of size , where column i of A contains a subarray stored in the local memory of processor , where . A transformation on the layout A will map the elements of A into the layout not necessarily of the same size. We present optimal or near optimal algorithms to handle several such transformations including broadcasting operations, matrix transposition, and data permutation. All the algorithms described are deterministic except for the algorithm to perform a general permutation.

We start by addressing several broadcasting operations. The simplest case is to broadcast a single item to a number of remote locations. Hence the layout A can be described as a one-dimensional array and we assume that the element has to be copied into the remaining entries of A. This can be viewed as a concurrent read operation from location executed by processors . The next lemma provides a simple algorithm to solve this problem; we later use this algorithm to derive an optimal broadcasting algorithm.

Proof: A simple algorithm consists of p-1 rounds that can be pipelined. During the rth round, each processor reads , for ; however, only is copied into A[j]. Since these rounds can be realized with p-1 pipelined prefetch read operations, the resulting communication complexity is .

We are now ready for the following theorem that essentially establishes the fact that a k-ary balanced tree broadcasting algorithm is the best possible for (recall that we earlier made the assumption that is an integral multiple of m).

Proof: We start by describing the algorithm. Let k be an integer to be determined later. The algorithm can be viewed as a k-ary tree rooted at location ; there are rounds. During the first round, is broadcast to locations , using the algorithm described in Lemma 3.1, followed by a synchronization barrier. Then during the second round, each element in locations is broadcast to a distinct set of k-1 locations, and so on. The communication cost incurred during each round is given by (Lemma 3.1). Therefore the total communication cost is . If we set , then .

We next establish the lower bound stated in the theorem. Any broadcasting algorithm using only read, write, and synchronization barrier instructions can be viewed as operating in phases, where each phase ends with a synchronization barrier (whenever there are more than a single phase). Suppose there are s phases. The amount of communication to execute phase i is at least , where is the maximum number of copies read from any processor during phase i. Hence the total amount of communication required is at least . Note that by the end of phase i, the desired item has reached at most remote locations. It follows that, if by the end of phase s, the desired item has reached all the processors, we must have . The communication time is minimized when , and hence . Therefore and the communication time is at least . We complete the proof of this theorem by proving the following claim.

Claim: , for any .

Proof of the Claim: Let , , , and . Then,

(Case 1) (): Since is decreasing and is increasing in this range, the claims follows easily by noting that and .
(Case 2) (k>r+1): We show that is increasing when k>r+1 by showing that for all integers . Note that since , we have that is at least as large as which is positive for all nonzero integer values of k. Hence and the claim follows.

The sum of p elements on a p-processor BDM can be computed in at most communication time by using a similar strategy. Based on this observation, it is easy to show the following theorem.

Another simple broadcasting operation is when each processor has to broadcast an item to all the remaining processors. This operation can be executed in communication time as shown in the next lemma.

Proof: The bound of follows from the simple algorithm described in Lemma 3.1. If p is significantly larger than m, then we can use the following strategy. We use the previous algorithm until each processor has m elements. Next, each block of m elements is broadcast in a circular fashion to the appropriate processors. One can verify that the resulting communication complexity is .

Our next data movement operation is the matrix transposition that can be defined as follows. Let and let p divide q evenly without loss of generality. The data layout described by A is supposed to be rearranged into the layout so that the first column of contains the first consecutive rows of A laid out in row major order form, the second column of contains the second set of consecutive rows of A, and so on. Clearly, if q=p, this corresponds to the usual notion of matrix transpose.

An efficient algorithm to perform matrix transposition on the BDM model is similar to the algorithm reported in [8]. There are p-1 rounds that can be fully pipelined by using prefetch read operations. During the first round, the appropriate block of elements in the ith column of A is read by processor into the appropriate locations, for . During the second round, the appropriate block of data in column i is read by processor , and so on. The resulting total communication time is given by and the amount of local computation is . Clearly this algorithm is optimal whenever pm divides q. Hence we have the following lemma.

We next discuss the broadcasting operation of a block of n elements residing on a single processor to p processors. We describe two algorithms, the first is suitable when the number n of elements is relatively small, and the second is more suitable for large values of n. Both algorithms are based on circular data movement as used in the matrix transposition algorithm. The details are given in the proof of the next theorem.

Proof: For the first algorithm, we use a k-ary tree as in the single item broadcasting algorithm described in Theorem 3.1, where . Using the matrix transposition strategy, distribute the n elements to be broadcast among k processors, where each processor receives a contiguous block of size . We now view the p processors as partitioned into k groups, where each group includes exactly one of the processors that contains a block of the items to be broadcast. The procedure is repeated within each group and so on. A similar reverse process can gradually read all the n items into each processor. Each forward or backward phase is carried out by using the cyclic data movement of the matrix transposition algorithm. One can check that the communication time can be bounded as follows.

If n>pm, we can broadcast the n elements in communication time using the matrix transposition algorithm of Lemma 3.3 twice, once to distribute the n elements among the p processors where each processor receives a block of size , and the second time to circulate these blocks to all the processors.

The problem of distributing n elements from a single processor can be solved by using the first half of either of the above two broadcasting algorithms. Hence we have the following corollary.

We finally address the following general routing problem. Let A be an array of n elements initially stored one column per processor in a p-processor BDM machine. Each element of A consists of a pair (data,i), where i is the index of the processor to which the data has to be relocated. We assume that at most elements have to be routed to any single processor for some constant . We describe in what follows a randomized algorithm that completes the routing in communication time and computation time, where c is any constant larger than . The complexity bounds are guaranteed to hold with high probability, that is, with probability , for some positive constant , as long as , where is the logarithm to the base e.

The overall idea of the algorithm has been used in various randomized routing algorithms on the mesh. Here we follow more closely the scheme described in [20] for randomized routing on the mesh with bounded queue size.

Before describing our algorithm, we introduce some terminology. We use an auxiliary array of size for manipulating the data during the intermediate stages and for holding the final output, where . Each column of will be held in a processor. The array can be divided into p equal size slices, each slice consisting of consecutive rows of . Hence a slice contains a set of consecutive elements from each column and such a set is referred to as a slot. We are ready to describe our algorithm.

Algorithm Randomized_Routing

Input: An input array such that each element of A consists of a pair (data,i), where i is the processor index to which the data has to be routed. No processor is the destination of more than elements for some constant .
Output: An output array holding the routed data, where c is any constant larger than .

begin

[Step 1]: Each processor distributes randomly its elements into the p slots of the jth column of .
[Step 2]: Transpose so that the jth slice will be stored in the jth processor, for .
[Step 3]: Each processor distributes locally its elements such that every element of the form (*,i) resides in slot i, for .
[Step 4]: Perform a matrix transposition on (hence the jth slice of the layout generated at the end of Step 3 now resides in ).

end

The next two facts will allow us to derive the complexity bounds for our randomized routing algorithm. For the analysis, we assume that .

Proof: The procedure performed by each processor is similar to the experiment of throwing balls into p bins. Hence the probability that exactly balls are placed in any particular bin is given by the binomial distribution

where , , and . Using the following Chernoff bound for estimating the tail of the binomial distribution

we obtain that the probability that a particular bin has more than balls is upper bounded by

Therefore the probability that any of the bins has more than balls is bounded by and the lemma follows.

Proof: The probability that an element is assigned to the jth slice by the end of Step 1 is . Hence the probability that elements destined for a single processor fall in the jth slice is bounded by since no processor is the destination of more than elements. Since there are p slices, the probability that more than elements in any processor are destined for the same processor is bounded by

and hence the lemma follows.

>From the previous two lemmas, it is easy to show the following theorem.

Remark: Since we are assuming that , the effect of the parameter m is dominated by the bound (as , assuming ).

Next: Load Balancing Up: The Block Distributed Memory Previous: The Block Distributed

joseph@umiacs.umd.edu