Scalable fast multipole methods on distributed heterogeneous architectures

TitleScalable fast multipole methods on distributed heterogeneous architectures
Publication TypeConference Papers
Year of Publication2011
AuthorsHu Q, Gumerov NA, Duraiswami R
Conference NameHigh Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for
Date Published2011/11//
Keywordsaccelerators;OpenMP;analysis, algorithm;iterative, and, architecture;CUDA;FMM, architectures;, architectures;divide-and-conquer, based, conquer, CPU-GPU, CPU;scalable, data, fast, heterogeneous, loop;data, loop;multicore, methods;graphics, methods;multiprocessing, methods;time, multipole, parts;distributed, PROCESSING, stepping, structures;divide, structures;GPU, systems;parallel, translation, units;iterative

We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divide- and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time- stepping or iterative loop. Using the observation that the local summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiple-node versions. Our implementation can perform the N-body sum for 128M particles on 16 nodes in 4.23 seconds, a performance not achieved by others in the literature on such clusters.