How to Think Algorithmically in Parallel?

Or, Parallel Programming through Parallel Algorithms

Full-day Tutorial

Date and Venue:

21st ACM International Conference on Supercomputing (ICS)

Abstract

The abstract model used in standard undergraduate courses on algorithms and data-structures for serial computing is known as RAM, for random-access machine, or model. What should be the proper counterpart for parallel computing?

Following a fierce "battle of ideas" during most of the 1980s regarding this very question, a clear winner emerged: The parallel random-access machine, or model, widely known as the PRAM (pronounced pee-RAM). The main assumption of the PRAM is that the latency for arbitrary number of memory accesses is the same as for one access.

The PRAM has been the model of choice for the major theory-related communities. During 1988-90, a big PRAM chapter was included in at least 3 then-popular standard algorithms textbooks: Baase-88, Cormen-Leiserson-Rivest-90, 1st Edition, and Manber-89. In fact, in terms of magnitude and breadth the PRAM has no serious contender. One can say that the PRAM emerged as a clear winner in a natural selection process that took place during the 1980s.

The PRAM did not fare as well with architecture communities in the early 1990s, when chip multiprocessing was not yet possible. The most relevant criticism argued correctly that there was no way to get sufficient bandwidth from multi-chip multiprocessors to allow the programmer to view the machine as a PRAM.

A very small fraction of the instruction time is devoted to programming, as opposed to reasoning about algorithms. Programming is to be picked up from documents (e.g., programming tutorial and manual). This is a strength: (i) algorithm design is the idea part of a program and where the real thinking takes place, (ii) programming (including parallel programming), on the other hand, should be a skill that can be acquired. Indeed, in the serial computing example, programming is typically not taught beyond the Freshman year. However, algorithms and data structure are taught all the way to graduate school. The short time devoted to programming attests to the relative ease of parallel programming. It should be clear that easy parallel programming is the overriding objective of the overall approach.

The (relatively short) architecture part of the tutorial will make the case that an on-chip parallel processor can provide sufficient bandwidth between processors and memories, and discuss the recent (FPGA) commitment to silicon of an explicit multi-threaded (XMT) architecture guided by a "PRAM-On-Chip" vision.

Target audience

In order to function in the new world of multi-core computer systems, some programming for parallelism cannot be avoided. Any one of the conference participants who recognizes that should find it helpful to understand the thinking process behind such programming. In fact, the short history of parallel computing provides many examples where parallel architectures were built only to find out that their designers expected that the thought process behind their programming will emerge after their machine are deployed. Unfortunately, build-first figure-out-how-to-think/program-later has its limits.

The opportunity that the tutorial will provide for hands-on experience could have a special appeal for the general audience of the conference.

What to expect

The key knowledge and ideas that participants can take away are:

Salient examples of the PRAM theory of parallel algorithms.
Identification of parallelism in existing "serial" algorithms, to the extent possible. The main issue becomes fitting new parallel data structures.
Reason about performance metrics such as number of processors and analytic run time within the formal PRAM model.
Think algorithmically in parallel. PRAM algorithm description requires almost a computer-program-like level-of-detail. Having to deal upfront with lower level details is not only tedious; it often becomes an obstacle to concentrating on the big picture. To overcome that, a description methodology that allows drastic reduction in level-of-detail is presented. The methodology guides thinking about a parallel algorithm in terms of two basic features: its total number of operations (called work), and the shortest time in which these operations could be completed, hypothetically assuming unlimited amount of hardware (called depth). Introduced by Shiloach and Vishkin in 1982, what makes the methodology work is that it is a matter of skill to later fill in the details that the Work-Depth description teaches to initially suppress. Furthermore, limited training (less than a one-semester course) is sufficient to provide the needed skill. It should be noted that this work-depth methodology provides the description platform for several parallel algorithms books that appeared since the early 1990s.
Reason about the practical implication PRAM performance analysis.
Prototype XMT parallel processor with its PRAM-like programming. Ways for filling in the details that the Work-Depth description teaches to initially suppress towards an efficient extremely fine-grained multi-threaded program; potential for delegating an increasing part of that to compilers.
Advance a non-trivial parallel algorithm into a program. Run the program on real PRAM silicon: an FPGA-based PRAM-On-Chip 64-processor 75MHz computer at the University of Maryland.

Where is this tutorial coming from?

Material for this tutorial is taken from a graduate course on Parallel Algorithms at the University of Maryland that is cross-listed between computer science and electrical and computer engineering. The course is a bit unique as it has its own computer, own compiler and own programming language. The tutorial provides coherent snapshots from this "UMD course planet".

The core material is an updated shorter version of the course class notes. Documents featuring a tutorial and a manual of a modest extension to C (called XMT-C) used for programming will be provided.

Organizer and presenter

Uzi Vishkin.
E-Mail: last name at umiacs.umd.edu.
Home page: http://www.umiacs.umd.edu/~vishkin/

Bio

Uzi Vishkin is a permanent member of the University of Maryland Institute for Advanced Computer Science (UMIACS) and Professor of Electrical and Computer Engineering since 1988. He got his DSc in Computer Science from the Technion-Israel Institute of Technology and his MSc and BSc in Mathematics from the Hebrew University, Jerusalem, Israel. He was Professor of Computer Science at Tel Aviv University and the Technion, research faculty at the Courant Institute, NYU and a post-doc at IBM T.J Watson. He is Fellow of the ACM and an ISI-Thompson Highly-Cited Researcher.

Relevant Publications

J. JaJa, An Introduction to Parallel Algorithms, Addison-Wesley, 1992.
J. Keller, C.W. Kessler and J.L. Traeff. Practical PRAM Programming. Wiley-Interscience, 2001.
Y. Shiloach and U. Vishkin. An O((n**2)log n) parallel max-flow algorithm. J. of Algorithms 3 (1982), 128--146.

The more recent publication below are either already posted, or will be posted on http://www.umiacs.umd.edu/~vishkin/XMT

U. Vishkin, S. Dascal, E. Berkovich and J. Nuzman. Explicit Multi-Threading (XMT) bridging models for instruction parallelism. ACM-SPAA'98, 140-151.
D. Naishlos, J. Nuzman, C-W. Tseng, and U. Vishkin. Towards a First Vertical Prototyping of an Extremely Fine-Grained Parallel Programming Approach. TOCS 36,5 pages 521-552, Springer-Verlag, 2003, Special Issue for ACM-SPAA'01
X. Wen and U. Vishkin. PRAM-On-Chip: 1st Commitment to Silicon. ACM-SPAA'07, June 9-11, 2007, to appear.
U. Vishkin, G. Caragea and B. Lee. Models for advancing PRAM and other algorithms into programs for a PRAM-On-Chip platform. In Handbook of Parallel Computing: Models, Algorithms and Applications. Editors: R. Rajasekaran and J. Reif. CRC Press, Sept. 2007, to appear. A metric that relates PRAM algorithms to their expected run-time on the UMD PRAM silicon.
A. Balkan, G. Qu and U. Vishkin. Mesh-of-Trees and Alternative Interconnection Networks for Single-Chip Parallel Processing. In Proc. 17th IEEE Int. Conf. on Application-specific Systems, Architectures and Processors (ASAP 2006), Steamboat Springs, Co, Sept. 2006, 73-80. Best Paper Award.
A. Balkan, M. Horak, G. Qu, and U. Vishkin. Layout-Accurate Design and Implementation of a High-Throughput Interconnection Network for Single-Chip Parallel Processing. Hot Interconnects 2007, Stanford University, CA, August 22-24, 2007.