From Algorithm Parallelism to Instruction-Level Parallelism: An Encode-Decode Chain Using Prefix-Sum (extended abstract)

Uzi Vishkin

ABSTRACT

A novel comprehensive and coherent approach for the purpose of increasing instruction-level parallelism (ILP) is devised. The key new tool in our envisioned system update is the addition of a parallel prefix-sum (PS) instruction, which will have efficient implementation in hardware, to the instruction-set architecture. This addition gives for the first time a concrete way for recruiting the whole knowledge base of parallel algorithms for that purpose. The potential increase in ILP is demonstrated by experimental results for a test application. The main technical contribution is in the form of a ``completeness theorem''. Perhaps surprisingly, the current abstract proves that in an envisioned system which employs parallel PS functional units, a proper use of a serial programming language suffices for the following. With a moderate effort, one can program a parallel algorithm (in a serial language), so that a parallelizing compiler (even without run-time methods!) will be able to extract the same (i.e., ``complete'') ILP from such serial code as from code written in a parallel language. Alternatively, rather than have the programmer produce the serial code, a precompiler could derive it from a parallel language. The most interesting idea in the proof is the reliance on the new parallel PS for circumventing collision-ambiguity in references to memory. Other new ideas in the paper include hardware-design of a prefix-sum unit and an on-line algorithm for high-bandwidth register-files. An informal upshot of this paper is the following general insight: to accommodate parallelism in uniprocessor systems (from algorithms to ILP), it is sufficient to only add (and, of course, incorporate) parallel prefix-sum functional units to standard serial system designs.