From Algorithm Parallelism to Instruction-Level
Parallelism: An Encode-Decode Chain Using Prefix-Sum
(extended abstract)
Uzi Vishkin
ABSTRACT
A novel comprehensive and coherent approach for the purpose of
increasing instruction-level parallelism (ILP) is devised.
The key new tool in our envisioned system update is the addition of a parallel
prefix-sum (PS) instruction, which will have efficient implementation in
hardware, to the instruction-set architecture.
This addition
gives for the first time a concrete way for recruiting the whole
knowledge base of parallel algorithms for that purpose.
The potential increase in ILP
is demonstrated by experimental results for a test application.
The main technical contribution is in the form of a ``completeness theorem''.
Perhaps surprisingly, the current abstract proves that in an envisioned system
which employs parallel PS functional units, a proper
use of a serial programming language suffices for the following.
With a moderate effort, one can program a parallel algorithm
(in a serial language), so
that a parallelizing compiler
(even without run-time methods!) will be able to extract the same (i.e.,
``complete'') ILP from such serial code as from code written in a
parallel language.
Alternatively, rather than have the programmer produce the serial code,
a precompiler could derive it from a parallel language.
The most interesting idea in the proof is the reliance on the new parallel PS
for circumventing collision-ambiguity in references to memory.
Other new ideas in the paper include hardware-design of a prefix-sum
unit and an on-line algorithm for high-bandwidth register-files.
An informal upshot of this paper is the following general insight:
to accommodate
parallelism in uniprocessor systems (from algorithms to ILP), it is sufficient
to only add (and, of course, incorporate) parallel prefix-sum functional
units to standard serial system designs.