Explicit Multi-Threading (XMT): A PRAM-On-Chip Vision
Do you find the following attention-catching question fascinating?
Will a 2cmX2cm piece of silicon provide the fastest computer?
The XMT vision was introduced in 1997.
The following recent highlights suggest
that the world is getting much closer to that vision:
- Fueled by six decades of unrelenting exponential progress in CPU clock
speed, the mainstream computer architecture industry focused exclusively on
optimizing performance using standard serial software. However, rather than
having 10 GHz CPUs today, as would have been expected based on past progress,
high-end CPU speeds are at around 4GHz, more or less where they were in
mid-2003. While the popular press has mostly
missed this rather dramatic lack of progress,
it is widely recognized among experts
that this is not a coincidence and the old paradigm is indeed collapsing:
serial computing is finally running out of steam. The recognition that
business survival for that industry depends upon new parallel computer
languages and parallel architectures is just starting to sink in.
- SUN announced its new Niagara architecture with 32 processors on chip.
- Intel has announced that all their future processor chips will have
multiple processors on the chip.
See also a recently proposed framework for 2005-2015 under:
Platform 2015:
Intel Processor and Platform
Evolution for the Next Decade
and note the trend towards parallel applications in the presentation by
Justin Rattner, Intel.
- IBM, Sony and Toshiba announced their joint Cell chip multiprocessor.
(Sony announced its deployment in the Sony Play Station 3.)
- In a 2005 article entitled
The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software,
Herb Sutter, Microsoft, reviews how hardware
vendors have run out of room with most of their traditional approaches to
boosting CPU performance (driving clock speeds and straight-line instruction
throughput). He suggests that this puts us at a fundamental turning point in
software development, where increased reliance on concurrency is expected.
- In a November 2004 interview with HPCWIRE, Dr. Burton Smith, Cray's Chief
Scientist said:
"...the uniprocessor has pretty well run out of steam.
Parallelism to date has been a nice strategy for HPC users and an
afterthought
for microprocessor vendors. Now, it is becoming a matter of business
survival
for all processor vendors. Parallelism is going to be taken more
seriously,
starting with the idea of exploiting multi-threading and multiple cores
on a single problem. This is a major change. Imagine if Microsoft wanted to
write Office in a parallel language. What would that language be, and what
would be the architecture to support it?
We don't have good answers to these questions yet."
XMT provides answers to these questions about language and architecture.
A more direct introduction to XMT follows.
Research over the last two decades by academic algorithm designers has
produced a huge knowledge-base of parallel algorithmic methods, which
is arguably second in its magnitude only to serial algorithms.
The model of parallel computation used for developing this knowledge-base
is called PRAM (for parallel-random-access machine, or model).
Non-expert obeservers will probably be most impressed with the fact that
superiority of the PRAM model, relative to other parallel computing
models, was established through a fierce competition among quite a few schools
of thought. This unofficial competition has been very well funded by
government and industry and involved participation by thousands of the most
talented computer science and engineering PhDs over several decades.
In other words, arbitration through Darwinistic-type natural selection among
different parallel algorithmic models is already behind us: the PRAM
model won by a wide margin.
However, this left open a fundamental question:
Is PRAM algorithmics implementable?
The XMT proposal finally resolves this very question!
The non-expert observer may want to take note that our positive answer came as
a big surprise to many, including parallel computing pundits, who have long
argued that PRAM algorithmics is "unrealistic".
The surprise may even be bigger since XMT relies on many of the great ideas that
have been developed by parallel computing researchers.
For this reason, we feel that,
first and foremost, XMT should be viewed as a vision of triumph for parallel
computing and for those who have labored for many years towards the goal of
reducing single task completion time by way of parallelism.
To name a few:
Schwartz's Ultracomputer vision of the PRAM as a theoretical yardstick,
the NYU-Ultracomputer with its Fetch-and-Add primitive, the
Shiloach-Vishkin PRAM work-time algorithmic framework (harnessing Brent's
appearingly too-abstract theorem), NESL's
system-supported nesting of parallelism, single-program multiple-data (SPMD)
programming languages, Data-Flow's Fork-Join, the SB-PRAM multi-processors,
the HEP-Tera thread switching, the IBM-360 out-of-order execution, Cray's
compiler-based vector parallelism, the U. Wisconsin-Multiscalar
multiple program counter (PC) architecture, and the U. Washington Simultaneous
Multi-Threading (SMT) technique for mixing instructions from several PCs
for a more effective use of functional units.
The broad XMT framework seeks to build on these ideas,
and attempts to make use of the ``Billion-transistor-chip''
technological opportunity towards an eventual resolution of the two questions
above.
The above names and efforts are meant to demonstrate that
XMT should be viewed as an enhancement of past great ideas
in parallel computing, and to build on and promote this past work.
XMT aims to prove that parallel computing, as a field, was right!
It should be noted that
perhaps the most common misunderstanding about XMT is that ``it is only an
architecture''.
In fact,
XMT is a fine-grained parallel computational framework
covering the spectrum from algorithms through architecture
to implementation.
Architecture is a crucial, though relatively small, component:
the main implementation related innovation in XMT
was through the incorporation of low-overhead hardware and software
mechanisms (for more effective fine-grained parallelism).
DISCLAIMER: The books by Almasi and Gottlieb, by Culler, Singh and Gupta, and
by JaJa provide a proper scholarly treatment of parallel computing and
describe the evolution of ideas in the field.
The above names are only for demonstration purpose.
Expanding on the first question.
1-Billion transistor chips, up by a factor of more than 30,000
(from 30,000 transistors) sometimes in the 1980s, are becoming a reality.
What should we do with all this hardware?
Many will agree that
the standard computing framework (e.g., the one which is based on Intel's X86
instruction set architecture platform) does not provide a good answer
and there is need to considerably update it.
The XMT proposal aspires to provide an answer.
(For a pointer to other schools-of-thought which address the same motivating
question see: Computer, Volume 30, Number 9, September 1997.)
A first priority in the design of XMT has been to harness the computing
power of such chips for the purpose of reducing the completion time of single
computation tasks.
Expanding on the second question.
Explicit Multi-Threading (XMT) is a new conceptual framework for
general-purpose computing.
The (virtual) thread structure of PRAM-like algorithms
is very dynamic: the number of threads that need to be generated changes
frequently, new threads are generated and terminated frequently, and
often threads are relatively short.
Perhaps the most distinguishing feature about the XMT framework is that it
envisions an extension to a standard instruction set which aspires to
efficiently implement PRAM-like algorithms; XMT does so using explicit
multi-threaded instruction-level parallelism (ILP).
In XMT, we tried to minimize changes to current practice:
new elements are added to computer languages, architecture
and implementation, only where needed.
What to read in order to quickly get a technical sense of what XMT is about?
1.
Chapter 2 in this 26-page paper
summarizes the XMT framework.
Note, in particular:
- The apparatus consisting of a stored program coupled with a program counter
that has come to dominate the control of general-purpose computers since the
1940s ("von-Neumann's architecture").
An extension of that apparatus to accommodate parallel programming
and low-overhead parallel execution is proposed.
- Hardware implementation of Fetch-And-Add as a functional unit is proposed.
- Single-program multiple-data (SPMD) programming of PRAM-like
algorithms along with "independence of order semantics" are proposed.
2.
A recent poster encapsulating salient features.
3. The latest general talk that you can find under the title Talks below.
But, how good parallel algorithms really are?
The SPAA'98, WAE'99, MTEAC'2000, HIPS'01 and TOCS'03
papers below provide examples where the XMT performance
of algorithms is quantitatively measured through a combination of experiments
and (non asymptotic!) analytic insights. The third question points
to a great deal of future research on experimental algorithmics.
Newsletter story
The Spring 2004 edition of the Electrical and Computer Engineering Department
Newsletter runs a story about 3 recent NSF (National Science Foundations) grants
awarded to its faculty from the Medium ITR (Information Technology Research)
program. To read about the vision behind the PRAM-On-Chip NSF-ITR 5-year
project click on the "2004 Spring" link from the
Newsletter page
.
Talks
(General talk:) Explicit Multi-Threading (XMT):
A Theorist's Proposal for a Big Little Computer System.
Download talk (27 transparencies).
Versions of the SPAA'98 talk were (or are planed to be) given at
IBM Haifa Research Laboratory (8/98),
University of Maryland Computer Engineering Colloquium (9/98),
The Computer Architecture and Parallel System Laboratory,
University of Delaware (10/98),
The Center for Research on Parallel Computation,
Rice University (10/98), CASES98:
Workshop on Compiler and Architecture Support for Embedded Systems, Washington, DC (12/98), New York University (12/98), IBM T.J. Watson (12/98), the
Workshop on Parallel Algorithms (WOPA'99), Atlanta (5/99),
Technion - Israel Institute of Technology (5/99), and Georgia Institute of
Technology (3/00).
(Specialized talk:) Experiments with list ranking for Explicit Multi-Threaded
(XMT) instruction parallelism (by S. Dascal and U. Vishkin).
3rd Workshop on Algorithms Engineering (WAE'99), London, U.K., July 1999.
Download talk (13 transparencies).
(General talk) What to do with all this hardware? Could the PRAM-On-Chip architecture
lead to upgrading the WINTEL performance-to-productivity platform?
Download abstract.
Download talk (postscript, slides in 6 quads)
or
(pdf, 6 quads)
Invited talks were given at the Intel Microprocessor Research Lab (MRL), Santa
Clara, CA, and at Microsoft Research, Redmond, WA, in January 2002, and at
IBM T.J. Watson Research Center, Yorktown Heights, NY, in June 2002.
Link to the Multi-University Research Lab Seminar Series
for a recording of the Microsoft Research talk (based on the 6 quads of slides
above).
(Specialized talk)
Bending Light for Multi-Chip Virtual PRAMs?
(12 slides),
3rd Workshop on Non-Slicon Computation,
held in conjunction with
the 31st International Symposium on Computer Architecture (ISCA 2004),
June 19-23, 2004, Munich, Germany.
(Specialized talk)
Pilot development time producivity study on the Spring 2004 course Parallel
Algorithmics (12 slides),
DARPA's High Productivity Computing Systems Program and Productivity Summit,
Fairfax, VA
June 29-30, 2004.
For context, see http://www.highproductivity.org/
(General talk)
PRAM-On-Chip: A Quest for Not-So-Obvious Non-Obviousness.
Download abstract
.
Download talk (pdf, 33 slides)
.
Invited talks were given at Intel Research Lab, Petach Tiqwa (Israel), 8/2004,
and at MFCS'04, Prague, 8/2004.
(General talk)
Can Parallel Computing Finally Impact Mainstream Computing?
Download abstract
.
Download talk (pdf, 40 slides)
.
Invited talks were given at Microsoft Research and Intel Research Pittsbrugh,
3/2005 and 4/2005.
Publications
U. Vishkin, S. Dascal, E. Berkovich and J. Nuzman.
Explicit Multi-Threading (XMT) Bridging Models for Instruction Parallelism
(Extended Abstract).
In Proc. 10th ACM Symposium on Parallel Algorithms and
Architectures (SPAA), 1998.
Note: This paper introduces XMT to readers whose background includes
reasonable understanding of algorithms and theory.
The presentation is by way of a bridging model.
Among other things, a bridging model provides a design space for
algorithm designers and programmers, as well as a design space for computer
architects. It is convenient to describe our wider vision regarding
``parallel-computing-on-a-chip'' as a two-stage development and therefore
two bridging models are presented:
Spawn-based multi-threading (Spawn-MT) and Elastic multi-threading (EMT).
Download TR (12 pages).
U. Vishkin, S. Dascal, E. Berkovich and J. Nuzman.
Explicit Multi-Threading (XMT) Bridging Models for Instruction Parallelism
(Extended Summary and Working Document).
Current version of UMIACS TR-98-05 . First version: January 1998.
Note: This long paper is really the MAIN publication for the XMT framework.
The SPAA'98 paper gives a principled presentation of XMT. It also introduces
XMT to readers with algorithms and theory background.
The next central presentation for the project is the journal version
of the SPAA'01 paper.
In contrast, the September 1999 TR UMIACS99-55 by Berkovich, Nuzman, Franklin,
Jacob and Vishkin is not a good single representative of the XMT project,
as a whole; while it
describes possible architecture choices for the XMT framework,
and studies their microarchitecture performance implications,
some part of it are misleading when it comes to understanding
choices actually made for XMT.
The paper
emphasizes an important ``decentralization'' property of the architecture.
Its performance hardly
degrades even when the number of cycles for traversing wires (interconnects)
increases.
The introduction that the latter paper provides to XMT is less comprehensive
and less direct
than the SPAA98 paper, as some aspects of XMT are not addressed, and
some architecture choices were made for the sake of completeness of the
architecture and not because they are required by the XMT paradigm.
Download TR (47 pages).
S. Dascal and U. Vishkin.
Experiments with list ranking for Explicit Multi-Threaded
(XMT) instruction parallelism (extended abstract). In Proc.
3rd Workshop on Algorithms Engineering (WAE'99), London, U.K., July 1999.
Download extended abstract (17 pages).
Download talk (13 transparencies).
An extended summary of this work is available as UMIACS-TR-99-33.
Download extended summary (27 pages).
Invited to the
Special Issue for WAE99:
Journal version, from the ACM Journal of Experimental Algorithmics,
Volume 5, 2000.
E. Berkovich,
J. Nuzman,
M. Franklin,
B. Jacob,
U. Vishkin.
XMT-M: A scalable decentralized processor.
UMIACS TR 99-55,
September 1999.
Download TR (postscript, 18 pages).
Note: as the above comment following UMIACS TR-98-05 states,
UMIACS TR 99-55 could be misleading in understanding architecture choices
made for XMT and is not a good single representative publication of the
XMT project.
U. Vishkin.
A No-Busy-Wait balanced tree parallel algorithmic paradigm.
In Proc. 12th ACM Symposium on Parallel Algorithms and
Architectures (SPAA), 2000.
Download TR (postscript, 9 pages).
pdf.
D. Naishlos, J. Nuzman,
C-W. Tseng,
and U. Vishkin.
Evaluating multi-threading in the prototype XMT environment.
In Proc. 4th Workshop on Multi-Threaded Execution, Architecture and
Compilation (MTEAC2000), December 2000 (held in conjunction with the 33rd
Int. Symp. on Microarchitecture MICRO-33).
Best Paper
Award for MTEAC2000.
Download TR (pdf, 9 pages).
D. Naishlos, J. Nuzman, C-W. Tseng,
and U. Vishkin.
Evaluating the XMT Parallel Programming Model.
In Proc. of the 6th Workshop on High-Level Parallel Programming
Models and Supportive Environments (HIPS-6), April 2001.
Download TR (pdf, 6 pages).
U. Vishkin.
Two techniques for reconciling algorithm parallelism
with memory constraints (extended abstract).
In Proc. 14th ACM Symposium on Parallel Algorithms and
Architectures (SPAA), 2002.
Download TR (postscript, 4 pages).
Download talk (postscript, 13 transparencies).
D. Naishlos, J. Nuzman, C-W. Tseng,
and U. Vishkin.
Towards a First Vertical Prototyping of an Extremely Fine-Grained Parallel
Programming Approach.
The conference version appeared
in Proc. 13th ACM Symposium on Parallel Algorithms and Architectures
(SPAA-01)), July 2001.
Download TR (pdf, 10 pages).
The journal version appeared in the
invited Special Issue for SPAA01:
TOCS 36,5 pages 521-552, Springer-Verlag, 2003.
Download the TR that led to the journal version (pdf, 26 pages).
A. Balkan,
G. Qu,
and U. Vishkin.
Arbitrate-and-Move primitives for high throughput
on-chip interconnection networks.
In Proc. IEEE International Symposium on Circuits and
Systems (ISCAS), SoC Design Technology lecture session , Vancouver, May 23-26,
2004.
Download TR (pdf, 4 pages).
U. Vishkin.
Can Optical Interconnection Networks Lead to Cheaper High-Performance
Multiprocessors?
Proc. Optics East, S.P.I.E. - The International Society for Optical
Engineering, Active and Passive Optical Components for WDM Communications IV,
Volume 5595, Philadelphia, October 26-28, 2004, 162-166.
Download TR (pdf, 5 pages).
The computer science and engineering side of this work was presented in the
paper:
U. Vishkin and I. Smolyaninov.
Bending Light for Multi-Chip Virtual PRAMs?
Proc. 3rd Workshop on Non-Slicon Computation,
held in conjunction with
the 31st International Symposium on Computer Architecture (ISCA 2004),
June 19-23, 2004, Munich, Germany, 32-38.
Download TR (pdf, 7 pages).
P. Gu and U. Vishkin.
Case Study of Gate-level Logic Simulation on an Extremely Fine-Grained Chip
Multiprocessor
Journal of Embedded Computing, Special Issue on
Embedded Single-Chip Multicore Architectures and related research -
from System Design to Application Support, to appear.
The test environment including the XMT machine, parallel logic simulation
program and benchmark circuits
.
Course on Parallel Algorithms Including XMT Programming Assignments, Tutorial and Software Tools
Spring 2005: ENEE759K/CMSC751 Parallel Algorithmics.
Below please find older software tools as well as older supplemental information to XMT
publications. More recent material is available throught the Parallel Algorithms course web site above.
Code and Analysis for Selected Problems and XMT tools
Brief Introduction.
We have been guided by the premise that reproducibility is a defining
attribute of empirical research.
So far, there have been two generations reporting the XMT empirical work.
The first generation includes the SPAA'98, and WAE'99 papers.
The second generation
includes the MTEAC'2000, HIPS'01 and TOCS'03 papers.
The first generation of XMT empirical work.
A very high-level description of several algorithms appears in the SPAA98
paper.
The following four documents provide raw data that, with the addition
of minimal analysis of algorithm execution, should enable
reproduction of the numerical results reported in the SPAA98 paper.
The WAE99 publication addressed list ranking--the most involved
problem considered in these four documents--and demonstrated how to incorporate
this added analysis.
It improved the readability
of this raw data, providing a detailed high level description of the
algorithm, and reviewed the programming techniques used.
The WAE99 paper presented algorithms and code design decisions
and optimizations made, comparing them with other alternatives.
Where needed, further elaboration on how to derive the
numerical results from the raw data were given.
It also explained why compilation is feasible.
Note:
All documents below are in PostScript. The gzip compressed files can be viewed
directly by using ghostview.
-
Directed Acyclic Graph Code and Analysis by Efraim Berkovich. 63 pages.
-
QuickSort Code and Analysis by Joseph Nuzman. 15 pages.
-
Integer Radix Sort Code and Analysis by Joseph Nuzman. 22 pages.
-
List Ranking Code and Analysis by Shlomit Dascal. 100 pages.
The second generation of XMT empirical work.
The results reported in the second generation papers were obtained using
the following XMT tools.
Supplemental Documentation
The following document is Shlomit Dascal's record of specifications
and assumptions for the bridging models and their executions.
It is a useful complement to the SPAA98 paper.
Note: Documentation is in PostScript. The gzip compressed files can be viewed directly by
using ghostview.
-
Specifications and Assumptions. 33 pages.
Dearly Departed Students
-
Efraim Berkovich, MS, Electrical and Computer Engineering,
University of Maryland.
-
Shlomit Dascal, MS, Computer Science, Tel Aviv University;
was visiting graduate student at UMIACS.
-
Dorit Naishlos,
MS, Computer Science,
University of Maryland.
-
Joseph Nuzman,
MS, Electrical and Computer Engineering,
University of Maryland.
Joe started working on XMT while still an Honors ECE undergraduate student.
-
Pei Gu,
MS, Electrical and Computer Engineering,
University of Maryland.
Students in the Research Team
-
Roni Kupershtok, post-doc, Electrical and Computer Engineering,
University of Maryland.
-
Aydin Balkan, graduate student, Electrical and Computer Engineering,
University of Maryland.
-
Xingzhi Wen, graduate student, Electrical and Computer Engineering,
University of Maryland.
-
Fuat Keceli, graduate student, Electrical and Computer Engineering,
University of Maryland.
-
Adam Beytin, undergraduate student, Computer Engineering,
University of Maryland.
-
Nghi Nguen Ba, undergraduate student, Computer Engineering,
University of Maryland.
-
Paul Mazzucco, undergraduate student, Computer Engineering,
University of Maryland.
-
George Caragea, graduate student, Computer Science,
University of Maryland.
This research project is directed by
Uzi Vishkin.
Funding by the
National Science Foundation
is graciously acknowledged.
Some of the material on this page is based upon work supported by the
National Science Foundation under Grant No. 0325393.
Any opinions, findings, and conclusions or recommendations expressed in this
material are those of the author(s) and do not necessarily reflect the views
of the National Science Foundation.