Explicit Multi-Threading (XMT): A PRAM-On-Chip Vision

Do you find the following attention-catching question fascinating?

Will a 2cmX2cm piece of silicon provide the fastest computer?

The XMT vision was introduced in 1997. The following recent highlights suggest that the world is getting much closer to that vision:
- Fueled by six decades of unrelenting exponential progress in CPU clock speed, the mainstream computer architecture industry focused exclusively on optimizing performance using standard serial software. However, rather than having 10 GHz CPUs today, as would have been expected based on past progress, high-end CPU speeds are at around 4GHz, more or less where they were in mid-2003. While the popular press has mostly missed this rather dramatic lack of progress, it is widely recognized among experts that this is not a coincidence and the old paradigm is indeed collapsing: serial computing is finally running out of steam. The recognition that business survival for that industry depends upon new parallel computer languages and parallel architectures is just starting to sink in.
- SUN announced its new Niagara architecture with 32 processors on chip.
- Intel has announced that all their future processor chips will have multiple processors on the chip. See also a recently proposed framework for 2005-2015 under: Platform 2015: Intel Processor and Platform Evolution for the Next Decade and note the trend towards parallel applications in the presentation by Justin Rattner, Intel.
- IBM, Sony and Toshiba announced their joint Cell chip multiprocessor. (Sony announced its deployment in the Sony Play Station 3.)
- In a 2005 article entitled The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software, Herb Sutter, Microsoft, reviews how hardware vendors have run out of room with most of their traditional approaches to boosting CPU performance (driving clock speeds and straight-line instruction throughput). He suggests that this puts us at a fundamental turning point in software development, where increased reliance on concurrency is expected.
- In a November 2004 interview with HPCWIRE, Dr. Burton Smith, Cray's Chief Scientist said: "...the uniprocessor has pretty well run out of steam. Parallelism to date has been a nice strategy for HPC users and an afterthought for microprocessor vendors. Now, it is becoming a matter of business survival for all processor vendors. Parallelism is going to be taken more seriously, starting with the idea of exploiting multi-threading and multiple cores on a single problem. This is a major change. Imagine if Microsoft wanted to write Office in a parallel language. What would that language be, and what would be the architecture to support it? We don't have good answers to these questions yet."
XMT provides answers to these questions about language and architecture.

A more direct introduction to XMT follows. Research over the last two decades by academic algorithm designers has produced a huge knowledge-base of parallel algorithmic methods, which is arguably second in its magnitude only to serial algorithms. The model of parallel computation used for developing this knowledge-base is called PRAM (for parallel-random-access machine, or model). Non-expert obeservers will probably be most impressed with the fact that superiority of the PRAM model, relative to other parallel computing models, was established through a fierce competition among quite a few schools of thought. This unofficial competition has been very well funded by government and industry and involved participation by thousands of the most talented computer science and engineering PhDs over several decades. In other words, arbitration through Darwinistic-type natural selection among different parallel algorithmic models is already behind us: the PRAM model won by a wide margin. However, this left open a fundamental question:

Is PRAM algorithmics implementable?
The XMT proposal finally resolves this very question!

The non-expert observer may want to take note that our positive answer came as a big surprise to many, including parallel computing pundits, who have long argued that PRAM algorithmics is "unrealistic". The surprise may even be bigger since XMT relies on many of the great ideas that have been developed by parallel computing researchers. For this reason, we feel that, first and foremost, XMT should be viewed as a vision of triumph for parallel computing and for those who have labored for many years towards the goal of reducing single task completion time by way of parallelism.

To name a few: Schwartz's Ultracomputer vision of the PRAM as a theoretical yardstick, the NYU-Ultracomputer with its Fetch-and-Add primitive, the Shiloach-Vishkin PRAM work-time algorithmic framework (harnessing Brent's appearingly too-abstract theorem), NESL's system-supported nesting of parallelism, single-program multiple-data (SPMD) programming languages, Data-Flow's Fork-Join, the SB-PRAM multi-processors, the HEP-Tera thread switching, the IBM-360 out-of-order execution, Cray's compiler-based vector parallelism, the U. Wisconsin-Multiscalar multiple program counter (PC) architecture, and the U. Washington Simultaneous Multi-Threading (SMT) technique for mixing instructions from several PCs for a more effective use of functional units. The broad XMT framework seeks to build on these ideas, and attempts to make use of the ``Billion-transistor-chip'' technological opportunity towards an eventual resolution of the two questions above. The above names and efforts are meant to demonstrate that XMT should be viewed as an enhancement of past great ideas in parallel computing, and to build on and promote this past work. XMT aims to prove that parallel computing, as a field, was right! It should be noted that perhaps the most common misunderstanding about XMT is that ``it is only an architecture''. In fact, XMT is a fine-grained parallel computational framework covering the spectrum from algorithms through architecture to implementation. Architecture is a crucial, though relatively small, component: the main implementation related innovation in XMT was through the incorporation of low-overhead hardware and software mechanisms (for more effective fine-grained parallelism).
DISCLAIMER: The books by Almasi and Gottlieb, by Culler, Singh and Gupta, and by JaJa provide a proper scholarly treatment of parallel computing and describe the evolution of ideas in the field. The above names are only for demonstration purpose.

Expanding on the first question. 1-Billion transistor chips, up by a factor of more than 30,000 (from 30,000 transistors) sometimes in the 1980s, are becoming a reality. What should we do with all this hardware? Many will agree that the standard computing framework (e.g., the one which is based on Intel's X86 instruction set architecture platform) does not provide a good answer and there is need to considerably update it. The XMT proposal aspires to provide an answer. (For a pointer to other schools-of-thought which address the same motivating question see: Computer, Volume 30, Number 9, September 1997.) A first priority in the design of XMT has been to harness the computing power of such chips for the purpose of reducing the completion time of single computation tasks.

Expanding on the second question. Explicit Multi-Threading (XMT) is a new conceptual framework for general-purpose computing. The (virtual) thread structure of PRAM-like algorithms is very dynamic: the number of threads that need to be generated changes frequently, new threads are generated and terminated frequently, and often threads are relatively short. Perhaps the most distinguishing feature about the XMT framework is that it envisions an extension to a standard instruction set which aspires to efficiently implement PRAM-like algorithms; XMT does so using explicit multi-threaded instruction-level parallelism (ILP). In XMT, we tried to minimize changes to current practice: new elements are added to computer languages, architecture and implementation, only where needed.


What to read in order to quickly get a technical sense of what XMT is about?

1. Chapter 2 in this 26-page paper summarizes the XMT framework. Note, in particular:
- The apparatus consisting of a stored program coupled with a program counter that has come to dominate the control of general-purpose computers since the 1940s ("von-Neumann's architecture"). An extension of that apparatus to accommodate parallel programming and low-overhead parallel execution is proposed.
- Hardware implementation of Fetch-And-Add as a functional unit is proposed.
- Single-program multiple-data (SPMD) programming of PRAM-like algorithms along with "independence of order semantics" are proposed.
2. A recent poster encapsulating salient features.
3. The latest general talk that you can find under the title Talks below.


But, how good parallel algorithms really are?
The SPAA'98, WAE'99, MTEAC'2000, HIPS'01 and TOCS'03 papers below provide examples where the XMT performance of algorithms is quantitatively measured through a combination of experiments and (non asymptotic!) analytic insights. The third question points to a great deal of future research on experimental algorithmics.

Newsletter story

  • The Spring 2004 edition of the Electrical and Computer Engineering Department Newsletter runs a story about 3 recent NSF (National Science Foundations) grants awarded to its faculty from the Medium ITR (Information Technology Research) program. To read about the vision behind the PRAM-On-Chip NSF-ITR 5-year project click on the "2004 Spring" link from the Newsletter page .

    Talks

  • (General talk:) Explicit Multi-Threading (XMT): A Theorist's Proposal for a Big Little Computer System. Download talk (27 transparencies). Versions of the SPAA'98 talk were (or are planed to be) given at IBM Haifa Research Laboratory (8/98), University of Maryland Computer Engineering Colloquium (9/98), The Computer Architecture and Parallel System Laboratory, University of Delaware (10/98), The Center for Research on Parallel Computation, Rice University (10/98), CASES98: Workshop on Compiler and Architecture Support for Embedded Systems, Washington, DC (12/98), New York University (12/98), IBM T.J. Watson (12/98), the Workshop on Parallel Algorithms (WOPA'99), Atlanta (5/99), Technion - Israel Institute of Technology (5/99), and Georgia Institute of Technology (3/00).
  • (Specialized talk:) Experiments with list ranking for Explicit Multi-Threaded (XMT) instruction parallelism (by S. Dascal and U. Vishkin). 3rd Workshop on Algorithms Engineering (WAE'99), London, U.K., July 1999.
    Download talk (13 transparencies).
  • (General talk) What to do with all this hardware? Could the PRAM-On-Chip architecture lead to upgrading the WINTEL performance-to-productivity platform? Download abstract. Download talk (postscript, slides in 6 quads) or (pdf, 6 quads) Invited talks were given at the Intel Microprocessor Research Lab (MRL), Santa Clara, CA, and at Microsoft Research, Redmond, WA, in January 2002, and at IBM T.J. Watson Research Center, Yorktown Heights, NY, in June 2002.
    Link to the Multi-University Research Lab Seminar Series for a recording of the Microsoft Research talk (based on the 6 quads of slides above).
  • (Specialized talk) Bending Light for Multi-Chip Virtual PRAMs? (12 slides), 3rd Workshop on Non-Slicon Computation, held in conjunction with the 31st International Symposium on Computer Architecture (ISCA 2004), June 19-23, 2004, Munich, Germany.
  • (Specialized talk) Pilot development time producivity study on the Spring 2004 course Parallel Algorithmics (12 slides), DARPA's High Productivity Computing Systems Program and Productivity Summit, Fairfax, VA June 29-30, 2004. For context, see http://www.highproductivity.org/
  • (General talk) PRAM-On-Chip: A Quest for Not-So-Obvious Non-Obviousness. Download abstract . Download talk (pdf, 33 slides) . Invited talks were given at Intel Research Lab, Petach Tiqwa (Israel), 8/2004, and at MFCS'04, Prague, 8/2004.
  • (General talk) Can Parallel Computing Finally Impact Mainstream Computing? Download abstract . Download talk (pdf, 40 slides) . Invited talks were given at Microsoft Research and Intel Research Pittsbrugh, 3/2005 and 4/2005.

    Publications

  • U. Vishkin, S. Dascal, E. Berkovich and J. Nuzman. Explicit Multi-Threading (XMT) Bridging Models for Instruction Parallelism (Extended Abstract). In Proc. 10th ACM Symposium on Parallel Algorithms and Architectures (SPAA), 1998.
    Note: This paper introduces XMT to readers whose background includes reasonable understanding of algorithms and theory. The presentation is by way of a bridging model. Among other things, a bridging model provides a design space for algorithm designers and programmers, as well as a design space for computer architects. It is convenient to describe our wider vision regarding ``parallel-computing-on-a-chip'' as a two-stage development and therefore two bridging models are presented: Spawn-based multi-threading (Spawn-MT) and Elastic multi-threading (EMT).
    Download TR (12 pages).
  • U. Vishkin, S. Dascal, E. Berkovich and J. Nuzman. Explicit Multi-Threading (XMT) Bridging Models for Instruction Parallelism (Extended Summary and Working Document). Current version of UMIACS TR-98-05 . First version: January 1998.
    Note: This long paper is really the MAIN publication for the XMT framework. The SPAA'98 paper gives a principled presentation of XMT. It also introduces XMT to readers with algorithms and theory background. The next central presentation for the project is the journal version of the SPAA'01 paper. In contrast, the September 1999 TR UMIACS99-55 by Berkovich, Nuzman, Franklin, Jacob and Vishkin is not a good single representative of the XMT project, as a whole; while it describes possible architecture choices for the XMT framework, and studies their microarchitecture performance implications, some part of it are misleading when it comes to understanding choices actually made for XMT. The paper emphasizes an important ``decentralization'' property of the architecture. Its performance hardly degrades even when the number of cycles for traversing wires (interconnects) increases. The introduction that the latter paper provides to XMT is less comprehensive and less direct than the SPAA98 paper, as some aspects of XMT are not addressed, and some architecture choices were made for the sake of completeness of the architecture and not because they are required by the XMT paradigm.
    Download TR (47 pages).
  • S. Dascal and U. Vishkin. Experiments with list ranking for Explicit Multi-Threaded (XMT) instruction parallelism (extended abstract). In Proc. 3rd Workshop on Algorithms Engineering (WAE'99), London, U.K., July 1999.
    Download extended abstract (17 pages).
    Download talk (13 transparencies).
    An extended summary of this work is available as UMIACS-TR-99-33. Download extended summary (27 pages).
    Invited to the Special Issue for WAE99: Journal version, from the ACM Journal of Experimental Algorithmics, Volume 5, 2000.
  • E. Berkovich, J. Nuzman, M. Franklin, B. Jacob, U. Vishkin. XMT-M: A scalable decentralized processor. UMIACS TR 99-55, September 1999.
    Download TR (postscript, 18 pages). Note: as the above comment following UMIACS TR-98-05 states, UMIACS TR 99-55 could be misleading in understanding architecture choices made for XMT and is not a good single representative publication of the XMT project.
  • U. Vishkin. A No-Busy-Wait balanced tree parallel algorithmic paradigm. In Proc. 12th ACM Symposium on Parallel Algorithms and Architectures (SPAA), 2000.
    Download TR (postscript, 9 pages). pdf.
  • D. Naishlos, J. Nuzman, C-W. Tseng, and U. Vishkin. Evaluating multi-threading in the prototype XMT environment. In Proc. 4th Workshop on Multi-Threaded Execution, Architecture and Compilation (MTEAC2000), December 2000 (held in conjunction with the 33rd Int. Symp. on Microarchitecture MICRO-33). Best Paper Award for MTEAC2000.
    Download TR (pdf, 9 pages).
  • D. Naishlos, J. Nuzman, C-W. Tseng, and U. Vishkin. Evaluating the XMT Parallel Programming Model. In Proc. of the 6th Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS-6), April 2001.
    Download TR (pdf, 6 pages).
  • U. Vishkin. Two techniques for reconciling algorithm parallelism with memory constraints (extended abstract). In Proc. 14th ACM Symposium on Parallel Algorithms and Architectures (SPAA), 2002.
    Download TR (postscript, 4 pages). Download talk (postscript, 13 transparencies).
  • D. Naishlos, J. Nuzman, C-W. Tseng, and U. Vishkin. Towards a First Vertical Prototyping of an Extremely Fine-Grained Parallel Programming Approach. The conference version appeared in Proc. 13th ACM Symposium on Parallel Algorithms and Architectures (SPAA-01)), July 2001.
    Download TR (pdf, 10 pages).
    The journal version appeared in the invited Special Issue for SPAA01: TOCS 36,5 pages 521-552, Springer-Verlag, 2003.
    Download the TR that led to the journal version (pdf, 26 pages).
  • A. Balkan, G. Qu, and U. Vishkin. Arbitrate-and-Move primitives for high throughput on-chip interconnection networks. In Proc. IEEE International Symposium on Circuits and Systems (ISCAS), SoC Design Technology lecture session , Vancouver, May 23-26, 2004.
    Download TR (pdf, 4 pages).
  • U. Vishkin. Can Optical Interconnection Networks Lead to Cheaper High-Performance Multiprocessors? Proc. Optics East, S.P.I.E. - The International Society for Optical Engineering, Active and Passive Optical Components for WDM Communications IV, Volume 5595, Philadelphia, October 26-28, 2004, 162-166. Download TR (pdf, 5 pages).
    The computer science and engineering side of this work was presented in the paper:
  • U. Vishkin and I. Smolyaninov. Bending Light for Multi-Chip Virtual PRAMs? Proc. 3rd Workshop on Non-Slicon Computation, held in conjunction with the 31st International Symposium on Computer Architecture (ISCA 2004), June 19-23, 2004, Munich, Germany, 32-38.
    Download TR (pdf, 7 pages).
  • P. Gu and U. Vishkin. Case Study of Gate-level Logic Simulation on an Extremely Fine-Grained Chip Multiprocessor Journal of Embedded Computing, Special Issue on Embedded Single-Chip Multicore Architectures and related research - from System Design to Application Support, to appear.
    The test environment including the XMT machine, parallel logic simulation program and benchmark circuits .

    Course on Parallel Algorithms Including XMT Programming Assignments, Tutorial and Software Tools

  • Spring 2005: ENEE759K/CMSC751 Parallel Algorithmics.

    Below please find older software tools as well as older supplemental information to XMT publications. More recent material is available throught the Parallel Algorithms course web site above.

    Code and Analysis for Selected Problems and XMT tools

    Brief Introduction.
    We have been guided by the premise that reproducibility is a defining attribute of empirical research. So far, there have been two generations reporting the XMT empirical work. The first generation includes the SPAA'98, and WAE'99 papers. The second generation includes the MTEAC'2000, HIPS'01 and TOCS'03 papers.

    The first generation of XMT empirical work. A very high-level description of several algorithms appears in the SPAA98 paper. The following four documents provide raw data that, with the addition of minimal analysis of algorithm execution, should enable reproduction of the numerical results reported in the SPAA98 paper. The WAE99 publication addressed list ranking--the most involved problem considered in these four documents--and demonstrated how to incorporate this added analysis. It improved the readability of this raw data, providing a detailed high level description of the algorithm, and reviewed the programming techniques used. The WAE99 paper presented algorithms and code design decisions and optimizations made, comparing them with other alternatives. Where needed, further elaboration on how to derive the numerical results from the raw data were given. It also explained why compilation is feasible.
    Note: All documents below are in PostScript. The gzip compressed files can be viewed directly by using ghostview.


    The second generation of XMT empirical work. The results reported in the second generation papers were obtained using the following XMT tools.
  • Supplemental Documentation
    The following document is Shlomit Dascal's record of specifications and assumptions for the bridging models and their executions. It is a useful complement to the SPAA98 paper.
    Note: Documentation is in PostScript. The gzip compressed files can be viewed directly by using ghostview.




    Dearly Departed Students

    Students in the Research Team

    This research project is directed by Uzi Vishkin.

    Funding by the National Science Foundation is graciously acknowledged. Some of the material on this page is based upon work supported by the National Science Foundation under Grant No. 0325393. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.