The RS/6000 POWER systems are implementations of a Reduced Instruction Set Computer (RISC) architecture. As is characteristic of many RISC architectures, loads and stores provide the only storage access; arithmetic instructions use only register operands. Several instructions, often considered more complex than a traditional RISC definition, enhance performance. POWER2 supports a superset of the POWER Architecture. New instructions provide performance opportunity: quad-word floating-point storage references, square root, and convert to integer. Virtual address translation changes improve performance and add capability. The architecture also adds hardware performance monitoring. Support of all POWER instructions maintains upward compatibility for programs.
In addition to the newly added instructions, users obtain a performance gain from the modifications to the virtual address translation process. As in most virtual memory systems, the operating system manages a set of architecturally defined pages tables which maintain a mapping of virtual pages to real pages. To increase the efficiency of this process, TLBs cache translation information for recently accessed pages. When a storage reference requires translation information which is not available in the TLBs, a TLB miss occurs and hardware searches the page tables for the translation information and either updates the TLBs or signals the occurrence of a page fault.
The POWER2 processor complex consists of eight semi-custom chips partitioned in a fashion similar to POWER: an Instruction Cache Unit (ICU), the Fixed Point Unit (FXU), the Floating Point Unit (FPU), four Data Cache Units (DCU), and a Storage Control Unit (SCU). The ICU prefetches instructions from the I-cache and places them in instruction buffers. ICU control logic decodes or analyzes the instructions in the buffers. The ICU executes ICU instructions (primarily branches), some times affecting the prefetch path. The ICU dispatches non-ICU instructions to the FXU and FPU over a 4-instruction wide "instruction dispatch" bus (IBUS). The FXU and FPU process the respective arithmetic instructions. The FXU also processes storage reference instructions by generating and translating the addresses before placing them on the cache address bus.
The POWER2 I/O unit is the same as the one in the RS/6000 Models 580 and 980. The I/O unit implements the 64-bit Streaming Data protocol on the Micro Channel at 10 MHz. The I/O unit implements dual 64-byte buffers per DMA channel so that operations over the SIO bus and Micro Channel can fully overlap. The I/O unit, along with some logic on the I/O planar, reduces the arbitration time on the Micro Channel from 400ns to 100ns. This improves bandwidth and bus utilization. In addition, enhancements to the protocol for the SIO bus include prefetch data commands from the I/O unit so that the DMA data from memory is available to the I/O unit with minimum delay.
The switch provides the internal message passing fabric that connects all of the SP2 nodes (processors) together in a way that potentially allows all processors to be sending messages simultaneously. The hardware to support this connectivity consists of 2 basic elements. The switch board and the communications adapter (High Performance Switch Adapter versions 1 and 2). There is one adapter per node and one switch board unit per rack. The switch board unit contains 8 logical switch chips, with 16 physical chips for reliablility reasons that are discussed later and provides the connectivity of each of the nodes to the switch fabric as well as the rack-to-rack connectivity.
As a start, the switch fabric needs to be scalable from tens up to thousands of nodes. In order to meet that objective a multistage network was chosen. The multistage network increases the amount of switch capability in a granular fashion as the number of processors grows. With this switch topology switch stages are added as the system grows to keep the amount of bandwidth available to each of the processors constant.
One measure of bandwidth that is useful for comparing machine designs is bisectional bandwidth. This is the most common measure of aggregate bandwidth for parallel machines and is loosely defined as follows: Define a plane which separates the parallel system into two parts containing an equal number of nodes. This plane intersects some number of network links. Bisectional bandwidth is the total possible bandwidth crossing this plane through these links. This term is often used to assess the scalability of a topology. For hypercubes, the SP2 Switch, and most multistage networks, bisectional bandwidth scales linearly with the number of nodes in the system. For a 2-dimensional mesh bisectional bandwidth scales with the square root of the number of nodes. For a ring, bisectional bandwidth remains constant as nodes are added. Since the effective bandwidth per node for this measurement is the aggregate divided by the number of nodes, the mesh and ring provide reduced capability as the system grows. The SP2 system maintains constant bisectional bandwidth per processor independent of the size of the machine.
As a summary the SP-2 communications network is a multi-stage, omega, buffered-wormhole routing packet-switch. If a message is traveling from node A to node B and another message needs to intersect it to go through from processors C to D, the messages can cut through each other and share the fabric. There is no blockage of one message by another message for long periods of time. If there is congestion on an outgoing path there is buffering on each of the chips such that messages are buffered and queued within the switch chip for fair delivery in a round-robin fashion, packet by packet.
Ease of job scheduling placed another requirement of the switch fabric: a uniform topology. The SP2 switch topology is uniform, which is to say that the fabric as an omega network has equidistant message traffic from any particular point in the fabric to any other point in the fabric. Algorithms do not place requirements on node selection or topology since all nodes are equidistant. This means that the scheduler doesn't need to worry about the physical location of the specific nodes selected for a job. I/O has the same flexibility: I/O server nodes can be located anywhere on the fabric.