Project 1: Netfinity Cluster Project
UMD PI: Jeff Hollingsworth, with participating faculty: Keleher, Saltz,
Sussman, and Tseng;
IBM Collaborators: Marc Snir, Judy Engles (IBM Power Parallel)
UMIACS and the Department of Computer Science are in the process of setting
up a cluster of multiple types of commodity processors, disks, and networking
technologies, managed as an integrated shared resource. The cluster will
support research in systems software for information intensive and wide-area
computing research, including applications in earth system science, digital
pathology, computer vision, and computational neuroscience. The planned
cluster will consist of three distinct types of nodes: server, compute,
and specialized. The server nodes will each be equipped with multiple
processors and multiple fast-wide SCSI disks, used either for performing
computationally intensive computations or as I/O server nodes for simulated
wide-area tests. The compute nodes will each consist of single processor
nodes, a high speed interconnect, and a modest amount of disk space. The
specialized nodes will be configured with inexpensive processors, and
multiple relatively high capacity, low performance (but inexpensive) disks).
The software infrastructure will be based on Linux. For distributed memory
message passing programming, we will use MPI (MPICH or LAM). In addition,
we will likely run HPSS to manage migration of files from server nodes
to the specialized nodes. We will be developing resource management policies
based on Active Harmony, a software architecture that manages distributed
execution of computational objects in dynamic environments. In addition,
we will use this cluster to support our effort to port IBM's DPCL tools
Linux clusters.
We propose to acquire Netfinity SMPs as the high-end server nodes for
this cluster. This configuration will provide a significant compute platform,
yet retain the cost-effectiveness of small SMP nodes. We will connect
these nodes to each other and via the rest of the cluster via gigabit
Ethernet.
More details about the research projects to be supported by this cluster
are given next.
Active Harmony (Hollingsworth and Keleher)
The Active Harmony project is seeking to expand the interface between
applications and system software. We have developed an API to permit
applications to be written to be resource-aware. In particular, our
model is based on having export options to the runtime system which
then selects particular values for the application based on the available
resources and other workload on the nodes. In addition, the applications
indicate the expected resource utilization for each option and resulting
performance level (e.g. completion time or speedup). For example, a
parallel application might expose an option that would select the number
of nodes the program would use. The expected resource information might
indicate the application can effectively use 16 nodes. but that performance
improves marginally after that. The harmony system could use this information
to run the job on more than 16 nodes if nodes were available, but only
on 16 nodes if other jobs could more effectively use those nodes.
To date, we have converted several simple applications to use the harmony
interface. In addition, we have "harmonized" a client-server database
system to automatically select between running queries on the client
or the server node based on sever load and predicted resource utilization
for the query. We are currently working with the Computer Vision group
of the University of Maryland to develop a "harmonized" version of their
real-time computer vision and object tracking system.
Additional information about the harmony project is available from
Active Harmony
Active Data Repository (Sussman and Saltz)
We have been developing an infrastructure, called the Active Data Repository
(ADR), that integrates storage, retrieval and processing of large multi-dimensional
datasets on distributed memory parallel architectures with multiple
disks attached to each node. ADR targets applications with a particular
form of processing. The processing steps consist of retrieving input
and output data items that fall in the desired portion of the multi-dimensional
space, mapping the coordinates of the retrieved input items to the corresponding
output items, and aggregating, in some way, all the retrieved input
items mapped to the same output data items. Correctness of the output
usually does not depend on the order input data items are aggregated.
ADR is designed as a set of modular services implemented in C++. Through
use of these services, ADR allows customization for application-specific
processing, while providing support for common operations such as memory
management, data retrieval, and scheduling of processing across a parallel
machine. The system architecture of ADR consists of a front-end and
a parallel back-end. The front-end interacts with clients, and forwards
queries with references to user-defined processing functions to the
parallel back-end. During query execution, back-end nodes retrieve input
data and perform user-defined operations over the data items retrieved
to generate the output products. Output products can be returned from
the back-end nodes to the requesting client, or stored in ADR.
We have implemented several applications with very large storage and
computational requirements with ADR. The Titan satellite data processing
application stores four kilometer resolution global coverage data, and
ten years of that data consists of over 1.4TB. For the Virtual Microscope
ADR application, one focal plane of a single slide requires over 7GB
(uncompressed) at high power, and a hospital such as Johns Hopkins produces
hundreds of thousands of slides per year. Similarly, the computation
for one ten day composite Titan query for the entire world takes about
100 seconds per processor on the Maryland sixteen node IBM SP2.
More information is available at Active
Data Repository Project.
Multi-level Memory Hierarchies (Tseng)
Modern microprocessors provide high performance by exploiting data
locality with carefully designed multilevel caches. However, advanced
scientific computations have features which make it difficult to utilize
caches effectively. To exploit locality for these complex applications
will require sophisticated compile-time analyses to guide both compiler
and run-time data and computation transformations. We propose to develop
and evaluate software support for improving locality for advanced scientific
applications. We will investigate techniques needed on both uniprocessors
and shared-memory multiprocessors.
We are targeting three areas. First, iterative solvers for 3D partial
differential equations have poor locality because accesses to nearby
elements in higher-level dimensions are spread far apart in memory.
Careful tiling and padding can frequently recapture such reuse. Second,
computations on adaptive meshes and sparse matrices experience many
cache misses because they access data in an irregular manner. Data layout
and access order can be rearranged according to mesh connections or
geometric location to improve locality, with cost models used to guide
frequency of transformations for adaptive computations. Third, applications
using pointer-based data structures (e.g., linked lists, trees) to organize
data also access data irregularly in memory. Compiler and run-time analysis
can improve locality by using simple analyses to change memory layout
and allocation. Preliminary results show significant performance improvements
are possible in all three cases, indicating the importance of providing
software support for locality in advanced scientific computations.
More information is available at COSMIC
Software DSM (Keleher)
The CVM group study scalable software distributed shared memory (SDSM)
protocols. SDSM is a technique that uses a software abstraction to emulate
a shared address space across workstation clusters connected by general-purpose
networks. Research in SDSM is enjoying a resurgence because of the new-found
popularity of hardware shared-memory (SMP) machines. With more and more
desktop platforms becoming multiprocessors, shared-memory becomes the
dominant programming model.
Although SDSM was originally targeted at uniprocessors, it can also
be used to combine multiprocessor platforms. These SMP-based systems
are attractive both because of the ubiquity of small-scale SMP's, but
also because they provide support for fine-grain sharing. Conventional
wisdom maintains that SDSM's are incapable of supporting fine-grain
applications due to the lack of efficient mechanisms for communication.
However, unlike SDSM's that use uniprocessor nodes, SMP-based systems
provide hardware cache-coherence between processors within each multiprocessor.
The result is that an SMP-based SDSM can efficiently support fine-grain
sharing for a large variety of application sharing patterns.
The proposed cluster of SMP's will allow us to develop scalable protocols,
and to test them in a variety of configurations.
Coherent Virtual Machine
(CVM)
Systems Software for Linux Clusters of PCs
To top of page
|