Cloud Computing Speaker Series (Spring 2008)

Storing and Processing Multi-dimensional Scientific Datasets

Alan Sussman
University of Maryland

11am, March 5, 2008
A.V. Williams 3174

[Slides]

Abstract

Large datasets are playing an increasingly important role in many areas of scientific research. Such datasets can be obtained from various sources, including sensors on scientific instruments and simulations of physical phenomena. The datasets often consist of a very large number of records, and have an underlying multi-dimensional attribute space. Because of such characteristics, traditional relational database techniques are not adequate to efficiently support ad hoc queries into the data. We have therefore developed algorithms and designed systems to efficiently store and process these datasets in both tightly coupled parallel computer systems and more loosely coupled distributed computing environments.

I will mainly discuss the design of two systems, the Active Data Repository (ADR) and DataCutter, for managing large datasets in parallel and distributed environments, respectively. Each of these systems provides both a programming model and a runtime framework for implementing high performance data servers. These data servers provide efficient ad hoc query capabilities into very large multi-dimensional datasets. ADR is an object-oriented framework that can be customized to provide optimized storage and processing of disk-based datasets on a parallel machine or network of workstations. DataCutter is a component-based programming model and runtime system for building data intensive applications that can execute efficiently in a Grid distributed computing environment. I will present optimization techniques that enable both systems to achieve high performance in a wide range of application areas. I will also present performance results on real applications on various computing platforms to support that claim.

About the Speaker

Alan Sussman is an Associate Professor in the Computer Science Department and Institute for Advanced Computer Studies at the University of Maryland, College Park. Working with students and other researchers at Maryland and other institutions he has published over 80 conference and journal papers in various topics related to software tools for high performance parallel and distributed (Grid) computing, and has contributed chapters to 6 books. Software tools he has built have been widely distributed and used in many computational science applications, in areas such as earth science, space science, and medical informatics. He received his Ph.D. in computer science from Carnegie Mellon University and his B.S.E. in Electrical Engineering and Computer Science from Princeton University.


About the Series

This page, first created: 17 Jan 2008; last updated: Valid XHTML 1.0! Valid CSS!