Cloud Computing Speaker Series (Spring 2008)

Cloud Computing Course: Final Project Presentations

Part I
11am, May 7, 2008
A.V. Williams 2120

Part II
11am, May 14, 2008
A.V. Williams 2120

Part III
11am, May 23, 2008
A.V. Williams 3174

Abstract

In these presentations, students in the cloud computing course will describe what they've accomplished during the semester.

Large-Data Statistical Machine Translation (May 7)
Chris Dyer, Aaron Cordova, Alex Mont

In the last decade, statistical techniques have revolutionized the field of machine translation. Using collections of documents that exist in pairs of languages (so-called bitexts), it is possible to build models of translation that can be used to translate effectively between many language pairs with virtually no language-specific knowledge. However, the amount of data that is available has grown more rapidly than the performance of computers, and the research community now faces a situation where estimating a new model from a training bitext takes several days with the existing tools. In this talk, we show that a large class of models can be naturally expressed using the MapReduce programming paradigm. This enables the models to be parallelized on clusters of commodity hardware and estimated exactly in a fraction of the time that would be required on a single-core machine. We present results showing the scaling characteristics of estimation code for two word alignment models (IBM Model 1 and the HMM alignment model) and a phrase-based translation model.

Parallel Automatic Text-Background Separation in Picture Books (May 7)
Chang Hu and Punit Mehta

We present an image processing system based on the Hadoop MapReduce framework. The system leverages existing serial image processing code and the MapReduce framework to perform text-background separation for picture books in the International Children's Digital Library. We evaluated the performance of the system. The system also provides an overall programming paradigm for integrating image processing systems using the MapReduce framework.

Language Modeling with MapReduce (May 14)
Denis Filimonov and Hua Wei

Parallelization of computationally intensive problems is challenging but often necessary. There are different approaches to parallelization which roughly fall into two categories: single image systems UMA or NUMA, and clusters of independent nodes communicating over some protocol, e.g., MPI. Clusters are typically built of commodity hardware and thus are much cheaper to scale up, provided the task in question is suitable for implementation on such cluster. This paper describes an implementation of recursive partitioning algorithm used for LM training in Hadoop, an Open Source implementation of MapReduce framework.

High-throughput Sequence Alignment with Hadoop (May 14)
Michael Schatz

Next-generation DNA sequencing machines generate sequence data at an unprecedented rate, but traditional single-processor sequence alignment algorithms are struggling to keep pace with them. BlastReduce is a new parallel read mapping algorithm optimized for aligning sequence data from those machines to reference genomes, for use in a variety of biological analyses, including SNP discovery, genotyping, and personal genomics. It is modeled after the widely used BLAST sequence alignment algorithm, but uses the open-source Hadoop implementation of MapReduce to parallelize execution to multiple compute nodes. To evaluate its performance, BlastReduce was used to map next generation sequence data to a reference bacterial genome in a variety of configurations. The results show BlastReduce scales linearly for the number of sequences processed, and with high speedup as the number of processors increases. In a modest 24 processor configuration, BlastReduce is up to 250x faster than BLAST executing on a single processor, and reduced the execution time from several days to a few minutes at the same level of sensitivity. Furthermore, BlastReduce is fully compatible with cloud computing, and can be easily executed on massively parallel remote resources to meet peak demand. BlastReduce is available open-source at: http://www.cbcb.umd.edu/software/blastreduce/.

Joint Resolution of Identity in Email Archives
Tamer Elsayed, Greg Jablonski, and Alan Jackoway

In this project, we explored the design of a MapReduce solution for jointly resolving all the named-mentions in large email archives. Since, the resolution of a single mention depends on resolving other mentions in emails that appear in its context, we proposed and implemented an iterative solution based on a graph-structure mapping of the problem. In this talk, we will present our MapReduce-motivated resolution approach, solutions to challenges (e.g., pairwise email similarity) that arise due to the large scale of the problem, and some preliminary results on the Enron collection.

Large-Scale Network Analysis to Improve Text Retrieval in the Biomedical Domain
George Caragea, Christiam Camacho

We introduce PageRankHD, which provides a scalable, easily extensible implementation of the PageRank algorithm using the Hadoop framework. We present results related to the scalability of PageRankHD as well as the Hadoop framework, and report on applicability to related document retrieval in the biomedical domain. We used an implementation of PageRankHD to expand on previous work by Lin and enhance the retrieval of related articles in the MEDLINE database at the National Library of Medicine.


About the Series

This page, first created: 19 Jan 2008; last updated: Valid XHTML 1.0! Valid CSS!