Abstract
In these presentations, students in the cloud computing course will describe what they've accomplished during the semester.
Large-Data Statistical Machine Translation (May 7)
Chris Dyer, Aaron Cordova, Alex Mont
In the last decade, statistical techniques have revolutionized the field of machine translation. Using collections of documents that exist in pairs of languages (so-called bitexts), it is possible to build models of translation that can be used to translate effectively between many language pairs with virtually no language-specific knowledge. However, the amount of data that is available has grown more rapidly than the performance of computers, and the research community now faces a situation where estimating a new model from a training bitext takes several days with the existing tools. In this talk, we show that a large class of models can be naturally expressed using the MapReduce programming paradigm. This enables the models to be parallelized on clusters of commodity hardware and estimated exactly in a fraction of the time that would be required on a single-core machine. We present results showing the scaling characteristics of estimation code for two word alignment models (IBM Model 1 and the HMM alignment model) and a phrase-based translation model.
Parallel Automatic Text-Background Separation in Picture Books (May 7)
Chang Hu and Punit Mehta
We present an image processing system based on the Hadoop MapReduce framework. The system leverages existing serial image processing code and the MapReduce framework to perform text-background separation for picture books in the International Children's Digital Library. We evaluated the performance of the system. The system also provides an overall programming paradigm for integrating image processing systems using the MapReduce framework.