The CLIP Colloquium Series presents...


Improving Automatic Machine Translation Using Comparable Corpora

(Internal talk)
Matt Snover (University of Maryland)
September 5, 2007, 11:00am, AVW 2120

Traditionally automatic statistical machine translation systems have relied on parallel bilingual text to train a translation model and have only used monolingual text in the target language to train the language model. While bilingual parallel text is expensive to generate, monolingual data is relatively common. This monolingual text has been under-utilized, being used only for language modeling. This research proposal describes a novel method for utilizing monolingual target text to improve the performance of a statistical machine translation system on news stories. The method exploits the existence of multiple texts across languages that discuss the same or similar stories. For every source document that is to be translated, a large monolingual data set in the target language is searched for documents that might be comparable to the source document. These source documents are then used to bias the translation system to increase the probability of generating texts that resemble the comparable document. The machine translation system is biased by modifications to both the language model and translation model of the statistical system. A preliminary examination of this approach has proven promising.

About the Speaker

Matt Snover is a Ph.D. student in Computer Science and a CLIP member.


This talk is part of the CLIP Colloquium Series, organized by Jimmy Lin (jimmylin -at- umd .dot. edu). For the complete schedule, please visit http://www.umiacs.umd.edu/research/CLIP/colloq/.