Acquisition of Bilingual MT Lexicons from OCRed Dictionaries

Burcu Karagol-Ayan
 
University
of Maryland, College Park


Computational Linguistics Colloquium

September 10, 2003, 11:00am, AVW Room 2120


 

This paper describes an approach to analyzing the lexical structure of OCRed bilingual dictionaries to construct resources suited for machine translation of low-density languages, where online resources are limited.  A rule-based, an HMM-based, and a post-processed HMM-based method are used for rapid construction of MT lexicons based on systematic structural clues provided in the original dictionary. We evaluate the effectiveness of our techniques, concluding that: (1) the rule-based method performs better with dictionaries where the font is not an important distinguishing feature for determining information types; (2) the post-processed stochastic method improves the results of the stochastic method for phrasal entries; and (3) Our resulting bilingual lexicons are comprehensive enough to provide the basis for reasonable translation results when compared to human translations.


For the colloquium series schedule, see the UMD Computational Linguistics Colloquium Series web page at http://umiacs.umd.edu/~resnik/cl_colloquium/. If you are interested in meeting with the speaker, please contact Doug Oard (oard@umiacs.umd.edu).