| |
Objective:
This project is developing a new generation of techniques for creating multilingual natural language processing (NLP) technology and for performing machine translation (MT) using statistical methods. Previous approaches to human language technology development and machine translation have emphasized either knowledge-intensive resources that contain linguistic information, which are costly to build, or linguistically uninformed, shallow statistical models, which result in poorer translation quality. In contrast to prior approaches, this effort aims to create new statistical models that are linguistically informed, leading to higher quality output for a wide range of languages while still being practical to train and use.
Approach:
In order to make serious improvements in translation quality, statistical translation models need to be able to condition on linguistic features not currently taken advantage of in traditional statistical models, such as the valence (number of arguments) of verbs and lexical co-occurrence probabilities mediated by syntactic relationships rather than just surface adjacency. However, linguistic information of that kind is not available for most of the world's languages and it is difficult and expensive to produce.
The solution to this problem lies in leveraging the existence of lexicons, morphology tools, parsers, ontologies, and other resources for English. The approach is to obtain parallel corpora involving languages of interest and to transform them into annotated parallel corpora, where the second-language corpus annotation is created not manually but via a projection from the English side using automatic alignments as the bridge. The automatically created second-language corpus forms the training data for noise-robust machine learning algorithms, which induce part-of-speech taggers, morphological analyzers, noun-phrase bracketers, dependency parsers, and word-sense taggers for these languages. It also provides training material for new statistical translation models and techniques that exploit parallel linguistically annotated training data.
The work in this project will lead to three main outcomes. First, the word-aligned, dependency parsed, POS tagged and richly annotated parallel corpora that to be produced for multiple, data-limited languages will serve as a broadly useful training resource for nearly all NLP applications. Second, The induced set of multilingual POS taggers, NP bracketers, morphological analyzers, dependency parsers, and sense taggers for multiple underserved languages will also be widely useful as stand-alone tools for applications spanning multilingual speech, IR and information extraction tasks. Third, the state of the art in statistical MT will be advanced by new models that take advantage of more sophisticated features of language than the traditional word based models.
For additional information, click here.
Researchers:
Philip Resnik, Bonnie Dorr, Rebecca Hwa, Amy Weinberg.
Partners and Sponsors:
Johns Hopkins: David Yarowsky, Bill Byrne, Jason Eisner, Sanjeev Khudanpur.
Sponsor:
Department of Defense |