From syntactic encodings to thematic roles: Building lexical entries for interlingual MT

TitleFrom syntactic encodings to thematic roles: Building lexical entries for interlingual MT
Publication TypeJournal Articles
Year of Publication1994
AuthorsDorr BJ, Garman J, Weinberg A
JournalMachine Translation
Pagination221 - 250
Date Published1994///
ISBN Number0922-6567
Keywordshumanities, Social Sciences and Law

Our goal is to construct large-scale lexicons for interlingual MT of English, Arabic, Korean, and Spanish. We describe techniques that predict salient linguistic features of a non-English word using the features of its English gloss (i.e., translation) in a bilingual dictionary. While not exact, owing to inexact glosses and language-to-language variations, these techniques can augment an existing dictionary with reasonable accuracy, thus saving significant time. We have conducted two experiments that demonstrate the value of these techniques. The first tested the feasibility of building a database of thematic grids for over 6500 Arabic verbs based on a mapping between English glosses and the syntactic codes in Longman's Dictionary of Contemporary English (LDOCE) (Procter, 1978). We show that it is more efficient and less error-prone to hand-verify the automatically constructed grids than it would be to build the thematic grids by hand from scratch. The second experiment tested the automatic classification of verbs into a richer semantic typology based on (Levin, 1993), from which we can derive a more refined set of thematic grids. In this second experiment, we show that a brute-force, non-robust technique provides 72% accuracy for semantic classification of LDOCE verbs; we then show that it is possible to approach this yield with a more robust technique based on fine-tuned statistical correlations. We further suggest the possibility of raising this yield by taking into account linguistic factors such as polysemy and positive and negative constraints on the syntax-semantics relation. We conclude that, while human intervention will always be necessary for the construction of a semantic classification from LDOCE, such intervention is significantly minimized as more knowledge about the syntax-semantics relation is introduced.