The CLIP Colloquium Series presents...


Bootstrapping Monolingual Parsers from Multilingual Data

David Smith (Johns Hopkins University)
February 7, 2008, 11:00am, AVW 2460

The creation of the Penn Treebank and similar datasets ca. 1990 produced a flowering of research on empirically trained syntactic parsers, which is now bearing fruit in information extraction and machine translation. This revolution has bypassed most languages and domains, however, due to the expense of creating treebanks. Semi-supervised learning methods such as bootstrapping and cotraining have the potential to leverage diverse sources of knowledge for robust statistical parsing in these new settings.

Drawing on Abney's (2004) analysis of the Yarowsky algorithm, I present a view of bootstrapping as optimization. This optimization is performed with standard dynamic programming for projective syntax or with a new model of graph spanning trees for non-projective syntax, which allows trees with crossing dependency links in languages such as Czech, Danish, and Dutch. Finally, I show how to draw features for a parser in one language from parse trees in another language. These quasi-synchronous grammars extend prior bootstrapping work with synchronous grammars and also have applications in translation modeling.

About the Speaker

David Smith received his A.B. in classics from Harvard University. An NSF graduate fellow, he is currently a Ph.D. student in Johns Hopkins University's Computer Science Department and Center for Language and Speech Processing. His interests are in machine translation, natural language parsing, and semi-supervised machine learning methods. David was formerly head programmer for the Perseus Digital Library Project at Tufts University, where he strayed from the path of classical philology toward text mining, geocoding, and information extraction.


This talk is part of the CLIP Colloquium Series, organized by Jimmy Lin (jimmylin -at- umd .dot. edu). For the complete schedule, please visit http://www.umiacs.umd.edu/research/CLIP/colloq/.