A parser used to analyze natural language text typically requires three types of information: a set of elementary structures, a lexicon, and information for pruning out unlikely analyses. This information can be collected from two types of resources: hand-crafted grammars and Treebanks. A hand-crafted grammar includes a set of elementary structures and a lexicon, but it lacks statistical information, and building such a grammar is very labor-intensive. A Treebank, on the other hand, provides rich statistical information, but the elementary structures that form the parse structures in the Treebank are implicit.
In this talk, I introduce two systems that address these issues. The first system (LexTract) extracts Lexicalized Tree Adjoining Grammars (LTAGs) from Treebanks and builds derivation trees to train statistical LTAG parsers directly. In addition to creating Treebank grammars and producing training material for parsers, the system is also used to detect annotation errors in the Treebanks, to evaluate the coverage of existing hand-crafted grammars, to compare syntactic structures for different languages, and to test certain linguistic hypotheses. The second system (LexOrg) addresses a major problem in hand-crafted grammars; namely, the redundancy caused by the reuse of substructures in elementary structures and the lack of explicit generalizations over the structures. The system takes several types of specifications as input and combines them to automatically generate a grammar. The system can be further extended to include language-independent specifications that can be tailored to specific languages by eliciting linguistic information from native informants, thus partially automating the grammar development process.
These two systems provide fundamental advances in our understanding of the relationships between Treebanks and grammars. LexTract makes explicit the elementary structures that form the phrase structures in a Treebank, while LexOrg makes explicit the components of these elementary structures and generalizations over these structures. The systems provide a rich set of tools for language description and comparison that greatly enhances our ability to build and maintain grammars and Treebanks effectively.
For the colloquium series schedule, see the UMD Computational Linguistics Colloquium Series web page at http://umiacs.umd.edu/~resnik/cl_colloquium/. If you are interested in meeting with the speaker, please contact Philip Resnik (resnik@umiacs.umd.edu).