UMIACS Computational Linguistics Colloquium, September 5, 2001

Relating Grammars and Treebanks for Natural Language Processing


Fei Xia


Department of Computer and Information Science, University of Pennsylvania


UMIACS Computational Linguistics Colloquium

September 5, 2001
10:15am, AVW Room 2120


A parser used to analyze natural language text typically requires three types of information: a set of elementary structures, a lexicon, and information for pruning out unlikely analyses. This information can be collected from two types of resources: hand-crafted grammars and Treebanks. A hand-crafted grammar includes a set of elementary structures and a lexicon, but it lacks statistical information, and building such a grammar is very labor-intensive. A Treebank, on the other hand, provides rich statistical information, but the elementary structures that form the parse structures in the Treebank are implicit.

In this talk, I introduce two systems that address these issues. The first system (LexTract) extracts Lexicalized Tree Adjoining Grammars (LTAGs) from Treebanks and builds derivation trees to train statistical LTAG parsers directly. In addition to creating Treebank grammars and producing training material for parsers, the system is also used to detect annotation errors in the Treebanks, to evaluate the coverage of existing hand-crafted grammars, to compare syntactic structures for different languages, and to test certain linguistic hypotheses. The second system (LexOrg) addresses a major problem in hand-crafted grammars; namely, the redundancy caused by the reuse of substructures in elementary structures and the lack of explicit generalizations over the structures. The system takes several types of specifications as input and combines them to automatically generate a grammar. The system can be further extended to include language-independent specifications that can be tailored to specific languages by eliciting linguistic information from native informants, thus partially automating the grammar development process.

These two systems provide fundamental advances in our understanding of the relationships between Treebanks and grammars. LexTract makes explicit the elementary structures that form the phrase structures in a Treebank, while LexOrg makes explicit the components of these elementary structures and generalizations over these structures. The systems provide a rich set of tools for language description and comparison that greatly enhances our ability to build and maintain grammars and Treebanks effectively.

About the speaker:

Fei Xia is graduating from Univeristy of Pennsylvania this semester. Her primary research areas are grammar development and Treebank development. She has built a system that automatically generates Lexicalized Tree Adjoining Grammars (LTAGs) from abstract language specifications and another system that extracts LTAGs from Treebanks. She was also the manager of the Chinese Penn Treebank project and an organizer of the First and Second Chinese Language Processing Workshops. A portion of the Treebank, which includes 100-thousand-word Xinhua News, has been released to the public by LDC. In addition, she is a member of the XTAG project, a long-term project to build LTAG grammars, parsers, and related NLP tools. For more information, please check out her homepage at http://www.cis.upenn.edu/~fxia.


For the colloquium series schedule, see the UMD Computational Linguistics Colloquium Series web page at http://umiacs.umd.edu/~resnik/cl_colloquium/. If you are interested in meeting with the speaker, please contact Philip Resnik (resnik@umiacs.umd.edu).