Putting Formal Grammars to Work


David Chiang

University of Pennsylvania


UMIACS Computational Linguistics Colloquium

April 26, 2004, 3:30pm, AVW Room 2120


What makes one grammar better than another? Formal language theory has traditionally answered this question in terms of how a grammar classifies strings as grammatical or ungrammatical (weak generative capacity), whereas applications have generally been more interested in more complex functions on strings -- for example, probability distributions, translations, grammatical relations. This gap makes it difficult to apply formal-language theoretic results directly.

 

In this talk I will outline a program for bridging this gap. Combining two views of strong generative capacity (SGC) -- Miller's view of SGC as "the semantics of linguistic formalism" and Joshi's view of SGC as having to do with derivations of tree-adjoining grammars and related formalisms -- I will sketch a basic framework in which the formal power of grammars can be tested in ways that are more directly relevant to their applications. I will then discuss two application areas.

 

First, in the area of statistical parsing, I will talk about how grammars are used as the basis for both generative models and maximum-entropy models, and argue that extra power for statistical modeling, in general, comes with a computational cost. But the proof of this result reveals a connection between lexicalized PCFG models like those of Charniak and Collins and lexicalized probabilistic TAG models. This suggests that lexicalized PCFG models should be thought of as defined not over phrase-structure trees but richer structural descriptions, and a central problem becomes that of training these models on the incomplete structural descriptions of the Treebank. I will describe the implementation of a generative TAG model using two different training methods, with results on both English and Chinese.

 

The second application area is that of natural language translation. A number of synchronous grammar formalisms have been proposed for syntax-based translation -- for example, inversion transduction grammar and synchronous tree substitution grammar -- and their relative formal power varies depending on how we measure it. One, synchronous regular-form TAG, is more powerful than synchronous CFG in the strictest sense, yet has the same parsing complexity. With Mark Dras and William Schuler, I have explored this formalism and shown how it can be used for a tricky case of Portuguese-English translation. I will conclude by discussing some possible ways of incorporating sights from both of these areas into full statistical machine-translation systems.

 


 About the Speaker:

 

David Chiang is a PhD candidate in Computer and Information Science at the University of Pennsylvania, under the supervision of Aravind Joshi. His research interests are in applying formal grammars to natural language processing and biological sequence analysis.

 For the colloquium series schedule, see the UMD Computational http://www.umiacs.umd.edu/research/CLIP/colloq/.  If you are interested in meeting with the speaker, please contact Doug <http://www.glue.umd.edu/~oard/>  Oard (oard@umiacs.umd.edu <mailto:oard@umiacs.umd.edu> ).