Computationally efficient models for natural language understanding can have a wide variety of applications starting from text mining and question answering, to natural language interfaces to databases. Constraint-based grammar formalisms have been widely used for deep language understanding. Yet, one serious obstacle for their use in real world applications is that these formalisms have overlooked a crucial requirement: learnability. Currently, there is a poor match between these grammar formalisms and existing learning methods.
In this talk, I will introduce a new type of constraint-based grammars, Lexicalized Well-Founded Grammars, which are learnable, and for which the semantic composition and semantic interpretation are encoded as constraints at the grammar rule level. I use an ontology-based interpretation, proposing a semantic representation that is sufficiently expressive to represent many aspects of language and yet sufficiently restrictive to support learning and tractable inferences. I will present a computationally efficient model for inducing these grammars, called Grammar Approximation by Representative Sublanguage. The learning paradigm is a new logic-based relational learning method that benefits from the advantage of soundness and completeness of the inverse entailment method, while avoiding the drawback of its undecidability. I have proved that the search space for grammar induction is a complete grammar lattice, which guarantees the uniqueness of the solution. My research has delivered a practical grammar induction system for deep language analysis, and has enabled robust methods for acquisition of terminological knowledge from text in the medical domain.
Smaranda Muresan is a PhD candidate in the Natural Language Processing Group at Columbia University working with Dr. Owen Rambow and Dr. Judith Klavans. Her research interests range from grammar induction, natural language understanding and knowledge representation to relational learning. Her work unifies two separate but central themes in human language and information technologies: computational formalisms to express natural language phenomena and induction of knowledge from data.
This talk is part of the CLIP Colloquium Series, organized by Jimmy Lin (jimmylin -at- umd .dot. edu). For the complete schedule, please visit http://www.umiacs.umd.edu/research/CLIP/colloq/.