Indirect Supervised Learning of Content

Selection Rules

Pablo Duboue

Department of Computer Science
Columbia University


UMIACS Computational Linguistics Colloquium

August 13, 2004, 11 am, CSIC Room 1121


Abstract:

 

As online data becomes more and more abundant, there is a growing need to filter information for users and therefore reduce the information overload.  This situation is akin to the Content Selection (CS) problem faced by a Natural Language Generation (NLG) system when starting to build a new text.  In a generation system, the CS module decides which pieces of information to include in the final generated text.  It has been argued that CS is central for the user acceptance of a generation system (as users may tolerate other type of errors as long the information is readily available on the output).  Moreover, the CS problem is quite domain dependent; major changes in CS knowledge are needed when moving a system to a different domain.

 

 

In this talk, I will present my work on the automatic acquisition of CS (Content Selection) rules, as a way to provide a domain independent solution to the the CS problem.  As training material, I employ an aligned Text-Data corpus, a resource that is increasingly popular for learning for NLG (as they are readily available and do not require expensive hand labelling).  However, aligned Text-Data corpora only provide indirect information about whether or not a piece of information has been selected or not by the human writer to be included in the text.  Indirect Supervised Learning is my proposed solution to this problem.  It has two steps; in the first step, the Text-Data corpus is transformed into a dataset with classification labels.  In the second step, supervised learning machinery acquires the CS rules from this dataset.  I evaluate the approach by comparing the output of my system with the information selected by human authors in unseen texts, obtaining a F* of 0.67 with high recall.

 

About the Speaker:

 

Pablo Duboue http://www.cs.columbia.edu/~pablo/ is a PhD candidate in the Computer Science Department at Columbia University, and expects to graduate in October 2004. He works in the Natural Language Processing Group with Prof Kathleen R McKeown, and his main research interests include natural language generation and machine learning.


For the colloquium series schedule, see the UMD Computational http://www.umiacs.umd.edu/research/CLIP/colloq/.  If you are interested in meeting with the speaker, please contact Doug <http://www.glue.umd.edu/~oard/>  Oard (oard@umiacs.umd.edu <mailto:oard@umiacs.umd.edu> ).