UMIACS Computational Linguistics Colloquium, April 20, 1999

Methods for Probabilistic Classification in Natural Language Processing Applied to Identifying Subjective Sentences


Janyce Wiebe


New Mexico State University


UMIACS Computational Linguistics Colloquium

April 20, 1999, Special time/location: 11:30pm, AVW 4406


The first part of this talk will give an overview of a framework for developing probabilistic classifiers in Natural Language Processing (NLP) (Bruce and Wiebe 1999). A probabilistic classifier assigns the most probable class to an object, based on a probability model of the interdependencies among the class and a set of input features. This framework focuses on formulating a model that captures the most important interdependencies, to avoid over-fitting the data while also characterizing the data well. The class of probability models and the associated inference techniques were developed in mathematical statistics, and are widely used in artificial intelligence and applied statistics. However, these techniques have not been widely used in NLP. The class of models, decomposable models, is large and expressive, yet there are computationally feasible model search procedures defined for them. The formality of the method supports evaluation: the talk will briefly describe how the three determinates of classifier performance (the features, the form of the model, and the parameter estimates) can be separately evaluated.

The second part of the talk will describe an empirical investigation of a natural language disambiguation task. In many text processing applications, such as information extraction, summarization, text categorization, and information retrieval, it is important to distinguish objective sentences, which are used to present factual information, from subjective sentences, which are used to present beliefs and evaluations (Wiebe 1994; Wiebe, Bruce, and O'Hara, 1999). Whether a sentence is subjective or objective depends not only on semantics, but also on the context in which the sentence appears. Using the model search procedure described above, we developed a probabilistic classifier for identifying subjective sentences. Using only shallow features, the classifier achieves an average accuracy 21 percentage points higher than the baseline, in 10-fold cross validation experiments. In order to develop the classifier, a gold-standard data set was needed for training and testing. In a two-phase study, a data set was annotated by multiple annotators, resulting in high intercoder agreement for sentences that could be tagged with certainty. The classifier also performs better on sentences the judges tagged with certainty, showing consistency between the classifier and the human judges.


For the colloquium series schedule, see the UMD Computational Linguistics Colloquium Series web page at http://umiacs.umd.edu/~resnik/cl_colloquium/. If you are interested in meeting with the speaker, please contact Mari Broman Olsen (molsen@umiacs.umd.edu) or Philip Resnik (resnik@umiacs.umd.edu).