Information extraction -- the task of locating key pieces of information in text -- is central to many language and speech processing applications. Viewed abstractly, the extraction task can be reduced to binary or ternary sequence labeling, where the substrings of text to be extracted are identified by certain labels. The labeling task can be solved with well-known probabilistic models. Conditional models yield, for a given input string, a probability distribution over possible labelings. This in turn gives rise to the following inference problem: find an optimal solution to the original extraction task from a solution to the labeling task. Defining the "optimal" solution to mean the minimum risk (or maximum expected utility) hypothesis under a given loss (utility) function puts this problem on solid theoretical foundations.
In this talk I will discuss recent work in optimal inference for information extraction where the utility function is the weighted F-score, which is widely used in language processing and information retrieval. For the binary sequence labeling task, I will present polynomial-time algorithms for optimal decoding under F-score [J07]. This makes use of of a general framework for evaluating the expectations of certain loss/utility functions, which is easily extended to the ternary labeling case [J06].
[J07] Martin Jansche. 2007. A maximum expected utility framework for binary sequence labeling. Annual Meeting of the Association for Computational Linguistics.
[J06] Martin Jansche. 2006. Algorithms for minimum risk chunking. Lecture Notes in Computer Science 4002.
Martin Jansche's main research interest is in empirical methods in automatic natural language and speech processing. He received his PhD in 2003 from Ohio State University and joined the newly founded Center for Computational Learning Systems at Columbia University. As of February 2007 he is with Google, Inc. in New York, working on web search and speech applications.
This talk is part of the CLIP Colloquium Series, organized by Jimmy Lin (jimmylin -at- umd .dot. edu). For the complete schedule, please visit http://www.umiacs.umd.edu/research/CLIP/colloq/.