Search engines are extremely useful tools for answering questions. However, a huge number of questions users might pose -- for example, "who has won a best actor Oscar for playing a villain?" -- can't be addressed using existing search engines, because the answers do not lie on a single page. To answer these kinds of queries, users must extract and synthesize information from multiple documents. With existing search engines, this is a tedious and error-prone manual process.
In this talk, I will describe my research aimed at automating this extraction of information from the Web. I begin by describing a formal probabilistic model of the redundancy inherent in large text collections, and demonstrate that when used for autonomous textual extraction, the model enables large improvements in accuracy when compared with previous work. However, the model suffers from inaccuracies when applied to sparse data; my second investigation shows how unsupervised language models can be leveraged in concert with redundancy to handle sparsity. Lastly, I will describe recent work in generalizing this approach to a general machine learning setting, and demonstrate its efficacy in a different task, document classification.
Doug Downey is a graduate student at the University of Washington, where he is advised by Oren Etzioni. His research interests are in the areas of natural language processing, machine learning, and artificial intelligence. Doug was part of the KnowItAll project, a system which utilizes the Web to autonomously extract large knowledgebases. Doug's primary research results concern probabilistic models of the redundancy inherent in large corpora, along with associated techniques that allow systems like KnowItAll to extract data autonomously at high precision. Doug is a recipient of an NSF Fellowship and a Microsoft Research Graduate Fellowship, and a winner of the Distinguished Paper Award at IJCAI'05.
This talk is part of the CLIP Colloquium Series, organized by Jimmy Lin (jimmylin -at- umd .dot. edu). For the complete schedule, please visit http://www.umiacs.umd.edu/research/CLIP/colloq/.