UMIACS Computational Linguistics Colloquium Series,
March 5, 1998
UMIACS Computational Linguistics Colloquium Series,
March 5, 1998
Glean: Using Syntactic Information in Document Filtering
Raman Chandrasekar
Institute for Research in Cognitive Science and
Center for the Advanced Study of India,
University of Pennsylvania
In this talk, I will describe a system called `Glean', which is
predicated on the idea that any coherent text contains significant
latent information, such as syntactic structure and patterns of
language use, which can be used to enhance the performance of
Information Retrieval systems. We show that
- syntactic information considerably improves the effectiveness
of filtering irrelevant documents, and
- a syntactic labeling technique known as supertagging is
more effective than part of speech tagging in filtering
documents.
Glean can be used to refine documents retrieved with a standard Web
search engine or an IR system by selecting relevant information and
filtering out irrelevant items. The system has been tested on a large
collection of newswire sentences, and achieves recall and precision
figures of 88% and 97% for filtering out irrelevant documents. Its
performance and modularity makes it a promising postprocessing addition
to any Information Retrieval system. A version of the system is
available on the Web.
This is joint work with Dr. B. Srinivas, AT&T Labs Research.
Return to the UMD
Computational Linguistics Colloquium Series.