UMIACS Computational Linguistics Colloquium, April 23, 2003

Empirical Studies on the Impact of Lexical Resources on CLIR Performance


Ralph Weischedel


BBN Technologies
Cambridge, MA


UMIACS Computational Linguistics Colloquium

April 23, 2003,
3:30pm, AVW Room 2120


In this work, we compile and review several experiments measuring Cross- lingual Information Retrieval (CLIR) performance as a function of the following resources: bilingual term lists, parallel corpora, machine translation (MT), and stemmers. Our CLIR system uses a simple probabilistic language model; the studies used TREC test corpora over Chinese, Spanish and Arabic. Our findings include:

While stemming is useful normally, with a very large parallel corpus for Arabic-English, stemming hurt performance in our empirical studies with Arabic, a highly inflected language.

Joint work with Jinxi Xu.


For the colloquium series schedule, see the UMD Computational Linguistics Colloquium Series web page at http://umiacs.umd.edu/~resnik/cl_colloquium/. If you are interested in meeting with the speaker, please contact Philip Resnik (resnik@umiacs.umd.edu).