UMIACS Computational Linguistics Colloquium,
April 23, 2003
Empirical Studies on the
Impact of Lexical Resources on CLIR Performance
Ralph Weischedel
BBN Technologies
Cambridge, MA
UMIACS Computational Linguistics Colloquium
April 23, 2003,
3:30pm, AVW Room 2120
In this work, we compile and review several experiments measuring Cross-
lingual Information Retrieval (CLIR) performance as a function of the
following resources: bilingual term lists, parallel corpora, machine
translation (MT), and stemmers. Our CLIR system uses a simple
probabilistic language model; the studies used TREC test corpora over
Chinese, Spanish and Arabic. Our findings include:
- One can achieve an acceptable CLIR performance using only a bilingual term
list (70~80% on Chinese and Arabic corpora).
- However, if a bilingual term list and parallel corpora are available, CLIR
performance can rival monolingual performance.
- If no parallel corpus is available, pseudo-parallel texts produced by an MT
system can partially overcome the lack of parallel text.
While stemming is useful normally, with a very large parallel corpus for
Arabic-English, stemming hurt performance in our empirical studies with
Arabic, a highly inflected language.
Joint work with Jinxi Xu.
For the colloquium series schedule, see the UMD
Computational Linguistics Colloquium Series web page at
http://umiacs.umd.edu/~resnik/cl_colloquium/. If you are interested
in meeting with the speaker, please contact Philip Resnik (resnik@umiacs.umd.edu).