While generally accepted for languages such as Chinese and Japanese, the use of character n-gram tokenization has not been widely adopted for information retrieval in alphabetic languages. However, n-grams are a simple representation for text that is surprisingly effective in diverse languages. In this talk I present empirical results in twelve European languages that have been studied in the Cross Language Evaluation Forum (CLEF) competitions. These results demonstrate that:
I will describe issues particular to n-gram indexing and retrieval such as increased disk space consumption and query times, make a case that n-grams are a synthetic form of morphological normalization, and argue that when linguistic and translation resources are scarce (as is the case with less-commonly studied languages), n-grams are an extremely attractive option for multilingual retrieval.
Paul McNamee holds a research appointment at the Johns Hopkins University Applied Physics Laboratory where he conducts research in human language technologies. He is currently a part-time PhD student at the University of Maryland Baltimore County. His current research interests are in the areas of cross-language information retrieval and multilingual information extraction. With colleagues at JHU/APL he has developed the HAIRCUT text retrieval system, which frequently attains top marks in international evaluations.
This talk is part of the CLIP Colloquium Series, organized by Jimmy Lin (jimmylin -at- umd .dot. edu). For the complete schedule, please visit http://www.umiacs.umd.edu/research/CLIP/colloq/.