Term selection for searching printed Arabic

TitleTerm selection for searching printed Arabic
Publication TypeConference Papers
Year of Publication2002
AuthorsDarwish K, Oard D
Conference NameProceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Date Published2002///
Conference LocationNew York, NY, USA
ISBN Number1-58113-561-0
Keywordsarabic, Information retrieval, OCR, term selection

Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an interesting problem. Arabic combines rich morphology with a writing system that presents unique challenges to OCR systems. These factors must be considered when selecting terms for automatic indexing. In this paper, alternative choices of indexing terms are explored using both an existing electronic text collection and a newly developed collection built from images of actual printed Arabic documents. Character n-grams or lightly stemmed words were found to typically yield near-optimal retrieval effectiveness, and combining both types of terms resulted in robust performance across a broad range of conditions.