TY - CONF T1 - Term selection for searching printed Arabic T2 - Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval Y1 - 2002 A1 - Darwish,Kareem A1 - Oard, Douglas KW - arabic KW - Information retrieval KW - OCR KW - term selection AB - Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an interesting problem. Arabic combines rich morphology with a writing system that presents unique challenges to OCR systems. These factors must be considered when selecting terms for automatic indexing. In this paper, alternative choices of indexing terms are explored using both an existing electronic text collection and a newly developed collection built from images of actual printed Arabic documents. Character n-grams or lightly stemmed words were found to typically yield near-optimal retrieval effectiveness, and combining both types of terms resulted in robust performance across a broad range of conditions. JA - Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval T3 - SIGIR '02 PB - ACM CY - New York, NY, USA SN - 1-58113-561-0 UR - http://doi.acm.org/10.1145/564376.564423 M3 - 10.1145/564376.564423 ER -