%0 Conference Paper %B The 10th European Conference on Computer Vision (ECCV 2008) %D 2008 %T Learning Visual Shape Lexicon for Document Image Content Recognition %A Zhu,Guangyu %A Yu,Xiaodong %A Li,Yi %A David Doermann %X Developing effective content recognition methods for diverse imagery continues to challenge computer vision researchers. We present a new approach for document image content categorization using a lexicon of shape features. Each lexical word corresponds to a scale and rotation invariant shape feature that is generic enough to be detected repeatably and segmentation free. We learn a concise, structurally indexed shape lexicon from training by clustering and partitioning feature types through graph cuts. We demonstrate our approach on two challenging document image content recognition problems: 1) The classification of 4,500 Web images crawled from Google Image Search into three content categories — pure image, image with text, and document image, and 2) Language identification of 8 languages (Arabic, Chinese, English, Hindi, Japanese, Korean, Russian, and Thai) on a 1,512 complex document image database composed of mixed machine printed text and handwriting. Our approach is capable to handle high intra-class variability and shows results that exceed other state-of-the-art approaches, allowing it to be used as a content recognizer in image indexing and retrieval systems. %B The 10th European Conference on Computer Vision (ECCV 2008) %C Marseille, France %P 745 - 758 %8 2008/// %G eng