Language Identification for Handwritten Document Images Using AShape Codebook

TitleLanguage Identification for Handwritten Document Images Using AShape Codebook
Publication TypeJournal Articles
Year of Publication2009
AuthorsZhu G, Yu X, Li Y, Doermann D
JournalPattern Recognition
Volume42
Pagination3184 - 3191
Date Published2009/12//
Abstract

Language identification for handwritten document images is an open document analysis problem. In this paper, we propose a novel approach to language identification for documents containing mixture of handwritten and machine printed text using image descriptors constructed from a codebook of shape features. We encode local text structures using scale and rotation invariant codewords, each representing a segmentation-free shape feature that is generic enough to be detected repeatably. We learn a concise, structurally indexed shape codebook from training by clustering and partitioning similar feature types through graph cuts. Our approach is easily extensible and does not require skew correction, scale normalization, or segmentation. We quantitatively evaluate our approach using a large real-world document image collection, which is composed of 1,512 documents in eight languages (Arabic, Chinese, English, Hindi, Japanese, Korean, Russian, and Thai) and contains a complex mixture of handwritten and machine printed content. Experiments demonstrate the robustness and flexibility of our approach, and show exceptional language identification performance that exceeds the state of the art.