Language Identification for Handwritten Document Images Using AShape Codebook

Title	Language Identification for Handwritten Document Images Using AShape Codebook
Publication Type	Journal Articles
Year of Publication	2009
Authors	Zhu G, Yu X, Li Y, Doermann D
Journal	Pattern Recognition
Volume	42
Pagination	3184 - 3191
Date Published	2009/12//
Abstract	Language identification for handwritten document images is an open document analysis problem. In this paper, we propose a novel approach to language identification for documents containing mixture of handwritten and machine printed text using image descriptors constructed from a codebook of shape features. We encode local text structures using scale and rotation invariant codewords, each representing a segmentation-free shape feature that is generic enough to be detected repeatably. We learn a concise, structurally indexed shape codebook from training by clustering and partitioning similar feature types through graph cuts. Our approach is easily extensible and does not require skew correction, scale normalization, or segmentation. We quantitatively evaluate our approach using a large real-world document image collection, which is composed of 1,512 documents in eight languages (Arabic, Chinese, English, Hindi, Japanese, Korean, Russian, and Thai) and contains a complex mixture of handwritten and machine printed content. Experiments demonstrate the robustness and flexibility of our approach, and show exceptional language identification performance that exceeds the state of the art.

Publications