Learning Document Structure for Retrieval and Classification

TitleLearning Document Structure for Retrieval and Classification
Publication TypeConference Papers
Year of Publication2012
AuthorsKumar J, Ye P, Doermann D
Conference NameInternational Conference on Pattern Recognition (ICPR 2012)

In this paper, we present a method for the retrieval of document images with chosen layout characteristics. The proposed method is based on statistics of patch codewords over different regions of image. We begin with a set of wanted and a random set of unwanted images representative of a large heterogeneous collection. We then use raw-image patches extracted from the unlabeled images to learn a codebook. To model the spatial relationships between patches, the image is recursively partitioned horizontally and vertically, and a histogram of patch-codewords is computed in each partition. The resulting set of features give a high precision and recall for the retrieval of hand-drawn and machine-print table-documents, and unconstrained mixed form-type documents, when trained using a random forest classifier. We compare our method to the spatial-pyramid method, and show that the proposed approach for learning layout characteristics is competitive for document images.