%0 Conference Paper %B International Conference on Pattern Recognition (ICPR 2012) %D 2012 %T Learning Document Structure for Retrieval and Classification %A Kumar,Jayant %A Ye,Peng %A David Doermann %X In this paper, we present a method for the retrieval of document images with chosen layout characteristics. The proposed method is based on statistics of patch codewords over different regions of image. We begin with a set of wanted and a random set of unwanted images representative of a large heterogeneous collection. We then use raw-image patches extracted from the unlabeled images to learn a codebook. To model the spatial relationships between patches, the image is recursively partitioned horizontally and vertically, and a histogram of patch-codewords is computed in each partition. The resulting set of features give a high precision and recall for the retrieval of hand-drawn and machine-print table-documents, and unconstrained mixed form-type documents, when trained using a random forest classifier. We compare our method to the spatial-pyramid method, and show that the proposed approach for learning layout characteristics is competitive for document images. %B International Conference on Pattern Recognition (ICPR 2012) %P 1558-1561