Yefeng Zheng


Model Based Line Detection and Its Application to Known Form Processing

Introduction

Millions of form documents, such as health insurance forms, checks, and bank slips, are being processed everyday. Some form processing systems have been designed to process a pre-defined set of forms, where a priori information can be stored as templates in the database to guide the later processing. For an input form, the system first selects the template which matches it best (form identification). Then some anchors (such as specific marks, form frame lines, etc.) are detected for registration so the variations produced by scanning (e.g. rotation, translation, and scaling) can be compensated (form registration). Though special anchors may be available to facilitate the form identification and registration for specially designed forms, more general approaches use features related to frame lines explicitly or implicitly, such as frame lines, form cells, and the cross points of frame lines, etc., for form identification and registration. Robust detection of frame lines is crucial in these approaches. 

We proposed an HMM model based form processing scheme for both form identification and registration. The algorithm has been tested on the NIST Structured Forms Reference Set (NIST Special Database 2). The database consists of 5,590 pages of binary, black-and-white images of synthesized documents. The documents in this database are 12 different tax forms from the IRS 1040 Package X for the year 1988. These include Forms 1040, 2106, 2441, 4562, and 6251 together with Schedules A, B, C, D, E, F, and SE. Eight of these forms contain two pages or form faces; therefore, there are 20 different form faces represented in the database.

20 samples of each form faces are used to train HMM models. The remaining samples are used for testing. The form identification accuracy is 100%. The line detection results are good (not quantitatively evaluated). Here are several examples (green color represents detected lines for form  identification and registration).

1040_1 form face 1040_2 form face 2106_1 form face 2106_2 form face

More samples are available

   1040_1      1040_2      2106_1      2106_2      2441     4562_1    4562_2    6251     sch_a     sch_b

  sch_c_1  sch_c_2  sch_d_1  sch_d_2  sch_e_1  sch_e_2   sch_f_1  sch_f_2  sch_se_1  sch_se_2

Demo

Download demos here (under Windows operating systems).

Source Code

Download source code here. The code should be used for non-commercial purposes only!

Test Samples

Here are two test sample sets. Bank deposit slips and NIST Form.


กก