Using the structure of Web sites for automatic segmentation of tables

TitleUsing the structure of Web sites for automatic segmentation of tables
Publication TypeConference Papers
Year of Publication2004
AuthorsLerman K, Getoor L, Minton S, Knoblock C
Conference NameProceedings of the 2004 ACM SIGMOD international conference on Management of data
Date Published2004///
PublisherACM
Conference LocationNew York, NY, USA
ISBN Number1-58113-859-8
Abstract

Many Web sites, especially those that dynamically generate HTML pages to display the results of a user's query, present information in the form of list or tables. Current tools that allow applications to programmatically extract this information rely heavily on user input, often in the form of labeled extracted records. The sheer size and rate of growth of the Web make any solution that relies primarily on user input is infeasible in the long term. Fortunately, many Web sites contain much explicit and implicit structure, both in layout and content, that we can exploit for the purpose of information extraction. This paper describes an approach to automatic extraction and segmentation of records from Web tables. Automatic methods do not require any user input, but rely solely on the layout and content of the Web source. Our approach relies on the common structure of many Web sites, which present information as a list or a table, with a link in each entry leading to a detail page containing additional information about that item. We describe two algorithms that use redundancies in the content of table and detail pages to aid in information extraction. The first algorithm encodes additional information provided by detail pages as constraints and finds the segmentation by solving a constraint satisfaction problem. The second algorithm uses probabilistic inference to find the record segmentation. We show how each approach can exploit the web site structure in a general, domain-independent manner, and we demonstrate the effectiveness of each algorithm on a set of twelve Web sites.

URLhttp://doi.acm.org/10.1145/1007568.1007584
DOI10.1145/1007568.1007584