Materializing multi-relational databases from the web using taxonomic queries

TitleMaterializing multi-relational databases from the web using taxonomic queries
Publication TypeConference Papers
Year of Publication2011
AuthorsMichelson M, Macskassy SA, Minton SN, Getoor L
Conference NameProceedings of the fourth ACM international conference on Web search and data mining
Date Published2011///
Conference LocationNew York, NY, USA
ISBN Number978-1-4503-0493-1
Keywordsdiscovering multi-relational data, multirelational data

Recently, much attention has been given to extracting tables from Web data. In this problem, the column definitions and tuples (such as what "company" is headquartered in what "city,") are extracted from Web text, structured Web data such as lists, or results of querying the deep Web, creating the table of interest. In this paper, we examine the problem of extracting and discovering multiple tables in a given domain, generating a truly multi-relational database as output. Beyond discovering the relations that define single tables, our approach discovers and leverages "within column" set membership relations, and discovers relations across the extracted tables (e.g., joins). By leveraging within-column relations our method can extract table instances that are ambiguous or rare, and by discovering joins, our method generates truly multi-relational output. Further, our approach uses taxonomic queries to bootstrap the extraction, rather than the more traditional "seed instances." Creating seeds often requires more domain knowledge than taxonomic queries, and previous work has shown that extraction methods may be sensitive to which input seeds they are given. We test our approach on two real world domains: NBA basketball and cancer information. Our results demonstrate that our approach generates databases of relevant tables from disparate Web information, and discovers the relations between them. Further, we show that by leveraging the "within column" relation our approach can identify a significant number of relevant tuples that would be difficult to do so otherwise.