Abstract | We address the problem of archiving dynamic web contents oversignificant time spans. Current schemes crawl the web contents at
regular time intervals and archive the contents after each crawl
regardless of whether or not the contents have changed between
consecutive crawls. Our goal is to store newly crawled web contents
only when they are different than the previous crawl, while ensuring
accurate and quick retrieval of archived contents based on arbitrary
temporal queries over the archived time period. In this paper, we
develop a scheme that stores unique temporal web contents in
containers following the widely used ARC/WARC format, and that
provides quick access to the archived contents for arbitrary temporal
queries. A novel component of our scheme is the use of a new indexing
structure based on the concept of persistent or multi-version data
structures. Our scheme can be shown to be asymptotically optimal both
in storage utilization and insert/retrieval time. We illustrate the
performance of our method on two very different data sets from the
Stanford WebBase project, the first reflecting very dynamic web
contents and the second relatively static web contents. The
experimental results clearly illustrate the substantial storage
savings achieved by eliminating duplicate contents detected between
consecutive crawls, as well as the speed at which our method can find
the archived contents specified through arbitrary temporal queries.
|