Introduction
This page discusses the problem of random access to ClueWeb09 WARC records, i.e., how to fetch individual web pages from the collection quickly. See my guide on working with the ClueWeb09 collection for a general introduction. The collection is distributed as a number of gzipped WARC files, each containing approximately 40k pages (about a TB uncompressed). The well-known problem with gzip files is that the format lacks an efficient way of seeking to some point in the middle of a compressed stream—you have to read through the entire archive each time. Therefore, if the web page being retrieved resides at the end of the gzipped WARC file, you'll have to basically read through the entire file.
For information retrieval experiments, there are several workarounds. Often, the documents of interest are known in advance, in which case, a not unreasonable solution is to perform a sequential scan through the entire collection and pull out documents of interest. Obviously, this isn't going to work for interactive retrieval.
Another solution is to simply uncompress the collection, which works well if you have lots of disk space—in fact, that's how we've previously worked with the collection. However, this seems like an awful waste of space, especially since the WARC files achieve about a five to one compression ratio.
Compression Schemes
The solution is fairly obvious: let's repack the gzipped WARC files in a block-compressed format. The tradeoff is between space and latency: with smaller block sizes, we achieve lower latencies, at the cost of more space (lower compression ratio). In the limit, we arrive at record-level compression, where each web page is separately compressed—the lowest possible latency (essentially, just a seek), but space-inefficient.
As it turns out, SequenceFiles in Hadoop already support block- and
record-level compression. We simply need to run some experiments to
empirically determine what the tradeoffs are. The following
experiments specifically focus on the first WARC file in the first
English segment of ClueWeb09
(i.e., ClueWeb09_English_1/en0000/00.war.gz). The
SequenceFile contains IntWritable keys and ClueWarcRecord values.
Here is the disk usage based on different compression schemes:
| compression scheme | bytes | difference |
| original gzipped WARC file | 168967171 | |
| uncompressed SequenceFile | 1047281773 | +520% |
| record-compressed SequenceFile | 244575860 | +44.7% |
| block-compressed SequenceFile (block=1000000) | 171185069 | +1.3% |
| block-compressed SequenceFile (block=500000) | 172885152 | +2.3% |
| block-compressed SequenceFile (block=100000) | 185094067 | +9.5% |
We see that web pages compress very well; here, slightly better than five to one. As expected, record-level compression isn't very space-efficient. It took a while to figure out, but the compression block size is controlled by the obscure Hadoop parameter "io.seqfile.compress.blocksize" (measured in bytes), with a default of 1000000. The default value seems to work well—barely a loss in compression efficiency (and this includes SequenceFile space overhead).
What about latency? These experiments were done on a local machine, so latency measurements aren't going to be particularly meaningful, since in the end we're going to be fetching from SequenceFiles stored in HDFS (actual end-to-end latency results will be presented later). For now, computing the block size (in number of records) will at least give us a sense of how much sequential reading can be expected under different conditions. This particular WARC file contains 35582 web pages. Results are shown below:
| compression scheme | # blocks | avg. pages/block |
| block-compressed SequenceFile (block=1000000) | 1024 | 35 |
| block-compressed SequenceFile (block=500000) | 2004 | 18 |
| block-compressed SequenceFile (block=100000) | 8673 | 4 |
These experiments were performed with ScanBlockCompressedSequenceFile in edu.umd.cloud9.collection.clue, which scans a block-compressed SequenceFile and outputs the block boundaries.
Note that the Hadoop SequenceFile block-compression scheme is parameterized by block size (which is the correct design decision), not number of records, so number of pages per block will depend on sizes of individual pages. These results show that, with the default block size, accessing a random page will on average require sequentially reading through 17 other pages. There is an additional tradeoff to consider: for random access, it is necessary to hold all block pointers (offsets) in memory. The smaller the block size, the more the blocks, and hence the larger memory footprint of structures necessary to support random access. All things considered, the default block size appears to be a good choice.
Repacking the WARC Files
Cloud9 comes with a program for repacking the original ClueWeb09 gzipped WARC files into block-compressed SequenceFiles. Sample invocation:
hadoop jar cloud9.jar edu.umd.cloud9.collection.clue.RepackClueWarcRecords \ /shared/ClueWeb09/collection.raw /shared/ClueWeb09/collection.compressed.block/en.01 1 \ /shared/ClueWeb09/docno-mapping.dat block
The first argument is the path of the collection, the second is the output directory, the third is the segment number, the fourth is the docno mapping data file (which is required since the keys in the SequenceFiles contain docnos), and the final argument is the string "block" (for block-level compression), "record" (for record-level compression), and "none" (for no compression). Since we have the luxury of a large cluster (a few hundred machines), repacking the first English segment of ClueWeb09 takes about twenty minutes (with Java built-in compression; with native libraries this should be even faster).
A sample result: the first English segment of ClueWeb09 weighs in at 247,363,677,391 bytes in its original distribution. After repacking, the size expands ever so slightly to 250,062,048,064 byte. For a tiny cost in space, we get random access...
Supporting Random Access
A forward index that will support random access to ClueWeb09 web pages is as simple as noting where all the block boundaries are. Cloud9 has a indexer to do exactly that. Sample invocation:
hadoop jar cloud9.jar edu.umd.cloud9.collection.clue.BuildClueWarcForwardIndex \ /shared/ClueWeb09/collection.compressed.block/en.01 /tmp/findex/ \ /shared/ClueWeb09/collection.compressed.block/findex.en.01.dat
The first argument is the location of the repacked block-compressed SequenceFiles, the second is a temporary path for MapReduce output, and the third is the location of the index file to be written.
The index file contains all the block locations in a binary-encoded format. Each block location is a triple of docno, block byte offset, and file number. To support random access, all block locations are loaded into memory in an array. Given the docno of a page, fetching it involves performing binary search over the block locations to find the correct block, opening up the proper file, seeking to the block byte offset, and reading through records until the desired docno is encountered. This is handled by the class ClueWarcForwardIndex in edu.umd.cloud9.collection.clue, which implements exactly the algorithm sketched above and provides an abstraction for fetching ClueWeb09 web pages.
Finally, some empirical results: on our cluster running Hadoop 0.20.1, with block-compressed SequenceFiles stored in HDFS, we get random access latencies in the 100-150ms range. The configuration of the cluster at the time of the experiment was 99 nodes, 198 cores, two 400 GB disks per node. In a bandwidth saturated scenario (e.g., while the cluster is running a distcp) latency drops to the 150-200ms range, which is still acceptable. Note that in these experiments we were using built-in Java compression; once again, with native libraries this should be substantially faster). This appears to be acceptable for interactive retrieval, especially considering that end-to-end latency is dominated by other things like fetching page images remotely.
Random Access Webapp
Finally, as a "cute" hack, we've developed a webapp for accessing ClueWeb09 pages within Hadoop itself. The lightweight HTTP server Jetty is already included in the Hadoop distribution (it's what runs the jobtracker, namenode, and other webapps). What we've done is folded a Jetty server into a mapper, so you can fire up the webapp in the same way you start a Hadoop job. In this case, the Hadoop job has only one mapper, and the mapper starts up a Jetty server. To find out where the server is running, access the tasktracker logs, as it's in the logging output. You'll find the hostname and port, and then you should be able to directly connect to the server. The webapp support fetching pages by both docno and docid. Here's a sample invocation of this:
hadoop jar cloud9.jar edu.umd.cloud9.collection.DocumentForwardIndexHttpServer \ /shared/ClueWeb09/collection.compressed.block/findex.en.01.dat \ /shared/ClueWeb09/docno-mapping.dat
The first argument is the location of the collection and the second is the index file. This webapp has been tested on Hadoop 0.20.1 and is known to issues with pre-20 Hadoop versions (since they used a different version of Jetty with an incompatible API). The interface will look something like this:
And that's it! Have fun! Please give us feedback if you find any issues, find any bugs, etc.