Introduction
The ClueWeb09 collection consists of one billion web pages (5 TB compressed, 25 TB uncompressed), in ten languages, crawled in January and February 2009. Its creation, supported by U.S. National Science Foundation (NSF), was led by Jamie Callan of the Language Technologies Institute at Carnegie Mellon University to support research on information retrieval and related human language technologies. The entire collection is available for research purposes. This guide provides instructions on processing the English portion of the Clue Web collection using Hadoop with Cloud9, building on code developed by Mark Hoy at CMU. For now, this page deal exclusively with the English portion of the collection.
Topics that this page covers:
See the "official" dataset information page for the definitive description of the collection. Some content on this page simply mirrors information on that page for convenience.
In total, there are 503,903,810 pages in the English portion of the ClueWeb09 collection (2.08 TB compressed, 13.4 TB uncompressed). The English data is distributed in ten parts (called segments), each corresponding to a directory. Here are the page counts for each segment:
ClueWeb09_English_1 50,220,423 pages ClueWeb09_English_2 51,577,077 pages ClueWeb09_English_3 50,547,493 pages ClueWeb09_English_4 52,311,060 pages ClueWeb09_English_5 50,756,858 pages ClueWeb09_English_6 50,559,093 pages ClueWeb09_English_7 52,472,358 pages ClueWeb09_English_8 49,545,346 pages ClueWeb09_English_9 50,738,874 pages ClueWeb09_English_10 45,175,228 pages
Each segment contains a number of sub-directories, each of which
contains a number of compressed WARC files. There is no official name
for these sub-directories, so to be precise I'll call them sections.
They are numbered from en0000 all the way
through en0133. For example, en0000
to en0011 belong to segment 1. That is, the first
segment contains the following sub-directories:
ClueWeb09_English_1/en0000 ClueWeb09_English_1/en0001 ... ClueWeb09_English_1/en0011
In addition, the first segment contains three special
sections, enwp00, enwp01, enwp02,
enwp03. These sections hold a version of English
Wikipedia. Also, sections en0083, en0084,
and en0097 do not exist (or are empty, depending on the
distribution).
Counting the Records
Once you've gotten your hands on the collection, the first thing you might want to do is run some sanity checks. The simplest sanity check is to rea through all the records and count them. This functionality is provided by a demo program in Cloud9. Here's the command-line invocation:
hadoop jar cloud9.jar edu.umd.cloud9.collection.clue.DemoCountClueWarcRecords \ original /shared/ClueWeb09/collection.raw 1 /shared/ClueWeb09/docno-mapping.dat
The first command-line argument indicates whether your counting records in the original distribution ("original") or repacked SequenceFiles ("repacked"); for the second condition, see details below. The second argument is the base path of your ClueWeb09 distribution. The third command-line argument is the segment number (1 through 10). The final argument is the location of the docno mapping (see details below). If you run the demo program on all ten segments, you should get the following results:
| segment | # files | bytes off disk | # of records | # of pages | total size |
| 1 | 1492 | 246,838,508,311 | 50,221,915 | 50,220,423 | 1,527,155,667,036 |
| 2 | 1416 | 224,505,694,289 | 51,578,493 | 51,577,077 | 1,435,415,062,235 |
| 3 | 1375 | 217,428,760,570 | 50,548,868 | 50,547,493 | 1,392,234,129,944 |
| 4 | 1363 | 213,615,715,952 | 52,312,423 | 52,311,060 | 1,379,063,022,766 |
| 5 | 1322 | 205,092,621,204 | 50,758,180 | 50,756,858 | 1,333,142,147,191 |
| 6 | 1302 | 203,616,661,324 | 50,560,395 | 50,559,093 | 1,314,228,067,242 |
| 7 | 1358 | 213,335,482,896 | 52,473,716 | 52,472,358 | 1,366,774,469,429 |
| 8 | 1295 | 199,607,688,405 | 49,546,641 | 49,545,346 | 1,308,432,844,339 |
| 9 | 1306 | 204,295,706,812 | 50,740,180 | 50,738,874 | 1,331,922,112,879 |
| 10 | 988 | 150,042,155,900 | 45,176,216 | 45,175,228 | 983,120,555,934 |
Description of the columns:
- segment: the segment in question.
- # files: number files (compressed WARC files) in the segment.
- bytes off disk: compressed size of the compressed WARC files.
- # of records: number of WARC records.
- # of pages: number of actual HTML pages.
- total size: uncompressed size of all records.
As a note, each compressed WARC file has a header followed by the actual HTML pages, so number of records should be equal to number of files plus number of pages.
Repacking the Records
In some ways, the original WARC files are awkward to work with. There is, for example, no simple way to quickly access an individual record that lies in the middle of a gzipped file. A good solution is to repack the collection into block-compressed SequenceFiles. This is described in a separate page on providing random access to the WARC records.
Sequentially-numbered docnos
Many information retrieval and other text processing tasks require that all documents in the collection be sequentially numbered, from 1 ... n. Typically, you'll want to start with document 1 as opposed to 0 because it is not possible to represent 0 with many standard compression schemes used in information retrieval (i.e., Golomb codes). For clarity, I call these sequentially-numbered document ids docnos, whereas I call the original ids docids. (This is a bit confusing as in previous TREC collections alphanumeric document ids are tagged as DOCNOs.)
The format of a docid (WARC-TREC-ID) in the collection is
clueweb09-enXXXX-YY-ZZZZZ. Due to this regular format,
it is very easy to algorithmically map between docnos and docids. In
Cloud9, the
ClueWarcDocnoMapping
class in the edu.umd.cloud9.collection.clue package provides an API
for you.
Even if you don't want to use the Cloud9 API, this mappings data file should be useful. Here are the first few lines:
en0000,0,35582,1 en0000,1,28413,35583 en0000,2,36053,63996 en0000,3,36260,100049 en0000,4,34786,136309 en0000,5,33015,171095 ...
It's a CSV data file, where the first column represents
the enXXXX portion of the docid, the second column
represents the YY portion, and the third column lists the
number of pages with the enXXXX-YY prefix.
Since ZZZZZ starts at zero, the last docid is the third
column minus one. The fourth column tracks the cumulative count in
the number of documents. With this information, mapping between
docnos and docids is a simple matter of arithmetic.
Malformed records
There are a number of malformed WARC records in the English portion
of the collection (there may be malformed records in the other
languages also, but I haven't analyzed them yet). The most prevalent
problem is an extra newline in the WARC header. There are a few cases
of other malformed headers also. See
this list of docids: each
docid refers to a WARC record that immediately precedes a malformed
WARC record. For example, the first docid in the list
is clueweb09-en0001-41-14941, which means
that clueweb09-en0001-41-14942 is malformed.
Of all the malformed WARC records referenced in the file above, all except for the following are malformed in having an extra newline in the WARC header. The following docids are malformed in other ways (all cases of garbled URL):
clueweb09-en0044-01-04501 clueweb09-en0059-46-06368 clueweb09-en0117-48-12547 clueweb09-en0112-59-06118 clueweb09-en0126-33-37391 clueweb09-en0126-88-10049
This errata is provided primarily for reference. The API in Cloud9 transparently handles these malformed WARC records (thanks to code originally written by Mark Hoy from CMU).