Cloud9 was designed to serve as both a teaching tool and to support research in text processing. It was used in "cloud computing" courses at the University of Maryland in Spring 2008 and Fall 2008. The library itself is available as a big tarball or via anonymous Subversion checkout (it's relatively big because it includes distributions of Hadoop, so you don't have to download separately). Like Hadoop itself, Cloud9 is distributed under the Apache License.
Getting It
- Subversion access: https://subversion.umiacs.umd.edu/umd-hadoop/core
- Cloud9, release 0.2 (2009/11/18): cloud9-r0.2.tar.gz (47.3 MB)
- Cloud9, release 0.1 (2009/07/16): cloud9-r0.1.tar.gz (131 MB)
Starting Points
- Hadoop homepage
- Hadoop API [0.20.1]
- Cloud9 API javadoc
- Downloading and setting up Cloud9
- Getting started with Cloud9 in standalone mode
- Getting started with Cloud9 on EC2
- Getting started with S3
- Getting started with the Google/IBM CLuE cluster
- MapReduce and related bibliography
- Sample text collection, consisting of the Bible and the complete works of Shakespeare (~2.8 MB compressed, ~9 MB uncompressed)
Next Steps
- Staging records and working with SequenceFiles
- Working with complex data types
- Primer on MapReduce algorithm design
- Working with counters
- Efficient feature vectors and maps
- Debugging MapReduce programs with log4j
- Working with standard document collections
- Frequently asked questions on the Hadoop user mailing list
- Guide to working on the ClueWeb09 collection
- Random access to ClueWeb09 WARC records
Exercises
This work is or has been supported by the following sources: NSF under awards IIS-0836560 and IIS-0705832; Google and IBM under the Academic Cloud Computing Initiative (ACCI); the Intramural Research Program of the NIH, National Library of Medicine; DARPA/IPTO Contract No. HR0011-06-2-0001 under the GALE program; and Amazon Web Services. Any opinions, findings, conclusions, or recommendations expressed here do not necessarily reflect those of the sponsors.