Introduction
This tutorial will get you started with Cloud9 in standalone mode. In standalone mode, you run Hadoop directly on your local machine. Of course, you don't get the benefit of distributing your code across multiple machines... but it's a good start for learning about Hadoop. This tutorial assumes you've already downloaded Cloud9 and gotten it set up. Otherwise, see my tutorial on that. Also, see companion tutorial on getting started with Cloud9 on EC2.
For Windows users: If you are using Windows, use Cygwin. That's what I mean when I say, "open up a shell".
Step 1: Configure Hadoop for standalone mode
This tutorial assumes Hadoop 0.20.1. Make sure you've downloaded and unpacked the Hadoop distribution in umd-hadoop-core/hadoop/. See this guide for more details.
Open up a shell and go to umd-hadoop-core/hadoop/hadoop-0.20.1/conf/. Make sure the file core-site.xml doesn't actually specify configuration parameters:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> </configuration>
Verify the same for hdfs-site.xml and mapred-site.xml. This should be the case for a clean distribution. This configuration ensures that your Hadoop now runs in standalone mode.
Later on you will actually specify configuration parameters here to connect to a cluster. In that case, you can override those parameters and force standalone mode from the command line. Like this:
hadoop fs -D mapred.job.tracker=local -D fs.default.name=file:/// -ls .
The above example performs a directory listing in standalone mode (which corresponds to a directory listing of the local disk).
Step 2: Run pi
Open a shell and go to umd-hadoop-core/hadoop/hadoop-0.20.1/. Now run the pi demo:
$ bin/hadoop jar hadoop-0.20.1-examples.jar pi 10 100 Number of Maps = 10 Samples per Map = 100 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job [...] 09/11/18 19:51:57 INFO mapred.JobClient: Job complete: job_local_0001 09/11/18 19:51:57 INFO mapred.JobClient: Counters: 13 09/11/18 19:51:57 INFO mapred.JobClient: FileSystemCounters 09/11/18 19:51:57 INFO mapred.JobClient: FILE_BYTES_READ=1725357 09/11/18 19:51:57 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1926195 09/11/18 19:51:57 INFO mapred.JobClient: Map-Reduce Framework 09/11/18 19:51:57 INFO mapred.JobClient: Reduce input groups=20 09/11/18 19:51:57 INFO mapred.JobClient: Combine output records=0 09/11/18 19:51:57 INFO mapred.JobClient: Map input records=10 09/11/18 19:51:57 INFO mapred.JobClient: Reduce shuffle bytes=0 09/11/18 19:51:57 INFO mapred.JobClient: Reduce output records=0 09/11/18 19:51:57 INFO mapred.JobClient: Spilled Records=40 09/11/18 19:51:57 INFO mapred.JobClient: Map output bytes=180 09/11/18 19:51:57 INFO mapred.JobClient: Map input bytes=240 09/11/18 19:51:57 INFO mapred.JobClient: Combine input records=0 09/11/18 19:51:57 INFO mapred.JobClient: Map output records=20 09/11/18 19:51:57 INFO mapred.JobClient: Reduce input records=20 Job Finished in 2.625 seconds Estimated value of Pi is 3.14800000000000000000
Okay, so the value of pi is a bit off... but at least Hadoop works!
Step 3: Unpack some data and build the job jar
Now we're getting ready to run the word count demo. Open a shell and go to umd-hadoop-core/data/. Uncompress the sample text collection (Bible and the complete works of Shakespeare):
$ gunzip bible+shakes.nopunc.gz
Now let's build a job jar for running the word count demo. Open a shell and go to umd-hadoop-core/build/ (which is where Eclipse automatically puts compiled class files). Jar up the class files:
$ jar cvf cloud9.jar *
If there's nothing in build/, go back to Eclipse and make sure the code compiles without error.
Step 4: Build and run the word count demo
Once you have created the jar, go back to umd-hadoop-core/hadoop/hadoop-0.18.3/ and submit the job in standalone mode. Run the class to find out its command-line arguments:
$ bin/hadoop jar ../../build/cloud9.jar edu.umd.cloud9.demo.DemoWordCount usage: [input-path] [output-path] [num-mappers] [num-reducers]
Now run the code with on the sample text collection:
$ bin/hadoop jar ../../build/cloud9.jar edu.umd.cloud9.demo.DemoWordCount ../../data/bible+shakes.nopunc demo 5 1 09/11/18 19:53:18 INFO demo.DemoWordCount: Tool: DemoWordCount 09/11/18 19:53:18 INFO demo.DemoWordCount: - input path: ../../data/bible+shakes.nopunc 09/11/18 19:53:18 INFO demo.DemoWordCount: - output path: demo 09/11/18 19:53:18 INFO demo.DemoWordCount: - number of mappers: 5 09/11/18 19:53:18 INFO demo.DemoWordCount: - number of reducers: 1 [...] 09/11/18 19:53:25 INFO mapred.JobClient: Job complete: job_local_0001 09/11/18 19:53:25 INFO mapred.JobClient: Counters: 13 09/11/18 19:53:25 INFO mapred.JobClient: FileSystemCounters 09/11/18 19:53:25 INFO mapred.JobClient: FILE_BYTES_READ=22537334 09/11/18 19:53:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=5495288 09/11/18 19:53:25 INFO mapred.JobClient: Map-Reduce Framework 09/11/18 19:53:25 INFO mapred.JobClient: Reduce input groups=41788 09/11/18 19:53:25 INFO mapred.JobClient: Combine output records=128253 09/11/18 19:53:25 INFO mapred.JobClient: Map input records=156215 09/11/18 19:53:25 INFO mapred.JobClient: Reduce shuffle bytes=0 09/11/18 19:53:25 INFO mapred.JobClient: Reduce output records=41788 09/11/18 19:53:25 INFO mapred.JobClient: Spilled Records=170041 09/11/18 19:53:25 INFO mapred.JobClient: Map output bytes=15919397 09/11/18 19:53:25 INFO mapred.JobClient: Map input bytes=9068074 09/11/18 19:53:25 INFO mapred.JobClient: Combine input records=1820763 09/11/18 19:53:25 INFO mapred.JobClient: Map output records=1734298 09/11/18 19:53:25 INFO mapred.JobClient: Reduce input records=41788 09/11/18 19:53:25 INFO demo.DemoWordCount: Job Finished in 7.406 seconds
There should now be a new sub-directory in your current directory called demo/ that contains the output of the word count demo:
$ head demo/part-00000 &c 70 &c' 1 ''all 1 ''among 1 ''and 1 ''but 1 ''how 1 ''lo 2 ''look 1 ''my 1 $ tail demo/part-00000 zorites 1 zorobabel 3 zounds 20 zuar 5 zuph 3 zur 5 zuriel 1 zurishaddai 5 zuzims 1 zwaggered 1 $ wc demo/part-00000 41788 83576 447180 demo/part-00000
And that's it! Now you're ready to run a real MapReduce cluster.