This brief guide will get you started with Cloud9 in Eclipse. A few notes to begin with:
- Instructions are written for the IBM cluster, but can be easily adapted to any other Hadoop cluster. See postscript at very end if you want to play with a local one-node Hadoop cluster.
- The Cloud9 demos have been verified to work in Windows (Java 1.5.0_02, Java 1.6.0_03) and Mac OS X 10.5 (Java 1.5.0_13).
- If you're using Java 1.6, you must change your compiler compliance level to 5.0 or else your code is not going to run on the IBM cluster. To do so, open Eclipse preferences (under Windows: Window > Preferences; under Mac: Eclipse > Preferences); select option Java > Compiler > Compiler compliance level.
Step 0: Download Various Software Packages
Download and install the following software packages:
- Java: version 1.6 is fine, but see notes above.
- Eclipse: an IDE for Java.
- Subclipse: a Subversion client plug-in for Eclipse.
Step 1: Check Out Subversion Repositories
Start Eclipse. You'll to have tweak Subclipse options. Open Eclipse preferences (under Windows: Window > Preferences; under Mac: Eclipse > Preferences). Select option Team > SVN. Change SVN interface to "SVNKit".
Switch to repository exploring mode. To do so, select menu option: Window > Open Perspective > Other > SVN Repository Exploring.
Add repository by right clicking on left panel > New > Repository Location. The two repositories you want to check out are:
umd-hadoop-dist: https://subversion.umiacs.umd.edu/umd-hadoop/distumd-hadoop-core: https://subversion.umiacs.umd.edu/umd-hadoop/core
For each repository, expand the tree. You should see "branches", "tags", and "trunk". Right click on trunk > Checkout... Follow dialog to check out repository.
When the checkout process is complete, switch back to the Java
perspective, and you should have two new projects:
umd-hadoop-core and umd-hadoop-dist. If
you're still getting errors, you might need to recompile the projects.
Select menu option: Project > Clean...
You'll note that Hadoop itself is checked in the
repository umd-hadoop-dist. This is intentional, to
ensure everyone is using the same version of Hadoop (makes debugging
easier). In the future, newer versions can be seamlessly rolled out
by checking in a newer version in Subversion and having everyone
update their sandboxes.
Step 2: Install the Hadoop Eclipse Plug-In
The next step is to install the Hadoop Eclipse plug-in. Go to
directory
umd-hadoop-dist/hadoop-0.15.3/contrib and find the file
hadoop-0.15.3-eclipse-plugin.jar. Copy that file to
the plugins directory in your Eclipse directory. Restart
Eclipse.
Restart Eclipse. The Eclipse Hadoop plugin should now be installed.
- To use the MapReduce perspective go to: Window > Open Perspective > Other... > MapReduce.
- To enable the MapReduce servers window go to: Window > Show View > Other... > MapReduce Tools > MapReduce Servers
Step 3: Connect to the IBM Cluster
Switch to the MapReduce perspective. At the bottom of your window, you should have a "MapReduce Servers" tab. If not, see second bullet above. Switch to that tab.
At the top right edge of the tab, you should see two little blue elephant icons. The one on the right allows you to add a new MapReduce server location. The hostname should be the IP address of the controller. You want to enable "Tunnel Connections" and put in the IP address of the gateway.
At this point, you should now have access to DFS. It should show
up under a little elephant icon in the Project Explorer (on the left
side of Eclipse). You can now browse the directory tree. Your home
directory should be /user/your_username. A sample
collection consisting of the Bible and Shakespeare's works has been
preloaded on the cluster, stored
at /shared/sample-input.
If you don't have access to the IBM cluster, you can download an image of a single-node Hadoop cluster to play around with; see instructions at the bottom of the page.
Step 4: Run the Demos
Find edu.umd.cloud9.demo.DemoWordCount in the Project
Explorer (panel on left in Eclipse). This is your standard "Hello
World" Hadoop program that does word count. Right click on the class,
select Run As > Run on Hadoop. This should pull up a dialog box;
select the Hadoop Server and watch it go!
When the job finishes, you should have something like the following:
08/01/26 14:42:56 INFO mapred.JobClient: Map-Reduce Framework 08/01/26 14:42:56 INFO mapred.JobClient: Map input records=156215 08/01/26 14:42:56 INFO mapred.JobClient: Map output records=1734298 08/01/26 14:42:56 INFO mapred.JobClient: Map input bytes=9068074 08/01/26 14:42:56 INFO mapred.JobClient: Map output bytes=15919397 08/01/26 14:42:56 INFO mapred.JobClient: Combine input records=1734298 08/01/26 14:42:56 INFO mapred.JobClient: Combine output records=135372 08/01/26 14:42:56 INFO mapred.JobClient: Reduce input groups=41788 08/01/26 14:42:56 INFO mapred.JobClient: Reduce input records=135372 08/01/26 14:42:56 INFO mapred.JobClient: Reduce output records=41788
Go into your DFS home directory, and you should find a new
directory called sample-counts: in it you'll find files
containing counts of each unique word. Congratulations, you're up and
running!
Postscript
Don't want to depend on the IBM cluster, especially while you are tinkering around with small programs? You can run a personal one-node Hadoop cluster on your local machine: instructions at provided by Google.
To better understand the directory layout of the project, consult page on layout of project directory tree.