Cloud9: Getting started with the Google/IBM CLuE cluster

by Jimmy Lin

(Page first created: 04 Sep 2008; last updated: )

Introduction

These instructions are intended help you get onto the Google/IBM Hadoop cluster, available via the NSF CLuE program and the Google/IBM Academic Cloud Computing Initiative (ACCI) to University of Maryland students who are working in the context of those two projects.

In these instructions, I will refer to the following machines:

  • GATEWAY: the gateway machine (IP address)
  • JOBTRACKER: the node that runs the jobtracker (IP address)
  • NAMENODE: the HDFS name node (actual host name)

You should have received this information separately. I'm not publishing this information on the public web as a security measure.

1. Download and unpack Hadoop

If you're using Cloud9, go through this guide to getting started in standalone mode to make sure you're comfortable with the basics of Hadoop.

The cluster is currently running a Cloudera distribution of Hadoop 0.20.1; the actual tarball are available here. The exact release (i.e., patch level) isn't too important, and even if I list it here, it'll probably get out of date quickly (if it is important to find out exactly what's running on the server, the jobtracker will tell you). For the most part, a slight mistmatch in the release on the client and server won't cause any problems, so you could just pick the latest release. You can even use the stock Hadoop distribution.

2. Site up configuration files

We have to configure Hadoop. Configuration parameters are stored in the conf directory under your base distribution path, in three separate files: core-site.xml, hdfs-site.xml, mapred-site.xml.

The following configuration information goes in core-site.xml:

<property>
  <name>fs.default.name</name>
  <value>NAMENODE</value>
</property>

<property>
  <name>hadoop.job.ugi</name>
  <value>UGI</value>
</property>

<property>
  <name>hadoop.rpc.socket.factory.class.default</name>
  <value>org.apache.hadoop.net.SocksSocketFactory</value>
</property>

<property>
  <name>hadoop.socks.server</name>
  <value>localhost:6789</value>
</property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/tmp/hadoop</value>
</property>

For NAMENODE, substitute the fully-qualified hostname of the namenode. This should begin with "hdfs://" and end with a port number. The UGI property specifies file permissions you have in HDFS and will vary by user. Ask me what this should be, or otherwise just put in your username twice separated by a comma. So if your username is "foo", put "foo,foo" here.

The following configuration information goes in hdfs-site.xml:

<property>
  <name>fs.default.name</name>
  <value>NAMENODE</value>
</property>

Finally, the following configuration information goes in mapred-site.xml:

<property>
  <name>mapred.job.tracker</name>
  <value>JOBTRACKER:8021</value>
</property>

Here, substitute JOBTRACKER for its actual IP address. Port 8021 is a commonly-used port for the jobtracker.

3. Open up a SOCKS proxy

On your local machine, open up a shell and invoke the following command:

ssh -D 6789 username@GATEWAY

Make sure you substitute your real username and the actual IP address for GATEWAY. Enter your password and log in when prompted.

Here, I'm assuming that you have ssh installed. In Windows, your best bet is to install Cygwin; ssh isn't installed by default, so you'll have to manually specify the package.

4. Test drive the cluster

With the above steps and bit of luck, you should be able to access the Hadoop cluster from your local machine! Basically, you've set up a direct connection to the Hadoop cluster via a SOCKS proxy.

Try things out with a simple command (from your local machine):

hadoop fs -ls /

You should be able to see the contents of HDFS.

Note: Hadoop only works with SOCKS v5, so if you have an older version of ssh that only supports SOCKS v4, then you should upgrade (or get your admin to). This is a known problem with some CLIP machines. This issue manifests in really unhelpful error messages.

Note: Windows users under Cygwin might get the following error:

javax.security.auth.login.LoginException: Login failed: Expect one token as the result of whoami: Jimmy Lin

The reason for this is that Cygwin uses your Windows username when trying to connect to the cluster, which may have a space in it, e.g., "Jimmy Lin" (which causes the above login error). To fix this, edit /etc/password, which in actuality would be c:\cygwin\etc\password if your Cygwin install path was c:\cygwin\. Find the entry for your username and remove the space (first field); following standard conventions I also like to downcase my entire username, e.g., "jimmylin". While you're at it you might want to rename your home directory also (second to last field).

5. Connect to the webapps

To access the jobtracker Webapp, you'll have to get your browser to use the SOCKS proxy. The preferred solution is FoxyProxy for Firefox. First, download and install FoxyProxy from this link. Then, take the following steps:

  1. Select menu item "Tools" > "FoxyProxy" > "Options".
  2. Click "Add New Proxy".
  3. In "General" tab, make sure the proxy is enabled. Give it a name, e.g., "Google/IBM cluster"
  4. In "Proxy Details" tab, select "Manual Proxy Configuration", enter "127.0.0.1" for "Host Name", 6789 for "Port", check "SOCKS Proxy?", and make sure "SOCKS v5" is selected.
  5. In "Patterns" tab, click "Add New Pattern". Given it a name (e.g., "default"), and enter the pattern "http://JOBTRACKER:*/*". Substitute the real IP address.
  6. You might want to add a similar pattern for accessing the HDFS Webapp on "http://NAMENODE:*/*", or generalize the two into a single pattern. Say the IP address is aaa.bbb.ccc.ddd, you might want to enter a pattern like "http://aaa.bbb.*.*:*/*". Also, to access the task logs, you'll need to add a pattern along the lines of "http://xen*:*/*".
  7. Click "OK" to finish setting up the proxy.
  8. Back in FoxyProxy Options, change "Mode" to "Use proxies based on their pre-defined patterns and priorities". Click "Close" to finish up.

If you navigate to http://JOBTRACKER:50030/ in your browser, you should be able to access the jobtracker!

6. Accessing Task Logs

You'll want to access the task logs on the individual cluster workers for debugging, but the problem is that the Webapp gives you URLs that contain references to Xen virtual machines, something like http://xenhost-XXXXXXXXXXXX-X:50060/tasklog..., which isn't very helpful because you need the actual IP addresses of the individual workers. The easiest solution is to hack your hosts file.

Email me or the admins for the Xen host to IP mappings (I'm not posting publicly for obvious security reasons). Add them to your hosts file, the location of which varies by operating system: usually %SystemRoot%\system32\drivers\etc\ for Windows NT/2000/XP/2003/Vista/7, and /etc/ for Linux and other Unix-like operating systems. For more information, consult this helpful Wikipedia article.

And that's it! You should now be able to access the task logs. If you still can't, go back to Step 6 in the previous section: one common problem is that your FoxyProxy pattern doesn't capture the URL and direct connections through the proxy.

There is, however, one minor issue with hacking the hosts file... which is that you'll need root privileges. If for whatever reason you don't have root access, there is another solution: a Firefox plugin called FoxReplace nicely solves the problem. It automatically replaces text in URLs, so we just need to give it a Xen host to IP mapping. Go ahead and download the plugin. Email me for the mappings XML file.

Once you have FoxReplace installed, go through the following steps:

  1. Select menu item "Tools" > "FoxReplace" > "FoxReplace options...".
  2. Click "Import" and load the XML mappings file you get from me.
  3. Make sure "Auto-replace on page load" is checked.
  4. Click "OK" to finish.

And that should do it!

Back to main page

Creative Commons: Attribution-Noncommercial-Share Alike 3.0 United States Valid XHTML 1.0! Valid CSS!