Cloud9: Frequently asked questions about Hadoop

by Jimmy Lin

(Page first created: 23 Dec 2008; last updated: )

If you're seriously getting into Hadoop, then you should definitely subscribe to the user mailing list, core-user@hadoop.apache.org; instructions for signing up can be found here. The list gets a fair amount of traffic and the signal-to-noise ratio can be low at times, I would nevertheless recommend subscribing.

Have a question about Hadoop? The user mailing list would be the place to ask. However, before posting, you might want to search the archives with MarkMail (a nice interface) or browse posts with this interface (the "official" archives).

From time to time, I find posts worth saving, usually about basic questions that a Hadoop developer might ask. I've compiled them here, with links back to the original post (and subsequent responses). Think of this as a curated FAQ with answers from the mailing list.

MapReduce

  • What's the issue with processing gzipped files? [Aug 30, 2007]
  • If a file is one byte in size, does it still take up a full HDFS block? (short answer, no; common misconception) [Mar 27, 2008]
  • What's the precise semantics of combiner execution? [JIRA 3326, resolved May 07, 2008]
  • How do I get access to the name of the file being processed inside a mapper? [Dec 07, 2008]
  • How do I get access to pre-defined counters (e.g., "Reduce input records")? [Dec 21, 2008]
  • How do I find out what partition number a reducer is processing? [Dec 09, 2008]
  • Is there a way for the mapper to know when it's processed all records? [Dec 09, 2008] [Aug 20, 2008]
  • How do I process different types of files simultaneously (i.e., mapping over differently formatted inputs)? [Feb 06, 2009]
  • When using Hadoop streaming, how do I update counters? [Feb 06, 2009]
  • How do I access counter values from the command line? [Feb 06, 2009]
  • How do you programmatically change default the task memory allocation (for mappers and reducers)? [Feb 07, 2009]
  • How can you map over lzo-compressed files? [Mar 03, 2009]
  • How do I emit multiple types of key-value pairs? [Apr 03, 2009]
  • How do I use counters to keep track of doubles? [Jun 02, 2009]
  • What job schedulers are currently available? [Jun 05, 2009]
  • What's the effect of increasing block size or minimum split size? [Jun 12, 2009]

HDFS

  • How do I know which replica is being accessed when reading something from HDFS? [Dec 01, 2008]
  • How do I "mount" HDFS as a "standard" file system? [Dec 09, 2008]
  • How do I directly "cat" a file into HDFS from the local disk? [Dec 09, 2008]
  • How do I get metadata of a file in HDFS (blocks, block locations, etc.)? [Jan 24, 2009]
  • How do I restart an individual datanode? [Dec 10, 2009]
  • How do I "cat" the contents of a SequenceFile? [Jan 26, 2009]
  • How do I programmatically get the block locations of a file in HDFS? [Jan 27, 2009]
  • What's wrong with a lot of small files with HDFS? [Feb 10, 2009] [Cloudera blog post on Feb 2, 2009]
  • How many people are using Hadoop streaming? What's the overhead of using Hadoop streaming? [Apr 03, 2009]
  • How easy is it to bypass user authentication in HDFS? [Apr 05, 2009]
  • What's the random access performance of HDFS? [Jun 27, 2009]
  • How do I configure HDFS as to not use up all available disk space? [Sep 23, 2009] Follow up: What about for intermediate output of MapReduce jobs? [Sep 23, 2009]

Other

  • How do I integrate Hadoop and MySQL? [JIRA 2536]
  • What is the largest documented Hadoop cluster? [~10k cores, as of Feb 19, 2008] [~30k cores, as of Sep 30, 2008]
  • How do I secure a Hadoop cluster through a gateway? [Dec 03, 2009]
  • What options are there for a scalable, distributed key-value store? [Jan 20, 2009]
  • Is there a recommended JSON library to use with Hadoop? [Mar 05, 2009]
  • How do I build a Lucene indexes in Hadoop? [Mar 13, 2009]

Back to main page

Creative Commons: Attribution-Noncommercial-Share Alike 3.0 United States Valid XHTML 1.0! Valid CSS!