Cloud9: Using the tuple library

by Jimmy Lin

This page provides a brief overview of the tuple library in Cloud9, which resides in the Java package edu.umd.cloud9.tuple.

Motivation

What's the motivation for having a tuple library? The entire MapReduce framework is built on top of key-value pairs (KV pairs for short):

  • WritableComparable is the base class for keys, and
  • Writable is the base class for values.

Classes implementing the above two interfaces provide your basic primitives:

  • IntWriteable (ints),
  • Text (strings),
  • BytesWritable (raw bytes),
  • etc.

However, there is no support for more complex data types. The Cloud9 tuple library fills this gap by providing support for arbitrary tuples. The Tuple class directly implements WirtableComparable, and therefore can be used directly as either keys or values.

Usage

Refer to edu.umd.cloud9.demo.DemoWordCountTuple for a well-commented basic demo of the Tuple class.

The structure of each tuple is dictated by a schema. Schemas are defined by the Schema class. Here a sample code fragment of how a schema is defined:

public static final Schema MYSCHEMA = new Schema();
static {
    MYSCHEMA.addField("token", String.class, "");
    MYSCHEMA.addField("int", Integer.class, new Integer(1));
}

The addField method allows you to insert a field and specify default values. The following are valid field types:

  • Basic Java primitives: Boolean, Integer, Long, Float, Double, String
  • Classes that implement Writable

Once a schema has been defined, tuples can be instantiated in one of two ways:

// method 1: new Tuple with default values
Tuple tuple1 = MYSCHEMA.instantiate();

// method 2: new Tuple with specified values
Tuple tuple2 = MYSCHEMA.instantiate("test", 2);

Calling the instantiate() method without any parameters creates a new Tuple with default values. Alternatively, you can directly specify the values of each field using instantiate(Object...), the overloaded method that takes a variable number of Objects as parameters.

Once a tuple is created, fields can be modified using the set method; field values can be retrieved using the get method. You can refer to a field by its integer index position, or by its field name: the first is faster, but the second makes code more readable.

Since a Tuple implements WritableComparable, it can be used directly in Hadoop without any effort. The class automatically takes care of serializing and deserializing the object.

Another feature of the Tuple class is its ability to store special symbols. Each field in the Tuple can either hold an Object of the type defined by its Schema, or a special symbol String. The method containsSymbol can be used to check if a field contains a special symbol. If the field contains a special symbol, get will return null. If the field does not contain a special symbol, getSymbol will return null.

What's the use of this feature? Say you had tuples that represented count(a,b), where a and b are tokens you observe. There is often a need to compute count(a,*), for example, to derive conditional probabilities. In this case, you can use a special symbol to represent the *, and distinguish it from the lexical token '*'. Refer to edu.umd.cloud9.demo.DemoWordCondProb for a well-commented basic demo that uses this special symbol feature.

Lists

Additional functionality in the tuple library is provided by the ListWritable class, which provides a Hadoop data type for storing a list of homogeneous Writable elements. This class, combined with Tuple, allows the user to define arbitrarily complex data structures.

Design Rationale

In adopting the factory pattern for the creation of tuples from the Schema class, verbosity was traded for transparency and readability. Using factory methods for instantiation may be cumbersome at times, but one is forced to develop explicit schemas that appropriately model the data. Furthermore, the current design allows fields to be referenced by descriptive names, which improves program readability, at the expense of larger serialized objects (since field names need to be stored with the tuples).

Donald Knuth's famous quote, "premature optimization is the root of all evil" (Knuth, 1974), should be kept in mind when considering the current implementation of the Tuple class. For example, the class uses a reasonable, but not particularly clever or optimized algorithm for serialization and deserialization. But that's not the point. We need to develop experience with a broad range of usage scenarios before optimization should be undertaken.

References

Knuth, Donald. Structured Programming with go to Statements, ACM Journal Computing Surveys, 6(4):261-301, 1974.

Back to main page

This page, first created: 30 Oct 2007; last updated: Creative Commons: Attribution-Noncommercial-Share Alike 3.0 United States Valid XHTML 1.0! Valid CSS!