This page provides a brief overview of the tuple library in
Cloud9, which resides in the Java package
edu.umd.cloud9.tuple.
Motivation
What's the motivation for having a tuple library? The entire MapReduce framework is built on top of key-value pairs (KV pairs for short):
WritableComparableis the base class for keys, andWritableis the base class for values.
Classes implementing the above two interfaces provide your basic primitives:
IntWriteable(ints),Text(strings),BytesWritable(raw bytes),- etc.
However, there is no support for more complex data types. The Cloud9 tuple library fills this gap by providing support for arbitrary tuples. The Tuple class directly implements WirtableComparable, and therefore can be used directly as either keys or values.
Usage
Refer to edu.umd.cloud9.demo.DemoWordCountTuple for a
well-commented basic demo of the Tuple class.
The structure of each tuple is dictated by a schema. Schemas are
defined by the Schema
class. Here a sample code fragment of how a schema is defined:
public static final Schema MYSCHEMA = new Schema();
static {
MYSCHEMA.addField("token", String.class, "");
MYSCHEMA.addField("int", Integer.class, new Integer(1));
}
The addField method allows you to insert a field and
specify default values. The following are valid field types:
- Basic Java primitives: Boolean, Integer, Long, Float, Double, String
- Classes that implement Writable
Once a schema has been defined, tuples can be instantiated in one of two ways:
// method 1: new Tuple with default values
Tuple tuple1 = MYSCHEMA.instantiate();
// method 2: new Tuple with specified values
Tuple tuple2 = MYSCHEMA.instantiate("test", 2);
Calling the instantiate() method without any
parameters creates a new Tuple with
default values. Alternatively, you can directly specify the values of
each field using instantiate(Object...), the overloaded
method that takes a variable number of Objects as parameters.
Once a tuple is created, fields can be modified using the
set method; field values can be retrieved using the
get method. You can refer to a field by its integer
index position, or by its field name: the first is faster, but the
second makes code more readable.
Since a Tuple implements WritableComparable, it can be used directly in Hadoop without any effort. The class automatically takes care of serializing and deserializing the object.
Another feature of the Tuple class is its ability to store special
symbols. Each field in the Tuple can either hold an Object of the
type defined by its Schema, or a special symbol String. The method
containsSymbol can be used to check if a field contains a
special symbol. If the field contains a special symbol,
get will return null. If the field does not
contain a special symbol, getSymbol will return
null.
What's the use of this feature? Say you had tuples that
represented count(a,b), where a and
b are tokens you observe. There is often a need to
compute count(a,*), for example, to derive conditional
probabilities. In this case, you can use a special symbol to represent
the *, and distinguish it from the lexical token
'*'. Refer to
edu.umd.cloud9.demo.DemoWordCondProb for a well-commented
basic demo that uses this special symbol feature.
Lists
Additional functionality in the tuple library is provided by the
ListWritable
class, which provides a Hadoop data type for storing a list of
homogeneous Writable elements. This class, combined
with Tuple, allows the user to define arbitrarily complex
data structures.
Design Rationale
In adopting the factory pattern for the creation of tuples from the Schema class, verbosity was traded for transparency and readability. Using factory methods for instantiation may be cumbersome at times, but one is forced to develop explicit schemas that appropriately model the data. Furthermore, the current design allows fields to be referenced by descriptive names, which improves program readability, at the expense of larger serialized objects (since field names need to be stored with the tuples).
Donald Knuth's famous quote, "premature optimization is the root of all evil" (Knuth, 1974), should be kept in mind when considering the current implementation of the Tuple class. For example, the class uses a reasonable, but not particularly clever or optimized algorithm for serialization and deserialization. But that's not the point. We need to develop experience with a broad range of usage scenarios before optimization should be undertaken.
References
Knuth, Donald. Structured Programming with go to Statements, ACM Journal Computing Surveys, 6(4):261-301, 1974.