Feature vectors are ubiquitous in text processing and many other applications. They represent a special case of associative arrays where the value is a numeric weight (say, int or float). It would be nice to have efficient implementations of these data structures.
A first stab at implementing a feature vector might be
as HashMap<String,Integer>. This would, for
example, be useful in storing term frequencies in a document. The
biggest issue with this implementation is that the value
is Integer, rather than int, since only
objects can be stuffed inside Java collections. Autoboxing
is slow, and
storing Integer objects increases the memory
footprint.
To get around this, Cloud9 implements
specialized versions of the standard Java HashMap class
where the value is hard-coded as either ints or floats (and the keys
can be any type). Feature vectors with int or float values are a
sufficiently common that this optimization is worth the effort. These
two classes are:
HMapKI<K>, an efficient alternative toHashMap<K,Integer>, implements theMapKI<K>interface.HMapKF<K>, an efficient alternative toHashMap<K,Float>, implements theMapKF<K>interface.
A further optimization is also to hard-code the key. Ints are frequently used as keys, so we have the following two common cases:
HMapII, an efficient alternative toHashMap<Integer,Integer>, implements theMapIIinterface.HMapIF, an efficient alternative toHashMap<Integer,Float>, implements theMapIFinterface.
How much more efficient are these implementation? As a benchmark, I experimented with feature vectors where both the features and values are ints. The benchmark first inserts 5m random ints, and then access all 5m ints. The following shows running times and the memory footprint:
HashMap<Integer,Integer> |
HMapKI<Integer> |
HMapII | |
| Insertion | 3796 ms | 2953 ms (-22%) | 1703 ms (-55%) |
| Access | 350 ms | 281 ms (-20%) | 78 ms (-78%) |
| Memory footprint (per entry) | 68 bytes | 52 bytes (-24%) | 31 bytes (-55%) |
Figures represent averages over 5 trials, on a 2.6 GHz Core Duo 2, 2 GB RAM, running Windows XP, Java 1.6u10 (conducted on 1/21/2009). As can be seen, these optimizations represent a significant improvement in both speed and memory footprint.
Insertion is still relatively expensive because the default
implementation of HashMap backs every key-value pair with
an Entry object. This goes naturally with the use of
chaining for hash collisions and supports iteration by keys, values,
and entries. The downside, however, is object creation overhead and
lots of time wasted garbage collecting. One solution around this is
to implement some type of object pooling, but a better solution is to
write a completely different implementation that uses open addressing
to deal with hash collisions, thereby obviating the need for
actual Entry objects.
Beyond the above classes, I've written a whole bunch of subclasses:
OHMapIFextendsHMapIFOHMapIIextendsHMapIIOHMapKFextendsHMapKF(key must be comparable)OHMapKIextendsHMapKI(key must be comparable)
They all share in providing methods to retrieve entries sorted by
values, along with methods like plus and dot
(for adding vectors and taking their dot products). These are the
classes you'd most likely want to use in practice.
Finally, for use in Hadoop, there are the final classes that
implement Writable:
OHMapIFWextendsHMapIF(well suited to holding accumulators in an IR application)OHMapIIWextendsHMapII(well suited to holding accumulators with impact-based integer scores in an IR application)OHMapSFWextendsHMapKF(specialized for aStringkey to eliminate need to wrap the string inside a HadoopTextobject; well suited for storing weighted feature vectors in "bag-of-words" representation of documents)OHMapSIWextendsHMapKI(specialized for aStringkey to eliminate need to wrap the string inside a HadoopTextobject; well suited for storing term frequencies in "bag-of-words" representation of documents)OHMapKFWextendsHMapKF(support for arbitrary keys that implementWritable)OHMapKIWextendsHMapKI(support for arbitrary keys that implementWritable)
The upside of all these classes is efficiency. The downside is the proliferation of a whole bunch of specialized classes... in my opinion, this is a worthwhile tradeoff. Use of a consistent naming scheme helps.