Cloud9: Efficient feature vectors and maps

by Jimmy Lin

(Page first created: 03 Jan 2009; last updated: )

Feature vectors are ubiquitous in text processing and many other applications. They represent a special case of associative arrays where the value is a numeric weight (say, int or float). It would be nice to have efficient implementations of these data structures.

A first stab at implementing a feature vector might be as HashMap<String,Integer>. This would, for example, be useful in storing term frequencies in a document. The biggest issue with this implementation is that the value is Integer, rather than int, since only objects can be stuffed inside Java collections. Autoboxing is slow, and storing Integer objects increases the memory footprint.

To get around this, Cloud9 implements specialized versions of the standard Java HashMap class where the value is hard-coded as either ints or floats (and the keys can be any type). Feature vectors with int or float values are a sufficiently common that this optimization is worth the effort. These two classes are:

  • HMapKI<K>, an efficient alternative to HashMap<K,Integer>, implements the MapKI<K> interface.
  • HMapKF<K>, an efficient alternative to HashMap<K,Float>, implements the MapKF<K> interface.

A further optimization is also to hard-code the key. Ints are frequently used as keys, so we have the following two common cases:

  • HMapII, an efficient alternative to HashMap<Integer,Integer>, implements the MapII interface.
  • HMapIF, an efficient alternative to HashMap<Integer,Float>, implements the MapIF interface.

How much more efficient are these implementation? As a benchmark, I experimented with feature vectors where both the features and values are ints. The benchmark first inserts 5m random ints, and then access all 5m ints. The following shows running times and the memory footprint:

HashMap<Integer,Integer> HMapKI<Integer> HMapII
Insertion3796 ms2953 ms (-22%)1703 ms (-55%)
Access350 ms281 ms (-20%)78 ms (-78%)
Memory footprint (per entry)68 bytes52 bytes (-24%)31 bytes (-55%)

Figures represent averages over 5 trials, on a 2.6 GHz Core Duo 2, 2 GB RAM, running Windows XP, Java 1.6u10 (conducted on 1/21/2009). As can be seen, these optimizations represent a significant improvement in both speed and memory footprint.

Insertion is still relatively expensive because the default implementation of HashMap backs every key-value pair with an Entry object. This goes naturally with the use of chaining for hash collisions and supports iteration by keys, values, and entries. The downside, however, is object creation overhead and lots of time wasted garbage collecting. One solution around this is to implement some type of object pooling, but a better solution is to write a completely different implementation that uses open addressing to deal with hash collisions, thereby obviating the need for actual Entry objects.

Beyond the above classes, I've written a whole bunch of subclasses:

  • OHMapIF extends HMapIF
  • OHMapII extends HMapII
  • OHMapKF extends HMapKF (key must be comparable)
  • OHMapKI extends HMapKI (key must be comparable)

They all share in providing methods to retrieve entries sorted by values, along with methods like plus and dot (for adding vectors and taking their dot products). These are the classes you'd most likely want to use in practice.

Finally, for use in Hadoop, there are the final classes that implement Writable:

  • OHMapIFW extends HMapIF (well suited to holding accumulators in an IR application)
  • OHMapIIW extends HMapII (well suited to holding accumulators with impact-based integer scores in an IR application)
  • OHMapSFW extends HMapKF (specialized for a String key to eliminate need to wrap the string inside a Hadoop Text object; well suited for storing weighted feature vectors in "bag-of-words" representation of documents)
  • OHMapSIW extends HMapKI (specialized for a String key to eliminate need to wrap the string inside a Hadoop Text object; well suited for storing term frequencies in "bag-of-words" representation of documents)
  • OHMapKFW extends HMapKF (support for arbitrary keys that implement Writable)
  • OHMapKIW extends HMapKI (support for arbitrary keys that implement Writable)

The upside of all these classes is efficiency. The downside is the proliferation of a whole bunch of specialized classes... in my opinion, this is a worthwhile tradeoff. Use of a consistent naming scheme helps.

Back to main page

Creative Commons: Attribution-Noncommercial-Share Alike 3.0 United States Valid XHTML 1.0! Valid CSS!