Cloud9: Working with standard document collections

by Jimmy Lin

(Page first created: 11 May 2009; last updated: )

The Cloud9 library contains APIs for working with many document collections that are frequently used in information retrieval experiments. This article provides an overview of these APIs.

The class edu.umd.cloud9.collection.Indexable is an interface that represents a generic document. An Indexable object has a docid, which is a globally-unique String identifier for the document in the collection. For many types of information retrieval algorithms, documents in the collection must be sequentially numbered; thus, each document in the collection must be assigned a unique integer identifier, which is its docno. The interface edu.umd.cloud9.collection.DocnoMapping provides the generic API to managing these mappings. The mappings are managed on a collection-specific basis. Typically, however, the docid to docno mappings are stored in a mappings file, which is loaded into memory by concrete objects implementing this interface.

APIs for working with specific collections are located in the package hierarchy under edu.umd.cloud9.collection.*; see documentation for each individual collection below:

Back to main page

Creative Commons: Attribution-Noncommercial-Share Alike 3.0 United States Valid XHTML 1.0! Valid CSS!