The Ivory Toolkit with the SMRF Retrieval Engine

Ivory is a Hadoop toolkit for web-scale information retrieval research that features a retrieval engine based on Markov Random Fields, appropriately named SMRF (Searching with Markov Random Fields). This open-source project began in Spring 2009 and represents a collaboration between the University of Maryland and Yahoo! Research. Ivory takes full advantage of the Hadoop distributed environment (the MapReduce programming model and the underlying distributed file system) for both indexing and retrieval. The current release of Ivory (release 0.2) works with Hadoop 0.20.1 (and requires certain features only found in that release). Ivory also uses Cloud9, a MapReduce library for Hadoop developed at the University of Maryland (currently also at release 0.2).

In order to temper expectations, please note that Ivory is not meant to serve as a full-featured search engine, but rather aimed at information retrieval researchers who need access to low-level data structures and who generally know their way around retrieval algorithms. As a result, a lot of "niceties" are simply missing—for example, fancy interfaces or ingestion support for different file types. It goes without saying that Ivory is a bit rough around the edges, but our philosophy is to release early and release often. In short, Ivory is experimental! If you just want search capabilities as a "black box", Lucene is a likely a better choice. Katta is a framework for serving distributed Lucene indexes that plays well with Hadoop clusters.

Ivory was specifically designed to work with Hadoop "out of the box" on the ClueWeb09 collection, a 1 billion page (25 TB) Web crawl distributed by Carnegie Mellon University. The initial release of Ivory is meant to serve as a reference implementation of indexing and retrieval algorithms that can operate at the multi-terabyte scale. Another interesting experimental aspect of Ivory is it's retrieval architecture: we've been playing with retrieval engines that directly read postings from HDFS. The getting started guide with TREC disks 4-5 provides more details.

Download

Documentation

This work is or has been supported by the following sources: NSF under awards IIS-0836560 and IIS-0705832; Google and IBM under the Academic Cloud Computing Initiative (ACCI); the Intramural Research Program of the NIH, National Library of Medicine; DARPA/IPTO Contract No. HR0011-06-2-0001 under the GALE program; and Amazon Web Services. Any opinions, findings, conclusions, or recommendations expressed here do not necessarily reflect those of the sponsors.

This page, first created: 16 Jun 2009; last updated: Creative Commons: Attribution-Noncommercial-Share Alike 3.0 United States Valid XHTML 1.0! Valid CSS!