INST 734
Information Retrieval Systems
Fall 2015
Information Retrieval Software
Available Text Retrieval Software
The following software is available for use in this course. Those
with links can be downloaded freely and used anywhere. The three you
are most likely to want to use are listed first, others are listed in
alphabetical order for completeness. Some of these search engines are
compared in an October 2007
Technical Report from Universitat Pompeu Fabra in Spain.
The Big Three
- Lucene
- A freely available Java IR system, probably the easiest system
to get up and running, and the most easily modified. The SOLR Web front end
for Lucene and the Elasticsearch
distributed (sharded) version of Lucene are both also well
worth considering.
- Indri
- Indri is optimized for efficiency, and thus is a good choice if
you have a large collection and a single processor. It is
built on top of the Lemur toolkit for
building language modeling systems for information retrieval.
There is also a simple variant of Indri written in Java called
Galago
that is designed for use with our textbook.
- Ivory
- An information retrieval system for the Hadoop MapReduce
framework. This is a good choice is you have a very large
collection and at least a modest size server cluster. You can
buy time from Amazon Web Services if you don't have your own
cluster.
The Others
- Cheshire 3
- Freely available research software implementing a logistic
regression model from the University of California at Berkeley.
Getting it working may require some facility with Z39.50.
- Glimpse
- Freely available software from the University of Arizona that
is designed for efficient indexing (at some cost in retrieval
efficiency). Glimpse is not configured for TREC-style
evaluations, so that would take some extra work.
- InQuery
- Commercial software based on inference networks that has a very
flexible query language. We have a research and teaching
license for this system from the University of Massachusetts,
but Indri does as well so these days we usually run Indri.
- IRF
- A Java toolkit for building IR systems for small applications.
The strength of IRF is that the object oriented framework greatly
simplifies tasks that require working with the source code. It
because Java is designed for platform independence rather than
efficiency, the size of the collections that can be handled is quite
limited.
- MG
- Research software from RMIT University that is designed to
maximize storage efficiency on very large collections. It is
available under the GNU public license. We installed this once
several years ago and it wasn't too difficult.
- PRISE
- Public domain vector space research software developed at NIST
We used this system for the TDT evaluations. PRISE includes a
Z39.50 interface, but it takes some facility with that standard
to get the interactive part running. PRISE is configured to
run TREC-style evaluations and the source code is available.
- SMART
- A vector space research system that was developed at Cornell
University. We have experience using SMART, but we have not
used it in many years now. SMART includes only a VT-100
interface, but it is configured to run TREC-style evaluations
and the source code is available.
- Terrier
- An information retrieval system from the University of Glasgow
that is optimized for efficiency. Terrier implements the
divergence from randomness framework for ranked retrieval.
- Xapian
- An open source IR system that is designed ot run under Linux.
Xapian is a descendant of Omseek, which itself is a decedent
of Open Muscat. Xapian is designed to handle several Western
European languages, and thus might be a good choice if you want to
work with languages other than English.
- Zettair
- Zettair is optimized for both efficiency and modifiability. It
therefore occupies a part of the design space between Lucene and
Indri.
See also the listing at searchtools.com.
Doug Oard
Last modified: Sat Aug 22 11:11:18 2015