Projects

Current Projects

Cloud Computing

We are engaged in an ongoing effort to explore cloud computing, particularly as it relates to massive data analytics with platforms such as MapReduce. Highlights of our effort include:

Cascades Ranking cascades provide a new approach to text retrieval, where document ranking is broken into a finite number of distinct stages. Each stage considers successively richer and more complex features, but over successively smaller candidate document sets—in other words, retrieval is viewed as a multi-stage progressive refinement problem. See this project page for more details.
Mr. LDA

Recently, we have begun a project to explore highly-scalable MapReduce algorithms for linguistic modeling within a Bayesian framework, making use of variational inference to achieve a high degree of parallelization on web-scale datasets. See this NSF project page for more details.

Ivory Ivory is a Hadoop toolkit for distributed text retrieval that features a retrieval engine based on Markov Random Fields. The project is focused on the challenges of indexing and retrieval algorithms at web scale. See this NSF project page for more details.
Large-Scale MT

In an ongoing effort, we have been exploring the intersection of large-scale text retrieval and statistical machine translation. One thread has been scaling up iterative machine learning algorithms to larger and larger dataset. Another thread has been the application of IR techniques to automatically extract bilingual training data. See this project page for a completed project from the Google/IBM Academic Cloud Computing Initiative and NSF's CLuE program.

Past Projects (Partial List)

iOpener The iOpener project brings together bibliometrics, computational linguistics, and information visualization to generate readily-consumable surveys of different scientific domains and topics, targeted to different audiences and levels, e.g., expert specialists, scientists from related disciplines, educators, students, government decision makers, and citizens.
Crossbow

Crossbow is a Hadoop-based version of Bowtie, a short read aligner based on the Burrows-Wheeler Transform. It started off as a course project in my Spring 2009 cloud computing course. Representative publication:

CloudBurst

CloudBurst is a short read mapping algorithm for DNA sequence analysis implemented in MapReduce. It began as a course project in my cloud computing course in Spring 2008. Representative publication:

Clinical QA

The Clinical QA project leveraged a combination of knowledge-based and statistical text processing techniques to build systems that satisfied the information needs of physicians practicing evidence-based medicine. Representative publications:

CLiMB

The Computational Linguistics for Metadata Building (CLiMB) project produced a Cataloger’s Toolkit for enhancing subject access to digital image collections. The toolkit leverages computational linguistic techniques to mine scholarly texts for metadata terms.