Web-Scale Information Processing Applications (Spring 2008)

Project Descriptions

Below is a listing of all projects in this course. Project work is divided into two major phases:

  • The focus in Phase I is on how existing solutions to the research problem can be recast into the MapReduce framework, i.e., how can we "hadoopify" existing algorithms?
  • The focus in Phase II is on interesting extensions, i.e., does the MapReduce framework allow us to do something that wasn't possible before?

Naturally, the preponderance of each phase will vary with the specific project.

Large-data Statistical Machine Translation

Short Project Description (October, 2007)

Team: Chris Dyer (Linguistics, Ph.D. student)
Alexander Mont (CS, undergrad)
Aaron Cordova (CS, undergrad)

Teaser blurb: This is one of the hottest area of research in language processing today---how do you build a system that automatically translates from a foreign language like Arabic or Chinese into English? Modern statistical MT systems work by inducing "translation models" from millions of pairs of training examples. MapReduce provides a nice infrastructure for crunching all this data.

Language Modeling in the Clouds

Short Project Description (October, 2007)

Team: Denis Filimonov (CS, Ph.D. student)
Hua Wei (Geography, Ph.D. student)

Teaser blurb: How can a computer know that "dog bites man" is a grammatical sentence, but "bites man dog" isn't? Language models provide a way of statistically characterizing sentences. Naturally, the more data you have to train these models, the better these models become... that's where Hadoop comes in.

Collective Resolution of Identity in Email Archives

Short Project Description (October, 2007)

Team: Tamer Elsayed (CS, Ph.D. student)
Greg Jablonski (iSchool, MLS student)
Alan Jackoway (CS, undergrad)

Teaser blurb: Due to the informal nature of email, simple questions such as "who is talking" and "what are they talking about" can often be difficult to answer, given the existence of shared contexts that are not explicitly recorded. One possible solution is to simultaneous build models of document content and individual participants, thereby leveraging mutual constraints between senders and recipients. Cloud computing provides the processing power necessary to support these computationally-intensive models.

Large-Scale Network Analysis to Improve Retrieval in the Biomedical Domain

Short Project Description (October, 2007)

Team: George Caragea (CS, Ph.D. student)
Christiam Camacho (iSchool, special student)

Teaser blurb: One of the keys to Google's success in Web search is the recognition of links between documents. Explore this idea in the biomedical domain!

Parallel Automatic Text-Background Separation in Picture Books

Short Project Description (October, 2007)

Team: Chang Hu (CS, Ph.D. student)
Punit Mehta (iSchool, MIM student)

Teaser blurb: Finally, a project that doesn't involve text (well, unfortunately it does indirectly...). This project aims to explore MapReduce for image processing. If you're interested in digital libraries, this is the project for you!

Back to main page

This page, first created: 17 Oct 2007; last updated: Creative Commons: Attribution-Noncommercial-Share Alike 3.0 United States Valid XHTML 1.0! Valid CSS!