Scalable language processing algorithms for the masses: a case study in computing word co-occurrence matrices with MapReduce

TitleScalable language processing algorithms for the masses: a case study in computing word co-occurrence matrices with MapReduce
Publication TypeConference Papers
Year of Publication2008
AuthorsJimmy Lin
Conference NameProceedings of the Conference on Empirical Methods in Natural Language Processing
Date Published2008///
PublisherAssociation for Computational Linguistics
Conference LocationStroudsburg, PA, USA
Abstract

This paper explores the challenge of scaling up language processing algorithms to increasingly large datasets. While cluster computing has been available in commercial environments for several years, academic researchers have fallen behind in their ability to work on large datasets. I discuss two barriers contributing to this problem: lack of a suitable programming model for managing concurrency and difficulty in obtaining access to hardware. Hadoop, an open-source implementation of Google's MapReduce framework, provides a compelling solution to both issues. Its simple programming model hides system-level details from the developer, and its ability to run on commodity hardware puts cluster computing within the reach of many academic research groups. This paper illustrates these points with a case study in building word cooccurrence matrices from large corpora. I conclude with an analysis of an alternative computing model based on renting instead of buying computer clusters.

URLhttp://dl.acm.org/citation.cfm?id=1613715.1613769