Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce

TitleBrute force and indexed approaches to pairwise document similarity comparisons with MapReduce
Publication TypeConference Papers
Year of Publication2009
AuthorsJimmy Lin
Conference NameProceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Date Published2009///
PublisherACM
Conference LocationNew York, NY, USA
ISBN Number978-1-60558-483-6
Keywordsdistributed algorithms, hadoop
Abstract

This paper explores the problem of computing pairwise similarity on document collections, focusing on the application of "more like this" queries in the life sciences domain. Three MapReduce algorithms are introduced: one based on brute force, a second where the problem is treated as large-scale ad hoc retrieval, and a third based on the Cartesian product of postings lists. Each algorithm supports one or more approximations that trade effectiveness for efficiency, the characteristics of which are studied experimentally. Results show that the brute force algorithm is the most efficient of the three when exact similarity is desired. However, the other two algorithms support approximations that yield large efficiency gains without significant loss of effectiveness.

URLhttp://doi.acm.org/10.1145/1571941.1571970
DOI10.1145/1571941.1571970