TY - JOUR T1 - The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce JF - Proceedings of LSDS-IR workshop Y1 - 2009 A1 - Jimmy Lin AB - This paper explores the problem of “stragglers” in Map-Reduce: a common phenomenon where a small number of mappers or reducers takes significantly longer than the oth- ers to complete. The effects of these stragglers include un- necessarily long wall-clock running times and sub-optimal cluster utilization. In many cases, this problem cannot sim- ply be attributed to hardware idiosyncrasies, but is rather caused by the Zipfian distribution of input or intermedi- ate data. I present a simple theoretical model that shows how such distributions impose a fundamental limit on the amount of parallelism that can be extracted from a large class of algorithms where all occurrences of the same ele- ment must be processed together. A case study in parallel ad hoc query evaluation highlights the severity of the strag- glers problem. Fortunately, a simple modification of the in- put data cuts end-to-end running time in half. This example illustrates some of the issues associated with designing effi- cient MapReduce algorithms for real-world datasets. ER -