%0 Conference Paper %B Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming %D 2010 %T Lazy binary-splitting: a run-time adaptive work-stealing scheduler %A Tzannes,Alexandros %A Caragea,George C. %A Barua,Rajeev %A Vishkin, Uzi %K Dynamic scheduling %K load balancing %K nested parallelism %K thread scheduling %K work stealing %X We present Lazy Binary Splitting (LBS), a user-level scheduler of nested parallelism for shared-memory multiprocessors that builds on existing Eager Binary Splitting work-stealing (EBS) implemented in Intel's Threading Building Blocks (TBB), but improves performance and ease-of-programming. In its simplest form (SP), EBS requires manual tuning by repeatedly running the application under carefully controlled conditions to determine a stop-splitting-threshold (sst)for every do-all loop in the code. This threshold limits the parallelism and prevents excessive overheads for fine-grain parallelism. Besides being tedious, this tuning also over-fits the code to some particular dataset, platform and calling context of the do-all loop, resulting in poor performance portability for the code. LBS overcomes both the performance portability and ease-of-programming pitfalls of a manually fixed threshold by adapting dynamically to run-time conditions without requiring tuning. We compare LBS to Auto-Partitioner (AP), the latest default scheduler of TBB, which does not require manual tuning either but lacks context portability, and outperform it by 38.9% using TBB's default AP configuration, and by 16.2% after we tuned AP to our experimental platform. We also compare LBS to SP by manually finding SP's sst using a training dataset and then running both on a different execution dataset. LBS outperforms SP by 19.5% on average. while allowing for improved performance portability without requiring tedious manual tuning. LBS also outperforms SP with sst=1, its default value when undefined, by 56.7%, and serializing work-stealing (SWS), another work-stealer by 54.7%. Finally, compared to serializing inner parallelism (SI) which has been used by OpenMP, LBS is 54.2% faster. %B Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming %S PPoPP '10 %I ACM %C New York, NY, USA %P 179 - 190 %8 2010/// %@ 978-1-60558-877-3 %G eng %U http://doi.acm.org/10.1145/1693453.1693479 %R 10.1145/1693453.1693479