Resource-Aware Compiler Prefetching for Many-Cores

TitleResource-Aware Compiler Prefetching for Many-Cores
Publication TypeConference Papers
Year of Publication2010
AuthorsCaragea GC, Tzannes A, Keceli F, Barua R, Vishkin U
Conference NameParallel and Distributed Computing (ISPDC), 2010 Ninth International Symposium on
Date Published2010/07//
Keywordsalgorithm;memory, architecture;hardware-software, architectures;parallel, architectures;resource, aware, caches;super-scalar, compiler, compiler;Multicore, compilers;parallel, GCC-derived, level, management;, many-core, memories;storage, out-of-order, Parallel, parallelism;parallel, prefetch;loop, prefetching, prefetching;shared, processor;fine-grained, processors;multiprocessing, systems;optimising

Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi-and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4-16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor lighter cores with less resources. Support for hardware and software prefetch increase MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We show that in situations where not enough resources are available to issue prefetch instructions for all references in a loop, it is more beneficial to decrease the prefetch distance and prefetch for as many references as possible, rather than use a fixed prefetched distance and skip prefetching for some references, as in current approaches. We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% and the state-of-the art GCC implementation by up to 34.79%. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show improvements of up to 24.61%.