WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China Using Graphics Processors for High-Performance IR Query Processing Shuai Ding, Jinru He, Hao Yan, and Torsten Suel CIS Depar tment, Polytechnic University Brooklyn, NY, 11201, USA sding@cis.poly.edu, jhe@cis.poly.edu, hyan@cis.poly.edu, suel@poly.edu ABSTRACT Web search engines are facing formidable p erformance challenges as they need to process thousands of queries p er second over billions of documents. To deal with this heavy workload, current engines use massively parallel architectures of thousands of machines that require large hardware investments. We investigate new ways to build such high-p erformance IR systems based on Graphical Processing Units (GPUs). GPUs were originally designed to accelerate computer graphics applications through massive on-chip parallelism. Recently a numb er of researchers have studied how to use GPUs for other problem domains including databases and scientific computing [2, 3, 5], but we are not aware of previous attempts to use GPUs for large-scale web search. Our contribution here is to design a basic system architecture for GPU-based highp erformance IR, and to describ e how to p erform highly efficient query processing within such an architecture. Preliminary exp erimental results based on a prototyp e implementation suggest that significant gains in query processing p erformance might b e obtainable with such an approach. Categories and Sub ject Descriptors: H.3 [INFORMATION STORAGE AND RETRIEVAL] General Terms: Performance. Keywords: Web search, query processing, GPU. with no need to temp orarily write the decompressed data to main memory. Thus, the main op erations required are index decompression, intersection, and score computation. Query processing is resp onsible for a large fraction of the total hardware cost of a large state-of-the-art engine. Index Compression: Compression of inverted indexes greatly reduces disk space use as well as disk and main memory accesses, resulting in faster query evaluation. There are many index compression methods [7]; the basic idea in most of them is to first compute the differences (gaps) b etween the sorted docIDs in an inverted list, and then apply a suitable integer compression scheme to the gaps. During decompression, the gaps are decoded and then summed up again in a prefix sum op eration. In our prototyp e, we focus on two methods that we b elieve are particularly suitable for GPUs: Rice coding, and a recent method in [8] called PForDelta. To compress a sequence of gaps with Rice coding, we first choose a b such that 2b is close to the average of the gaps to b e coded. Then each gap n is encoded in two parts: a quotient q = n/(2b ) stored in unary code, and a remainder r = n mod 2b stored in binary using b bits. While Rice coding is considered somewhat slow in decompression sp eed, we consider here a new implementation prop osed in [6] that is much faster than the standard one. The second compression method we consider is the PForDelta method prop osed in [8], which was shown to decompress up to a billion integers p er second on current CPUs. This method first determines a b such that most of the values in the list (say, 90%) are less than 2b and thus fit into a fixed bit field of b bits each. The remaining integers, called exceptions, are coded separately. Both methods were recently evaluated for CPUs in [6]. Graphical Processing Units (GPUs): The current generations of GPUs arose due to the increasing demand for processing p ower by graphics-oriented applications such as computer games. Because of this, GPUs are highly optimized towards the typ es of op erations needed in graphics, but researchers have recently studied how to exploit the computing p ower of these processors for other typ es of applications [2, 3, 5]. Modern GPUs offer multiple computing cores that can p erform many op erations in parallel, plus a very high memory bandwidth that allows processing of large amounts of data. However, to b e efficient, computations need to the carefully structured to conform to the programming model offered by the GPU, which is a data-parallel model reminiscent of the massively parallel SIMD models studied in the 1980s. Recently, GPU vendors have started to offer b etter supp ort for general-purp ose computation on GPUs, thus removing some of the hassle of programming them. However, the requirements of the basic data-parallel programming model remain; in particular, it is imp ortant to structure computation in a very regular (oblivious) manner, such that each concurrently executed thread p erforms essentially the same sequence of basic steps. This is esp ecially challenging for operations such as decompression and intersection that tend to b e more adaptive in nature. One ma jor vendor of GPUs, 1. TECHNICAL PRELIMINARIES For a good overview of IR query processing, see [7]. We assume that each document (e.g., each page covered by the engine) is identified by a unique document ID (docID). Our approach is based on an inverted index structure, used in essentially all current web search engines, which allows efficient retrieval of documents containing a particular set of words (or terms). An inverted index consists of many inverted lists, where each inverted list Iw contains the docIDs of all documents in the collection that contain the word w, sorted by document ID or some other measure, plus p ossibly the numb er of occurrences in the document and their p ositions. Inverted indexes are usually stored in highly compressed form on disk, or sometimes in main memory if space is available. Given an inverted index, the basic structure of query processing is as follows: The inverted lists of the query terms are first fetched from disk or main memory and decompressed, and then an intersection or other Boolean filter b etween the lists is applied to determine those docIDs that contain all or most of the query terms. For these docIDs, the additional information associated with the docID in the index (such as numb er of occurrences and their p ositions) is used to compute a score for the document, and the k top-scoring documents are returned. These op erations are usually pip elined so that the score of a document is computed immediately after the Boolean filter (usually intersection) is applied to it, which is itself done right after decompressing the relevant index entries, Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. 1213 WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China CPU-based algorithms typically pip eline the various op erations in query processing. This is difficult in GPUs, but also less necessary due to the high global memory bandwidth. CPU-based algorithms also typically skip over some parts of an inverted list without decompressing it, if the list is much longer than the one it has to b e intersected with. This is achieved by partitioning each list into blocks of some size, say 128 docIDs, and storing the first docID in each block in uncompressed form such that decompression can b e resumed at any block b oundary. This can also b e implemented in a GPU, though slightly differently, by p erforming a merge-typ e op eration b etween the uncompressed elements of a longer list and the elements of an already decompressed shorter list (similar to the intersection ab ove); this determines the list of blocks in the longer list that need to b e decompressed. Keeping one element in each block uncompressed also limits the prefix computation to within each block, resulting in a faster single-level computation than the multi-level approach in [4]. Finally, computation of rank scores is fairly straightforward. Figure 1: Architecture of a GPU-Based System. NVIDIA, recently presented the Compute Unified Device Architecture (CUDA), a new hardware and software architecture that simplifies GPU programming [1]. Our prototyp e is based on CUDA, and was develop ed and tested on an NVIDIA GeForce 8800 GTS graphics card. However, other cards supp orting CUDA could also b e used, and our general approach can b e p orted to other GPU programming environments. 2. A GPU-BASED IR QUERY PROCESSOR We now first outline the basic structure of our GPU-based IR query processor, based on Figure 1. The compressed inverted index is either completely in main memory, or stored on disk but partially cached in main memory for b etter p erformance. The GPU itself can access main memory via the CPU, but also has its own global memory (640 MB in our case) plus several other sp ecialized caches, including shared memory that can b e shared by many threads. Data transfer b etween CPU and GPU and thus b etween main memory and GPU is reasonably fast (at least 3 GB in our tests), while memory bandwidth b etween the GPU and its own global memory is in the tens of GB and thus much higher than typical system memory bandwidths. (But accesses need to b e scheduled carefully to achieve this p erformance.) A query is processed as follows: First, the CPU receives and preprocesses the query. The corresp onding inverted lists are retrieved from disk if not already cached in main memory. The data is then transferred to the GPU Global Memory, which also maintains its own cache of index data in order to decrease the time for main memory-to-GPU data transfers. The GPU then decompresses and intersects the lists and returns the top results to the CPU. For b est p erformance, the CPU may also process some of the queries itself such that b oth processors share the overall load. Thus, the CPU controls the overall process, while the GPU serves as an auxiliary compute device. Algorithms for GPU Query Processing: We now give more details on our implementation; due to space constraints, we can only give a sketch of our techniques. We first describ e how to decompress an inverted list, for the case of Rice coding. As in [6], the unary and binary parts of the Rice code are kept separately. We can then decompress the code by running two separate prefix sums, one on the binary data and one on the unary data. The binary prefix sum op erates on b-bit elements, while the unary one sums up the total numb er of `1' values in the unary data. Then each document ID can b e retrieved by summing up its binary prefix sum and 2b times the corresp onding unary prefix sum. Multi-level tree-based algorithms for prefix sums on GPUs were discussed in [4] based on earlier SIMD algorithms; we adapted and fine-tuned these for unary and b-bit binary data. The other challenge is the intersection of the different inverted lists. To do so, we again adopt older techniques from the parallel computing literature, for the related problem of merging lists of integers. In particular, we use multi-level algorithms that first select a small subset of splitter elements that are inserted into the other list, thus partitioning this other list into segments each of which only has to b e compared to a small subset of elements in the first list; this process can then b e applied recursively and in the other direction. 3. EXPERIMENTS AND DISCUSSION We now present preliminary exp erimental results that show the p otential of the new architecture. Note that we have not yet implemented all of the features in the prop osed query processor. All our runs use Rice coding, and we do not yet supp ort blocked access to the inverted lists. For the exp eriments, we used the TREC GOV2 data set of 25.6 million web pages, and selected 1000 random queries from the supplied query logs. On average, there were ab out 3.74 million p ostings in the inverted lists associated with a query. Our CPU implementation is based on the optimized code in [6]. CPU Times GPU Times Sp eedup Decompression 25.89 11.39 2.27 Intersection 5.73 1.95 2.93 Total 31.62 13.35 2.37 Table 1: Times in ms per query for CPU and GPU. From Table 1, we see that we get a sp eedup of more than 2 even for our very preliminary implementation. There are various caveats, of course. Again, we have not implemented skipping and block-wise access. Also, results may b e different for PForDelta compression, which runs very fast on modern CPUs. Finally, all results assume that the data is already in memory, or at least that disk is not the main b ottleneck of the system. (Of course, if CPU is not the b ottleneck, then a GPU cannot help either.) We are working on overcoming the current limitations and on completing a full prototyp e. 4. REFERENCES [1] Nvidia CUDA Computer Unified Device Architecture ­ Programming Guide, June 2007. http://www.nvidia.com/ob ject/cuda develop.html. [2] N. Govindara ju, J. Gray, R. Kumar, and D. Mano cha. GPUTeraSort: high p erformance graphics co-pro cessor sorting for large database management. In Proc. of the 26th SIGMOD, 2006. [3] N. Govindara ju, B. Lloyd, W. Wang, M. Lin, and D. Mano cha. Fast computation of database op erations using graphics pro cessors. In Proc. of the 24th SIGMOD, 2004. [4] M. Harris. Parallel prefix sum (scan) with CUDA, 2007. http://develop er.download.nvidia.com/compute/cuda/sdk/website /pro jects/scan/do c/scan.p df. [5] J. Owens, D. Luebke, and etc. A survey of general-purp ose computation on graphics hardware. In Eurographics, 2005. [6] J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proc. of the 17th Int. World Wide Web Conference, April 2008. [7] J. Zob el and A. Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2), 2006. [8] M. Zukowski, S. Heman, N. Nes, and P. Boncz. Sup er-scalar RAM-CPU cache compression. In Proc. of the 22nd ICDE, 2006. 1214