WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China Efficiently Querying RDF Data in Triple Stores Ying Yan , Chen Wang , Aoying Zhou § , Weining Qian § , Li Ma , Yue Pan Department of Computer Science and Engineering, Fudan University {yingyan, ayzhou}@fudan.edu.cn § {chwang, malli, panyue}@cn.ibm.com IBM China Research Laboratory Institute of Massive Computing, East China Normal University {ayzhou, wnqian}@sei.ecnu.edu.cn ABSTRACT Efficiently querying RDF [1] data is b eing an imp ortant factor in applying Semantic Web technologies to real-world applications. In this context, many efforts have b een made to store and query RDF data in relational database using particular schemas. In this pap er, we prop ose a new scheme to store, index, and query RDF data in triple stores. Graph feature of RDF data is taken into considerations which might help reduce the join costs on the vertical database structure. We would partition RDF triples into overlapp ed groups, store them in a triple table with one more column of group identity, and build up a signature tree to index them. Based on this infrastructure, a complex RDF query is decomp osed into multiple pieces of sub-queries which could b e easily filtered into some RDF groups using signature tree index, and finally is evaluated with a comp osed and optimized SQL with sp ecific constraints. We compare the p erformance of our method with prior art on typical queries over a large scaled LUBM and UOBM b enchmark data (more than 10 million triples)in [3]. For some extreme cases, they can promote 3 to 4 orders of magnitude. Categories and Subject Descriptors H.2.4 [Systems]: Query processing; C.4 [Performance of Systems]: Design studies General Terms Exp erimentation, Performance Keywords RDF, Signature, Graph Partitioning, Indexing 1. INTRODUCTION The Semantic Web is an effort by the World Wide Web Consortium (W3C) to enable data integration and sharing Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. across different applications. It is designed to evolve the general web which mostly consists of markup or other formats p erceived by p eople into the machine-readable data web. The Semantic Web data model of Resource Description Framework [1] (RDF), recommended by W3C, represents data as a lab elled graph connecting resources and their property values with lab elled edges representing prop erties. The graph can b e structurally parsed into a set of triples or statements in the form of < subj ect, pr edicate, obj ect >. RDF data model is very general and is easy to express any typ e of data. Though RDF data representation is flexible, it p otentially results in serious p erformance issues since RDF queries involve intensive self-joins over the triple table. When the numb er of triples is so large that they can not b e cached in memory for each filter, the joins will b e very exp ensive b ecause disk scan and index lookup are required. As a general data model, RDF triple table stores any typ e of data in a column. For example, values of age, weight, or even given name of all p ersons co-exist in the ob ject column. Accordingly, the statistics collected in these columns complicates selectivity estimation, which will further disable relational database query optimizer to deliver a nice query plan. At present, creating indices, splitting data into prop erty tables (2-column schema), and materializing join views (e.g., sub ject-sub ject and sub ject-ob ject) are common methods for improving RDF query p erformance on the vertical database structure. However, it is still urgently needed to figure out new storage and indexing schemes that make relational database much efficient to supp ort RDF query. In this pap er, we prop ose a novel idea to optimize executable SQL for general-purp osed triple stores by considering graph feature of RDF data. An intuitive example is given as follows to illustrate the idea and our motivation. Given a SPARQL [2] query of Figure 1(a) which logically equals to the query graph shown in Figure 1(b), we issue it over the RDF data in Figure 1(c). Supp ose that the triples stored in the table are not well sorted intentionally in advance and no index is created as well. After the SPARQL query is translated and submitted to relational database, query engine might determine a nested-loop self-join on the column of sub ject accordingly. Roughly estimated, the join will cost 4 × 4 = 16 times of string comparison and 4 + 4 = 8 I/O times. Once the query and data are much more complex, the cost will increase dramatically. Observing the same example from p ersp ective of graph model shown in Figure 2(a), even though the triples (or edges bridging two vertices) like <:D :Type :Professor> 1053 WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China Select ?X Where { ?X :type :Professor, ?X :memberOf :Department0, } Join X :m pe em :ty be rO f :Professor :Department0 (a) SPARQL Query (b) Query Graph Figure 1: A Motivated Example (c) Triple Table ... 1 e nt rtm :me epa :m mberOf :D em ber :D Of ... GID=1 pe :ty :C GID=2 be em :m GID=3 Join :type :E :F ... ... :B (a) Partitioned RDF Data Graph Figure 2: Our Proposed Idea are totally disconnected with those including <:A :MemberOf :Department0>, <:B :MemberOf :Department0> and <:G :MemberOf :Department0>, they are still selected to compare with each other by string. Here, we are motivated to decrease the join cost by considering the fact that the query results must b e connected subgraphs emb edded in the RDF data graph and able to match query graph. Therefore, we first partition the data graph into three groups as shown by the dotted lines in Figure 2(a) and add an additional column for the triple table (Figure 1(c)) to store the group (subgraph) identities illustrated in Figure 2(b). Thus, we only p erform the join op erations within each group, and finally merge the results together and eliminate redundancy. In this case, only two triples of <:C :Type :Professor> and <:C :MemberOf :Department0> are filtered as candidates in group 2. The join cost is reduced to 1 × 1 = 1 time of string comparison and 1 + 1 = 2 I/O times. Furthermore, to fast locate the candidate groups probably containing the query graph, we build up signature for all partitions and filter them using the query graph signature. :ty p :type :type :Professor :G :ty pe ... f berO mem :Student :Department0 : :m em be r ... Of :A rO f :ty p e e r be em :m :m O f em be rO f (b) Partitioned Triple Table and UOBM b enchmark data (more than 10 million triples). The exp erimental results in [3] show that our method can help dramatically improve query efficacy of triple stores. For some extreme cases, it can promote 3 to 4 orders of magnitude. We highlight our contributions as b elow: Graph partitioning for RDF triples Filtering and joining of each RDF query always happ en within the smaller scop e than the whole RDF graph. It would dramatically reduce the self-join cost. Signature indexing Bit vector matching instead of string matching b etween query graph and data graph is used to fast filter out candidate partitions, which greatly save the cost on locating partitions. SQL rewriting We slightly change the schema of the triple table and only rewrite the translated SQLs by common RDF triple stores with our techniques, which can b e easily integrated as an optimization comp onent into varying typ es of triple stores. Early stop strategy Null result can b e returned immediately without the need of consulting RDBMS, once while no candidate partition is found to p otentially contain query graph by signature matching. The query evaluation is finished outside relational database. 2. GRAPH PARTITIONING AND INDEXING We prop ose a new scheme to store, index, and query RDF data in this pap er. Graph feature of RDF data is taken into consideration which might help reduce the join costs on the vertical database structure. We would partition RDF triples into overlapp ed groups, store them in a triple table with one more column of group identity, and build up a signature tree to index them. Based on this infrastructure, a complex RDF query is decomp osed into multiple pieces of sub-queries which could b e easily filtered into some RDF groups using signature tree index, and finally is evaluated with a comp osed and optimized SQL having sp ecific constraints. In [3], we compare the p erformance of our method with prior art on typical queries over a large-scale of LUBM 3. REFERENCES [1] F. Manola and E. Miller. RDF primer. W3C recommendation, Feb 2004. [2] E. Prud'hommeaux and A. Seab orne. SPARQL query language for RDF. W3C candidate recommendation, April 2006. [3] Y. Yan, C. Wang, A. Zhou, W. Qian, L. Ma, and Y. Pan. Efficiently querying rdf data in triple stores. Technique rep ort, 2008. 1054