WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China Efficient Vectorial Operators for Processing XML Twig Queries Guoliang Li, Jianhua Feng, Jianyong Wang, Feng Lin, and Lizhu Zhou Depar tment of Computer Science and Technology, Tsinghua University, Beijing 100084, P. R. China {liguoliang,fengjh,jianyong,dcszlz}@tsinghua.edu.cn; lin-f@mails.tsinghua.edu.cn ABSTRACT This pap er prop oses several vectorial op erators for processing XML twig queries, which are easy to b e p erformed and inherently efficient for b oth Ancestor-Descendant (A-D) and Parent-Child (P-C) relationships. We develop optimizations on the vectorial op erators to improve the efficiency of answering twig queries in holistic. We prop ose an algorithm to answer GTP queries based on our vectorial op erators. T wig 2 S tack[1] is prop osed to process GTP queries, it is constrained by the fan-out of the XML document and thus leads to inefficiency. We discuss how to efficiently answer GTP queries according to our vectorial op erators. To the b est of our knowledge, this is the first pap er that employs vectorial op erators to process twig queries and GTP queries. 2. TJOPERATOR We employ a sequence to represent an XML document and answer XML queries based on the sequence. For any node u, we construct list Iu , which keeps the elements w.r.t. u and the p ositions of their occurrences in the corresp onding sequence. To check whether element ea Ia and element ed Id satisfy the A-D or P-C relationships, we transform Ia and Id into bit-vectors and evaluate the A-D or P-C relationships on the bit-vectors. Categories and Subject Descriptors H.2.8 [Database Applications ]: Miscellaneous General Terms Algorithms, Performance, Languages Keywords Holistic Twig Join,Vectorial Op erators,Subsequence Match 2.1 Vectorial Operators We b egin by introducing three vectorial op erators, , and , to transform any input list Ia into bit-vectors, Va ,Va and V a , resp ectively as illustrated in Table 1. We prop ose three corresp onding op erators, , , , which are similar to , , and op erate on any bit-vector (e.g., an intermediate bit-vector) as shown in Table 1. We prop ose two vectorial op erators a[d] and a/d to evaluate the P-C relationship as shown in Table 1. The two op erators can evaluate whether the elements in Ia and the elements in Id satisfy a/d. In addition, we introduce three vectorial op erators a , a[//d] and a//d , to evaluate the A-D relationship, as shown in Table 1. Equations (2.1-1)-(2.1-10) formally describ e how to evaluate the P-C or A-D relationships, where P [i] denotes the element at the i-th p osition in the sequence and S denotes the corresp onding result set. Ra[d] = a[d] ( (Ia ), (Id )) Ra/d = a/d ( (Ia ), (Id )) Sa[d] = {P [i]|Ra[d] [i] = 1} Sa/d = {P [i]|Ra/d [i] = 1} S(a,/d) = {(P [i + 1], P [i])|Ra/d [i] = 1} Ra//d = a//d ( (Ia ), (Id )) 1. INTRODUCTION Approaches based on structural index, numb ering scheme and subsequence matching have b een studied for processing XML twig queries, among which the most efficient one is holistic twig join based on numb ering scheme. However, holistic twig join is sub optimal for those twig queries in which b oth Ancestor-Descendant (A-D) and Parent-Child (P-C) relationships are included, b ecause it may involve huge even unb ounded intermediate results. In addition, most of existing prop osals are not efficient for the GTP queries [3, 4], b ecause to answer GTP queries they have to compute the result sets of all the nodes although a part of them will not contribute to the answer, and hence they have to eliminate those redundancies involved in the result. To address these problems, we present some vectorial operators for the P-C and A-D relationships to avoid redundant intermediate results and demonstrate how to answer twig queries using these vectorial op erators efficiently. To accelerate the processing of twig queries, we prop ose several techniques to optimize these vectorial op erators. Although (2.1-1) (2.1-2) (2.1-3) (2.1-4) (2.1-5) (2.1-6) (2.1-7) (2.1-8) (2.1-9) Ra[//d] = a[//d] ( (Ia ), (Id )) Sa//d = {P [i]|Ra//d [i] = 1} Sa[//d] = {P [i]|Ra[//d] [i] = 1} Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. S(a,//d) = {(P [i], P [j ])|i, j, if Ra[//d] [i]=1, P [i].leftmost j < i, Ra//d [j ] = 1}. (2.1-10) 1037 WWW 2008 / Poster Paper Operator (Ia ) (Ia ) (I a ) (V a ) (V a ) (Va ) a[d] (Va ,Vd ) a/d (Va ,Vd ) a (Va , Vd ) a//d (Va ,Vd ) a[//d] (Va ,Vd ) Operand Ia Ia Ia Va Va Va Va , Vd Va , Vd Va , Vd Va , Vd Va , Vd Result Va V a Va Va V a Va Va[d] Va/d Va Va//d Va[//d] April 21-25, 2008 · Beijing, China Definition Va [i]=1 if j, k,1j |Ia |,1k|Ia [j ].occur | and i=Ia [j ].occur [k]; otherwise, Va [i]=0. Va [i]=1 if j, k,1j |Ia |, k=|Ia [j ].occur | and i=Ia [j ].occur [k]; otherwise, Va [i]=0. V a [i]=1 if j, k, t,1j |Ia |,k=|Ia [j ].occur |,t=Ia [j ].occur [k], P [t].leftmost it; else V Va [i]=1 if j ,1j |Va |, Va [j ]=1 and P [i]=P [j ]; otherwise, Va [i]=0. Va [i]=1 if j i, Va [j ]=1 and P [i]=P [j ], but k>i, P [i]=P [k]; otherwise, Va [i]=0. V a [i]=1 if j ,1j |Va |, (Va )[j ]=1 and P [j ].leftmost i j ; otherwise, V a [i]=0. Va[d] = ( (Va ) ( (Vd )>>1) ) Va/d = ( (Va )<<1) (Vd ) Va [i]=1 if (Va )[i] =1, j ,P [i].leftmost j > 1))) n (2.2-1) 10 Q1 Q2 0 Q7 Q8 Q9 Q10 Q11 Q12 Twig and GTP Queries on XMark Twig and GTP Queries on TreeBank Figure 1: Experimental results 3. EXPERIMENTAL STUDY We compared T J Oper ator with state-of-the-art methods, PRIX [5], iTwigJoin [2] and Twig2 Stack [1]. All the algorithms were coded in C++ and all the exp eriments were conducted on a 2.4 GHz Pentium IV processor with 1GB RAM, running Microsoft Windows XP. We used the realworld dataset DBLP and the dataset TreeBank with deep recursive structures, and employed twelve queries for our exp eriments. Figure 1 shows the exp erimental results. We can observe that T J Oper ator outp erforms PRIX, iTwigJoin and Twig2 Stack on various queries and datasets significantly. Rd = (dd-childr en(n) (n ( (In ), (In ) Rd ))) (2.2-2) n (In ) if n is a leaf node (Rc ) if n only has c -child n (2.2-3) Rn = (Rd ) if n only has d -child n (Rc Rd ) otherwise n n if n is the root node Rn ( (Fp )<<1) Rn if n is c-child of its parent p Fn = (Fp ) Rn if n is d-child of its parent p (2.2-4) Sn = {P [i]|Fn [i] = 1} (2.2-5) 4. CONCLUSION We prop ose several vectorial op erators to evaluate the AD and P-C relationships and present how to process twig queries in holistic by employing these op erators. We demonstrate several effective techniques to optimize these op erators for processing XML twig queries and GTP queries. 2.3 Optimizations for Vectorial Operators We introduce several techniques to optimize our vectorial op erators. Let c1 ,c2 ,...,cl b e the c-children (P-C) of node n and d1 ,d2 ,...,dk b e the d-children (A-D), to optimize the computation of Rd , we extend a from a binary op erator to n a multiple op erator n : Vn =n (Vn ,Vd1 ,Vd2 ,...,Vdk )|=Vn [i]=1, if (Vn )[i]=1,1j k, bj , P [i].leftmost bj > 1))(Rc2 >> 1))...)(Rcl >> 1)) n (2.3-1) Rd (n ( (In ), (In ) Rd1 , ..., (In ) Rdk )) (2.3-2) n (Rc Rd ) (Rc ) (Rd ) n n n n (Rd ) n ( (In ), n (In ) 5. ACKNOWLEDGEMENT This work is partly supp orted by the National Natural Science Foundation of China under Grant No.60573094, the National High Technology Development 863 Program of China under Grant No.2007AA01Z152 and 2006AA01A101, the National Grand Fundamental Research 973 Program of China under Grant No.2006CB303103. 6. REFERENCES [1] S. Chen, H.-G. Li, J. Tatemura, W.-P. Hsiung, D. Agrawal, and K. S. Candan. Twig2stack: Bottom-up processing of generalized-tree-pattern queries over xml documents. In VLDB, 2006. [2] T. Chen, J. Lu, and T. Ling. On boosting holism in xml twig pattern matching using structural indexing techniques. In SIGMOD, pages 455­466, 2005. [3] Z. Chen, H. V. Jagadish, L. V. S. Lakshmanan, and S. Paparizos. From tree patterns to generalized tree patterns: On efficient evaluation of xquery. In VLDB, pages 237­248, 2003. [4] G. Li, J. Feng, and L. Zhou. Efficient Holistic Twig Joins in Leaf-to-Root Combining with Root-to-Leaf Way. In DASFAA, 2007. [5] P. Rao and B. Moon. Prix: Indexing and querying xml using prufer sequences. In ICDE, pages 288­299, 2004. (2.3-3) Rdk ) (2.3-4) Rd1 , ..., (In ) To answer twig queries in holistic and process GTP queries effectively, we prop ose an effective algorithm T J Oper ator by employing the vectorial op erators. T J Oper ator directly computes the results and does not require a p ost-processing to eliminate the irrelevant elements. T J Oper ator only maintains several bit-vectors and the numb er of these bit-vectors is no more than the numb er of result nodes, thus T J Oper ator will not involve large intermediate results. occur keeps the p ositions of e's o ccurrences in the sequence and leftmost is the p osition of the o ccurrence of e's leftmost descendant. 1038