WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China RACE: Finding and Ranking Compact Connected Trees for Keyword Proximity Search over XML Documents 1 Guoliang Li1 , Jianhua Feng1 , Jianyong Wang1 , Bei Yu2 , and Yukai He1 Depar tment of Computer Science and Technology Tsinghua University, Beijing, China 2 School of Computing National University of Singapore, Singapore {liguoliang,fengjh,jianyong}@tsinghua.edu.cn; yubei@comp.nus.edu.sg; heyk05@mails.tsinghua.edu.cn ABSTRACT In this pap er, we study the problem of keyword proximity search over XML documents and leverage the efficiency and effectiveness. We take the disjunctive semantics among input keywords into consideration and identify meaningful compact connected trees as the answers of keyword proximity queries. We introduce the notions of Compact Lowest Common Ancestor (CLCA) and Maximal CLCA (MCLCA) and prop ose Compact Connected Trees (CCTrees) and Maximal CCTrees (MCCTrees) to efficiently and effectively answer keyword queries. We prop ose a novel ranking mechanism, RACE, to Rank compAct Connected trEes, by taking into consideration b oth the structural similarity and the textual similarity. Our extensive exp erimental study shows that our method achieves b oth high search efficiency and effectiveness, and outp erforms existing approaches significantly. n1 n2 n0 n 19 n 22 n1 n0 n 19 n 22 n4 k1 n 18 n 20 n 21 n 23 n 24 n 18 n 20 n 21 n 23 n 24 n2 k4 k1 k3 k2 k4 k4 k1 k3 k2 k4 n3 n 13 n3 n 13 n6 n6 n 4 n 5 n 7 n 10 n 14 n 15 n 5 n 7 n 10 n 14 n 15 k1 k2 n n n n k4 n n k 2 n 8 n 9 n 12k 4 n n 17 16 8 9 11 12 17 16 k1 k3 k2 k4 k1 k2 k1 k3k2 k4 k1 k2 Figure 1: Maximal Compact Connected Trees similarity from the DB p oint of view and the textual similarity from the IR viewp oint. We introduce the notions of Compact LCA (CLCA) and Maximal CLCA (MCLCA) to capture the focuses of keyword queries, and prop ose Compact Connected Trees (CCTrees) and Maximal CCTrees (MCCTrees) to efficiently and effectively answer keyword proximity queries. Moreover, we devise a novel ranking mechanism, RACE, to Rank compAct Connected trEes. RACE not only considers the textual similarity like document relevancy in IR literature, but also incorp orates the structural similarity into the ranking function from the DB p oint of view. Categories and Subject Descriptors H.2.8 [Database Applications ]: Miscellaneous General Terms Algorithms, Performance, Languages Keywords Lowest Common Ancestor (LCA), Compact LCA (CLCA), Maximal CLCA(MCLCA) 2. COMPACT CONNECTED TREES Traditional methods usually compute the LCAs of content nodes to answer keyword queries. However, it is inefficient to compute all the LCAs as given a keyword query m {k1 , k2 , · · · , km }, there are i=1 |Ii | combinations of LCA candidates, where Ii denotes the set of content nodes that directly contain keyword ki . To address this problem, we introduce the concepts of Compact LCA (CLCA) and Compact Connected Trees (CCTrees). Definition 2.1. (CLCA and CCTree) Given q content nodes, v1 I1 , v2 I2 ,· · · ,vq Iq , and w=LCA(v1 ,v2 ,· · · ,vq ). w is said to dominate vi w.r.t. {k1 ,k2 ,· · · ,kq }, if w LCA(v1 ,· · · , vi-1 ,vi ,vi+1 ,· · · ,vq ), v1 I1 ,v2 I2 ,· · · ,vi-1 Ii-1 , vi+1 Ii+1 ,· · · , vq Iq . w is a CLCA w.r.t. {k1 ,k2 ,· · · ,kq }, if w dominates each vi for 1iq . The tree rooted at a CLCA and containing the paths from the root to the nodes dominated by the root, is cal led a CCTree. A CLCA is the LCA of some relevant nodes and the irrelevant nodes cannot share a CLCA. For example, in Figure 1, n3 is the CLCA of n4 and n5 w.r.t. {k1 ,k2 }, however, n2 is not the CLCA of n4 and n17 , although n2 is their LCA. Because n3 dominates n4 , and n15 dominates n17 , but there is no node which dominates b oth n4 and n17 . We observe that n3 and n15 are more relevant to {k1 ,k2 } than n2 . 1. INTRODUCTION Keyword search is a proven and widely accepted mechanism for querying in textual document systems and World Wide Web. The research community has recently recognized the b enefits of keyword search and has b een introducing keyword search capability into XML documents [2, 4, 5, 6, 7]. In this pap er, we study the problem of keyword proximity search over XML documents by considering the disjunctive semantics (i.e., the OR predicate) among the input keywords, and provide a novel ranking mechanism for effective keyword search, by taking into account b oth the structural Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. 1045 WWW 2008 / Poster Paper The subtree rooted at n3 is a CCTree. CLCA is orthogonal to SLCA [7] and avoids false negatives introduced by SLCA. For example, in Figure 1, n0 and n6 are b oth CLCAs w.r.t. {k1 ,k2 ,k3 ,k4 }, and they dominate {n20 ,n21 ,n23 ,n24 } and {n8 ,n9 ,n11 ,n12 }, resp ectively. n0 is a false negative for SLCA as n0 has a LCA descendant n6 . CLCA can avoid those false negatives and thus is a more meaningful methodology to answer keyword queries. We give the least upp er b ound of the numb er of CLCAs as stated in Lemma 2.1, which is much smaller than the numb er of LCAs. m Lemma 2.1. There are at most 2 i=1 |Ii |-1 CLCAs w.r.t. a query K=(k1 ,k2 ,· · · ,km ) and an XML document D in terms of the disjunctive semantics (i.e., the OR predicate). 100000 1e+007 1e+006 April 21-25, 2008 · Beijing, China 1e+007 1e+006 Elapsed Time(ms) Elapsed Time(ms) 100000 10000 1000 100 10 XSEarch XRank MSLCA GDMCT CCTree G1 G2 G3 G4 G5 G6 Elapsed Time(ms) 10000 1000 100 10 1 XSEarch XRank MSLCA GDMCT MCCTree G1 G2 G3 G4 G5 G6 100000 10000 1000 100 10 1 G1 G2 XSEarch XRank MSLCA GDMCT RACE G3 G4 G5 G6 1 (a) Conjunctive Semantics (b) Disjunctive Semantics (c) Disjunctive Semantics+Rank Figure 2: Efficiency of various algorithms 100 100 Precision 80 70 60 50 40 G1 G2 G3 G4 G5 G6 Top-k Relevancy (%) 90 90 80 70 60 50 40 1 5 10 50 100 XSEarch XRank MSLCA GDMCT RACE Precision Top-k Definition 2.2. (MCLCA and MCCTree) Given a keyword query K={k1 ,k2 ,· · · ,km } and Ki ={ki1 ,ki2 ,· · · ,kiq }K. Suppose w= CLCA(vi1 ,vi2 ,· · · ,viq ), where vi1 Ii1 ,· · · ,viq Iiq . w is a Maximal CLCA (MCLCA), if k (K-Ki ), vk Ik , w , which dominates both vk and every vij for 1j q . The CCTree rooted at an MCLCA is cal led an MCCTree. To effectively answer keyword search, we prop ose the concepts of Maximal CLCA and Maximal CCTree. An MCLCA is also a CLCA, which has no ancestors that still dominate some other content nodes b esides the content nodes dominated by the MCLCA. Therefore, an MCLCA dominates a maximal set of content nodes and is more meaningful than a CLCA. An MCCTree is the CCTree rooted at an MCLCA and contains more keywords than CCTrees. For example, in Figure 1, the four circled trees are MCCTrees. Figure 3: Top-k answer relevancy [7]. We selected six groups of queries, G1 ,· · · ,G6 . Each group has ten queries and the queries in the same group have the same numb er of keywords. For example, each query in G3 has 3 keywords. Figure 2 illustrates the exp erimental results on search efficiency and Figure 3 gives the exp erimental results on search quality. 5. CONCLUSION In this pap er, we have investigated the problem of keyword proximity search over XML documents. We prop osed the notions of CLCA and MCLCA to capture the focuses of keyword queries and adopted CCTrees and MCCTrees to effectively and efficiently answer keyword proximity queries. We demonstrated a novel ranking mechanism, RACE, to Rank the compAct Connected trEes, by taking into account b oth structural similarity from the DB viewp oint and textual similarity from the IR p oint of view. The exp erimental results show that our approach achieves high search efficiency and quality, and outp erforms existing methods significantly. 3. RACE T F ·I DF based methods for ranking relevant documents have b een proved to b e effective for keyword proximity search in text documents. However, traditional ranking techniques in IR literature may not b e effective to rank MCCTrees, as b esides the term frequency (tf ) and inverse document frequency (idf ), MCCTrees also contain rather rich structural information. We take into account b oth the structural similarity and traditional IR metrics to rank MCCTrees. There are three parameters - the numb er of content nodes in T , nc , the numb er of distinct input keywords contained in T , nk , and the numb er of all nodes in T , ns , which will affect the score assigned to an MCCTree, and we will employ these three parameters to rank MCCTrees. Intuitively, the larger nc , the higher the score of the MCCTree should b e; on the other hand, the larger nk , the more likely the MCCTree is relevant to K. On the contrary, ns should b e inverse with the score of the MCCTree. In addition, the succinctness of the MCCTree should b e reflected in the structural similarity function, and the more succinct of the MCCTree, the higher score of the structural similarity should b e. Based on ab ove observations, we can compute the structural similarity. Accordingly, we combine the textual similarity and structural similarity to effectively rank the MCCTrees. 6. ACKNOWLEDGEMENT This work is partly supp orted by the National Natural Science Foundation of China under Grant No.60573094, the National High Technology Development 863 Program of China under Grant No.2007AA01Z152 and 2006AA01A101, the National Grand Fundamental Research 973 Program of China under Grant No.2006CB303103. 7. REFERENCES [1] S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv. Xsearch: A semantic search engine for xml. In VLDB, 2003. [2] L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. Xrank: Ranked keyword search over xml documents. In SIGMOD, 2003. [3] V. Hristidis, N. Koudas, Y. Papakonstantinou, and D. Srivastava. Keyword proximity search in xml trees. In IEEE TKDE 18(4), 2006. [4] G. Li, J. Feng, J. Wang, and L. Zhou. Efficient keyword search for valuable lcas over xml documents. In CIKM, 2007. [5] G. Li, J. Feng, J. Wang, and L. Zhou. SAILER: An Effective Search Engine for Unified Retrieval of Heterogeneous XML and Web Documents. In WWW, 2008. [6] G. Li, B.C. Ooi, J. Feng, J. Wang, and L. Zhou. EASE: Efficient and Adaptive Keyword Search on Unstructured, Semi-structured and Structured Data. In SIGMOD, 2008. [7] C. Sun, C. Y. Chan, and A. K. Goenka. Multiway slca-based keyword search in xml data. In WWW, 2007. 4. EXPERIMENTAL STUDY We have conducted a set of exp eriments to evaluate the p erformance of our approach. We used real dataset DBLP in our exp eriments. The raw file was ab out 420MB. The exp eriments were conducted on an Intel(R) Pentium(R) 2.4GHz computer with 1GB of RAM. The algorithms were implemented in Java. We compared RACE with state-of-the-art methods, XSEarch[1], XRank[2], GDMCT[3] and MSLCA 1046