WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China Dissemination of Heterogeneous XML Data Yuan Ni Chee-Yong Chan National University of Singapore IBM China Research Lab niyuan@cn.ibm.com chancy@comp.nus.edu.sg ABSTRACT A lot of recent research has focused on the content-based dissemination of XML data. However, due to the heterogeneous data schemas used by different data publishers even for data in the same domain, an imp ortant challenge is how to efficiently and effectively disseminate relevant data to subscrib ers whose subscriptions might b e sp ecified based on schemas that are different from those used by the data publishers. This pap er examines the options to resolve this schema heterogeneity problem in XML data dissemination, and prop oses a novel paradigm that is based on data rewriting. Our exp erimental results demonstrate the effectiveness of the data rewriting paradigm and identifies the tradeoffs of the various approaches. Categories and Subject Descriptors H.4.m [Information Systems]: Systems-Query processing General Terms Algorithms, Design, Performance Keywords data rewriting, dissemination, heterogeneous, XML 1. INTRODUCTION The ubiquity of XML data and the effectiveness of the content-based pub/sub-based paradigm of delivering relevant information has led to a lot of interest in content-based dissemination of XML data (e.g., [1]). Existing work on XML data dissemination, however, are all implicitly based on a homogeneous schema assumption where b oth the data published by different publishers as well as the users' subscriptions share the same schema. However, since the data publishers in a pub/sub system are autonomous and indep endent, they generally do not use the same schemas even when their published data are related and b elong to the same domain (e.g., product catalogues). Consequently, if a user's subscription is based on the schema of a sp ecific publisher (say P ), then while the user can receive relevant documents The work was done while the author was at the National University of Singap ore Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. from P that match his subscription, it is very likely that his subscription will not match relevant data from another publisher P if the data schemas used by P and P are different. Thus, the effectiveness of the pub/sub systems in pushing relevant data to consumers b ecomes diminished in the presence of heterogeneous data schemas. In this pap er, we address the problem of how to improve the effectiveness of XML data dissemination in the presence of heterogenous data schemas. Our problem, referred to as heterogeneous data dissemination problem, can b e stated as follows. We consider a pub/sub system where data published by different publishers are based on different schemas. The problem is how to effectively disseminate a document (based on some publisher's schema S ) to relevant subscrib ers whose subscriptions might b e based on schemas different from S . For simplicity and without loss of generality, we assume that all the published data are of the same domain such that it is p ossible to use a single global schema to resolve the structural conflicts among different publishers' schemas (of the same domain). Our problem and prop osed techniques can b e easily extended to the general case by first partitioning the collection of publishers' schemas into groups of schemas with similar domains, and then generating a global schema for each group of related schemas. The well-known data integration problem [3] handles the problem ab out how to query multiple data sources with different schemas by adopting a query rewritten based approach (QRA). However, a fundamental difference exists b etween the data integration problem and our heterogeneous data dissemination problem, which is that the integration problem b elongs to a single-query-multiple-data scenario while the dissemination problem b elongs to a singledata-multiple-queries scenario. To adopt the QRA in the heterogeneous data dissemination problem would incur a scalability problem as each input subscription needs to b e rewritten into one subscription for each local schema. This increases the space overhead for storing and indexing the expanded set of local subscriptions at each router since the numb er of subscriptions on each router is large. In this pap er, we present a novel paradigm to solve the heterogeneous data dissemination problem that is based on the principle of data rewriting. We refer to our new approach as DRA for data rewriting approach. The conceptual idea of DRA is as follows. Firstly the collection of local schemas from the publishers is integrated to form a global schema Sg which is then made available to users to sp ecify their subscriptions. Unlike QRA, our DRA does not require query rewriting which means that only the input global subscrip- 1059 WWW 2008 / Poster Paper are indexed at each router. For each incoming data D conforming to some local schema S ) to a router, our DRA rewrites D to Dg (Dg may not b e materialized here) such that the evaluation of subscriptions is actually conducted against Dg . In contrast to QRA, our prop osed DRA is more effective for the heterogeneous data dissemination problem b ecause pub/sub systems are typically characterized by two prop erties : (1) the numb er of subscriptions at each router is large (which limits the scalability of QRA); and (2) the data b eing disseminated is relatively small (which incurs only a small processing overhead for data rewriting). tions ( April 21-25, 2008 · Beijing, China D dynamical ly. We refer to this approach as dynamic data rewriting (DDR) approach. Note that DDR does not modify D and also does not physically materialize Dg . Instead, the rewriting of D to Dg is p erformed dynamically as D i s b eing parsed. Sp ecifically, the parsed events from D are used to generate parsed events corresp onding to Dg which are matched against the subscriptions, and D is then forwarded to any matching routers/subscrib ers. We have prop osed two dynamic data rewriting approaches based on where the data rewriting is p erformed. NDDR. The first option is to p erform the rewriting outside of the matching engine by installing a new software comp onent, called the data rewriter, b etween the document parser and matching engine. The data rewriter essentially rewrites D to Dg by intercepting the sequence of events E that is generated by the event-based XML parser (as it parses the input document D ) and generating a modified sequence of events Eg to the matching engine such that Eg is equivalent to the sequence of events generated by parsing Dg . We refer to this approach as non-intrusive dynamic data rewriting (NDDR) approach since it does not require making any changes to the existing XML parser and matching engine comp onents. The implementation of NDDR requires additional memory to cache some parsed events. IDDR. The second option is to rewrite the data within the matching engine itself. We refer to this approach as intrusive dynamic data rewriting (IDDR) approach as it entails making modifications to the matching engine. The IDDR approach avoids the using of additional memory, however it makes the matching more complicated. 2. DATA REWRITING FRAMEWORK This section presents our framework to solve the heterogeneous data dissemination problem by using data rewriting. 2.1 System Architecture We use S to denote some publisher's local schema, and Sg to denote a global schema integrated from a collection of local schemas of the same domain. We use D (resp., Dg ) to denote a document conforming to schema S (resp. Sg ). Similar to existing pub/sub systems, we have a mediator agent (MA) that serves as a coordinator b etween the data publishers and routers [2]. Besides collecting schemas from publishers and registering queries for users, the MA is also resp onsible for resolving the structural conflicts among various schemas to generate a global schema. The MA creates a schema mapping for each local schema S that is integrated to a global schema Sg . The schema mapping is essentially a data transformation sp ecification that enables an input document D to b e mapp ed into an output document Dg that preserves the appropriate information content of D . 3. DISCUSSIONS Based on our exp erimental results, we have the following observations on the efficiency of various approaches. First, SDR does not p erform well due to the transmission of additional data, esp ecially when the bandwidth is small or the numb er of hops to subscrib ers is large. IDDR does not scale well as the numb er of subscriptions or the size of documents increases due to its more complicated matching engine. Finally, our exp eriments show that NDDR overall achieves the b est p erformance. Moreover, the memory space that NDDR incurs for the dynamic data rewriting is small: among the set of documents we exp erimented, the maximum memory overhead of NDDR is ab out 32% of the document size, and only three of the documents actually require memory space overhead of over 20% of the document size. For the ma jority of the documents, the memory space overhead is only around 5% of the document size. Since the size of the documents in data dissemination is usually small, the memory space overhead of NDDR is small which makes NDDR an attractive approach. 2.2 Data Rewriting Approaches 2.2.1 Static Data Rewriting (SDR) In the static data rewriting (SDR) approach, each published data D is rewritten to Dg statically (but only once) by the mediator agent (denoted as MA). The advantage of employing the MA to rewrite the data is that the publishers are shielded from the details of the schema mappings and rewriting processing; this requires each publisher to first forward D to the MA for the rewriting b efore the MA forwards the transformed data to the routers for dissemination. Once D has b een rewritten to Dg , b oth D and Dg are forwarded together to the network of routers for dissemination. Since the subscriptions stored in each router are based on the global schema Sg , Dg is used for matching against the subscriptions to detect matching subscriptions and decide to which router(s) the data needs to b e forwarded next; D (p ossibly with an attached digital signature for verification of data integrity) is forwarded to any matching local subscrib ers at a router. One advantage of SDR is that it is a non-intrusive approach that can b e easily implemented. However, the tradeoff is that the amount of data that is b eing forwarded is roughly doubled compared to the conventional approach. 4. REFERENCES [1] C.-Y. Chan, P. Felb er, M. Garofalakis, and R. Rastogi. Efficient filtering of XML documents with XPath expressions. VLDB, 11(4), 2002. [2] Y. Diao, S. Rizvi, and M. J. Franklin. Towards an internet-scale XML dissemination service. In VLDB, 2004. [3] I. Manolescu, D. Florescu, and D. Kossmann. Answering XML queries over heterogeneouse data sources. In VLDB, 2001. 2.2.2 Dynamic Data Rewriting (DDR) To avoid the transmission overhead of SDR, an alternative strategy is for each router to forward only D but the tradeoff is that each router now needs to rewrite the data 1060