WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web II April 21-25, 2008 · Beijing, China Scaling RDF with Time University of Calabria Rende, Italy Andrea Pugliese apugliese@deis.unical.it udrea@umiacs.umd.edu University Of Maryland College Park, MD 20742 Octavian Udrea vs@umiacs.umd.edu University Of Maryland College Park, MD 20742 V.S. Subrahmanian ABSTRACT The World Wide Web Consortium's RDF standard primarily consists of (sub ject,prop erty,ob ject) triples that sp ecify the value that a given sub ject has for a given prop erty. However, it is frequently the case that even for a fixed sub ject and prop erty, the value varies with time. As a consequence, efforts have b een made to annotate RDF triples with "valid time" intervals. However, to date, no prop osals exist for efficient indexing of such temp oral RDF databases. It is clearly b eneficial to store RDF data in a relational DB ­ however, standard relational indexes are inadequately equipp ed to handle RDF's graph structure. In this pap er, we prop ose the tGRIN index structure that builds a sp ecialized index for temp oral RDF that is physically stored in an RDBMS. Past efforts to store RDF in relational stores include Jena2 from HP, Sesame from Op enRDF.org, and 3store from the University of Southampton. We show that even when these efforts are augmented with well known temp oral indexes like R+ trees, SR-trees, ST-index, and MAP21, the tGRIN index exhibits sup erior p erformance. In terms of index build time, tGRIN takes two thirds or less of the time used by any other system, and it uses a comparable amount of memory and less disk space than Jena, Sesame and 3store. More imp ortantly, tGRIN can answer queries three to six times faster for average query graph patterns and five to ten times faster for complex queries than these systems. Categories and Subject Descriptors I.2.4 [Computing Methodologies]: Artificial Intelligence--know ledge representation formalisms and methods ; E.2 [Data]: Data Storage Representations General Terms Algorithms, Performance Keywords Resource Description Framework, temp oral RDF, RDF indexing 1. INTRODUCTION RDF ("Resource Description Framework") is a growing semantic web standard from the World Wide Web Consortium Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. that has supp ort from many companies. Large databases of RDF data that we are aware of include a 20.5 million triple database on activities by the US congress1 and a 5 million plus database of triples ab out violent events. RDF primarily sp ecifies the creation of triples (s, p, o) where s is a sub ject, p is a prop erty, and o is an ob ject (or a value) that the prop erty p has for a given sub ject. Some triples have no temp oral extent (e.g., Mary is always the mother of John). However, some triples can have a temp oral extent attached to them. For instance, the USA (subject) may have Bill Clinton (ob ject) as the value of its President prop erty throughout the entire time frame 1992-2000. We call triples that are valid continuously during a time interval determinate triples. Likewise, some triples may only b e known to occur at certain time p oints during an interval, rather than b e true continuously throughout the interval. For example, company A (sub ject) may have a supplier (prop erty) relationship with company B (ob ject) at some p oints of time during the time interval 10 - 20. We call such triples indeterminate triples. Past work on temp oral RDF is relatively sparse. In early related work, Buraga et al. [3] present an RDF-based model for representing spatio-temp oral relations b etween websites. They define the TRSL language that uses XML to express a numb er of different op erators for the Interval Temp oral Logic. Guti´rrez et al. [5, 6] are the first to provide a e syntax and semantics for temp oral RDF with determinate triples, as well as a query language; however, they do not provide query processing algorithms or an evaluation of the language. Note that all past work in temp oral RDF has the following limitations. (i) No work on indexing temp oral RDF exists to date. (ii) No work on temp oral RDF to date handles indeterminate triples. (iii) Though some algorithms have b een given for query processing in temp oral RDF [5, 6], they are all main memory based and do not include any implementation or exp erimental details. Past work on RDF indexing is also relatively sparse. Liu and Hu [10] and Matono et al. [12] prop ose two different indexing schemes for RDF path queries on cycle-free RDF databases. Since modern RDF query languages such as SPARQL rely on graph pattern queries, we focus on SPARQL-like queries in this pap er. While we are not aware of any RDF index designed esp ecially for such queries, most RDF storage systems to date such as Jena, Sesame, RDFBroker, 3store make use of a relational database backend to store RDF and employ relational indexes to sp eed up query processing. For instance, Jena2 uses a normalized 1 http://www.govtrack.org. 605 WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web II triple store approach, in which triples are stored in a statement table with columns for the triple's sub ject, prop erty and ob ject. Lexical-order relational indexes can b e defined on any of the columns (or combinations of them) to make data access more efficient. Sesame can also use a relational back-end to store triples, but they use native indexing in the form of a B-tree on the (sub ject, prop erty, ob ject) of a triple ­ the order of these can b e changed by the user. The system also allows additional indexes to b e defined on parts of the triple. The common issue with all of the ab ove approaches is that the translation of RDF graph queries into SQL tends to produce queries too complex for the relational indexing techniques employed. On the other hand, tGRIN is esp ecially designed to reduce the complexity of graph pattern queries. Graph indexing also relates to RDF, since the latter can naturally b e represented as a graph in which RDF resources are vertices and RDF triples are edges. Graph indexing techniques generally fall into the following two categories. Graph database indexing ­ an example of which is the work by Yan et al. [16] ­, is concerned with efficiently retrieving sup ergraphs of a given query from a set of graphs (the database). Indexing large graphs for reachability and navigation is relevant to web and telecommunication traffic analysis. Ab ello and Kotidis [2] prop ose efficient storage methods for large graph navigation based on hierarchical decomp ositions of the edge set. Trißl and Leser [14] provide fast indexing methods for large graphs in the context of reachability queries. None of these works on RDF and graph indexing addresses the problem of answering general graph pattern queries. In [15] we prop osed the first RDF indexing technique sp ecifically targeted to this kind of queries. In this pap er, we start by extending the past work of [5, 6] to the case of indeterminate triples. This is not claimed as a particularly novel contribution. We then prop ose the novel concept of a normalized tRDF database and provide a normalization algorithm. We present a formal definition of tRDF queries and answers. We then present the tGRIN2 index structure suitable for graph pattern queries. The tGRIN index structure is a tree whose root implicitly represents the entire space of vertices in a tRDF database. As in the case of RDF, every tRDF database can b e thought of as a graph -- Figure 1 shows an example. Each node in a tGRIN index implicitly represents a set of vertices in this graph. Each node m has a "center" vertex Cm and a "radius" Rm lab eling it. Node m implicitly represents all vertices in the tRDF graph that are within distance Rm of the vertex Cm . The RDF indexing technique prop osed in this pap er differs from our work in [15] in several critical ways: (i) the presence of temp oral annotations on the triples changes the semantics of queries and distance measures in the tRDF graph; (ii) the index structure is a n-ary tree instead of a binary tree, thus affecting the way the index is built and (iii) the tGRIN index is implemented on disk, whereas our previous GRIN index was in-memory. We have also built the first exp erimental temp oral RDF prototyp e DBMS that we are aware of. We compared our tGRIN based implementation with three commercial (nontemp oral) RDF datastores. Jena2 from HP is one of the b est known industry systems. Sesame from Op enRDF.org and 3Store from the University of Southampton are also well known RDF DBMSs. All these RDF stores, including 2 April 21-25, 2008 · Beijing, China ours, use a standard relational DBMS for storage so that years of advances in concurrency control, crash and error recovery, etc. can b e easily leveraged. We extended Jena2, Sesame, and 3Store with some of the b est known methods to index temp oral data. In particular, we extended them with R+ -trees, SR-trees, ST-index and MAP-21 ­ compared in detail in the survey by Salzb erg and Tsotras [13]. For each of Jena2, Sesame and 3Store, we chose the index that worked b est (of these 4 indexes) and compared the results with tGRIN. We found that tGRIN took only 66% of the time required by them (or less) to build the index and uses a comparable amount of memory to store the index. Most imp ortantly, however, tGRIN can answer queries three to six times faster for relatively simple query patterns, and six to ten times faster for complex queries. 2. TEMPORAL RDF: SYNTAX & SEMANTICS OVERVIEW In this section, we overview the syntax and semantics of determinate temp oral RDF provided by Guti´rrez et al. [5, e 6]. Moreover, we provide a straightforward extension to the case of indeterminate triples. A Temp oral RDF (or tRDF for short) database consists of a set of temp orally annotated RDF triples3 of the form (subject, property: annotation, object) where: · The subject is an entity denoted by an URI reference from a set U . · The property is an entity denoted by an URI reference from a set P . · The object is either an entity from U or a constant from the set of literals L. URI references and literals form the set of resources R = U L. The structure of the "annotation" will b e describ ed shortly. In addition to these, a tRDF database may also contain triples of the form (p1 , r df s : subP r oper ty Of , p2 ) which denote the fact that for any triple with prop erty p1 we can infer an identical triple with prop erty p2 .4 Throughout this pap er, we use Tp to denote the set of all time p oints and T to denote the set of all p ossible time intervals. The annotation of a tRDF triple can take one of the following forms (n is a natural numb er and T T ): 1. (s, p : {T }, v ). This typ e of triple represents a relationship p b etween s and v that holds at every time p oint in T (e.g., "Senate Joint Resolution 37 (sj37) was under discussed in the Senate Finance Committee throughout May 2002)". 2. (s, p : n : T , v ). This triple represents a relationship p b etween s and v that holds at least at n distinct time p oints in T (e.g., "The p olitician with identifier p eople/B000711 campaigned for at least 8 months in 2004"). 3. (s, p : [n : T ], v ). This triple represents a relationship p b etween s and v that holds at most at n distinct Although technically the temp oral annotation makes these quadruples, the term "RDF triple" is so wide-spread in the literature that we will continue using triple. 4 Due to space restrictions, we do not discuss certain features of RDF (such as blank nodes, reification or collections) and RDFS. 3 tGRIN stands for temp oral Graph-based RDF INdex. 606 WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web II State and Local Government subject inCommittee:{1/1999, 2 /1999} congress/committees/ SenateEnvironmentand PublicWorks congress/ 106/bills/ s1990 congress/ committees/ SenateFinance inCommittee: s u b j e c t {05/2002} congress/ 107/bills/ sj37 supported:{05/2002} State and Local Government campaign/ 2004/ S2CA00286 ?v3 subject April 21-25, 2008 · Beijing, China congress/committees/ SenateCommerceScienceAndTransportation member {1996-?v2} supported: <1, 2000-2005> supported <1: 1998-2005> ?v1 role {1996-2002} congress/ senate/ca congress/ 107/bills/ sj37 inCommittee: <1, 2000-2005> congress/ committees/ SenateFinance member:{1995-2006} s p o n s o r : { 0 1 / 1 9 9 9 } chairperson:{2007} congress/committees/ people/ SenateCommerceSciB000711 enceAndTransportation member:{1995-2007} role:{1987-1988} role:{1989-1990} congress/ house/100/ ca congress/ house/101/ ca (b) congress/ 106/bills/ s1990 supported {01/1999} campaign: <8: 2004> people/ B000711 role:{1993-1997} role: sponsor role:{1998-2010} {1993-2010} {01/1999} congress/ senate/ca campaign/ 2004/ S2CA00286 contributed: [2: 2004] contributed: [ 1 : 20 0 4 ] people/ B000711 campaign: {01/200410/2004} campaign/ 2004/ S2CA00286 campaign: <8: 2004> role:{1993-1997} role:{1998-2010} congress/ senate/ca office contributed:[2: 2004] B., C arol people/ B000711 B., C arol (supported, rdfs:subPropertyOf, sponsor} (chairperson, rdfs:subPropertyOf, member) (supported, rdfs:subPropertyOf, sponsor} (a) (c) (d) Figure 1: (a) Example tRDF database; (b) Example tRDF query; (c) Normalization ­ generation example; (d) Normalization ­ pruning example. time p oints in T (e.g., "Carol B. contributed to the campaign of p olitician p eople/B000711 at most two times in 2004"). Example 2.1. Figure 1(a) contains a graphical depiction of a tRDF database where (i) the time granularity is one month, (ii) triples without temporal annotations are considered true at al l time points, and (iii) triples annotated only with year-based intervals are assumed to start in January of the starting year and end in December of the ending year. The example is based on a subset of the GovTrack dataset5 . The ful l names for the identifiers (e.g., congress/106/bil ls/s1990) used in this example are available in the dataset. To define the semantics of tRDF, [5, 6] define an interpretation I as a function I : Tp U × P × R that associates a set of RDF triples with every timep oint. Clearly, not all interpretations make sense for a given tRDF database. For instance, an interpretation I for the database in Example 2.1 that has (people/B 000711, r ole, cong r ess/house/101/ca) I (1989) does not match the database. We therefore define the concept of a satisfying interpretation. Definition 2.1 (tRDF satisfaction). Let I be a tRDF interpretation. I satisfies a tRDF triple e (denoted by I |= e) under the fol lowing conditions: 1. I |= (s, p : {T }, v ) iff t T ,(s, p, v ) I (t). 2. I |= (s, p : n : T , v ) iff |{t T |(s, p, v ) I (t)}| n. 3. I |= (s, p : [n : T ], v ) iff |{t T |(s, p, v ) I (t)}| n. 4. I |= (sp, r df s : subP r oper ty Of , p) iff t Tp , (s , s p , v ) I (t ), (s , p , v ) I (t ). I satisfies a tRDF database D (denoted by I |= D) iff e D, I |= e. Database D is consistent iff I s.t. I |= D. 5 Publicly available at http://www.govtrack.us/source.xp d -- consists of over 20.5 million triples. Definition 2.2 (Entailment). A tRDF database D entails a tRDF triple e, denoted D |= e, iff for each tRDF interpretation I such that I |= D, it is also the case that I |= e. Example 2.2. The database in Figure 1(a) entails the triple (people/B 000711, suppor ted : {01/1999}, cong r ess/106/bills/s1990) since sponsor ed is a However, it does subproperty of suppor ted. not entail (people/B 0007111, r ole : {1985 - 1988}, cong r ess/house/100/ca) because we can construct a satisfying interpretation for D that does not satisfy this triple. 3. NORMALIZED TRDF DATABASES In this section, we prop ose the concept of a normalized tRDF database. In a normalized tRDF database D, any tRDF-triple t that is entailed by D must b e entailed by a single tRDF-triple in D. In general, tRDF databases may not b e normalized. For instance, consider the query: what was the role of people/B000711 between 1996 and 2003? on the database in Figure 1(a). To answer this query, we must analyze the subset of D containing the two triples on the property role b etween people/B000711 and congress/senate/ca. Even though there is a single continuous p eriod 1993­2010, it is represented in two different triples that b oth intersect the interval in the query ([1997, 2003]). In general, in the worst case we would need to look at all p ossible subsets of triples (an exp onential search space) even for the simplest queries. In this section, we show how to normalize a tRDF database -- later, in Section 6, we will show exp erimentally that normalization plays a big part in evaluating queries efficiently at the exp ense of a small increase in the storage space. Definition 3.1 (Normalization). Given a tRDF s database D, a normalization of D is a tRDF database D uch that D |= e iff e D such that {e } |= e. Note that a normalization of D can actually b e smaller in size than D. For example, if D = {(s, p : {10 - 607 WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web II 20}, o), (s, p : {20 - 30}, o)}, then these two triples say that the triple (s, p, o) is valid throughout b oth the intervals [10, 20] and [20, 30]: hence, they can b e merged into one triple (s, p : {10 - 30}, o). The normalization avoids the requirement that we compute all triples entailed by D as this can lead to an exp onential blow up. For instance, the triple (people/B 000711, r ole{1993 - 1997}, cong r ess/senate/ca) taken at a time granularity of one day, would entail 5 · 365 = 1865 triples, one for each day in the interval 1993­1997. However, under Definition 3.1 we only need space for one triple in D . We use two op erations to normalize a tRDF database. The first is the generation of new triples ­ for instance, by coalescing identical triples which are annotated with connecting time intervals. Proposition 3.1. Suppose s U , p, sp P , v R, n and m are natural numbers, and T , T T . The fol lowing relationships hold: 1. {(s, sp : {T }, v ), (sp, r df s : subP r oper ty Of , p)} |= (s, p : {T }, v ). 2. {(s, sp : n : T , v ), (sp, r df s : subP r oper ty Of , p)} |= (s, p : n : T , v ). 3. {(s, p : {T }, v ), (s, p : {T }, v )} |= (s, p : {T T }, v ). Figure 1(c) shows example applications of Cases 1 and 3 of Prop osition 3.1. Generated triples are depicted with dashed lines. The second op eration consists in pruning redundant triples. Proposition 3.2. Let s U , p P , v R be resources, let n and m be natural numbers, and let T , T T . The fol lowing relationships hold: 1. If T T , then {(s, p : {T }, v )} |= (s, p : {T }, v ). 2. If m n and T T , then {(s, p : n : T , v )} |= (s , p : m : T , v ). 3. If m n and T T , then {(s, p : [m : T ], v )} |= (s, p : [n : T ], v ). 4. If |T T | n, then {(s, p : {T }, v )} |= (s, p : n : T , v ). Figure 1(d) shows example applications of Cases 3 and 4 of Prop osition 3.2. The triples that can b e deleted are represented by dashed lines. Finally, Figure 2 shows the tc algorithm that normalizes a tRDF database D. In the algorithm, we denote by subpr op(D) the set of r df s : subP r oper ty Of triples in D. We assume there are no cycles in subpr op(D) (otherwise all prop erties involved in a cycle are equivalent) and subpr op(D) is sorted in top ological order. Furthermore, g r oup(D) is an ordered set containing the same triples as D and such that the triples having the same sub ject, property and value are consecutive. Proposition 3.3. Given a tRDF database D, the fol lowing statements hold: 1. tc(D) is a normalization of D. 2. Algorithm tc runs in time O(|D|2 · log |D|). 3. |tc(D)| is O(|D|2 ). April 21-25, 2008 · Beijing, China 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 2 23 4 25 26 27 28 29 Algorithm tc(D) Input: tRDF database D Output: Normalization of D RD for all e R \ subpr op(R) for all e R \ subpr op(R) \ {e} if {e} |= e according to Prop osition 3.2 remove e from R e n d if end for end for for all e R \ subpr op(R) for all e subpr op(R) if {e, e } |= e according to cases 1 or 2 of Prop. 3.1 add e to R e n d if end for end for B gr oup(R) RR\B r e p ea t pick the first e B for all e B \ {e} if {e, e } |= e according to case 3 of Prop. 3.1 remove e and e from B ee e n d if end for add e to R remove e from B until B = return R Figure 2: Normalization of a tRDF database. 3.1 tRDF consistency Unlike RDF, which is essentially free of inconsistencies with the exception of data typ e mismatches, under our tRDF model we could represent inconsistent databases. Proposition 3.4. A tRDF database D is inconsistent iff at least one of the fol lowing conditions holds: 1. D |= (s, p n : T , v ), D |= (s, p[m : T ], v ), T T , and n > m. 2. D |= (s, p{T }, v ), D |= (s, p[n : T ], v ), and |T T | > n. Figure 3 contains examples for the two cases in which inconsistencies can occur. campaign/ 2004/ S2CA00286 contributed:[ 2: 2004] contributed:< 3: 2004> campaign/ 2004/ S2CA00286 contributed:[ 2: 2004] contributed:{ 01/2004-05/2004} B., Carol B., Carol Figure 3: Examples of tRDF inconsistencies. The consistency of a tRDF database D can b e checked by first computing its normalization D (whose size is p olynomial in the size of D), then checking the conditions of Prop osition 3.4 on D . Proposition 3.5. The problem of checking the consistency of a tRDF database D has a worst-case time complexity of O(|D|4 ). 608 WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web II April 21-25, 2008 · Beijing, China 4. TRDF QUERIES 6 A tRDF query is essentially a conjunctive SPARQL query that is augmented with temp oral annotations on the edges (either variable or constant). Most RDF databases have interfaces that supp ort the SPARQL language, which allows us to b etter evaluate our implementation w.r.t. existing systems. Definition 4.1 (tRDF query). A tRDF query is a 5tuple (N , E , V , n , t ) where: · · · · · N is a set of vertices. V is a set of variables. E N × N × (V P ) is a set of edges. n : N R V is a vertex labeling function. t is an edge labeling function that associates with every edge in E , an expression of the form {T }, n : T o r [n : T ], where T is either a constant time interval or a variable and n is a natural number. A naive algorithm for answering tRDF query q on a database D can b e given as follows: 1. For each query atom qj q , compute the set j of substitutions where D entails qj j . 2. Consider all p ossible elements of 1 × · · · × n and select those elements (1 , . . . , n ) for which all substitutions i with i [1, n] are compatible (i.e., do not assign different values to the same variable). The clear disadvantage of this algorithm is that it has to compute a Cartesian product (essentially a join of n relations), which is prohibitively exp ensive for complex queries. In fact, we show exp erimentally in Section 6 that some of the leading RDF database systems we considered cannot handle queries with graph patterns that go b eyond 15 vertices, half of which are variables. Instead, let us look at Example 4.2 again. The entire GovTrack dataset this example is extracted from contains over 20 million triples, and yet the answer to our query can b e found in a very small p ortion of the entire database, which we have seen in practice to b e true of the large ma jority of queries. Therefore, a b etter strategy is to (i) identify the smallest p ortion of the database that is guaranteed to contain the answer and (ii) p erform subgraph matching on that p ortion. To accomplish this, we define the tGRIN index structure for temp oral RDF. We refer to each edge in the query graph pattern as a query atom. Example 4.1. Figure 1(b) shows a graphical depiction of a tRDF query. The query can be easily translated into a SPARQL graph pattern, if we removed the temporal annotation. To provide an answer to a tRDF query over a database D, we are looking for all p ossible substitutions for the query variables in V such as the query graph after the prop er substitutions is entailed by D. Definition 4.2 (tRDF query answer). The answer to a tRDF query q = (N , E , V , n , t ) w.r.t. a database D, denoted ansq (D), is a set of variable substitutions {1 , . . . , k }, with i : V R Tp such that the fol lowing conditions hold: 1. (Soundness). For al l i [1, k] and for al l query atoms qj q , D |= qj i , where qj i denotes the application of the substitution i to query atom qj . 2. (Completeness). For al l substitutions such that D |= qj for al l query atoms qj , there is a substitution j ansq (D) that is more general than .7 Note that the query op erations we sp ecified are akin to relational selection. We have not defined anything that is equivalent to pro jection over tRDF databases (i.e., we do not select a subset of variables we are interested in). Exp erimentally, we have determined that unlike the relational case, pro jection does not help much with the query running time (which is dominated by searching for subgraphs matching the query). Pro jection can b e therefore applied after finding ansq (D) in linear time in the size of the answer. Example 4.2. The query in Figure 1(b) has two possible answers in the dataset in Figure 1(a). In both cases, ?v 1 people/B 000711 and ?v 2 2007. The first answer has ?v 3 cong r ess/106/bills/s1990, whereas the second answer has ?v 3 cong r ess/107/bills/sj 37. http://www.w3.org/TR/rdf-sparql-query/ We assume the reader is familiar with the concepts of a substitution , an application of a substitution qj , and what it means for one substitution to b e more general than another[11]. 7 6 5. THE TGRIN INDEX STRUCTURE tGRIN is based on the idea that vertices that are "close" together in the tRDF graph are more likely to app ear together in a query answer, and therefore should b e stored on the same page (in the same index node). Unfortunately, in tRDF, "close together" can have two meanings: temp orally close together, or close together in terms of distance in the tRDF graph. For instance, cong r ess/house/100/ca and campaig n/2004/S 2C A00286 in our example are close together in our example tRDF graph, but are temp orally far apart.The tGRIN index structure must take b oth notions of closeness into account. We first define distance metrics on temp oral intervals alone, and then show how they can b e used to induce combined graphical-temp oral distance measures.8 5.1 Distance metrics Graph distance metric. We can use either the shortest or the longest path in the undirected RDF graph as the graph distance metric dG (·, ·). We observed empirically that the shortest path gives b etter p erformance for queries and therefore omit the longest path from the rest of the discussion. Temporal distance metric. The temp oral distance metrics combine the distance b etween consecutive temp oral intervals on a path b etween two resources. First, we need to 8 One may wonder whether it is p ossible to build an Rtree like structure by thinking of graph based features and temp oral features as two dimensions. Unfortunately, there seems to b e no ordering on the resources in a tRDF graph that correlates well with graph query patterns. We could for instance give a lexical ordering based on resource names, but we determined exp erimentally that such an ordering was very inefficient. However, since we can define meaningful distance measures in b oth these dimensions, "circles" are the ideal way to represent inner index nodes in tGRIN. 609 WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web II A B Q:{2, 4} G April 21-25, 2008 · Beijing, China ?v 1 B ALL ?p: {1,3} ( D,5) ( N,4) ?v 2 (B,2) ( F,4) ( M,1) (K,3) P: {1,3} Q: {1,3} Q: {1,3}P: {1,3}Q:{1, 5} P:{3, 5} D P:{1, 7} E Q:{2, 4} I P:{3, 5} L P:{4, 8} M P:{3, 5} F C P:{2, 8} Q:{5, 7} H (A,1) (B,1) ( C,1) (E,1) (I,1) (K,3) (L,4) ?v 3 (c) ?v 1 Q: {5,7} ?v2 P{4,6} P:{1, 9} Q:{2, 6} J P:{3, 7} Q:{2, 6} N K A B C D E F G H I J K L M N ?p: {3,5} E ?p: {4,6} Q: {2,6} P: {1,9} J ?v3 (a) (b ) (d) Figure 4: (a) Synthetic tRDF database; (b) Example tGRIN index; (c), (d) Example tRDF query patterns. characterize the distance, (Ti , Tj ), b etween two temp oral intervals Ti and Tj . We require that satisfies the following axioms (where T .s and T .e denote the start and end p oints of T , and we assume T .s T .e): 1. (Ti , Ti ) = 0. 2. (Ti , Tj ) = (Tj , Ti ). 3. (Ti , Tj ) (Ti , Tj ) if Ti Ti , Ti where T T iff T .e T .s. · Each leaf node contains a set N R of resources s.t. for all leaf nodes = , N N = , and L N = R. · Each non-leaf node t contains a pair (c, r ), with c R and r I . Intuitively, this is a very succinct represenN tation of the set of resources in the graph at distance at most r of the resource c according to the metric d. We write this set as Nt = {c R|d(c, c ) r }. · For any nodes x, y in the tree such that x is a parent of y , Nx Ny . One of the imp ortant parameters of the index determined by the physical storage format is the maximum numb er of children of each index node, denoted by M . Intuitively, the higher M is, the smaller the "circles" represented by index nodes, which means we have a higher probability of identifying smaller parts of the tRDF database to match against the query. However, large values of M also imply the index, and hence the search space is larger. We will evaluate the impact of M on the index p erformance in Section 6. Building the tGRIN index is a three-step process based on a modified Hierarchical Agglomerative Clustering [9] algorithm (HAC for short). In the first step, HAC starts by having each vertex of the tRDF database in a separate cluster. At each iteration, the standard version of the algorithm merges the two clusters with the smallest inter-cluster distance9 until all vertices are in the same cluster. We modified this algorithm such that at each step, it will merge the M clusters that are closest in terms of the inter-cluster distance. At the end of this first step, HAC will produce a dendrogram ­ a tree in which each leaf node is a vertex in the tRDF database and each inner node is a cluster. Furthermore, each node in the dendrogram has at most M children. The second step of the index building process consists of traversing the dendrogram tree and finding centroids for all the inner nodes in the tree. A centroid for a set of vertices S is the vertex c that has the minimum average distance to all the other vertices in S : av gy S (d(c, y )) = minz (av gy S (d(z , y ))). For each cluster, we also determine the maximum distance from c to any other node: r = maxxS (d(c, x)). At the end of this step, we will have a "circle" representation (c, r ) for any cluster in the tree. Note that this step inflates the dendrogram clusters to circles, which may cause overlap b etween sibling index nodes. The inter-cluster distance generally used is either the minimum or the maximum distance b etween a vertex in the first and second cluster resp ectively. 9 Tj , and Tj Tj , If Ti and Tj are temp oral intervals, any of the following are acceptable functions: 1. 2. 3. 4. (Ti , Tj ) = ¬ Ti .e-Ti .s - j 2 j ¬ (interval centers). 2 (Ti , Tj ) = |Ti .s - Tj .s| (start p oints). (Ti , Tj ) = |Ti .e - Tj .e| (end p oints). (Ti , Tj ) = |Ti .s - Tj .e| if Ti Tj , otherwise (Ti , Tj ) = |Tj .s - Ti .e| (leftmost and rightmost p oint). ¬ ¬ ¬ T .e-T .s ¬ Given a , we can then define a temp oral distance metric as follows. Definition 5.1 (Temporal distance metric). Let D be a tRDF database, x, y R, p = (e1 , . . . , en ) be a path between x and y in the undirected tRDF graph, and Tj be the time interval labeling the edge ej . If n = 1, then we define dp (x, y ) = 0. Otherwise, we define T È Final ly, the temporal d p (x , y ) = T j [2,n] (Tj , Tj -1 ). distance between x and y is the minimum over al l the possible paths dT (x, y ) = minp (dp (x, y )). T tGRIN distance metric. Since b oth dG (·, ·) and dT (·, ·) are metrics, we can use a norm function to produce a single metric d(·, ·). For tGRIN , we use the k-norm d(x, y ) = 1 [(dG (x, y ))k + (dT (x, y ))k ] k . Example 5.1. Consider the graph in Figure 4(a). Let be defined as the distance between interval center points, dG be the shortest path distance and k = 1. Clearly, dG (B , F ) = 3. There are two different paths between B and F : {B , D, E , F } and {B , C, H , F }. On the first path, dT (B , F ) = 2 and on the second dT (B , F ) = 3. We take the minimum and obtain d(B , F ) = 4. 5.2 Building the tGRIN index A tGRIN index is a (balanced) tree such that: 610 WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web II A final step (optional) consists of balancing the tree while maintaining the requirements of the tGRIN structure. Note that unless the numb er of vertices in the tRDF database is a p ower of M , we are unlikely to obtain a balanced tree from the HAC step. We found this step helpful in practice, although it is optional for the correctness of the algorithms. Example 5.2. The structure in Figure 4(b) is a tGRIN index for the database in Figure 4(a). Here, we used the interval center distance for , k = 1 and M = 2. The following theorem characterizes the worst-case time complexity of building the tGRIN index. Theorem 1. The tGRIN build algorithm runs in time O(|R|3 logm |R|), where |R| is the number of resources in the database. Proof. (Sketch ) The HAC algorithm runs in time O(|R|3 ) since: (i) computing the minimal distance b etween any two resources in the database is O(|R|3 ) and (ii) the numb er of op erations in computing the dendrogram is O(|R|), each op eration taking O(|R|2 ) to compute the intercluster distance. The tGRIN index has a height of at most log m |R|, therefore at most |R| · logm |R| nodes. Computing the center of each node is O(|R|2 ), therefore the overall complexity is O(|R|3 logm |R|). 1 2 3 4 5 6 7 8 9 10 11 April 21-25, 2008 · Beijing, China is a constant, if d(c, x) + l z , we are sure that v is inside the circle (c, r ). In order to find answers to a query q in the subgraph represented by the current tGRIN index node (c, r ), then all node variables v in the query must b e inside the circle. Therefore, rule (R2) states that we reject (c, r ) unless all variable nodes in the query are provably inside (c, r ). Algorithm query tGRIN(D, G, q, nI ) Input: tRDF database D, tGRIN index G and query q = (N , V , E , n , t ), tGRIN no de nI (initially the ro ot of the index). subgr aphM atch is a subgraph matching metho d that finds an isomorphism b etween the query graph q and a graph H and returns a set of substitutions for the variables in q. Output: The set of answers . 0 if nI is a leaf no de H the subgraph of D containing the resources in NnI return subgr aphM atch(q,H) else if nI is not rejected by checking rules (R1), (R2) against cons(q) nchildren(nI ) quer y tGRI N (D, G, q, n) if = H the subgraph of D containing the resources in NnI return subgr aphM atch(q,H) else return else return Figure 5: Answering queries over tGRIN. The query answering algorithm (Figure 5) uses the subgraph matching algorithm by Cordella et al. [4]. We chose this algorithm empirically b ecause it generally yielded the b est query answer times. If the invocation of query tGRIN is currently at a leaf node, we simply match the query graph with a subgraph of D containing the resources represented in nI (lines 2­4). Otherwise, if nI is a p otential candidate (line 5 checks (R1), (R2)), we attempt a recursive call on the children of nI (line 6). If one of the recursive calls returns a non-empty answer, we return it. Otherwise, we return the result of the subgraph matching on nI itself (line 8­9). The following theorem states the correctness of Algorithm query tGRIN and characterizes its worst-case time complexity. Theorem 2. The fol lowing statements hold: 1. Algorithm query tGRIN terminates and returns the correct answer. 2. The worst-case time complexity of Algorithm query tGRIN is O(|R|!). Proof. (Sketch ) We prove the statements in turn: 1. When invoking query tGRIN on the children of an inner node nI , if one of the recursive calls returns a nonempty answer, then there exists a descendant (c, r ) of nI that satisfies b oth rules (R1) and (R2), and all the answers to the query are guaranteed to b e inside the circle identified by (c, r ). Thus, the answer can b e found by just matching against (c, r ). Only if none of the descendants contains an answer we must look at nI itself. 2. The complexity of checking the constraints for rules (R1) and (R2) only dep ends on the size of the query q , since there can b e at most |q |2 constraints extracted from the query. The complexity of depth-first search is O(M logm |R| ), therefore the worst time complexity of locating the smallest circle containing the answer is O(M logm |R| · |q |2 ). The worst-case time complexity 5.3 Answering queries with tGRIN In this section, we show how to evaluate a tRDF query q = (N , V , E , n , t ) against the tGRIN structure. We start by showing how to derive a set of inequality constraints cons(q ) from the query. The constraints will b e evaluated against the nodes of the tGRIN index. This is done to identify the smallest subgraph that contains answers to q . For any path connecting a resource c and a variable v in the undirected graph (i.e. ignoring directionality of edges) corresp onding to q , we compute dq (c, v ) using the same method as the tGRIN distance metric d. We then add the constraint d(c, v ) dq (c, v ) to cons(q ). Note that we can only do this for paths with non-variable temp oral annotations to ensure the soundness of the constraints. Example 5.3. Consider the example query in Figure 4(c). The query leads to the fol lowing set of constraints: d(?v 1, B ) 2), d(?v 2, B ) 1, d(?v 3, B ) 1. For the query in Figure 4(d), we can deduce the fol lowing (not a complete list): d(?v 1, E ) 1, d(?v 2, J ) 1 and d(?v 3, J ) 3. In the next step, we use the constraints generated from the query to identify nodes in the tGRIN structure that could contain answers to the query. On any tGRIN node, we have the option of accepting the node (which means it may contain answers to the query) or rejecting the node (which means it is guaranteed not to contain answers to the query). Consider a tGRIN node corresp onding to the circle (c, r ). We will define two rules to decide whether (c, r ) should b e rejected. (R1). The first rule is straightforward: for any constant (resource) x in q , reject (c, r ) if d(c, x) > z . Intuitively, we are rejecting the circle represented by the tGRIN node if any constant factors in the query are outside it. (R2). Let us consider the case of a constraint d(x, v ) l involving variable v and constant resource x. Since d is a metric, d(c, v ) d(c, x) + d(x, v ) d(c, x) + l. Since d(c, x) 611 WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web II of the subg r aphM atch algorithm is O(N !), where N is the total numb er of vertices in the graphs to b e matched [4]. This op eration thus dominates the index traversal, since it can have a worst-time complexity of O(|R|!). April 21-25, 2008 · Beijing, China As the exp erimental results will show, the worst-case time complexity stated ab ove is a rather conservative measure, as the tGRIN index is able to identify very small subgraphs containing the answer to the query. Example 5.4. We wil l now show how query tGRIN works on the synthetic example database and queries from Figure 4. For the query in Figure 4(c), we start at the root node and recursively cal l q uer y tGRI N on the child index nodes until we reach (B , 2). From cons(q ) we already know that d(?v 1, B ) 2), d(?v 2, B ) 1 and d(?v 3, B ) 1, hence al l variables are in the circle centered in B of radius 2. By running the subgraph matching algorithm on this portion of the graph (which contains vertices {A, B , C, D, G}, we obtain the answer to the query: ?v 1 A, ?v 2 D, ?v 3 C and ?p P . The processing of the query in Figure 4(d) is similar ­ once we recursively reach the analysis for the node (F, 4), we see that none of the children satisfies the constraints for al l the variables. From d(?v 1, E ) 1 and d(E , F ) = 1 we have d(?v 1, F ) 2, from d(?v 2, J ) 1 and d(F, J ) = 1 we deduce that d(?v 2, F ) 2 and from d(?v 3, J ) 3 and d(J, F ) = 1 we obtain that d(?v 3, F ) 4. Since ?v 1, ?v 2, ?v 3 are al l in (F, 4), we can apply the subgraph matching step of the algorithm on (F, 4) and obtain ?v 1 F , ?v 2 H , ?v 3 K and ?p P . We p erformed the evaluation on two datasets: the GovTrack dataset consists over 20.5M triples of public record information ab out the US Congress, including campaigns and contributions, votes, the actions taken on each bill, etc. We manually created 20 query graph patterns of increasing sizes (from 5 to 35 nodes). For each pattern, we varied the ratio of variables to constants in the queries from 0.2 to 1.5. We also randomly generated synthetic datasets ranging in size from 2 to 26 million triples in increments of 3 million. The triples were generated using a uniform random distribution. The temp oral intervals were randomly generated as follows: (i) first, a center for the interval was chosen uniformly at random in the range 1 ­ 1000; (ii) then, we generated the size of the interval from a Gaussian distribution. We varied the mean and the standard deviation b etween different versions for the same dataset size. We selected the same numb er of query patterns as for the GovTrack dataset with the same characteristics. The tRDF data was stored in a single relation (sub ject, prop erty, ob ject, annotation) in a PostgreSQL 8.0 database and the tGRIN index was stored separately on disk. It should b e noted that tGRIN is indep endent of the relational storage model, thus more efficient storage models such as the one in [1] can b e easily coupled with the index structure. Exp eriments were p erformed on a Pentium IV 3 GHz machine with 2 GB of RAM running op enSuse 10.2. The results rep orted include I/O access times are averaged over five indep endent runs. In our exp erimental evaluation, we first looked at determining the optimal tGRIN parameters, such as k, and M and their correlation to the dataset characteristics. To measure index p erformance, we looked at index building time, memory consumption, disk space and query running time. Third, we compared the index p erformance indicators with Jena2, Sesame and 3store. 6. EXPERIMENTAL EVALUATION tGRIN was implemented in approximately 3650 lines of Java code and evaluated against three well known RDF DBMSs: Jena2, Sesame and 3store. All RDF dDBMSs to date store RDF in a relational DBMS and rely on the translation of RDF queries into SQL queries. However, the underlying relational schemas vary in order to optimize sp ecific characteristics of RDF queries, p ossibly in the presence of a given RDF schema. We chose these three systems since their results are the b est of their resp ective schema typ e groups10 . Our systems app ears to b e the first implementation of Temp oral RDF -- none of these three systems is tailored for temp oral queries. However, all three systems allow various user-defined indexing schemes. We used PostgreSQL 8.0 as the underlying DBMS and wrote approximately 500 lines of Java code to allow these systems to use existing temp oral indexes such as R+ -trees, SR-trees, the ST-index and MAP21. We chose these as the most promising representatives of the different classes of "valid-time" temp oral indexing methods describ ed in the survey by Salzb erg and Tsotras [13] - we also tested the case where temp oral annotations of RDF triples were expressed using reification. We denoted each variant by the corresp onding index. For instance, Sesame­R+ -tree stands for the Sesame system using R+ -trees. 10 6.1 tGRIN parameters and performance First, we evaluated the tGRIN version with and without normalization on b oth the GovTrack and synthetic datasets. The memory consumption and disk space for the normalized version were approximately 10% higher for the GovTrack dataset and 13.5% higher on average for the synthetic datasets. On average, answering a query was approximately 19.2% faster for the normalized version. The difference in running time b etween the normalized and default version increases from .2% for 5 node queries graphs with a .2 ratio of variables to constants to 39.5% for 35 node queries with a 1.5 ratio of variables to constants. The database was normalized in 78.3 seconds for the GovTrack dataset. In addition, we measured that computing the normalized database incrementally after insertions takes 5 ms p er triple on average. For the rest of the exp erimental results, we used only normalized version of all the databases. Second, we looked at the other tGRIN parameters ­ , k and N . We found no statistically significant difference (all statistical significance tests were at the 1% level) b etween any of the variations for either the GovTrack dataset of the synthetic datasets, even when we generated temp oral interval sizes uniformly and not from a Gaussian. We continued our exp erimental evaluation by using the distance b etween interval centers as our function. We also varied k b etween 1 and 5 for a fixed value of M = 5 (we obtained similar dep endencies for higher values of M ). The results for the GovTrack dataset are summarized For more details ab out the underlying relational schemas, see http://jena.sourceforge.net/DB/layout.html for Jena2, http://www.op enrdf.org/doc/sesame/users/ch04.html for Sesame and [7] for 3store. 612 WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web II Table 1: Index performance for values of k. k 1 2 3 4 5 Build 154.51s 144.23s 157.47s 149.61s 161.55s Mem. 221MB 241MB 215MB 216MB 220MB Disk 1.57MB 1.69MB 1.62MB 1.64MB 1.66MB Query 26.41s 14.65s 17.03s 28.71s 35.34s Overl. 29.71% 13.64% 14.58% 22.51% 34.65% Cover. 83.42% 91.65% 89.92% 81.64% 80.16% April 21-25, 2008 · Beijing, China Table 2: Index performance for values of M . M 2 4 6 8 10 Build 121.51s 134.26s 159.22s 198.61s 245.55s Mem. 194MB 225MB 247MB 269MB 298MB Disk 1.45MB 1.62MB 1.75MB 1.91MB 2.25MB Query 31.71s 14.76s 13.78s 28.53s 37.94s Overl. 17.43% 14.76% 12.31% 11.91% 11.99% Cover. 95.67% 93.47% 88.99% 89.54% 87.13% in Table 1. The query times are for 15-node query patterns with a ratio of variable to constants of .5. Note that we only measure disk space for the tGRIN index nodes and not the leaf nodes which contain the actual data. The index build time, memory and disk space does not vary greatly with k, but the query time is clearly much smaller for k = 2 and k = 3. In fact, we found that these values of k also produce small overlap and large coverage values. In our case, we measure overlap as the p ercentage of tRDF resources that are covered by more than one index node at mid-level in the tGRIN tree. The coverage in this case is the p ercentage of triples that are completely covered by any index node at mid-level in the tGRIN tree (rememb er that tGRIN index nodes are designed to cover resources, not triples ­ we can have triples with one end in one index node and the other end in a different index node). We obtained similar results for the synthetic datasets, the only difference b eing that k = 3 did slightly b etter (ab out a 1% improvement) in terms of query time than k = 2. We computed the same indicators for different query pattern sizes and variable to instance ratios and obtained the same results ­ k = 2 and k = 3 provide the b est query running times b ecause they produce high coverage and low overlap. Finally, we looked at the maximum numb er, M , of children p er index node when k = 2. As M increases, overlap and coverage b oth decrease. We exp ected that the query running times would b e large for very low values of M as having larger index nodes means that the probability of doing subgraph matching on the smallest p ossible subset of the database decreases. On the other hand, high values of M means an increase in the depth-first traversal time. The data for the GovTrack dataset shown in Table 2 confirms this hyp othesis. We used the same query pattern as in Table 1. Note that the index build time, memory and disk space correlate p ositively with M (if we take into account all values M from 2 to 10, the correlation factors are .973, .994 and .979 resp ectively). We chose M = 5 as the b est compromise; note that M = 4 to M = 6 yields the lowest query running times, whereas the index build time starts a sharp er increase for M > 6 ­ we determined that this is primarily due to the clustering algorithm. The rest of the exp erimental evaluation was p erformed with k = 2 and M = 5. system and each indicator, we chose the index that p erformed the b est w.r.t. that indicator and compared it against tGRIN. For instance, Jena­R+ -tree did b est in terms of disk space on the synthetic dataset, but Jena­reified did b etter in terms of memory usage. The results are shown in Figure 6. First, we observed that the MAP21 index had relatively bad p erformance on all three system. On average, MAP21 traversed 35-40% of the index was traversed during each query; we therefore ommitted the MAP21 results from the presentation. We can easily see that tGRIN takes much less to build than any of the other indexes ­ approximately 231s for a 26 million triples dataset. Also, tGRIN takes less disk space than Sesame and 3store, at comparable memory usage. However, the greatest improvements are in terms of query running time. tGRIN p erforms 3 ­ 6 times faster on the average typ es of queries in Figure 6(d). Jena2 runs out of memory for datasets larger than 13 million tuples. Sesame and 3store are able to answer queries with a sharp time p enalty for large datasets. The translation of RDF queries into SQL typically contains many joins, which are increasingly difficult to compute as the dataset grows. On the other hand, tGRIN does not compare the query pattern against the actual triples of the database until the smallest index nodes that contain the answer to the query have b een identified. The very small increase in running time for tGRIN is due to the increase in the size of the tGRIN tree. These results led us to susp ect that the p erformance gains in tGRIN would increase as the queries grow more and more complex. For relational representations, a larger query translates into more SQL joins, whereas for tGRIN it translates primarily into larger sets of constraints. For the GovTrack dataset, we varied the size of the query pattern from 5 to 35 nodes with a fixed variable/constant ratio of .5 and measured the query running time. In a second exp eriment, we kept the query pattern size at 15, but varied the variable/constant ratio from .2 to 1.5. In this latter case, we normalized the query time by the numb er of answers returned. We could not use Jena2 in this exp eriment due to out-of-memory errors. The results are shown in Figure 6(e) and (f ) resp ectively. On complex queries, tGRIN takes only 1 1 'th to 10 'th the time taken by 3store or Sesame. Again, 6 this is due to the fact that higher query complexity does not immediately translate into higher SQL query complexity. The small query time increase is due to the subgraph matching op eration, which is in most cases localized to a very small p ortion of the database. 7. CONCLUSIONS RDF is a growing web standard supp orted by the W3C. It is clear that temp orally annotated RDF datasets will grow as well b ecause prop erties are relationships represented in RDF will vary with time in many applications. Examples such as the GovTrack data set show that large scale tRDF databases are b eing built and will need efficient querying and indexing mechanisms. Imp ortant contributions toward defining temp oral RDF have b een made by researchers such as [3, 5, 6, 8]. However, three imp ortant asp ects of a DBMS are not covered by these past works. (i) No indexing mechanism for temp oral RDF exists to date, (ii) No scalable algorithms for processing complex queries on large disk-resident temp oral RDF databases has b een prop osed to date, and (iii) no implementation or exp erimental results have b een provided to date. In this pa- 6.2 Comparison with other systems We compared tGRIN with Jena2, Sesame and 3store enhanced with the corresp onding temp oral indexes. For each 613 WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web II 60 0 50 0 April 21-25, 2008 · Beijing, China 4 .5 4 3 .5 Index Building Time t-GRIN Jena-SR-tree Sesame-reified 3store-ST-index 35 0 30 0 Peak memory usage t-GRIN Jena-reified Sesame-R+-tree 3store-ST-index Disk space t-GRIN Jena-R+-tree Sesame-R+-tree 3store-ST-index 30 0 20 0 10 0 0 2 5 8 11 14 17 20 Dataset size [milion triples] 23 26 20 0 15 0 10 0 50 0 2 5 8 11 14 17 20 Dataset size [milion triples] 23 26 Disk space [MB] Build time [s] 40 0 Memory used [MB] 25 0 3 2 .5 2 1 .5 1 0 .5 0 2 5 8 11 14 17 20 Dataset size [milion triples] 23 26 (a) Index building time 18 0 16 0 14 0 (b) Memory usage 30 0 25 0 (c) Disk space 4 .5 Query time for 15 node-pattern, variables/constants=.5 t-GRIN Jena-R+-tree Sesame-R+-tree 3store-SR-tree Query running time variables/constants=.5 Running time per answer returned [s] t-GRIN Sesame-R+-tree 3store-SR-tree Query time graph pattern size = 15 nodes t-GRIN Sesame-R+-tree 3store-SR-tree 4 3 .5 3 2 .5 2 1 .5 1 0 .5 0 Running time [s] 10 0 80 60 40 20 0 2 5 8 11 14 17 20 Dataset size [milion triples] 23 26 Running time [s] 12 0 20 0 15 0 10 0 50 0 5 10 15 20 Query pattern size [no nodes] 25 35 0 .2 0 .4 0 .6 0 .8 1 Ratio of variables/constants 1 .2 1 .5 (d) Query time (e) Increasing query pattern size (f ) Increasing variable/constant ratio Figure 6: Comparison between tGRIN, Jena2, Sesame and 3store. (a)-(d) are over the synthetic dataset; (e), (f ) are on the GovTrack dataset. p er, we provide, for the first time, a solution that addresses all these three p oints and that is demonstrated to show efficient p erformance on a real data set containing over 20 million triples, as well as on synthetic data sets. Our index structure is shown to b e a b etter match for tRDF queries than the use of standard indexes for temp oral data - the reason is that the latter do not account for the graph based nature of tRDF data and tRDF queries, while our tGRIN index structure accounts neatly for b oth. A small additional p oint is that we also handle indeterminate time in tRDF which is not done in past work. [9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Comput. Surv., 31(3):264­323, 1999. [10] B. Liu and B. Hu. Path queries based rdf index. SKG, 0:91, 2005. [11] A. Martelli and U. Montanari. An efficient unification algorithm. ACM Trans. Program. Lang. Syst., 4(2):258­282, 1982. [12] A. Matono, T. Amagasa, M. Yoshikawa, and S. Uemura. An efficient pathway search using an indexing scheme for RDF. Genome Informatics, 14:374­375, 2003. [13] B. Salzb erg and V. J. Tsotras. Comparison of access methods for time-evolving data. ACM Comput. Surv., 31(2):158­221, 1999. [14] S. Trißl and U. Leser. Fast and practical indexing and querying of very large graphs. In SIGMOD, pages 845­856, 2007. [15] O. Udrea, A. Pugliese, and V. S. Subrahmanian. Grin: A graph based rdf index. In AAAI, pages 1465­1470, 2007. [16] X. Yan, P. S. Yu, and J. Han. Graph indexing: a frequent structure-based approach. In SIGMOD, pages 335­346, 2004. 8. REFERENCES [1] D. J. Abadi, A. Marcus, S. Madden, and K. J. Hollenbach. Scalable semantic web data management using vertical partitioning. In VLDB, pages 411­422, 2007. [2] J. Ab ello and Y. Kotidis. Hierarchical graph indexing. In CIKM, pages 453­460, 2003. [3] S. C. Buraga and G. Ciobanu. A rdf-based model for expressing spatio-temp oral relations b etween web sites. In WISE, pages 355­361, 2002. [4] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. A (Sub)Graph Isomorphism Algorithm for Matching Large Graphs. IEEE Trans. PAMI, 26(10):1367­1372, 2004. [5] C. Guti´rrez, C. A. Hurtado, and A. A. Vaisman. e Temp oral rdf. In ESWC, pages 93­107, 2005. [6] C. Gutierrez, C. A. Hurtado, and A. A. Vaisman. Introducing time into rdf. IEEE Trans. Know l. Data Eng., 19(2):207­218, 2007. [7] S. Harris and N. Gibbins. 3store: Efficient bulk rdf storage. In PSSS, pages 1­15, 2003. [8] J. Heggland. Ontolog: Temp oral annotation using ad hoc ontologies and application profiles. In ECDL, pages 118­128, London, UK, 2002. Springer-Verlag. 614