WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web I April 21-25, 2008 · Beijing, China Computing Minimum Cost Diagnoses to Repair Populated DL-based Ontologies Jianfeng Du State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences Graduate University of the Chinese Academy of Sciences Beijing 100080, China Yi-Dong Shen State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences Beijing 100080, China ydshen@ios.ac.cn jfdu@ios.ac.cn ABSTRACT Ontology p opulation is prone to cause inconsistency b ecause the p opulating process is imprecise or the p opulated data may conflict with the original data. By assuming that the intensional part of the p opulated DL-based ontology is fixed and each removable ABox assertion is given a removal cost, we repair the ontology by deleting a subset of removable ABox assertions in which the sum of removal costs is minimum. We call such subset a minimum cost diagnosis. We show that, unless P=NP, the problem of finding a minimum cost diagnosis for a DL-Lite ontology is insolvable in PTIME w.r.t. data complexity. In spite of that, we present a feasible computational method for more general (i.e. S HI Q) ontologies. It transforms a S HI Q ontology to a set of disjoint prop ositional programs, thus reducing the original problem into a set of indep endent subproblems. Each such subproblem computes an optimal model and is solvable in logarithmic calls to a SAT solver. Exp erimental results show that the method can handle moderately complex ontologies with over thousands of ABox assertions, where all ABox assertions can b e assumed removable. Full, where the former two coincide semantically with certain description logics (DLs) [1]. A DL-based ontology consists of an intensional part and an extensional part. The intensional part consists of a TBox and an RBox, and contains knowledge ab out concepts and relations (called roles ) b etween the elements of the domain. The extensional part consists of an ABox, and contains knowledge ab out individuals and how they relate to the concepts and roles from the intensional part. In this pap er, the knowledge in intensional parts is called axioms, whilst the knowledge in extensional parts is called ABox assertions or simply assertions. A crucial question in the vision of Semantic Web is how to supp ort and ease the process of creating and maintaining DL-based ontologies. An imp ortant task within this process is ontology population, which adds instances of concepts and relations to the ontology. In recent years, there has b een a great surge of interest in methods for p opulating ontologies from textual resources. To name a few, Text2Onto [3] and KITE [30] are frameworks that integrate algorithms for p opulating ontologies from textual data. The algorithms include information extraction algorithms that assign annotations carrying some semantics to regions of the data, and co-reference algorithms that identify annotated individuals in multiple places. As for p opulating DL-based ontologies, the information extraction process b ehaves as adding concept/role assertions, whilst the co-reference process b ehaves as adding equality/inequality assertions. The p opulated ontology, however, may b ecome inconsistent b ecause the information extraction/co-reference process is imprecise or the p opulated data p ossibly conflict with the original data. In order to repair the p opulated ontology, we prop ose to delete a subset of assertions in which the sum of removal costs is minimum, based on the following considerations. First, the intensional part should not b e changed, b ecause in general it is well prepared and coherent (i.e., having no unsatisfiable concepts) b efore the p opulation. Second, for changing the extensional part, only the deletion of assertions is considered b ecause there is generally no criteria for revising assertions. Third, for deleting assertions, some minimal criteria on removable assertions (e.g., the cost of losing information) should b e considered. Fourth, the certainty information on an assertion, given by the information extraction/coreference process, can b e used as the cost of losing information (called the removal cost ), b ecause deleting a more certain assertion generally loses more information. Fifth, the Categories and Subject Descriptors I.2.3 [Artificial Intelligence]: Deduction and Theorem Proving; I.2.4 [Artificial Intelligence]: Knowledge Representation Formalisms and Methods General Terms Algorithms Keywords Ontologies, Description Logics, Disjunctive Datalog, Diagnosis 1. INTRODUCTION Nowadays OWL [22] has b een established as a core standard in the Semantic Web. It comes in three layers in ascending expressivity, i.e., OWL Lite, OWL DL and OWL Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. 565 WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web I collective removal cost of a set of assertions can b e approximated by the sum of removal costs in the set. Therefore, we in this pap er address computing a minimum cost diagnosis for an inconsistent ontology K B , i.e. a subset of removable assertions whose removal turns K B consistent and in which the sum of removal costs is minimum. We show that, unless P=NP, the problem of finding a minimum cost diagnosis for a DL-Lite ontology is insolvable in p olynomial time (PTIME) w.r.t. data complexity, i.e. the complexity measured in the size of the ABox only. Note that DL-Lite is a fairly inexpressive DL language such that the consistency checking problem for DL-Lite ontologies is in PTIME in the size of the ontology [2]. This complexity result implies that the problem of finding minimum cost diagnoses is in general intractable. In spite of that, we develop a feasible computational method for more general (i.e. S HI Q) ontologies. It transforms a S HI Q ontology to a set of disjoint prop ositional programs by applying an existing transformation method (from S HI Q to disjunctive datalog) [12, 21] and new grounding and partitioning techniques. Thus our method reduces the problem of finding a minimum cost diagnosis into a set of indep endent subproblems. Each such subproblem computes an optimal model and is solvable in O(log 2 n) calls to a satisfiability (SAT) solver, by assuming that removal costs have b een scaled to p ositive integers p olynomial in n the numb er of removable assertions. We implement our method and exp eriment on several originally consistent, real/b enchmark ontologies. Each test ontology has over thousands of assertions. We implement a tool to inject conflicts into a consistent ontology, where a conflict, caused by several inserted assertions, violates a functional restriction or a disjointness constraint. Exp erimental results show that, even when all assertions are assumed removable, our method can handle all the test ontologies with injected conflicts. Esp ecially, our method scales well on the extended b enchmark ontologies with increasing numb er (from 1000) of conflicts. April 21-25, 2008 · Beijing, China Example 1. Let the intensional part consist of two axioms A P.¬A Q.¬A and ¬A P.A Q.A, and the ABox b e {A(a1 ), P (an , a1 ), Q(an , a1 )} {P (ai , ai+1 ), Q(ai , ai+1 ) | 1 i n - 1}, where n is an odd numb er. Then the ontology is inconsistent, b ecause ¬A(a1 ) is one of its consequences but conflicts with A(a1 ). The minimal conflict sets over the ABox are of the form {A(a1 ), U1 (a1 , a2 ), . . . , Un-1 (an-1 , an ), Un (an , a1 )}, where Ui is either P or Q. So the numb er of minimal conflict sets is 2n . Hence, a method that computes minimum cost diagnoses from minimal conflict sets may work in exp onential time and exp onential space w.r.t. data complexity (e.g., when it handles an ontology that is the union of the ontology in the ab ove example and the ontology given in the proof of Theorem 2). In contrast, our method works in exp onential time (more precisely, in logarithmic calls to an NP oracle) and p olynomial space w.r.t. data complexity. There exist some heuristics-based methods for repairing DL-based ontologies. Schlobach [25] prop osed an approximate approach to computing a subset of axioms whose removal corrects an unsatisfiable concept or an incoherent TBox. Dolby et al. [4] exploited summarization and refinement techniques to compute a subset of assertions whose removal turns an inconsistent ontology consistent. Their prop osed methods, however, cannot guarantee minimality for the set of removed axioms/assertions. There also exist some methods for revising problematic axioms (e.g., [19, 13, 16, 23]). But they cannot b e adapted to revising assertions, b ecause assertions are assumed atomic in our work. We only consider the deletion of assertions. As for dealing with inconsistency in DL-based ontologies, there is another approach that simply avoids/tolerates the inconsistency and applies a non-standard reasoning method to obtain meaningful answers (e.g., [11, 17]). We can also adapt our method to this approach, by defining a consistent consequence of an inconsistent ontology as a consequence invariant under all minimum cost diagnoses. This is out of the scop e of this pap er and is not discussed here. 2. RELATED WORK There are some works that address repairing DL-based ontologies. Kalyanpur et al. [13] extended Reiter's Hitting Set Tree (HST) algorithm [24] to compute a minimum-rank hitting set, which is a subset of axioms that need to b e removed/fixed to correct an unsatisfiable concept, such that the sum of axiom ranks in the subset is minimum. The notion of minimum-rank hitting set is similar to that of minimum cost diagnosis, except that the former is on axioms while the latter is on assertions. Schlobach [26] applied Reiter's HST algorithm to compute a minimal subset of axioms that need to b e removed/fixed to correct an unsatisfiable concept or an incoherent TBox. The ab ove methods require all minimal conflict sets b e computed b eforehand, where a minimal conflict set is a minimal subset of axioms resp onsible for the unwanted consequence. Though the ab ove methods can work with ABoxes as well (by viewing assertions as axioms), it is impractical to adapt them to computing minimum cost diagnoses. First, the problem of finding a minimum hitting set from minimal conflict sets is intrinsically intractable [15]. Second, though there exist efficient methods for computing a minimal conflict set (e.g., [27, 14, 20]), computing all minimal conflict sets is still hard b ecause the numb er of minimal conflict sets can b e exp onential in the numb er of assertions, as shown in the following example. 3. PRELIMINARIES 3.1 SHIQ and DL-Lite The S HI Q description logic [10] is a syntactic variant of OWL DL [22] without nominals and concrete domain sp ecifications, but allowing qualified numb er restrictions. A S HI Q RBox K BR is a finite set of transitivity axioms T r ans(R) and role inclusion axioms R S, where R and S Sre roles. Let b e the reflexive transitive closure of {R a , I nv (R) Inv(S) | R S K BR }, where I nv (R) = R- and I nv (R- ) = R for a role R. A role S is simple if there is no role R such that R S and either T r ans(R) K BR or T r ans(I nv (R)) K BR . The set of S HI Q concepts is the smallest set containing , , A, ¬C , C D, C D, R.C , R.C , n S.C and n S.C , where A is a concept name (i.e. an atomic concept ), C and D S HI Q concepts, R a role, S a simple role, and n a p ositive integer. A S HI Q TBox K BT is a finite set of concept inclusion axioms C D, where C and D are S HI Q concepts. A S HI Q ABox K BA is a set of concept assertions C (a), role assertions R(a, b), equality assertions a b and inequality assertions a b, where C is a S HI Q concept, R a role, and a and b individuals. A S HI Q ontology K B is a triple (K BR , K BT , K BA ), where 566 WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web I KBR is an RBox, K BT a TBox, and K BA an ABox. In this pap er, by K B we simply denote (K BR , K BT , K BA ) if there is no confusion. DL-Lite [2] is a sub-language of S HI Q. The set of DLLite concepts is the smallest set containing A, R, R- , ¬B and C1 C2 , where A is a concept name, R a role name, B a basic concept (i.e. a concept of the form A, R or R- ), and C1 and C2 DL-Lite concepts. R is actually an unqualified existential restriction R. . A DL-Lite ontology K B = (K BT , K BA ) consists of a TBox K BT and an ABox C and K BA . K BT is a finite set of inclusion axioms B functionality axioms (func R) and (func R- ), where B is a basic concept, C a DL-Lite concept and R a role. K BA is a set of concept assertions B (a) and role assertions R(a, b), where B is a basic concept, R a role, and a and b individuals. The semantics of a S HI Q ontology K B is given by a mapping that translates K B into first-order logic. Due to space limitation, we refer readers to [12] for the definition of . K B is said to b e consistent/satisfiable if there exists a first-order model of (K B ). The semantics of a DL-Lite ontology K B can still b e given by the same mapping , b ecause a functionality axiom (func R) is a syntactic variant of 1 R. . Note that the unique name assumption (UNA) [1] on individuals is applied in DL-Lite but not in S HI Q. UNA can b e explicitly axiomatized by app ending to the ABox all inequality assertions a b for any two individuals a and b that have different URIs. In our work we assume that all concept assertions are attached to atomic concepts only, due to the following reasons. First, the representation of non-atomic concept assertions is not supp orted by the RDF/XML syntax of OWL (cf. http://www.w3.org/TR/owl- ref), which is the main syntax for representing DL-based ontologies nowadays. Second, a non-atomic concept assertion C (a) can b e reduced to an atomic one by replacing C (a) with Q(a) and app ending C Q to the TBox, where Q is a new atomic concept. April 21-25, 2008 · Beijing, China 3.3 Reducing SHIQ to Disjunctive Datalog Since S HI Q is a subset of first-order logic, S HI Q axioms can first b e translated into logical formulas, then into clausal form. The resulting clauses can b e represented as disjunctive rules without negation-as-failure. However, due to p ossible skolemization steps in the clausal form translation, the resulting rules may contain function symb ols. Standard logic program engines, however, may not terminate in the presence of function symb ols. To cop e with this problem, Hustadt et al. [12, 21] develop ed the KAON2 transformation method to get rid of function symb ols without losing ABox consequences. The method reduces a S HI Q ontology K B to a p ositive disjunctive datalog program DD(K B ) = (K BR , K BT ) (K BA ) (K B ). (K BR , K BT ) is a set of disjunctive datalog rules computed from the intensional part of K B by translating S HI Q axioms into clauses, adding logical consequences, and translating clauses into disjunctive datalog rules. (K BA ) is a set of facts translated from K BA , where each inequality assertion (of the form a b) is translated into a ground constraint (of the from a b), and other assertions are directly translated into ground facts. (K B ) is a set of facts of the form H U (a), H U (af ) and Sf (a, af ), which are introduced to remove function symb ols and instantiated for each individual a occurring in K B and each function symb ol f . Theorem 1 ([21]). For K B a S HI Q ontology, K B is unsatisfiable if and only if DD(K B ) is unsatisfiable. 4. MINIMUM COST DIAGNOSIS Given a p ossibly inconsistent S HI Q ontology K B in which some assertions are removable and assigned removal costs, our goal is to find a subset of removable assertions whose removal turns K B consistent and in which the sum of removal costs is minimum. Such subset is called a minimum cost diagnosis, formally defined b elow. Definition 1. Let K B b e a p ossibly inconsistent S HI Q ontology and RK B K BA a set of removable assertions such that each assertion RK B is given a removal cost c() > 0. Then, a subset of assertions R RK B is called a diagnosis for K B w.r.t. RK B if (K BR , K BT , K BA \ R) is consistent. A diagnosis R is called a minimum cost diagnosis for K B w.r.t. RK if there is nodiagnosis R for K B w.r.t. B RK B such that R c() < R c(). R is simply called a diagnosis/minimum cost diagnosis if K B and RK B are clear from the context. We consider the time complexity for finding a minimum cost diagnosis. Complexity results in this pap er refer to data complexity, i.e. the complexity measured as a function of the numb er of assertions in the ontology. Theorem 2 shows that, unless P=NP, there is no p olynomial time algorithm for finding a minimum cost diagnosis for a DL-Lite ontology K B w.r.t. K BA . It implies that the problem of finding minimum cost diagnoses for S HI Q ontologies is in general intractable. Theorem 2. Given a positive integer k and a possibly inconsistent DL-Lite ontology K B = (K BT , K BA ) where each assertion K BA is given a removal cost c() = 1, deciding if there is a diagnosis R for K B w.r.t. K BA such that R c() k is NP-hard w.r.t. data complexity. 3.2 Disjunctive Datalog A disjunctive datalog program with equality [6] P is a finite set of rules without function symb ols of the form A1 . . . An B1 , . . . , Bm (where Ai and Bi are atoms). Each rule must b e safe, i.e., each variable occurring in a head atom Ai must occur in some b ody atom Bj . For a rule r , the set of head atoms is denoted by head(r ), whereas the set of b ody atoms is denoted by body (r ). A rule r is called a constraint if |head(r )| = 0; a fact if |body (r )| = 0. An atom is called negated if it leads with negation-as-failure. Typical definitions of a disjunctive datalog program, such as [6], allow negated atoms in the b ody. In our work, negated atoms cannot occur in a transformed program that we consider, so we omit negation-as-failure from the definitions. Disjunctive datalog programs without negation-as-failure are often called positive programs. The set of all ground instances of rules in P is denoted by g r ound(P ). An interpretation M of P is a subset of ground atoms in the Herbrand base of P . An interpretation M is called a model of P if (i) body (r ) M implies head(r ) M = for each rule r g r ound(P ), and (ii) all atoms from M with the equality predicate yield a congruence relation, i.e. a relation that is reflexive, symmetric, transitive, and T (a1 , . . . , ai , . . . , an ) M and ai bi M imply T (a1 , . . . , bi , . . . , an ) M for each predicate symb ol T in P . P is said to b e satisfiable if it has a model. 567 WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web I Proof. Given an arbitrary instance I of the SAT problem, we transform it into an instance I of the given decision problem. Let I b e the formula f = C1 . . . Cm with m a clauses and n b oolean variables x1 , . . . , xn . We construct I s follows. (1) K BT consists of the following axioms: T ¬S, T - ¬S - , (func T - ), (func S - ). (2) For each b oolean variable xi in f , K BA contains a corresp onding constant ai . (3) For each clause Cj containing nj literals lj,1 , . . . , lj,nj whose corresp onding constants in K BA , introduced in (2), are aj,1 , . . . , aj,nj resp ectively, K BA contains a corresp onding constant cj for Cj and nj assertions U (aj,1 , cj ), . . . , U (aj,nj , cj ), where U (aj,k , cj ) is T (aj,k ,mj ) if lj,k is p ositive, or S (aj,k , cj ) otherwise. (4) Let c k = j =1 (nj - 1). Now we prove that f is satisfiable if and only if there is a diagnosis R for K B = (K BT , K BA ) w.r.t. K BA such that a R c(a) k , i.e., |R| k . () Since f is satisfiable, for each clause Cj there is a literal lj,k assigned true. We app end to R all assertions in K BA of the form U (aj,p , cj ) (p = k), where U (aj,p , cj ) is T (aj,p , cj ) if lj,p is p ositive, or S (aj,p , cj ) otherwise. Clearly, R is a diagnosis for K B w.r.t. K BA such that |R| k. () Supp ose R is a diagnosis for K B w.r.t. K BA such that |R| k. It is not hard to see that any set of assertions of the form U (aj,k , cj ) (U is either T or S ) must have exactly nj - 1 assertions in R and one in K BA \ R. To see that f is satisfiable, for each U (aj,k , cj ) K BA \ R (j = 1, . . . , m), if U is T , we assign xj,k (i.e. the corresp onding variable of aj,k in f ) true; otherwise we assign xj,k false. The ab ove (partial) assignment on {x1 , . . . , xn } is consistent and ensures lj,k = true for all j = 1, . . . , m. Thus f is satisfiable. Since the construction of K B is accomplished in PTIME and the SAT problem is NP-complete, and since K BT has a fixed size, this theorem follows. April 21-25, 2008 · Beijing, China i optimization function of the form minimize ci xi with constants ci Z and variables xi {0, 1}, where Z denotes the integer domain. The SAT problems with PB-constraints and linear optimization functions are well studied in the SAT community (cf. http://www.cril.univ- artois.fr/PB07/). A SAT problem with linear optimization functions can b e translated into a set of SAT problems with PB-constraints. A SAT problem with PB-constraints can b e either solved by standard SAT solvers after translating PB-constraints to SAT clauses [5], or solved by extended SAT solvers that supp ort PB-constraints natively (e.g., PUEBLO [28]). Now, the remaining problems are how to transform a S HI Q ontology to the intended p ositive program and how to efficiently compute minimum cost diagnoses. We address these problems in the following subsections. 5.1 Constructing a Repair Program Given a p ossibly inconsistent S HI Q ontology K B , we first employ the KAON2 transformation method [12, 21], describ ed in Preliminaries, to reduce K B to a disjunctive datalog program DD(K B ) = (K BR , K BT ) (K BA ) (K B ), but introduce a sp ecial treatment. The original KAON2 transformation method allows equality atoms (of the form X Y or a b, where X, Y denote variables and a, b denote constants) to occur in rule b odies in DD(K B ) while disallows inequality atoms (of the form X Y or a b) to occur in DD(K B ). To handle inequality assertions in a similar way as other assertions, we first move equality atoms (X Y or a b) in any rule b ody in DD(K B ) to the corresp onding rule head and replace them with inequality atoms (X Y or a b), then app end to DD(K B ) a constraint X Y , X Y (written R ), so that (K BA ) is simplified to a direct translation from assertions in K BA to ground facts in DD(K B ). Having such treatment we simply denote (K BA ) as K BA . The modified rules in DD(K B ) are still safe due to the restricted form of the original rules that have equality atoms in the b ody. In essence, our treatment views an inequality atom as an ordinary one and does not impact the satisfiability of DD(K B ). Then, we convert DD(K B ) to a repair program R(K B ) defined b elow. Intuitively, the decision atom - is introduced to weaken K B , so that - = true (resp. - = false) implies that is removed from (resp. kept in) K B . Note that decision atoms in R(K B ) are treated as nullary ground atoms. Definition 2. For K B a p ossibly inconsistent S HI Q ontology and RK B K BA a set of removable assertions such that each assertion RK B is given a removal cost c() > 0, a repair program of K B w.r.t. RK B , written R(K B ), is a disjunctive datalog program converted from DD(K B ) as follows: for each assertion RK B , we introduce a corresp onding decision atom - and give it a cost c(- ) = c(), then replace the ground fact in DD(K B ) with - . We simply call R(K B ) a repair program if RK B is clear from the context. Example 2. Let A, E , H , P , S , T , me and pa abbreviate Ar tif icer, E ng ineer , H uman, P r of essor, S tudent, T eacher , mentor and par ent resp ectively. Given a S HI Q ontology K B = (, K BT , K BA ), where K BT = {S 1 me me.P H, H pa.H, P E, me.E ¬A, E ¬T } and K BA = {S (s1 ), S (s2 ), me(s1 , t1 ), me(s1 , t2 ), A (s 2 ), T (t 1 ), T (t 2 ), T (p 1 ), E (p 2 ), p a (s 1 , p 1 ), p a (s 2 , p 1 ), t 1 t2 , p1 p2 }, and a set of removable assertions RK B = 5. COMPUTING MINIMUM COST DIAGNOSES As analyzed in related work, a method that computes minimum cost diagnoses based on minimal conflict sets is impractical, b ecause it may require b oth exp onential time and exp onential space. We thus consider methods that need not compute minimal conflict sets. A na¨ve method is the i black-b ox method, which searches minimum cost diagnoses over all subsets of removal assertions by applying a DL reasoner to check diagnoses. However, the black-b ox method cannot compute a minimum cost diagnosis for a DL-Lite ontology in p olynomial calls to a DL reasoner, otherwise a minimum cost diagnosis can b e computed in PTIME w.r.t. data complexity, contradicting Theorem 2. In order to find a practical computational method, we consider transforming the input ontology K B into a p ositive program such that for any subset S of RK B the set of removable assertions, K B \ S is consistent if and only if {assign (- ) = 1 | S } {assign (- ) = 0 | RK B \ S } is satisfiable, where - is a fresh ground atom corresp onding to ground atom in , and assign( ) denotes the 0-1 truth value of . Then, a minimumcost diagnosis corresp onds to a valuation - of X such that - X c() · assign ( ) is minimum and - is satisfiable, where X = { | RK B }. To find such a valuation of X , we need to handle pseudoi ci xi d boolean constraints (PB-constraints) of the form with constants ci , d Z and variables xi {0, 1}, or a linear 568 WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web I {S(s2 ), A(s2 ), T (t1 ), T (t2 ), E (p2 ), t1 t2 , p1 p2 } such that c() = 1 for each assertion RK B , we construct the repair program R(K B ) as follows. First, by applying the KAON2 transformation method with our sp ecial treatment, we reduce K B to DD(K B ) = (K BR , K BT ) {R } K BA (K B ), where (K B ) = {Sf (s1 , s1f ), Sf (s2 , s2f ), Sf (t1 , t1f ), Sf (t2 , t2f ), Sf (p1 , p1f )} and (K BR , K BT ) = {R1 , . . . , R9 } as given b elow. Y 1 Y 2 S (X ), m e (X , Y 1 ), m e (X , Y 2 ). R1 : P (Xf ) S (X ), Sf (X, Xf ). R2 : R3 : me(X, Xf ) S (X ), Sf (X, Xf ). R4 : H (X ) S (X ). R5 : H (Y ) H (X ), p a (X , Y ). E (X ) P (X ). R6 : R7 : A (X ), m e (X , Y ), E (Y ). T (X ), E (X ). R8 : R9 : A (X ), S (X ). Then, by introducing decision atoms and converting ground facts in DD(K B ), we obtain R(K B ) = {R1 , . . . , R9 , R } (K B ) {S (s1 ), me(s1 , t1 ), me(s1 , t2 ), T (p1 ), pa(s1 , p1 ), p a (s 2 , p 1 ), S (s 2 ) S (s 2 )- , A (s 2 ) A (s 2 )- , T (t 1 ) T (t 1 )- , T (t 2 ) T (t 2 )- , E (p 2 ) E (p 2 )- , ( t 1 t 2 ) (t 1 t 2 )- , (p1 p2 ) (p1 p2 )- }. There exists a corresp ondence b etween minimum cost diagnoses for K B w.r.t. RK B and X -MC models of R(K B ), where X = {- | RK B } (see Theorem 3). A model M of a p ositive program P is called an X -MC model of P if there is no model M of P such that M X c( ) < M X c( ), where X is a set of ground atoms and c is a predefined cost function over X . Theorem 3. Let K B be a S HI Q ontology, RK B K BA a set of removable assertions such that each assertion RK B is given a removal cost c() > 0, R(K B ) a repair program of K B w.r.t. RK B , and X = {- | RK B }. (Soundness) For each X -MC model M of R(K B ), { | - M } is a minimum cost diagnosis for K B w.r.t. RK B ; (Completeness) For each minimum cost diagnosis R for K B w.r.t. RK B , there exists an X -MC model M of R(K B ) such that R = { | - M }. Proof sketch. (Soundness) Let M b e an X -MC model of R(K B ), R = { | - M } and M = M \ {- | R}. It can b e shown that M is a model of DD(K B ) \ R. By Theorem 1, R is diagnosis of K B w.r.t. RK B . Further, R must b e a minimum cost diagnoses, otherwise itcan b e shown that - there exists a model M of R(K B ) s.t. - M X c( ) < - - M X c( ). (Completeness) Let R b e a minimum cost diagnosis for K B w.r.t. RK B . By Theorem 1, DD(K B ) \ R is satisfiable and thus has a model, say M . It can b e shown that M = m M {- | R} is a model of R(K B ). Further, M ust b e an X -MC model of R(K B ), otherwise it can b e shown that there exits a diagnosis R for K B w.r.t. RK B s s.t. R c() < R c(). April 21-25, 2008 · Beijing, China equality atoms from other atoms. To treat the equality predicate , which is interpreted as a congruence relation, as an ordinary predicate, we use a well-known transformation from [8]. For a disjunctive datalog program P , let P denote the program consisting of the rules stating that the equality predicate is reflexive, symmetric and transitive, and the replacement rules given b elow, instantiated for each predicate T in P (excluding ) and each p osition i. Note that the reflexive rule is not safe and is instead represented as a set of ground facts of the form a a, instantiated for each constant a in P . Then, app ending P to P allows to treat as an ordinary predicate. T (X1 , . . . , Yi , . . . , Xn ) Xi Yi , T (X1 , . . . , Xi , . . . , Xn ). For the input issue, we need to ground R(K B ) b efore applying SAT solvers. A well-known grounding technique is intel ligent grounding (IG) [7], which only applies to equalityfree disjunctive datalog programs. That is, if the equality predicate is present in a disjunctive datalog program P , we must app end P to P b efore grounding P using the IG technique. The IG technique has b een implemented in a disjunctive datalog engine DLV [18], but the current implementation cannot handle large disjunctive datalog programs due to memory limitation1 , esp ecially when the equality predicate is present. On the other hand, current implementations of SAT solvers lack scalability for large prop ositional programs. To address these problems, we develop two diskbased algorithms for grounding R(K B ) to (K B ) and for partitioning (K B ) to disjoint subprograms resp ectively, so that the computation of minimum cost diagnoses can b e separately p erformed over each subprogram. Algorithm 1 is our algorithm for grounding a repair program P . By Mdef we denote the unique minimal model of the definite fragment of P , i.e. {R P | |head(R)| = 1}. C is actually the set of congruence classes {C1 , . . . , Cm } occurring in P , where Ci = {b | a b Mdef } for an arbitrary constant a occurring in some equality atom in Mdef that is not of the form a a. fc (a, C ) denotes the congruence class in C that contains constant a. minc (C ) denotes the constant a C having the smallest value in {occ(a) | a C }, where occ(a) is the occurrence order of a in P . Ddef is actually a set of non-equality atoms in Mdef such that for each non-equality atom T (a1 , . . . , ak ) Mdef , there exists a unique ground atom T (b1 , . . . , bk ) Ddef such that for each i = 1, . . . , k, ai and bi are either the same or together in some C C . D is the set of ground atoms occurring in the grounded program . Let S and S b e two sets of ground atoms. S denotes the subset of S consisting of all equality atoms in S ; S\ denotes S \ S . For a rule R, the function GetSubstitutes(R, S , S ) returns the set of all ground substitutes such that body (R ) S , head(R )\ S = and head(R ) does not contain equality atoms of the form a a. The function Rewrite(S , C ) rewrites all constants a in S such that fc (a, C ) exists to minc (fc (a, C )), and returns the modified S . The algorithm consists of three steps. Step 1 (lines 1­13) computes Mdef in a standard iterative manner, but represents Mdef as Ddef and C . Step 2 (lines 14­16) rewrites the constants occurring in disjunctive ground facts (of the form - ) in P , b ecause some constants occurring in The current implementation with DB supp ort, DLVD B (http://www.mat.unical.it/terracina/dlvdb/), does not work with DBs if the input program has disjunctions. 1 5.2 Computing X-MC Models By Theorem 3, the problem of finding minimum cost diagnoses for K B w.r.t. RK B is reduced to the problem of computing X -MC models of R(K B ), which can b e realized by applying SAT solvers. However, SAT solvers take a p ositive prop ositional program as input and do not distinguish 569 WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web I are represented by other constants in step 1. At a word, in step 1 and step 2, all constants in a congruence class are replaced with a same constant, so as to reduce the numb er of instantiated rules in step 3. Step 3 (lines 17­26) grounds P + , i.e. P P excluding the definite ground facts, in a standard iterative manner based on Mdef . Each instantiated rule r such that head(r ) Mdef = is ignored (line 22), b ecause r has no impact on computing models of P . If a b Ddef , the equality atom a b in an instantiated rule head is not app ended to D (line 24), b ecause it cannot occur in any model of P . The ground atoms in Mdef are removed from the b ody of any instantiated rule (lines 25­ 26), b ecause they are in every model of P . Note that the function GetSubstitutes can b e realized by applying a SQL engine and its results can b e stored in disk, so the algorithm is disk-based. In what follows, by (K B ) we denote the grounded repair program returned by Ground(R(K B )). Algorithm 1. Ground(P ) Input: A repair program P . Output: A set C of sets of constants and a positive propositional program . 1. C := ; Ddef := ; Ddef := {}; // to enforce Ddef = Ddef 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. Ddef := Ddef ; for each rule R P s.t. |head(R)| = 1 do := GetSubstitutes(R, Ddef , Ddef ); for each sequentially retrieved from do Ddef := Ddef head(R)\ ; if head(R) = {a b} for some constants a and b that are not together in some C C then if fc (a, C ) does not exist then Set fc (a, C ) as {a}; if fc (b, C ) does not exist then Set fc (b, C ) as {b}; C := fc (a, C ) fc (b, C ); C := (C \ {fc (a, C ), fc (b, C )}) {C }; Ddef := Rewrite(Ddef , C ); // executed once for for each disjunctive ground fact - in P do for each constant a in (or - ) s.t. fc (a, C ) exists do Rewrite a in (or - ) to minc (fc (a, C )); P + := {R P P | |head(R)| > 1 or R is non-ground}; := ; D := Ddef {a a | a occurs in P }; D := ; while D = D do D := D ; for each rule R P + do := GetSubstitutes(R, D , Ddef ); for each sequentially retrieved from do D := D head(R)\ {a b head(R) | a b Ddef }; B := body (R) \ (Ddef {a a | a occurs in R}); h B := { ead(R) }; return (C , ); while Ddef = Ddef do April 21-25, 2008 · Beijing, China facts) in an iterative manner, obtaining a prop ositional program (K B ) = {r1 , . . . , r22 }, where r14 is instantiated from X Y , X Y (i.e. R ), r15 , . . . , r20 are instantiated from R(K B ) . Note that for instantiating a rule that contains X Y in the b ody, we only consider all ground substitutes such that occ(X ) occ(Y ), where occ(a) is the occurrence order of constant a in R(K B ). r 2 : A (s 2 ) A (s 2 )- . r 1 : S (s 2 ) S (s 2 )- . - r 4 : T (t 1 ) T (t 1 )- . r 3 : ( t 1 t 1 ) (t 1 t 1 ) . r5 : (p1 p2 ) (P1 p2 )- . r6 : E (p2 ) E (p2 )- . r7 : P (s2f ) S (s2 ). r8 : me(s2 , s2f ) S (s2 ). r 9 : H (s 2 ) S (s 2 ). r10 : E (s2f ) P (s2f ). r11 : A(s2 ), me(s2 , s2f ), E (s2f ). r12 : T (t1 ). r13 : A(s2 ), S (s2 ). r14 : t1 t1 . r16 : T (p2 ) p1 p2 . r15 : p2 p1 p1 p2 . r17 : E (p2 ) p1 p2 , E (p1 ). r18 : pa(s1 , p2 ) p1 p2 . r19 : E (p1 ) p1 p2 , E (p2 ). r20 : pa(s2 , p2 ) p1 p2 . r22 : T (p2 ), E (p2 ). r21 : E (p1 ). Algorithm 2. Partition(, X ) Input: A positive propositional program and a set X of ground atoms occurring in . Output: A set of disjoint subprograms of . 1. Set map() as 0 for all ground atoms occurring in ; 2. Move constraints in in front of other rules in ; k := 0; 3. repeat 4. mer g ed := false; 5. for each rule r sequentially retrieved from s.t. head(r ) = or map() > 0 for all head(r ) \ X do 6. for each head(r ) body (r ) s.t. map() = 0 do 7. k := k + 1; map() := k ; 8. if |map(r )| > 1 then 9. mer g ed := true; minid := min(map(r )); 10. for each head(r ) body (r ) do map() := minid ; 11. until not mer g ed; 12. for i = 1, . . . , k do 13. i := {r | head(r ) body (r ) : map() = i}; 14. return {i = | 1 i k }; Example 3. Continue with Example 2. We now demonstrate how Ground(R(K B )) works. In step 1, we compute the unique minimal model Mdef of R(K B )def in an iterative manner, obtaining C = {{t1 , t2 , s1f }} and Ddef = {S (s1 ), m e ( s 1 , t 1 ) , T ( p 1 ) , p a ( s 1 , p 1 ) , p a ( s 2 , p 1 ) , Sf ( s 1 , t 1 ) , P ( t 1 ) , H (s1 ), H (p1 ), E (t1 ), Sf (s2 , s2f ), Sf (t1 , t1f ), Sf (t1 , t2f ), Sf (p1 , p1f )}. In step 2, according to C , we replace the set of disjunctive ground facts in R(K B ) with {S (s2 ) S (s2 )- , A (s 2 ) A (s 2 )- , T (t 1 ) T (t 1 )- , E (p 2 ) E (p 2 )- , ( t 1 t1 ) (t1 t1 )- , (p1 p2 ) (p1 p2 )- }. In step 3, we ground R(K B ) R(K B ) (excluding the definite ground Algorithm 2 is our algorithm for partitioning a p ositive prop ositional program based on a set X of ground atoms occurring in . The basic idea is to filter out rules that have no impact on M X when constructing an X -MC model M of and put together remaining rules that have common ground atoms to form disjoint subprograms. In the algorithm, each ground atom occurring in is mapp ed to a partition identifier map(). For a rule r , we use map(r ) to denote {map() | head(r ) body (r )}. To simplify explanation, we call a rule r ready if head(r ) = or map() > 0 for all head(r ) \ X . Before a ground atom is detected in some ready rule, it is mapp ed to 0 (line 1). To process ready rules as early as p ossible, constraints (which are ready rules) are moved in front of other rules in (line 2). Then, {map() | occurs in } is adjusted in an iterative manner until {map(r ) | r } reaches a fixp oint (lines 3­11). Each ground atom first detected in ready rules is initially mapp ed to a unique partition identifer (lines 6­7). All ground atoms in a ready rule r are mapp ed to the same partition identifier (lines 8­10). After the loop is finished, all ready rules in mapp ed to the same partition identifier are put together, yielding a set of nonempty subprograms {i }1in (lines 12­13). 570 WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web I It can b e seen that the numb er of iterations (lines 3­11) is at most ||, b ecause the mapping adjustment (lines 9­10) ensures that in each iteration, a ready rule rm having the smallest value of min(map(r )) among {r | r is ready and |map(r )| > 1} must reach a state that |map(rm )| = 1 c and that map(rm ) is unn hanged in subsequent iterations. Furthermore, 0 = \ i=1 i is the intended set of filtered rules (see Lemma 1); i and j contain no common ground atoms for all 1 i < j n. Since is sequentially accessed in each iteration, the algorithm is also disk-based. Lemma 1. Let P be the set of subprograms returned by P \ . For any X -MC Partition(, X ) and 0 = r P = M { head(r ) | model M of ,M 0 X, map() = 0} is an X -MC model of such that M X = M X. r Proof. Let M0 = 0 { head(r ) | X , map() = 0}. Since M0 head(r ) = for every r 0 , M = M M0 satisfies every rule in 0 . Moreover, since map() > 0 for P every ground atom occurring in P , M0 M = and s thus M till satisfies every rule in as M . It follows that M is a model of . Since M0 X = , M is an X -MC model of such that M X = M X . Example 4. Continue with Example 3. Let X b e the set of ground atoms of the form - in (K B ). We now demonstrate how Partition((K B ), X ) works. We first move r11 , . . . , r14 , r21 , r22 in front of other rules in (K B ), then map each ground atom in to a partition identifier. In the first iteration for processing rules r , map(r ) is set as follows (for every ground atom , j :k denotes map() = j b efore processing r at line 8 in Algorithm 2, and map() = k after the first iteration). r11 : A(s2 )1:1 , me(s2 , s2f )2:1 , E (s2f )3:1 . r12 : T (t1 )4:4 . r13 : A(s2 )1:1 , S (s2 )5:1 . r14 : (t1 t1 )6:6 . r22 : T (p2 )8:7 , E (p2 )9:7 . r21 : E (p1 )7:7 . r2 : A(s2 )1:1 A(s2 )- :1 . r1 : S (s2 )1:1 S (s2 )- :1 . 10 11 r3 : (t1 t1 )6:6 (t1 t1 )- :6 . r4 : T (t1 )4:4 T (t1 )- :4 . 12 13 r5 : (p1 p2 )14:7 (p1 p2 )- :7 . r6 : E (p2 )8:7 E (p2 )- :7 . 15 16 r7 : P (s2f )0:1 S (s2 )1:1 . r8 : me(s2 , s2f )1:1 S (s2 )1:1 . r9 : H (s2 )0:0 S (s2 )1:1 . r10 : E (s2f )1:1 P (s2f )17:1 . r15 : (p2 p1 )0:0 (p1 p2 )14:7 . r16 : T (p2 )8:7 (p1 p2 )14:7 . r17 : E (p2 )8:7 (p1 p2 )8:7 , E (p1 )7:7 . r18 : pa(s1 , p2 )0:0 (p1 p2 )7:7 . r19 : E (p1 )7:7 (p1 p2 )7:7 , E (p2 )7:7 . r20 : pa(s2 , p2 )0:0 (p1 p2 )7:7 . In the second iteration, it is detected that |map(r )| = 1 for all ready rules r , so the loop is finished. Finally we obtain four disjoint subprograms from the resulting mapping: 1 = {r11 , r13 , r1 , r2 , r7 , r8 , r10 }, 2 = {r12 , r4 }, 3 = {r14 , r3 } and 4 = {r21 , r22 , r5 , r6 , r16 , r17 , r19 }. In what follows, we call a ground atom of the form - a translated decision atom. Let C b e the set of congruence classes returned by Ground(R(K B )), and {i }1in the set of subprograms returned by Partition((K B ), X ), where X is the set of translated decision atoms occurring in (K B ). We intend to compute X -MC models of R(K B ) over each of {i }1in , where X is the set of decision atoms occurring in R(K B ). However, some decision atoms are ren placed with other atoms in i=1 i , b ecause all constants in a congruence class in C are replaced with a same constant. April 21-25, 2008 · Beijing, China Let X b e the set of translated decision atoms occurring in n i i=1 i . X s said to b e sound ly converted from X w.r.t. C , if each ground atom T (a1 , . . .T ak )- X has b een given a . . cost c (T (a1 , . . . , ak )- ) = (b ,...,bk )- X,b1 =C a1 ,...,bk =C ak .1 - c(T (b1 , . . . , bk ) ), where bi =C ai means that constants bi and ai are the same or together in some C C . Such conversion is reasonable b ecause all decision atoms T (b1 , . . . , bk )- . . X such that b1 =C a1 , . . . , bk =C ak for some T (a1 , . . . , ak )- m X ust b e present or absent together in every model of R(K B ). Moreover, given a subset S of X , we define a decoding of S w.r.t. X and C , written d(S, X, C ), as {T (b1 , . . . , bk )- . . X | T (a1 , . . . , ak )- S, b1 =C a1 , . . . , bk =C ak }. Then, a minimum cost diagnosis of K B corresp onds to a disjoint union of models in each subprogram (see Theorem 4). Example 5. Continue with Example 4. Let X b e the set 4 of decision atoms in R(K B ). In i=1 i , the set of ground atoms soundly converted from X w.r.t. C = {{t1 , t2 , s1f }} is X = {S (s2 )- , A(s2 )- , (t1 t1 )- , T (t1 )- , (p1 p2 )- , E (p2 )- }, where each ground atom - X is given a cost c (- ) = 1 except that c (T (t1 )- ) = 2. Let Xi (i = 1, . . . , 4) b e the subsets of X that occur in i . It is easy to see that 1 has two X1 -MC models M1,1 = {S (s2 )- , A(s2 )} and M1,2 = {S (s2 ), A(s2 )- , P (s2f ), me(s2 , s2f ), E (s2f )}; 2 has a unique X2 -MC model M2 = {T (t1 )- }; 3 has a unique X3 -MC model M3 = {(t1 t1 )- }; 4 has two X4 -MC models M4,1 = {(p1 p2 )- , E (p2 )} and M4,2 = {E (p2 )- , p1 p2 , T (p2 )}. Hence, d(M1,1 X1 , X, C ) = {S (s2 )- }; d(M1,2 X1 , X, C ) = {A(s2 )- }; d(M2 X2 , X, C ) = {T (t1 )- T (t2 )- }; d(M3 X3 , X, C ) = {(t1 t2 )- }; d(M4,1 X4 , X, C ) = {(p1 p2 )- }; d(M4,2 X4 , X, C ) = {E (p2 )- }. By Theorem 4, we obtain four minimum cost diagnoses for K B w.r.t. RK B : {S (s2 ), T (t1 ), T (t2 ), t1 t2 , p1 p2 }, {A(s2 ), T (t1 ), T (t2 ), t1 t2 , p1 p2 }, {S (s2 ), T (t1 ), T (t2 ), t1 t2 , E (p2 )} and {A(s2 ), T (t1 ), T (t2 ), t1 t2 , E (p2 )}. Theorem 4. For K B a S HI Q ontology and RK B K BA a set of removable assertions such that each assertion RK B is given a removal cost c() > 0, suppose Ground(R(K B )) returns (C , (K B )) and Partition((K B ), X ) returns {i }1in , where X is the set of translated decision atoms occurring in (K B ). Let X be the set of transn lated decision atoms occurring in i=1 i which is sound ly - converted from X = { | RK B }, and X1 , . . . , Xn be the subsets of X that occur in 1 , . . . , n respectively. (Soundness) For each Xi -MC model Mi of i (i = 1, . . . , n), n { | - i=1 d(Mi Xi , X, C )} is a minimum cost diagnosis for K B w.r.t. RK B ; (Completeness) For each minimum cost diagnosis R for K B m w.r.t. RK B , there exists an Xi -MC n odel Mi of i (i = 1, . . . , n) such that R = { | - i=1 d(Mi Xi , X, C )}. Proof sketch. (Soundness) For i = 1, . . . , n, let Bi b e the set of ground atoms occurring in i nand Mi an Xi MC model of i . Let 0 = (K B ) \ i=1 i and M = n n r 0 { head(r ) | X i=1 Bi } i=1 (Mi Bi ). Let M b e the set of ground atoms derived by applying all rules in R(K B ) over M {a b | a and b are together innsome C C }, and M + = M M . It can b e seen that + X . It can further b e shown i=1 d(Mi Xi , X , C ) = M + that M is an X -MC model of R(K B ). By Theorem 3, n {|- i=1 d(Mi Xi , X, C )} = {|- M + } is a minimum cost diagnosis for K B w.r.t. RK B . 571 WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web I (Completeness) Let R b e a minimum cost diagnosis for K B w.r.t. RK B . By Theorem 3, there exists an X -MC model M of R(K B ) s.t. R = {|- M }. Let M b e a set of ground atoms converted from M by rewriting each constant a in M s.t. fc (a, C ) exists to minc (fc (a, C )). It s can b e shown that Mi = M Bi in an Xi -MC model of i (i = 1, . . . , n) s.t. R = { | - i=1 d(Mi Xi , X, C )}. By Theorem 4, the problem of finding minimum cost diagnoses for K B w.r.t. RK B is reduced to n subproblems, each of which computes Xi -MC models of i (i = 1, . . . , n). We consider computing Xi -MC models of i by applying SAT solvers that supp ort PB-constraints. We assume that the cost of each atom in Xi has b een scaled to a p ositive integer p olynomial in |X | the total numb er of removable assertions. Then, the first Xi -MC model ofi can b e computed by a binary search (within range [0, c ( )]) for the minimum Xi ( value vmin such that i { c ) · assign ( ) vmin } Xi ( is satisfiable, taking O(log 2 c )) = O(log2 |X |) calls Xi to a SAT solver. Let M = {M Xi | M is a previously computed Xi -MC model of i }. Then a next Xi -MC model M of i , such that M Xi = S for every S M, can b e ( computed as a model of i { Xi c ) · assign( ) vmin } { S | S M}, by calling a SAT solver once. Consider the time complexity for computing minimum cost diagnoses. Under the data complexity assumption, the numb er of non-ground rules in R(K B ) and the numb er of different variables in each rule in R(K B ) are b ounded by constants. Thus the numb er of rules in (K B ) is p olynomial in |K BA |. It follows that Ground(R(K B )) is executed in PTIME. Let X b e the set of translated decision atoms occurring in (K B ). Partition((K B ), X ) commits at most |(K B )| iterations over (K B ), so it is executed in PTIME too. Let n b e the numb er of removable assertions in K B and {i }1im b e the set of prop ositional programs returned by Partition((K B ), X ). Note that the SAT problem with a PB-constraint is NP-complete. Since i and j have no common ground atoms for all 1 i < j m, m calls to a SAT solver over 1 , . . . ,m m resp ectively amount to one call to an NP oracle over i=1 i . Under the assumption that each removal cost has b een scaled to a p ositive integer p olynomial in n, it follows from Theorem 4 that, the first minimum cost diagnosis is computed in O(log2 n) calls to an NP oracle, and a next one, in one more call. April 21-25, 2008 · Beijing, China Table 1: The complexity of test ontologies NC NR NI NA Nr Features S 59 16 17,941 65,291 221 EQ13 ,DS0 H 27 49 82,095 154,110 159 EQ7 ,DS7 L1 86 34 50,253 120,274 168 EQ2 ,DS0 L10 86 34 629,568 1,293,286 168 EQ2 ,DS0 Note: S stands for Semintec. H stands for HumanCyc- . L1 stands for LUBM1+ . L10 stands for LUBM10+ . NC is the numb er of concept names. NR is the numb er of role names. NI is the numb er of individuals. NA is the numb er of ABox assertions stored in MySQL databases. Nr is the numb er of rules transformed from the intensional part. The features show how many sp ecial transformed rules: EQn means there are n equality rules (i.e. rules containing equality atoms); DSn means there are n disjunctive rules. 6. EXPERIMENTAL EVALUATION We implemented the prop osed method for computing minimum cost diagnoses in GNU C++. In the implementation, MySQL is used as the back-end SQL engine; ABox assertions and new ground atoms derived in the grounding process are maintained in a SQL database; All ground substitutes of rules, retrieved via SQL statements, are maintained in disk files; The SAT solver MiniSat+ [5], which supp orts PB-constraints and linear optimization functions by internally translating them into SAT clauses, is applied to compute X -MC models. All the exp eriments were conducted on a 3.2GHz Pentium 4 CPU 2GB RAM PC running Windows XP and Cygwin. in the SEMINTEC pro ject at the University of Poznan. Its intensional part contains functional roles and disjointness constraints. HumanCyc3 is an ontology on human metab olic pathways and human genome, created by the SRI International corp oration. Since its intensional part contains nomimals and concrete domain sp ecifications (e.g., a role range is of some datatyp e, a concrete role is functional, etc.) that cannot b e handled by our method, we converted nominals to new atomic concepts and deleted concrete domain sp ecifications. The weakened intensional part still contains disjunctions, functional roles/restrictions and disjointness constraints. LUBM4 is a b enchmark ontology develop ed at the Lehigh University [9]. Since its intensional part has no functional roles, numb er restrictions or disjointness constraints, which implies that it cannot b e inconsistent, we extended it by adding a functional role headOf and an inverse functional role isTaughtBy, where headOf (resp. isTaughtBy) is also defined as an inverse role of isHeadOf (resp. teacherOf), and adding disjointness constraints X ¬N onX for each existing concept name X , where N onX is a new concept name. LUBM comes with an ABox generator. Let LUBMn denote the ontology obtained from the generator by setting the numb er of universities to n. Before testing the prop osed method, the intensional parts of the ab ove ontologies were offline transformed to datalog programs using the KAON2 DL reasoner5 . Each transformation was p erformed in less than one second. Moreover, ABox assertions of the ab ove ontologies were stored into MySQL databases, where duplicated ABox assertions were removed. Table 1 summarizes the complexity of the test ontologies and the datalog programs transformed from their intensional parts, where HumanCyc- denotes the weakened HumanCyc, and LUBMn+ denotes the extended LUBMn. We develop ed a tool, called Injector, to inject a given numb er of conflicts into an ontology. Given a consistent ontology K B and a numb er ncnf of conflicts to b e injected, Injector first deduces into K B all atomic concept assertions that are consequences of K B , then injects ncnf conflicts one by one. Let SF R denote the set of functional/inverse funcsemintec.htm 3 http://humancyc.org/ 4 http://swat.cse.lehigh.edu/projects/lubm/ 5 http://kaon2.semanticweb.org/ 6.1 Test Ontologies and Preparations Semintec is an ontology ab out financial services, created 2 2 http://www.cs.put.poznan.pl/alawrynowicz/ 572 WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web I tional roles, SD C denote the set of atomic concepts that have disjoint concepts. To inject a conflict, Injector randomly selects an entity in SF R SD C . In case an functional role R is selected, if there exist role assertions over R in K B , Injector randomly selects one, say R(a, b), and app ends R(a, c) and b c to K B , where c is a new individual; otherwise, Injector app ends R(a, b), R(a, c) and b c to K B , where a, b, c are new individuals. In case an inverse functional role R is selected, Injector treats it as R- . In case an atomic concept C is selected, if there exist concept assertions over C in K B , Injector randomly selects one, say C (a), and app ends D(a) to K B for a randomly selected disjoint concept D of C ; otherwise, Injector app ends C (a) and D(a) to K B , where a is a new individual and D a randomly selected disjoint concept of C . Injector was implemented in JAVA, using the Pellet [29] API to deduce all atomic concept assertions that are consequences of a consistent ontology. April 21-25, 2008 · Beijing, China 6.2 Experimental Results We injected different numb er of conflicts to the four test ontologies using Injector, obtaining a set of inconsistent ontologies. We consider the hardest case where all assertions in an obtained ontology are assumed removable. We assume that each assertion is given a removal cost 1. In order to make the implemented system terminate in an acceptable time, we set a time limit of 20 minutes for one call to MiniSat+ . The test results are rep orted in Table 2. In each block, the first row lists ncnf , i.e. the numb er of injected conflicts; the second row lists the total execution time for computing the first minimum cost diagnosis, starting from transforming the input ontology. For Semintec and HumanCyc- , the implemented system cannot handle more injected conflicts in our test environment, b ecause when ncnf = 140 for Semintec (or ncnf = 60 for HumanCyc- ), some call to MiniSat+ exceeds the time limit. In contrast, the implemented system scales well on LUBM1+ /LUBM10+ ontologies with increasing numb er (from 1000) of conflicts. We also collected runtime statistics on the partitioning process. Let K B b e an inconsistent ontology rep orted in Table 2. As can b e seen, the numb er of rules in each grounded repair program (K B ) is up to millions. In addition, the numb er of decision atoms in (K B ) is at least in thousands. We exp erimentally verified that any (K B ), without b eing partitioned, cannot b e handled by MiniSat+ b ecause the execution exceeds the time or memory limit. This shows the imp ortance of the partitioning process. Other statistics show the effectiveness of the partition| ing process. The p ercentage of filtered rules, ||(0B)| , is at K least 11% for all rep orted ontologies (esp., at least 57% for LUBM10+ ontologies). The numb er of disjoint subprograms of (K B ) (i.e. the numb er of partitions), #{i }, increases when the numb er of conflicts increases. This shows that the partitioning process improves the scalability. Table 2 also rep orts the maximum numb er of ground rules in each partition, max(|{i }|), and the maximum numb er of translated decision atoms in each partition, max(|{Xi }|). It can b e seen that the total execution time is roughly dominated by max(|{i }|) and max(|{Xi }|). For Semintec and HumanCyc- , since max(|{Xi }|) is up to tens of thousands, the execution of MiniSat+ over the largest partition quickly exceeds the time limit when the numb er of conflicts increases (as max(|{i }|) increases too). In contrast, for LUBM1+ Table 2: Test results against different number of conflicts ncnf Semintec ncnf 40 60 80 100 120 Time (sec) 53 191 155 298 1144 |(K B )| 159K 241K 230K 371K 636K | 0 | 73.8% 45.1% 53.1% 35.8% 24.9% | ( K B )| #{i } 25 31 40 49 57 max({|i |}) 11K 66K 76K 152K 393K max({|Xi |}) 7054 16114 14687 20241 18794 - HumanCyc ncnf 10 20 30 40 50 Time (sec) 428 326 531 412 867 |(K B )| 598K 607K 620K 604K 639K | 0 | 26.0% 25.6% 25.1% 25.7% 24.4% | ( K B )| #{i } 5 7 10 19 21 max({|i |}) 443K 451K 465K 449K 483K max({|Xi |}) 68155 68159 68175 68181 68197 + LUBM1 ncnf 1000 2000 3000 4000 5000 Time (sec) 202 387 554 814 1010 |(K B )| 537K 978K 1490K 2306K 2877K | 0 | 41.9% 25.9% 18.7% 13.3% 11.4% | ( K B )| #{i } 886 1766 2696 3645 4606 max({|i |}) 43K 146K 120K 244K 540K max({|Xi |}) 808 859 895 901 898 LUBM10+ ncnf 1000 2000 3000 4000 5000 Time (sec) 736 851 1070 1250 1618 |(K B )| 3893K 4111K 4165K 4338K 4753K | 0 | 82.3% 70.8% 65.8% 62.8% 57.3% | ( K B )| #{i } 963 1860 2713 3605 4423 max({|i |}) 26K 24K 31K 43K 34K max({|Xi |}) 810 1571 1961 2610 1393 Note: |(K B )| is the numb er of rules in the grounded repair | program (K B ). ||(0B)| is the p ercentage of rules filtered K out in our partitioning algorithm. #{i } is the numb er of partitions. max({|i |}) is the maximum numb er of ground rules in each partition. max({|Xi |}) is the maximum numb er of translated decision atoms in each partition. and LUBM10+ , since max(|{i }|) or max(|{Xi }|) is stable around a relatively small value, the total execution time increases smoothly when the numb er of conflicts increases. We can conclude that the p erformance of our method dep ends on the effectiveness of the partitioning process. As for what influences such effectiveness when the numb er of assertions and the numb er of conflicts are fixed, we can see from Table 1 and Table 2 that, equality rules have a stronger impact than normal rules; further, disjunctive rules have a stronger impact than equality rules. Hence, we b elieve that our method can handle any real (p opulated) ontology that has up to millions of assertions together with a moderately complex intensional part, which can b e transformed to up to hundreds of datalog rules with a few disjunctive rules and equality rules. 573 WWW 2008 / Refereed Track: Semantic / Data Web - Semantic Web I April 21-25, 2008 · Beijing, China 7. CONCLUSION AND FUTURE WORK A DL-based ontology may b ecome inconsistent after it is p opulated. In this pap er, we prop osed a solution to repair the p opulated ontology by deleting assertions in a minimum cost diagnosis. We first showed the intractability of finding a minimum cost diagnosis, then presented an exact method for computing minimum cost diagnoses for S HI Q ontologies. The method transforms a S HI Q ontology to a set of disjoint prop ositional programs in PTIME w.r.t. data complexity, thus reducing the original problem into a set of indep endent subproblems. Each such subproblem computes an X -MC model and is solvable by applying SAT solvers. We exp erimentally showed that the method can handle moderately complex ontologies with over thousands of assertions, where all assertions can b e assumed removable. Esp ecially, the method scales well on the extended LUBM ontologies with increasing numb er (from 1000) of conflicts. For future work, we plan to enhance our method to cop e with concrete domain sp ecifications, seek feasible approaches to handling nomimals, and work on tractable approximate methods for computing minimum cost diagnoses. 8. ACKNOWLEDGEMENTS This work is supp orted in part by NSFC grants 60673103, 60721061 and 60373052. 9. REFERENCES [1] F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, and P. F. Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, 2003. [2] D. Calvanese, G. D. Giacomo, D. Lemb o, M. Lenzerini, and R. Rosati. Dl-lite: Tractable description logics for ontologies. In Proc. of AAAI'05, pages 602­607, 2005. [3] P. Cimiano and J. V¨lker. Text2onto - a framework o for ontology learning and data-driven change discovery. In Proc. of NLDB'05, pages 227­238, 2005. [4] J. Dolby, J. Fan, A. Fokoue, A. Kalyanpur, A. Kershenbaum, L. Ma, J. W. Murdock, K. Srinivas, and C. A. Welty. Scalable cleanup of information extraction data using ontologies. In Proc. of ISWC'07, pages 100­113, 2007. [5] N. E´n and N. S¨rensson. Translating pseudo-b oolean e o constraints into sat. JSAT, 2:1­26, 2006. [6] T. Eiter, G. Gottlob, and H. Mannila. Disjunctive datalog. ACM Trans. Database Systems, 22(3):364­418, 1997. [7] T. Eiter, N. Leone, C. Mateis, G. Pfeifer, and F. Scarcello. A deductive system for non-monotonic reasoning. In Proc. of LPNMR'97, pages 364­375, 1997. [8] M. Fitting. First-order Logic and Automated Theorem Proving (2nd ed.). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1996. [9] Y. Guo, Z. Pan, and J. Heflin. Lubm: A b enchmark for owl knowledge base systems. J. Web Sem., 3(2­3):158­182, 2005. [10] I. Horrocks, U. Sattler, and S. Tobies. Practical reasoning for very expressive description logics. Logic Journal of the IGPL, 8(3), 2000. [11] Z. Huang, F. van Harmelen, and A. ten Teije. Reasoning with inconsistent ontologies. In Proc. of IJCAI'05, pages 454­459, 2005. [12] U. Hustadt, B. Motik, and U. Sattler. Reducing S HI Q- description logic to disjunctive datalog programs. In Proc. of KR'04, pages 152­162, 2004. [13] A. Kalyanpur, B. Parsia, E. Sirin, and B. C. Grau. Repairing unsatisfiable concepts in owl ontologies. In Proc. of ESWC'06, pages 170­184, 2006. [14] A. Kalyanpur, B. Parsia, E. Sirin, and J. Hendler. Debugging unsatisfiable classes in owl ontologies. J. Web Sem., 3(4):268­293, 2005. [15] R. M. Karp. Reducibility among combinatorial problems. In Proc. of a Symposium on the Complexity of Computer Computations, pages 85­103, 1972. [16] S. C. Lam, J. Z. Pan, D. H. Sleeman, and W. W. Vasconcelos. A fine-grained approach to resolving unsatisfiable ontologies. In Proc. of WI'06, pages 428­434, 2006. [17] D. Lemb o and M. Ruzzi. Consistent query answering over description logic ontologies. In Proc. of RR'07, pages 194­208, 2007. [18] N. Leone, G. Pfeifer, W. Fab er, T. Eiter, G. Gottlob, S. Perri, and F. Scarcello. The dlv system for knowledge representation and reasoning. ACM Trans. Comput. Log., 7(3):499­562, 2006. [19] T. Meyer, K. Lee, and R. Booth. Knowledge integration for description logics. In Proc. of AAAI'05, pages 645­650, 2005. [20] T. Meyer, K. Lee, R. Booth, and J. Z. Pan. Finding maximally satisfiable terminologies for the description logic ALC . In Proc. of AAAI'06, pages 269­274, 2006. [21] B. Motik. Reasoning in Description Logics using Resolution and Deductive Databases. PhD thesis, Univesit¨t karlsruhe, Germany, Jan. 2006. a [22] P. F. Patel-Schneider, P. Hayes, and I. Horrocks. OWL Web Ontology Language Semantics and Abstract Syntax. W3C Recommendation, 2004. http://www.w3.org/TR/owl- semantics/. [23] G. Qi, W. Liu, and D. A. Bell. Knowledge base revision in description logics. In Proc. of JELIA'06, pages 386­398, 2006. [24] R. Reiter. A theory of diagnosis from first principles. Artif. Intel l., 32(1):57­95, 1987. [25] S. Schlobach. Debugging and semantic clarification by pinp ointing. In Proc. of ESWC'05, pages 226­240, 2005. [26] S. Schlobach. Diagnosing terminologies. In Proc. of AAAI'05, pages 670­675, 2005. [27] S. Schlobach and R. Cornet. Non-standard reasoning services for the debugging of description logic terminologies. In Proc. of IJCAI'03, pages 355­362, 2003. [28] H. M. Sheini and K. A. Sakallah. Pueblo: A hybrid pseudo-b oolean sat solver. JSAT, 2:157­181, 2006. [29] E. Sirin, B. Parsia, B. C. Grau, A. Kalyanpur, and Y. Katz. Pellet: A practical owl-dl reasoner. J. Web Sem., 5(2):51­53, 2007. [30] C. Welty and J. W. Murdock. Towards knowledge acquisition from information extraction. In Proc. of ISWC'06, pages 152­162, 2006. 574