WWW 2008 / Alternate Track: WWW in China - Chinese Web Innovations April 21-25, 2008 · Beijing, China Towards a Global Schema for Web Entities Conglei Yao, Yongjian Yu, Sicong Shou, Xiaoming Li Depar tment of Computer Science and Technology Peking University Beijing, 100871, P. R. China {ycl, yyj, ssc}@net.pku.edu.cn, lxm@pku.edu.cn ABSTRACT Popular entities often have thousands of instances on the Web. In this pap er, we focus on the case where they are presented in table-like format, namely app earing with their attribute names. It is observed that, on one hand, for the same entity, different web pages often incorp orate different attributes; on the other, for the same attribute, different web pages often use different attribute names (lab els). Therefore, it is imaginably difficult to produce a global attribute schema for all the web entities of a given entity typ e based on their web instances, although the global attribute schema is usually highly desired in web entity instances integration and web ob ject extraction. To this end, we prop ose a novel framework of automatically learning a global attribute schema for all web entities of one sp ecific entity typ e. Under this framework, an iterative instances extraction procedure is first employed to extract sufficient web entity instances to discover enough attribute lab els. Next, based on the lab els, entity instances, and related web pages, a maximum entropy-based schema discovery approach is adopted to learn the global attribute schema for the target entity typ e. Exp erimental results on the Chinese Web achieve weighted average Fscores of 0.7122 and 0.7123 on two global attribute schemas for p erson-typ e and movie-typ e web entities, resp ectively. These results show that our framework is general, efficient and effective. Figure 1: Two person instances Categories and Subject Descriptors I.2.6 [Artificial Intelligence]: Learning--Know ledge acquisition General Terms Algorithms, Exp erimentation Keywords Web Entities, Attribute Lab els, Entity Instances, Global Attribute Schema 1. INTRODUCTION With the rapid expansion of the Web, more and more entity instances, such as p ersons, b ooks and movies, are app earing on the Web. Most of them exist in structured web Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. pages where their contents are organized in table-like format. Each page has a set of data regions, and each region stands for an entity instance. It's a main research line to identify data regions and extract corresp onding entity instances from structured pages [15, 22, 20, 24] in Web mining. The entity instances, defined as data records [15, 22] or web objects [20, 24], often include names, attribute values, and lab els which describ e the attribute typ es. For example, b oth instances in Figure 1 contain several attributes regarding the same p erson. For the attribute type "Birthday" , the first instance uses Born as its lab el, which is different from Birthday as used in the second. Therefore, we should make sure they are of the same attribute typ e if we want to integrate these two instances. In fact, it's common for entities of one entity typ e to have many typ es of attributes with each of them describ ed by different lab els. Consequently, to integrate the web entity instances of the same entity typ e, we have to establish a map b etween lab els and their typ es. To establish this map, a global attribute schema, which reflects the main attribute typ es and their descriptive lab els, must b e constructed in advance. We focus on automatically learning a global entity attribute schema for all the web entities of a sp ecific entity typ e. For example, for p erson-typ e entities, a basic description that a p erson entity should contain attributes of name, gender, birthday, birthplace and weight is provided by the users. Then, based on this description, we can locate related web pages to extract p erson entity instances, mine other latent attributes, and then use the new attributes to locate more pages for more instances. This process runs iteratively until necessary attribute lab els for the main attribute types have b een discovered. Based on these attribute lab els, entity instances and related pages, we discover the relations b etween different attribute lab els, and, finally, construct the global entity attribute schema for p erson-typ e web entities. The global attribute schema for web entities of a sp ecific entity typ e is essential to integrate web entity instances. Moreover, from the global schema, the typ es of the imp or- 999 WWW 2008 / Alternate Track: WWW in China - Chinese Web Innovations tant attributes owned by the web entities of the target entity typ e can b e automatically identified, and then used as the input to web ob ject extraction systems to extract web objects with these imp ortant attributes. And this makes the results of these systems more valuable and reasonable. However, for this aim, there are obvious difficulties. On the one hand, enough frequent attribute lab els must b e extracted from the Web, and they should also b e able to represent the main attribute typ es which constitute the global schema. On the other hand, the numb er of frequent lab els in learning a global attribute schema is usually large and their qualities are not uniformly high, due to the variety and complexity of web entity instances. In this pap er, we prop ose the problem of learning a global attribute schema for all the web entities of a sp ecific entity typ e for the first time, and present a novel framework to address this problem. Under this framework, first, we prop ose an iterative web entity instances extraction approach to extract enough web entity instances as well as frequent attribute lab els to learn a global attribute schema. Next, we present a maximum entropy-based method to automatically learn a global attribute schema based on the frequent lab els, entity instances and related pages. Finally, we demonstrate our technique on p erson-typ e and movie-typ e entities on the Chinese Web, and exp erimental results show that our technique is general, efficient and effective. This article is further structured as follows. Section 2 discusses related works. Section 3 formulates the problem of learning a global attribute schema. Section 4 presents the iterative method of entity instances extraction. Section 5 prop oses the maximum entropy-based approach of schema discovery. Section 6 illustrates the exp eriments and analyzes the exp erimental results. Section 7 summarizes this pap er. April 21-25, 2008 · Beijing, China constructed at different levels: giving a schema for a set of logically related sites by viewing them as a whole [2]; examining a single site [3]; or, structuring single page. Similar to these studies, our research aims at learning a schema for web entities. However, our study is different from them in three asp ects. First, we target learning a schema from large numb ers of entity instances and related web pages, instead of from a single page, a single site or a set of logically related sites. Second, our study mines the attribute lab els, entity instances and related pages, instead of analyzing the structure of HTML documents. And lastly, our study seeks to learn a global schema for all the web entities of a sp ecific entity typ e, while previous related studies aimed to extract the schema of web data app earing in some sites or pages. 2.2 Schema Matching Schema matching aims at discovering semantic corresp ondences of attributes of schemas across heterogeneous sources. Previous studies can b e classified into schema-level matching [17, 5, 9, 16] and instance-level matching [14, 6, 12, 4]. A recent research [21] prop oses a unified solution to address the two corresp onding schema matching problems of intra-site and inter-site. [8] assumes that a hidden generative distribution exists in related schemas, and presents a statistical schema matching method to reconstruct the hidden generative distribution based on the input schemas. [10] addresses the problem of complex schema matching, which matches a set of m attributes to another set of n attributes from two individual schemas. In our research, we first look for frequent attribute lab els in entity instances from different pages, and then combine lab els which b elong to the same attribute typ e. Naturally, the attribute lab els corresp ond to elements in schemas. However, our research is unlike schema matching. First, schema matching integrates different but related schemas, which are derived from different data resources. Meanwhile, our research combines attribute lab els extracted from web entity instances to learn a global attribute schema. Second, the schemas used in schema matching are extracted from regular data sources, such as deep web databases, whose quality is high and their vocabularies are relatively small, while our attribute lab els are collected from surface web pages, whose quality is not uniformly high and their quantity is relatively large. 2. RELATED WORK To the b est of our knowledge, no research has yet adequately addressed the problem of learning a global attribute schema from the Web for entities of a given entity typ e. The previous study in [8] seeks to discover hidden schema model for query interfaces on deep Web. This study focuses on finding a hidden schema model to match schemas of query interfaces in the same domain. Its goal is the same as ours in finding a global schema. However, its method is effective and efficient only when the numb er of attribute lab els in query interfaces is relatively small, b ecause it creates the global schema based on all p ossible lab el partitions, and its time complexity is exp onential to the numb er of lab els. Moreover, it assumes that attribute lab els in each query interface's schema are of different typ es, which is natural in deep Web. However, in our research, the attribute lab els and the p ossible lab el partitions are b oth far more numerous than the aforementioned study. Further, the attribute lab els in one entity instance are not always different typ es. Besides, the other related researches can b e summarized in the following asp ects: 3. PROBLEM FORMULATION The general problem of learning a global attribute schema can b e formulated as follows: Formally, assume we are given a basic description for entities of a sp ecific entity typ e in Minit , which is provided by users under the framework of a predefined entity model M . Minit presents users' knowledge ab out the entities of the target entity typ e. Then, our goal is to get a global attribute schema for all the web entities of the target entity typ e, which is denoted as S and is learned based on Minit and related web pages. To describ e the problem clearly, we first define a concept "entity" as follows: Definition 1 An entity is something that has a distinct, separate existence, and owns an entity typ e and all the attributes which are owned by al l the entites with the same entity typ e. A web entity is an entity whose instances are 2.1 Schema Extraction Schema extraction is the task of automatically discovering implicit structural information from semi-structured data. Web data b elongs to semi-structured data class, i.e., data with self-contained schema [1]. Previous studies [19, 11, 3] discover schema for web data by utilizing the structure of web pages. In these studies, schema for web data can b e 1000 WWW 2008 / Alternate Track: WWW in China - Chinese Web Innovations Initial entity model Web page set Entity instances extraction Attribute label set Entity instance set Search engines Global schema discovery A global attribute schema April 21-25, 2008 · Beijing, China Figure 2: The framework of learning a global attribute schema presented in web pages. Formally, an entity can b e represented as e = {et, (t1 , v1 ), . . . , (ti , vi ), . . ., (tn , vn )}, where et is the entity typ e, ti and vi is the typ e and value of ith attribute of e, resp ectively. Definition 2 An entity instance is a partial container of a sp ecific entity. An entity instance contains some attribute lab els and their values of the entity. Formally, an entity instance can b e represented as inst = {(l1 , v1 ), . . . , (li , vi ), . . ., (ln , vn )}, where li and vi is the lab el and value of ith attribute of inst. For example, the two p erson instances in Figure 1 are b oth entity instances of the same entity. Definition 3 An entity model is a model which characterizes the features of all the entities of a sp ecific entity typ e, which include their entity typ e, the typ es of their attributes, and the lab els describing their attributes. Formally, an entity model is represented as M = {et, L1 , . . . , Li , . . . , Ln }, where et is the entity typ e, and Li is a set of lab els describing one attribute typ e. Definition 4 An initial entity model is a sp ecial entity model provided by users to present the basic description of all the entities of the target entity typ e. Due to the limitation of users' knowledge, its content only covers a small fraction of the overall knowledge of all the target entities, but it is enough to characterize them. As a result, an initial entity model must contain the entity typ e and some sets of attribute lab els. An initial entity model can b e formally represented as Minit = {et, L1 , . . . , Lj , . . . , Ln }, where for Minit , M , |M | |Minit |, and for Lj Minit , Li M , Lj Li . To learn a global attribute schema, an attribute schema for all the entities of a target entity typ e is defined as follows: Definition 5 An attribute schema for all the entities of a target entity typ e is a schema ab out attribute typ es owned by these entities. It reflects the typ es of attributes owned by these entities and the lab els describing each attribute type. Similarly, a global attribute schema for all the entities of a sp ecific entity typ e is the schema which defines all the main attribute typ es and lab els describing each attribute typ e. An attribute schema can b e represented as S = {(t1 , L1 ), . . ., (ti , Li ), . . . , (tn , Ln )}, where ti is the ith attribute typ e and can b e initialized by one attribute lab el in Li , and Li is the lab el set of the ith attribute. Based on the ab ove definitions, our problem can b e formulated as follows: Given an initial entity model Minit provided by users under an entity model framework M , automatically extract enough entity instances {insti }n 1 from the Web, mine attributes lab els i= Algorithm 1 The iterative entity instances extraction algorithm Input: an initial entity model Minit . Output: an attribute lab el set L, an entity instance set I , and a related page set P . 1: initialize three empty sets L, I and P ; 2: create a query set Q = {qi }n 1 based on Minit ; i= 3: for each query qi in Q do 4: issue qi to a search engine S E , get the retrieved page set Ri ; 5: for each page rij in Ri do 6: if rij is a structured page then 7: extract entity instances from rij , put them into I , and put rij into P ; 8: end if 9: end for 10: end for 11: if iteration termination criteria is fulfilled then 12: put all attribute lab els in I into L, and terminate the algorithm; 13: else 14: create one-element lab el set for each attribute lab el in I , put all the lab el sets into Minit , and go to 2 to continue; 15: end if from {insti }n 1 and learn an global attribute schema S usi= ing attributes lab els through the mining of {insti }n 1 and i= related web pages. 4. ENTITY INSTANCES EXTRACTION The framework of learning a global attribute schema is illustrated in Figure 2, in which the first step is to iteratively extract enough entity instances that contain sufficient frequent lab els to discover the global schema. To achieve this goal, first, we automatically locate related structured pages which contain entity instances. Then, for each page, we extract entity instances. In this step, the emphasis is to determine whether the extracted entity instances are enough to obtain sufficient frequent lab els to terminate the extraction process. However, we could not know how many frequent lab els are enough in advance. To determine that numb er, we need to observe whether the numb er of frequent lab els converges after the creation of a certain numb er of new instances or new iterations. We prop ose an iterative algorithm as in Algorithm 1 to automatically locate related structured pages and extract enough entity instances inside. There are three imp ortant issues in this algorithm as follows: 1. Query construction Each query consists of several attribute lab els in Minit , and the query set created in each iteration does not include the queries created in previous iterations. Moreover, in order to locate structured pages containing entity instances of the target entity typ e, each query should contain enough lab els in Minit . If the lab els in a query is not enough, unrelated pages will b e retrieved. For example, the query {`Name' AND `Age' AND `Weight'} will locate many target structured pages which contain p erson instances, while the query {`Name' AND `Weight'} will locate many other unrelated pages. Furthermore, to retrieve 1001 WWW 2008 / Alternate Track: WWW in China - Chinese Web Innovations enough pages to extract enough instances, the lab els in one query should not b e excessive b ecause the query with excessive lab els will match only a few related pages. For instance, the query {`Name' AND `Gender' AND `Age' AND `Birthday' AND `Birthplace' AND `Height' AND `Weight' AND `Race'} will match fewer pages than the query {`Name' AND `Gender' AND `Age' AND `Birthday'}. Our exp erimental result indicates that the prop er numb er of lab els in one query is 4. 2. Entity instances extraction After retrieving related pages, we attempt to extract their containing entity instances. For each page, we first identify its data regions with the help of related mining techniques such as [15]. If one or more regions are discovered, which implies that this page might b e a structured page with entity instances­then we scan each region to count the lab el numb er corresp onding to the query in that region. If the ratio b etween this numb er and the numb er of lab els in the query reaches a certain threshold (0.7 in our exp eriments), this region is regarded as a candidate data region containing one entity instance. We then extract this region, mine the pattern of the lab els in the query app earing in this region, and utilize the pattern to locate each lab el with its value. 3. Iteration termination criteria To get enough frequent lab els, the termination criteria of the iterative extraction process is a key p oint. We assume that the frequency of a frequent lab el is ab ove a certain threshold, and such lab els are sufficient if no new frequent lab el is discovered after a predefined numb er ( ) of new instances have b een extracted, or a predefined numb er ( ) of continuous iterations have b een p erformed. Then, after each iteration, if at least (300 in our exp eriments) new instances are extracted without new lab els, we can conclude that the lab els are sufficient and iterations can b e terminated. And if (7 in our exp eriments) continuous iterations have b een p erformed while the numb er of new instances are b elow , we consider that the lab els are sufficient and iterations should also b e terminated. Exp erimental results in section 6 affirm the effectiveness of our iteration termination criteria. April 21-25, 2008 · Beijing, China lab els emb edded in I . The constraints can b e represented as two factors: labels concurrence factor and same entity-value factor. 5.1.1 Two factors Labels concurrence factor: This factor is to measure the probability that two lab els are of different attribute typ es. Its main idea is that, it is p ossible for two lab els, which often app ear together in entity instances, to b e of different attribute typ es. This is based on the observation that two attribute lab els app earing in one entity instances are often used to describ e different attribute typ es. Strictly sp eaking, when two lab els app ear in one instance, they should not describ e the same attribute typ e. However, due to the diversity of web entity instances, there are some instances in which two lab els are actually of the same attribute typ e. To measure the probability that two lab els have different attribute typ es, we define the labels concurrence factor as follows : C ( li , l j ) = |Ii Ij | min(|Ii |, |Ij |) (1) 5. GLOBAL SCHEMA DISCOVERY Assume the set of frequent lab els is L = {li }n 1 , the set of i= web entity instances is I = {instj }m 1 , and the set of related j= pages is P = {pk }t =1 , our goal is to mine the content of L, I k and P to combine lab els of the same attribute typ e together, and, finally, to create the attribute schema S . To achieve this goal, we first preprocess L using the constraints emb edded in I to generate candidate attribute typ es, and get two sets of lab el sets C and L . Each element of C represents a unique candidate attribute typ e and consists of attribute lab els b elonging to this attribute typ e. Each element of L also consists of attribute lab els b elonging to the same attribute typ e, but it does not represent a unique attribute typ e. In other words, the attribute typ e represented by each L 's element is indep endent of the attribute typ es represented by another elements. Then, based on C and L , we prop ose a maximum entropy-based method to discover the attribute typ es hidden in L, and learn the global attribute schema. where li and lj are two lab els in L, Ii and Ij are two sets of instances which contain li and lj , resp ectively. Same entity-value factor: This factor measures the probability of two lab els b eing of the same attribute typ e. The rationale b ehind this factor is that, when two lab els often app ear with the same value in different instances of the same entity, it is p ossible for them to b e of the same attribute typ e. This inference is derived from the observation that different web pages often describ e the same attribute typ e in the same entity's instances by using different lab els. Then, if two instances are found to b elong to the same entity (which will b e discussed later), each pair of their lab els app earing in them with the same value could b e considered to b e of the same attribute typ e. However, this inference may not stand for all situations, b ecause some lab els from different instances of the same entity may often contain some meaningless values such as "unknown", "none", and "secret", when these lab els are actually not of the same attribute typ e. Consequently, we define the same entity-value factor as follows to measure the probability of two lab els li and lj b eing of the same attribute typ e: S ( li , l j ) = |I Pij | |I Pij | (2) where I Pij = {(instmi , instmj )|S ameE ntity (instmi , instmj ) = tr ue, vij , (li , vij ) instmi , (lj , vij ) instmj } and I Pij = {(instmi , instmj )|S ameE ntity (instmi , instmj ) = tr ue, vi , (li , vi ) instmi , vj , (lj , vj ) instmj } S ameE ntity (instmi , instmj ) = tr ue means instmi and instmj are actually of the same entity, which is determined by the names and the overlap of attribute values of two instances. The overlap of attribute values can b e calculated as O(instmi , instmj ) = |Vmi Vmj | min(|Vmi |, |Vmj |) 5.1 Candidate attribute types generation The aim of candidate attribute types generation is to create two sets C and L based on L by utilizing the constraints on where Vmi and Vmj are the attribute value sets of instmi and instmj , resp ectively. Two instances instmi and instmj are considered as the same entity's instances if their names are 1002 WWW 2008 / Alternate Track: WWW in China - Chinese Web Innovations Algorithm 2 The algorithm of candidate attribute typ es generation Input: one lab el set L. Output: two set of lab el sets C and L . 1: initialize two empty sets C and L ; 2: group all lab els in L which are of the same attribute typ e into one group, and put each group into L ; 3: repeat 4: for each element Li in L do 5: if Li represents an attribute typ e different from typ es in C then 6: put Li into C , and remove it from L ; 7: terminate the f or loop; 8: end if 9: end for 10: until no new element is put into C April 21-25, 2008 · Beijing, China Algorithm 3 The algorithm of global schema discovery Input: two sets of lab el set C and L . Output: an attribute schema S . 1: initialize an empty schema S ; 2: for each element Ci in C do 3: select lab el lij with maximum frequency from Ci ; 4: ti lij , and put (ti , Ci ) into S ; 5: end for 6: for each element Lk in L do 7: if Lk represents a new attribute typ e then 8: select lab el lkm with maximum frequency from Lk ; 9: tk lkm , and put (tk , Lk ) into S ; 10: e l se 11: discover the attribute typ e tl in S which Lk b elongs to, and put all lab els in Lk into tl 's lab el set Ll ; 12: end if 13: end for the same and the overlap of attribute values O(instmi , instmj ) is ab ove a certain threshold. This assessment is simple and effective, as illustrated in section 6. 5.2.1 Maximum entropy-based method Maximum entropy model is a general-purp ose statistical model that can freely incorp orate various problem-sp ecific knowledge in terms of features which are not required to b e strongly indep endent. As a result, one can choose arbitrary features to reflect the characteristics of the problem domain as faithfully as p ossible. As in our problem, given an attribute schema S , and a lab el set Lk whose elements are of the same attribute typ e, our goal is to determine whether lab els in Lk are of a new attribute typ e, or of an existing attribute typ e in S . And this determination is related to various dep endent features, and is prop er to b e solved using maximum entropy model. |Lk | Supp ose Lk = {lh }h=1 , for each lab el lh Lk , there are some entity instances which contain lh in the instances set I . For each instance instj which app ears in web page wj and contains lh , we can calculate a probability of lh b eing of an attribute typ e ti in S as p(ti |chj ), where chj is a context of lh and is comp osed of some features created based on lh , instj and wj . Then, for Lk , we can calculate its average probability b eing of the attribute typ e ti as P P|Il | j =1 p(ti |chj ) 5.1.2 Generating candidate attribute types Based on the two factors defined ab ove, we can generate candidate attribute typ es from the lab el set L using Algorithm 2. In this algorithm, two issues need to b e explained: · Two lab els li and lj have the same attribute typ e if their same entity-value factor S (li , lj ) is ab ove a certain threshold . And this relation can b e transferred to more lab els. As a result, we can group lab els which are of the same attribute typ e into one set, as illustrated in 2nd line in Algorithm 2. · Two lab els li and lj have different attribute typ es if their labels concurrence factor C (li , lj ) is ab ove a certain threshold . Furthermore, for two attribute lab el sets L1 and L2 where lab els in each of them have the same attribute typ e, if one lab el in L1 and another in L2 have different attribute typ es, all the lab els in L1 have a different attribute typ e from those of L2 . By this means, we can determine whether Li represents a new attribute typ e different from typ es in C , as in 5th line in Algorithm 2. Av er P (Lk , ti ) = h |Il | |Lk | (3) 5.2 Discovering global schema After generating candidate attribute lab els, we get C and L based on the original lab el set L. To learn a global attribute schema, first, we initialize the schema S using C , since each element of C represents a unique attribute typ e. Next, by utilizing our prop osed maximum entropy-based method, we find new attribute typ es represented by some elements of L and group them with their lab els into S , or find the attribute typ es in S to which some elements of L b elong and group lab els in these elements into the corresp onding elements of S . The algorithm of global schema discovery is illustrated in Algorithm 3. The frequency of a lab el in the ab ove algorithm is the p ercentage of the extracted entity instances owning the label in instance set I in all the extracted instances. Moreover, the key p oint of this algorithm is to use the maximum entropybased method to determine whether a set of lab els represents a new attribute typ e, which will b e explained in detail. where Il is a subset of I and all instances in Il contain the lab el lh . If Av er P (Lk , ti ) P , where P is a certain threshold, and Lk can not b e determined by labels concurrence factor to b e of different attribute typ e from ti , then lab els in Lk are of the attribute typ e of ti ; Else, lab els in Lk are not of the typ e ti . Moreover, if lab els in Lk are determined to b e not of all the typ es in S , they should b e of a new attribute typ e. Given the instance instj and the web page wj , the probability of lh b eing of the attribute typ e ti is p(ti , chj ) p(ti |chj ) = P i p(ti , chj ) (4) Under the maximum entropy framework, let ti b e t, chj b e c, p(ti , chj ) can b e replaced as p(t, c), which is the joint probability of t and c, and maximizes the entropy H (p). X H (p ) = - p(t, c) log p(t, c) 1003 WWW 2008 / Alternate Track: WWW in China - Chinese Web Innovations Table 1: Feature "templates" on the current "scene" sch Features "templates" lh+1 = X lh-1 = X vh = X th+1 = X th+1 = X w-1 = X w1 = X April 21-25, 2008 · Beijing, China Table 2: Four initial entity models for person-type and movie-type entities Model Minit0 Minit1 Minit2 Minit3 Details of model {per son, {N ame}, {Ag e}, {Gender, S ex}, {B ir thplace}, {B ir thday , B ir thtime}, {O ccupation}} {per son, {N ame}, {Ag e}, {Gender }, {W eig ht}} {movie, {T itle}, {D ir ector }, {Lang uag e}, {Reg ion}} {movie, {T itle}, {D ir ector }, {Actor }, {Genr e, T y pe}, {Running time}} &th &th &th &th &th &th &th =T =T =T =T =T =T =T where p(t, c) is the observed distribution of attribute typ es ~ and the contexts of attribute lab els in training data, and fj (t, c) is a feature created on the basis of attribute typ e t and attribute lab el context c. The distribution p(t, c) in ab ove constaints is given by p(t, c) = k Y under the following constraints X X p(t, c)fj (p, c) = p(t, c)fj (t, c), 1 j k ~ For this feature, each entity instance in training set is automatically scanned to initialize it. If lh is found in one instance with attribute typ e "Birthday", and "Gender" is the lab el b ehind it, fj (sch , lh ) is to 1 in this instance; else, it's set to 0. 6. EXPERIMENTS The prop osed iterative entity instances extraction method and global attribute schema learning approach are fully implemented and evaluated on p erson-typ e and movie-typ e entities. The goal of the exp erimental study are:(i) to check the p erformance of the iterative entity instances extraction algorithm; (ii) to discover the effectiveness of candidate attribute typ es generation method, (iii) to evaluate the p erformance of our global attribute schema learning method. j j f (t , c ) j =1 where is a normalization factor, the j s are the unknown parameters of the model, and each j corresp onds to a fj (t, c) and can b e seen as the weight of fj (t, c). There are several algorithms designed to estimate these unknown parameters, and we select Limited-Memory Variable Metric b ecause it has b een proved to b e esp ecially effective[18]. In our problem, the context of an attribute lab el is based on the set of lab els, values, typ es, and words surrounding the lab el in the web page, which can b e seen as the "scene" of the lab el. The "scene" of lab el lh in an instance instj which app ears in a web page wj is sch = {lh , lh+1 , lh-1 , vh , th+1 , th-1 , w-1 , w1 } where lh-1 and lh+1 are the lab els b efore and b ehind lh in instj , vh is the value of lh , th-1 and th+1 are the typ e of lh-1 and lh+1 , and w-1 and w1 are the words b efore and b ehind lh in wj . If lh is the first lab el in instj , lh-1 and th-1 are defined to b e sp ecial characters " H" and " TH " ; if lh is the last lab el in instj , lh+1 and th+1 are defined to b e characters " E " and " TE ". Similary, if lh is the first word in wj , w-1 is set to " FW ", and if it's the last word, w1 is set t o " LW " . Based on the "scene" of lh , the features can b e created by scanning each pair (sch , th ) in the training data with the feature "templates" given in Table 1, where th is the attribute typ e of the lab el with the "scene" sch , and the training data is created using the attribute typ es in the current schema S . Given sch as the current "scene", a feature always asks some yes/no question ab out sch , and constrains th to a certain attribute typ e. The instantiations for the variables X and T in Table 1 are obtained by automatically scanning training data. For example, for a lab el "Birth time" named lh , one feature with scene sch might b e 1 if lh+1 ="Gender"& th = "Birthday" ; fj (sch , lh ) = 0 otherwise. 6.1 Experiment Design We select Google as the search engine to extract entity instances on the Chinese Web. Due to the limitation of Google's Web interface, we expand each query for several times using Recursive Query Expansion (RQE) [7], based on a set of high-frequency words in a standard Chinese Web collection named CWT100G (http://www.cwirf.org). We select p erson-typ e and movie-typ e entities to evaluate the p erformance of our technique. For entities of each entity typ e, we sp ecify two different initial models and guarantee each model is able to characterize them. We list these four initial models in Table 2, where Minit0 and Minit1 are models for p erson-typ e entities, and the others are for movie-typ e entities. To the b est of our knowledge, our research is the first study aimed at learning a global attribute schema for all the web entities of a sp ecific entity typ e. The only relevant study is [8], which also aims at mining a global schema from an attribute lab el set. While, this study requires high-quality schemas as input and is sensitive to errors in schemas, and its efficiency decreases sharply with the increase of the unique lab els' numb er in schemas. We select the algorithm M GSsd in [8] as the baseline, and implement it to compare the result's quality and the algorithm's efficiency with our method. 6.2 Evaluation Metrics To evaluate the p erformance of the iterative entity instances extraction algorithm, we utilize three metrics: precision, entity-type error ratio, and attribute-label error ratio. Precision is defined as the p ercentage of the correct instances, which are truly of the sp ecified entity typ e and all their elements (pairs of lab el and value) b elong to the corresp onding entities, in all the extracted instances. Entity-type error ratio is the ratio of instances which are not of the 1004 WWW 2008 / Alternate Track: WWW in China - Chinese Web Innovations April 21-25, 2008 · Beijing, China Table 3: Extraction results of the four initial entity models Model Minit0 Minit1 Minit2 Minit3 # iterations 6 8 10 9 # time (hour) 90 110 86 79 # instances 111879 129954 107798 93853 # pages 556211 666742 257665 232348 # frequent labels 839 863 406 383 # unfrequent labels 9012 9264 1849 1713 900 450 Number of frequent labels 800 700 600 500 400 300 200 100 0 0 2 4 6 8 Number of frequent labels Minit0 Minit1 400 350 300 250 200 150 100 50 0 0 2 4 6 8 10 12 Minit2 Minit3 10 4 12 14 Number of instances (*10 ) Number of instances (*104) Figure 3: Number of frequent labels for different person-type entity instances Figure 4: Number of frequent labels for different movie-type entity instances Table 4: Evaluation of the extracted entity instances Entity type person-type movie-type Precision 93% 87.5% Entity-type error ratio 1% 0.5% Attribute-label error ratio 6% 12% sp ecified entity typ e, to all the extracted instances. And attribute-label error ratio is the ratio of the extracted instances in which some elements actually do not b elong to the corresp onding entities, to all the extracted instances. The reason for the entity-type error is that the iterative instances extraction process might locate some pages containing other information which is wrongly extracted as the desired entity instances. And the reason for the attribute-label error is that the structures of some pages are too flexible to accurately mine the data regions from them. Actually, the learning of an attribute schema can b e seen as the clustering of lab els with the aim of grouping lab els of the same attribute typ e into the same cluster. Assume the gold schema is Sgold = {(t1 , L1 ), . . ., (ti , Li ), . . . , (tn , Ln )}, and the learned schema is Slearned = {(t1 , L1 ), . . . , (tj , Lj ), . . . , (tm , Lm )}. Then, the lab els clustering result corresp onding to Sgold is Cgold = {L1 , . . . , Li , . . . , Ln }, and lab els clustering result corresp onding to Slearned is Clearned = {L1 , . . . , Li , . . . , Lm }. We utilize weighted average Fscore (waFscore ), which is often used in clustering evaluation[13, 23], as the evaluation metric to measure the quality of the learned schema Slearned , using the waFscore of Clearned compared with Cgold . 6.3 Experiment Results 6.3.1 Iterative entity instances extraction The results using different initial models are illustrated in Table 3, in which frequent lab els are those whose frequencies are ab ove 0.005. For p erson-typ e entities, b oth models (Minit0 and Minit1 ) are capable of extracting enough instances. Figure 3 shows the change of the frequent lab els' numb er with the increase of the numb er of extracted p erson instances. It shows that for b oth models, the numb er of frequent lab els increases sharply at the b eginning of extraction process, and b egin to converge with the progress of extraction process. Moreover, in the 839 frequent lab els created by Minit0 , 826 lab els also app ear in the 863 lab els created by Minit1 , which indicates that most of the frequent lab els have b een discovered. In the following exp erimental results, we select Minit0 's result as the working basis for p erson-typ e entities. For movie-typ e instances, the two models are also capable of extracting enough instances, and the change of the frequent lab els' numb er with the increase of the extracted movie instances' numb er is illustrated in Figure 4. The same conclusion can b e drawn from this figure. Furthermore, it is observed that 376 frequent lab els created by Minit3 also app ear in the 406 lab els created by Minit2 , and we select Minit3 's result as the working basis for movie-typ e entities. To evaluate the quality of the extracted instances, we randomly select 200 instances from the results of Minit0 and Minit3 , resp ectively, and do the evaluation by manually checking them. The result is illustrated in Table 4. As we can see our algorithm could extract instances of sp ecified entity typ e precisely, at the same time some errors exist for the errors in the mining of data regions. Moreover, in order to find out why the attribute-label error ratio of movie-typ e entities (12%) is higher than that of p erson-typ e entities (6%), we manually check the pages including these instances, and find that movie entity instances often app ear in some arbitrary created pages with other movie instances, and the b orders b etween them are not clearly identified by HTML tags, which makes the data region mining technique invalid when processing these pages. While p erson entity instances often app ear in well-structured pages which include the resumes of p ersons, and data regions can b e accurately mined from these pages. 6.3.2 Candidate attribute types generation In this section, we first evaluate the p erformance of labels concurrence factor and same entity-value factor, and then rep ort the generated candidate attribute typ es. 1005 WWW 2008 / Alternate Track: WWW in China - Chinese Web Innovations 1 0.99 0.98 Movie-type Person-type 1 0.99 0.98 April 21-25, 2008 · Beijing, China Precision Precision 0.97 0.96 0.95 0.94 0.93 0.92 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Recall 1 0.97 0.96 0.95 0.94 0.93 0 0.2 0.4 0.6 Threshold 0.8 1 Movie-type Person-type Figure 5: Precision and recall of labels concurrence factor for different thresholds Figure 6: Precision of judgment of same entity for different thresholds 1 Movie-type Person-type Labels concurrence factor: For p erson-typ e entities, we randomly select 110 frequent attribute lab els, compute labels concurrence factor for each pair of them, and get 2672 pairs with nonzero labels concurrence factors. Then, we manually evaluate whether lab els in these pairs are actually of different typ es and find 2606 pairs of lab els which are of different typ es. Finally, we utilize our algorithm to automatically judge whether two lab els are of different typ es, by determining whether their labels concurrence factors is ab ove a certain threshold. Similarly, we randomly select 55 frequent lab els for movie-typ e entities, and do the same exp eriment. We get 1286 pairs with nonzero labels concurrence factors and find 1201 pairs of lab els which are of different typ es. Figure 5 shows the relation b etween the precision and recall of the judgment when using different thresholds. We can see that the precision of judgment for p erson-typ e entities is always higher than that of movie-typ e entities. It is b ecause the attribute-label error ratios of p erson-typ e entities are lower than those of movie-typ e entities as illustrated in section 6.3.1, and more pairs of movie-typ e entities' lab els which are not of different typ es app ear together in some instances due to higher attribute-label error ratios. To get precise judgment, we select 0.15 as the b est threshold for p erson-typ e entities, and 0.18 as the b est threshold for movie-typ e entities. Same entity-value factor: The basis of this factor is the judgment of whether two instances are of the same entity. For p erson-typ e instances, we randomly select 50 p erson names from extracted instances and make sure each p erson name has at least 3 different instances, and totally get 176 instances. Then, we calculate overlaps of attribute values for all pairs of instances which own the same name. After that, for each pair of instances which own the same name, we manually make the judgment on whether they are of the same entity. We also select 50 movies from 215 movie instances and run the same exp eriment. Figure 6 shows the precisions of the prop osed method in section 5.1.1 for different thresholds. As we can see that this simple method is effective to make right judgments, and we select 0.75 as the b est threshold to b e used in exp eriments. Based on the judgment ab out two instances b eing of the same entity, we can calculate same entity-value factor for each lab el pair. If the factor is ab ove a certain threshold, our algorithm can judge that they are of the same typ e. Figure 7 shows the relation b etween the precision and recall of the judgment with different thresholds for 110 frequent lab els of p erson instances and 55 frequent lab els of movie instances. 0.9 0.8 Precision 0.7 0.6 0.5 0.4 0.3 0.65 0.7 0.75 0.8 0.85 Recall 0.9 0.95 1 Figure 7: Precision and recall of same entity-value factor for different thresholds As we can see that the precision of judgment for movie-typ e entities is always higher than that of p erson-typ e entities. It is b ecause movie-typ e entity instances include less meaningless attribute values than p erson-typ e entity instances. In order to get precise result, we select 0.84 as the b est threshold for p erson-typ e entities, and 0.82 as the b est threshold for movie-typ e entities. Candidate attribute types: For p erson-typ e entities, we select the top 400 frequent attribute lab els to generate candidate attribute typ es, b ecause these lab els account for 96.9 p ercent of all the lab els' occurrences in the extracted instances. We get 3941 pairs of lab els which are of different attribute typ es and 55 pairs of lab els which are of the same attribute typ es, and generate 19 candidate attribute typ es. For movie-typ e entities, we select the top 150 frequent lab els since they account for 97.1 p ercent of all the lab els' occurrences, and get 1947 pairs of lab els which are of different typ es and 21 pairs of lab els which are of the same typ es, and generate 7 candidate attribute typ es. 6.3.3 Global Attribute schema learning method We select top 400 frequent lab els of p erson-typ e entities and top 150 frequent lab els of movie-typ e entities as the basic lab el sets (Lperson and Lmovie ) to evaluate the p erformance of our global attribute schema learning method. First, for each lab el set, three annotators indep endently create two schemas by discovering the meaning of attribute lab els with the help of related entity instances and pages. Then, a final schema for each lab el set is created by another annotator who is familiar with the meaning of attributes lab els based on three schemas. Finally, we get two gold schemas, SglodP erson and SgoldM ovie , and create the gold attribute lab els clustering results, CgoldP erson and CgoldM ovie . Comparison with the baseline As mentioned in section 6.1, we select M GSsd in [8] as the baseline, in which 1006 WWW 2008 / Alternate Track: WWW in China - Chinese Web Innovations Table 5: Details of four label sets created on Lperson Label set Top10 Top20 Top30 Top40 Occurrence ratio 0.5679 0.6824 0.7235 0.7527 # Created different schemas 809 3388 4594 5202 April 21-25, 2008 · Beijing, China Table 7: Details of four label sets created on Lmovie Label set Top10 Top20 Top30 Top40 Occurrence ratio 0.7361 0.8213 0.8598 0.8821 # Created different schemas 356 1631 2286 2887 Table 6: Comparison between our method and baseline for person-type entities Label set Top10 Top20 Top30 Top40 Method Our method Baseline Our method Baseline Our method Baseline Our method Baseline WaFscore 1.0 1.0 1.0 0.7917 1.0 0.7527 0.9837 -- Running time(s) 60.67 0.068 96.65 0.248 395.03 3.875 631.52 about 1553586 Table 8: Comparison between our method and baseline for movie-type entities Label set Top10 Top20 Top30 Top40 Method Our method Baseline Our method Baseline Our method Baseline Our method Baseline WaFscore 1.0 1.0 1.0 0.8333 0.9877 0.7444 0.9694 -- Running time(s) 46.24 0.038 74.06 0.413 286.65 0.609 547.36 about 51372504 the input is a set of schemas consisting of attribute lab els. Moreover, due to the time complexity of the baseline, for Lperson and Lmovie , we only select the top 10, 20, 30 and 40 lab els from them, and get four lab el sets. Then, for each created lab el set Lc , we create schemas by automatically scanning each extracted instance insti to create one set of lab els which are b oth in Lc and insti , and regarding this lab el set as the schema. Finally, we run M GSsd using these schemas to create the desired attribute schemas, and compare the result's quality and the algorithm's efficiency with our method. For the baseline, since the hyp othesis space is too big (360600 for p erson-typ e entities, and 2826021 for movie-typ e entities) when the numb er of lab els is 40, we only estimate its running time. From the analysis of the baseline's running log, we find that running time is mainly sp ent on 2 estimation. From the theoretical analysis of the algorithm, we discover that the time sp ent on 2 estimation of each hyp othesis is the same. As a result, by recording the running time sp ent on 2 estimation for a small numb er of hyp othesizes, we estimate the running time of all the hyp othesizes for the baseline. For p erson-typ e entities, the details of the four lab el sets created on Lperson are illustrated in Table 5, in which the second column, occurrence ratio, is the ratio of the occurrences of lab els of the corresp onding lab el set to the occurrences of all the lab els in extracted instances, and the third column is the numb er of different schemas created by this lab el set. Table 6 shows the comparison of result quality and efficiency b etween our method and the baseline. As we can see the result's quality of our method is b etter than the baseline, which is b ecause the baseline is too sensitive to the errors of schemas, while our method utilizes probabilistic method to eliminate errors. Furthermore, when the lab el set is small, the efficiency of the baseline is higher than our method, while with the increase of the lab el set's size, the efficiency b egins to decline sharply for its exp onential time complexity. However, the decline of our method's efficiency is far more smoother than the baseline. Moreover, the baseline is only effective when the numb er of lab els is less than 40, and b ecome impractical even when processing the Top40 lab el set, which only accounts for 75.72% of the lab els' occurrences and is inadequate to create the global attribute schema. Therefore, the baseline can not create the global attribute schema effectively and efficiently. On the contrary, as the evaluation result illustrated, our method can learn the global attribute schema effectively and efficiently. We run the same exp eriment on movie-typ e entities, and the details of the lab el sets created on Lmovie are illustrated in Table 7. The comparison result is illustrated in Table 8, and we can see that our method also outp erforms the baseline. 1 Movie-type Person-type Weighted average Fscore 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0 50 100 150 200 250 300 350 400 Value of N Figure 8: WaFscores on TopN label sets Performances on different label sets We create top 50, 100, 150, 200, 250, 300, 350 and 400 lab el sets from Lperson , and top 25, 50, 75, 100, 125 and 150 lab el sets from Lmovie , and compare the results' quality on these different lab el sets. The comparison is illustrated in Figure 8. As we can see that our method p erform well when the size of lab el set is relatively small and the p erformance declines with the growth of the lab el set. This is b ecause that the lab els with lower frequency own less features to characterize their attribute typ es, while the lab els with higher frequency p ossess more features. However, even for the Top300 lab el set of p erson-typ e entities, which accounts for 95.35% of the lab els' occurrences, the wavFscore of the created global attribute schema is still acceptable (0.7122), and the time sp ent on 1007 WWW 2008 / Alternate Track: WWW in China - Chinese Web Innovations schema creation is 2427.692 seconds (ab out 40 minutes); for the Top100 lab el set for movie-typ e entities, which account for 94.90% of the lab els' occurrences, the waFscore of the created global attribute schema is also acceptable (0.7123), and the time used to create the schema is 996.862 seconds (ab out 17 minutes). April 21-25, 2008 · Beijing, China 7. CONCLUSION This pap er explores the problem of learning a global attribute schema for web entities of a given entity typ e. This problem is essential to facilitate the integration of entity instances and p erform valuable and reasonable web ob ject extraction, and is difficult due to the complexity and variety of web entity instances. We prop ose a general framework to automatically learn a global attribute schema, and further sp ecialize it to develop an iterative instances extraction algorithm to extract sufficient instances and attribute lab els, as well as a maximum entropy-based approach to construct the global attribute schema. We demonstrate our technique on p erson-typ e and movie-typ e entities on the Chinese Web, and create two global attribute schemas for the entities of these two typ es, and the weighted average Fscores for the two created schemas are 0.7122 and 0.7123, resp ectively. These results validate the efficiency and effectiveness of our technique in learning the global attribute schema for web entities of the sp ecific typ e. 8. ACKNOWLEDGMENTS We thank Dr. Tao Meng, Lijiang Chen, Jing He, Shengliang Gao, Mengcheng Duan, Nan Di, Yuan Liu and Bo Peng for their comments. The work in this pap er was supp orted by NSFC Grant 60773162 and 863 Pro ject Grant 2006AA01Z196. 9. REFERENCES [1] S. Abiteb oul, P. Buneman, and D. Suciu. Data on the Web. Morgan Kaufmann San Francisco, 2000. [2] V. Carchiolo, A. Longheu, and M. Malgeri. Extracting Logical Schema from the Web. Applied Intel ligence, 18(3):341­355, 2003. [3] S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. The TSIMMIS pro ject: Integration of heterogeneous information sources. Proc. of IPSJ Conference, pages 7­18, 1994. [4] R. Dhamankar, Y. Lee, A. Doan, A. Halevy, and P. Domingos. iMAP: discovering complex semantic matches b etween database schemas. Proc. of SIGMOD, pages 383­394, 2004. [5] H. Do and E. Rahm. COMA-A System for Flexible Combination of Schema Matching Approaches. Proc. of VLDB, 2002. [6] A. Doan, P. Domingos, and A. Halevy. Reconciling schemas of disparate data sources: a machine-learning approach. ACM SIGMOD Record, 30(2):509­520, 2001. [7] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Pop escu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Web-scale information extraction in knowitall:(preliminary results). Proc. of WWW, pages 100­110, 2004. [8] B. He and K. Chang. Statistical schema matching across web query interfaces. Proc. of SIGMOD, pages 217­228, 2003. [9] B. He and K. Chang. A holistic paradigm for large scale schema matching. ACM SIGMOD Record, 33(4):20­25, 2004. [10] B. He, K. Chang, and J. Han. Discovering complex matchings across web query interfaces: a correlation mining approach. Proc. of SIGKDD, pages 148­157, 2004. [11] G. Huck, P. Fankhauser, K. Ab erer, and E. Neuhold. Jedi: Extracting and Synthesizing Information from the Web. Proc. of the 3rd IFCIS International Conference on Cooperative Information Systems, pages 32­43, 1998. [12] J. Kang and J. Naughton. On schema matching with opaque column names and data values. Proc. of SIGMOD, 2003. [13] B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. Proc. of SIGKDD, pages 16­22, 1999. [14] W. Li, C. Clifton, and S. Liu. Database Integration Using Neural Networks: Implementation and Exp eriences. Know ledge and Information Systems, 2(1):73­96, 2000. [15] B. Liu, R. Grossman, and Y. Zhai. Mining data records in Web pages. Proc. of SIGMOD, pages 601­606, 2003. [16] J. Madhavan, P. Bernstein, and A. Doan. Corpus-Based Schema Matching. Proc. of ICDE'05, pages 57­68. [17] J. Madhavan, P. Bernstein, and E. Rahm. Generic Schema Matching with Cupid. The VLDB Journal, pages 49­58, 2001. [18] R. Malouf. A comparison of algorithms for maximum entropy parameter estimation. Proc. of CoNLL, pages 1­7, 2002. [19] S. Nestorov, S. Abiteb oul, and R. Motwani. Extracting schema from semistructured data. Proc. of SIGMOD, pages 295­306, 1998. [20] Z. Nie, F. Wu, J. Wen, and W. Ma. Extracting Ob jects from the Web. Proc. of ICDE'06, 2006. [21] J. Wang, J. Wen, F. Lochovsky, and W. Ma. Instance-based schema matching for web databases by domain-sp ecific query probing. Very Large Data Bases (VLDB), 2004. [22] Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. Proc. of WWW, pages 76­85, 2005. [23] Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. ACM Press New York, NY, USA, 2002. [24] J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma. Simultaneous record detection and attribute lab eling in web data extraction. Proc. of SIGKDD, pages 494­503, 2006. 1008