WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China A Unified Framework for Name Disambiguation Jie Tang, Jing Zhang Department of Computer Science Tsinghua University FIT 1-308, Tsinghua University, Beijing, 100084, China Duo Zhang Department of Computer Science University of Illinois at Urbana Champaign Juanzi Li Department of Computer Science Tsinghua University FIT 1-308, Tsinghua University, Beijing, 100084, China jietang@tsinghua.edu.cn ABSTRACT dolphin.zduo@gmail.com ljz@keg.cs.tsinghua.edu.cn Name ambiguity problem has been a challenging issue for a long history. In this paper, we intend to make a thorough investigation of the whole problem. Specifically, we formalize the name disambiguation problem in a unified framework. The framework can incorporate both attribute and relationship into a probabilistic model. We explore a dynamic approach for automatically estimating the person number K and employ an adaptive distance measure to estimate the distance between objects. Experimental results show that our proposed framework can significantly outperform the baseline method. We define five types of undirected relationships between papers. Table 1 shows the relationships. Relationship r1 represents that two papers are published at the same conference/journal. Relationship r2 means two papers have a secondary author with the same name, and relationship r3 means one paper cites the other paper. Table 1. Relationships between papers R W Relation Name Description r1 w1 Co-Conference pi.pubvenue = pj.pubvenue r, s>0, ai(r)=aj(s) r2 w2 Co-Author r3 w3 Citation pi cites pj or pj cites pi r4 w4 Constraints Feedbacks supplied by users r5 w5 -CoAuthor -extension co-authorship (>1) We use an example to explain relationship r5. Suppose pi has authors "David Mitchell" and "Andrew Mark", and pj has authors "David Mitchell" and "Fernando Mulford". (We are going to disambiguate "David Mitchell".) If "Andrew Mark" and "Fernando Mulford" also coauthor one paper, then we say pi and pj have a 2CoAuthor relationship. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval, Digital Libraries, I.2.6 [Artificial Intelligence]: Learning, H.2.8 [Database Management]: Database Applications. General Terms Algorithms, Experimentation Keywords Name Disambiguation, Probabilistic Model, Digital Library 2.2 Our Approach The proposed framework is based on Hidden Markov Random Fields [1], which can be used to model dependencies between observations (here each paper can be viewed as an observation). Then for each disambiguation task, we propose using the Bayesian Information Criterion (BIC) [2] as the criterion to estimate the person number K. We define an objective function for the disambiguation task. Our goal is to optimize a parameter setting that maximizes the objective function with some given K. Figure 1 shows the graphical structure of the HMRF model to the name disambiguation problem. The edge between the hidden variables corresponds to the relationships between papers. The value of each hidden variable indicates the assignment results. By the fundamental theorem of random fields [3], the probability distribution of the label configuration Y has the form: P(Y ) = 1 exp( - f ( yi , y j )) Z1 i ( yi , y j )Ei 1. INTRODUCTION Name disambiguation is a very critical problem in many knowledge management applications, such as Digital Libraries and Semantic Web applications. We have examined one hundred person names and found that the problem is very serious. For example, there are 54 papers authored by 25 "Jing Zhang". Even, there are three "Yi Li" graduated from the author's lab. In this paper, we propose a unified probabilistic framework to offer solutions to the above challenges. We explore a dynamic approach for estimating the person number K. We propose a unified probabilistic model. The model can achieve better performance in name disambiguation than the existing methods. 2. NAME DISAMBIGUATION 2.1 Notations The problem can be described as: Given a person name a, let all publications containing the author named a as P={p1, p2, ..., pn}. Suppose there existing k actual researchers {y1, y2, ..., yk} having the name a, our task is to assign these n publications to their real researcher yi. Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. (1) where potential function f(yi, yj) is a non-negative function defined based on the edge (yi, yj) and Ei represents all neighborhoods related to yi. Z1 is a normalization factor. Our goal is to find the maximum a-posteriori (MAP) configuration of the HMRF, i.e. maximize P(Y|X). Suppose P(X) is a constant, then we get P (Y | X ) P (Y ) P ( X | Y ) by the Bayes rule. Therefore, our objective function is defined as: Lmax = log( P (Y | X )) = log( P (Y ) P ( X | Y )) (2) By integrating (1) into (2), we obtain: 1205 WWW 2008 / Poster Paper 1 Lmax = log exp(- f ( yi , y j )) x X P ( xi | yi ) (3) Z1 i i ( yi , y j )Ei Then the problem is how to define the potential function f and how to estimate the generative probability P(xi|yi). April 21-25, 2008 · Beijing, China split every cluster C into two sub-clusters. We calculate a local BIC score of the new sub model M2. Given BIC(M2) > BIC(M1), we split the cluster. We calculate a global BIC score for the obtained new model. The process continues if there exists split. Finally, we use the global BIC score as the criterion to choose as output the model with the highest score. BIC score is defined as | | log(n) (7) 2 For the parameter ||, we simply define it as the sum of the K cluster probabilities, weight of the relations, and cluster centroids. BIC v ( M h ) = log ( P ( M h | P) ) - We define the potential function f(yi, yj) by the distance D(xi, xj) between paper pi and pj. As for the probability P(xi|yi), we assume that publications is generated under the spherical Gaussian distribution and thus we have: 1 exp(- D( xi , ( i ) )) (4) Z2 where µ(i) is the cluster centroid that the paper xi is assigned. Notation D(xi, µ(i)) represents the distance between paper pi and its assigned cluster center µ(i). Thus putting equation (4) into (3), we obtain the objective function in minimizing form: P ( xi | yi ) = Lmin = i ( yi , y j )Ei 2.5 Experimental Results To evaluate our method, we created two datasets, namely Abbreviated Name and Real Name. The first dataset contains 10 abbreviated names (e.g. `C. Chang') and the second data set has two real person names (e.g. `Jing Zhang'). We evaluated the performances of our method and the baseline methods (K-means) on the two data sets. Results show that our method can significantly outperform the baseline method for name disambiguation (+46.54% on Abbreviate Name data set and +41.35% on Real Name data set in terms of the average F1-score). We evaluated the effectiveness of estimation of the person number K. We have found that the estimated numbers by our approach are close to the results by human labeled. We applied Xmeans to find the person number K. We found that X-means fails to find the actual number. It always outputs only one cluster except "Yi Li" with 2. See [4][5] for more evaluation results. To further evaluate the effectiveness of our method, we applied it to expert finding and people association finding. For expert finding, in terms of mean average precision (MAP), 2% improvements can be obtained. For people association finding, we selected five pairs of person names and searched in ArnetMiner system, averagely 20% improvements can be obtained. f ( yi , y j ) + x X D ( xi , ( i ) ) + log Z i (5) where Z=Z1Z2. y 4= 2 t - c oa u t ho r y 7 = 2 y 1= 1 c o - c on f e r e n c e y 3= 1 y 2= 1 c it e cite y 5= 2 y 6= 2 cite c o au th or c o a u th o r y 8= 1 c oa u tho r y 11 = 3 y9= 3 c o au th or y 1 0= 3 c o -c o n f e re n c e x4 x7 x1 x3 x2 x5 x6 x9 x10 x11 x8 3. CONCLUSION In this paper, we have investigated the problem of name disambiguation. We have proposed a generalized probabilistic model to the problem. We have explored a dynamic approach for estimating the person number K. Experiments show that the proposed method significantly outperforms the baseline methods. Figure 1. The graphical structure of a HMRF 2.3 EM Algorithm Three tasks are executed by Expectation Maximization (EM): learning parameters of the distance measure, re-assignment of each paper to a cluster, and update of cluster centroid u(i). We define the distance function D(xi, xj) as follows [1]: D( xi , x j ) = 1 - x Ax j || xi ||A || x j ||A T i 4. REFERENCES [1] S. Basu, M. Bilenko, and R. J. Mooney. A Probabilistic Framework for Semi-Supervised Clustering. In Proc. of SIGKDD'2004, pp. 59-68, Seattle, USA, August 2004 M. Ester, R. Ge, B.J. Gao, Z. Hu, and B. Ben-Moshe. Joint Cluster Analysis of Attribute Data and Relationship Data: the Connected K-center Problem. In Proc. of SDM'2006. J. Hammersley and P. Clifford. Markov Fields on Finite Graphs and Lattices. Unpublished manuscript. 1971. J. Tang, D. Zhang, and L. Yao. Social network extraction of academic researchers. Proc. of ICDM'2007. pp. 292-301 D. Zhang, J. Tang, J. Li, and K. Wang. A constraint-based probabilistic framework for name disambiguation. Proc. of CIKM'2007. pp. 1019-1022 , where || xi ||A = xiT Axi (6) here A is a parameter matrix. The EM process can be summarized as follows: in the E-step, given the current cluster centroid, every paper xi is re-assigned to the cluster by maximizing p(yi|xi). In the M-step, the cluster centers are re-estimated based on the assignments to minimize the objective function L; and the parameter matrix is updated to increase the objective function. [2] [3] [4] [5] 2.4 Estimation of K Our proposed strategy (see Algorithm 1) is to start by setting K as 1 and use the BIC score to measure whether to split the current cluster. The algorithm runs iteratively. In each iteration, we try to 1206