WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China Extraction and Mining of an Academic Social Network Jie Tang Department of Computer Science Tsinghua University FIT 1-308, Tsinghua University, Beijing, 100084, China Jing Zhang Department of Computer Science Tsinghua University FIT 1-308, Tsinghua University, Beijing, 100084, China Limin Yao, Juanzi Li Department of Computer Science Tsinghua University FIT 1-308, Tsinghua University, Beijing, 100084, China jietang@tsinghua.edu.cn ABSTRACT zhangjing@keg.cs.tsinghua.edu.cn {ylm, ljz}@keg.cs.tsinghua.edu.cn crawled publications. 3. 4. 5. Storage and Access: it provides storage and indexing for the extracted/integrated data in the RNKB. Search: it provides three types of searches: person search, publication search, and category based search. Mining: it provides mining services, e.g., expertise search on a given topic and people association finding. This paper addresses several key issues in extraction and mining of an academic social network: 1) extraction of a researcher social network from the existing Web; 2) integration of the publications from existing digital libraries; 3) expertise search on a given topic; and 4) association search between researchers. We developed a social network system, called ArnetMiner, based on proposed methods to the above problems. In total, 448,470 researcher profiles and 981,599 publications were extracted/integrated after the system having been in operation for two years. The paper describes the architecture and main features of the system. It also briefly presents the experimental results of the proposed methods. 5. Mining 4. Search Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval, Digital Libraries, I.2.6 [Artificial Intelligence]: Learning, H.2.8 [Database Management]: Database Applications. Expert finding Association Finding Hot-topic Finding Sub-topic Finding Survey paper finding 3. Storage and Access Person search Publication search Conference search Access Interface Indexing Storage RNKB Ontology 2. Integration General Terms Algorithms, Experimentation Name disambiguation 1. Extraction Keywords Social Network, Information Extraction, Expertise Search Profile extraction Publication collection 1. INTRODUCTION The quickly growing up Web based social network applications provides abundant data for mining, at the same time bring big challenges to the field. In this paper, we present a system called ArnetMiner (http://www.arnetminer.org). Our objective in this system is to provide services for managing academic social networks, specifically including: 1) how to extract researcher profiles from the Web, 2) how to integrate the researcher profiles and publications, 3) how to simultaneously find expertise objects (of different types) on a topic, and 4) how to find associations between researchers. Web Papers DBLP Figure 1. Architecture of ArnetMiner For several features in the system, e.g., extraction of researcher profiles, name disambiguation in the integration, and expertise search, we propose new approaches trying to overcome the drawbacks that exist in the conventional methods. For some other features, e.g., storage and access and searching, we utilize the state-of-the-art methods. 2.1 Researcher Profiling We define the schema of the researcher profile, by extending the FOAF ontology [2]. In the profile, 24 properties and two relations are defined. It is non-trivial to perform the profile extraction, as the layout and content of the researcher homepages/introducing pages may vary largely depending on the authors. We propose a unified approach to the problem [5]. The approach consists of three steps: relevant page identification, preprocessing and tagging. In relevant page identification, given a researcher name, we first get a list of web pages by a search engine (we used Google API) and then identify the homepage/introducing page using a classifier. The performance of the classifier is 92.39% in terms of F1-measure. In preprocessing, (a) we separate the text into tokens and (b) we assign possible tags to each token. The 2. ARNETMINER Figure 1 shows the architecture of the ArnetMiner system. The system mainly consists of five main components: 1. 2. Extraction: it focuses on automatically extracting the researcher profile from the Web. Integration: it integrates the extracted researcher profiles and Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. 1193 WWW 2008 / Poster Paper tokens form the basic units and the pages form the sequences of units in the tagging problem. In tagging, given a sequence of units, we determine the most likely corresponding sequence of tags by using a trained tagging model. (The type of the tags corresponds to the property defined in the profile.) As the tagging model, we use Conditional Random Fields (CRFs) [3]. CRF is a conditional probability of a sequence of labels given a sequence of observations tokens. The CRF is used to find the sequence of tags having the highest likelihood using a trained model. Features were defined for different types of tokens in the CRF model. For evaluating our unified profiling method, we randomly chose 1,000 researcher names from ArnetMiner and conducted human annotation. Experimental results show that our proposed approach can achieve a performance of 83.37% on average in terms of F1measure, against Support Vector Machine based method (73.57%) and Amilcare (53.44%). April 21-25, 2008 · Beijing, China between the query and the object (e.g., paper and author) or random walk model to estimate importance of each object. We propose a Latent Dirichlet Allocation-style model [1], called Author-Conference-Topic (ACT) model to model the dependencies between different types of objects in the researcher network. In the ACT model, for each paper, an author is first drawn from a uniform distribution; a topic z is then drawn from a mixture weight of the chosen author and a distribution from a symmetric Dirichlet prior; next a word is generated from the topic z and a conference stamp is generated from the topic z. In this way, the dependencies between different types of objects are modeled using the topic z. Another advantage of the model is that we can use this model to capture the `semantic'/hidden relevance between the query and the target objects. After applying the ACT model to the research network, we again employ a random walk model on the heterogeneous network and finally output a combined score for each object to the query. We conducted experiments on Arnetminer with seven queries and compared the results with two baselines of using language model and PageRank, as well as results of two existing systems (Libra and Rexa). Experimental results show that the proposed method outperforms them from 4.26% to 29.2% in terms of MAP [7]. 2.2 Name Disambiguation We integrate the publication data from existing online data source. We chose DBLP bibliography (dblp.uni-trier.de/). For integrating the researcher profiles and the publications, we inevitably have the ambiguous problem. The problem can be described as: Given a person name a, let all publications containing the author named a as P={p1, p2, ..., pn}. Suppose there existing k actual researchers {y1, y2, ..., yk} having the name a, our task is to assign these n publications to their real researcher yi. Our method is based on a unified probabilistic model based on Hidden Markov Random Fields (HMRF) [6]. This model incorporates constraints and a parameterized-distance measure. The disambiguation problem is cast as assigning a tag to each paper with each tag representing an actual researcher yi. Specifically, we define the a-posteriori probability as the objective function. We aims at finding the maximum of the objective function. The objective function is defined as the conditional probability of researcher labels y given the papers x: 1 p ( y | x) = exp D( xi , yk ) + {D( xi , x j ) [ wk ck ( xi , x j )]} i,k Z ( x) i, j i ck C 2.4 Association Search Finally, we investigate the problem of association search: finding connections between researchers. We formalize the association search as that of near-shortest paths and use a two stage approach to deal with it. First, we employed a shortest path search to find shortest path from all persons in the network to the target person and then we use a depth-first search method to find top K ranked results. Our method can find the top K results in 2-5 seconds for a general query on the social network with about half million researchers and 1 million publications. 3. CONCLUSION In this paper, we have presented a system called ArnetMiner for extracting and mining a researcher social network. We introduced the architecture and the main features of the system. We have described the four issues that we are focusing on and proposed our approaches to them. Experimental results indicate that the proposed methods can achieve high performances. where D(xi, yk) is the distance between paper xi and researcher yk and D(xi, xj) is the distance between paper xi and xj; ck(xi, xj) denotes a constraint of xi and xj; wj is the parameter; and Z(x) is the normalization factor. A Expectation Maximization (EM) based method is used to learn the parameters for the distance function D(.) in the model. We define six types of constraints based on the characteristic of the publications, e.g., a constraint means two publications have a co-author with the same name. See [6] for the other constraints. To evaluate our method, we created two datasets, namely Abbreviated Name and Real Name. The first dataset contains 10 abbreviated names (e.g. `C. Chang') and the second data set has two real person names (e.g. `Jing Zhang'). The proposed method can obtain an overall performance of 83.0% in terms of pairwiseF1-measure [6], outperforming the baseline [4] by 8.0%. REFERENCES [1] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 2003, 3, 993-1022. [2] D. Brickley and L. Miller, FOAF vocabulary specification, namespace document, 2004. http://xmlns.com/foaf/0.1/. [3] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data. Proc. of ICML'2001. [4] Y.F. Tan, M. Kan, and D. Lee. Search engine driven author disambiguation. Proc. of JCDL'2006. pp. 314-315. 2.3 Expertise Search The goal of expertise search is aimed at answering: "Who are experts or which are expertise conferences/papers on topic X?". Traditional, the problem is usually viewed as a ranking problem using either language model to directly calculate the relevance [5] J. Tang, D. Zhang, and L. Yao. Social network extraction of academic researchers. Proc. of ICDM'2007. pp. 292-301 [6] D. Zhang, J. Tang, J. Li, and K. Wang. A constraint-based probabilistic framework for name disambiguation. Proc. of CIKM'2007. pp. 1019-1022 [7] J. Zhang, J. Tang, L. Liu, and J. Li. A mixture model for expert finding. Proc. of PAKDD'2008. (to appear) 1194