WWW 2008 / Poster Paper

April 21-25, 2008 · Beijing, China

Mining for Personal Name Aliases on the Web
Danushka Bollegala, Taiki Honma, Yutaka Matsuo and Mitsuru Ishizuka
The University of Tokyo, Hongo 7-3-1, Tokyo, 113-8656, Japan

{danushka, honma}@mi.ci.i.u-tokyo.ac.jp, matsuo@biz-model.t.u-tokyo.ac.jp, ishizuka@i.u-tokyo.ac.jp

ABSTRACT
We prop ose a novel approach to find aliases of a given name from the web. We exploit a set of known names and their aliases as training data and extract lexical patterns that convey information related to aliases of names from text snipp ets returned by a web search engine. The patterns are then used to find candidate aliases of a given name. We use anchor texts and hyp erlinks to design a word co-occurrence model and define numerous ranking scores to evaluate the association b etween a name and its candidate aliases. The prop osed method outp erforms numerous baselines and previous work on alias extraction on a dataset of p ersonal names, achieving a statistically significant mean reciprocal rank of 0.6718. Moreover, the aliases extracted using the prop osed method improve recall by 20% in a relation-detection task.
seed queries: "NAME *ALIAS" (NAME, ALIAS)
..........

web search engine snippets Pattern extraction algorithm "NAME PATTERN "

*

Alias extraction & ranking algorithm

NAME ALIASES

Training Data

Lexical patterns

Categories and Subject Descriptors
H.3.3 [Information Systems]: Information Search and Retrieval

General Terms
Algorithms

Keywords
Name alias extraction, Semantic Web, Web Mining

Figure 1: Outline of the alias extraction method . On the other hand, referential ambiguity occurs b ecause p eople use different names to refer to the same entity on the web. For example, the American movie star Wil l Smith is often called the fresh prince in web contents. Although lexical ambiguity, particularly ambiguity related to p ersonal names, has b een explored extensively in the previous studies of name disambiguation [1], the problem of referential ambiguity of entities on the web has received much less attention. In this pap er, we sp ecifically examine on the problem of automatically extracting the various references on the web to a particular entity. In contrast to the real name of an entity, we use the term alias to describ e words or multi-word expressions that are used to refer to the entity (e.g., fresh prince is an alias for Wil l Smith ).

1.

INTRODUCTION

Precisely identifying entities in web documents is necessary for various tasks in the Semantic Web such as relation extraction, metadata extraction, search and integration of data. Nevertheless, identification of entities on the web is difficult for two fundamental reasons: first, different entities can share the same name (lexical ambiguity); secondly, a single entity can b e designated by multiple names (referential ambiguity). As an example of lexical ambiguity the name Jim Clark is illustrative. Aside from the two most p opular namesakes, the formula-one racing champion and the founder of Netscap e, at least 10 different p eople are listed among the top 100 results returned by Google for the name. Research Fellow of the Japan Society for the Promotion of Science (JSPS)
Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04.


2. METHOD
The prop osed method is outlined in Fig.1 and comprises two main comp onents: pattern extraction, and alias extraction and ranking. To identify the various ways in which information related to aliases are represented on the web, we use a seed list of (name,alias) pairs as queries to a web search engine and download text-snipp ets that contain b oth a name and its alias. Here, we use the wildcard op erator * to p erform a NEAR query. The op erator * matches with one or more words in a snipp et. Figure 2 shows a snipp et retrieved for the query "Will Smith * The Fresh Prince" In Fig.2 the . ...Rock the House, the duo's debut album of 1987, demonstrated that Wil l Smith, aka the Fresh Prince, was an entertaining and amusing storytel ler... Figure 2: A snippet returned for the query "Will Smith * The Fresh Prince" by Google

1107


WWW 2008 / Poster Paper
Table 1: Lexical patterns with the highest F -scores
* [NAME] [NAME] [ N A M E] [NAME] * [NAME] * pattern aka aka better known as alias also known as nee nickname whose real name is [NAME] * * * * [NAME] * [NAME] F -score 0.335 0.322 0.310 0.286 0.281 0.225 0.224 0.205

April 21-25, 2008 · Beijing, China
Table 2: Overall performance
Dataset English Personal Names English Place Names Japanese Personal Names Real Name David Hasselhoff Courteney Cox Al Pacino Teri Hatcher Texas Vermont Wyoming Hideki Matsui MRR 0.6150 0.8159 0.6718 Average Precision 0.6865 0.7819 0.6646

Table 3: Aliases extracted by the proposed method
Extracted Aliases hoff, michael knight, michael dirt lucy, lucy, monica michael corleone susan mayer, susan, mayer lone star state, lone star, lone green mountain state, green, equality state, cowboy state Go dzilla, nishikori, matsui

snipp et contains aka (i.e. also known as ), which indicates the fact that fresh prince is an alias for Wil l Smith. In addition to a.k.a., numerous clues exist such as nicknamed, alias, real name is, nee, which are used on the web to represent aliases of a name. We create lexical patterns from snipp ets by replacing the name and alias resp ectively by two variables [NAME] and [ALIAS], and extracting the phrases that app ear in b etween. We rep eat this procedure with the reversed query, "alias * name" to extract patterns in which alias precedes the name (e.g., [ALIAS] is an alias for [NAME]). The created patterns can then b e used to query a search engine to find candidate aliases of a given name. Sp ecifically, we substitute the given name for the variable [NAME] and * for the variable [ALIAS] and download snipp ets. The words that match the wildcard op erator in snipp ets are selected as candidate aliases for the given name. Using 50 pairs of names and aliases as seeds, we extracted over 8000 lexical patterns. Patterns are ranked according to their F -scores on training data. Top ranking patterns are shown in Table 1. Considering the noise in web-snipp ets, candidates extracted using a set of shallow lexical patterns might include some invalid aliases. Therefore, it is imp erative that we identify the candidates which are most likely to b e correct aliases of a given name. We model this problem of alias recognition as one of ranking candidate aliases with resp ect to a given name such that the candidates which are most likely to b e correct aliases are assigned a higher rank. We define various ranking scores to measure the association b etween a name and a candidate alias using two approaches: hyp erlink structure on the web and page-counts retrieved from a search engine. Inb ound anchor texts of a url provide useful semantic clues related to the resource represented by the url. We define two words a and b as co-occurring, if they app ear in at least two different inb ound anchor texts A and B , resp ectively, in a url u. Using this definition, we compute 18 p opular co-occurrence measures. Hubs in the hyp erlink graph are automatically detected and co-occurrences in hubs are appropriately adjusted. Moreover, four page-count-based association measures are computed [2]. All ranking scores are integrated using raking supp ort vector machines to leverage a robust ranking function. Because of the limited availability of space, we omit the details of the ranking algorithm.

reciprocal rank (MRR) and average precision on all datasets. Moreover, the prop osed method outp erforms a previously prop osed alias extraction algorithm by Hokama and Kitagawa [3] (MRR=0.6314 on Japanese names dataset) which uses manually created patterns sp ecific to Japanese names. Top ranking aliases as extracted by the prop osed method for some names are shown in Table 3. Overall, in Table 3 the prop osed method extracts most aliases assigned in the manually created gold standard (shown in b old). Table 4: Effect of aliases on relation detection
Real name only Precision Recall F .4812 .7185 .4792 Real name and top alias Precision Recall F .4833 .9083 .5918

We evaluate the extracted aliases on a relation detection task. First, we manually classify 50 p eople in the English p ersonal names dataset into four categories: music, politics, movies, and sports. Then we measure the association b etween two p eople using the WebPMI [2] and p erform group average agglomerative clustering to form four clusters. Clustering accuracies with and without using aliases are shown in Table 4. The use of aliases significantly improves recall (ca. 20%) and consequently the F score. By considering not only real names but also aliases, it is p ossible to discover relations that are unidentifiable solely using real names.

4. CONCLUSION
We prop osed a lexical-pattern-based approach to extract aliases for a given name. The extracted candidates were ranked using various ranking scores computed using the hyp erlink structure on the web and page-counts retrieved from a search engine. The prop osed method rep orted high MRR scores on three different datasets and significantly improved recall in a relation detection task.

5. REFERENCES
[1] R. Bekkerman and A. McCallum. Disambiguating web app earances of p eople in a social network. In Proc. of WWW'05, pages 463­470, 2005. [2] D. Bollegala, Y. Matsuo, and M. Ishizuka. Measuring semantic similarity b etween words using web search engines. In Proc. of WWW'07, pages 757­766, 2007. [3] T. Hokama and H. Kitagawa. Extracting mnemonic names of p eople from the web. In Proc. of 9th Intl. Conf. on Asian Digital Libraries (ICADL'06), pages 121­130, 2006.

3.

EVALUATION

We evaluate the prop osed alias extraction algorithm on three datasets 1 : English p ersonal names (50 names), Japanese p ersonal names (100 names) and English place names (50 U.S. states). Personal names datasets include p eople from various fields of cinema, sp orts, p olitics and science. As shown in Table 2, the prop osed method rep orts high mean
1

www.miv.t.u-toyko.ac.jp/danushka/aliasdata.zip

1108