WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China Which "Apple" Are You Talking About ? Mandar A. Rahurkar University of Illinois Dan Roth University of Illinois Thomas S. Huang University of Illinois rahurkar@uiuc.edu danr@cs.uiuc.edu huang@ifp.uiuc.edu ABSTRACT In a higher level task such as clustering of web results or word sense disambiguation, knowledge of all p ossible distinct concepts in which an ambiguous word can b e expressed would b e advantageous, for instance in determining the numb er of clusters in case of clustering web search results. We prop ose an algorithm to generate such a ranked list of distinct concepts associated with an ambiguous word. Concepts which are p opular in terms of usage are ranked higher. We evaluate the coverage of the concepts inferred from our algorithm on the results retrieved by querying the ambiguous word using a ma jor search engine and show a coverage of 85% for top 30 documents averaged over all keywords. Categories and Sub ject Descriptors: H.3.3 [Information Systems]: Clustering General Terms: Algorithms, Exp erimentation. ambiguous word. In case of web queries which are typically very short such an approach may not b e an option. 2. DATA Wikip edia is a collection of articles where each article defines and describ es an entity or an event. Each article may have several hyp erlinks to the other pages within or outside Wikip edia and is uniquely referenced by a title. Title is comp osed of one or more words and occasionally an explanation in parenthesis clarifying the context of the article. For example, the article for mercury with the meaning of "automobile" has the unique identifier mercury (automobile). Ambiguous surface form is hyp er linked to the appropriate article using a pip e, e.g., link from the word orange to an article on Orange Color (if applicable), as in [[Orange (Color)|Orange]]. This can generally b e represented as [[Concept Identifier| Surface form]] pair. The phrase "concept identifier" and the word concept is used interchangeably. We use this structural information within the Wikip edia to identify p ossible concepts for a given ambiguous word. Since these links have b een manually created and reviewed by a large diverse audience, they are accurate in referencing the article clarifying the context in which the surface form has b een used. Keywords Wikip edia, Concepts, Clustering 1. INTRODUCTION AND PRIOR WORK Consider an ambiguous query [3] "apple" queried on a search engine. The search relevance can b e improved if we know which "apple" user is interested in. One of the b etter ways to present the relevant search results would b e to cluster the results wherein we have a cluster for each distinct concept in which the word apple can b e used, e.g, Apple (corporation) or Apple (fruit). However most clustering algorithms require the numb er of clusters as an input. This requires multiple iterations seeking an optimal numb er based on some statistical criteria which may or may not guarantee consistency within and across different clusters. Thus, in this example as well as other applications including word sense disambiguation, knowing numb er of unique concepts in which a word can b e used would b e an invaluable asset. In this p oster we prop ose an algorithm to determine all p ossible unique concepts for a given ambiguous word. We do so by using the Wikip edia. Wikip edia has b een recently used for named entity disambiguation [1] and word sense disambiguation(WSD) [2] tasks. In [2] contextual text surrounding the ambiguous word in addition to the word itself is used without explicitly enumerating the distinct concepts of the Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. 3. ALGORITHM We parse the Wikip edia1 to extract all the hyp erlinked occurrences of a word. While parsing, the disambiguation pages, pages associated with dates and pages enumerating the lists are excluded. Wikip edia, b eing up dated regularly by a diverse group of p eople reflects the realistic use of these concepts. Therefore, we use the occurrence count in the first pass to obtain the ranked list R of concepts associated with a keyword such that R = {C1 , C2 , . . . , Cn } and rank of Ci > rank of Cj for i < j . Top 20 such concepts for ambiguous keyword "Bush" are enumerated in table 1. Most surface forms are found to associate with a large numb er of concepts, e.g., surface form Orange is associated with ab out 150 concepts, Mercury with 110 concepts etc. However large numb er of these concepts might not b e unique as seen in the table 1 where concepts "george w. bush", "george walker bush" and "george w bush" represent the same concept. We re-rank R filtering it for the duplicate concepts. 3.1 Re-Ranking the concept list Consider a directed graph G = {V , E } where each vertex V , represents a Wikip edia article or a concept. There exists 1 XML file dump as on Septemb er 07 1197 WWW 2008 / Poster Paper Table 1: Top 20 concepts for ambiguous query "bush" before the list is filtered and re-ranked. Rank 1-10 Rank 11-20 george w. bush bush family bush (band) bush alaska george h. w. bush george walker bush the bush alaskan bush shrub outback george h.w. bush uss bush (dd-529) bush (canadian band) bush, alabama forest bush plane george herb ert walker bush woody plant george w bush bush, illinois an outgoing edge from two vertices Vi and Vj , Vi Vj , if there is an outgoing hyp erlink from article i to j in the Wikip edia. Example graph is shown in figure 1, where C1 and C2 might represent "george w. bush" and "george walker bush". Given a ranked list R containing concepts C1 , C2 , . . . , Ck we seek filtering measures to re-rank the list and prune the duplicate concepts. If concept Ci Cj under some measure M , then the concept Ci is subsumed by the concept Cj . If a concept Cj subsumes concepts Ci , Ck , . . . , Cl , the counts of these concepts are added to Cj thereby increasing its score and p ossibly the rank. Ranked list R is processed sequentially starting with the concept node at the top of R . Graph b eginning from this node is parsed in a depth first search fashion to identify concepts Ci , where i > j such that Ci Cj under the measure M defined later. Since there are cycles in the graph we halt processing of a node if we encounter its ancestor. We now compute three measures and combine them to re-rank R . The first measure checks for the existence of a bi-directional link b etween the two concepts. Mb (Ci , Cj ) = I((Cj Ci ) (Ci Cj )) (1) April 21-25, 2008 · Beijing, China Figure 1: Concept Graph Table 2: Concepts after filtering the list R for keywords bush and Mercury Bush Mercury George W. Bush Mercury (element) Bush (band) Mercury (planet) Bush alaska Mercury (records) Forest Mercury (mythology) Bush LA Mercury (automobile) 4. RESULT AND CONCLUSION We evaluate the ranked list of concepts R by examining its coverage over top 30 results returned by a ma jor search engine on querying the ambiguous words from [2, 3]. Human annotators were shown the retrieved pages and asked to assign the concept to a page from the list R which b est describ es the content on that page, e.g., Orange telecom for a page related to "Orange mobile company". In addition to concepts in the list R associated with the ambiguous word, annotator could also assign the lab els "Can't Say", "Other: Not defined here" and "Tech Error" in cases when they could not reliably identify a concept, or if the concept on webpage was not mentioned in the list or if an error occurred in loading a webpage. We obtained a p ercentage coverage of over 85% for top 30 documents and 89% for top 5 documents averaged across all the keywords. Coverage is low for the concept jordan b ecause jordan in Wikip edia has not b een used to refer to "michael jordan, professor" or several other firms by that name which were part of the retrieved results. Coverage should improve to an extent on using the latest version of the Wikip edia dump. where, I is an Indicator variable. Thus, a concept Cj may subsume Ci if Mb = 1. Mb measure however is a greedy measure and concept Ci may subsume weakly related concept Cj due to a presence of a bi-directional link, e.g., Orange Color and Syracuse Orange. If two concepts share a subset of outb ound and inb ound links they are likely to b e similar. The second and third measure (not shown) count the overlap b etween the inb ound and outb ound links. Op erators I n(C ) and Out(C ) return a set of inb ound and outb ound links for concept C and |x| returns the cardinality of the set x. MO B is similar to MI B but defined instead on the outb ound links. ,, MI B (Cm , Cn ) = max |I n(Cm ) I n(Cn )| |I n(Cm ) I n(Cn )| , |I n(Cm )| |I n(Cn )| « [1] S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of EMNLP-CoNLL 2007, pages 708­716, 2007. [2] R. Mihalcea. Using wikipedia for automatic word sense disambiguation. NAACL, 2007. [3] H.-J. Zeng, Q.-C. He, Z. Chen, W.-Y. Ma, and J. Ma. Learning to cluster web search results. SIGIR, July 2004. 5. REFERENCES The fourth measure combines these three measures, ,, « M (Cm , Cn ) = Mb (Cm , Cn )× 1 MI B (Cm , Cn )+2 MO B (Cm , Cn ) Measure M yields values in [0, 1], and a higher value indicates more similarity. A concept in list R , Cj , subsumes Ci , where i > j if M (Cm , Cn ) > P V al. P V al is determined empirically to b e 0.24. Concepts which occur less than 5 times are discarded. The pruned concept lists for the ambiguous words bush and mercury are shown in table 2. where, 1 + 2 = 1. Table 3: Percentage coverage for top 30 results obtained on issuing an ambiguous query. Query Avg. Coverage Query Avg. Coverage apple 100.00 lincoln 73.33 bar 83.33 matrix 66.67 bush 90.00 orange 80.00 clinton 73.33 quotes 90.00 ford 100.00 saturn 96.67 jaguar 86.67 tiger 86.67 jobs 96.67 ups 93.33 jordan 66.67 1198