The CLIP Colloquium Series presents...


Measuring Semantic Distance using Distributional Profiles of Concepts

Saif Mohammad (University of Maryland)
April 16, 2008, 4:00pm, AVW 3174

Semantic distance is a measure of how close or distant in meaning two units of language are. A large number of important natural language problems, including machine translation and word sense disambiguation, can be viewed as semantic distance problems. I argue that semantic distance is essentially a property of concepts (rather than words) and that two concepts are semantically close if they occur in similar contexts. Instead of identifying the co-occurrence (distributional) profiles of words (distributional hypothesis), distributional profiles of concepts (DPCs) can be used to infer the semantic properties of concepts and indeed to estimate semantic distance more accurately.

I propose a new hybrid approach to calculating semantic distance that combines corpus statistics and a published thesaurus (Macquarie Thesaurus). The algorithm determines estimates of the DPCs using the categories in the thesaurus as very coarse concepts and, notably, without requiring any sense-annotated data. Even though the use of only about 1000 concepts to represent the vocabulary of a language seems drastic, I show that the method achieves results better than the state-of-the-art in a number of natural language tasks. I show how cross-lingual DPCs can be created by combining text in one language with a thesaurus from another. Using these cross-lingual DPCs, we can solve problems in one, possibly resource-poor, language using a knowledge source from another, possibly resource-rich, language. I show that the approach is also useful in tasks that inherently involve two or more languages, such as machine translation and multilingual text summarization.

The proposed approach is computationally inexpensive, it can estimate both semantic relatedness and semantic similarity, and it can be applied to all parts of speech. Extensive experiments on ranking word pairs as per semantic distance, real-word spelling correction, solving Reader's Digest word choice problems, determining word sense dominance, word sense disambiguation, and word translation show that the new approach is markedly superior to previous ones.

I will conclude with recent work at UMIACS on developing a computational model for antonymy---a unique lexical-semantic relation that simultaneously conveys a sense of both distance and closeness.

About the Speaker

Saif Mohammad is a Research Associate in the Institute of Advanced Computer Studies at the University of Maryland. In 2008, under the supervision of Dr. Graeme Hirst, he got his Ph.D. in Computer Science from the University of Toronto. Saif's interests are in Natural Language Processing, especially Lexical Semantics. His work focuses on developing monolingual and cross-lingual computational models of semantic distance and lexical-semantic relations such as antonymy for the benefit of natural language applications.


This talk is part of the CLIP Colloquium Series, organized by Jimmy Lin (jimmylin -at- umd .dot. edu). For the complete schedule, please visit http://www.umiacs.umd.edu/research/CLIP/colloq/.