Saif Mohammad and Graeme Hirst
Submitted. 2009.
ABSTRACT: Automatic measures of semantic distance can be classified into two kinds: (1) those, such as WordNet, that rely on the structure of manually created lexical resources and (2) those that rely only on co-occurrence statistics from large corpora. Each kind has inherent strengths and limitations. Here we present a hybrid approach that combines corpus statistics with the structure of a Roget-like thesaurus to gain the strengths of each while avoiding many of their limitations. We create distributional profiles (co-occurrence vectors) of coarse thesaurus concepts, rather than words. This allows us to estimate the distributional similarity between concepts, rather than words. We show that this approach can be ported to a cross-lingual framework, so as to estimate semantic distance in a resource-poor language by combining its text with a thesaurus in a resource-rich language. Extensive experiments, both monolingually and cross-lingually, on ranking word pairs in order of semantic distance, correcting real-word spelling errors, and solving word-choice problems show that these distributional measures of concept distance markedly outperform traditional distributional word-distance measures and are competitive with the best WordNet-based measures.
THE PAPER (PDF)