Morphology Induction from Limited Noisy Data Using Approximate String Matching

Title	Morphology Induction from Limited Noisy Data Using Approximate String Matching
Publication Type	Conference Papers
Year of Publication	2006
Authors	Karagol-Ayan B, Doermann D, Weinberg A
Conference Name	Proceedings of the Eighth Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON 2006)
Date Published	2006/06//
Conference Location	New York City, NY
Abstract	For a language with limited resources, a dictionary may be one of the few available electronic resources. To make effective use of the dictionary for translation, however, users must be able to access it using the root form of morphologically deformed variant found in the text. Stemming and data driven methods, however, are not suitable when data is sparse. We present algorithms for discovering morphemes from limited, noisy data obtained by scanning a hard copy dictionary. Our approach is based on the novel application of the longest common substring and string edit distance metrics. Results show that these algorithms can in fact segment words into roots and affixes from the limited data contained in a dictionary, and extract affixes. This in turn allows non native speakers to perform multilingual tasks for applications where response must be rapid, and their knowledge is limited. In addition, this analysis can feed other NLP tools requiring lexicons.

Publications