Morphology Induction from Limited Noisy Data Using Approximate String Matching

TitleMorphology Induction from Limited Noisy Data Using Approximate String Matching
Publication TypeConference Papers
Year of Publication2006
AuthorsKaragol-Ayan B, Doermann D, Weinberg A
Conference NameProceedings of the Eighth Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON 2006)
Date Published2006/06//
Conference LocationNew York City, NY

For a language with limited resources, a dictionary may be one of the few available electronic resources. To make effective use of the dictionary for translation, however, users must be able to access it using the root form of morphologically deformed variant found in the text. Stemming and data driven methods, however, are not suitable when data is sparse. We present algorithms for discovering morphemes from limited, noisy data obtained by scanning a hard copy dictionary. Our approach is based on the novel application of the longest common substring and string edit distance metrics. Results show that these algorithms can in fact segment words into roots and affixes from the limited data contained in a dictionary, and extract affixes. This in turn allows non native speakers to perform multilingual tasks for applications where response must be rapid, and their knowledge is limited. In addition, this analysis can feed other NLP tools requiring lexicons.