Cross-Lingual Utilization of NLP Resources for New Languages
Okan Kolak
University of Maryland,
ABSTRACT
Until recently the focus of the NLP community have been on a handful, mostly European, languages. However, as the economical and political climate of the world changes, the relative importance given to various languages are also changing rapidly. Importance of the ability to rapidly acquire NLP resources and computational capabilities in new languages is widely accepted. Statistical NLP models have a distinct advantage over rule based methods to achieve this goal, as they require far less manual labor. However, statistical methods require two fundamental resources for training: (1) online corpora, (2) manual annotations; and creating these resources can be as difficult as porting rule based methods.
This thesis demonstrates the feasibility of acquiring both corpora and annotations by exploiting existing resources for well-studied languages. Basic resources for new languages can be acquired in a rapid and cost-effective manner by utilizing existing resources cross-lingually.
Currently, most viable method of obtaining online corpora is converting existing printed text into electronic form using OCR. Unfortunately, a language that lacks online corpora most likely lacks OCR as well. We tackle this problem by utilizing an existing OCR system for a language with a similar script. We present a generative OCR model that allows us to post-process output of a non-native OCR system, and achieve accuracy close to, or better than, a native one. Furthermore, we show that the performance of a native or trained OCR system can be improved by the same method.
Next, we demonstrate cross-utilization of annotations on treebanks. We present an algorithm that projects dependency trees across parallel corpora, and show that a reasonable quality treebank can be generated by combining projection with minimal language-specific post-processing. The projected treebank allows us to train a parser that performs comparable to a parser trained on manually generated data.
|
|
|
For
the colloquium series schedule, see the UMD Computational http://www.umiacs.umd.edu/research/CLIP/colloq/. If you are interested in meeting with the
speaker, please contact Jimmy Lin <http://www.glue.umd.edu/~jimmylin/> Lin (