Annotation errors in large corpora are harmful for both the training and evaluation of natural language processing technologies, but how do we systematically locate such errors, and how can we correct them? Using part-of-speech (POS) annotation as our test annotation, we show how to detect errors, automatically correct them, and then demonstrate how the correction methodology can be adapted for POS tagging.
Using the idea that variation in annotation is likely erroneous, we will first discuss the so-called variation n-gram error detection method. This method searches for identical strings which vary in their annotation, as an indicator of erroneous mark-up. This idea is quite successful in detecting errors in the Wall Street Journal corpus.
Having detected errors, we turn to a method for automatically correcting them. Building on top of the variation n-gram method, we first try correcting a corpus using two off-the-shelf POS taggers, based on the idea that they enforce consistency; with this, we find some improvement. After some discussion of the tagging process, we alter the tagging model to better account for problematic tagging distinctions. This modification results in significantly improved performance, reducing the error rate of the corpus.
Based on these modifications, we then explore a method for improving POS tagging, in a way which is potentially more robust to errors. Automatic correction has different aims than POS tagging, and so we have to make some adjustments in order to obtain broader coverage. Preliminary results indicate that this methodology does increase POS tagging precision.
After obtaining his PhD at the Ohio State University, Markus Dickinson is now a visiting assistant professor at Georgetown University. His interests include investigating the interaction of natural language processing methods with corpora, namely detecting and correcting corpus annotation errors for varying levels of annotation complexity and determining the effect of the errors on POS tagging and parsing. Recent work involves improving NLP technologies through a better use of the annotation scheme.
This talk is part of the CLIP Colloquium Series, organized by Jimmy Lin (jimmylin -at- umd .dot. edu). For the complete schedule, please visit http://www.umiacs.umd.edu/research/CLIP/colloq/.