Automatic text summarization evaluation metrics such as BLEU, ROUGE, and Basic Elements claim to correlate highly with the results of human task-based evaluations. This research investigates these claims and introduces Relevance Prediction, a new human task-based summarization evaluation measure. Relevance Prediction is a more intuitive measure of individual task performance and is shown to produce more reliable results than LDC Agreement, a current external gold-standard based measure used in the summarization evaluation community.
Six experimental studies are conducted to examine the existence of correlations between the human task-based evaluations of text summarization and the output of current intrinsic automatic evaluation metrics. The experimental results indicate that moderate, yet consistent correlations exist between the Relevance Prediction method and the ROUGE metric for single-document summarization. This work also formally establishes the usefulness of text summarization in reducing task time while maintaining a similar level of task judgment accuracy as seen with the full text documents.
Stacy Hobson is a PhD candidate in the Neuroscience and Cognitive Science program, and a member of the CLIP group.
This talk is part of the CLIP Colloquium Series, organized by Jimmy Lin (jimmylin -at- umd .dot. edu). For the complete schedule, please visit http://www.umiacs.umd.edu/research/CLIP/colloq/.