Will pyramids built of nuggets topple over?

TitleWill pyramids built of nuggets topple over?
Publication TypeConference Papers
Year of Publication2006
AuthorsJimmy Lin, Demner-Fushman D
Conference NameProceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Date Published2006///
PublisherAssociation for Computational Linguistics
Conference LocationStroudsburg, PA, USA

The present methodology for evaluating complex questions at TREC analyzes answers in terms of facts called "nuggets". The official F-score metric represents the harmonic mean between recall and precision at the nugget level. There is an implicit assumption that some facts are more important than others, which is implemented in a binary split between "vital" and "okay" nuggets. This distinction holds important implications for the TREC scoring model---essentially, systems only receive credit for retrieving vital nuggets---and is a source of evaluation instability. The upshot is that for many questions in the TREC testsets, the median score across all submitted runs is zero. In this work, we introduce a scoring model based on judgments from multiple assessors that captures a more refined notion of nugget importance. We demonstrate on TREC 2003, 2004, and 2005 data that our "nugget pyramids" address many shortcomings of the present methodology, while introducing only minimal additional overhead on the evaluation flow.