Will pyramids built of nuggets topple over?

Title	Will pyramids built of nuggets topple over?
Publication Type	Conference Papers
Year of Publication	2006
Authors	Jimmy Lin, Demner-Fushman D
Conference Name	Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Date Published	2006///
Publisher	Association for Computational Linguistics
Conference Location	Stroudsburg, PA, USA
Abstract	The present methodology for evaluating complex questions at TREC analyzes answers in terms of facts called "nuggets". The official F-score metric represents the harmonic mean between recall and precision at the nugget level. There is an implicit assumption that some facts are more important than others, which is implemented in a binary split between "vital" and "okay" nuggets. This distinction holds important implications for the TREC scoring model---essentially, systems only receive credit for retrieving vital nuggets---and is a source of evaluation instability. The upshot is that for many questions in the TREC testsets, the median score across all submitted runs is zero. In this work, we introduce a scoring model based on judgments from multiple assessors that captures a more refined notion of nugget importance. We demonstrate on TREC 2003, 2004, and 2005 data that our "nugget pyramids" address many shortcomings of the present methodology, while introducing only minimal additional overhead on the evaluation flow.
URL	http://dx.doi.org/10.3115/1220835.1220884
DOI	10.3115/1220835.1220884

Will pyramids built of nuggets topple over?

Publications