Deconstructing nuggets: the stability and reliability of complex question answering evaluation

TitleDeconstructing nuggets: the stability and reliability of complex question answering evaluation
Publication TypeConference Papers
Year of Publication2007
AuthorsJimmy Lin, Zhang P
Conference NameProceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Date Published2007///
PublisherACM
Conference LocationNew York, NY, USA
ISBN Number978-1-59593-597-7
Keywordscomplex information needs, human judgments, trec
Abstract

A methodology based on "information nuggets" has recently emerged as the de facto standard by which answers to complex questions are evaluated. After several implementations in the TREC question answering tracks, the community has gained a better understanding of its many characteristics. This paper focuses on one particular aspect of the evaluation: the human assignment of nuggets to answer strings, which serves as the basis of the F-score computation. As a byproduct of the TREC 2006 ciQA task, identical answer strings were independently evaluated twice, which allowed us to assess the consistency of human judgments. Based on these results, we explored simulations of assessor behavior that provide a method to quantify scoring variations. Understanding these variations in turn lets researchers be more confident in their comparisons of systems.

URLhttp://doi.acm.org/10.1145/1277741.1277799
DOI10.1145/1277741.1277799