Introduction | | Test Collection | | Schedule | | References | | Organizers |
Introduction
Question Answering for the Spoken Web (QASW) is an information
retrieval evaluation in which the goal is to match questions spoken in
Gujarati to answers spoken in Gujarati. QASW is a task in the 2013 Forum for Information Retrieval
Evaluation (FIRE). MediaEval
2013 participants are welcome to participate in the FIRE QASW
task, but evaluation results will not be available in time for the
MediaEval meeting.
The source of the questions and the the collection of possible answers is the IBM Spoken Web Gujarati collection. The questions are actual questions asked by users of an operational system; the collection of possible answers is composed of both answers that were actually given to specific questions (which may apply to more than one question, since some topics were asked about more than once) and of announcements on topics of general interest; both are from the same operational system. The collections will be distributed on a license that is freely usable for research purposes, and the resulting test collections (including relevance judgments) will be deposited in a community repository (e.g., LDC or ELDA). The questions, the collection, and the relevance judgments will be identical in the two evaluations.
We expect QASW to be of interest to researchers interested in speech
recognition, information retrieval (including question answering), and
information and communications technology for development (ICTD).
Test Collection
We are have transcribed 196 questions (from the 2,285 that were asked
in the operational system on the date we captured them). From these,
we have selected 50 for training (numbered 1-50), and 99 for
evaluation (numbered 101-201, with the exception of 174-175). Our
intent in selecting 99 evaluation questions is to make it likely that
we will get a yield of at least 5 relevant documents for at least 50
of the questions.
The collection to be searched will consist of about 4,000 speech segments that were selected from 3,557 answers that were given in response to specific questions and 834 "announcements" general answers that were provided that were provided to address topics of general interest. The selection was performed by removing those that were too short to be useful or that contain no recognizable speech. The speech segments are available in 2 forms: (1) .wav audio files, (2) manual transcripts. A Gujarati stemmer and stopword list are also available.
Relevance judgments will be performed using depth-30 pooling (or
deeper, if resources allow) using graded relevance judgments.
Participating systems will be asked to submit depth-1000 results using
the full question, and also using truncated versions of the question
(truncated at 5 seconds, 10 seconds, 15 seconds, etc.). The principal
evaluation measure will be mean NDCG for the full questions.
Participating systems will also be asked to predict which truncation
point maximizes a reward function that rewards DCG@1 and that
penalizes duration (i.e., later truncation points) -- the goal of this
measure is to encourage the design of systems that can determine when
to "barge in" for the first time with a plausible answer to the
question (in a real system, subsequent interaction would be possible,
but that will not be modeled in 2013).
Schedule