Metrics to Compare Question-Answering Datasets

Open Access
Eckman, Leah
Area of Honors:
Computer Science
Bachelor of Science
Document Type:
Thesis Supervisors:
  • Rebecca Passonneau, Thesis Supervisor
  • John Joseph Hannan, Honors Advisor
  • Natural Language Processing
  • Artificial Intelligence
  • QA systems
  • QA datasets
  • language modeling
The area of natural language processing has grown to include many Question-Answer (QA) datasets, which are used to train and test different QA systems. When evaluating these systems, the metric of success has traditionally been accuracy, which measures whether the system returns the correct answer to the question. There has been little to no work done that looks at the characteristics of the overall question datasets or ways to compare these characteristics. Accuracy is necessarily relative to a given question dataset, so it could prove to be useful to understand how QA datasets compare with respect to different characteristics as a foundation for interpreting differences in accuracy across datasets. This thesis investigates concrete ways to evaluate and compare the question datasets. It specifically applies automated methods to look at different syntactic and semantic characteristics of QA datasets and evaluates which of these measures distinguishes the datasets from each other. The general findings show promise, specifically in the area of ambiguity of the datasets, as ways to show differences in how QA systems deal with questions in addition to differences in QA datasets themselves.