This is the repository for our paper Ditch the Gold Standard: Re-evaluating Conversational Question Answering

Overview

Ditch the Gold Standard: Re-evaluating Conversational Question Answering

This is the repository for our paper Ditch the Gold Standard: Re-evaluating Conversational Question Answering.

Overview

In this work, we conduct the first large-scale human evaluation of state-of-the-art conversational QA systems. In our evaluation, human annotators chat with conversational QA models about passages from the QuAC development set, and after that the annotators judge the correctness of model answers. We release the human annotated dataset in the following section.

We also identify a critical issue with the current automatic evaluation, which pre-collectes human-human conversations and uses ground-truth answers as conversational history (differences between different evaluations are shown in the following figure). By comparison, we find that the automatic evaluation does not always agree with the human evaluation. We propose a new evaluation protocol that is based on predicted history and question rewriting. Our experiments show that the new protocol better reflects real-world performance compared to the original automatic evaluation. We also provide the new evaluation protocol code in the following.

Different evaluation protocols

Human Evaluation Dataset

You can download the human annotation dataset from data/human_annotation_data.json. The json file contains one data field data, which is a list of conversations. Each conversation contains the following fields:

  • model_name: The model evaluated. One of bert4quac, graphflow, ham, excord.
  • context: The passage used in this conversation.
  • dialog_id: The ID from the original QuAC dataset.
  • qas: The conversation, which contains a list of QA pairs. Each QA pair has the following fields:
    • turn_id: The number of turn.
    • question: The question from the human annotator.
    • answer: The answer from the model.
    • valid: Whether the question is valid (annotated by our human annotator).
    • answerable: Whether the question is answerable (annotated by our human annotator).
    • correct: Whether the model's answer is correct (annotated by our human annotator).

Automatic model evaluation interface

We provide a convenient interface to test model performance on a few evaluation protocols compared in our paper, including Auto-Pred, Auto-Replace and our proposed evaluation protocol, Auto-Rewrite, which better demonstrates models' performance in human-model conversations. Please refer to our paper for more details. Following is a figure describing how Auto-Rewrite works.

Auto-rewrite

To use our evaluation interface on your own model, follow the steps:

  • Step 1: Download the QuAC dataset.

  • Step 2: Install allennlp, allennlp_models, ncr.replace_corefs through pip if you would like to use Auto-Rewrite.

  • Step 3: Download the CANARD dataset and set --canard_path if you would like to use Auto-Replace.

  • Step 4: Write a model interface following the template interface.py. Explanations to each function are provided through in-line comments. Make sure to import all your model dependencies at the top.

  • Step 5: Add the model to the evaluation script run_quac_eval.py. Changes that are need to be made are marked with #TODO.

  • Step 6: Run evaluation script. See run.sh for reference. Explanations of all arguments are provided in run_quac_eval.py. Make sure to turn on only one of --pred, --rewrite or --replace.

Citation

@article{li2021ditch,
   title={Ditch the Gold Standard: Re-evaluating Conversational Question Answering},
   author={Li, Huihan and Gao, Tianyu and Goenka, Manan and Chen, Danqi},
   journal={arXiv preprint arXiv:2112.08812},
   year={2021}
}
Owner
Princeton Natural Language Processing
Princeton Natural Language Processing
Source code for paper "Deep Diffusion Models for Robust Channel Estimation", TBA.

diffusion-channels Source code for paper "Deep Diffusion Models for Robust Channel Estimation". Generic flow: Use 'matlab/main.mat' to generate traini

The University of Texas Computational Sensing and Imaging Lab 15 Dec 22, 2022
Boosted neural network for tabular data

XBNet - Xtremely Boosted Network Boosted neural network for tabular data XBNet is an open source project which is built with PyTorch which tries to co

Tushar Sarkar 175 Jan 04, 2023
A general and strong 3D object detection codebase that supports more methods, datasets and tools (debugging, recording and analysis).

ALLINONE-Det ALLINONE-Det is a general and strong 3D object detection codebase built on OpenPCDet, which supports more methods, datasets and tools (de

Michael.CV 5 Nov 03, 2022
🐸STT integration examples

🐸 STT 0.9.x Examples These are various examples on how to use or integrate 🐸 STT using our packages. It is a good way to just try out 🐸 STT before

coqui 92 Dec 19, 2022
Age and Gender prediction using Keras

cnn_age_gender Age and Gender prediction using Keras Dataset example : Description : UTKFace dataset is a large-scale face dataset with long age span

XN3UR0N 58 May 03, 2022
Official Pytorch Implementation of Unsupervised Image Denoising with Frequency Domain Knowledge

Unsupervised Image Denoising with Frequency Domain Knowledge (BMVC 2021 Oral) : Official Project Page This repository provides the official PyTorch im

Donggon Jang 12 Sep 26, 2022
Official code release for: EditGAN: High-Precision Semantic Image Editing

Official code release for: EditGAN: High-Precision Semantic Image Editing

565 Jan 05, 2023
Imaging, analysis, and simulation software for radio interferometry

ehtim (eht-imaging) Python modules for simulating and manipulating VLBI data and producing images with regularized maximum likelihood methods. This ve

Andrew Chael 5.2k Dec 28, 2022
QSYM: A Practical Concolic Execution Engine Tailored for Hybrid Fuzzing

QSYM: A Practical Concolic Execution Engine Tailored for Hybrid Fuzzing Environment Tested on Ubuntu 14.04 64bit and 16.04 64bit Installation # disabl

gts3.org (<a href=[email protected])"> 581 Dec 30, 2022
Learning Continuous Signed Distance Functions for Shape Representation

DeepSDF This is an implementation of the CVPR '19 paper "DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation" by Park et a

Meta Research 1.1k Jan 01, 2023
Starter code for the ICCV 2021 paper, 'Detecting Invisible People'

Detecting Invisible People [ICCV 2021 Paper] [Website] Tarasha Khurana, Achal Dave, Deva Ramanan Introduction This repository contains code for Detect

Tarasha Khurana 28 Sep 16, 2022
Shared Attention for Multi-label Zero-shot Learning

Shared Attention for Multi-label Zero-shot Learning Overview This repository contains the implementation of Shared Attention for Multi-label Zero-shot

dathuynh 26 Dec 14, 2022
Official implementation for Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder at NeurIPS 2020

Likelihood-Regret Official implementation of Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder at NeurIPS 2020. T

Xavier 33 Oct 12, 2022
A code generator from ONNX to PyTorch code

onnx-pytorch Generating pytorch code from ONNX. Currently support onnx==1.9.0 and torch==1.8.1. Installation From PyPI pip install onnx-pytorch From

Wenhao Hu 94 Jan 06, 2023
Source code for the paper "PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction" in ACL2021

PLOME:Pre-training with Misspelled Knowledge for Chinese Spelling Correction (ACL2021) This repository provides the code and data of the work in ACL20

197 Nov 26, 2022
Repository For Programmers Seeking a platform to show their skills

Programming-Nerds Repository For Programmers Seeking Pull Requests In hacktoberfest ❓ What's Hacktoberfest 2021? Hacktoberfest is the easiest way to g

42 Oct 29, 2022
FIRM-AFL is the first high-throughput greybox fuzzer for IoT firmware.

FIRM-AFL FIRM-AFL is the first high-throughput greybox fuzzer for IoT firmware. FIRM-AFL addresses two fundamental problems in IoT fuzzing. First, it

356 Dec 23, 2022
AMTML-KD: Adaptive Multi-teacher Multi-level Knowledge Distillation

AMTML-KD: Adaptive Multi-teacher Multi-level Knowledge Distillation

Frank Liu 26 Oct 13, 2022
本项目是一个带有前端界面的垃圾分类项目,加载了训练好的模型参数,模型为efficientnetb4,暂时为40分类问题。

说明 本项目是一个带有前端界面的垃圾分类项目,加载了训练好的模型参数,模型为efficientnetb4,暂时为40分类问题。 python依赖 tf2.3 、cv2、numpy、pyqt5 pyqt5安装 pip install PyQt5 pip install PyQt5-tools 使用 程

4 May 04, 2022
Self-Supervised Methods for Noise-Removal

SSMNR | Self-Supervised Methods for Noise Removal Image denoising is the task of removing noise from an image, which can be formulated as the task of

1 Jan 16, 2022