Entity-Based Knowledge Conflicts in Question Answering.

Overview

Entity-Based Knowledge Conflicts in Question Answering

Run Instructions | Paper | Citation | License

This repository provides the Substitution Framework described in Section 2 of our paper Entity-Based Knowledge Conflicts in Question Answering. Given a quesion answering dataset, we derive a new dataset where the context passages have been modified to have new answers to their question. By training on the original examples and evaluating on the derived examples, we simulate a parametric-contextual knowledge conflict --- useful for understanding how model's employ sources of knowledge to arrive at a decision.

Our dataset derivation follows two steps: (1) identifying named entity answers, and (2) replacing all occurrences of the answer in the context with a substituted entity, effectively changing the answer. The answer substitutions depend on the chosen substitution policy.

Run Instructions

1. Setup

Setup requirements and download SpaCy and WikiData dependencies.

bash setup.sh

2. (Optional) Download and Process Wikidata

This optional stage reproduces wikidata/entity_info.json.gz, downloaded during Setup.

Download the Wikidata dump from October 2020 here and the Wikipedia pageviews from June 2, 2020 here.

NOTE: We don't use the newest Wikidata dump because Wikidata doesn't keep old dumps so reproducibility is an issue. If you'd like to use the newest dump, it is available here. Wikipedia pageviews, on the other hand, are kept around and can be found here. Be sure to download the *-user.bz2 file and not the *-automatic.bz2 or the *-spider.bz2 files.

To extract out Wikidata information, run the following (takes ~8 hours)

python extract_wikidata_info.py --wikidata_dump wikidata-20201026-all.json.bz2 --popularity_dump pageviews-20210602-user.bz2 --output_file entity_info.json.gz

The output file of this step is available here.

3. Load and Preprocess Dataset

PYTHONPATH=. python src/load_dataset.py -d MRQANaturalQuestionsTrain -w wikidata/entity_info.json.gz
PYTHONPATH=. python src/load_dataset.py -d MRQANaturalQuestionsDev -w wikidata/entity_info.json.gz

4. Generate Substitutions

PYTHONPATH=. python src/generate_substitutions.py --inpath datasets/normalized/MRQANaturalQuestionsTrain.jsonl --outpath datasets/substitution-sets/MRQANaturalQuestionsTrain
   
    .jsonl 
    
      -n 1 ...
PYTHONPATH=. python src/generate_substitutions.py --inpath datasets/normalized/MRQANaturalQuestionsDev.jsonl --outpath datasets/substitution-sets/MRQANaturalQuestionsDev
     
      .jsonl 
      
        -n 1 ...

      
     
    
   

See descriptions of the substitution policies (substitution-commands) we provide here. Inspect the argparse and substitution-specific subparsers in generate_substitutions.py to see additional arguments.

Our Substitution Functions

Here we define the the substitution functions we provide. These functions ingests a QADataset, and modifies the context passage, according to defined rules, such that there is now a new answer to the question, according to the context. Greater detail is provided in our paper.

  • Alias Substitution (sub-command: alias-substitution) --- Here we replace an answer with one of it's wikidata aliases. Since the substituted answer is always semantically equivalent, answer type preservation is naturally maintained.
  • Popularity Substitution (sub-command: popularity-substitution) --- Here we replace answers with a WikiData answer of the same type, with a specified popularity bracket (according to monthly page views).
  • Corpus Substitution (sub-command: corpus-substitution) --- Here we replace answers with other answers of the same type, sampled from the same corpus.
  • Type Swap Substitution (sub-command: type-swap-substitution) --- Here we replace answers with other answers of different type, sampled from the same corpus.

How to Add Your own Dataset / Substitution Fn / NER Models

Use your own Dataset

To add your own dataset, create your own subclass of QADataset (in src/classes/qadataset.py).

  1. Overwrite the read_original_dataset function, to read your dataset, creating a List of QAExample objects.
  2. Add your class and the url/filepath to the DATASETS variable in src/load_dataset.py.

See MRQANaturalQuetsionsDataset in src/classes/qadataset.py as an example.

Use your own Substitution Function

We define 5 different substitution functions in src/generate_substitutions.py. These are described here. Inspect their docstrings and feel free to add your own, leveraging any of the wikidata, derived answer type, or other info we populate for examples and answers. Here are the steps to create your own:

  1. Add a subparser in src/generate_substitutions.py for your new function, with any relevant parameters. See alias_sub_parser as an example.
  2. Add your own substitution function to src/substitution_fns.py, ensuring the signature arguments match those specified in the subparser. See alias_substitution_fn as an example.
  3. Add a reference to your new function to SUBSTITUTION_FNS in src/generate_substitutions.py. Ensure the dictionary key matches the subparser name.

Use your own Named Entity Recognition and/or Entity Linking Model

Our SpaCy NER model is trained and used mainly to categorize answer text into answer types. Only substitutions that preserve answer type are likely to be coherent.

The functions which need to be changed are:

  1. run_ner_linking in utils.py, which loads the NER model and populates info for each answer (see function docstring).
  2. Answer._select_answer_type() in src/classes/answer.py, which uses the NER answer type label and wikidata type labels to cateogrize the answer into a type category.

Citation

Please cite the following if you found this resource or our paper useful.

@misc{longpre2021entitybased,
      title={Entity-Based Knowledge Conflicts in Question Answering}, 
      author={Shayne Longpre and Kartik Perisetla and Anthony Chen and Nikhil Ramesh and Chris DuBois and Sameer Singh},
      year={2021},
      eprint={2109.05052},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

The Knowledge Conflicts repository, and entity-based substitution framework are licensed according to the LICENSE file.

Contact Us

To contact us feel free to email the authors in the paper or create an issue in this repository.

Owner
Apple
Apple
Repository for the paper : Meta-FDMixup: Cross-Domain Few-Shot Learning Guided byLabeled Target Data

1 Meta-FDMIxup Repository for the paper : Meta-FDMixup: Cross-Domain Few-Shot Learning Guided byLabeled Target Data. (ACM MM 2021) paper News! the rep

Fu Yuqian 44 Nov 18, 2022
COLMAP - Structure-from-Motion and Multi-View Stereo

COLMAP About COLMAP is a general-purpose Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipeline with a graphical and command-line interface.

4.7k Jan 07, 2023
Solution to the Weather4cast 2021 challenge

This code was used for the entry by the team "antfugue" for the Weather4cast 2021 Challenge. Below, you can find the instructions for generating predi

Jussi Leinonen 13 Jan 03, 2023
WaveFake: A Data Set to Facilitate Audio DeepFake Detection

WaveFake: A Data Set to Facilitate Audio DeepFake Detection This is the code repository for our NeurIPS 2021 (Track on Datasets and Benchmarks) paper

Chair for Sys­tems Se­cu­ri­ty 27 Dec 22, 2022
The official implementation of CircleNet: Anchor-free Detection with Circle Representation, MICCAI 2030

CircleNet: Anchor-free Detection with Circle Representation The official implementation of CircleNet, MICCAI 2020 [PyTorch] [project page] [MICCAI pap

The Biomedical Data Representation and Learning Lab 45 Nov 18, 2022
Trainable PyTorch reproduction of AlphaFold 2

OpenFold A faithful PyTorch reproduction of DeepMind's AlphaFold 2. Features OpenFold carefully reproduces (almost) all of the features of the origina

AQ Laboratory 1.7k Dec 29, 2022
A custom-designed Spider Robot trained to walk using Deep RL in a PyBullet Simulation

SpiderBot_DeepRL Title: Implementation of Single and Multi-Agent Deep Reinforcement Learning Algorithms for a Walking Spider Robot Authors(s): Arijit

Arijit Dasgupta 9 Jul 28, 2022
This repo is for segmentation of T2 hyp regions in gliomas.

T2-Hyp-Segmentor This repo is for segmentation of T2 hyp regions in gliomas. By downloading the model from here you can use it to segment your T2w ima

1 Jan 18, 2022
WORD: Revisiting Organs Segmentation in the Whole Abdominal Region

WORD: Revisiting Organs Segmentation in the Whole Abdominal Region. This repository provides the codebase and dataset for our work WORD: Revisiting Or

Healthcare Intelligence Laboratory 71 Jan 07, 2023
A small library for creating and manipulating custom JAX Pytree classes

Treeo A small library for creating and manipulating custom JAX Pytree classes Light-weight: has no dependencies other than jax. Compatible: Treeo Tree

Cristian Garcia 58 Nov 23, 2022
Official Pytorch Implementation of Unsupervised Image Denoising with Frequency Domain Knowledge

Unsupervised Image Denoising with Frequency Domain Knowledge (BMVC 2021 Oral) : Official Project Page This repository provides the official PyTorch im

Donggon Jang 12 Sep 26, 2022
BaseCls BaseCls 是一个基于 MegEngine 的预训练模型库,帮助大家挑选或训练出更适合自己科研或者业务的模型结构

BaseCls BaseCls 是一个基于 MegEngine 的预训练模型库,帮助大家挑选或训练出更适合自己科研或者业务的模型结构。 文档地址:https://basecls.readthedocs.io 安装 安装环境 BaseCls 需要 Python = 3.6。 BaseCls 依赖 M

MEGVII Research 28 Dec 23, 2022
PyTorch code for our paper "Attention in Attention Network for Image Super-Resolution"

Under construction... Attention in Attention Network for Image Super-Resolution (A2N) This repository is an PyTorch implementation of the paper "Atten

Haoyu Chen 71 Dec 30, 2022
PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

DiffGAN-TTS - PyTorch Implementation PyTorch implementation of DiffGAN-TTS: High

Keon Lee 157 Jan 01, 2023
Supervised Contrastive Learning for Downstream Optimized Sequence Representations

SupCL-Seq 📖 Supervised Contrastive Learning for Downstream Optimized Sequence representations (SupCS-Seq) accepted to be published in EMNLP 2021, ext

Hooman Sedghamiz 18 Oct 21, 2022
Official PyTorch code for "BAM: Bottleneck Attention Module (BMVC2018)" and "CBAM: Convolutional Block Attention Module (ECCV2018)"

BAM and CBAM Official PyTorch code for "BAM: Bottleneck Attention Module (BMVC2018)" and "CBAM: Convolutional Block Attention Module (ECCV2018)" Updat

Jongchan Park 1.7k Jan 01, 2023
Convex optimization for fun and profit.

CFMM Optimal Routing This repository contains the code needed to generate the figures used in the paper Optimal Routing for Constant Function Market M

Guillermo Angeris 183 Dec 29, 2022
Pip-package for trajectory benchmarking from "Be your own Benchmark: No-Reference Trajectory Metric on Registered Point Clouds", ECMR'21

Map Metrics for Trajectory Quality Map metrics toolkit provides a set of metrics to quantitatively evaluate trajectory quality via estimating consiste

Mobile Robotics Lab. at Skoltech 31 Oct 28, 2022
SMORE: Knowledge Graph Completion and Multi-hop Reasoning in Massive Knowledge Graphs

SMORE: Knowledge Graph Completion and Multi-hop Reasoning in Massive Knowledge Graphs SMORE is a a versatile framework that scales multi-hop query emb

Google Research 135 Dec 27, 2022