Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

Overview

Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

This repository contains the source code for an end-to-end open-domain question answering system. The system is made up of two components: a retriever model and a reading comprehension (question answering) model. We provide the code for these two models in addition to demo code based on Streamlit. A video of the demo can be viewed here.

Installation

Our system uses PubMedBERT, a neural language model that is pretrained on PubMed abstracts for the retriever. Download the PyTorch version of PubMedBert here. For reading comprehension, we utilize BioBERT fine-tuned on SQuAD V2 . The model can be found here.

Datasets

We provide the COVID-QA dataset under the data directory. This is used for both the retriever and reading models. The train/dev/test files for the retriever are named dense_*.txt and those for reading comprehension are named qa_*.json.

The CORD-19 dataset is available for download here. Our system requires download of both the document_parses and metadata files for complete article information. For our system we use the 2021-02-15 download but any other download can also work. This must be combined into a jsonl file where each line contains a json object with:

  • id: article PMC id
  • title: article title
  • text: article text
  • index: text's index in the corpus (also the same as line number in the jsonl file)
  • date: article date
  • journal: journal published
  • authors: author list

We split each article into multiple json entries based on paragraph text cutoff in the document_parses file. Paragraphs that are longer than 200 tokens are split futher. This can be done with splitCORD.py where

* metdata-file: the metadata downloaded for CORD
* pmc-path: path to the PMC articles downloaded for CORD
* out-path: output jsonl file

Dense Retrieval Model

Once we have our model (PubMedBERT), we can start training. More specifically during training, we use positive and negative paragraphs, positive being paragraphs that contain the answer to a question, and negative ones not. We train on the COVID-QA dataset (see the Datasets section for more information on COVID-QA). We have a unified encoder for both questions and text paragraphs that learns to encode questions and associated texts into similar vectors. Afterwards, we use the model to encode the CORD-19 corpus.

Training

scripts/train.sh can be used to train our dense retrieval model.

CUDA_VISIBLE_DEVICES=0 python ../train_retrieval.py \
    --do_train \
    --prefix strong_dpr_baseline_b150 \
    --predict_batch_size 2000 \
    --model_name microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext \
    --train_batch_size 75 \
    --learning_rate 2e-5 \
    --fp16 \
    --train_file ../data/dense_train.txt \
    --predict_file ../data/dense_dev.txt \
    --seed 16 \
    --eval_period 300 \
    --max_c_len 300 \
    --max_q_len 30 \
    --warmup_ratio 0.1 \
    --num_train_epochs 20 \
    --dense_only \
    --output_dir /path/to/model/output \

Here are things to keep in mind:

1. The output_dir flag is where the model will be saved.
2. You can define the init_checkpoint flag to continue fine-tuning on another dataset.

The Dense retrieval model is then combined with BM25 for reranking (see paper for details).

Corpus

Next, go to scripts/encode_covid_corpus.sh for the command to encode our corpus.

CUDA_VISIBLE_DEVICES=0 python ../encode_corpus.py \
    --do_predict \
    --predict_batch_size 1000 \
    --model_name microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext \
    --fp16 \
    --predict_file /path/to/corpus \
    --max_c_len 300 \
    --init_checkpoint /path/to/saved/model/checkpoint_best.pt \
    --save_path /path/to/encoded/corpus

We pass the corpus (CORD-19) to our trained encoder in our dense retrieval model. Corpus embeddings are indexed.

Here are things to keep in mind:

1. The predict_file flag should take in your CORD-19 dataset path. It should be a .jsonl file.
2. Look at your output_dir path when you ran train.sh. After training our model, we should now have a checkpoint in that folder. Copy the exact path onto
the init_checkpoint flag here.
3. As previously mentioned, the result of these commands is the corpus (CORD-19) embeddings become indexed. The embeddings are saved in the save_path flag argument. Create that directory path as you wish.

Evaluation

You can run scripts/eval.sh to evaluate the document retrieval model.

CUDA_VISIBLE_DEVICES=0 python ../eval_retrieval.py \
    ../data/dense_test.txt \
    /path/to/encoded/corpus \
    /path/to/saved/model/checkpoint_best.pt \
    --batch-size 1000 --model-name microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext  --topk 100 --dimension 768

We evaluate retrieval on a test set from COVID-QA. We determine the percentage of questions that have retrieved paragraphs with the correct answer across different top-k settings.

We do that in the following 3 ways:

  1. exact answer matches in top-k retrievals
  2. matching articles in top-k retrievals
  3. F1 and Siamese BERT fuzzy matching

Here are things to think about:

1. The first, second, and third arguments are our COVID-QA test set, corpus indexed embeddings, and retrieval model respectively.
2. The other flag that is important is the topk one. This flag determines the quantity of retrieved CORD19 paragraphs.

Reading Comprehension

We utilize the HuggingFace's question answering scripts to train and evaluate our reading comprehension model. This can be done with scripts/qa.sh. The scripts are modified to allow for the extraction of multiple answer spans per document. We use a BioBERT model fine-tuned on SQuAD V2 as our pre-trained model.

CUDA_VISIBLE_DEVICES=0 python ../qa/run_qa.py \
  --model_name_or_path ktrapeznikov/biobert_v1.1_pubmed_squad_v2 \
  --train_file ../data/qa_train.json \
  --validation_file ../data/qa_dev.json \
  --test_file ../data/qa_test.json \
  --do_train \
  --do_eval \
  --do_predict \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 5 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /path/to/model/output \

Demo

We combine the retrieval model and reading model for an end-to-end open-domain question answering demo with Streamlit. This can be run with scripts/demo.sh.

CUDA_VISIBLE_DEVICES=0 streamlit run ../covid_qa_demo.py -- \
  --retriever-model-name microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext \
  --retriever-model path/to/saved/retriever_model/checkpoint_best.pt \
  --qa-model-name ktrapeznikov/biobert_v1.1_pubmed_squad_v2 \
  --qa-model /path/to/saved/qa_model \
  --index-path /path/to/encoded/corpus

Here are things to keep in mind:

1. retriever-model is the checkpoint file of your trained retriever model.
2. qa-model is the trained reading comprehension model.
3. index-path is the path to the encoded corpus embeddings.

Requirements

See requirements.txt

[NeurIPS-2021] Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data

MosaicKD Code for NeurIPS-21 paper "Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data" 1. Motivation Natural images share common l

ZJU-VIPA 37 Nov 10, 2022
Reference PyTorch implementation of "End-to-end optimized image compression with competition of prior distributions"

PyTorch reference implementation of "End-to-end optimized image compression with competition of prior distributions" by Benoit Brummer and Christophe

Benoit Brummer 6 Jun 16, 2022
Like Dirt-Samples, but cleaned up

Clean-Samples Like Dirt-Samples, but cleaned up, with clear provenance and license info (generally a permissive creative commons licence but check the

TidalCycles 39 Nov 30, 2022
UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning

UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning This is the official PyTorch implementation for UniMoCo pape

dddzg 49 Jan 02, 2023
Code for ACL2021 long paper: Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases

LANKA This is the source code for paper: Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases (ACL 2021, long paper) Referen

Boxi Cao 30 Oct 24, 2022
TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

FunMatch-Distillation TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A g

Sayak Paul 67 Dec 20, 2022
Python wrapper of LSODA (solving ODEs) which can be called from within numba functions.

numbalsoda numbalsoda is a python wrapper to the LSODA method in ODEPACK, which is for solving ordinary differential equation initial value problems.

Nick Wogan 52 Jan 09, 2023
Weakly Supervised Dense Event Captioning in Videos, i.e. generating multiple sentence descriptions for a video in a weakly-supervised manner.

WSDEC This is the official repo for our NeurIPS paper Weakly Supervised Dense Event Captioning in Videos. Description Repo directories ./: global conf

Melon(Xuguang Duan) 96 Nov 01, 2022
Random-Afg - Afghanistan Random Old Idz Cloner Tools

AFGHANISTAN RANDOM OLD IDZ CLONER TOOLS Install $ apt update $ apt upgrade $ apt

MAHADI HASAN AFRIDI 5 Jan 26, 2022
Pairwise model for commonlit competition

Pairwise model for commonlit competition To run: - install requirements - create input directory with train_folds.csv and other competition data - cd

abhishek thakur 45 Aug 31, 2022
A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation models. It contains 17 different amateur subjects performing 30

Aiden Nibali 25 Jun 20, 2021
Template repository for managing machine learning research projects built with PyTorch-Lightning

Tutorial Repository with a minimal example for showing how to deploy training across various compute infrastructure.

Sidd Karamcheti 3 Feb 11, 2022
Deep Learning with PyTorch made easy 🚀 !

Deep Learning with PyTorch made easy 🚀 ! Carefree? carefree-learn aims to provide CAREFREE usages for both users and developers. It also provides a c

381 Dec 22, 2022
BBScan py3 - BBScan py3 With Python

BBScan_py3 This repository is forked from lijiejie/BBScan 1.5. I migrated the fo

baiyunfei 12 Dec 30, 2022
Faster Convex Lipschitz Regression

Faster Convex Lipschitz Regression This reepository provides a python implementation of our Faster Convex Lipschitz Regression algorithm with GPU and

Ali Siahkamari 0 Nov 19, 2021
Motion planning environment for Sampling-based Planners

Sampling-Based Motion Planners' Testing Environment Sampling-based motion planners' testing environment (sbp-env) is a full feature framework to quick

Soraxas 23 Aug 23, 2022
Time-Optimal Planning for Quadrotor Waypoint Flight

Time-Optimal Planning for Quadrotor Waypoint Flight This is an example implementation of the paper "Time-Optimal Planning for Quadrotor Waypoint Fligh

Robotics and Perception Group 38 Dec 02, 2022
Probabilistic Gradient Boosting Machines

PGBM Probabilistic Gradient Boosting Machines (PGBM) is a probabilistic gradient boosting framework in Python based on PyTorch/Numba, developed by Air

Olivier Sprangers 112 Dec 28, 2022
Object detection evaluation metrics using Python.

Object detection evaluation metrics using Python.

Louis Facun 2 Sep 06, 2022
验证码识别 深度学习 tensorflow 神经网络

captcha_tf2 验证码识别 深度学习 tensorflow 神经网络 使用卷积神经网络,对字符,数字类型验证码进行识别,tensorflow使用2.0以上 目前项目还在更新中,诸多bug,欢迎提出issue和PR, 希望和你一起共同完善项目。 实例demo 训练过程 优化器选择: Adam

5 Apr 28, 2022