This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection"

Overview

Splinter

This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection", to appear at ACL 2021.

Our pretraining code is based on TensorFlow (checked on 1.15), while fine-tuning is based on PyTorch (1.7.1) and Transformers (2.9.0). Note each has its own requirement file: pretraining/requirements.txt and finetuning/requirements.txt.

Data

Downloading Few-Shot MRQA Splits

curl -L https://www.dropbox.com/sh/pfg8j6yfpjltwdx/AAC8Oky0w8ZS-S3S5zSSAuQma?dl=1 > mrqa-few-shot.zip
unzip mrqa-few-shot.zip -d mrqa-few-shot

Pretrained Model

Command for downloading Splinter
curl -L https://www.dropbox.com/sh/h63xx2l2fjq8bsz/AAC5_Z_F2zBkJgX87i3IlvGca?dl=1 > splinter.zip
unzip splinter.zip -d splinter 

Pretraining

Create a virtual environment and execute

cd pretraining
pip install -r requirements.txt  # or requirements-gpu.txt for a GPU version

Then download the raw data (our pretraining was based on Wikipedia and BookCorpus). We support two data formats:

  • For wiki, a tag starts a new article and a ends it.
  • For BookCorpus, we process an already-tokenized file where tokens are separated by whitespaces. Newlines stands for a new book.
Command for creating the pretraining data

This command takes as input a set of files ($INPUT_PATTERN) and creates a tensorized dataset for pretraining. It supports the following masking schemes:

Command for creating the data for Splinter (recurring span selection)
cd pretraining
python create_pretraining_data.py \
    --input_file=$INPUT_PATTERN \
    --output_dir=$OUTPUT_DIR \
    --vocab_file=vocabs/bert-cased-vocab.txt \
    --do_lower_case=False \
    --do_whole_word_mask=False \
    --max_seq_length=512 \
    --num_processes=63 \
    --dupe_factor=5 \
    --max_span_length=10 \
    --recurring_span_selection=True \
    --only_recurring_span_selection=True \
    --max_questions_per_seq=30

n-gram statistics are written to ngrams.txt in the output directory.

Command for pretraining Splinter
cd pretraining
python run_pretraining.py \
    --bert_config_file=configs/bert-base-cased-config.json \
    --input_file=$INPUT_FILE \
    --output_dir=$OUTPUT_DIR \
    --max_seq_length=512 \
    --recurring_span_selection=True \
    --only_recurring_span_selection=True \
    --max_questions_per_seq=30 \
    --do_train \
    --train_batch_size=256 \
    --learning_rate=1e-4 \
    --num_train_steps=2400000 \
    --num_warmup_steps=10000 \
    --save_checkpoints_steps=10000 \
    --keep_checkpoint_max=240 \
    --use_tpu \
    --num_tpu_cores=8 \
    --tpu_name=$TPU_NAME

This can be trained using GPUs by dropping the use_tpu flag (although it was tested mainly on TPUs).

Convert TensorFlow Model to PyTorch

In order to fine-tune the TF model you pretrained with run_pretraining.py, you will first need to convert it to PyTorch. You can do so by

cd model_conversion
pip install -r requirements.txt
python convert_tf_to_pytorch.py --tf_checkpoint_path $TF_MODEL_PATH --pytorch_dump_path $OUTPUT_PATH

Fine-tuning

Fine-tuning has different requirements than pretraining, as it uses HuggingFace's Transformers library. Create a virtual environment and execute

cd finetuning
pip install -r requirements.txt

Please Note: If you want to reproduce results from the paper or run with a QASS head in genral, questions need to be augmented with a [QUESTION] token. In order to do so, please run

cd finetuning
python qass_preprocess.py --path "../mrqa-few-shot/*/*.jsonl"

This will add a [MASK] token to each question in the training data, which will later be replaced by a [QUESTION] token automatically by the QASS layer implementation.

Then fine-tune Splinter by

cd finetuning
export MODEL="../splinter"
export OUTPUT_DIR="output"
python run_mrqa.py \
    --model_type=bert \
    --model_name_or_path=$MODEL \
    --qass_head=True \
    --tokenizer_name=$MODEL \
    --output_dir=$OUTPUT_DIR \
    --train_file="../mrqa-few-shot/squad/squad-train-seed-42-num-examples-16_qass.jsonl" \
    --predict_file="../mrqa-few-shot/squad/dev_qass.jsonl" \
    --do_train \
    --do_eval \
    --max_seq_length=384 \
    --doc_stride=128 \
    --threads=4 \
    --save_steps=50000 \
    --per_gpu_train_batch_size=12 \
    --per_gpu_eval_batch_size=16 \
    --learning_rate=3e-5 \
    --max_answer_length=10 \
    --warmup_ratio=0.1 \
    --min_steps=200 \
    --num_train_epochs=10 \
    --seed=42 \
    --use_cache=False \
    --evaluate_every_epoch=False 

In order to train with automatic mixed precision, install apex and add the --fp16 flag.

See an example script for fine-tuning SpanBERT (rather than Splinter) here.

Citation

If you find this work helpful, please cite us

@inproceedings{ram-etal-2021-shot,
    title = "Few-Shot Question Answering by Pretraining Span Selection",
    author = "Ram, Ori  and
      Kirstain, Yuval  and
      Berant, Jonathan  and
      Globerson, Amir  and
      Levy, Omer",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.239",
    pages = "3066--3079",
}

Acknowledgements

We would like to thank the European Research Council (ERC) for funding the project, and to Google’s TPU Research Cloud (TRC) for their support in providing TPUs.

Owner
Ori Ram
PhD Candidate at Tel Aviv University, focusing on NLP and Machine Learning
Ori Ram
Unsupervised Language Model Pre-training for French

FlauBERT and FLUE FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the n

GETALP 212 Dec 10, 2022
NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels

NumPy String-Indexed NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels, rather than conventio

Aitan Grossman 1 Jan 08, 2022
An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GPT-NeoX An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hun

EleutherAI 3.1k Jan 08, 2023
Task-based datasets, preprocessing, and evaluation for sequence models.

SeqIO: Task-based datasets, preprocessing, and evaluation for sequence models. SeqIO is a library for processing sequential data to be fed into downst

Google 290 Dec 26, 2022
Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, Explosion AI 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 French 1.2.3 German 1.2

Explosion 70 Dec 12, 2022
Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Toy Machine Learning Pipeline Table of Contents About Getting Started ML task description and evaluation procedure Dataset description Repository stru

Shreya Shankar 190 Dec 21, 2022
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 01, 2023
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022
Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.

Pretrained Language Model This repository provides the latest pretrained language models and its related optimization techniques developed by Huawei N

HUAWEI Noah's Ark Lab 2.6k Jan 08, 2023
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
中文問句產生器;使用台達電閱讀理解資料集(DRCD)

Transformer QG on DRCD The inputs of the model refers to we integrate C and A into a new C' in the following form. C' = [c1, c2, ..., [HL], a1, ..., a

Philip 1 Oct 22, 2021
Code for ACL 2020 paper "Rigid Formats Controlled Text Generation"

SongNet SongNet: SongCi + Song (Lyrics) + Sonnet + etc. @inproceedings{li-etal-2020-rigid, title = "Rigid Formats Controlled Text Generation",

Piji Li 212 Dec 17, 2022
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

1.3k Dec 26, 2022
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

20.5k Jan 08, 2023
GPT-2 Model for Leetcode Questions in python

Leetcode using AI 🤖 GPT-2 Model for Leetcode Questions in python New demo here: https://huggingface.co/spaces/gagan3012/project-code-py Note: the Ans

Gagan Bhatia 100 Dec 12, 2022
Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

This repository is the official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

vanint 101 Dec 30, 2022
Input english text, then translate it between languages n times using the Deep Translator Python Library.

mass-translator About Input english text, then translate it between languages n times using the Deep Translator Python Library. How to Use Install dep

2 Mar 04, 2022
Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

Hans Alemão 4 Jul 20, 2022
Large-scale Knowledge Graph Construction with Prompting

Large-scale Knowledge Graph Construction with Prompting across tasks (predictive and generative), and modalities (language, image, vision + language, etc.)

ZJUNLP 161 Dec 28, 2022
Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和TextBlob

Rui Wang 6k Jan 02, 2023