Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration πŸšƒ

Overview


This repository provides a library for efficient training of masked language models (MLM), built with fairseq. We fork fairseq to give researchers more flexibility when using our training scripts, while also making it easier to adapt our code contributions into other projects.

Why DinkyTrain?

The Dinky runs between Princeton Junction and Princeton and is the shortest scheduled commuter rail line in the United States. We also aim to make pre-training short and accessible to everyone.

Our Contributions

  • DeepSpeed transformer kernel integration
  • A training recipe for efficient MLM pre-training
  • An easy-to-follow guideline of using fairseq for MLM pre-training.

Other fairseq features:

See the fairseq repo and its documentation for more details on how to use and extend fairseq.

DinkyTrain for Efficient MLM Pre-training

Quick Links

Overview

You can reproduce the pre-training experiments of our recent paper Should You Mask 15% in Masked Language Modeling?, where we find that higher masking rates can lead to more efficient pre-training.

Installation

  • PyTorch version >= 1.5.0
  • Python version >= 3.6
  • To install fairseq and develop locally:
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
  • For faster training (FP16) install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./
  • For faster training (DeepSpeed cuda kernel) install DeepSpeed library and compile the DeepSpeed kernel
DS_BUILD_TRANSFORMER=1 DS_BUILD_STOCHASTIC_TRANSFORMER=1 pip install deepspeed
  • For large datasets install PyArrow: pip install pyarrow
  • If you use Docker make sure to increase the shared memory size either with --ipc=host or --shm-size as command line options to nvidia-docker run .

Trouble-shooting:

  • If using lower version of Python, you might encounter import problems with importlib.metadata. Try pip install importlib-metadata.
  • To install apex and deepspeed, you will need nvcc (CUDA compiler).
  • When installing apex, if you encounter the error Cuda extensions are bing compiled with a version of Cuda that does not match ..., go to setup.py and comment out the line that raised the error (at your own risk).
  • Both apex and deepspeed installation require a high gcc version to support c++14. If you encounter relevant errors, update your gcc.

Data Pre-processing

Tokenization: First, download the GPT2 BPE vocabulary:

wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe

Then, tokenize your raw data:

python -m examples.roberta.multiprocessing_bpe_encoder \
    --encoder-json gpt2_bpe/encoder.json \
    --vocab-bpe gpt2_bpe/vocab.bpe \
    --inputs ${SPLIT}.raw \
    --outputs ${SPLIT}.bpe \
    --keep-empty \
    --workers 8

Finally, index and binarize your data:

fairseq-preprocess \
    --only-source \
    --srcdict gpt2_bpe/dict.txt \
    --trainpref ${TRAIN_SPLIT}.bpe \
    --validpref ${VALID_SPLIT}.bpe \
    --testpref ${TEST_SPLIT}.bpe \
    --destdir output-bin \
    --workers 8

Alternatively: Use our pre-processed data: We preprocessed Wikipedia+BookCorpus and shared it on Huggingface dataset. It is ~22GB and contains two epochs of data, each epoch being sliced into 8 shards. You can download it using git:

git lfs install # Git lfs is needed for downloading
git clone https://huggingface.co/datasets/princeton-nlp/wikibook_fairseq_format

Pre-training

Use our script for efficient pre-training

GPU={number of GPUs} DATA_DIR={data path} [DEEPSPEED=1] bash run_efficient_mlm_recipe.sh

Flags explained

  • GPU: number of GPUs.
  • DATA_DIR: directory to the processed pre-training data. If you are using our preprocessed dataset, DATA_DIR should be:
DATA_DIR=$(seq 0 15 | sed -e 's/^/wikibook_fairseq_format\/bin-shard/' | sed -e 's/$/-8/' | paste -sd ':')
  • DEEPSPEED (optional): if set to 1, the DeepSpeed CUDA kernel will be used.

Please refer to the script for more hyperparameter choices.

Fine-tuning on GLUE and SQuAD

All our checkpoints can be converted to HuggingFace transformers models (see next nextion) and use the transformers package for fine-tuning. Fairseq also supports fine-tuning on GLUE.

First, download the preprocessed GLUE data (you can also process by yourself following the preprocess section above):

git lfs install # Git lfs is needed for downloading
git clone https://huggingface.co/datasets/princeton-nlp/glue_fairseq_format

Then use the following script for fine-tuning

DATA_DIR={path to the data directory} \
TASK={glue task name (mnli qnli qqp rte sst2 mrpc cola stsb)} \
LR={learning rate} \
BSZ={batch size} \
EPOCHS={number of epochs} \
SEED={random seed} \
CKPT_DIR={checkpoint's directory} \
CKPT_NAME={checkpoint's name} \
[DEEPSPEED=1] bash finetune_glue.sh

For fine-tuning on SQuAD, please convert the models to HuggingFace checkpoints following the next section and use HuggingFace's examples.

Convert to HuggingFace

We also provide conversion codes so that you can easily turn Fairseq checkpoints into HuggingFace checkpoints. Usage:

cd scripts
[PRELAYERNORM=1] [FROM_DS=1] python convert_fs_ckpt_to_hf_ckpt.py --fr {fairseq checkpoint} --to {huggingface checkpoint path} --hf_model_config {roberta-base/roberta-large}

Flags explained:

  • PRELAYERNORM=1: Using pre layer-norm (default is post layer-norm).
  • FROM_DS=1: The Fairseq checkpoint uses DeepSpeed's cuda kernel.
  • --fr: The path to the Fairseq checkpoint.
  • --to: The path you want to save the HuggingFace checkpoint to.
  • --hf_model_config: roberta-base or roberta-large.

IMPORTANT: all our models use pre layer norm, which is not supported by HuggingFace yet. To use it, import the model class from huggingface/modeling_roberta_prelayernorm.py. For example:

from huggingface.modeling_roberta_prelayernorm import RobertaForSequenceClassification

For more configuration, please refer to convert_fs_ckpt_to_hf_ckpt.py.

Model List

Here are the HuggingFace checkpoints of our models in the paper Should You Mask 15% in Masked Language Modeling. Results are development set performance.

Model MNLI QNLI QQP SST-2
princeton-nlp/efficient_mlm_m0.15 84.2 90.9 87.8 93.3
princeton-nlp/efficient_mlm_m0.20 84.1 91.3 87.9 92.7
princeton-nlp/efficient_mlm_m0.30 84.2 91.6 88.0 93.0
princeton-nlp/efficient_mlm_m0.40 84.5 91.6 88.1 92.8
princeton-nlp/efficient_mlm_m0.50 84.1 91.1 88.1 92.7
princeton-nlp/efficient_mlm_m0.60 83.2 90.7 87.8 92.6
princeton-nlp/efficient_mlm_m0.70 82.3 89.4 87.5 91.9
princeton-nlp/efficient_mlm_m0.80 80.8 87.9 87.1 90.5
princeton-nlp/efficient_mlm_m0.15-801010 83.7 90.4 87.8 93.2
princeton-nlp/efficient_mlm_m0.40-801010 84.3 91.2 87.9 93.0

We also offer the original (deepspeed) fairseq checkpoints here.

Bugs or Questions?

If you hav an questions, or encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

@article{wettig2022should,
   title={Should You Mask 15% in Masked Language Modeling?},
   author={Wettig, Alexander and Gao, Tianyu and Zhong, Zexuan and Chen, Danqi},
   boo={arXiv preprint arXiv:2202.08005},
   year={2022}
}

Acknowledgment

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53.

  • Our efficient training recipe is based on the following paper:

Peter Izsak, Moshe Berchansky, and Omer Levy. 2021. How to train BERT with an academic budget. In Empirical Methods in Natural Language Processing (EMNLP), pages 10644–10652.

Owner
Princeton Natural Language Processing
Princeton Natural Language Processing
Meta learning algorithms to train cross-lingual NLI (multi-task) models

Meta learning algorithms to train cross-lingual NLI (multi-task) models

M.Hassan Mojab 4 Nov 20, 2022
kochat

Kochat 챗봇 λΉŒλ”λŠ” 성에 μ•ˆμ°¨κ³ , μžμ‹ λ§Œμ˜ λ”₯λŸ¬λ‹ 챗봇 μ• ν”Œλ¦¬μΌ€μ΄μ…˜μ„ λ§Œλ“œμ‹œκ³  μ‹ΆμœΌμ‹ κ°€μš”? Kochat을 μ΄μš©ν•˜λ©΄ μ†μ‰½κ²Œ μžμ‹ λ§Œμ˜ λ”₯λŸ¬λ‹ 챗봇 μ• ν”Œλ¦¬μΌ€μ΄μ…˜μ„ λΉŒλ“œν•  수 μžˆμŠ΅λ‹ˆλ‹€. # 1. 데이터셋 객체 생성 dataset = Dataset(ood=True) #

1 Oct 25, 2021
Conversational text Analysis using various NLP techniques

Conversational text Analysis using various NLP techniques

Rita Anjana 159 Jan 06, 2023
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 14 Aug 24, 2022
Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

This Repository contains a sample code for Tacotron 2, WaveGlow with multi-speaker, emotion embeddings together with a script for data preprocessing.

Ivan Didur 106 Jan 01, 2023
A tool helps build a talk preview image by combining the given background image and talk event description

talk-preview-img-builder A tool helps build a talk preview image by combining the given background image and talk event description Installation and U

PyCon Taiwan 4 Aug 20, 2022
MRC approach for Aspect-based Sentiment Analysis (ABSA)

B-MRC MRC approach for Aspect-based Sentiment Analysis (ABSA) Paper: Bidirectional Machine Reading Comprehension for Aspect Sentiment Triplet Extracti

Phuc Phan 1 Apr 05, 2022
This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers.

Improving Transformer Models by Reordering their Sublayers This repository contains the code for running the character-level Sandwich Transformers fro

Ofir Press 53 Sep 26, 2022
Sequence model architectures from scratch in PyTorch

This repository implements a variety of sequence model architectures from scratch in PyTorch. Effort has been put to make the code well structured so that it can serve as learning material. The train

Brando Koch 11 Mar 28, 2022
πŸ›Έ Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 1.2k Jan 08, 2023
Semi-automated vocabulary generation from semantic vector models

vec2word Semi-automated vocabulary generation from semantic vector models This script generates a list of potential conlang word forms along with asso

9 Nov 25, 2022
[Preprint] Escaping the Big Data Paradigm with Compact Transformers, 2021

Compact Transformers Preprint Link: Escaping the Big Data Paradigm with Compact Transformers By Ali Hassani[1]*, Steven Walton[1]*, Nikhil Shah[1], Ab

SHI Lab 367 Dec 31, 2022
Google's Meena transformer chatbot implementation

Here's my attempt at recreating Meena, a state of the art chatbot developed by Google Research and described in the paper Towards a Human-like Open-Domain Chatbot.

Francesco Pham 94 Dec 25, 2022
A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code or write code yourself

Scriptfab - What is it? A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code

DevNugget 3 Jul 28, 2021
This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Word-Level Coreference Resolution This is a repository with the code to reproduce the experiments described in the paper of the same name, which was a

79 Dec 27, 2022
Code to reprudece NeurIPS paper: Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Accelerated Sparse Neural Training: A Provable and Efficient Method to FindN:M Transposable Masks Recently, researchers proposed pruning deep neural n

itay hubara 4 Feb 23, 2022
Trex is a tool to match semantically similar functions based on transfer learning.

Trex is a tool to match semantically similar functions based on transfer learning.

62 Dec 28, 2022
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language βš–οΈ The library of Natural Language Processing for Brazilian legal lang

Felipe Maia Polo 125 Dec 20, 2022
Outreachy TFX custom component project

Schema Curation Custom Component Outreachy TFX custom component project This repo contains the code for Schema Curation Custom Component made as a par

Robert Crowe 5 Jul 16, 2021
Code for the Python code smells video on the ArjanCodes channel.

7 Python code smells This repository contains the code for the Python code smells video on the ArjanCodes channel (watch the video here). The example

55 Dec 29, 2022