Repository for the paper "Optimal Subarchitecture Extraction for BERT"

Related tags

Text Data & NLPbort
Overview

Bort

Companion code for the paper "Optimal Subarchitecture Extraction for BERT."

Bort is an optimal subset of architectural parameters for the BERT architecture, extracted by applying a fully polynomial-time approximation scheme (FPTAS) for neural architecture search. Bort has an effective (that is, not counting the embedding layer) size of 5.5% the original BERT-large architecture, and 16% of the net size. It is also able to be pretrained in 288 GPU hours, which is 1.2% of the time required to pretrain the highest-performing BERT parametric architectural variant, RoBERTa-large. It is also 7.9x faster than BERT-base (20x faster than BERT/RoBERTa-large) on a CPU, and performs better than other compressed variants of the architecture, and some of the non-compressed variants; it obtains an average performance improvement of between 0.3% and 31%, relative, with respect to BERT-large on multiple public natural language understanding (NLU) benchmarks.

Here are the corresponding GLUE scores on the test set:

Model Score CoLA SST-2 MRPC STS-B QQP MNLI-m MNLI-mm QNLI(v2) RTE WNLI AX
Bort 83.6 63.9 96.2 94.1/92.3 89.2/88.3 66.0/85.9 88.1 87.8 92.3 82.7 71.2 51.9
BERT-Large 80.5 60.5 94.9 89.3/85.4 87.6/86.5 72.1/89.3 86.7 85.9 92.7 70.1 65.1 39.6

And SuperGLUE scores on the test set:

Model Score BoolQ CB COPA MultiRC ReCoRD RTE WiC WSC AX-b AX-g
Bort 74.1 83.7 81.9/86.5 89.6 83.7/54.1 49.8/49.0 81.2 70.1 65.8 48.0 96.1/61.5
BERT-Large 69.0 77.4 75.7/83.6 70.6 70.0/24.1 72.0/71.3 71.7 69.6 64.4 23.0 97.8/51.7

And here are the architectural parameters:

Model Parameters (M) Layers Attention heads Hidden size Intermediate size Embedding size (M) Encoder proportion (%)
Bort 56 4 8 1024 768 39 30.3
BERT-Large 340 24 16 1024 4096 31.8 90.6

Setup:

  1. You need to install the requirements from the requirements.txt file:
pip install -r requirements.txt

This code has been tested with Python 3.6.5+. To save yourself some headache we recommend you install Horovod from source, after you install MxNet. This is only needed if you are pre-training the architecture. For this, run the following commands (you'll need a C++ compiler which supports c++11 standards, like gcc > 4.8):

    pip uninstall horovod
    HOROVOD_CUDA_HOME=/usr/local/cuda-10.1 \
    HOROVOD_WITH_MXNET=1 \
    HOROVOD_GPU_ALLREDUCE=NCCL \
    pip install horovod==0.16.2 --no-cache-dir
  1. You also need to download the model from here. If you have the AWS CLI, all you need to do is run:
aws s3 cp s3://alexa-saif-bort/bort.params model/
  1. To run the tests, you also need to download the sample text from Gluon and put it in test_data/:
wget https://github.com/dmlc/gluon-nlp/blob/v0.9.x/scripts/bert/sample_text.txt
mv sample_text.txt test_data/

Pre-training:

Bort is already pre-trained, but if you want to try out other datasets, you can follow the steps here. Note that this does not run the FPTAS described in the paper, and works for a fixed architecture (Bort).

  1. First, you will need to tokenize the pre-training text:
python create_pretraining_data.py \
            --input_file <input text> \
            --output_dir <output directory> \
            --dataset_name <dataset name> \
            --dupe_factor <duplication factor> \
            --num_outputs <number of output files>

We recommend using --dataset_name openwebtext_ccnews_stories_books_cased for the vocabulary. If your data file is too large, the script will throw out-of-memory errors. We recommend splitting it into smaller chunks and then calling the script one-by-one.

  1. Then run the pre-training distillation script:
./run_pretraining_distillation.sh <num gpus> <training data> <testing data> [optional teacher checkpoint]

Please see the contents of run_pretraining_distillation.sh for example usages and additional optional configuration. If you have installed Horovod, we highly recommend you use run_pretraining_distillation_hvd.py instead.

Fine-tuning:

  1. To fine-tune Bort, run:
./run_finetune.sh <your task here>

We recommend you play with the hyperparameters from run_finetune.sh. This code supports all the tasks outlined in the paper, but for the case of the RACE dataset, you need to download the data and extract it. The default location for extraction is ~/.mxnet/datasets/race. Same goes for SuperGLUE's MultiRC, since the Gluon implementation is the old version. You can download the data and extract it to ~/.mxnet/datasets/superglue_multirc/.

It is normal to get very odd results for the fine-tuning step, since this repository only contains the training part of Agora. However, you can easily implement your own version of that algorithm. We recommend you use the following initial set of hyperparameters, and follow the requirements described in the papers at the end of this file:

seeds={0,1,2,3,4}
learning_rates={1e-4, 1e-5, 9e-6}
weight_decays={0, 10, 100, 350}
warmup_rates={0.35, 0.40, 0.45, 0.50}
batch_sizes={8, 16}

Troubleshooting:

Dependency errors

Bort requires a rather unusual environment to run. For this reason, most of the problems regarding runtime can be fixed by installing the requirements from the requirements.txt file. Also make sure to have reinstalled Horovod as outlined above.

Script failing when downloading the data

This is inherent to the way Bort is fine-tuned, since it expects the data to be preexisting for some arbitrary implementation of Agora. You can get around that error by downloading the data before running the script, e.g.:

from data.classification import BoolQTask
task = BoolQTask()
task.dataset_train()[1]; task.dataset_val()[1]; task.dataset_test()[1]
Out-of-memory errors

While Bort is designed to be efficient in terms of the space it occupies in memory, a very large batch size or sequence length will still cause you to run out of memory. More often than ever, reducing the sequence length from 512 to 256 will solve out-of-memory issues. 80% of the time, it works every time.

Slow fine-tuning/pre-training

We strongly recommend using distributed training for both fine-tuning and pre-training. If your Horovod acts weird, remember that it needs to be built after the installation of MXNet (or any framework for that matter).

Low task-specific performance

If you observe near-random task-specific performance, that is to be expected. Bort is a rather small architecture and the optimizer/scheduler/learning rate combination is quite aggressive. We highly recommend you fine-tune Bort using an implementation of Agora. More details on how to do that are in the references below, specifically the second paper. Note that we needed to implement "replay" (i.e., re-doing some iterations of Agora) to get it to converge better.

References

If you use Bort or the other algorithms in your work, we'd love to hear from it! Also, please cite the so-called "Bort trilogy" papers:

@article{deWynterApproximation,
    title={An Approximation Algorithm for Optimal Subarchitecture Extraction},
    author={Adrian de Wynter},
    year={2020},
    eprint={2010.08512},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    journal={CoRR},
    volume={abs/2010.08512},
    url={http://arxiv.org/abs/2010.08512}
}
@article{deWynterAlgorithm,
      title={An Algorithm for Learning Smaller Representations of Models With Scarce Data},
      author={Adrian de Wynter},
      year={2020},
      eprint={2010.07990},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      journal={CoRR},
      volume={abs/2010.07990},
      url={http://arxiv.org/abs/2010.07990}
}
@article{deWynterPerryOptimal,
      title={Optimal Subarchitecture Extraction for BERT},
      author={Adrian de Wynter and Daniel J. Perry},
      year={2020},
      eprint={2010.10499},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      journal={CoRR},
      volume={abs/2010.10499},
      url={http://arxiv.org/abs/2010.10499}
}

Lastly, if you use the GLUE/SuperGLUE/RACE tasks, don't forget to give proper attribution to the original authors.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Owner
Alexa
Alexa
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP prod

VinAI Research 109 Dec 02, 2022
Experiments in converting wikidata to ftm

FollowTheMoney / Wikidata mappings This repo will contain tools for converting Wikidata entities into FtM schema. Prefixes: https://www.mediawiki.org/

Friedrich Lindenberg 2 Nov 12, 2021
Japanese synonym library

chikkarpy chikkarpyはchikkarのPython版です。 chikkarpy is a Python version of chikkar. chikkarpy は Sudachi 同義語辞書を利用し、SudachiPyの出力に同義語展開を追加するために開発されたライブラリです。

Works Applications 48 Dec 14, 2022
The ability of computer software to identify words and phrases in spoken language and convert them to human-readable text

speech-recognition-py Speech recognition is the ability of computer software to identify words and phrases in spoken language and convert them to huma

Deepangshi 1 Apr 03, 2022
NLP, before and after spaCy

textacy: NLP, before and after spaCy textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the hig

Chartbeat Labs Projects 2k Jan 04, 2023
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

OpenBMB 377 Jan 02, 2023
Code for ACL 2020 paper "Rigid Formats Controlled Text Generation"

SongNet SongNet: SongCi + Song (Lyrics) + Sonnet + etc. @inproceedings{li-etal-2020-rigid, title = "Rigid Formats Controlled Text Generation",

Piji Li 212 Dec 17, 2022
To classify the News into Real/Fake using Features from the Text Content of the article

Hoax-Detector Authenticity of news has now become a major problem. The Idea is to classify the News into Real/Fake using Features from the Text Conten

Aravindhan 1 Feb 09, 2022
Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

Max Woolf 3.1k Jan 07, 2023
A combination of autoregressors and autoencoders using XLNet for sentiment analysis

A combination of autoregressors and autoencoders using XLNet for sentiment analysis Abstract In this paper sentiment analysis has been performed in or

James Zaridis 2 Nov 20, 2021
ADCS - Automatic Defect Classification System (ADCS) for SSMC

Table of Contents Table of Contents ADCS Overview Summary Operator's Guide Demo System Design System Logic Training Mode Production System Flow Folder

Tam Zher Min 2 Jun 24, 2022
Input english text, then translate it between languages n times using the Deep Translator Python Library.

mass-translator About Input english text, then translate it between languages n times using the Deep Translator Python Library. How to Use Install dep

2 Mar 04, 2022
Convolutional 2D Knowledge Graph Embeddings resources

ConvE Convolutional 2D Knowledge Graph Embeddings resources. Paper: Convolutional 2D Knowledge Graph Embeddings Used in the paper, but do not use thes

Tim Dettmers 586 Dec 24, 2022
Bnagla hand written document digiiztion

Bnagla hand written document digiiztion This repo addresses the problem of digiizing hand written documents in Bangla. Documents have definite fields

Mushfiqur Rahman 1 Dec 10, 2021
Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit".

Patience-based Early Exit Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit". NEWS: We now have a better and tidier i

Kevin Canwen Xu 54 Jan 04, 2023
Data manipulation and transformation for audio signal processing, powered by PyTorch

torchaudio: an audio library for PyTorch The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the

1.9k Jan 08, 2023
Active learning for text classification in Python

Active Learning allows you to efficiently label training data in a small-data scenario.

Webis 375 Dec 28, 2022
TPlinker for NER 中文/英文命名实体识别

本项目是参考 TPLinker 中HandshakingTagging思想,将TPLinker由原来的关系抽取(RE)模型修改为命名实体识别(NER)模型。

GodK 113 Dec 28, 2022
TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

Yixuan Su 26 Oct 17, 2022
Model parallel transformers in JAX and Haiku

Table of contents Mesh Transformer JAX Updates Pretrained Models GPT-J-6B Links Acknowledgments License Model Details Zero-Shot Evaluations Architectu

Ben Wang 4.9k Jan 04, 2023