The code for the Subformer, from the EMNLP 2021 Findings paper: "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers", by Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo

Overview

Subformer

This repository contains the code for the Subformer. To help overcome this we propose the Subformer, allowing us to retain performance while reducing parameters in generative Transformers from 25% ~ 70%. The Subformer consists of the following two techniques:

  1. Sandwich-style parameter sharing, in which we share all the layers in a block except the first and last. This allows us the use the central shared layers --"sandwich module" -- as a large representation learner (similar to BERT vs ALBERT) while the input and output model layers are able to focus on more specific representations for token prediction/generation while maintaining performance.
  2. For our sequence to sequence tasks, we also introduce SAFE (self-attentive factorized embeddings), which help us reduce embedding parameters significantly, while still retaining performance.

If you used this code or found our work useful, please cite:

@inproceedings{reid2021subformer,
    title = {{S}ubformer: {E}xploring {W}eight {S}haring for {P}arameter {E}fficiency in {G}enerative {T}ransformers},
    author = {Machel Reid and Edison Marrese-Taylor and Yutaka Matsuo},
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
}

Requirements and Installation

(As this code is based on fairseq, some installation instructions are taken straight from their README)

  • PyTorch version >= 1.5.0
  • Python version >= 3.6
  • For training new models, you'll also need an NVIDIA GPU and NCCL
  • To install and develop locally:
git clone https://github.com/machelreid/subformer
cd subformer
pip install --e ./

# on MacOS:
# CFLAGS="-stdlib=libc++" pip install --editable ./
  • For faster training install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./
  • For large datasets install PyArrow: pip install pyarrow
  • If you use Docker make sure to increase the shared memory size either with --ipc=host or --shm-size as command line options to nvidia-docker run .

Training

Machine Translation

python train.py $DATA_BIN --arch transformer_wmt_en_de \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --lr 5e-4 \
    --warmup-init-lr 1e-7 --stop-min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 10000 \
    --optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-6 --task translation \
    --max-tokens 8192 --weight-decay 0.01 --dropout 0.2 --encoder-layers 6 --encoder-embed-dim 512 \
    --decoder-layers 6 --decoder-embed-dim 512 --fp16 --max-source-positions 10000 \
    --max-target-positions 10000 --max-update 200000 --seed 1 \
    --save-dir $CHECKPOINT_DIR --share-all-embeddings \
    --share-encoder-parameters-sandwich --share-decoder-parameters-sandwich \ #for sandwich-style parameter sharing
    --reduction-dim 320 #for SAFE embeddings

Generation

python generate.py --path $CHECKPOINT --gen-subset $SPLIT --beam 5 --lenpen $LENPEN --batch-size 400 --remove-bpe

CNN-DM Summarization

fairseq-train $DATA_BIN \
   --share-decoder-input-output-embed \
   --max-update 30000 \
   --optimizer adam --adam-betas '(0.9, 0.98)' --skip-invalid-size-inputs-valid-test \
   --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 10000 --lr 0.0005 \
   --stop-min-lr 1e-09 --clip-norm 0.1 --dropout 0.3 --weight-decay 0.0 \
   --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --update-freq 7 --attention-dropout 0.2 \
   --max-tokens 8192 --arch transformer_wmt_en_de --seed 1 --warmup-init-lr 1e-7 \
   --source-lang source_bpe --target-lang target_bpe --save-dir $CHECKPOINT_DIR --no-epoch-checkpoints --keep-best-checkpoints 10 --truncate-source --max-source-positions 512 --share-encoder-parameters-sandwich --share-decoder-parameters-sandwich --sandwich-embed-dim 1024 --sandwich-ffn-embed-dim 3072 --reduction-dim 256

Generation

fairseq-generate $DATA_BIN --task translation --gen-subset $SPLIT --batch-size 32 --path $CHECKPOINT --remove-bpe  --min-len 55 --beam 5 --max-len-b 140 --no-repeat-ngram-size 3 --lenpen $LENPEN -s source_bpe -t target_bpe --truncate-source --max-source-positions 512

Note that the min,max len parameters can be tuned for better performance

For post processing and ROUGE calculation feel free to take a look at this.

Citation

Please cite as:

@inproceedings{reid2021subformer,
    title = {{S}ubformer: {E}xploring {W}eight {S}haring for {P}arameter {E}fficiency in {G}enerative {T}ransformers},
    author = {Machel Reid and Edison Marrese-Taylor and Yutaka Matsuo},
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
}
Owner
Machel Reid
Researcher at University of Tokyo. Research Intern at CMU. Masason Foundation Scholar. Won the Rakuten Hackathon 2018.
Machel Reid
Implementation of "Adversarial purification with Score-based generative models", ICML 2021

Adversarial Purification with Score-based Generative Models by Jongmin Yoon, Sung Ju Hwang, Juho Lee This repository includes the official PyTorch imp

15 Dec 15, 2022
Repository for Graph2Pix: A Graph-Based Image to Image Translation Framework

Graph2Pix: A Graph-Based Image to Image Translation Framework Installation Install the dependencies in env.yml $ conda env create -f env.yml $ conda a

18 Nov 17, 2022
voice2json is a collection of command-line tools for offline speech/intent recognition on Linux

Command-line tools for speech and intent recognition on Linux

Michael Hansen 988 Jan 04, 2023
Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

Uyghur 11 Nov 17, 2022
NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels

NumPy String-Indexed NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels, rather than conventio

Aitan Grossman 1 Jan 08, 2022
A curated list of efficient attention modules

awesome-fast-attention A curated list of efficient attention modules

Sepehr Sameni 891 Dec 22, 2022
Multilingual word vectors in 78 languages

Aligning the fastText vectors of 78 languages Facebook recently open-sourced word vectors in 89 languages. However these vectors are monolingual; mean

Babylon Health 1.2k Dec 17, 2022
An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GPT-NeoX An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hun

EleutherAI 3.1k Jan 08, 2023
nlp基础任务

NLP算法 说明 此算法仓库包括文本分类、序列标注、关系抽取、文本匹配、文本相似度匹配这五个主流NLP任务,涉及到22个相关的模型算法。 框架结构 文件结构 all_models ├── Base_line │   ├── __init__.py │   ├── base_data_process.

zuxinqi 23 Sep 22, 2022
An Explainable Leaderboard for NLP

ExplainaBoard: An Explainable Leaderboard for NLP Introduction | Website | Download | Backend | Paper | Video | Bib Introduction ExplainaBoard is an i

NeuLab 319 Dec 20, 2022
Linking data between GBIF, Biodiverse, and Open Tree of Life

GBIF-biodiverse-OpenTree Linking data between GBIF, Biodiverse, and Open Tree of Life The python scripts will rely on opentree and Dendropy. To set up

2 Oct 03, 2022
A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

420 Dec 28, 2022
초성 해석기 based on ko-BART

초성 해석기 개요 한국어 초성만으로 이루어진 문장을 입력하면, 완성된 문장을 예측하는 초성 해석기입니다. 초성: ㄴㄴ ㄴㄹ ㅈㅇㅎ 예측 문장: 나는 너를 좋아해 모델 모델은 SKT-AI에서 공개한 Ko-BART를 이용합니다. 데이터 문장 단위로 이루어진 아무 코퍼스나

Dawoon Jung 29 Oct 28, 2022
Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization 📥 Download Datasets 📥 Download Trained Models INTRODUCTION TH2ZH (

Nakhun Chumpolsathien 5 Jan 03, 2022
CCF BDCI BERT系统调优赛题baseline(Pytorch版本)

CCF BDCI BERT系统调优赛题baseline(Pytorch版本) 此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式,因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

Ziqi Zhou 9 Oct 13, 2022
A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenizatio

Computation for Indian Language Technology (CFILT) 9 Oct 13, 2022
Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

morning 49 Dec 26, 2022
Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

This repository provides a library for efficient training of masked language models (MLM), built with fairseq. We fork fairseq to give researchers mor

Princeton Natural Language Processing 92 Dec 27, 2022
SDL: Synthetic Document Layout dataset

SDL is the project that synthesizes document images. It facilitates multiple-level labeling on document images and can generate in multiple languages.

Sơn Nguyễn 0 Oct 07, 2021
YACLC - Yet Another Chinese Learner Corpus

汉语学习者文本多维标注数据集YACLC V1.0 中文 | English 汉语学习者文本多维标注数据集(Yet Another Chinese Learner

BLCU-ICALL 47 Dec 15, 2022