Implementation of paper "Towards a Unified View of Parameter-Efficient Transfer Learning"

Last update: Dec 29, 2022

Overview

A Unified Framework for Parameter-Efficient Transfer Learning

This is the official implementation of the paper:

Towards a Unified View of Parameter-Efficient Transfer Learning
Junxian He*, Chunting Zhou*, Xuezhe Ma, Taylor Berg-Kirkpatrick, Graham Neubig
Preprint 2021

Parameter-efficient transfer learning (PETL) methods only tune a small number of (extra) parameters to adapt large pretrained models into downstream tasks. This paper reveals the connection among existing PETL methods such as adapters, prefix tuning, and LoRA, and proposes a unified framework to interpret their designs. This unified framework is able to instantiate existing approaches by varying values along several defined design dimensions, which also provides principled guidance to design new PETL methods. In this repo as well as in the paper, we include examples of how we easily derive new state-of-the-art PETL methods from the unified framework.

Dependencies

This repo is a fork of the huggingface transformers repo (forked on June 23, 2021), and the code is tested on PyTorch 1.9.0. Please follow the instructions below to install dependencies after you set up PyTorch:

git clone [email protected]:jxhe/MAM-adapter.git
cd MAM-adapter

# install transformers from this repo
pip install -e .

# install other requirements
pip install datasets==1.11.0

# used to compute BLEU score for en-ro translation
git clone [email protected]:moses-smt/mosesdecoder.git

Usage

MAM-Adapter

Run the following command to reproduce the MAM-Adapter results in the paper on the XSum, en-ro translation, MNLI, or SST2 datasets:

bash exps/run_{xsum|en_ro|glue}.sh

We ran all the experiments with one A6000 or A100 GPU that has >=40GB GPU memory -- if your GPU does not have a large memory, you may need to reduce the bsz (max_tokens_per_batch for en-ro) and increase the gradient_steps values in the scripts to match our effective batch size. You may train with multiple GPUs easily with python -m torch.distributed.launch --nproc_per_node {num_gpus} to enable data parallelism.

Training time: in our experiments that use one GPU, XSum takes 24 hours w/ A100 or 50 hours w/ A6000, en-ro takes 20 hours w/ A6000, SST2 takes 2 hours, and MNLI takes 10 hours.

Advanced Usage for Other PETL Variants

As the paper shows, our unified framework instantiates different PETL variants easily by varying along the design dimensions. You can modify the script to train other PETL variants as we studied in the paper, we include some examples in run_xsum.sh, which can be directly applied to the other scripts as well:

# ----- MAM adapter -----
attn_mode="prefix"
attn_option="concat"
attn_composition="add"
attn_bn=30  # attn bottleneck dim

ffn_mode="adapter"
ffn_option="parallel"
ffn_adapter_layernorm_option="none"
ffn_adapter_init_option="lora"
ffn_adapter_scalar="4"
ffn_bn=512 # ffn bottleneck dim

# ----- prefix tuning baseline ----- 
# attn_mode="prefix"
# attn_option="concat"
# attn_composition="add"
# attn_bn=200  # attn bottleneck dim

# ffn_mode="none"
# ffn_option="parallel"
# ffn_adapter_layernorm_option="none"
# ffn_adapter_init_option="lora"
# ffn_adapter_scalar="4"
# ffn_bn=512 # ffn bottleneck dim

# ----- Houlsby Adapter ----- 
# attn_mode="adapter"
# attn_option="sequential"
# attn_composition="add"
# attn_bn=200  # attn bottleneck dim

# ffn_mode="adapter"
# ffn_option="sequential"
# ffn_adapter_layernorm_option="none"
# ffn_adapter_init_option="bert"
# ffn_adapter_scalar="1"
# ffn_bn=200 # ffn bottleneck dim

# ----- FFN Scaled Parallel Adapter ----- 
# attn_mode="None"
# attn_option="parallel"
# attn_composition="add"
# attn_bn=200  # attn bottleneck dim

# ffn_mode="adapter"
# ffn_option="parallel"
# ffn_adapter_layernorm_option="none"
# ffn_adapter_init_option="lora"
# ffn_adapter_scalar="4"
# ffn_bn=512 # ffn bottleneck dim

There are more variations than what is shown above. Please see a complete explanation of these arguments here in petl/options.py. The results of all the variants reported in the paper could be reproduced by changing these values in the scripts.

Citation

@article{he2021towards,
  title={Towards a Unified View of Parameter-Efficient Transfer Learning},
  author={He, Junxian and Zhou, Chunting and Ma, Xuezhe and Berg-Kirkpatrick, Taylor and Neubig, Graham},
  journal={arXiv preprint arXiv:2110.04366},
  year={2021}
}

Implementation of paper "Towards a Unified View of Parameter-Efficient Transfer Learning"

Related tags

Overview

A Unified Framework for Parameter-Efficient Transfer Learning

Dependencies

Usage

MAM-Adapter

Advanced Usage for Other PETL Variants

Citation

Owner

Junxian He

Supporting code for the paper "Dangers of Bayesian Model Averaging under Covariate Shift"

Key information extraction from invoice document with Graph Convolution Network

Simple data balancing baselines for worst-group-accuracy benchmarks.

Code for the paper "Unsupervised Contrastive Learning of Sound Event Representations", ICASSP 2021.

Real time Human Detection Counting

mbrl-lib is a toolbox for facilitating development of Model-Based Reinforcement Learning algorithms.

VISNOTATE: An Opensource tool for Gaze-based Annotation of WSI Data

Open source implementation of "A Self-Supervised Descriptor for Image Copy Detection" (SSCD).

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Semantic Segmentation.

An unofficial personal implementation of UM-Adapt, specifically to tackle joint estimation of panoptic segmentation and depth prediction for autonomous driving datasets.

A PyTorch implementation of Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks

An algorithm study of the 6th iOS 10 set of Boost Camp Web Mobile

Example repository for custom C++/CUDA operators for TorchScript

[CVPR 2022] PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervision (Oral)

Pytorch code for "State-only Imitation with Transition Dynamics Mismatch" (ICLR 2020)

Spearmint Bayesian optimization codebase

An essential implementation of BYOL in PyTorch + PyTorch Lightning

Mosaic of Object-centric Images as Scene-centric Images (MosaicOS) for long-tailed object detection and instance segmentation.

PyTorch Code for NeurIPS 2021 paper Anti-Backdoor Learning: Training Clean Models on Poisoned Data.