DeLighT: Very Deep and Light-Weight Transformers

Last update: Dec 18, 2022

Related tags

Overview

DeLighT: Very Deep and Light-weight Transformers

This repository contains the source code of our work on building efficient sequence models: DeFINE (ICLR'20) and DeLighT (preprint).

Table of contents

Overview
Requirements and installation
Training, evaluation, and results
Multiplication-addition operations
Citation
Acknowledgement
Issues

Overview

In this repository, we share the source code of our paper DeLight, that delivers similar or better performance than transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1) within each Transformer block using DExTra, a deep and light-weight transformation and (2) across blocks using block-wise scaling, that allows for shallower and narrower DeLighT blocks near the input and wider and deeper DeLighT blocks near the output. Overall, DeLighT networks are 2.5 to 4 times deeper than standard transformer models and yet have fewer parameters and operations. For details, see our papers: DeFINE and and DeLighT.

Requirements and Installation

PyTorch version >= 1.4.0
Python version >= 3.6
For training new models, you'll also need an NVIDIA GPU and NCCL
To use DeLighT, you need to install fairseq and develop locally:

git clone https://github.com/sacmehta/delight
cd delight
pip install --editable ./

For faster training install NVIDIA's apex library:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

Training, Evaluation, and Results

For training, evaluation, and results, see below links. To ease reproduction of our results, we also provide links to training logs.

Neural machine translation

Language Modeling

WikiText-103

Multiplication-Addition Operations

We have added module profiling for both Transformer and DeLight networks. This can be enabled using --print-stats argument. A model summary will be printed (by default for 20 tokens), similar to below screenshot. To use larger sequence lengths for source and target for profiling statistics, you can use --src-len-ps and --tgt-len-ps flags.

Citation

If you find our work useful, please consider citing following works:

@misc{mehta2020delight,
    title={DeLighT: Very Deep and Light-weight Transformer},
    author={Sachin Mehta and Marjan Ghazvininejad and Srinivasan Iyer and Luke Zettlemoyer and Hannaneh Hajishirzi},
    year={2020},
    eprint={2008.00623},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

@inproceedings{mehta2019define,
  title={DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling},
  author={Mehta, Sachin and Koncel-Kedziorski, Rik and Rastegari, Mohammad and Hajishirzi, Hannaneh},
  booktitle={International Conference on Learning Representations},
  year={2019}
}

Acknowledgements

We would like to thank Fairseq team for building easy-to-use sequence library.

Issues

Thanks for your interest in our work. For any issues, please raise a request.

DeLighT: Very Deep and Light-Weight Transformers

Related tags

Overview

DeLighT: Very Deep and Light-weight Transformers

Overview

Requirements and Installation

Training, Evaluation, and Results

Neural machine translation

Language Modeling

Multiplication-Addition Operations

Citation

Acknowledgements

Issues

Owner

Sachin Mehta

SwinIR: Image Restoration Using Swin Transformer

Cleaned test data list of DukeMTMC-reID, ICCV2021

MMRazor: a model compression toolkit for model slimming and AutoML

Confident Semantic Ranking Loss for Part Parsing

It is a system used to detect bone fractures. using techniques deep learning and image processing

Kaggle competition: Springleaf Marketing Response

This is a GUI interface which can process forest fire detection, smoke detection and fire segmentation

Syntax-Aware Action Targeting for Video Captioning

A Factor Model for Persistence in Investment Manager Performance

RAANet: Range-Aware Attention Network for LiDAR-based 3D Object Detection with Auxiliary Density Level Estimation

Keras udrl - Keras implementation of Upside Down Reinforcement Learning

A PyTorch Implementation of Single Shot MultiBox Detector

Fully-automated scripts for collecting AI-related papers

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

CompilerGym is a library of easy to use and performant reinforcement learning environments for compiler tasks

Ros2-voiceroid2 - ROS2 wrapper package of VOICEROID2

Automatic voice-synthetised summaries of latest research papers on arXiv

Final project for Intro to CS class.

Stream images from a connected camera over MQTT, view using Streamlit, record to file and sqlite

How the Deep Q-learning method works and discuss the new ideas that makes the algorithm work