FewBit — a library for memory efficient training of large neural networks

Last update: Oct 22, 2022

Overview

FewBit

FewBit — a library for memory efficient training of large neural networks. Its efficiency originates from storage optimizations applied to backward pass and memory footprint reduction for saved tensors between forward and backward passes. Namely, the library provides its own implementation of common activation functions and linear layer since they contribute the most to memory usage in training time. Optimized linear layer saves up to 15-20% memory and optimized activation functions save up to 15-30% of memory usage with negligible loss in performance (see [1][2] for details).

In the table below, one can see comparison of different optimizations applied to RoBERTa model. Compression rate of randomized linear layer is 20% (it uses only 20% of input) and GELU approximation uses only 3 bits.

	Task	Batch Size	GELU	Linear Layer	Peak Memory, GiB	Saving, %
1	MRPC	128	Vanilla	Vanilla	11.30	0.0
2	MRPC	128	3-bit	Vanilla	9.75	13.8
3	MRPC	128	Vanilla	Randomized	9.20	18.6
4	MRPC	128	3-bit	Randomized	7.60	32.7

Usage

The library fewbit implements basic activation functions with backward pass optimizations for reducing memory footprint during model training. All activation functions exported by the library can be used as a drop-in replacement for most of standard activation functions implemented in PyTorch. The common pattern is to replace torch.nn with fewbit package qualifier.

import fewbit
import torch as T

model = T.nn.Sequential(
    ...,
    fewbit.GELU(bits=3),  # Use 3-bits GELU approximation.
    ...,
)

In the case of pre-trained models, one can rebuild model with map_module routine which walks through model tree recursively and allows to replace some modules or activation functions. So, user should only use suitable constructor for a new module. As an example the code below replaces all default linear layers with randomized ones.

from fewbit import RandomizedLinear
from fewbit.util import convert_linear, map_module

converter = lambda x: convert_linear(x, RandomizedLinear, proj_dim_ratio=0.1)
new_model = map_module(old_model, converter)  # In-place model construction.

Installation

The simplest and preferred installation way is installation from PyPI.

pip install -U fewbit

FewBit is written in Python, but it implements some opertions in C++/CUDA to archive better performance. So, building from source requires CUDA Toolkit and CMake as a build system. The latest release can be installed with the following command.

pip install -U https://github.com/SkoltechAI/fewbit.git

List of Activation Functions

The library supports the following activation functions.

Piece-wise Activation Functions

In this section, all activation functions has 1-bit derivative. The only difference is band. The band requires two comparison to determine gradient domain. The complete list of activation functions is leaky_relu, relu, threshold, hardsigmoid, hardtanh, relu6, hardshrink, and softshrink.

Continous Activation Functions

All continous activation function could be divided into three classes according to its parity property: odd, even, and neither even nor odd. The parity property allows to use a small optimization to increase precision of approximation. The complete list of reimplemented activation functions in this category is celu, elu, hardswish, logsigmoid, mish, selu, sigmoid, silu, softplus, softsign, tanh, and tanhshrink.

List of Modules

Module RandomizedLinear is a replacement for default Linear module. It is used power of approximate matrix multiplication for memory saving.

Assembly

Preliminary step depends on one's PyTorch distribution and availiable tooling. Building of native components requires CMake and a build system like Make or Ninja. Next, if PyTorch is installed system-wide the the following step is not neccessary. Otherwise, one likely should add search path for CMake modules to environment variables as follows.

export CMAKE_PREFIX_PATH="$(python -c 'import torch.utils; print(torch.utils.cmake_prefix_path)')"

The next step is useful in development environment. It just builds PyTorch operator library in source tree (option --inplace) with forced CUDA support (option --cuda). By default no CUDA support are forced.

python setup.py build_ext --inplace --cuda

With options similar to the previous step, one can build wheel binary distribution of the package.

python setup.py bdist_wheel --inplace --cuda

Development Environment with Docker

In order to develop on different platforms we uses custom docker image for non-priviledge user based on Nvidia CUDA image. Image contains pre-built native extention and it is parametrized by user name and user ID in a host system. The latter is crucial thing in binding host volumes.

docker build -t fewbit --build-arg UID=$(id -u) .
docker run --rm -ti -e TERM=$TERM fewbit

Citation

Please cite the following papers if the library is used in an academic paper (export BibTeX).

@misc{bershatsky2022memoryefficient,
    title={{M}emory-{E}fficient {B}ackpropagation through {L}arge {L}inear {L}ayers},
    author={Daniel Bershatsky and Aleksandr Mikhalev and Alexandr Katrutsa and Julia Gusak and Daniil Merkulov and Ivan Oseledets},
    year={2022},
    eprint={2201.13195},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
}

@misc{novikov2022fewbit,
    title={{F}ew-{B}it {B}ackward: {Q}uantized {G}radients of {A}ctivation {F}unctions for {M}emory {F}ootprint {R}eduction},
    author={Georgii Novikov and Daniel Bershatsky and Julia Gusak and Alex Shonenkov and Denis Dimitrov and Ivan Oseledets},
    year={2022},
    eprint={2202.00441},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
}

License

The work was supported by Sber AI and the Analytical center under the RF Government (subsidy agreement 000000D730321P5Q0002, Grant No. 70-2021-00145 02.11.2021). ↩

FewBit — a library for memory efficient training of large neural networks

Related tags

Overview

FewBit

Usage

Installation

List of Activation Functions

Piece-wise Activation Functions

Continous Activation Functions

List of Modules

Assembly

Development Environment with Docker

Citation

License

Owner

ProMP: Proximal Meta-Policy Search

Code for ICMI2020 and ICMI2021 papers: "Studying Person-Specific Pointing and Gaze Behavior for Multimodal Referencing of Outside Objects from a Moving Vehicle" and "ML-PersRef: A Machine Learning-based Personalized Multimodal Fusion Approach for Referencing Outside Objects From a Moving Vehicle"

Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"

Official code for the paper "Self-Supervised Prototypical Transfer Learning for Few-Shot Classification"

Code for "FGR: Frustum-Aware Geometric Reasoning for Weakly Supervised 3D Vehicle Detection", ICRA 2021

Visualization toolkit for neural networks in PyTorch! Demo -->

A PyTorch implementation of the baseline method in Panoptic Narrative Grounding (ICCV 2021 Oral)

The Fundamental Clustering Problems Suite (FCPS) summaries 54 state-of-the-art clustering algorithms, common cluster challenges and estimations of the number of clusters as well as the testing for cluster tendency.

Time Dependent DFT in Tamm-Dancoff Approximation

Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

PyTorch implementation of deep GRAph Contrastive rEpresentation learning (GRACE).

Match SafeGraph POIs with Data collected through a cultural resource survey in Washington DC.

Code for the paper: Fighting Fake News: Image Splice Detection via Learned Self-Consistency

Kalman Filter book using Jupyter Notebook. Focuses on building intuition and experience, not formal proofs. Includes Kalman filters,extended Kalman filters, unscented Kalman filters, particle filters, and more. All exercises include solutions.

Codes accompanying the paper "Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning" (NeurIPS 2021 Spotlight

Deploy a ML inference service on a budget in less than 10 lines of code.

Rank 1st in the public leaderboard of ScanRefer (2021-03-18)

TEDSummary is a speech summary corpus. It includes TED talks subtitle (Document), Title-Detail (Summary), speaker name (Meta info), MP4 URL, and utterance id

Dynamical movement primitives (DMPs), probabilistic movement primitives (ProMPs), spatially coupled bimanual DMPs.

Robust, modular and efficient implementation of advanced Hamiltonian Monte Carlo algorithms

FewBit — a library for memory efficient training of large neural networks

Related tags

Overview

FewBit

Usage

Installation

List of Activation Functions

Piece-wise Activation Functions

Continous Activation Functions

List of Modules

Assembly

Development Environment with Docker

Citation

License

Footnotes

Owner

ProMP: Proximal Meta-Policy Search

Code for ICMI2020 and ICMI2021 papers: "Studying Person-Specific Pointing and Gaze Behavior for Multimodal Referencing of Outside Objects from a Moving Vehicle" and "ML-PersRef: A Machine Learning-based Personalized Multimodal Fusion Approach for Referencing Outside Objects From a Moving Vehicle"

Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"

Official code for the paper "Self-Supervised Prototypical Transfer Learning for Few-Shot Classification"

Code for "FGR: Frustum-Aware Geometric Reasoning for Weakly Supervised 3D Vehicle Detection", ICRA 2021

Visualization toolkit for neural networks in PyTorch! Demo -->

A PyTorch implementation of the baseline method in Panoptic Narrative Grounding (ICCV 2021 Oral)

The Fundamental Clustering Problems Suite (FCPS) summaries 54 state-of-the-art clustering algorithms, common cluster challenges and estimations of the number of clusters as well as the testing for cluster tendency.

Time Dependent DFT in Tamm-Dancoff Approximation

Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

PyTorch implementation of deep GRAph Contrastive rEpresentation learning (GRACE).

Match SafeGraph POIs with Data collected through a cultural resource survey in Washington DC.

Code for the paper: Fighting Fake News: Image Splice Detection via Learned Self-Consistency

Kalman Filter book using Jupyter Notebook. Focuses on building intuition and experience, not formal proofs. Includes Kalman filters,extended Kalman filters, unscented Kalman filters, particle filters, and more. All exercises include solutions.

Codes accompanying the paper "Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning" (NeurIPS 2021 Spotlight

Deploy a ML inference service on a budget in less than 10 lines of code.

Rank 1st in the public leaderboard of ScanRefer (2021-03-18)

TEDSummary is a speech summary corpus. It includes TED talks subtitle (Document), Title-Detail (Summary), speaker name (Meta info), MP4 URL, and utterance id

Dynamical movement primitives (DMPs), probabilistic movement primitives (ProMPs), spatially coupled bimanual DMPs.

Robust, modular and efficient implementation of advanced Hamiltonian Monte Carlo algorithms