Post-Training Quantization for Vision transformers.

Last update: Dec 28, 2022

Related tags

Overview

PTQ4ViT

Post-Training Quantization Framework for Vision Transformers. We use the twin uniform quantization method to reduce the quantization error on these activation values. And we use a Hessian guided metric to evaluate different scaling factors, which improves the accuracy of calibration with a small cost. The quantized vision transformers (ViT, DeiT, and Swin) achieve near-lossless prediction accuracy (less than 0.5% drop at 8-bit quantization) on the ImageNet classification task. Please read the paper for details.

Install

Requirement

python>=3.5
pytorch>=1.5
matplotlib
pandas
timm

Datasets

To run example testing, you should put your ImageNet2012 dataset in path /datasets/imagenet.

We use ViTImageNetLoaderGenerator in utils/datasets.py to initialize our DataLoader. If your Imagenet datasets are stored elsewhere, you'll need to manually pass its root as an argument when instantiating a ViTImageNetLoaderGenerator.

Usage

1. Run example quantization

To test on all models with BasePTQ/PTQ4ViT, run

python example/test_all.py

To run ablation testing, run

python example/test_ablation.py

You can run the testing scripts with multiple GPUs. For example, calling

python example/test_all.py --multigpu --n_gpu 6

will use 6 gpus to run the test.

2. Download quantized model checkpoints

(Coming soon)

Results

Results of BasePTQ

model	original	w8a8	w6a6
ViT-S/224/32	75.99	73.61	60.144
ViT-S/224	81.39	80.468	70.244
ViT-B/224	84.54	83.896	75.668
ViT-B/384	86.00	85.352	46.886
DeiT-S/224	79.80	77.654	72.268
DeiT-B/224	81.80	80.946	78.786
DeiT-B/384	83.11	82.33	68.442
Swin-T/224	81.39	80.962	78.456
Swin-S/224	83.23	82.758	81.742
Swin-B/224	85.27	84.792	83.354
Swin-B/384	86.44	86.168	85.226

Results of PTQ4ViT

model	original	w8a8	w6a6
ViT-S/224/32	75.99	75.582	71.908
ViT-S/224	81.39	81.002	78.63
ViT-B/224	84.54	84.25	81.65
ViT-B/384	86.00	85.828	83.348
DeiT-S/224	79.80	79.474	76.282
DeiT-B/224	81.80	81.482	80.25
DeiT-B/384	83.11	82.974	81.55
Swin-T/224	81.39	81.246	80.47
Swin-S/224	83.23	83.106	82.38
Swin-B/224	85.27	85.146	84.012
Swin-B/384	86.44	86.394	85.388

Results of Ablation

ViT-S/224 (original top-1 accuracy 81.39%)

Hessian Guided	Softmax Twin	GELU Twin	W8A8	W6A6
			80.47	70.24
✓			80.93	77.20
✓	✓		81.11	78.57
✓		✓	80.84	76.93
	✓	✓	79.25	74.07
✓	✓	✓	81.00	78.63

ViT-B/224 (original top-1 accuracy 84.54%)

Hessian Guided	Softmax Twin	GELU Twin	W8A8	W6A6
			83.90	75.67
✓			83.97	79.90
✓	✓		84.07	80.76
✓		✓	84.10	80.82
	✓	✓	83.40	78.86
✓	✓	✓	84.25	81.65

ViT-B/384 (original top-1 accuracy 86.00%)

Hessian Guided	Softmax Twin	GELU Twin	W8A8	W6A6
			85.35	46.89
✓			85.42	79.99
✓	✓		85.67	82.01
✓		✓	85.60	82.21
	✓	✓	84.35	80.86
✓	✓	✓	85.89	83.19

Citation

@article{PTQ4ViT_cvpr2022,
    title={PTQ4ViT: Post-Training Quantization Framework for Vision Transformers},
    author={Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, Guangyu Sun},
    journal={arXiv preprint arXiv:2111.12293},
    year={2022},
}

Post-Training Quantization for Vision transformers.

Related tags

Overview

PTQ4ViT

Install

Requirement

Datasets

Usage

1. Run example quantization

2. Download quantized model checkpoints

Results

Results of BasePTQ

Results of Ablation

Citation

Owner

Zhihang Yuan

CTRL-C: Camera calibration TRansformer with Line-Classification

Code for the CIKM 2019 paper "DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting".

code for the ICLR'22 paper: On Robust Prefix-Tuning for Text Classification

GeneralOCR is open source Optical Character Recognition based on PyTorch.

FMA: A Dataset For Music Analysis

The end-to-end platform for building voice products at scale

PyTorch implementation of Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy

Autolfads-tf2 - A TensorFlow 2.0 implementation of Latent Factor Analysis via Dynamical Systems (LFADS) and AutoLFADS

AVD Quickstart Containerlab

[CVPR 2022 Oral] Balanced MSE for Imbalanced Visual Regression https://arxiv.org/abs/2203.16427

Official Implementation for "StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery" (ICCV 2021 Oral)

ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs

High-performance moving least squares material point method (MLS-MPM) solver.

Object detection on multiple datasets with an automatically learned unified label space.

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Language Models for the legal domain in Spanish done @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Code for "Learning Graph Cellular Automata"

BARF: Bundle-Adjusting Neural Radiance Fields 🤮 (ICCV 2021 oral)

TensorFlow ROCm port

[ICCV 2021] Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation