The official implementation for "FQ-ViT: Fully Quantized Vision Transformer without Retraining".

Overview

FQ-ViT [arXiv]

This repo contains the official implementation of "FQ-ViT: Fully Quantized Vision Transformer without Retraining".

Table of Contents

Introduction

Transformer-based architectures have achieved competitive performance in various CV tasks. Compared to the CNNs, Transformers usually have more parameters and higher computational costs, presenting a challenge when deployed to resource-constrained hardware devices.

Most existing quantization approaches are designed and tested on CNNs and lack proper handling of Transformer-specific modules. Previous work found there would be significant accuracy degradation when quantizing LayerNorm and Softmax of Transformer-based architectures. As a result, they left LayerNorm and Softmax unquantized with floating-point numbers. We revisit these two exclusive modules of the Vision Transformers and discover the reasons for degradation. In this work, we propose the FQ-ViT, the first fully quantized Vision Transformer, which contains two specific modules: Powers-of-Two Scale (PTS) and Log-Int-Softmax (LIS).

Layernorm quantized with Powers-of-Two Scale (PTS)

These two figures below show that there exists serious inter-channel variation in Vision Transformers than CNNs, which leads to unacceptable quantization errors with layer-wise quantization.

Taking the advantages of both layer-wise and channel-wise quantization, we propose PTS for LayerNorm's quantization. The core idea of PTS is to equip different channels with different Powers-of-Two Scale factors, rather than different quantization scales.

Softmax quantized with Log-Int-Softmax (LIS)

The storage and computation of attention map is known as a bottleneck for transformer structures, so we want to quantize it to extreme lower bit-width (e.g. 4-bit). However, if directly implementing 4-bit uniform quantization, there will be severe accuracy degeneration. We observe a distribution centering at a fairly small value of the output of Softmax, while only few outliers have larger values close to 1. Based on the following visualization, Log2 preserves more quantization bins than uniform for the small value interval with dense distribution.

Combining Log2 quantization with i-exp, which is a polynomial approximation of exponential function presented by I-BERT, we propose LIS, an integer-only, faster, low consuming Softmax.

The whole process is visualized as follow.

Getting Started

Install

  • Clone this repo.
git clone https://github.com/linyang-zhh/FQ-ViT.git
cd FQ-ViT
  • Create a conda virtual environment and activate it.
conda create -n fq-vit python=3.7 -y
conda activate fq-vit
  • Install PyTorch and torchvision. e.g.,
conda install pytorch=1.7.1 torchvision cudatoolkit=10.1 -c pytorch

Data preparation

You should download the standard ImageNet Dataset.

├── imagenet
│   ├── train
|
│   ├── val

Run

Example: Evaluate quantized DeiT-S with MinMax quantizer and our proposed PTS and LIS

python test_quant.py deit_small <YOUR_DATA_DIR> --quant --pts --lis --quant-method minmax
  • deit_small: model architecture, which can be replaced by deit_tiny, deit_base, vit_base, vit_large, swin_tiny, swin_small and swin_base.

  • --quant: whether to quantize the model.

  • --pts: whether to use Power-of-Two Scale Integer Layernorm.

  • --lis: whether to use Log-Integer-Softmax.

  • --quant-method: quantization methods of activations, which can be chosen from minmax, ema, percentile and omse.

Results on ImageNet

This paper employs several current post-training quantization strategies together with our methods, including MinMax, EMA , Percentile and OMSE.

  • MinMax uses the minimum and maximum values of the total data as the clipping values;

  • EMA is based on MinMax and uses an average moving mechanism to smooth the minimum and maximum values of different mini-batch;

  • Percentile assumes that the distribution of values conforms to a normal distribution and uses the percentile to clip. In this paper, we use the 1e-5 percentile because the 1e-4 commonly used in CNNs has poor performance in Vision Transformers.

  • OMSE determines the clipping values by minimizing the quantization error.

The following results are evaluated on ImageNet.

Method W/A/Attn Bits ViT-B ViT-L DeiT-T DeiT-S DeiT-B Swin-T Swin-S Swin-B
Full Precision 32/32/32 84.53 85.81 72.21 79.85 81.85 81.35 83.20 83.60
MinMax 8/8/8 23.64 3.37 70.94 75.05 78.02 64.38 74.37 25.58
MinMax w/ PTS 8/8/8 83.31 85.03 71.61 79.17 81.20 80.51 82.71 82.97
MinMax w/ PTS, LIS 8/8/4 82.68 84.89 71.07 78.40 80.85 80.04 82.47 82.38
EMA 8/8/8 30.30 3.53 71.17 75.71 78.82 70.81 75.05 28.00
EMA w/ PTS 8/8/8 83.49 85.10 71.66 79.09 81.43 80.52 82.81 83.01
EMA w/ PTS, LIS 8/8/4 82.57 85.08 70.91 78.53 80.90 80.02 82.56 82.43
Percentile 8/8/8 46.69 5.85 71.47 76.57 78.37 78.78 78.12 40.93
Percentile w/ PTS 8/8/8 80.86 85.24 71.74 78.99 80.30 80.80 82.85 83.10
Percentile w/ PTS, LIS 8/8/4 80.22 85.17 71.23 78.30 80.02 80.46 82.67 82.79
OMSE 8/8/8 73.39 11.32 71.30 75.03 79.57 79.30 78.96 48.55
OMSE w/ PTS 8/8/8 82.73 85.27 71.64 78.96 81.25 80.64 82.87 83.07
OMSE w/ PTS, LIS 8/8/4 82.37 85.16 70.87 78.42 80.90 80.41 82.57 82.45

Citation

If you find this repo useful in your research, please consider citing the following paper:

@misc{
    lin2021fqvit,
    title={FQ-ViT: Fully Quantized Vision Transformer without Retraining}, 
    author={Yang Lin and Tianyu Zhang and Peiqin Sun and Zheng Li and Shuchang Zhou},
    year={2021},
    eprint={2111.13824},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
Code to reproduce the results for Compositional Attention

Compositional-Attention This repository contains the official implementation for the paper Compositional Attention: Disentangling Search and Retrieval

Sarthak Mittal 58 Nov 30, 2022
Contains a bunch of different python programm tasks

py_tasks Contains a bunch of different python programm tasks Armstrong.py - calculate Armsrong numbers in range from 0 to n with / without cache and c

Dmitry Chmerenko 1 Dec 17, 2021
Contrastive Learning for Metagenomic Binning

CLMB A simple framework for CLMB - a novel deep Contrastive Learningfor Metagenomic Binning Created by Pengfei Zhang, senior of Department of Computer

1 Sep 14, 2022
PSANet: Point-wise Spatial Attention Network for Scene Parsing, ECCV2018.

PSANet: Point-wise Spatial Attention Network for Scene Parsing (in construction) by Hengshuang Zhao*, Yi Zhang*, Shu Liu, Jianping Shi, Chen Change Lo

Hengshuang Zhao 217 Oct 30, 2022
TensorFlow-based implementation of "Pyramid Scene Parsing Network".

PSPNet_tensorflow Important Code is fine for inference. However, the training code is just for reference and might be only used for fine-tuning. If yo

HsuanKung Yang 323 Dec 20, 2022
Ground truth data for the Optical Character Recognition of Historical Classical Commentaries.

OCR Ground Truth for Historical Commentaries The dataset OCR ground truth for historical commentaries (GT4HistComment) was created from the public dom

Ajax Multi-Commentary 3 Sep 08, 2022
YKKDetector For Python

YKKDetector OpenCVを利用した機械学習データをもとに、VRChatのスクリーンショットなどからYKKさん(もとい「幽狐族のお姉様」)を検出できるソフトウェアです。 マニュアル こちらから実行環境のセットアップから解説する詳細なマニュアルをご覧いただけます。 ライセンス 本ソフトウェア

あんふぃとらいと 5 Dec 07, 2021
JittorVis - Visual understanding of deep learning models

JittorVis: Visual understanding of deep learning model JittorVis is an open-source library for understanding the inner workings of Jittor models by vi

thu-vis 182 Jan 06, 2023
Official implementation of Rethinking Graph Neural Architecture Search from Message-passing (CVPR2021)

Rethinking Graph Neural Architecture Search from Message-passing Intro The GNAS can automatically learn better architecture with the optimal depth of

Shaofei Cai 48 Sep 30, 2022
Attendance Monitoring with Face Recognition using Python

Attendance Monitoring with Face Recognition using Python A python GUI integrated attendance system using face recognition to take attendance. In this

Vaibhav Rajput 2 Jun 21, 2022
OcclusionFusion: realtime dynamic 3D reconstruction based on single-view RGB-D

OcclusionFusion (CVPR'2022) Project Page | Paper | Video Overview This repository contains the code for the CVPR 2022 paper OcclusionFusion, where we

Wenbin Lin 193 Dec 15, 2022
Code base of object detection

rmdet code base of object detection. 环境安装: 1. 安装conda python环境 - `conda create -n xxx python=3.7/3.8` - `conda activate xxx` 2. 运行脚本,自动安装pytorch1

3 Mar 08, 2022
这是一个yolo3-tf2的源码,可以用于训练自己的模型。

YOLOV3:You Only Look Once目标检测模型在Tensorflow2当中的实现 目录 性能情况 Performance 所需环境 Environment 文件下载 Download 训练步骤 How2train 预测步骤 How2predict 评估步骤 How2eval 参考资料

Bubbliiiing 68 Dec 21, 2022
ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

This is the project page for the paper: ISTR: End-to-End Instance Segmentation via Transformers, Jie Hu, Liujuan Cao, Yao Lu, ShengChuan Zhang, Yan Wa

Jie Hu 182 Dec 19, 2022
Heart Arrhythmia Classification

This program takes and input of an ECG in European Data Format (EDF) and outputs the classification for heartbeats into normal vs different types of arrhythmia . It uses a deep learning model for cla

4 Nov 02, 2022
Self-Supervised Learning with Kernel Dependence Maximization

Self-Supervised Learning with Kernel Dependence Maximization This is the code for SSL-HSIC, a self-supervised learning loss proposed in the paper Self

DeepMind 29 Dec 29, 2022
Discovering Interpretable GAN Controls [NeurIPS 2020]

GANSpace: Discovering Interpretable GAN Controls Figure 1: Sequences of image edits performed using control discovered with our method, applied to thr

Erik Härkönen 1.7k Jan 03, 2023
Revisiting, benchmarking, and refining Heterogeneous Graph Neural Networks.

Heterogeneous Graph Benchmark Revisiting, benchmarking, and refining Heterogeneous Graph Neural Networks. Roadmap We organize our repo by task, and on

THUDM 176 Dec 17, 2022
A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

WaveGlow A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis Quick Start: Install requirements: pip install

Yuchao Zhang 204 Jul 14, 2022
Homepage of paper: Paint Transformer: Feed Forward Neural Painting with Stroke Prediction, ICCV 2021.

Paint Transformer: Feed Forward Neural Painting with Stroke Prediction [Paper] [Official Paddle Implementation] [Huggingface Gradio Demo] [Unofficial

442 Dec 16, 2022