A curated list and survey of awesome Vision Transformers.

Overview
awesome-vit

English | 简体中文

A curated list and survey of awesome Vision Transformers.

You can use mind mapping software to open the mind mapping source file. You can also download the mind mapping HD pictures if you just want to browse them.

Contents

Survey

Only typical algorithms are listed in each category.

Image Classification

Chinese Blogs

Attention-based

image

Training Strategy

image

  • [DeiT] Training data-efficient image transformers & distillation through attention (ICML 2021-2020.12) [Paper]
  • [Token Labeling] All Tokens Matter: Token Labeling for Training Better Vision Transformers (2021.4) [Paper]
Model Improvements
Tokenization Module

image

Image to Token:

  • Non-overlapping Patch Embedding

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [TNT] Transformer in Transformer (NeurIPS 2021-2021.3) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
  • Overlapping Patch Embedding

    • [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]

    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]

    • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]

    • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]

    • [PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]

Token to Token:

  • Fixed sampling window tokenization
    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
  • Dynamic sampling tokenization
    • [PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]
    • [TokenLearner] TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? (2021.6) [Paper]
Position Encoding Module

image

Explicit position encoding:

  • Absolute position encoding
    • [Transformer] Attention is All You Need] (NIPS 2017-2017.06) [Paper]
    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
  • Relative position encoding
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

Implicit position encoding:

  • [CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]
  • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
  • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]
  • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
Attention Module

image

Include only global attention:

  • Multi-Head attention module

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
  • Reduce global attention computation

    • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]

    • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]

    • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]

    • [P2T] P2T: Pyramid Pooling Transformer for Scene Understanding (2021.6) [Paper]

    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]

    • [MViT] Multiscale Vision Transformers (2021.4) [Paper]

    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

  • Generalized linear attention

    • [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]

Introduce extra local attention:

  • Local window mode

    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]
    • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
    • [GG-Transformer] Glance-and-Gaze Vision Transformer (2021.6) [Paper]
    • [Shuffle Transformer] Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer (2021.6) [Paper]
    • [MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens (2021.5) [Paper]
    • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
  • Introduce convolutional local inductive bias

    • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]
    • [ELSA] ELSA: Enhanced Local Self-Attention for Vision Transformer (2021.12) [Paper]
  • Sparse attention

    • [Sparse Transformer] Sparse Transformer: Concentrated Attention Through Explicit Selection [Paper]
FFN Module

image

Improve performance with Conv's local information extraction capability:

  • [LocalViT] LocalViT: Bringing Locality to Vision Transformers (2021.4) [Paper]
  • [CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]
Normalization Module Location

image

  • Pre Normalization

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
  • Post Normalization

    • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
Classification Prediction Head Module

image

  • Class Tokens

    • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]
    • [CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]
  • Avgerage Pooling

    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]
    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
Others

image

(1) How to output multi-scale feature map

  • Patch merging

    • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
    • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
    • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
    • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
    • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
    • [MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]
  • Pooling attention

    • [MViT] Multiscale Vision Transformers (2021.4) [Paper][Imporved MViT]

    • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

  • Dilation convolution

    • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]

(2) How to train a deeper Transformer

  • [Cait] Going deeper with Image Transformers (2021.3) [Paper]
  • [DeepViT] DeepViT: Towards Deeper Vision Transformer (2021.3) [Paper]

MLP-based

image

  • [MLP-Mixer] MLP-Mixer: An all-MLP Architecture for Vision (2021.5) [Paper]

  • [ResMLP] ResMLP: Feedforward networks for image classification with data-efficient training (CVPR2021-2021.5) [Paper]

  • [gMLP] Pay Attention to MLPs (2021.5) [Paper]

  • [CycleMLP] CycleMLP: A MLP-like Architecture for Dense Prediction (2021.7) [Paper]

ConvMixer-based

  • [ConvMixer] Patches Are All You Need [Paper]

General Architecture Analysis

image

  • Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight (2021.6) [Paper]
  • A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP (2021.8) [Paper]
  • [MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]
  • [ConvNeXt] A ConvNet for the 2020s (2022.01) [Paper]

Others

Object Detection

Semantic Segmentation

back to top

Papers

Transformer Original Paper

  • [Transformer] Attention is All You Need] (NIPS 2017-2017.06) [Paper]

ViT Original Paper

  • [ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021-2020.10) [Paper]

Image Classification

2020

  • [DeiT] Training data-efficient image transformers & distillation through attention (ICML 2021-2020.12) [Paper]
  • [Sparse Transformer] Sparse Transformer: Concentrated Attention Through Explicit Selection [Paper]

2021

  • [T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]

  • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]

  • [CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]

  • [TNT] Transformer in Transformer (NeurIPS 2021-2021.3) [Paper]

  • [Cait] Going deeper with Image Transformers (2021.3) [Paper]

  • [DeepViT] DeepViT: Towards Deeper Vision Transformer (2021.3) [Paper]

  • [Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]

  • [CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]

  • [LocalViT] LocalViT: Bringing Locality to Vision Transformers (2021.4) [Paper]

  • [MViT] Multiscale Vision Transformers (2021.4) [Paper]

  • [Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]

  • [Token Labeling] All Tokens Matter: Token Labeling for Training Better Vision Transformers (2021.4) [Paper]

  • [ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]

  • [MLP-Mixer] MLP-Mixer: An all-MLP Architecture for Vision (2021.5) [Paper]

  • [ResMLP] ResMLP: Feedforward networks for image classification with data-efficient training (CVPR2021-2021.5) [Paper]

  • [gMLP] Pay Attention to MLPs (2021.5) [Paper]

  • [MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens (2021.5) [Paper]

  • [PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]

  • [TokenLearner] TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? (2021.6) [Paper]

  • Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight (2021.6) [Paper]

  • [P2T] P2T: Pyramid Pooling Transformer for Scene Understanding (2021.6) [Paper]

  • [GG-Transformer] Glance-and-Gaze Vision Transformer (2021.6) [Paper]

  • [Shuffle Transformer] Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer (2021.6) [Paper]

  • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]

  • [CycleMLP] CycleMLP: A MLP-like Architecture for Dense Prediction (2021.7) [Paper]

  • [CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]

  • [PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]

  • A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP (2021.8) [Paper]

  • [Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]

  • [MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]

  • [Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]

  • [ELSA] ELSA: Enhanced Local Self-Attention for Vision Transformer (2021.12) [Paper]

  • [ConvMixer] Patches Are All You Need [Paper]

2022

  • [ConvNeXt] A ConvNet for the 2020s (2022.01) [Paper]

Object Detection

Semantic Segmentation

back to top

Stay tuned and PRs are welcomed!

Owner
OpenMMLab
OpenMMLab
Open source code for Paper "A Co-Interactive Transformer for Joint Slot Filling and Intent Detection"

A Co-Interactive Transformer for Joint Slot Filling and Intent Detection This repository contains the PyTorch implementation of the paper: A Co-Intera

67 Dec 05, 2022
Python package to add text to images, textures and different backgrounds

nider Python package for text images generation and watermarking Free software: MIT license Documentation: https://nider.readthedocs.io. nider is an a

Vladyslav Ovchynnykov 131 Dec 30, 2022
A very impractical 3D rendering engine that runs in the python terminal.

Terminal-3D-Render A very impractical 3D rendering engine that runs in the python terminal. do NOT try to run this program using the standard python I

23 Dec 31, 2022
CVPR2021 Workshop - HDRUNet: Single Image HDR Reconstruction with Denoising and Dequantization.

HDRUNet [Paper Link] HDRUNet: Single Image HDR Reconstruction with Denoising and Dequantization By Xiangyu Chen, Yihao Liu, Zhengwen Zhang, Yu Qiao an

XyChen 105 Dec 20, 2022
Tensors and Dynamic neural networks in Python with strong GPU acceleration

PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU acceleration Deep neural networks b

61.4k Jan 04, 2023
Gym environments used in the paper: "Developmental Reinforcement Learning of Control Policy of a Quadcopter UAV with Thrust Vectoring Rotors"

gym_multirotor Gym to train reinforcement learning agents on UAV platforms Quadrotor Tiltrotor Requirements This package has been tested on Ubuntu 18.

Aditya M. Deshpande 19 Dec 29, 2022
Utility code for use with PyXLL

pyxll-utils There is no need to use this package as of PyXLL 5. All features from this package are now provided by PyXLL. If you were using this packa

PyXLL 10 Dec 18, 2021
This is an official implementation for "SimMIM: A Simple Framework for Masked Image Modeling".

SimMIM By Zhenda Xie*, Zheng Zhang*, Yue Cao*, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai and Han Hu*. This repo is the official implementation of

Microsoft 674 Dec 26, 2022
FedMM: Saddle Point Optimization for Federated Adversarial Domain Adaptation

This repository contains the code accompanying the paper " FedMM: Saddle Point Optimization for Federated Adversarial Domain Adaptation" Paper link: R

20 Jun 29, 2022
FPGA: Fast Patch-Free Global Learning Framework for Fully End-to-End Hyperspectral Image Classification

FPGA & FreeNet Fast Patch-Free Global Learning Framework for Fully End-to-End Hyperspectral Image Classification by Zhuo Zheng, Yanfei Zhong, Ailong M

Zhuo Zheng 92 Jan 03, 2023
Code for "Learning Graph Cellular Automata"

Learning Graph Cellular Automata This code implements the experiments from the NeurIPS 2021 paper: "Learning Graph Cellular Automata" Daniele Grattaro

Daniele Grattarola 37 Oct 26, 2022
BBB streaming without Xorg and Pulseaudio and Chromium and other nonsense (heavily WIP)

BBB Streamer NG? Makes a conference like this... ...streamable like this! I also recorded a small video showing the basic features: https://www.youtub

Lukas Schauer 60 Oct 21, 2022
WarpRNNT loss ported in Numba CPU/CUDA for Pytorch

RNNT loss in Pytorch - Numba JIT compiled (warprnnt_numba) Warp RNN Transducer Loss for ASR in Pytorch, ported from HawkAaron/warp-transducer and a re

Somshubra Majumdar 15 Oct 22, 2022
[ICCV 2021] Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation

ADDS-DepthNet This is the official implementation of the paper Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation I

LIU_LINA 52 Nov 24, 2022
Progressive Growing of GANs for Improved Quality, Stability, and Variation

Progressive Growing of GANs for Improved Quality, Stability, and Variation — Official TensorFlow implementation of the ICLR 2018 paper Tero Karras (NV

Tero Karras 5.9k Jan 05, 2023
House_prices_kaggle - Predict sales prices and practice feature engineering, RFs, and gradient boosting

House Prices - Advanced Regression Techniques Predicting House Prices with Machine Learning This project is build to enhance my knowledge about machin

Gurpreet Singh 1 Jan 01, 2022
The MATH Dataset

Measuring Mathematical Problem Solving With the MATH Dataset This is the repository for Measuring Mathematical Problem Solving With the MATH Dataset b

Dan Hendrycks 267 Dec 26, 2022
Tensorflow implementation for "Improved Transformer for High-Resolution GANs" (NeurIPS 2021).

HiT-GAN Official TensorFlow Implementation HiT-GAN presents a Transformer-based generator that is trained based on Generative Adversarial Networks (GA

Google Research 78 Oct 31, 2022
Jupyter notebooks for the code samples of the book "Deep Learning with Python"

Jupyter notebooks for the code samples of the book "Deep Learning with Python"

François Chollet 16.2k Dec 30, 2022
PyTorch implementation of the paper The Lottery Ticket Hypothesis for Object Recognition

LTH-ObjectRecognition The Lottery Ticket Hypothesis for Object Recognition Sharath Girish*, Shishira R Maiya*, Kamal Gupta, Hao Chen, Larry Davis, Abh

16 Feb 06, 2022