Official PyTorch implementation of Less is More: Pay Less Attention in Vision Transformers.

Last update: Jan 01, 2023

Overview

Less is More: Pay Less Attention in Vision Transformers

Official PyTorch implementation of Less is More: Pay Less Attention in Vision Transformers.

By Zizheng Pan, Bohan Zhuang, Haoyu He, Jing Liu and Jianfei Cai.

In our paper, we present a novel Less attention vIsion Transformer (LIT), building upon the fact that convolutions, fully-connected (FC) layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. LIT uses pure multi-layer perceptrons (MLPs) to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers. Moreover, we further propose a learned deformable token merging module to adaptively fuse informative patches in a non-uniform manner.

If you use this code for a paper please cite:

@article{pan2021less,
  title={Less is More: Pay Less Attention in Vision Transformers},
  author={Pan, Zizheng and Zhuang, Bohan and He, Haoyu and Liu, Jing and Cai, Jianfei},
  journal={arXiv preprint arXiv:2105.14217},
  year={2021}
}

Usage

First, clone this repository.

git clone https://github.com/MonashAI/LIT

Next, create a conda virtual environment.

# Make sure you have a NVIDIA GPU.
cd LIT/
bash setup_env.sh [conda_install_path] [env_name]

# For example
bash setup_env.sh /home/anaconda3 lit

Note: We use PyTorch 1.7.1 with CUDA 10.1 for all experiments. The setup_env.sh has illustrated all dependencies we used in our experiments. You may want to edit this file to install a different version of PyTorch or any other packages.

Data Preparation

Download the ImageNet 2012 dataset from here, and prepare the dataset based on this script. The file structure should look like:

imagenet
├── train
│   ├── class1
│   │   ├── img1.jpeg
│   │   ├── img2.jpeg
│   │   └── ...
│   ├── class2
│   │   ├── img3.jpeg
│   │   └── ...
│   └── ...
└── val
    ├── class1
    │   ├── img4.jpeg
    │   ├── img5.jpeg
    │   └── ...
    ├── class2
    │   ├── img6.jpeg
    │   └── ...
    └── ...

Model Zoo

We provide baseline LIT models pretrained on ImageNet 2012.

Name	Params (M)	FLOPs (G)	Top-1 Acc. (%)	Model	Log
LIT-Ti	19	3.6	81.1	google drive/github	log
LIT-S	27	4.1	81.5	google drive/github	log
LIT-M	48	8.6	83.0	google drive/github	log
LIT-B	86	15.0	83.4	google drive/github	log

Training and Evaluation

In our implementation, we have different training strategies for LIT-Ti and other LIT models. Therefore, we provide two codebases.

For LIT-Ti, please refer to code_for_lit_ti.

For LIT-S, LIT-M, LIT-B, please refer to code_for_lit_s_m_b.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Acknowledgement

This repository has adopted codes from DeiT, PVT and Swin, we thank the authors for their open-sourced code.

This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Introduction This is an official implementation of CvT: Introducing Convolutions to Vision Transformers. We present a new architecture, named Convolut

175 Jan 8, 2023

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

cosFormer Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention Update log 2022/2/28 Add core code License This

120 Dec 15, 2022

Official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

CrossViT This repository is the official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. ArXiv If

168 Dec 29, 2022

The official implementation of ELSA: Enhanced Local Self-Attention for Vision Transformer

ELSA: Enhanced Local Self-Attention for Vision Transformer By Jingkai Zhou, Pich

87 Dec 19, 2022

Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch.

SE3 Transformer - Pytorch Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch. May be needed for replicating Alphafold2 resu

207 Dec 23, 2022

Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

TRAnsformer Routing Networks (TRAR) This is an official implementation for ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visu

49 Nov 10, 2022

PyTorch Implementation of CvT: Introducing Convolutions to Vision Transformers

CvT: Introducing Convolutions to Vision Transformers Pytorch implementation of CvT: Introducing Convolutions to Vision Transformers Usage: img = torch

193 Jan 3, 2023

A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers.

ViTGAN: Training GANs with Vision Transformers A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers. Refer

127 Dec 23, 2022

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

272 Dec 23, 2022

Comments

Problem about DCN

I have some problems about compiling DCN, thus it is hard for me to use the DCN.deform_conv2d_forward and DCN.deform_conv2d_backward functions. Can 'deform_conv2d_naive' be used instead of this part? Or is there other methods for me to accomplish this DCN.deform_conv2d part?

opened by jarygrace 3
How to use LITNet as a beckbone of object detection

Could you please release the code of using LITNet as a beckbone of RetinaNet as mentioned in your paper? I wonder how to use it as a beckbone for object detection...

opened by sunhuisunhui 2

Official PyTorch implementation of Less is More: Pay Less Attention in Vision Transformers.

Related tags

Overview

Less is More: Pay Less Attention in Vision Transformers

Usage

Data Preparation

Model Zoo

Training and Evaluation

License

Acknowledgement

You might also like...

This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

Official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

The official implementation of ELSA: Enhanced Local Self-Attention for Vision Transformer

Implementation of SE3-Transformers for Equivariant Self-Attention, in Pytorch.

Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

PyTorch Implementation of CvT: Introducing Convolutions to Vision Transformers

A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers.

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Comments

Problem about DCN

How to use LITNet as a beckbone of object detection

Releases(v2.1)

v2.1(Mar 10, 2022)

v2.0(Jun 26, 2021)

v1.0(Jun 8, 2021)

Owner

This is the source code for our ICLR2021 paper: Adaptive Universal Generalized PageRank Graph Neural Network.

[ACM MM 2019 Oral] Cycle In Cycle Generative Adversarial Networks for Keypoint-Guided Image Generation

Python module providing a framework to trace individual edges in an image using Gaussian process regression.

Türkiye Canlı Mobese Görüntülerinde Profesyonel Nesne Takip Sistemi

Notspot robot simulation - Python version

Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

A developer interface for creating Chat AIs for the Chai app.

Official Implementation of "LUNAR: Unifying Local Outlier Detection Methods via Graph Neural Networks"

Code I use to automatically update my videos' metadata on YouTube

A repo for Causal Imitation Learning under Temporally Correlated Noise

Group Activity Recognition with Clustered Spatial Temporal Transformer

AITom is an open-source platform for AI driven cellular electron cryo-tomography analysis.

A PaddlePaddle version image model zoo.

UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss

YOLOX Win10 Project

This is the winning solution of the Endocv-2021 grand challange.

A Python library created to assist programmers with complex mathematical functions

P-Tuning v2: Prompt Tuning Can Be Comparable to Finetuning Universally Across Scales and Tasks

[CVPR 2021] 'Searching by Generating: Flexible and Efficient One-Shot NAS with Architecture Generator'

This is the repo for the paper "Improving the Accuracy-Memory Trade-Off of Random Forests Via Leaf-Refinement".