Contains code for the paper "Vision Transformers are Robust Learners".

Last update: Jan 05, 2023

Overview

Vision Transformers are Robust Learners

This repository contains the code for the paper Vision Transformers are Robust Learners by Sayak Paul^* and Pin-Yu Chen^*.

^*Equal contribution.

Abstract

Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art (SOTA) standard accuracy with better parameter efficiency. Since self-attention helps a model systematically align different components present inside the input data, it leaves grounds to investigate its performance under model robustness benchmarks. In this work, we study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We use six different diverse ImageNet datasets concerning robust classification to conduct a comprehensive performance comparison of ViT models and SOTA convolutional neural networks (CNNs), Big-Transfer. Through a series of six systematically designed experiments, we then present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners. For example, with fewer parameters and similar dataset and pre-training combinations, ViT gives a top-1 accuracy of 28.10% on ImageNet-A which is 4.3x higher than a comparable variant of BiT. Our analyses on image masking, Fourier spectrum sensitivity, and spread on discrete cosine energy spectrum reveal intriguing properties of ViT attributing to improved robustness.

Structure and Navigation

All the results related to the ImageNet datasets (ImageNet-C, ImageNet-P, ImageNet-R, ImageNet-A, ImageNet-O, and ImageNet-9) can be derived from the notebooks contained in the imagenet_results/ directory. Many notebooks inside that directory can be executed with Google Colab. When that is not the case, we provide execution instructions explicitly. This is followed for the rest of the directories present inside this repository.

analysis/ directory contains the code used to generate results for Section 4 in the paper.

misc/ directory contains the code for visualizing frequency artifacts inside images.

About our dev environment

We use Python 3.8. As for the hardware setup (when not using Colab), we use a GCP AI Platform Notebook with 4 V100s, 60 GBs of RAM with 16 vCPUs (n1-standard-16 machine type).

Citation

@misc{paul2021vision,
      title={Vision Transformers are Robust Learners}, 
      author={Sayak Paul and Pin-Yu Chen},
      year={2021},
      eprint={2105.07581},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgements

We are thankful to the Google Developers Experts program (specifically Soonson Kwon and Karl Weinmeister) for providing Google Cloud Platform credits to support the experiments. We also thank Justin Gilmer (of Google), Guillermo Ortiz-Jimenez (of EPFL, Switzerland), and Dan Hendrycks (of UC Berkeley) for fruitful discussions.

Contains code for the paper "Vision Transformers are Robust Learners".

Related tags

Overview

Vision Transformers are Robust Learners

Abstract

Structure and Navigation

About our dev environment

Citation

Acknowledgements

Owner

Sayak Paul

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Notes, programming assignments and quizzes from all courses within the Coursera Deep Learning specialization offered by deeplearning.ai

ByteTrack with ReID module following the paradigm of FairMOT, tracking strategy is borrowed from FairMOT/JDE.

Semi-supervised Learning for Sentiment Analysis

A script helps the user to update Linux and Mac systems through the terminal

Code for reproducing experiments in "Improved Training of Wasserstein GANs"

Decoding the Protein-ligand Interactions Using Parallel Graph Neural Networks

The repository contains source code and models to use PixelNet architecture used for various pixel-level tasks. More details can be accessed at .

K-Means Clustering and Hierarchical Clustering Unsupervised Learning Solution in Python3.

PyTorch implementation of some learning rate schedulers for deep learning researcher.

Advanced Deep Learning with TensorFlow 2 and Keras (Updated for 2nd Edition)

Official implementation of the method ContIG, for self-supervised learning from medical imaging with genomics

Keras documentation, hosted live at keras.io

[CVPR 2021 Oral] ForgeryNet: A Versatile Benchmark for Comprehensive Forgery Analysis

Official code release for: EditGAN: High-Precision Semantic Image Editing

PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

Code for ACM MM 2020 paper "NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination"

Official implementation of TMANet.

DeepDiffusion: Unsupervised Learning of Retrieval-adapted Representations via Diffusion-based Ranking on Latent Feature Manifold

An extremely simple, intuitive, hardware-friendly, and well-performing network structure for LiDAR semantic segmentation on 2D range image. IROS21