UniFormer - official implementation of UniFormer

Last update: Jan 04, 2023

Related tags

Overview

UniFormer

This repo is the official implementation of "Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning". It currently includes code and models for the following tasks:

Image Classification
Video Classification
Object Detection (code will be released soon)
Semantic Segmentation (code will be released soon)
Pose Estimation (code will be released soon)

Updates

01/13/2022

[Initial commits]:

Pretrained models on ImageNet-1K, Kinetics-400, Kinetics-600, Something-Something V1&V2
The supported code and models for image classification and video classification are provided.

Introduction

UniFormer (Unified transFormer) is introduce in arxiv, which effectively unifies 3D convolution and spatiotemporal self-attention in a concise transformer format. We adopt local MHRA in shallow layers to largely reduce computation burden and global MHRA in deep layers to learn global token relation.

UniFormer achieves strong performance on video classification. With only ImageNet-1K pretraining, our UniFormer achieves 82.9%/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other comparable methods (e.g., 16.7x fewer GFLOPs than ViViT with JFT-300M pre-training). For Something-Something V1 and V2, our UniFormer achieves 60.9% and 71.2% top-1 accuracy respectively, which are new state-of-the-art performances.

Main results on ImageNet-1K

Please see image_classification for more details.

More models with large resolution and token labeling will be released soon.

Model	Pretrain	Resolution	Top-1	#Param.	FLOPs
UniFormer-S	ImageNet-1K	224x224	82.9	22M	3.6G
UniFormer-S†	ImageNet-1K	224x224	83.4	24M	4.2G
UniFormer-B	ImageNet-1K	224x224	83.9	50M	8.3G

Main results on Kinetics-400

Please see video_classification for more details.

Model	Pretrain	#Frame	Sampling Method	FLOPs	K400 Top-1	K600 Top-1
UniFormer-S	ImageNet-1K	16x1x4	16x4	167G	80.8	82.8
UniFormer-S	ImageNet-1K	16x1x4	16x8	167G	80.8	82.7
UniFormer-S	ImageNet-1K	32x1x4	32x4	438G	82.0	-
UniFormer-B	ImageNet-1K	16x1x4	16x4	387G	82.0	84.0
UniFormer-B	ImageNet-1K	16x1x4	16x8	387G	81.7	83.4
UniFormer-B	ImageNet-1K	32x1x4	32x4	1036G	82.9	84.5*

* Since Kinetics-600 is too large to train (>1 month in single node with 8 A100 GPUs), we provide model trained in multi node (around 2 weeks with 32 V100 GPUs), but the result is lower due to the lack of tuning hyperparameters.

Main results on Something-Something

Please see video_classification for more details.

Model	Pretrain	#Frame	FLOPs	SSV1 Top-1	SSV2 Top-1
UniFormer-S	K400	16x3x1	125G	57.2	67.7
UniFormer-S	K600	16x3x1	125G	57.6	69.4
UniFormer-S	K400	32x3x1	329G	58.8	69.0
UniFormer-S	K600	32x3x1	329G	59.9	70.4
UniFormer-B	K400	16x3x1	290G	59.1	70.4
UniFormer-B	K600	16x3x1	290G	58.8	70.2
UniFormer-B	K400	32x3x1	777G	60.9	71.1
UniFormer-B	K600	32x3x1	777G	61.0	71.2

Main results on downstream tasks

We have conducted extensive experiments on downstream tasks and achieved comparable results with SOTA models.

Code and models will be released in two weeks.

Cite Uniformer

If you find this repository useful, please use the following BibTeX entry for citation.

@misc{li2022uniformer,
      title={Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning}, 
      author={Kunchang Li and Yali Wang and Peng Gao and Guanglu Song and Yu Liu and Hongsheng Li and Yu Qiao},
      year={2022},
      eprint={2201.04676},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

This project is released under the MIT license. Please see the LICENSE file for more information.

Contributors and Contact Information

UniFormer is maintained by Kunchang Li.

For help or issues using UniFormer, please submit a GitHub issue.

For other communications related to UniFormer, please contact Kunchang Li ([email protected]).

UniFormer - official implementation of UniFormer

Related tags

Overview

UniFormer

Updates

Introduction

Main results on ImageNet-1K

Main results on Kinetics-400

Main results on Something-Something

Main results on downstream tasks

Cite Uniformer

License

Contributors and Contact Information

Owner

SenseTime X-Lab

This repository contains all source code, pre-trained models related to the paper "An Empirical Study on GANs with Margin Cosine Loss and Relativistic Discriminator"

Code to reproduce experiments in the paper "Explainability Requires Interactivity".

Efficient Sharpness-aware Minimization for Improved Training of Neural Networks

This repository contain code on Novelty-Driven Binary Particle Swarm Optimisation for Truss Optimisation Problems.

Hierarchical Attentive Recurrent Tracking

The Noise Contrastive Estimation for softmax output written in Pytorch

Code for Fully Context-Aware Image Inpainting with a Learned Semantic Pyramid

A Blender python script for getting asset browser custom preview images for objects and collections.

[Preprint] "Bag of Tricks for Training Deeper Graph Neural Networks A Comprehensive Benchmark Study" by Tianlong Chen, Kaixiong Zhou, Keyu Duan, Wenqing Zheng, Peihao Wang, Xia Hu, Zhangyang Wang

Learning Energy-Based Models by Diffusion Recovery Likelihood

Unifying Global-Local Representations in Salient Object Detection with Transformer

A deep learning based semantic search platform that computes similarity scores between provided query and documents

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

Manim is an engine for precise programmatic animations, designed for creating explanatory math videos

Official code for paper "Optimization for Oriented Object Detection via Representation Invariance Loss".

Just playing with getting CLIP Guided Diffusion running locally, rather than having to use colab.

University of Rochester 2021 Summer REU focusing on music sentiment transfer using CycleGAN

Calculates JMA (Japan Meteorological Agency) seismic intensity (shindo) scale from acceleration data recorded in NumPy array

Federated learning on graph, especially on graph neural networks (GNNs), knowledge graph, and private GNN.

Implementation of Stochastic Image-to-Video Synthesis using cINNs.

UniFormer - official implementation of UniFormer

Related tags

Overview

UniFormer

Updates

Introduction

Main results on ImageNet-1K

Main results on Kinetics-400

Main results on Something-Something

Main results on downstream tasks

Cite Uniformer

License

Contributors and Contact Information

Owner

SenseTime X-Lab

This repository contains all source code, pre-trained models related to the paper "An Empirical Study on GANs with Margin Cosine Loss and Relativistic Discriminator"

Code to reproduce experiments in the paper "Explainability Requires Interactivity".

Efficient Sharpness-aware Minimization for Improved Training of Neural Networks

This repository contain code on Novelty-Driven Binary Particle Swarm Optimisation for Truss Optimisation Problems.

Hierarchical Attentive Recurrent Tracking

The Noise Contrastive Estimation for softmax output written in Pytorch

Code for Fully Context-Aware Image Inpainting with a Learned Semantic Pyramid

A Blender python script for getting asset browser custom preview images for objects and collections.

[Preprint] "Bag of Tricks for Training Deeper Graph Neural Networks A Comprehensive Benchmark Study" by Tianlong Chen*, Kaixiong Zhou*, Keyu Duan, Wenqing Zheng, Peihao Wang, Xia Hu, Zhangyang Wang

Learning Energy-Based Models by Diffusion Recovery Likelihood

Unifying Global-Local Representations in Salient Object Detection with Transformer

A deep learning based semantic search platform that computes similarity scores between provided query and documents

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

Manim is an engine for precise programmatic animations, designed for creating explanatory math videos

Official code for paper "Optimization for Oriented Object Detection via Representation Invariance Loss".

Just playing with getting CLIP Guided Diffusion running locally, rather than having to use colab.

University of Rochester 2021 Summer REU focusing on music sentiment transfer using CycleGAN

Calculates JMA (Japan Meteorological Agency) seismic intensity (shindo) scale from acceleration data recorded in NumPy array

Federated learning on graph, especially on graph neural networks (GNNs), knowledge graph, and private GNN.

Implementation of Stochastic Image-to-Video Synthesis using cINNs.

[Preprint] "Bag of Tricks for Training Deeper Graph Neural Networks A Comprehensive Benchmark Study" by Tianlong Chen, Kaixiong Zhou, Keyu Duan, Wenqing Zheng, Peihao Wang, Xia Hu, Zhangyang Wang