Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

Overview

Temporally Efficient Vision Transformer for Video Instance Segmentation

Temporally Efficient Vision Transformer for Video Instance Segmentation (CVPR 2022, Oral)

by Shusheng Yang1,3, Xinggang Wang1 πŸ“§ , Yu Li4, Yuxin Fang1, Jiemin Fang1,2, Wenyu Liu1, Xun Zhao3, Ying Shan3.

1 School of EIC, HUST, 2 AIA, HUST, 3 ARC Lab, Tencent PCG, 4 IDEA.

( πŸ“§ ) corresponding author.


  • This repo provides code, models and training/inference recipes for TeViT(Temporally Efficient Vision Transformer for Video Instance Segmentation).
  • TeViT is a transformer-based end-to-end video instance segmentation framework. We build our framework upon the query-based instance segmentation methods, i.e., QueryInst.
  • We propose a messenger shift mechanism in the transformer backbone, as well as a spatiotemporal query interaction head in the instance heads. These two designs fully utlizes both frame-level and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost.

Overall Arch

Models and Main Results

  • We provide both checkpoints and codalab server submissions on YouTube-VIS-2019 dataset.
Name AP [email protected] [email protected] [email protected] [email protected] model submission
TeViT_MsgShifT 46.3 70.6 50.9 45.2 54.3 link link
TeViT_MsgShifT_MST 46.9 70.1 52.9 45.0 53.4 link link
  • We have conducted multiple runs due to the training instability and checkpoints above are all the best one among multiple runs. The average performances are reported in our paper.
  • Besides basic models, we also provide TeViT with ResNet-50 and Swin-L backbone, models are also trained on YouTube-VIS-2019 dataset.
  • MST denotes multi-scale traning.
Name AP [email protected] [email protected] [email protected] [email protected] model submission
TeViT_R50 42.1 67.8 44.8 41.3 49.9 link link
TeViT_Swin-L_MST 56.8 80.6 63.1 52.0 63.3 link link
  • Due to backbone limitations, TeViT models with ResNet-50 and Swin-L backbone are conducted with STQI Head only (i.e., without our proposed messenger shift mechanism).
  • With Swin-L as backbone network, we apply more instance queries (i.e., from 100 to 300) and stronger data augmentation strategies. Both of them can further boost the final performance.

Installation

Prerequisites

  • Linux
  • Python 3.7+
  • CUDA 10.2+
  • GCC 5+

Prepare

  • Clone the repository locally:
git clone https://github.com/hustvl/TeViT.git
  • Create a conda virtual environment and activate it:
conda create --name tevit python=3.7.7
conda activate tevit
pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI
  • Install Python requirements
torch==1.9.0
torchvision==0.10.0
mmcv==1.4.8
pip install -r requirements.txt
  • Please follow Docs to install MMDetection
python setup.py develop
  • Download YouTube-VIS 2019 dataset from here, and organize dataset as follows:
TeViT
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ youtubevis
β”‚   β”‚   β”œβ”€β”€ train
β”‚   β”‚   β”‚   β”œβ”€β”€ 003234408d
β”‚   β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”‚   β”œβ”€β”€ val
β”‚   β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   β”‚   β”œβ”€β”€ train.json
β”‚   β”‚   β”‚   β”œβ”€β”€ valid.json

Inference

python tools/test_vis.py configs/tevit/tevit_msgshift.py $PATH_TO_CHECKPOINT

After inference process, the predicted results is stored in results.json, submit it to the evaluation server to get the final performance.

Training

  • Download the COCO pretrained QueryInst with PVT-B1 backbone from here.
  • Train TeViT with 8 GPUs:
./tools/dist_train.sh configs/tevit/tevit_msgshift.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT
  • Train TeViT with multi-scale data augmentation:
./tools/dist_train.sh configs/tevit/tevit_msgshift_mstrain.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT
  • The whole training process will cost about three hours with 8 TESLA V100 GPUs.
  • To train TeViT with ResNet-50 or Swin-L backbone, please download the COCO pretrained weights from QueryInst.

Acknowledgement ❀️

This code is mainly based on mmdetection and QueryInst, thanks for their awesome work and great contributions to the computer vision community!

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation πŸ“ :

@inproceedings{yang2022tevit,
  title={Temporally Efficient Vision Transformer for Video Instance Segmentation,
  author={Yang, Shusheng and Wang, Xinggang and Li, Yu and Fang, Yuxin and Fang, Jiemin and Liu and Zhao, Xun and Shan, Ying},
  booktitle =   {Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)},
  year      =   {2022}
}
Owner
Hust Visual Learning Team
Hust Visual Learning Team belongs to the Artificial Intelligence Research Institute in the School of EIC in HUST, Lead by @xinggangw
Hust Visual Learning Team
Disentangled Cycle Consistency for Highly-realistic Virtual Try-On, CVPR 2021

Disentangled Cycle Consistency for Highly-realistic Virtual Try-On, CVPR 2021 [WIP] The code for CVPR 2021 paper 'Disentangled Cycle Consistency for H

ChongjianGE 94 Dec 11, 2022
Classification of EEG data using Deep Learning

Graduation-Project Classification of EEG data using Deep Learning Epilepsy is the most common neurological disease in the world. Epilepsy occurs as a

Osman AlpaydΔ±n 5 Jun 24, 2022
Exploration & Research into cross-domain MEV. Initial focus on ETH/POLYGON.

xMEV, an apt exploration This is a small exploration on the xMEV opportunities between Polygon and Ethereum. It's a data analysis exercise on a few pa

odyslam.eth 7 Oct 18, 2022
code for paper"A High-precision Semantic Segmentation Method Combining Adversarial Learning and Attention Mechanism"

PyTorch implementation of UAGAN(U-net Attention Generative Adversarial Networks) This repository contains the source code for the paper "A High-precis

Tong 8 Apr 25, 2022
WaveFake: A Data Set to Facilitate Audio DeepFake Detection

WaveFake: A Data Set to Facilitate Audio DeepFake Detection This is the code repository for our NeurIPS 2021 (Track on Datasets and Benchmarks) paper

Chair for SysΒ­tems SeΒ­cuΒ­riΒ­ty 27 Dec 22, 2022
This is a repo of basic Machine Learning!

Basic Machine Learning This repository contains a topic-wise curated list of Machine Learning and Deep Learning tutorials, articles and other resource

Ekram Asif 53 Dec 31, 2022
This repo provides the source code & data of our paper "GreaseLM: Graph REASoning Enhanced Language Models"

GreaseLM: Graph REASoning Enhanced Language Models This repo provides the source code & data of our paper "GreaseLM: Graph REASoning Enhanced Language

137 Jan 02, 2023
A project which aims to protect your privacy using inexpensive hardware and easily modifiable software

Protecting your privacy using an ESP32, an IR sensor and a python script This project, which I personally call the "never-gonna-catch-me-in-the-act-ev

8 Oct 10, 2022
Simulation-based inference for the Galactic Center Excess

Simulation-based inference for the Galactic Center Excess Siddharth Mishra-Sharma and Kyle Cranmer Abstract The nature of the Fermi gamma-ray Galactic

Siddharth Mishra-Sharma 3 Jan 21, 2022
Single Image Random Dot Stereogram for Tensorflow

TensorFlow-SIRDS Single Image Random Dot Stereogram for Tensorflow SIRDS is a means to present 3D data in a 2D image. It allows for scientific data di

Greg Peatfield 5 Aug 10, 2022
Capsule endoscopy detection DACON challenge

capsule_endoscopy_detection (DACON Challenge) Overview Yolov5, Yolor, mmdetection기반의 λͺ¨λΈμ„ μ‚¬μš© (총 11개 λͺ¨λΈ 앙상블) λͺ¨λ“  λͺ¨λΈμ€ ν•™μŠ΅ μ‹œ Pretrained Weight을 yolov5, yolo

MAILAB 11 Nov 25, 2022
Reinforcement Learning for Portfolio Management

qtrader Reinforcement Learning for Portfolio Management Why Reinforcement Learning? Learns the optimal action, rather than models the market. Adaptive

Angelos Filos 406 Jan 01, 2023
Contrastive Language-Image Pretraining

CLIP [Blog] [Paper] [Model Card] [Colab] CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pair

OpenAI 11.5k Jan 08, 2023
Code for paper Novel View Synthesis via Depth-guided Skip Connections

Novel View Synthesis via Depth-guided Skip Connections Code for paper Novel View Synthesis via Depth-guided Skip Connections @InProceedings{Hou_2021_W

8 Mar 14, 2022
sssegmentation is a general framework for our research on strongly supervised semantic segmentation.

sssegmentation is a general framework for our research on strongly supervised semantic segmentation.

445 Jan 02, 2023
Robotics with GPU computing

Robotics with GPU computing Cupoch is a library that implements rapid 3D data processing for robotics using CUDA. The goal of this library is to imple

Shirokuma 625 Jan 07, 2023
BASH - Biomechanical Animated Skinned Human

We developed a method animating a statistical 3D human model for biomechanical analysis to increase accessibility for non-experts, like patients, athletes, or designers.

Machine Learning and Data Analytics Lab FAU 66 Nov 19, 2022
Awesome Monocular 3D detection

Awesome Monocular 3D detection Paper list of 3D detetction, keep updating! Contents Paper List 2022 2021 2020 2019 2018 2017 2016 KITTI Results Paper

Zhikang Zou 184 Jan 04, 2023
Generalized Random Forests

generalized random forests A pluggable package for forest-based statistical estimation and inference. GRF currently provides non-parametric methods fo

GRF Labs 781 Dec 25, 2022
Bald-to-Hairy Translation Using CycleGAN

GANiry: Bald-to-Hairy Translation Using CycleGAN Official PyTorch implementation of GANiry. GANiry: Bald-to-Hairy Translation Using CycleGAN, Fidan Sa

Fidan Samet 10 Oct 27, 2022