This is official implementaion of paper "Token Shift Transformer for Video Classification".

Overview

TokShift-Transformer

This is official implementaion of paper "Token Shift Transformer for Video Classification". We achieve SOTA performance 80.40% on Kinetics-400 val. Paper link

Updates

July 11, 2021

  • Release this V1 version (the version used in paper) to public.
  • we are preparing a V2 version which include the following modifications, will release within 1 week:
  1. Directly decode video mp4 file during training/evaluation
  2. Change to adopt standarlize timm code-base.
  3. Performances are further improved than reported in paper version (average +0.5).

April 22, 2021

  • Add Train/Test guidline and Data perpariation

April 16, 2021

  • Publish TokShift Transformer for video content understanding

Model Zoo and Baselines

architecture backbone pretrain Res & Frames GFLOPs x views top1 config
ViT (Video) Base16 ImgNet21k 224 & 8 134.7 x 30 76.02 link k400_vit_8x32_224.yml
TokShift Base-16 ImgNet21k 224 & 8 134.7 x 30 77.28 link k400_tokshift_div4_8x32_base_224.yml
TokShift (MR) Base16 ImgNet21k 256 & 8 175.8 x 30 77.68 link k400_tokshift_div4_8x32_base_256.yml
TokShift (HR) Base16 ImgNet21k 384 & 8 394.7 x 30 78.14 link k400_tokshift_div4_8x32_base_384.yml
TokShift Base16 ImgNet21k 224 & 16 268.5 x 30 78.18 link k400_tokshift_div4_16x32_base_224.yml
TokShift-Large (HR) Large16 ImgNet21k 384 & 8 1397.6 x 30 79.83 link k400_tokshift_div4_8x32_large_384.yml
TokShift-Large (HR) Large16 ImgNet21k 384 & 12 2096.4 x 30 80.40 link k400_tokshift_div4_12x32_large_384.yml

Below is trainig log, we use 3 views evaluation (instead of 30 views) during validation for time-saving.

Installation

  • PyTorch >= 1.7, torchvision
  • tensorboardx

Quick Start

Train

  1. Download ImageNet-22k pretrained weights from Base16 and Large16.
  2. Prepare Kinetics-400 dataset organized in the following structure, trainValTest
k400
|_ frames331_train
|  |_ [category name 0]
|  |  |_ [video name 0]
|  |  |  |_ img_00001.jpg
|  |  |  |_ img_00002.jpg
|  |  |  |_ ...
|  |  |
|  |  |_ [video name 1]
|  |  |   |_ img_00001.jpg
|  |  |   |_ img_00002.jpg
|  |  |   |_ ...
|  |  |_ ...
|  |
|  |_ [category name 1]
|  |  |_ [video name 0]
|  |  |  |_ img_00001.jpg
|  |  |  |_ img_00002.jpg
|  |  |  |_ ...
|  |  |
|  |  |_ [video name 1]
|  |  |   |_ img_00001.jpg
|  |  |   |_ img_00002.jpg
|  |  |   |_ ...
|  |  |_ ...
|  |_ ...
|
|_ frames331_val
|  |_ [category name 0]
|  |  |_ [video name 0]
|  |  |  |_ img_00001.jpg
|  |  |  |_ img_00002.jpg
|  |  |  |_ ...
|  |  |
|  |  |_ [video name 1]
|  |  |   |_ img_00001.jpg
|  |  |   |_ img_00002.jpg
|  |  |   |_ ...
|  |  |_ ...
|  |
|  |_ [category name 1]
|  |  |_ [video name 0]
|  |  |  |_ img_00001.jpg
|  |  |  |_ img_00002.jpg
|  |  |  |_ ...
|  |  |
|  |  |_ [video name 1]
|  |  |   |_ img_00001.jpg
|  |  |   |_ img_00002.jpg
|  |  |   |_ ...
|  |  |_ ...
|  |_ ...
|
|_ trainValTest
   |_ train.txt
   |_ val.txt
  1. Using train-script (train.sh) to train k400
#!/usr/bin/env python
import os

cmd = "python -u main_ddp_shift_v3.py \
		--multiprocessing-distributed --world-size 1 --rank 0 \
		--dist-ur tcp://127.0.0.1:23677 \
		--tune_from pretrain/ViT-L_16_Img21.npz \
		--cfg config/custom/kinetics400/k400_tokshift_div4_12x32_large_384.yml"
os.system(cmd)

Test

Using test.sh (test.sh) to evaluate k400

#!/usr/bin/env python
import os
cmd = "python -u main_ddp_shift_v3.py \
        --multiprocessing-distributed --world-size 1 --rank 0 \
        --dist-ur tcp://127.0.0.1:23677 \
        --evaluate \
        --resume model_zoo/ViT-B_16_k400_dense_cls400_segs8x32_e18_lr0.1_B21_VAL224/best_vit_B8x32x224_k400.pth \
        --cfg config/custom/kinetics400/k400_vit_8x32_224.yml"
os.system(cmd)

Contributors

VideoNet is written and maintained by Dr. Hao Zhang and Dr. Yanbin Hao.

Citing

If you find TokShift-xfmr is useful in your research, please use the following BibTeX entry for citation.

@article{tokshift2021,
  title={Token Shift Transformer for Video Classification},
  author={Hao Zhang, Yanbin Hao, Chong-Wah Ngo},
  journal={ACM Multimedia 2021},
}

Acknowledgement

Thanks for the following Github projects:

Owner
VideoNet
VideoNet
An official implementation of "Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation" (ICCV 2021) in PyTorch.

Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation This is an official implementation of the paper "Exploiting a Joint

CV Lab @ Yonsei University 35 Oct 26, 2022
Message Passing on Cell Complexes

CW Networks This repository contains the code used for the papers Weisfeiler and Lehman Go Cellular: CW Networks (Under review) and Weisfeiler and Leh

Twitter Research 108 Jan 05, 2023
Complementary Patch for Weakly Supervised Semantic Segmentation, ICCV21 (poster)

CPN (ICCV2021) This is an implementation of Complementary Patch for Weakly Supervised Semantic Segmentation, which is accepted by ICCV2021 poster. Thi

Ferenas 20 Dec 12, 2022
Repo for the Video Person Clustering dataset, and code for the associated paper

Video Person Clustering Repo for the Video Person Clustering dataset, and code for the associated paper. This reporsitory contains the Video Person Cl

Andrew Brown 47 Nov 02, 2022
Boosted CVaR Classification (NeurIPS 2021)

Boosted CVaR Classification Runtian Zhai, Chen Dan, Arun Sai Suggala, Zico Kolter, Pradeep Ravikumar NeurIPS 2021 Table of Contents Quick Start Train

Runtian Zhai 4 Feb 15, 2022
Deep Learning Algorithms for Hedging with Frictions

Deep Learning Algorithms for Hedging with Frictions This repository contains the Forward-Backward Stochastic Differential Equation (FBSDE) solver and

Xiaofei Shi 3 Dec 22, 2022
Demystifying How Self-Supervised Features Improve Training from Noisy Labels

Demystifying How Self-Supervised Features Improve Training from Noisy Labels This code is a PyTorch implementation of the paper "[Demystifying How Sel

<a href=[email protected]"> 4 Oct 14, 2022
Human motion synthesis using Unity3D

Human motion synthesis using Unity3D Prerequisite: Software: amc2bvh.exe, Unity 2017, Blender. Unity: RockVR (Video Capture), scenes, character models

Hao Xu 9 Jun 01, 2022
A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.

WebDataset WebDataset is a PyTorch Dataset (IterableDataset) implementation providing efficient access to datasets stored in POSIX tar archives and us

1.1k Jan 08, 2023
Spatial Action Maps for Mobile Manipulation (RSS 2020)

spatial-action-maps Update: Please see our new spatial-intention-maps repository, which extends this work to multi-agent settings. It contains many ne

Jimmy Wu 27 Nov 30, 2022
Official Pytorch implementation for video neural representation (NeRV)

NeRV: Neural Representations for Videos (NeurIPS 2021) Project Page | Paper | UVG Data Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser-Nam Lim, Abhinav S

hao 214 Dec 28, 2022
Extreme Rotation Estimation using Dense Correlation Volumes

Extreme Rotation Estimation using Dense Correlation Volumes This repository contains a PyTorch implementation of the paper: Extreme Rotation Estimatio

Ruojin Cai 29 Nov 18, 2022
Music Classification: Beyond Supervised Learning, Towards Real-world Applications

Music Classification: Beyond Supervised Learning, Towards Real-world Applications

104 Dec 15, 2022
Decompose to Adapt: Cross-domain Object Detection via Feature Disentanglement

Decompose to Adapt: Cross-domain Object Detection via Feature Disentanglement In this project, we proposed a Domain Disentanglement Faster-RCNN (DDF)

19 Nov 24, 2022
YoloV3 Implemented in Tensorflow 2.0

YoloV3 Implemented in TensorFlow 2.0 This repo provides a clean implementation of YoloV3 in TensorFlow 2.0 using all the best practices. Key Features

Zihao Zhang 2.5k Dec 26, 2022
This repository contains project created during the Data Challenge module at London School of Hygiene & Tropical Medicine

LSHTM_RCS This repository contains project created during the Data Challenge module at London School of Hygiene & Tropical Medicine (LSHTM) in collabo

Lukas Kopecky 3 Jan 30, 2022
Collect super-resolution related papers, data, repositories

Collect super-resolution related papers, data, repositories

WangChaofeng 1.7k Jan 03, 2023
efficient neural audio synthesis in the waveform domain

neural waveshaping synthesis real-time neural audio synthesis in the waveform domain paper • website • colab • audio by Ben Hayes, Charalampos Saitis,

Ben Hayes 169 Dec 23, 2022
Modification of convolutional neural net "UNET" for image segmentation in Keras framework

ZF_UNET_224 Pretrained Model Modification of convolutional neural net "UNET" for image segmentation in Keras framework Requirements Python 3.*, Keras

209 Nov 02, 2022
Official PyTorch implementation of DD3D: Is Pseudo-Lidar needed for Monocular 3D Object detection? (ICCV 2021), Dennis Park*, Rares Ambrus*, Vitor Guizilini, Jie Li, and Adrien Gaidon.

DD3D: "Is Pseudo-Lidar needed for Monocular 3D Object detection?" Install // Datasets // Experiments // Models // License // Reference Full video Offi

Toyota Research Institute - Machine Learning 364 Dec 27, 2022