Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

Related tags

Deep LearningSTAM
Overview

An Image is Worth 16x16 Words, What is a Video Worth?

paper

Official PyTorch Implementation

Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor
DAMO Academy, Alibaba Group

Abstract

Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach 78.8 top-1 accuracy with ×30 less frames per video, and ×40 faster inference than the current leading method

Update 2/5/2021: Improved results

Due to improved training hyperparameters, and using KD training, we were able to improve STAM results on Kinetics400 (+ ~1.5%). We are releasing the pretrained weights of the improved models (see Pretrained Models below).

Main Article Results

STAM models accuracy and GPU throughput on Kinetics400, compared to X3D. All measurements were done on Nvidia V100 GPU, with mixed precision. All models are trained on input resolution of 224.

Models Top-1 Accuracy
(%)
Flops × views
(10^9)
# Input Frames Runtime
(Videos/sec)
X3D-M 76.0 6.2 × 30 480 1.3
X3D-L 77.5 24.8 × 30 480 0.46
X3D-XL 79.1 48.4 × 30 480 N/A
X3D-XXL 80.4 194 × 30 480 N/A
TimeSformer-L 80.7 2380 × 3 288 N/A
ViViT-L 81.3 3992 × 12 384 N/A
STAM-16 79.3 270 × 1 16 20.0
STAM-64 80.5 1080 × 1 64 4.8

Pretrained Models

We provide a collection of STAM models pre-trained on Kinetics400.

Model name checkpoint
STAM_16 link
STAM_64 link

Reproduce Article Scores

We provide code for reproducing the validation top-1 score of STAM models on Kinetics400. First, download pretrained models from the links above.

Then, run the infer.py script. For example, for stam_16 (input size 224) run:

python -m infer \
--val_dir=/path/to/kinetics_val_folder \
--model_path=/model/path/to/stam_16.pth \
--model_name=stam_16
--input_size=224

Citations

@misc{sharir2021image,
    title   = {An Image is Worth 16x16 Words, What is a Video Worth?}, 
    author  = {Gilad Sharir and Asaf Noy and Lihi Zelnik-Manor},
    year    = {2021},
    eprint  = {2103.13915},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Acknowledgements

We thank Tal Ridnik for discussions and comments.

Some components of this code implementation are adapted from the excellent repository of Ross Wightman. Check it out and give it a star while you are at it.

ANEA: Distant Supervision for Low-Resource Named Entity Recognition

ANEA: Distant Supervision for Low-Resource Named Entity Recognition ANEA is a tool to automatically annotate named entities in unlabeled text based on

Saarland University Spoken Language Systems Group 15 Mar 30, 2022
Feedback is important: response-aware feedback mechanism for background based conversation

RFM The code for the paper: "Feedback is important: response-aware feedback mechanism for background based conversation." Requirements python 3.7 pyto

Jiatao Chen 2 Sep 29, 2022
Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

This repo contains the official implementation of the VAE-GAN from the INTERSPEECH 2020 paper Voice Conversion Using Speech-to-Speech Neuro-Style Transfer.

Ehab AlBadawy 93 Jan 05, 2023
For storing the complete exploration of Visual Question Answering for our B.Tech Project

Multi-Image vqa @authors: Akhilesh, Janhavi, Harsh Paper summary, Ideas tried and their corresponding results: on wiki Other discussions: on discussio

Harsh Raj 3 Jun 16, 2022
Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

1 Jan 23, 2022
Llvlir - Low Level Variable Length Intermediate Representation

Low Level Variable Length Intermediate Representation Low Level Variable Length

Michael Clark 2 Jan 24, 2022
Code for 'Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning', ICCV 2021

CMIC-Retrieval Code for Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning. ICCV 2021. Introduction In this wo

42 Nov 17, 2022
Cmsc11 arcade - Final Project for CMSC11

cmsc11_arcade Final Project for CMSC11 Developers: Limson, Mark Vincent Peñafiel

Gregory 1 Jan 18, 2022
piSTAR Lab is a modular platform built to make AI experimentation accessible and fun. (pistar.ai)

piSTAR Lab WARNING: This is an early release. Overview piSTAR Lab is a modular deep reinforcement learning platform built to make AI experimentation a

piSTAR Lab 0 Aug 01, 2022
Self-Supervised Monocular DepthEstimation with Internal Feature Fusion(arXiv), BMVC2021

DIFFNet This repo is for Self-Supervised Monocular DepthEstimation with Internal Feature Fusion(arXiv), BMVC2021 A new backbone for self-supervised de

Hang 94 Dec 25, 2022
🚩🚩🚩

My CTF Challenges 2021 AIS3 Pre-exam / MyFirstCTF Name Category Keywords Difficulty ⒸⓄⓋⒾⒹ-①⑨ (MyFirstCTF Only) Reverse Baby ★ Piano Reverse C#, .NET ★

6 Oct 28, 2021
Implementing yolov4 target detection and tracking based on nao robot

Implementing yolov4 target detection and tracking based on nao robot

6 Apr 19, 2022
Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph".

multilingual-mrc-isdg Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph". This r

Liyan 5 Dec 07, 2022
Code for "Neural 3D Scene Reconstruction with the Manhattan-world Assumption" CVPR 2022 Oral

News 05/10/2022 To make the comparison on ScanNet easier, we provide all quantitative and qualitative results of baselines here, including COLMAP, COL

ZJU3DV 365 Dec 30, 2022
Randstad Artificial Intelligence Challenge (powered by VGEN). Soluzione proposta da Stefano Fiorucci (anakin87) - primo classificato

Randstad Artificial Intelligence Challenge (powered by VGEN) Soluzione proposta da Stefano Fiorucci (anakin87) - primo classificato Struttura director

Stefano Fiorucci 1 Nov 13, 2021
A geometric deep learning pipeline for predicting protein interface contacts.

A geometric deep learning pipeline for predicting protein interface contacts.

44 Dec 30, 2022
Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers 1 Using Colab Please notic

Hila Chefer 489 Jan 07, 2023
ArcaneGAN by Alex Spirin

ArcaneGAN by Alex Spirin

Alex 617 Dec 28, 2022
Gray Zone Assessment

Gray Zone Assessment Get started Clone github repository git clone https://github.com/andreanne-lemay/gray_zone_assessment.git Build docker image dock

1 Jan 08, 2022
The codes and related files to reproduce the results for Image Similarity Challenge Track 2.

The codes and related files to reproduce the results for Image Similarity Challenge Track 2.

Wenhao Wang 89 Jan 02, 2023