Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

Related tags

Computer VisionSTAM
Overview

An Image is Worth 16x16 Words, What is a Video Worth?

PWC

paper

Official PyTorch Implementation

Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor
DAMO Academy, Alibaba Group

Abstract

Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach 78.8 top-1 accuracy with ×30 less frames per video, and ×40 faster inference than the current leading method

Main Article Results

STAM models accuracy and GPU throughput on Kinetics400, compared to X3D. All measurements were done on Nvidia V100 GPU, with mixed precision. All models are trained on input resolution of 224.

Models Top-1 Accuracy
(%)
Flops × views
(10^9)
# Input Frames Runtime
(Videos/sec)
X3D-M 76.0 6.2 × 30 480 1.3
X3D-L 77.5 24.8 × 30 480 0.46
X3D-XL 79.1 48.4 × 30 480 N/A
STAM-16 77.8 270 × 1 16 20.0
STAM-64 79.2 1080 × 1 64 4.8

Pretrained Models

We provide a collection of STAM models pre-trained on Kinetics400.

Model name checkpoint
STAM_16 link
STAM_32 link
STAM_64 link

Reproduce Article Scores

We provide code for reproducing the validation top-1 score of STAM models on Kinetics400. First, download pretrained models from the links above.

Then, run the infer.py script. For example, for stam_16 (input size 224) run:

python -m infer \
--val_dir=/path/to/kinetics_val_folder \
--model_path=/model/path/to/stam_16.pth \
--model_name=stam_16
--input_size=224

Citations

@misc{sharir2021image,
    title   = {An Image is Worth 16x16 Words, What is a Video Worth?}, 
    author  = {Gilad Sharir and Asaf Noy and Lihi Zelnik-Manor},
    year    = {2021},
    eprint  = {2103.13915},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Acknowledgements

We thank Tal Ridnik for discussions and comments.

Some components of this code implementation are adapted from the excellent repository of Ross Wightman. Check it out and give it a star while you are at it.

Generates a message from the infamous Jerma Impostor image

Generate your very own jerma sus imposter message. Modes: Default Mode: Only supports the characters " ", !, a, b, c, d, e, h, i, m, n, o, p, q, r, s,

Giorno420 1 Oct 27, 2022
Text language identification using Wikipedia data

Text language identification using Wikipedia data The aim of this project is to provide high-quality language detection over all the web's languages.

Vsevolod Dyomkin 28 Jul 09, 2022
Generating .npy dataset and labels out of given image, containing numbers from 0 to 9, using opencv

basic-dataset-generator-from-image-of-numbers generating .npy dataset and labels out of given image, containing numbers from 0 to 9, using opencv inpu

1 Jan 01, 2022
An Implementation of the alogrithm in paper IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection

InceptText-Tensorflow An Implementation of the alogrithm in paper IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Orien

GeorgeJoe 115 Dec 12, 2022
OpenCVを用いたカメラキャリブレーションのサンプルです。2021/06/21時点でPython実装のある3種類(通常カメラ向け、魚眼レンズ向け(fisheyeモジュール)、全方位カメラ向け(omnidirモジュール))について用意しています。

OpenCV-CameraCalibration-Example FishEyeCameraCalibration.mp4 OpenCVを用いたカメラキャリブレーションのサンプルです 2021/06/21時点でPython実装のある以下3種類について用意しています。 通常カメラ向け 魚眼レンズ向け(

KazuhitoTakahashi 34 Nov 17, 2022
ocroseg - This is a deep learning model for page layout analysis / segmentation.

ocroseg This is a deep learning model for page layout analysis / segmentation. There are many different ways in which you can train and run it, but by

NVIDIA Research Projects 71 Dec 06, 2022
CRAFT-Pyotorch:Character Region Awareness for Text Detection Reimplementation for Pytorch

CRAFT-Reimplementation Note:If you have any problems, please comment. Or you can join us weChat group. The QR code will update in issues #49 . Reimple

453 Dec 28, 2022
A Screen Translator/OCR Translator made by using Python and Tesseract, the user interface are made using Tkinter. All code written in python.

About An OCR translator tool. Made by me by utilizing Tesseract, compiled to .exe using pyinstaller. I made this program to learn more about python. I

Fauzan F A 41 Dec 30, 2022
RRD: Rotation-Sensitive Regression for Oriented Scene Text Detection

RRD: Rotation-Sensitive Regression for Oriented Scene Text Detection For more details, please refer to our paper. Citing Please cite the related works

Minghui Liao 102 Jun 29, 2022
Computer vision applications project (Flask and OpenCV)

Computer Vision Applications Project This project is at it's initial phase. This is all about the implementation of different computer vision techniqu

Suryam Thapa 1 Jan 26, 2022
A python screen recorder for low-end computers, provides high quality video output.

RecorderX - v1.0 A screen recorder made in Python with the help of OpenCv, it has ability to record your screen in high quality. No matter what your P

Priyanshu Jindal 4 Nov 10, 2021
Some bits of javascript to transcribe scanned pages using PageXML

nashi (nasḫī) Some bits of javascript to transcribe scanned pages using PageXML. Both ltr and rtl languages are supported. Try it! But wait, there's m

Andreas Büttner 15 Nov 09, 2022
Introduction to image processing, most used and popular functions of OpenCV

👀 OpenCV 101 Introduction to image processing, most used and popular functions of OpenCV go here.

Vusal Ismayilov 3 Jul 02, 2022
Using python libraries to track hands

Python-HandTracking Using python libraries to track hands on a camera Uses cv2 and mediapipe libraries custom hand tracking module PyCharm IDE Final E

Martin Matsudaira 1 Dec 17, 2021
OCR, Scene-Text-Understanding, Text Recognition

Scene-Text-Understanding Survey [2015-PAMI] Text Detection and Recognition in Imagery: A Survey paper [2014-Front.Comput.Sci] Scene Text Detection and

Alan Tang 354 Dec 12, 2022
Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. This Neural Network (NN) model recognizes the text contained in the images of segmented words.

Handwritten-Text-Recognition Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. T

27 Jan 08, 2023
The official code for the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates".

SpeechDrivesTemplates The official repo for the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates". [arxiv

Qian Shenhan 53 Dec 23, 2022
Code for CVPR2021 paper "Learning Salient Boundary Feature for Anchor-free Temporal Action Localization"

AFSD: Learning Salient Boundary Feature for Anchor-free Temporal Action Localization This is an official implementation in PyTorch of AFSD. Our paper

Tencent YouTu Research 146 Dec 24, 2022
The code for CVPR2022 paper "Likert Scoring with Grade Decoupling for Long-term Action Assessment".

Likert Scoring with Grade Decoupling for Long-term Action Assessment This is the code for CVPR2022 paper "Likert Scoring with Grade Decoupling for Lon

10 Oct 21, 2022
Random maze generator and solver

Maze Generator and Solver I wrote a maze generator that works with two commonly known algorithms: Depth First Search and Randomized Prims. Both of the

Daniel Pérez 10 Sep 23, 2022