Implementation of TimeSformer, a pure attention-based solution for video classification

Last update: Jan 03, 2023

Overview

TimeSformer - Pytorch

Implementation of TimeSformer, a pure and simple attention-based solution for reaching SOTA on video classification. This repository will only house the best performing variant, 'Divided Space-Time Attention', which is nothing more than attention along the time axis before the spatial.

Install

$ pip install timesformer-pytorch

Usage

import torch
from timesformer_pytorch import TimeSformer

model = TimeSformer(
    dim = 512,
    image_size = 224,
    patch_size = 16,
    num_frames = 8,
    num_classes = 10,
    depth = 12,
    heads = 8,
    dim_head =  64,
    attn_dropout = 0.1,
    ff_dropout = 0.1
)

video = torch.randn(2, 8, 3, 224, 224) # (batch x frames x channels x height x width)
pred = model(video) # (2, 10)

Citations

@misc{bertasius2021spacetime,
    title   = {Is Space-Time Attention All You Need for Video Understanding?}, 
    author  = {Gedas Bertasius and Heng Wang and Lorenzo Torresani},
    year    = {2021},
    eprint  = {2102.05095},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Comments

How to deal with varying length video? Thanks

Dear all, I am wondering if TimeSformer can handle different videos with diverse lengths? Is it possible to use mask as the original Transformer? Any ideas, thanks a lot.

opened by junyongyou 2
fix runtime error in SpaceTime Attention

There is a shape mismatch error in Attention. When we splice out the classification token from the first token of each sequence in q, k and v, the shape becomes (batch_size * num_heads, num_frames * num_patches - 1, head_dim). Then we try to reshape the tensor by taking out a factor of num_frames or num_patches (depending on whether it is space or time attention) from dimension 1. That doesn't work because we subtracted out the classification token.

I found that performing the rearrange operation before splicing the token fixes the issue.

I recreate the problem and illustrate the solution in this notebook: https://colab.research.google.com/drive/1lHFcn_vgSDJNSqxHy7rtqhMVxe0nUCMS?usp=sharing.

By the way, thank you to @lucidrains; all of your implementations on attention-based models are helping me more than you know.

opened by adam-mehdi 1
Update timesformer_pytorch.py

fixing issue for scaling

File "/home/aarti9/.local/lib/python3.6/site-packages/timesformer_pytorch/timesformer_pytorch.py", line 82, in forward q *= self.scale

RuntimeError: Output 0 of ViewBackward is a view and is being modified inplace. This view is an output of a function that returns multiple views. Inplace operators on such views is forbidden. You should replace the inplace operation by an out-of-place one.

opened by aarti9 0
Fine-tune with new datasets

Thank you so much for your great effort. I can predict the images using the given .py files. But, I couldn't find train.py files, so how to fine-tune the network with new datasets? where should i define the image samples of the new dataset ?

opened by Jeba-create 0
problem in timesformer_pytorch.py

start from line 182 video = rearrange(video, 'b f c (h p1) (w p2) -> b (f h w) (p1 p2 c)', p1 = p, p2 = p) i think this should be video = rearrange(video, 'b f c (hp p1) (wp p2) -> b (f hp wp) (p1 p2 c)', p1 = p, p2 = p)

opened by Weizhongjin 2
Imagenet Pretrained Weights

Thanks for the work! In their paper they say For all our experiments, we adopt the “Base” ViT model architecture (Dosovitskiy et al., 2020) pretrained on ImageNet.

I know that you said the official weights trained on kinetics and such are not officially released yet. However, I am not interested in those but am actually in need of the initial weights of the network just based on ViT Imagenet pretraining. I need to train this implementation of yours starting from those. From what it looks like, you don't have weights for this implementation that come from imagenet pretraining, do you?

opened by RaivoKoot 5

Releases(0.4.1)

0.4.1(Aug 25, 2021)

Source code(tar.gz)
Source code(zip)
0.4.0(Aug 16, 2021)

Source code(tar.gz)
Source code(zip)
0.3.3(Jul 4, 2021)

Source code(tar.gz)
Source code(zip)
0.3.2(Apr 26, 2021)

Source code(tar.gz)
Source code(zip)
0.3.1(Apr 25, 2021)

Source code(tar.gz)
Source code(zip)
0.2.1(Apr 21, 2021)

Source code(tar.gz)
Source code(zip)
0.1.1(Mar 23, 2021)

Source code(tar.gz)
Source code(zip)
0.1.0(Mar 21, 2021)

Source code(tar.gz)
Source code(zip)
0.0.5(Mar 18, 2021)

Source code(tar.gz)
Source code(zip)
0.0.4(Feb 11, 2021)

Source code(tar.gz)
Source code(zip)
0.0.3(Feb 11, 2021)

Source code(tar.gz)
Source code(zip)
0.0.2(Feb 11, 2021)

Source code(tar.gz)
Source code(zip)
0.0.1a(Feb 11, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need.

GitHub Repository

A repository for the paper "Improved Adversarial Systems for 3D Object Generation and Reconstruction".

Improved Adversarial Systems for 3D Object Generation and Reconstruction: This is a repository for the paper "Improved Adversarial Systems for 3D Obje

188 Dec 25, 2022

PaddleBoBo是基于PaddlePaddle和PaddleSpeech、PaddleGAN等开发套件的虚拟主播快速生成项目

PaddleBoBo - 元宇宙时代，你也可以动手做一个虚拟主播。 PaddleBoBo是基于飞桨PaddlePaddle深度学习框架和PaddleSpeech、PaddleGAN等开发套件的虚拟主播快速生成项目。PaddleBoBo致力于简单高效、可复用性强，只需要一张带人像的图片和一段文字，就能

502 Jan 08, 2023

An End-to-End Machine Learning Library to Optimize AUC (AUROC, AUPRC).

176 Jan 07, 2023

Implementation of TransGanFormer, an all-attention GAN that combines the finding from the recent GanFormer and TransGan paper

TransGanFormer (wip) Implementation of TransGanFormer, an all-attention GAN that combines the finding from the recent GansFormer and TransGan paper. I

146 Dec 06, 2022

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

43 Nov 21, 2022

MemStream: Memory-Based Anomaly Detection in Multi-Aspect Streams with Concept Drift

MemStream Implementation of MemStream: Memory-Based Anomaly Detection in Multi-Aspect Streams with Concept Drift . Siddharth Bhatia, Arjit Jain, Shivi

61 Dec 02, 2022

a general-purpose Transformer based vision backbone

Swin Transformer By Ze Liu*, Yutong Lin*, Yue Cao*, Han Hu*, Yixuan Wei, Zheng Zhang, Stephen Lin and Baining Guo. This repo is the official implement

9.9k Jan 08, 2023

A Closer Look at Invalid Action Masking in Policy Gradient Algorithms

A Closer Look at Invalid Action Masking in Policy Gradient Algorithms This repo contains the source code to reproduce the results in the paper A Close

73 Dec 24, 2022

PartImageNet is a large, high-quality dataset with part segmentation annotations

PartImageNet: A Large, High-Quality Dataset of Parts We will release our dataset and scripts soon after cleaning and approval. Introduction PartImageN

77 Nov 30, 2022

Official implementation of "One-Shot Voice Conversion with Weight Adaptive Instance Normalization".

One-Shot Voice Conversion with Weight Adaptive Instance Normalization By Shengjie Huang, Yanyan Xu*, Dengfeng Ke*, Mingjie Chen, Thomas Hain. This rep

31 Dec 07, 2022

ICRA 2021 "Towards Precise and Efficient Image Guided Depth Completion"

PENet: Precise and Efficient Depth Completion This repo is the PyTorch implementation of our paper to appear in ICRA2021 on "Towards Precise and Effic

232 Dec 25, 2022

Detectron2 for Document Layout Analysis

Detectron2 trained on PubLayNet dataset This repo contains the training configurations, code and trained models trained on PubLayNet dataset using Det

163 Nov 21, 2022

Some tentative models that incorporate label propagation to graph neural networks for graph representation learning in nodes, links or graphs.

1 Nov 18, 2021

Implementation of TimeSformer, a pure attention-based solution for video classification

Related tags

Overview

TimeSformer - Pytorch

Install

Usage

Citations

Comments

How to deal with varying length video? Thanks

fix runtime error in SpaceTime Attention

Update timesformer_pytorch.py

Fine-tune with new datasets

problem in timesformer_pytorch.py

Imagenet Pretrained Weights

Releases(0.4.1)

0.4.1(Aug 25, 2021)

0.4.0(Aug 16, 2021)

0.3.3(Jul 4, 2021)

0.3.2(Apr 26, 2021)

0.3.1(Apr 25, 2021)

0.2.1(Apr 21, 2021)

0.1.1(Mar 23, 2021)

0.1.0(Mar 21, 2021)

0.0.5(Mar 18, 2021)

0.0.4(Feb 11, 2021)

0.0.3(Feb 11, 2021)

0.0.2(Feb 11, 2021)

0.0.1a(Feb 11, 2021)

Owner

Phil Wang

A repository for the paper "Improved Adversarial Systems for 3D Object Generation and Reconstruction".

PaddleBoBo是基于PaddlePaddle和PaddleSpeech、PaddleGAN等开发套件的虚拟主播快速生成项目

An End-to-End Machine Learning Library to Optimize AUC (AUROC, AUPRC).

Implementation of TransGanFormer, an all-attention GAN that combines the finding from the recent GanFormer and TransGan paper

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

MemStream: Memory-Based Anomaly Detection in Multi-Aspect Streams with Concept Drift

a general-purpose Transformer based vision backbone

A Closer Look at Invalid Action Masking in Policy Gradient Algorithms

PartImageNet is a large, high-quality dataset with part segmentation annotations

Official implementation of "One-Shot Voice Conversion with Weight Adaptive Instance Normalization".

ICRA 2021 "Towards Precise and Efficient Image Guided Depth Completion"

Detectron2 for Document Layout Analysis

Some tentative models that incorporate label propagation to graph neural networks for graph representation learning in nodes, links or graphs.

Finetune the base 64 px GLIDE-text2im model from OpenAI on your own image-text dataset

Official repository for Few-shot Image Generation via Cross-domain Correspondence (CVPR '21)

Codebase for BMVC 2021 paper "Text Based Person Search with Limited Data"

Intel® Neural Compressor is an open-source Python library running on Intel CPUs and GPUs

Predictive Maintenance LSTM

K Closest Points and Maximum Clique Pruning for Efficient and Effective 3D Laser Scan Matching (To appear in RA-L 2022)

Informal Persian Universal Dependency Treebank