TCPNet - Temporal-attentive-Covariance-Pooling-Networks-for-Video-Recognition

Overview

Temporal-attentive-Covariance-Pooling-Networks-for-Video-Recognition

This is an implementation of TCPNet.

arch

Introduction

For video recognition task, a global representation summarizing the whole contents of the video snippets plays an important role for the final performance. However, existing video architectures usually generate it by using a simple, global average pooling (GAP) method, which has limited ability to capture complex dynamics of videos. For image recognition task, there exist evidences showing that covariance pooling has stronger representation ability than GAP. Unfortunately, such plain covariance pooling used in image recognition is an orderless representative, which cannot model spatio-temporal structure inherent in videos. Therefore, this paper proposes a Temporal-attentive Covariance Pooling (TCP), inserted at the end of deep architectures, to produce powerful video representations. Specifi- cally, our TCP first develops a temporal attention module to adaptively calibrate spatio-temporal features for the succeeding covariance pooling, approximatively producing attentive covariance representations. Then, a temporal covariance pooling performs temporal pooling of the attentive covariance representations to char- acterize both intra-frame correlations and inter-frame cross-correlations of the calibrated features. As such, the proposed TCP can capture complex temporal dynamics. Finally, a fast matrix power normalization is introduced to exploit geometry of covariance representations. Note that our TCP is model-agnostic and can be flexibly integrated into any video architectures, resulting in TCPNet for effective video recognition. The extensive experiments on six benchmarks (e.g., Kinetics, Something-Something V1 and Charades) using various video architectures show our TCPNet is clearly superior to its counterparts, while having strong generalization ability.

Citation

@InProceedings{Gao_2021_TCP,
                author = {Zilin, Gao and Qilong, Wang and Bingbing, Zhang and Qinghua, Hu and Peihua, Li},
                title = {Temporal-attentive Covariance Pooling Networks for Video Recognition},
                booktitle = {arxiv preprint axXiv:2021.06xxx},
                year = {2021}
  }

Model Zoo

Kinetics-400

Method Backbone frames 1 crop Acc (%) 30 views Acc (%) Model Pretrained Model test log
TCPNet TSN R50 8f 72.4/90.4 75.3/91.8 K400_TCP_TSN_R50_8f Img1K_R50_GCP log
TCPNet TEA R50 8f 73.9/91.6 76.8/92.9 K400_TCP_TEA_R50_8f Img1K_Res2Net50_GCP log
TCPNet TSN R152 8f 75.7/92.2 78.3/93.7 K400_TCP_TSN_R152_8f Img11K_1K_R152_GCP log
TCPNet TSN R50 16f 73.9/91.2 75.8/92.1 K400_TCP_TSN_R50_16f Img1K_R50_GCP log
TCPNet TEA R50 16f 75.3/92.2 77.2/93.1 K400_TCP_TEA_R50_16f Img1K_Res2Net50_GCP log
TCPNet TSN R152 16f 77.2/93.1 79.3/94.0 K400_TCP_TSN_R152_16f Img11K_1K_R152_GCP TODO

Mini-Kinetics-200

Method Backbone frames 1 crop Acc (%) 30 views Acc (%) Model Pretrained Model
TCPNet TSN R50 8f 78.7 80.7 K200_TCP_TSN_8f K400_TCP_TSN_R50_8f

Environments

pytorch v1.0+(for TCP_TSN); v1.0~1.4(for TCP+TEA)

ffmpeg

graphviz pip install graphviz

tensorboard pip install tensorboardX

tqdm pip install tqdm

scikit-learn conda install scikit-learn

matplotlib conda install -c conda-forge matplotlib

fvcore pip install 'git+https://github.com/facebookresearch/fvcore'

Dataset Preparation

We provide a detailed dataset preparation guideline for Kinetics-400 and Mini-Kinetics-200. See Dataset preparation.

StartUp

  1. download the pretrained model and put it in pretrained_models/
  2. execute the training script file e.g.: sh script/K400/train_TCP_TSN_8f_R50.sh
  3. execute the inference script file e.g.: sh script/K400/test_TCP_TSN_R50_8f.sh

TCP Code


├── ops
|    ├── TCP
|    |   ├── TCP_module.py
|    |   ├── TCP_att_module.py
|    |   ├── TSA.py
|    |   └── TCA.py
|    ├ ...
├ ...

Acknowledgement

  • We thank TSM for providing well-designed 2D action recognition toolbox.
  • We also refer to some functions from iSQRT, TEA and Non-local.
  • Mini-K200 dataset samplling strategy follows Mini_K200.
  • We would like to thank Facebook for developing pytorch toolbox.

Thanks for their work!

Owner
Zilin Gao
Zilin Gao
Over-the-Air Ensemble Inference with Model Privacy

Over-the-Air Ensemble Inference with Model Privacy This repository contains simulations for our private ensemble inference method. Installation Instal

Selim Firat Yilmaz 1 Jun 29, 2022
A Human-in-the-Loop workflow for creating HD images from text

A Human-in-the-Loop? workflow for creating HD images from text DALL·E Flow is an interactive workflow for generating high-definition images from text

Jina AI 2.5k Jan 02, 2023
Learning What and Where to Draw

###Learning What and Where to Draw Scott Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, Honglak Lee This is the code for our NIPS 201

Scott Ellison Reed 337 Nov 18, 2022
This is a work in progress reimplementation of Instant Neural Graphics Primitives

Neural Hash Encoding This is a work in progress reimplementation of Instant Neural Graphics Primitives Currently this can train an implicit representa

Penn 79 Sep 01, 2022
Learning Multiresolution Matrix Factorization and its Wavelet Networks on Graphs

Project Learning Multiresolution Matrix Factorization and its Wavelet Networks on Graphs, https://arxiv.org/pdf/2111.01940.pdf. Authors Truong Son Hy

5 Jun 28, 2022
This implementation contains the application of GPlearn's symbolic transformer on a commodity futures sector of the financial market.

GPlearn_finiance_stock_futures_extension This implementation contains the application of GPlearn's symbolic transformer on a commodity futures sector

Chengwei <a href=[email protected]"> 189 Dec 25, 2022
Trading and Backtesting environment for training reinforcement learning agent or simple rule base algo.

TradingGym TradingGym is a toolkit for training and backtesting the reinforcement learning algorithms. This was inspired by OpenAI Gym and imitated th

Yvictor 1.1k Jan 02, 2023
BASH - Biomechanical Animated Skinned Human

We developed a method animating a statistical 3D human model for biomechanical analysis to increase accessibility for non-experts, like patients, athletes, or designers.

Machine Learning and Data Analytics Lab FAU 66 Nov 19, 2022
The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

PRIMER The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization. PRIMER is a pre-trained model for mu

AI2 114 Jan 06, 2023
Contains source code for the winning solution of the xView3 challenge

Winning Solution for xView3 Challenge This repository contains source code and pretrained models for my (Eugene Khvedchenya) solution to xView 3 Chall

Eugene Khvedchenya 51 Dec 30, 2022
Histocartography is a framework bringing together AI and Digital Pathology

Documentation | Paper Welcome to the histocartography repository! histocartography is a python-based library designed to facilitate the development of

155 Nov 23, 2022
Uncertainty-aware Semantic Segmentation of LiDAR Point Clouds for Autonomous Driving

SalsaNext: Fast, Uncertainty-aware Semantic Segmentation of LiDAR Point Clouds for Autonomous Driving Abstract In this paper, we introduce SalsaNext f

308 Jan 04, 2023
R3Det based on mmdet 2.19.0

R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object Installation # install mmdetection first if you haven't installed it

SJTU-Thinklab-Det 38 Dec 15, 2022
Long Expressive Memory (LEM)

Long Expressive Memory for Sequence Modeling This repository contains the implementation to reproduce the numerical experiments of the paper Long Expr

Konstantin Rusch 47 Dec 17, 2022
Chainer Implementation of Fully Convolutional Networks. (Training code to reproduce the original result is available.)

fcn - Fully Convolutional Networks Chainer implementation of Fully Convolutional Networks. Installation pip install fcn Inference Inference is done as

Kentaro Wada 218 Oct 27, 2022
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Vision Longformer This project provides the source code for the vision longformer paper. Multi-Scale Vision Longformer: A New Vision Transformer for H

Microsoft 209 Dec 30, 2022
RCDNet: A Model-driven Deep Neural Network for Single Image Rain Removal (CVPR2020)

RCDNet: A Model-driven Deep Neural Network for Single Image Rain Removal (CVPR2020) Hong Wang, Qi Xie, Qian Zhao, and Deyu Meng [PDF] [Supplementary M

Hong Wang 6 Sep 27, 2022
An implementation for `Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction`

Text2Event An implementation for Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction Please contact Yaojie Lu (@

Roger 153 Jan 07, 2023
Code for "Diffusion is All You Need for Learning on Surfaces"

Source code for "Diffusion is All You Need for Learning on Surfaces", by Nicholas Sharp Souhaib Attaiki Keenan Crane Maks Ovsjanikov NOTE: the linked

Nick Sharp 247 Dec 28, 2022
Stochastic Tensor Optimization for Robot Motion - A GPU Robot Motion Toolkit

STORM Stochastic Tensor Optimization for Robot Motion - A GPU Robot Motion Toolkit [Install Instructions] [Paper] [Website] This package contains code

NVIDIA Research Projects 101 Dec 12, 2022