UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

Last update: Jan 04, 2023

Related tags

Deep Learning UMT

Overview

Unified Multi-modal Transformers

This repository maintains the official implementation of the paper UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection by Ye Liu, Siyuan Li, Yang Wu, Chang Wen Chen, Ying Shan, and Xiaohu Qie, which has been accepted by CVPR 2022.

Installation

Please refer to the following environmental settings that we use. You may install these packages by yourself if you meet any problem during automatic installation.

CUDA 11.5.0
CUDNN 8.3.2.44
Python 3.10.0
PyTorch 1.11.0
NNCore 0.3.6

Install from source

Clone the repository from GitHub.

git clone https://github.com/TencentARC/UMT.git
cd UMT

Install dependencies.

pip install -r requirements.txt

Getting Started

Download and prepare the datasets

Download and extract the datasets.

Prepare the files in the following structure.

UMT
├── configs
├── datasets
├── models
├── tools
├── data
│   ├── qvhighlights
│   │   ├── *features
│   │   ├── highlight_{train,val,test}_release.jsonl
│   │   └── subs_train.jsonl
│   ├── charades
│   │   ├── *features
│   │   └── charades_sta_{train,test}.txt
│   ├── youtube
│   │   ├── *features
│   │   └── youtube_anno.json
│   └── tvsum
│       ├── *features
│       └── tvsum_anno.json
├── README.md
├── setup.cfg
└── ···

Train a model

Run the following command to train a model using a specified config.

# Single GPU
python tools/launch.py ${path-to-config}

# Multiple GPUs
torchrun --nproc_per_node=${num-gpus} tools/launch.py ${path-to-config}

Test a model and evaluate results

Run the following command to test a model and evaluate results.

python tools/launch.py ${path-to-config} --checkpoint ${path-to-checkpoint} --eval

Pre-train with ASR captions on QVHighlights

Run the following command to pre-train a model using ASR captions on QVHighlights.

torchrun --nproc_per_node=4 tools/launch.py configs/qvhighlights/umt_base_pretrain_100e_asr.py

Model Zoo

We provide multiple pre-trained models and training logs here. All the models are trained with a single NVIDIA Tesla V100-FHHL-16GB GPU and are evaluated using the default metrics of the datasets.

Dataset	Model	Type	MR mAP		HD mAP		Download
Dataset	Model	Type	[email protected]	[email protected]	[email protected]	[email protected]	Download
QVHighlights	UMT-B	—	38.59		39.85		model \| metrics
QVHighlights	UMT-B	w/ PT	39.26		40.10		model \| metrics
Charades-STA	UMT-B	V + A	48.31	29.25	88.79	56.08	model \| metrics
Charades-STA	UMT-B	V + O	49.35	26.16	89.41	54.95	model \| metrics
YouTube Highlights	UMT-S	Dog	—		65.93		model \| metrics
	UMT-S	Gymnastics	—		75.20		model \| metrics
	UMT-S	Parkour	—		81.64		model \| metrics
	UMT-S	Skating	—		71.81		model \| metrics
	UMT-S	Skiing	—		72.27		model \| metrics
	UMT-S	Surfing	—		82.71		model \| metrics
TVSum	UMT-S	VT	—		87.54		model \| metrics
	UMT-S	VU	—		81.51		model \| metrics
	UMT-S	GA	—		88.22		model \| metrics
	UMT-S	MS	—		78.81		model \| metrics
	UMT-S	PK	—		81.42		model \| metrics
	UMT-S	PR	—		86.96		model \| metrics
	UMT-S	FM	—		75.96		model \| metrics
	UMT-S	BK	—		86.89		model \| metrics
	UMT-S	BT	—		84.42		model \| metrics
	UMT-S	DS	—		79.63		model \| metrics

Here, w/ PT means initializing the model using pre-trained weights on ASR captions. V, A, and O indicate video, audio, and optical flow, respectively.

Citation

If you find this project useful for your research, please kindly cite our paper.

@inproceedings{liu2022umt,
  title={UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection},
  author={Liu, Ye and Li, Siyuan and Wu, Yang and Chen, Chang Wen and Shan, Ying and Qie, Xiaohu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}

UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

Related tags

Overview

Unified Multi-modal Transformers

Installation

Install from source

Getting Started

Download and prepare the datasets

Train a model

Test a model and evaluate results

Pre-train with ASR captions on QVHighlights

Model Zoo

Citation

Owner

Applied Research Center (ARC), Tencent PCG

Joint-task Self-supervised Learning for Temporal Correspondence (NeurIPS 2019)

A PyTorch re-implementation of Neural Radiance Fields

This is the repository of our article published on MDPI Entropy "Feature Selection for Recommender Systems with Quantum Computing".

[CVPR 2021] A Peek Into the Reasoning of Neural Networks: Interpreting with Structural Visual Concepts

This is code to fit per-pixel environment map with spherical Gaussian lobes, using LBFGS optimization

This is just a funny project that we want to see AutoEncoder (AE) can actually work to enhance the features we want

This repository contains the official MATLAB implementation of the TDA method for reverse image filtering

Preprossing-loan-data-with-NumPy - In this project, I have cleaned and pre-processed the loan data that belongs to an affiliate bank based in the United States.

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

The VarCNN is an Convolution Neural Network based approach to automate Video Assistant Referee in football.

Dynamic Capacity Networks using Tensorflow

Exploring the link between uncertainty estimates obtained via "exact" Bayesian inference and out-of-distribution (OOD) detection.

Checkout some cool self-projects you can try your hands on to curb your boredom this December!

Encode and decode text application

Official implement of "CAT: Cross Attention in Vision Transformer".

Real-time LIDAR-based Urban Road and Sidewalk detection for Autonomous Vehicles 🚗

Official repository for "Restormer: Efficient Transformer for High-Resolution Image Restoration". SOTA results for single-image motion deblurring, image deraining, image denoising (synthetic and real data), and dual-pixel defocus deblurring.

Research shows Google collects 20x more data from Android than Apple collects from iOS. Block this non-consensual telemetry using pihole blocklists.

This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

A font family with a great monospaced variant for programmers.