Code for the paper "Controllable Video Captioning with an Exemplar Sentence"

Last update: Dec 04, 2022

Related tags

Overview

SMCG

Code for the paper "Controllable Video Captioning with an Exemplar Sentence"

Introduction

We investigate a novel and challenging task, namely controllable video captioning with an exemplar sentence. Formally, given a video and a syntactically valid exemplar sentence, the task aims to generate one caption which not only describes the semantic contents of the video, but also follows the syntactic form of the given exemplar sentence. In order to tackle such an exemplar-based video captioning task, we propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture.

Dependency

python 2.7.2
torch 1.1.0
java openjdk version "10.0.2" 2018-07-17
StanfordCoreNLP

Download Features and Preprocess Data

For the MSRVTT dataset, please download the following files into the './msrvtt/msrvtt_data/' folder:

MSRVTT caption info: videodatainfo_2016.json,
MSRVTT captions and their sentence parse trees: msrvtt_all_sentence_parse_dict.pkl,
Collected exemplar sentences and their parse trees: coco_filter_parse_dict.pkl,
Video features: msrvtt_incepRes_rgb_feats.hdf5,
Glove word embeddings: glove.840B.300d.zip.

For the ActivityNet Captionsd dataset, please download the following files into the './activitynet/activitynet_data/' folder:

ActivityNet caption info: CAP.pkl,
ActivityNet captions and their sentence parse trees: anet_parse_dict.pkl,
Collected exemplar sentences and their parse trees: coco_filter_parse_dict.pkl,
Video features: anet_new_inception_resnet_feats.hdf5,
Glove word embeddings: glove.840B.300d.zip.

Data Preprocessing

Go to the './msrvtt/process_msrvtt_data/' folder, and run:

python prepro_vocab_parse_pos.py
python fill_template.py

Go to the './activitynet/process_activitynet_data/' folder, and run:

python prepro_anetcoco_vocab_pos_parse.py
python fill_template.py

Model Training and Testing

For the MSRVTT dataset, please go to the './msrvtt/src/' folder, and train the model by:

python train.py --gpu xx

For model inference and evaluation, run:

bash eval.sh 
bash control.sh

Note: 'eval.sh' is used to evaluate the generated exemplar-based captions with conventional captioning metrics. 'control.sh' is used to compare the generated exemplar-based captions with the provided exemplar captions from the syntactic aspect, i.e., compute the edit distance between their parse trees.
For the ActivityNet Captions dataset, please go to the './activitynet/src/' folder, and train/test the model as on the MSRVTT dataset.

Citation

@inproceedings{yuan2020Control,
  title={Controllable Video Captioning with an Exemplar Sentence},
  author={Yuan, Yitian and Ma, Lin and Wang, Jingwen and Zhu, Wenwu},
  booktitle={the 28th ACM International Conference on Multimedia (MM ’20)},
  year={2020}
}

Code for the paper "Controllable Video Captioning with an Exemplar Sentence"

Related tags

Overview

SMCG

Introduction

Dependency

Download Features and Preprocess Data

Data Preprocessing

Model Training and Testing

Citation

Owner

Official implementation of "Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets" (CVPR2021)

a generic C++ library for image analysis

A pure PyTorch batched computation implementation of "CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition"

D²Conv3D: Dynamic Dilated Convolutions for Object Segmentation in Videos

The official implementation of Variable-Length Piano Infilling (VLI).

Fast Neural Style for Image Style Transform by Pytorch

Learning Dense Representations of Phrases at Scale (Lee et al., 2020)

Imposter-detector-2022 - HackED 2022 Team 3IQ - 2022 Imposter Detector

Uncertainty-aware Semantic Segmentation of LiDAR Point Clouds for Autonomous Driving

African language Speech Recognition - Speech-to-Text

Active Offline Policy Selection With Python

My tensorflow implementation of "A neural conversational model", a Deep learning based chatbot

Official implementation of our neural-network-based fast diffuse room impulse response generator (FAST-RIR)

Adversarial Graph Representation Adaptation for Cross-Domain Facial Expression Recognition (AGRA, ACM 2020, Oral)

Code of paper "Compositionally Generalizable 3D Structure Prediction"

Torch-ngp - A pytorch implementation of the hash encoder proposed in instant-ngp

CrossMLP - The repository offers the official implementation of our BMVC 2021 paper (oral) in PyTorch.

Domain Generalization for Mammography Detection via Multi-style and Multi-view Contrastive Learning

Code for Learning to Segment The Tail (LST)

[NeurIPS'21] "AugMax: Adversarial Composition of Random Augmentations for Robust Training" by Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Animashree Anandkumar, and Zhangyang Wang.