Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Last update: Nov 21, 2022

Overview

Diverse Image Captioning with Context-Object Split Latent Spaces

This repository is the PyTorch implementation of the paper:

Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

We additionally include evaluation code from Luo et al. in the folder GoogleConceptualCaptioning , which has been patched for compatibility.

Requirements

The following code is written in Python 3.6.10 and CUDA 9.0.

Requirements:

torch 1.1.0
torchvision 0.3.0
nltk 3.5
inflect 4.1.0
tqdm 4.46.0
sklearn 0.0
h5py 2.10.0

To install requirements:

conda config --add channels pytorch
conda config --add channels anaconda
conda config --add channels conda-forge
conda config --add channels conda-forge/label/cf202003
conda create -n <environment_name> --file requirements.txt
conda activate <environment_name>

Preprocessed data

The dataset used in this project for assessing accuracy and diversity is COCO 2014 (m-RNN split). The full dataset is available here.

We use the Faster R-CNN features for images similar to Anderson et al.. We additionally require "classes"/"scores" fields detected for image regions. The classes correspond to Visual Genome.

Download instructions

Preprocessed training data is available here as hdf5 files. The provided hdf5 files contain the following fields:

image_id: ID of the COCO image
num_boxes: The proposal regions detected from Faster R-CNN
features: ResNet-101 features of the extracted regions
classes: Visual genome classes of the extracted regions
scores: Scores of the Visual genome classes of the extracted regions

Note that the ["image_id","num_boxes","features"] fields are identical to Anderson et al.

Create a folder named coco and download the preprocessed training and test datasets from the coco folder in the drive link above as follows (it is also possible to directly download the entire coco folder from the drive link):

Download the following files for training on COCO 2014 (m-RNN split):

coco/coco_train_2014_adaptive_withclasses.h5
coco/coco_val_2014_adaptive_withclasses.h5
coco/coco_val_mRNN.txt
coco/coco_test_mRNN.txt

Download the following files for training on held-out COCO (novel object captioning):

coco/coco_train_2014_noc_adaptive_withclasses.h5
coco/coco_train_extra_2014_noc_adaptive_withclasses.h5

Download the following files for testing on held-out COCO (novel object captioning):

coco/coco_test_2014_noc_adaptive_withclasses.h5

Download the (caption) annotation files and place them in a subdirectory coco/annotations (mirroring the Google drive folder structure)

coco/annotations/captions_train2014.json
coco/annotations/captions_val2014.json

Download the following files from the drive link in a seperate folder data (outside coco). These files contain the contextual neighbours for pseudo supervision:

data/nn_final.pkl
data/nn_noc.pkl

For running the train/test scripts (described in the following) "pathToData"/"nn_dict_path" in params.json and params_noc.json needs to be set to the coco/data folder created above.

Verify Folder Structure after Download

The folder structure of coco after data download should be as follows,

coco
 - annotations
   - captions_train2014.json
   - captions_val2014.json
 - coco_val_mRNN.txt
 - coco_test_mRNN.txt
 - coco_train_2014_adaptive_withclasses.h5
 - coco_val_2014_adaptive_withclasses.h5
 - coco_train_2014_noc_adaptive_withclasses.h5
 - coco_train_extra_2014_noc_adaptive_withclasses.h5
 - coco_test_2014_noc_adaptive_withclasses.h5
data
 - coco_classname.txt
 - visual_genome_classes.txt
 - vocab_coco_full.pkl
 - nn_final.pkl
 - nn_noc.pkl

Training

Please follow the following instructions for training:

Set hyperparameters for training in params.json and params_noc.json.
Train a model on COCO 2014 for captioning,

   	python ./scripts/train.py

Train a model for diverse novel object captioning,

   	python ./scripts/train_noc.py

Please note that the data folder provides the required vocabulary.

Memory requirements

The models were trained on a single nvidia V100 GPU with 32 GB memory. 16 GB is sufficient for training a single run.

Pre-trained models and evaluation

We provide pre-trained models for both captioning on COCO 2014 (mRNN split) and novel object captioning. Please follow the following steps:

Download the pre-trained models from here to the ckpts folder.
For evaluation of oracle scores and diversity, we follow Luo et al.. In the folder GoogleConceptualCaptioning download the cider and in the cococaption folder run the download scripts,

   	./GoogleConceptualCaptioning/cococaption/get_google_word2vec_model.sh
   	./GoogleConceptualCaptioning/cococaption/get_stanford_models.sh
   	python ./scripts/eval.py

For diversity evaluation create the required numpy file for consensus re-ranking using,

   	python ./scripts/eval_diversity.py

For consensus re-ranking follow the steps here. To obtain the final diversity scores, follow the instructions of DiversityMetrics. Convert the numpy file to required json format and run the script evalscripts.py

To evaluate the F1 score for novel object captioning,

   	python ./scripts/eval_noc.py

Results

Oracle evaluation on the COCO dataset

	B4	B3	B2	B1	CIDEr	METEOR	ROUGE	SPICE
COS-CVAE	0.633	0.739	0.842	0.942	1.893	0.450	0.770	0.339

Diversity evaluation on the COCO dataset

	Unique	Novel	mBLEU	Div-1	Div-2
COS-CVAE	96.3	4404	0.53	0.39	0.57

F1-score evaluation on the held-out COCO dataset

	bottle	bus	couch	microwave	pizza	racket	suitcase	zebra	average
COS-CVAE	35.4	83.6	53.8	63.2	86.7	69.5	46.1	81.7	65.0

Bibtex

@inproceedings{coscvae20neurips,
  title     = {Diverse Image Captioning with Context-Object Split Latent Spaces},
  author    = {Mahajan, Shweta and Roth, Stefan},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year = {2020}
}

Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Related tags

Overview

Diverse Image Captioning with Context-Object Split Latent Spaces

Requirements

Preprocessed data

Download instructions

Verify Folder Structure after Download

Training

Memory requirements

Pre-trained models and evaluation

Results

Oracle evaluation on the COCO dataset

Diversity evaluation on the COCO dataset

F1-score evaluation on the held-out COCO dataset

Bibtex

Owner

Visual Inference Lab @TU Darmstadt

A video scene detection algorithm is designed to detect a variety of different scenes within a video

The implement of papar "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization"

The 2nd place solution of 2021 google landmark retrieval on kaggle.

Vertex AI: Serverless framework for MLOPs (ESP / ENG)

Implementation for the paper: Invertible Denoising Network: A Light Solution for Real Noise Removal (CVPR2021).

Digitalizing-Prescription-Image - PIRDS - Prescription Image Recognition and Digitalizing System is a OCR make with Tensorflow

Revisting Open World Object Detection

nnDetection is a self-configuring framework for 3D (volumetric) medical object detection which can be applied to new data sets without manual intervention. It includes guides for 12 data sets that were used to develop and evaluate the performance of the proposed method.

TigerLily: Finding drug interactions in silico with the Graph.

PyTorch deep learning projects made easy.

Repository to run object detection on a model trained on an autonomous driving dataset.

Research on controller area network Intrusion Detection Systems

Accuracy Aligned. Concise Implementation of Swin Transformer

ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

noisy labels; missing labels; semi-supervised learning; entropy; uncertainty; robustness and generalisation.

transfer attack; adversarial examples; black-box attack; unrestricted Adversarial Attacks on ImageNet; CVPR2021 天池黑盒竞赛

Simple cross-platform application for DaVinci surgical video frame annotation

MADE (Masked Autoencoder Density Estimation) implementation in PyTorch

Sync2Gen Code for ICCV 2021 paper: Scene Synthesis via Uncertainty-Driven Attribute Synchronization

A PyTorch-based Semi-Supervised Learning (SSL) Codebase for Pixel-wise (Pixel) Vision Tasks