Code and data for ImageCoDe, a contextual vison-and-language benchmark

Last update: Dec 02, 2022

Related tags

Overview

ImageCoDe

This repository contains code and data for ImageCoDe: Image Retrieval from Contextual Descriptions.

Data

All collected descriptions for the training and validation set are under data/train_data.json and data/valid_data.json.

Image sets can be downloaded on Zenodo or GoogleDrive and should be unzipped in data/.

You can download from the commandline via:

wget https://zenodo.org/record/6518944/files/image-sets.zip

For ViLBERT experiments, you need to download a pretrained ViLBERT checkpoint from volta here, simply by clicking on ViLBERT in the table. Save the downloaded file as baselines/vilbert/vilbert-pretrained.bin. Since ViLBERT uses image features from Faster R-CNN, you also have to downloaded these for all ImageCoDe images here: Google Drive link. Save the file as data/rcnn-features36-36.lmdb. The same procedure applies for UNITER.

The format for data/train_data.json looks like this:

{
  "MSR-VTT-videoTrainValVideo_video2044-shot1_0": {
    "6": "a mom holding her babies in the middle of the picture, no other image intervenes with the image.",
    "7": "The image is fading between a woman holding a baby and a woman sitting with a red background. The hands of the woman sitting aren't visible."
  },
  "video-storytelling-videochristmas_56Nm66j-i5Q-shot14_2": {
  "..."
  }
}

And the images under data/ have the following structure. Each folder contains 10 images. If the images are video frames, the number X in imgX.jpg indicates the frame number:

  .
  ├── MSR-VTT-videoTrainValVideo_video2044-shot1_0
      │   ├── img0.jpg
      │   ├── img7.jpg
      │   ├── ...
  ├── video-storytelling-videochristmas_56Nm66j-i5Q-shot14_2
      │   ├── ...

Leaderboard

Based on this you can train your model and test on the unlabeled test set:

{
  "MSR-VTT-videoTestVideo_video7763-shot2_1": [
    "The team name on shirt is visible without a number, but all letters can be seen for team name.",
    "the player can be seen with him on the left close to the logo on the pitch on the right and can be clearly seen"
  ],
  "...":
  ["..."]
}

In order to appear on the leaderboard, please format your results in the following format:

{
  "MSR-VTT-videoTestVideo_video7763-shot2_1": [
    1,
    2
  ],
  "...":
  ["..."]
}

Where the example here with "1" and "2" represent image indices ranging from 0 to 9. You can submit to the leaderboard by sending your test set file (or a download link) to [email protected] and we will update the leaderboard quickly (max. 1-2 days). The leaderboard is maintained on the project website and might change its submission procedure at some point.

Installations

Run install.sh for running CLIP experiments. For VilBERT follow the instructions for volta.

Code

Code for CLIP is under baselines/clip and and code for ViLBERT/UNITER is under baselines/crossencoders.

For details commands to run each model variant shown in the paper, have a look at the README in baselines.

For example to train the best performing model CLIP+TemporalEmbeddings, run:

python3 contextual.py --lr 2e-6 --lr_head 1e-4 -b 36 -m ViT-B/16 --fusion mult -a gelu --logit_scale 1000 --finetuned_checkpoint_path checkpoints/CONTRA_clip_best__36_4e-06_30_1395526.pt --add_input --frozen_clip --positional

Data Analysis

Our manual annotation of various phenomena (negation, nuances, ...) in our validation set can be found under data/manual_annotation_valid.yaml

License

This work is licensed under the MIT license. See LICENSE for details. Third-party software and data sets are subject to their respective licenses.
If you want to cite our paper, please use:

@inproceedings{krojer_contextual_2022,
  address = {Online},
  title = {Image Retrieval from Contextual Descriptions},
  booktitle = {Proceedings of the 60th {Annual} {Meeting} of the {Association} for {Computational} {Linguistics},
  publisher = {Association for Computational Linguistics},
  author = {Krojer, Benno and Adlakha, Vaibhav and Vineet, Vibhav and Goyal, Yash and Ponti, Edoardo and Reddy, Siva},
  month = may,
  year = {2022},
}

Acknowledgement

Our data (specifically the image sets) are built upon 3 video dataset and Open Images:

We also the volta repository for ViLBERT and UNITER baseline variants

For questions or feedback, don't hesitate to contact the author: [email protected]

Code and data for ImageCoDe, a contextual vison-and-language benchmark

Related tags

Overview

ImageCoDe

Data

Leaderboard

Installations

Code

Data Analysis

License

Acknowledgement

Owner

McGill NLP

A PyTorch Implementation of "Neural Arithmetic Logic Units"

rliable is an open-source Python library for reliable evaluation, even with a handful of runs, on reinforcement learning and machine learnings benchmarks.

wgan, wgan2(improved, gp), infogan, and dcgan implementation in lasagne, keras, pytorch

Notspot robot simulation - Python version

Jupyter notebooks showing best practices for using cx_Oracle, the Python DB API for Oracle Database

Code for CVPR 2021 paper: Anchor-Free Person Search

A curated list of automated deep learning (including neural architecture search and hyper-parameter optimization) resources.

PyTorch-Multi-Style-Transfer - Neural Style and MSG-Net

Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

A Strong Baseline for Image Semantic Segmentation

An exploration of log domain "alternative floating point" for hardware ML/AI accelerators.

This code provides a PyTorch implementation for OTTER (Optimal Transport distillation for Efficient zero-shot Recognition), as described in the paper.

Code for the paper "SmoothMix: Training Confidence-calibrated Smoothed Classifiers for Certified Robustness" (NeurIPS 2021)

Semi-supevised Semantic Segmentation with High- and Low-level Consistency

Code for "Adversarial Training for a Hybrid Approach to Aspect-Based Sentiment Analysis

Tooling for the Common Objects In 3D dataset.

[ICCV 2021 Oral] NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo

Planner_backend - Academic planner application designed for students and counselors.

code for the ICLR'22 paper: On Robust Prefix-Tuning for Text Classification

Implementation of "Deep Implicit Templates for 3D Shape Representation"