A unified framework to jointly model images, text, and human attention traces.

Overview

connect-caption-and-trace

This repository contains the reference code for our paper Connecting What to Say With Where to Look by Modeling Human Attention Traces (CVPR2021).

example results

Requirements

  • Python 3
  • PyTorch 1.5+ (along with torchvision)
  • coco-caption (Remember to follow initialization steps in coco-caption/README.md)

Prepare data

Our experiments cover all four datasets included in Localized Narratives: COCO2017, Flickr30k, Open Images and ADE20k. For each dataset, we need four things: (1) json file containing image info and word tokens. (DATASET_LN.json) (2) h5 file containing caption labels (DATASET_LN_label.h5) (3) The trace labels extracted from Localized Narratives (DATASET_LN_trace_box/) (4) json file for coco-caption evaluation (captions_DATASET_LN_test.json) (5) Image features (with bounding boxes) extracted by a Mask-RCNN pretrained on Visual Genome.

You can download (1--4) from here: (make a folder named data and put (1--3) in it, and put (4) under coco-caption/annotaions/)

To get (5), you can use Detectron2. First, install Detectron2, then follow Prepare COCO-style annotations for Visual Genome (We use the pre-trained Resnet101-C4 model provided there). After that you can utilize tools/extract_feats.py in Detectron2 to extract features. Finally, run scripts/prepare_feats_boxes_from_npz.py in this repo to prepare features and bounding boxes in seperate folders for training.

For COCO dataest you can also directly use the features provided by Peter Anderson here. The performance is almost the same (with around 0.2% difference.)

Training

The dataset can be chosen from the four datasets. The --task can be chosen from trace, caption, c_joint_t and pred_both. The --eval_task can be chosen from trace, caption, and pred_both.

COCO: joint training of controlled caption generation and trace generation (N=2 layers, evaluated on caption generation)

python tools/train.py --language_eval 0 --id transformer_LN_coco  --caption_model transformer --input_json data/coco_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/coco_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/coco_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 2 --task c_joint_t --eval_task caption --dataset_choice=coco

Open image: training of generating caption and trace at the same time (N=1 layers, evaluated on predicting both)

python tools/train.py --language_eval 0 --id transformer_LN_openimg  --caption_model transformer --input_json data/openimg_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/openimg_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/openimg_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 1 --task pred_both --eval_task pred_both --dataset_choice=openimg

Flickr30k: training of controlled caption generation alone (N=1 layer)

python tools/train.py --language_eval 0 --id transformer_LN_flk30k  --caption_model transformer --input_json data/flk30k_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/flk30k_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/flk30k_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 1 --task caption --eval_task caption --dataset_choice=flk30k

ADE20k: training of controlled trace generation alone (N=1 layer)

python tools/train.py --language_eval 0 --id transformer_LN_ade20k  --caption_model transformer --input_json data/ade20k_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/ade20k_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/ade20k_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 1 --task trace --eval_task trace --dataset_choice=ade20k

Evaluating

COCO: joint training of controlled caption generation and trace generation (N=2 layers, evaluated on caption generation)

python tools/train.py --language_eval 1 --id transformer_LN_coco  --caption_model transformer --input_json data/coco_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/coco_LN_label.h5 --batch_size 2 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/coco_LN_trace_box --use_trace_feat 0 --beam_size 5 --val_images_use -1 --num_layers 2 --task c_joint_t --eval_task caption --dataset_choice=coco

COCO: joint training of controlled caption generation and trace generation (N=2 layers, evaluated on trace generation)

python tools/train.py --language_eval 1 --id transformer_LN_coco  --caption_model transformer --input_json data/coco_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/coco_LN_label.h5 --batch_size 30 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/coco_LN_trace_box --use_trace_feat 0 --beam_size 1 --val_images_use -1 --num_layers 2 --task c_joint_t --eval_task trace --dataset_choice=coco

Open image: training of generating caption and trace at the same time (N=1 layers, evaluated on predicting both)

python tools/train.py --language_eval 1 --id transformer_LN_openimg  --caption_model transformer --input_json data/openimg_LN.json --input_att_dir Dir_to_image_features_vg --input_box_dir Dir_to_bounding_boxes_vg --input_label_h5 data/openimg_LN_label.h5 --batch_size 2 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 100 --learning_rate_decay_every 3  --save_checkpoint_every 1000 --max_epochs 30 --max_length 225 --seq_per_img 1 --use_box 1   --use_trace 1  --input_trace_dir data/openimg_LN_trace_box --use_trace_feat 0 --beam_size 5 --val_images_use -1 --num_layers 1 --task pred_both --eval_task pred_both --dataset_choice=openimg

Acknowledgements

Some components of this repo were built from Ruotian Luo's ImageCaptioning.pytorch.

Owner
Meta Research
Meta Research
🔥 Cogitare - A Modern, Fast, and Modular Deep Learning and Machine Learning framework for Python

Cogitare is a Modern, Fast, and Modular Deep Learning and Machine Learning framework for Python. A friendly interface for beginners and a powerful too

Cogitare - Modern and Easy Deep Learning with Python 76 Sep 30, 2022
Densely Connected Convolutional Networks, In CVPR 2017 (Best Paper Award).

Densely Connected Convolutional Networks (DenseNets) This repository contains the code for DenseNet introduced in the following paper Densely Connecte

Zhuang Liu 4.5k Jan 03, 2023
Generative Query Network (GQN) in PyTorch as described in "Neural Scene Representation and Rendering"

Update 2019/06/24: A model trained on 10% of the Shepard-Metzler dataset has been added, the following notebook explains the main features of this mod

Jesper Wohlert 313 Dec 27, 2022
CV backbones including GhostNet, TinyNet and TNT, developed by Huawei Noah's Ark Lab.

CV Backbones including GhostNet, TinyNet, TNT (Transformer in Transformer) developed by Huawei Noah's Ark Lab. GhostNet Code TinyNet Code TNT Code Pyr

HUAWEI Noah's Ark Lab 3k Jan 08, 2023
PyTorch Lightning implementation of Automatic Speech Recognition

lasr Lightening Automatic Speech Recognition An MIT License ASR research library, built on PyTorch-Lightning, for developing end-to-end ASR models. In

Soohwan Kim 40 Sep 19, 2022
Official implementation of "StyleCariGAN: Caricature Generation via StyleGAN Feature Map Modulation" (SIGGRAPH 2021)

StyleCariGAN: Caricature Generation via StyleGAN Feature Map Modulation This repository contains the official PyTorch implementation of the following

Wonjong Jang 270 Dec 30, 2022
Models Supported: AlbUNet [18, 34, 50, 101, 152] (1D and 2D versions for Single and Multiclass Segmentation, Feature Extraction with supports for Deep Supervision and Guided Attention)

AlbUNet-1D-2D-Tensorflow-Keras This repository contains 1D and 2D Signal Segmentation Model Builder for AlbUNet and several of its variants developed

Sakib Mahmud 1 Nov 15, 2021
Computer-Vision-Paper-Reviews - Computer Vision Paper Reviews with Key Summary along Papers & Codes

Computer-Vision-Paper-Reviews Computer Vision Paper Reviews with Key Summary along Papers & Codes. Jonathan Choi 2021 50+ Papers across Computer Visio

Jonathan Choi 2 Mar 17, 2022
MohammadReza Sharifi 27 Dec 13, 2022
Automated Hyperparameter Optimization Competition

QQ浏览器2021AI算法大赛 - 自动超参数优化竞赛 ACM CIKM 2021 AnalyticCup 在信息流推荐业务场景中普遍存在模型或策略效果依赖于“超参数”的问题,而“超参数"的设定往往依赖人工经验调参,不仅效率低下维护成本高,而且难以实现更优效果。因此,本次赛题以超参数优化为主题,从真

20 Dec 09, 2021
Official Pytorch Implementation of Length-Adaptive Transformer (ACL 2021)

Length-Adaptive Transformer This is the official Pytorch implementation of Length-Adaptive Transformer. For detailed information about the method, ple

Clova AI Research 93 Dec 28, 2022
A pre-trained model with multi-exit transformer architecture.

ElasticBERT This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

fastNLP 48 Dec 14, 2022
Fully Connected DenseNet for Image Segmentation

Fully Connected DenseNets for Semantic Segmentation Fully Connected DenseNet for Image Segmentation implementation of the paper The One Hundred Layers

Somshubra Majumdar 84 Oct 31, 2022
A toy project using OpenCV and PyMunk

A toy project using OpenCV, PyMunk and Mediapipe the source code for my LindkedIn post It's just a toy project and I didn't write a documentation yet,

Amirabbas Asadi 82 Oct 28, 2022
Multiple-criteria decision-making (MCDM) with Electre, Promethee, Weighted Sum and Pareto

EasyMCDM - Quick Installation methods Install with PyPI Once you have created your Python environment (Python 3.6+) you can simply type: pip3 install

Labrak Yanis 6 Nov 22, 2022
Official PyTorch Implementation of GAN-Supervised Dense Visual Alignment

GAN-Supervised Dense Visual Alignment — Official PyTorch Implementation Paper | Project Page | Video This repo contains training, evaluation and visua

944 Jan 07, 2023
This project provides the code and datasets for 'CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detection', CVPR 2019.

Code-and-Dataset-for-CapSal This project provides the code and datasets for 'CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detec

lu zhang 48 Aug 19, 2022
Implementation of "Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification"

hypergraph_reid Implementation of "Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification" If you find this help your research,

62 Dec 21, 2022
Equivariant Imaging: Learning Beyond the Range Space

Equivariant Imaging: Learning Beyond the Range Space Equivariant Imaging: Learning Beyond the Range Space Dongdong Chen, Julián Tachella, Mike E. Davi

Dongdong Chen 46 Jan 01, 2023
一个多语言支持、易使用的 OCR 项目。An easy-to-use OCR project with multilingual support.

AgentOCR 简介 AgentOCR 是一个基于 PaddleOCR 和 ONNXRuntime 项目开发的一个使用简单、调用方便的 OCR 项目 本项目目前包含 Python Package 【AgentOCR】 和 OCR 标注软件 【AgentOCRLabeling】 使用指南 Pytho

AgentMaker 98 Nov 10, 2022