Visual dialog agents with pre-trained vision-and-language encoders.

Overview

Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation

Or READ-UP: Referring Expression Agent Dialog with Unified Pretraining.

This repo includes the training/testing code for our paper Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation that has been accepted by CVPR 2021.

Please cite the following paper if you use the code in this repository:

@inproceedings{tu2021learning,
  title={Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation},
  author={Tu, Tao and Ping, Qing and Thattai, Govindarajan and Tur, Gokhan and Natarajan, Prem},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={5622--5631},
  year={2021}
}

Repository Setup

Environment

The following environment is recommended:

Instance storage: > 800 GB
pytorch 1.4.0
cuda 10.0

Set up virtual environment and install pytorch:

$ conda create -n read_up python=3.6
$ conda activate read_up
$ git clone https://github.com/amazon-research/read-up.git

# [IMPORTANT] pytorch 1.4.0 have no issue for parallel training
$ conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.0 -c pytorch

Install dependencies:

# Install general dependencies
$ sudo apt-get install build-essential libcap-dev
$ pip install -r requirement.txt

# Install vqa-maskrcnn-benchmark (for feature extraction only)
$ git clone https://gitlab.com/vedanuj/vqa-maskrcnn-benchmark.git
$ cd vqa-maskrcnn-benchmark
$ python setup.py build develop

Install Apex for distributed training

# Apex is used for both `faster-rcnn feature extraction` & `distributed training`
$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Dataset

Meta-data

Download the GuessWhat?! dataset:

$ wget https://florian-strub.com/guesswhat.train.jsonl.gz -P data/
$ wget https://florian-strub.com//guesswhat.valid.jsonl.gz -P data/
$ wget https://florian-strub.com//guesswhat.test.jsonl.gz -P data/

Prepare dict.json:

  1. Set up repo as instructed in https://github.com/GuessWhatGame/guesswhat
  2. Generate the dict.json file:
$ python src/guesswhat/preprocess_data/create_dictionary.py -data_dir data -dict_file dict.json -min_occ 3
  1. Copy dict.json file to read-up repo:
$ cd read-up
$ mkdir tf-pretrained-model
$ cp guesswhat/data/dict.json read-up/tf-pretrained-model/

Dataset for Oracle models

1. Dataset for baseline Oracle + Faster-RCNN visual features.

Under vqa-maskrcnn-benchmark/data/, download RCNN model and COCO images:

# download RCNN model
$ wget https://dl.fbaipublicfiles.com/vilbert-multi-task/detectron_model.pth
$ wget https://dl.fbaipublicfiles.com/vilbert-multi-task/detectron_config.yaml

# download COCO data  
$ wget http://images.cocodataset.org/zips/train2014.zip
$ wget http://images.cocodataset.org/zips/val2014.zip
$ wget http://images.cocodataset.org/zips/test2014.zip
$ unzip -j train2014.zip 
$ unzip -j valid2014.zip 
$ unzip -j test2014.zip 

Copy the guesswhat.train/valid/test.jsonl to vqa-maskrcnn-benchmark/data/. Unzip the COCO images into a folder image_dir/COCO_2014/images/, and prepare a npy file for feature extraction later.


$ python bin/prepare_extract_gt_features_gw.py \
    --src vqa-maskrcnn-benchmark/data/guesswhat.train.jsonl \
    --img-dir vqa-maskrcnn-benchmark/image_dir/COCO_2014/images/ \
    --out vqa-maskrcnn-benchmark/image_dir/COCO_2014/npy_files/guesswhat.train.npy

Repeat the same process for val and test. The generated file looks like the following:

{
    {
        'file_name': 'name_of_image_file',
        'file_path': '<path_to_image_file_on_your_disk>',
        'bbox': array([
                        [ x1, y1, width1, height1],
                        [ x2, y2, width2, height2],
                        ...
                    ]),
        'num_box': 2
    },
    ....
}

Extract features from the ground-truth bounding boxes generated before:

$ python bin/extract_features_from_gt.py \
    --model_file vqa-maskrcnn-benchmark/data/detectron_model.pth \
    --config_file vqa-maskrcnn-benchmark/data/detectron_config.yaml \
    --imdb_gt_file vqa-maskrcnn-benchmark/image_dir/COCO_2014/npy_files/guesswhat.train.npy \
    --output_folder data/rcnn/from_gt_gw_xyxy_scale/train

Repeat this process for val and test data.

2. Dataset for our Oracle model.

Download the pretrained VilBERT model (both vanilla and 12-in-1 have similar performance in our experiments).

# download vanilla pretrained model
$ cd vilbert-pretrained-model
$ wget https://dl.fbaipublicfiles.com/vilbert-multi-task/pretrained_model.bin

# download 12-in-1 pretrained model
$ wget https://dl.fbaipublicfiles.com/vilbert-multi-task/multi_task_model.bin

Download the features for COCO:

$ wget https://dl.fbaipublicfiles.com/vilbert-multi-task/datasets/coco/features_100/COCO_trainval_resnext152_faster_rcnn_genome.lmdb/data.mdb && mv data.mdb COCO_trainval_resnext152_faster_rcnn_genome.lmdb/
$ wget https://dl.fbaipublicfiles.com/vilbert-multi-task/datasets/coco/features_100/COCO_test_resnext152_faster_rcnn_genome.lmdb/data.mdb && mv data.mdb COCO_test_resnext152_faster_rcnn_genome.lmdb/

Dataset for Q-Gen models

1. Dataset for baseline Q-Gen model [1]

$ wget www.florian-strub.com/github/ft_vgg_img.zip
$ unzip ft_vgg_img.zip -d img/

2. Dataset for VDST Q-Gen model [2]

$ python bin/extract_features.py \
    --model_file vqa-maskrcnn-benchmark/data/detectron_model.pth \
    --config_file vqa-maskrcnn-benchmark/data/detectron_config.yaml \
    --image_dir vqa-maskrcnn-benchmark/image_dir/COCO_2014/images/ \
    --output_folder data/rcnn/from_rcnn/ \
    --batch_size 8

Dataset for Guesser models

1. Dataset for baseline Guesser model[1]

$ cd data/vilbert-multi-task
$ wget https://dl.fbaipublicfiles.com/vilbert-multi-task/datasets.tar.gz
$ tar -I pigz -xvf datasets.tar.gz datasets/guesswhat/

Model Training & Evaluation

Oracle

To train our Oracle model:

$ python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=1 \
    --node_rank=0 \
    main.py \
    --command train-oracle-vilbert \
    --config config_files/oracle_vilbert.yaml \
    --n-jobs 8

To evaluate our Oracle model:

$ python main.py \
    --command test-oracle-vilbert \
    --config config_files/oracle_vilbert.yaml \
    --load ckpt/oracle_vilbert-sd0/epoch-3.pth

This repo also implements other Oracle models:

  • Baseline Oracle model [1]
  • Baseline Oracle model + Faster-RCNN visual features (our ablation model)

To train and evaluate this model, run the main.py with corresponding config file and command.

Guesser

To train our Guesser model:

$ python main.py \
    --command train-guesser-vilbert \
    --config config_files/guesser_vilbert.yaml \
    --n-jobs 8

To evaluate our Guesser model:

$ python main.py \
    --command test-guesser-vilbert \
    --config config_files/guesser_vilbert.yaml \
    --n-jobs 8 \
    --load ckpt/guesser_vilbert-sd0/best.pth

This repo also implements other Guesser models:

  • Baseline Guesser model [1]

To train and evaluate this model, run the main.py with corresponding config file and command.

Q-Gen

To train our Q-Gen model:

# Distributed training
$ python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=1 \
    --node_rank=0 \
    main.py \
    --command train-qgen-vilbert \
    --config config_files/qgen_vilbert.yaml \
    --n-jobs 8 

# Non-distributed training
$ python main.py \
    --command train-qgen-vilbert \
    --config config_files/qgen_vilbert.yaml \
    --n-jobs 8 

To evalaute our Q-Gen model:

$ python main.py \
    --command test-self-play-all-vilbert \
    --config config_files/self_play_all_vilbert.yaml \
    --n-jobs 8

This repo also implements other Q-Gen models:

  • Baseline Q-Gen model [1]
  • VDST Q-Gen model [2]

To train and evaluate these models, run the main.py with corresponding config file and command.

References

[1] Strub, F., De Vries, H., Mary, J., Piot, B., Courvile, A., & Pietquin, O. (2017, August). End-to-end optimization of goal-driven and visually grounded dialogue systems. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (pp. 2765-2771).

[2] Pang, W., & Wang, X. (2020, April). Visual dialogue state tracking for question generation. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 07, pp. 11831-11838).

deep-prae

Deep Probabilistic Accelerated Evaluation (Deep-PrAE) Our work presents an efficient rare event simulation methodology for black box autonomy using Im

Safe AI Lab 4 Apr 17, 2021
Turn based roguelike in python

pyTB Turn based roguelike in python Documentation can be found here: http://mcgillij.github.io/pyTB/index.html Screenshot Dependencies Written in Pyth

Jason McGillivray 4 Sep 29, 2022
OpenVINO黑客松比赛项目

Window_Guard OpenVINO黑客松比赛项目 英文名称:Window_Guard 中文名称:窗口卫士 硬件 树莓派4B 8G版本 一个磁石开关 USB摄像头(MP4视频文件也可以) 软件(库) OpenVINO RPi 使用方法 本项目使用的OPenVINO是是2021.3版本,并使用了

Tango 6 Jul 04, 2021
HNN: Human (Hollywood) Neural Network

HNN: Human (Hollywood) Neural Network Learn the top 1000 actors on IMDB with your very own low cost, highly parallel, CUDAless biological neural netwo

Madhava Jay 0 Dec 21, 2021
Solutions of Reinforcement Learning 2nd Edition

Solutions of Reinforcement Learning, An Introduction

YIFAN WANG 1.4k Dec 30, 2022
Only works with the dashboard version / branch of jesse

Jesse optuna Only works with the dashboard version / branch of jesse. The config.yml should be self-explainatory. Installation # install from git pip

Markus K. 8 Dec 04, 2022
Build tensorflow keras model pipelines in a single line of code. Created by Ram Seshadri. Collaborators welcome. Permission granted upon request.

deep_autoviml Build keras pipelines and models in a single line of code! Table of Contents Motivation How it works Technology Install Usage API Image

AutoViz and Auto_ViML 102 Dec 17, 2022
An Evaluation of Generative Adversarial Networks for Collaborative Filtering.

An Evaluation of Generative Adversarial Networks for Collaborative Filtering. This repository was developed by Fernando B. Pérez Maurera. Fernando is

Fernando Benjamín PÉREZ MAURERA 0 Jan 19, 2022
LeetCode Solutions https://t.me/tenvlad

leetcode LeetCode Solutions groupped by common patterns YouTube: https://www.youtube.com/c/vladten Telegram: https://t.me/nilinterface Problems source

Vlad Ten 158 Dec 29, 2022
Visual odometry package based on hardware-accelerated NVIDIA Elbrus library with world class quality and performance.

Isaac ROS Visual Odometry This repository provides a ROS2 package that estimates stereo visual inertial odometry using the Isaac Elbrus GPU-accelerate

NVIDIA Isaac ROS 343 Jan 03, 2023
Huawei Hackathon 2021 - Sweden (Stockholm)

huawei-hackathon-2021 Contributors DrakeAxelrod Challenge Requirements: python=3.8.10 Standard libraries (no importing) Important factors: Data depend

Drake Axelrod 32 Nov 08, 2022
City-seeds - A random generator of cultural characteristics intended to spark ideas and help draw threads

City Seeds This is a random generator of cultural characteristics intended to sp

Aydin O'Leary 2 Mar 12, 2022
Evolution Strategies in PyTorch

Evolution Strategies This is a PyTorch implementation of Evolution Strategies. Requirements Python 3.5, PyTorch = 0.2.0, numpy, gym, universe, cv2 Wh

Andrew Gambardella 333 Nov 14, 2022
[Open Source]. The improved version of AnimeGAN. Landscape photos/videos to anime

[Open Source]. The improved version of AnimeGAN. Landscape photos/videos to anime

CC 4.4k Dec 27, 2022
ICRA 2021 "Towards Precise and Efficient Image Guided Depth Completion"

PENet: Precise and Efficient Depth Completion This repo is the PyTorch implementation of our paper to appear in ICRA2021 on "Towards Precise and Effic

232 Dec 25, 2022
Official PyTorch Implementation of Rank & Sort Loss [ICCV2021]

Rank & Sort Loss for Object Detection and Instance Segmentation The official implementation of Rank & Sort Loss. Our implementation is based on mmdete

Kemal Oksuz 229 Dec 20, 2022
using STGCN to achieve egg classification task

EEG Classification   The task requires us to classify electroencephalography(EEG) into six categories, including human body, human face, animal body,

4 Jun 13, 2022
Import Python modules from dicts and JSON formatted documents.

Paker Paker is module for importing Python packages/modules from dictionaries and JSON formatted documents. It was inspired by httpimporter. Important

Wojciech Wentland 1 Sep 07, 2022
Code for the paper Learning the Predictability of the Future

Learning the Predictability of the Future Code from the paper Learning the Predictability of the Future. Website of the project in hyperfuture.cs.colu

Computer Vision Lab at Columbia University 139 Nov 18, 2022
Physics-Aware Training (PAT) is a method to train real physical systems with backpropagation.

Physics-Aware Training (PAT) is a method to train real physical systems with backpropagation. It was introduced in Wright, Logan G. & Onodera, Tatsuhiro et al. (2021)1 to train Physical Neural Networ

McMahon Lab 230 Jan 05, 2023