Code for the paper "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web" (ECCV 2020)

Last update: Dec 14, 2022

Related tags

Deep Learning vln-bert

Overview

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, and Dhruv Batra

Paper: https://arxiv.org/abs/2004.14973

Model Zoo

A variety of pre-trained VLN-BERT weights can accessed through the following links:

	Pre-training Stages	Job ID	Val Unseen SR	URL
0	no pre-training	174631	30.52%	TBD
1	1	175134	45.17%	TBD
3	1 and 2	221943	49.64%	download
2	1 and 3	220929	50.02%	download
4	1, 2, and 3 (Full Model)	220825	59.26%	download

Usage Instructions

Follow the instructions in INSTALL.md to setup this codebase. The instructions walk you through several steps including preprocessing the Matterport3D panoramas by extracting regions with a pretrained object detector.

Training

To preform stage 3 of pre-training, first download ViLBERT weights from here. Then, run:

python \
-m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--node_rank=0 \
train.py \
--from_pretrained <path/to/vilbert_pytorch_model_9.bin> \
--save_name [pre_train_run_id] \
--num_epochs 50 \
--warmup_proportion 0.08 \
--cooldown_factor 8 \
--masked_language \
--masked_vision \
--no_ranking

To fine-tune VLN-BERT for the path selection task, run:

python \
-m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--node_rank=0 \
train.py \
--from_pretrained <path/to/pytorch_model_50.bin> \
--save_name [fine_tune_run_id]

Evaluation

To evaluate a pre-trained model, run:

python test.py \
--split [val_seen|val_unseen] \
--from_pretrained <path/to/run_[run_id]_pytorch_model.bin> \
--save_name [run_id]

followed by:

python scripts/calculate-metrics.py <path/to/results_[val_seen|val_unseen].json>

Citation

If you find this code useful, please consider citing:

@inproceedings{majumdar2020improving,
  title={Improving Vision-and-Language Navigation with Image-Text Pairs from the Web},
  author={Arjun Majumdar and Ayush Shrivastava and Stefan Lee and Peter Anderson and Devi Parikh and Dhruv Batra},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  year={2020}
}

Code for the paper "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web" (ECCV 2020)

Related tags

Overview

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Model Zoo

Usage Instructions

Training

Evaluation

Citation

Owner

Arjun Majumdar

MinkLoc3D-SI: 3D LiDAR place recognition with sparse convolutions,spherical coordinates, and intensity

From Canonical Correlation Analysis to Self-supervised Graph Neural Networks

Repository for the AugmentedPCA Python package.

A basic duplicate image detection service using perceptual image hash functions and nearest neighbor search, implemented using faiss, fastapi, and imagehash

This is a official repository of SimViT.

StyleSwin: Transformer-based GAN for High-resolution Image Generation

TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

This repository collects 100 papers related to negative sampling methods.

The first machine learning framework that encourages learning ML concepts instead of memorizing class functions.

nanodet_plus,yolov5_v6.0

Implementation of light baking system for ray tracing based on Activision's UberBake

Instance-wise Feature Importance in Time (FIT)

bespoke tooling for offensive security's Windows Usermode Exploit Dev course (OSED)

Source code related to the article submitted to the International Conference on Computational Science ICCS 2022 in London

This is the solution for 2nd rank in Kaggle competition: Feedback Prize - Evaluating Student Writing.

Gesture Volume Control Using OpenCV and MediaPipe

Official implementation of "Learning Not to Reconstruct" (BMVC 2021)

Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training

Segmentation-Aware Convolutional Networks Using Local Attention Masks

SmartSim Infrastructure Library.