Vision-and-Language Navigation in Continuous Environments using Habitat

Overview

Vision-and-Language Navigation in Continuous Environments (VLN-CE)

Project WebsiteVLN-CE ChallengeRxR-Habitat Challenge

Official implementations:

  • Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments (paper)
  • Waypoint Models for Instruction-guided Navigation in Continuous Environments (paper, README)

Vision and Language Navigation in Continuous Environments (VLN-CE) is an instruction-guided navigation task with crowdsourced instructions, realistic environments, and unconstrained agent navigation. This repo is a launching point for interacting with the VLN-CE task and provides both baseline agents and training methods. Both the Room-to-Room (R2R) and the Room-Across-Room (RxR) datasets are supported. VLN-CE is implemented using the Habitat platform.

VLN-CE comparison to VLN

Setup

This project is developed with Python 3.6. If you are using miniconda or anaconda, you can create an environment:

conda create -n vlnce python3.6
conda activate vlnce

VLN-CE uses Habitat-Sim 0.1.7 which can be built from source or installed from conda:

conda install -c aihabitat -c conda-forge habitat-sim=0.1.7 headless

Then install Habitat-Lab:

git clone --branch v0.1.7 [email protected]:facebookresearch/habitat-lab.git
cd habitat-lab
# installs both habitat and habitat_baselines
python -m pip install -r requirements.txt
python -m pip install -r habitat_baselines/rl/requirements.txt
python -m pip install -r habitat_baselines/rl/ddppo/requirements.txt
python setup.py develop --all

Now you can install VLN-CE:

git clone [email protected]:jacobkrantz/VLN-CE.git
cd VLN-CE
python -m pip install -r requirements.txt

Data

Scenes: Matterport3D

Matterport3D (MP3D) scene reconstructions are used. The official Matterport3D download script (download_mp.py) can be accessed by following the instructions on their project webpage. The scene data can then be downloaded:

# requires running with python 2.7
python download_mp.py --task habitat -o data/scene_datasets/mp3d/

Extract such that it has the form data/scene_datasets/mp3d/{scene}/{scene}.glb. There should be 90 scenes.

Episodes: Room-to-Room (R2R)

The R2R_VLNCE dataset is a port of the Room-to-Room (R2R) dataset created by Anderson et al for use with the Matterport3DSimulator (MP3D-Sim). For details on the porting process from MP3D-Sim to the continuous reconstructions used in Habitat, please see our paper. We provide two versions of the dataset, R2R_VLNCE_v1-2 and R2R_VLNCE_v1-2_preprocessed. R2R_VLNCE_v1-2 contains the train, val_seen, val_unseen, and test splits. R2R_VLNCE_v1-2_preprocessed runs with our models out of the box. It additionally includes instruction tokens mapped to GloVe embeddings, ground truth trajectories, and a data augmentation split (envdrop) that is ported from R2R-EnvDrop. The test split does not contain episode goals or ground truth paths. For more details on the dataset contents and format, see our project page.

Dataset Extract path Size
R2R_VLNCE_v1-2.zip data/datasets/R2R_VLNCE_v1-2 3 MB
R2R_VLNCE_v1-2_preprocessed.zip data/datasets/R2R_VLNCE_v1-2_preprocessed 345 MB

Downloading the dataset:

# R2R_VLNCE_v1-2
gdown https://drive.google.com/uc?id=1YDNWsauKel0ht7cx15_d9QnM6rS4dKUV
# R2R_VLNCE_v1-2_preprocessed
gdown https://drive.google.com/uc?id=18sS9c2aRu2EAL4c7FyG29LDAm2pHzeqQ
Encoder Weights

Baseline models encode depth observations using a ResNet pre-trained on PointGoal navigation. Those weights can be downloaded from here (672M). Extract the contents to data/ddppo-models/{model}.pth.

Episodes: Room-Across-Room (RxR)

Download: RxR_VLNCE_v0.zip

The Room-Across-Room dataset was ported to continuous environments for the RxR-Habitat Challenge hosted at the CVPR 2021 Embodied AI Workshop. The dataset has train, val_seen, val_unseen, and test_challenge splits with both Guide and Follower trajectories ported. The starter code expects files in this structure:

data/datasets
├─ RxR_VLNCE_v0
|   ├─ train
|   |    ├─ train_guide.json.gz
|   |    ├─ train_guide_gt.json.gz
|   |    ├─ train_follower.json.gz
|   |    ├─ train_follower_gt.json.gz
|   ├─ val_seen
|   |    ├─ val_seen_guide.json.gz
|   |    ├─ val_seen_guide_gt.json.gz
|   |    ├─ val_seen_follower.json.gz
|   |    ├─ val_seen_follower_gt.json.gz
|   ├─ val_unseen
|   |    ├─ val_unseen_guide.json.gz
|   |    ├─ val_unseen_guide_gt.json.gz
|   |    ├─ val_unseen_follower.json.gz
|   |    ├─ val_unseen_follower_gt.json.gz
|   ├─ test_challenge
|   |    ├─ test_challenge_guide.json.gz
|   ├─ text_features
|   |    ├─ ...

The baseline models for RxR-Habitat use precomputed BERT instruction features which can be downloaded from here and saved to data/datasets/RxR_VLNCE_v0/text_features/rxr_{split}/{instruction_id}_{language}_text_features.npz.

RxR-Habitat Challenge (RxR Data)

RxR Challenge Teaser GIF

The RxR-Habitat Challenge uses the new Room-Across-Room (RxR) dataset which:

  • contains multilingual instructions (English, Hindi, Telugu),
  • is an order of magnitude larger than existing datasets, and
  • uses varied paths to break a shortest-path-to-goal assumption.

The challenge was hosted at the CVPR 2021 Embodied AI Workshop. While the official challenge is over, the leaderboard remains open and we encourage submissions on this difficult task! For guidelines and access, please visit: ai.google.com/research/rxr/habitat.

Generating Submissions

Submissions are made by running an agent locally and submitting a jsonlines file (.jsonl) containing the agent's trajectories. Starter code for generating this file is provided in the function BaseVLNCETrainer.inference(). Here is an example of generating predictions for English using the Cross-Modal Attention baseline:

python run.py \
  --exp-config vlnce_baselines/config/rxr_baselines/rxr_cma_en.yaml \
  --run-type inference

If you use different models for different languages, you can merge their predictions with scripts/merge_inference_predictions.py. Submissions are only accepted that contain all episodes from all three languages in the test-challenge split. Starter code for this challenge was originally hosted in the rxr-habitat-challenge branch but is now under continual development in master.

VLN-CE Challenge (R2R Data)

The VLN-CE Challenge is live and taking submissions for public test set evaluation. This challenge uses the R2R data ported in the original VLN-CE paper.

To submit to the leaderboard, you must run your agent locally and submit a JSON file containing the generated agent trajectories. Starter code for generating this JSON file is provided in the function BaseVLNCETrainer.inference(). Here is an example of generating this file using the pretrained Cross-Modal Attention baseline:

python run.py \
  --exp-config vlnce_baselines/config/r2r_baselines/test_set_inference.yaml \
  --run-type inference

Predictions must be in a specific format. Please visit the challenge webpage for guidelines.

Baseline Performance

The baseline model for the VLN-CE task is the cross-modal attention model trained with progress monitoring, DAgger, and augmented data (CMA_PM_DA_Aug). As evaluated on the leaderboard, this model achieves:

Split TL NE OS SR SPL
Test 8.85 7.91 0.36 0.28 0.25
Val Unseen 8.27 7.60 0.36 0.29 0.27
Val Seen 9.06 7.21 0.44 0.34 0.32

This model was originally presented with a val_unseen performance of 0.30 SPL, however the leaderboard evaluates this same model at 0.27 SPL. The model was trained and evaluated on a hardware + Habitat build that gave slightly different results, as is the case for the other paper experiments. Going forward, the leaderboard contains the performance metrics that should be used for official comparison. In our tests, the installation procedure for this repo gives nearly identical evaluation to the leaderboard, but we recognize that compute hardware along with the version and build of Habitat are factors to reproducibility.

For push-button replication of all VLN-CE experiments, see here.

Starter Code

The run.py script controls training and evaluation for all models and datasets:

python run.py \
  --exp-config path/to/experiment_config.yaml \
  --run-type {train | eval | inference}

For example, a random agent can be evaluated on 10 val-seen episodes of R2R using this command:

python run.py --exp-config vlnce_baselines/config/r2r_baselines/nonlearning.yaml --run-type eval

For lists of modifiable configuration options, see the default task config and experiment config files.

Training Agents

The DaggerTrainer class is the standard trainer and supports teacher forcing or dataset aggregation (DAgger). This trainer saves trajectories consisting of RGB, depth, ground-truth actions, and instructions to disk to avoid time spent in simulation.

The RecollectTrainer class performs teacher forcing using the ground truth trajectories provided in the dataset rather than a shortest path expert. Also, this trainer does not save episodes to disk, instead opting to recollect them in simulation.

Both trainers inherit from BaseVLNCETrainer.

Evaluating Agents

Evaluation on validation splits can be done by running python run.py --exp-config path/to/experiment_config.yaml --run-type eval. If EVAL.EPISODE_COUNT == -1, all episodes will be evaluated. If EVAL_CKPT_PATH_DIR is a directory, each checkpoint will be evaluated one at a time.

Cuda

Cuda will be used by default if it is available. We find that one GPU for the model and several GPUs for simulation is favorable.

SIMULATOR_GPU_IDS: [0]  # list of GPU IDs to run simulations
TORCH_GPU_ID: 0  # GPU for pytorch-related code (the model)
NUM_ENVIRONMENTS: 1  # Each GPU runs NUM_ENVIRONMENTS environments

The simulator and torch code do not need to run on the same device. For faster training and evaluation, we recommend running with as many NUM_ENVIRONMENTS as will fit on your GPU while assuming 1 CPU core per env.

License

The VLN-CE codebase is MIT licensed. Trained models and task datasets are considered data derived from the mp3d scene dataset. Matterport3D based task datasets and trained models are distributed with Matterport3D Terms of Use and under CC BY-NC-SA 3.0 US license.

Citing

If you use VLN-CE in your research, please cite the following paper:

@inproceedings{krantz_vlnce_2020,
  title={Beyond the Nav-Graph: Vision and Language Navigation in Continuous Environments},
  author={Jacob Krantz and Erik Wijmans and Arjun Majundar and Dhruv Batra and Stefan Lee},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2020}
 }

If you use the RxR-Habitat data, please additionally cite the following paper:

@inproceedings{ku2020room,
  title={Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding},
  author={Ku, Alexander and Anderson, Peter and Patel, Roma and Ie, Eugene and Baldridge, Jason},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  pages={4392--4412},
  year={2020}
}
Owner
Jacob Krantz
PhD student at Oregon State University
Jacob Krantz
Neural Articulated Radiance Field

Neural Articulated Radiance Field NARF Neural Articulated Radiance Field Atsuhiro Noguchi, Xiao Sun, Stephen Lin, Tatsuya Harada ICCV 2021 [Paper] [Co

Atsuhiro Noguchi 144 Jan 03, 2023
Self-Supervised Deep Blind Video Super-Resolution

Self-Blind-VSR Paper | Discussion Self-Supervised Deep Blind Video Super-Resolution By Haoran Bai and Jinshan Pan Abstract Existing deep learning-base

Haoran Bai 35 Dec 09, 2022
Code for the paper A Theoretical Analysis of the Repetition Problem in Text Generation

A Theoretical Analysis of the Repetition Problem in Text Generation This repository share the code for the paper "A Theoretical Analysis of the Repeti

Zihao Fu 37 Nov 21, 2022
Label Hallucination for Few-Shot Classification

Label Hallucination for Few-Shot Classification This repo covers the implementation of the following paper: Label Hallucination for Few-Shot Classific

Yiren Jian 13 Nov 13, 2022
A texturizer that I just made. Nothing special here.

texturizer This is a little project that I did with an hour's time. It texturizes an image given a image and a texture to texturize it with. There is

1 Nov 11, 2021
HairCLIP: Design Your Hair by Text and Reference Image

Overview This repository hosts the official PyTorch implementation of the paper: "HairCLIP: Design Your Hair by Text and Reference Image". Our single

322 Jan 06, 2023
ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts

ANEA The goal of Automatic (Named) Entity Annotation is to create a small annotated dataset for NER extracted from German domain-specific texts. Insta

Anastasia Zhukova 2 Oct 07, 2022
Symbolic Music Generation with Diffusion Models

Symbolic Music Generation with Diffusion Models Supplementary code release for our work Symbolic Music Generation with Diffusion Models. Installation

Magenta 119 Jan 07, 2023
Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes (CVPR 2021 Oral)

Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Surfaces Official code release for NGLOD. For technical details, please refer t

659 Dec 27, 2022
Code and experiments for "Deep Neural Networks for Rank Consistent Ordinal Regression based on Conditional Probabilities"

corn-ordinal-neuralnet This repository contains the orginal model code and experiment logs for the paper "Deep Neural Networks for Rank Consistent Ord

Raschka Research Group 14 Dec 27, 2022
Codebase of deep learning models for inferring stability of mRNA molecules

Kaggle OpenVaccine Models Codebase of deep learning models for inferring stability of mRNA molecules, corresponding to the Kaggle Open Vaccine Challen

Eternagame 40 Dec 29, 2022
使用yolov5训练自己数据集(详细过程)并通过flask部署

使用yolov5训练自己的数据集(详细过程)并通过flask部署 依赖库 torch torchvision numpy opencv-python lxml tqdm flask pillow tensorboard matplotlib pycocotools Windows,请使用 pycoc

HB.com 19 Dec 28, 2022
Computational Methods Course at UdeA. Forked and size reduced from:

Computational Methods for Physics & Astronomy Book version at: https://restrepo.github.io/ComputationalMethods by: Sebastian Bustamante 2014/2015 Dieg

Diego Restrepo 11 Sep 10, 2022
Python Algorithm Interview Book Review

파이썬 알고리즘 인터뷰 책 리뷰 리뷰 IT 대기업에 들어가고 싶은 목표가 있다. 내가 꿈꿔온 회사에서 일하는 사람들의 모습을 보면 멋있다고 생각이 들고 나의 목표에 대한 열망이 강해지는 것 같다. 미래의 핵심 사업 중 하나인 SW 부분을 이끌고 발전시키는 우리나라의 I

SharkBSJ 1 Dec 14, 2021
Visualizing lattice vibration information from phonon dispersion to atoms (For GPUMD)

Phonon-Vibration-Viewer (For GPUMD) Visualizing lattice vibration information from phonon dispersion for primitive atoms. In this tutorial, we will in

Liangting 6 Dec 10, 2022
Example scripts for the detection of lanes using the ultra fast lane detection model in ONNX.

Example scripts for the detection of lanes using the ultra fast lane detection model in ONNX.

Ibai Gorordo 35 Sep 07, 2022
Explainability for Vision Transformers (in PyTorch)

Explainability for Vision Transformers (in PyTorch) This repository implements methods for explainability in Vision Transformers

Jacob Gildenblat 442 Jan 04, 2023
Code to reproduce the experiments in the paper "Transformer Based Multi-Source Domain Adaptation" (EMNLP 2020)

Transformer Based Multi-Source Domain Adaptation Dustin Wright and Isabelle Augenstein To appear in EMNLP 2020. Read the preprint: https://arxiv.org/a

CopeNLU 36 Dec 05, 2022
Deep learning model, heat map, data prepo

deep learning model, heat map, data prepo

Pamela Dekas 1 Jan 14, 2022
Make differentially private training of transformers easy for everyone

private-transformers This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers. What is this? Why

Xuechen Li 73 Dec 28, 2022