Vision-and-Language Navigation in Continuous Environments using Habitat

Overview

Vision-and-Language Navigation in Continuous Environments (VLN-CE)

Project WebsiteVLN-CE ChallengeRxR-Habitat Challenge

Official implementations:

  • Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments (paper)
  • Waypoint Models for Instruction-guided Navigation in Continuous Environments (paper, README)

Vision and Language Navigation in Continuous Environments (VLN-CE) is an instruction-guided navigation task with crowdsourced instructions, realistic environments, and unconstrained agent navigation. This repo is a launching point for interacting with the VLN-CE task and provides both baseline agents and training methods. Both the Room-to-Room (R2R) and the Room-Across-Room (RxR) datasets are supported. VLN-CE is implemented using the Habitat platform.

VLN-CE comparison to VLN

Setup

This project is developed with Python 3.6. If you are using miniconda or anaconda, you can create an environment:

conda create -n vlnce python3.6
conda activate vlnce

VLN-CE uses Habitat-Sim 0.1.7 which can be built from source or installed from conda:

conda install -c aihabitat -c conda-forge habitat-sim=0.1.7 headless

Then install Habitat-Lab:

git clone --branch v0.1.7 [email protected]:facebookresearch/habitat-lab.git
cd habitat-lab
# installs both habitat and habitat_baselines
python -m pip install -r requirements.txt
python -m pip install -r habitat_baselines/rl/requirements.txt
python -m pip install -r habitat_baselines/rl/ddppo/requirements.txt
python setup.py develop --all

Now you can install VLN-CE:

git clone [email protected]:jacobkrantz/VLN-CE.git
cd VLN-CE
python -m pip install -r requirements.txt

Data

Scenes: Matterport3D

Matterport3D (MP3D) scene reconstructions are used. The official Matterport3D download script (download_mp.py) can be accessed by following the instructions on their project webpage. The scene data can then be downloaded:

# requires running with python 2.7
python download_mp.py --task habitat -o data/scene_datasets/mp3d/

Extract such that it has the form data/scene_datasets/mp3d/{scene}/{scene}.glb. There should be 90 scenes.

Episodes: Room-to-Room (R2R)

The R2R_VLNCE dataset is a port of the Room-to-Room (R2R) dataset created by Anderson et al for use with the Matterport3DSimulator (MP3D-Sim). For details on the porting process from MP3D-Sim to the continuous reconstructions used in Habitat, please see our paper. We provide two versions of the dataset, R2R_VLNCE_v1-2 and R2R_VLNCE_v1-2_preprocessed. R2R_VLNCE_v1-2 contains the train, val_seen, val_unseen, and test splits. R2R_VLNCE_v1-2_preprocessed runs with our models out of the box. It additionally includes instruction tokens mapped to GloVe embeddings, ground truth trajectories, and a data augmentation split (envdrop) that is ported from R2R-EnvDrop. The test split does not contain episode goals or ground truth paths. For more details on the dataset contents and format, see our project page.

Dataset Extract path Size
R2R_VLNCE_v1-2.zip data/datasets/R2R_VLNCE_v1-2 3 MB
R2R_VLNCE_v1-2_preprocessed.zip data/datasets/R2R_VLNCE_v1-2_preprocessed 345 MB

Downloading the dataset:

# R2R_VLNCE_v1-2
gdown https://drive.google.com/uc?id=1YDNWsauKel0ht7cx15_d9QnM6rS4dKUV
# R2R_VLNCE_v1-2_preprocessed
gdown https://drive.google.com/uc?id=18sS9c2aRu2EAL4c7FyG29LDAm2pHzeqQ
Encoder Weights

Baseline models encode depth observations using a ResNet pre-trained on PointGoal navigation. Those weights can be downloaded from here (672M). Extract the contents to data/ddppo-models/{model}.pth.

Episodes: Room-Across-Room (RxR)

Download: RxR_VLNCE_v0.zip

The Room-Across-Room dataset was ported to continuous environments for the RxR-Habitat Challenge hosted at the CVPR 2021 Embodied AI Workshop. The dataset has train, val_seen, val_unseen, and test_challenge splits with both Guide and Follower trajectories ported. The starter code expects files in this structure:

data/datasets
├─ RxR_VLNCE_v0
|   ├─ train
|   |    ├─ train_guide.json.gz
|   |    ├─ train_guide_gt.json.gz
|   |    ├─ train_follower.json.gz
|   |    ├─ train_follower_gt.json.gz
|   ├─ val_seen
|   |    ├─ val_seen_guide.json.gz
|   |    ├─ val_seen_guide_gt.json.gz
|   |    ├─ val_seen_follower.json.gz
|   |    ├─ val_seen_follower_gt.json.gz
|   ├─ val_unseen
|   |    ├─ val_unseen_guide.json.gz
|   |    ├─ val_unseen_guide_gt.json.gz
|   |    ├─ val_unseen_follower.json.gz
|   |    ├─ val_unseen_follower_gt.json.gz
|   ├─ test_challenge
|   |    ├─ test_challenge_guide.json.gz
|   ├─ text_features
|   |    ├─ ...

The baseline models for RxR-Habitat use precomputed BERT instruction features which can be downloaded from here and saved to data/datasets/RxR_VLNCE_v0/text_features/rxr_{split}/{instruction_id}_{language}_text_features.npz.

RxR-Habitat Challenge (RxR Data)

RxR Challenge Teaser GIF

The RxR-Habitat Challenge uses the new Room-Across-Room (RxR) dataset which:

  • contains multilingual instructions (English, Hindi, Telugu),
  • is an order of magnitude larger than existing datasets, and
  • uses varied paths to break a shortest-path-to-goal assumption.

The challenge was hosted at the CVPR 2021 Embodied AI Workshop. While the official challenge is over, the leaderboard remains open and we encourage submissions on this difficult task! For guidelines and access, please visit: ai.google.com/research/rxr/habitat.

Generating Submissions

Submissions are made by running an agent locally and submitting a jsonlines file (.jsonl) containing the agent's trajectories. Starter code for generating this file is provided in the function BaseVLNCETrainer.inference(). Here is an example of generating predictions for English using the Cross-Modal Attention baseline:

python run.py \
  --exp-config vlnce_baselines/config/rxr_baselines/rxr_cma_en.yaml \
  --run-type inference

If you use different models for different languages, you can merge their predictions with scripts/merge_inference_predictions.py. Submissions are only accepted that contain all episodes from all three languages in the test-challenge split. Starter code for this challenge was originally hosted in the rxr-habitat-challenge branch but is now under continual development in master.

VLN-CE Challenge (R2R Data)

The VLN-CE Challenge is live and taking submissions for public test set evaluation. This challenge uses the R2R data ported in the original VLN-CE paper.

To submit to the leaderboard, you must run your agent locally and submit a JSON file containing the generated agent trajectories. Starter code for generating this JSON file is provided in the function BaseVLNCETrainer.inference(). Here is an example of generating this file using the pretrained Cross-Modal Attention baseline:

python run.py \
  --exp-config vlnce_baselines/config/r2r_baselines/test_set_inference.yaml \
  --run-type inference

Predictions must be in a specific format. Please visit the challenge webpage for guidelines.

Baseline Performance

The baseline model for the VLN-CE task is the cross-modal attention model trained with progress monitoring, DAgger, and augmented data (CMA_PM_DA_Aug). As evaluated on the leaderboard, this model achieves:

Split TL NE OS SR SPL
Test 8.85 7.91 0.36 0.28 0.25
Val Unseen 8.27 7.60 0.36 0.29 0.27
Val Seen 9.06 7.21 0.44 0.34 0.32

This model was originally presented with a val_unseen performance of 0.30 SPL, however the leaderboard evaluates this same model at 0.27 SPL. The model was trained and evaluated on a hardware + Habitat build that gave slightly different results, as is the case for the other paper experiments. Going forward, the leaderboard contains the performance metrics that should be used for official comparison. In our tests, the installation procedure for this repo gives nearly identical evaluation to the leaderboard, but we recognize that compute hardware along with the version and build of Habitat are factors to reproducibility.

For push-button replication of all VLN-CE experiments, see here.

Starter Code

The run.py script controls training and evaluation for all models and datasets:

python run.py \
  --exp-config path/to/experiment_config.yaml \
  --run-type {train | eval | inference}

For example, a random agent can be evaluated on 10 val-seen episodes of R2R using this command:

python run.py --exp-config vlnce_baselines/config/r2r_baselines/nonlearning.yaml --run-type eval

For lists of modifiable configuration options, see the default task config and experiment config files.

Training Agents

The DaggerTrainer class is the standard trainer and supports teacher forcing or dataset aggregation (DAgger). This trainer saves trajectories consisting of RGB, depth, ground-truth actions, and instructions to disk to avoid time spent in simulation.

The RecollectTrainer class performs teacher forcing using the ground truth trajectories provided in the dataset rather than a shortest path expert. Also, this trainer does not save episodes to disk, instead opting to recollect them in simulation.

Both trainers inherit from BaseVLNCETrainer.

Evaluating Agents

Evaluation on validation splits can be done by running python run.py --exp-config path/to/experiment_config.yaml --run-type eval. If EVAL.EPISODE_COUNT == -1, all episodes will be evaluated. If EVAL_CKPT_PATH_DIR is a directory, each checkpoint will be evaluated one at a time.

Cuda

Cuda will be used by default if it is available. We find that one GPU for the model and several GPUs for simulation is favorable.

SIMULATOR_GPU_IDS: [0]  # list of GPU IDs to run simulations
TORCH_GPU_ID: 0  # GPU for pytorch-related code (the model)
NUM_ENVIRONMENTS: 1  # Each GPU runs NUM_ENVIRONMENTS environments

The simulator and torch code do not need to run on the same device. For faster training and evaluation, we recommend running with as many NUM_ENVIRONMENTS as will fit on your GPU while assuming 1 CPU core per env.

License

The VLN-CE codebase is MIT licensed. Trained models and task datasets are considered data derived from the mp3d scene dataset. Matterport3D based task datasets and trained models are distributed with Matterport3D Terms of Use and under CC BY-NC-SA 3.0 US license.

Citing

If you use VLN-CE in your research, please cite the following paper:

@inproceedings{krantz_vlnce_2020,
  title={Beyond the Nav-Graph: Vision and Language Navigation in Continuous Environments},
  author={Jacob Krantz and Erik Wijmans and Arjun Majundar and Dhruv Batra and Stefan Lee},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2020}
 }

If you use the RxR-Habitat data, please additionally cite the following paper:

@inproceedings{ku2020room,
  title={Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding},
  author={Ku, Alexander and Anderson, Peter and Patel, Roma and Ie, Eugene and Baldridge, Jason},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  pages={4392--4412},
  year={2020}
}
Owner
Jacob Krantz
PhD student at Oregon State University
Jacob Krantz
A python toolbox for predictive uncertainty quantification, calibration, metrics, and visualization

Website, Tutorials, and Docs    Uncertainty Toolbox A python toolbox for predictive uncertainty quantification, calibration, metrics, and visualizatio

Uncertainty Toolbox 1.4k Dec 28, 2022
[CVPR 2022] Structured Sparse R-CNN for Direct Scene Graph Generation

Structured Sparse R-CNN for Direct Scene Graph Generation Our paper Structured Sparse R-CNN for Direct Scene Graph Generation has been accepted by CVP

Multimedia Computing Group, Nanjing University 44 Dec 23, 2022
Minimal PyTorch implementation of Generative Latent Optimization from the paper "Optimizing the Latent Space of Generative Networks"

Minimal PyTorch implementation of Generative Latent Optimization This is a reimplementation of the paper Piotr Bojanowski, Armand Joulin, David Lopez-

Thomas Neumann 117 Nov 27, 2022
Implementation of Uformer, Attention-based Unet, in Pytorch

Uformer - Pytorch Implementation of Uformer, Attention-based Unet, in Pytorch. It will only offer the concat-cross-skip connection. This repository wi

Phil Wang 72 Dec 19, 2022
Light-Head R-CNN

Light-head R-CNN Introduction We release code for Light-Head R-CNN. This is my best practice for my research. This repo is organized as follows: light

jemmy li 835 Dec 06, 2022
Per-Pixel Classification is Not All You Need for Semantic Segmentation

MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation Bowen Cheng, Alexander G. Schwing, Alexander Kirillov [arXiv] [Proj

Facebook Research 1k Jan 08, 2023
Real-Time and Accurate Full-Body Multi-Person Pose Estimation&Tracking System

News! Aug 2020: v0.4.0 version of AlphaPose is released! Stronger tracking! Include whole body(face,hand,foot) keypoints! Colab now available. Dec 201

Machine Vision and Intelligence Group @ SJTU 6.7k Dec 28, 2022
use machine learning to recognize gesture on raspberrypi

Raspberrypi_Gesture-Recognition use machine learning to recognize gesture on raspberrypi 說明 利用 tensorflow lite 訓練手部辨識模型 分辨 "剪刀"、"石頭"、"布" 之手勢 再將訓練模型匯入

1 Dec 10, 2021
BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training

BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training By Likun Cai, Zhi Zhang, Yi Zhu, Li Zhang, Mu Li, Xiangyang Xue. This

290 Dec 29, 2022
This is the official implementation for "Do Transformers Really Perform Bad for Graph Representation?".

Graphormer By Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng*, Guolin Ke, Di He*, Yanming Shen and Tie-Yan Liu. This repo is the official impl

Microsoft 1.3k Dec 29, 2022
The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

PRIMER The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization. PRIMER is a pre-trained model for mu

AI2 111 Dec 18, 2022
ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection

ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection This repository contains implementation of the

Visual Understanding Lab @ Samsung AI Center Moscow 190 Dec 30, 2022
Cross-media Structured Common Space for Multimedia Event Extraction (ACL2020)

Cross-media Structured Common Space for Multimedia Event Extraction Table of Contents Overview Requirements Data Quickstart Citation Overview The code

Manling Li 49 Nov 21, 2022
code release for USENIX'22 paper `On the Security Risks of AutoML`

This project is a minimized runnable project cut from trojanzoo, which contains more datasets, models, attacks and defenses. This repo will not be mai

Ren Pang 5 Apr 19, 2022
This project aims to explore the deployment of Swin-Transformer based on TensorRT, including the test results of FP16 and INT8.

Swin Transformer This project aims to explore the deployment of SwinTransformer based on TensorRT, including the test results of FP16 and INT8. Introd

maggiez 87 Dec 21, 2022
Deep Halftoning with Reversible Binary Pattern

Deep Halftoning with Reversible Binary Pattern ICCV Paper | Project Website | BibTex Overview Existing halftoning algorithms usually drop colors and f

Menghan Xia 17 Nov 22, 2022
PyTorch implementation of Memory-based semantic segmentation for off-road unstructured natural environments.

MemSeg: Memory-based semantic segmentation for off-road unstructured natural environments Introduction This repository is a PyTorch implementation of

11 Nov 28, 2022
SphereFace: Deep Hypersphere Embedding for Face Recognition

SphereFace: Deep Hypersphere Embedding for Face Recognition By Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj and Le Song License SphereFa

Weiyang Liu 1.5k Dec 29, 2022
Ray tracing of a Schwarzschild black hole written entirely in TensorFlow.

TensorGeodesic Ray tracing of a Schwarzschild black hole written entirely in TensorFlow. Dependencies: Python 3 TensorFlow 2.x numpy matplotlib About

5 Jan 15, 2022
Code for ICLR 2021 Paper, "Anytime Sampling for Autoregressive Models via Ordered Autoencoding"

Anytime Autoregressive Model Anytime Sampling for Autoregressive Models via Ordered Autoencoding , ICLR 21 Yilun Xu, Yang Song, Sahaj Gara, Linyuan Go

Yilun Xu 22 Sep 08, 2022