Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Related tags

Deep LearningE.T.
Overview

Episodic Transformers (E.T.)

Episodic Transformer for Vision-and-Language Navigation
Alexander Pashevich, Cordelia Schmid, Chen Sun

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions. This code reproduces the results obtained with E.T. on ALFRED benchmark. To learn more about the benchmark and the original code, please refer to ALFRED repository.

Quickstart

Clone repo:

$ git clone https://github.com/alexpashevich/E.T..git ET
$ export ET_ROOT=$(pwd)/ET
$ export ET_LOGS=$ET_ROOT/logs
$ export ET_DATA=$ET_ROOT/data
$ export PYTHONPATH=$PYTHONPATH:$ET_ROOT

Install requirements:

$ virtualenv -p $(which python3.7) et_env
$ source et_env/bin/activate

$ cd $ET_ROOT
$ pip install --upgrade pip
$ pip install -r requirements.txt

Downloading data and checkpoints

Download ALFRED dataset:

$ cd $ET_DATA
$ sh download_data.sh json_feat

Copy pretrained checkpoints:

$ wget http://pascal.inrialpes.fr/data2/apashevi/et_checkpoints.zip
$ unzip et_checkpoints.zip
$ mv pretrained $ET_LOGS/

Render PNG images and create an LMDB dataset with natural language annotations:

$ python -m alfred.gen.render_trajs
$ python -m alfred.data.create_lmdb with args.visual_checkpoint=$ET_LOGS/pretrained/fasterrcnn_model.pth args.data_output=lmdb_human args.vocab_path=$ET_ROOT/files/human.vocab

Note #1: For rendering, you may need to configure args.x_display to correspond to an X server number running on your machine.
Note #2: We do not use JPG images from the full dataset as they would differ from the images rendered during evaluation due to the JPG compression.

Pretrained models evaluation

Evaluate an E.T. agent trained on human data only:

$ python -m alfred.eval.eval_agent with eval.exp=pretrained eval.checkpoint=et_human_pretrained.pth eval.object_predictor=$ET_LOGS/pretrained/maskrcnn_model.pth exp.num_workers=5 eval.eval_range=None exp.data.valid=lmdb_human

Note: make sure that your LMDB database is called exactly lmdb_human as the word embedding won't be loaded otherwise.

Evaluate an E.T. agent trained on human and synthetic data:

$ python -m alfred.eval.eval_agent with eval.exp=pretrained eval.checkpoint=et_human_synth_pretrained.pth eval.object_predictor=$ET_LOGS/pretrained/maskrcnn_model.pth exp.num_workers=5 eval.eval_range=None exp.data.valid=lmdb_human

Note: For evaluation, you may need to configure eval.x_display to correspond to an X server number running on your machine.

E.T. with human data only

Train an E.T. agent:

$ python -m alfred.model.train with exp.model=transformer exp.name=et_s1 exp.data.train=lmdb_human train.seed=1

Evaluate the trained E.T. agent:

$ python -m alfred.eval.eval_agent with eval.exp=et_s1 eval.object_predictor=$ET_LOGS/pretrained/maskrcnn_model.pth exp.num_workers=5

Note: you may need to train up to 5 agents using different random seeds to reproduce the results of the paper.

E.T. with language pretraining

Language encoder pretraining with the translation objective:

$ python -m alfred.model.train with exp.model=speaker exp.name=translator exp.data.train=lmdb_human

Train an E.T. agent with the language pretraining:

$ python -m alfred.model.train with exp.model=transformer exp.name=et_synth_s1 exp.data.train=lmdb_human train.seed=1 exp.pretrained_path=translator

Evaluate the trained E.T. agent:

$ python -m alfred.eval.eval_agent with eval.exp=et_synth_s1 eval.object_predictor=$ET_LOGS/pretrained/maskrcnn_model.pth exp.num_workers=5

Note: you may need to train up to 5 agents using different random seeds to reproduce the results of the paper.

E.T. with joint training

You can also generate more synthetic trajectories using generate_trajs.py, create an LMDB and jointly train a model on it. Please refer to the original ALFRED code to know more the data generation. The steps to reproduce the results are the following:

  1. Generate 45K trajectories with alfred.gen.generate_trajs.
  2. Create a synthetic LMDB dataset called lmdb_synth_45K using args.visual_checkpoint=$ET_LOGS/pretrained/fasterrcnn_model.pth and args.vocab_path=$ET_ROOT/files/synth.vocab.
  3. Train an E.T. agent using exp.data.train=lmdb_human,lmdb_synth_45K.

Citation

If you find this repository useful, please cite our work:

@misc{pashevich2021episodic,
  title ={{Episodic Transformer for Vision-and-Language Navigation}},
  author={Alexander Pashevich and Cordelia Schmid and Chen Sun},
  year={2021},
  eprint={2105.06453},
  archivePrefix={arXiv},
}
Owner
Alex Pashevich
PhD student at Thoth (Inria Alpes, France)
Alex Pashevich
LibFewShot: A Comprehensive Library for Few-shot Learning.

LibFewShot Make few-shot learning easy. Supported Methods Meta MAML(ICML'17) ANIL(ICLR'20) R2D2(ICLR'19) Versa(NeurIPS'18) LEO(ICLR'19) MTL(CVPR'19) M

<a href=[email protected]&L"> 603 Jan 05, 2023
Happywhale - Whale and Dolphin Identification Silver🥈 Solution (26/1588)

Kaggle-Happywhale Happywhale - Whale and Dolphin Identification Silver 🥈 Solution (26/1588) 竞赛方案思路 图像数据预处理-标志性特征图片裁剪:首先根据开源的标注数据训练YOLOv5x6目标检测模型,将训练集

Franxx 20 Nov 14, 2022
Human pose estimation from video plays a critical role in various applications such as quantifying physical exercises, sign language recognition, and full-body gesture control.

Pose Detection Project Description: Human pose estimation from video plays a critical role in various applications such as quantifying physical exerci

Hassan Shahzad 2 Jan 17, 2022
A tutorial on training a DarkNet YOLOv4 model for the CrowdHuman dataset

YOLOv4 CrowdHuman Tutorial This is a tutorial demonstrating how to train a YOLOv4 people detector using Darknet and the CrowdHuman dataset. Table of c

JK Jung 118 Nov 10, 2022
This is the code of NeurIPS'21 paper "Towards Enabling Meta-Learning from Target Models".

ST This is the code of NeurIPS 2021 paper "Towards Enabling Meta-Learning from Target Models". If you use any content of this repo for your work, plea

Su Lu 7 Dec 06, 2022
Create images and texts with the First Order Generative Adversarial Networks

First Order Divergence for training GANs This repository contains code accompanying the paper First Order Generative Advesarial Netoworks The majority

Zalando Research 35 Dec 11, 2021
Pytorch reimplementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)

Vision Transformer Pytorch reimplementation of Google's repository for the ViT model that was released with the paper An Image is Worth 16x16 Words: T

Eunkwang Jeon 1.4k Dec 28, 2022
PyTorch implementation of neural style randomization for data augmentation

README Augment training images for deep neural networks by randomizing their visual style, as described in our paper: https://arxiv.org/abs/1809.05375

84 Nov 23, 2022
A code repository associated with the paper A Benchmark for Rough Sketch Cleanup by Chuan Yan, David Vanderhaeghe, and Yotam Gingold from SIGGRAPH Asia 2020.

A Benchmark for Rough Sketch Cleanup This is the code repository associated with the paper A Benchmark for Rough Sketch Cleanup by Chuan Yan, David Va

33 Dec 18, 2022
Machine learning notebooks in different subjects optimized to run in google collaboratory

Notebooks Name Description Category Link Training pix2pix This notebook shows a simple pipeline for training pix2pix on a simple dataset. Most of the

Zaid Alyafeai 363 Dec 06, 2022
Public Code for NIPS submission SimiGrad: Fine-Grained Adaptive Batching for Large ScaleTraining using Gradient Similarity Measurement

Public code for NIPS submission "SimiGrad: Fine-Grained Adaptive Batching for Large Scale Training using Gradient Similarity Measurement" This repo co

Heyang Qin 0 Oct 13, 2021
ReferFormer - Official Implementation of ReferFormer

The official implementation of the paper: Language as Queries for Referring Video Object Segmentation Language as Queries for Referring Video Object S

Jonas Wu 232 Dec 29, 2022
git《Learning Pairwise Inter-Plane Relations for Piecewise Planar Reconstruction》(ECCV 2020) GitHub:

Learning Pairwise Inter-Plane Relations for Piecewise Planar Reconstruction Code for the ECCV 2020 paper by Yiming Qian and Yasutaka Furukawa Getting

37 Dec 04, 2022
This is the source code for the experiments related to the paper Unsupervised Audio Source Separation Using Differentiable Parametric Source Models

Unsupervised Audio Source Separation Using Differentiable Parametric Source Models This is the source code for the experiments related to the paper Un

30 Oct 19, 2022
Hooks for VCOCO

Verbs in COCO (V-COCO) Dataset This repository hosts the Verbs in COCO (V-COCO) dataset and associated code to evaluate models for the Visual Semantic

Saurabh Gupta 131 Nov 24, 2022
This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

TransFG: A Transformer Architecture for Fine-grained Recognition Official PyTorch code for the paper: TransFG: A Transformer Architecture for Fine-gra

Ju He 307 Jan 03, 2023
LSTM Neural Networks for Spectroscopic Studies of Type Ia Supernovae

Package Description The difficulties in acquiring spectroscopic data have been a major challenge for supernova surveys. snlstm is developed to provide

7 Oct 11, 2022
Prototype for Baby Action Detection and Classification

Baby Action Detection Table of Contents About Install Run Predictions Demo About An attempt to harness the power of Deep Learning to come up with a so

Shreyas K 30 Dec 16, 2022
Flaxformer: transformer architectures in JAX/Flax

Flaxformer is a transformer library for primarily NLP and multimodal research at Google.

Google 116 Jan 05, 2023
Learning to Stylize Novel Views

Learning to Stylize Novel Views [Project] [Paper] Contact: Hsin-Ping Huang ([ema

34 Nov 27, 2022