End-to-End Referring Video Object Segmentation with Multimodal Transformers

Related tags

Deep LearningMTTR
Overview

End-to-End Referring Video Object Segmentation with Multimodal Transformers

License Framework

This repo contains the official implementation of the paper:


End-to-End Referring Video Object Segmentation with Multimodal Transformers

MTTR_preview.mp4

How to Run the Code

First, clone this repo to your local machine using:

git clone https://github.com/mttr2021/MTTR.git

Dataset Requirements

A2D-Sentences

Follow the instructions here to download the dataset. Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── a2d_sentences/ 
    ├── Release/
    │   ├── videoset.csv  (videos metadata file)
    │   └── CLIPS320/
    │       └── *.mp4     (video files)
    └── text_annotations/
        ├── a2d_annotation.txt  (actual text annotations)
        ├── a2d_missed_videos.txt
        └── a2d_annotation_with_instances/ 
            └── */ (video folders)
                └── *.h5 (annotations files) 

###JHMDB-Sentences Follow the instructions here to download the dataset. Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── jhmdb_sentences/ 
    ├── Rename_Images/  (frame images)
    │   └── */ (action dirs)
    ├── puppet_mask/  (mask annotations)
    │   └── */ (action dirs)
    └── jhmdb_annotation.txt  (text annotations)

Refer-YouTube-VOS

Download the dataset from the competition's website here.

Note that you may be required to sign up to the competition in order to get access to the dataset. This registration process is free and short.

Then, extract and organize the files inside your cloned repo directory as follows (note that only the necessary files are shown):

MTTR/
└── refer_youtube_vos/ 
    ├── train/
    │   ├── JPEGImages/
    │   │   └── */ (video folders)
    │   │       └── *.jpg (frame image files) 
    │   └── Annotations/
    │       └── */ (video folders)
    │           └── *.png (mask annotation files) 
    ├── valid/
    │   └── JPEGImages/
    │       └── */ (video folders)
    │           └── *.jpg (frame image files) 
    └── meta_expressions/
        ├── train/
        │   └── meta_expressions.json  (text annotations)
        └── valid/
            └── meta_expressions.json  (text annotations)

Environment Installation

The code was tested on a Conda environment installed on Ubuntu 18.04. Install Conda and then create an environment as follows:

conda create -n mttr python=3.9.7 pip -y

conda activate mttr

  • Pytorch 1.10:

conda install pytorch==1.10.0 torchvision==0.11.1 -c pytorch -c conda-forge

Note that you might have to change the cudatoolkit version above according to your system's CUDA version.

  • Hugging Face transformers 4.11.3:

pip install transformers==4.11.3

  • COCO API (for mAP calculations):

pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

  • Additional required packages:

pip install h5py wandb opencv-python protobuf av einops ruamel.yaml timm joblib

conda install -c conda-forge pandas matplotlib cython scipy cupy

Running Configuration

The following table lists the parameters which can be configured directly from the command line.

The rest of the running/model parameters for each dataset can be configured in configs/DATASET_NAME.yaml.

Note that in order to run the code the path of the relevant .yaml config file needs to be supplied using the -c parameter.

Command Description
-c path to dataset configuration file
-rm running mode (train/eval)
-ws window size
-bs training batch size per GPU
-ebs eval batch size per GPU (if not provided, training batch size is used)
-ng number of GPUs to run on

Evaluation

The following commands can be used to reproduce the main results of our paper using the supplied checkpoint files.

The commands were tested on RTX 3090 24GB GPUs, but it may be possible to run some of them using GPUs with less memory by decreasing the batch-size -bs parameter.

A2D-Sentences

Window Size Command Checkpoint File mAP Result
10 python main.py -rm eval -c configs/a2d_sentences.yaml -ws 10 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 46.1
8 python main.py -rm eval -c configs/a2d_sentences.yaml -ws 8 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 44.7

JHMDB-Sentences

The following commands evaluate our A2D-Sentences-pretrained model on JHMDB-Sentences without additional training.

For this purpose, as explained in our paper, we uniformly sample three frames from each video. To ensure proper reproduction of our results on other machines we include the metadata of the sampled frames under datasets/jhmdb_sentences/jhmdb_sentences_samples_metadata.json. This file is automatically loaded during the evaluation process with the commands below.

To avoid using this file and force sampling different frames, change the seed and generate_new_samples_metadata parameters under MTTR/configs/jhmdb_sentences.yaml.

Window Size Command Checkpoint File mAP Result
10 python main.py -rm eval -c configs/jhmdb_sentences.yaml -ws 10 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 39.2
8 python main.py -rm eval -c configs/jhmdb_sentences.yaml -ws 8 -bs 3 -ckpt CHECKPOINT_PATH -ng 2 Link 36.6

Refer-YouTube-VOS

The following command evaluates our model on the public validation subset of Refer-YouTube-VOS dataset. Since annotations are not publicly available for this subset, our code generates a zip file with the predicted masks under MTTR/runs/[RUN_DATE_TIME]/validation_outputs/submission_epoch_0.zip. This zip needs to be uploaded to the competition server for evaluation. For your convenience we supply this zip file here as well.

Window Size Command Checkpoint File Output Zip J&F Result
12 python main.py -rm eval -c configs/refer_youtube_vos.yaml -ws 12 -bs 1 -ckpt CHECKPOINT_PATH -ng 8 Link Link 55.32

Training

First, download the Kinetics-400 pretrained weights of Video Swin Transformer from this link. Note that these weights were originally published in video swin's original repo here.

Place the downloaded file inside your cloned repo directory as MTTR/pretrained_swin_transformer/swin_tiny_patch244_window877_kinetics400_1k.pth.

Next, the following commands can be used to train MTTR as described in our paper.

Note that it may be possible to run some of these commands on GPUs with less memory than the ones mentioned below by decreasing the batch-size -bs or window-size -ws parameters. However, changing these parameters may also affect the final performance of the model.

A2D-Sentences

  • The command for the following configuration was tested on 2 A6000 48GB GPUs:
Window Size Command
10 python main.py -rm train -c configs/a2d_sentences.yaml -ws 10 -bs 3 -ng 2
  • The command for the following configuration was tested on 3 RTX 3090 24GB GPUs:
Window Size Command
8 python main.py -rm train -c configs/a2d_sentences.yaml -ws 8 -bs 2 -ng 3

Refer-YouTube-VOS

  • The command for the following configuration was tested on 4 A6000 48GB GPUs:
Window Size Command
12 python main.py -rm train -c configs/refer_youtube_vos.yaml -ws 12 -bs 1 -ng 4
  • The command for the following configuration was tested on 8 RTX 3090 24GB GPUs.
Window Size Command
8 python main.py -rm train -c configs/refer_youtube_vos.yaml -ws 8 -bs 1 -ng 8

Note that this last configuration was not mentioned in our paper. However, it is more memory efficient than the original configuration (window size 12) while producing a model which is almost as good (J&F of 54.56 in our experiments).

JHMDB-Sentences

As explained in our paper JHMDB-Sentences is used exclusively for evaluation, so training is not supported at this time for this dataset.

An SE(3)-invariant autoencoder for generating the periodic structure of materials

Crystal Diffusion Variational AutoEncoder This software implementes Crystal Diffusion Variational AutoEncoder (CDVAE), which generates the periodic st

Tian Xie 94 Dec 10, 2022
This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

AdapterHub 18 Dec 09, 2022
Official implementation of the paper Vision Transformer with Progressive Sampling, ICCV 2021.

Vision Transformer with Progressive Sampling This is the official implementation of the paper Vision Transformer with Progressive Sampling, ICCV 2021.

yuexy 123 Jan 01, 2023
MINERVA: An out-of-the-box GUI tool for offline deep reinforcement learning

MINERVA is an out-of-the-box GUI tool for offline deep reinforcement learning, designed for everyone including non-programmers to do reinforcement learning as a tool.

Takuma Seno 80 Nov 06, 2022
MAUS: A Dataset for Mental Workload Assessment Using Wearable Sensor - Baseline system

MAUS: A Dataset for Mental Workload Assessment Using Wearable Sensor - Baseline system Getting started To start working on this assignment, you should

2 Aug 06, 2022
Predicting Price of house by considering ,house age, Distance from public transport

House-Price-Prediction Predicting Price of house by considering ,house age, Distance from public transport, No of convenient stores around house etc..

Musab Jaleel 1 Jan 08, 2022
Code for Transformer Hawkes Process, ICML 2020.

Transformer Hawkes Process Source code for Transformer Hawkes Process (ICML 2020). Run the code Dependencies Python 3.7. Anaconda contains all the req

Simiao Zuo 111 Dec 26, 2022
Stereo Hybrid Event-Frame (SHEF) Cameras for 3D Perception, IROS 2021

For academic use only. Stereo Hybrid Event-Frame (SHEF) Cameras for 3D Perception Ziwei Wang, Liyuan Pan, Yonhon Ng, Zheyu Zhuang and Robert Mahony Th

Ziwei Wang 11 Jan 04, 2023
Complex-Valued Neural Networks (CVNN)Complex-Valued Neural Networks (CVNN)

Complex-Valued Neural Networks (CVNN) Done by @NEGU93 - J. Agustin Barrachina Using this library, the only difference with a Tensorflow code is that y

youceF 1 Nov 12, 2021
Unified tracking framework with a single appearance model

Paper: Do different tracking tasks require different appearance model? [ArXiv] (comming soon) [Project Page] (comming soon) UniTrack is a simple and U

ZhongdaoWang 300 Dec 24, 2022
KIND: an Italian Multi-Domain Dataset for Named Entity Recognition

KIND (Kessler Italian Named-entities Dataset) KIND is an Italian dataset for Named-Entity Recognition. It contains more than one million tokens with t

Digital Humanities 5 Jun 21, 2022
This repository contains an overview of important follow-up works based on the original Vision Transformer (ViT) by Google.

This repository contains an overview of important follow-up works based on the original Vision Transformer (ViT) by Google.

75 Dec 02, 2022
pytorch implementation of trDesign

trdesign-pytorch This repository is a PyTorch implementation of the trDesign paper based on the official TensorFlow implementation. The initial port o

Learn Ventures Inc. 41 Dec 29, 2022
Federated Deep Reinforcement Learning for the Distributed Control of NextG Wireless Networks.

FDRL-PC-Dyspan Federated Deep Reinforcement Learning for the Distributed Control of NextG Wireless Networks. This repository contains the entire code

Peyman Tehrani 17 Nov 18, 2022
TorchXRayVision: A library of chest X-ray datasets and models.

torchxrayvision A library for chest X-ray datasets and models. Including pre-trained models. ( 🎬 promo video about the project) Motivation: While the

Machine Learning and Medicine Lab 575 Jan 08, 2023
This repository contains FEDOT - an open-source framework for automated modeling and machine learning (AutoML)

package tests docs license stats support This repository contains FEDOT - an open-source framework for automated modeling and machine learning (AutoML

National Center for Cognitive Research of ITMO University 482 Dec 26, 2022
[PyTorch] Official implementation of CVPR2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency". https://arxiv.org/abs/2103.05465

PointDSC repository PyTorch implementation of PointDSC for CVPR'2021 paper "PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency",

153 Dec 14, 2022
Project page for our ICCV 2021 paper "The Way to my Heart is through Contrastive Learning"

The Way to my Heart is through Contrastive Learning: Remote Photoplethysmography from Unlabelled Video This is the official project page of our ICCV 2

36 Jan 06, 2023
Making self-supervised learning work on molecules by using their 3D geometry to pre-train GNNs. Implemented in DGL and Pytorch Geometric.

3D Infomax improves GNNs for Molecular Property Prediction Video | Paper We pre-train GNNs to understand the geometry of molecules given only their 2D

Hannes Stärk 95 Dec 30, 2022
TensorFlow implementation of Barlow Twins (Barlow Twins: Self-Supervised Learning via Redundancy Reduction)

Barlow-Twins-TF This repository implements Barlow Twins (Barlow Twins: Self-Supervised Learning via Redundancy Reduction) in TensorFlow and demonstrat

Sayak Paul 36 Sep 14, 2022