MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

Related tags

Deep Learningmdetr
Overview

MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

WebsiteColabPaper

This repository contains code and links to pre-trained models for MDETR (Modulated DETR) for pre-training on data having aligned text and images with box annotations, as well as fine-tuning on tasks requiring fine grained understanding of image and text.

We show big gains on the phrase grounding task (Flickr30k), Referring Expression Comprehension (RefCOCO, RefCOCO+ and RefCOCOg) as well as Referring Expression Segmentation (PhraseCut, CLEVR Ref+). We also achieve competitive performance on visual question answering (GQA, CLEVR).

MDETR

TL;DR. We depart from the fixed frozen object detector approach of several popular vision + language pre-trained models and achieve true end-to-end multi-modal understanding by training our detector in the loop. In addition, we only detect objects that are relevant to the given text query, where the class labels for the objects are just the relevant words in the text query. This allows us to expand our vocabulary to anything found in free form text, making it possible to detect and reason over novel combination of object classes and attributes.

For details, please see the paper: MDETR - Modulated Detection for End-to-End Multi-Modal Understanding by Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve and Nicolas Carion.

Aishwarya Kamath and Nicolas Carion made equal contributions to this codebase.

Usage

The requirements file has all the dependencies that are needed by MDETR.

We provide instructions how to install dependencies via conda. First, clone the repository locally:

git clone https://github.com/ashkamath/mdetr.git

Make a new conda env and activate it:

conda create -n mdetr_env python=3.8
conda activate mdetr_env

Install the the packages in the requirements.txt:

pip install -r requirements.txt

Multinode training

Distributed training is available via Slurm and submitit:

pip install submitit

Pre-training

The links to data, steps for data preparation and script for running finetuning can be found in Pretraining Instructions We also provide the pre-trained model weights for MDETR trained on our combined aligned dataset of 1.3 million images paired with text.

The models are summarized in the following table. Note that the performance reported is "raw", without any fine-tuning. For each dataset, we report the class-agnostic box [email protected], which measures how well the model finds the boxes mentioned in the text. All performances are reported on the respective validation sets of each dataset.

Backbone GQA Flickr Refcoco Url
Size
AP AP [email protected] AP Refcoco [email protected] Refcoco+ [email protected] Refcocog [email protected]
1 R101 58.9 75.6 82.5 60.3 72.1 58.0 55.7 model 3GB
2 ENB3 59.5 76.6 82.9 57.6 70.2 56.7 53.8 model 2.4GB
3 ENB5 59.9 76.4 83.7 61.8 73.4 58.8 57.1 model 2.7GB

Downstream tasks

Phrase grounding on Flickr30k

Instructions for data preparation and script to run evaluation can be found at Flickr30k Instructions

AnyBox protocol

Backbone Pre-training Image Data Val [email protected] Val [email protected] Val [email protected] Test [email protected] Test [email protected] Test [email protected] url size
Resnet-101 COCO+VG+Flickr 82.5 92.9 94.9 83.4 93.5 95.3 model 3GB
EfficientNet-B3 COCO+VG+Flickr 82.9 93.2 95.2 84.0 93.8 95.6 model 2.4GB
EfficientNet-B5 COCO+VG+Flickr 83.6 93.4 95.1 84.3 93.9 95.8 model 2.7GB

MergedBox protocol

Backbone Pre-training Image Data Val [email protected] Val [email protected] Val [email protected] Test [email protected] Test [email protected] Test [email protected] url size
Resnet-101 COCO+VG+Flickr 82.3 91.8 93.7 83.8 92.7 94.4 model 3GB

Referring expression comprehension on RefCOCO, RefCOCO+, RefCOCOg

Instructions for data preparation and script to run finetuning and evaluation can be found at Referring Expression Instructions

RefCOCO

Backbone Pre-training Image Data Val TestA TestB url size
Resnet-101 COCO+VG+Flickr 86.75 89.58 81.41 model 3GB
EfficientNet-B3 COCO+VG+Flickr 87.51 90.40 82.67 model 2.4GB

RefCOCO+

Backbone Pre-training Image Data Val TestA TestB url size
Resnet-101 COCO+VG+Flickr 79.52 84.09 70.62 model 3GB
EfficientNet-B3 COCO+VG+Flickr 81.13 85.52 72.96 model 2.4GB

RefCOCOg

Backbone Pre-training Image Data Val Test url size
Resnet-101 COCO+VG+Flickr 81.64 80.89 model 3GB
EfficientNet-B3 COCO+VG+Flickr 83.35 83.31 model 2.4GB

Referring expression segmentation on PhraseCut

Instructions for data preparation and script to run finetuning and evaluation can be found at PhraseCut Instructions

Backbone M-IoU Precision @0.5 Precision @0.7 Precision @0.9 url size
Resnet-101 53.1 56.1 38.9 11.9 model 1.5GB
EfficientNet-B3 53.7 57.5 39.9 11.9 model 1.2GB

Visual question answering on GQA

Instructions for data preparation and scripts to run finetuning and evaluation can be found at GQA Instructions

Backbone Test-dev Test-std url size
Resnet-101 62.48 61.99 model 3GB
EfficientNet-B5 62.95 62.45 model 2.7GB

Long-tailed few-shot object detection

Instructions for data preparation and scripts to run finetuning and evaluation can be found at LVIS Instructions

Data AP AP 50 AP r APc AP f url size
1% 16.7 25.8 11.2 14.6 19.5 model 3GB
10% 24.2 38.0 20.9 24.9 24.3 model 3GB
100% 22.5 35.2 7.4 22.7 25.0 model 3GB

Synthetic datasets

Instructions to reproduce our results on CLEVR-based datasets are available at CLEVR instructions

Overall Accuracy Count Exist
Compare Number Query Attribute Compare Attribute Url Size
99.7 99.3 99.9 99.4 99.9 99.9 model 446MB

License

MDETR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Citation

If you find this repository useful please give it a star and cite as follows! :) :

    @article{kamath2021mdetr,
      title={MDETR--Modulated Detection for End-to-End Multi-Modal Understanding},
      author={Kamath, Aishwarya and Singh, Mannat and LeCun, Yann and Misra, Ishan and Synnaeve, Gabriel and Carion, Nicolas},
      journal={arXiv preprint arXiv:2104.12763},
      year={2021}
    }
Owner
Aishwarya Kamath
Find me @ ashkamath.github.io
Aishwarya Kamath
Learned Token Pruning for Transformers

LTP: Learned Token Pruning for Transformers Check our paper for more details. Installation We follow the same installation procedure as the original H

Sehoon Kim 52 Dec 29, 2022
Automatic voice-synthetised summaries of latest research papers on arXiv

PaperWhisperer PaperWhisperer is a Python application that keeps you up-to-date with research papers. How? It retrieves the latest articles from arXiv

Valerio Velardo 124 Dec 20, 2022
List of papers, code and experiments using deep learning for time series forecasting

Deep Learning Time Series Forecasting List of state of the art papers focus on deep learning and resources, code and experiments using deep learning f

Alexander Robles 2k Jan 06, 2023
A system used to detect whether a person is wearing a medical mask or not.

Mask_Detection_System A system used to detect whether a person is wearing a medical mask or not. To open the program, please follow these steps: Make

Mohamed Emad 0 Nov 17, 2022
CFNet: Cascade and Fused Cost Volume for Robust Stereo Matching(CVPR2021)

CFNet(CVPR 2021) This is the implementation of the paper CFNet: Cascade and Fused Cost Volume for Robust Stereo Matching, CVPR 2021, Zhelun Shen, Yuch

106 Dec 28, 2022
RNN Predict Street Commercial Vitality

RNN-for-Predicting-Street-Vitality Code and dataset for Predicting the Vitality of Stores along the Street based on Business Type Sequence via Recurre

Zidong LIU 1 Dec 15, 2021
mbrl-lib is a toolbox for facilitating development of Model-Based Reinforcement Learning algorithms.

mbrl-lib is a toolbox for facilitating development of Model-Based Reinforcement Learning algorithms. It provides easily interchangeable modeling and planning components, and a set of utility function

Facebook Research 724 Jan 04, 2023
Self-Supervised Learning of Event-based Optical Flow with Spiking Neural Networks

Self-Supervised Learning of Event-based Optical Flow with Spiking Neural Networks Work accepted at NeurIPS'21 [paper, video]. If you use this code in

TU Delft 43 Dec 07, 2022
Source code for From Stars to Subgraphs

GNNAsKernel Official code for From Stars to Subgraphs: Uplifting Any GNN with Local Structure Awareness Visualizations GNN-AK(+) GNN-AK(+) with Subgra

44 Dec 19, 2022
A PyTorch Lightning Callback for pushing models to the Hugging Face Hub 🤗⚡️

hf-hub-lightning A callback for pushing lightning models to the Hugging Face Hub. Note: I made this package for myself, mostly...if folks seem to be i

Nathan Raw 27 Dec 14, 2022
Pmapper is a super-resolution and deconvolution toolkit for python 3.6+

pmapper pmapper is a super-resolution and deconvolution toolkit for python 3.6+. PMAP stands for Poisson Maximum A-Posteriori, a highly flexible and a

NASA Jet Propulsion Laboratory 8 Nov 06, 2022
Spectral Tensor Train Parameterization of Deep Learning Layers

Spectral Tensor Train Parameterization of Deep Learning Layers This repository is the official implementation of our AISTATS 2021 paper titled "Spectr

Anton Obukhov 12 Oct 23, 2022
Semantic Segmentation of images using PixelLib with help of Pascalvoc dataset trained with Deeplabv3+ framework.

CARscan- Approach 1 - Segmentation of images by detecting contours. It failed because in images with elements along with cars were also getting detect

Padmanabha Banerjee 5 Jul 29, 2021
DROPO: Sim-to-Real Transfer with Offline Domain Randomization

DROPO: Sim-to-Real Transfer with Offline Domain Randomization Gabriele Tiboni, Karol Arndt, Ville Kyrki. This repository contains the code for the pap

Gabriele Tiboni 8 Dec 19, 2022
DP-CL(Continual Learning with Differential Privacy)

DP-CL(Continual Learning with Differential Privacy) This is the official implementation of the Continual Learning with Differential Privacy. If you us

Phung Lai 3 Nov 04, 2022
Neural Scene Flow Prior (NeurIPS 2021 spotlight)

Neural Scene Flow Prior Xueqian Li, Jhony Kaesemodel Pontes, Simon Lucey Will appear on Thirty-fifth Conference on Neural Information Processing Syste

Lilac Lee 85 Jan 03, 2023
[ArXiv 2021] One-Shot Generative Domain Adaptation

GenDA - One-Shot Generative Domain Adaptation One-Shot Generative Domain Adaptation Ceyuan Yang*, Yujun Shen*, Zhiyi Zhang, Yinghao Xu, Jiapeng Zhu, Z

GenForce: May Generative Force Be with You 46 Dec 19, 2022
Generic U-Net Tensorflow implementation for image segmentation

Tensorflow Unet Warning This project is discontinued in favour of a Tensorflow 2 compatible reimplementation of this project found under https://githu

Joel Akeret 1.8k Dec 10, 2022
This repo provides the base code for pytorch-lightning and weight and biases simultaneous integration.

Write your model faster with pytorch-lightning-wadb-code-backbone This repository provides the base code for pytorch-lightning and weight and biases s

9 Mar 29, 2022
Teaching end to end workflow of deep learning

Deep-Education This repository is now available for public use for teaching end to end workflow of deep learning. This implies that learners/researche

Data Lab at College of William and Mary 2 Sep 26, 2022