MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

Last update: Dec 28, 2022

Related tags

Overview

MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

This repository contains code and links to pre-trained models for MDETR (Modulated DETR) for pre-training on data having aligned text and images with box annotations, as well as fine-tuning on tasks requiring fine grained understanding of image and text.

We show big gains on the phrase grounding task (Flickr30k), Referring Expression Comprehension (RefCOCO, RefCOCO+ and RefCOCOg) as well as Referring Expression Segmentation (PhraseCut, CLEVR Ref+). We also achieve competitive performance on visual question answering (GQA, CLEVR).

TL;DR. We depart from the fixed frozen object detector approach of several popular vision + language pre-trained models and achieve true end-to-end multi-modal understanding by training our detector in the loop. In addition, we only detect objects that are relevant to the given text query, where the class labels for the objects are just the relevant words in the text query. This allows us to expand our vocabulary to anything found in free form text, making it possible to detect and reason over novel combination of object classes and attributes.

For details, please see the paper: MDETR - Modulated Detection for End-to-End Multi-Modal Understanding by Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve and Nicolas Carion.

Aishwarya Kamath and Nicolas Carion made equal contributions to this codebase.

Usage

The requirements file has all the dependencies that are needed by MDETR.

We provide instructions how to install dependencies via conda. First, clone the repository locally:

git clone https://github.com/ashkamath/mdetr.git

Make a new conda env and activate it:

conda create -n mdetr_env python=3.8
conda activate mdetr_env

Install the the packages in the requirements.txt:

pip install -r requirements.txt

Multinode training

Distributed training is available via Slurm and submitit:

pip install submitit

Pre-training

The links to data, steps for data preparation and script for running finetuning can be found in Pretraining Instructions We also provide the pre-trained model weights for MDETR trained on our combined aligned dataset of 1.3 million images paired with text.

The models are summarized in the following table. Note that the performance reported is "raw", without any fine-tuning. For each dataset, we report the class-agnostic box [email protected], which measures how well the model finds the boxes mentioned in the text. All performances are reported on the respective validation sets of each dataset.

	Backbone	GQA	Flickr		Refcoco				Url	Size
	Backbone	AP	AP	[email protected]	AP	Refcoco [email protected]	Refcoco+ [email protected]	Refcocog [email protected]	Url	Size
1	R101	58.9	75.6	82.5	60.3	72.1	58.0	55.7	model	3GB
2	ENB3	59.5	76.6	82.9	57.6	70.2	56.7	53.8	model	2.4GB
3	ENB5	59.9	76.4	83.7	61.8	73.4	58.8	57.1	model	2.7GB

Downstream tasks

Phrase grounding on Flickr30k

Instructions for data preparation and script to run evaluation can be found at Flickr30k Instructions

AnyBox protocol

Backbone	Pre-training Image Data	Val [email protected]	Val [email protected]	Val [email protected]	Test [email protected]	Test [email protected]	Test [email protected]	url	size
Resnet-101	COCO+VG+Flickr	82.5	92.9	94.9	83.4	93.5	95.3	model	3GB
EfficientNet-B3	COCO+VG+Flickr	82.9	93.2	95.2	84.0	93.8	95.6	model	2.4GB
EfficientNet-B5	COCO+VG+Flickr	83.6	93.4	95.1	84.3	93.9	95.8	model	2.7GB

MergedBox protocol

Backbone	Pre-training Image Data	Val [email protected]	Val [email protected]	Val [email protected]	Test [email protected]	Test [email protected]	Test [email protected]	url	size
Resnet-101	COCO+VG+Flickr	82.3	91.8	93.7	83.8	92.7	94.4	model	3GB

Referring expression comprehension on RefCOCO, RefCOCO+, RefCOCOg

Instructions for data preparation and script to run finetuning and evaluation can be found at Referring Expression Instructions

RefCOCO

Backbone	Pre-training Image Data	Val	TestA	TestB	url	size
Resnet-101	COCO+VG+Flickr	86.75	89.58	81.41	model	3GB
EfficientNet-B3	COCO+VG+Flickr	87.51	90.40	82.67	model	2.4GB

RefCOCO+

Backbone	Pre-training Image Data	Val	TestA	TestB	url	size
Resnet-101	COCO+VG+Flickr	79.52	84.09	70.62	model	3GB
EfficientNet-B3	COCO+VG+Flickr	81.13	85.52	72.96	model	2.4GB

RefCOCOg

Backbone	Pre-training Image Data	Val	Test	url	size
Resnet-101	COCO+VG+Flickr	81.64	80.89	model	3GB
EfficientNet-B3	COCO+VG+Flickr	83.35	83.31	model	2.4GB

Referring expression segmentation on PhraseCut

Instructions for data preparation and script to run finetuning and evaluation can be found at PhraseCut Instructions

Backbone	M-IoU	Precision @0.5	Precision @0.7	Precision @0.9	url	size
Resnet-101	53.1	56.1	38.9	11.9	model	1.5GB
EfficientNet-B3	53.7	57.5	39.9	11.9	model	1.2GB

Visual question answering on GQA

Instructions for data preparation and scripts to run finetuning and evaluation can be found at GQA Instructions

Backbone	Test-dev	Test-std	url	size
Resnet-101	62.48	61.99	model	3GB
EfficientNet-B5	62.95	62.45	model	2.7GB

Long-tailed few-shot object detection

Instructions for data preparation and scripts to run finetuning and evaluation can be found at LVIS Instructions

Data	AP	AP 50	AP r	APc	AP f	url	size
1%	16.7	25.8	11.2	14.6	19.5	model	3GB
10%	24.2	38.0	20.9	24.9	24.3	model	3GB
100%	22.5	35.2	7.4	22.7	25.0	model	3GB

Synthetic datasets

Instructions to reproduce our results on CLEVR-based datasets are available at CLEVR instructions

Overall Accuracy	Count	Exist	Compare Number	Query Attribute	Compare Attribute	Url	Size
99.7	99.3	99.9	99.4	99.9	99.9	model	446MB

License

MDETR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Citation

If you find this repository useful please give it a star and cite as follows! :) :

    @article{kamath2021mdetr,
      title={MDETR--Modulated Detection for End-to-End Multi-Modal Understanding},
      author={Kamath, Aishwarya and Singh, Mannat and LeCun, Yann and Misra, Ishan and Synnaeve, Gabriel and Carion, Nicolas},
      journal={arXiv preprint arXiv:2104.12763},
      year={2021}
    }

MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

Related tags

Overview

MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

Usage

Pre-training

Downstream tasks

Phrase grounding on Flickr30k

AnyBox protocol

MergedBox protocol

Referring expression comprehension on RefCOCO, RefCOCO+, RefCOCOg

RefCOCO

RefCOCO+

RefCOCOg

Referring expression segmentation on PhraseCut

Visual question answering on GQA

Long-tailed few-shot object detection

Synthetic datasets

License

Citation

Owner

Aishwarya Kamath

PoseCamera is python based SDK for human pose estimation through RGB webcam.

Implementation of QuickDraw - an online game developed by Google, combined with AirGesture - a simple gesture recognition application

SuMa++: Efficient LiDAR-based Semantic SLAM (Chen et al IROS 2019)

A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

利用Tensorflow实现基于CNN的中文短文本分类

[ICCV'2021] "SSH: A Self-Supervised Framework for Image Harmonization", Yifan Jiang, He Zhang, Jianming Zhang, Yilin Wang, Zhe Lin, Kalyan Sunkavalli, Simon Chen, Sohrab Amirghodsi, Sarah Kong, Zhangyang Wang

CV backbones including GhostNet, TinyNet and TNT, developed by Huawei Noah's Ark Lab.

Mesh TensorFlow: Model Parallelism Made Easier

Python wrappers to the C++ library SymEngine, a fast C++ symbolic manipulation library.

This repository holds code and data for our PETS'22 article 'From "Onion Not Found" to Guard Discovery'.

Multiple style transfer via variational autoencoder

Tensorforce: a TensorFlow library for applied reinforcement learning

Cross-platform CLI tool to generate your Github profile's stats and summary.

DP-CL(Continual Learning with Differential Privacy)

Study of human inductive biases in CNNs and Transformers.

Implementation of Fast Transformer in Pytorch

Supervised Classification from Text (P)

MIMIC Code Repository: Code shared by the research community for the MIMIC-III database

dualFace: Two-Stage Drawing Guidance for Freehand Portrait Sketching (CVMJ)

Guided Internet-delivered Cognitive Behavioral Therapy Adherence Forecasting