Code release for "Detecting Twenty-thousand Classes using Image-level Supervision".

Last update: Jan 04, 2023

Related tags

Overview

Detecting Twenty-thousand Classes using Image-level Supervision

Detic: A Detector with image classes that can use image-level labels to easily train detectors.

Detecting Twenty-thousand Classes using Image-level Supervision,
Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra,
arXiv technical report (arXiv 2201.02605)

Features

Detects any class given class names (using CLIP).
We train the detector on ImageNet-21K dataset with 21K classes.
Cross-dataset generalization to OpenImages and Objects365 without finetuning.
State-of-the-art results on Open-vocabulary LVIS and Open-vocabulary COCO.
Works for DETR-style detectors.

Installation

See installation instructions.

Demo

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the web demo:

Run our demo using Colab (no GPU needed):

We use the default detectron2 demo interface. For example, to run our 21K model on a messy desk image (image credit David Fouhey) with the lvis vocabulary, run

mkdir models
wget https://dl.fbaipublicfiles.com/detic/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth -O models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth
wget https://web.eecs.umich.edu/~fouhey/fun/desk/desk.jpg
python demo.py --config-file configs/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.yaml --input desk.jpg --output out.jpg --vocabulary lvis --opts MODEL.WEIGHTS models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth

If setup correctly, the output should look like:

The same model can run with other vocabularies (COCO, OpenImages, or Objects365), or a custom vocabulary. For example:

python demo.py --config-file configs/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.yaml --input desk.jpg --output out2.jpg --vocabulary custom --custom_vocabulary headphone,webcam,paper,coffe --confidence-threshold 0.3 --opts MODEL.WEIGHTS models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth

The output should look like:

Note that headphone, paper and coffe (typo intended) are not LVIS classes. Despite the misspelled class name, our detector can produce a reasonable detection for coffe.

Benchmark evaluation and training

Please first prepare datasets, then check our MODEL ZOO to reproduce results in our paper. We highlight key results below:

Open-vocabulary LVIS

mask mAP mask mAP_novel

Box-Supervised 30.2 16.4

Detic 32.4 24.9

	mask mAP	mask mAP_novel
Box-Supervised	30.2	16.4
Detic	32.4	24.9

Standard LVIS

	Detector/ Backbone	mask mAP	mask mAP_rare
Box-Supervised	CenterNet2-ResNet50	31.5	25.6
Detic	CenterNet2-ResNet50	33.2	29.7
Box-Supervised	CenterNet2-SwinB	40.7	35.9
Detic	CenterNet2-SwinB	41.7	41.7

	Detector/ Backbone	box mAP	box mAP_rare
Box-Supervised	DeformableDETR-ResNet50	31.7	21.4
Detic	DeformableDETR-ResNet50	32.5	26.2

Cross-dataset generalization

Backbone Objects365 box mAP OpenImages box mAP50

Box-Supervised SwinB 19.1 46.2

Detic SwinB 21.4 55.2

	Backbone	Objects365 box mAP	OpenImages box mAP50
Box-Supervised	SwinB	19.1	46.2
Detic	SwinB	21.4	55.2

License

The majority of Detic is licensed under the Apache 2.0 license, however portions of the project are available under separate license terms: SWIN-Transformer, CLIP, and TensorFlow Object Detection API are licensed under the MIT license; UniDet is licensed under the Apache 2.0 license; and the LVIS API is licensed under a custom license (https://github.com/lvis-dataset/lvis-api/blob/master/LICENSE)” If you later add other third party code, please keep this license info updated, and please let us know if that component is licensed under something other than CC-BY-NC, MIT, or CC0

Ethical Considerations

Detic's wide range of detection capabilities may introduce similar challenges to many other visual recognition and open-set recognition methods. As the user can define arbitrary detection classes, class design and semantics may impact the model output.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@inproceedings{zhou2021detecting,
  title={Detecting Twenty-thousand Classes using Image-level Supervision},
  author={Zhou, Xingyi and Girdhar, Rohit and Joulin, Armand and Kr{\"a}henb{\"u}hl, Philipp and Misra, Ishan},
  booktitle={arXiv preprint arXiv:2201.02605},
  year={2021}
}

Code release for "Detecting Twenty-thousand Classes using Image-level Supervision".

Related tags

Overview

Detecting Twenty-thousand Classes using Image-level Supervision

Features

Installation

Demo

Benchmark evaluation and training

License

Ethical Considerations

Citation

Owner

Meta Research

This repository includes the code of the sequence-to-sequence model for discontinuous constituent parsing described in paper Discontinuous Grammar as a Foreign Language.

Official repo for AutoInt: Automatic Integration for Fast Neural Volume Rendering in CVPR 2021

This is the source code for the experiments related to the paper Unsupervised Audio Source Separation Using Differentiable Parametric Source Models

Training DALL-E with volunteers from all over the Internet using hivemind and dalle-pytorch (NeurIPS 2021 demo)

Speech Recognition using DeepSpeech2.

Self-supervised Product Quantization for Deep Unsupervised Image Retrieval - ICCV2021

GANSketchingJittor - Implementation of Sketch Your Own GAN in Jittor

Event-forecasting - Event Forecasting Algorithms With Python

[Official] Exploring Temporal Coherence for More General Video Face Forgery Detection(ICCV 2021)

Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning

Structured Edge Detection Toolbox

[CVPR 2021] Involution: Inverting the Inherence of Convolution for Visual Recognition, a brand new neural operator

Source code for The Power of Many: A Physarum Swarm Steiner Tree Algorithm

Official Implementation of VAT

This respository includes implementations on Manifoldron: Direct Space Partition via Manifold Discovery

Official implementation of Influence-balanced Loss for Imbalanced Visual Classification in PyTorch.

A Keras implementation of YOLOv4 (Tensorflow backend)

TensorFlow2 Classification Model Zoo playing with TensorFlow2 on the CIFAR-10 dataset.

[CVPR 2022] Back To Reality: Weak-supervised 3D Object Detection with Shape-guided Label Enhancement

The final project for "Applying AI to Wearable Device Data" course from "AI for Healthcare" - Udacity.