Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

Related tags

Deep LearningHOTR
Overview


Official PyTorch Implementation for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'2021, Oral Presentation)

HOTR: End-to-End Human-Object Interaction Detection with Transformers

HOTR is a novel framework which directly predicts a set of {human, object, interaction} triplets from an image using a transformer-based encoder-decoder. Through the set-level prediction, our method effectively exploits the inherent semantic relationships in an image and does not require time-consuming post-processing which is the main bottleneck of existing methods. Our proposed algorithm achieves the state-of-the-art performance in two HOI detection benchmarks with an inference time under 1 ms after object detection.

HOTR is composed of three main components: a shared encoder with a CNN backbone, a parallel decoder, and the recomposition layer to generate final HOI triplets. The overview of our pipeline is presented below.

1. Environmental Setup

$ conda create -n kakaobrain python=3.7
$ conda install -c pytorch pytorch torchvision # PyTorch 1.7.1, torchvision 0.8.2, CUDA=11.0
$ conda install cython scipy
$ pip install pycocotools
$ pip install opencv-python
$ pip install wandb

2. HOI dataset setup

Our current version of HOTR supports the experiments for V-COCO dataset. Download the v-coco dataset under the pulled directory.

# V-COCO setup
$ git clone https://github.com/s-gupta/v-coco.git
$ cd v-coco
$ ln -s [:COCO_DIR] coco/images # COCO_DIR contains images of train2014 & val2014
$ python script_pick_annotations.py [:COCO_DIR]/annotations

If you wish to download the v-coco on our own directory, simply change the 'data_path' argument to the directory you have downloaded the v-coco dataset.

--data_path [:your_own_directory]/v-coco

3. How to Train/Test HOTR on V-COCO dataset

For testing, you can either use your own trained weights and pass the directory to the 'resume' argument, or use our provided weights. Below is the example of how you should edit the Makefile.

# [Makefile]
# Testing your own trained weights
multi_test:
  python -m torch.distributed.launch \
		--nproc_per_node=8 \
    ...
    --resume checkpoints/vcoco/KakaoBrain/multi_run_000001/best.pth # the best performing checkpoint is saved in this format

# Testing our provided trained weights
multi_test:
  python -m torch.distributed.launch \
		--nproc_per_node=8 \
    ...
    --resume checkpoints/vcoco/q16.pth # download the q16.pth as described below.

In order to use our provided weights, you can download the weights from this link. Then, pass the directory of the downloaded file (for example, we put the weights under the directory checkpoints/vcoco/q16.pth) to the 'resume' argument as well.

# multi-gpu training / testing (8 GPUs)
$ make multi_[train/test]

# single-gpu training / testing
$ make single_[train/test]

4. Results

Here, we provide improved results of V-COCO Scenario 1 (58.9 mAP, 0.5ms) from the version of our initial submission (55.2 mAP, 0.9ms). This is obtained "without" applying any priors on the scores (see iCAN).

Epoch # queries Scenario 1 Scenario 2 Checkpoint
100 16 58.9 63.8 download

If you want to use pretrained weights for inference, download the pretrained weights (from the above link) under checkpoints/vcoco/ and match the interaction query argument as described in the weight file (others are already set in the Makefile). Our evaluation code follows the exact implementations of the official python v-coco evaluation. You can test the weights by the command below (e.g., the weight file is named as q16.pth, which denotes that the model uses 16 interaction queries).

python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --use_env vcoco_main.py \
    --batch_size 2 \
    --HOIDet \
    --share_enc \
    --pretrained_dec \
    --num_hoi_queries [:query_num] \
    --temperature 0.05 \ # use the exact same temperature value that you used during training!
    --object_threshold 0 \
    --no_aux_loss \
    --eval \
    --dataset_file vcoco \
    --data_path v-coco \
    --resume checkpoints/vcoco/[:query_num].pth

The results will appear as the following:

[Logger] Number of params:  51181950
Evaluation Inference (V-COCO)  [308/308]  eta: 0:00:00    time: 0.2063  data: 0.0127  max mem: 1578
[stats] Total Time (test) : 0:01:05 (0.2114 s / it)
[stats] HOI Recognition Time (avg) : 0.5221 ms
[stats] Distributed Gathering Time : 0:00:49
[stats] Score Matrix Generation completed

============= AP (Role scenario_1) ==============
               hold_obj: AP = 48.99 (#pos = 3608)
              sit_instr: AP = 47.81 (#pos = 1916)
             ride_instr: AP = 67.04 (#pos = 556)
               look_obj: AP = 40.57 (#pos = 3347)
              hit_instr: AP = 76.42 (#pos = 349)
                hit_obj: AP = 71.27 (#pos = 349)
                eat_obj: AP = 55.75 (#pos = 521)
              eat_instr: AP = 67.57 (#pos = 521)
             jump_instr: AP = 71.44 (#pos = 635)
              lay_instr: AP = 57.09 (#pos = 387)
    talk_on_phone_instr: AP = 49.07 (#pos = 285)
              carry_obj: AP = 34.75 (#pos = 472)
              throw_obj: AP = 52.37 (#pos = 244)
              catch_obj: AP = 48.80 (#pos = 246)
              cut_instr: AP = 49.58 (#pos = 269)
                cut_obj: AP = 57.02 (#pos = 269)
 work_on_computer_instr: AP = 67.44 (#pos = 410)
              ski_instr: AP = 49.35 (#pos = 424)
             surf_instr: AP = 77.07 (#pos = 486)
       skateboard_instr: AP = 86.44 (#pos = 417)
            drink_instr: AP = 38.67 (#pos = 82)
               kick_obj: AP = 73.92 (#pos = 180)
               read_obj: AP = 44.81 (#pos = 111)
        snowboard_instr: AP = 81.25 (#pos = 277)
| mAP(role scenario_1): 58.94
----------------------------------------------------

The HOI recognition time is calculated by the end-to-end inference time excluding the object detection time.

5. Auxiliary Loss

HOTR follows the auxiliary loss of DETR, where the loss between the ground truth and each output of the decoder layer is also computed. The ground-truth for the auxiliary outputs are matched with the ground-truth HOI triplets with our proposed Hungarian Matcher.

6. Temperature Hyperparameter, tau

Based on our experimental results, the temperature hyperparameter is sensitive to the number of interaction queries and the coefficient for the index loss and index cost, and the number of decoder layers. Empirically, a larger number of queries require a larger tau, and a smaller coefficient for the loss and cost for HO Pointers requires a smaller tau (e.g., for 16 interaction queries, tau=0.05 for the default set_cost_idx=1, hoi_idx_loss_coef=1, hoi_act_loss_coef=10 shows the best result). The initial version of HOTR (with 55.2 mAP) has been trained with 100 queries, which required a larger tau (tau=0.1). There might be better results than the tau we used in our paper according to these three factors. Feel free to explore yourself!

7. Citation

If you find this code helpful for your research, please cite our paper.

@inproceedings{kim2021hotr,
  title={HOTR: End-to-End Human-Object Interaction Detection with Transformers},
  author    = {Bumsoo Kim and
               Junhyun Lee and
               Jaewoo Kang and
               Eun-Sol Kim and
               Hyunwoo J. Kim},
  booktitle = {CVPR},
  publisher = {IEEE},
  year      = {2021}
}

8. Contact for Issues

Bumsoo Kim, [email protected]

9. License

This project is licensed under the terms of the Apache License 2.0. Copyright 2021 Kakao Brain Corp. https://www.kakaobrain.com All Rights Reserved.

Owner
Kakao Brain
Kakao Brain Corp.
Kakao Brain
Bridging Vision and Language Model

BriVL BriVL (Bridging Vision and Language Model) 是首个中文通用图文多模态大规模预训练模型。BriVL模型在图文检索任务上有着优异的效果,超过了同期其他常见的多模态预训练模型(例如UNITER、CLIP)。 BriVL论文:WenLan: Bridgi

235 Dec 27, 2022
Tensorflow implementation of MIRNet for Low-light image enhancement

MIRNet Tensorflow implementation of the MIRNet architecture as proposed by Learning Enriched Features for Real Image Restoration and Enhancement. Lanu

Soumik Rakshit 91 Jan 06, 2023
A script depending on VASP output for calculating Fermi-Softness.

Fermi softness calculation for Vienna Ab initio Simulation Package (VASP) Update 1.1.0: Big update: Rewrote the code. Use Bader atomic division instea

qslin 11 Nov 08, 2022
Reproducing Results from A Hybrid Approach to Targeting Social Assistance

title author date output Reproducing Results from A Hybrid Approach to Targeting Social Assistance Lendie Follett and Heath Henderson 12/28/2021 html_

Lendie Follett 0 Jan 06, 2022
On Out-of-distribution Detection with Energy-based Models

On Out-of-distribution Detection with Energy-based Models This repository contains the code for the experiments conducted in the paper On Out-of-distr

Sven 19 Aug 07, 2022
Ready-to-use code and tutorial notebooks to boost your way into few-shot image classification.

Easy Few-Shot Learning Ready-to-use code and tutorial notebooks to boost your way into few-shot image classification. This repository is made for you

Sicara 399 Jan 08, 2023
Pixray is an image generation system

Pixray is an image generation system

pixray 883 Jan 07, 2023
This is the official repository for our paper: ''Pruning Self-attentions into Convolutional Layers in Single Path''.

Pruning Self-attentions into Convolutional Layers in Single Path This is the official repository for our paper: Pruning Self-attentions into Convoluti

Zhuang AI Group 77 Dec 26, 2022
Pytorch implementation of "Forward Thinking: Building and Training Neural Networks One Layer at a Time"

forward-thinking-pytorch Pytorch implementation of Forward Thinking: Building and Training Neural Networks One Layer at a Time Requirements Python 2.7

Kim Heecheol 65 Oct 06, 2022
git git《Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking》(CVPR 2021) GitHub:git2] 《Masksembles for Uncertainty Estimation》(CVPR 2021) GitHub:git3]

Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li Accepted by CVPR

NingWang 236 Dec 22, 2022
Bayesian dessert for Lasagne

Gelato Bayesian dessert for Lasagne Recent results in Bayesian statistics for constructing robust neural networks have proved that it is one of the be

Maxim Kochurov 84 May 11, 2020
SFD implement with pytorch

S³FD: Single Shot Scale-invariant Face Detector A PyTorch Implementation of Single Shot Scale-invariant Face Detector Description Meanwhile train hand

Jun Li 251 Dec 22, 2022
Automatic labeling, conversion of different data set formats, sample size statistics, model cascade

Simple Gadget Collection for Object Detection Tasks Automatic image annotation Conversion between different annotation formats Obtain statistical info

llt 4 Aug 24, 2022
[UNMAINTAINED] Automated machine learning for analytics & production

auto_ml Automated machine learning for production and analytics Installation pip install auto_ml Getting started from auto_ml import Predictor from au

Preston Parry 1.6k Jan 02, 2023
A small library for creating and manipulating custom JAX Pytree classes

Treeo A small library for creating and manipulating custom JAX Pytree classes Light-weight: has no dependencies other than jax. Compatible: Treeo Tree

Cristian Garcia 58 Nov 23, 2022
Wav2Vec for speech recognition, classification, and audio classification

Soxan در زبان پارسی به نام سخن This repository consists of models, scripts, and notebooks that help you to use all the benefits of Wav2Vec 2.0 in your

Mehrdad Farahani 140 Dec 15, 2022
You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling

You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling Transformer-based models are widely used in natural language processi

Zhanpeng Zeng 12 Jan 01, 2023
This is my codes that can visualize the psnr image in testing videos.

CVPR2018-Baseline-PSNRplot This is my codes that can visualize the psnr image in testing videos. Future Frame Prediction for Anomaly Detection – A New

Wenhao Yang 12 May 29, 2021
The self-supervised goal reaching benchmark introduced in Discovering and Achieving Goals via World Models

Lexa-Benchmark Codebase for the self-supervised goal reaching benchmark introduced in 'Discovering and Achieving Goals via World Models'. Setup Create

1 Oct 14, 2021
Implementation of a Transformer that Ponders, using the scheme from the PonderNet paper

Ponder(ing) Transformer Implementation of a Transformer that learns to adapt the number of computational steps it takes depending on the difficulty of

Phil Wang 65 Oct 04, 2022