MlTr: Multi-label Classification with Transformer

This is official implement of "MlTr: Multi-label Classification with Transformer".

Abstract

The task of multi-label image classification is to recognize all the object labels presented in an image. Though advancing for years, small objects, similar objects and objects with high conditional probability are still the main bottlenecks of previous convolutional neural network(CNN) based models, limited by convolutional kernels' representational capacity. Recent vision transformer networks utilize the self-attention mechanism to extract the feature of pixel granularity, which expresses richer local semantic information, while is insufficient for mining global spatial dependence. In this paper, we point out the three crucial problems that CNN-based methods encounter and explore the possibility of conducting specific transformer modules to settle them. We put forward a Multi-label Transformer architecture(MlTr) constructed with windows partitioning, in-window pixel attention, cross-window attention, particularly improving the performance of multi-label image classification tasks. The proposed MlTr shows state-of-the-art results on various prevalent multi-label datasets such as MS-COCO, Pascal-VOC, and NUS-WIDE with 88.5%, 95.8%, and 65.5% respectively.

Pretrained model (Results on MS-COCO2014)

name	resolution	map	params(M)	model	log
mltr-s	224x224	81.9	33	coming soon	coming soon
mltr-m	384x384	86.8	62	coming soon	coming soon
mltr-l	384x384	88.5	108	coming soon	coming soon

Citing artical

Pleadse cite this article as:

@misc{cheng2021mltr,
      title={MlTr: Multi-label Classification with Transformer}, 
      author={Xing Cheng and Hezheng Lin and Xiangyu Wu and Fan Yang and Dong Shen and Zhongyuan Wang and Nian Shi and Honglin Liu},
      year={2021},
      eprint={2106.06195},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Started

Please refer to get_started.

MlTr: Multi-label Classification with Transformer

Related tags

Overview

MlTr: Multi-label Classification with Transformer

Abstract

Pretrained model (Results on MS-COCO2014)

Citing artical

Started

Owner

程星

abess: Fast Best-Subset Selection in Python and R

Code To Tune or Not To Tune? Zero-shot Models for Legal Case Entailment.

CARL provides highly configurable contextual extensions to several well-known RL environments.

Codes for CyGen, the novel generative modeling framework proposed in "On the Generative Utility of Cyclic Conditionals" (NeurIPS-21)

Learning Dense Representations of Phrases at Scale (Lee et al., 2020)

An implementation of "MixHop: Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing" (ICML 2019).

Retinal vessel segmentation based on GT-UNet

This repository contains implementations and illustrative code to accompany DeepMind publications

A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

Part-Aware Data Augmentation for 3D Object Detection in Point Cloud

Generate indoor scenes with Transformers

A pytorch implementation of the CVPR2021 paper "VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild"

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021)

Python project to take sound as input and output as RGB + Brightness values suitable for DMX

Code for CVPR 2018 paper --- Texture Mapping for 3D Reconstruction with RGB-D Sensor

Official implementation of the paper Do pedestrians pay attention? Eye contact detection for autonomous driving

gACSON software for visualization, processing and analysis of three-dimensional electron microscopy images

GitHub repository for the ICLR Computational Geometry & Topology Challenge 2021

Kroomsa: A search engine for the curious

Spatiotemporal resampling methods for mlr3