ConvMAE: Masked Convolution Meets Masked Autoencoders

Overview

ConvMAE

ConvMAE: Masked Convolution Meets Masked Autoencoders

Peng Gao1, Teli Ma1, Hongsheng Li2, Jifeng Dai3, Yu Qiao1,

1 Shanghai AI Laboratory, 2 MMLab, CUHK, 3 Sensetime Research.

This repo is the official implementation of ConvMAE: Masked Convolution Meets Masked Autoencoders. It currently concludes codes and models for the following tasks:

ImageNet Pretrain: See PRETRAIN.md.
ImageNet Finetune: See FINETUNE.md.
Object Detection: See DETECTION.md.
Semantic Segmentation: See SEGMENTATION.md.

Updates

16/May/2022

The supported codes and models for COCO object detection and instance segmentation are available.

11/May/2022

  1. Pretrained models on ImageNet-1K for ConvMAE.
  2. The supported codes and models for ImageNet-1K finetuning and linear probing are provided.

08/May/2022

The preprint version is public at arxiv.

Introduction

ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme.

  • We present the strong and efficient self-supervised framework ConvMAE, which is easy to implement but show outstanding performances on downstream tasks.
  • ConvMAE naturally generates hierarchical representations and exhibit promising performances on object detection and segmentation.
  • ConvMAE-Base improves the ImageNet finetuning accuracy by 1.4% compared with MAE-Base. On object detection with Mask-RCNN, ConvMAE-Base achieves 53.2 box AP and 47.1 mask AP with a 25-epoch training schedule while MAE-Base attains 50.3 box AP and 44.9 mask AP with 100 training epochs. On ADE20K with UperNet, ConvMAE-Base surpasses MAE-Base by 3.6 mIoU (48.1 vs. 51.7).

tenser

Pretrain on ImageNet-1K

The following table provides pretrained checkpoints and logs used in the paper.

ConvMAE-Base
pretrained checkpoints download
logs download

Main Results on ImageNet-1K

Models #Params(M) Supervision Encoder Ratio Pretrain Epochs FT [email protected](%) LIN [email protected](%) FT logs/weights LIN logs/weights
BEiT 88 DALLE 100% 300 83.0 37.6 - -
MAE 88 RGB 25% 1600 83.6 67.8 - -
SimMIM 88 RGB 100% 800 84.0 56.7 - -
MaskFeat 88 HOG 100% 300 83.6 N/A - -
data2vec 88 RGB 100% 800 84.2 N/A - -
ConvMAE-B 88 RGB 25% 1600 85.0 70.9 log/weight

Main Results on COCO

Mask R-CNN

Models Pretrain Pretrain Epochs Finetune Epochs #Params(M) FLOPs(T) box AP mask AP logs/weights
Swin-B IN21K w/ labels 300 36 109 0.7 51.4 45.4 -
Swin-L IN21K w/ labels 300 36 218 1.1 52.4 46.2 -
MViTv2-B IN21K w/ labels 300 36 73 0.6 53.1 47.4 -
MViTv2-L IN21K w/ labels 300 36 239 1.3 53.6 47.5 -
Benchmarking-ViT-B IN1K w/o labels 1600 100 118 0.9 50.4 44.9 -
Benchmarking-ViT-L IN1K w/o labels 1600 100 340 1.9 53.3 47.2 -
ViTDet IN1K w/o labels 1600 100 111 0.8 51.2 45.5 -
MIMDet-ViT-B IN1K w/o labels 1600 36 127 1.1 51.5 46.0 -
MIMDet-ViT-L IN1K w/o labels 1600 36 345 2.6 53.3 47.5 -
ConvMAE-B IN1K w/o lables 1600 25 104 0.9 53.2 47.1 log/weight

Main Results on ADE20K

UperNet

Models Pretrain Pretrain Epochs Finetune Iters #Params(M) FLOPs(T) mIoU logs/weights
DeiT-B IN1K w/ labels 300 16K 163 0.6 45.6 -
Swin-B IN1K w/ labels 300 16K 121 0.3 48.1 -
MoCo V3 IN1K 300 16K 163 0.6 47.3 -
DINO IN1K 400 16K 163 0.6 47.2 -
BEiT IN1K+DALLE 1600 16K 163 0.6 47.1 -
PeCo IN1K 300 16K 163 0.6 46.7 -
CAE IN1K+DALLE 800 16K 163 0.6 48.8 -
MAE IN1K 1600 16K 163 0.6 48.1 -
ConvMAE-B IN1K 1600 16K 153 0.6 51.7 soon

Main Results on Kinetics-400

Models Pretrain Epochs Finetune Epochs #Params(M) Top1 Top5 logs/weights
VideoMAE-B 200 100 87 77.8
VideoMAE-B 800 100 87 79.4
VideoMAE-B 1600 100 87 79.8
VideoMAE-B 1600 100 (w/ Repeated Aug) 87 80.7 94.7
SpatioTemporalLearner-B 800 150 (w/ Repeated Aug) 87 81.3 94.9
VideoConvMAE-B 200 100 86 80.1 94.3 Soon
VideoConvMAE-B 800 100 86 81.7 95.1 Soon
VideoConvMAE-B-MSD 800 100 86 82.7 95.5 Soon

Main Results on Something-Something V2

Models Pretrain Epochs Finetune Epochs #Params(M) Top1 Top5 logs/weights
VideoMAE-B 200 40 87 66.1
VideoMAE-B 800 40 87 69.3
VideoMAE-B 2400 40 87 70.3
VideoConvMAE-B 200 40 86 67.7 91.2 Soon
VideoConvMAE-B 800 40 86 69.9 92.4 Soon
VideoConvMAE-B-MSD 800 40 86 70.7 93.0 Soon

Getting Started

Prerequisites

  • Linux
  • Python 3.7+
  • CUDA 10.2+
  • GCC 5+

Training and evaluation

Acknowledgement

The pretraining and finetuning of our project are based on DeiT and MAE. The object detection and semantic segmentation parts are based on MIMDet and MMSegmentation respectively. Thanks for their wonderful work.

License

ConvMAE is released under the MIT License.

Citation

@article{gao2022convmae,
  title={ConvMAE: Masked Convolution Meets Masked Autoencoders},
  author={Gao, Peng and Ma, Teli and Li, Hongsheng and Dai, Jifeng and Qiao, Yu},
  journal={arXiv preprint arXiv:2205.03892},
  year={2022}
}
Owner
Alpha VL Team of Shanghai AI Lab
Alpha VL Team of Shanghai AI Lab
Semi-supervised Video Deraining with Dynamical Rain Generator (CVPR, 2021, Pytorch)

S2VD Semi-supervised Video Deraining with Dynamical Rain Generator (CVPR, 2021) Requirements and Dependencies Ubuntu 16.04, cuda 10.0 Python 3.6.10, P

Zongsheng Yue 53 Nov 23, 2022
PyBullet CartPole and Quadrotor environments—with CasADi symbolic a priori dynamics—for learning-based control and reinforcement learning

safe-control-gym Physics-based CartPole and Quadrotor Gym environments (using PyBullet) with symbolic a priori dynamics (using CasADi) for learning-ba

Dynamic Systems Lab 300 Dec 28, 2022
The trained model and denoising example for paper : Cardiopulmonary Auscultation Enhancement with a Two-Stage Noise Cancellation Approach

The trained model and denoising example for paper : Cardiopulmonary Auscultation Enhancement with a Two-Stage Noise Cancellation Approach

ycj_project 1 Jan 18, 2022
Pytorch implementation for the Temporal and Object Quantification Networks (TOQ-Nets).

TOQ-Nets-PyTorch-Release Pytorch implementation for the Temporal and Object Quantification Networks (TOQ-Nets). Temporal and Object Quantification Net

Zhezheng Luo 9 Jun 30, 2022
CROSS-LINGUAL ABILITY OF MULTILINGUAL BERT: AN EMPIRICAL STUDY

M-BERT-Study CROSS-LINGUAL ABILITY OF MULTILINGUAL BERT: AN EMPIRICAL STUDY Motivation Multilingual BERT (M-BERT) has shown surprising cross lingual a

CogComp 1 Feb 28, 2022
Official Repo for ICCV2021 Paper: Learning to Regress Bodies from Images using Differentiable Semantic Rendering

[ICCV2021] Learning to Regress Bodies from Images using Differentiable Semantic Rendering Getting Started DSR has been implemented and tested on Ubunt

Sai Kumar Dwivedi 83 Nov 27, 2022
JAXMAPP: JAX-based Library for Multi-Agent Path Planning in Continuous Spaces

JAXMAPP: JAX-based Library for Multi-Agent Path Planning in Continuous Spaces JAXMAPP is a JAX-based library for multi-agent path planning (MAPP) in c

OMRON SINIC X 24 Dec 28, 2022
This is the source code of the solver used to compete in the International Timetabling Competition 2019.

ITC2019 Solver This is the source code of the solver used to compete in the International Timetabling Competition 2019. Building .NET Core (2.1 or hig

Edon Gashi 8 Jan 22, 2022
Repository for RNNs using TensorFlow and Keras - LSTM and GRU Implementation from Scratch - Simple Classification and Regression Problem using RNNs

RNN 01- RNN_Classification Simple RNN training for classification task of 3 signal: Sine, Square, Triangle. 02- RNN_Regression Simple RNN training for

Nahid Ebrahimian 13 Dec 13, 2022
Codes and models of NeurIPS2021 paper - DominoSearch: Find layer-wise fine-grained N:M sparse schemes from dense neural networks

DominoSearch This is repository for codes and models of NeurIPS2021 paper - DominoSearch: Find layer-wise fine-grained N:M sparse schemes from dense n

11 Sep 10, 2022
Implementation of Enformer, Deepmind's attention network for predicting gene expression, in Pytorch

Enformer - Pytorch (wip) Implementation of Enformer, Deepmind's attention network for predicting gene expression, in Pytorch. The original tensorflow

Phil Wang 235 Dec 27, 2022
Semi-supervised Semantic Segmentation with Directional Context-aware Consistency (CVPR 2021)

Semi-supervised Semantic Segmentation with Directional Context-aware Consistency (CAC) Xin Lai*, Zhuotao Tian*, Li Jiang, Shu Liu, Hengshuang Zhao, Li

Jia Research Lab 137 Dec 14, 2022
This repository is for our paper Exploiting Scene Graphs for Human-Object Interaction Detection accepted by ICCV 2021.

SG2HOI This repository is for our paper Exploiting Scene Graphs for Human-Object Interaction Detection accepted by ICCV 2021. Installation Pytorch 1.7

HT 10 Dec 20, 2022
Corgis are the cutest creatures; have 30K of them!

corgi-net This is a dataset of corgi images scraped from the corgi subreddit. After filtering using an ImageNet classifier, the training set consists

Alex Nichol 6 Dec 24, 2022
Solution of Kaggle competition: Sartorius - Cell Instance Segmentation

Sartorius - Cell Instance Segmentation https://www.kaggle.com/c/sartorius-cell-instance-segmentation Environment setup Build docker image bash .dev_sc

68 Dec 09, 2022
This repository contains the official MATLAB implementation of the TDA method for reverse image filtering

ReverseFilter TDA This repository contains the official MATLAB implementation of the TDA method for reverse image filtering proposed in the paper: "Re

Fergaletto 2 Dec 13, 2021
Physics-informed Neural Operator for Learning Partial Differential Equation

PINO Physics-informed Neural Operator for Learning Partial Differential Equation Abstract: Machine learning methods have recently shown promise in sol

107 Jan 02, 2023
Official implementation of NeurIPS 2021 paper "Contextual Similarity Aggregation with Self-attention for Visual Re-ranking"

CSA: Contextual Similarity Aggregation with Self-attention for Visual Re-ranking PyTorch training code for CSA (Contextual Similarity Aggregation). We

Hui Wu 19 Oct 21, 2022
Concept drift monitoring for HA model servers.

{Fast, Correct, Simple} - pick three Easily compare training and production ML data & model distributions Goals Boxkite is an instrumentation library

98 Dec 15, 2022
AI Based Smart Exam Proctoring Package

AI Based Smart Exam Proctoring Package It takes image (base64) as input: Provide Output as: Detection of Mobile phone. Detection of More than 1 person

NARENDER KESWANI 3 Sep 09, 2022