iBOT: Image BERT Pre-Training with Online Tokenizer

Related tags

Text Data & NLPibot
Overview

Image BERT Pre-Training with iBOT iBOT Icon

PWC PWC

Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

[arXiv] [BibTex]

iBOT framework

iBOT is a novel self-supervised pre-training framework that performs masked image modeling with self-distillation. iBOT pre-trained model shows local semantic features, which helps the model transfer well to downstream tasks both at a global scale and a local scale. For example, iBOT achieves strong performance on COCO object detection (51.4 box AP and 44.2 mask AP) and ADE20K semantic segmentation (50.0 mIoU) with vanilla ViT-B/16. iBOT can also extract semantic-meaningful local parts, like dog's ear 🐶 .

Update 🎉

  • December 2021 - Release the code and pre-trained models.
  • November 2021 - Release the pre-print on arXiv.

Installation

See installation structions for details.

Training

For a glimpse at the full documentation of iBOT pre-training, please run:

python main_ibot.py --help

iBOT Pre-Training with ViTs

To start the iBOT pre-training with Vision Transformer (ViT), simply run the following commands. JOB_NAME is a customized argument to distinguish different experiments and this will automatically save checkpoints into the seperate folders.

./run.sh imagenet_pretrain $JOB_NAME vit_{small,base,large} teacher {16,24,64}

The exact arguments to reproduce the models presented in our paper can be found in the args column of the pre-trained models. We also provide the logs for pre-training to help reproducibility.

For example, run iBOT with ViT-S/16 network on two nodes with 8 GPUs for 800 epochs with the following command. The resulting checkpoint should reach 75.2% on k-NN accuracy, 77.9% on linear probing accuracy, and 82.3% on fine-tuning accuracy.

./run.sh imagenet_pretrain $JOB_NAME vit_small teacher 16 \
  --teacher_temp 0.07 \
  --warmup_teacher_temp_epochs 30 \
  --norm_last_layer false \
  --epochs 800 \
  --batch_size_per_gpu 64 \
  --shared_head true \
  --out_dim 8192 \
  --local_crops_number 10 \
  --global_crops_scale 0.25 1 \
  --local_crops_scale 0.05 0.25 \
  --pred_ratio 0 0.3 \
  --pred_ratio_var 0 0.2

iBOT Pre-Training with Swins

This code also works for training iBOT on Swin Transformer (Swin). In the paper, we only conduct experiments on Swin-T with different window size:

./run.sh imagenet_pretrain $JOB_NAME swin_tiny teacher {16,40} \
  --patch_size 4 \
  --window_size {7,14}

For example, run iBOT with Swin-T/14 network on five nodes with 8 GPUS for 300 epochs with the following command. The resulting checkpoint should reach 76.2% on k-NN accuracy, 79.3% on linear probing accuracy.

./run.sh imagenet_pretrain $JOB_NAME swin_tiny teacher 40 \
  --teacher_temp 0.07 \
  --warmup_teacher_temp_epochs 30 \
  --norm_last_layer false \
  --epochs 300 \
  --batch_size_per_gpu 26 \
  --shared_head true \
  --out_dim 8192 \
  --local_crops_number 10 \
  --global_crops_scale 0.25 1 \
  --local_crops_scale 0.05 0.25 \
  --pred_ratio 0 0.3 \
  --pred_ratio_var 0 0.2 \
  --pred_start_epoch 50 \
  --patch_size 4 \
  --window_size 14 

Pre-Trained Models

You can choose to download only the weights of the pretrained backbone used for downstream tasks, and the full ckpt which contains backbone and projection head weights for both student and teacher networks. For the backbone, s denotes that the student network is selected while t denotes that the teacher network is selected.

Arch. Par. k-NN Lin. Fin. download
ViT-S/16 21M 74.5% 77.0% 82.3% backbone (t) full ckpt args logs
Swin-T/7 28M 75.3% 78.6% \ backbone (t) full ckpt args logs
Swin-T/14 28M 76.2% 79.3% \ backbone (t) full ckpt args logs
ViT-B/16 85M 77.1% 79.5% 83.8% backbone (t) full ckpt args logs

We also provide the ViT-{B,L}/16 model pre-trained on ImageNet-22K dataset.

Arch. Par. k-NN Lin. Fin. download
ViT-B/16 85M 71.1% 79.0% 84.4% backbone (s) full ckpt args logs
ViT-L/16 307M 70.6% 81.7% 86.3% backbone (s) full ckpt args logs

To extract the backbone from the full checkpoint by yourself, please run the following command where KEY being either student or teacher.

WEIGHT_FILE=$OUTPUT_DIR/checkpoint_$KEY.pth

python extract_backbone_weights.py \
  --checkpoint_key $KEY \
  $PRETRAINED \
  $WEIGHT_FILE \

Downstream Evaluation

See Evaluating iBOT on Downstream Tasks for details.

Property Analysis

See Analyzing iBOT's Properties for robustness test and visualizing self-attention map:

iBOT Global Pattern Layout

or extracting sparse correspondence pairs bwtween two images:

iBOT Global Pattern Layout

Extracting Semantic Patterns

We extract top-k numbered local classes based on patch tokens with their corresponding patches and contexts by running the following command. We indentify very diverse behaviour like shared low-level textures and high-level semantics.

python3 -m torch.distributed.launch --nproc_per_node=8 \
    --master_port=${MASTER_PORT:-29500} \
    analysis/extract_pattern/extract_topk_cluster.py \
    --pretrained_path $PRETRAINED \
    --checkpoint {student,teacher} \
    --type patch \
    --topk 36 \
    --patch_window 5 \
    --show_pics 20 \
    --arch vit_small \
    --save_path memory_bank_patch.pth \
    --data_path data/imagenet/val
iBOT Local Part-Level Pattern Layout

The script also supports to extract the patern layout on the [CLS] token, which is actually doing clustering or unsupervised classification. This property is not induced by MIM objective since we also spot this feature on DINO.

python3 -m torch.distributed.launch --nproc_per_node=8 \
    --master_port=${MASTER_PORT:-29500} \
    analysis/extract_pattern/extract_topk_cluster.py \
    --pretrained_path $PRETRAINED \
    --checkpoint {student,teacher} \
    --type cls \
    --topk 36 \
    --show_pics 20 \
    --arch vit_small \
    --save_path memory_bank_cls.pth \
    --data_path data/imagenet/val
iBOT Global Pattern Layout

Acknowledgement

This repository is built using the DINO repository and the BEiT repository.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Citing iBOT

If you find this repository useful, please consider giving a star and citation:

@article{zhou2021ibot,
  title={iBOT: Image BERT Pre-Training with Online Tokenizer},
  author={Zhou, Jinghao and Wei, Chen and Wang, Huiyu and Shen, Wei and Xie, Cihang and Yuille, Alan and Kong, Tao},
  journal={arXiv preprint arXiv:2111.07832},
  year={2021}
}
Owner
Bytedance Inc.
Bytedance Inc.
An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model

pl_prompt_sst An example project using OpenPrompt under the framework of pytorch-lightning for a training prompt-based text classification model on SS

Zhiling Zhang 5 Oct 21, 2022
Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

VUMBLEB 69 Nov 04, 2022
Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization This repo is for our paper "Enhanced Seq2Seq Autoencode

Rachel Zheng 14 Nov 01, 2022
Telegram AI chat bot written in Python using Pyrogram

Aurora_Al Just another Telegram AI chat bot written in Python using Pyrogram. A public running instance can be found on telegram as @AuroraAl. Require

♗CσNϙUҽRσR_MҽSƙEƚҽҽR 1 Oct 31, 2021
The (extremely) naive sentiment classification function based on NBSVM trained on wisesight_sentiment

thai_sentiment The naive sentiment classification function based on NBSVM trained on wisesight_sentiment วิธีติดตั้ง pip install thai_sentiment==0.1.3

Charin 7 Dec 08, 2022
無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン

VOICEVOX ENGINE VOICEVOXの音声合成エンジン。 実態は HTTP サーバーなので、リクエストを送信すればテキスト音声合成できます。 API ドキュメント VOICEVOX ソフトウェアを起動した状態で、ブラウザから

Hiroshiba 3 Jul 05, 2022
CLIPfa: Connecting Farsi Text and Images

CLIPfa: Connecting Farsi Text and Images OpenAI released the paper Learning Transferable Visual Models From Natural Language Supervision in which they

Sajjad Ayoubi 66 Dec 14, 2022
2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

繁體中文場景文字辨識 程式碼說明 組別:這就是我 成員:蔣明憲 唐碩謙 黃玥菱 林冠霆 蕭靖騰 目錄 環境套件 安裝方式 資料夾布局 前處理-製作偵測訓練註解檔 前處理-製作分類訓練樣本 part.py : 從 json 裁切出分類訓練樣本 Class.py : 將切出來的樣本按照文字分類到各資料夾

HuanyueTW 3 Jan 14, 2022
A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

Zhenbang Feng 29 Nov 26, 2022
DiY Oxygen Concentrator based on the OxiKit

M19O2 DiY Oxygen Concentrator based on / inspired by the OxiKit, OpenOx, Marut, RepRap and Project Apollo platforms. About Read about the project on H

Maker's Asylum 62 Dec 22, 2022
:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

R²SQL The PyTorch implementation of paper Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing. (AAAI 2021) Requirement

huybery 60 Dec 31, 2022
Random-Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

Random Word Generator Generates meaningful words from dictionary with given no. of letters and words. This might be useful for generating short links

Mohammed Rabil 1 Jan 01, 2022
Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers. Cherche is meant to be used with small to medium sized corpora. C

Raphael Sourty 224 Nov 29, 2022
Code for using and evaluating SpanBERT.

SpanBERT This repository contains code and models for the paper: SpanBERT: Improving Pre-training by Representing and Predicting Spans. If you prefer

Meta Research 798 Dec 30, 2022
Knowledge Oriented Programming Language

KoPL: 面向知识的推理问答编程语言 安装 | 快速开始 | 文档 KoPL全称 Knowledge oriented Programing Language, 是一个为复杂推理问答而设计的编程语言。我们可以将自然语言问题表示为由基本函数组合而成的KoPL程序,程序运行的结果就是问题的答案。目前,

THU-KEG 62 Dec 12, 2022
Harvis is designed to automate your C2 Infrastructure.

Harvis Harvis is designed to automate your C2 Infrastructure, currently using Mythic C2. 📌 What is it? Harvis is a python tool to help you create mul

Thiago Mayllart 99 Oct 06, 2022
Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Yan Yuanmeng 478 Dec 25, 2022
A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk.

Simple-Vosk A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk. Check out the official Vosk G

2 Jun 19, 2022
Code for Text Prior Guided Scene Text Image Super-Resolution

Code for Text Prior Guided Scene Text Image Super-Resolution

82 Dec 26, 2022