I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection

This is the repo for the paper "I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection". I3CL with ViTAEv2, ResNet50 and ResNet50 w/ RegionCL backbone are included.

Updates

[2022/04/13] Publish links of training datasets.

[2022/04/11] Add SSL training code for this implementation.

[2022/04/09] The training code for ICDAR2019 ArT dataset is uploaded. Private github repo temporarily.

Other applications of ViTAE Transformer: Image Classification | Object Detection | Sementic Segmentation | Animal Pose Estimation | Matting | Remote Sensing

Introduction

Existing methods for arbitrary-shaped text detection in natural scenes face two critical issues, i.e., 1) fracture detections at the gaps in a text instance; and 2) inaccurate detections of arbitrary-shaped text instances with diverse background context. To address these issues, we propose a novel method named Intra- and Inter-Instance Collaborative Learning (I3CL). Specifically, to address the first issue, we design an effective convolutional module with multiple receptive fields, which is able to collaboratively learn better character and gap feature representations at local and long ranges inside a text instance. To address the second issue, we devise an instance-based transformer module to exploit the dependencies between different text instances and a global context module to exploit the semantic context from the shared background, which are able to collaboratively learn more discriminative text feature representation. In this way, I3CL can effectively exploit the intra- and inter-instance dependencies together in a unified end-to-end trainable framework. Besides, to make full use of the unlabeled data, we design an effective semi-supervised learning method to leverage the pseudo labels via an ensemble strategy. Without bells and whistles, experimental results show that the proposed I3CL sets new state-of-the-art results on three challenging public benchmarks, i.e., an F-measure of 77.5% on ArT, 86.9% on Total-Text, and 86.4% on CTW-1500. Notably, our I3CL with the ResNeSt-101 backbone ranked the 1st place on the ArT leaderboard.

Results

Example results from paper.

Evaluation results of I3CL with different backbones on ArT. Note that: (1) I3CL with ViTAE only adopts one training stage with LSVT+MLT19+ArT training datasets in this repo. ResNet series adopt three training stages, i.e, pre-train on SynthText, mix-train on ReCTS+RCTW+LSVT+MLT19+ArT and lastly finetune on LSVT+MLT19+ArT. (2) Origin implementation of ResNet series is based on Detectron2. The results and model links of ResNet-50 will be updated soon in this implementation.

Backbone	Model Link	Training Data	Recall	Precision	F-measure
ViTAEv2-S [this repo]	OneDrive/ 百度网盘 (pw:w754)	LSVT,MLT19,ArT	75.4	82.8	78.9
ResNet-50 [paper]	-	SynthText,ReCTS,RCTW,LSVT,MLT19,ArT	71.3	82.7	76.6
ResNet-50 w/ RegionCL(finetuning) [paper]	-	SynthText,ReCTS,RCTW,LSVT,MLT19,ArT	72.6	81.9	77.0
ResNet-50 w/ RegionCL(w/o finetuning) [paper]	-	SynthText,ReCTS,RCTW,LSVT,MLT19,ArT	73.5	81.6	77.3
ResNeXt-101 [paper]	-	SynthText,ReCTS,RCTW,LSVT,MLT19,ArT	74.1	85.5	79.4
ResNeSt-101 [paper]	-	SynthText,ReCTS,RCTW,LSVT,MLT19,ArT	75.1	86.3	80.3
ResNeXt-151 [paper]	-	SynthText,ReCTS,RCTW,LSVT,MLT19,ArT	74.9	86.0	80.1

Usage

Install

Prerequisites：

Linux (macOS and Windows are not tested)

Python >= 3.6

Pytorch >= 1.8.1 (For ViTAE implementation). Please make sure your compilation CUDA version and runtime CUDA version match.

GCC >= 5

MMCV (We use mmcv-full==1.4.3)

Create a conda virtual environment and activate it. Note that this implementation is based on mmdetection 2.20.0 version.
Install Pytorch and torchvision following official instructions.

Install mmcv-full and timm. Please refer to mmcv to install the proper version. For example:

pip install mmcv-full==1.4.3 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html
pip install timm

Clone this repository and then install it:

git clone https://github.com/ViTAE-Transformer/ViTAE-Transformer-Scene-Text-Detection.git
cd ViTAE-Transformer-Scene-Text-Detection
pip install -r requirements/build.txt
pip install -r requirements/runtime.txt
pip install -v -e .

Preparation

Model:

To train I3CL model yourself, please download the pretrained ViTAEv2 used in this implementation from here: OneDrive | 百度网盘 (pw:petb). ResNet-50 w/ RegionCL(finetuning): OneDrive | 百度网盘 (pw:y598). ResNet-50 w/ RegionCL(w/o finetuning): OneDrive | 百度网盘 (pw:cybh). For backbone initialization, please put them in pretrained_model/ViTAE or pretrained_model/RegionCL.
Full I3CL model with ViTAE backbone trained on ArT can be downloaded and put in pretrained_model/I3CL.

Data

Coco format training datasets are utilized. Some offline augmented ArT training datasets are used. lsvt-test is only used to train SSL(Semi-Supervised Learning) model in paper. Files named train_lossweight.json are the provided pseudo-label for SSL training. You can download correspoding datasets in config file from here and put them in data/:

Dataset	Link (OneDrive)	Link (Baidu Wangpan百度网盘)
art	Link	Link (pw:etif)
art_light	Link	Link (pw:mzrk)
art_noise	Link	Link (pw:scxi)
art_sig	Link	Link (pw:cdk8)
lsvt	Link	Link (pw:wly0)
lsvt_test	Link	Link (pw:8ha3)
icdar2019_mlt	Link	Link (pw:hmnj)
rctw	Link	Link (pw:ngge)
rects	Link	Link (pw:y00o)

The file structure should look like:

|- data
    |- art
    |   |- train_images
    |   |    |- *.jpg
    |   |- test_images
    |   |    |- *.jpg
    |   |- train.json
    |   |- train_lossweight.json
    |- art_light
    |   |- train_images
    |   |    |- *.jpg
    |   |- train.json
    |   |- train_lossweight.json
    ......
    |- lsvt
    |   |- train_images1
    |   |    |- *.jpg
    |   |- train_images2
    |   |    |- *.jpg
    |   |- train1.json
    |   |- train1_lossweight.json
    |   |- train2.json
    |   |- train2_lossweight.json
    |- lsvt_test
    |   |- train_images
    |   |    |- *.jpg
    |   |- train_lossweight.json
    ......

Training

Distributed training with 4GPUs for ViTAE backbone:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=29500 tools/train.py \
configs/i3cl_vitae_fpn/i3cl_vitae_fpn_ms_train.py --launcher pytorch --work-dir ./out_dir/${your_dir}

Distributed training with 4GPUs for ResNet50 backbone:

stage1:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=29500 tools/train.py \
configs/i3cl_r50_fpn/i3cl_r50_fpn_ms_pretrain.py --launcher pytorch --work-dir ./out_dir/art_r50_pretrain/

stage2:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=29500 tools/train.py \
configs/i3cl_r50_fpn/i3cl_r50_fpn_ms_mixtrain.py --launcher pytorch --work-dir ./out_dir/art_r50_mixtrain/

stage3:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=29500 tools/train.py \
configs/i3cl_r50_fpn/i3cl_r50_fpn_ms_finetune.py --launcher pytorch --work-dir ./out_dir/art_r50_finetune/

Distributed training with 4GPUs for ResNet50 w/ RegionCL backbone:

stage1:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=29500 tools/train.py \
configs/i3cl_r50_regioncl_fpn/i3cl_r50_fpn_ms_pretrain.py --launcher pytorch --work-dir ./out_dir/art_r50_regioncl_pretrain/

stage2:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=29500 tools/train.py \
configs/i3cl_r50_regioncl_fpn/i3cl_r50_fpn_ms_mixtrain.py --launcher pytorch --work-dir ./out_dir/art_r50_regioncl_mixtrain/

stage3:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=29500 tools/train.py \
configs/i3cl_r50_regioncl_fpn/i3cl_r50_fpn_ms_finetune.py --launcher pytorch --work-dir ./out_dir/art_r50_regioncl_finetune/

Note:

If the GPU memory is limited during training I3CL ViTAE backbone, please adjust img_scale in configuration file. The maximum scale set to (800, 1333) is proper for V100(16G) while there is little effect on the performance actually. Please change the training scale according to your condition.

Inference

For example, use our trained I3CL model to get inference results on ICDAR2019 ArT test set with visualization images, txt format records and the json file for testing submission, please run:

python demo/art_demo.py --checkpoint pretrained_model/I3CL/vitae_epoch_12.pth --score-thr 0.45 --json_file art_submission.json

Note:

Upload the saved json file to ICDAR2019-ArT evaluation website for Recall, Precision and F1 evaluation results. Change the path for saving visualizations and txt files if needed.

Citation

This project is for research purpose only.

If you are interested in our work, please consider citing our work. Arxiv

Please post issues to let us know if you encounter any problems.

Acknowledgement

Thanks for mmdetection.

The repo for the paper "I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection".

Related tags

Overview

I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection

Updates

Introduction

Results

Usage

Install

Preparation

Training

Inference

Citation

Acknowledgement

Owner

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

RL Algorithms with examples in Python / Pytorch / Unity ML agents

Pytorch implementation of our method for regularizing nerual radiance fields for few-shot neural volume rendering.

This repo includes our code for evaluating and improving transferability in domain generalization (NeurIPS 2021)

Suite of 500 procedurally-generated NLP tasks to study language model adaptability

Pytorch implementation of our paper under review — Lottery Jackpots Exist in Pre-trained Models

An image processing project uses Viola-jones technique to detect faces and then use SIFT algorithm for recognition.

Notes taking website build with Docker + Django + React.

Libraries, tools and tasks created and used at DeepMind Robotics.

Understanding the Generalization Benefit of Model Invariance from a Data Perspective

PyTorch implementation of Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network

MicroNet: Improving Image Recognition with Extremely Low FLOPs (ICCV 2021)

RoMA: Robust Model Adaptation for Offline Model-based Optimization

[IEEE Transactions on Computational Imaging] Self-Gated Memory Recurrent Network for Efficient Scalable HDR Deghosting

Unsupervised Image-to-Image Translation

Segmentation models with pretrained backbones. PyTorch.

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

[ICRA 2022] An opensource framework for cooperative detection. Official implementation for OPV2V.

Adabelief-Optimizer - Repository for NeurIPS 2020 Spotlight "AdaBelief Optimizer: Adapting stepsizes by the belief in observed gradients"

Multi-Glimpse Network With Python