Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

Last update: Dec 31, 2022

Related tags

Overview

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

The official code of ABINet (CVPR 2021, Oral).

ABINet uses a vision model and an explicit language model to recognize text in the wild, which are trained in end-to-end way. The language model (BCN) achieves bidirectional language representation in simulating cloze test, additionally utilizing iterative correction strategy.

Runtime Environment

We provide a pre-built docker image using the Dockerfile from docker/Dockerfile

Running in Docker

$ [email protected]:FangShancheng/ABINet.git
$ docker run --gpus all --rm -ti --ipc=host -v $(pwd)/ABINet:/app fangshancheng/fastai:torch1.1 /bin/bash

(Untested) Or using the dependencies
```
pip install -r requirements.txt
```

Datasets

Training datasets
1. MJSynth (MJ):
  - Use tools/create_lmdb_dataset.py to convert images into LMDB dataset
  - LMDB dataset BaiduNetdisk(passwd:n23k)
2. SynthText (ST):
  - Use tools/crop_by_word_bb.py to crop images from original SynthText dataset, and convert images into LMDB dataset by tools/create_lmdb_dataset.py
  - LMDB dataset BaiduNetdisk(passwd:n23k)
3. WikiText103, which is only used for pre-trainig language models:
  - Use notebooks/prepare_wikitext103.ipynb to convert text into CSV format.
  - CSV dataset BaiduNetdisk(passwd:dk01)
Evaluation datasets, LMDB datasets can be downloaded from BaiduNetdisk(passwd:1dbv), GoogleDrive.
1. ICDAR 2013 (IC13)
2. ICDAR 2015 (IC15)
3. IIIT5K Words (IIIT)
4. Street View Text (SVT)
5. Street View Text-Perspective (SVTP)
6. CUTE80 (CUTE)

The structure of data directory is

data
├── charset_36.txt
├── evaluation
│   ├── CUTE80
│   ├── IC13_857
│   ├── IC15_1811
│   ├── IIIT5k_3000
│   ├── SVT
│   └── SVTP
├── training
│   ├── MJ
│   │   ├── MJ_test
│   │   ├── MJ_train
│   │   └── MJ_valid
│   └── ST
├── WikiText-103.csv
└── WikiText-103_eval_d1.csv

Pretrained Models

Get the pretrained models from BaiduNetdisk(passwd:kwck), GoogleDrive. Performances of the pretrained models are summaried as follows:

Model	IC13	SVT	IIIT	IC15	SVTP	CUTE	AVG
ABINet-SV	97.1	92.7	95.2	84.0	86.7	88.5	91.4
ABINet-LV	97.0	93.4	96.4	85.9	89.5	89.2	92.7

Training

Pre-train vision model

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/pretrain_vision_model.yaml

Pre-train language model

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/pretrain_language_model.yaml

Train ABINet

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/train_abinet.yaml

Note:

You can set the checkpoint path for vision and language models separately for specific pretrained model, or set to None to train from scratch

Evaluation

CUDA_VISIBLE_DEVICES=0 python main.py --config=configs/train_abinet.yaml --phase test --image_only

Additional flags:

--checkpoint /path/to/checkpoint set the path of evaluation model
--test_root /path/to/dataset set the path of evaluation dataset
--model_eval [alignment|vision] which sub-model to evaluate
--image_only disable dumping visualization of attention masks

Visualization

Successful and failure cases on low-quality images:

Citation

If you find our method useful for your reserach, please cite

@article{fang2021read,
  title={Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition},
  author={Fang, Shancheng and Xie, Hongtao and Wang, Yuxin and Mao, Zhendong and Zhang, Yongdong},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2021}
}

License

This project is only free for academic research purposes, licensed under the 2-clause BSD License - see the LICENSE file for details.

Feel free to contact [email protected] if you have any questions.

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

Related tags

Overview

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

Runtime Environment

Datasets

Pretrained Models

Training

Evaluation

Visualization

Citation

License

Owner

PyTorch code for the paper: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Unofficial PyTorch implementation of MobileViT based on paper "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer".

Tutorial to set up TensorFlow Object Detection API on the Raspberry Pi

Torch implementation of SegNet and deconvolutional network

PyTorch implementation for View-Guided Point Cloud Completion

This repository contains the source code for the paper First Order Motion Model for Image Animation

U-Net Brain Tumor Segmentation

DilatedNet in Keras for image segmentation

Contrastive Loss Gradient Attack (CLGA)

AdaDM: Enabling Normalization for Image Super-Resolution

A Comprehensive Analysis of Weakly-Supervised Semantic Segmentation in Different Image Domains (IJCV submission)

Code for reproducing experiments in "Improved Training of Wasserstein GANs"

Benchmark VAE - Library for Variational Autoencoder benchmarking

Virtual Dance Reality Stage: a feature that offers you to share a stage with another user virtually

The code of NeurIPS 2021 paper "Scalable Rule-Based Representation Learning for Interpretable Classification".

Repositório criado para abrigar os notebooks com a listas de exercícios propostos pelo professor Gustavo Guanabara do canal Curso em Vídeo do YouTube durante o Curso de Python 3

8-week curriculum for AI Builders

Mapping Conditional Distributions for Domain Adaptation Under Generalized Target Shift

3D ResNet Video Classification accelerated by TensorRT