Visual Attention based OCR

Overview

Attention-OCR

Authours: Qi Guo and Yuntian Deng

Visual Attention based OCR. The model first runs a sliding CNN on the image (images are resized to height 32 while preserving aspect ratio). Then an LSTM is stacked on top of the CNN. Finally, an attention model is used as a decoder for producing the final outputs.

example image 0

Prerequsites

Most of our code is written based on Tensorflow, but we also use Keras for the convolution part of our model. Besides, we use python package distance to calculate edit distance for evaluation. (However, that is not mandatory, if distance is not installed, we will do exact match).

Tensorflow: Installation Instructions (tested on 0.12.1)

Distance (Optional):

wget http://www.cs.cmu.edu/~yuntiand/Distance-0.1.3.tar.gz
tar zxf Distance-0.1.3.tar.gz
cd distance; sudo python setup.py install

Usage:

Note: We assume that the working directory is Attention-OCR.

Train

Data Preparation

We need a file (specified by parameter data-path) containing the path of images and the corresponding characters, e.g.:

path/to/image1 abc
path/to/image2 def

And we also need to specify a data-base-dir parameter such that we read the images from path data-base-dir/path/to/image. If data-path contains absolute path of images, then data-base-dir needs to be set to /.

A Toy Example

For a toy example, we have prepared a training dataset of the specified format, which is a subset of Synth 90k

wget http://www.cs.cmu.edu/~yuntiand/sample.tgz
tar zxf sample.tgz
python src/launcher.py --phase=train --data-path=sample/sample.txt --data-base-dir=sample --log-path=log.txt --no-load-model

After a while, you will see something like the following output in log.txt:

...
2016-06-08 20:47:22,335 root  INFO     Created model with fresh parameters.
2016-06-08 20:47:52,852 root  INFO     current_step: 0
2016-06-08 20:48:01,253 root  INFO     step_time: 8.400597, step perplexity: 38.998714
2016-06-08 20:48:01,385 root  INFO     current_step: 1
2016-06-08 20:48:07,166 root  INFO     step_time: 5.781749, step perplexity: 38.998445
2016-06-08 20:48:07,337 root  INFO     current_step: 2
2016-06-08 20:48:12,322 root  INFO     step_time: 4.984972, step perplexity: 39.006730
2016-06-08 20:48:12,347 root  INFO     current_step: 3
2016-06-08 20:48:16,821 root  INFO     step_time: 4.473902, step perplexity: 39.000267
2016-06-08 20:48:16,859 root  INFO     current_step: 4
2016-06-08 20:48:21,452 root  INFO     step_time: 4.593249, step perplexity: 39.009864
2016-06-08 20:48:21,530 root  INFO     current_step: 5
2016-06-08 20:48:25,878 root  INFO     step_time: 4.348195, step perplexity: 38.987707
2016-06-08 20:48:26,016 root  INFO     current_step: 6
2016-06-08 20:48:30,851 root  INFO     step_time: 4.835423, step perplexity: 39.022887

Note that it takes quite a long time to reach convergence, since we are training the CNN and attention model simultaneously.

Test and visualize attention results

The test data format shall be the same as training data format. We have also prepared a test dataset of the specified format, which includes ICDAR03, ICDAR13, IIIT5k and SVT.

wget http://www.cs.cmu.edu/~yuntiand/evaluation_data.tgz
tar zxf evaluation_data.tgz

We also provide a trained model on Synth 90K:

wget http://www.cs.cmu.edu/~yuntiand/model.tgz
tar zxf model.tgz
python src/launcher.py --phase=test --visualize --data-path=evaluation_data/svt/test.txt --data-base-dir=evaluation_data/svt --log-path=log.txt --load-model --model-dir=model --output-dir=results

After a while, you will see something like the following output in log.txt:

2016-06-08 22:36:31,638 root  INFO     Reading model parameters from model/translate.ckpt-47200
2016-06-08 22:36:40,529 root  INFO     Compare word based on edit distance.
2016-06-08 22:36:41,652 root  INFO     step_time: 1.119277, step perplexity: 1.056626
2016-06-08 22:36:41,660 root  INFO     1.000000 out of 1 correct
2016-06-08 22:36:42,358 root  INFO     step_time: 0.696687, step perplexity: 2.003350
2016-06-08 22:36:42,363 root  INFO     1.666667 out of 2 correct
2016-06-08 22:36:42,831 root  INFO     step_time: 0.466550, step perplexity: 1.501963
2016-06-08 22:36:42,835 root  INFO     2.466667 out of 3 correct
2016-06-08 22:36:43,402 root  INFO     step_time: 0.562091, step perplexity: 1.269991
2016-06-08 22:36:43,418 root  INFO     3.366667 out of 4 correct
2016-06-08 22:36:43,897 root  INFO     step_time: 0.477545, step perplexity: 1.072437
2016-06-08 22:36:43,905 root  INFO     4.366667 out of 5 correct
2016-06-08 22:36:44,107 root  INFO     step_time: 0.195361, step perplexity: 2.071796
2016-06-08 22:36:44,127 root  INFO     5.144444 out of 6 correct

Example output images in results/correct (the output directory is set via parameter output-dir and the default is results): (Look closer to see it clearly.)

Format: Image index (predicted/ground truth) Image file

Image 0 (j/j): example image 0

Image 1 (u/u): example image 1

Image 2 (n/n): example image 2

Image 3 (g/g): example image 3

Image 4 (l/l): example image 4

Image 5 (e/e): example image 5

Parameters:

  • Control

    • phase: Determine whether to train or test.
    • visualize: Valid if phase is set to test. Output the attention maps on the original image.
    • load-model: Load model from model-dir or not.
  • Input and output

    • data-base-dir: The base directory of the image path in data-path. If the image path in data-path is absolute path, set it to /.
    • data-path: The path containing data file names and labels. Format per line: image_path characters.
    • model-dir: The directory for saving and loading model parameters (structure is not stored).
    • log-path: The path to put log.
    • output-dir: The path to put visualization results if visualize is set to True.
    • steps-per-checkpoint: Checkpointing (print perplexity, save model) per how many steps
  • Optimization

    • num-epoch: The number of whole data passes.
    • batch-size: Batch size. Only valid if phase is set to train.
    • initial-learning-rate: Initial learning rate, note the we use AdaDelta, so the initial value doe not matter much.
  • Network

    • target-embedding-size: Embedding dimension for each target.
    • attn-use-lstm: Whether or not use LSTM attention decoder cell.
    • attn-num-hidden: Number of hidden units in attention decoder cell.
    • attn-num-layers: Number of layers in attention decoder cell. (Encoder number of hidden units will be attn-num-hidden*attn-num-layers).
    • target-vocab-size: Target vocabulary size. Default is = 26+10+3 # 0: PADDING, 1: GO, 2: EOS, >2: 0-9, a-z

References

Convert a formula to its LaTex source

What You Get Is What You See: A Visual Markup Decompiler

Torch attention OCR

Owner
Yuntian Deng
Yuntian Deng
Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

STN-OCR: A single Neural Network for Text Detection and Text Recognition This repository contains the code for the paper: STN-OCR: A single Neural Net

Christian Bartz 496 Jan 05, 2023
A novel region proposal network for more general object detection ( including scene text detection ).

DeRPN: Taking a further step toward more general object detection DeRPN is a novel region proposal network which concentrates on improving the adaptiv

Deep Learning and Vision Computing Lab, SCUT 151 Dec 12, 2022
Image processing is one of the most common term in computer vision

Image processing is one of the most common term in computer vision. Computer vision is the process by which computers can understand images and videos, and how they are stored, manipulated, and retri

Happy N. Monday 3 Feb 15, 2022
Repository relating to the CVPR21 paper TimeLens: Event-based Video Frame Interpolation

TimeLens: Event-based Video Frame Interpolation This repository is about the High Speed Event and RGB (HS-ERGB) dataset, used in the 2021 CVPR paper T

Robotics and Perception Group 544 Dec 19, 2022
Slice a single image into multiple pieces and create a dataset from them

OpenCV Image to Dataset Converter Slice a single image of Persian digits into mu

Meysam Parvizi 14 Dec 29, 2022
A dataset handling library for computer vision datasets in LOST-fromat

A dataset handling library for computer vision datasets in LOST-fromat

8 Dec 15, 2022
A set of workflows for corpus building through OCR, post-correction and normalisation

PICCL: Philosophical Integrator of Computational and Corpus Libraries PICCL offers a workflow for corpus building and builds on a variety of tools. Th

Language Machines 41 Dec 27, 2022
Detect the mathematical formula from the given picture and the same formula is extracted and converted into the latex code

Mathematical formulae extractor The goal of this project is to create a learning based system that takes an image of a math formula and returns corres

6 May 22, 2022
CRAFT-Pyotorch:Character Region Awareness for Text Detection Reimplementation for Pytorch

CRAFT-Reimplementation Note:If you have any problems, please comment. Or you can join us weChat group. The QR code will update in issues #49 . Reimple

453 Dec 28, 2022
The Open Source Framework for Machine Vision

SimpleCV Quick Links: About Installation [Docker] (#docker) Ubuntu Virtual Environment Arch Linux Fedora MacOS Windows Raspberry Pi SimpleCV Shell Vid

Sight Machine 2.6k Dec 31, 2022
OCR, Scene-Text-Understanding, Text Recognition

Scene-Text-Understanding Survey [2015-PAMI] Text Detection and Recognition in Imagery: A Survey paper [2014-Front.Comput.Sci] Scene Text Detection and

Alan Tang 354 Dec 12, 2022
Use Youdao OCR API to covert your clipboard image to text.

Alfred Clipboard OCR 注:本仓库基于 oott123/alfred-clipboard-ocr 的逻辑用 Python 重写,换用了有道 AI 的 API,准确率更高,有效防止百度导致隐私泄露等问题,并且有道 AI 初始提供的 50 元体验金对于其资费而言个人用户基本可以永久使用

Junlin Liu 6 Sep 19, 2022
This is a tensorflow re-implementation of PSENet: Shape Robust Text Detection with Progressive Scale Expansion Network.My blog:

PSENet: Shape Robust Text Detection with Progressive Scale Expansion Network Introduction This is a tensorflow re-implementation of PSENet: Shape Robu

Michael liu 498 Dec 30, 2022
Rubik's Cube in pygame with OpenGL

Rubik Rubik's Cube in pygame with OpenGL The script show on the screen a Rubik Cube buit with OpenGL. Then I have also implemented all the possible mo

Gabro 2 Apr 15, 2022
A facial recognition device is a device that takes an image or a video of a human face and compares it to another image faces in a database.

A facial recognition device is a device that takes an image or a video of a human face and compares it to another image faces in a database. The structure, shape and proportions of the faces are comp

Pavankumar Khot 4 Mar 19, 2022
Hiiii this is the Spanish for Linux and win 10 and in the near future the english version of PortScan my new tool on which you can see what ports are Open only with the IP adress.

PortScanner-by-IIT PortScanner es una herramienta programada en Python3. Como su nombre indica esta herramienta escanea los primeros 150 puertos de re

5 Sep 19, 2022
Face Anonymizer - FaceAnonApp v1.0

Face Anonymizer - FaceAnonApp v1.0 Blur faces from image and video files in /data/files folder. Contents Repo of the source files for the FaceAnonApp.

6 Apr 18, 2022
Python-based tools for document analysis and OCR

ocropy OCRopus is a collection of document analysis programs, not a turn-key OCR system. In order to apply it to your documents, you may need to do so

OCRopus 3.2k Dec 31, 2022
Ackermann Line Follower Robot Simulation.

Ackermann Line Follower Robot This is a simulation of a line follower robot that works with steering control based on Stanley: The Robot That Won the

Lucas Mazzetto 2 Apr 16, 2022
Automatically remove the mosaics in images and videos, or add mosaics to them.

Automatically remove the mosaics in images and videos, or add mosaics to them.

Hypo 1.4k Dec 30, 2022