Binarize document images

Last update: Jan 02, 2023

Overview

Binarization

Binarization for document images

Examples

Introduction

This tool performs document image binarization (i.e. transform colour/grayscale to black-and-white pixels) for OCR using multiple trained models.

The method used is based on Calvo-Zaragoza/Gallego, 2018. A selectional auto-encoder approach for document image binarization.

Installation

Clone the repository, enter it and run

pip install .

Models

Pre-trained models can be downloaded from here:

https://qurator-data.de/sbb_binarization/

Usage

sbb_binarize \
  --patches \
  -m <directory with models> \
  <input image> \
  <output image>

Note In virtually all cases, the --patches flag will improve results.

To use the OCR-D interface:

ocrd-sbb-binarize --overwrite -I INPUT_FILE_GRP -O OCR-D-IMG-BIN -P model "/var/lib/sbb_binarization"

Comments

Handle input errors in exceptions

Hello I trying to use image input "sbb_binarize --patches -m ./models/model_bin_sbb_ens.h5 179681.png img.png"

I get the following error " File "/home/lin/anaconda3/bin/sbb_binarize", line 8, in sys.exit(main()) File "/home/lin/anaconda3/lib/python3.5/site-packages/click/core.py", line 829, in call return self.main(*args, **kwargs) File "/home/lin/anaconda3/lib/python3.5/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/home/lin/anaconda3/lib/python3.5/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/lin/anaconda3/lib/python3.5/site-packages/click/core.py", line 610, in invoke return callback(*args, **kwargs) File "/home/lin/anaconda3/lib/python3.5/site-packages/sbb_binarize/cli.py", line 16, in main SbbBinarizer(model_dir).run(image_path=input_image, use_patches=patches, save=output_image) File "/home/lin/anaconda3/lib/python3.5/site-packages/sbb_binarize/sbb_binarize.py", line 265, in run img_last[:, :][img_last[:, :] > 0] = 255 TypeError: 'int' object is not subscriptable"
documentation

opened by hiyashi-CianDuo 21

how to use sbb_binarization within a script?

Hi. After searching for numerous hours without success, I am wondering if someone might offer insight on how to run this from within a python script.

(Using Windows 10 os, Visual Studio Code) For example, I can run the following successfully from the terminal: sbb_binarize --patches -m 'C:/Users/Scott/Desktop/Python2/sbb_binarization/models' 'C:/Users/Scott/Desktop/Python2/Kpics/Pages_cropped/061r.png' 'C:/Users/Scott/Desktop/Python2/Kpics/new_test8.png' However, if I try the following script (using CodeRunner extension):

import subprocess
def sbb_def():
    args = ['sbb_binarize', '--patches', '-m', 'C:/Users/Scott/Desktop/Python2/sbb_binarization/models', 'C:/Users/Scott/Desktop/Python2/Kpics/Pages_cropped/061r.png', 'C:/Users/Scott/Desktop/Python2/Kpics/new_test8.png']
    subprocess.Popen(args)
sbb_def()

I get the following:

[Running] C:\ProgramData\Anaconda3\Scripts\activate.bat C:\ProgramData\Anaconda3 & python "c:\Users\Scott\Desktop\Python2\my_sbb_binarization_example.py"
Traceback (most recent call last):
  File "c:\Users\Scott\Desktop\Python2\my_sbb_binarization_example.py", line 8, in <module>
    sbb_def()
  File "c:\Users\Scott\Desktop\Python2\my_sbb_binarization_example.py", line 6, in sbb_def
    subprocess.Popen(args)
  File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 854, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 1307, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

[Done] exited with code=1 in 0.687 seconds

I don't suggest that this is a bug or anything. I'm rather sure the "issue" is mine. I'm very green at python/coding in general. Any help would be greatly appreciated.

opened by SB2020-eye 11

Cannot install sbb_binarization (on Windows) - TensorFlow not found (even, if available)

Hi, I am trying out to setup your nice tool in Windows environment. I am using Python 3.8. After doing "pip install sbb_binarization" I get the following error:

Collecting ocrd>=2.18.0
  Using cached ocrd-2.20.1-py3-none-any.whl (51 kB)
ERROR: Could not find a version that satisfies the requirement tensorflow<1.16,>=1.15 (from sbb_binarization) (from versions: 2.2.0rc1, 2.2.0rc2, 2.2.0rc3, 2.2.0rc4, 2.2.0, 2.2.1, 2.3.0rc0, 2.3.0rc1, 2.3.0rc2, 2.3.0, 2.3.1, 2.4.0rc0, 2.4.0rc1)
ERROR: No matching distribution found for tensorflow<1.16,>=1.15 (from sbb_binarization)

If i call "pip list", I can see, that TensorFlow is installed:

...
setuptools             41.2.0
six                    1.14.0
stomp.py               6.0.0
tensorboard            2.4.0
tensorboard-plugin-wit 1.7.0
tensorflow             2.3.1
tensorflow-estimator   2.3.0
termcolor              1.1.0
urllib3                1.26.2
...
Do you have any idea, what to do?

wontfix

opened by stefanCCS 11

Model won't load on Python 3.9

Hey,

After using this model for a while and having quite remarkable results as compared to standard binarization techniques, I would like to move to a newer version of python: 3.9.

Unfortunately, the model won't load then as I get a ValueError: bad marshal data (unknown type code). To fix this I need the raw SBB model and load the weights there and save again in the newer python version.

Is anyone aware of what the exact model is or where I can find it?

Thanks! LudovA

opened by LudovA 7
output is inverted in certain input formats

I sometimes get output which looks like this:

The input image for this was a PNG (which someone seems to have converted somehow from an original JPEG):

(That's from this GT BTW.)
bug

opened by bertsky 6

can't install

Hi. Running on Windows 10 OS. Using Visual Studio Code.

Running (myenvironmentname) PS C:\users\scott\desktop\python2\sbb_binarization> pip install . I keep getting the following:

Processing c:\users\scott\desktop\python2\sbb_binarization
    ERROR: Command errored out with exit status 1:
     command: 'C:\ProgramData\Anaconda3\envs\myenvironmentname\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\Scott\\AppData\\Local\\Temp\\pip-req-build-l7egxsl1\\setup.py'"'"'; __file__='"'"'C:\\Users\\Scott\\AppData\\Local\\Temp\\pip-req-build-l7egxsl1\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\Scott\AppData\Local\Temp\pip-pip-egg-info-notlemz5'
         cwd: C:\Users\Scott\AppData\Local\Temp\pip-req-build-l7egxsl1\
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\Scott\AppData\Local\Temp\pip-req-build-l7egxsl1\setup.py", line 6, in <module>
        with open('./ocrd-tool.json', 'r') as f:
    FileNotFoundError: [Errno 2] No such file or directory: './ocrd-tool.json'
    ----------------------------------------
WARNING: Discarding file:///C:/users/scott/desktop/python2/sbb_binarization. Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

I'm very new at all this. And my (beginner) language is Python. I don't understand the json stuff. Any help getting this installed would be greatly appreciated. :)

opened by SB2020-eye 6

[transformer_model_integration] "Normal" CLI does not produce useful output
On the transformer_model_integration branch, the normal CLI does not produce useful output

I'm using the image from https://qurator-data.de/examples/actevedef_718448162.first-page.zip

This fails with a (transparent?) empty output TIFF: sbb_binarize --patches --model-dir ~/devel/qurator-data/sbb_binarization/2022-08-16/ OCR-D-IMG_00000024.tif OCR-D-IMG_00000024-bin.tif

This - the OCR-D CLI - works(!): ocrd-sbb-binarize -I OCR-D-IMG -O OCR-D-IMG-BIN -P model /home/mike/devel/qurator-data/sbb_binarization/2022-08-16

bug
opened by mikegerber 4
strange border artifacts in patch mode

I sometimes get output which looks like this:

| input | output | | --- | --- | | | |

Could this be a problem with the patch size or patching in general? Should I try to crop first?

opened by bertsky 3
v0.0.7 is not on PyPi

This just came up in our team meeting: Version 0.0.7 is not on PyPI.

(Commit history also looks like a bug fix is not in the most recent GitHub release yet but I cannot say if that bug fix warrants a new release or not.)

opened by mikegerber 3
Why is --patches not the default?

The README says:

Note In virtually all cases, applying the --patches flag will improve the quality of results.

Why is it not the default? Why no --no-patches option instead?
documentation

opened by mikegerber 2

Cannot load models in qurator-data git-annex

$ ocrd-sbb-binarize --overwrite -I OCR-D-IMG -O OCR-D-IMG-BIN -P model /var/lib/sbb_binarization
18:35:13.783 INFO processor.SbbBinarize - INPUT FILE 0 / PHYS_0024
18:35:13.787 INFO processor.SbbBinarize - Binarizing on 'page' level in page 'PHYS_0024'
/var/lib/sbb_binarization/.gitkeep
Traceback (most recent call last):
  File "/usr/local/bin/ocrd-sbb-binarize", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/sbb_binarize/ocrd_cli.py", line 115, in cli
    return ocrd_cli_wrap_processor(SbbBinarizeProcessor, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/ocrd/decorators/__init__.py", line 81, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/ocrd/processor/helpers.py", line 69, in run_processor
    processor.process()
  File "/usr/local/lib/python3.6/dist-packages/sbb_binarize/ocrd_cli.py", line 66, in process
    bin_image = cv2pil(binarizer.run(image=pil2cv(page_image), use_patches=True))
  File "/usr/local/lib/python3.6/dist-packages/sbb_binarize/sbb_binarize.py", line 199, in run
    res = self.predict(model_in, image, use_patches)
  File "/usr/local/lib/python3.6/dist-packages/sbb_binarize/sbb_binarize.py", line 47, in predict
    model, model_height, model_width, n_classes = self.load_model(model_name)
  File "/usr/local/lib/python3.6/dist-packages/sbb_binarize/sbb_binarize.py", line 40, in load_model
    model = load_model(join(self.model_dir, model_name), compile=False)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/saving.py", line 492, in load_wrapper
    return load_function(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/saving.py", line 583, in load_model
    with H5Dict(filepath, mode='r') as h5dict:
  File "/usr/local/lib/python3.6/dist-packages/keras/utils/io_utils.py", line 191, in __init__
    self.data = h5py.File(path, mode=mode)
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py", line 408, in __init__
    swmr=swmr)
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/files.py", line 173, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (file signature not found)

The directory /var/lib/sbb_binarization is a copy of sbb_binarization/ in our private qurator-data git-annex, which happens to include a file .gitkeep - which the current code tries to load as a HDF5 file.

opened by mikegerber 2

packaging inconsistency

If I run sbb_binarize --version, I get:

Traceback (most recent call last):
  File "/bin/sbb_binarize", line 8, in <module>
    sys.exit(main())
  File "/lib/python3.6/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/lib/python3.6/site-packages/click/core.py", line 1052, in main
    with self.make_context(prog_name, args, **extra) as ctx:
  File "/lib/python3.6/site-packages/click/core.py", line 914, in make_context
    self.parse_args(ctx, args)
  File "/lib/python3.6/site-packages/click/core.py", line 1370, in parse_args
    value, args = param.handle_parse_result(ctx, opts, args)
  File "/lib/python3.6/site-packages/click/core.py", line 2347, in handle_parse_result
    value = self.process_value(ctx, value)
  File "/lib/python3.6/site-packages/click/core.py", line 2309, in process_value
    value = self.callback(ctx, self, value)
  File "/lib/python3.6/site-packages/click/decorators.py", line 383, in callback
    ) from None
RuntimeError: 'sbb_binarize' is not installed. Try passing 'package_name' instead.

Looks like the name=sbb_binarization kwarg is not consistent with the top-level module sbb_binarize IINM.

Maybe you want to restructure your package using qurator as namespace package on that occasion?

opened by bertsky 1

Batch-prediction across multiple GPUs and more efficient patch-prediction
In order to batch-binarize thousands of images, I've rewritten the prediction script to allow us to predict around 1500-2000 images per hour on a decent machine with two GPUs.

The proposed changes include:

An efficient way to compute the image patches instead of a very inefficient loop

Complete removal of the prediction on the down-scaled image as the results are pretty much always worse

Batch-prediction code that can binarize an entire directory into a given output directory while preserving the folder structure and skipping images that have already been binarized, to allow stopping and continuing the conversion

Multiprocessing batch-prediction across multiple GPUs using the mpire library

A fix for the memory-leak that caused mass-binarization to very quickly crash because we were running out of memory on the GPU. With this fix, we are already running the conversion for 16 hours without any crash.

Simplified loading of the model removing obsolete session-handling code

Please note: I know that the code looks completely different now (hopefully more readable) and is probably not 1:1 compatible with the remaining code in your repository, but I tried to put all the relevant changes into this PR and make the code as self-contained as possible to allow you to update the solution as you see fit.

Thanks for sharing the code-base with us. I hope that this PR is of some help to you.
opened by apacha 3

Saving to TIFF does not work

E.g.

% sbb_binarize --patches --model-dir /home/mike/devel/qurator-data/sbb_binarization/2022-08-16/ OCR-D-IMG_00000024.tif OCR-D-IMG_00000024-bin.tif

produces a transparent(?) TIFF with no content. No warning, no error.

Document supported Python versions

sbb_binarization currently needs TensorFlow 2.4, which is not available* for Python 3.10, the default on my Linux installation. Which versions are supported?

as in: available on PyPI:

ERROR: Could not find a version that satisfies the requirement tensorflow==2.4.* (from sbb-binarization) (from versions: 2.8.0rc0, 2.8.0rc1, 2.8.0, 2.8.1, 2.8.2, 2.9.0rc0, 2.9.0rc1, 2.9.0rc2, 2.9.0, 2.9.1)
ERROR: No matching distribution found for tensorflow==2.4.*

documentation

opened by mikegerber 6

Releases(v0.0.11)

v0.0.11(Oct 24, 2022)
Added:

Trained models listed in ocrd-tool.json for download with OCR-D resource manager, #53

Source code(tar.gz)
Source code(zip)
v0.0.10(Jul 21, 2022)
Fixed:

Use correct import, s/click/types, #40

Source code(tar.gz)
Source code(zip)
v0.0.9(Apr 26, 2022)
Changed:

Factor setup by @bertsky in https://github.com/qurator-spk/sbb_binarization/pull/31

actually use TF2 (with TF1.compat mode) by @cneud in https://github.com/qurator-spk/sbb_binarization/pull/35

Source code(tar.gz)
Source code(zip)
v0.0.8(May 7, 2021)
Fixed:

handle image smaller than patch size

fix unbound variable error, #27

Source code(tar.gz)
Source code(zip)
v0.0.7(Feb 2, 2021)
Changed:

Use OCR-D/core resource resolving, #25

Source code(tar.gz)
Source code(zip)
v0.0.6(Nov 23, 2020)
Fixed:

Require h5py < 3, qurator-spk/sbb_textline_detection#50, #18

Require tensorflow-gpu (CPU+GPU), not tensorflow (CPU only), #20

Source code(tar.gz)
Source code(zip)
v0.0.5(Nov 2, 2020)
Fixed:

Memory leak, start tf session only once, #17 ht @sulzbals

Source code(tar.gz)
Source code(zip)
v0.0.4(Oct 27, 2020)
Changed:

Env var SBB_BINARIZE_DATA is combined with model param now, #9

Source code(tar.gz)
Source code(zip)
v0.0.3(Oct 27, 2020)
Fixed:

typo broke sbb_binarize CLI, #13

Source code(tar.gz)
Source code(zip)
v0.0.2(Oct 27, 2020)
Changed:

SBB_BINARIZE_DATA can replace model parameter, #6

Fixed:

AlternativeImage/comments now set on page level, #8, #11

Only try to load *.h5 model files, #7, #10

Source code(tar.gz)
Source code(zip)

Owner

QURATOR-SPK

Curation Technologies

GitHub Repository

Deep Learning Chinese Word Segment

引用本项目模型BiLSTM+CRF参考论文：http://www.aclweb.org/anthology/N16-1030 ,IDCNN+CRF参考论文：https://arxiv.org/abs/1702.02098 构建安装好bazel代码构建工具，安装好tensorflow（目前本项目需

2.1k Dec 23, 2022

Contextual speed detection for python

Speed Prediction using Optical Flow and 2D CNN About the challenge: Comma.AI Speed Challenge This challenge was developed by Comma.AI to predict the s

2 Dec 16, 2021

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted. ocrmypdf # it's a scriptable c

7.9k Jan 03, 2023

Detect textlines in document images

Textline Detection Detect textlines in document images Introduction This tool performs border, region and textline detection from document image data

70 Jun 30, 2022

Document manipulation detection with python

image manipulation detection task: -- tianchi function image segmentation salie

3 Aug 22, 2022

Controlling the computer volume with your hands // OpenCV

HandsControll-AI Controlling the computer volume with your hands // OpenCV Step 1 git clone https://github.com/Hayk-21/HandsControll-AI.git pip instal

1 Nov 04, 2021

SceneCollisionNet This repo contains the code for "Object Rearrangement Using Learned Implicit Collision Functions", an ICRA 2021 paper. For more info

31 Nov 22, 2022

One Metrics Library to Rule Them All!

onemetric Installation Install onemetric from PyPI (recommended): pip install onemetric Install onemetric from the GitHub source: git clone https://gi

49 Jan 03, 2023

This is a GUI for scrapping PDFs with the help of optical character recognition making easier than ever to scrape PDFs.

pdf-scraper-with-ocr With this tool I am aiming to facilitate the work of those who need to scrape PDFs either by hand or using tools that doesn't imp

75 Oct 21, 2022

1st place solution for SIIM-FISABIO-RSNA COVID-19 Detection Challenge

SIIM-COVID19-Detection Source code of the 1st place solution for SIIM-FISABIO-RSNA COVID-19 Detection Challenge. 1.INSTALLATION Ubuntu 18.04.5 LTS CUD

170 Dec 21, 2022

Msos searcher - A half-hearted attempt at finding a magic square of squares

MSOS searcher A half-hearted attempt at finding (or rather searching) a MSOS (Magic Square of Squares) in the spirit of the Parker Square. Running I r

1 Jan 02, 2022

Use Convolutional Recurrent Neural Network to recognize the Handwritten line text image without pre segmentation into words or characters. Use CTC loss Function to train.

Handwritten Line Text Recognition using Deep Learning with Tensorflow Description Use Convolutional Recurrent Neural Network to recognize the Handwrit

224 Jan 07, 2023

This repo contains a script that allows us to find range of colors in images using openCV, and then convert them into geo vectors.

Vectorizing color range This repo contains a script that allows us to find range of colors in images using openCV, and then convert them into geo vect

9 Jul 27, 2022

A simple QR-Code Reader in Python

A simple QR-Code Reader written in Python, that copies the content of a QR-Code directly into the copy clipboard.

1 Oct 28, 2021

A PyTorch implementation of ECCV2018 Paper: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes A PyTorch implement of TextSnake: A Flexible Representation for Detecting

417 Dec 12, 2022

Histogram specification using openCV in python .

histogram specification using openCV in python . Have to input miu and sigma to draw gausssian distribution which will be used to map the input image . Example input can be miu = 128 sigma = 30

6 Nov 17, 2021

[python3.6] 运用tf实现自然场景文字检测,keras/pytorch实现ctpn+crnn+ctc实现不定长场景文字OCR识别

本文基于tensorflow、keras/pytorch实现对自然场景的文字检测及端到端的OCR中文文字识别 update20190706 为解决本项目中对数学公式预测的准确性，做了其他的改进和尝试，效果还不错，https://github.com/xiaofengShi/Image2Katex 希

2.7k Dec 25, 2022

A facial recognition program that plays a alarm (mp3 file) when a person i seen in the room. A basic theif using Python and OpenCV

Home-Security-Demo A facial recognition program that plays a alarm (mp3 file) when a person is seen in the room. A basic theif using Python and OpenCV

4 Nov 02, 2021

Repository collecting all the submodules for the new PyTorch-based OCR System.

OCRopus3 is being replaced by OCRopus4, which is a rewrite using PyTorch 1.7; release should be soonish. Please check github.com/tmbdev/ocropus for up

138 Dec 09, 2022

OpenGait is a flexible and extensible gait recognition project

A flexible and extensible framework for gait recognition. You can focus on designing your own models and comparing with state-of-the-arts easily with the help of OpenGait.

335 Dec 22, 2022