~1000 book pages + OpenCV + python = page regions identified as paragraphs, lines, images, captions, etc.

Last update: Dec 06, 2022

Related tags

Overview

cosc428-structor

I had an open-ended Computer Vision assignment to complete, and an out-of-copyright book that I wanted to turn into an ebook. Conventional OCR engines like Tesseract weren't able to accurately recognise the page structure, which led to many transcription errors. If I could tell Tesseract to ignore certain regions (like images or repeated headers), then I could greatly reduce the number of errors in the resulting ebook. Thus: for my assignment, I wrote a program that takes an image and uses computer vision magick to determine the page's structure. So far, my program can detect and locate:

lines of text,
paragraphs,
section titles,
images and their associated captions,
boilerplate like page numbers, and
chapter titles.

Ain't it grand?

Dependencies

The project is written in Python 2.7.3 and uses the cv2 library for interacting with openCV. It also uses numpy for some of the mathematical operations. On windows, the best way to get these dependencies is to install the Python(x,y) suite (https://code.google.com/p/pythonxy/), which combines python with a customisable set of scientific computing libraries.

Program Structure

The program's root is main.py, but this simply iterates through images in a folder and constructs a Page instance from each image. Thus, the real work happens in page.py.

page.py contains a few utility methods and the Page class. The constructor calls the appropriate methods in order to determine the logical structure of the page. This structure is stored in three objects: self.margin, self.content, and self.boilerplate (which contains such non-content text objects as the page number and header).

The getBuildingBlocks method is responsible for finding words, grouping words into textual lines, discarding marginal noise, and fitting a Margin instance around the remaining lines. Most of these tasks are preformed by calling other functions.

The self.content object is found by passing the set of lines to the Content() constructor. This uses a state machine to group lines into figures, paragraphs, section titles, etc. The Content class, along with a class for each content type, is found in content.py.

The other files can generally be ignored when trying to understand the program; they are largely just convenience classes which represent page elements (such as points, geometric lines, words, text lines, and boxes), as well as supporting tools such as the Stopwatch.

How to Run the Code

Run main.py using the python interpreter. This will process each page in ./images, and for each page a series of 'snapshot' images will be displayed in order to illustrate the algorithm. To show only the final result for each image, set showSteps in main.py to False.

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Handwritten Text Recognition (OCR) with MXNet Gluon These notebooks have been created by Jonathan Chung, as part of his internship as Applied Scientis

422 Jan 3, 2023

Comments

The getBuildingBlocks

Hello, Recently, I have some task about the document layout analysis. The description in "README.md" is very consistent with my mission. But when I try to run the code as README.md: How to Run the Code, there just some red line in each dobule word and have no resault of the detect and locate of "line of text", "paragraphs", "section titles" , etc. So I want to know what has happend to the code. Very thankful

opened by lvbohui 3

Releases(v1.0)

v1.0(Nov 7, 2013)

This is the version that I used to write the first draft of my conference paper.
Source code(tar.gz)
Source code(zip)

~1000 book pages + OpenCV + python = page regions identified as paragraphs, lines, images, captions, etc.

Related tags

Overview

cosc428-structor

Dependencies

Program Structure

How to Run the Code

You might also like...

Basic functions manipulating images using the OpenCV library

Some bits of javascript to transcribe scanned pages using PageXML

scantailor - Scan Tailor is an interactive post-processing tool for scanned pages.

Text page dewarping using a "cubic sheet" model

Deep learning based page layout analysis

ocroseg - This is a deep learning model for page layout analysis / segmentation.

a deep learning model for page layout analysis / segmentation.

OCR-D-compliant page segmentation

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Comments

The getBuildingBlocks

Releases(v1.0)

v1.0(Nov 7, 2013)

Owner

Chad Oliver

🔎 Like Chardet. 🚀 Package for encoding & language detection. Charset detection.

This is a passport scanning web service to help you scan, identify and validate your passport created with a simple and flexible design and ready to be integrated right into your system!

([email protected]) Boosting Co-teaching with Compression Regularization for Label Noise

This is a GUI for scrapping PDFs with the help of optical character recognition making easier than ever to scrape PDFs.

Smart computer vision application

零样本学习测评基准，中文版

ARU-Net - Deep Learning Chinese Word Segment

Sort By Face

This is a GUI program which consist of 4 OpenCV projects

Simple SDF mesh generation in Python

Ackermann Line Follower Robot Simulation.

computer vision, image processing and machine learning on the web browser or node.

Détection de créneaux de vaccination disponibles pour l'outil ViteMaDose

Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation

Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding for Zero-Example Video Retrieval.

RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition

The code for “Oriented RepPoints for Aerail Object Detection”

text detection mainly based on ctpn model in tensorflow, id card detect, connectionist text proposal network

Implementation of our paper 'PixelLink: Detecting Scene Text via Instance Segmentation' in AAAI2018

An application of high resolution GANs to dewarp images of perturbed documents