Extract tables from scanned image PDFs using Optical Character Recognition.

Last update: Dec 06, 2022

Overview

ocr-table

This project aims to extract tables from scanned image PDFs using Optical Character Recognition.

Install Requirements

Tesseract OCR
```
sudo apt-get install tesseract-ocr
```
Imagemagick
```
sudo apt-get install imagemagick
```
PDF Utilities
```
sudo apt-get install poppler-utils
```
Python packages
```
sudo pip install -r requirements.txt
```

Usage

Clear the pdf/ folder and copy all your pdf files to be scanned in it.
Run the OCR:
```
python3 shellocr.py
```
The scanned text files shall be available in the txt/ folder once the process completes.

Alternate

If the above doesn't work for you, try the alternate method.
Save your file as input.pdf in the root directory.
Run
```
python3 pdf_miner.py 
```

Owner

Abhijeet Singh

Mozilla Rep | Software Engineer

GitHub Repository

A synthetic data generator for text recognition

TextRecognitionDataGenerator A synthetic data generator for text recognition What is it for? Generating text image samples to train an OCR software. N

2.5k Jan 04, 2023

Handwritten Text Recognition (HTR) system implemented with TensorFlow.

Handwritten Text Recognition with TensorFlow Update 2021: more robust model, faster dataloader, word beam search decoder also available for Windows Up

1.5k Jan 07, 2023

Balabobapy - Using artificial intelligence algorithms to continue the text

1 Feb 04, 2022

nofacedb/faceprocessor is a face recognition engine for NoFaceDB program complex.

faceprocessor nofacedb/faceprocessor is a face recognition engine for NoFaceDB program complex. Tech faceprocessor uses a number of open source projec

3 Sep 06, 2021

CNN+Attention+Seq2Seq

Attention_OCR CNN+Attention+Seq2Seq The model and its tensor transformation are shown in the figure below It is necessary ch_ train and ch_ test the p

2 Jul 14, 2022

Reference Code for AAAI-20 paper "Multi-Stage Self-Supervised Learning for Graph Convolutional Networks on Graphs with Few Labels"

Reference Code for AAAI-20 paper "Multi-Stage Self-Supervised Learning for Graph Convolutional Networks on Graphs with Few Labels" Please refer to htt

1 Feb 14, 2022

MXNet OCR implementation. Including text recognition and detection.

insightocr Text Recognition Accuracy on Chinese dataset by caffe-ocr Network LSTM 4x1 Pooling Gray Test Acc SimpleNet N Y Y 99.37% SE-ResNet34 N Y Y 9

99 Nov 01, 2022

Satoshi is a discord bot template in python using discord.py that allow you to track some live crypto prices with your own discord bot.

Satoshi ~ DiscordCryptoBot Satoshi is a simple python discord bot using discord.py that allow you to track your favorites cryptos prices with your own

2 Sep 15, 2022

Python Computer Vision application that allows users to draw/erase on the screen using their webcam.

CV-Virtual-WhiteBoard The Virtual WhiteBoard is a project I made using the OpenCV and Mediapipe Python libraries. Using your index and middle finger y

1 Jan 07, 2022

Tracking the latest progress in Scene Text Detection and Recognition: Must-read papers well organized

SceneTextPapers Tracking the latest progress in Scene Text Detection and Recognition: must-read papers well organized Information about this repositor

763 Jan 01, 2023

SceneCollisionNet This repo contains the code for "Object Rearrangement Using Learned Implicit Collision Functions", an ICRA 2021 paper. For more info

31 Nov 22, 2022

Extract tables from scanned image PDFs using Optical Character Recognition.

Related tags

Overview

ocr-table

Install Requirements

Usage

Alternate

Owner

Abhijeet Singh

A synthetic data generator for text recognition

Handwritten Text Recognition (HTR) system implemented with TensorFlow.

Balabobapy - Using artificial intelligence algorithms to continue the text

nofacedb/faceprocessor is a face recognition engine for NoFaceDB program complex.

CNN+Attention+Seq2Seq

Reference Code for AAAI-20 paper "Multi-Stage Self-Supervised Learning for Graph Convolutional Networks on Graphs with Few Labels"

MXNet OCR implementation. Including text recognition and detection.

Satoshi is a discord bot template in python using discord.py that allow you to track some live crypto prices with your own discord bot.

Python Computer Vision application that allows users to draw/erase on the screen using their webcam.

Tracking the latest progress in Scene Text Detection and Recognition: Must-read papers well organized

SceneCollisionNet This repo contains the code for "Object Rearrangement Using Learned Implicit Collision Functions", an ICRA 2021 paper. For more info

Image processing in Python

The world's simplest facial recognition api for Python and the command line

Resizing Canny Countour In Python

A program that takes in the hand gesture displayed by the user and translates ASL.

Automatically fishes for you while you are afk :)

PyQT5 app that colorize black & white pictures using CNN(use pre-trained model which was made with OpenCV)

Convert scans of handwritten notes to beautiful, compact PDFs

Image augmentation library in Python for machine learning.

Code for the AAAI 2018 publication "SEE: Towards Semi-Supervised End-to-End Scene Text Recognition"