Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Last update: Oct 28, 2021

Related tags

Overview

Quick and Dirty OCR of Facebook Papers

Gizmodo has been working through the Facebook Papers and releasing the docs that they process and review.

As luck would have it, I had some ugly but functional code lying around that would do a first pass on OCR on these docs. That code is in the pdf_to_image.py script. I'd welcome improvement to the code, especially in image cleanup prior to OCR (lines 92-97, approx). I experimented with cleaning up the image via PIL and cv2, but the results were less accurate, almost certainly due to my lack of familiarity with either of these approaches.

These Facebook Papers are especially challenging from an OCR perspective because many of them are pictures taken of a screen, so the base image quality isn't especially good. Because of this, not every document can be processed cleanly, and the documents that do get processed have some cruft in them.

With that said, the text pulled from these files simplifies the process of parsing through a large amount of data for keywords.

Other (Better) Options

This OCR should be seen as a first step. Text files are generally a decent starting point because they allow for a wide range of follow on analysis.

And, other/better options exist. For a comprehensive, contained analysis, these other options will almost certainly be a better choice.

Want to help?

If you want to collaborate on this project, let me know!

Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Related tags

Overview

Quick and Dirty OCR of Facebook Papers

Other (Better) Options

Want to help?

Owner

Bill Fitzgerald

Face_mosaic - Mosaic blur processing is applied to multiple faces appearing in the video

This project modify tensorflow object detection api code to predict oriented bounding boxes. It can be used for scene text detection.

Python-based tools for document analysis and OCR

How to detect objects in real time by using Jupyter Notebook and Neural Networks , by using Yolo3

Code for the paper "Controllable Video Captioning with an Exemplar Sentence"

A bot that plays TFT using OCR. Keeps track of bench, board, items, and plays the user defined team comp.

Regions sanitàries (RS), Sectors Sanitàris (SS) i Àrees Bàsiques de Salut (ABS) de Catalunya

Text recognition (optical character recognition) with deep learning methods.

Python library to extract tabular data from images and scanned PDFs

Captcha Recognition

An application of high resolution GANs to dewarp images of perturbed documents

Course material for the Multi-agents and computer graphics course

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

A facial recognition program that plays a alarm (mp3 file) when a person i seen in the room. A basic theif using Python and OpenCV

A simple Digits Recogniser made in Python

Detect the mathematical formula from the given picture and the same formula is extracted and converted into the latex code

Sort By Face

STEFANN: Scene Text Editor using Font Adaptive Neural Network

Automatically fishes for you while you are afk :)

Bu uygulamada Python ve Opencv kullanarak bilgisayar kamerasından yüz tespiti yapıyoruz.