pdf_sprinkles: sprinkles text in your PDFs

Overview

pdf_sprinkles: sprinkles text in your PDFs

pdf_sprinkles remotely OCRs a PDF with Google Cloud Document AI, and returns the result as a PDF with searchable text.

It runs on the command-line or as a web server. The server version can be deployed to App Engine easily.

pdf_sprinkles has only been tested with English-language text, but it should work for most European languages supported by the Document AI API today. It is known not to work with RTL languages and with CJK scripts currently.

Installation

pdf_sprinkles is experimental, so it's not packaged yet. To install:

  • Set up Google Cloud Document AI, following the quickstart.

  • Clone this repository and cd to it.

  • Create a virtualenv, pdf_sprinkles$ virtualenv env.

  • Install requirements, pdf_sprinkles$ pip install -r requirements.txt.

  • Save your location, processor_id and project_id in a flagfile:

    pdf_sprinkles$ cat >flagfile
    --location='your-location' # 'us' or 'eu'
    --processor_id='your-processor-id'
    --project_id='your-project-id'
    pdf_sprinkles$
    

Quickstart

Activate the virtualenv:

  • pdf_sprinkles$ . env/bin/activate

and invoke pdf_sprinkles_cli.py with your input and output:

  • (env) pdf_sprinkles$ ./pdf_sprinkles_cli.py --flagfile=flagfile --input=scan.pdf --output=scan-ocr.pdf

or invoke pdf_sprinkles_web.py and visit it at http://localhost:8888/ :

  • (env) pdf_sprinkles$ ./pdf_sprinkles_web.py --flagfile=flagfile

Usage

pdf_sprinkles_web.py

USAGE: ./pdf_sprinkles_web.py [flags]

./pdf_sprinkles_web.py:

  • --address: Address to bind to. (default: '127.0.0.1')
  • --[no]cloud_logging: Use cloud logging. (default: 'false')
  • --cookie_secret_id: ID of a cookie secret in Secrets Manager
  • --[no]debug: Starts Tornado in debugging mode. (default: 'false')
  • --port: Port to bind to (default: '8888') (an integer)
  • --self_link: If set, displays a self link in the header.

uimodules:

  • --faq_link: If set, displays an FAQ link in the footer.
  • --mailing_list_link: If set, displays a mailing list link in the footer.

pdf_sprinkles_cli.py

USAGE: ./pdf_sprinkles_cli.py [flags]

./pdf_sprinkles_cli.py:

  • --input: Path to input file
  • --output: Path to output file

Shared Flags

These flags can be set for both the CLI and Web frontends.

document_ai_ocr:

  • --location: : Location of document processor (default: 'us')
  • --processor_id: ID of document processor
  • --project_id: Google Cloud project ID

third_party.hocr_tools.hocr_pdf:

  • --min_confidence: Minimum confidence of lines to include in output. (default: '0.9') (a number)

pdf_sprinkles uses Abseil Flags, so you can define rarely changing flags in a file and import it with --flagfile=FILENAME.

Running on App Engine

IMPORTANT: this is only meant to be used in a trusted environment; Document AI requests are much costlier than normal web requests, and this can rapidly turn into a denial-of-wallet attack if running on the public Internet.

pdf_sprinkles ships with configs to run on a Python 3 Standard Environment runtime. It uses supervisord, with listening port and number of workers controlled by environment variables.

Set up config files

  1. copy app.yaml.example to app.yaml.

  2. Adjust instance size / workers / scaling to taste. For instance, if you have a busy environment and don't mind a few hundred dollars a month in costs, set:

     env_variables:
         WORKERS: 4
     instance_class: F4_1G
    
     automatic_scaling:
       min_idle_instances: 1
  3. copy supervisord.conf.example to supervisord.conf.

  4. update flags in supervisord.conf to match the flagfile.

Cookie Secret

The app can uses a cookie secret for XSRF protection. Since checking secrets in to Git is a bad idea, we use Secret Manager instead.

You'll need to set this up on first use.

  1. Generate a 32-byte symmetric key:

    $ head -c 32 /dev/urandom | base64
    BNUV6qSX0YOjatf4kfYBHUKVlD3kw+89hLia5M1Pduw=
    $
    

    and store it in Secret Manager.

  2. Grant the app service account access to the secret and its versions (see IAM Roles, below.)

  3. Set --cookie_secret_id in supervisord.conf to match.

IAM Roles

The service account for the app needs project-level IAM roles:

  • roles/documentai.apiUser, Document AI > Cloud DocumentAI API User
  • roles/logging.logWriter, Logging > Logs Writer

and needs access to its cookie secret, granted with:

  • roles/secretmanager.secretAccessor, Secret Manager Secret Accessor
  • roles/secretmanager.viewer, Secret Manager Viewer

Deploy

Run pdf_sprinkles$ gcloud app deploy.

License

pdf_sprinkles is licensed under the Apache License, Version 2.0.

Owner
Will Angley
Will Angley
Zen-Knit is a formal (PDF), informal (HTML) report generator for data analyst and data scientist who wants to use python.

About Zen-Knit: Zen-Knit is a formal (PDF), informal (HTML) report generator for data analyst and data scientist who wants to use python. Inspired fro

Zen Reportz 27 Jul 13, 2022
PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files

Matthew Stamy 5k Jan 04, 2023
Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

Dom 76 Dec 12, 2022
Telegram bot that can do a lot of things related to PDF files.

Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif

130 Dec 26, 2022
Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

Scans pdfs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (pdf, url, doi, arxiv) and metadata fr

Marshal Miller 22 Nov 21, 2022
minipdf is a package for creating simple, single-page PDF documents.

minipdf minipdf is a package for creating simple, single-page PDF documents. Installation You can install the development version from GitHub with: #

mikefc 41 Dec 19, 2022
Python bindings for MuPDF's rendering library.

PyMuPDF 1.19.3 Release date: December 15, 2021 On PyPI since August 2016: Author Jorj X. McKie, based on original code by Ruikai Liu. Introduction PyM

Jorj X. McKie 0 Nov 03, 2022
A backend for mdbook in Python for generating PDF based on Chrome DevTools Protocol.

mdbook-pdf A backend for mdbook written in Python for generating PDF based on Chrome DevTools Protocol. Python library dependency Usage Put mdbook-pdf

Hollow Man 49 Dec 27, 2022
Extract the table in the PDF,outputs the data similar to the json format

extract the table in the PDF,outputs the data similar to the json format

3 Nov 25, 2021
Convert PDF to AudioBook and Audio Speech to PDF

In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox modul

RISHABH MISHRA 1 Feb 13, 2022
An application which enables the users to perform simple yet intriguing PDF operations

AstutePDF A repository containing the GUI for an application which enables the users to perform simple yet intriguing PDF operations. These include, M

Raghav S 5 Jan 22, 2022
this is simple program, that converts pdf file to png

author: a5892731 last update:2021-11-01 version: 1.1 resources: -https://pypi.org/project/pdf2image/ -https://github.com/oschwartz10612/poppler-window

1 Nov 01, 2021
DietPDF aims at reducing PDF file size while not degrading quality nor losing metadata

DietPDF aims at reducing PDF file size while not degrading quality nor losing metadata

Frédéric BISSON 6 Jul 27, 2022
A simple pdf size compressing telegram robot witten in python.

Pdf Compressor Telegram Bot ##About : A simple pdf size compressing telegram robot witten in python. Mostly useful for digital documentation. Deploy t

Renjith Mangal 22 Oct 28, 2022
A simple Python script to convert multiple images (well technically also a single image) into a pdf.

PythonImage2PDF A simple Python script to convert multiple images into a single PDF-document. Created basically for only my own needs for converting m

Joona Gynther 1 Jun 28, 2022
Split given PDF document into 4 page groups and convert them to booklet format

PUTO: PDF to Booklet converter Split given PDF document into 4 page groups and convert them to booklet format. It creates a PDF like shown below: Fir

3 Mar 12, 2022
Excalibur: A web interface to extract tabular data from PDFs

Excalibur: A web interface to extract tabular data from PDFs Excalibur is a web interface to extract tabular data from PDFs, written in Python 3! It i

1.2k Jan 04, 2023
Busca no nome e conteúdo de arquivos PDF no diretório e subdiretórios.

PDF Finder Este script auxilia na pesquisa em pastas com inúmeros arquivos PDF. A pesquisa é feita em todos os arquivos do doretório e subdiretórios.

William Pilger 1 Nov 27, 2021
WeasyPrint is a smart solution helping web developers to create PDF documents.

WeasyPrint is a smart solution helping web developers to create PDF documents. It turns simple HTML pages into gorgeous statistical reports, invoices, tickets…

Kozea 5.4k Jan 08, 2023
A Python tool to generate a static HTML file that represents the internal structure of a PDF file

PDFSyntax A Python tool to generate a static HTML file that represents the internal structure of a PDF file At some point the low-level functions deve

Martin D. 394 Dec 30, 2022