pdf_sprinkles: sprinkles text in your PDFs

Overview

pdf_sprinkles: sprinkles text in your PDFs

pdf_sprinkles remotely OCRs a PDF with Google Cloud Document AI, and returns the result as a PDF with searchable text.

It runs on the command-line or as a web server. The server version can be deployed to App Engine easily.

pdf_sprinkles has only been tested with English-language text, but it should work for most European languages supported by the Document AI API today. It is known not to work with RTL languages and with CJK scripts currently.

Installation

pdf_sprinkles is experimental, so it's not packaged yet. To install:

  • Set up Google Cloud Document AI, following the quickstart.

  • Clone this repository and cd to it.

  • Create a virtualenv, pdf_sprinkles$ virtualenv env.

  • Install requirements, pdf_sprinkles$ pip install -r requirements.txt.

  • Save your location, processor_id and project_id in a flagfile:

    pdf_sprinkles$ cat >flagfile
    --location='your-location' # 'us' or 'eu'
    --processor_id='your-processor-id'
    --project_id='your-project-id'
    pdf_sprinkles$
    

Quickstart

Activate the virtualenv:

  • pdf_sprinkles$ . env/bin/activate

and invoke pdf_sprinkles_cli.py with your input and output:

  • (env) pdf_sprinkles$ ./pdf_sprinkles_cli.py --flagfile=flagfile --input=scan.pdf --output=scan-ocr.pdf

or invoke pdf_sprinkles_web.py and visit it at http://localhost:8888/ :

  • (env) pdf_sprinkles$ ./pdf_sprinkles_web.py --flagfile=flagfile

Usage

pdf_sprinkles_web.py

USAGE: ./pdf_sprinkles_web.py [flags]

./pdf_sprinkles_web.py:

  • --address: Address to bind to. (default: '127.0.0.1')
  • --[no]cloud_logging: Use cloud logging. (default: 'false')
  • --cookie_secret_id: ID of a cookie secret in Secrets Manager
  • --[no]debug: Starts Tornado in debugging mode. (default: 'false')
  • --port: Port to bind to (default: '8888') (an integer)
  • --self_link: If set, displays a self link in the header.

uimodules:

  • --faq_link: If set, displays an FAQ link in the footer.
  • --mailing_list_link: If set, displays a mailing list link in the footer.

pdf_sprinkles_cli.py

USAGE: ./pdf_sprinkles_cli.py [flags]

./pdf_sprinkles_cli.py:

  • --input: Path to input file
  • --output: Path to output file

Shared Flags

These flags can be set for both the CLI and Web frontends.

document_ai_ocr:

  • --location: : Location of document processor (default: 'us')
  • --processor_id: ID of document processor
  • --project_id: Google Cloud project ID

third_party.hocr_tools.hocr_pdf:

  • --min_confidence: Minimum confidence of lines to include in output. (default: '0.9') (a number)

pdf_sprinkles uses Abseil Flags, so you can define rarely changing flags in a file and import it with --flagfile=FILENAME.

Running on App Engine

IMPORTANT: this is only meant to be used in a trusted environment; Document AI requests are much costlier than normal web requests, and this can rapidly turn into a denial-of-wallet attack if running on the public Internet.

pdf_sprinkles ships with configs to run on a Python 3 Standard Environment runtime. It uses supervisord, with listening port and number of workers controlled by environment variables.

Set up config files

  1. copy app.yaml.example to app.yaml.

  2. Adjust instance size / workers / scaling to taste. For instance, if you have a busy environment and don't mind a few hundred dollars a month in costs, set:

     env_variables:
         WORKERS: 4
     instance_class: F4_1G
    
     automatic_scaling:
       min_idle_instances: 1
  3. copy supervisord.conf.example to supervisord.conf.

  4. update flags in supervisord.conf to match the flagfile.

Cookie Secret

The app can uses a cookie secret for XSRF protection. Since checking secrets in to Git is a bad idea, we use Secret Manager instead.

You'll need to set this up on first use.

  1. Generate a 32-byte symmetric key:

    $ head -c 32 /dev/urandom | base64
    BNUV6qSX0YOjatf4kfYBHUKVlD3kw+89hLia5M1Pduw=
    $
    

    and store it in Secret Manager.

  2. Grant the app service account access to the secret and its versions (see IAM Roles, below.)

  3. Set --cookie_secret_id in supervisord.conf to match.

IAM Roles

The service account for the app needs project-level IAM roles:

  • roles/documentai.apiUser, Document AI > Cloud DocumentAI API User
  • roles/logging.logWriter, Logging > Logs Writer

and needs access to its cookie secret, granted with:

  • roles/secretmanager.secretAccessor, Secret Manager Secret Accessor
  • roles/secretmanager.viewer, Secret Manager Viewer

Deploy

Run pdf_sprinkles$ gcloud app deploy.

License

pdf_sprinkles is licensed under the Apache License, Version 2.0.

Owner
Will Angley
Will Angley
Convert MD files to PDF automatically (with CSS) 📄🚀

MD2PDF Action Convert MD files to PDF automatically (with CSS)! Converts a pattern described set of markdown files and converts them to pdf whilst app

Will Fantom 1 Feb 09, 2022
Camelot is a Python library that can help you extract tables from PDFs!

A Python library to extract tabular data from PDFs

1.8k Jan 03, 2023
Camelot is a Python library that makes it easy for anyone to extract tables from PDF files

Camelot: PDF Table Extraction for Humans Camelot is a Python library that makes it easy for anyone to extract tables from PDF files! Note: You can als

Atlan Technologies Pvt Ltd 3.3k Jan 06, 2023
borb is a library for reading, creating and manipulating PDF files in python.

borb is a library for reading, creating and manipulating PDF files in python.

Joris Schellekens 2.9k Jan 01, 2023
CLI tool to generate pdf invoices written in python

invoicepy CLI invoice tool, store and print invoices as pdf. save companies and customers for later use. installation pip install invoicepy config co

Adam Wojtczak 9 Aug 01, 2022
Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza.

tratapdf Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza. dependências icc-profiles ghostscript visualizador de PDF

1 Nov 30, 2021
Program that locks/unlocks pdf files🐍

🐍 📄 PDFtools 📄 🐍 Programa que bloqueia/desbloqueia arquivos pdf Requisitos • Como usar • Capturas de Tela 🚨 Aviso 🚨 Altere os caminhos referente

João Victor Vilela dos Santos 1 Nov 04, 2021
pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative markdown file as input

pystitcher pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative input in the form of a mark

Nemo 387 Dec 10, 2022
Simple pdf editor while preserving structure and format.

SIMPdf Simple pdf editor while preserving structure and format.

Shashwat Singh 242 Jan 04, 2023
Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi

Yusuke Shinyama 4.9k Jan 04, 2023
An application which enables the users to perform simple yet intriguing PDF operations

AstutePDF A repository containing the GUI for an application which enables the users to perform simple yet intriguing PDF operations. These include, M

Raghav S 5 Jan 22, 2022
JoplinPdf2Images - Converts a PDF to images in Joplin and adds it to the specified note as a printout

joplinPdf2Images Converts a PDF to images in Joplin and adds it to the specified

Morten Haahr Kristensen 2 Apr 20, 2022
x-ray is a Python library for finding bad redactions in PDF documents.

A tool to detect whether a PDF has a bad redaction

Free Law Project 73 Dec 19, 2022
PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files

Matthew Stamy 5k Jan 04, 2023
Merge multiple PDF files into one.

PDF Merger Merge multiple PDF files into one. Usage % python pdf_merger.py -h usage: pdf_merger.py [-h] [-o OUTPUT] [-f [FILES ...]] optional argumen

Duo Apps 6 Oct 03, 2022
Auto Convert PDFs to png files in python

This python tool, which is an application of PyMuPDF module, could auto convert PDFs to png files

Bo-Yu 4 Dec 05, 2021
Table automatically extraction from PDF Document

PDF Table Extractor Table automatically extraction from PDF Document Our Icon 📌 Name : PDF Table Extractor 📌 Authors : Minku Koo Jiyong Park 📌 Deve

1 Jan 10, 2022
pikepdf is a Python library for reading and writing PDF files.

A Python library for reading and writing PDF, powered by qpdf

1.6k Jan 03, 2023
Extract the table in the PDF,outputs the data similar to the json format

extract the table in the PDF,outputs the data similar to the json format

3 Nov 25, 2021
Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator

Malicious PDF Generator ☠️ Generate ten different malicious pdf files with phone-home functionality. Can be used with Burp Collaborator. Used for pene

Jonas Lejon 1.9k Jan 01, 2023