A modern pure-Python library for reading PDF files

Last update: Apr 06, 2022

Related tags

Overview

pdf

A modern pure-Python library for reading PDF files.

The goal is to have a modern interface to handle PDF files which is consistent with itself and typical Python syntax.

The library should be Python-only (hence no C-extensions), but allow to change the backend. Similar in concept to matplotlib backends and Keras backends.

The default backend could be PyPDF2.

Possible other backends could be PyMuPDF (using MuPDF) and PikePDF (using QPDF).

WARNING: This library is UNSTABLE at the moment! Expect many changes!

Installation

pip install pdffile

Usage

Retrieve Metadata

>>> import pdf

>>> doc = pdf.PdfFile("001-trivial/minimal-document.pdf")
>>> len(doc)
1

>>> doc.metadata
Metadata(
    title=None,
    producer='pdfTeX-1.40.23',
    creator='TeX',
    creation_date=datetime.datetime(2022, 4, 3, 18, 5, 42),
    modification_date=datetime.datetime(2022, 4, 3, 18, 5, 42)
    other={
         '/CreationDate': "D:20220403180542+02'00'",
         '/ModDate': "D:20220403180542+02'00'",
         '/Trapped': '/False',
         '/PTEX.Fullbanner': 'This is pdfTeX, V...'})

Encrypted PDFs

If you have an encrypted PDF, just provide the key:

doc = pdf.PdfFile(pdf_path, password=password)

All following operations work just as described.

Get Outline

>>> import pdf
>>> doc = pdf.PdfFile(pdf_path, password=password)
>>> doc.outline
[
    Links(page=5, text='1 Header'),
    Links(page=5, text='1.1 A section'),
    Links(page=9, text='2 Foobar'),
    Links(page=108, text='References')
]

Extract Text

>>> import pdf
>>> doc = pdf.PdfFile("001-trivial/minimal-document.pdf")
>>> doc[0]
<pdf.PdfPage object at 0x7f72d2b04100>
>>> doc[0].text
'Loremipsumdolorsitamet,consetetursadipscingelitr,seddiamnonumyeirmod\ntemporinviduntutlaboreetdoloremagnaaliquyamerat,seddiamvoluptua.Atvero\neosetaccusametjustoduodoloresetearebum.Stetclitakasdgubergren,noseataki-\nmatasanctusestLoremipsumdolorsitamet.Loremipsumdolorsitamet,consetetur\nsadipscingelitr,seddiamnonumyeirmodtemporinviduntutlaboreetdoloremagna\naliquyamerat,seddiamvoluptua.Atveroeosetaccusametjustoduodoloresetea\nrebum.Stetclitakasdgubergren,noseatakimatasanctusestLoremipsumdolorsit\namet.\n1\n'

Alternatively, you can use doc.text to get the text of all pages.

A modern pure-Python library for reading PDF files

Related tags

Overview

pdf

Installation

Usage

Retrieve Metadata

Encrypted PDFs

Get Outline

Extract Text

Owner

This library is a location of the LegacyLogger for PyTorch Lightning.

Official PyTorch implementation for FastDPM, a fast sampling algorithm for diffusion probabilistic models

Code artifacts for the submission "Mind the Gap! A Study on the Transferability of Virtual vs Physical-world Testing of Autonomous Driving Systems"

A PaddlePaddle implementation of Time Interval Aware Self-Attentive Sequential Recommendation.

BTC-Generator - BTC Generator With Python

1st-in-MICCAI2020-CPM - Combined Radiology and Pathology Classification

Depression Asisstant GDSC Challenge Solution

Official implementation of "SinIR: Efficient General Image Manipulation with Single Image Reconstruction" (ICML 2021)

POPPY (Physical Optics Propagation in Python) is a Python package that simulates physical optical propagation including diffraction

A small demonstration of using WebDataset with ImageNet and PyTorch Lightning

Face Recognize System on camera AI OAK1

PyTorch reimplementation of hand-biomechanical-constraints (ECCV2020)

1st Place Solution to ECCV-TAO-2020: Detect and Represent Any Object for Tracking

Code for "Causal autoregressive flows" - AISTATS, 2021

This is the repo for Uncertainty Quantification 360 Toolkit.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Semantic Segmentation with Pytorch-Lightning

Machine Learning automation and tracking

Improving Generalization Bounds for VC Classes Using the Hypergeometric Tail Inversion

Official repository of "BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment"