PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

Overview

PDFSanitizer

Renders possibly malicious PDF files and outputs harmless PDF files

To do this, the PDF files are rendered and converted to images using PyMuPDF. The images are then saved to a new PDF file using img2pdf. This ensures no visual data is lost, but any scripts/external references/flash files are removed.

Instalation:

git clone https://github.com/lacioffi/PDFSanitizer
cd PDFSanitizer
pip install -r requirements.txt 

Usage:

pdfsanitizer.py 
    
    

   

This project uses the following libraries:

PyMuPDF - By Jorj X. McKie (@JorjMcKie)

img2pdf - By Johannes Schauer Marin Rodrigues ([email protected])

Special thanks to @CoolerVoid for helping with the sandboxing part <3

Security Considerations

This script will render possibly malicious PDFs to generate the sanitized file using MuPDF, which has had exploits in the past: https://www.cvedetails.com/vulnerability-list/vendor_id-10846/product_id-20840/Artifex-Mupdf.html Therefore, assume the machine this runs on will be pwned eventually.

To mitigate this risk, I recommend two techinques:

Running the script on a Sandbox

Firejail

This sandbox will restrict all network access, unnecessary syscalls and reading/writing from unexpected files/folders. I recommend this method.

Install it using:

sudo apt-get install firejail

The profile is already included in this repo, but it assumes you're running from ~/PDFSanitizer. Setup and use it by running the following commands:

cd ~
git clone https://github.com/lacioffi/PDFSanitizer
cd ~/PDFSanitizer/ 
firejail --profile=pdfsanitizer.profile python3 ~/PDFSanitizer/pdfsanitizer.py file.pdf ~/PDFSanitizer/out

Plase note that the input file must be located in ~/PDFSanitizer. The output folder can have any name, but it must already exist and also be inside ~/PDFSanitizer.

If you want to run PDFSanitizer from another folder, change the following line in "pdfsanitizer.profile":

whitelist ${HOME}/PDFSanitizer/

To wherever you're running the program from.

If you want the output folder or the input file to be outside of PDFSanitizer's folder, simply add a line in "pdfsanitizer.profile":

whitelist /full/path/to/output/folder
whitelist /full/path/to/file.pdf

Again, please note that the output folder must already exist.

CloudFlare Sandbox

This sandbox will restrict all network access and unnecessary syscalls, but will not restrict reading/writing to arbitrary files/folders.

Download and build it from here: https://github.com/cloudflare/sandbox Take the generated "libsandbox.so" file, put it in this PDFSanitizer's folder and run the following command:

LD_PRELOAD=./libsandbox.so SECCOMP_SYSCALL_ALLOW="read:write:lseek:close:openat:brk:stat:munmap:fstat:getdents64:ioctl:rt_sigaction:mmap:mprotect:pread64:lstat:dup:mremap:futex:getegid:getuid:getgid:geteuid:sigaltstack:rt_sigprocmask:access:uname:fcntl:getcwd:readlink:sysinfo:arch_prctl:gettid:set_tid_address:set_robust_list:prlimit64:getrandom:exit_group" python3 pdfsanitizer.py file.pdf ./output

Running the script on an isolated environment

For maximum security, run this script on an isolated, ephemeral instance or even a serverless environment. Block all network communications, maybe kill the instance after the job is done, and only allow reading from an input folder/bucket and writing to an output folder/bucket.

I didn't try this method, but I believe you can do it with some Cloud Majyks.

To-do

This method removes EVERYTHING from the PDF. It would be nice to at least keep the text copy-pasteable.

Python script that split PDF files.

Automatic PDF Splitter This script can create new single-page PDFs files from multipaged PDFs. Requirements Python 3.0+ # Debian distros sudo apt-get

Leandro Padula 5 Apr 02, 2022
pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative markdown file as input

pystitcher pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative input in the form of a mark

Nemo 387 Dec 10, 2022
A bot for PDF for doing Many Things....

Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif

Mr. Developer 60 Dec 27, 2022
borb is a library for reading, creating and manipulating PDF files in python.

borb is a library for reading, creating and manipulating PDF files in python.

Joris Schellekens 2.9k Jan 01, 2023
Table automatically extraction from PDF Document

PDF Table Extractor Table automatically extraction from PDF Document Our Icon 📌 Name : PDF Table Extractor 📌 Authors : Minku Koo Jiyong Park 📌 Deve

1 Jan 10, 2022
Converting Html files to pdf using python script, pdfkit module and wkhtmltopdf.

Html-to-pdf-pdfkit-wkhtml- This repository has code for converting local html files and online html resources into pdf. It is an python script which u

Hemachandran P 1 Nov 09, 2021
Performing the following operations using python on PDF.

Python PDF Handling Tutorial Python is a highly versatile language with a huge set of libraries. It is a high level language with simple syntax. Pytho

Prajwol Lamichhane 131 Dec 16, 2022
Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

1.8k Dec 29, 2022
A backend for mdbook in Python for generating PDF based on Chrome DevTools Protocol.

mdbook-pdf A backend for mdbook written in Python for generating PDF based on Chrome DevTools Protocol. Python library dependency Usage Put mdbook-pdf

Hollow Man 49 Dec 27, 2022
A bulk pdf generator. This application can generate PDFs in bulk by using just one click.

A bulk html pdf generator. This application can generate PDFs in bulk by using just one click. Screenshots Requirements 🧱 Your system must have the f

Aman Nirala 3 Apr 23, 2022
Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza.

tratapdf Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza. dependências icc-profiles ghostscript visualizador de PDF

1 Nov 30, 2021
PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files

Matthew Stamy 5k Jan 04, 2023
WeasyPrint is a smart solution helping web developers to create PDF documents.

WeasyPrint is a smart solution helping web developers to create PDF documents. It turns simple HTML pages into gorgeous statistical reports, invoices, tickets…

Kozea 5.4k Jan 08, 2023
Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi

Yusuke Shinyama 4.9k Jan 04, 2023
x-ray is a Python library for finding bad redactions in PDF documents.

A tool to detect whether a PDF has a bad redaction

Free Law Project 73 Dec 19, 2022
Pdfencrypt is a tool to encrypt/lock PDFs

Pdfencrypt Pdfencrypt is a tool to encrypt/lock PDFs Installation $ apt update $ apt upgrade $ apt install git $ apt install python $ git clone https:

Anontemitayo 5 Nov 28, 2021
Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator

Malicious PDF Generator ☠️ Generate ten different malicious pdf files with phone-home functionality. Can be used with Burp Collaborator. Used for pene

Jonas Lejon 1.9k Jan 01, 2023
Camelot is a Python library that can help you extract tables from PDFs!

A Python library to extract tabular data from PDFs

1.8k Jan 03, 2023
An application which enables the users to perform simple yet intriguing PDF operations

AstutePDF A repository containing the GUI for an application which enables the users to perform simple yet intriguing PDF operations. These include, M

Raghav S 5 Jan 22, 2022
this is simple program, that converts pdf file to png

author: a5892731 last update:2021-11-01 version: 1.1 resources: -https://pypi.org/project/pdf2image/ -https://github.com/oschwartz10612/poppler-window

1 Nov 01, 2021