A Python module and command line utility for working with web archive data using the WACZ format specification

Overview

py-wacz

The py-wacz repository contains a Python module and command line utility for working with web archive data using the WACZ format specification. Web Archive Collection Zipped (WACZ) allows web archives to be shared and distributed by providing a predictable way of packaging up web archive data and metadata as a ZIP file. The wacz command line utility supports converting any WARC files into WACZ files, and optionally generating full-text search indices of pages.

Install

Use pip to install the module and a command line utility:

pip install wacz

Once installed you can use the wacz command line utility to create and validate WACZ files.

Create

To create a WACZ package you can point wacz at a WARC file and tell it where to write the WACZ with the -o option:

wacz create -o myfile.wacz 
   

   

The resulting myfile.wacz should be loadable via ReplayWeb.page.

wacz accepts the following options for customizing how the WACZ file is assembled.

-f --file

Explicitly declare the file being passed to the create function.

wacz create -f tests/fixtures/example-collection.warc

-o --output

Explicitly declare the name of the wacz being created

wacz create tests/fixtures/example-collection.warc -o mywacz.wacz

-t --text

Generates pages.jsonl page index with a full-text index, must be run in conjunction with --detect-pages. Will have no effect if run alone

wacz create tests/fixtures/example-collection.warc -t

--detect-pages

Generates pages.jsonl page index without a full-text index

wacz create tests/fixtures/example-collection.warc --detect-pages

-p --pages

Overrides the pages index generation with the passed jsonl pages.

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl

-t --text

You can add a full text index by including the --text tag

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl --text

--ts

Overrides the ts metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --ts TIMESTAMP

--url

Overrides the url metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --url URL

--title

Overrides the titles metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --title TITLE

--desc

Overrides the desc metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --desc DESC

--hash-type

Allows the user to specify the hash type used: (sha256 or md5):

wacz create tests/fixtures/example-collection.warc --hash-type md5

Validate

You can also validate an existing WACZ file by running:

wacz validate myfile.wacz

-f --file

Explicitly declare the file being passed to the validate function.

wacz validate -f tests/fixtures/example-collection.warc

Testing

If you are developing wacz you can run the unit tests with pytest:

pytest tests
Owner
Webrecorder
Webrecorder Project
Webrecorder
This a simple tool to query the awesome ippsec.rocks website from your terminal

ippsec-cli This a simple tool to query the awesome ippsec.rocks website from your terminal Installation and usage cd /opt git clone https://github.com

stark0de 5 Nov 26, 2022
a GUI app base on warp-cli for linux

warp cloudflare gui a GUI app base on warp-cli for linux Installation read warp-cli install doc. install warp-cli and register with $ warp-cli registe

Moein Aghamirzaei 58 Jan 01, 2023
LSD (Linux Spotify Downloader) is a command line tool for downloading or rather recording content on Spotify.

LSD (Linux Spotify Downloader) is a command line tool for downloading or rather recording content on Spotify.

Jannis Zahn 7 Jun 21, 2022
Tools crack instagram + fb ayok dicoba keburu premium 😁

FITUR INSTALLASI [1] pkg update && pkg upgrade [2] pkg install git [3] pkg install python [4] pkg install python2 [5] pkg install nano [6]

Jeeck 1 Dec 11, 2021
frogtrade9000 - a command-line Rich client for the freqtrade REST API

frogtrade9000 - a command-line Rich client for the freqtrade REST API I found FreqUI too cumbersome and slow on my Raspberry Pi 400 when running multi

Robert Davey 79 Dec 02, 2022
CLI tool to computes CO2 emissions of HPC computations following green-algorithms.org methodology

gqueue gqueue is a CLI (command line interface) tool that computes carbon footprint of HPC computations on clusters running slurm. It follows the meth

4 Dec 10, 2021
This is a CLI program which can help you generate your own QR Code.

Python-QR-code-generator This is a CLI program which can help you generate your own QR Code. Single.py This will allow you only to input a single mess

1 Dec 24, 2021
Convert shellcode generated using pe_2_shellcode to cdb format.

pe2shc-to-cdb This tool will convert shellcode generated using pe_to_shellcode to cdb format. Cdb.exe is a LOLBIN which can help evade detection & app

mrd0x 75 Jan 05, 2023
Example of a CLI with python - know the extension of your files.

extensionCLI Example of a CLI with python - know the extension of your files. Usage: Install the CLI: pip3 install -e . Run the command with "ext" + t

ItanuRomero 5 Dec 29, 2022
Python script to tabulate data formats like json, csv, html, etc

pyT PyT is a a command line tool and as well a library for visualising various data formats like: JSON HTML Table CSV XML, etc. Features Print table o

Mobolaji Abdulsalam 1 Dec 30, 2021
A CLI based task manager tool which helps you track your daily task and activity.

CLI based task manager tool This is the simple CLI tool can be helpful in increasing your productivity. More like your todolist. It uses Postgresql as

ritik 1 Jan 19, 2022
🔖 Lemnos: A simple, light-weight command-line to-do list manager.

🔖 Lemnos: CLI To-do List Manager This is a simple program that allows one to manage a to-do list via the command-line. Example $ python3 todo.py add

Rohan Sikand 1 Dec 07, 2022
pls is a better ls for developers, pronounced /pliːz/ as in 'please'

pls is a better ls for developers. The "p" stands for ("pro" as in "professional"/"programmer") or "prettier". It works in a manner similar to ls, in

Dhruv Bhanushali 572 Dec 28, 2022
Get Air Quality Index for your city/country 😷

Air Quality Index CLI Get Air Quality index for your City. Installation $ pip install air-quality-cli Contents Air Quality Index CLI Installation Cont

Yankee 40 Oct 21, 2022
Tool for HackMyVM platform

HMV-cli It is a tool for the HackMyVM platform. With this tool you will be able to see the machines you have pending, filter by difficulty, download d

bitc0de 11 Sep 19, 2022
💻 Physics2Calculator - A simple and powerful calculator for Physics 2

💻 Physics2Calculator A simple and powerful calculator for Physics 2 🔌 Predefined constants pi = 3.14159... k = 8988000000 (coulomb constant) e0 = 8.

Dylan Tintenfich 4 Dec 01, 2021
Interactive Redis: A Terminal Client for Redis with AutoCompletion and Syntax Highlighting.

Interactive Redis: A Cli for Redis with AutoCompletion and Syntax Highlighting. IRedis is a terminal client for redis with auto-completion and syntax

2.2k Dec 29, 2022
adds flavor of interactive filtering to the traditional pipe concept of UNIX shell

percol __ ____ ___ ______________ / / / __ \/ _ \/ ___/ ___/ __ \/ / / /_/ / __/ / / /__/ /_/ / / / .__

Masafumi Oyamada 3.2k Jan 07, 2023
GoogleFormSpammer - A simple CLI script to spam Google Forms used by Crypto Wallet scammers to collect stolen data

GoogleFormSpammer - A simple CLI script to spam Google Forms used by Crypto Wallet scammers to collect stolen data

14 Dec 17, 2022
Redial is a simple shell application that manages your SSH sessions on Unix terminal.

redial redial is a simple shell application that manages your SSH sessions on Unix terminal. What's New 0.7 (19.12.2019) Basic support for adding ssh

Bahadır Yağan 186 Oct 28, 2022