A Python module and command line utility for working with web archive data using the WACZ format specification

Overview

py-wacz

The py-wacz repository contains a Python module and command line utility for working with web archive data using the WACZ format specification. Web Archive Collection Zipped (WACZ) allows web archives to be shared and distributed by providing a predictable way of packaging up web archive data and metadata as a ZIP file. The wacz command line utility supports converting any WARC files into WACZ files, and optionally generating full-text search indices of pages.

Install

Use pip to install the module and a command line utility:

pip install wacz

Once installed you can use the wacz command line utility to create and validate WACZ files.

Create

To create a WACZ package you can point wacz at a WARC file and tell it where to write the WACZ with the -o option:

wacz create -o myfile.wacz 
   

   

The resulting myfile.wacz should be loadable via ReplayWeb.page.

wacz accepts the following options for customizing how the WACZ file is assembled.

-f --file

Explicitly declare the file being passed to the create function.

wacz create -f tests/fixtures/example-collection.warc

-o --output

Explicitly declare the name of the wacz being created

wacz create tests/fixtures/example-collection.warc -o mywacz.wacz

-t --text

Generates pages.jsonl page index with a full-text index, must be run in conjunction with --detect-pages. Will have no effect if run alone

wacz create tests/fixtures/example-collection.warc -t

--detect-pages

Generates pages.jsonl page index without a full-text index

wacz create tests/fixtures/example-collection.warc --detect-pages

-p --pages

Overrides the pages index generation with the passed jsonl pages.

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl

-t --text

You can add a full text index by including the --text tag

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl --text

--ts

Overrides the ts metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --ts TIMESTAMP

--url

Overrides the url metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --url URL

--title

Overrides the titles metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --title TITLE

--desc

Overrides the desc metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --desc DESC

--hash-type

Allows the user to specify the hash type used: (sha256 or md5):

wacz create tests/fixtures/example-collection.warc --hash-type md5

Validate

You can also validate an existing WACZ file by running:

wacz validate myfile.wacz

-f --file

Explicitly declare the file being passed to the validate function.

wacz validate -f tests/fixtures/example-collection.warc

Testing

If you are developing wacz you can run the unit tests with pytest:

pytest tests
Owner
Webrecorder
Webrecorder Project
Webrecorder
A Terminal Client for MySQL with AutoCompletion and Syntax Highlighting.

mycli A command line client for MySQL that can do auto-completion and syntax highlighting. HomePage: http://mycli.net Documentation: http://mycli.net/

dbcli 10.7k Jan 07, 2023
Shellcode runner to execute malicious payload and bypass AV

buffshark-shellcode-runner Python Shellcode Runner to execute malicious payload and bypass AV This script utilizes mmap(for linux) and win api wrapper

Momo Lenard 9 Dec 29, 2022
Detect secret in source code, scan your repo for leaks. Find secrets with GitGuardian and prevent leaked credentials. GitGuardian is an automated secrets detection & remediation service.

GitGuardian Shield: protect your secrets with GitGuardian GitGuardian shield (ggshield) is a CLI application that runs in your local environment or in

GitGuardian 1.2k Jan 06, 2023
Magnificent app which corrects your previous console command.

The Fuck The Fuck is a magnificent app, inspired by a @liamosaur tweet, that corrects errors in previous console commands. Is The Fuck too slow? Try t

Vladimir Iakovlev 75k Jan 02, 2023
A CLI/Shell supporting OpenRobot API and more!

A CLI/Shell supporting JeyyAPI, OpenRobot API and RePI API.

OpenRobot Packages 1 Jan 06, 2022
Create argparse subcommands with decorators.

python-argparse-subdec This is a very simple Python package that allows one to create argparse's subcommands via function decorators. Usage Create a S

Gustavo José de Sousa 7 Oct 21, 2022
You'll never want to use cd again.

Jmp Description Have you ever used the cd command? You'll never touch that outdated thing again when you try jmp. Navigate your filesystem with unprec

Grant Holmes 21 Nov 03, 2022
This tool is a free and unlimited python CLI for google translate. based on google_trans_new.

GoTransPy A free and unlimited python CLI for google translate based on google_trans_new. It's very easy to use and solve the problem that the old api

Youssef Mohamed 2 Jan 10, 2022
Convert shellcode into :sparkles: different :sparkles: formats!

Bluffy Convert shellcode into ✨ different ✨ formats! Bluffy is a utility which was used in experiments to bypass Anti-Virus products (statically) by f

AD995 305 Dec 17, 2022
Analyzing the most strategic words to guess on Wordle, based on letter frequency distributions

wordle-analysis Evaluating different heuristics to determine the most effective solving strategy and building an AI-powered assistant tool to help you

Sejal Dua 9 Feb 27, 2022
GetRepo-py is a command line client that queries GitHub API and searches repositories by given arguments

GetRepo-py is a command line client that queries GitHub API and searches repositories by given arguments

Davidcin 3 Feb 14, 2022
A tool to manage the study of courses at the university.

todo-cli A tool to manage the study of courses at the university

Quentin 6 Aug 01, 2022
An question and answer shell environment based on xonsh using ansible for setup

An question and answer shell environment based on xonsh using ansible for setup

Steven Hollingsworth 2 Jan 11, 2022
Random scripts and other bits for interacting with the SpaceX Starlink user terminal hardware

starlink-grpc-tools This repository has a handful of tools for interacting with the gRPC service implemented on the Starlink user terminal (AKA "the d

270 Dec 29, 2022
A Python script for finding a food-truck based on latitude and longitude coordinates that you can run in your shell

Food Truck Finder Project Description This repository contains a Python script for finding a food-truck based on latitude and longitude coordinates th

1 Jan 22, 2022
Get COVID-19 vaccination schedules from booking.moh.gov.ge in the CLI

vaccination.py Get COVID-19 vaccination schedules from booking.moh.gov.ge in the CLI. Installation $ pip install vaccination Usage Make sure the Pytho

Temuri Takalandze 11 Dec 08, 2021
WebApp Maker make web apps (Duh). It is open source and make with python and shell.

WebApp Maker make web apps (Duh). It is open source and make with python and shell. This app can take any website and turn it into an app. I highly recommend turning these few websites into webapps:

2 Jan 09, 2022
💻VIEN is a command-line tool for managing Python Virtual Environments.

vien VIEN is a command-line tool for managing Python Virtual Environments. It provides one-line shortcuts for: creating and deleting environments runn

Artёm IG 5 Mar 19, 2022
spid-sp-test is a SAML2 SPID/CIE Service Provider validation tool that can be executed from the command line.

spid-sp-test spid-sp-test is a SAML2 SPID/CIE Service Provider validation tool that can be executed from the command line. This tool was born by separ

Developers Italia 30 Nov 08, 2022
A lightweight terminal-based password manager coded with Python using SQLCipher for SQLite database encryption.

password-manager A lightweight terminal-based password manager coded with Python using SQLCipher for SQLite database encryption. Screenshot Pre-requis

Leonardo de Araujo 15 Oct 15, 2022