Simple archive format designed for quickly reading some files without extracting the entire archive

Overview

hop

Simple archive format designed for quickly reading some files without extracting the entire archive. Possibly will be used in Bun.

25x faster than unzip and 10x faster than tar at reading individual files (uncompressed)

Format Random access Fast extraction Fast archiving Compression Encryption Append
hop
tar
zip (when small)

Features:

  • Faster at printing individual files than tar & zip (compression disabled)
  • Faster extraction than zip, comparable to tar (compression disabled)
  • Faster archiving than zip, comparable to tar (compression disabled)

Anti-features:

  • Single-threaded (but doesn't need to be)
  • I wrote it in about 3 hours and there are no tests
  • No checksums yet. Probably not a good idea to use this for untrusted data until that's fixed.
  • Ignores symlinks
  • Can't be larger than 4 GB
  • Archives are read-only and file names are not normalized across platforms

Usage

Download the binary from /releases

To create an archive:

hop ./path-to-folder

To extract an archive:

hop archive.hop

To print one file from the archive:

hop archive.hop package.json

Why?

Why can't software read many tiny files with similar performance characteristics as individual files?

  • Reading and writing lots of tiny files incurs significant syscall overhead, and (npm) packages often have lots of tiny files. Zip files are unacceptably slow to read from like a directory. tar files extract quickly, but are slow at non-sequential access.
  • Reading directory entries (ls) in large directory trees is slow

Some benchmarks

On macOS 12 with an M1X

Using tigerbeetle github repo as an example

Extracting:

image

Archiving:

image

On an Ubuntu AMD64 server

Extracting a node_modules folder

image

Why faster?

  • It stores an array of hashes for each file path and the list of files are sorted lexigraphically. This makes non-sequential access faster than tar, but can make creating new archives slower.
  • Does not store directories, only files
  • .hop files are read-only (more precisely, one could append but would have to rewrite all metadata)
  • copy_file_range
  • packed struct makes serialization & deserialization very fast because there is very little encoding/decoding step.

How does it work?

  1. File contents go at the top, file metadata goes at the bottom
  2. This is the metadata it currently stores:
package Hop;

struct StringPointer {
    uint32 off;
    uint32 len;
}

struct File {
    StringPointer name;
    uint32 name_hash;
    uint32 chmod;
    uint32 mtime;
    uint32 ctime;
    StringPointer data;
}

message Archive {
    uint32 version = 1;
    uint32 content_offset = 2;
    File[] files = 3;
    uint32[] name_hashes = 4;
    byte[] metadata = 5;
}
You might also like...
🧹 Create symlinks for .m2ts files and classify them into directories in yyyy-mm format.
🧹 Create symlinks for .m2ts files and classify them into directories in yyyy-mm format.

🧹 Create symlinks for .m2ts files and classify them into directories in yyyy-mm format.

Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series.
Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series.

Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series. The Fourier series can be animated and visualized, the function can be output as a two dimensional vector for Desmos and there is a method to output the coefficients as LaTeX code.

Extract an archive file (zip file or tar file) stored on AWS S3

S3 Extract Extract an archive file (zip file or tar file) stored on AWS S3. Details Downloads archive from S3 into memory, then extract and re-upload

csv2ir is a script to convert ir .csv files to .ir files for the flipper.

csv2ir csv2ir is a script to convert ir .csv files to .ir files for the flipper. For a repo of .ir files, please see https://github.com/logickworkshop

Various technical documentation, in electronically parseable format

a-pile-of-documentation Various technical documentation, in electronically parseable format. You will need Python 3 to run the scripts and programs in

A simple file module for creating, editing and saving files.

A simple file module for creating, editing and saving files.

A simple library for temporary storage of small files

TemporaryStorage An simple library for temporary storage of small files. Navigation Install Usage In Python console As a standalone application List o

This python project contains a class FileProcessor which allows one to grab a file and get some meta data and header information from it
This python project contains a class FileProcessor which allows one to grab a file and get some meta data and header information from it

This python project contains a class FileProcessor which allows one to grab a file and get some meta data and header information from it. In the current state, it outputs a PrettyTable to txt file as well as the raw data from that table into a csv.

RMfuse provides access to your reMarkable Cloud files in the form of a FUSE filesystem

RMfuse provides access to your reMarkable Cloud files in the form of a FUSE filesystem. These files are exposed either in their original format, or as PDF files that contain your annotations. This lets you manage files in the reMarkable Cloud using the same tools you use on your local system.

Comments
  • just curious: why use zig?

    just curious: why use zig?

    I'm not familiar with zig and honestly heard it today first in my life. At first glance, s cool language I think!

    Do you have any specific reasons to use zig?

    opened by roeniss 0
  • front page results about

    front page results about "25x faster" are incorrect

    The tests on the front page have extremely small total times, a single milliseconds range. This means you are measuring mostly startup overhead and not actually decompression (the benchmarking harness tells you the same thing as well: "Command took less than 5 milliseconds, results may be inaccurate")

    In order for performance tests to be accurate, you need to re-measure it with larger archives and more data, so that overall time is no longer dominated by program startup.

    opened by theamk 1
Owner
Jarred Sumner
Jarred Sumner
Nmap XML output to CSV and HTTP/HTTPS URLS.

xml-to-csv-url Convert NMAP's XML output to CSV file and print URL addresses for HTTP/HTTPS ports. NOTE: OS Version Parsing is not working properly ye

1 Dec 21, 2021
gitfs is a FUSE file system that fully integrates with git - Version controlled file system

gitfs is a FUSE file system that fully integrates with git. You can mount a remote repository's branch locally, and any subsequent changes made to the files will be automatically committed to the rem

Presslabs 2.3k Jan 08, 2023
Kartothek - a Python library to manage large amounts of tabular data in a blob store

Kartothek - a Python library to manage (create, read, update, delete) large amounts of tabular data in a blob store

15 Dec 25, 2022
CredSweeper is a tool to detect credentials in any directories or files.

CredSweeper is a tool to detect credentials in any directories or files. CredSweeper could help users to detect unwanted exposure of credentials (such as personal information, token, passwords, api k

Samsung 54 Dec 13, 2022
Pti-file-format - Reverse engineering the Polyend Tracker instrument file format

pti-file-format Reverse engineering the Polyend Tracker instrument file format.

Jaap Roes 14 Dec 30, 2022
Add Ranges and page numbers to IIIF Manifest from a CSV.

Add Ranges and page numbers to IIIF Manifest from CSV specific to a workflow of the Bibliotheca Hertziana.

Raffaele Viglianti 3 Apr 28, 2022
Python script for converting figma produced SVG files into C++ JUCE framework source code

AutoJucer Python script for converting figma produced SVG files into C++ JUCE framework source code Watch the tutorial here! Getting Started Make some

SuperConductor 1 Nov 26, 2021
Test app for importing contact information in CSV files.

Contact Import TestApp Test app for importing contact information in CSV files. Explore the docs » · Report Bug · Request Feature Table of Contents Ab

1 Feb 06, 2022
QSynthesis is a Python3 API to perform I/O based program synthesis of bitvector expressions.

QSynthesis is a Python3 API to perform I/O based program synthesis of bitvector expressions. It aims at facilitating code deobfuscation. The algorithm is greybox approach combining both a blackbox I/

Quarkslab 103 Dec 30, 2022
Media file renamer and organizion tool

mnamer mnamer (media renamer) is an intelligent and highly configurable media organization utility. It parses media filenames for metadata, searches t

Jessy Williams 533 Dec 29, 2022
An universal file format tool kit. At present will handle the ico format problem.

An universal file format tool kit. At present will handle the ico format problem.

Sadam·Sadik 1 Dec 26, 2021
Convert All TXT Files To One File.

AllToOne Convert All TXT Files To One File. Hi 👋 , I'm Alireza A Python Developer Boy 🔭 I’m currently working on my C# projects 🌱 I’m currently Lea

4 Jun 07, 2022
A simple library for temporary storage of small files

TemporaryStorage An simple library for temporary storage of small files. Navigation Install Usage In Python console As a standalone application List o

2 Apr 17, 2022
Listreqs is a simple requirements.txt generator. It's an alternative to pipreqs

⚡ Listreqs Listreqs is a simple requirements.txt generator. It's an alternative to pipreqs. Where in Pipreqs, it helps you to Generate requirements.tx

Soumyadip Sarkar 4 Oct 15, 2021
fast change directory with python and ruby

fcdir fast change directory with python and ruby run run python script , chose drirectoy and change your directory need you need python and ruby deskt

XCO 2 Jun 20, 2022
Uproot is a library for reading and writing ROOT files in pure Python and NumPy.

Uproot is a library for reading and writing ROOT files in pure Python and NumPy. Unlike the standard C++ ROOT implementation, Uproot is only an I/O li

Scikit-HEP Project 164 Dec 31, 2022
Maltego transforms to pivot between PE files based on their VirusTotal codeblocks

VirusTotal Codeblocks Maltego Transforms Introduction These Maltego transforms allow you to pivot between different PE files based on codeblocks they

Ariel Jungheit 18 Feb 03, 2022
A simple file sharing tool written in python

Share it A simple file sharing tool written in python Installation If you are using Windows os you can directly Run .exe file -- download If you are

Sachit Yadav 7 Dec 16, 2022
Better directory iterator and faster os.walk(), now in the Python 3.5 stdlib

scandir, a better directory iterator and faster os.walk() scandir() is a directory iteration function like os.listdir(), except that instead of return

Ben Hoyt 506 Dec 29, 2022
Singer is an open source standard for moving data between databases, web APIs, files, queues, and just about anything else you can think of.

Singer is an open source standard for moving data between databases, web APIs, files, queues, and just about anything else you can think of. Th

Singer 1.1k Jan 05, 2023