Simple archive format designed for quickly reading some files without extracting the entire archive

Overview

hop

Simple archive format designed for quickly reading some files without extracting the entire archive. Possibly will be used in Bun.

25x faster than unzip and 10x faster than tar at reading individual files (uncompressed)

Format Random access Fast extraction Fast archiving Compression Encryption Append
hop
tar
zip (when small)

Features:

  • Faster at printing individual files than tar & zip (compression disabled)
  • Faster extraction than zip, comparable to tar (compression disabled)
  • Faster archiving than zip, comparable to tar (compression disabled)

Anti-features:

  • Single-threaded (but doesn't need to be)
  • I wrote it in about 3 hours and there are no tests
  • No checksums yet. Probably not a good idea to use this for untrusted data until that's fixed.
  • Ignores symlinks
  • Can't be larger than 4 GB
  • Archives are read-only and file names are not normalized across platforms

Usage

Download the binary from /releases

To create an archive:

hop ./path-to-folder

To extract an archive:

hop archive.hop

To print one file from the archive:

hop archive.hop package.json

Why?

Why can't software read many tiny files with similar performance characteristics as individual files?

  • Reading and writing lots of tiny files incurs significant syscall overhead, and (npm) packages often have lots of tiny files. Zip files are unacceptably slow to read from like a directory. tar files extract quickly, but are slow at non-sequential access.
  • Reading directory entries (ls) in large directory trees is slow

Some benchmarks

On macOS 12 with an M1X

Using tigerbeetle github repo as an example

Extracting:

image

Archiving:

image

On an Ubuntu AMD64 server

Extracting a node_modules folder

image

Why faster?

  • It stores an array of hashes for each file path and the list of files are sorted lexigraphically. This makes non-sequential access faster than tar, but can make creating new archives slower.
  • Does not store directories, only files
  • .hop files are read-only (more precisely, one could append but would have to rewrite all metadata)
  • copy_file_range
  • packed struct makes serialization & deserialization very fast because there is very little encoding/decoding step.

How does it work?

  1. File contents go at the top, file metadata goes at the bottom
  2. This is the metadata it currently stores:
package Hop;

struct StringPointer {
    uint32 off;
    uint32 len;
}

struct File {
    StringPointer name;
    uint32 name_hash;
    uint32 chmod;
    uint32 mtime;
    uint32 ctime;
    StringPointer data;
}

message Archive {
    uint32 version = 1;
    uint32 content_offset = 2;
    File[] files = 3;
    uint32[] name_hashes = 4;
    byte[] metadata = 5;
}
You might also like...
🧹 Create symlinks for .m2ts files and classify them into directories in yyyy-mm format.
🧹 Create symlinks for .m2ts files and classify them into directories in yyyy-mm format.

🧹 Create symlinks for .m2ts files and classify them into directories in yyyy-mm format.

Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series.
Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series.

Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series. The Fourier series can be animated and visualized, the function can be output as a two dimensional vector for Desmos and there is a method to output the coefficients as LaTeX code.

Extract an archive file (zip file or tar file) stored on AWS S3

S3 Extract Extract an archive file (zip file or tar file) stored on AWS S3. Details Downloads archive from S3 into memory, then extract and re-upload

csv2ir is a script to convert ir .csv files to .ir files for the flipper.

csv2ir csv2ir is a script to convert ir .csv files to .ir files for the flipper. For a repo of .ir files, please see https://github.com/logickworkshop

Various technical documentation, in electronically parseable format

a-pile-of-documentation Various technical documentation, in electronically parseable format. You will need Python 3 to run the scripts and programs in

A simple file module for creating, editing and saving files.

A simple file module for creating, editing and saving files.

A simple library for temporary storage of small files

TemporaryStorage An simple library for temporary storage of small files. Navigation Install Usage In Python console As a standalone application List o

This python project contains a class FileProcessor which allows one to grab a file and get some meta data and header information from it
This python project contains a class FileProcessor which allows one to grab a file and get some meta data and header information from it

This python project contains a class FileProcessor which allows one to grab a file and get some meta data and header information from it. In the current state, it outputs a PrettyTable to txt file as well as the raw data from that table into a csv.

RMfuse provides access to your reMarkable Cloud files in the form of a FUSE filesystem

RMfuse provides access to your reMarkable Cloud files in the form of a FUSE filesystem. These files are exposed either in their original format, or as PDF files that contain your annotations. This lets you manage files in the reMarkable Cloud using the same tools you use on your local system.

Comments
  • just curious: why use zig?

    just curious: why use zig?

    I'm not familiar with zig and honestly heard it today first in my life. At first glance, s cool language I think!

    Do you have any specific reasons to use zig?

    opened by roeniss 0
  • front page results about

    front page results about "25x faster" are incorrect

    The tests on the front page have extremely small total times, a single milliseconds range. This means you are measuring mostly startup overhead and not actually decompression (the benchmarking harness tells you the same thing as well: "Command took less than 5 milliseconds, results may be inaccurate")

    In order for performance tests to be accurate, you need to re-measure it with larger archives and more data, so that overall time is no longer dominated by program startup.

    opened by theamk 1
Owner
Jarred Sumner
Jarred Sumner
csv2ir is a script to convert ir .csv files to .ir files for the flipper.

csv2ir csv2ir is a script to convert ir .csv files to .ir files for the flipper. For a repo of .ir files, please see https://github.com/logickworkshop

Alex 38 Dec 31, 2022
Remove [x]_ from StudIP zip Archives and archive_filelist.csv completely

This tool removes the "[x]_" at the beginning of StudIP zip Archives. It also deletes the "archive_filelist.csv" file

Kelke vl 1 Jan 19, 2022
Test app for importing contact information in CSV files.

Contact Import TestApp Test app for importing contact information in CSV files. Explore the docs » · Report Bug · Request Feature Table of Contents Ab

1 Feb 06, 2022
Read and write TIFF files

Read and write TIFF files Tifffile is a Python library to store numpy arrays in TIFF (Tagged Image File Format) files, and read image and metadata fro

Christoph Gohlke 346 Dec 18, 2022
Python's Filesystem abstraction layer

PyFilesystem2 Python's Filesystem abstraction layer. Documentation Wiki API Documentation GitHub Repository Blog Introduction Think of PyFilesystem's

pyFilesystem 1.8k Jan 02, 2023
ZipFly is a zip archive generator based on zipfile.py

ZipFly is a zip archive generator based on zipfile.py. It was created by Buzon.io to generate very large ZIP archives for immediate sending out to clients, or for writing large ZIP archives without m

Buzon 506 Jan 04, 2023
Vericopy - This Python script provides various usage modes for secure local file copying and hashing.

Vericopy This Python script provides various usage modes for secure local file copying and hashing. Hash data is captured and logged for paths before

15 Nov 05, 2022
File support for asyncio

aiofiles: file support for asyncio aiofiles is an Apache2 licensed library, written in Python, for handling local disk files in asyncio applications.

Tin Tvrtković 2.1k Jan 01, 2023
Yadl - it is a simple library for working with both dotenv files and environment variables.

Yadl Yadl - it is a simple library for working with both dotenv files and environment variables. Features Validation of whitespaces. Validation of num

Ivan Kapranov 3 Oct 19, 2021
Search for files under the specified directory. Extract the file name and file path and import them as data.

Search for files under the specified directory. Extract the file name and file path and import them as data. Based on that, search for the file, select it and open it.

G-jon FujiYama 2 Jan 10, 2022
Simple addon to create folder structures in blender.

BlenderCreateFolderStructure Simple Add-on to create a folder structure in Blender. Installation Download BlenderCreateFolderStructure.py Open Blender

Dominik Strasser 2 Feb 21, 2022
Nintendo Game Boy music assembly files parser into musicxml format

GBMusicParser Nintendo Game Boy music assembly files parser into musicxml format This python code will get an file.asm from the disassembly of a Game

1 Dec 11, 2021
Quick and dirty FAT12 filesystem to ZIP file converter

Quick and Dirty FAT12 Filesystem Converter This is a really crappy Python script I wrote to convert a semi-compatible FAT12 filesystem from my HP150's

Tube Time 2 Feb 12, 2022
A wrapper for DVD file structure and ISO files.

vs-parsedvd DVDs were an error. A wrapper for DVD file structure and ISO files. You can find me in the IEW Discord server

7 Nov 17, 2022
A small Python module for determining appropriate platform-specific dirs, e.g. a "user data dir".

the problem What directory should your app use for storing user data? If running on macOS, you should use: ~/Library/Application Support/AppName If

ActiveState Software 948 Dec 31, 2022
A tool written in python to generate basic repo files from github

A tool written in python to generate basic repo files from github

Riley 7 Dec 02, 2021
An universal file format tool kit. At present will handle the ico format problem.

An universal file format tool kit. At present will handle the ico format problem.

Sadam·Sadik 1 Dec 26, 2021
File storage with API access. Used as a part of the Swipio project

API File storage File storage with API access. Used as a part of the Swipio project 📝 About The Project File storage allows you to upload and downloa

25 Sep 17, 2022
Object-oriented file system path manipulation

path (aka path pie, formerly path.py) implements path objects as first-class entities, allowing common operations on files to be invoked on those path

Jason R. Coombs 1k Dec 28, 2022
A python module to parse text files with contains secret variables.

A python module to parse text files with contains secret variables.

0 Dec 05, 2022