CaskDB is a disk-based, embedded, persistent, key-value store based on the Riak's bitcask paper, written in Python.

Overview

CaskDB - Disk based Log Structured Hash Table Store

made-with-python build codecov MIT license

architecture

CaskDB is a disk-based, embedded, persistent, key-value store based on the Riak's bitcask paper, written in Python. It is more focused on the educational capabilities than using it in production. The file format is platform, machine, and programming language independent. Say, the database file created from Python on macOS should be compatible with Rust on Windows.

This project aims to help anyone, even a beginner in databases, build a persistent database in a few hours. There are no external dependencies; only the Python standard library is enough.

If you are interested in writing the database yourself, head to the workshop section.

Features

  • Low latency for reads and writes
  • High throughput
  • Easy to back up / restore
  • Simple and easy to understand
  • Store data much larger than the RAM

Limitations

Most of the following limitations are of CaskDB. However, there are some due to design constraints by the Bitcask paper.

  • Single file stores all data, and deleted keys still take up the space
  • CaskDB does not offer range scans
  • CaskDB requires keeping all the keys in the internal memory. With a lot of keys, RAM usage will be high
  • Slow startup time since it needs to load all the keys in memory

Dependencies

CaskDB does not require any external libraries to run. For local development, install the packages from requirements_dev.txt:

pip install -r requirements_dev.txt

Installation

PyPi is not used for CaskDB yet (issue #5), and you'd have to install it directly from the repository by cloning.

Usage

disk: DiskStorage = DiskStore(file_name="books.db")
disk.set(key="othello", value="shakespeare")
author: str = disk.get("othello")
# it also supports dictionary style API too:
disk["hamlet"] = "shakespeare"

Prerequisites

The workshop is for intermediate-advanced programmers. Knowing Python is not a requirement, and you can build the database in any language you wish.

Not sure where you stand? You are ready if you have done the following in any language:

  • If you have used a dictionary or hash table data structure
  • Converting an object (class, struct, or dict) to JSON and converting JSON back to the things
  • Open a file to write or read anything. A common task is dumping a dictionary contents to disk and reading back

Workshop

NOTE: I don't have any workshops scheduled shortly. Follow me on Twitter for updates. Drop me an email if you wish to arrange a workshop for your team/company.

CaskDB comes with a full test suite and a wide range of tools to help you write a database quickly. A Github action is present with an automated tests runner, code formatter, linter, type checker and static analyser. Fork the repo, push the code, and pass the tests!

Throughout the workshop, you will implement the following:

  • Serialiser methods take a bunch of objects and serialise them into bytes. Also, the procedures take a bunch of bytes and deserialise them back to the things.
  • Come up with a data format with a header and data to store the bytes on the disk. The header would contain metadata like timestamp, key size, and value.
  • Store and retrieve data from the disk
  • Read an existing CaskDB file to load all keys

Tasks

  1. Read the paper. Fork this repo and checkout the start-here branch
  2. Implement the fixed-sized header, which can encode timestamp (uint, 4 bytes), key size (uint, 4 bytes), value size (uint, 4 bytes) together
  3. Implement the key, value serialisers, and pass the tests from test_format.py
  4. Figure out how to store the data on disk and the row pointer in the memory. Implement the get/set operations. Tests for the same are in test_disk_store.py
  5. Code from the task #2 and #3 should be enough to read an existing CaskDB file and load the keys into memory

Use make lint to run mypy, black, and pytype static analyser. Run make test to run the tests locally. Push the code to Github, and tests will run on different OS: ubuntu, mac, and windows.

Not sure how to proceed? Then check the hints file which contains more details on the tasks and hints.

Hints

  • Check out the documentation of struck.pack for serialisation methods in Python
  • Not sure how to come up with a file format? Read the comment in the format module

What next?

I often get questions about what is next after the basic implementation. Here are some challenges (with different levels of difficulties)

Level 1:

  • Crash safety: the bitcask paper stores CRC in the row, and while fetching the row back, it verifies the data
  • Key deletion: CaskDB does not have a delete API. Read the paper and implement it
  • Instead of using a hash table, use a data structure like the red-black tree to support range scans
  • CaskDB accepts only strings as keys and values. Make it generic and take other data structures like int or bytes.

Level 2:

  • Hint file to improve the startup time. The paper has more details on it
  • Implement an internal cache which stores some of the key-value pairs. You may explore and experiment with different cache eviction strategies like LRU, LFU, FIFO etc.
  • Split the data into multiple files when the files hit a specific capacity

Level 3:

  • Support for multiple processes
  • Garbage collector: keys which got updated and deleted remain in the file and take up space. Write a garbage collector to remove such stale data
  • Add SQL query engine layer
  • Store JSON in values and explore making CaskDB as a document database like Mongo
  • Make CaskDB distributed by exploring algorithms like raft, paxos, or consistent hashing

Name

This project was named cdb earlier and now renamed to CaskDB.

Line Count

$ tokei -f format.py disk_store.py
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Python                  2          391          261          103           27
-------------------------------------------------------------------------------
 disk_store.py                      204          120           70           14
 format.py                          187          141           33           13
===============================================================================
 Total                   2          391          261          103           27
===============================================================================

License

The MIT license. Please check LICENSE for more details.

Owner
I git stuff done
HomeAssistant Linux Companion

Application to run on linux desktop computer to provide sensors data to homeasssistant, and get notifications as if it was a mobile device.

Javier Lopez 10 Dec 27, 2022
UFDR2DIR - A script to convert a Cellebrite UFDR to the original file structure

UFDR2DIR A script to convert a Cellebrite UFDR to it's original file and directo

DFIRScience 25 Oct 24, 2022
Set of scripts that schedules employees for shifts throughout the week based on availability, shift times, and shift necessities

Automatic-Scheduler Set of scripts that schedules employees for shifts throughout the week based on availability, shift times, and shift necessities *

Matthew 1 May 01, 2022
Unofficial package for fetching users information based on National ID Number (Tanzania)

Nida Unofficial package for fetching users information based on National ID Number made by kalebu Installation You can install it directly or using pi

Jordan Kalebu 57 Dec 28, 2022
Simple dependency injection framework for Python

A simple, strictly typed dependency injection library.

BentoML 14 Jun 29, 2022
WordPress-style shortcodes for Python

Python Shortcodes WordPress-style shortcodes for Python Create and use WordPress-style shortcodes in your Python based app. Example # static output de

Bob 1 Dec 22, 2021
A simple watcher for the XTZ/kUSD pool on Quipuswap

Kolibri Quipuswap Watcher This repo holds the source code for the QuipuBot bot deployed to the #quipuswap-updates channel in the Kolibri Discord Setup

Hover Labs 1 Nov 18, 2021
An improved version of the common ˙pacman -S˙

BetterPacmanLook An improved version of the common pacman -S. Installation I know that this is probably one of the worst solutions and i will be worki

1 Nov 06, 2021
propuestas electorales de los candidatos a constituyentes, Chile 2021

textos-constituyentes propuestas electorales de los candidatos a constituyentes, Chile 2021 Programas descargados desde https://elecciones2021.servel.

Sergio Lucero 6 Nov 19, 2021
People tracker on the Internet: OSINT analysis and research tool by Jose Pino

trape (stable) v2.0 People tracker on the Internet: Learn to track the world, to avoid being traced. Trape is an OSINT analysis and research tool, whi

Jose Pino 7.3k Dec 30, 2022
My Solutions to 120 commonly asked data science interview questions.

Data_Science_Interview_Questions Introduction 👋 Here are the answers to 120 Data Science Interview Questions The above answer some is modified based

Milaan Parmar / Милан пармар / _米兰 帕尔马 181 Dec 31, 2022
Med to csv - A simple way to parse MedAssociate output file in tidy data

MedAssociates to CSV file A simple way to parse MedAssociate output file in tidy

Jean-Emmanuel Longueville 5 Sep 09, 2022
Hitchhikers-guide - The Hitchhiker's Guide to Data Science for Social Good

Welcome to the Hitchhiker's Guide to Data Science for Social Good. What is the Data Science for Social Good Fellowship? The Data Science for Social Go

Data Science for Social Good 907 Jan 01, 2023
Scripts for BGC analysis in large MAGs and results of their application to soil metagenomes within Chernevaya Taiga RSF-funded project

Scripts for BGC analysis in large MAGs and results of their application to soil metagenomes within Chernevaya Taiga RSF-funded project

1 Dec 06, 2021
Package to provide translation methods for pyramid, and means to reload translations without stopping the application

Package to provide translation methods for pyramid, and means to reload translations without stopping the application

Grzegorz Śliwiński 4 Nov 20, 2022
Python calculator made with tkinter package

Python-Calculator Python calculator made with tkinter package. works both on Visual Studio Code Or Any Other Ide Or You Just Copy paste The Same Thing

Pro_Gamer_711 1 Nov 11, 2021
Turn a raspberry pi into a Bluetooth Midi device

PiBluetoothMidSetup This will change serveral system wide packages/configurations Do not run this on your primary machine or anything you don't know h

MyLab6 40 Sep 19, 2022
Write Streamlit apps using Notion! (Prototype)

Streamlit + Notion test app Write Streamlit apps using Notion! ☠️ IMPORTANT: This is just a little prototype I made to play with some ideas. Not meant

Thiago Teixeira 22 Sep 08, 2022
An alternative app for core Armoury Crate functions.

NoROG DISCLAIMER: Use at your own risk. This is alpha-quality software. It has not been extensively tested, though I personally run it daily on my lap

12 Nov 29, 2022
A proof-of-concept package manager for Cairo contracts/libraries

glyph A proof-of-concept package manager for Cairo contracts/libraries. Distribution through pypi. Installation through existing package managers -- p

Sam Barnes 11 Jun 06, 2022