SIMD-accelerated bitwise hamming distance Python module for hexidecimal strings

Overview

hexhamming

Pip Prs Github

What does it do?

This module performs a fast bitwise hamming distance of two hexadecimal strings.

This looks like:

DEADBEEF = 11011110101011011011111011101111
00000000 = 00000000000000000000000000000000
XOR      = 11011110101011011011111011101111
Hamming  = number of ones in DEADBEEF ^ 00000000 = 24

This essentially amounts to

>>> import gmpy
>>> gmpy.popcount(0xdeadbeef ^ 0x00000000)
24

except with Python strings, so

>>> import gmpy
>>> gmpy.popcount(int("deadbeef", 16) ^ int("00000000", 16))
24

A few assumptions are made and enforced:

  • this is a valid hexadecimal string (i.e., [a-fA-F0-9]+)
  • the strings are the same length
  • the strings do not begin with "0x"

Why yet another Hamming distance library?

There are a lot of fantastic (python) libraries that offer methods to calculate various edit distances, including Hamming distances: Distance, textdistance, scipy, jellyfish, etc.

In this case, I needed a hamming distance library that worked on hexadecimal strings (i.e., a Python str) and performed blazingly fast. Furthermore, I often did not care about hex strings greater than 256 bits. That length constraint is different vs all the other libraries and enabled me to explore vectorization techniques via numba, numpy, and SSE/AVX intrinsics.

Lastly, I wanted to minimize dependencies, meaning you do not need to install numpy, gmpy, cython, pypy, pythran, etc.

Eventually, after playing around with gmpy.popcount, numba.jit, pythran.run, numpy, I decided to write what I wanted in essentially raw C. At this point, I'm using raw char* and int*, so exploring re-writing this in Fortran makes little sense.

Installation

To install, ensure you have Python 2.7 or 3.4+. Run

pip install hexhamming

or to install from source

git clone https://github.com/mrecachinas/hexhamming
cd hexhamming
python setup.py install # or pip install .

If you want to contribute to hexhamming, you should install the dev dependencies

pip install -r requirements-dev.txt

and make sure the tests pass with

python -m pytest -vls .

Example

Using hexhamming is as simple as

>>> from hexhamming import hamming_distance_string
>>> hamming_distance_string("deadbeef", "00000000")
24

New in v2.0.0 : hexhamming now supports byte`s via ``hamming_distance_bytes`. You use it in the exact same way as before, except you pass in a byte string.

>>> from hexhamming import hamming_distance_bytes
>>> hamming_distance_bytes(b"\xde\xad\xbe\xef", b"\x00\x00\x00\x00")
24

Benchmark

Below is a benchmark using pytest-benchmark with hexhamming==v1.3.2 my 2020 2.0 GHz quad-core Intel Core i5 16 GB 3733 MHz LPDDR4 macOS Catalina (10.15.5) with Python 3.7.3 and Apple clang version 11.0.3 (clang-1103.0.32.62).

Name Mean (ns) Std (ns) Median (ns) Rounds Iterations
test_hamming_distance_bench_3 93.8 10.5 94.3 53268 200
test_hamming_distance_bench_3_same 94.2 15.2 94.9 102146 100
test_check_hexstrings_within_dist_bench 231.9 104.2 216.5 195122 22
test_hamming_distance_bench_256 97.5 34.1 94.0 195122 22
test_hamming_distance_bench_1000 489.8 159.4 477.5 94411 20
test_hamming_distance_bench_1000_same 497.8 87.8 496.6 18971 20
test_hamming_distance_bench_1024 509.9 299.5 506.7 18652 10
test_hamming_distance_bench_1024_same 467.4 205.9 450.4 181819 10
Owner
Michael Recachinas
Husband to @erinrecachinas, Dad, 🐶 Dad, he/him/his
Michael Recachinas
CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

CML with cloud compute This repository contains a sample project using CML with Terraform (via the cml-runner function) to launch an AWS EC2 instance

Iterative 19 Oct 03, 2022
SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and TensorFlow

SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and TensorFlow, in High Performance Computing (HPC) simulations and workloads.

Distributed scikit-learn meta-estimators in PySpark

sk-dist: Distributed scikit-learn meta-estimators in PySpark What is it? sk-dist is a Python package for machine learning built on top of scikit-learn

Ibotta 282 Dec 09, 2022
Generate music from midi files using BPE and markov model

Generate music from midi files using BPE and markov model

Aditya Khadilkar 37 Oct 24, 2022
Toolkit for building machine learning models that generalize to unseen domains and are robust to privacy and other attacks.

Toolkit for Building Robust ML models that generalize to unseen domains (RobustDG) Divyat Mahajan, Shruti Tople, Amit Sharma Privacy & Causal Learning

Microsoft 149 Jan 06, 2023
Pandas DataFrames and Series as Interactive Tables in Jupyter

Pandas DataFrames and Series as Interactive Tables in Jupyter Star Turn pandas DataFrames and Series into interactive datatables in both your notebook

Marc Wouts 364 Jan 04, 2023
Convoys is a simple library that fits a few statistical model useful for modeling time-lagged conversions.

Convoys is a simple library that fits a few statistical model useful for modeling time-lagged conversions. There is a lot more info if you head over to the documentation. You can also take a look at

Better 240 Dec 26, 2022
Lingtrain Alignment Studio is an ML based app for texts alignment on different languages.

Lingtrain Alignment Studio Intro Lingtrain Alignment Studio is the ML based app for accurate texts alignment on different languages. Extracts parallel

Sergei Averkiev 186 Jan 03, 2023
A simple python program which predicts the success of a movie based on it's type, actor, actress and director

Movie-Success-Prediction A simple python program which predicts the success of a movie based on it's type, actor, actress and director. The program us

Mahalinga Prasad R N 1 Dec 17, 2021
GroundSeg Clustering Optimized Kdtree

ground seg and clustering based on kitti velodyne data, and a additional optimized kdtree for knn and radius nn search

2 Dec 02, 2021
Machine learning template for projects based on sklearn library.

Machine learning template for projects based on sklearn library.

Janez Lapajne 17 Oct 28, 2022
Napari sklearn decomposition

napari-sklearn-decomposition A simple plugin to use with napari This napari plug

1 Sep 01, 2022
OptaPy is an AI constraint solver for Python to optimize planning and scheduling problems.

OptaPy is an AI constraint solver for Python to optimize the Vehicle Routing Problem, Employee Rostering, Maintenance Scheduling, Task Assignment, School Timetabling, Cloud Optimization, Conference S

OptaPy 208 Dec 27, 2022
Both social media sentiment and stock market data are crucial for stock price prediction

Relating-Social-Media-to-Stock-Movement-Public - We explore the application of Machine Learning for predicting the return of the stock by using the information of stock returns. A trading strategy ba

Vishal Singh Parmar 15 Oct 29, 2022
Implemented four supervised learning Machine Learning algorithms

Implemented four supervised learning Machine Learning algorithms from an algorithmic family called Classification and Regression Trees (CARTs), details see README_Report.

Teng (Elijah) Xue 0 Jan 31, 2022
Bayesian Additive Regression Trees For Python

BartPy Introduction BartPy is a pure python implementation of the Bayesian additive regressions trees model of Chipman et al [1]. Reasons to use BART

187 Dec 16, 2022
A linear equation solver using gaussian elimination. Implemented for fun and learning/teaching.

A linear equation solver using gaussian elimination. Implemented for fun and learning/teaching. The solver will solve equations of the type: A can be

Sanjeet N. Dasharath 3 Feb 15, 2022
A Lucid Framework for Transparent and Interpretable Machine Learning Models.

Currently a Beta-Version lucidmode is an open-source, low-code and lightweight Python framework for transparent and interpretable machine learning mod

lucidmode 15 Aug 12, 2022
Machine Learning for Time-Series with Python.Published by Packt

Machine-Learning-for-Time-Series-with-Python Become proficient in deriving insights from time-series data and analyzing a model’s performance Links Am

Packt 124 Dec 28, 2022