Weird Sort-and-Compress Thing

A weird integer sorting + compression algorithm inspired by a conversation with Luthingx (it probably already exists by some name I don't know yet). There's a lot still to improve about this algorithm, so be careful where you use it.

How it works

Here's an example for the following list:

l = [1, 2, 2, 2, 3]

The algorithm starts with counting sort, creating a dictionary with each unique number as key and the number of occurences of it in the list as the value:

d = {1: 1, 2: 3, 3: 1}

To decrease the space needed to store the numbers in memory, we'll only store the first number and then the difference between each of the next numbers and the previous one:

d2 = [(1, 1), (1, 3), (1, 1))

Now, the minimum amount of memory we need to store every key that's in d2 is 1 bit, since 1 is the maximum difference between any subsequent elements. The same applies to the values, except that to store any value here we need 2 bits of memory, since the maximum value is 3(11 in binary). So we know that we can store this list as a sequence of 3 bits elements, like this:

d2_bin = ["101", "111", 101"]

We can now return the list as a single number, along with a pair of integers containing the number of bits in each key and the number of bits in each value, allowing the value to be decompressed.

Memory efficiency

Here's a list with the sum of the number of bits of all numbers in a list with 100 elements, generated with random values in the range 0 to 50 and generated 20 times, vs. the number of bits in the resulting compressed integer(taking as a premise that all numbers in the array are all actually stored in continuous memory, including duplicates):

And 1000 numbers from 0 to 50, also 20 times:

4724 => 358
4827 => 309
4818 => 308
4801 => 309
4763 => 309
4763 => 309
4801 => 359
4757 => 359
4766 => 309
4794 => 309
4769 => 309
4789 => 359
4887 => 359
4787 => 309
4761 => 309
4749 => 309
4844 => 308
4798 => 359
4799 => 308
4763 => 359

Weird Sort-and-Compress Thing

Related tags

Overview

Weird Sort-and-Compress Thing

How it works

Memory efficiency

Owner

Douglas

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

An official repository for tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a University of Edinburgh master's course.

Implementation of legal QA system based on SentenceKoBART

Download videos from YouTube/Twitch/Twitter right in the Windows Explorer, without installing any shady shareware apps

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

SDL: Synthetic Document Layout dataset

This is the source code of RPG (Reward-Randomized Policy Gradient)

DataCLUE: 国内首个以数据为中心的AI测评（含模型分析报告）

문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

Arabic speech recognition, classification and text-to-speech.

Machine learning classifiers to predict American Sign Language .

A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

SurvTRACE: Transformers for Survival Analysis with Competing Events

Under the hood working of transformers, fine-tuning GPT-3 models, DeBERTa, vision models, and the start of Metaverse, using a variety of NLP platforms: Hugging Face, OpenAI API, Trax, and AllenNLP

Knowledge Oriented Programming Language