Weird Sort-and-Compress Thing

Overview

Weird Sort-and-Compress Thing

A weird integer sorting + compression algorithm inspired by a conversation with Luthingx (it probably already exists by some name I don't know yet). There's a lot still to improve about this algorithm, so be careful where you use it.

How it works

Here's an example for the following list:

l = [1, 2, 2, 2, 3]

The algorithm starts with counting sort, creating a dictionary with each unique number as key and the number of occurences of it in the list as the value:

d = {1: 1, 2: 3, 3: 1}

To decrease the space needed to store the numbers in memory, we'll only store the first number and then the difference between each of the next numbers and the previous one:

d2 = [(1, 1), (1, 3), (1, 1))

Now, the minimum amount of memory we need to store every key that's in d2 is 1 bit, since 1 is the maximum difference between any subsequent elements. The same applies to the values, except that to store any value here we need 2 bits of memory, since the maximum value is 3(11 in binary). So we know that we can store this list as a sequence of 3 bits elements, like this:

d2_bin = ["101", "111", 101"]

We can now return the list as a single number, along with a pair of integers containing the number of bits in each key and the number of bits in each value, allowing the value to be decompressed.

Memory efficiency

Here's a list with the sum of the number of bits of all numbers in a list with 100 elements, generated with random values in the range 0 to 50 and generated 20 times, vs. the number of bits in the resulting compressed integer(taking as a premise that all numbers in the array are all actually stored in continuous memory, including duplicates):

467 => 208
486 => 230
490 => 221
491 => 216
493 => 222
491 => 221
494 => 230
485 => 235
494 => 217
490 => 252
461 => 265
476 => 241
492 => 247
487 => 230
474 => 246
460 => 222
484 => 216
486 => 203
484 => 222
485 => 231

And 1000 numbers from 0 to 50, also 20 times:

4724 => 358
4827 => 309
4818 => 308
4801 => 309
4763 => 309
4763 => 309
4801 => 359
4757 => 359
4766 => 309
4794 => 309
4769 => 309
4789 => 359
4887 => 359
4787 => 309
4761 => 309
4749 => 309
4844 => 308
4798 => 359
4799 => 308
4763 => 359
Owner
Douglas
Douglas
SimBERT升级版(SimBERTv2)!

RoFormer-Sim RoFormer-Sim,又称SimBERTv2,是我们之前发布的SimBERT模型的升级版。 介绍 https://kexue.fm/archives/8454 训练 tensorflow 1.14 + keras 2.3.1 + bert4keras 0.10.6 下载

317 Dec 23, 2022
Collection of useful (to me) python scripts for interacting with napari

Napari scripts A collection of napari related tools in various state of disrepair/functionality. Browse_LIF_widget.py This module can be imported, for

5 Aug 15, 2022
An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Khalid Saifullah 37 Sep 05, 2022
The simple project to separate mixed voice (2 clean voices) to 2 separate voices.

Speech Separation The simple project to separate mixed voice (2 clean voices) to 2 separate voices. Result Example (Clisk to hear the voices): mix ||

vuthede 31 Oct 30, 2022
A Python script that compares files in directories

compare-files A Python script that compares files in different directories, this is similar to the command filecmp.cmp(f1, f2). I made this script in

Colvin 1 Oct 15, 2021
NLP, before and after spaCy

textacy: NLP, before and after spaCy textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the hig

Chartbeat Labs Projects 2k Jan 04, 2023
Finetune gpt-2 in google colab

gpt-2-colab finetune gpt-2 in google colab sample result (117M) from retraining on A Tale of Two Cities by Charles Di

212 Jan 02, 2023
This project converts your human voice input to its text transcript and to an automated voice too.

Human Voice to Automated Voice & Text Introduction: In this project, whenever you'll speak, it will turn your voice into a robot voice and furthermore

Hassan Shahzad 3 Oct 15, 2021
华为商城抢购手机的Python脚本 Python script of Huawei Store snapping up mobile phones

HUAWEI STORE GO 2021 说明 基于Python3+Selenium的华为商城抢购爬虫脚本,修改自近两年没更新的项目BUY-HW,为女神抢Nova 8(什么时候华为开始学小米玩饥饿营销了?) 原项目的登陆以及抢购部分已经不可用,本项目对原项目进行了改正以适应新华为商城,并增加一些功能

ZhangLiang 111 Dec 22, 2022
[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

LM-Critic: Language Models for Unsupervised Grammatical Error Correction This repo provides the source code & data of our paper: LM-Critic: Language M

Michihiro Yasunaga 98 Nov 24, 2022
HiFi DeepVariant + WhatsHap workflowHiFi DeepVariant + WhatsHap workflow

HiFi DeepVariant + WhatsHap workflow Workflow steps align HiFi reads to reference with pbmm2 call small variants with DeepVariant, using two-pass meth

William Rowell 2 May 14, 2022
AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

Shunsuke KITADA 12 Feb 16, 2022
This is a project of data parallel that running on NLP tasks.

This is a project of data parallel that running on NLP tasks.

2 Dec 12, 2021
SentimentArcs: a large ensemble of dozens of sentiment analysis models to analyze emotion in text over time

SentimentArcs - Emotion in Text An end-to-end pipeline based on Jupyter notebooks to detect, extract, process and anlayze emotion over time in text. E

jon_chun 14 Dec 19, 2022
ConvBERT-Prod

ConvBERT 目录 0. 仓库结构 1. 简介 2. 数据集和复现精度 3. 准备数据与环境 3.1 准备环境 3.2 准备数据 3.3 准备模型 4. 开始使用 4.1 模型训练 4.2 模型评估 4.3 模型预测 5. 模型推理部署 5.1 基于Inference的推理 5.2 基于Serv

yujun 7 Apr 08, 2022
A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

Reddit text to speech generator A basic reddit tts video generator Current functionality Generate videos for subs based on comments,(askreddit) so rea

Aadvik 17 Dec 19, 2022
Code Generation using a large neural network called GPT-J

CodeGenX is a Code Generation system powered by Artificial Intelligence! It is delivered to you in the form of a Visual Studio Code Extension and is Free and Open-source!

DeepGenX 389 Dec 31, 2022
APEACH: Attacking Pejorative Expressions with Analysis on Crowd-generated Hate Speech Evaluation Datasets

APEACH - Korean Hate Speech Evaluation Datasets APEACH is the first crowd-generated Korean evaluation dataset for hate speech detection. Sentences of

Kevin-Yang 70 Dec 06, 2022
Train 🤗-transformers model with Poutyne.

poutyne-transformers Train 🤗 -transformers models with Poutyne. Installation pip install poutyne-transformers Example import torch from transformers

Lennart Keller 2 Dec 18, 2022
C.J. Hutto 3.8k Dec 30, 2022