lightweight, fast and robust columnar dataframe for data analytics with online update

Last update: May 19, 2022

Related tags

Overview

streamdf

Streamdf is a lightweight data frame library built on top of the dictionary of numpy array, developed for Kaggle's time-series code competition.

Key Features

Fast and robust insertion
- The insertion of row can be performed with amortized constant time (much faster than np.append)
- Automatically falls back to the default value when an abnormal value is inserted
Time-travel
- Get the past state of the data as a slice of the original dataframe without copying
Null/empty-safe aggregations
- Provides a set of aggregation methods that can be safely called when an element has nan or is empty.
Columnar layout
- Internal data is stored in a simple columnar format, which is easier to use for analysis than numpy's structured array

Example

import pandas as pd
from streamdf import StreamDf

df = pd.read_csv('test.csv')
sdf = StreamDf.from_pandas(df)

# extend
sdf.extend({
    'x': 1,
    'y': 2
})

assert len(sdf) == len(df) + 1

# access
print(sdf['x'])

# aggregate
sdf.last_value('x')

import numpy as np
from streamdf import StreamDf

sdf = StreamDf.empty({'x': np.int32, 'time': 'datetime64[D]'}, 'time')

sdf.extend({'x': 1, 'time': np.datetime64('2018-01-01')})
sdf.extend({'x': 5, 'time': np.datetime64('2018-02-01')})
sdf.extend({'x': 3, 'time': np.datetime64('2018-02-03')})

assert len(sdf) == 3

# Time travel (zero copy)
sliced = sdf.slice_until(np.datetime64('2018-02-02'))

assert len(sliced) == 2

lightweight, fast and robust columnar dataframe for data analytics with online update

Related tags

Overview

streamdf

Key Features

Example

Owner

Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

Header-only C++ HNSW implementation with python bindings

☀️ Measuring the accuracy of BBC weather forecasts in Honolulu, USA

Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Comprehensive-E2E-TTS - PyTorch Implementation

TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech

Opal-lang - A WIP programming language based on Python

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

PyTorch impelementations of BERT-based Spelling Error Correction Models.

Unsupervised text tokenizer focused on computational efficiency

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

使用pytorch+transformers复现了SimCSE论文中的有监督训练和无监督训练方法

Perform sentiment analysis on textual data that people generally post on websites like social networks and movie review sites.

GPT-3: Language Models are Few-Shot Learners