Conduits - A Declarative Pipelining Tool For Pandas

Related tags

Data Analysisconduits
Overview

Conduits - A Declarative Pipelining Tool For Pandas

Traditional tools for declaring pipelines in Python suck. They are mostly imperative, and can sometimes requires that you adhere to strong contracts in order to use them (looking at you Scikit Learn pipelines ��). It is also usually done completely differently to the way the pipelines where developed during the ideation phase, requiring significate rewrite to get them to work in the new paradigm.

Modelled off the declarative pipeline of Flask, Conduits aims to give you a nicer, simpler, and more flexible way of declaring your data processing pipelines.

Installation

pip install conduits

Quickstart

False! assert output.X.sum() == 17 # Square before addition => True! ">
import pandas as pd
from conduits import Pipeline

##########################
## Pipeline Declaration ##
##########################

pipeline = Pipeline()


@pipeline.step(dependencies=["first_step"])
def second_step(data):
    return data + 1


@pipeline.step()
def first_step(data):
    return data ** 2


###############
## Execution ##
###############

df = pd.DataFrame({"X": [1, 2, 3], "Y": [10, 20, 30]})

output = pipeline.fit_transform(df)
assert output.X.sum() != 29  # Addition before square => False!
assert output.X.sum() == 17  # Square before addition => True!

Usage Guide

Declarations

Your pipeline is defined using a standard decorator syntax. You can wrap your pipeline steps using the decorator:

@pipeline.step()
def transformer(df):
    return df + 1

The decoratored function should accept a pandas dataframe or pandas series and return a pandas dataframe or pandas series. Arbitrary inputs and outputs are currently unsupported.

If your transformer is stateful, you can optionally supply the function with fit and transform boolean arguments. They will be set as True when the appropriate method is called.

@pipeline.step()
def stateful(data: pd.DataFrame, fit: bool, transform: bool):
    if fit:
        scaler = StandardScaler()
        scaler.fit(data)
        joblib.dump(scaler, "scaler.joblib")
        return data
    
    if transform:
        scaler = joblib.load(scaler, "scaler.joblib")
        return scaler.transform(data)

You should not serialise the pipeline object itself. The pipeline is simply a declaration and shouldn't maintain any state. You should manage your pipeline DAG definition versions using a tool like Git. You will receive an error if you try to serialise the pipeline.

If there are any dependencies between your pipeline steps, you may specify these in your decorator and they will be run prior to this step being run in the pipeline. If a step has no dependencies specified it will be assumed that it can be run at any point.

@pipeline.step(dependencies=["add_feature_X", "add_feature_Y"])
def combine_X_with_Y(df):
    return df.X + df.Y

API

Conduits attempts to mock the Scikit Learn API as best as possible. Your defined piplines have the standard methods of:

pipeline.fit(df)
out = pipeline.transform(df)
out = pipeline.fit_transform(df)

Note that for the current release you can only supply pandas dataframes or series objects. It will not accept numpy arrays.

Tests

In order to run the testing suite you should install the dev.requirements.txt file. It comes with all the core dependencies used in testing and packaging. Once you have your dependencies installed, you can run the tests via the target:

make tests

The tests rely on pytest-regressions to test some functionality. If you make a change you can refresh the regression targets with:

make regressions
Owner
Kale Miller
Founder @ Prometheus AI
Kale Miller
A highly efficient and modular implementation of Gaussian Processes in PyTorch

GPyTorch GPyTorch is a Gaussian process library implemented using PyTorch. GPyTorch is designed for creating scalable, flexible, and modular Gaussian

3k Jan 02, 2023
TextDescriptives - A Python library for calculating a large variety of statistics from text

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistic

150 Dec 30, 2022
MeSH2Matrix - A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

SisonkeBiotik 6 Nov 30, 2022
Data-sets from the survey and analysis

bachelor-thesis "Umfragewerte.xlsx" contains the orginal survey results. "umfrage_alle.csv" contains the survey results but one participant is cancele

1 Jan 26, 2022
An interactive grid for sorting, filtering, and editing DataFrames in Jupyter notebooks

qgrid Qgrid is a Jupyter notebook widget which uses SlickGrid to render pandas DataFrames within a Jupyter notebook. This allows you to explore your D

Quantopian, Inc. 2.9k Jan 08, 2023
Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

Open-Domain Question-Answering for COVID-19 and Other Emergent Domains This repository contains the source code for an end-to-end open-domain question

7 Sep 27, 2022
A stock analysis app with streamlit

StockAnalysisApp A stock analysis app with streamlit. You select the ticker of the stock and the app makes a series of analysis by using the price cha

Antonio Catalano 50 Nov 27, 2022
Uses MIT/MEDSL, New York Times, and US Census datasources to analyze per-county COVID-19 deaths.

Covid County Executive summary Setup Install miniconda, then in the command line, run conda create -n covid-county conda activate covid-county conda i

Ahmed Fasih 1 Dec 22, 2021
Pyspark project that able to do joins on the spark data frames.

SPARK JOINS This project is to perform inner, all outer joins and semi joins. create_df.py: load_data.py : helps to put data into Spark data frames. d

Joshua 1 Dec 14, 2021
This python script allows you to manipulate the audience data from Sl.ido surveys

Slido-Automated-VoteBot This python script allows you to manipulate the audience data from Sl.ido surveys Since Slido blocks interference from automat

Pranav Menon 1 Jan 24, 2022
PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

Emmanuel Boateng Sifah 1 Jan 19, 2022
Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

Pypeln Pypeln (pronounced as "pypeline") is a simple yet powerful Python library for creating concurrent data pipelines. Main Features Simple: Pypeln

Cristian Garcia 1.4k Dec 31, 2022
Used for data processing in machine learning, and help us to construct ML model more easily from scratch

Used for data processing in machine learning, and help us to construct ML model more easily from scratch. Can be used in linear model, logistic regression model, and decision tree.

ShawnWang 0 Jul 05, 2022
A program that uses an API and a AI model to get info of sotcks

Stock-Market-AI-Analysis I dont mind anyone using this code but please give me credit A program that uses an API and a AI model to get info of stocks

1 Dec 17, 2021
Extract Thailand COVID-19 Cluster data from daily briefing pdf.

Thailand COVID-19 Cluster Data Extraction About Extract Clusters from Thailand Daily COVID-19 briefing PDF Download latest data Here. Data will be upd

Noppakorn Jiravaranun 5 Sep 27, 2021
Package for decomposing EMG signals into motor unit firings, as used in Formento et al 2021.

EMGDecomp Package for decomposing EMG signals into motor unit firings, created for Formento et al 2021. Based heavily on Negro et al, 2016. Supports G

13 Nov 01, 2022
Data collection, enhancement, and metrics calculation.

l3_data_collection Data collection, enhancement, and metrics calculation. Summary Repository containing code for QuantDAO's JDT data collection task.

Ruiwyn 3 Dec 23, 2022
Gaussian processes in TensorFlow

Website | Documentation (release) | Documentation (develop) | Glossary Table of Contents What does GPflow do? Installation Getting Started with GPflow

GPflow 1.7k Jan 06, 2023
Pipeline to convert a haploid assembly into diploid

HapDup (haplotype duplicator) is a pipeline to convert a haploid long read assembly into a dual diploid assembly. The reconstructed haplotypes

Mikhail Kolmogorov 50 Jan 05, 2023
A probabilistic programming language in TensorFlow. Deep generative models, variational inference.

Edward is a Python library for probabilistic modeling, inference, and criticism. It is a testbed for fast experimentation and research with probabilis

Blei Lab 4.7k Jan 09, 2023