icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models

Last update: Dec 09, 2022

Related tags

Overview

icepickle

It's a cooler way to store simple linear models.

The goal of icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models. Not only is this much safer, but it also allows for an interesting finetuning pattern that does not require a GPU.

Installation

You can install everything with pip:

python -m pip install icepickle

Usage

Let's say that you've gotten a linear model from scikit-learn trained on a dataset.

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)

clf = LogisticRegression()
clf.fit(X, y)

Then you could use a pickle to save the model.

from joblib import dump, load

# You can save the classifier.
dump(clf, 'classifier.joblib')

# You can load it too.
clf_reloaded = load('classifier.joblib')

But this is unsafe. The scikit-learn documentations even warns about the security concerns and compatibility issues. The goal of this package is to offer a safe alternative to pickling for simple linear models. The coefficients will be saved in a .h5 file and can be loaded into a new regression model later.

from icepickle.linear_model import save_coefficients, load_coefficients

# You can save the classifier.
save_coefficients(clf, 'classifier.h5')

# You can create a new model, with new hyperparams.
clf_reloaded = LogisticRegression()

# Load the previously trained weights in.
load_coefficients(clf_reloaded, 'classifier.h5')

This is a lot safer and there's plenty of use-cases that could be handled this way.

There's a cool finetuning-trick we can do now too!

Finetuning

Assuming that you use a stateless featurizer in your pipeline, such as HashingVectorizer or language models from whatlies, you choose to pre-train your scikit-learn model beforehand and fine-tune it later using models that offer the .partial_fit()-api. If you're unfamiliar with this api, you might appreciate this course on calmcode.

This library also comes with utilities that makes it easier to finetune systems via the .partial_fit() API. In particular we offer partial pipeline components via the icepickle.pipeline submodule.

import pandas as pd
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.feature_extraction.text import HashingVectorizer

from icepickle.linear_model import save_coefficients, load_coefficients
from icepickle.pipeline import make_partial_pipeline

url = "https://raw.githubusercontent.com/koaning/icepickle/main/datasets/imdb_subset.csv"
df = pd.read_csv(url)
X, y = list(df['text']), df['label']

# Train a pre-trained model.
pretrained = LogisticRegression()
pipe = make_partial_pipeline(HashingVectorizer(), pretrained)
pipe.fit(X, y)

# Save the coefficients, safely.
save_coefficients(pretrained, 'pretrained.h5')

# Create a new model using pre-trained weights.
finetuned = SGDClassifier()
load_coefficients(finetuned, 'pretrained.h5')
new_pipe = make_partial_pipeline(HashingVectorizer(), finetuned)

# This new model can be used for fine-tuning.
for i in range(10):
    # Inside this for-loop you could consider doing data-augmentation.
    new_pipe.partial_fit(X, y)

Supported Pipeline Parts

The following pipeline components are added.

from icepickle.pipeline import (
    PartialPipeline,
    PartialFeatureUnion,
    make_partial_pipeline,
    make_partial_union,
)

These tools allow you to declare pipelines that support .partial_fit. Note that components used in these pipelines all need to have .partial_fit() implemented.

Supported Scikit-Learn Models

We unit test against the following models in our save_coefficients and load_coefficients functions.

from sklearn.linear_model import (
    SGDClassifier,
    SGDRegressor,
    LinearRegression,
    LogisticRegression,
    PassiveAggressiveClassifier,
    PassiveAggressiveRegressor,
)

icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models

Related tags

Overview

icepickle

Installation

Usage

Finetuning

Owner

vincent d warmerdam

A unified framework for machine learning with time series

OptaPy is an AI constraint solver for Python to optimize planning and scheduling problems.

MLReef is an open source ML-Ops platform that helps you collaborate, reproduce and share your Machine Learning work with thousands of other users.

TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models.

Contains an implementation (sklearn API) of the algorithm proposed in "GENDIS: GEnetic DIscovery of Shapelets" and code to reproduce all experiments.

A Python implementation of GRAIL, a generic framework to learn compact time series representations.

Time Series Prediction with tf.contrib.timeseries

Simple Machine Learning Tool Kit

A library to generate synthetic time series data by easy-to-use factors and generator

Summer: compartmental disease modelling in Python

A collection of Scikit-Learn compatible time series transformers and tools.

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

Built various Machine Learning algorithms (Logistic Regression, Random Forest, KNN, Gradient Boosting and XGBoost. etc)

Bodywork deploys machine learning projects developed in Python, to Kubernetes.

Tutorial for Decision Threshold In Machine Learning.

Automatic extraction of relevant features from time series:

A model to predict steering torque fully end-to-end

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Python package for concise, transparent, and accurate predictive modeling

Programming assignments and quizzes from all courses within the Machine Learning Engineering for Production (MLOps) specialization offered by deeplearning.ai