To SMOTE, or not to SMOTE?

Last update: Jan 03, 2022

Related tags

Deep Learning to-smote-or-not

Overview

To SMOTE, or not to SMOTE?

This package includes the code required to repeat the experiments in the paper and to analyze the results.

To SMOTE, or not to SMOTE?

Yotam Elor and Hadar Averbuch-Elor

Installation

# Create a new conda environment and activate it
conda create --name to-SMOTE-or-not -y python=3.7
conda activate to-SMOTE-or-not
# Install dependencies
pip install -r requirements.txt

Running experiments

The data is not included with this package. See an example of running a single experiment with a dataset from imblanaced-learn

# Load the data
import pandas as pd
import numpy as np
from imblearn.datasets import fetch_datasets
data = fetch_datasets()["mammography"]
x = pd.DataFrame(data["data"])
y = np.array(data["target"]).reshape((-1, 1))

# Run the experiment
from experiment import experiment
from classifiers import CLASSIFIER_HPS
from oversamplers import OVERSAMPLER_HPS
results = experiment(
    x=x,
    y=y,
    oversampler={
        "type": "smote",
        "ratio": 0.4,
        "params": OVERSAMPLER_HPS["smote"][0],
    },
    classifier={
        "type": "cat",  # Catboost
        "params": CLASSIFIER_HPS["cat"][0]
    },
    seed=0,
    normalize=False,
    clean_early_stopping=False,
    consistent=True,
    repeats=1
)

# Print the results nicely
import json
print(json.dumps(results, indent=4))

To run all the experiments in our study, wrap the above in loops, for example

for dataset in datasets:
    x, y = load_dataset(dataset)  # this functionality is not provided
    for seed in range(7):
        for classifier, classifier_hp_configs in CLASSIFIER_HPS.items():
            for classifier_hp in classifier_hp_configs:
                for oversampler, oversampler_hp_configs in OVERSAMPLER_HPS.items():
                    for oversampler_hp in oversampler_hp_configs:
                        for ratio in [0.1, 0.2, 0.3, 0.4, 0.5]:
                            results = experiment(
                                x=x,
                                y=y,
                                oversampler={
                                    "type": oversampler,
                                    "ratio": ratio,
                                    "params": oversampler_hp,
                                },
                                classifier={
                                    "type": classifier,
                                    "params": classifier_hp
                                },
                                seed=seed,
                                normalize=...,
                                clean_early_stopping=...,
                                consistent=...,
                                repeats=...
                            )

Analyze

Read the results from the compressed csv file. As the results file is large, it is tracked using git-lfs. You might need to download it manually or install git-lfs.

import os
import pandas as pd
data_path = os.path.join(os.path.dirname(__file__), "../data/results.gz")
df = pd.read_csv(data_path)

Drop nans and filter experiments with consistent classifiers, no normalization and a single validation fold

df = df.dropna()
df = df[
    (df["consistent"] == True)
    & (df["normalize"] == False)
    & (df["clean_early_stopping"] == False)
    & (df["repeats"] == 1)
]

Select the best HP configurations according to AUC validation scores. opt_metric is the key used to select the best configuration. For example, for a-priori HPs use opt_metric="test.roc_auc" and for validation-HPs use opt_metric="validation.roc_auc". Additionaly calculate average score and rank

from analyze import filter_optimal_hps
df = filter_optimal_hps(
    df, opt_metric="validation.roc_auc", output_metrics=["test.roc_auc"]
)
print(df)

Plot the results

from analyze import avg_plots
avg_plots(df, "test.roc_auc")

Citation

@misc{elor2022smote,
    title={To SMOTE, or not to SMOTE?}, 
    author={Yotam Elor and Hadar Averbuch-Elor},
    year={2022},
    eprint={2201.08528},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

To SMOTE, or not to SMOTE?

Related tags

Overview

To SMOTE, or not to SMOTE?

Installation

Running experiments

Analyze

Citation

Security

License

Owner

Amazon Web Services

StyleGAN2-ADA - Official PyTorch implementation

PyTorch source code for Distilling Knowledge by Mimicking Features

A curated list of awesome deep long-tailed learning resources.

Official repository for "Orthogonal Projection Loss" (ICCV'21)

Parameterising Simulated Annealing for the Travelling Salesman Problem

Real-time Object Detection for Streaming Perception, CVPR 2022

Sharpened cosine similarity torch - A Sharpened Cosine Similarity layer for PyTorch

Physics-Aware Training (PAT) is a method to train real physical systems with backpropagation.

The Python code for the paper A Hybrid Quantum-Classical Algorithm for Robust Fitting

MOOSE (Multi-organ objective segmentation) a data-centric AI solution that generates multilabel organ segmentations to facilitate systemic TB whole-person research

tf2-keras implement yolov5

ObjDetApp deploys a pytorch model for object detection

This script runs neural style transfer against the provided content image.

A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution (CVPR2022)

A novel pipeline framework for multi-hop complex KGQA task. About the paper title: Improving Multi-hop Embedded Knowledge Graph Question Answering by Introducing Relational Chain Reasoning

Pytorch implementation of AREL

Code for "NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video", CVPR 2021 oral

Implementation of the SUMO (Slim U-Net trained on MODA) model

The Empirical Investigation of Representation Learning for Imitation (EIRLI)

Algorithm to texture 3D reconstructions from multi-view stereo images

To SMOTE, or not to SMOTE?

Related tags

Overview

To SMOTE, or not to SMOTE?

Installation

Running experiments

Analyze

Citation

Security

License

Owner

Amazon Web Services

StyleGAN2-ADA - Official PyTorch implementation

PyTorch source code for Distilling Knowledge by Mimicking Features

A curated list of awesome deep long-tailed learning resources.

Official repository for "Orthogonal Projection Loss" (ICCV'21)

Parameterising Simulated Annealing for the Travelling Salesman Problem

Real-time Object Detection for Streaming Perception, CVPR 2022

Sharpened cosine similarity torch - A Sharpened Cosine Similarity layer for PyTorch

Physics-Aware Training (PAT) is a method to train real physical systems with backpropagation.

The Python code for the paper A Hybrid Quantum-Classical Algorithm for Robust Fitting

MOOSE (Multi-organ objective segmentation) a data-centric AI solution that generates multilabel organ segmentations to facilitate systemic TB whole-person research

tf2-keras implement yolov5

*ObjDetApp* deploys a pytorch model for object detection

This script runs neural style transfer against the provided content image.

A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution (CVPR2022)

A novel pipeline framework for multi-hop complex KGQA task. About the paper title: Improving Multi-hop Embedded Knowledge Graph Question Answering by Introducing Relational Chain Reasoning

Pytorch implementation of AREL

Code for "NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video", CVPR 2021 oral

Implementation of the SUMO (Slim U-Net trained on MODA) model

The Empirical Investigation of Representation Learning for Imitation (EIRLI)

Algorithm to texture 3D reconstructions from multi-view stereo images

ObjDetApp deploys a pytorch model for object detection