To SMOTE, or not to SMOTE?

Overview

To SMOTE, or not to SMOTE?

This package includes the code required to repeat the experiments in the paper and to analyze the results.

To SMOTE, or not to SMOTE?

Yotam Elor and Hadar Averbuch-Elor

Installation

# Create a new conda environment and activate it
conda create --name to-SMOTE-or-not -y python=3.7
conda activate to-SMOTE-or-not
# Install dependencies
pip install -r requirements.txt

Running experiments

The data is not included with this package. See an example of running a single experiment with a dataset from imblanaced-learn

# Load the data
import pandas as pd
import numpy as np
from imblearn.datasets import fetch_datasets
data = fetch_datasets()["mammography"]
x = pd.DataFrame(data["data"])
y = np.array(data["target"]).reshape((-1, 1))

# Run the experiment
from experiment import experiment
from classifiers import CLASSIFIER_HPS
from oversamplers import OVERSAMPLER_HPS
results = experiment(
    x=x,
    y=y,
    oversampler={
        "type": "smote",
        "ratio": 0.4,
        "params": OVERSAMPLER_HPS["smote"][0],
    },
    classifier={
        "type": "cat",  # Catboost
        "params": CLASSIFIER_HPS["cat"][0]
    },
    seed=0,
    normalize=False,
    clean_early_stopping=False,
    consistent=True,
    repeats=1
)

# Print the results nicely
import json
print(json.dumps(results, indent=4))

To run all the experiments in our study, wrap the above in loops, for example

for dataset in datasets:
    x, y = load_dataset(dataset)  # this functionality is not provided
    for seed in range(7):
        for classifier, classifier_hp_configs in CLASSIFIER_HPS.items():
            for classifier_hp in classifier_hp_configs:
                for oversampler, oversampler_hp_configs in OVERSAMPLER_HPS.items():
                    for oversampler_hp in oversampler_hp_configs:
                        for ratio in [0.1, 0.2, 0.3, 0.4, 0.5]:
                            results = experiment(
                                x=x,
                                y=y,
                                oversampler={
                                    "type": oversampler,
                                    "ratio": ratio,
                                    "params": oversampler_hp,
                                },
                                classifier={
                                    "type": classifier,
                                    "params": classifier_hp
                                },
                                seed=seed,
                                normalize=...,
                                clean_early_stopping=...,
                                consistent=...,
                                repeats=...
                            )

Analyze

Read the results from the compressed csv file. As the results file is large, it is tracked using git-lfs. You might need to download it manually or install git-lfs.

import os
import pandas as pd
data_path = os.path.join(os.path.dirname(__file__), "../data/results.gz")
df = pd.read_csv(data_path)

Drop nans and filter experiments with consistent classifiers, no normalization and a single validation fold

df = df.dropna()
df = df[
    (df["consistent"] == True)
    & (df["normalize"] == False)
    & (df["clean_early_stopping"] == False)
    & (df["repeats"] == 1)
]

Select the best HP configurations according to AUC validation scores. opt_metric is the key used to select the best configuration. For example, for a-priori HPs use opt_metric="test.roc_auc" and for validation-HPs use opt_metric="validation.roc_auc". Additionaly calculate average score and rank

from analyze import filter_optimal_hps
df = filter_optimal_hps(
    df, opt_metric="validation.roc_auc", output_metrics=["test.roc_auc"]
)
print(df)

Plot the results

from analyze import avg_plots
avg_plots(df, "test.roc_auc")

Citation

@misc{elor2022smote,
    title={To SMOTE, or not to SMOTE?}, 
    author={Yotam Elor and Hadar Averbuch-Elor},
    year={2022},
    eprint={2201.08528},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Owner
Amazon Web Services
Amazon Web Services
Flappy bird automation using Neuroevolution of Augmenting Topologies (NEAT) in Python

FlappyAI Flappy bird automation using Neuroevolution of Augmenting Topologies (NEAT) in Python Everything Used Genetic Algorithm especially NEAT conce

Eryawan Presma Y. 2 Mar 24, 2022
Codebase for Image Classification Research, written in PyTorch.

pycls pycls is an image classification codebase, written in PyTorch. It was originally developed for the On Network Design Spaces for Visual Recogniti

Facebook Research 2k Jan 01, 2023
Official PyTorch implementation of "Meta-Learning with Task-Adaptive Loss Function for Few-Shot Learning" (ICCV2021 Oral)

MeTAL - Meta-Learning with Task-Adaptive Loss Function for Few-Shot Learning (ICCV2021 Oral) Sungyong Baik, Janghoon Choi, Heewon Kim, Dohee Cho, Jaes

Sungyong Baik 44 Dec 29, 2022
An intelligent, flexible grammar of machine learning.

An english representation of machine learning. Modify what you want, let us handle the rest. Overview Nylon is a python library that lets you customiz

Palash Shah 79 Dec 02, 2022
A PyTorch library and evaluation platform for end-to-end compression research

CompressAI CompressAI (compress-ay) is a PyTorch library and evaluation platform for end-to-end compression research. CompressAI currently provides: c

InterDigital 680 Jan 06, 2023
PAthological QUpath Obsession - QuPath and Python conversations

PAQUO: PAthological QUpath Obsession Welcome to paquo 👋 , a library for interacting with QuPath from Python. paquo's goal is to provide a pythonic in

Bayer AG 60 Dec 31, 2022
DeepLearning Anomalies Detection with Bluetooth Sensor Data

Final Year Project. Constructing models to create offline anomalies detection using Travel Time Data collected from Bluetooth sensors along the route.

1 Jan 10, 2022
Gray Zone Assessment

Gray Zone Assessment Get started Clone github repository git clone https://github.com/andreanne-lemay/gray_zone_assessment.git Build docker image dock

1 Jan 08, 2022
An Unbiased Learning To Rank Algorithms (ULTRA) toolbox

Unbiased Learning to Rank Algorithms (ULTRA) This is an Unbiased Learning To Rank Algorithms (ULTRA) toolbox, which provides a codebase for experiment

back 3 Nov 18, 2022
Hyperbolic Hierarchical Clustering.

Hyperbolic Hierarchical Clustering (HypHC) This code is the official PyTorch implementation of the NeurIPS 2020 paper: From Trees to Continuous Embedd

HazyResearch 154 Dec 15, 2022
The Instructed Glacier Model (IGM)

The Instructed Glacier Model (IGM) Overview The Instructed Glacier Model (IGM) simulates the ice dynamics, surface mass balance, and its coupling thro

27 Dec 16, 2022
FPGA: Fast Patch-Free Global Learning Framework for Fully End-to-End Hyperspectral Image Classification

FPGA & FreeNet Fast Patch-Free Global Learning Framework for Fully End-to-End Hyperspectral Image Classification by Zhuo Zheng, Yanfei Zhong, Ailong M

Zhuo Zheng 92 Jan 03, 2023
3D HourGlass Networks for Human Pose Estimation Through Videos

3D-HourGlass-Network 3D CNN Based Hourglass Network for Human Pose Estimation (3D Human Pose) from videos. This was my summer'18 research project. Dis

Naman Jain 51 Jan 02, 2023
Element selection for functional materials discovery by integrated machine learning of atomic contributions to properties

Element selection for functional materials discovery by integrated machine learning of atomic contributions to properties 8.11.2021 Andrij Vasylenko I

Leverhulme Research Centre for Functional Materials Design 4 Dec 20, 2022
An open-access benchmark and toolbox for electricity price forecasting

epftoolbox The epftoolbox is the first open-access library for driving research in electricity price forecasting. Its main goal is to make available a

97 Dec 05, 2022
A novel pipeline framework for multi-hop complex KGQA task. About the paper title: Improving Multi-hop Embedded Knowledge Graph Question Answering by Introducing Relational Chain Reasoning

Rce-KGQA A novel pipeline framework for multi-hop complex KGQA task. This framework mainly contains two modules, answering_filtering_module and relati

金伟强 -上海大学人工智能小渣渣~ 16 Nov 18, 2022
Training Structured Neural Networks Through Manifold Identification and Variance Reduction

Training Structured Neural Networks Through Manifold Identification and Variance Reduction This repository is a pytorch implementation of the Regulari

0 Dec 23, 2021
Code of Adverse Weather Image Translation with Asymmetric and Uncertainty aware GAN

Adverse Weather Image Translation with Asymmetric and Uncertainty-aware GAN (AU-GAN) Official Tensorflow implementation of Adverse Weather Image Trans

Jeong-gi Kwak 36 Dec 26, 2022
A Japanese Medical Information Extraction Toolkit

JaMIE: a Japanese Medical Information Extraction toolkit Joint Japanese Medical Problem, Modality and Relation Recognition The Train/Test phrases requ

7 Dec 12, 2022
Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression

Quantile Regression DQN Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression (https://arx

Arsenii Senya Ashukha 80 Sep 17, 2022