PyImpetus is a Markov Blanket based feature subset selection algorithm that considers features both separately and together as a group in order to provide not just the best set of features but also the best combination of features

Overview

forthebadge made-with-python ForTheBadge built-with-love

PyPI version shields.io Downloads Maintenance

PyImpetus

PyImpetus is a Markov Blanket based feature selection algorithm that selects a subset of features by considering their performance both individually as well as a group. This allows the algorithm to not only select the best set of features, but also select the best set of features that play well with each other. For example, the best performing feature might not play well with others while the remaining features, when taken together could out-perform the best feature. PyImpetus takes this into account and produces the best possible combination. Thus, the algorithm provides a minimal feature subset. So, you do not have to decide on how many features to take. PyImpetus selects the optimal set for you.

PyImpetus has been completely revamped and now supports binary classification, multi-class classification and regression tasks. It has been tested on 14 datasets and outperformed state-of-the-art Markov Blanket learning algorithms on all of them along with traditional feature selection algorithms such as Forward Feature Selection, Backward Feature Elimination and Recursive Feature Elimination.

How to install?

pip install PyImpetus

Functions and parameters

# The initialization of PyImpetus takes in multiple parameters as input
# PPIMBC is for classification
model = PPIMBC(model, p_val_thresh, num_simul, simul_size, simul_type, sig_test_type, cv, verbose, random_state, n_jobs)
  • model - estimator object, default=DecisionTreeClassifier() The model which is used to perform classification in order to find feature importance via significance-test.
  • p_val_thresh - float, default=0.05 The p-value (in this case, feature importance) below which a feature will be considered as a candidate for the final MB.
  • num_simul - int, default=30 (This feature has huge impact on speed) Number of train-test splits to perform to check usefulness of each feature. For large datasets, the value should be considerably reduced though do not go below 5.
  • simul_size - float, default=0.2 The size of the test set in each train-test split
  • simul_type - boolean, default=0 To apply stratification or not
    • 0 means train-test splits are not stratified.
    • 1 means the train-test splits will be stratified.
  • sig_test_type - string, default="non-parametric" This determines the type of significance test to use.
    • "parametric" means a parametric significance test will be used (Note: This test selects very few features)
    • "non-parametric" means a non-parametric significance test will be used
  • cv - cv object/int, default=0 Determines the number of splits for cross-validation. Sklearn CV object can also be passed. A value of 0 means CV is disabled.
  • verbose - int, default=2 Controls the verbosity: the higher, more the messages.
  • random_state - int or RandomState instance, default=None Pass an int for reproducible output across multiple function calls.
  • n_jobs - int, default=-1 The number of CPUs to use to do the computation.
    • None means 1 unless in a :obj:joblib.parallel_backend context.
    • -1 means using all processors.
# The initialization of PyImpetus takes in multiple parameters as input
# PPIMBR is for regression
model = PPIMBR(model, p_val_thresh, num_simul, simul_size, sig_test_type, cv, verbose, random_state, n_jobs)
  • model - estimator object, default=DecisionTreeRegressor() The model which is used to perform regression in order to find feature importance via significance-test.
  • p_val_thresh - float, default=0.05 The p-value (in this case, feature importance) below which a feature will be considered as a candidate for the final MB.
  • num_simul - int, default=30 (This feature has huge impact on speed) Number of train-test splits to perform to check usefulness of each feature. For large datasets, the value should be considerably reduced though do not go below 5.
  • simul_size - float, default=0.2 The size of the test set in each train-test split
  • sig_test_type - string, default="non-parametric" This determines the type of significance test to use.
    • "parametric" means a parametric significance test will be used (Note: This test selects very few features)
    • "non-parametric" means a non-parametric significance test will be used
  • cv - cv object/int, default=0 Determines the number of splits for cross-validation. Sklearn CV object can also be passed. A value of 0 means CV is disabled.
  • verbose - int, default=2 Controls the verbosity: the higher, more the messages.
  • random_state - int or RandomState instance, default=None Pass an int for reproducible output across multiple function calls.
  • n_jobs - int, default=-1 The number of CPUs to use to do the computation.
    • None means 1 unless in a :obj:joblib.parallel_backend context.
    • -1 means using all processors.
# To fit PyImpetus on provided dataset and find recommended features
fit(data, target)
  • data - A pandas dataframe upon which feature selection is to be applied
  • target - A numpy array, denoting the target variable
# This function returns the names of the columns that form the MB (These are the recommended features)
transform(data)
  • data - A pandas dataframe which needs to be pruned
# To fit PyImpetus on provided dataset and return pruned data
fit_transform(data, target)
  • data - A pandas dataframe upon which feature selection is to be applied
  • target - A numpy array, denoting the target variable
# To plot XGBoost style feature importance
feature_importance()

How to import?

from PyImpetus import PPIMBC, PPIMBR

Usage

# Import the algorithm. PPIMBC is for classification and PPIMBR is for regression
from PyImeptus import PPIMBC, PPIMBR
# Initialize the PyImpetus object
model = PPIMBC(model=SVC(random_state=27, class_weight="balanced"), p_val_thresh=0.05, num_simul=30, simul_size=0.2, simul_type=0, sig_test_type="non-parametric", cv=5, random_state=27, n_jobs=-1, verbose=2)
# The fit_transform function is a wrapper for the fit and transform functions, individually.
# The fit function finds the MB for given data while transform function provides the pruned form of the dataset
df_train = model.fit_transform(df_train.drop("Response", axis=1), df_train["Response"].values)
df_test = model.transform(df_test)
# Check out the MB
print(model.MB)
# Check out the feature importance scores for the selected feature subset
print(model.feat_imp_scores)
# Get a plot of the feature importance scores
model.feature_importance()

For better accuracy

Note: Play with the values of num_simul, simul_size, simul_type and p_val_thresh because sometimes a specific combination of these values will end up giving best results

  • Increase the cv value In all experiments, cv did not help in getting better accuracy. Use this only when you have extremely small dataset
  • Increase the num_simul value
  • Try one of these values for simul_size = {0.1, 0.2, 0.3, 0.4}
  • Use non-linear models for feature selection. Apply hyper-parameter tuning on models
  • Increase value of p_val_thresh in order to increase the number of features to include in thre Markov Blanket

For better speeds

  • Decrease the cv value. For large datasets cv might not be required. Therefore, set cv=0 to disable the aggregation step. This will result in less robust feature subset selection but at much faster speeds
  • Decrease the num_simul value but don't decrease it below 5
  • Set n_jobs to -1
  • Use linear models

For selection of less features

  • Try reducing the p_val_thresh value
  • Try out sig_test_type = "parametric"

Performance in terms of Accuracy (classification) and MSE (regression)

Dataset # of samples # of features Task Type Score using all features Score using featurewiz Score using PyImpetus # of features selected % of features selected Tutorial
Ionosphere 351 34 Classification 88.01% 92.86% 14 42.42% tutorial here
Arcene 100 10000 Classification 82% 84.72% 304 3.04%
AlonDS2000 62 2000 Classification 80.55% 86.98% 88.49% 75 3.75%
slice_localization_data 53500 384 Regression 6.54 5.69 259 67.45% tutorial here

Note: Here, for the first, second and third tasks, a higher accuracy score is better while for the fourth task, a lower MSE (Mean Squared Error) is better.

Performance in terms of Time (in seconds)

Dataset # of samples # of features Time (with PyImpetus)
Ionosphere 351 34 35.37
Arcene 100 10000 1570
AlonDS2000 62 2000 125.511
slice_localization_data 53500 384 1296.13

Future Ideas

  • Let me know

Feature Request

Drop me an email at [email protected] if you want any particular feature

Please cite this work as

Reference to the upcoming paper will be added here

Owner
Atif Hassan
PhD student at the Center of Excellence for AI, IIT Kharagpur.
Atif Hassan
This is a repository for a No-Code object detection inference API using the OpenVINO. It's supported on both Windows and Linux Operating systems.

OpenVINO Inference API This is a repository for an object detection inference API using the OpenVINO. It's supported on both Windows and Linux Operati

BMW TechOffice MUNICH 68 Nov 24, 2022
This repository contains a CBIR system that uses swin transformer to extract image's feature.

Swin-transformer based CBIR This repository contains a CBIR(content-based image retrieval) system. Here we use Swin-transformer to extract query image

JsHou 12 Nov 17, 2022
Basics of 2D and 3D Human Pose Estimation.

Human Pose Estimation 101 If you want a slightly more rigorous tutorial and understand the basics of Human Pose Estimation and how the field has evolv

Sudharshan Chandra Babu 293 Dec 14, 2022
The code used for the free [email protected] Webinar series on Reinforcement Learning in Finance

Reinforcement Learning in Finance [email protected] Webinar This repository provides the code f

Yves Hilpisch 62 Dec 22, 2022
The original implementation of TNDM used in the NeurIPS 2021 paper (no longer being updated)

TNDM - Targeted Neural Dynamical Modeling Note: This code is no longer being updated. The official re-implementation can be found at: https://github.c

1 Jul 21, 2022
High-Resolution 3D Human Digitization from A Single Image.

PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization (CVPR 2020) News: [2020/06/15] Demo with Google Colab (i

Meta Research 8.4k Dec 29, 2022
Pre-trained model, code, and materials from the paper "Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation" (MICCAI 2019).

Adaptive Segmentation Mask Attack This repository contains the implementation of the Adaptive Segmentation Mask Attack (ASMA), a targeted adversarial

Utku Ozbulak 53 Jul 04, 2022
Flexible-Modal Face Anti-Spoofing: A Benchmark

Flexible-Modal FAS This is the official repository of "Flexible-Modal Face Anti-

Zitong Yu 22 Nov 10, 2022
A Weakly Supervised Amodal Segmenter with Boundary Uncertainty Estimation

Paper Khoi Nguyen, Sinisa Todorovic "A Weakly Supervised Amodal Segmenter with Boundary Uncertainty Estimation", accepted to ICCV 2021 Our code is mai

Khoi Nguyen 5 Aug 14, 2022
This is a Python wrapper for TA-LIB based on Cython instead of SWIG.

TA-Lib This is a Python wrapper for TA-LIB based on Cython instead of SWIG. From the homepage: TA-Lib is widely used by trading software developers re

John Benediktsson 7.3k Jan 03, 2023
Python Jupyter kernel using Poetry for reproducible notebooks

Poetry Kernel Use per-directory Poetry environments to run Jupyter kernels. No need to install a Jupyter kernel per Python virtual environment! The id

Pathbird 204 Jan 04, 2023
Bootstrapped Unsupervised Sentence Representation Learning (ACL 2021)

Install first pip3 install -e . Training python3 training/unsupervised_tuning.py python3 training/supervised_tuning.py python3 training/multilingual_

yanzhang_nlp 26 Jul 22, 2022
UV matrix decompostion using movielens dataset

UV-matrix-decompostion-with-kfold UV matrix decompostion using movielens dataset upload the 'ratings.dat' file install the following python libraries

2 Oct 18, 2022
Course on computational design, non-linear optimization, and dynamics of soft systems at UIUC.

Computational Design and Dynamics of Soft Systems · This is a repository that contains the source code for generating the lecture notes, handouts, exe

Tejaswin Parthasarathy 4 Jul 21, 2022
A TensorFlow 2.x implementation of Masked Autoencoders Are Scalable Vision Learners

Masked Autoencoders Are Scalable Vision Learners A TensorFlow implementation of Masked Autoencoders Are Scalable Vision Learners [1]. Our implementati

Aritra Roy Gosthipaty 59 Dec 10, 2022
Art Project "Schrödinger's Game of Life"

Repo of the project "Team Creative Quantum AI: Schrödinger's Game of Life" Installation new conda env: conda create --name qcml python=3.8 conda activ

ℍ◮ℕℕ◭ℍ ℝ∈ᛔ∈ℝ 2 Sep 15, 2022
Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Transformers for variable misuse, function naming and code completion tasks The official PyTorch implementation of: Empirical Study of Transformers fo

Bayesian Methods Research Group 56 Nov 15, 2022
计算机视觉中用到的注意力模块和其他即插即用模块PyTorch Implementation Collection of Attention Module and Plug&Play Module

PyTorch实现多种计算机视觉中网络设计中用到的Attention机制,还收集了一些即插即用模块。由于能力有限精力有限,可能很多模块并没有包括进来,有任何的建议或者改进,可以提交issue或者进行PR。

PJDong 599 Dec 23, 2022
Train emoji embeddings based on emoji descriptions.

emoji2vec This is my attempt to train, visualize and evaluate emoji embeddings as presented by Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko

Miruna Pislar 17 Sep 03, 2022
Yolo object detection - Yolo object detection with python

How to run download required files make build_image make download Docker versio

3 Jan 26, 2022