A fast, flexible, and performant feature selection package for python.

Overview

linselect

A fast, flexible, and performant feature selection package for python.

Package in a nutshell

It's built on stepwise linear regression

When passed data, the underlying algorithm seeks minimal variable subsets that produce good linear fits to the targets. This approach to feature selection strikes a competitive balance between performance, speed, and memory efficiency.

It has a simple API

A simple API makes it easy to quickly rank a data set's features in terms of their added value to a given fit. This is demoed below, where we learn that we can drop column 1 of X and still obtain a fit to y that captures 97.37% of its variance.

from linselect import FwdSelect
import numpy as np

X = np.array([[1,2,4], [1,1,2], [3,2,1], [10,2,2]])
y = np.array([[1], [-1], [-1], [1]])

selector = FwdSelect()
selector.fit(X, y)

print selector.ordered_features
print selector.ordered_cods
# [2, 0, 1]
# [0.47368422, 0.97368419, 1.0]

X_compressed = X[:, selector.ordered_features[:2]]

It's fast

A full sweep on a 1000 feature count data set runs in 10s on my laptop -- about one million times faster (seriously) than standard stepwise algorithms, which are effectively too slow to run at this scale. A 100 count feature set runs in 0.07s.

from linselect import FwdSelect
import numpy as np
import time

X = np.random.randn(5000, 1000)
y = np.random.randn(5000, 1)

selector = FwdSelect()

t1 = time.time()
selector.fit(X, y)
t2 = time.time()
print t2 - t1
# 9.87492

Its scores reveal your effective feature count

By plotting fitted CODs against ranked feature count, one often learns that seemingly high-dimensional problems can actually be understood using only a minority of the available features. The plot below demonstrates this: A fit to one year of AAPL's stock fluctuations -- using just 3 selected stocks as predictors -- nearly matches the performance of a 49-feature fit. The 3-feature fit arguably provides more insight and is certainly easier to reason about (cf. tutorials for details).

apple stock plot

It's flexible

linselect exposes multiple applications of the underlying algorithm. These allow for:

  • Forward, reverse, and general forward-reverse stepwise regression strategies.
  • Supervised applications aimed at a single target variable or simultaneous prediction of multiple target variables.
  • Unsupervised applications. The algorithm can be applied to identify minimal, representative subsets of an available column set. This provides a feature selection analog of PCA -- importantly, one that retains interpretability.

Under the hood

Feature selection algorithms are used to seek minimal column / feature subsets that capture the majority of the useful information contained within a data set. Removal of a selected subset's complement -- the relatively uninformative or redundant features -- can often result in a significant data compression and improved interpretability.

Stepwise selection algorithms work by iteratively updating a model feature set, one at a time [1]. For example, in a given step of a forward process, one considers all of the features that have not yet been added to the model, and then identifies that which would improve the model the most. This is added, and the process is then repeated until all features have been selected. The features that are added first in this way tend to be those that are predictive and also not redundant with those already included in the predictor set. Retaining only these first selected features therefore provides a convenient method for identifying minimal, informative feature subsets.

In general, identifying the optimal feature to add to a model in a given step requires building and scoring each possible updated model variant. This results in a slow process: If there are n features, O(n^2) models must be built to carry out a full ranking. However, the process can be dramatically sped up in the case of linear regression -- thanks to some linear algebra identities that allow one to efficiently update these models as features are either added or removed from their predictor sets [2,3]. Using these update rules, a full feature ranking can be carried out in roughly the same amount of time that is needed to fit only a single model. For n=1000, this means we get an O(n^2) = O(10^6) speed up! linselect makes use of these update rules -- first identified in [2] -- allowing for fast feature selection sweeps.

[1] Introduction to Statistical Learning by G. James, et al -- cf. chapter 6.

[2] M. Efroymson. Multiple regression analysis. Mathematical methods for digital computers, 1:191–203, 1960.

[3] J. Landy. Stepwise regression for unsupervised learning, 2017. arxiv.1706.03265.

Classes, documentation, tests, license

linselect contains three classes: FwdSelect, RevSelect, and GenSelect. As the names imply, these support efficient forward, reverse, and general forward-reverse search protocols, respectively. Each can be used for both supervised and unsupervised analyses.

Docstrings and basic call examples are illustrated for each class in the ./docs folder.

An FAQ and a running list of tutorials are available at efavdb.com/linselect.

Tests: From the root directory,

python setup.py test

This project is licensed under the terms of the MIT license.

Installation

The package can be installed using pip, from pypi

pip install linselect

or from github

pip install git+git://github.com/efavdb/linselect.git

Author

Jonathan Landy - EFavDB

Acknowledgments: Special thanks to P. Callier, P. Spanoudes, and R. Zhou for providing helpful feedback.

Orchest is a browser based IDE for Data Science.

Orchest is a browser based IDE for Data Science. It integrates your favorite Data Science tools out of the box, so you don’t have to. The application is easy to use and can run on your laptop as well

Orchest 3.6k Jan 09, 2023
MIR Cheatsheet - Survival Guidebook for MIR Researchers in the Lab

MIR Cheatsheet - Survival Guidebook for MIR Researchers in the Lab

SeungHeonDoh 3 Jul 02, 2022
A forecasting system dedicated to smart city data

smart-city-predictions System prognostyczny dedykowany dla danych inteligentnych miast Praca inżynierska realizowana przez Michała Stawikowskiego and

Kevin Lai 1 Nov 08, 2021
pyhsmm MITpyhsmm - Bayesian inference in HSMMs and HMMs. MIT

Bayesian inference in HSMMs and HMMs This is a Python library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and expli

Matthew Johnson 527 Dec 04, 2022
Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle.

2019-indian-election-eda Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle. This project is a part of the Cou

Souradeep Banerjee 5 Oct 10, 2022
Provide a market analysis (R)

market-study Provide a market analysis (R) - FRENCH Produisez une étude de marché Prérequis Pour effectuer ce projet, vous devrez maîtriser la manipul

1 Feb 13, 2022
Senator Trades Monitor

Senator Trades Monitor This monitor will grab the most recent trades by senators and send them as a webhook to discord. Installation To use the monito

Yousaf Cheema 5 Jun 11, 2022
peptides.py is a pure-Python package to compute common descriptors for protein sequences

peptides.py Physicochemical properties and indices for amino-acid sequences. 🗺️ Overview peptides.py is a pure-Python package to compute common descr

Martin Larralde 32 Dec 31, 2022
BinTuner is a cost-efficient auto-tuning framework, which can deliver a near-optimal binary code that reveals much more differences than -Ox settings.

BinTuner is a cost-efficient auto-tuning framework, which can deliver a near-optimal binary code that reveals much more differences than -Ox settings. it also can assist the binary code analysis rese

BinTuner 42 Dec 16, 2022
A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

GBiStat package A python package to assist programmers with data analysis. This package could be used to plot : Binomial Distribution of the dataset p

Rishikesh S 4 Oct 17, 2022
A data structure that extends pyspark.sql.DataFrame with metadata information.

MetaFrame A data structure that extends pyspark.sql.DataFrame with metadata info

Invent Analytics 8 Feb 15, 2022
This module is used to create Convolutional AutoEncoders for Variational Data Assimilation

VarDACAE This module is used to create Convolutional AutoEncoders for Variational Data Assimilation. A user can define, create and train an AE for Dat

Julian Mack 23 Dec 16, 2022
Weather analysis with Python, SQLite, SQLAlchemy, and Flask

Surf's Up Weather analysis with Python, SQLite, SQLAlchemy, and Flask Overview The purpose of this analysis was to examine weather trends (precipitati

Art Tucker 1 Sep 05, 2021
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 3.7k Jan 03, 2023
Stock Analysis dashboard Using Streamlit and Python

StDashApp Stock Analysis Dashboard Using Streamlit and Python If you found the content useful and want to support my work, you can buy me a coffee! Th

StreamAlpha 27 Dec 09, 2022
Spaghetti: an open-source Python library for the analysis of network-based spatial data

pysal/spaghetti SPAtial GrapHs: nETworks, Topology, & Inference Spaghetti is an open-source Python library for the analysis of network-based spatial d

Python Spatial Analysis Library 203 Jan 03, 2023
A script to "SHUA" H1-2 map of Mercenaries mode of Hearthstone

lushi_script Introduction This script is to "SHUA" H1-2 map of Mercenaries mode of Hearthstone Installation Make sure you installed python=3.6. To in

210 Jan 02, 2023
Python ELT Studio, an application for building ELT (and ETL) data flows.

The Python Extract, Load, Transform Studio is an application for performing ELT (and ETL) tasks. Under the hood the application consists of a two parts.

Schlerp 55 Nov 18, 2022
📊 Python Flask game that consolidates data from Nasdaq, allowing the user to practice buying and selling stocks.

Web Trader Web Trader is a trading website that consolidates data from Nasdaq, allowing the user to search up the ticker symbol and price of any stock

Paulina Khew 21 Aug 30, 2022
Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data

Statistical_Modelling Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data Statistical Methods for Decision Ma

Avnika Mehta 1 Jan 27, 2022