SparseLasso: Sparse Solutions for the Lasso

Overview

SparseLasso: Sparse Solutions for the Lasso

Introduction

SparseLasso provides a Scikit-Learn based estimation of the Lasso with cross-validation tuning for the penalty choice using the 'one standard error' rule to yield sparse solutions. The 'one standard error' rule recognizes the fact that the cross-validation path is estimated with error and selects the more parsimonious model (see Hastie, Tibshirani and Friedman, 2009). This rule thus chooses the largest possible penalty which is still within the one standard error of the cross-validation optimal value. Given that the Lasso often selects too many variables in practice, the one standard error rule provides a practical solution to yield sparser models. The software implementation of this rule is readily available in the R-package 'glmnet' (Friedman, Hastie and Tibshirani, 2010), however, it is absent from the Scikit-Learn module (Pedregosa et al., 2011). SparseLasso provides estimation of the penalized linear and logistic model based on Scikit-Learn's LassoCV and LogisticRegressionCV, respectively and thus accepts the standard Scikit-Learn arguments.

Installation

SparseLasso module relies on Python 3 and is based on the scikit-learn module. The required modules can be installed by navigating to the root of this project and executing the following command: pip install -r requirements.txt.

Example

The example below demonstrates the basic usage of the SparseLasso module.

# import modules
import pandas as pd
import numpy as np
from sklearn.datasets import make_regression
from sklearn.linear_model import LassoCV

# import SparseLasso
from sparse_lasso import SparseLassoCV

# simulate some example data for the linear model
X, y, coef = make_regression(n_samples=1000,
                             n_features=100, 
                             n_informative=10,
                             noise=10,
                             coef=True,
                             random_state=0)

# estimate standard LassoCV with optimal lambda minimizing error
lasso_min = LassoCV(n_alphas=100, cv=10).fit(X=X, y=y)

# estimate SparseLassoCV with lambda using 1 standard error rule
lasso_1se = SparseLassoCV(n_alphas=100, cv=10).fit(X=X, y=y)

# compare the penalty values
print('Lasso Min Penalty: ', round(lasso_min.alpha_, 2), '\n',
      'Lasso 1se Penalty: ', round(lasso_1se.alpha, 2), '\n')

# compare the number of selected features
print('Lasso Min Number of Selected Variables:     ',
      np.sum((lasso_min.coef_ != 0) * 1), '\n',
      'Lasso 1se Number of Selected Variables:     ',
      np.sum((lasso_1se.coef_ != 0) * 1), '\n')

For a more detailed example see the sparse_lasso_example.py as well as the sparse_lasso_simulation.py for a simulation exercise comparing the optimal cross-validation penalty choice with the one standard error rule for variable selection.

References

  • Hastie, Trevor, Robert Tibshirani, and J H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. , 2009. Print.
  • Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. "Regularization paths for generalized linear models via coordinate descent." Journal of statistical software 33.1 (2010): 1.
  • Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." the Journal of machine Learning research 12 (2011): 2825-2830.
Owner
Gabriel Okasa
PhD Candidate in Econometrics at the University of St.Gallen, Switzerland
Gabriel Okasa
Transform-Invariant Non-Negative Matrix Factorization

Transform-Invariant Non-Negative Matrix Factorization A comprehensive Python package for Non-Negative Matrix Factorization (NMF) with a focus on learn

EMD Group 6 Jul 01, 2022
Python utility to extract differences between two pandas dataframes.

Python utility to extract differences between two pandas dataframes.

Jaime Valero 8 Jan 07, 2023
AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures.

AptaMAT Purpose AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures. The method is based on the compa

GEC UTC 3 Nov 03, 2022
This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

📈 Statistical Quality Control 📉 This repo contains a simple but effective tool made using python which can be used for quality control in statistica

SasiVatsal 8 Oct 18, 2022
a tool that compiles a csv of all h1 program stats

h1stats - h1 Program Stats Scraper This python3 script will call out to HackerOne's graphql API and scrape all currently active programs for informati

Evan 40 Oct 27, 2022
Active Learning demo using two small datasets

ActiveLearningDemo How to run step one put the dataset folder and use command below to split the dataset to the required structure run utils.py For ea

3 Nov 10, 2021
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 05, 2023
Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

HoloViz 2.9k Jan 06, 2023
Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Meltano is open source, self-hosted, CLI-first, debuggable, and extensible. Pipelines are code, ready to be version c

Meltano 625 Jan 02, 2023
2019 Data Science Bowl

Kaggle-2019-Data-Science-Bowl-Solution - Here i present my solution to kaggle 2019 data science bowl and how i improved it to win a silver medal in that competition.

Deepak Nandwani 1 Jan 01, 2022
Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Larch: Data Analysis Tools for X-ray Spectroscopy and More Documentation: http://xraypy.github.io/xraylarch Code: http://github.com/xraypy/xraylarch L

xraypy 95 Dec 13, 2022
This mini project showcase how to build and debug Apache Spark application using Python

Spark app can't be debugged using normal procedure. This mini project showcase how to build and debug Apache Spark application using Python programming language. There are also options to run Spark a

Denny Imanuel 1 Dec 29, 2021
Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) an

PyMC 7.2k Dec 30, 2022
Data Intelligence Applications - Online Product Advertising and Pricing with Context Generation

Data Intelligence Applications - Online Product Advertising and Pricing with Context Generation Overview Consider the scenario in which advertisement

Manuel Bressan 2 Nov 18, 2021
Aggregating gridded data (xarray) to polygons

A package to aggregate gridded data in xarray to polygons in geopandas using area-weighting from the relative area overlaps between pixels and polygons. Check out the binder link above for a sample c

Kevin Schwarzwald 42 Nov 09, 2022
Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Stock Statistics/Indicators Calculation Helper VERSION: 0.3.2 Introduction Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline s

Cedric Zhuang 1.1k Dec 28, 2022
Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

WeRateDogs Twitter Data from 2015 to 2017 Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data Table of Contents Introduction Proj

Keenan Cooper 1 Jan 12, 2022
Python Implementation of Scalable In-Memory Updatable Bitmap Indexing

PyUpBit CS490 Large Scale Data Analytics — Implementation of Updatable Compressed Bitmap Indexing Paper Table of Contents About The Project Usage Cont

Hyeong Kyun (Daniel) Park 1 Jun 28, 2022
Leverage Twitter API v2 to analyze tweet metrics such as impressions and profile clicks over time.

Tweetmetric Tweetmetric allows you to track various metrics on your most recent tweets, such as impressions, retweets and clicks on your profile. The

Mathis HAMMEL 29 Oct 18, 2022
Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Gabriele 3 Jul 05, 2022