LightGBM + Optuna: no brainer

Overview

AutoLGBM

LightGBM + Optuna: no brainer

  • auto train lightgbm directly from CSV files
  • auto tune lightgbm using optuna
  • auto serve best lightgbm model using fastapi

NOTE: PRs are currently

  • not accepted. If there are issues/problems, please create an issue.
  • accepted. If there are issues/problems, please solve with a PR.

Inspired by Abhishek Thakur's AutoXGB.

Installation

Install using pip

pip install autolgbm

Usage

Training a model using AutoLGBM is a piece of cake. All you need is some tabular data.

Parameters

###############################################################################
### required parameters
###############################################################################

# path to training data
train_filename = "data_samples/binary_classification.csv"

# path to output folder to store artifacts
output = "output"

###############################################################################
### optional parameters
###############################################################################

# path to test data. if specified, the model will be evaluated on the test data
# and test_predictions.csv will be saved to the output folder
# if not specified, only OOF predictions will be saved
# test_filename = "test.csv"
test_filename = None

# task: classification or regression
# if not specified, the task will be inferred automatically
# task = "classification"
# task = "regression"
task = None

# an id column
# if not specified, the id column will be generated automatically with the name `id`
# idx = "id"
idx = None

# target columns are list of strings
# if not specified, the target column be assumed to be named `target`
# and the problem will be treated as one of: binary classification, multiclass classification,
# or single column regression
# targets = ["target"]
# targets = ["target1", "target2"]
targets = ["income"]

# features columns are list of strings
# if not specified, all columns except `id`, `targets` & `kfold` columns will be used
# features = ["col1", "col2"]
features = None

# categorical_features are list of strings
# if not specified, categorical columns will be inferred automatically
# categorical_features = ["col1", "col2"]
categorical_features = None

# use_gpu is boolean
# if not specified, GPU is not used
# use_gpu = True
# use_gpu = False
use_gpu = True

# number of folds to use for cross-validation
# default is 5
num_folds = 5

# random seed for reproducibility
# default is 42
seed = 42

# number of optuna trials to run
# default is 1000
# num_trials = 1000
num_trials = 100

# time_limit for optuna trials in seconds
# if not specified, timeout is not set and all trials are run
# time_limit = None
time_limit = 360

# if fast is set to True, the hyperparameter tuning will use only one fold
# however, the model will be trained on all folds in the end
# to generate OOF predictions and test predictions
# default is False
# fast = False
fast = False

Python API

To train a new model, you can run:

from autolgbm import AutoLGBM


# required parameters:
train_filename = "data_samples/binary_classification.csv"
output = "output"

# optional parameters
test_filename = None
task = None
idx = None
targets = ["income"]
features = None
categorical_features = None
use_gpu = True
num_folds = 5
seed = 42
num_trials = 100
time_limit = 360
fast = False

# Now its time to train the model!
algbm = AutoLGBM(
    train_filename=train_filename,
    output=output,
    test_filename=test_filename,
    task=task,
    idx=idx,
    targets=targets,
    features=features,
    categorical_features=categorical_features,
    use_gpu=use_gpu,
    num_folds=num_folds,
    seed=seed,
    num_trials=num_trials,
    time_limit=time_limit,
    fast=fast,
)
algbm.train()

CLI

Train the model using the autolgbm train command. The parameters are same as above.

autolgbm train \
 --train_filename datasets/30train.csv \
 --output outputs/30days \
 --test_filename datasets/30test.csv \
 --use_gpu

You can also serve the trained model using the autolgbm serve command.

autolgbm serve --model_path outputs/mll --host 0.0.0.0 --debug

To know more about a command, run:

`autolgbm  --help` 
autolgbm train --help


usage: autolgbm  [
   
    ] train [-h] --train_filename TRAIN_FILENAME [--test_filename TEST_FILENAME] --output
                                        OUTPUT [--task {classification,regression}] [--idx IDX] [--targets TARGETS]
                                        [--num_folds NUM_FOLDS] [--features FEATURES] [--use_gpu] [--fast]
                                        [--seed SEED] [--time_limit TIME_LIMIT]

optional arguments:
  -h, --help            show this help message and exit
  --train_filename TRAIN_FILENAME
                        Path to training file
  --test_filename TEST_FILENAME
                        Path to test file
  --output OUTPUT       Path to output directory
  --task {classification,regression}
                        User defined task type
  --idx IDX             ID column
  --targets TARGETS     Target column(s). If there are multiple targets, separate by ';'
  --num_folds NUM_FOLDS
                        Number of folds to use
  --features FEATURES   Features to use, separated by ';'
  --use_gpu             Whether to use GPU for training
  --fast                Whether to use fast mode for tuning params. Only one fold will be used if fast mode is set
  --seed SEED           Random seed
  --time_limit TIME_LIMIT
                        Time limit for optimization

   
Owner
Rishiraj Acharya
Machine Learning Engineer at Dynopii | Teacher (CS106A) at Stanford | Microsoft Student Ambassador, DeepLearning.AI Ambassador | ML Team Lead at Google DSC NSEC
Rishiraj Acharya
使用数学和计算机知识投机倒把

偷鸡不成项目集锦 坦率地讲,涉及金融市场的好策略如果公开,必然导致使用的人多,最后策略变差。所以这个仓库只收集我目前失败了的案例。 加密货币组合套利 中国体育彩票预测 我赚不上钱的项目,也许可以帮助更有能力的人去赚钱。

Roy 28 Dec 29, 2022
(3D): LeGO-LOAM, LIO-SAM, and LVI-SAM installation and application

SLAM-application: installation and test (3D): LeGO-LOAM, LIO-SAM, and LVI-SAM Tested on Quadruped robot in Gazebo ● Results: video, video2 Requirement

EungChang-Mason-Lee 203 Dec 26, 2022
YouTube Spam Detection with python

YouTube Spam Detection This code deletes spam comment on youtube videos based on two characteristics (currently) If the author of the comment has a se

MohamadReza Taalebi 5 Sep 27, 2022
Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models

Highly interpretable, sklearn-compatible classifier based on decision rules This is a scikit-learn compatible wrapper for the Bayesian Rule List class

Tamas Madl 482 Nov 19, 2022
Predict the output which should give a fair idea about the chances of admission for a student for a particular university

Predict the output which should give a fair idea about the chances of admission for a student for a particular university.

ArvindSandhu 1 Jan 11, 2022
Python implementation of Weng-Lin Bayesian ranking, a better, license-free alternative to TrueSkill

Python implementation of Weng-Lin Bayesian ranking, a better, license-free alternative to TrueSkill This is a port of the amazing openskill.js package

Open Debates Project 156 Dec 14, 2022
李航《统计学习方法》复现

本项目复现李航《统计学习方法》每一章节的算法 特点: 笔记摘要:在每个文件开头都会有一些核心的摘要 pythonic:这里会用尽可能规范的方式来实现,包括编程风格几乎严格按照PEP8 循序渐进:前期的算法会更list的方式来做计算,可读性比较强,后期几乎完全为numpy.array的计算,并且辅助详

58 Oct 22, 2021
The MLOps is the process of continuous integration and continuous delivery of Machine Learning artifacts as a software product, keeping it inside a loop of Design, Model Development and Operations.

MLOps The MLOps is the process of continuous integration and continuous delivery of Machine Learning artifacts as a software product, keeping it insid

Maykon Schots 25 Nov 27, 2022
ClearML - Auto-Magical Suite of tools to streamline your ML workflow. Experiment Manager, MLOps and Data-Management

ClearML - Auto-Magical Suite of tools to streamline your ML workflow Experiment Manager, MLOps and Data-Management ClearML Formerly known as Allegro T

ClearML 4k Jan 09, 2023
Visualize classified time series data with interactive Sankey plots in Google Earth Engine

sankee Visualize changes in classified time series data with interactive Sankey plots in Google Earth Engine Contents Description Installation Using P

Aaron Zuspan 76 Dec 15, 2022
machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

This is a machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service. We initially made th

Krishna Priyatham Potluri 73 Dec 01, 2022
A collection of Machine Learning Models To Web Api which are built on open source technologies/frameworks like Django, Flask.

Author Ibrahim Koné From-Machine-Learning-Models-To-WebAPI A collection of Machine Learning Models To Web Api which are built on open source technolog

Ibrahim Koné 2 May 24, 2022
The easy way to combine mlflow, hydra and optuna into one machine learning pipeline.

mlflow_hydra_optuna_the_easy_way The easy way to combine mlflow, hydra and optuna into one machine learning pipeline. Objective TODO Usage 1. build do

shibuiwilliam 9 Sep 09, 2022
Banpei is a Python package of the anomaly detection.

Banpei Banpei is a Python package of the anomaly detection. Anomaly detection is a technique used to identify unusual patterns that do not conform to

Hirofumi Tsuruta 282 Jan 03, 2023
Simple and flexible ML workflow engine.

This is a simple and flexible ML workflow engine. It helps to orchestrate events across a set of microservices and create executable flow to handle requests. Engine is designed to be configurable wit

Katana ML 295 Jan 06, 2023
Data Efficient Decision Making

Data Efficient Decision Making

Microsoft 197 Jan 06, 2023
flexible time-series processing & feature extraction

A corona statistics and information telegram bot.

PreDiCT.IDLab 206 Dec 28, 2022
Open source time series library for Python

PyFlux PyFlux is an open source time series library for Python. The library has a good array of modern time series models, as well as a flexible array

Ross Taylor 2k Jan 02, 2023
XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

92 Dec 14, 2022
PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.

PyNNDescent PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors. It provides a python implementation of Nearest Neighbo

Leland McInnes 699 Jan 09, 2023