Bonsai: Gradient Boosted Trees + Bayesian Optimization

Overview

Bonsai: Gradient Boosted Trees + Bayesian Optimization

Bonsai is a wrapper for the XGBoost and Catboost model training pipelines that leverages Bayesian optimization for computationally efficient hyperparameter tuning.

Despite being a very small package, it has access to nearly all of the configurable parameters in XGBoost and CatBoost as well as the BayesianOptimization package allowing users to specify unique objectives, metrics, parameter search ranges, and search policies. This is made possible thanks to the strong similarities between both libraries.

$ pip install bonsai-tree

References/Dependencies:

Why use Bonsai?

Grid search and random search are the most commonly used algorithms for exploring the hyperparameter space for a wide range of machine learning models. While effective for optimizing over low dimensional hyperparameter spaces (ex: few regularization terms), these methods do not scale well to models with a large number of hyperparameters such as gradient boosted trees.

Bayesian optimization on the other hand dynamically samples from the hyperparameter space with the goal of minimizing uncertaintly about the underlying objective function. For the case of model optimization, this consists of iteratively building a prior distribution of functions over the hyperparameter space and sampling with the goal of minimizing the posterior variance of the loss surface (via Gaussian Processes).

Model Configuration

Since Bonsai is simply a wrapper for both XGBoost and CatBoost, the model_params dict is synonymous with the params argument for both catboost.fit() and xgboost.fit(). Additionally, you must encode your categorical features as usual depending on which library you are using (XGB: One-Hot, CB: Label).

Below is a simple example of binary classification using CatBoost:

# label encoded training data
X = train.drop(target, axis = 1)
y = train[target]

# same args as catboost.train(...)
model_params = dict(objective = 'Logloss', verbose = False)

# same args as catboost.cv(...)
cv_params = dict(nfold = 5)

The pbounds dict as seen below specifies the hyperparameter bounds over which the optimizer will search. Additionally, the opt_config dictionary is for configuring the optimizer itself. Refer to the BayesianOptimization documentation to learn more.

# defining parameter search ranges
pbounds = dict(
  eta = (0.15, 0.4), 
  n_estimators = (200,2000), 
  max_depth = (4, 8)
)

# 10 warm up samples + 10 optimizing steps
n_iter, init_points= 10, 10

# to learn more about customizing your search policy:
# BayesianOptimization/examples/exploitation_vs_exploration.ipynb
opt_config = dict(acq = 'ei', xi = 1e-2)

Tuning and Prediction

All that is left is to initialize and optimize.

from bonsai.tune import CB_Tuner

# note that 'cats' is a list of categorical feature names
tuner = CB_Tuner(X, y, cats, model_params, cv_params, pbounds)
tuner.optimize(n_iter, init_points, opt_config, bounds_transformer)

After the optimal parameters are found, the model is trained and stored internally giving full access to the CatBoost model.

test_pool = catboost.Pool(test, cat_features = cats)
preds = tuner.model.predict(test_pool, prediction_type = 'Probability')

Bonsai also comes with a parallel coordinates plotting functionality allowing users to further narrow down their parameter search ranges as needed.

from bonsai.utils import parallel_coordinates

# DataFrame with hyperparams and observed loss
results = tuner.opt_results
parallel_coordinates(results)

Owner
Landon Buechner
Library of Stan Models for Survival Analysis

survivalstan: Survival Models in Stan author: Jacki Novik Overview Library of Stan Models for Survival Analysis Features: Variety of standard survival

Hammer Lab 122 Jan 06, 2023
An implementation of Relaxed Linear Adversarial Concept Erasure (RLACE)

Background This repository contains an implementation of Relaxed Linear Adversarial Concept Erasure (RLACE). Given a dataset X of dense representation

Shauli Ravfogel 4 Apr 13, 2022
High performance implementation of Extreme Learning Machines (fast randomized neural networks).

High Performance toolbox for Extreme Learning Machines. Extreme learning machines (ELM) are a particular kind of Artificial Neural Networks, which sol

Anton Akusok 174 Dec 07, 2022
Regularization and Feature Selection in Least Squares Temporal Difference Learning

Regularization and Feature Selection in Least Squares Temporal Difference Learning Description This is Python implementations of Least Angle Regressio

Mina Parham 0 Jan 18, 2022
My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data

kNN-vs-RFR My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data In many areas, rental bikes have been launched to

1 Oct 28, 2021
An easier way to build neural search on the cloud

Jina is geared towards building search systems for any kind of data, including text, images, audio, video and many more. With the modular design & multi-layer abstraction, you can leverage the effici

Jina AI 17k Jan 01, 2023
2021 Machine Learning Security Evasion Competition

2021 Machine Learning Security Evasion Competition This repository contains code samples for the 2021 Machine Learning Security Evasion Competition. P

Fabrício Ceschin 8 May 01, 2022
SIMD-accelerated bitwise hamming distance Python module for hexidecimal strings

hexhamming What does it do? This module performs a fast bitwise hamming distance of two hexadecimal strings. This looks like: DEADBEEF = 1101111010101

Michael Recachinas 12 Oct 14, 2022
ETNA is an easy-to-use time series forecasting framework.

ETNA is an easy-to-use time series forecasting framework. It includes built in toolkits for time series preprocessing, feature generation, a variety of predictive models with unified interface - from

Tinkoff.AI 674 Jan 07, 2023
MBTR is a python package for multivariate boosted tree regressors trained in parameter space.

MBTR is a python package for multivariate boosted tree regressors trained in parameter space.

SUPSI-DACD-ISAAC 61 Dec 19, 2022
Machine Learning Course with Python:

A Machine Learning Course with Python Table of Contents Download Free Deep Learning Resource Guide Slack Group Introduction Motivation Machine Learnin

Instill AI 6.9k Jan 03, 2023
Xeasy-ml is a packaged machine learning framework.

xeasy-ml 1. What is xeasy-ml Xeasy-ml is a packaged machine learning framework. It allows a beginner to quickly build a machine learning model and use

9 Mar 14, 2022
决策树分类与回归模型的实现和可视化

DecisionTree 决策树分类与回归模型,以及可视化 DecisionTree ID3 C4.5 CART 分类 回归 决策树绘制 分类树 回归树 调参 剪枝 ID3 ID3决策树是最朴素的决策树分类器: 无剪枝 只支持离散属性 采用信息增益准则 在data.py中,我们记录了一个小的西瓜数据

Welt Xing 10 Oct 22, 2022
Machine Learning Algorithms ( Desion Tree, XG Boost, Random Forest )

implementation of machine learning Algorithms such as decision tree and random forest and xgboost on darasets then compare results for each and implement ant colony and genetic algorithms on tsp map,

Mohamadreza Rezaei 1 Jan 19, 2022
Timeseries analysis for neuroscience data

=================================================== Nitime: timeseries analysis for neuroscience data ===============================================

NIPY developers 212 Dec 09, 2022
A simple example of ML classification, cross validation, and visualization of feature importances

Simple-Classifier This is a basic example of how to use several different libraries for classification and ensembling, mostly with sklearn. Example as

Rob 2 Aug 25, 2022
ParaMonte is a serial/parallel library of Monte Carlo routines for sampling mathematical objective functions of arbitrary-dimensions

ParaMonte is a serial/parallel library of Monte Carlo routines for sampling mathematical objective functions of arbitrary-dimensions, in particular, the posterior distributions of Bayesian models in

Computational Data Science Lab 182 Dec 31, 2022
🎛 Distributed machine learning made simple.

🎛 lazycluster Distributed machine learning made simple. Use your preferred distributed ML framework like a lazy engineer. Getting Started • Highlight

Machine Learning Tooling 44 Nov 27, 2022
Machine Learning for Time-Series with Python.Published by Packt

Machine-Learning-for-Time-Series-with-Python Become proficient in deriving insights from time-series data and analyzing a model’s performance Links Am

Packt 124 Dec 28, 2022
Repository for DCA0305, an undergraduate course about Machine Learning Workflows and Pipelines

Federal University of Rio Grande do Norte Technology Center Department of Computer Engineering and Automation Machine Learning Based Systems Design Re

Ivanovitch Silva 81 Oct 18, 2022