Python Automated Machine Learning library for tabular data.

Overview

Read the Docs Lines of code GitHub issues GitHub Repo stars GitHub contributors


Logo

Simple but powerful Automated Machine Learning library for tabular data. It uses efficient in-memory SAP HANA algorithms to automate routine Data Science tasks.
πŸ“š Explore the docs Β»

🐞 Report Bug Β· πŸ†• Request Feature

Table of Contents

  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact

About the project

Disclaimer

This library is an open-source research project and is not part of any official SAP products.

What's this?

This is a simple but accurate Automated Machine Learning library. Based on SAP HANA powerful in-memory algorithms, it provides high accuracy in multiple machine learning tasks. Our library also uses numerous data preprocessing functions to automate routine data cleaning tasks. So, hana_automl goes through all AutoML steps and makes Data Science work easier.

What is SAP HANA?

From www.sap.com: SAP HANA is a high-performance in-memory database that speeds data-driven, real-time decisions and actions.

Web app

https://share.streamlit.io/dan0nchik/sap-hana-automl/main/web.py

Documentation

https://sap-hana-automl.readthedocs.io/en/latest/index.html

Benchmarks

https://github.com/dan0nchik/SAP-HANA-AutoML/blob/main/comparison_openml.ipynb

ML tasks:

  • Binary classification
  • Regression
  • Multiclass classification
  • Forecasting

Steps automated:

  • Data exploration
  • Data preparation
  • Feature engineering
  • Model selection
  • Model training
  • Hyperparameter tuning

πŸ‘‡ By the end of summer 2021, blue part will be fully automated by our library Logo

Clients

Streamlit client Streamlit client

Built With

Getting Started

To get a package up and running, follow these simple steps.

Prerequisites

Make sure you have the following:

  1. βœ… Setup SAP HANA (skip this step if you have an instance with PAL enabled). There are 2 ways to do that.
    In HANA Cloud:

    • Create a free trial account
    • Setup an instance
    • Enable PAL - Predictive Analysis Library. It is vital to enable it because we use their algorithms.

    In Virtual Machine:

    • Rent a virtual machine in Azure, AWS, Google Cloud, etc.
    • Install HANA instance there or on your PC (if you have >32 Gb RAM).
    • Enable PAL - Predictive Analysis Library. It is vital to enable it because we use their algorithms.
  2. βœ… Installed software

  • Python > 3.6
    Skip this step if python --version returns > 3.6
  • Cython
    pip3 install Cython

Installation

There are 2 ways to install the library

  • Stable: from pypi
    pip3 install hana_automl
  • Latest: from the repository
    pip3 install https://github.com/dan0nchik/SAP-HANA-AutoML/archive/dev.zip
    Note: latest version may contain bugs, be careful!

After installation

Check that PAL (Predictive Analysis Library) is installed and roles are granted

  • Read docs section about that.
  • If you don't want to read docs, run this code
    from hana_automl.utils.scripts import setup_user
    from hana_ml.dataframe import ConnectionContext
    
    cc = ConnectionContext(address='address', user='user', password='password', port=39015)
    
    # replace with credentials of user that will be created or granted a role to run PAL.
    setup_user(connection_context=cc, username='user', password="password")

Usage

From code

Our library in a few lines of code

Connect to database.

from hana_ml.dataframe import ConnectionContext

cc = ConnectionContext(address='address',
                     user='username',
                     password='password',
                     port=1234)

Create AutoML model and fit it.

from hana_automl.automl import AutoML

model = AutoML(cc)
model.fit(
  file_path='path to training dataset', # it may be HANA table/view, or pandas DataFrame
  steps=10, # number of iterations
  target='target', # column to predict
  time_limit=120 # time limit in seconds
)

Predict.

model.predict(
file_path='path to test dataset',
id_column='ID',
verbose=1
)

For more examples, please refer to the Documentation

How to run Streamlit client

  1. Clone repository: git clone https://github.com/dan0nchik/SAP-HANA-AutoML.git
  2. Install dependencies: pip3 install -r requirements.txt
  3. Run GUI: streamlit run ./web.py

Roadmap

See the open issues for a list of proposed features (and known issues). Feel free to report any bugs :)

Contributing

Any contributions you make are greatly appreciated πŸ‘ !

  1. Fork the Project

  2. Create your Feature Branch (git checkout -b feature/NewFeature)

  3. Install dependencies

    pip3 install Cython
    pip3 install -r requirements.txt
  4. Create credentials.py file in tests directory Your files should look like this:

    SAP-HANA-AutoML
    β”‚   README.md
    β”‚   all other files   
    β”‚   .....
    |
    └───tests
        β”‚   test files...
        β”‚   credentials.py
    

    Copy and paste this piece of code there and replace it with your credentials:

    host = "host"
    user = "username"
    password = "password"
    port = 39015 # or any port you need
    schema = "your schema"

    Don't worry, this file is in .gitignore, so your credentials won't be seen by anyone.

  5. Make some changes

  6. Write tests that cover your code in tests directory

  7. Run tests (under SAP-HANA-AutoML directory)

    pytest
  8. Commit your changes (git commit -m 'Add some amazing features')

  9. Push to the branch (git push origin feature/AmazingFeature)

  10. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.
Don't really understand license? Check out the MIT license summary.

Contact

Authors: @While-true-codeanything, @DbusAI, @dan0nchik

Project Link: https://github.com/dan0nchik/SAP-HANA-AutoML

Owner
Daniel Khromov
Learning Swift, C#, and Data Science
Daniel Khromov
Reproducibility and Replicability of Web Measurement Studies

Reproducibility and Replicability of Web Measurement Studies This repository holds additional material to the paper "Reproducibility and Replicability

6 Dec 31, 2022
A simple example of ML classification, cross validation, and visualization of feature importances

Simple-Classifier This is a basic example of how to use several different libraries for classification and ensembling, mostly with sklearn. Example as

Rob 2 Aug 25, 2022
Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Auto_TS: Auto_TimeSeries Automatically build multiple Time Series models using a Single Line of Code. Now updated with Dask. Auto_timeseries is a comp

AutoViz and Auto_ViML 519 Jan 03, 2023
Implementation of deep learning models for time series in PyTorch.

List of Implementations: Currently, the reimplementation of the DeepAR paper(DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks

Yunkai Zhang 275 Dec 28, 2022
Transpile trained scikit-learn estimators to C, Java, JavaScript and others.

sklearn-porter Transpile trained scikit-learn estimators to C, Java, JavaScript and others. It's recommended for limited embedded systems and critical

Darius Morawiec 1.2k Jan 05, 2023
Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets Datasets Used: Iris dataset,

Samrat Mitra 2 Nov 18, 2021
Extended Isolation Forest for Anomaly Detection

Table of contents Extended Isolation Forest Summary Motivation Isolation Forest Extension The Code Installation Requirements Use Citation Releases Ext

Sahand Hariri 377 Dec 18, 2022
Pandas-method-chaining is a plugin for flake8 that provides method chaining linting for pandas code

pandas-method-chaining pandas-method-chaining is a plugin for flake8 that provides method chaining linting for pandas code. It is a fork from pandas-v

Francis 5 May 14, 2022
Nixtla is an open-source time series forecasting library.

Nixtla Nixtla is an open-source time series forecasting library. We are helping data scientists and developers to have access to open source state-of-

Nixtla 401 Jan 08, 2023
Python/Sage Tool for deriving Scattering Matrices for WDF R-Adaptors

R-Solver A Python tools for deriving R-Type adaptors for Wave Digital Filters. This code is not quite production-ready. If you are interested in contr

8 Sep 19, 2022
Microsoft 5.6k Jan 07, 2023
Crypto-trading - ML techiques are used to forecast short term returns in 14 popular cryptocurrencies

Crypto-trading - ML techiques are used to forecast short term returns in 14 popular cryptocurrencies. We have amassed a dataset of millions of rows of high-frequency market data dating back to 2018 w

Panagiotis (Panos) Mavritsakis 4 Sep 22, 2022
Automatically create Faiss knn indices with the most optimal similarity search parameters.

It selects the best indexing parameters to achieve the highest recalls given memory and query speed constraints.

Criteo 419 Jan 01, 2023
Scikit-Learn useful pre-defined Pipelines Hub

Scikit-Pipes Scikit-Learn useful pre-defined Pipelines Hub Usage: Install scikit-pipes It's advised to install sklearn-genetic using a virtual env, in

Rodrigo Arenas 1 Apr 26, 2022
Class-imbalanced / Long-tailed ensemble learning in Python. Modular, flexible, and extensible

IMBENS: Class-imbalanced Ensemble Learning in Python Language: English | Chinese/δΈ­ζ–‡ Links: Documentation | Gallery | PyPI | Changelog | Source | Downl

Zhining Liu 176 Jan 04, 2023
Tools for Optuna, MLflow and the integration of both.

HPOflow - Sphinx DOC Tools for Optuna, MLflow and the integration of both. Detailed documentation with examples can be found here: Sphinx DOC Table of

Telekom Open Source Software 17 Nov 20, 2022
Library of Stan Models for Survival Analysis

survivalstan: Survival Models in Stan author: Jacki Novik Overview Library of Stan Models for Survival Analysis Features: Variety of standard survival

Hammer Lab 122 Jan 06, 2023
LibTraffic is a unified, flexible and comprehensive traffic prediction library based on PyTorch

LibTraffic is a unified, flexible and comprehensive traffic prediction library, which provides researchers with a credibly experimental tool and a convenient development framework. Our library is imp

432 Jan 05, 2023
A simple guide to MLOps through ZenML and its various integrations.

ZenBytes Join our Slack Community and become part of the ZenML family Give the main ZenML repo a GitHub star to show your love ZenBytes is a series of

ZenML 127 Dec 27, 2022
Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis.

Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis. It is distributed under the MIT License.

Jeong-Yoon Lee 720 Dec 25, 2022