A multi-tenant multi-client scalable product categorising demo stack

Overview

Better Categories 4All: A multi-tenant multi-client product categorising stack

The steps to reproduce training and inference are in the end of this file, sorry for the long explanation.

example workflow

Problem scope

We want to create a full product categorization stack for multiple clients. For each client, and each product we want to find the 5 most suitable categories.

Project structure

The project is split into two layers:

  • ML layer: the python package for training and serving model. It's a pipenv based project. The Pipfile include all required dependencies. The python environment generated by pipenv is used to run the training/inference and run also unit tests. Code is generic for all clients.
  • Orchestration layer: the Airflow DAGs for training and prediction. Each client has its own training DAG and its prediction DAG. These DAGs uses the Airflow BashOperator to execute training and prediction inside the pipenv environment.

img_1.png

Why one DAG per a client instead of a single DAG for all client ?

We could have a single DAG that train all clients. So each client has its own training task inside the same DAG. I chose rather to build a separate DAG for each client. Several reasons motivated my decision:

  • In my past experiences, some individual cients may have problem s with their data and it's more practical to have a DAG per client when it's come to day to day monitoring.
  • New clients may come and other may leave and we may endup with a DAG that keeps constantly adding new Task and loosing others and it's against airflow best practicies.
  • It make sens to have one failed DAG and 99 other successful DAGs rather than a single DAG failing all the time because of one random client training failing each day.

Training

In this part we will train a classification model for each client.

Training package

The package categories_classification include a training function train_model. It takes the following inputs:

  • client_id: the id of the client in training dataset
  • features: a list of features names to use in training
  • model_params: a dict of params to be passed to model python class.
  • training_date: the execution date of training, used to track the training run.

The chosen model is scikit-learn implementation of random forest sklearn.ensemble.RandomForestClassifier. For the sake of simplicity, we didn't fine tune model parameters, but optimal params can be set in config.

In addition to train_model function, a cli binary is created to be able to run training directly from command line. The binary command trainer runs the training:

pipenv run python categories_classification_cli.py trainer --help

Usage: categories_classification_cli.py trainer [OPTIONS]

Options:
  --client_id TEXT     The id of the client.  [required]
  --features TEXT      The list of input features.  [required]
  --model_params TEXT  Params to be passed to model.  [required]
  --training_date TEXT  The training date.  [required]
  --help               Show this message and exit.

Data and model paths

All data are stored in a command base path retrieved from environment variable DATA_PREFIX, default is ./data. Given a client id, training data is loaded from $DATA_PREFIX/train/client_id= /data_train.csv.gz .

Splitting data

Before training, data is split into training set and test set. The train set is used to train the model while the test set is used to evaluate the model after training. Evaluation score is logged.

Model tracking and versioning

The whole training event is tracked in Mlfow as a training run. Each client hash its own experiments and its own model name following the convention " _model". The tracking process saves also metrics and model parameters in the same run metadata.

Finally, the model is saved in Mlflow Registry with name " _model". Saving the model means a new model version is saved in Mlflow, as the same model may have multiple versions.

Prediction

In this part, we will predict product categories using previously trained model.

Prediction package

The package categories_classification include a prediction function predict_categories. It takes the following inputs:

  • client_id: the id of the client in training dataset
  • inference_date: an inference execution date to version output categories

The prediction is done through spark so that it can be done on big datasets. Prediction dataset is loaded in spark DataFrame. We use Mlflow to get the latest model version and load latest model. The model is then broadcasted in Spark in order to be available in Spark workers. To apply the model to the prediction dataset, I use a new Spark 3.0 experimental feature called mapInPandas. This Dataframe method maps an iterator of batches (pandas Dataframe) using a prediction used-defined function that outputs also a pandas Dataframe. This is done thanks to PyArrow efficient data transfer between Spark JVM and python pandas runtime.

Prediction function

The advantage of mapInPandas feature comparing to classic pandas_udf is that we can add more rows than we have as input. Thus for each product, we can output 5 predicted categories with their probabilities and ranked from 0 to 4. The predicted label are then persisted to filesystem as parquet dataset.

Model version retrieval

Before loading the model, we use Mlflow to get the latest version of the model. In production system we probabilities want to push model to staging, verify its metrics or validate it before passing it to production. Let's suppose that we are working the same stage line, we use MlflowClient to connect to Mlflow Registry and get the latest model version. The version is then used to build the latest model uri.

Reproducing training and inference

Pipenv initialization

First you need to check you have pipenv installed locally otherwise you can install it with pip install pipenv.

Then you need to initialize the pipenv environment with the following command:

make init-pipenv

This may take some time as it will install all required dependencies. Once done you can run linter (pylint) and unit tests:

make lint
make unit-tests

Airflow/Mlflow initialization

You need also to initialize the local airflow stack, thus building a custom airflow docker image including the pipenv environment, the mlflow image and initializing the Airflow database.

make init-airflow

Generate DAGs

Airflow dags needs to be generated using config file in conf/clients_config.yaml. It's already created with the 10 clients example datasets. But if you want you can add new clients or change the actual configuration. For each client you must include the list of features and optional model params.

Then, you can generate DAGs using the following command:

make generate-dags

This will can the script scripts/generate_dags.py which will:

  • load training and inference DAG templates from dags_templates, they are jinja2 templates.
  • load conf from conf/clients_config.yaml
  • render DAG for each client and each template

Start local Airflow

You can start local airflow with following command:

make start-airflow

Once all services started, you can go to you browser and visit:

  • Airflow UI in http://localhost:8080
  • Mlflow UI in http://localhost:5000

Run training and inference

In Airflow all DAGs are disabled by default. To run training for a client you can enable the DAG and it will immediately trigger the training.

Once the model in Mlflow, you can enable the inference DAG and it will immediately trigger a prediction.

Inspect result

To inspect result you run a local jupyter, you do it with:

make run-jupyter

Then visit notebook inspect_inference_result.ipynb and run it to check the prediction output.

A Simple Telegram Inline Torrent Search Bot by @infotechIT

Torrent-Search-RoBot A Simple Telegram Inline Torrent Search Bot by @infotechIT. Torrent API Using api.infotech.wtf API Host Bot Deploy to Heroku Clic

InfoTech 0 May 05, 2022
You can connect with Sanila Ranatunga using this botπŸ˜‰πŸ˜‰

Sanila-Ranatunga-s-Assistant-Bot You can connect with Sanila Ranatunga using this bot πŸ˜‰ πŸ˜‰ Reach me on Telegram Sanila's Assistant Bot What is Telegr

Sanila Ranatunga 5 Feb 01, 2022
A really easy way to display your spotify listening status on spotify.

Spotify playing README A really easy way to display your spotify listening status on READMEs and Websites too! Demo Here's the embed from the site. Cu

Sunrit Jana 21 Nov 06, 2022
A stack-based systems language that supports structures, functions, expressions, and user-defined operator behaviour

A stack-based systems language that supports structures, functions, expressions, and user-defined operator behaviour. Currently compiles to URCL with plans to add additional formats in the future.

Lucida Dragon 3 Nov 03, 2022
A python telegram bot to fetch the details of an ipadress with help of ip-api

ipfetcher A python(Pyrogram) oriented telegram bot to fetch the details of an ipadress developed by @riz4d with the API of https://ip-api.com Deployme

Mohamed Rizad 5 Mar 12, 2022
This is a open source discord bot project

pythonDiscordBot This is a open source discord bot project #based on the MAX A video: https://www.youtube.com/watch?v=jHZlvRr9KxM Prerequisites Python

Edson Holanda Teixeira Junior 3 Oct 11, 2021
An anime themed telegram bot that can convert telegram media.

ShoukoKomiRobot β€’ π•Žπ•£π•šπ•₯π•₯π•–π•Ÿ π•€π•Ÿ Python3 β€’ π•ƒπ•šπ•“π•£π•’π•£π•ͺ π•Œπ•€π•–π•• Pyrogram β€’ π•Šπ• π•—π•₯𝕨𝕒𝕣𝕖 π•Œπ•€π•–π•• Ebook-convert Deploy π”½π• π•£π•œ π•₯π•™π•šπ•€ 𝕣

25 Aug 14, 2022
β€œ Hey there πŸ‘‹ I'm Sophia β€ž TG Group management bot with Some Extra features..

❀️ Sophia ❀️ Avaiilable on Telegram as SophiaBot πŸƒβ€β™‚οΈ Easy Deploy Mandatory Vars [+] Make Sure You Add All These Mandatory Vars. [-] APP_ID: You ca

THEEKSHANA 5 Dec 09, 2021
Auxiliator is telegram bot for basic web-application analysis

Auxiliator Auxiliator is telegram bot for basic web-application analysis What for? Sometimes there is no access to your main PC, where you can scan we

Revoltage 13 Dec 26, 2021
GitHub Usage Report

github-analytics from github_analytics import analyze pr_analysis = analyze.PRAnalyzer( "organization/repo", "organization", "team-name",

Shrivu Shankar 1 Oct 26, 2021
Python written Rule34 API

Python written Rule34 API

1 Nov 11, 2021
szrose is an all in one group management bot made for managing your group effectively with some advance security tools & Suit For All Your Needs ❀️

szrose is an all in one group management bot made for managing your group effectively with some advance security tools & Suit For All Your Needs ❀️

szsupunma 93 Jan 07, 2023
Exporta archivos masivamente del TEC Digital.

TEC Digital Files Exporter Script que permite exportar los archivos de cursos del TEC Digital del Instituto TecnolΓ³gico de Costa Rica, debido al borra

Joseph Vargas 22 Apr 08, 2021
A EddieHub API python package.

EddieHub A EddieHub API python package. Made with Python3 (C) @FayasNoushad Copyright permission under MIT License License - https://github.com/Fayas

Fayas Noushad 5 Sep 22, 2021
Telegram Bot that's allow you to play Video & Music on Telegram Group Video Chat

WAR MUSIC / VIDEO PLAYER Bot Bot Link: πŸ§ͺ Get SESSION_NAME from below: Pyrogram 🎭 Preview ✨ Features Music & Video stream support MultiChat support P

Abhishek singh 11 Dec 25, 2022
NiceHash Python Library and Command Line Rest API

NiceHash Python Library and Command Line Rest API Requirements / Modules pip install requests Required data and where to get it Following data is nee

Ashlin Darius Govindasamy 2 Jan 02, 2022
toldium is a modular, fast, reliable and customizable multiplatform bot library for your communities

toldium The easy multiplatform bot toldium is a modular, fast, reliable and customizable multiplatform bot library for your communities, from a commun

Stockdroid Fans 5 Nov 03, 2021
Tesseract Open Source OCR Engine (main repository)

Tesseract OCR About This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM

48.3k Jan 05, 2023
Telegram Bot for generating and decoding QR-codes

Telegram openqrgen_bot Telegram Bot that generates from user's messages and decodes QR-codes from photos. Also contains rickroll detection :) Just typ

2 Nov 14, 2021
Discord RPC for Notion written in Python

Discord RPC for Notion This is a program that allows you to add your Notion workspace activities to your Discord profile. This project is currently un

Thuliumitation 1 Feb 10, 2022