Pipeline for fast building text classification TF-IDF + LogReg baselines.

Overview

tests linter codecov

python 3.6 release (latest by date) license

pre-commit code style: black

pypi version pypi downloads

Text Classification Baseline

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Usage

Instead of writing custom code for specific text classification task, you just need:

  1. install pipeline:
pip install text-classification-baseline
  1. run pipeline:
  • either in terminal:
text-clf-train
  • or in python:
import text_clf

text_clf.train()

No data preparation is needed, only a csv file with two raw columns (with arbitrary names):

  • text
  • target

NOTE: the target can be presented in any format, including text - not necessarily integers from 0 to n_classes-1.

Config

The user interface consists of only one file config.yaml.

Change config.yaml to create the desired configuration and train text classification model with the following command:

  • terminal:
text-clf-train --path_to_config config.yaml
  • python:
import text_clf

text_clf.train(path_to_config="config.yaml")

Default config.yaml:

seed: 42
verbose: true
path_to_save_folder: models

# data
data:
  train_data_path: data/train.csv
  valid_data_path: data/valid.csv
  sep: ','
  text_column: text
  target_column: target_name_short

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 0.0

# logreg
logreg:
  penalty: l2
  C: 1.0
  class_weight: balanced
  solver: saga
  multi_class: auto
  n_jobs: -1

NOTE: tf-idf and logreg are sklearn TfidfVectorizer and LogisticRegression parameters correspondingly, so you can parameterize instances of these classes however you want.

Output

After training the model, the pipeline will return the following files:

  • model.joblib - sklearn pipeline with TF-IDF and LogReg steps
  • target_names.json - mapping from encoded target labels from 0 to n_classes-1 to it names
  • config.yaml - config that was used to train the model
  • logging.txt - logging file

Requirements

Python >= 3.6

Citation

If you use text-classification-baseline in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021textclf,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training text classification baselines},
    howpublished = {\url{https://github.com/dayyass/text-classification-baseline}},
    year         = {2021}
}
You might also like...
Code for EMNLP 2021 main conference paper
Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

Pipeline for chemical image-to-text competition
Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

Text-Summarization-using-NLP - Text Summarization using NLP  to fetch BBC News Article and summarize its text and also it includes custom article Summarization A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:
A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI 🍣 Online live demos: http://tworld.io/s

Text vectorization tool to outperform TFIDF for classification tasks
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

Text vectorization tool to outperform TFIDF for classification tasks
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

Comments
  • release v0.1.4

    release v0.1.4

    • fixed load_20newsgroups.py (#65 #71)
    • added Makefile (#71)
    • added logging confusion matrix (#72)
    • replaced all "valid" occurrences with "test" (#74)
    • updated docstrings (#77)
    • changed python interface - train function returns model and target_names_mapping (#78)
    enhancement 
    opened by dayyass 1
  • release v0.1.6

    release v0.1.6

    fixed token frequency support (add token frequency support #85) fixed threshold selection for binary classification (add threshold selection for binary classification #86)

    bug enhancement 
    opened by dayyass 0
  • release v0.1.5

    release v0.1.5

    • added lemmatization (#66)
    • added token frequency support (#84)
    • added threshold selection for binary classification (#79)
    • added arbitrary save folder name (#80)
    enhancement 
    opened by dayyass 0
  • release v0.1.5

    release v0.1.5

    • added lemmatization (#81)
    • added token frequency support (#85)
    • added threshold selection for binary classification (#86)
    • added arbitrary save folder name (#83)
    enhancement 
    opened by dayyass 0
Releases(v0.1.6)
  • v0.1.6(Nov 6, 2021)

    Release v0.1.6

    • fixed token frequency support (add token frequency support #85)
    • fixed threshold selection for binary classification (add threshold selection for binary classification #86)
    Source code(tar.gz)
    Source code(zip)
  • v0.1.5(Oct 21, 2021)

    Release v0.1.5 🥳🎉🍾

    • added pymorphy2 lemmatization (#81)
    • added token frequency support (#85)
    • added threshold selection for binary classification (#86)
    • added arbitrary save folder name (#83)

    pymorphy2 lemmatization (config.yaml)

    # preprocessing
    # (included in resulting model pipeline, so preserved for inference)
    preprocessing:
      lemmatization: pymorphy2
    

    token frequency support

    • text_clf.token_frequency.get_token_frequency(path_to_config) -
      get token frequency of train dataset according to the config file parameters

    threshold selection for binary classification

    • text_clf.pr_roc_curve.get_precision_recall_curve(path_to_model_folder) -
      get precision and recall metrics for precision-recall curve
    • text_clf.pr_roc_curve.get_roc_curve(path_to_model_folder) -
      get false positive rate (fpr) and true positive rate (tpr) metrics for roc curve
    • text_clf.pr_roc_curve.plot_precision_recall_curve(precision, recall) -
      plot precision-recall curve
    • text_clf.pr_roc_curve.plot_roc_curve(fpr, tpr) -
      plot roc curve
    • text_clf.pr_roc_curve.plot_precision_recall_f1_curves_for_thresholds(precision, recall, thresholds) -
      plot precision, recall, f1-score curves for probability thresholds

    arbitrary save folder name (config.yaml)

    experiment_name: model
    
    Source code(tar.gz)
    Source code(zip)
  • v0.1.4(Oct 10, 2021)

    • fixed load_20newsgroups.py (#65 #71)
    • added Makefile (#71)
    • added logging confusion matrix (#72)
    • replaced all "valid" occurrences with "test" (#74)
    • updated docstrings (#77)
    • changed python interface - train function returns model and target_names_mapping (#78)
    Source code(tar.gz)
    Source code(zip)
  • v0.1.3(Sep 2, 2021)

  • v0.1.2(Aug 19, 2021)

  • v0.1.1(Aug 11, 2021)

  • v0.1.0(Aug 7, 2021)

Owner
Dani El-Ayyass
NLP Tech Lead @ Sber AI, Master Student in Applied Mathematics and Computer Science @ CMC MSU
Dani El-Ayyass
CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT This repo provides the code for reproducing the experiments in CodeBERT: A Pre-Trained Model for Programming and Natural Languages. CodeBERT

Microsoft 1k Jan 03, 2023
Simple Text-To-Speech Bot For Discord

Simple Text-To-Speech Bot For Discord This is a very simple TTS bot for discord made with python. For this bot you need FFMPEG, see installation to se

1 Sep 26, 2022
Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech

epub2audiobook Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech Input examples qual a pasta do seu

7 Aug 25, 2022
translate using your voice

speech-to-text-translator Usage translate using your voice description this project makes translating a word easy, all you have to do is speak and...

1 Oct 18, 2021
🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Pretrained BigBird Model for Korean What is BigBird • How to Use • Pretraining • Evaluation Result • Docs • Citation 한국어 | English What is BigBird? Bi

Jangwon Park 183 Dec 14, 2022
Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"

ERNIE Source code and dataset for "ERNIE: Enhanced Language Representation with Informative Entities" Reqirements: Pytorch=0.4.1 Python3 tqdm boto3 r

THUNLP 1.3k Dec 30, 2022
Generating Korean Slogans with phonetic and structural repetition

LexPOS_ko Generating Korean Slogans with phonetic and structural repetition Generating Slogans with Linguistic Features LexPOS is a sequence-to-sequen

Yeoun Yi 3 May 23, 2022
一个基于Nonebot2和go-cqhttp的娱乐性qq机器人

Takker - 一个普通的QQ机器人 此项目为基于 Nonebot2 和 go-cqhttp 开发,以 Sqlite 作为数据库的QQ群娱乐机器人 关于 纯兴趣开发,部分功能借鉴了大佬们的代码,作为Q群的娱乐+功能性Bot 声明 此项目仅用于学习交流,请勿用于非法用途 这是开发者的第一个Pytho

风屿 79 Dec 29, 2022
An official repository for tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a University of Edinburgh master's course.

PMR computer tutorials on HMMs (2021-2022) This is a repository for computer tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a Univer

Vaidotas Šimkus 10 Dec 06, 2022
jiant is an NLP toolkit

jiant is an NLP toolkit The multitask and transfer learning toolkit for natural language processing research Why should I use jiant? jiant supports mu

ML² AT CILVR 1.5k Jan 04, 2023
Translate U is capable of translating the text present in an image from one language to the other.

Translate U is capable of translating the text present in an image from one language to the other. The app uses OCR and Google translate to identify and translate across 80+ languages.

Neelanjan Manna 1 Dec 22, 2021
An algorithm that can solve the word puzzle Wordle with an optimal number of guesses on HARD mode.

WordleSolver An algorithm that can solve the word puzzle Wordle with an optimal number of guesses on HARD mode. How to use the program Copy this proje

Akil Selvan Rajendra Janarthanan 3 Mar 02, 2022
NLP command-line assistant powered by OpenAI

NLP command-line assistant powered by OpenAI

Axel 16 Dec 09, 2022
Quick insights from Zoom meeting transcripts using Graph + NLP

Transcript Analysis - Graph + NLP This program extracts insights from Zoom Meeting Transcripts (.vtt) using TigerGraph and NLTK. In order to run this

Advit Deepak 7 Sep 17, 2022
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

pyannote 2.2k Jan 09, 2023
Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers. Cherche is meant to be used with small to medium sized corpora. C

Raphael Sourty 224 Nov 29, 2022
Code for hyperboloid embeddings for knowledge graph entities

Implementation for the papers: Self-Supervised Hyperboloid Representations from Logical Queries over Knowledge Graphs, Nurendra Choudhary, Nikhil Rao,

30 Dec 10, 2022
The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

22 Dec 14, 2022
Implementation of some unbalanced loss like focal_loss, dice_loss, DSC Loss, GHM Loss et.al

Implementation of some unbalanced loss for NLP task like focal_loss, dice_loss, DSC Loss, GHM Loss et.al Summary Here is a loss implementation reposit

121 Jan 01, 2023
Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

IMDB Sentiment Analysis This is the final project of Machine Learning Courses in Huazhong University of Science and Technology, School of Artificial I

Daniel 0 Dec 27, 2021