LazyText is inspired b the idea of lazypredict, a library which helps build a lot of basic models without much code.

Overview

LazyText

lazy

lazytext Documentation Code Coverage Downloads

LazyText is inspired b the idea of lazypredict, a library which helps build a lot of basic mpdels without much code. LazyText is for text what lazypredict is for numeric data.

  • Free Software: MIT licence

Installation

To install LazyText

pip install lazytext

Usage

To use lazytext import in your project as

from lazytext.supervised import LazyTextPredict

Text Classification

Text classification on BBC News article classification.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from lazytext.supervised import LazyTextPredict
import re
import nltk

# Load the dataset
df = pd.read_csv("tests/assets/bbc-text.csv")
df.dropna(inplace=True)

# Download models required for text cleaning
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# split the data into train set and test set
df_train, df_test = train_test_split(df, test_size=0.3, random_state=13)

# Tokenize the words
df_train['clean_text'] = df_train['text'].apply(nltk.word_tokenize)
df_test['clean_text'] = df_test['text'].apply(nltk.word_tokenize)

# Remove stop words
stop_words=set(nltk.corpus.stopwords.words("english"))
df_train['text_clean'] = df_train['clean_text'].apply(lambda x: [item for item in x if item not in stop_words])
df_test['text_clean'] = df_test['clean_text'].apply(lambda x: [item for item in x if item not in stop_words])

# Remove numbers, punctuation and special characters (only keep words)
regex = '[a-z]+'
df_train['text_clean'] = df_train['text_clean'].apply(lambda x: [item for item in x if re.match(regex, item)])
df_test['text_clean'] = df_test['text_clean'].apply(lambda x: [item for item in x if re.match(regex, item)])

# Lemmatization
lem = nltk.stem.wordnet.WordNetLemmatizer()
df_train['text_clean'] = df_train['text_clean'].apply(lambda x: [lem.lemmatize(item, pos='v') for item in x])
df_test['text_clean'] = df_test['text_clean'].apply(lambda x: [lem.lemmatize(item, pos='v') for item in x])

# Join the words again to form sentences
df_train["clean_text"] = df_train.text_clean.apply(lambda x: " ".join(x))
df_test["clean_text"] = df_test.text_clean.apply(lambda x: " ".join(x))

# Tfidf vectorization
vectorizer = TfidfVectorizer()

x_train = vectorizer.fit_transform(df_train.clean_text)
x_test = vectorizer.transform(df_test.clean_text)
y_train = df_train.category.tolist()
y_test = df_test.category.tolist()

lazy_text = LazyTextPredict(
    classification_type="multiclass",
    )
models = lazy_text.fit(x_train, x_test, y_train, y_test)


Label Analysis
| Classes             | Weights              |
|--------------------:|---------------------:|
| tech                | 0.8725490196078431   |
| politics            | 1.1528497409326426   |
| sport               | 1.0671462829736211   |
| entertainment       | 0.8708414872798435   |
| business            | 1.1097256857855362   |

 Result Analysis
| Model                         | Accuracy            | Balanced Accuracy   | F1 Score            | Custom Metric Score | Time Taken          |
| ----------------------------: | -------------------:| -------------------:| -------------------:| -------------------:| -------------------:|
| AdaBoostClassifier            | 0.7260479041916168  | 0.717737172132769   | 0.7248335989941609  | NA                  | 1.829047679901123   |
| BaggingClassifier             | 0.8817365269461078  | 0.8796633962363677  | 0.8814695332332374  | NA                  | 3.5215072631835938  |
| BernoulliNB                   | 0.9535928143712575  | 0.9505929193425733  | 0.9533647387436917  | NA                  | 0.020041465759277344|
| CalibratedClassifierCV        | 0.9760479041916168  | 0.9760018220340847  | 0.9755904096436046  | NA                  | 0.4990670680999756  |
| ComplementNB                  | 0.9760479041916168  | 0.9752329192546583  | 0.9754237510855159  | NA                  | 0.013598203659057617|
| DecisionTreeClassifier        | 0.8532934131736527  | 0.8473956671194278  | 0.8496464898940103  | NA                  | 0.478792667388916   |
| DummyClassifier               | 0.2155688622754491  | 0.2                 | 0.07093596059113301 | NA                  | 0.008046865463256836|
| ExtraTreeClassifier           | 0.7275449101796407  | 0.7253518459908658  | 0.7255575847020816  | NA                  | 0.026398658752441406|
| ExtraTreesClassifier          | 0.9655688622754491  | 0.9635363285903302  | 0.9649837485086689  | NA                  | 1.6907336711883545  |
| GradientBoostingClassifier    | 0.9565868263473054  | 0.9543725191544354  | 0.9554606292723953  | NA                  | 39.16400766372681   |
| KNeighborsClassifier          | 0.938622754491018   | 0.9370053693959814  | 0.9367294513157219  | NA                  | 0.14803171157836914 |
| LinearSVC                     | 0.9745508982035929  | 0.974262691599302   | 0.9740343976103922  | NA                  | 0.10053229331970215 |
| LogisticRegression            | 0.968562874251497   | 0.9668995859213251  | 0.9678778814908909  | NA                  | 2.9565982818603516  |
| LogisticRegressionCV          | 0.9715568862275449  | 0.9708896757262861  | 0.971147482393915   | NA                  | 109.64091444015503  |
| MLPClassifier                 | 0.9760479041916168  | 0.9753381642512078  | 0.9752912960666735  | NA                  | 35.64296746253967   |
| MultinomialNB                 | 0.9700598802395209  | 0.9678795721187026  | 0.9689200656860745  | NA                  | 0.024427413940429688|
| NearestCentroid               | 0.9520958083832335  | 0.9499045135454718  | 0.9515097876015481  | NA                  | 0.024636268615722656|
| NuSVC                         | 0.9670658682634731  | 0.9656159420289855  | 0.9669719954040374  | NA                  | 8.287142515182495   |
| PassiveAggressiveClassifier   | 0.9775449101796407  | 0.9772388820754925  | 0.9770812340935414  | NA                  | 0.10332632064819336 |
| Perceptron                    | 0.9775449101796407  | 0.9769254658385094  | 0.9768161404324825  | NA                  | 0.07216000556945801 |
| RandomForestClassifier        | 0.9625748502994012  | 0.9605135542632081  | 0.9624462948504477  | NA                  | 1.2427525520324707  |
| RidgeClassifier               | 0.9775449101796407  | 0.9769254658385093  | 0.9769176825464448  | NA                  | 0.17272400856018066 |
| SGDClassifier                 | 0.9700598802395209  | 0.9695007868373973  | 0.969787370271274   | NA                  | 0.13134551048278809 |
| SVC                           | 0.9715568862275449  | 0.9703778467908902  | 0.9713021262026043  | NA                  | 8.388679027557373   |

Result of each estimator is stored in models which is a list and each trained estimator is also returned which can be used further for analysis.

confusion matrix and classification reports are also part of the models if they are needed.

print(models[0])
{
    'name': 'AdaBoostClassifier',
    'accuracy': 0.7260479041916168,
    'balanced_accuracy': 0.717737172132769,
    'f1_score': 0.7248335989941609,
    'custom_metric_score': 'NA',
    'time': 1.829047679901123,
    'model': AdaBoostClassifier(),
    'confusion_matrix': array([
        [ 89,   5,  12,  35,   3],
        [  8,  58,   5,  44,   0],
        [  5,   2, 108,  10,   1],
        [  5,   7,   5, 138,   2],
        [ 25,   5,   1,   3,  92]]),
 'classification_report':
 """
            precision    recall  f1-score   support
        0       0.67      0.62      0.64       144
        1       0.75      0.50      0.60       115
        2       0.82      0.86      0.84       126
        3       0.60      0.88      0.71       157
        4       0.94      0.73      0.82       126
 accuracy                           0.73       668
 macro avg       0.76      0.72     0.72       668
 weighted avg    0.75      0.73     0.72       668'}

Custom metrics

LazyText also support custom metric for evaluation, this metric can be set up like following

from lazytext.supervised import LazyTextPredict
# Custom metric
def my_custom_metric(y_true, y_pred):

    ...do your stuff

    return score


lazy_text = LazyTextPredict(custom_metric=my_custom_metric)
lazy_text.fit(X_train, X_test, y_train, y_test)

If the signature of the custom metric function does not match with what is given above, then even though the custom metric is provided, it will be ignored.

Custom model parameters

LazyText also support providing parameters to the esitmators. For this just provide a dictornary of the parameters as shown below and those following arguments will be applied to the desired estimator.

In the following example I want to apply/change the default parameters of SVC classifier.

LazyText will fit all the models but only change the default parameters for SVC in the following case.

from lazytext.supervisd
custom_parameters = [
    {
        "name": "SVC",
        "parameters": {
            "C": 0.5,
            "kernel": 'poly',
            "degree": 5
        }
    }
]


l = LazyTextPredict(
    classification_type="multiclass",
    custom_parameters=custom_parameters
    )
l.fit(x_train, x_test, y_train, y_test)
You might also like...
Repository containing the code for An-Gocair text normaliser

Scottish Gaelic Text Normaliser The following project contains the code and resources for the Scottish Gaelic text normalisation project. The repo can

Code Jam for creating a text-based adventure game engine and custom worlds

Text Based Adventure Jam Author: Devin McIntyre Our goal is two-fold: Create a text based adventure game engine that can parse a standard file format

Microsoft's Cascadia Code font customized to my liking.

Microsoft's Cascadia Code font customized to my liking. Also includes some simple batch patch and bake scripts to batch patch glyphs and bake font features into fonts!

Hamming code generation, error detection & correction.

Hamming code generation, error detection & correction.

Simple python program to auto credit your code, text, book, whatever!

Credit Simple python program to auto credit your code, text, book, whatever! Setup First change credit_text to whatever text you would like to credit

A minimal code sceleton for a textadveture parser written in python.

Textadventure sceleton written in python Use with a map file generated on https://www.trizbort.io Use the following Sockets for walking directions: n

Idea is to build a model which will take keywords as inputs and generate sentences as outputs.
Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

keytotext Idea is to build a model which will take keywords as inputs and generate sentences as outputs. Potential use case can include: Marketing Sea

The sequel to SquidNet. It has many of the previous features that were in the original script, however a lot of the functions that do not serve much functionality have been removed.

SquidNet2 The sequel to SquidNet. It has many of the previous features that were in the original script, however a lot of the functions that do not se

A python script providing an idea of how a MindSphere application, e.g., a dashboard, can be displayed around the clock without the need of manual re-authentication on enforced session expiration

A python script providing an idea of how a MindSphere application, e.g., a dashboard, can be displayed around the clock without the need of manual re-authentication on enforced session expiration

A concept I came up which ditches the idea of
A concept I came up which ditches the idea of "layers" in a neural network.

Dynet A concept I came up which ditches the idea of "layers" in a neural network. Install Copy Dynet.py to your project. Run the example Install matpl

Ubuntu env build; Nginx build; DB build;

Deploy 介绍 Deploy related scripts bitnami Dependencies Ubuntu openssl envsubst docker v18.06.3 docker-compose init base env upload https://gitlab-runn

Aggrokatz is an aggressor plugin extension for Cobalt Strike which enables pypykatz to interface with the beacons remotely and allows it to parse LSASS dump files and registry hive files to extract credentials and other secrets stored without downloading the file and without uploading any suspicious code to the beacon. A :baby: buddy to help caregivers track sleep, feedings, diaper changes, and tummy time to learn about and predict baby's needs without (as much) guess work.
A :baby: buddy to help caregivers track sleep, feedings, diaper changes, and tummy time to learn about and predict baby's needs without (as much) guess work.

Baby Buddy A buddy for babies! Helps caregivers track sleep, feedings, diaper changes, tummy time and more to learn about and predict baby's needs wit

Lazymux is a tool installer that is specially made for termux user which provides a lot of tool mainly used tools in termux and its easy to use
Lazymux is a tool installer that is specially made for termux user which provides a lot of tool mainly used tools in termux and its easy to use

Lazymux is a tool installer that is specially made for termux user which provides a lot of tool mainly used tools in termux and its easy to use, Lazymux install any of the given tools provided by it from itself with just one click, and its often get updated.

When doing audio and video sentiment recognition, I found that a lot of code is duplicated, often a function in different time debugging for a long time, based on this problem, I want to manage all the previous work, organized into an open source library can be iterative. For their own use and others. PathPicker accepts a wide range of input -- output from git commands, grep results, searches -- pretty much anything.After parsing the input, PathPicker presents you with a nice UI to select which files you're interested in. After that you can open them in your favorite editor or execute arbitrary commands.
A simple script which allows you to see how much GEXP you earned for playing in the last Minecraft Hypixel server session

Project Landscape A simple script which allows you to see how much GEXP you earned for playing in the Minecraft Server Hypixel Usage Install python 3.

Ross Virtual Assistant is a programme which can play Music, search Wikipedia, open Websites and much more.

Ross-Virtual-Assistant Ross Virtual Assistant is a programme which can play Music, search Wikipedia, open Websites and much more. Installation Downloa

Releases(0.0.2)
Owner
Jay Vala
Data Scientist at scoutbee
Jay Vala
A simple Python module for parsing human names into their individual components

Name Parser A simple Python (3.2+ & 2.6+) module for parsing human names into their individual components. hn.title hn.first hn.middle hn.last hn.suff

Derek Gulbranson 574 Dec 20, 2022
Convert ebooks with few clicks on Telegram!

E-Book Converter Bot A bot that converts e-books to various formats, powered by calibre! It currently supports 34 input formats and 19 output formats.

Youssif Shaaban Alsager 45 Jan 05, 2023
Python character encoding detector

Chardet: The Universal Character Encoding Detector Detects ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants) Big5, GB2312, EUC-TW, HZ-GB-2312, IS

Character Encoding Detector 1.8k Jan 08, 2023
An experimental Fang Song style Chinese font generated with skeleton-tracing and pix2pix

An experimental Fang Song style Chinese font generated with skeleton-tracing and pix2pix, with glyphs based on cwTeXFangSong. The font is optimised fo

Lingdong Huang 98 Jan 07, 2023
text-to-speach bot - You really do NOT have time for read a newsletter? Now you can listen to it

NewsletterReader You really do NOT have time for read a newsletter? Now you can listen to it The Newsletter of Filipe Deschamps is a great place to re

ItanuRomero 8 Sep 18, 2021
Map Reduce Wordcount in Python using gRPC

This project is implemented in Python using gRPC. The input files are given in .txt format and the word count operation is performed.

Divija 4 Dec 05, 2022
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

Life4 3k Jan 02, 2023
Converts a Bangla numeric string to literal words.

Bangla Number in Words Converts a Bangla numeric string to literal words. Install $ pip install banglanum2words Usage

Syed Mostofa Monsur 3 Aug 29, 2022
一款高性能敏感词(非法词/脏字)检测过滤组件,附带繁体简体互换,支持全角半角互换,汉字转拼音,模糊搜索等功能。

一款高性能非法词(敏感词)检测组件,附带繁体简体互换,支持全角半角互换,获取拼音首字母,获取拼音字母,拼音模糊搜索等功能。

ToolGood 3.6k Jan 07, 2023
Production First and Production Ready End-to-End Keyword Spotting Toolkit

WeKws Production First and Production Ready End-to-End Keyword Spotting Toolkit. The goal of this toolkit it to... Small footprint keyword spotting (K

222 Dec 30, 2022
Simple python program to auto credit your code, text, book, whatever!

Credit Simple python program to auto credit your code, text, book, whatever! Setup First change credit_text to whatever text you would like to credit

Hashm 1 Jan 29, 2022
Wikipedia Reader for the GNOME Desktop

Wike Wike is a Wikipedia reader for the GNOME Desktop. Provides access to all the content of this online encyclopedia in a native application, with a

Hugo Olabera 126 Dec 24, 2022
Search for terms(word / table / field name or any) under Snowflake schema names

snowflake-search-terms-in-ddl-views Search for terms(word / table / field name or any) under Snowflake schema names Version : 1.0v How to use ? Run th

Igal Emona 1 Dec 15, 2021
A simple text editor for linux

wolf-editor A simple text editor for linux Installing using Deb Package Download newest package from releases CD into folder where the downloaded acka

Focal Fossa 5 Nov 30, 2021
A python tool one can extract the "hash" from a WINDOWS HELLO PIN

WINHELLO2hashcat About With this tool one can extract the "hash" from a WINDOWS HELLO PIN. This hash can be cracked with Hashcat, more precisely with

33 Dec 05, 2022
Username reconnaisance tool that checks the availability of a specified username on over 200 websites.

Username reconnaisance tool that checks the availability of a specified username on over 200 websites. Installation & Usage Clone from Github: $ git c

Richard Mwewa 20 Oct 30, 2022
PyMultiDictionary is a Dictionary Module for Python 3+ to get meanings, translations, synonyms and antonyms of words in 20 different languages

PyMultiDictionary PyMultiDictionary is a Dictionary Module for Python 3+ to get meanings, translations, synonyms and antonyms of words in 20 different

Pablo Pizarro R. 19 Dec 26, 2022
The app gets your sutitle.srt and proccess it to extract sentences

DubbingAssistants This app gets your sutitle.srt and proccess it to extract sentences, and also find Start time and End time of them. Step 1: install

Ali Booresh 1 Jan 04, 2022
This is a text summarizing tool written in Python

Summarize Written by: Ling Li Ya This is a text summarizing tool written in Python. User Guide Some things to note: The application is accessible here

Marcus Lee 2 Feb 18, 2022
pydantic-i18n is an extension to support an i18n for the pydantic error messages.

pydantic-i18n is an extension to support an i18n for the pydantic error messages

Boardpack 48 Dec 21, 2022