Text vectorization tool to outperform TFIDF for classification tasks

Overview

textvec logo

WHAT: Supervised text vectorization tool

Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP methods in Python. The main idea of this project is to show alternatives for an excellent TFIDF method which is highly overused for supervised tasks. All interfaces are similar to scikit-learn so you should be able to test the performance of this supervised methods just with a few changes.

Textvec is compatible with: Python 2.7-3.7.


WHY: Comparison with TFIDF

As you can read in the different articles1,2 almost on every dataset supervised methods outperform unsupervised. But most text classification examples on the internet ignores that fact.

IMDB_bin RT_bin Airlines Sentiment_bin Airlines Sentiment_multiclass 20news_multiclass
TF 0.8984 0.7571 0.9194 0.8084 0.8206
TFIDF 0.9052 0.7717 0.9259 0.8118 0.8575
TFPF 0.8813 0.7403 0.9212 NA NA
TFRF 0.8797 0.7412 0.9194 NA NA
TFICF 0.8984 0.7642 0.9199 0.8125 0.8292
TFBINICF 0.8984 0.7571 0.9194 NA NA
TFCHI2 0.8898 0.7398 0.9108 NA NA
TFGR 0.8850 0.7065 0.8956 NA NA
TFRRF 0.8879 0.7506 0.9194 NA NA
TFOR 0.9092 0.7806 0.9207 NA NA

Here is a comparison for binary classification on imdb sentiment data set. Labels sorted by accuracy score and the heatmap shows the correlation between different approaches. As you can see some methods are good for to ensemble models or perform features selection.

Binary comparison

For more dataset benchmarks (rotten tomatoes, airline sentiment) see Binary classification quality comparison


Install:

Usage:

pip install textvec

Source code:

git clone https://github.com/textvec/textvec
cd textvec
pip install .

HOW: Examples

The usage is similar to scikit-learn:

from sklearn.feature_extraction.text import CountVectorizer
from textvec.vectorizers import TfBinIcfVectorizer

cvec = CountVectorizer().fit(train_data.text)

tficf_vec = TfBinIcfVectorizer(sublinear_tf=True)
tficf_vec.fit(cvec.transform(text), y)

For more detailed examples see Basic example and other notebooks in Examples

Currently implemented methods:

  • TfIcfVectorizer
  • TforVectorizer
  • TfgrVectorizer
  • TfigVectorizer
  • Tfchi2Vectorizer
  • TfrfVectorizer
  • TfrrfVectorizer
  • TfBinIcfVectorizer
  • TfpfVectorizer
  • SifVectorizer
  • TfbnsVectorizer

Most of the vectorization techniques you can find in articles1,2,3. If you see any method with wrong name or reference please commit!


TODO

  • Docs

REFERENCE

Comments
  • the bug existed in the vectorizer.py

    the bug existed in the vectorizer.py

    the 35th line sp.spdiag(y == val ......) might be wrong. i assume that u maybe wanna write sp.spdiag([i for i in y if i == val] ......)

    expect your reply,thx!

    opened by dongrixinyu 3
  • AttributeError: 'NoneType' object has no attribute 'transform'

    AttributeError: 'NoneType' object has no attribute 'transform'

    For some reason, I can't use your vectorizers in pipeline. Here is my code:

    pipeline = Pipeline([ 
        ('vect', CountVectorizer(stop_words='en', ngram_range=(1,2), analyzer='word')),
        ('transform', TfIcfVectorizer(sublinear_tf=True)),
        ('clf', LinearSVC(class_weight='balanced')),
    ])
    pipeline.fit(X, y)
    

    And I got the following error:

    D:\programs\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit(self, X, y, **fit_params) 263 This estimator 264 """ --> 265 Xt, fit_params = self._fit(X, y, **fit_params) 266 if self._final_estimator is not None: 267 self._final_estimator.fit(Xt, y, **fit_params)

    D:\programs\Anaconda3\lib\site-packages\sklearn\pipeline.py in _fit(self, X, y, **fit_params) 228 Xt, fitted_transformer = fit_transform_one_cached( 229 cloned_transformer, Xt, y, None, --> 230 **fit_params_steps[name]) 231 # Replace the transformer of the step with the fitted 232 # transformer. This is necessary when loading the transformer

    D:\programs\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py in call(self, *args, **kwargs) 340 341 def call(self, *args, **kwargs): --> 342 return self.func(*args, **kwargs) 343 344 def call_and_shelve(self, *args, **kwargs):

    D:\programs\Anaconda3\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, X, y, weight, **fit_params) 614 res = transformer.fit_transform(X, y, **fit_params) 615 else: --> 616 res = transformer.fit(X, y, **fit_params).transform(X) 617 # if we have a weight for this transformer, multiply output 618 if weight is None:

    AttributeError: 'NoneType' object has no attribute 'transform'

    My code works fine with TfIdfTransformer.

    Python 3.7.3 sklearn 0.20.3

    opened by Tamplier 2
  • Vectorizers now working with GridSearchCV and Pipelines with parameters

    Vectorizers now working with GridSearchCV and Pipelines with parameters

    Fixes #20 Fixes #21

    By adding sklearn.base.BaseEstimator to all vectorizers they are now capable of handling parameters from GridSearchCV.

    Also, when testing I verified that they already got a "toString" for them, so two problems solved.

    >>> from textvec.vectorizers import TfIcfVectorizer
    >>> TfIcfVectorizer()
    TfIcfVectorizer(norm=None, sublinear_tf=False)
    >>> from textvec.vectorizers import TfBinIcfVectorizer
    >>> TfBinIcfVectorizer()
    TfBinIcfVectorizer(norm='l2', smooth_df=True, sublinear_tf=False)
    
    opened by bernardoduarte 1
  • Using tficf without target Y

    Using tficf without target Y

    I want to get text vectorization using TfIcfVectorizer() without the need to use Y label vector, Is it possible ?

    tficf = vectorizers.TfIcfVectorizer() tficf_train = tficf.fit_transform(cv_train, newsgroups_train.target) # here i should give the vectorizer one argument instead of two tficf_test = tficf.transform(cv_test)

    opened by banyous 1
  • Bump numpy from 1.14.2 to 1.22.0

    Bump numpy from 1.14.2 to 1.22.0

    Bumps numpy from 1.14.2 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Textvec vectorizers doesn't work on GridSearchCV and Pipelines with set_params

    Textvec vectorizers doesn't work on GridSearchCV and Pipelines with set_params

    The code below throws this error:

    AttributeError: 'TfBinIcfVectorizer' object has no attribute 'set_params'

    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.model_selection import GridSearchCV
    from sklearn.svm import SVC
    from sklearn.pipeline import Pipeline
    from textvec.vectorizers import TfBinIcfVectorizer
    
    pipeline = Pipeline([
        ('count', CountVectorizer()),
        ('transformer', TfBinIcfVectorizer()),
        ('model', SVC()),
    ])
    
    param_grid = {
        'transformer__sublinear_tf': (True, False),
        'count__analyzer': ('word', 'char'),
        'count__max_df': (1.0,),
    }
    
    grid_search = GridSearchCV(pipeline, param_grid, n_jobs=-1, verbose=1, refit=True)
    grid_search.fit(train_x, train_y)
    

    If I replace TfBinIcfVectorizer with TfidfTransformer it does work. So it seems that there is something missing on textvec.

    opened by bernardoduarte 0
  • Missing

    Missing "toString" from Vectorizers

    As shown below by the code snippet, scikit-learn's TfidfTransformer for example have a pretty representation when printed, while textvec's TfIcfVectorizer doesn't. It seems that this happens to all of then.

    I'll search for way to do it like scikit-learn does.

    >>> from sklearn.feature_extraction.text import TfidfTransformer
    >>> TfidfTransformer()
    TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)
    >>> from textvec.vectorizers import TfIcfVectorizer
    >>> TfIcfVectorizer()
    <textvec.vectorizers.TfIcfVectorizer object at 0x7f88de881490>
    
    opened by bernardoduarte 0
  • Mark TfBNS as solved in README and add it to the Currently implemented methods

    Mark TfBNS as solved in README and add it to the Currently implemented methods

    As from this pull request it seems that TfBNS is now solved from the TODO list on README.md.

    That means that it should be added to the Currently implemented methods and removed from the TODO.

    opened by bernardoduarte 0
  • ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

    ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

    Code:

    ...
    train['title'].isnull().sum() 
    # Out: 0
    title_countvec = CountVectorizer(ngram_range=(1,3), max_features=300000, lowercase=True)
    title_countvec.fit(train['title'], y_train)
    train_title_countvec = title_countvec.transform(train['title'])
    title_vectorizer = TfIcfVectorizer(norm='l2', sublinear_tf=True)
    title_vectorizer.fit(train_title_countvec, y_train)
    train_title_countvec = title_countvec.transform(train['title'])
    np.isfinite(train_title_countvec.data).all(), np.isinf(train_title_countvec.data).any()
    # Out: (True, False)
    train_transformed['title'] = title_vectorizer.transform(train_title_countvec) 
    # Error
    

    Traceback:

    ValueError                                Traceback (most recent call last)
    <ipython-input-72-5fdf9718ecba> in <module>()
    ----> 1 train_transformed['title'] = title_vectorizer.transform(train_title_countvec)
    
    /usr/local/lib/python3.5/dist-packages/textvec/vectorizers.py in transform(self, X, min_freq)
         45         X = X * sp.spdiags(self.k, 0, f, f)
         46         if self.norm:
    ---> 47             X = normalize(X, self.norm)
         48         return X
         49 
    
    ~/.local/lib/python3.5/site-packages/sklearn/preprocessing/data.py in normalize(X, norm, axis, copy, return_norm)
       1410 
       1411     X = check_array(X, sparse_format, copy=copy,
    -> 1412                     estimator='the normalize function', dtype=FLOAT_DTYPES)
       1413     if axis == 0:
       1414         X = X.T
    
    ~/.local/lib/python3.5/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
        429     if sp.issparse(array):
        430         array = _ensure_sparse_format(array, accept_sparse, dtype, copy,
    --> 431                                       force_all_finite)
        432     else:
        433         array = np.array(array, dtype=dtype, order=order, copy=copy)
    
    ~/.local/lib/python3.5/site-packages/sklearn/utils/validation.py in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite)
        304                           % spmatrix.format)
        305         else:
    --> 306             _assert_all_finite(spmatrix.data)
        307     return spmatrix
        308 
    
    ~/.local/lib/python3.5/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
         42             and not np.isfinite(X).all()):
         43         raise ValueError("Input contains NaN, infinity"
    ---> 44                          " or a value too large for %r." % X.dtype)
         45 
         46 
    
    ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
    
    bug 
    opened by sharthZ23 4
Releases(v2.0)
  • v2.0(Sep 12, 2019)

    Textvec 1.0.1 -> 2.0

    Features

    • SifVectorizer #5
    • Scikit-learn compability #7

    Improvements

    • Better examples #9
    • Unit tests #6

    Bug Fixes

    • Sparse data processing fix #8
    Source code(tar.gz)
    Source code(zip)
Spam filtering made easy for you

spammy Author: Tasdik Rahman Latest version: 1.0.3 Contents 1 Overview 2 Features 3 Example 3.1 Accuracy of the classifier 4 Installation 4.1 Upgradin

Tasdik Rahman 137 Dec 18, 2022
Stand-alone language identification system

langid.py readme Introduction langid.py is a standalone Language Identification (LangID) tool. The design principles are as follows: Fast Pre-trained

2k Jan 04, 2023
Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

OpenNMT 5.8k Jan 04, 2023
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 3k Jan 05, 2023
Synthetic data for the people.

zpy: Synthetic data in Blender. Website • Install • Docs • Examples • CLI • Contribute • Licence Abstract Collecting, labeling, and cleaning data for

Zumo Labs 253 Dec 21, 2022
Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech

epub2audiobook Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech Input examples qual a pasta do seu

7 Aug 25, 2022
List of GSoC organisations with number of times they have been selected.

Welcome to GSoC Organisation Frequency And Details 👋 List of GSoC organisations with number of times they have been selected, techonologies, topics,

Shivam Kumar Jha 41 Oct 01, 2022
PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.09

Keon Lee 142 Jan 06, 2023
Repositório da disciplina no semestre 2021-2

Avisos! Nenhum aviso! Compiladores 1 Este é o Git da disciplina Compiladores 1. Aqui ficará o material produzido em sala de aula assim como tarefas, w

6 May 13, 2022
EMNLP 2021 paper "Pre-train or Annotate? Domain Adaptation with a Constrained Budget".

Pre-train or Annotate? Domain Adaptation with a Constrained Budget This repo contains code and data associated with EMNLP 2021 paper "Pre-train or Ann

Fan Bai 8 Dec 17, 2021
Weakly-supervised Text Classification Based on Keyword Graph

Weakly-supervised Text Classification Based on Keyword Graph How to run? Download data Our dataset follows previous works. For long texts, we follow C

Hello_World 20 Dec 29, 2022
小布助手对话短文本语义匹配的一个baseline

oppo-text-match 小布助手对话短文本语义匹配的一个baseline 模型 参考:https://kexue.fm/archives/8213 base版本线下大概0.952,线上0.866(单模型,没做K-flod融合)。 训练 测试环境:tensorflow 1.15 + keras

苏剑林(Jianlin Su) 132 Dec 14, 2022
Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Toy Machine Learning Pipeline Table of Contents About Getting Started ML task description and evaluation procedure Dataset description Repository stru

Shreya Shankar 190 Dec 21, 2022
Text-to-Speech for Belarusian language

title emoji colorFrom colorTo sdk app_file pinned Belarusian TTS 🐸 green green gradio app.py false Belarusian TTS 📢 🤖 Belarusian TTS (text-to-speec

Yurii Paniv 1 Nov 27, 2021
Must-read papers on improving efficiency for pre-trained language models.

Must-read papers on improving efficiency for pre-trained language models.

Tobias Lee 89 Jan 03, 2023
Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

T-TA (Transformer-based Text Auto-encoder) This repository contains codes for Transformer-based Text Auto-encoder (T-TA, paper: Fast and Accurate Deep

Jeong Ukjae 13 Dec 13, 2022
🏖 Easy training and deployment of seq2seq models.

Headliner Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both resear

Axel Springer Ideas Engineering GmbH 231 Nov 18, 2022
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Hao Zhu 2 Sep 27, 2022
PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

Facebook Research 605 Jan 02, 2023
End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Image captioning End-to-end image captioning with EfficientNet-b3 + LSTM with Attention Model is seq2seq model. In the encoder pretrained EfficientNet

2 Feb 10, 2022