Various Algorithms for Short Text Mining

Overview

Short Text Mining in Python

CircleCI GitHub release Documentation Status Updates Python 3 pypi download stars

Introduction

This package shorttext is a Python package that facilitates supervised and unsupervised learning for short text categorization. Due to the sparseness of words and the lack of information carried in the short texts themselves, an intermediate representation of the texts and documents are needed before they are put into any classification algorithm. In this package, it facilitates various types of these representations, including topic modeling and word-embedding algorithms.

Since release 1.5.2, it runs on Python 3.9. Since release 1.5.0, support for Python 3.6 was decommissioned. Since release 1.2.4, it runs on Python 3.8. Since release 1.2.3, support for Python 3.5 was decommissioned. Since release 1.1.7, support for Python 2.7 was decommissioned. Since release 1.0.8, it runs on Python 3.7 with 'TensorFlow' being the backend for keras. Since release 1.0.7, it runs on Python 3.7 as well, but the backend for keras cannot be TensorFlow. Since release 1.0.0, shorttext runs on Python 2.7, 3.5, and 3.6.

Characteristics:

  • example data provided (including subject keywords and NIH RePORT);
  • text preprocessing;
  • pre-trained word-embedding support;
  • gensim topic models (LDA, LSI, Random Projections) and autoencoder;
  • topic model representation supported for supervised learning using scikit-learn;
  • cosine distance classification;
  • neural network classification (including ConvNet, and C-LSTM);
  • maximum entropy classification;
  • metrics of phrases differences, including soft Jaccard score (using Damerau-Levenshtein distance), and Word Mover's distance (WMD);
  • character-level sequence-to-sequence (seq2seq) learning;
  • spell correction;
  • API for word-embedding algorithm for one-time loading; and
  • Sentence encodings and similarities based on BERT.

Documentation

Documentation and tutorials for shorttext can be found here: http://shorttext.rtfd.io/.

See tutorial for how to use the package, and FAQ.

Installation

To install it, in a console, use pip.

>>> pip install -U shorttext

or, if you want the most recent development version on Github, type

>>> pip install -U git+https://github.com/stephenhky/[email protected]

Developers are advised to make sure Keras >=2 be installed. Users are advised to install the backend Tensorflow (preferred) or Theano in advance. It is desirable if Cython has been previously installed too.

See installation guide for more details.

Issues

To report any issues, go to the Issues tab of the Github page and start a thread. It is welcome for developers to submit pull requests on their own to fix any errors.

Contributors

If you would like to contribute, feel free to submit the pull requests. You can talk to me in advance through e-mails or the Issues page.

Useful Links

News

  • 07/11/2021: shorttext 1.5.3 released.
  • 07/06/2021: shorttext 1.5.2 released.
  • 04/10/2021: shorttext 1.5.1 released.
  • 04/09/2021: shorttext 1.5.0 released.
  • 02/11/2021: shorttext 1.4.8 released.
  • 01/11/2021: shorttext 1.4.7 released.
  • 01/03/2021: shorttext 1.4.6 released.
  • 12/28/2020: shorttext 1.4.5 released.
  • 12/24/2020: shorttext 1.4.4 released.
  • 11/10/2020: shorttext 1.4.3 released.
  • 10/18/2020: shorttext 1.4.2 released.
  • 09/23/2020: shorttext 1.4.1 released.
  • 09/02/2020: shorttext 1.4.0 released.
  • 07/23/2020: shorttext 1.3.0 released.
  • 06/05/2020: shorttext 1.2.6 released.
  • 05/20/2020: shorttext 1.2.5 released.
  • 05/13/2020: shorttext 1.2.4 released.
  • 04/28/2020: shorttext 1.2.3 released.
  • 04/07/2020: shorttext 1.2.2 released.
  • 03/23/2020: shorttext 1.2.1 released.
  • 03/21/2020: shorttext 1.2.0 released.
  • 12/01/2019: shorttext 1.1.6 released.
  • 09/24/2019: shorttext 1.1.5 released.
  • 07/20/2019: shorttext 1.1.4 released.
  • 07/07/2019: shorttext 1.1.3 released.
  • 06/05/2019: shorttext 1.1.2 released.
  • 04/23/2019: shorttext 1.1.1 released.
  • 03/03/2019: shorttext 1.1.0 released.
  • 02/14/2019: shorttext 1.0.8 released.
  • 01/30/2019: shorttext 1.0.7 released.
  • 01/29/2019: shorttext 1.0.6 released.
  • 01/13/2019: shorttext 1.0.5 released.
  • 10/03/2018: shorttext 1.0.4 released.
  • 08/06/2018: shorttext 1.0.3 released.
  • 07/24/2018: shorttext 1.0.2 released.
  • 07/17/2018: shorttext 1.0.1 released.
  • 07/14/2018: shorttext 1.0.0 released.
  • 06/18/2018: shorttext 0.7.2 released.
  • 05/30/2018: shorttext 0.7.1 released.
  • 05/17/2018: shorttext 0.7.0 released.
  • 02/27/2018: shorttext 0.6.0 released.
  • 01/19/2018: shorttext 0.5.11 released.
  • 01/15/2018: shorttext 0.5.10 released.
  • 12/14/2017: shorttext 0.5.9 released.
  • 11/08/2017: shorttext 0.5.8 released.
  • 10/27/2017: shorttext 0.5.7 released.
  • 10/17/2017: shorttext 0.5.6 released.
  • 09/28/2017: shorttext 0.5.5 released.
  • 09/08/2017: shorttext 0.5.4 released.
  • 09/02/2017: end of GSoC project. (Report)
  • 08/22/2017: shorttext 0.5.1 released.
  • 07/28/2017: shorttext 0.4.1 released.
  • 07/26/2017: shorttext 0.4.0 released.
  • 06/16/2017: shorttext 0.3.8 released.
  • 06/12/2017: shorttext 0.3.7 released.
  • 06/02/2017: shorttext 0.3.6 released.
  • 05/30/2017: GSoC project (Chinmaya Pancholi, with gensim)
  • 05/16/2017: shorttext 0.3.5 released.
  • 04/27/2017: shorttext 0.3.4 released.
  • 04/19/2017: shorttext 0.3.3 released.
  • 03/28/2017: shorttext 0.3.2 released.
  • 03/14/2017: shorttext 0.3.1 released.
  • 02/23/2017: shorttext 0.2.1 released.
  • 12/21/2016: shorttext 0.2.0 released.
  • 11/25/2016: shorttext 0.1.2 released.
  • 11/21/2016: shorttext 0.1.1 released.

Possible Future Updates

  • Dividing components to other packages;
  • More available corpus.
Comments
  • standalone ?

    standalone ?

    Hi. I have many questions.... :-)

    I'm a beginner for python. Is there any method to run the code standalone ?

    e.g. I trained my data. And I'd like to see the scores on terminal by classifier.score('apple') . The word 'apple' can be changed.

    Thank you regards,

    opened by chocosando 20
  • ImportError: No module named classification_exceptions

    ImportError: No module named classification_exceptions

    import shorttext

    
    ---------------------------------------------------------------------------
    ImportError                               Traceback (most recent call last)
    <ipython-input-5-cb09b3381050> in <module>()
    ----> 1 import shorttext
    
    /usr/local/lib/python2.7/dist-packages/shorttext/__init__.py in <module>()
          5 sys.path.append(thisdir)
          6 
    ----> 7 from . import utils
          8 from . import data
          9 from . import classifiers
    
    /usr/local/lib/python2.7/dist-packages/shorttext/utils/__init__.py in <module>()
          4 from . import textpreprocessing
          5 from .wordembed import load_word2vec_model
    ----> 6 from . import compactmodel_io
          7 
          8 from .textpreprocessing import spacy_tokenize as tokenize
    
    /usr/local/lib/python2.7/dist-packages/shorttext/utils/compactmodel_io.py in <module>()
         13 from functools import partial
         14 
    ---> 15 import utils.classification_exceptions as e
         16 
         17 def removedir(dir):
    
    ImportError: No module named classification_exceptions
    
    
    opened by spate141 11
  • ImportError: dlopen: cannot load any more object with static TLS

    ImportError: dlopen: cannot load any more object with static TLS

    Hi, I got the following error when i import shorttext, how shall i resolve?

    Using TensorFlow backend.

    I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so.7.5 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so.5 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so.7.5 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so.7.5 locally Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/shorttext/init.py", line 7, in from . import utils File "/usr/local/lib/python2.7/dist-packages/shorttext/utils/init.py", line 3, in from . import gensim_corpora File "/usr/local/lib/python2.7/dist-packages/shorttext/utils/gensim_corpora.py", line 2, in from .textpreprocessing import spacy_tokenize as tokenize File "/usr/local/lib/python2.7/dist-packages/shorttext/utils/textpreprocessing.py", line 5, in import spacy File "/usr/local/lib/python2.7/dist-packages/spacy/init.py", line 8, in from . import en, de, zh, es, it, hu, fr, pt, nl, sv, fi, bn, he File "/usr/local/lib/python2.7/dist-packages/spacy/en/init.py", line 4, in from ..language import Language File "/usr/local/lib/python2.7/dist-packages/spacy/language.py", line 12, in from .syntax.parser import get_templates ImportError: dlopen: cannot load any more object with static TLS

    opened by kenyeung128 8
  • extend score to take an array of shorttext

    extend score to take an array of shorttext

    Currently, score takes only a single input and as a result, the method is very slow if you are trying to classify thousands of examples. Is there a way you can generate scores for 10K+ samples at the same time.

    opened by rja172 6
  • Importing problem (not installation) over google colab

    Importing problem (not installation) over google colab

    I am experimenting with the library for the first time. The installation was successful and didn't need any extra steps. however when I started importing the library I got the following error related to keras:

    /usr/local/lib/python3.7/dist-packages/shorttext/generators/bow/AutoEncodingTopicModeling.py in () 8 from gensim.corpora import Dictionary 9 from keras import Input ---> 10 from keras.engine import Model 11 from keras.layers import Dense 12 from scipy.spatial.distance import cosine

    ImportError: cannot import name 'Model' from 'keras.engine' (/usr/local/lib/python3.7/dist-packages/keras/engine/init.py)

    I tried to install keras separately but no improvement. any suggestions would be appreciated.

    opened by yomnamahmoud 6
  • RuntimeWarning: overflow encountered in exp2 topicmodeler.train

    RuntimeWarning: overflow encountered in exp2 topicmodeler.train

    Code: trainclassdict = shorttext.data.nihreports(sample_size=None) topicmodeler = shorttext.generators.LDAModeler() topicmodeler.train(trainclassdict, 128) Error message: /lib/python2.7/site-packages/gensim/models/ldamodel.py:535: RuntimeWarning: overflow encountered in exp2 perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words

    Then the results are variable for topicmodeler.retrieve_topicvec('stem cell research')

    opened by dbonner 6
  • Remove negation terms from stopwords.txt

    Remove negation terms from stopwords.txt

    I noticed that stopwords.txt includes negation terms such as "no" and "not". These terms revert the meaning of a word or a sentence, so they should be preserved in the text data. For example, "not a good idea" would become "good idea" after stopword removal. Therefore, I recommend removing negation terms from the stopword list. Thanks!

    opened by star1327p 5
  • Input to shorttext.generators.LDAModeler()

    Input to shorttext.generators.LDAModeler()

    I was wondering what should be the format of data as input for:

    shorttext.generators.LDAModeler() topicmodeler.train(data, 100)

    Can I feed it with a pandas column? Or it should be in a dictionary format? If a dictionary, what should be the keys? I have a large set of tweets.

    opened by malizad 5
  • from shorttext.classifiers import MaxEntClassifier is it regression?

    from shorttext.classifiers import MaxEntClassifier is it regression?

    seems to be maxent is a fancy word for regression or you do have something special in your maxent? https://www.quora.com/What-is-the-relationship-between-Log-Linear-model-MaxEnt-model-and-Logistic-Regression or https://en.wikipedia.org/wiki/Multinomial_logistic_regression

    Multinomial logistic regression is known by a variety of other names, including polytomous LR,[2][3] multiclass LR, softmax regression, multinomial logit, the maximum entropy (MaxEnt) classifier, and the conditional maximum entropy model.[4]
    
    opened by Sandy4321 5
  • No Python 3.6 support with SciPy 1.6

    No Python 3.6 support with SciPy 1.6

    opened by Dobatymo 4
  • Data nihreports not available anymore

    Data nihreports not available anymore

    Some datasets are not available anymore.

    For example the following: nihtraindata = shorttext.data.nihreports(sample_size=None)

    Error message:

    Downloading...
    Source:  http://storage.googleapis.com/pyshorttext/nih_grant_public/nih_full.csv.zip
    Failure to download file!
    (<class 'urllib.error.HTTPError'>, <HTTPError 404: 'Not Found'>, <traceback object at 0x7f09063ed788>)
    

    Python error:

    HTTPError: HTTP Error 404: Not Found
    
    During handling of the above exception, another exception occurred:
    

    When opening the link the same error appears:

    image

    opened by AlessandroVol23 4
Releases(1.5.8)
Owner
Kwan-Yuet "Stephen" Ho
quantitative research, machine learning, data science, text mining, physics
Kwan-Yuet
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 🤗 Transformers provides thousands of pretrained models to perform tasks o

Hugging Face 77.3k Jan 03, 2023
Milaan Parmar / Милан пармар / _米兰 帕尔马 170 Dec 13, 2022
Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning English | 中文 ❗ Now we provide inferencing code and pre-training models

164 Jan 02, 2023
PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

Kakao Brain 1.2k Dec 21, 2022
Text Classification in Turkish Texts with Bert

You can watch the details of the project on my youtube channel Project Interface Project Second Interface Goal= Correctly guessing the classification

42 Dec 31, 2022
FactSumm: Factual Consistency Scorer for Abstractive Summarization

FactSumm: Factual Consistency Scorer for Abstractive Summarization FactSumm is a toolkit that scores Factualy Consistency for Abstract Summarization W

devfon 83 Jan 09, 2023
A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

A2T: Towards Improving Adversarial Training of NLP Models This is the source code for the EMNLP 2021 (Findings) paper "Towards Improving Adversarial T

QData 17 Oct 15, 2022
Natural Language Processing Best Practices & Examples

NLP Best Practices In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive bus

Microsoft 6.1k Dec 31, 2022
SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Princeton Natural Language Processing 2.5k Jan 07, 2023
Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vecto

3.2k Dec 30, 2022
Searching keywords in PDF file folders

keyword_searching Steps to use this Python scripts: (1)Paste this script into the file folder containing the PDF files you need to search from; (2)Thi

1 Nov 08, 2021
Sequence model architectures from scratch in PyTorch

This repository implements a variety of sequence model architectures from scratch in PyTorch. Effort has been put to make the code well structured so that it can serve as learning material. The train

Brando Koch 11 Mar 28, 2022
spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines

spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines spaCy-wrap is minimal library intended for wrapping fine-tuned transformers from t

Kenneth Enevoldsen 32 Dec 29, 2022
A PyTorch-based model pruning toolkit for pre-trained language models

English | 中文说明 TextPruner是一个为预训练语言模型设计的模型裁剪工具包,通过轻量、快速的裁剪方法对模型进行结构化剪枝,从而实现压缩模型体积、提升模型速度。 其他相关资源: 知识蒸馏工具TextBrewer:https://github.com/airaria/TextBrewe

Ziqing Yang 231 Jan 08, 2023
Comprehensive-E2E-TTS - PyTorch Implementation

A Non-Autoregressive End-to-End Text-to-Speech (text-to-wav), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultima

Keon Lee 114 Nov 13, 2022
自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器

ja-timex 自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器 概要 ja-timex は、現代日本語で書かれた自然文に含まれる時間情報表現を抽出しTIMEX3と呼ばれるアノテーション仕様に変換することで、プログラムが利用できるような形に規格化するルールベースの解析器です。

Yuki Okuda 116 Nov 09, 2022
NLP applications using deep learning.

NLP-Natural-Language-Processing NLP applications using deep learning like text generation etc. 1- Poetry Generation: Using a collection of Irish Poem

KASHISH 1 Jan 27, 2022
HiFi DeepVariant + WhatsHap workflowHiFi DeepVariant + WhatsHap workflow

HiFi DeepVariant + WhatsHap workflow Workflow steps align HiFi reads to reference with pbmm2 call small variants with DeepVariant, using two-pass meth

William Rowell 2 May 14, 2022
Blue Brain text mining toolbox for semantic search and structured information extraction

Blue Brain Search Source Code DOI Data & Models DOI Documentation Latest Release Python Versions License Build Status Static Typing Code Style Securit

The Blue Brain Project 29 Dec 01, 2022
GooAQ 🥑 : Google Answers to Google Questions!

This repository contains the code/data accompanying our recent work on long-form question answering.

AI2 112 Nov 06, 2022