Biterm Topic Model (BTM): modeling topics in short texts

Last update: Dec 30, 2022

Overview

Biterm Topic Model

Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actually, it is a cythonized version of BTM. This package is also capable of computing perplexity and semantic coherence metrics.

Development

Please note that bitermplus is actively improved. Refer to documentation to stay up to date.

Requirements

cython
numpy
pandas
scipy
scikit-learn
tqdm

Setup

Linux and Windows

There should be no issues with installing bitermplus under these OSes. You can install the package directly from PyPi.

pip install bitermplus

Or from this repo:

pip install git+https://github.com/maximtrp/bitermplus.git

Mac OS

First, you need to install XCode CLT and Homebrew. Then, install libomp using brew:

xcode-select --install
brew install libomp
pip3 install bitermplus

Example

Model fitting

import bitermplus as btm
import numpy as np
import pandas as pd

# IMPORTING DATA
df = pd.read_csv(
    'dataset/SearchSnippets.txt.gz', header=None, names=['texts'])
texts = df['texts'].str.strip().tolist()

# PREPROCESSING
# Obtaining terms frequency in a sparse matrix and corpus vocabulary
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
tf = np.array(X.sum(axis=0)).ravel()
# Vectorizing documents
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
docs_lens = list(map(len, docs_vec))
# Generating biterms
biterms = btm.get_biterms(docs_vec)

# INITIALIZING AND RUNNING MODEL
model = btm.BTM(
    X, vocabulary, seed=12321, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=20)
p_zd = model.transform(docs_vec)

# METRICS
perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, 8)
coherence = btm.coherence(model.matrix_topics_words_, X, M=20)
# or
perplexity = model.perplexity_
coherence = model.coherence_

Results visualization

You need to install tmplot first.

import tmplot as tmp
tmp.report(model=model, docs=texts)

Tutorial

There is a tutorial in documentation that covers the important steps of topic modeling (including stability measures and results visualization).

Comments

the topic distribution for all doc is similar

topic

[9.99998750e-01 3.12592152e-07 3.12592152e-07 3.12592152e-07 3.12592152e-07] [9.99999903e-01 2.43742411e-08 2.43742411e-08 2.43742411e-08 2.43742411e-08] [9.99999264e-01 1.83996702e-07 1.83996702e-07 1.83996702e-07 1.83996702e-07] [9.99998890e-01 2.77376339e-07 2.77376339e-07 2.77376339e-07 2.77376339e-07] [9.99999998e-01 3.94318712e-10 3.94318712e-10 3.94318712e-10 3.94318712e-10] [9.99998428e-01 3.92884503e-07 3.92884503e-07 3.92884503e-07 3.92884503e-07]
bug help wanted good first issue

opened by JennieGerhardt 11
ERROR: Failed building wheel for bitermplus

creating build/temp.macosx-10.9-universal2-cpython-310/src/bitermplus clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.10/include/python3.10 -c src/bitermplus/_btm.c -o build/temp.macosx-10.9-universal2-cpython-310/src/bitermplus/_btm.o -Xpreprocessor -fopenmp src/bitermplus/_btm.c:772:10: fatal error: 'omp.h' file not found #include <omp.h> ^~~~~~~ 1 error generated. error: command '/usr/bin/clang' failed with exit code 1 [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for bitermplus Failed to build bitermplus ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects
bug documentation

opened by QinrenK 9
Got an unexpected result in marked sample

Hi, @maximtrp, I am trying to use bitermplus for topic modeling. However, when i use the marked sample to train the model. i got the unexpeted result. Firstly, the marked samples contain 5 types, but trained model get a huge perlexity when the the number of topic is 5. Secondly, when i test the topic parameter from 1 to 20, the perplexity was reduced following the increase of topic number. my code is following: df = pd.read_csv('dataPretreatment/data/corpus.txt', header=None, names=['texts']) texts = df['texts'].str.strip().tolist() print(df) stop_words = segmentWord.stopwordslist() perplexitys = [] coherences = []

for T in range(1,21,1): print(T) X, vocabulary, vocab_dict = btm.get_words_freqs(texts, stop_words=stop_words) # Vectorizing documents docs_vec = btm.get_vectorized_docs(texts, vocabulary) # Generating biterms biterms = btm.get_biterms(docs_vec) # INITIALIZING AND RUNNING MODEL model = btm.BTM(X, vocabulary, seed=12321, T=T, M=50, alpha=50/T, beta=0.01) model.fit(biterms, iterations=2000) p_zd = model.transform(docs_vec) perplexity = btm.perplexity(model.matrix_topics_words_, p_zd, X, T) coherence = model.coherence_ perplexitys.append(perplexity) coherences.append(coherence)

``

opened by Chen-X666 7
Getting the error 'CountVectorizer' object has no attribute 'get_feature_names_out'

Hi @maximtrp, I am trying to use bitermplus for topic modeling. Running the code shows the error I mentioned in the title. Seems sth in get_words_freqs function goes wrong. I appreciate if you advise how I can fix that.

opened by Sajad7010 4

Cannot find Closest topics and Stable topics

Hello there, I am able to generate the model and visualize it. But when I tried to find the closest topics and stable topics, I get the error for code line:

closest_topics, dist = btm.get_closest_topics(*matrix_topic_words, top_words=139, verbose=True)

The error is:

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

This is despite me separately checking the array size and it is 2-D. I am pasting the code below. Pl. can you check if I am doing anything wrong.

Thank you.

X, vocabulary, vocab_dict = btm.get_words_freqs(clean_text, max_df=.85, min_df=15,ngram_range=(1,2))

# Vectorizing documents
docs_vec = btm.get_vectorized_docs(clean_text, vocabulary)

# Generating biterms
Y = X.todense()
biterms = btm.get_biterms(docs_vec, 15)

# INITIALIZING AND RUNNING MODEL
model = btm.BTM(X, vocabulary, T=8, M=10, alpha=500/1000, beta=0.01, win=15, has_background= True)
model.fit(biterms, iterations=500, verbose=True)
p_zd = model.transform(docs_vec,verbose=True)  
print(p_zd) 

# matrix of document-topics; topics vs. documents, topics vs. words probabilities 
matrix_docs_topics = model.matrix_docs_topics_    #Documents vs topics probabilities matrix.
topic_doc_matrix = model.matrix_topics_docs_      #Topics vs documents probabilities matrix.
matrix_topic_words = model.matrix_topics_words_   #Topics vs words probabilities matrix.

# Getting stable topics
print("Array Dimension = ",len(matrix_topic_words.shape))
closest_topics, dist = btm.get_closest_topics(*matrix_topic_words, top_words=100, verbose=True)
stable_topics, stable_kl = btm.get_stable_topics(closest_topics, thres=0.7)

# Stable topics indices list
print(stable_topics)

help wanted question

opened by RashmiBatra 4

Questions regarding Perplexity and Model Comparison with C++

I have two questions regarding this mode. First of all, I noticed that the evaluation metric perplexity was implemented. However, traditionally, the perplexity was mostly computed on the held-out dataset. Does that mean that when using this model, we should leave out certain proportion of the data and compute the perplexity on those samples that have not been used for training the model? My second question was that I was trying to compare this implementation with the C++ version from the original paper. The results (the top words in each topic) are quite different when the same parameters are used on the same corpus. Do you know what might be causing that and which part was implemented differently?
help wanted question

opened by orpheus92 3
How do I get the topic words?

Hi,

Firstly, thanks for sharing your code.

Not an issue, just a question. I'm able to see the relevant words for a topic in the tmplot report. How do I get those words? I need to get at least the most three relevant terms.

Thanks in advance.
question

opened by aguinaldoabbj 3

failed building wheels

Hi!

I've got an error when running pip3 install bitermplus on MacOS (intel-based, Ventura), using python 3.10.8 in a separate venv (not anaconda):

Building wheels for collected packages: bitermplus
  Building wheel for bitermplus (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for bitermplus (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [34 lines of output]
      Error in sitecustomize; set PYTHONVERBOSE for traceback:
      AssertionError:
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.macosx-12-x86_64-cpython-310
      creating build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/__init__.py -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_util.py -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      running egg_info
      writing src/bitermplus.egg-info/PKG-INFO
      writing dependency_links to src/bitermplus.egg-info/dependency_links.txt
      writing requirements to src/bitermplus.egg-info/requires.txt
      writing top-level names to src/bitermplus.egg-info/top_level.txt
      reading manifest file 'src/bitermplus.egg-info/SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      adding license file 'LICENSE'
      writing manifest file 'src/bitermplus.egg-info/SOURCES.txt'
      copying src/bitermplus/_btm.c -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_btm.pyx -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_metrics.c -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      copying src/bitermplus/_metrics.pyx -> build/lib.macosx-12-x86_64-cpython-310/bitermplus
      running build_ext
      building 'bitermplus._btm' extension
      creating build/temp.macosx-12-x86_64-cpython-310
      creating build/temp.macosx-12-x86_64-cpython-310/src
      creating build/temp.macosx-12-x86_64-cpython-310/src/bitermplus
      clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX12.sdk -I/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.10/include/python3.10 -c src/bitermplus/_btm.c -o build/temp.macosx-12-x86_64-cpython-310/src/bitermplus/_btm.o -Xpreprocessor -fopenmp
      src/bitermplus/_btm.c:772:10: fatal error: 'omp.h' file not found
      #include <omp.h>
               ^~~~~~~
      1 error generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for bitermplus
Failed to build bitermplus
ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

Could this error be related to #29? I've tested on a PC and it worked though.

bug documentation

opened by alanmaehara 2

Failed building wheel for bitermplus

Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects

When I try to install bitermplus with pip install bitermplus there is an error massage like this : note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for bitermplus ERROR: Could not build wheels for bitermplus, which is required to install pyproject.toml-based projects
bug

opened by novra 2
Calculation of nmi,ami,ri

I'm trying to test the model and see if it matches the data labels, but I can't get the topic for each document. I'm trying to get the list of labels to apply nmi, ami and ri so I'm wondering how to get the labels from the model. @maximtrp

opened by gitassia 2
Implementation Guide

I was wondering is there any way to print the the topics generate by the BTM model, just like how I can do it with Gensim. In addition to that, I am getting all negative coherence values in the range of -500 or -600. I am not sure if I am doing something wrong. The issues is, I am not able to interpret the results, even plotting gives some strange output.

The following image show what is held by the variable adobe, again I am not sure if it needs to be in this manner or each row here needs to a list

opened by neel6762 2

Releases(v0.6.12)

v0.6.12(Mar 29, 2022)

This release contains some minor fixes and adds labels_ property to BTM model class (labels for the most probable topics for each of the documents). It also adds get_docs_top_topic method for creating DataFrames with documents and their labels.
Source code(tar.gz)
Source code(zip)
v0.6.11(Jan 8, 2022)

This release fixes the incompatibility error between bitermplus and scikit-learn.
Source code(tar.gz)
Source code(zip)
v0.6.10(Dec 16, 2021)

This release includes a number of minor fixes. Methods to select stable topics have been moved to tmplot package. Please see the updated tutorial in the documentation.
Source code(tar.gz)
Source code(zip)
v0.6.9(Aug 19, 2021)

This release introduces a function for Renyi entropy calculation (bitermplus.entropy) that can be used to estimate the optimal number of topics. For more details, read this paper.
Source code(tar.gz)
Source code(zip)
v0.6.8(Jul 23, 2021)

This release is an attempt to fix the issue with perplexity calculation yielding infinity values (#7).
Source code(tar.gz)
Source code(zip)
v0.6.7(Jul 1, 2021)
This release drops support for pyLDAvis in favor of tmplot that can be installed with pip (optional):

pip install tmplot
Source code(tar.gz)
Source code(zip)
v0.6.6(Jun 16, 2021)

This release exposes new model attributes: matrix_topics_docs_, matrix_words_topics_, and df_words_topics_ (words vs topics probabilities in a DataFrame).
Source code(tar.gz)
Source code(zip)
v0.6.5(Jun 11, 2021)

This release fixes a critical bug in the closest topics selection (get_closest_topics method).
Source code(tar.gz)
Source code(zip)
v0.6.4(Apr 18, 2021)

This release includes memory optimizations and new metrics for topics distance measuring (see get_closest_topics method).
Source code(tar.gz)
Source code(zip)
v0.6.3(Apr 7, 2021)

This release fixes a bug in transform method that occurred when empty documents were passed as inputs.
Source code(tar.gz)
Source code(zip)
v0.6.2(Apr 6, 2021)

This release fixes a bug in document vs topics matrix shape (reported in this issue).
Source code(tar.gz)
Source code(zip)
v0.6.1(Apr 5, 2021)

This is a minor release that fixes buffer types mismatch on creating biterms (critical bug that appeared under Windows).
Source code(tar.gz)
Source code(zip)
v0.6.0(Apr 4, 2021)
This is a major release that fixes critical bugs in arrays initialization. The previous versions of bitermplus are not recommended for use.

Changelog:

Arrays (n_bz, n_wz) are now properly initialized. This procedure was broken in the previous versions that led to biased results.

Data normalization (via _normalize hidden method) improved.

New NumPy random generators are used to initially assign topics to biterms.

Biterms (biterms_ model attribute) and topics probabilities (theta_ model attribute) are now available.

Biterms are now serialized as well when model is saved.

Source code(tar.gz)
Source code(zip)
v0.5.10(Mar 23, 2021)

This release improves model pickling and adds seed argument to fit() method of BTM class.
Source code(tar.gz)
Source code(zip)
v0.5.9(Mar 22, 2021)

In this release public extension attributes were converted to properties with comprehensible names and docstrings.
Source code(tar.gz)
Source code(zip)
v0.5.8(Mar 21, 2021)

This release fixed numerous bugs in the code of inference methods, optimizes memory usage, and covers most part of model fitting and inferring code with tests.
Source code(tar.gz)
Source code(zip)

Owner

Maksim Terpilowski

Research scientist

GitHub Repository https://bitermplus.readthedocs.io/en/stable/

Text Analysis & Topic Extraction on Android App user reviews

AndroidApp_TextAnalysis Hi, there! This is code archive for Text Analysis and Topic Extraction from user_reviews of Android App. Dataset Source : http

1 Feb 14, 2022

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

The implementation of paper CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. CLIP4Clip is a video-text retrieval model based

456 Jan 06, 2023

[AAAI 21] Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

◥ Curriculum Labeling ◣ Revisiting Pseudo-Labeling for Semi-Supervised Learning Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, Vicente Ordonez. In the

113 Dec 15, 2022

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

305 Dec 22, 2022

Final Project for the Intel AI Readiness Boot Camp NLP (Jan)

NLP Boot Camp (Jan) Synopsis Full Name: Prameya Mohanty Name of your School: Delhi Public School, Rourkela Class: VIII Title of the Project: iTransect

1 Feb 01, 2022

CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

CCF BDCI 2020 房产行业聊天问答匹配 A榜47/2985 赛题描述详见：https://www.datafountain.cn/competitions/474 文件说明 data: 存放训练数据和测试数据以及预处理代码 model_bert.py: 网络模型结构定义 adv_train

40 Sep 28, 2022

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Hiring We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on NLP and large-scale pre-traine

7.8k Jan 09, 2023

Text to speech converter with GUI made in Python.

Text-to-speech-with-GUI Text to speech converter with GUI made in Python. To run this download the zip file and run the main file or clone this repo.

1 Nov 15, 2021

Text to speech for Vietnamese, ez to use, ez to update

Chào mọi người, đây là dự án mở nhằm giúp việc đọc được trở nên dễ dàng hơn. Rất cảm ơn đội ngũ Zalo đã cung cấp hạ tầng để mình có thể tạo ra app này

32 Jul 29, 2022

Higher quality textures for the Metal Gear Solid series.

Metal Gear Solid: HD Textures Higher quality textures for the Metal Gear Solid series. The goal is to maximize the quality of assets that the engine w

6 Dec 06, 2022

Using BERT-based models for toxic span detection

SemEval 2021 Task 5: Toxic Spans Detection: Task: Link to SemEval-2021: Task 5 Toxic Span Detection is https://competitions.codalab.org/competitions/2

1 Jan 04, 2022

Lumped-element impedance calculator and frequency-domain plotter.

fastZ: Lumped-Element Impedance Calculator fastZ is a small tool for calculating and visualizing electrical impedance in Python. Features include: Sup

47 Nov 18, 2022

STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

st3 STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch. Currently it supports converting pbmm models to pt scripts with integra

8 Oct 18, 2021

2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

繁體中文場景文字辨識程式碼說明組別：這就是我成員：蔣明憲唐碩謙黃玥菱林冠霆蕭靖騰目錄環境套件安裝方式資料夾布局前處理-製作偵測訓練註解檔前處理-製作分類訓練樣本 part.py ：從 json 裁切出分類訓練樣本 Class.py ：將切出來的樣本按照文字分類到各資料夾

3 Jan 14, 2022

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation This repository contains the implementation of the following paper: Live Speech

575 Dec 31, 2022

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-popu

587 Dec 20, 2022

Maix Speech AI lib, including ASR, chat, TTS etc.

Maix-Speech 中文 | English Brief Now only support Chinese, See 中文 Build Clone code by: git clone https://github.com/sipeed/Maix-Speech Compile x86x64 c

267 Dec 25, 2022

Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

VILLA: Vision-and-Language Adversarial Training This is the official repository of VILLA (NeurIPS 2020 Spotlight). This repository currently supports

109 Dec 31, 2022

PortaSpeech - PyTorch Implementation

PortaSpeech - PyTorch Implementation PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech. Model Size Module Nor

276 Dec 26, 2022

This is Assignment1 code for the Web Data Processing System.

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata).

3 Dec 04, 2022

Biterm Topic Model (BTM): modeling topics in short texts

Related tags

Overview

Biterm Topic Model

Development

Requirements

Setup

Linux and Windows

Mac OS

Example

Model fitting

Results visualization

Tutorial

Comments

topic

Releases(v0.6.12)

v0.6.12(Mar 29, 2022)

v0.6.11(Jan 8, 2022)

v0.6.10(Dec 16, 2021)

v0.6.9(Aug 19, 2021)

v0.6.8(Jul 23, 2021)

v0.6.7(Jul 1, 2021)

v0.6.6(Jun 16, 2021)

v0.6.5(Jun 11, 2021)

v0.6.4(Apr 18, 2021)

v0.6.3(Apr 7, 2021)

v0.6.2(Apr 6, 2021)

v0.6.1(Apr 5, 2021)

v0.6.0(Apr 4, 2021)

v0.5.10(Mar 23, 2021)

v0.5.9(Mar 22, 2021)

v0.5.8(Mar 21, 2021)