An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Overview

BERTify

This is an easy-to-use python module that helps you to extract the BERT embeddings for a large text dataset efficiently. It is intended to be used for Bengali and English texts.

Specially, optimized for usability in limited computational setups (i.e. free colab/kaggle GPUs). Extracting embeddings for IMDB dataset (a list of 25000 texts) took less than ~28 mins. on Colab's GPU. (Haven't perform any hardcore benchmark, so take these numbers with a grain of salt).

Requirements

  • numpy
  • torch
  • tqdm
  • transformers

Quick Installation

$ pip install git+https://github.com/khalidsaifullaah/BERTify

Usage

num. of texts, 4096 -> embedding dim.) # Example 2: English Embedding Extraction en_bertify = BERTify( lang="en", last_four_layers_embedding=True ) # bn_bertify.batch_size = 96 texts = ["how are you doing?", "I don't know about this.", "This is the most important thing."] en_embeddings = en_bertify.embedding(texts) # shape of the returned matrix in this example 3x3072 (3 -> num. of texts, 3072 -> embedding dim.) ">
from bertify import BERTify

# Example 1: Bengali Embedding Extraction
bn_bertify = BERTify(
    lang="bn",  # language of your text.
    last_four_layers_embedding=True  # to get richer embeddings.
)

# By default, `batch_size` is set to 64. Set `batch_size` higher for making things even faster but higher value than 96 may throw `CUDA out of memory` on Colab's GPU, so try at your own risk.

# bn_bertify.batch_size = 96

# A list of texts that we want the embedding for, can be one or many. (You can turn your whole dataset into a list of texts and pass it into the method for faster embedding extraction)
texts = ["বিখ্যাত হওয়ার প্রথম পদক্ষেপ", "জীবনে সবচেয়ে মূল্যবান জিনিস হচ্ছে", "বেশিরভাগ মানুষের পছন্দের জিনিস হচ্ছে"]

bn_embeddings = bn_bertify.embedding(texts)   # returns numpy matrix 
# shape of the returned matrix in this example 3x4096 (3 -> num. of texts, 4096 -> embedding dim.)




# Example 2: English Embedding Extraction
en_bertify = BERTify(
    lang="en",
    last_four_layers_embedding=True
)

# bn_bertify.batch_size = 96

texts = ["how are you doing?", "I don't know about this.", "This is the most important thing."]
en_embeddings = en_bertify.embedding(texts) 
# shape of the returned matrix in this example 3x3072 (3 -> num. of texts, 3072 -> embedding dim.)

Tips

  • Try passing all your text data through the .embedding() function at once by turning it into a list of texts.
  • For faster inference, make sure you're using your colab/kaggle GPU while making the .embedding() call
  • Try increasing the batch_size to make it even faster, by default we're using 64 (to be on the safe side) which doesn't throw any CUDA out of memory but I believe we can go even further. Thanks to Alex, from his empirical findings, it seems like it can be pushed until 96. So, before making the .embedding() call, you can do bertify.batch_zie=96 to set a larger batch_zie

Definitions


class BERTify(lang: str = "en", last_four_layers_embedding: bool = False)


A module for extracting embedding from BERT model for Bengali or English text datasets. For 'en' -> English data, it uses bert-base-uncased model embeddings, for 'bn' -> Bengali data, it uses sahajBERT model embeddings.

Parameters:

lang (str, optional): language of your data. Currently supports only 'en' and 'bn'. Defaults to 'en'. last_four_layers_embedding (bool, optional): BERT paper discusses they've reached the best results by concatenating the output of the last four layers, so if this argument is set to True, your embedding vector would be (for bert-base model for example) 4*768=3072 dimensional, otherwise it'd be 768 dimensional. Defaults to False.


def BERTify.embedding(texts: List[str])


The embedding function, that takes a list of texts, feed them through the model and returns a list of embeddings.

Parameters:

texts (List[str]): A list of texts, that you want to extract embedding for (e.g. ["This movie was a total waste of time.", "Whoa! Loved this movie, totally loved all the characters"])

Returns:

np.ndarray: A numpy matrix of shape num_of_texts x embedding_dimension

License

MIT License.

Owner
Khalid Saifullah
love to learn new things.
Khalid Saifullah
Subtitle Workshop (subshop): tools to download and synchronize subtitles

SUBSHOP Tools to download, remove ads, and synchronize subtitles. SUBSHOP Purpose Limitations Required Web Credentials Installation, Configuration, an

Joe D 4 Feb 13, 2022
This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

Rohan Mathur 9 Jul 17, 2021
Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

Mišo Belica 3k Jan 08, 2023
The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

Kay Savetz 60 Dec 25, 2022
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

Felipe Maia Polo 125 Dec 20, 2022
Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Yoon Kim 43 Dec 23, 2022
Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time.

Wordle_Bot Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time. It will log onto the wordle website and en

Lucas Polidori 15 Dec 11, 2022
【原神】自动演奏风物之诗琴的程序

疯物之诗琴 读取midi并自动演奏原神风物之诗琴。 可以自定义配置文件自动调整音符来适配风物之诗琴。 (原神1.4直播那天就开始做了!到现在才能放出来。。) 如何使用 在Release页面中下载打包好的程序和midi压缩包并解压。 双击运行“疯物之诗琴.exe”。 在原神中打开风物之诗琴,软件内输入

435 Jan 04, 2023
This project deals with a simplified version of a more general problem of Aspect Based Sentiment Analysis.

Aspect_Based_Sentiment_Extraction Created on: 5th Jan, 2022. This project deals with an important field of Natural Lnaguage Processing - Aspect Based

Naman Rastogi 4 Jan 01, 2023
Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

Weitang Liu 1.6k Jan 03, 2023
Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2

Google Research Datasets 52 Jun 21, 2022
Outreachy TFX custom component project

Schema Curation Custom Component Outreachy TFX custom component project This repo contains the code for Schema Curation Custom Component made as a par

Robert Crowe 5 Jul 16, 2021
skweak: A software toolkit for weak supervision applied to NLP tasks

Labelled data remains a scarce resource in many practical NLP scenarios. This is especially the case when working with resource-poor languages (or text domains), or when using task-specific labels wi

Norsk Regnesentral (Norwegian Computing Center) 850 Dec 28, 2022
Code for the Python code smells video on the ArjanCodes channel.

7 Python code smells This repository contains the code for the Python code smells video on the ArjanCodes channel (watch the video here). The example

55 Dec 29, 2022
Watson Natural Language Understanding and Knowledge Studio

Material de demonstração dos serviços: Watson Natural Language Understanding e Knowledge Studio Visão Geral: https://www.ibm.com/br-pt/cloud/watson-na

Vanderlei Munhoz 4 Oct 24, 2021
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

ELECTRA Introduction ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using

Google Research 2.1k Dec 28, 2022
Interpretable Models for NLP using PyTorch

This repo is deprecated. Please find the updated package here. https://github.com/EdGENetworks/anuvada Anuvada: Interpretable Models for NLP using PyT

Sandeep Tammu 19 Dec 17, 2022
2021搜狐校园文本匹配算法大赛baseline

sohu2021-baseline 2021搜狐校园文本匹配算法大赛baseline 简介 分享了一个搜狐文本匹配的baseline,主要是通过条件LayerNorm来增加模型的多样性,以实现同一模型处理不同类型的数据、形成不同输出的目的。 线下验证集F1约0.74,线上测试集F1约0.73。

苏剑林(Jianlin Su) 45 Sep 06, 2022
Code for the paper "Are Sixteen Heads Really Better than One?"

Are Sixteen Heads Really Better than One? This repository contains code to reproduce the experiments in our paper Are Sixteen Heads Really Better than

Paul Michel 143 Dec 14, 2022
Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS 🐸 green green gradio app.py false Ukrainian TTS 📢 🤖 Ukrainian TTS (text-to-speech)

Yurii Paniv 85 Dec 26, 2022