Fast, DB Backed pretrained word embeddings for natural language processing.

Last update: Nov 21, 2022

Overview

Embeddings

Embeddings is a python package that provides pretrained word embeddings for natural language processing and machine learning.

Instead of loading a large file to query for embeddings, embeddings is backed by a database and fast to load and query:

>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300)
100 loops, best of 3: 12.7 ms per loop

>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300).emb('canada')
100 loops, best of 3: 12.9 ms per loop

>>> g = GloveEmbedding('common_crawl_840', d_emb=300)

>>> %timeit -n1 g.emb('canada')
1 loop, best of 3: 38.2 µs per loop

Installation

pip install embeddings  # from pypi
pip install git+https://github.com/vzhong/embeddings.git  # from github

Usage

Upon first use, the embeddings are first downloaded to disk in the form of a SQLite database. This may take a long time for large embeddings such as GloVe. Further usage of the embeddings are directly queried against the database. Embedding databases are stored in the $EMBEDDINGS_ROOT directory (defaults to ~/.embeddings). Note that this location is probably undesirable if your home directory is on NFS, as it would slow down database queries significantly.

from embeddings import GloveEmbedding, FastTextEmbedding, KazumaCharEmbedding, ConcatEmbedding

g = GloveEmbedding('common_crawl_840', d_emb=300, show_progress=True)
f = FastTextEmbedding()
k = KazumaCharEmbedding()
c = ConcatEmbedding([g, f, k])
for w in ['canada', 'vancouver', 'toronto']:
    print('embedding {}'.format(w))
    print(g.emb(w))
    print(f.emb(w))
    print(k.emb(w))
    print(c.emb(w))

Docker

If you use Docker, an image prepopulated with the Common Crawl 840 GloVe embeddings and Kazuma Hashimoto's character ngram embeddings is available at vzhong/embeddings. To mount volumes from this container, set $EMBEDDINGS_ROOT in your container to /opt/embeddings.

For example:

docker run --volumes-from vzhong/embeddings -e EMBEDDINGS_ROOT='/opt/embeddings' myimage python train.py

Contribution

Pull requests welcome!

Fast, DB Backed pretrained word embeddings for natural language processing.

Related tags

Overview

Embeddings

Installation

Usage

Docker

Contribution

Owner

Victor Zhong

Implementation for paper BLEU: a Method for Automatic Evaluation of Machine Translation

Code for the project carried out fulfilling the course requirements for Fall 2021 NLP at NYU

💫 Industrial-strength Natural Language Processing (NLP) in Python

A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Simple and efficient RevNet-Library with DeepSpeed support

Rethinking the Truly Unsupervised Image-to-Image Translation - Official PyTorch Implementation (ICCV 2021)

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers.

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

Random-Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

Binary LSTM model for text classification

Honor's thesis project analyzing whether the GPT-2 model can more effectively generate free-verse or structured poetry.

Large-scale Knowledge Graph Construction with Prompting

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Long text token classification using LongFormer

Py65 65816 - Add support for the 65C816 to py65

Natural Language Processing at EDHEC, 2022

Search-Engine - 📖 AI based search engine