A library for end-to-end learning of embedding index and retrieval model

Related tags

Text Data & NLPpoeem
Overview

Poeem

Poeem is a library for efficient approximate nearest neighbor (ANN) search, which has been widely adopted in industrial recommendation, advertising and search systems. Apart from other libraries, such as Faiss and ScaNN, which build embedding indexes with already learned embeddings, Poeem jointly learn the embedding index together with retrieval model in order to avoid the quantization distortion. Consequentially, Poeem is proved to outperform the previous methods significantly, as shown in our SIGIR paper. Poeem is written based on Tensorflow GPU version 1.15, and some of the core functionalities are written in C++, as custom TensorFlow ops. It is developed by JD.com Search.

For more details, check out our SIGIR 2021 paper here.

Content

System Requirements

  • We only support Linux systems for now, e.g., CentOS and Ubuntu. Windows users might need to build the library from source.
  • Python 3.6 installation.
  • TensorFlow GPU version 1.15 (pip install tensorflow-gpu==1.15.0). Other TensorFlow versions are not tested.
  • CUDA toolkit 10.1, required by TensorFlow GPU 1.15.

Quick Start

Poeem aims at an almost drop-in utility for training and serving large scale embedding retrieval models. We try to make it easy to use as much as we can.

Install

Install poeem for most Linux system can be done easily with pip.

$ pip install poeem

Quick usage

As an extreme simple example, you can use Poeem simply by the following commands

>>> import tensorflow as tf, poeem
>>> hparams = poeem.embedding.PoeemHparam()
>>> poeem_indexing_layer = poeem.embedding.PoeemEmbed(64, hparams)
>>> emb = tf.random.normal([100, 64])  # original embedding before indexing layer
>>> emb_quantized, coarse_code, code, regularizer = poeem_indexing_layer.forward(emb)
>>> emb = emb - tf.stop_gradient(emb - emb_quantized)   # use this embedding for downstream computation
>>> with tf.Session() as sess:
>>>   sess.run(tf.global_variables_initializer())
>>>   sess.run(emb)

Tutorial

The above simple example, as a quick start, does not show how to build embedding index and how to serve it online. Experienced or advanced users who are interested in applying it in real-world or industrial system, can further read the tutorials.

Authors

The main authors of Poeem are:

  • Han Zhang wrote most Python models and conducted most of experiments.
  • Hongwei Shen wrote most of the C++ TensorFlow ops and managed the pip released package.
  • Yunjiang Jiang developed the rotation algorithm and wrote the related code.
  • Wen-Yun Yang initiated the Poeem project, wrote some of TensorFlow ops, integrated different parts and wrote the tutorials.

How to Cite

Reference to cite if you use Poeem in a research paper or in a real-world system

  @inproceeding{poeem_sigir21,
    title={Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index},
    author={Han Zhang, Hongwei Shen, Yiming Qiu, Yunjiang Jiang, Songlin Wang, Sulong Xu, Yun Xiao, Bo Long and Wen-Yun Yang},
    booktitle={The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
    pages={},
    year={2021}
}

License

MIT licensed

Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

CTC Decoding Algorithms Update 2021: installable Python package Python implementation of some common Connectionist Temporal Classification (CTC) decod

Harald Scheidl 736 Jan 03, 2023
Gold standard corpus annotated with verb-preverb connections for Hungarian.

Hungarian Preverb Corpus A gold standard corpus manually annotated with verb-preverb connections for Hungarian. corpus The corpus consist of the follo

RIL Lexical Knowledge Representation Research Group 3 Jan 27, 2022
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 04, 2022
This is a general repo that helps you develop fast/effective NLP classifiers using Huggingface

NLP Classifier Introduction This project trains a bert model on any NLP classifcation model. And uses the model in make predictions on new data using

Abdullah Tarek 3 Mar 11, 2022
[AAAI 21] Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

◥ Curriculum Labeling ◣ Revisiting Pseudo-Labeling for Semi-Supervised Learning Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, Vicente Ordonez. In the

UVA Computer Vision 113 Dec 15, 2022
A multi-voice TTS system trained with an emphasis on quality

TorToiSe Tortoise is a text-to-speech program built with the following priorities: Strong multi-voice capabilities. Highly realistic prosody and inton

James Betker 2.1k Jan 01, 2023
Neural-Machine-Translation - Implementation of revolutionary machine translation models

Neural Machine Translation Framework: PyTorch Repository contaning my implementa

Utkarsh Jain 1 Feb 17, 2022
DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa: Decoding-enhanced BERT with Disentangled Attention This repository is the official implementation of DeBERTa: Decoding-enhanced BERT with Dis

Microsoft 1.2k Jan 03, 2023
Nested Named Entity Recognition for Chinese Biomedical Text

CBio-NAMER CBioNAMER (Nested nAMed Entity Recognition for Chinese Biomedical Text) is our method used in CBLUE (Chinese Biomedical Language Understand

8 Dec 25, 2022
SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation In this repo you can find the code of the Supervised Hybrid Audio Segmentatio

Machine Translation @ UPC 21 Dec 20, 2022
Code Generation using a large neural network called GPT-J

CodeGenX is a Code Generation system powered by Artificial Intelligence! It is delivered to you in the form of a Visual Studio Code Extension and is Free and Open-source!

DeepGenX 389 Dec 31, 2022
Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Yu Zhang 50 Nov 08, 2022
PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

Kakao Brain 1.2k Dec 21, 2022
Constituency Tree Labeling Tool

Constituency Tree Labeling Tool The purpose of this package is to solve the constituency tree labeling problem. Look from the dataset labeled by NLTK,

张宇 6 Dec 20, 2022
A telegram bot to translate 100+ Languages

🔥 GOOGLE TRANSLATER 🔥 The owner would not be responsible for any kind of bans due to the bot. • ⚡ INSTALLING ⚡ • • 🔰 Deploy To Railway 🔰 • • ✅ OFF

Aɴᴋɪᴛ Kᴜᴍᴀʀ 5 Dec 20, 2021
Demo programs for the Talking Head Anime from a Single Image 2: More Expressive project.

Demo Code for "Talking Head Anime from a Single Image 2: More Expressive" This repository contains demo programs for the Talking Head Anime

Pramook Khungurn 901 Jan 06, 2023
Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms

FNet: Mixing Tokens with Fourier Transforms Pytorch implementation of Fnet : Mixing Tokens with Fourier Transforms. Citation: @misc{leethorp2021fnet,

Rishikesh (ऋषिकेश) 217 Dec 05, 2022
NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Project 3: Web APIs & NLP Problem Statement How do r/Libertarian and r/Neoliberal differ on Biden post-inaguration? The goal of the project is to see

Adam Muhammad Klesc 2 Mar 29, 2022
Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Pulkit Kathuria 173 Jan 04, 2023
A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

Won Joon Yoo 335 Jan 04, 2023