A library for end-to-end learning of embedding index and retrieval model

Last update: Dec 21, 2022

Related tags

Overview

Poeem

Poeem is a library for efficient approximate nearest neighbor (ANN) search, which has been widely adopted in industrial recommendation, advertising and search systems. Apart from other libraries, such as Faiss and ScaNN, which build embedding indexes with already learned embeddings, Poeem jointly learn the embedding index together with retrieval model in order to avoid the quantization distortion. Consequentially, Poeem is proved to outperform the previous methods significantly, as shown in our SIGIR paper. Poeem is written based on Tensorflow GPU version 1.15, and some of the core functionalities are written in C++, as custom TensorFlow ops. It is developed by JD.com Search.

For more details, check out our SIGIR 2021 paper here.

System Requirements

We only support Linux systems for now, e.g., CentOS and Ubuntu. Windows users might need to build the library from source.
Python 3.6 installation.
TensorFlow GPU version 1.15 (pip install tensorflow-gpu==1.15.0). Other TensorFlow versions are not tested.
CUDA toolkit 10.1, required by TensorFlow GPU 1.15.

Quick Start

Poeem aims at an almost drop-in utility for training and serving large scale embedding retrieval models. We try to make it easy to use as much as we can.

Install

Install poeem for most Linux system can be done easily with pip.

$ pip install poeem

Quick usage

As an extreme simple example, you can use Poeem simply by the following commands

>>> import tensorflow as tf, poeem
>>> hparams = poeem.embedding.PoeemHparam()
>>> poeem_indexing_layer = poeem.embedding.PoeemEmbed(64, hparams)
>>> emb = tf.random.normal([100, 64])  # original embedding before indexing layer
>>> emb_quantized, coarse_code, code, regularizer = poeem_indexing_layer.forward(emb)
>>> emb = emb - tf.stop_gradient(emb - emb_quantized)   # use this embedding for downstream computation
>>> with tf.Session() as sess:
>>>   sess.run(tf.global_variables_initializer())
>>>   sess.run(emb)

Tutorial

The above simple example, as a quick start, does not show how to build embedding index and how to serve it online. Experienced or advanced users who are interested in applying it in real-world or industrial system, can further read the tutorials.

Authors

The main authors of Poeem are:

Han Zhang wrote most Python models and conducted most of experiments.
Hongwei Shen wrote most of the C++ TensorFlow ops and managed the pip released package.
Yunjiang Jiang developed the rotation algorithm and wrote the related code.
Wen-Yun Yang initiated the Poeem project, wrote some of TensorFlow ops, integrated different parts and wrote the tutorials.

How to Cite

Reference to cite if you use Poeem in a research paper or in a real-world system

  @inproceeding{poeem_sigir21,
    title={Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index},
    author={Han Zhang, Hongwei Shen, Yiming Qiu, Yunjiang Jiang, Songlin Wang, Sulong Xu, Yun Xiao, Bo Long and Wen-Yun Yang},
    booktitle={The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
    pages={},
    year={2021}
}

License

MIT licensed

A library for end-to-end learning of embedding index and retrieval model

Related tags

Overview

Poeem

Content

System Requirements

Quick Start

Install

Quick usage

Tutorial

Authors

How to Cite

License

Owner

Fixes mojibake and other glitches in Unicode text, after the fact.

Scene Text Retrieval via Joint Text Detection and Similarity Learning

Write Python in Urdu - اردو میں کوڈ لکھیں

Ask for weather information like a human

Named Entity Recognition API used by TEI Publisher

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization (ACL 2021)

DVC-NLP-Simple-usecase

Code voor mijn Master project omtrent VideoBERT

I can help you convert your images to pdf file.

基于GRU网络的句子判断程序/A program based on GRU network for judging sentences

pyMorfologik MorfologikpyMorfologik - Python binding for Morfologik.

Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models

BookNLP, a natural language processing pipeline for books

Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

تولید اسم های رندوم فینگیلیش

An A-SOUL Text Generator Based on CPM-Distill.

2021海华AI挑战赛·中文阅读理解·技术组·第三名

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Biterm Topic Model (BTM): modeling topics in short texts