A library for end-to-end learning of embedding index and retrieval model

Related tags

Text Data & NLPpoeem
Overview

Poeem

Poeem is a library for efficient approximate nearest neighbor (ANN) search, which has been widely adopted in industrial recommendation, advertising and search systems. Apart from other libraries, such as Faiss and ScaNN, which build embedding indexes with already learned embeddings, Poeem jointly learn the embedding index together with retrieval model in order to avoid the quantization distortion. Consequentially, Poeem is proved to outperform the previous methods significantly, as shown in our SIGIR paper. Poeem is written based on Tensorflow GPU version 1.15, and some of the core functionalities are written in C++, as custom TensorFlow ops. It is developed by JD.com Search.

For more details, check out our SIGIR 2021 paper here.

Content

System Requirements

  • We only support Linux systems for now, e.g., CentOS and Ubuntu. Windows users might need to build the library from source.
  • Python 3.6 installation.
  • TensorFlow GPU version 1.15 (pip install tensorflow-gpu==1.15.0). Other TensorFlow versions are not tested.
  • CUDA toolkit 10.1, required by TensorFlow GPU 1.15.

Quick Start

Poeem aims at an almost drop-in utility for training and serving large scale embedding retrieval models. We try to make it easy to use as much as we can.

Install

Install poeem for most Linux system can be done easily with pip.

$ pip install poeem

Quick usage

As an extreme simple example, you can use Poeem simply by the following commands

>>> import tensorflow as tf, poeem
>>> hparams = poeem.embedding.PoeemHparam()
>>> poeem_indexing_layer = poeem.embedding.PoeemEmbed(64, hparams)
>>> emb = tf.random.normal([100, 64])  # original embedding before indexing layer
>>> emb_quantized, coarse_code, code, regularizer = poeem_indexing_layer.forward(emb)
>>> emb = emb - tf.stop_gradient(emb - emb_quantized)   # use this embedding for downstream computation
>>> with tf.Session() as sess:
>>>   sess.run(tf.global_variables_initializer())
>>>   sess.run(emb)

Tutorial

The above simple example, as a quick start, does not show how to build embedding index and how to serve it online. Experienced or advanced users who are interested in applying it in real-world or industrial system, can further read the tutorials.

Authors

The main authors of Poeem are:

  • Han Zhang wrote most Python models and conducted most of experiments.
  • Hongwei Shen wrote most of the C++ TensorFlow ops and managed the pip released package.
  • Yunjiang Jiang developed the rotation algorithm and wrote the related code.
  • Wen-Yun Yang initiated the Poeem project, wrote some of TensorFlow ops, integrated different parts and wrote the tutorials.

How to Cite

Reference to cite if you use Poeem in a research paper or in a real-world system

  @inproceeding{poeem_sigir21,
    title={Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index},
    author={Han Zhang, Hongwei Shen, Yiming Qiu, Yunjiang Jiang, Songlin Wang, Sulong Xu, Yun Xiao, Bo Long and Wen-Yun Yang},
    booktitle={The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
    pages={},
    year={2021}
}

License

MIT licensed

Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和TextBlob

Rui Wang 6k Jan 02, 2023
NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

MCSE: Multimodal Contrastive Learning of Sentence Embeddings This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multi

Saarland University Spoken Language Systems Group 39 Nov 15, 2022
Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Phil Wang 5k Jan 02, 2023
CYGNUS, the Cynical AI, combines snarky responses with uncanny aggression.

New & (hopefully) Improved CYGNUS with several API updates, user updates, and online/offline operations added!!!

Simran Farrukh 0 Mar 28, 2022
Binary LSTM model for text classification

Text Classification The purpose of this repository is to create a neural network model of NLP with deep learning for binary classification of texts re

Nikita Elenberger 1 Mar 11, 2022
Big Bird: Transformers for Longer Sequences

BigBird, is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. Moreover, BigBird comes along with a theoretical understanding of the c

Google Research 457 Dec 23, 2022
Official PyTorch implementation of SegFormer

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers Figure 1: Performance of SegFormer-B0 to SegFormer-B5. Project page

NVIDIA Research Projects 1.4k Dec 29, 2022
ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

bnosac 37 Nov 06, 2022
A script that automatically creates a branch name using google translation api and jira api

About google translation api와 jira api을 사용하여 자동으로 브랜치 이름을 만들어주는 스크립트 Setup 환경변수에 다음 3가지를 등록해야 한다. JIRA_USER : JIRA email (ex: hyunwook.kim 2 Dec 20, 2021

NLP codes implemented with Pytorch (w/o library such as huggingface)

NLP_scratch NLP codes implemented with Pytorch (w/o library such as huggingface) scripts ├── models: Neural Network models ├── data: codes for dataloa

3 Dec 28, 2021
Topic Modelling for Humans

gensim – Topic Modelling in Python Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Targ

RARE Technologies 13.8k Jan 02, 2023
KR-FinBert And KR-FinBert-SC

KR-FinBert & KR-FinBert-SC Much progress has been made in the NLP (Natural Language Processing) field, with numerous studies showing that domain adapt

5 Jul 29, 2022
Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Machel Reid 82 Dec 19, 2022
Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

simple_diarizer Simplified diarization pipeline using some pretrained models. Made to be a simple as possible to go from an input audio file to diariz

Chau 65 Dec 30, 2022
Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Neural Network Models for Joint POS Tagging and Dependency Parsing Implementations of joint models for POS tagging and dependency parsing, as describe

Dat Quoc Nguyen 152 Sep 02, 2022
A single model that parses Universal Dependencies across 75 languages.

A single model that parses Universal Dependencies across 75 languages. Given a sentence, jointly predicts part-of-speech tags, morphology tags, lemmas, and dependency trees.

Dan Kondratyuk 189 Nov 29, 2022
This is a MD5 password/passphrase brute force tool

CROWES-PASS-CRACK-TOOl This is a MD5 password/passphrase brute force tool How to install: Do 'git clone https://github.com/CROW31/CROWES-PASS-CRACK-TO

9 Mar 02, 2022
Simple Annotated implementation of GPT-NeoX in PyTorch

Simple Annotated implementation of GPT-NeoX in PyTorch This is a simpler implementation of GPT-NeoX in PyTorch. We have taken out several optimization

labml.ai 101 Dec 03, 2022
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

DeeBERT This is the code base for the paper DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference. Code in this repository is also available

Castorini 132 Nov 14, 2022
New Modeling The Background CodeBase

Modeling the Background for Incremental Learning in Semantic Segmentation This is the updated official PyTorch implementation of our work: "Modeling t

Fabio Cermelli 9 Dec 28, 2022