Build Text Rerankers with Deep Language Models

Overview

Reranker

Reranker is a lightweight, effective and efficient package for training and deploying deep languge model reranker in information retrieval (IR), question answering (QA) and many other natural language processing (NLP) pipelines. The training procedure follows our ECIR paper Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline using a localized constrastive esimation (LCE) loss.

Reranker speaks Huggingface 🤗 language! This means that you instantly get all state-of-the-art pre-trained models as soon as they are ported to HF transformers. You also get the familiar model and trainer interfaces.

Stae of the Art Performance.

Reranker has two submissions to MS MARCO document leaderboard. Each got 1st place, advancing the SOTA!

Date Submission Name Dev [email protected] Eval [email protected]
2021/01/20 LCE loss + HDCT (ensemble) 0.464 0.405
2020/09/09 HDCT top100 + BERT-base FirstP (single) 0.434 0.382

Features

  • Training rerankers from the state-of-the-art pre-trained language models like BERT, RoBERTa and ELECTRA.
  • The state-of-the-art reranking performance with our LCE loss based training pipeline.
  • GPU memory optimizations: Loss Parallelism and Gradient Cache which allow training of larger model.
  • Faster training
    • Distributed Data Parallel (DDP) for multi GPUs.
    • Automatic Mixed Precision (AMP) training and inference with up to 2x speedup!
  • Break CPU RAM limitation by memory mapping datasets with pyarrow through datasets package interface.
  • Checkpoint interoperability with Hugging Face transformers.

Design Philosophy

The library is designed to be dedicated for text reranking modeling, training and testing. This helps us keep the code concise and focus on a more specific task.

Under the hood, Reranker provides a thin layer of wrapper over Huggingface libraries. Our model wraps PreTrainedModel and our trainer sub-class Huggingface Trainer. You can then work with the familiar interfaces.

Installation and Dependencies

Reranker uses Pytorch, Huggingface Transformers and Datasets. Install with the following commands,

git clone https://github.com/luyug/Reranker.git
cd Reranker
pip install .

Reranker has been tested with torch==1.6.0, transformers==4.2.0, datasets==1.1.3.

For development, install as editable,

pip install -e .

Workflow

Inference (Reranking)

The easiest way to do inference is to use one of our uploaded trained checkpoints with RerankerForInference.

from reranker import RerankerForInference
rk = RerankerForInference.from_pretrained("Luyu/bert-base-mdoc-bm25")  # load checkpoint

inputs = rk.tokenize('weather in new york', 'it is cold today in new york', return_tensors='pt')
score = rk(inputs).logits

Training

For training, you will need a model, a dataset and a trainer. Say we have parsed arguments into model_args, data_args and training_args with reranker.arguments. First, initialize the reranker and tokenizer from one of pre-tained language models from Hugging Face. For example, let's use RoBERTa by loading roberta-base.

from reranker import Reranker 
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = Reranker.from_pretrained(model_args, data_args, training_args, 'roberta-base')

Then create the dataset,

from reranker.data import GroupedTrainDataset
train_dataset = GroupedTrainDataset(
    data_args, data_args.train_path, 
    tokenizer=tokenizer, train_args=training_args
)

Create a trainer and train,

from reranker import RerankerTrainer
trainer = RerankerTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        data_collator=GroupCollator(tokenizer),
    )
trainer.train()

See full examples in our examples.

Examples

MS MARCO Document Ranking with Reranker

More to come

Large Models

Loss Paralellism

We support computing a query's LCE loss with multiple GPUs with flag --collaborative. Note that a group size (pos + neg) not divisible by number of GPUs may incur undefined behaviours. You will typically want to use it with gradient accumulation steps greater than one.

Detailed instruction ot be added.

Gradient Cache

Experimental We provide subclasses RerankerDC and RerankerDCTrainer. In the MS MARCO example, You can use them with --distance_cahce argument to activate gradient caching with respect to computed unnormalized distance. This allows potentially training with unlimited number of negatives beyond GPU memory limitation up to numerical precision. The method is described in our preprint Scaling Deep Contrastive Learning Batch Size with Almost Constant Peak Memory Usage.

Detailed instruction to be added.

Helpers

We provide a few helpers in the helper directory for data formatting,

Score Formatting

  • score_to_marco.py turns a raw score txt file into MS MARCO format.
  • score_to_tein.py turns a raw score txt file into trec eval format.

For example,

python score_to_tein.py --score_file {path to raw score txt}

This generates a trec eval format file in the same directory as the raw score file.

Data Format

Reranker core utilities (batch training, batch inference) expect processed and tokenized text in token id format. This means pre-processing should be done beforehand, e.g. with BERT tokenizer.

Training Data

Training data is grouped by query into a json file where each line has a query, its corresponding positives and sampled negatives.

{
    "qry": {
        "qid": str,
        "query": List[int],
    },
    "pos": List[
        {
            "pid": str,
            "passage": List[int],
        }
    ],
    "neg": List[
        {
            "pid": str,
            "passage": List[int]
        }
    ]
}

Training data is handled by class reranker.data.GroupedTrainDataset.

Inference (Reranking) Data

Inference data is grouped by query document(passage) pairs. Each line is a json entry to be rereanked (scored).

{
    "qid": str,
    "pid": str,
    "qry": List[int],
    "psg": List[int]
}

To speed up postprocessing, we currently take an additional tsv specifying text ids,

qid0     pid0
qid0     pid1
...

The ordering in the two files are expected to be the same.

Inference data is handled by class reranker.data.PredictionDataset.

Result Scores

Scores are stored in a tsv file with columns corresponding to qid, pid and score.

qid0     pid0     s0
qid0     pid1     s1
...

You can post-process it with our helper scirpt into MS MARCO format or TREC eval format.

Contribution

We welcome contribution to the package, either adding new dataset interface or new models.

Contact

You can reach me by email [email protected]. As a 2nd year master, I get busy days from time to time and may not reply very promptly. Feel free to ping me if you don't get replies.

Citation

If you use Reranker in your research, please consider citing our ECIR paper,

@inproceedings{gao2021lce,
               title={Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline}, 
               author={Luyu Gao and Zhuyun Dai and Jamie Callan},
               year={2021},
               booktitle={The 43rd European Conference On Information Retrieval (ECIR)},
      
}

For the gradient cache utility, consider citing our preprint,

@misc{gao2021scaling,
      title={Scaling Deep Contrastive Learning Batch Size with Almost Constant Peak Memory Usage}, 
      author={Luyu Gao and Yunyi Zhang},
      year={2021},
      eprint={2101.06983},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

Reranker is currently licensed under CC-BY-NC 4.0.

Owner
Luyu Gao
NLP Research [email protected], CMU
Luyu Gao
An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

Koniwa project 32 Dec 14, 2022
It analyze the sentiment of the user, whether it is postive or negative.

Sentiment-Analyzer-Tool It analyze the sentiment of the user, whether it is postive or negative. It uses streamlit library for creating this sentiment

Paras Patidar 18 Dec 17, 2022
A PyTorch-based model pruning toolkit for pre-trained language models

English | 中文说明 TextPruner是一个为预训练语言模型设计的模型裁剪工具包,通过轻量、快速的裁剪方法对模型进行结构化剪枝,从而实现压缩模型体积、提升模型速度。 其他相关资源: 知识蒸馏工具TextBrewer:https://github.com/airaria/TextBrewe

Ziqing Yang 231 Jan 08, 2023
A Paper List for Speech Translation

Keyword: Speech Translation, Spoken Language Processing, Natural Language Processing

138 Dec 24, 2022
CDLA: A Chinese document layout analysis (CDLA) dataset

CDLA: A Chinese document layout analysis (CDLA) dataset 介绍 CDLA是一个中文文档版面分析数据集,面向中文文献类(论文)场景。包含以下10个label: 正文 标题 图片 图片标题 表格 表格标题 页眉 页脚 注释 公式 Text Title

buptlihang 84 Dec 28, 2022
A python wrapper around the ZPar parser for English.

NOTE This project is no longer under active development since there are now really nice pure Python parsers such as Stanza and Spacy. The repository w

ETS 49 Sep 12, 2022
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 05, 2022
Write Python in Urdu - اردو میں کوڈ لکھیں

UrduPython Write simple Python in Urdu. How to Use Write Urdu code in سامپل۔پے The mappings are as following: "۔": ".", "،":

Saad A. Bazaz 26 Nov 27, 2022
This is the offline-training-pipeline for our project.

offline-training-pipeline This is the offline-training-pipeline for our project. We adopt the offline training and online prediction Machine Learning

0 Apr 22, 2022
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Boolean Prompting for Neural Text Generators Neural text generators like the GPT models promise a general-purpose means of manipulating texts. These m

Jeffrey M. Binder 20 Jan 09, 2023
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Jan 03, 2023
AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

凌逆战 75 Dec 05, 2022
An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GPT-NeoX An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hun

EleutherAI 3.1k Jan 08, 2023
🤗🖼️ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web.

🤗 🖼️ HuggingPics Fine-tune Vision Transformers for anything using images found on the web. Check out the video below for a walkthrough of this proje

Nathan Raw 185 Dec 21, 2022
A python package to fine-tune transformer-based models for named entity recognition (NER).

nerblackbox A python package to fine-tune transformer-based language models for named entity recognition (NER). Resources Source Code: https://github.

Felix Stollenwerk 13 Jul 30, 2022
Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

Workshop: Enterprise-Scale NLP with Hugging Face & Amazon SageMaker Earlier this year we announced a strategic collaboration with Amazon to make it ea

Philipp Schmid 161 Dec 16, 2022
Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

Habib Abdurrasyid 5 Dec 28, 2021
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

44 Jan 06, 2023
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

NVIDIA Corporation 3.5k Dec 30, 2022