Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Overview

Tevatron

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized design for easy research; a set of command line tools are also provided for fast development and testing. A set of easy-to-use interfaces to Huggingfac's state-of-the-art pre-trained transformers ensures Tevatron's superior performance.

Tevatron is currently under initial development stage. We will be actively adding new features and API changes may happen. Suggestions, feature requests and PRs are welcomed.

Features

  • Command line interface for dense retriever training/encoding and dense index search.
  • Flexible and extendable Pytorch retriever models.
  • Highly efficient Trainer, a subclass of Huggingface Trainer, that naively support training performance features like mixed precision and distributed data parallel.
  • Fast and memory-efficient train/inference data access based on memory mapping with Apache Arrow through Huggingface datasets.

Installation

First install neural network and similarity search backends, namely Pytorch and FAISS. Check out the official installation guides for Pytorch and for FAISS.

Then install Tevatron with pip,

pip install tevatron

Or typically for develoment/research, clone this repo and install as editable,

git https://github.com/texttron/tevatron
cd tevatron
pip install --editable .

Note: The current code base has been tested with, torch==1.8.2, faiss-cpu==1.7.1, transformers==4.9.2, datasets==1.11.0

Data Format

Training: Each line of the the Train file is a training instance,

{'query': TEXT_TYPE, 'positives': List[TEXT_TYPE], 'negatives': List[TEXT_TYPE]}
...

Inference/Encoding: Each line of the the encoding file is a piece of text to be encoded,

{text_id: "xxx", 'text': TEXT_TYPE}
...

Here TEXT_TYPE can be either raw string or pre-tokenized ids, i.e. List[int]. Using the latter can help lower data processing latency during training to reduce/eliminate GPU wait. Note: the current code requires text_id of passages/contexts to be convertible to integer, e.g. integers or string of integers.

Training (Simple)

To train a simple dense retriever, call the tevatron.driver.train module,

python -m tevatron.driver.train \  
  --output_dir $OUTDIR \  
  --model_name_or_path bert-base-uncased \  
  --do_train \  
  --save_steps 20000 \  
  --train_dir $TRAIN_DIR \
  --fp16 \  
  --per_device_train_batch_size 8 \  
  --learning_rate 5e-6 \  
  --num_train_epochs 2 \  
  --dataloader_num_workers 2

Here we picked bert-base-uncased BERT weight from Huggingface Hub and turned on AMP with --fp16 to speed up training. Several command flags are provided in addition to configure the learned model, e.g. --add_pooler which adds an linear projection. A full list command line arguments can be found in tevatron.arguments.

Training (Research)

Check out the run.py in examples directory for a fully configurable train/test loop. Typically you will do,

from tevatron.modeling import DenseModel
from tevatron.trainer import DenseTrainer as Trainer

...
model = DenseModel.build(
        model_args,
        data_args,
        training_args,
        config=config,
        cache_dir=model_args.cache_dir,
    )
trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        data_collator=collator,
    )
...
trainer.train()

Encoding

To encode, call the tevatron.driver.encode module. For large corpus, split the corpus into shards to parallelize.

for s in shard1 shar2 shard3
do
python -m tevatron.driver.encode \  
  --output_dir=$OUTDIR \  
  --tokenizer_name $TOK \  
  --config_name $CONFIG \  
  --model_name_or_path $MODEL_DIR \  
  --fp16 \  
  --per_device_eval_batch_size 128 \  
  --encode_in_path $CORPUS_DIR/$s.json \  
  --encoded_save_path $ENCODE_DIR/$s.pt
done

Index Search

Call the tevatron.faiss_retriever module,

python -m tevatron.faiss_retriever \  
--query_reps $ENCODE_QRY_DIR/qry.pt \  
--passage_reps $ENCODE_DIR/'*.pt' \  
--depth $DEPTH \
--batch_size -1 \
--save_text \
--save_ranking_to rank.tsv

Encoded corpus or corpus shards are loaded based on glob pattern matching of argument --passage_reps. Argument --batch_size controls number of queries passed to the FAISS index each search call and -1 will pass all queries in one call. Larger batches typically run faster (due to better memory access patterns and hardware utilization.) Setting flag --save_text will save the ranking to a tsv file with each line being qid pid score.

Alternatively paralleize search over the shards,

for s in shard1 shar2 shard3
do
python -m tevatron.faiss_retriever \  
--query_reps $ENCODE_QRY_DIR/qry.pt \  
--passage_reps $ENCODE_DIR/$s.pt \  
--depth $DEPTH \  
--save_ranking_to $INTERMEDIATE_DIR/$s
done

Then combine the results using the reducer module,

python -m tevatron.faiss_retriever.reducer \  
--score_dir $INTERMEDIATE_DIR \  
--query $ENCODE_QRY_DIR/qry.pt \  
--save_ranking_to rank.txt  

Contacts

If you have a toolkit specific question, feel free to open an issue.

You can also reach out to us for general comments/suggestions/questions through email.

Owner
texttron
texttron
Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Dense Passage Retrieval Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain Q&A research. It is based on the

Meta Research 1.1k Jan 07, 2023
Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

2 Oct 17, 2021
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 435 Jan 06, 2023
ASCEND Chinese-English code-switching dataset

ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese-English code-switching corpus collected in Hong Kong.

CAiRE 11 Dec 09, 2022
Simple and efficient RevNet-Library with DeepSpeed support

RevLib Simple and efficient RevNet-Library with DeepSpeed support Features Half the constant memory usage and faster than RevNet libraries Less memory

Lucas Nestler 112 Dec 05, 2022
Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

SpeechMix Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together. Introduction For the same input: from datas

Eric Lam 31 Nov 07, 2022
fastai ulmfit - Pretraining the Language Model, Fine-Tuning and training a Classifier

fast.ai ULMFiT with SentencePiece from pretraining to deployment Motivation: Why even bother with a non-BERT / Transformer language model? Short answe

Florian Leuerer 26 May 27, 2022
Khandakar Muhtasim Ferdous Ruhan 1 Dec 30, 2021
Russian words synonyms and antonyms

ru_synonyms Russian words synonyms and antonyms. Install pip install git+https://github.com/ahmados/rusynonyms.git Usage from ru_synonyms import Anto

sumekenov 7 Dec 14, 2022
小布助手对话短文本语义匹配的一个baseline

oppo-text-match 小布助手对话短文本语义匹配的一个baseline 模型 参考:https://kexue.fm/archives/8213 base版本线下大概0.952,线上0.866(单模型,没做K-flod融合)。 训练 测试环境:tensorflow 1.15 + keras

苏剑林(Jianlin Su) 132 Dec 14, 2022
A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container.

keyword_searching Steps to use this Python scripts: (1)Paste this script into the file folder containing the PDF files you need to search from; (2)Thi

2 Nov 11, 2022
Just Another Telegram Ai Chat Bot Written In Python With Pyrogram.

OkaeriChatBot Just another Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher.

Wahyusaputra 2 Dec 23, 2021
Trex is a tool to match semantically similar functions based on transfer learning.

Trex is a tool to match semantically similar functions based on transfer learning.

62 Dec 28, 2022
End-2-end speech synthesis with recurrent neural networks

Introduction New: Interactive demo using Google Colaboratory can be found here TTS-Cube is an end-2-end speech synthesis system that provides a full p

Tiberiu Boros 214 Dec 07, 2022
Active learning for text classification in Python

Active Learning allows you to efficiently label training data in a small-data scenario.

Webis 375 Dec 28, 2022
Header-only C++ HNSW implementation with python bindings

Hnswlib - fast approximate nearest neighbor search Header-only C++ HNSW implementation with python bindings. NEWS: version 0.6 Thanks to (@dyashuni) h

2.3k Jan 05, 2023
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 01, 2022
SimBERT升级版(SimBERTv2)!

RoFormer-Sim RoFormer-Sim,又称SimBERTv2,是我们之前发布的SimBERT模型的升级版。 介绍 https://kexue.fm/archives/8454 训练 tensorflow 1.14 + keras 2.3.1 + bert4keras 0.10.6 下载

317 Dec 23, 2022
Perform sentiment analysis and keyword extraction on Craigslist listings

craiglist-helper synopsis Perform sentiment analysis and keyword extraction on Craigslist listings Background I love Craigslist. I've found most of my

Mark Musil 1 Nov 08, 2021
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

Justin Terry 32 Nov 09, 2021