πŸš€ RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

Overview

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutting edge technologies, this repository provides an easy-to-use toolkit for running and fine-tuning the state-of-the-art dense retrievers, namely πŸš€ RocketQA. This toolkit has the following advantages:

  • State-of-the-art: πŸš€ RocketQA provides our well-trained models, which achieve SOTA performance on many dense retrieval datasets. And it will continue to update the latest models.
  • First-Chinese-model: πŸš€ RocketQA provides the first open source Chinese dense retrieval model, which is trained on millions of manual annotation data from DuReader.
  • Easy-to-use: By integrating this toolkit with JINA, πŸš€ RocketQA can help developers build an end-to-end retrieval system and question answering system with several lines of code.

News

  • April 29, 2022: Training function is added to RocketQA toolkit. And the baseline models of DuReaderretrieval (both cross encoder and dual encoder) are available in RocketQA models.
  • March 30, 2022: The baseline of DuReaderretrieval leaderboard was released. [code/model]
  • March 30, 2022: We released DuReaderretrieval, a large-scale Chinese benchmark for passage retrieval. The dataset contains over 90K questions and 8M passages from Baidu Search. [paper] [data]
  • December 3, 2021: The toolkit of dense retriever RocketQA was released, including the first chinese dense retrieval model trained on DuReader.
  • August 26, 2021: RocketQA v2 was accepted by EMNLP 2021. [code/model]
  • May 5, 2021: PAIR was accepted by ACL 2021. [code/model]
  • March 11, 2021: RocketQA v1 was accepted by NAACL 2021. [code/model]

Installation

We provide two installation methods: Python Installation Package and Docker Environment

Install with Python Package

First, install PaddlePaddle.

# GPU version:
$ pip install paddlepaddle-gpu

# CPU version:
$ pip install paddlepaddle

Second, install rocketqa package (latest version: 1.1.0):

$ pip install rocketqa

NOTE: this toolkit MUST be running on Python3.6+ with PaddlePaddle 2.0+.

Install with Docker

docker pull rocketqa/rocketqa

docker run -it docker.io/rocketqa/rocketqa bash

Getting Started

Refer to the examples below, you can build and run your own Search Engine with several lines of code. We also provide a Playground with JupyterNotebook. Try πŸš€ RocketQA straight away in your browser!

Running with JINA

JINA is a cloud-native neural search framework to build SOTA and scalable deep learning search applications in minutes. Here is a simple example to build a Search Engine based on JINA and RocketQA.

cd examples/jina_example
pip3 install -r requirements.txt

# Generate vector representations and build a libray for your Documents
# JINA will automaticlly start a web service for you
python3 app.py index toy_data/test.tsv

# Try some questions related to the indexed Documents
python3 app.py query_cli

Please view JINA example to know more.

Running with FAISS

We also provide a simple example built on Faiss.

cd examples/faiss_example/
pip3 install -r requirements.txt

# Generate vector representations and build a libray for your Documents
python3 index.py zh ../data/dureader.para test_index

# Start a web service on http://localhost:8888/rocketqa
python3 rocketqa_service.py zh ../data/dureader.para test_index

# Try some questions related to the indexed Documents
python3 query.py

API

You can also easily integrate πŸš€ RocketQA into your own task. We provide two types of models, ERNIE-based dual encoder for answer retrieval and ERNIE-based cross encoder for answer re-ranking. For running our models, you can use the following functions.

Load model

rocketqa.available_models()

Returns the names of the available RocketQA models. To know more about the available models, please see the code comment.

rocketqa.load_model(model, use_cuda=False, device_id=0, batch_size=1)

Returns the model specified by the input parameter. It can initialize both dual encoder and cross encoder. By setting input parameter, you can load either RocketQA models returned by "available_models()" or your own checkpoints.

Dual encoder

Dual-encoder returned by "load_model()" supports the following functions:

model.encode_query(query: List[str])

Given a list of queries, returns their representation vectors encoded by model.

model.encode_para(para: List[str], title: List[str])

Given a list of paragraphs and their corresponding titles (optional), returns their representations vectors encoded by model.

model.matching(query: List[str], para: List[str], title: List[str])

Given a list of queries and paragraphs (and titles), returns their matching scores (dot product between two representation vectors).

model.train(train_set: str, epoch: int, save_model_path: str, args)

Given the hyperparameters train_set, epoch and save_model_path, you can train your own dual encoder model or finetune our models. Other settings like save_steps and learning_rate can also be set in args. Please refer to examples/example.py for detail.

Cross encoder

Cross-encoder returned by "load_model()" supports the following function:

model.matching(query: List[str], para: List[str], title: List[str])

Given a list of queries and paragraphs (and titles), returns their matching scores (probability that the paragraph is the query's right answer).

model.train(train_set: str, epoch: int, save_model_path: str, args)

Given the hyperparameters train_set, epoch and save_model_path, you can train your own cross encoder model or finetune our models. Other settings like save_steps and learning_rate can also be set in args. Please refer to examples/example.py for detail.

Examples

Following the examples below, you can retrieve the vector representations of your documents and connect πŸš€ RocketQA to your own tasks.

Run RocketQA Model

To run RocketQA models, you should set the parameter model in 'load_model()' with RocketQA model name returned by 'available_models()'.

import rocketqa

query_list = ["trigeminal definition"]
para_list = [
    "Definition of TRIGEMINAL. : of or relating to the trigeminal nerve.ADVERTISEMENT. of or relating to the trigeminal nerve. ADVERTISEMENT."]

# init dual encoder
dual_encoder = rocketqa.load_model(model="v1_marco_de", use_cuda=True, device_id=0, batch_size=16)

# encode query & para
q_embs = dual_encoder.encode_query(query=query_list)
p_embs = dual_encoder.encode_para(para=para_list)
# compute dot product of query representation and para representation
dot_products = dual_encoder.matching(query=query_list, para=para_list)

Train Your Own Model

To train your own models, you can use train() function with your dataset and parameters. Training data contains 4 columns: query, title, para, label (0 or 1), separated by "\t". For detail about parameters and dataset, please refer to './examples/example.py'

import rocketqa

# init cross encoder, and set device and batch_size
cross_encoder = rocketqa.load_model(model="zh_dureader_ce", use_cuda=True, device_id=0, batch_size=32)

# finetune cross encoder based on "zh_dureader_ce_v2"
cross_encoder.train('./examples/data/cross.train.tsv', 2, 'ce_models', save_steps=1000, learning_rate=1e-5, log_folder='log_ce')

Run Your Own Model

To run your own models, you should set parameter model in 'load_model()' with a JSON config file.

import rocketqa

# init cross encoder
cross_encoder = rocketqa.load_model(model="./examples/ce_models/config.json", use_cuda=True, device_id=0, batch_size=16)

# compute relevance of query and para
relevance = cross_encoder.matching(query=query_list, para=para_list)

config is a JSON file like this

{
    "model_type": "cross_encoder",
    "max_seq_len": 384,
    "model_conf_path": "zh_config.json",
    "model_vocab_path": "zh_vocab.txt",
    "model_checkpoint_path": ${YOUR_MODEL},
    "for_cn": true,
    "share_parameter": 0
}

Folder examples provides more details.

Citations

If you find RocketQA v1 models helpful, feel free to cite our publication RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering

@inproceedings{rocketqa_v1,
    title="RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering",
    author="Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu and Haifeng Wang",
    year="2021",
    booktitle = "In Proceedings of NAACL"
}

If you find PAIR models helpful, feel free to cite our publication PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval

@inproceedings{rocketqa_pair,
    title="PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval",
    author="Ruiyang Ren, Shangwen Lv, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang and Ji-Rong Wen",
    year="2021",
    booktitle = "In Proceedings of ACL Findings"
}

If you find RocketQA v2 models helpful, feel free to cite our publication RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking

@inproceedings{rocketqa_v2,
    title="RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking",
    author="Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang and Ji-Rong Wen",
    year="2021",
    booktitle = "In Proceedings of EMNLP"
}

If you find DuReaderretrieval dataset helpful, feel free to cite our publication DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine

@inproceedings{DuReader_retrieval,
    title="DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine",
    author="Yifu Qiu, Hongyu Li, Yingqi Qu, Ying Chen, Qiaoqiao She, Jing Liu, Hua Wu and Haifeng Wang",
    year="2022"
}

License

This repository is provided under the Apache-2.0 license.

Contact Information

For help or issues using RocketQA, please submit a Github issue.

For other communication or cooperation, please contact Jing Liu ([email protected]) or scan the following QR Code.

A Python script that compares files in directories

compare-files A Python script that compares files in different directories, this is similar to the command filecmp.cmp(f1, f2). I made this script in

Colvin 1 Oct 15, 2021
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural languag

Benjamin Heinzerling 1.1k Jan 03, 2023
Multiple implementations for abstractive text summurization , using google colab

Text Summarization models if you are able to endorse me on Arxiv, i would be more than glad https://arxiv.org/auth/endorse?x=FRBB89 thanks This repo i

463 Dec 26, 2022
Question and answer retrieval in Turkish with BERT

trfaq Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! πŸŽ‰ What is this? At this repo, I'm

M. Yusuf SarΔ±gΓΆz 13 Oct 10, 2022
Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the model for this program is one of the deep-learning NLP(Natural Language Process) model struc

RUO 2 Feb 22, 2022
Codes to pre-train Japanese T5 models

t5-japanese Codes to pre-train a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts. The model is available at https://hug

Megagon Labs 37 Dec 25, 2022
Material for GW4SHM workshop, 16/03/2022.

GW4SHM Workshop Wednesday, 16th March 2022 (13:00 – 15:15 GMT): Presented by: Dr. Rhodri Nelson, Imperial College London Project website: https://www.

Devito Codes 1 Mar 16, 2022
The Sudachi synonym dictionary in Solar format.

solr-sudachi-synonyms The Sudachi synonym dictionary in Solar format. Summary Run a script that checks for updates to the Sudachi dictionary every hou

Karibash 3 Aug 19, 2022
NLP made easy

GluonNLP: Your Choice of Deep Learning for NLP GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you l

Distributed (Deep) Machine Learning Community 2.5k Jan 04, 2023
βš–οΈ A Statutory Article Retrieval Dataset in French.

A Statutory Article Retrieval Dataset in French This repository contains the Belgian Statutory Article Retrieval Dataset (BSARD), as well as the code

Maastricht Law & Tech Lab 19 Nov 17, 2022
Almost State-of-the-art Text Generation library

Ps: we are adding transformer model soon Text Gen 🐐 Almost State-of-the-art Text Generation library Text gen is a python library that allow you build

Emeka boris ama 63 Jun 24, 2022
Exploring dimension-reduced embeddings

sleepwalk Exploring dimension-reduced embeddings This is the code repository. See here for the Sleepwalk web page. License and disclaimer This program

S. Anders's research group at ZMBH 91 Nov 29, 2022
Python port of Google's libphonenumber

phonenumbers Python Library This is a Python port of Google's libphonenumber library It supports Python 2.5-2.7 and Python 3.x (in the same codebase,

David Drysdale 3.1k Dec 29, 2022
Contains the code and data for our #ICSE2022 paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences"

CodeFill This repository contains the code for our paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Namin

Software Analytics Lab 11 Oct 31, 2022
Shirt Bot is a discord bot which uses GPT-3 to generate text

SHIRT BOT Β· Shirt Bot is a discord bot which uses GPT-3 to generate text. Made by Cyclcrclicly#3420 (474183744685604865) on Discord. Support Server EX

31 Oct 31, 2022
Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Machel Reid 82 Dec 19, 2022
A telegram bot to translate 100+ Languages

πŸ”₯ GOOGLE TRANSLATER πŸ”₯ The owner would not be responsible for any kind of bans due to the bot. β€’ ⚑ INSTALLING ⚑ β€’ β€’ πŸ”° Deploy To Railway πŸ”° β€’ β€’ βœ… OFF

AΙ΄α΄‹Ιͺα΄› Kα΄œα΄α΄€Κ€ 5 Dec 20, 2021
Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

texttron 193 Jan 04, 2023
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las TecnologΓ­as del Lenguaje" (Plan-TL).

Spanish Language Models πŸ’ƒπŸ» Corpora πŸ“ƒ Corpora Number of documents Size (GB) BNE 201,080,084 570GB Models πŸ€– RoBERTa-base BNE: https://huggingface.co

PlanTL-SANIDAD 203 Dec 20, 2022
Kurumi ChatBot

KurumiChatBot Just another Telegram AI chat bot written in Python using Pyrogram. A public running instance can be found on telegram as @TokisakiChatB

Yoga Pranata 3 Jun 28, 2022