The Easy-to-use Dialogue Response Selection Toolkit for Researchers

Last update: Nov 13, 2022

Related tags

Text Data & NLP SimpleReDial-v1

Overview

Easy-to-use toolkit for retrieval-based Chatbot

Recent Activity

Our released RRS corpus can be found here.
Our released BERT-FP post-training checkpoint for the RRS corpus can be found here.
Our related work (Exploring Dense Retrieval for Dialogue Response Selection) can be found here.

How to Use

Init the repo

Before using the repo, please run the following command to init:

# create the necessay folders
python init.py

# prepare the environment
# if some package cannot be installed, just google and install it from other ways
pip install -r requirements.txt

train the model

./scripts/train.sh <dataset_name> <model_name> <cuda_ids>

test the model [rerank]

./scripts/test_rerank.sh <dataset_name> <model_name> <cuda_id>

test the model [recal]

# different recall_modes are available: q-q, q-r
./scripts/test_recall.sh <dataset_name> <model_name> <cuda_id>

inference the responses and save into the faiss index

Somethings inference will missing data samples, please use the 1 gpu (faiss-gpu search use 1 gpu quickly)

It should be noted that: 1. For writer dataset, use extract_inference.py script to generate the inference.txt 2. For other datasets(douban, ecommerce, ubuntu), just cp train.txt inference.txt. The dataloader will automatically read the test.txt to supply the corpus.

# work_mode=response, inference the response and save into faiss (for q-r matching) [dual-bert/dual-bert-fusion]
# work_mode=context, inference the context to do q-q matching
# work_mode=gray, inference the context; read the faiss(work_mode=response has already been done), search the topk hard negative samples; remember to set the BERTDualInferenceContextDataloader in config/base.yaml
./scripts/inference.sh <dataset_name> <model_name> <cuda_ids>

If you want to generate the gray dataset for the dataset:

# 1. set the mode as the **response**, to generate the response faiss index; corresponding dataset name: BERTDualInferenceDataset;
./scripts/inference.sh <dataset_name> response <cuda_ids>

# 2. set the mode as the **gray**, to inference the context in the train.txt and search the top-k candidates as the gray(hard negative) samples; corresponding dataset name: BERTDualInferenceContextDataset
./scripts/inference.sh <dataset_name> gray <cuda_ids>

# 3. set the mode as the **gray-one2many** if you want to generate the extra positive samples for each context in the train set, the needings of this mode is the same as the **gray** work mode
./scripts/inference.sh <dataset_name> gray-one2many <cuda_ids>

If you want to generate the pesudo positive pairs, run the following commands:

# make sure the dual-bert inference dataset name is BERTDualInferenceDataset
./scripts/inference.sh <dataset_name> unparallel <cuda_ids>

deploy the rerank and recall model

# load the model on the cuda:0(can be changed in deploy.sh script)
./scripts/deploy.sh <cuda_id>

at the same time, you can test the deployed model by using:

# test_mode: recall, rerank, pipeline
./scripts/test_api.sh <test_mode> <dataset>

test the recall performance of the elasticsearch

Before testing the es recall, make sure the es index has been built:

# recall_mode: q-q/q-r
./scripts/build_es_index.sh <dataset_name> <recall_mode>

# recall_mode: q-q/q-r
./scripts/test_es_recall.sh <dataset_name> <recall_mode> 0

simcse generate the gray responses

# train the simcse model
./script/train.sh <dataset_name> simcse <cuda_ids>

# generate the faiss index, dataset name: BERTSimCSEInferenceDataset
./script/inference_response.sh <dataset_name> simcse <cuda_ids>

# generate the context index
./script/inference_simcse_response.sh <dataset_name> simcse <cuda_ids>
# generate the test set for unlikelyhood-gen dataset
./script/inference_simcse_unlikelyhood_response.sh <dataset_name> simcse <cuda_ids>

# generate the gray response
./script/inference_gray_simcse.sh <dataset_name> simcse <cuda_ids>
# generate the test set for unlikelyhood-gen dataset
./script/inference_gray_simcse_unlikelyhood.sh <dataset_name> simcse <cuda_ids>

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

Related tags

Overview

Easy-to-use toolkit for retrieval-based Chatbot

Recent Activity

How to Use

Owner

GMFTBY

Flaxformer: transformer architectures in JAX/Flax

Korean Sentence Embedding Repository

A raytrace framework using taichi language

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Code for ACL 2020 paper "Rigid Formats Controlled Text Generation"

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

Задания КЕГЭ по информатике 2021 на Python

Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

easySpeech is an open-source Python wrapper for google speech to text API that doesn't require PyAudio(So you especially windows user don't have to deal with the errors while installing PyAudio) and also works with hugging face transformers

code for modular summarization work published in ACL2021 by Krishna et al

Smart discord chatbot integrated with Dialogflow

Main repository for the chatbot Bobotinho.

An Open-Source Package for Neural Relation Extraction (NRE)

ConvBERT: Improving BERT with Span-based Dynamic Convolution

This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"

precise iris segmentation

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code or write code yourself

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets