Extracting and filtering paraphrases by bridging natural language inference and paraphrasing

Last update: Mar 09, 2022

Related tags

Overview

nli2paraphrases

Source code repository accompanying the preprint Extracting and filtering paraphrases by bridging natural language inference and paraphrasing. The idea presented in the paper is to re-use NLI datasets for paraphrasing, by finding paraphrases through bidirectional entailment.

Setup

# Make sure to run this from the root of the project (top-level directory)
$ pip3 install -r requirements.txt
$ python3 setup.py install

Project Organization

├── README.md          
├── experiments        <- Experiment scripts, through which training and extraction is done
├── models             <- Intended for storing fine-tuned models and configs
├── requirements.txt   
├── setup.py           
├── src                <- Core source code for this project
│   ├── __init__.py    
│   ├── data           <- data loading scripts
│   ├── models         <- general scripts for training/using a NLI model
│   └── visualization  <- visualization scripts for obtaining a nicer view of extracted paraphrases

Getting started

As an example, let us extract paraphrases from SNLI.

The training and extraction process largely follows the same track for other datasets (with some new or removed flags, run scripts with --help flag to see the specifics).

In the example, we first fine-tune a roberta-base NLI model on SNLI sequences (s1, s2).
Then, we use the fine-tuned model to predict the reverse relation for entailment examples, and select only those examples for which entailment holds in both directions. The extracted paraphrases are stored into extract-argmax.

This example assumes that you have access to a GPU. If not, you can force the scripts to use CPU by setting --use_cpu, although the whole process will be much slower.

# Assuming the current position is in the root directory of the project
$ cd experiments/SNLI_NLI

# Training takes ~1hr30mins on Colab GPU (K80)
$ python3 train_model.py \
--experiment_dir="../models/SNLI_NLI/snli-roberta-base-maxlen42-2e-5" \
--pretrained_name_or_path="roberta-base" \
--model_type="roberta" \
--num_epochs=10 \
--max_seq_len=42 \
--batch_size=256 \
--learning_rate=2e-5 \
--early_stopping_rounds=5 \
--validate_every_n_examples=5000

# Extraction takes ~15mins on Colab GPU (K80)
$ python3 extract_paraphrases.py \
--experiment_dir="extract-argmax" \
--pretrained_name_or_path="../models/SNLI_NLI/snli-roberta-base-maxlen42-2e-5" \
--model_type="roberta" \
--max_seq_len=42 \
--batch_size=1024 \
--l2r_strategy="ground_truth" \
--r2l_strategy="argmax"

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Extracting and filtering paraphrases by bridging natural language inference and paraphrasing

Related tags

Overview

nli2paraphrases

Setup

Project Organization

Getting started

Owner

Matej Klemen

Framework for abstracting Amiga debuggers and access to AmigaOS libraries and devices.

This is the source code for generating the ASL-Skeleton3D and ASL-Phono datasets. Check out the README.md for more details.

[CVPR 2021] Generative Hierarchical Features from Synthesizing Images

A programming language written with python

ExCon: Explanation-driven Supervised Contrastive Learning

A task-agnostic vision-language architecture as a step towards General Purpose Vision

🔥 Real-time Super Resolution enhancement (4x) with content loss and relativistic adversarial optimization 🔥

Config files for my GitHub profile.

🛠️ Tools for Transformers compression using Lightning ⚡

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

A GPT, made only of MLPs, in Jax

This is the dataset and code release of the OpenRooms Dataset.

Pytorch implementation of paper "Efficient Nearest Neighbor Language Models" (EMNLP 2021)

Product-based-recommendation-system - A product based recommendation system which uses Machine learning algorithm such as KNN and cosine similarity

A fast model to compute optical flow between two input images.

Code for "Learning Skeletal Graph Neural Networks for Hard 3D Pose Estimation" ICCV'21

Code for "LoFTR: Detector-Free Local Feature Matching with Transformers", CVPR 2021

Research code for the paper "Variational Gibbs inference for statistical estimation from incomplete data".

Normalizing Flows with a resampled base distribution

Author's PyTorch implementation of TD3+BC, a simple variant of TD3 for offline RL