Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

Related tags

Text Data & NLPesacl
Overview

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization

This repo is for our paper "Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization". Our program is building on top of the Huggingface transformers framework. You can refer to their repo at: https://github.com/huggingface/transformers/tree/master/examples/seq2seq.

Local Setup

Tested with Python 3.7 via virtual environment. Clone the repo, go to the repo folder, setup the virtual environment, and install the required packages:

$ python3.7 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Install apex

Based on the recommendation from HuggingFace, both finetuning and eval are 30% faster with --fp16. For that you need to install apex.

$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Data

Create a directory for data used in this work named data:

$ mkdir data

CNN/DM

$ wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz
$ tar -xzvf cnn_dm_v2.tgz
$ mv cnn_cln data/cnndm

XSUM

$ wget https://cdn-datasets.huggingface.co/summarization/xsum.tar.gz
$ tar -xzvf xsum.tar.gz
$ mv xsum data/xsum

Generate Augmented Dataset

$ python generate_augmentation.py \
    --dataset xsum \
    --n 5 \
    --augmentation1 randomdelete \
    --augmentation2 randomswap

Training

CNN/DM

Our model is warmed up using sshleifer/distilbart-cnn-12-6:

$ DATA_DIR=./data/cnndm-augmented/RandominsertionRandominsertion-NumSent-3
$ OUTPUT_DIR=./log/cnndm

$ python -m torch.distributed.launch --nproc_per_node=3  cl_finetune_trainer.py \
  --data_dir $DATA_DIR \
  --output_dir $OUTPUT_DIR \
  --learning_rate=5e-7 \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 16 \
  --do_train --do_eval \
  --evaluation_strategy steps \
  --freeze_embeds \
  --save_total_limit 10 \
  --save_steps 1000 \
  --logging_steps 1000 \
  --num_train_epochs 5 \
  --model_name_or_path sshleifer/distilbart-cnn-12-6 \
  --alpha 0.2 \
  --temperature 0.5 \
  --freeze_encoder_layer 6 \
  --prediction_loss_only \
  --fp16

XSUM

$ DATA_DIR=./data/xsum-augmented/RandomdeleteRandomswap-NumSent-3
$ OUTPUT_DIR=./log/xsum

$ python -m torch.distributed.launch --nproc_per_node=3  cl_finetune_trainer.py \
  --data_dir $DATA_DIR \
  --output_dir $OUTPUT_DIR \
  --learning_rate=5e-7 \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 16 \
  --do_train --do_eval \
  --evaluation_strategy steps \
  --freeze_embeds \
  --save_total_limit 10 \
  --save_steps 1000 \
  --logging_steps 1000 \
  --num_train_epochs 5 \
  --model_name_or_path sshleifer/distilbart-xsum-12-6 \
  --alpha 0.2 \
  --temperature 0.5 \
  --freeze_encoder \
  --prediction_loss_only \
  --fp16

Evaluation

We have released the following checkpoints for pre-trained models as described in the paper:

CNN/DM

CNN/DM requires an extra postprocessing step.

$ export DATA=cnndm
$ export DATA_DIR=data/$DATA
$ export CHECKPOINT_DIR=./log/$DATA
$ export OUTPUT_DIR=output/$DATA

$ python -m torch.distributed.launch --nproc_per_node=2  run_distributed_eval.py \
    --model_name sshleifer/distilbart-cnn-12-6  \
    --save_dir $OUTPUT_DIR \
    --data_dir $DATA_DIR \
    --bs 16 \
    --fp16 \
    --use_checkpoint \
    --checkpoint_path $CHECKPOINT_DIR
    
$ python postprocess_cnndm.py \
    --src_file $OUTPUT_DIR/test_generations.txt \
    --tgt_file $DATA_DIR/test.target

XSUM

$ export DATA=xsum
$ export DATA_DIR=data/$DATA
$ export CHECKPOINT_DIR=./log/$DATA
$ export OUTPUT_DIR=output/$DATA

$ python -m torch.distributed.launch --nproc_per_node=3  run_distributed_eval.py \
    --model_name sshleifer/distilbart-xsum-12-6  \
    --save_dir $OUTPUT_DIR \
    --data_dir $DATA_DIR \
    --bs 16 \
    --fp16 \
    --use_checkpoint \
    --checkpoint_path $CHECKPOINT_DIR
Owner
Rachel Zheng
Rachel Zheng
Rachel Zheng
NLP project that works with news (NER, context generation, news trend analytics)

СоАвтор СоАвтор – платформа и открытый набор инструментов для редакций и журналистов-фрилансеров, который призван сделать процесс создания контента ма

38 Jan 04, 2023
A fast and easy implementation of Transformer with PyTorch.

FasySeq FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which

宁羽 7 Jul 18, 2022
HF's ML for Audio study group

Hugging Face Machine Learning for Audio Study Group Welcome to the ML for Audio Study Group. Through a series of presentations, paper reading and disc

Vaibhav Srivastav 110 Jan 01, 2023
Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

Housegan-data-reader House-GAN++ (data-reader) Code and instructions for converting rplan dataset (raster images) to housegan++ data format. House-GAN

Sepid Hosseini 13 Nov 24, 2022
PIZZA - a task-oriented semantic parsing dataset

The PIZZA dataset continues the exploration of task-oriented parsing by introducing a new dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.

17 Dec 14, 2022
Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow.

Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow. Documentation Proper documentation is available at

HUSEIN ZOLKEPLI 151 Jan 05, 2023
The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)

Language Models are Few-shot Multilingual Learners Paper This is the source code of the paper [Arxiv] [ACL Anthology]: This code has been written usin

Genta Indra Winata 45 Nov 21, 2022
The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

Good news! Our new work exhibits state-of-the-art performances on DocUNet benchmark dataset: DocScanner: Robust Document Image Rectification with Prog

Hao Feng 231 Dec 26, 2022
Problem: Given a nepali news find the category of the news

Classification of category of nepali news catorgory using different algorithms Problem: Multiclass Classification Approaches: TFIDF for vectorization

pudasainishushant 2 Jan 09, 2022
ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

ThinkTwice ThinkTwice is a retriever-reader architecture for solving long-text machine reading comprehension. It is based on the paper: ThinkTwice: A

Walle 4 Aug 06, 2021
NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Artefact 114 Dec 15, 2022
Signature remover is a NLP based solution which removes email signatures from the rest of the text.

Signature Remover Signature remover is a NLP based solution which removes email signatures from the rest of the text. It helps to enchance data conten

Forges Alterway 8 Jan 06, 2023
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

Facebook Research 5.1k Dec 26, 2022
Chinese version of GPT2 training code, using BERT tokenizer.

GPT2-Chinese Description Chinese version of GPT2 training code, using BERT tokenizer or BPE tokenizer. It is based on the extremely awesome repository

Zeyao Du 5.6k Jan 04, 2023
glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

Rhasspy 8 Dec 25, 2022
Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

YicongHong 109 Dec 21, 2022
Code for the paper "Are Sixteen Heads Really Better than One?"

Are Sixteen Heads Really Better than One? This repository contains code to reproduce the experiments in our paper Are Sixteen Heads Really Better than

Paul Michel 143 Dec 14, 2022
Mkdocs + material + cool stuff

Modern-Python-Doc-Example mkdocs + material + cool stuff Doc is live here Features out of the box amazing good looking website thanks to mkdocs.org an

Francesco Saverio Zuppichini 61 Oct 26, 2022
Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets What is LASSL • How to Use What is LASSL LASSL은 LAnguage Semi-Super

LASSL: LAnguage Self-Supervised Learning 116 Dec 27, 2022
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CRNN paper:An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 1. create your ow

Tsukinousag1 3 Apr 02, 2022