Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

Last update: Nov 01, 2022

Related tags

Overview

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization

This repo is for our paper "Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization". Our program is building on top of the Huggingface transformers framework. You can refer to their repo at: https://github.com/huggingface/transformers/tree/master/examples/seq2seq.

Local Setup

Tested with Python 3.7 via virtual environment. Clone the repo, go to the repo folder, setup the virtual environment, and install the required packages:

$ python3.7 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Install `apex`

Based on the recommendation from HuggingFace, both finetuning and eval are 30% faster with --fp16. For that you need to install apex.

$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Data

Create a directory for data used in this work named data:

$ mkdir data

CNN/DM

$ wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz
$ tar -xzvf cnn_dm_v2.tgz
$ mv cnn_cln data/cnndm

XSUM

$ wget https://cdn-datasets.huggingface.co/summarization/xsum.tar.gz
$ tar -xzvf xsum.tar.gz
$ mv xsum data/xsum

Generate Augmented Dataset

$ python generate_augmentation.py \
    --dataset xsum \
    --n 5 \
    --augmentation1 randomdelete \
    --augmentation2 randomswap

Training

CNN/DM

Our model is warmed up using sshleifer/distilbart-cnn-12-6:

$ DATA_DIR=./data/cnndm-augmented/RandominsertionRandominsertion-NumSent-3
$ OUTPUT_DIR=./log/cnndm

$ python -m torch.distributed.launch --nproc_per_node=3  cl_finetune_trainer.py \
  --data_dir $DATA_DIR \
  --output_dir $OUTPUT_DIR \
  --learning_rate=5e-7 \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 16 \
  --do_train --do_eval \
  --evaluation_strategy steps \
  --freeze_embeds \
  --save_total_limit 10 \
  --save_steps 1000 \
  --logging_steps 1000 \
  --num_train_epochs 5 \
  --model_name_or_path sshleifer/distilbart-cnn-12-6 \
  --alpha 0.2 \
  --temperature 0.5 \
  --freeze_encoder_layer 6 \
  --prediction_loss_only \
  --fp16

XSUM

$ DATA_DIR=./data/xsum-augmented/RandomdeleteRandomswap-NumSent-3
$ OUTPUT_DIR=./log/xsum

$ python -m torch.distributed.launch --nproc_per_node=3  cl_finetune_trainer.py \
  --data_dir $DATA_DIR \
  --output_dir $OUTPUT_DIR \
  --learning_rate=5e-7 \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 16 \
  --do_train --do_eval \
  --evaluation_strategy steps \
  --freeze_embeds \
  --save_total_limit 10 \
  --save_steps 1000 \
  --logging_steps 1000 \
  --num_train_epochs 5 \
  --model_name_or_path sshleifer/distilbart-xsum-12-6 \
  --alpha 0.2 \
  --temperature 0.5 \
  --freeze_encoder \
  --prediction_loss_only \
  --fp16

Evaluation

We have released the following checkpoints for pre-trained models as described in the paper:

CNN/DM:
XSUM:

CNN/DM

CNN/DM requires an extra postprocessing step.

$ export DATA=cnndm
$ export DATA_DIR=data/$DATA
$ export CHECKPOINT_DIR=./log/$DATA
$ export OUTPUT_DIR=output/$DATA

$ python -m torch.distributed.launch --nproc_per_node=2  run_distributed_eval.py \
    --model_name sshleifer/distilbart-cnn-12-6  \
    --save_dir $OUTPUT_DIR \
    --data_dir $DATA_DIR \
    --bs 16 \
    --fp16 \
    --use_checkpoint \
    --checkpoint_path $CHECKPOINT_DIR
    
$ python postprocess_cnndm.py \
    --src_file $OUTPUT_DIR/test_generations.txt \
    --tgt_file $DATA_DIR/test.target

XSUM

$ export DATA=xsum
$ export DATA_DIR=data/$DATA
$ export CHECKPOINT_DIR=./log/$DATA
$ export OUTPUT_DIR=output/$DATA

$ python -m torch.distributed.launch --nproc_per_node=3  run_distributed_eval.py \
    --model_name sshleifer/distilbart-xsum-12-6  \
    --save_dir $OUTPUT_DIR \
    --data_dir $DATA_DIR \
    --bs 16 \
    --fp16 \
    --use_checkpoint \
    --checkpoint_path $CHECKPOINT_DIR

Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

Related tags

Overview

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization

Local Setup

Install `apex`

Data

CNN/DM

XSUM

Generate Augmented Dataset

Training

CNN/DM

XSUM

Evaluation

CNN/DM

XSUM

Owner

Rachel Zheng

Mednlp - Medical natural language parsing and utility library

News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

EdiTTS: Score-based Editing for Controllable Text-to-Speech

Python implementation of TextRank for phrase extraction and summarization of text documents

Write Alphabet, Words and Sentences with your eyes.

Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

结巴中文分词

CoSENT 比Sentence-BERT更有效的句向量方案

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

DVC-NLP-Simple-usecase

An extension for asreview implements a version of the tf-idf feature extractor that saves the matrix and the vocabulary.

Test finetuning of XLSR (multilingual wav2vec 2.0) for other speech classification tasks

Korean Sentence Embedding Repository

NLPShala , the best IDE for all Natural language processing tasks.

EasyTransfer is designed to make the development of transfer learning in NLP applications easier.

Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

HiFi DeepVariant + WhatsHap workflowHiFi DeepVariant + WhatsHap workflow

A python script that will use hydra to get user and password to login to ssh, ftp, and telnet

Global Rhythm Style Transfer Without Text Transcriptions

Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

Related tags

Overview

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization

Local Setup

Install apex

Data

CNN/DM

XSUM

Generate Augmented Dataset

Training

CNN/DM

XSUM

Evaluation

CNN/DM

XSUM

Owner

Rachel Zheng

Mednlp - Medical natural language parsing and utility library

News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

EdiTTS: Score-based Editing for Controllable Text-to-Speech

Python implementation of TextRank for phrase extraction and summarization of text documents

Write Alphabet, Words and Sentences with your eyes.

Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

结巴中文分词

CoSENT 比Sentence-BERT更有效的句向量方案

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

DVC-NLP-Simple-usecase

An extension for asreview implements a version of the tf-idf feature extractor that saves the matrix and the vocabulary.

Test finetuning of XLSR (multilingual wav2vec 2.0) for other speech classification tasks

Korean Sentence Embedding Repository

NLPShala , the best IDE for all Natural language processing tasks.

EasyTransfer is designed to make the development of transfer learning in NLP applications easier.

Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

HiFi DeepVariant + WhatsHap workflowHiFi DeepVariant + WhatsHap workflow

A python script that will use hydra to get user and password to login to ssh, ftp, and telnet

Global Rhythm Style Transfer Without Text Transcriptions

Install `apex`