Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models

Overview

PEGASUS library

Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models, or PEGASUS, uses self-supervised objective Gap Sentences Generation (GSG) to train a transformer encoder-decoder model. The paper can be found on arXiv. ICML 2020 accepted.

If you use this code or these models, please cite the following paper:

@misc{zhang2019pegasus,
    title={PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization},
    author={Jingqing Zhang and Yao Zhao and Mohammad Saleh and Peter J. Liu},
    year={2019},
    eprint={1912.08777},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Results update

We train a pegasus model with sampled gap sentence ratios on both C4 and HugeNews, and stochastically sample important sentences. The updated the results are reported in this table.

dataset C4 HugeNews Mixed & Stochastic
xsum 45.20/22.06/36.99 47.21/24.56/39.25 47.60/24.83/39.64
cnn_dailymail 43.90/21.20/40.76 44.17/21.47/41.11 44.16/21.56/41.30
newsroom 45.07/33.39/41.28 45.15/33.51/41.33 45.98/34.20/42.18
multi_news 46.74/17.95/24.26 47.52/18.72/24.91 47.65/18.75/24.95
gigaword 38.75/19.96/36.14 39.12/19.86/36.24 39.65/20.47/36.76
wikihow 43.07/19.70/34.79 41.35/18.51/33.42 46.39/22.12/38.41 *
reddit_tifu 26.54/8.94/21.64 26.63/9.01/21.60 27.99/9.81/22.94
big_patent 53.63/33.16/42.25 53.41/32.89/42.07 52.29/33.08/41.66 *
arxiv 44.70/17.27/25.80 44.67/17.18/25.73 44.21/16.95/25.67
pubmed 45.49/19.90/27.69 45.09/19.56/27.42 45.97/20.15/28.25
aeslc 37.69/21.85/36.84 37.40/21.22/36.45 37.68/21.25/36.51
billsum 57.20/39.56/45.80 57.31/40.19/45.82 59.67/41.58/47.59

The "Mixed & Stochastic" model has the following changes:

  • trained on both C4 and HugeNews (dataset mixture is weighted by their number of examples).
  • trained for 1.5M instead of 500k (we observe slower convergence on pretraining perplexity).
  • the model uniformly sample a gap sentence ratio between 15% and 45%.
  • importance sentences are sampled using a 20% uniform noise to importance scores.
  • the sentencepiece tokenizer is updated to be able to encode newline character.

(*) the numbers of wikihow and big_patent datasets are not comparable because of change in tokenization and data:

  • wikihow dataset contains newline characters which is useful for paragraph segmentation, the C4 and HugeNews model's sentencepiece tokenizer doesn't encode newline and loose this information.
  • we update the BigPatent dataset to preserve casing, some format cleanings are also changed, please refer to change in TFDS.

Setup

create an instance on google cloud with GPU (optional)

Please create a project first and create an instance

gcloud compute instances create \
  ${VM_NAME} \
  --zone=${ZONE} \
  --machine-type=n1-highmem-8 \
  --accelerator type=nvidia-tesla-v100,count=1 \
  --boot-disk-size=500GB \
  --image-project=ml-images \
  --image-family=tf-1-15 \
  --maintenance-policy TERMINATE --restart-on-failure

install library and dependencies

Clone library on github and install requirements.

git clone https://github.com/google-research/pegasus
cd pegasus
export PYTHONPATH=.
pip3 install -r requirements.txt

Download vocab, pretrained and fine-tuned checkpoints of all experiments from Google Cloud.

Alternatively in terminal, follow the instruction and install gsutil. Then

mkdir ckpt
gsutil cp -r gs://pegasus_ckpt/ ckpt/

Finetuning on downstream datasets

on existing dataset

Finetune on an existing dataset aeslc.

python3 pegasus/bin/train.py --params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model \
--train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \
--model_dir=ckpt/pegasus_ckpt/aeslc

If you would like to finetune on a subset of dataset, please refer to the example of input pattern.

Evaluate on the finetuned dataset.

python3 pegasus/bin/evaluate.py --params=aeslc_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6 \
--model_dir=ckpt/pegasus_ckpt/aeslc

Note that the above example is using a single GPU so the batch_size is much smaller than the results reported in the paper.

add new finetuning dataset

Two types of dataset format are supported: TensorFlow Datasets (TFDS) or TFRecords.

This tutorial shows how to add a new dataset in TFDS. (The fine-tuning dataset is expected to be supervised, please provide supervised_keys in dataset info).

Tfrecords format requires each record to be a tf example of {"inputs":tf.string, "targets":tf.string}.

For example, if you registered a TFDS dataset called new_tfds_dataset for training and evaluation, and have some files in tfrecord format called new_dataset_files.tfrecord* for test, they can be registered in /pegasus/params/public_params.py.

@registry.register("new_params")
def my_param(param_overrides):
  return public_params.transformer_params(
      {
          "train_pattern": "tfds:new_tfds_dataset,train",
          "dev_pattern": "tfds:new_tfds_dataset,validation",
          "test_pattern": "tfrecord:new_dataset_files.tfrecord*",
          "max_input_len": 512,
          "max_output_len": 128,
          "train_steps": 10000,
          "learning_rate": 0.0001,
          "batch_size": 8,
      }, param_overrides)

Evaluation metrics.

Evaluation results can be found in mode_dir. Summarization metrics are automatically calculated for each evaluation point.

  • ROUGE is the main metric for summarization quality.

  • BLEU is an alternative quality metric for language generation.

  • Extractive Fragments Coverage & Density are metrics that measures the abstractiveness of the summary.

  • Repetition Rates measures generation repetition failure modes.

  • Length statistics measures the length distribution of decodes comparing to gold summary.

Several types of output files can be found in model_dir

  • text_metrics-*.txt: above metrics in text format. Each row contains metric name, 95% lower bound value, mean value, 95% upper bound value.
  • inputs-.txt, targets-.txt, predictions-*.txt: raw text files of model inputs/outputs.

Pre-training

Pretraining (on C4 or any other corpus) requires a customly built tensorflow that includes ops for on-the-fly parsing that processes raw text document into model inputs and targets ids. Please refer to pegasus/ops/pretrain_parsing_ops.cc and pegasus/data/parsers.py for details.

Acknowledgements

Contains parts of code and design for training and evaluation of summarization models originally by Ben Goodrich [email protected].

Owner
Google Research
Google Research
chaii - hindi & tamil question answering

chaii - hindi & tamil question answering This is the solution for rank 5th in Kaggle competition: chaii - Hindi and Tamil Question Answering. The comp

abhishek thakur 33 Dec 18, 2022
本插件是pcrjjc插件的重置版,可以独立于后端api运行

pcrjjc2 本插件是pcrjjc重置版,不需要使用其他后端api,但是需要自行配置客户端 本项目基于AGPL v3协议开源,由于项目特殊性,禁止基于本项目的任何商业行为 配置方法 环境需求:.net framework 4.5及以上 jre8 别忘了装jre8 别忘了装jre8 别忘了装jre8

132 Dec 26, 2022
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

1.3k Dec 26, 2022
Dust model dichotomous performance analysis

Dust-model-dichotomous-performance-analysis Using a collated dataset of 90,000 dust point source observations from 9 drylands studies from around the

1 Dec 17, 2021
Predict the spans of toxic posts that were responsible for the toxic label of the posts

toxic-spans-detection An attempt at the SemEval 2021 Task 5: Toxic Spans Detection. The Toxic Spans Detection task of SemEval2021 required participant

Ilias Antonopoulos 3 Jul 24, 2022
Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

Alireza Savand 142 Dec 21, 2022
NLPShala , the best IDE for all Natural language processing tasks.

The revolutionary IDE for all NLP (Natural language processing) stuffs on the internet.

Abhi 3 Aug 08, 2021
ChessCoach is a neural network-based chess engine capable of natural-language commentary.

ChessCoach is a neural network-based chess engine capable of natural-language commentary.

Chris Butner 380 Dec 03, 2022
Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"

This repository contains code for the following two papers: VisualBERT: A Simple and Performant Baseline for Vision and Language (arxiv) with a short

Natural Language Processing @UCLA 464 Jan 04, 2023
Repository for Project Insight: NLP as a Service

Project Insight NLP as a Service Contents Introduction Features Installation Setup and Documentation Project Details Demonstration Directory Details H

Abhishek Kumar Mishra 286 Dec 06, 2022
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Atomicoo 161 Dec 19, 2022
AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

Shunsuke KITADA 12 Feb 16, 2022
The swas programming language

The Swas programming language This is a language that was made for fun. Installation Step 0: Make sure you have python installed Step 1. Clone this re

Swas.py 19 Jul 18, 2022
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

ParlAI (pronounced “par-lay”) is a python framework for sharing, training and testing dialogue models, from open-domain chitchat, to task-oriented dia

Facebook Research 9.7k Jan 09, 2023
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

fastNLP 342 Jan 05, 2023
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

ICTNLP 90 Dec 27, 2022
News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

NLP T5 Project proposal Topic Modeling and Clustering of News-Articles-and-Essays Students: Nasser Alshehri Abdullah Bushnag Abdulrhman Alqurashi OVER

2 Jan 18, 2022
MRC approach for Aspect-based Sentiment Analysis (ABSA)

B-MRC MRC approach for Aspect-based Sentiment Analysis (ABSA) Paper: Bidirectional Machine Reading Comprehension for Aspect Sentiment Triplet Extracti

Phuc Phan 1 Apr 05, 2022
FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

* MY SOCIAL MEDIA : Programming And Memes Want to contact Mr. Error ? CONTACT : [ema

Mr. Error 9 Jun 17, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Dec 31, 2022