Does Pretraining for Summarization Reuqire Knowledge Transfer?

Overview

Does Pretraining for Summarization Reuqire Knowledge Transfer?

This repository is the official implementation of the work in the paper Does Pretraining for Summarization Reuqire Knowledge Transfer? to appear in Findings of EMNLP 2021.
You can find the paper on arXiv here: https://arxiv.org/abs/2109.04953

Requirements

This code requires Python 3 (tested using version 3.6)

To install requirements, run:

pip install -r requirements.txt

Preparing finetuning datasets

To prepare a summarization dataset for finetuning, run the corresponding script in the finetuning_datasetgen folder. For example, to prepare the cnn-dailymail dataset run:

cd finetuning_datasetgen
python cnndm.py

Running finetuning experiment

We show here how to run training, prediction and evaluation steps for a finetuning experiment. We assume that you have downloaded the pretrained models in the pretrained_models folder from the provided Google Drive link (see pretrained_models/README.md) If you want to pretrain models yourself, see latter part of this readme for the instructions.

All models in our work are trained using allennlp config files which are in .jsonnet format. To run a finetuning experiment, simply run

# for t5-like models
./pipeline_t5.sh 
   
    

# for pointer-generator models
./pipeline_pg.sh 
    

    
   

For example, for finetuning a T5 model on cnndailymail dataset, starting from a model pretrained with ourtasks-nonsense pretraining dataset, run

./pipeline_t5.sh finetuning_experiments/cnndm/t5-ourtasks-nonsense

Similarly, for finetuning a randomly-initialized pointer-generator model, run

./pipeline_pg.sh finetuning_experiments/cnndm/pg-randominit

The trained model and output files would be available in the folder that would be created by the script.

model.tar.gz contains the trained (finetuned) model

test_outputs.jsonl contains the outputs of the model on the test split.

test_genmetrics.json contains the ROUGE scores of the output

Creating pretraining datasets

We have provided the nonsense pretraining datasets used in our work via Google Drive (see dataset_root/pretraining_datasets/README.md for instructions)

However, if you want to generate your own pretraining corpus, you can run

cd pretraining_datasetgen
# for generating dataset using pretraining tasks
python ourtasks.py
# for generating dataset using STEP pretraining tasks
python steptasks.py

These commands would create pretraining datasets using nonsense. If you want to create datasets starting from wikipedia documents please look into the two scripts which guide you how to do that by commenting/uncommenting two blocks of code.

Pretraining models

Although we provide you the pretrained model checkpoints via GoogleDrive, if you want to pretrain your own models, you can do that by using the corresponding pretraining config file. As an example, we have provided a config file which pretrains on ourtasks-nonsense dataset. Make sure that the pretraining dataset files exist (either created by you or downloaded from GoogleDrive) before running the pretraining command. The pretraining is also done using the same shell scripts used for the finetuning experiments. For example, to pretrain a model on the ourtasks-nonsense dataset, simply run :

./pipeline_t5.sh pretraining_experiments/pretraining_t5_ourtasks_nonsense
Owner
Approximately Correct Machine Intelligence (ACMI) Lab
Research on machine learning, its social impacts, and applications to healthcare. PI—@zackchase
Approximately Correct Machine Intelligence (ACMI) Lab
A PyTorch implementation of "Signed Graph Convolutional Network" (ICDM 2018).

SGCN ⠀ A PyTorch implementation of Signed Graph Convolutional Network (ICDM 2018). Abstract Due to the fact much of today's data can be represented as

Benedek Rozemberczki 251 Nov 30, 2022
StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion Yinghao Aaron Li, Ali Zare, Nima Mesgarani We pres

Aaron (Yinghao) Li 282 Jan 01, 2023
AI-based, context-driven network device ranking

Batea A batea is a large shallow pan of wood or iron traditionally used by gold prospectors for washing sand and gravel to recover gold nuggets. Batea

Secureworks Taegis VDR 269 Nov 26, 2022
OpenGAN: Open-Set Recognition via Open Data Generation

OpenGAN: Open-Set Recognition via Open Data Generation ICCV 2021 (oral) Real-world machine learning systems need to analyze novel testing data that di

Shu Kong 90 Jan 06, 2023
A Robust Non-IoU Alternative to Non-Maxima Suppression in Object Detection

Confluence: A Robust Non-IoU Alternative to Non-Maxima Suppression in Object Detection 1. 介绍 用以替代 NMS,在所有 bbox 中挑选出最优的集合。 NMS 仅考虑了 bbox 的得分,然后根据 IOU 来

44 Sep 15, 2022
[Official] Exploring Temporal Coherence for More General Video Face Forgery Detection(ICCV 2021)

Exploring Temporal Coherence for More General Video Face Forgery Detection(FTCN) Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, Fang Wen Accepted b

57 Dec 28, 2022
CARL provides highly configurable contextual extensions to several well-known RL environments.

CARL (context adaptive RL) provides highly configurable contextual extensions to several well-known RL environments.

AutoML-Freiburg-Hannover 51 Dec 28, 2022
rliable is an open-source Python library for reliable evaluation, even with a handful of runs, on reinforcement learning and machine learnings benchmarks.

Open-source library for reliable evaluation on reinforcement learning and machine learning benchmarks. See NeurIPS 2021 oral for details.

Google Research 529 Jan 01, 2023
Tools to create pixel-wise object masks, bounding box labels (2D and 3D) and 3D object model (PLY triangle mesh) for object sequences filmed with an RGB-D camera.

Tools to create pixel-wise object masks, bounding box labels (2D and 3D) and 3D object model (PLY triangle mesh) for object sequences filmed with an RGB-D camera. This project prepares training and t

305 Dec 16, 2022
Pytorch tutorials for Neural Style transfert

PyTorch Tutorials This tutorial is no longer maintained. Please use the official version: https://pytorch.org/tutorials/advanced/neural_style_tutorial

Alexis David Jacq 135 Jun 26, 2022
Face recognition system using MTCNN, FACENET, SVM and FAST API to track participants of Big Brother Brasil in real time.

BBB Face Recognizer Face recognition system using MTCNN, FACENET, SVM and FAST API to track participants of Big Brother Brasil in real time. Instalati

Rafael Azevedo 232 Dec 24, 2022
Yolo Traffic Light Detection With Python

Yolo-Traffic-Light-Detection This project is based on detecting the Traffic light. Pretained data is used. This application entertained both real time

Ananta Raj Pant 2 Aug 08, 2022
GANsformer: Generative Adversarial Transformers Drew A

GANformer: Generative Adversarial Transformers Drew A. Hudson* & C. Lawrence Zitnick Update: We released the new GANformer2 paper! *I wish to thank Ch

Drew Arad Hudson 1.2k Jan 02, 2023
The world's largest toxicity dataset.

The Toxicity Dataset by Surge AI Saving the internet is fun. Combing through thousands of online comments to build a toxicity dataset isn't. That's wh

Surge AI 134 Dec 19, 2022
AFLFast (extends AFL with Power Schedules)

AFLFast Power schedules implemented by Marcel Böhme [email protected]

Marcel Böhme 380 Jan 03, 2023
Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers 1 Using Colab Please notic

Hila Chefer 489 Jan 07, 2023
Collection of generative models in Tensorflow

tensorflow-generative-model-collections Tensorflow implementation of various GANs and VAEs. Related Repositories Pytorch version Pytorch version of th

3.8k Dec 30, 2022
某学校选课系统GIF验证码数据集 + Baseline模型 + 上下游相关工具

elective-dataset-2021spring 某学校2021春季选课系统GIF验证码数据集(29338张) + 准确率98.4%的Baseline模型 + 上下游相关工具。 数据集采用 知识共享署名-非商业性使用 4.0 国际许可协议 进行许可。 Baseline模型和上下游相关工具采用

xmcp 27 Sep 17, 2021
Learned Token Pruning for Transformers

LTP: Learned Token Pruning for Transformers Check our paper for more details. Installation We follow the same installation procedure as the original H

Sehoon Kim 52 Dec 29, 2022
Meli Data Challenge 2021 - First Place Solution

My solution for the Meli Data Challenge 2021

Matias Moreyra 23 Mar 09, 2022