On Generating Extended Summaries of Long Documents

Overview

ExtendedSumm

This repository contains the implementation details and datasets used in On Generating Extended Summaries of Long Documents paper at the AAAI-21 Workshop on Scientific Document Understanding (SDU 2021).

Conda environment: preliminary setup

To install the required packages, please run conda yml file that you find in the root directory using the following command:

conda env create -f environment.yml

How to run...

IMPORTANT: The following commands should be run under src/ directory.

Dataset

To start with, you first need to download the datasets that are intended to work with the code base. You can download them from following links:

Dataset Download Link
arXiv-Long Download
PubMed-Long Download

After downloading the dataset, you will need to uncompress it using the following command:

tar -xvf pubmedL.tar.gz 

This will uncompress the pubmedL tar file into the current directory. The directory will include the single json files of different sets including training, validation, and test.

FORMAT Each paper file is structured within a a json object with the following keys:

  • "id" (String): the paper ID
  • "abstract" (String): the abstract text of the paper. This field is different from "gold" field for the datasets that have different ground-truth than the abstract.
  • "gold" (List >): the ground-truth summary of the paper, where the inner list is the tokens associated with each gold summary sentence.
  • "sentences" (List >): the source sentences of the full-text. The inner list contains 5 indices, each of which represents different fields of the source sentence:
    • Index [0]: tokens of the sentences (i.e., list of tokens).
    • Index [1]: textual representation of the section that the sentence belongs to.
    • Index [2]: Rouge-L score of the sentence with the gold summary.
    • Index [3]: textual representation of the sentences.
    • Index [4]: oracle label associated with the sentence (0, or 1).
    • Index [5]: the section id assigned by sequential sentence classification package. For more information, please refer to this repository

Preparing Data

Simply run the prep.sh bash script with providing the dataset directory. This script will use two functions to first create aggregated json files, and then preparing them for pretrained language models' usage.

Please note that if you want to use your custom dataset and create torch files, you will need to frame the format of your dataset to the given format in the Dataset section.

Training

The full training scripts are inside train.sh bash file. To run it on your machine, you will need to change the directories to fit in your needs:

...

DATA_PATH=/path/to/dataset/torch-files/
MODEL_PATH=/path/to/saved/model/

# Specifiying GPUs either single GPU, or multi-GPU
export CUDA_VISIBLE_DEVICES=0,1


# You don't need to modify these below 
LOG_DIR=../logs/$(echo $MODEL_PATH | cut -d \/ -f 6).log
mkdir -p ../results/$(echo $MODEL_PATH | cut -d \/ -f 6)
RESULT_PATH_TEST=../results/$(echo $MODEL_PATH | cut -d \/ -f 6)/

MAX_POS=2500

...

Inference

The inference scripts are inside test.sh bash file. To run it on your machine, you will need to modify the file directories:

...
# path to the data directory
BERT_DIR=/path/to/dataset/torch-files/

# path to the trained model directory
MODEL_PATH=/disk1/sajad/sci-trained-models/presum/LSUM-2500-segmented-sectioned-multi50-classi-v1/

# path to the best trained model (or the checkpoint that you want to run inference on)
CHECKPOINT=$MODEL_PATH/Recall_BEST_model_s63000_0.4910.pt

# GPU machines, either multi or single GPU
export CUDA_VISIBLE_DEVICES=0,1

MAX_POS=2500

...

Citation

If you plan to use this work, please cite the following papers:

@inproceedings{Sotudeh2021ExtendedSumm,
  title={On Generating Extended Summaries of Long Documents},
  author={Sajad Sotudeh and Arman Cohan and Nazli Goharian},
  booktitle={The AAAI-21 Workshop on Scientific Document Understanding (SDU 2021)},
  year={2021}
}
@inproceedings{Sotudeh2020LongSumm,
  title={GUIR @ LongSumm 2020: Learning to Generate Long Summaries from Scientific Documents},
  author={Sajad Sotudeh and Arman Cohan and Nazli Goharian},
  booktitle={First Workshop on Scholarly Document Processing (SDP 2020)},
  year={2020}
}
Owner
Georgetown Information Retrieval Lab
Georgetown Information Retrieval Lab
Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly Code for this paper Ultra-Data-Efficient GAN Tra

VITA 77 Oct 05, 2022
Layer 7 DDoS Panel with Cloudflare Bypass ( UAM, CAPTCHA, BFM, etc.. )

Blood Deluxe DDoS DDoS Attack Panel includes CloudFlare Bypass (UAM, CAPTCHA, BFM, etc..)(It works intermittently. Working on it) Don't attack any web

272 Nov 01, 2022
Implementation of Pix2Seq in PyTorch

pix2seq-pytorch Implementation of Pix2Seq paper Different from the paper image input size 1280 bin size 1280 LambdaLR scheduler used instead of Linear

Tony Shin 9 Dec 15, 2022
This folder contains the implementation of the multi-relational attribute propagation algorithm.

MrAP This folder contains the implementation of the multi-relational attribute propagation algorithm. It requires the package pytorch-scatter. Please

6 Dec 06, 2022
Keras-tensorflow implementation of Fully Convolutional Networks for Semantic Segmentation(Unfinished)

Keras-FCN Fully convolutional networks and semantic segmentation with Keras. Models Models are found in models.py, and include ResNet and DenseNet bas

645 Dec 29, 2022
This is the PyTorch implementation of GANs N’ Roses: Stable, Controllable, Diverse Image to Image Translation

Official PyTorch repo for GAN's N' Roses. Diverse im2im and vid2vid selfie to anime translation.

1.1k Jan 01, 2023
Release of the ConditionalQA dataset

ConditionalQA Datasets accompanying the paper ConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers. Disclaimer This dataset

14 Oct 17, 2022
Real-Time SLAM for Monocular, Stereo and RGB-D Cameras, with Loop Detection and Relocalization Capabilities

ORB-SLAM2 Authors: Raul Mur-Artal, Juan D. Tardos, J. M. M. Montiel and Dorian Galvez-Lopez (DBoW2) 13 Jan 2017: OpenCV 3 and Eigen 3.3 are now suppor

Raul Mur-Artal 7.8k Dec 30, 2022
BaseCls BaseCls 是一个基于 MegEngine 的预训练模型库,帮助大家挑选或训练出更适合自己科研或者业务的模型结构

BaseCls BaseCls 是一个基于 MegEngine 的预训练模型库,帮助大家挑选或训练出更适合自己科研或者业务的模型结构。 文档地址:https://basecls.readthedocs.io 安装 安装环境 BaseCls 需要 Python = 3.6。 BaseCls 依赖 M

MEGVII Research 28 Dec 23, 2022
Airborne magnetic data of the Osborne Mine and Lightning Creek sill complex, Australia

Osborne Mine, Australia - Airborne total-field magnetic anomaly This is a section of a survey acquired in 1990 by the Queensland Government, Australia

Fatiando a Terra Datasets 1 Jan 21, 2022
VLGrammar: Grounded Grammar Induction of Vision and Language

VLGrammar: Grounded Grammar Induction of Vision and Language

Yining Hong 27 Dec 23, 2022
Alleviating Over-segmentation Errors by Detecting Action Boundaries

Alleviating Over-segmentation Errors by Detecting Action Boundaries Forked from ASRF offical code. This repo is the a implementation of replacing orig

13 Dec 12, 2022
NeROIC: Neural Object Capture and Rendering from Online Image Collections

NeROIC: Neural Object Capture and Rendering from Online Image Collections This repository is for the source code for the paper NeROIC: Neural Object C

Snap Research 647 Dec 27, 2022
[3DV 2021] Channel-Wise Attention-Based Network for Self-Supervised Monocular Depth Estimation

Channel-Wise Attention-Based Network for Self-Supervised Monocular Depth Estimation This is the official implementation for the method described in Ch

Jiaxing Yan 27 Dec 30, 2022
Machine Learning University: Accelerated Computer Vision Class

Machine Learning University: Accelerated Computer Vision Class This repository contains slides, notebooks, and datasets for the Machine Learning Unive

AWS Samples 1.3k Dec 28, 2022
Official Repository for our ICCV2021 paper: Continual Learning on Noisy Data Streams via Self-Purified Replay

Continual Learning on Noisy Data Streams via Self-Purified Replay This repository contains the official PyTorch implementation for our ICCV2021 paper.

Jinseo Jeong 22 Nov 23, 2022
Privacy-Preserving Machine Learning (PPML) Tutorial Presented at PyConDE 2022

PPML: Machine Learning on Data you cannot see Repository for the tutorial on Privacy-Preserving Machine Learning (PPML) presented at PyConDE 2022 Abst

Valerio Maggio 10 Aug 16, 2022
SimBERT升级版(SimBERTv2)!

RoFormer-Sim RoFormer-Sim,又称SimBERTv2,是我们之前发布的SimBERT模型的升级版。 介绍 https://kexue.fm/archives/8454 训练 tensorflow 1.14 + keras 2.3.1 + bert4keras 0.10.6 下载

318 Dec 31, 2022
A robust pointcloud registration pipeline based on correlation.

PHASER: A Robust and Correspondence-Free Global Pointcloud Registration Ubuntu 18.04+ROS Melodic: Overview Pointcloud registration using correspondenc

ETHZ ASL 101 Dec 01, 2022
Fast Scattering Transform with CuPy/PyTorch

Announcement 11/18 This package is no longer supported. We have now released kymatio: http://www.kymat.io/ , https://github.com/kymatio/kymatio which

Edouard Oyallon 289 Dec 07, 2022