SummerTime - Text Summarization Toolkit for Non-experts

Overview

SummerTime - Text Summarization Toolkit for Non-experts

CI License Open In Colab

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.

The library architecture is as follows:

NOTE: SummerTime is in active development, any helpful comments are highly encouraged, please open an issue or reach out to any of the team members.

Installation and setup

Create and activate a new conda environment:

!conda create -n summertime python=3.7
!conda activate summertime

pip dependencies for local demo:

!pip install -r requirements.txt
Setup ROUGE
!export ROUGE_HOME=/usr/local/lib/python3.7/dist-packages/summ_eval/ROUGE-1.5.5/
!pip install -U  git+https://github.com/bheinzerling/pyrouge.git

Quick Start

Imports model, initializes default model, and summarizes sample documents.

import model as st_model

model = st_model.summarizer()
documents = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. 
    The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected 
    by the shutoffs which were expected to last through at least midday tomorrow."""
]
model.summarize(documents)

# ["California's largest electricity provider has turned off power to hundreds of thousands of customers."]

Also, please run our colab notebook for a more hands-on demo and more examples.

Open In Colab

Models

Supported Models

SummerTime supports different models (e.g., TextRank, BART, Longformer) as well as model wrappers for more complex summariztion tasks (e.g., JointModel for multi-doc summarzation, BM25 retrieval for query-based summarization).

Models Single-doc Multi-doc Dialogue-based Query-based
BartModel ✔️
BM25SummModel ✔️
HMNetModel ✔️
LexRankModel ✔️
LongformerModel ✔️
MultiDocJointModel ✔️
MultiDocSeparateModel ✔️
PegasusModel ✔️
TextRankModel ✔️
TFIDFSummModel ✔️

To see all supported models, run:

from model import SUPPORTED_SUMM_MODELS
print(SUPPORTED_SUMM_MODELS)

Import and initialization:

import model as st_model

# To use a default model
default_model = st_model.summarizer()    

# Or a specific model
bart_model = st_model.BartModel()
pegasus_model = st_model.PegasusModel()
lexrank_model = st_model.LexRankModel()
textrank_model = st_model.TextRankModel()

Users can easily access documentation to assist with model selection

sample_model.show_capability()
pegasus_model.show_capability()
textrank_model.show_capability()

To use a model for summarization, simply run:

documents = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. 
    The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected 
    by the shutoffs which were expected to last through at least midday tomorrow."""
]

sample_model.summarize(documents)
# or 
pegasus_model.summarize(documents)

All models can be initialized with the following optional options:

def __init__(self,
         trained_domain: str=None,
         max_input_length: int=None,
         max_output_length: int=None,
         ):

All models will implement the following methods:

def summarize(self,
  corpus: Union[List[str], List[List[str]]],
  queries: List[str]=None) -> List[str]:

def show_capability(cls) -> None:

Datasets

Datasets supported

SummerTime supports different summarization datasets across different domains (e.g., CNNDM dataset - news article corpus, Samsum - dialogue corpus, QM-Sum - query-based dialogue corpus, MultiNews - multi-document corpus, ML-sum - multi-lingual corpus, PubMedQa - Medical domain, Arxiv - Science papers domain, among others.

Dataset Domain # Examples Src. length Tgt. length Query Multi-doc Dialogue Multi-lingual
ArXiv Scientific articles 215k 4.9k 220
CNN/DM(3.0.0) News 300k 781 56
MlsumDataset Multi-lingual News 1.5M+ 632 34 ✔️ German, Spanish, French, Russian, Turkish
Multi-News News 56k 2.1k 263.8 ✔️
SAMSum Open-domain 16k 94 20 ✔️
Pubmedqa Medical 272k 244 32 ✔️
QMSum Meetings 1k 9.0k 69.6 ✔️ ✔️
ScisummNet Scientific articles 1k 4.7k 150
SummScreen TV shows 26.9k 6.6k 337.4 ✔️
XSum News 226k 431 23.3

To see all supported datasets, run:

import dataset

print(dataset.list_all_dataset())

Dataset Initialization

import dataset

cnn_dataset = dataset.CnndmDataset()
# or 
xsum_dataset = dataset.XsumDataset()
# ..etc
Dataset Object

All datasets are implementations of the SummDataset class. Their data splits can be accessed as follows:

dataset = dataset.CnndmDataset()

train_data = dataset.train_set  
dev_data = dataset.dev_set  
test_data = dataset.test_set        

To see the details of the datasets, run:

dataset = dataset.CnndmDataset()

dataset.show_description()
Data instance

The data in all datasets is contained in a SummInstance class object, which has the following properties:

data_instance.source = source    # either `List[str]` or `str`, depending on the dataset itself, string joining may needed to fit into specific models.
data_instance.summary = summary  # a string summary that serves as ground truth
data_instance.query = query      # Optional, applies when a string query is present

print(data_instance)             # to print the data instance in its entirety

Loading and using data instances

Data is loaded using a generator to save on space and time

To get a single instance

data_instance = next(cnn_dataset.train_set)
print(data_instance)

To get a slice of the dataset

import itertools

# Get a slice from the train set generator - first 5 instances
train_set = itertools.islice(cnn_dataset.train_set, 5)

corpus = [instance.source for instance in train_set]
print(corpus)

Using the datasets with the models - Examples

import itertools
import dataset
import model

cnn_dataset = dataset.CnndmDataset()


# Get a slice of the train set - first 5 instances
train_set = itertools.islice(cnn_dataset.train_set, 5)

corpus = [instance.source for instance in train_set]


# Example 1 - traditional non-neural model
# LexRank model
lexrank = model.LexRankModel(corpus)
print(lexrank.show_capability())

lexrank_summary = lexrank.summarize(corpus)
print(lexrank_summary)


# Example 2 - A spaCy pipeline for TextRank (another non-neueral extractive summarization model)
# TextRank model
textrank = model.TextRankModel()
print(textrank.show_capability())

textrank_summary = textrank.summarize(corpus)
print(textrank_summary)


# Example 3 - A neural model to handle large texts
# LongFormer Model
longformer = model.LongFormerModel()
longformer.show_capability()

longformer_summary = longformer.summarize(corpus)
print(longformer_summary)

Evaluation

SummerTime supports different evaluation metrics including: BertScore, Bleu, Meteor, Rouge, RougeWe

To print all supported metrics:

from evaluation import SUPPORTED_EVALUATION_METRICS

print(SUPPORTED_EVALUATION_METRICS)

Import and initialization:

import evaluation as st_eval

bert_eval = st_eval.bertscore()
bleu_eval = st_eval.bleu_eval()
meteor_eval = st_eval.bleu_eval()
rouge_eval = st_eval.rouge()
rougewe_eval = st_eval.rougewe()

Evaluation Class

All evaluation metrics can be initialized with the following optional arguments:

def __init__(self, metric_name):

All evaluation metric objects implement the following methods:

def evaluate(self, model, data):

def get_dict(self, keys):

Using evaluation metrics

Get sample summary data

from evaluation.base_metric import SummMetric
from evaluation import Rouge, RougeWe, BertScore

import itertools

# Evaluates model on subset of cnn_dailymail
# Get a slice of the train set - first 5 instances
train_set = itertools.islice(cnn_dataset.train_set, 5)

corpus = [instance for instance in train_set]
print(corpus)

articles = [instance.source for instance in corpus]

summaries = sample_model.summarize(articles)
targets = [instance.summary for instance in corpus]

Evaluate the data on different metrics

from evaluation import  BertScore, Rouge, RougeWe,

# Calculate BertScore
bert_metric = BertScore()
bert_score = bert_metric.evaluate(summaries, targets)
print(bert_score)

# Calculate Rouge
rouge_metric = Rouge()
rouge_score = rouge_metric.evaluate(summaries, targets)
print(rouge_score)

# Calculate RougeWe
rougewe_metric = RougeWe()
rougwe_score = rougewe_metric.evaluate(summaries, targets)
print(rougewe_score)

To contribute

Pull requests

Create a pull request and name it [your_gh_username]/[your_branch_name]. If needed, resolve your own branch's merge conflicts with main. Do not push directly to main.

Code formatting

If you haven't already, install black and flake8:

pip install black
pip install flake8

Before pushing commits or merging branches, run the following commands from the project root. Note that black will write to files, and that you should add and commit changes made by black before pushing:

black .
flake8 .

Or if you would like to lint specific files:

black path/to/specific/file.py
flake8 path/to/specific/file.py

Ensure that black does not reformat any files and that flake8 does not print any errors. If you would like to override or ignore any of the preferences or practices enforced by black or flake8, please leave a comment in your PR for any lines of code that generate warning or error logs. Do not directly edit config files such as setup.cfg.

See the black docs and flake8 docs for documentation on installation, ignoring files/lines, and advanced usage. In addition, the following may be useful:

  • black [file.py] --diff to preview changes as diffs instead of directly making changes
  • black [file.py] --check to preview changes with status codes instead of directly making changes
  • git diff -u | flake8 --diff to only run flake8 on working branch changes

Note that our CI test suite will include invoking black --check . and flake8 --count . on all non-unittest and non-setup Python files, and zero error-level output is required for all tests to pass.

Tests

Our continuous integration system is provided through Github actions. When any pull request is created or updated or whenever main is updated, the repository's unit tests will be run as build jobs on tangra for that pull request. Build jobs will either pass or fail within a few minutes, and build statuses and logs are visible under Actions. Please ensure that the most recent commit in pull requests passes all checks (i.e. all steps in all jobs run to completion) before merging, or request a review. To skip a build on any particular commit, append [skip ci] to the commit message. Note that PRs with the substring /no-ci/ anywhere in the branch name will not be included in CI.

Citation

This repository is built by the LILY Lab at Yale University, led by Prof. Dragomir Radev. The main contributors are Ansong Ni, Zhangir Azerbayev, Troy Feng, Murori Mutuma and Yusen Zhang (Penn State).

If you use SummerTime in your work, consider citing:

@article{ni2021summertime,
     title={SummerTime: Text Summarization Toolkit for Non-experts}, 
     author={Ansong Ni and Zhangir Azerbayev and Mutethia Mutuma and Troy Feng and Yusen Zhang and Tao Yu and Ahmed Hassan Awadallah and Dragomir Radev},
     journal={arXiv preprint arXiv:2108.12738},
     year={2021}
}

For comments and question, please open an issue.

Comments
  • evaluation refactoring

    evaluation refactoring

    Modified evaluation library to better align with style conventions.

    One thing I can't figure out how to do is import SummModel into base_metric.py for type annotation purposes. Any help with this is appreciated.

    opened by zhangir-azerbayev 13
  • cleanup to prepare for the 0.1 release

    cleanup to prepare for the 0.1 release

    Cleaned up files/dirs that are not touch for 5+ months.

    There are some files that I am not sure whether they can be deleted, for which I will ask people to take a look in the follow-up thread.

    opened by niansong1996 12
  • Integration with SummEval

    Integration with SummEval

    @MuroriM Alex just sent out an email about SummEval being pip installable now, can you give some progress information here about integrating it with SummerTime?

    bug feature request 
    opened by niansong1996 11
  • Add XLSum and Massivesumm datasets

    Add XLSum and Massivesumm datasets

    Add the XLSum and Massivesumm datasets to SummerTime.

    still TODO:

    • add to documentation for these datasets in readme
    • create tests for these datasets
    • add support for initializing Massivesumm dataset with multiple languages
    • add utility function for downloading URL zip file from google drive
    • file organization?
    • reduce code reuse between multilingual datasets?
    • Remove big dictionary of links from massivesumm.py ??(instead parse TSV from git repo??)
    opened by haileyschoelkopf 10
  • Troyfeng116/code styling test

    Troyfeng116/code styling test

    • Test linters (black and flake8) on sample file (see model/base_model.py for formatting diffs)
    • Add Contributors section to README with guidelines on code styling and linting
    opened by troyfeng116 10
  • Adds a try-except block for datasets that may occasionally fail

    Adds a try-except block for datasets that may occasionally fail

    • Creates a 'loading_dataset' function wrapper that has a try-except block to catch when the dataset trying to be loaded cannot be reached online.
    • Implemented for the MLsum Dataset, which occasionally has this issue
    opened by MuroriM 8
  • Yusen hmnet1

    Yusen hmnet1

    This is an intermediate result for HMNet. We need to merge after pipelining the QMSum dataset etc.

    TODOs:

    1. checkpoint saving and loading
    2. pos_tag and role vector saving
    3. interface to the "corpus"
    4. minimize the dependencies that need to be installed
    opened by chatc 8
  • Input for Single-Doc Summerization

    Input for Single-Doc Summerization

    Hello, Is it possible to provide a list of (already split) sentences as the source input to the summarizer, as opposed to a single source document? The goal is to treat each list of sentences as one long sequence during extractive summarization.

    question 
    opened by johnhutx 6
  • Add mT5

    Add mT5

    add mT5 model (using a checkpoint fine-tuned on the XLSum dataset.)

    Ready to merge, but still todo:

    • possibly adding the rest of the 101 languages that mT5-base was trained on to supported languages, instead of just including the languages in XLSum as supported languages (~45 languages)
    opened by haileyschoelkopf 6
  • Add translation pipeline model

    Add translation pipeline model

    add a translation pipeline model class (other lang -> translate to english -> summarization in english -> translate summaries to english)

    Addressing #109

    opened by haileyschoelkopf 5
  • Troyfeng116/integration tests

    Troyfeng116/integration tests

    • Add basic integration tests
    • Update model tests: assert model output typing + against input instances
    • Debug dataset + eval tests
    • Update model classes for new output type assertions

    Note:

    • Eval tests still failing to run
    • Add py7zr pip dependency
    • SummEval backend eval metrics still broken on both local machine + Tangra
    opened by troyfeng116 4
  • Error loading SUPPORTED_EVALUATION_METRICS library due to Matplotlib

    Error loading SUPPORTED_EVALUATION_METRICS library due to Matplotlib

    when I try to load SUPPORTED_EVALUATION_METRICS & pprint(SUPPORTED_EVALUATION_METRICS)

    I get this error AttributeError: module 'matplotlib.cbook' has no attribute '_make_class_factory'

    I tried running this command on diff matplotlib versions: 3.0 & 2.1.1 but always with the same results.

    I'm trying to run the code on Colab, on a Mac M1 chip.

    thanks

    opened by mterrestre01 0
  • ModuleNotFoundError: No module named 'summertime'

    ModuleNotFoundError: No module named 'summertime'

    Hello! I'm trying to install summertime, but I cannot import it after installation.

    How to reproduce

    Run on colab:

    %pip install [email protected]+https://github.com/bheinzerling/pyrouge.git
    %pip install [email protected]://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl
    %pip install summertime
    
    from summertime import model
    > ModuleNotFoundError: No module named 'summertime'
    

    Edit: even after updating pip, the problem is still there.

    The package is correctly installed:

    %pip freeze | grep summertime
    > summertime==1.2.1
    

    I've also tried to install on Python 3.9, and the problem persists.

    opened by fabioperez 2
  • Inconsistent printing/logging

    Inconsistent printing/logging

    The printing and logging are slightly out-of-hand. We've got messages printed out everywhere, some from the models that we import and some from random places for debugging purposes.

    We should regulate this more with the python logging package and add a hierarchy of logging levels (i.e., debug, info, warning, error, etc)

    cleanup 
    opened by niansong1996 0
Releases(v1.2.1)
  • v1.2.1(Mar 2, 2022)

    Finalizing the multilingual summarization models and pipelines

    What's Changed

    • Installation fixes for the 1.1.0 release by @niansong1996 in https://github.com/Yale-LILY/SummerTime/pull/102
    • Multilingual refactoring and language ID checking by @NickSchoelkopf in https://github.com/Yale-LILY/SummerTime/pull/96
    • Add mT5 by @NickSchoelkopf in https://github.com/Yale-LILY/SummerTime/pull/98
    • Add translation pipeline model by @NickSchoelkopf in https://github.com/Yale-LILY/SummerTime/pull/110
    • Add T5 to supported summarization models by @arjunvnair in https://github.com/Yale-LILY/SummerTime/pull/115
    • Add XLSum and Massivesumm datasets by @NickSchoelkopf in https://github.com/Yale-LILY/SummerTime/pull/114

    New Contributors

    • @arjunvnair made their first contribution in https://github.com/Yale-LILY/SummerTime/pull/115

    Full Changelog: https://github.com/Yale-LILY/SummerTime/compare/v1.1.0...v1.2.1

    Source code(tar.gz)
    Source code(zip)
    summertime-1.2.1-py3-none-any.whl(12.84 KB)
    summertime-1.2.1.tar.gz(20.20 KB)
  • v1.1.0(Nov 9, 2021)

Owner
Yale-LILY
Language, Information, and Learning at Yale
Yale-LILY
OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters - where the final result looks like waves in the ocean.

This is the source code of RPG (Reward-Randomized Policy Gradient)

RPG (Reward-Randomized Policy Gradient) Zhenggang Tang*, Chao Yu*, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, Yi Wu (

40 Nov 25, 2022
运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。

OlittleRer 运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。编程语言和工具包括Java、Python、Matlab、CPLEX、Gurobi、SCIP 等。 关注我们: 运筹小公众号 有问题可以直接在

运小筹 151 Dec 30, 2022
NLP Overview

NLP-Overview Introduction The field of NPL encompasses a variety of topics which involve the computational processing and understanding of human langu

PeterPham 1 Jan 13, 2022
Let Xiao Ai speakers control third-party devices

A stupid way to extend miot/xiaoai. Demo for Panasonic Bath Bully FV-RB20VL1 逆向 Panasonic Smart China,获得控制浴霸的请求信息(HTTP 请求),详见 apps/panasonic.py; 2. 通过

bin 14 Jul 07, 2022
Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields

Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields [project page][paper][cite] Geometry-Consistent Neural Shape Represe

Yifan Wang 100 Dec 19, 2022
Officile code repository for "A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning"

CvarAdversarialRL Official code repository for "A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning". Initial setup Create a virtual

Mathieu Godbout 1 Nov 19, 2021
Neural-Machine-Translation - Implementation of revolutionary machine translation models

Neural Machine Translation Framework: PyTorch Repository contaning my implementa

Utkarsh Jain 1 Feb 17, 2022
Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

SpeechMix Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together. Introduction For the same input: from datas

Eric Lam 31 Nov 07, 2022
Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources Description This is the repository for the paper Unifying Cross-

Sapienza NLP group 16 Sep 09, 2022
ACL'22: Structured Pruning Learns Compact and Accurate Models

☕ CoFiPruning: Structured Pruning Learns Compact and Accurate Models This repository contains the code and pruned models for our ACL'22 paper Structur

Princeton Natural Language Processing 130 Jan 04, 2023
华为商城抢购手机的Python脚本 Python script of Huawei Store snapping up mobile phones

HUAWEI STORE GO 2021 说明 基于Python3+Selenium的华为商城抢购爬虫脚本,修改自近两年没更新的项目BUY-HW,为女神抢Nova 8(什么时候华为开始学小米玩饥饿营销了?) 原项目的登陆以及抢购部分已经不可用,本项目对原项目进行了改正以适应新华为商城,并增加一些功能

ZhangLiang 111 Dec 22, 2022
The (extremely) naive sentiment classification function based on NBSVM trained on wisesight_sentiment

thai_sentiment The naive sentiment classification function based on NBSVM trained on wisesight_sentiment วิธีติดตั้ง pip install thai_sentiment==0.1.3

Charin 7 Dec 08, 2022
An assignment on creating a minimalist neural network toolkit for CS11-747

minnn by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik This is an exercise in developing a minimalist neural network toolkit for NLP, part of Car

Graham Neubig 63 Dec 29, 2022
TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

Alexa 98 Dec 09, 2022
Official Stanford NLP Python Library for Many Human Languages

Official Stanford NLP Python Library for Many Human Languages

Stanford NLP 6.4k Jan 02, 2023
Tools, wrappers, etc... for data science with a concentration on text processing

Rosetta Tools for data science with a focus on text processing. Focuses on "medium data", i.e. data too big to fit into memory but too small to necess

207 Nov 22, 2022
Machine translation models released by the Gourmet project

Gourmet Models Overview The Gourmet project has released several machine translation models to translate low-resource languages. This repository conta

Edinburgh NLP 5 Dec 08, 2021
Transformer related optimization, including BERT, GPT

This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA.

NVIDIA Corporation 1.7k Jan 04, 2023
This code is the implementation of Text Emotion Recognition (TER) with linguistic features

APSIPA-TER This code is the implementation of Text Emotion Recognition (TER) with linguistic features. The network model is BERT with a pretrained mod

kenro515 1 Feb 08, 2022