Extracting Summary Knowledge Graphs from Long Documents

Overview

GraphSum

This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other baseline TTG is simply based on BertSumExt.

Environment Setup

This code is tested on python 3.6.9, transformer 3.0.2 and pytorch 1.7.0. You would also need numpy and scipy packages.

Data

Download and unzip the data from this link. Put the unzipped folder named as ./data parallel with ./src. You should see four subfolders under ./data/json, corresponding to four data splits as described in the paper.

Under each subfolder, the json file contains all document full texts, abstracts as well as the summarized graphs obtained from the abstract, organized by the document keys. Each full text consists of a list of sections. Each summarized graph contains a list of entity and relation mentions. Except for the test split, other three data splits have their summarized graphs obtained by running DyGIE++ on the abstract. The test set have manually annotated summarized graphs from SciERC dataset. The format of the graph follows the output of DyGIE++, where each entity mention in a section is represented by (start token id, end token id, entity type) and each relation mention is represented by (start token id of entity 1, end token id of entity 1, start token id of entity 2, end token id of entity 2, relation type). The graph also contains a list of coreferential entity mentions.

You should also see two subfolders under the processed folder of each data split: merged_entities and aligned_entities. merged_entities contains the full and summarized graphs for each document, where the graph vertices are cluster of entity mentions. Entity clusters in each summarized graph are coreferential entity mentions predicted by DyGIE++ or annotated (in test set). Entity clusters in each full graph contains entity mentions that are coreferences or share the same non-generic string names (as described in our paper). Under merged_entities, we provide entity clusters and relations between entity clusters, as well as corresponding entity and relation mentions in the full paper or abstract. Each relation is represented by "[entity cluster id 1]_[entity cluster id 2]_[relation type]". The original full graphs with all entity and relation mentions are obtained by running DyGIE++ on the document full text. You don't need them to run the code, but you can find them here. For some entity names, you may see a trailing string "<GENERIC_ID> [number]". It means these entity names are classified by DyGIE++ as "generic" and the trailing string is used to differentiate the same entity name strings in different clusters in such cases.

aligned_entities contains the pre-calculated alignment between entity clusters (see Section 5.1 in the paper) in the summarized and full graphs for each document. In each entity alignment file, under each entity cluster of the summarized graph, there is a list of entity clusters from the full graph if the list is not empty. They are used to facilitate data preprocessing of G2G and evaluation.

Training and Evaluation

The model is based on GAT. Go to ./src and run bash run.sh. You can also find the pretrained model here. Put it under ./src/output and run the inference and evaluation parts in ./src/run.sh.

Owner
Zeqiu (Ellen) Wu
PhD Student at UW NLP Research Group
Zeqiu (Ellen) Wu
ZUNIT - Toward Zero-Shot Unsupervised Image-to-Image Translation

ZUNIT Dependencies you can install all the dependencies by pip install -r requirements.txt Datasets Download CUB dataset. Unzip the birds.zip at ./da

Chen Yuanqi 9 Jun 24, 2022
NLP command-line assistant powered by OpenAI

NLP command-line assistant powered by OpenAI

Axel 16 Dec 09, 2022
This repository describes our reproducible framework for assessing self-supervised representation learning from speech

LeBenchmark: a reproducible framework for assessing SSL from speech Self-Supervised Learning (SSL) using huge unlabeled data has been successfully exp

49 Aug 24, 2022
Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Espresso Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning libra

Yiming Wang 919 Jan 03, 2023
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.6k Dec 27, 2022
News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

NLP T5 Project proposal Topic Modeling and Clustering of News-Articles-and-Essays Students: Nasser Alshehri Abdullah Bushnag Abdulrhman Alqurashi OVER

2 Jan 18, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Dec 30, 2022
Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

Mišo Belica 3k Jan 08, 2023
Learning Spatio-Temporal Transformer for Visual Tracking

STARK The official implementation of the paper Learning Spatio-Temporal Transformer for Visual Tracking Highlights The strongest performances Tracker

Multimedia Research 485 Jan 04, 2023
Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles

Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles (TASLP 2022)

Zhuosheng Zhang 3 Apr 14, 2022
Tool which allow you to detect and translate text.

Text detection and recognition This repository contains tool which allow to detect region with text and translate it one by one. Description Two pretr

Damian Panek 176 Nov 28, 2022
BERT Attention Analysis

BERT Attention Analysis This repository contains code for What Does BERT Look At? An Analysis of BERT's Attention. It includes code for getting attent

Kevin Clark 401 Dec 11, 2022
STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

st3 STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch. Currently it supports converting pbmm models to pt scripts with integra

Vlad Ki 8 Oct 18, 2021
Khandakar Muhtasim Ferdous Ruhan 1 Dec 30, 2021
Exploring dimension-reduced embeddings

sleepwalk Exploring dimension-reduced embeddings This is the code repository. See here for the Sleepwalk web page. License and disclaimer This program

S. Anders's research group at ZMBH 91 Nov 29, 2022
Creating an LSTM model to generate music

Music-Generation Creating an LSTM model to generate music music-generator Used to create basic sin wave sounds music-ai Contains the functions to conv

Jerin Joseph 2 Dec 02, 2021
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Vikash Singh 5.3k Jan 01, 2023
[ICLR'19] Trellis Networks for Sequence Modeling

TrellisNet for Sequence Modeling This repository contains the experiments done in paper Trellis Networks for Sequence Modeling by Shaojie Bai, J. Zico

CMU Locus Lab 460 Oct 13, 2022
基于Transformer的单模型、多尺度的VAE模型

UniVAE 基于Transformer的单模型、多尺度的VAE模型 介绍 https://kexue.fm/archives/8475 依赖 需要大于0.10.6版本的bert4keras(当前还没有推到pypi上,可以直接从GitHub上clone最新版)。 引用 @misc{univae,

苏剑林(Jianlin Su) 49 Aug 24, 2022