DirectQuote - A Dataset for Direct Quotation Extraction and Attribution in News Articles

DirectQuote is a corpus containing 19,760 paragraphs and 10,353 direct quotations manually annotated from online news media.

A quotation is a general notion that covers different kinds of speech, thought, and writing in text (Semino and Short,2004). It is a prominent linguistic device for expressing opinions, statements, and assessments attributed to the speaker (Cappelen and Lepore, 2012). Among all kinds of quotations, the entire content of the direct quotation (O’Keefe et al.,2013) is in quotation marks, which means that what the speaker said is transcribed verbatim.

Task Definition

Quotation extractionis defined as extracting reported speech from a third party in the text, also known as reportedspeech extraction. Quotation attribution refers to determining the speaker of the quotation. When annotating speakers, we ensure that valid speakers should be able to belinked to a person entity in a named entity library. Among them, simple patterns are removed to increase the diversity of the corpus.

Data

Region	Name	Numbers
U.S.	Associated Press	438
	Cable News Network	627
	American Broadcasting Company	240
	New York Times	5,642
	CBS Broadcasting	4,890
UK	British Broadcasting Corporation	926
	Reuters	5,836
	The Guardian	4,302
Canada	The Globe and Mail	1,955
Canada	The Star	13,769
New Zealand	NZ Herald	115
Australia	Australian Broadcasting Corporation	312
Australia	Sydney Morning Herald	93

We select representative and multiple news sources across the political spectrum, including 13 well-known online news media from five major English-speaking countries. The corpus adopts the format consistent with CoNLL 2003. We use IOB1 format in the corpus. Raw texts are tokenized by whitespace tokenizer. Every word is classified into the following lables:

LeftSpeaker Quotation, the corresponding speaker is in the preceding text
RightSpeaker Quotation, the corresponding speaker is in the following text
Unknown Quotation, no corresponding speaker
Speaker Speaker
Out Neither

Statistics

	Numbers
News Article	39,153
Paragraph	19,760
Quotation	10,353
Time	2020.09-2021.03

Reference

DirectQuote: A Dataset for Direct Quotation Extraction and Attribution in News Articles, Yuanchi Zhang, Yang Liu

A Dataset for Direct Quotation Extraction and Attribution in News Articles.

Related tags

Overview

DirectQuote - A Dataset for Direct Quotation Extraction and Attribution in News Articles

Task Definition

Data

Statistics

Reference

Owner

THUNLP-MT

[BMVC 2021] Official PyTorch Implementation of Self-supervised learning of Image Scale and Orientation Estimation

IDRLnet, a Python toolbox for modeling and solving problems through Physics-Informed Neural Network (PINN) systematically.

Split your patch similarly to `git add -p` but supporting multiple buckets

Gradient Step Denoiser for convergent Plug-and-Play

Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

Convolutional Neural Network for 3D meshes in PyTorch

Multi-label Co-regularization for Semi-supervised Facial Action Unit Recognition (NeurIPS 2019)

Lexical Substitution Framework

Encode and decode text application

基于Paddlepaddle复现yolov5，支持PaddleDetection接口

Dataset and codebase for NeurIPS 2021 paper: Exploring Forensic Dental Identification with Deep Learning

A Dataset of Python Challenges for AI Research

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Reinforcement learning library(framework) designed for PyTorch, implements DQN, DDPG, A2C, PPO, SAC, MADDPG, A3C, APEX, IMPALA ...

The PyTorch re-implement of a 3D CNN Tracker to extract coronary artery centerlines with state-of-the-art (SOTA) performance. (paper: 'Coronary artery centerline extraction in cardiac CT angiography using a CNN-based orientation classiﬁer')

An unopinionated replacement for PyTorch's Dataset and ImageFolder, that handles Tar archives

Author: Wenhao Yu ([email protected]). ACL 2022. Commonsense Reasoning on Knowledge Graph for Text Generation

A coin flip game in which you can put the amount of money below or equal to 1000 and then choose heads or tail

DeepGNN is a framework for training machine learning models on large scale graph data.

OBG-FCN - implementation of 'Object Boundary Guided Semantic Segmentation'