BookNLP, a natural language processing pipeline for books

Overview

BookNLP

BookNLP is a natural language processing pipeline that scales to books and other long documents (in English), including:

  • Part-of-speech tagging
  • Dependency parsing
  • Entity recognition
  • Character name clustering (e.g., "Tom", "Tom Sawyer", "Mr. Sawyer", "Thomas Sawyer" -> TOM_SAWYER) and coreference resolution
  • Quotation speaker identification
  • Supersense tagging (e.g., "animal", "artifact", "body", "cognition", etc.)
  • Event tagging
  • Referential gender inference (TOM_SAWYER -> he/him/his)

BookNLP ships with two models, both with identical architectures but different underlying BERT sizes. The larger and more accurate big model is fit for GPUs and multi-core computers; the faster small model is more appropriate for personal computers. See the table below for a comparison of the difference, both in terms of overall speed and in accuracy for the tasks that BookNLP performs.

Small Big
Entity tagging (F1) 88.2 90.0
Supersense tagging (F1) 73.2 76.2
Event tagging (F1) 70.6 74.1
Coreference resolution (Avg. F1) 76.4 79.0
Speaker attribution (B3) 86.4 89.9
CPU time, 2019 MacBook Pro (mins.)* 3.6 15.4
CPU time, 10-core server (mins.)* 2.4 5.2
GPU time, Titan RTX (mins.)* 2.1 2.2

*timings measure speed to run BookNLP on a sample book of The Secret Garden (99K tokens). To explore running BookNLP in Google Colab on a GPU, see this notebook.

Installation

conda create --name booknlp python=3.7
conda activate booknlp
  • If using a GPU, install pytorch for your system and CUDA version by following installation instructions on https://pytorch.org.

  • Install booknlp and download Spacy model.

pip install booknlp
python -m spacy download en_core_web_sm

Usage

from booknlp.booknlp import BookNLP

model_params={
		"pipeline":"entity,quote,supersense,event,coref", 
		"model":"big"
	}
	
booknlp=BookNLP("en", model_params)

# Input file to process
input_file="input_dir/bartleby_the_scrivener.txt"

# Output directory to store resulting files in
output_directory="output_dir/bartleby/"

# File within this directory will be named ${book_id}.entities, ${book_id}.tokens, etc.
book_id="bartleby"

booknlp.process(input_file, output_directory, book_id)

This runs the full BookNLP pipeline; you are able to run only some elements of the pipeline (to cut down on computational time) by specifying them in that parameter (e.g., to only run entity tagging and event tagging, change model_params above to include "pipeline":"entity,event").

This process creates the directory output_dir/bartleby and generates the following files:

  • bartleby/bartleby.tokens -- This encodes core word-level information. Each row corresponds to one token and includes the following information:

    • paragraph ID
    • sentence ID
    • token ID within sentence
    • token ID within document
    • word
    • lemma
    • byte onset within original document
    • byte offset within original document
    • POS tag
    • dependency relation
    • token ID within document of syntactic head
    • event
  • bartleby/bartleby.entities -- This represents the typed entities within the document (e.g., people and places), along with their coreference.

    • coreference ID (unique entity ID)
    • start token ID within document
    • end token ID within document
    • NOM (nominal), PROP (proper), or PRON (pronoun)
    • PER (person), LOC (location), FAC (facility), GPE (geo-political entity), VEH (vehicle), ORG (organization)
    • text of entity
  • bartleby/bartleby.supersense -- This stores information from supersense tagging.

    • start token ID within document
    • end token ID within document
    • supersense category (verb.cognition, verb.communication, noun.artifact, etc.)
  • bartleby/bartleby.quotes -- This stores information about the quotations in the document, along with the speaker. In a sentence like "'Yes', she said", where she -> ELIZABETH_BENNETT, "she" is the attributed mention of the quotation 'Yes', and is coreferent with the unique entity ELIZABETH_BENNETT.

    • start token ID within document of quotation
    • end token ID within document of quotation
    • start token ID within document of attributed mention
    • end token ID within document of attributed mention
    • attributed mention text
    • coreference ID (unique entity ID) of attributed mention
    • quotation text
  • bartleby/bartleby.book

JSON file providing information about all characters mentioned more than 1 time in the book, including their proper/common/pronominal references, referential gender, actions for the which they are the agent and patient, objects they possess, and modifiers.

  • bartleby/bartleby.book.html

HTML file containing a.) the full text of the book along with annotations for entities, coreference, and speaker attribution and b.) a list of the named characters and major entity catgories (FAC, GPE, LOC, etc.).

Annotations

Entity annotations

The entity annotation layer covers six of the ACE 2005 categories in text:

  • People (PER): Tom Sawyer, her daughter
  • Facilities (FAC): the house, the kitchen
  • Geo-political entities (GPE): London, the village
  • Locations (LOC): the forest, the river
  • Vehicles (VEH): the ship, the car
  • Organizations (ORG): the army, the Church

The targets of annotation here include both named entities (e.g., Tom Sawyer), common entities (the boy) and pronouns (he). These entities can be nested, as in the following:

drawing

For more, see: David Bamman, Sejal Popat and Sheng Shen, "An Annotated Dataset of Literary Entities," NAACL 2019.

The entity tagging model within BookNLP is trained on an annotated dataset of 968K tokens, including the public domain materials in LitBank and a new dataset of ~500 contemporary books, including bestsellers, Pulitzer Prize winners, works by Black authors, global Anglophone books, and genre fiction (article forthcoming).

Event annotations

The event layer identifies events with asserted realis (depicted as actually taking place, with specific participants at a specific time) -- as opposed to events with other epistemic modalities (hypotheticals, future events, extradiegetic summaries by the narrator).

Text Events Source
My father’s eyes had closed upon the light of this world six months, when mine opened on it. {closed, opened} Dickens, David Copperfield
Call me Ishmael. {} Melville, Moby Dick
His sister was a tall, strong girl, and she walked rapidly and resolutely, as if she knew exactly where she was going and what she was going to do next. {walked} Cather, O Pioneers

For more, see: Matt Sims, Jong Ho Park and David Bamman, "Literary Event Detection," ACL 2019.

The event tagging model is trained on event annotations within LitBank. The small model above makes use of a distillation process, by training on the predictions made by the big model for a collection of contemporary texts.

Supersense tagging

Supersense tagging provides coarse semantic information for a sentence by tagging spans with 41 lexical semantic categories drawn from WordNet, spanning both nouns (including plant, animal, food, feeling, and artifact) and verbs (including cognition, communication, motion, etc.)

Example Source
The [station wagons]artifact [arrived]motion at [noon]time, a long shining [line]group that [coursed]motion through the [west campus]location. Delillo, White Noise

The BookNLP tagger is trained on SemCor.

.

Character name clustering and coreference

The coreference layer covers the six ACE entity categories outlined above (people, facilities, locations, geo-political entities, organizations and vehicles) and is trained on LitBank and PreCo.

Example Source
One may as well begin with [Helen]x's letters to [[her]x sister]y Forster, Howard's End

Accurate coreference at the scale of a book-length document is still an open research problem, and attempting full coreference -- where any named entity (Elizabeth), common entity (her sister, his daughter) and pronoun (she) can corefer -- tends to erroneously conflate multiple distinct entities into one. By default, BookNLP addresses this by first carrying out character name clustering (grouping "Tom", "Tom Sawyer" and "Mr. Sawyer" into a single entity), and then allowing pronouns to corefer with either named entities (Tom) or common entities (the boy), but disallowing common entities from co-referring to named entities. To turn off this mode and carry out full corefernce, add pronominalCorefOnly=False to the model_params parameters dictionary above (but be sure to inspect the output!).

For more on the coreference criteria used in this work, see David Bamman, Olivia Lewke and Anya Mansoor (2020), "An Annotated Dataset of Coreference in English Literature", LREC.

Referential gender inference

BookNLP infers the referential gender of characters by associating them with the pronouns (he/him/his, she/her, they/them, xe/xem/xyr/xir, etc.) used to refer to them in the context of the story. This method encodes several assumptions:

  • BookNLP describes the referential gender of characters, and not their gender identity. Characters are described by the pronouns used to refer to them (e.g., he/him, she/her) rather than labels like "M/F".

  • Prior information on the alignment of names with referential gender (e.g., from government records or larger background datasets) can be used to provide some information to inform this process if desired (e.g., "Tom" is often associated with he/him in pre-1923 English texts). Name information, however, should not be uniquely determinative, but rather should be sensitive to the context in which it is used (e.g., "Tom" in the book "Tom and Some Other Girls", where Tom is aligned with she/her). By default, BookNLP uses prior information on the alignment of proper names and honorifics with pronouns drawn from ~15K works from Project Gutenberg; this prior information can be ignored by setting referential_gender_hyperparameterFile:None in the model_params file. Alternative priors can be used by passing the pathname to a prior file (in the same format as english/data/gutenberg_prop_gender_terms.txt) to this parameter.

  • Users should be free to define the referential gender categories used here. The default set of categories is {he, him, his}, {she, her}, {they, them, their}, {xe, xem, xyr, xir}, and {ze, zem, zir, hir}. To specify a different set of categories, update the model_params setting to define them: referential_gender_cats: [ ["he", "him", "his"], ["she", "her"], ["they", "them", "their"], ["xe", "xem", "xyr", "xir"], ["ze", "zem", "zir", "hir"] ]

Speaker attribution

The speaker attribution model identifies all instances of direct speech in the text and attributes it to its speaker.

Quote Speaker Source
— Come up , Kinch ! Come up , you fearful jesuit ! Buck_Mulligan-0 Joyce, Ulysses
‘ Oh dear ! Oh dear ! I shall be late ! ’ The_White_Rabbit-4 Carroll, Alice in Wonderland
“ Do n't put your feet up there , Huckleberry ; ” Miss_Watson-26 Twain, Huckleberry Finn

This model is trained on speaker attribution data in LitBank. For more on the quotation annotations, see this paper.

Part-of-speech tagging and dependency parsing

BookNLP uses Spacy for part-of-speech tagging and dependency parsing.

Acknowledgments

BookNLP is supported by the National Endowment for the Humanities (HAA-271654-20) and the National Science Foundation (IIS-1942591).
novel deep learning research works with PaddlePaddle

Research 发布基于飞桨的前沿研究工作,包括CV、NLP、KG、STDM等领域的顶会论文和比赛冠军模型。 目录 计算机视觉(Computer Vision) 自然语言处理(Natrual Language Processing) 知识图谱(Knowledge Graph) 时空数据挖掘(Spa

1.5k Jan 03, 2023
📝An easy-to-use package to restore punctuation of the text.

✏️ rpunct - Restore Punctuation This repo contains code for Punctuation restoration. This package is intended for direct use as a punctuation restorat

Daulet Nurmanbetov 72 Dec 30, 2022
:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

R²SQL The PyTorch implementation of paper Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing. (AAAI 2021) Requirement

huybery 60 Dec 31, 2022
a chinese segment base on crf

Genius Genius是一个开源的python中文分词组件,采用 CRF(Conditional Random Field)条件随机场算法。 Feature 支持python2.x、python3.x以及pypy2.x。 支持简单的pinyin分词 支持用户自定义break 支持用户自定义合并词

duanhongyi 237 Nov 04, 2022
History Aware Multimodal Transformer for Vision-and-Language Navigation

History Aware Multimodal Transformer for Vision-and-Language Navigation This repository is the official implementation of History Aware Multimodal Tra

Shizhe Chen 46 Nov 23, 2022
Transformer Based Korean Sentence Spacing Corrector

TKOrrector Transformer Based Korean Sentence Spacing Corrector License Summary This solution is made available under Apache 2 license. See the LICENSE

Paul Hyung Yuel Kim 3 Apr 18, 2022
Research code for the paper "Fine-tuning wav2vec2 for speaker recognition"

Fine-tuning wav2vec2 for speaker recognition This is the code used to run the experiments in https://arxiv.org/abs/2109.15053. Detailed logs of each t

Nik 103 Dec 26, 2022
NLP library designed for reproducible experimentation management

Welcome to the Transfer NLP library, a framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP You can

Feedly 290 Dec 20, 2022
This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

Twitter COVID-19 Sentiment Analysis Members: Christopher Bach | Khalid Hamid Fallous | Jay Hirpara | Jing Tang | Graham Thomas | David Wetherhold Pro

4 Oct 15, 2022
gaiic2021-track3-小布助手对话短文本语义匹配复赛rank3、决赛rank4

决赛答辩已经过去一段时间了,我们队伍ac milan最终获得了复赛第3,决赛第4的成绩。在此首先感谢一些队友的carry~ 经过2个多月的比赛,学习收获了很多,也认识了很多大佬,在这里记录一下自己的参赛体验和学习收获。

102 Dec 19, 2022
TLA - Twitter Linguistic Analysis

TLA - Twitter Linguistic Analysis Tool for linguistic analysis of communities TLA is built using PyTorch, Transformers and several other State-of-the-

Tushar Sarkar 47 Aug 14, 2022
SimBERT升级版(SimBERTv2)!

RoFormer-Sim RoFormer-Sim,又称SimBERTv2,是我们之前发布的SimBERT模型的升级版。 介绍 https://kexue.fm/archives/8454 训练 tensorflow 1.14 + keras 2.3.1 + bert4keras 0.10.6 下载

317 Dec 23, 2022
MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data. It is implemented using Python.

willow 6 Jun 27, 2022
Simple, hackable offline speech to text - using the VOSK-API.

Simple, hackable offline speech to text - using the VOSK-API.

Campbell Barton 844 Jan 07, 2023
Code for the paper PermuteFormer

PermuteFormer This repo includes codes for the paper PermuteFormer: Efficient Relative Position Encoding for Long Sequences. Directory long_range_aren

Peng Chen 42 Mar 16, 2022
Ελληνικά νέα (Python script) / Greek News Feed (Python script)

Ελληνικά νέα (Python script) / Greek News Feed (Python script) Ελληνικά English Το 2017 είχα υλοποιήσει ένα Python script για να εμφανίζει τα τωρινά ν

Loren Kociko 1 Jun 14, 2022
Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products.

Leah Pathan Khan 2 Jan 12, 2022
NLP applications using deep learning.

NLP-Natural-Language-Processing NLP applications using deep learning like text generation etc. 1- Poetry Generation: Using a collection of Irish Poem

KASHISH 1 Jan 27, 2022
Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech

epub2audiobook Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech Input examples qual a pasta do seu

7 Aug 25, 2022
A unified tokenization tool for Images, Chinese and English.

ICE Tokenizer Token id [0, 20000) are image tokens. Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == 'unk', ice

THUDM 42 Dec 27, 2022