PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English

Related tags

Deep Learningpastrie
Overview

PASTRIE

CC BY-SA 4.0

Official release of the corpus described in the paper:

Michael Kranzlein, Emma Manning, Siyao Peng, Shira Wein, Aryaman Arora, and Nathan Schneider (2020). PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English [link]. Proceedings of the 14th Linguistic Annotation Workshop.


Overview

PASTRIE is a corpus of English data from Reddit annotated with preposition supersenses from the SNACS inventory.

While the data in PASTRIE is in English, it was produced by presumed speakers of four L1s:

  • English
  • French
  • German
  • Spanish

For details on how L1s were identified, see section 3.1 of Rabinovich et al. (2018).

Annotation Example

Below is an example sentence from the corpus, where annotation targets are bolded and preposition supersenses are annotated with the notation SceneRole↝Function. Together, a scene role and function are known as a construal.


Data Formats

PASTRIE is released in the following formats. We expect that most projects will be best served by one of the JSON formats.

  • .conllulex: the 19-column CoNLL-U-Lex format originally used for STREUSLE.
  • .json: a JSON representation of the CoNLL-U-Lex that does not require a CoNLL-U-Lex parser.
  • .govobj.json: an extended version of the JSON representation that contains information about each preposition's syntactic parent and object.

PASTRIE mostly follows STREUSLE with respect to the data format and SNACS annotation practice. Primary differences in the annotations are:

  • Lemmas, part-of-speech tags, and syntactic dependencies aim to follow the UD standard in both cases. They are gold in STREUSLE, versus automatic with some manual corrections in PASTRIE.
    • PASTRIE does not group together base+clitic combinations, whereas STREUSLE does (multiword tokens—where a single orthographic word contains multiple syntactic words).
    • PASTRIE does not regularly specify SpaceAfter=No to indicate alignment between the tokens and the raw text.
    • In PASTRIE, the raw text string accompanying the sentence may contain two or more consecutive spaces.
    • PASTRIE lacks enhanced dependencies.
  • Multiword expression annotations in PASTRIE are limited to expressions containing a preposition. Depending on the syntactic head, the expression may or may not have a SNACS supersense.
    • Verbal multiword expressions in PASTRIE are not subtyped in the lexcat; they all have a lexcat of V.
  • Noun and verb expressions in PASTRIE do not have supersense labels.
Comments
  • Misc. annotation errors and/or conversion script bugs

    Misc. annotation errors and/or conversion script bugs

    There are some annotations which I'm fairly sure are incorrect and are choking up the JSON conversion script. (These errors occur using the unmodified versions of all scripts taken straight from STRUESLE.) One or two might also be indicative of a bug in the conllulex2json.py file.

    1. vs mistagged as a noun--should be prep

    AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

    1. ditto

    AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

    1. Script complains about "to" in this snippet at ID=23. Not immediately clear to me what the issue is--perhaps that "to" is labeled ADP/IN? For its xpos I think it ought to be TO, not sure about its upos. Snippet:
    13      shit    shit    NOUN    NN      _       16      obl:npmod       _       _       _       _       _       _       _       _       _       _       _
    14      this    this    PRON    DT      _       16      nsubj   _       _       _       _       _       _       _       _       _       _       _
    15      can     can     AUX     MD      _       16      aux     _       _       _       _       _       _       _       _       _       _       _
    16      end     end     VERB    VB      _       4       parataxis       _       _       _       _       _       _       _       _       _       _       _
    17      right   right   ADV     RB      _       18      advmod  _       _       _       _       _       _       _       _       _       _       _
    18      now     now     ADV     RB      _       16      advmod  _       _       _       _       _       _       _       _       _       _       _
    19      if      if      SCONJ   IN      _       21      mark    _       _       _       _       _       _       _       _       _       _       _
    20      I       I       PRON    PRP     _       21      nsubj   _       _       _       _       _       _       _       _       _       _       _
    21      want    want    VERB    VBP     _       16      advcl   _       _       _       _       _       _       _       _       _       _       _
    22      it      it      PRON    PRP     _       21      obj     _       _       _       _       _       _       _       _       _       _       _
    23      to      to      ADP     IN      _       21      obl     _       _       _       _       _       `i      `i      _       _       _       _
    24      .       .       PUNCT   .       _       4       punct   _       _       _       _       _       _       _       _       _       _       _
    

    Error:

    AssertionError: ('french-a17a4340-f9c0-8fef-fa1b-1bf13879399b-02', {'lexlemma': 'to', 'lexcat': 'INF', 'ss': 'i', 'ss2': 'i', 'toknums': [23]}, {'#': 23, 'word': 'to', 'lemma': 'to', 'upos': 'ADP', 'xpos': 'IN', 'feats': None, 'head': 21, 'deprel': 'obl', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-INF-`i'})

    Relevant span of code:

                if validate_pos and upos!=lc and (upos,lc) not in {('NOUN','N'),('PROPN','N'),('VERB','V'),
                    ('ADP','P'),('ADV','P'),('SCONJ','P'),
                    ('ADP','DISC'),('ADV','DISC'),('SCONJ','DISC'),
                    ('PART','POSS')}:
                    # most often, the single-word lexcat should match its upos
                    # check a list of exceptions
                    mismatchOK = False
                    if xpos=='TO' and lc.startswith('INF'):
                        mismatchOK = True
                    elif (xpos=='TO')!=lc.startswith('INF'):
                        assert upos in ['SCONJ', "ADP"] and swe['lexlemma']=='for',(sent['sent_id'],swe,tok)
                        mismatchOK = True
    
    1. Originator as function:

    (in french-c02823ec-60bd-adce-7327-01337eb9d1c8-02) AssertionError: ('p.Originator should never be function', {'lexlemma': 'you', 'lexcat': 'PRON.POSS', 'ss': 'p.Originator', 'ss2': 'p.Originator', 'toknums': [1]})

    1. lexcat DISC with ADJ:

    AssertionError: In spanish-a25e8289-e04a-f5af-ce56-ead9faca65b1-02, single-word expression 'like' has lexcat DISC, which is incompatible with its upos ADJ

    1. "her" tagged with Possessor is incorrectly parsed as iobj and tagged as PRP instead of PRP$. Relevant snippet:
    1       My      my      PRON    PRP$    _       2       nmod:poss       _       _       _       _       _       SocialRel       Gestalt _       _       _       _
    2       grandma grandma NOUN    NN      _       3       nsubj   _       _       _       _       _       _       _       _       _       _       _
    3       had     have    VERB    VBD     _       0       root    _       _       _       _       _       _       _       _       _       _       _
    4       her     she     PRON    PRP     _       3       iobj    _       _       _       _       _       Possessor       Possessor       _       _       _       _
    5       super   super   ADV     RB      _       6       advmod  _       _       _       _       _       _       _       _       _       _       _
    6       thick   thick   ADJ     JJ      _       8       amod    _       _       _       _       _       _       _       _       _       _       _
    7       floor   floor   NOUN    NN      _       8       compound        _       _       _       _       _       _       _       _       _       _       _
    8       mats    mat     NOUN    NNS     _       3       obj     _       _       _       _       _       _       _       _       _       _       _
    9       *       *       PUNCT   NFP     _       8       punct   _       _       _       _       _       _       _       _       _       _       _
    10      over    over    ADP     IN      _       13      case    _       _       _       _       _       Locus   Locus   _       _       _       _
    11      *       *       PUNCT   NFP     _       13      punct   _       _       _       _       _       _       _       _       _       _       _
    12      the     the     DET     DT      _       13      det     _       _       _       _       _       _       _       _       _       _       _
    13      accelerator     accelerator     NOUN    NN      _       3       obl     _       _       _       _       _       _       _       _       _       _       _
    14      ,       ,       PUNCT   ,       _       3       punct   _       _       _       _       _       _       _       _       _       _       _
    

    Error:

    AssertionError: In spanish-ebba3c73-2431-c216-8f4d-d469ee8d5564-01, single-word expression 'her' has lexcat P, which is incompatible with its upos PRON

    1. "NA" is misannotated--this is NA as in North America, i.e. a PROPN/NP, but it's lemmatized as "no", and its tags are weird.

    AssertionError: ('german-35000895-1d78-c18a-01ed-f7410b9c0581-01', {'lexlemma': 'no', 'lexcat': 'ADV', 'ss': None, 'ss2': None, 'toknums': [5]}, {'#': 5, 'word': 'NA', 'lemma': 'no', 'upos': 'PART', 'xpos': 'TO', 'feats': None, 'head': 6, 'deprel': 'mark', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-ADV'})

    opened by lgessler 6
  • Prepositional supersense annotations on non-preposition targets

    Prepositional supersense annotations on non-preposition targets

    Is it OK for a verb-headed SMWE to have a prepositional supersense? The validator complains about it. Offending SMWE:

    21	give	give	VERB	VB	_	10	conj	_	_	2:1	_	give up on	p.Theme	p.Theme	_	_	_	_
    22	up	up	ADP	RP	_	21	compound:prt	_	_	2:2	_	_	_	_	_	_	_	_
    23	on	on	ADP	IN	_	24	case	_	_	2:3	_	_	_	_	_	_	_	_
    
    opened by lgessler 5
  • Prepositions unannotated for supersense

    Prepositions unannotated for supersense

    Token 6:

    # sent_id = french-f57dd6ab-5263-4c8a-e360-8ec683e6a37a-02
    # text = Once you have the hang of it it s pretty fast ( and does n't eat your clutch ) .
    1	Once	once	SCONJ	IN	_	3	mark	_	_	_	_	_	_	_	_	_	_	_
    2	you	you	PRON	PRP	_	3	nsubj	_	_	_	_	_	_	_	_	_	_	_
    3	have	have	VERB	VBP	_	11	advcl	_	_	_	_	_	_	_	_	_	_	_
    4	the	the	DET	DT	_	5	det	_	_	_	_	_	_	_	_	_	_	_
    5	hang	hang	NOUN	NN	_	3	obj	_	_	_	_	_	_	_	_	_	_	_
    6	of	of	ADP	IN	_	7	case	_	_	_	_	_	_	_	_	_	_	_
    7	it	it	PRON	PRP	_	5	nmod	_	_	_	_	_	_	_	_	_	_	_
    8	it	it	PRON	PRP	_	11	nsubj	_	_	_	_	_	_	_	_	_	_	_
    9	s	be	AUX	VBZ	_	11	cop	_	_	_	_	_	_	_	_	_	_	_
    10	pretty	pretty	ADV	RB	_	11	advmod	_	_	_	_	_	_	_	_	_	_	_
    11	fast	fast	ADJ	JJ	_	0	root	_	_	_	_	_	_	_	_	_	_	_
    12	(	(	PUNCT	-LRB-	_	16	punct	_	_	_	_	_	_	_	_	_	_	_
    13	and	and	CCONJ	CC	_	16	cc	_	_	_	_	_	_	_	_	_	_	_
    14	does	do	AUX	VBZ	_	16	aux	_	_	_	_	_	_	_	_	_	_	_
    15	n't	not	PART	RB	_	16	advmod	_	_	_	_	_	_	_	_	_	_	_
    16	eat	eat	VERB	VB	_	11	conj	_	_	_	_	_	_	_	_	_	_	_
    17	your	you	PRON	PRP$	_	18	nmod:poss	_	_	_	_	_	Possessor	Possessor	_	_	_	_
    18	clutch	clutch	NOUN	NN	_	16	obj	_	_	_	_	_	_	_	_	_	_	_
    19	)	)	PUNCT	-RRB-	_	11	punct	_	_	_	_	_	_	_	_	_	_	_
    20	.	.	PUNCT	.	_	11	punct	_	_	_	_	_	_	_	_	_	_	_
    

    I assumed that all preps were supposed to be annotated, but perhaps not?

    opened by lgessler 3
  • Apostrophes removed in preprocessing?

    Apostrophes removed in preprocessing?

    Looking through the data, there are a LOT of sentences where clitics are tokenized off but lack an apostrophe. Is that just the genre or did they get lost in preprocessing?

    opened by nschneid 2
  • Dataset requested

    Dataset requested

    Hi all,

    I would like to request the PASTRIE dataset accompanying the paper "PASTRIE: A Corpus of Prepositions Annotated with Supsersense Tags in Reddit International English".

    Thanks for reply.

    opened by fj-morales 2
  • SNACS supersense tags should start with

    SNACS supersense tags should start with "p."

    For compatibility with STREUSLE, it should be p.Locus, p.Theme, etc.

    Special labels like `i `d `c `$ ?? should not start with p.. In fact, the backtick labels from annotation are not represented as such in STREUSLE—they are reflected in the LEXCAT column of the data.

    opened by nschneid 0
  • Questionable adpositional MWEs

    Questionable adpositional MWEs

    • in_male_term — from "in male terms"; should be in_term (at most)
    • in_the_first_place
    • in_my_hand — from "in my hands"; should be in_hand (at most)
    • for_quite_some_time — just Duration for, weak MWE?
    • at_all_time — from what should have been "at all times". OK?
    • on_a_smaller_scale — omit adjective?
    • withouth — typo
    • see_as — "seeing as" (deverbal MWE acting like a preposition)
    opened by nschneid 0
  • Some undersegmentation of sentences

    Some undersegmentation of sentences

    Despite manual editing there are still places where a long sentence ought to be split up (esp. where it consists of a blockquoted sentence with > followed by a response). Looking for multiple consecutive spaces in the raw text uncovers some of these (as well as some discourse appendages like emoticons, which should probably remain in the same UD sentence).

    It would be nice to write a script to help clean these up—the tricky part is updating offsets in each parse.

    opened by nschneid 0
Releases(v2.0.1)
  • v2.0.1(Nov 21, 2021)

  • v2.0(Oct 22, 2021)

    • Switch to full .conllulex format following STREUSLE
      • add lexcats (#3), morphological features, newdoc directives
    • Scripts for validation and format conversion
    • Clean up various annotation issues, including:
      • restore apostrophes and fixing other conversion problems (#6, #9)
      • include pretokenized raw text (#12)
    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Dec 14, 2020)

    • Added .json file format
    • Switched lemmatization and pos tagging from StanfordNLP 0.2.0 to Stanza 1.1.1
    • Corrected rare encoding issue from v1.0
    Source code(tar.gz)
    Source code(zip)
Owner
NERT @ Georgetown
NERT @ Georgetown
Using pytorch to implement unet network for liver image segmentation.

Using pytorch to implement unet network for liver image segmentation.

zxq 1 Dec 17, 2021
Geometric Vector Perceptron --- a rotation-equivariant GNN for learning from biomolecular structure

Geometric Vector Perceptron Code to accompany Learning from Protein Structure with Geometric Vector Perceptrons by B Jing, S Eismann, P Suriana, RJL T

Dror Lab 85 Dec 29, 2022
🔥 Cogitare - A Modern, Fast, and Modular Deep Learning and Machine Learning framework for Python

Cogitare is a Modern, Fast, and Modular Deep Learning and Machine Learning framework for Python. A friendly interface for beginners and a powerful too

Cogitare - Modern and Easy Deep Learning with Python 76 Sep 30, 2022
Code, Models and Datasets for OpenViDial Dataset

OpenViDial This repo contains downloading instructions for the OpenViDial dataset in 《OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Vis

119 Dec 08, 2022
Pytorch implementation of Learning with Opponent-Learning Awareness

Pytorch implementation of Learning with Opponent-Learning Awareness using DiCE

Alexis David Jacq 82 Sep 15, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

310 Dec 28, 2022
Solve a Rubiks Cube using Python Opencv and Kociemba module

Rubiks_Cube_Solver Solve a Rubiks Cube using Python Opencv and Kociemba module Main Steps Get the countours of the cube check whether there are tota

Adarsh Badagala 176 Jan 01, 2023
Official Pytorch implementation of 'RoI Tanh-polar Transformer Network for Face Parsing in the Wild.'

Official Pytorch implementation of 'RoI Tanh-polar Transformer Network for Face Parsing in the Wild.'

Jie Shen 125 Jan 08, 2023
Official code for paper "Optimization for Oriented Object Detection via Representation Invariance Loss".

Optimization for Oriented Object Detection via Representation Invariance Loss By Qi Ming, Zhiqiang Zhou, Lingjuan Miao, Xue Yang, and Yunpeng Dong. Th

ming71 56 Nov 28, 2022
The implementation of FOLD-R++ algorithm

FOLD-R-PP The implementation of FOLD-R++ algorithm. The target of FOLD-R++ algorithm is to learn an answer set program for a classification task. Inst

13 Dec 23, 2022
Code accompanying the NeurIPS 2021 paper "Generating High-Quality Explanations for Navigation in Partially-Revealed Environments"

Generating High-Quality Explanations for Navigation in Partially-Revealed Environments This work presents an approach to explainable navigation under

RAIL Group @ George Mason University 1 Oct 28, 2022
Unofficial Pytorch Implementation of WaveGrad2

WaveGrad 2 — Unofficial PyTorch Implementation WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis Unofficial PyTorch+Lightning Implementati

MINDs Lab 104 Nov 29, 2022
Data & Code for ACCENTOR Adding Chit-Chat to Enhance Task-Oriented Dialogues

ACCENTOR: Adding Chit-Chat to Enhance Task-Oriented Dialogues Overview ACCENTOR consists of the human-annotated chit-chat additions to the 23.8K dialo

Facebook Research 69 Dec 29, 2022
FlowTorch is a PyTorch library for learning and sampling from complex probability distributions using a class of methods called Normalizing Flows

FlowTorch is a PyTorch library for learning and sampling from complex probability distributions using a class of methods called Normalizing Flows.

Meta Incubator 272 Jan 02, 2023
Relative Uncertainty Learning for Facial Expression Recognition

Relative Uncertainty Learning for Facial Expression Recognition The official implementation of the following paper at NeurIPS2021: Title: Relative Unc

35 Dec 28, 2022
This repository contains the code for "Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP".

Self-Diagnosis and Self-Debiasing This repository contains the source code for Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based

Timo Schick 62 Dec 12, 2022
Keras documentation, hosted live at keras.io

Keras.io documentation generator This repository hosts the code used to generate the keras.io website. Generating a local copy of the website pip inst

Keras 2k Jan 08, 2023
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

This is a release of our VIMPAC paper to illustrate the implementations. The pretrained checkpoints and scripts will be soon open-sourced in HuggingFace transformers.

Hao Tan 74 Dec 03, 2022
Repo for FUZE project. I will also publish some Linux kernel LPE exploits for various real world kernel vulnerabilities here. the samples are uploaded for education purposes for red and blue teams.

Linux_kernel_exploits Some Linux kernel exploits for various real world kernel vulnerabilities here. More exploits are yet to come. This repo contains

Wei Wu 472 Dec 21, 2022
A Low Complexity Speech Enhancement Framework for Full-Band Audio (48kHz) based on Deep Filtering.

DeepFilterNet A Low Complexity Speech Enhancement Framework for Full-Band Audio (48kHz) based on Deep Filtering. libDF contains Rust code used for dat

Hendrik Schröter 292 Dec 25, 2022