PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English

Last update: Dec 02, 2021

Related tags

Overview

PASTRIE

Official release of the corpus described in the paper:

Michael Kranzlein, Emma Manning, Siyao Peng, Shira Wein, Aryaman Arora, and Nathan Schneider (2020). PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English [link]. Proceedings of the 14th Linguistic Annotation Workshop.

Overview

PASTRIE is a corpus of English data from Reddit annotated with preposition supersenses from the SNACS inventory.

While the data in PASTRIE is in English, it was produced by presumed speakers of four L1s:

English
French
German
Spanish

For details on how L1s were identified, see section 3.1 of Rabinovich et al. (2018).

Annotation Example

Below is an example sentence from the corpus, where annotation targets are bolded and preposition supersenses are annotated with the notation SceneRole↝Function. Together, a scene role and function are known as a construal.

Data Formats

PASTRIE is released in the following formats. We expect that most projects will be best served by one of the JSON formats.

.conllulex: the 19-column CoNLL-U-Lex format originally used for STREUSLE.
.json: a JSON representation of the CoNLL-U-Lex that does not require a CoNLL-U-Lex parser.
.govobj.json: an extended version of the JSON representation that contains information about each preposition's syntactic parent and object.

PASTRIE mostly follows STREUSLE with respect to the data format and SNACS annotation practice. Primary differences in the annotations are:

Lemmas, part-of-speech tags, and syntactic dependencies aim to follow the UD standard in both cases. They are gold in STREUSLE, versus automatic with some manual corrections in PASTRIE.
- PASTRIE does not group together base+clitic combinations, whereas STREUSLE does (multiword tokens—where a single orthographic word contains multiple syntactic words).
- PASTRIE does not regularly specify SpaceAfter=No to indicate alignment between the tokens and the raw text.
- In PASTRIE, the raw text string accompanying the sentence may contain two or more consecutive spaces.
- PASTRIE lacks enhanced dependencies.
Multiword expression annotations in PASTRIE are limited to expressions containing a preposition. Depending on the syntactic head, the expression may or may not have a SNACS supersense.
- Verbal multiword expressions in PASTRIE are not subtyped in the lexcat; they all have a lexcat of V.
Noun and verb expressions in PASTRIE do not have supersense labels.

Comments

Misc. annotation errors and/or conversion script bugs

There are some annotations which I'm fairly sure are incorrect and are choking up the JSON conversion script. (These errors occur using the unmodified versions of all scripts taken straight from STRUESLE.) One or two might also be indicative of a bug in the conllulex2json.py file.

vs mistagged as a noun--should be prep

AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

ditto

AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

Script complains about "to" in this snippet at ID=23. Not immediately clear to me what the issue is--perhaps that "to" is labeled ADP/IN? For its xpos I think it ought to be TO, not sure about its upos. Snippet:

13      shit    shit    NOUN    NN      _       16      obl:npmod       _       _       _       _       _       _       _       _       _       _       _
14      this    this    PRON    DT      _       16      nsubj   _       _       _       _       _       _       _       _       _       _       _
15      can     can     AUX     MD      _       16      aux     _       _       _       _       _       _       _       _       _       _       _
16      end     end     VERB    VB      _       4       parataxis       _       _       _       _       _       _       _       _       _       _       _
17      right   right   ADV     RB      _       18      advmod  _       _       _       _       _       _       _       _       _       _       _
18      now     now     ADV     RB      _       16      advmod  _       _       _       _       _       _       _       _       _       _       _
19      if      if      SCONJ   IN      _       21      mark    _       _       _       _       _       _       _       _       _       _       _
20      I       I       PRON    PRP     _       21      nsubj   _       _       _       _       _       _       _       _       _       _       _
21      want    want    VERB    VBP     _       16      advcl   _       _       _       _       _       _       _       _       _       _       _
22      it      it      PRON    PRP     _       21      obj     _       _       _       _       _       _       _       _       _       _       _
23      to      to      ADP     IN      _       21      obl     _       _       _       _       _       `i      `i      _       _       _       _
24      .       .       PUNCT   .       _       4       punct   _       _       _       _       _       _       _       _       _       _       _

Error:

AssertionError: ('french-a17a4340-f9c0-8fef-fa1b-1bf13879399b-02', {'lexlemma': 'to', 'lexcat': 'INF', 'ss': 'i', 'ss2': 'i', 'toknums': [23]}, {'#': 23, 'word': 'to', 'lemma': 'to', 'upos': 'ADP', 'xpos': 'IN', 'feats': None, 'head': 21, 'deprel': 'obl', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-INF-`i'})

Relevant span of code:

            if validate_pos and upos!=lc and (upos,lc) not in {('NOUN','N'),('PROPN','N'),('VERB','V'),
                ('ADP','P'),('ADV','P'),('SCONJ','P'),
                ('ADP','DISC'),('ADV','DISC'),('SCONJ','DISC'),
                ('PART','POSS')}:
                # most often, the single-word lexcat should match its upos
                # check a list of exceptions
                mismatchOK = False
                if xpos=='TO' and lc.startswith('INF'):
                    mismatchOK = True
                elif (xpos=='TO')!=lc.startswith('INF'):
                    assert upos in ['SCONJ', "ADP"] and swe['lexlemma']=='for',(sent['sent_id'],swe,tok)
                    mismatchOK = True

Originator as function:

(in french-c02823ec-60bd-adce-7327-01337eb9d1c8-02) AssertionError: ('p.Originator should never be function', {'lexlemma': 'you', 'lexcat': 'PRON.POSS', 'ss': 'p.Originator', 'ss2': 'p.Originator', 'toknums': [1]})

lexcat DISC with ADJ:

AssertionError: In spanish-a25e8289-e04a-f5af-ce56-ead9faca65b1-02, single-word expression 'like' has lexcat DISC, which is incompatible with its upos ADJ

"her" tagged with Possessor is incorrectly parsed as iobj and tagged as PRP instead of PRP$. Relevant snippet:

1       My      my      PRON    PRP$    _       2       nmod:poss       _       _       _       _       _       SocialRel       Gestalt _       _       _       _
2       grandma grandma NOUN    NN      _       3       nsubj   _       _       _       _       _       _       _       _       _       _       _
3       had     have    VERB    VBD     _       0       root    _       _       _       _       _       _       _       _       _       _       _
4       her     she     PRON    PRP     _       3       iobj    _       _       _       _       _       Possessor       Possessor       _       _       _       _
5       super   super   ADV     RB      _       6       advmod  _       _       _       _       _       _       _       _       _       _       _
6       thick   thick   ADJ     JJ      _       8       amod    _       _       _       _       _       _       _       _       _       _       _
7       floor   floor   NOUN    NN      _       8       compound        _       _       _       _       _       _       _       _       _       _       _
8       mats    mat     NOUN    NNS     _       3       obj     _       _       _       _       _       _       _       _       _       _       _
9       *       *       PUNCT   NFP     _       8       punct   _       _       _       _       _       _       _       _       _       _       _
10      over    over    ADP     IN      _       13      case    _       _       _       _       _       Locus   Locus   _       _       _       _
11      *       *       PUNCT   NFP     _       13      punct   _       _       _       _       _       _       _       _       _       _       _
12      the     the     DET     DT      _       13      det     _       _       _       _       _       _       _       _       _       _       _
13      accelerator     accelerator     NOUN    NN      _       3       obl     _       _       _       _       _       _       _       _       _       _       _
14      ,       ,       PUNCT   ,       _       3       punct   _       _       _       _       _       _       _       _       _       _       _

Error:

AssertionError: In spanish-ebba3c73-2431-c216-8f4d-d469ee8d5564-01, single-word expression 'her' has lexcat P, which is incompatible with its upos PRON

"NA" is misannotated--this is NA as in North America, i.e. a PROPN/NP, but it's lemmatized as "no", and its tags are weird.

AssertionError: ('german-35000895-1d78-c18a-01ed-f7410b9c0581-01', {'lexlemma': 'no', 'lexcat': 'ADV', 'ss': None, 'ss2': None, 'toknums': [5]}, {'#': 5, 'word': 'NA', 'lemma': 'no', 'upos': 'PART', 'xpos': 'TO', 'feats': None, 'head': 6, 'deprel': 'mark', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-ADV'})

opened by lgessler 6

Prepositional supersense annotations on non-preposition targets
Is it OK for a verb-headed SMWE to have a prepositional supersense? The validator complains about it. Offending SMWE:

21 give give VERB VB _ 10 conj _ _ 2:1 _ give up on p.Theme p.Theme _ _ _ _ 22 up up ADP RP _ 21 compound:prt _ _ 2:2 _ _ _ _ _ _ _ _ 23 on on ADP IN _ 24 case _ _ 2:3 _ _ _ _ _ _ _ _
opened by lgessler 5

Prepositions unannotated for supersense

Token 6:

# sent_id = french-f57dd6ab-5263-4c8a-e360-8ec683e6a37a-02
# text = Once you have the hang of it it s pretty fast ( and does n't eat your clutch ) .
1	Once	once	SCONJ	IN	_	3	mark	_	_	_	_	_	_	_	_	_	_	_
2	you	you	PRON	PRP	_	3	nsubj	_	_	_	_	_	_	_	_	_	_	_
3	have	have	VERB	VBP	_	11	advcl	_	_	_	_	_	_	_	_	_	_	_
4	the	the	DET	DT	_	5	det	_	_	_	_	_	_	_	_	_	_	_
5	hang	hang	NOUN	NN	_	3	obj	_	_	_	_	_	_	_	_	_	_	_
6	of	of	ADP	IN	_	7	case	_	_	_	_	_	_	_	_	_	_	_
7	it	it	PRON	PRP	_	5	nmod	_	_	_	_	_	_	_	_	_	_	_
8	it	it	PRON	PRP	_	11	nsubj	_	_	_	_	_	_	_	_	_	_	_
9	s	be	AUX	VBZ	_	11	cop	_	_	_	_	_	_	_	_	_	_	_
10	pretty	pretty	ADV	RB	_	11	advmod	_	_	_	_	_	_	_	_	_	_	_
11	fast	fast	ADJ	JJ	_	0	root	_	_	_	_	_	_	_	_	_	_	_
12	(	(	PUNCT	-LRB-	_	16	punct	_	_	_	_	_	_	_	_	_	_	_
13	and	and	CCONJ	CC	_	16	cc	_	_	_	_	_	_	_	_	_	_	_
14	does	do	AUX	VBZ	_	16	aux	_	_	_	_	_	_	_	_	_	_	_
15	n't	not	PART	RB	_	16	advmod	_	_	_	_	_	_	_	_	_	_	_
16	eat	eat	VERB	VB	_	11	conj	_	_	_	_	_	_	_	_	_	_	_
17	your	you	PRON	PRP$	_	18	nmod:poss	_	_	_	_	_	Possessor	Possessor	_	_	_	_
18	clutch	clutch	NOUN	NN	_	16	obj	_	_	_	_	_	_	_	_	_	_	_
19	)	)	PUNCT	-RRB-	_	11	punct	_	_	_	_	_	_	_	_	_	_	_
20	.	.	PUNCT	.	_	11	punct	_	_	_	_	_	_	_	_	_	_	_

I assumed that all preps were supposed to be annotated, but perhaps not?

opened by lgessler 3

Apostrophes removed in preprocessing?

Looking through the data, there are a LOT of sentences where clitics are tokenized off but lack an apostrophe. Is that just the genre or did they get lost in preprocessing?

opened by nschneid 2
Dataset requested

Hi all,

I would like to request the PASTRIE dataset accompanying the paper "PASTRIE: A Corpus of Prepositions Annotated with Supsersense Tags in Reddit International English".

Thanks for reply.

opened by fj-morales 2
SNACS supersense tags should start with "p."

For compatibility with STREUSLE, it should be p.Locus, p.Theme, etc.

Special labels like `i `d `c `$ ?? should not start with p.. In fact, the backtick labels from annotation are not represented as such in STREUSLE—they are reflected in the LEXCAT column of the data.

opened by nschneid 0
Questionable adpositional MWEs
in_male_term — from "in male terms"; should be in_term (at most)

in_the_first_place

in_my_hand — from "in my hands"; should be in_hand (at most)

for_quite_some_time — just Duration for, weak MWE?

at_all_time — from what should have been "at all times". OK?

on_a_smaller_scale — omit adjective?

withouth — typo

see_as — "seeing as" (deverbal MWE acting like a preposition)
opened by nschneid 0
Some undersegmentation of sentences

Despite manual editing there are still places where a long sentence ought to be split up (esp. where it consists of a blockquoted sentence with > followed by a response). Looking for multiple consecutive spaces in the raw text uncovers some of these (as well as some discourse appendages like emoticons, which should probably remain in the same UD sentence).

It would be nice to write a script to help clean these up—the tricky part is updating offsets in each parse.

opened by nschneid 0

Releases(v2.0.1)

v2.0.1(Nov 21, 2021)
Fixes 3 erroneous sentence IDs (along with beefed up sentence ID validation in scripts). (#16)

Source code(tar.gz)
Source code(zip)
v2.0(Oct 22, 2021)
Switch to full .conllulex format following STREUSLE

add lexcats (#3), morphological features, newdoc directives

Scripts for validation and format conversion

Clean up various annotation issues, including:

restore apostrophes and fixing other conversion problems (#6, #9)

include pretokenized raw text (#12)

Source code(tar.gz)
Source code(zip)
v1.0.1(Dec 14, 2020)
Added .json file format

Switched lemmatization and pos tagging from StanfordNLP 0.2.0 to Stanza 1.1.1

Corrected rare encoding issue from v1.0

Source code(tar.gz)
Source code(zip)
v1.0(Dec 12, 2020)

Source code(tar.gz)
Source code(zip)

Owner

NERT @ Georgetown

GitHub Repository

PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English

Related tags

Overview

PASTRIE

Overview

Annotation Example

Data Formats

Comments

Releases(v2.0.1)

v2.0.1(Nov 21, 2021)

v2.0(Oct 22, 2021)

v1.0.1(Dec 14, 2020)

v1.0(Dec 12, 2020)

Owner

NERT @ Georgetown

Pytorch implementation of our method for regularizing nerual radiance fields for few-shot neural volume rendering.

Run Keras models in the browser, with GPU support using WebGL

A lightweight tool to get an AI Infrastructure Stack up in minutes not days.

TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-Captured Scenarios

ReLoss - Official implementation for paper "Relational Surrogate Loss Learning" ICLR 2022

Streamlit Tutorial (ex: stock price dashboard, cartoon-stylegan, vqgan-clip, stylemixing, styleclip, sefa)

Final project code: Implementing BicycleGAN, for CIS680 FA21 at University of Pennsylvania

NeuralForecast is a Python library for time series forecasting with deep learning models

A PyTorch Image-Classification With AlexNet And ResNet50.

A collection of metrics for evaluating timbre dissimilarity using the TorchMetrics API

noisy labels; missing labels; semi-supervised learning; entropy; uncertainty; robustness and generalisation.

Code for paper: Group-CAM: Group Score-Weighted Visual Explanations for Deep Convolutional Networks

Computationally efficient algorithm that identifies boundary points of a point cloud.

Code release for General Greedy De-bias Learning

Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.

Kaggle G2Net Gravitational Wave Detection : 2nd place solution

A simple pygame dino game which can also be trained and played by a NEAT KI

Tutorial in Python targeted at Epidemiologists. Will discuss the basics of analysis in Python 3

TACTO: A Fast, Flexible and Open-source Simulator for High-Resolution Vision-based Tactile Sensors

RoMA: Robust Model Adaptation for Offline Model-based Optimization