UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Overview

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

This repository contains UA-GEC data and an accompanying Python library.

Data

All corpus data and metadata stay under the ./data. It has two subfolders for train and test splits

Each split (train and test) has further subfolders for different data representations:

./data/{train,test}/annotated stores documents in the annotated format

./data/{train,test}/source and ./data/{train,test}/target store the original and the corrected versions of documents. Text files in these directories are plain text with no annotation markup. These files were produced from the annotated data and are, in some way, redundant. We keep them because this format is convenient in some use cases.

Metadata

./data/metadata.csv stores per-document metadata. It's a CSV file with the following fields:

  • id (str): document identifier.
  • author_id (str): document author identifier.
  • is_native (int): 1 if the author is native-speaker, 0 otherwise
  • region (str): the author's region of birth. A special value "Інше" is used both for authors who were born outside Ukraine and authors who preferred not to specify their region.
  • gender (str): could be "Жіноча" (female), "Чоловіча" (male), or "Інша" (other).
  • occupation (str): one of "Технічна", "Гуманітарна", "Природнича", "Інша"
  • submission_type (str): one of "essay", "translation", or "text_donation"
  • source_language (str): for submissions of the "translation" type, this field indicates the source language of the translated text. Possible values are "de", "en", "fr", "ru", and "pl".
  • annotator_id (int): ID of the annotator who corrected the document.
  • partition (str): one of "test" or "train"
  • is_sensitive (int): 1 if the document contains profanity or offensive language

Annotation format

Annotated files are text files that use the following in-text annotation format: {error=>edit:::error_type=Tag}, where error and edit stand for the text item before and after correction respectively, and Tag denotes an error category (Grammar, Spelling, Punctuation, or Fluency).

Example of an annotated sentence:

    I {likes=>like:::error_type=Grammar} turtles.

An accompanying Python package, ua_gec, provides many tools for working with annotated texts. See its documentation for details.

Train-test split

We expect users of the corpus to train and tune their models on the train split only. Feel free to further split it into train-dev (or use cross-validation).

Please use the test split only for reporting scores of your final model. In particular, never optimize on the test set. Do not tune hyperparameters on it. Do not use it for model selection in any way.

Next section lists the per-split statistics.

Statistics

UA-GEC contains:

Split Documents Sentences Tokens Authors
train 851 18,225 285,247 416
test 160 2,490 43,432 76
TOTAL 1,011 20,715 328,779 492

See stats.txt for detailed statistics generated by the following command (ua-gec must be installed first):

$ make stats

Python library

Alternatively to operating on data files directly, you may use a Python package called ua_gec. This package includes the data and has classes to iterate over documents, read metadata, work with annotations, etc.

Getting started

The package can be easily installed by pip:

    $ pip install ua_gec==1.1

Alternatively, you can install it from the source code:

    $ cd python
    $ python setup.py develop

Iterating through corpus

Once installed, you may get annotated documents from the Python code:

    
    >>> from ua_gec import Corpus
    >>> corpus = Corpus(partition="train")
    >>> for doc in corpus:
    ...     print(doc.source)         # "I likes it."
    ...     print(doc.target)         # "I like it."
    ...     print(doc.annotated)      # like} it.")
    ...     print(doc.meta.region)    # "Київська"

Note that the doc.annotated property is of type AnnotatedText. This class is described in the next section

Working with annotations

ua_gec.AnnotatedText is a class that provides tools for processing annotated texts. It can iterate over annotations, get annotation error type, remove some of the annotations, and more.

While we're working on a detailed documentation, here is an example to get you started. It will remove all Fluency annotations from a text:

    >>> from ua_gec import AnnotatedText
    >>> text = AnnotatedText("I {likes=>like:::error_type=Grammar} it.")
    >>> for ann in text.iter_annotations():
    ...     print(ann.source_text)       # likes
    ...     print(ann.top_suggestion)    # like
    ...     print(ann.meta)              # {'error_type': 'Grammar'}
    ...     if ann.meta["error_type"] == "Fluency":
    ...         text.remove(ann)         # or `text.apply(ann)`

Contributing

  • The data collection is an ongoing activity. You can always contribute your Ukrainian writings or complete one of the writing tasks at https://ua-gec-dataset.grammarly.ai/

  • Code improvements and document are welcomed. Please submit a pull request.

Contacts

Owner
Grammarly
Millions of users rely on Grammarly's AI-powered products to make their messages, documents, and social media posts clear, mistake-free, and impactful.
Grammarly
2021 2학기 데이터크롤링 기말프로젝트

공지 주제 웹 크롤링을 이용한 취업 공고 스케줄러 스케줄 주제 정하기 코딩하기 핵심 코드 설명 + 피피티 구조 구상 // 12/4 토 피피티 + 스크립트(대본) 제작 + 녹화 // ~ 12/10 ~ 12/11 금~토 영상 편집 // ~12/11 토 웹크롤러 사람인_평균

Choi Eun Jeong 2 Aug 16, 2022
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 3k Jan 05, 2023
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

Texar-PyTorch is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar

ASYML 726 Dec 30, 2022
NLP: SLU tagging

NLP: SLU tagging

北海若 3 Jan 14, 2022
Malware-Related Sentence Classification

Malware-Related Sentence Classification This repo contains the code for the ICTAI 2021 paper "Enrichment of Features for Malware-Related Sentence Clas

Chau Nguyen 1 Mar 26, 2022
Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

🌳 Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

LCS2-IIITDelhi 5 Sep 13, 2022
Problem: Given a nepali news find the category of the news

Classification of category of nepali news catorgory using different algorithms Problem: Multiclass Classification Approaches: TFIDF for vectorization

pudasainishushant 2 Jan 09, 2022
Just Another Telegram Ai Chat Bot Written In Python With Pyrogram.

OkaeriChatBot Just another Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher.

Wahyusaputra 2 Dec 23, 2021
NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Project 3: Web APIs & NLP Problem Statement How do r/Libertarian and r/Neoliberal differ on Biden post-inaguration? The goal of the project is to see

Adam Muhammad Klesc 2 Mar 29, 2022
GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning GrammarTagger is an open-source toolkit for grammatical profiling for lan

Octanove Labs 27 Jan 05, 2023
Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

fake-news-detector-1.0 Lists, lists and more lists... Spam filter list, quality keyword list, stoplist list, top-domains urls list, news agencies webs

Memo Sim 1 Jan 04, 2022
Knowledge Oriented Programming Language

KoPL: 面向知识的推理问答编程语言 安装 | 快速开始 | 文档 KoPL全称 Knowledge oriented Programing Language, 是一个为复杂推理问答而设计的编程语言。我们可以将自然语言问题表示为由基本函数组合而成的KoPL程序,程序运行的结果就是问题的答案。目前,

THU-KEG 62 Dec 12, 2022
This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers.

private-transformers This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers. What is this? Why

Xuechen Li 73 Dec 28, 2022
PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop

molten A minimal, extensible, fast and productive API framework for Python 3. Changelog: https://moltenframework.com/changelog.html Community: https:/

3.2k Dec 28, 2022
BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

BROS (BERT Relying On Spatiality) is a pre-trained language model focusing on text and layout for better key information extraction from documents. Given the OCR results of the document image, which

Clova AI Research 94 Dec 30, 2022
This repository structures data in title, summary, tags, sentiment given a fragment of a conversation

Understand-conversation-AI This repository structures data in title, summary, tags, sentiment given a fragment of a conversation How to install: pip i

Juan Camilo López Montes 1 Jan 11, 2022
Persian Bert For Long-Range Sequences

ParsBigBird: Persian Bert For Long-Range Sequences The Bert and ParsBert algorithms can handle texts with token lengths of up to 512, however, many ta

Sajjad Ayoubi 63 Dec 14, 2022
Just a basic Telegram AI chat bot written in Python using Pyrogram.

Nikko ChatBot Just a basic Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher. A bot token. Installation $ https

ʀᴇxɪɴᴀᴢᴏʀ 2 Oct 21, 2022
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [Project Page] [Paper] [Video] Wenlong Huang1, Pieter Abbee

Wenlong Huang 114 Dec 29, 2022
Named Entity Recognition API used by TEI Publisher

TEI Publisher Named Entity Recognition API This repository contains the API used by TEI Publisher's web-annotation editor to detect entities in the in

e-editiones.org 14 Nov 15, 2022