KaziText is a tool for modelling common human errors.

Related tags

Deep Learningkazitext
Overview

KaziText

KaziText is a tool for modelling common human errors. It estimates probabilities of individual error types (so called aspects) from grammatical error correction corpora in M2 format.

The tool was introduced in Understanding Model Robustness to User-generated Noisy Texts.

Requirements

A set of requirements is listed in requirements.txt. Moreover, UDPipe model has to be downloaded for used languages (see http://hdl.handle.net/11234/1-3131) and linked in udpipe_tokenizer.py.

Overview

KaziText defines a set of aspects located in aspects. These model following phenomena:

  • Casing Errors
  • Common Other Errors (for most common phrases)
  • Errors in Diacritics
  • Punctuation Errors
  • Spelling Errors
  • Errors in wrongly used suffix/prefix
  • Whitespace Errors
  • Word-Order Errors

Each aspect has a set of internal probabilities (e.g. the probability of a user typing first letter of a starting word in lower-case instead of upper-case) that are estimated from M2 GEC corpora.

A complete set of aspects with their internal probabilities is called profile. We provide precomputed profiles for Czech, English, Russian and German in profiles as json files. The profiles are additionally split into dev and test. Also there are 4 profiles for Czech and 2 profiles for English differing in the underlying user domain (e.g. natives vs second learners).

To noise a text using a profile, use:

python introduce_errors.py $infile $outfile $profile $lang 

introduce_errors.py script offers a variety of switches (run python introduce_errors.py --help to display them). One noteworthy is --alpha that serves for regulating final text error rate (set it to value lower than 1 to reduce number of errors; set to to value bigger than 1 to have more noisy texts). Apart for profiles themselves, we also precomputed set of alphas that are stored as .csv files in respective profiles folders and store values for alphas to reach 5-30 final text word error rates as well as so called reference-alpha word error rate that corresponds to the same error rate as the original M2 files the profile was estimated from had. To have for example noisy text at circa 5% word error rate noised by Romani profile, use --profile dev/cs_romi.json --alpha 0.2.

Moreover, we provide several scripts (noise*.py) for noising specific data formats.

To estimate a profile for given M2 file, run:

python estimate_all_ratios.py $m2_pattern outfile

To estimate normalization alphas file, see estimate_alpha.sh that describes iterative process of noising clean texts with an alpha, measuring text's noisiness and changing alpha respectively.

Other notes

  • Russian RULEC-GEC was normalized using normalize_russian_m2.py
Owner
ÚFAL
Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
ÚFAL
Source code for GNN-LSPE (Graph Neural Networks with Learnable Structural and Positional Representations)

Graph Neural Networks with Learnable Structural and Positional Representations Source code for the paper "Graph Neural Networks with Learnable Structu

Vijay Prakash Dwivedi 180 Dec 22, 2022
[peer review] An Arbitrary Scale Super-Resolution Approach for 3D MR Images using Implicit Neural Representation

ArSSR This repository is the pytorch implementation of our manuscript "An Arbitrary Scale Super-Resolution Approach for 3-Dimensional Magnetic Resonan

Qing Wu 19 Dec 12, 2022
SGoLAM - Simultaneous Goal Localization and Mapping

SGoLAM - Simultaneous Goal Localization and Mapping PyTorch implementation of the MultiON runner-up entry, SGoLAM: Simultaneous Goal Localization and

10 Jan 05, 2023
Code that accompanies the paper Semi-supervised Deep Kernel Learning: Regression with Unlabeled Data by Minimizing Predictive Variance

Semi-supervised Deep Kernel Learning This is the code that accompanies the paper Semi-supervised Deep Kernel Learning: Regression with Unlabeled Data

58 Oct 26, 2022
Codes and Data Processing Files for our paper.

Code Scripts and Processing Files for EEG Sleep Staging Paper 1. Folder Tree ./src_preprocess (data preprocessing files for SHHS and Sleep EDF) sleepE

Chaoqi Yang 18 Dec 12, 2022
Official PyTorch implementation of "ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows"

ArtFlow Official PyTorch implementation of the paper: ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows Jie An*, Siyu Huang*, Yibing

123 Dec 27, 2022
This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).

Dynamic-Vision-Transformer (Pytorch) This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT). Not All Ima

210 Dec 18, 2022
A modular domain adaptation library written in PyTorch.

A modular domain adaptation library written in PyTorch.

Kevin Musgrave 225 Dec 29, 2022
Traductor de lengua de señas al español basado en Python con Opencv y MedaiPipe

Traductor de señas Traductor de lengua de señas al español basado en Python con Opencv y MedaiPipe Requerimientos 🔧 Python 3.8 o inferior para evitar

Jahaziel Hernandez Hoyos 3 Nov 12, 2022
More than a hundred strange attractors

dysts Analyze more than a hundred chaotic systems. Basic Usage Import a model and run a simulation with default initial conditions and parameter value

William Gilpin 185 Dec 23, 2022
A repository for generating stylized talking 3D and 3D face

style_avatar A repository for generating stylized talking 3D faces and 2D videos. This is the repository for paper Imitating Arbitrary Talking Style f

Haozhe Wu 191 Dec 22, 2022
This repository contains the code to replicate the analysis from the paper "Moving On - Investigating Inventors' Ethnic Origins Using Supervised Learning"

Replication Code for 'Moving On' - Investigating Inventors' Ethnic Origins Using Supervised Learning This repository contains the code to replicate th

Matthias Niggli 0 Jan 04, 2022
TensorFlow ROCm port

Documentation TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, a

ROCm Software Platform 622 Jan 09, 2023
Code for our paper "Interactive Analysis of CNN Robustness"

Perturber Code for our paper "Interactive Analysis of CNN Robustness" Datasets Feature visualizations: Google Drive Fine-tuning checkpoints as saved m

Stefan Sietzen 0 Aug 17, 2021
Predicting Event Memorability from Contextual Visual Semantics

Predicting Event Memorability from Contextual Visual Semantics

0 Oct 06, 2021
Log4j JNDI inj. vuln scanner

Log-4-JAM - Log 4 Just Another Mess Log4j JNDI inj. vuln scanner Requirements pip3 install requests_toolbelt Usage # make sure target list has http/ht

Ashish Kunwar 66 Nov 09, 2022
Stochastic gradient descent with model building

Stochastic Model Building (SMB) This repository includes a new fast and robust stochastic optimization algorithm for training deep learning models. Th

S. Ilker Birbil 22 Jan 19, 2022
Clockwork Variational Autoencoder

Clockwork Variational Autoencoders (CW-VAE) Vaibhav Saxena, Jimmy Ba, Danijar Hafner If you find this code useful, please reference in your paper: @ar

Vaibhav Saxena 35 Nov 06, 2022
Object Tracking and Detection Using OpenCV

Object tracking is one such application of computer vision where an object is detected in a video, otherwise interpreted as a set of frames, and the object’s trajectory is estimated. For instance, yo

Happy N. Monday 4 Aug 21, 2022
Eth brownie struct encoding example

eth-brownie struct encoding example Overview This repository contains an example of encoding a struct, so that it can be used in a function call, usin

Ittai Svidler 2 Mar 04, 2022