Transformer Based Korean Sentence Spacing Corrector

Overview

TKOrrector

Transformer Based Korean Sentence Spacing Corrector

Architecture

License Summary

This solution is made available under Apache 2 license. See the LICENSE file.

Minimum Requirements

It is recommended that you run the Trainig on a machine with Nvidia GPU with drivers and CUDA installed.

Prerequisites

  1. Clone this repo and cd into it.

  2. Install dependencies. Preferrably in a virtual env.

    a. Optional: Create new virtual env. Conda example below.
    conda create --name TKOrrector python=3.9 -y
    conda activate TKOrrector

    b. Install PyTorch with CUDA conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia

    or

    b. Install PyTorch without GPU conda install pytorch torchvision torchaudio cpuonly -c pytorch

    c. Install dependencies
    pip install -r requirements.txt

Run

You can run the pretrained model without the need to Train.

Download the pretrained model and extract into the current directory (tar zxvf TKOrrector.tar.gz)

sh demo.sh

Example demo run screen and results.
Example Demo Run

Train

Download the Corpus

  1. Go to NIKL Corpus Download Site and apply for a new license.

    The cost is free but you need to sign an agreement. It is recommended that you upload the corpus file on an object storage such as GCS to quickly download on additional machines such as GCP GCE to use a VM with GPU for training as needed without huge upfront cost. Edit src/download_corpus.sh to download the Corpus file and expand it into the designated directory.

    cd src
    sh download_corpus.sh

Run the data prep stage

Change lines 51, 53 in prepare_corpus_with_tokenizer.sh to increase the training dataset size.  
The second argument is the number of files to include into the training set + 1.  
`get_corpus "../data/$CORPUS1/*" 10`  
Above command would include 9 files (manual pdf file is skipped) from the Newspaper corpus.
  1. Run the data prep command.

    sh prepare_corpus_with_tokenizer.sh

Run the training stage

  1. Run the training command.

    sh train.sh

Run the Evaluation

  1. After the training is done, evaluation of the model with test dataset can be performed with batch translations by running the command below.

    sh calculate_metrics.sh

Detailed Dataflow Diagram

Detailed Architecture

Owner
Paul Hyung Yuel Kim
Paul Hyung Yuel Kim
Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration This repo contains only model Implementation of Zero-Shot Text-to-Speech for Text

Rishikesh (ऋषिकेश) 33 Sep 22, 2022
My implementation of Safaricom Machine Learning Codility test. The code has bugs, logical I guess I made errors and any correction will be appreciated.

Safaricom_Codility Machine Learning 2022 The test entails two questions. Question 1 was on Machine Learning. Question 2 was on SQL I ran out of time.

Lawrence M. 1 Mar 03, 2022
Kurumi ChatBot

KurumiChatBot Just another Telegram AI chat bot written in Python using Pyrogram. A public running instance can be found on telegram as @TokisakiChatB

Yoga Pranata 3 Jun 28, 2022
An assignment on creating a minimalist neural network toolkit for CS11-747

minnn by Graham Neubig, Zhisong Zhang, and Divyansh Kaushik This is an exercise in developing a minimalist neural network toolkit for NLP, part of Car

Graham Neubig 63 Dec 29, 2022
A single model that parses Universal Dependencies across 75 languages.

A single model that parses Universal Dependencies across 75 languages. Given a sentence, jointly predicts part-of-speech tags, morphology tags, lemmas, and dependency trees.

Dan Kondratyuk 189 Nov 29, 2022
Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

Creating a python chatbot that Starbucks users can text to place an order + help cut wait time of a normal coffee.

2 Jan 20, 2022
The tool to make NLP datasets ready to use

chazutsu photo from Kaikado, traditional Japanese chazutsu maker chazutsu is the dataset downloader for NLP. import chazutsu r = chazutsu.data

chakki 243 Dec 29, 2022
State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Trapper (Transformers wRAPPER) Trapper is an NLP library that aims to make it easier to train transformer based models on downstream tasks. It wraps h

Open Business Software Solutions 42 Sep 21, 2022
小布助手对话短文本语义匹配的一个baseline

oppo-text-match 小布助手对话短文本语义匹配的一个baseline 模型 参考:https://kexue.fm/archives/8213 base版本线下大概0.952,线上0.866(单模型,没做K-flod融合)。 训练 测试环境:tensorflow 1.15 + keras

苏剑林(Jianlin Su) 132 Dec 14, 2022
Translation for Trilium Notes. Trilium Notes 中文版.

Trilium Translation 中文说明 This repo provides a translation for the awesome Trilium Notes. Currently, I have translated Trilium Notes into Chinese. Test

743 Jan 08, 2023
Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

Dennis Priskorn 9 Nov 17, 2022
Translate - a PyTorch Language Library

NOTE PyTorch Translate is now deprecated, please use fairseq instead. Translate - a PyTorch Language Library Translate is a library for machine transl

775 Dec 24, 2022
Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers. Cherche is meant to be used with small to medium sized corpora. C

Raphael Sourty 224 Nov 29, 2022
Honor's thesis project analyzing whether the GPT-2 model can more effectively generate free-verse or structured poetry.

gpt2-poetry The following code is for my senior honor's thesis project, under the guidance of Dr. Keith Holyoak at the University of California, Los A

Ashley Kim 2 Jan 09, 2022
Persian Bert For Long-Range Sequences

ParsBigBird: Persian Bert For Long-Range Sequences The Bert and ParsBert algorithms can handle texts with token lengths of up to 512, however, many ta

Sajjad Ayoubi 63 Dec 14, 2022
A programming language with logic of Python, and syntax of all languages.

Pytov The idea was to take all well known syntaxes, and combine them into one programming language with many posabilities. Installation Install using

Yuval Rosen 14 Dec 07, 2022
RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

RIDE: ROS IDE RIDE automatically creates the package and boilerplate OOP Python code for nodes as per your needs (RIDE is not an IDE, but even ROS isn

Jash Mota 20 Jul 14, 2022
Just Another Telegram Ai Chat Bot Written In Python With Pyrogram.

OkaeriChatBot Just another Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher.

Wahyusaputra 2 Dec 23, 2021
NLP Overview

NLP-Overview Introduction The field of NPL encompasses a variety of topics which involve the computational processing and understanding of human langu

PeterPham 1 Jan 13, 2022
Jarvis is a simple Chatbot with a GUI capable of chatting and retrieving information and daily news from the internet for it's user.

J.A.R.V.I.S Kindly consider starring this repository if you like the program :-) What/Who is J.A.R.V.I.S? J.A.R.V.I.S is an chatbot written that is bu

Epicalable 50 Dec 31, 2022