This repository has a implementations of data augmentation for NLP for Japanese.

Last update: Nov 11, 2022

Related tags

Text Data & NLP daaja

Overview

daaja

This repository has a implementations of data augmentation for NLP for Japanese:

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
An Analysis of Simple Data Augmentation for Named Entity Recognition

Install

pip install daaja

How to use

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

Command

python -m aug_ja.eda.run --input input.tsv --output data_augmentor.tsv

The format of input.tsv is as follows:

1	この映画はとてもおもしろい
0	つまらない映画だった

In Python

from aug_ja.eda import EasyDataAugmentor
augmentor = EasyDataAugmentor(alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1, num_aug=4)
text = "日本語でデータ拡張を行う"
aug_texts = augmentor.augments(text)
print(aug_texts)
# ['日本語でを拡張データ行う', '日本語でデータ押広げるを行う', '日本語でデータ拡張を行う', '日本語で智見拡張を行う', '日本語でデータ拡張を行う']

An Analysis of Simple Data Augmentation for Named Entity Recognition

Command

python -m aug_ja.ner_sda.run --input input.tsv --output data_augmentor.tsv

The format of input.tsv is as follows:

私	O
は	O
田中	B-PER
と	O
いい	O
ます	O

In Python

from daaja.ner_sda import SimpleDataAugmentationforNER
tokens_list = [
    ["私", "は", "田中", "と", "いい", "ます"],
    ["筑波", "大学", "に", "所属", "して", "ます"],
    ["今日", "から", "筑波", "大学", "に", "通う"],
    ["茨城", "大学"],
]
labels_list = [
    ["O", "O", "B-PER", "O", "O", "O"],
    ["B-ORG", "I-ORG", "O", "O", "O", "O"],
    ["B-DATE", "O", "B-ORG", "I-ORG", "O", "O"],
    ["B-ORG", "I-ORG"],
]
augmentor = SimpleDataAugmentationforNER(tokens_list=tokens_list, labels_list=labels_list,
                                            p_power=1, p_lwtr=1, p_mr=1, p_sis=1, p_sr=1, num_aug=4)
tokens = ["吉田", "さん", "は", "株式", "会社", "A", "に", "出張", "予定", "だ"]
labels = ["B-PER", "O", "O", "B-ORG", "I-ORG", "I-ORG", "O", "O", "O", "O"]
augmented_tokens_list, augmented_labels_list = augmentor.augments(tokens, labels)
print(augmented_tokens_list)
# [['吉田', 'さん', 'は', '株式', '会社', 'A', 'に', '出張', '志す', 'だ'],
#  ['吉田', 'さん', 'は', '株式', '大学', '大学', 'に', '出張', '予定', 'だ'],
#  ['吉田', 'さん', 'は', '株式', '会社', 'A', 'に', '出張', '予定', 'だ'],
#  ['吉田', 'さん', 'は', '筑波', '大学', 'に', '出張', '予定', 'だ'],
#  ['吉田', 'さん', 'は', '株式', '会社', 'A', 'に', '出張', '予定', 'だ']]
print(augmented_labels_list)
# [['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O']]

Reference

Comments

too many progress bars

When I use EasyDataAugmentor in the train process, there are too many progress bars in the console.

So, can you make this line 19 tqdm selectable on-off when we define EasyDataAugmentor? https://github.com/kajyuuen/daaja/blob/12835943868d43f5c248cf1ea87ab60f67a6e03d/daaja/flows/sequential_flow.py#L19

opened by Yongtae723 6
from daaja.methods.eda.easy_data_augmentor import EasyDataAugmentorにてエラー

daajaをpipインストール後、from daaja.methods.eda.easy_data_augmentor import EasyDataAugmentorを行うと、以下のエラーとなる。 ConnectionError: HTTPConnectionPool(host='compling.hss.ntu.edu.sg', port=80): Max retries exceeded with url: /wnja/data/1.1/wnjpn.db.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3b6a6cced0>: Failed to establish a new connection: [Errno 110] Connection timed out'))

opened by naoki1213mj 5
is it possible to use on GPU device?

Hi!

thank you for the great library. when I train with this augmentation, this takes so much more time than forward and backward process.

therefore, can we possibly use this augmentation on GPU to save time?

thank you

opened by Yongtae723 3
Bump joblib from 1.1.0 to 1.2.0
Bumps joblib from 1.1.0 to 1.2.0.

Changelog

Sourced from joblib's changelog.

Release 1.2.0

Fix a security issue where eval(pre_dispatch) could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327

Make sure that joblib works even when multiprocessing is not available, for instance with Pyodide joblib/joblib#1256

Avoid unnecessary warnings when workers and main process delete the temporary memmap folder contents concurrently. joblib/joblib#1263

Fix memory alignment bug for pickles containing numpy arrays. This is especially important when loading the pickle with mmap_mode != None as the resulting numpy.memmap object would not be able to correct the misalignment without performing a memory copy. This bug would cause invalid computation and segmentation faults with native code that would directly access the underlying data buffer of a numpy array, for instance C/C++/Cython code compiled with older GCC versions or some old OpenBLAS written in platform specific assembly. joblib/joblib#1254

Vendor cloudpickle 2.2.0 which adds support for PyPy 3.8+.

Vendor loky 3.3.0 which fixes several bugs including:

robustly forcibly terminating worker processes in case of a crash (joblib/joblib#1269);

avoiding leaking worker processes in case of nested loky parallel calls;

reliability spawn the correct number of reusable workers.

Release 1.1.1

Fix a security issue where eval(pre_dispatch) could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327

Commits

5991350 Release 1.2.0

3fa2188 MAINT cleanup numpy warnings related to np.matrix in tests (#1340)

cea26ff CI test the future loky-3.3.0 branch (#1338)

8aca6f4 MAINT: remove pytest.warns(None) warnings in pytest 7 (#1264)

067ed4f XFAIL test_child_raises_parent_exits_cleanly with multiprocessing (#1339)

ac4ebd5 MAINT add back pytest warnings plugin (#1337)

a23427d Test child raises parent exits cleanly more reliable on macos (#1335)

ac09691 [MAINT] various test updates (#1334)

4a314b1 Vendor loky 3.2.0 (#1333)

bdf47e9 Make test_parallel_with_interactively_defined_functions_default_backend timeo...

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Implement Data Augmentation using Pre-trained Transformer Models
paper

Data Augmentation using Pre-trained Transformer Models

code

https://github.com/varunkumar-dev/TransformersDataAugmentation

ref

https://www.ai-shift.co.jp/techblog/1939

add-new-technique
opened by kajyuuen 0
Implement Contextual Augmentation
Paper

Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations

Code

https://github.com/pfnet-research/contextual_augmentation

add-new-technique
opened by kajyuuen 0
Implement MixText
Paper

MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification

Code

https://github.com/GT-SALT/MixText

add-new-technique
opened by kajyuuen 0

Releases(v0.0.7)

v0.0.7(Oct 24, 2022)
Changes

Change pytest @kajyuuen (#35 #37 #38)

Change WORDNER_URL @kajyuuen (#34)

Source code(tar.gz)
Source code(zip)
daaja-0.0.7-py3-none-any.whl(18.19 KB)
v0.0.6(Mar 3, 2022)
Changes

Update version @kajyuuen (#27)

Add verbose option @kajyuuen (#25)

📖 Documentation

Add README_ja.md and Update README.md @kajyuuen (#26)

Source code(tar.gz)
Source code(zip)
v0.0.5(Feb 27, 2022)
Changes

💪 Enhancement

Add ContextualAugmentor @kajyuuen (#23)

Add BackTranslationAugmentor @kajyuuen (#21 , #22)

📖 Documentation

Add quick_example @kajyuuen (#17)

Source code(tar.gz)
Source code(zip)
v0.0.4(Feb 21, 2022)
Changes

Release v0.0.4 @kajyuuen (#16)

Chore add release drafter @kajyuuen (#6)

💪 Enhancement

Add tqdm @kajyuuen (#8)

📖 Documentation

Refactoring @kajyuuen (#15)

Add SDA example @kajyuuen (#9)

Add EDA example @kajyuuen (#7)

Source code(tar.gz)
Source code(zip)
v0.0.3(Feb 13, 2022)

Source code(tar.gz)
Source code(zip)
daaja-0.0.3-py3-none-any.whl(14.80 KB)
v0.0.2(Feb 13, 2022)

Source code(tar.gz)
Source code(zip)
daaja-0.0.2-py3-none-any.whl(14.97 KB)

Owner

Koga Kobayashi

GitHub Repository

ADCS - Automatic Defect Classification System (ADCS) for SSMC

Table of Contents Table of Contents ADCS Overview Summary Operator's Guide Demo System Design System Logic Training Mode Production System Flow Folder

2 Jun 24, 2022

German Text-To-Speech Engine using Tacotron and Griffin-Lim

jotts JoTTS is a German text-to-speech engine using tacotron and griffin-lim. The synthesizer model has been trained on my voice using Tacotron1. Due

6 Aug 28, 2022

GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

GNES is Generic Neural Elastic Search, a cloud-native semantic search system based on deep neural network.

1.2k Jan 06, 2023

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

11 Aug 26, 2022

Command Line Text-To-Speech using Google TTS

cli-tts Thanks to gTTS by @pndurette! This is an interactive command line text-to-speech tool using Google TTS. Just type text and the voice will be p

3 Nov 11, 2022

Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

61 Dec 21, 2022

To be a next-generation DL-based phenotype prediction from genome mutations.

Sequence -----------+-- 3D_structure -- 3D_module --+ +-- ? | |

18 Jan 11, 2022

PyABSA - Open & Efficient for Framework for Aspect-based Sentiment Analysis

567 Jan 07, 2023

Search with BERT vectors in Solr and Elasticsearch

123 Dec 29, 2022

Build Text Rerankers with Deep Language Models

Reranker is a lightweight, effective and efficient package for training and deploying deep languge model reranker in information retrieval (IR), question answering (QA) and many other natural languag

140 Dec 06, 2022

Yet Another Neural Machine Translation Toolkit

YANMTT YANMTT is short for Yet Another Neural Machine Translation Toolkit. For a backstory how I ended up creating this toolkit scroll to the bottom o

121 Jan 05, 2023

Segmenter - Transformer for Semantic Segmentation

592 Dec 27, 2022

Simple program that translates the name of files into English

Simple program that translates the name of files into English. Useful for when editing/inspecting programs that were developed in a foreign language.

0 Dec 22, 2021

Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features Train python main.py --dataset brazil-flights C

0 Jun 28, 2022

AIDynamicTextReader - A simple dynamic text reader based on Artificial intelligence

AI Dynamic Text Reader: This is a simple dynamic text reader based on Artificial

1 Jan 18, 2022

A python wrapper around the ZPar parser for English.

NOTE This project is no longer under active development since there are now really nice pure Python parsers such as Stanza and Spacy. The repository w

49 Sep 12, 2022

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform This repo try to implement iSTFTNet : Fast

126 Jan 02, 2023

無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン

VOICEVOX ENGINE VOICEVOXの音声合成エンジン。実態は HTTP サーバーなので、リクエストを送信すればテキスト音声合成できます。 API ドキュメント VOICEVOX ソフトウェアを起動した状態で、ブラウザから

3 Jul 05, 2022

This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Nepali-news-notifier This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular in

1 Feb 11, 2022

Awesome-NLP-Research (ANLP)

72 Dec 19, 2022

This repository has a implementations of data augmentation for NLP for Japanese.

Related tags

Overview

daaja

Install

How to use

Command

In Python

Command

In Python

Comments

Release 1.2.0

Release 1.1.1

Releases(v0.0.7)

v0.0.7(Oct 24, 2022)

Changes

v0.0.6(Mar 3, 2022)

Changes

📖 Documentation

v0.0.5(Feb 27, 2022)

Changes

💪 Enhancement

📖 Documentation

v0.0.4(Feb 21, 2022)

Changes

💪 Enhancement

📖 Documentation

v0.0.3(Feb 13, 2022)

v0.0.2(Feb 13, 2022)

Owner

Koga Kobayashi

ADCS - Automatic Defect Classification System (ADCS) for SSMC

German Text-To-Speech Engine using Tacotron and Griffin-Lim

GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Command Line Text-To-Speech using Google TTS

Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

To be a next-generation DL-based phenotype prediction from genome mutations.

PyABSA - Open & Efficient for Framework for Aspect-based Sentiment Analysis

Search with BERT vectors in Solr and Elasticsearch

Build Text Rerankers with Deep Language Models

Yet Another Neural Machine Translation Toolkit

Segmenter - Transformer for Semantic Segmentation

Simple program that translates the name of files into English

Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

AIDynamicTextReader - A simple dynamic text reader based on Artificial intelligence

A python wrapper around the ZPar parser for English.

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン

This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Awesome-NLP-Research (ANLP)