Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Last update: Dec 27, 2022

Related tags

Text Data & NLP lassl

Overview

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

What is LASSL • How to Use

What is LASSL

LASSL은 LAnguage Semi-Supervised Learning의 약자로, 데이터만 있다면 누구나 쉽게 자신만의 언어모델을 가질 수 있도록 Huggingface의 Transformers, Datasets 라이브러리를 이용해 언어 모델 사전학습을 제공합니다.

Environment setting

아래 명령어를 통해 필요한 패키지를 설치하거나,

pip3 install -r requirements.txt

poetry를 이용하여 환경설정을 할 수 있습니다.

# poetry 설치
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -
# poetry dependencies 환경 설정
poetry install

How to Use

언어 모델 사전학습은 크게 1. 토크나이저 학습, 2. 코퍼스 시리얼라이즈, 3. 언어 모델 사전학습의 세 단계로 나눌 수 있습니다.
데이터셋을 지원하는 형식에 맞춰 준비한 뒤 순서에 따라 진행하면 자신만의 사전학습 모델을 만들 수 있습니다. 지원하는 코퍼스 형태(corpus_type)는 여기서 확인할 수 있습니다.

1. Train Tokenizer

python3 train_tokenizer.py \
    --corpora_dir $CORPORA_DIR \
    --corpus_type $CORPUS_TYPE \
    --sampling_ratio $SAMPLING_RATIO \
    --model_type $MODEL_TYPE \
    --vocab_size $VOCAB_SIZE \
    --min_frequency $MIN_FREQUENCY

# poetry 이용
poetry run python3 train_tokenizer.py \
    --corpora_dir $CORPORA_DIR \
    --corpus_type $CORPUS_TYPE \
    --sampling_ratio $SAMPLING_RATIO \
    --model_type $MODEL_TYPE \
    --vocab_size $VOCAB_SIZE \
    --min_frequency $MIN_FREQUENCY

2. Serialize Corpora

python3 serialize_corpora.py \
    --model_type $MODEL_TYPE \
    --tokenizer_dir $TOKENIZER_DIR \
    --corpora_dir $CORPORA_DIR \
    --corpus_type $CORPUS_TYPE \
    --max_length $MAX_LENGTH \
    --num_proc $NUM_PROC

# poetry 이용
poetry run python3 serialize_corpora.py \
    --model_type $MODEL_TYPE \
    --tokenizer_dir $TOKENIZER_DIR \
    --corpora_dir $CORPORA_DIR \
    --corpus_type $CORPUS_TYPE \
    --max_length $MAX_LENGTH \
    --num_proc $NUM_PROC

3. Pretrain Language Model

python3 pretrain_language_model.py --config_path $CONFIG_PATH

# poetry 이용
poetry run python3 pretrain_language_model.py --config_path $CONFIG_PATH

# TPU를 사용할 때는 아래 명령어를 사용합니다. (poetry 환경은 PyTorch XLA를 기본으로 제공하지 않습니다.)
python3 xla_spawn.py --num_cores $NUM_CORES pretrain_language_model.py --config_path $CONFIG_PATH

Contributors

김보섭	류민호	류인제	박장원	김형석

Github	Github	Github	Github	Github

Acknowledgements

LASSL은 Tensorflow Research Cloud (TFRC) 프로그램의 Cloud TPU 지원으로 제작되었습니다.

Comments

Ready to release v0.1.0
Summary

기본적으로 전체적인 틀은 잡혀있는 사항 v0.1.0을 release하기에 앞서 다음의 내용에 대해서 논의

serialize_corpora.py와 train_tokenizer.py가 지원하는 model_type에 이격이 존재

serialie_corpora.py: roberta, gpt2, albert

train_tokenizer.py: bert-uncased, bert-cased, gpt2, roberta, albert, electra

README.md

help wanted
opened by seopbo 10
Refactor codes relevant to pretrain
학습하고자하는 plm 별로 DataCollatorFor{MODEL}을 추가함.

pretrain_language_model.py에서 model_type_to_collator를 정의하여, model_type 별로 collator를 가져옴.

config 파일의 collator 항목에서 collator를 위한 args (e.g. mlm_probability)를 가져옴.

pretrain_language_model.py에서 eval_dataset을 사용하기위한 코드추가

config 파일의 data 항목에서 eval_dataset을 설정하기위한 test_size arg를 가져옴.

그 의 isort, black 돌림.

Refs: #30
opened by seopbo 6
Add UL2 Language Modeling

슬랙에서도 소개하긴 했는데 Universal Language Learning Paradigm 논문에 소개된 Mixture of Denoisers 를 활용한 목적함수가 기존 Span corruption, MLM, CLM 보다 전반적으로 좋다고 합니다. 저도 마침 회사에서 활용해 볼 생각이 있어서 lassl에 collator 및 processor를 구현하려고 하는데 어떻게 생각하시나요??

opened by DaehanKim 4
Support training BART

Is your feature request related to a problem? Please describe. BART processor, collator 추가하기

Describe the solution you'd like text_infilling 방법을 collator로 추가한다.
enhancement

opened by bzantium 4
Add keep_in_memory option in load_dataset
Is your feature request related to a problem? Please describe.

TPU VM에서 학습하는 과정에서 캐쉬로 인해 메모리가 충분함에도 disk 용량이 꽉차는 이슈가 발생함

Describe the solution you'd like

load_dataset 단계에서 keep_in_memory 옵션을 추가하여 해결

Serialize과정이완료된 데이터는 disk에 저장되므로, train 단계에서는 필요가 없고 tokenizer, serialize과정에서만 추가
opened by iron-ij 2

KoRobertaSmall training

TODO

Training tokenizer

poetry run python3 train_tokenizer.py --corpora_dir corpora \
--corpus_type sent_text \
--model_type roberta \
--vocab_size 51200 \
--min_frequency 2

Serializing corpora

poetry run python3 serialize_corpora.py --model_type roberta \
--tokenizer_dir tokenizers/roberta \
--corpora_dir corpora \
--corpus_type sent_text \
--max_length 512 \
--num_proc 96 \
--batch_size 1000 \
--writer_batch_size 1000

ref:

https://github.com/huggingface/blog/blob/master/notebooks/13_pytorch_xla.ipynb

help wanted

opened by seopbo 2

Support corpus_type
"docu_text", "docu_json", "sent_text", "sent_json"으로 corpus_type을 정의함.

위에 대응하여 load_corpora 함수를 수정함.

"sent_text"에 대응되는 loading scripts의 이름과 class 명을 수정함

serialize_corpora.py에서 corpus_type에 대응되게 argument parser를 수정함.

train_tokenizer.py에서 corpus_type에 대응되게 refactoring을 수행함.

model_name -> model_type으로 수정함.

Refs: #23
enhancement
opened by seopbo 2
Support setting arguments of pretraining by a config file
config 파일하나로 pretrain_language_model.py에 실행에 필요한 arguments를 전달함.

nested dict 처리를 위한 Omegaconf library 추가

CONFIG_MAPPING을 활용하여 class 생성자 호출

Refs: #16
opened by seopbo 2
argument setting
To Do

https://github.com/lassl/lassl/blob/c507a547e5e22a3bc89bf65e448712783e688211/pretrain_language_model.py#L47

set ModelArguments from config.json file

set TrainingArguments from config.json file

enhancement
opened by alxiom 2
Single-stage Electra collator refactored
src/lassl/collators.py

Simplified the main operation (all-in-tensor)

change the function name pad_for_token_type_ids -> _token_type_ids_with_pad for clarity

documentation
opened by Doohae 1
Add config files #82
Add config files for following:

bert-small.yaml

albert-small.yaml

gpt2-small.yaml

roberta-small.yaml Also add readme file for brief explanation of config files in general For issue #82 @seopbo
opened by Doohae 1
Can you give some examples or benchmarks, that use this pretrain framework make downstream task better ?

I think if you can give a evidence that use this framework will improve the performance in some self build corpus in some downstream task, will make this project more attractive.

opened by svjack 1
Change default save format to parquet
TODO

Currently, serialize_corpora.py saves encoded dataset with save_to_disk.

In this issue, we replace calling save_to_disk with calling `to_parquet``

cc: @Doohae @DaehanKim
enhancement
opened by seopbo 0

Releases(v1.0.0)

v1.0.0(Nov 2, 2022)
What's Changed

[mixed] refactor: Refactor for v1.0.0 by @seopbo in https://github.com/lassl/lassl/pull/102

Currently, lassl suports to train bert, albert, roberta, gpt2, bart, t5, ul2

In next, lassl will suport to train electra. Moreover train_universal_tokenizer.py will be added to lassl.

train_universal_tokenizer.py will train tokenizer used to train all types of model which are supported by lassl.

Full Changelog: https://github.com/lassl/lassl/compare/v0.2.0...v1.0.0
Source code(tar.gz)
Source code(zip)
v0.2.0(Sep 22, 2022)
What's Changed

Support training BART by @seopbo in https://github.com/lassl/lassl/pull/81

Support training T5 model by @DaehanKim in https://github.com/lassl/lassl/pull/87

Add config files #82 by @Doohae in https://github.com/lassl/lassl/pull/88

Support Electra pretrain by @Doohae in https://github.com/lassl/lassl/pull/91

Add UL2 Language Modeling by @DaehanKim in https://github.com/lassl/lassl/pull/98

New Contributors

@DaehanKim made their first contribution in https://github.com/lassl/lassl/pull/87

Full Changelog: https://github.com/lassl/lassl/compare/v0.1.4...v0.2.0
Source code(tar.gz)
Source code(zip)
v0.1.3(Mar 18, 2022)
Summary

Refactor lassl for packaging modules to library

Add a function of dataset blending

What's Changed

Add dataset blender by @hyunwoongko in https://github.com/lassl/lassl/pull/73

Remove poetry dependencies by @seopbo in https://github.com/lassl/lassl/pull/76

New Contributors

@hyunwoongko made their first contribution in https://github.com/lassl/lassl/pull/73

Full Changelog: https://github.com/lassl/lassl/compare/v0.1.2...v0.1.3
Source code(tar.gz)
Source code(zip)
v0.1.2(Dec 30, 2021)
Summary

Fix bugs in src/collators.py

What's Changed

[python] fix: Fix importing a invalid module by @seopbo in https://github.com/lassl/lassl/pull/72

Full Changelog: https://github.com/lassl/lassl/compare/v0.1.1...v0.1.2
Source code(tar.gz)
Source code(zip)
v0.1.1(Dec 20, 2021)
Summary

Update README.md

Support README.md in english.

Support README_ko.md in korean.

Fix bugs of training GPT2

Add examples configs for gpu, tpu environments.

What's Changed

[docs] fix: Change a license by @seopbo in https://github.com/lassl/lassl/pull/64

[etc] docs: Add English version of README by @bzantium in https://github.com/lassl/lassl/pull/66

Add example configs for gpu, tpu by @seopbo in https://github.com/lassl/lassl/pull/65

[python] fix: debug GPT2 processor and collator by @bzantium in https://github.com/lassl/lassl/pull/69

Update README.md by @bzantium in https://github.com/lassl/lassl/pull/70

Full Changelog: https://github.com/lassl/lassl/compare/v0.1.0...v0.1.1
Source code(tar.gz)
Source code(zip)
v0.1.0(Dec 15, 2021)
Summary

First release

What's Changed

Feature/#2 by @seopbo in https://github.com/lassl/lassl/pull/4

feat: TPU compatibility by @monologg in https://github.com/lassl/lassl/pull/8

Feature/#3 GPT2Preprocessor 추가 by @iron-ij in https://github.com/lassl/lassl/pull/10

[docs] chore: Add authors by @seopbo in https://github.com/lassl/lassl/pull/13

Feature/#9 ALBERT용 Processor, Collator 추가 by @bzantium in https://github.com/lassl/lassl/pull/14

[python] feat: Save tokenizer by @seopbo in https://github.com/lassl/lassl/pull/19

[python] mixed: Support sentence per line type doc by @seopbo in https://github.com/lassl/lassl/pull/20

Support setting arguments of pretraining by a config file by @seopbo in https://github.com/lassl/lassl/pull/22

Support corpus_type by @seopbo in https://github.com/lassl/lassl/pull/25

Support adding additional special tokens by @seopbo in https://github.com/lassl/lassl/pull/26

[python] feat: Add bert processor by @bzantium in https://github.com/lassl/lassl/pull/29

Refactor codes relevant to pretrain by @seopbo in https://github.com/lassl/lassl/pull/31

Update issue templates by @seopbo in https://github.com/lassl/lassl/pull/34

[python] fix: sampling_ratio 조건 추가하기 by @bzantium in https://github.com/lassl/lassl/pull/36

[python] chore: Update dependencies by @seopbo in https://github.com/lassl/lassl/pull/38

[python] fix: Fix a buffer in processing.py by @seopbo in https://github.com/lassl/lassl/pull/41

[mixed] fix: xla_spawn 변경, config 추가 및 주석 by @bzantium in https://github.com/lassl/lassl/pull/44

[python] feat: add keep_in_memory option in serialize_corpora by @iron-ij in https://github.com/lassl/lassl/pull/43

[chore] fix: Fix a requirements.txt by @seopbo in https://github.com/lassl/lassl/pull/46

[python] fix: sampling할 때 중복샘플링 옵션 제거 by @bzantium in https://github.com/lassl/lassl/pull/48

[etc] docs: README 추가 by @bzantium in https://github.com/lassl/lassl/pull/39

[etc] docs: README에 LASSL 약자소개 추가하기 by @bzantium in https://github.com/lassl/lassl/pull/52

[python] chore: Update dependencies by @seopbo in https://github.com/lassl/lassl/pull/54

[python] fix: GPT2 Collator CollatorForLM 상속하기 by @bzantium in https://github.com/lassl/lassl/pull/57

[etc] docs: Add additional information to doc by @seopbo in https://github.com/lassl/lassl/pull/59

New Contributors

@seopbo made their first contribution in https://github.com/lassl/lassl/pull/4

@monologg made their first contribution in https://github.com/lassl/lassl/pull/8

@iron-ij made their first contribution in https://github.com/lassl/lassl/pull/10

@bzantium made their first contribution in https://github.com/lassl/lassl/pull/14

Full Changelog: https://github.com/lassl/lassl/commits/v0.1.0
Source code(tar.gz)
Source code(zip)

Owner

LASSL: LAnguage Self-Supervised Learning

GitHub Repository

Pipeline for training LSA models using Scikit-Learn.

Latent Semantic Analysis Pipeline for training LSA models using Scikit-Learn. Usage Instead of writing custom code for latent semantic analysis, you j

23 Sep 05, 2022

Use fastai-v2 with HuggingFace's pretrained transformers

FastHugs Use fastai v2 with HuggingFace's pretrained transformers, see the notebooks below depending on your task: Text classification: fasthugs_seq_c

111 Nov 16, 2022

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

377 Jan 02, 2023

Collection of scripts to pinpoint obfuscated code

Obfuscation Detection (v1.0) Author: Tim Blazytko Automatically detect control-flow flattening and other state machines Description: Scripts and binar

230 Nov 26, 2022

Treemap visualisation of Maya scene files

Ever wondered which nodes are responsible for that 600 mb+ Maya scene file? Features Fast, resizable UI Parsing at 50 mb/sec Dependency-free, single-f

76 Nov 12, 2022

Write Python in Urdu - اردو میں کوڈ لکھیں

UrduPython Write simple Python in Urdu. How to Use Write Urdu code in سامپل۔پے The mappings are as following: "۔": ".", "،":

26 Nov 27, 2022

A tool helps build a talk preview image by combining the given background image and talk event description

talk-preview-img-builder A tool helps build a talk preview image by combining the given background image and talk event description Installation and U

4 Aug 20, 2022

Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

PyStanfordDependencies Python interface for converting Penn Treebank trees to Universal Dependencies and Stanford Dependencies. Example usage Start by

64 May 08, 2022

ChatterBot is a machine learning, conversational dialog engine for creating chat bots

ChatterBot ChatterBot is a machine-learning based conversational dialog engine build in Python which makes it possible to generate responses based on

12.8k Jan 03, 2023

AMUSE - financial summarization

AMUSE AMUSE - financial summarization Unzip data.zip Train new model: python FinAnalyze.py --task train --start 0 --count how many files,-1 for all

1 Jan 11, 2022

SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

2.5k Jan 07, 2023

Chinese version of GPT2 training code, using BERT tokenizer.

GPT2-Chinese Description Chinese version of GPT2 training code, using BERT tokenizer or BPE tokenizer. It is based on the extremely awesome repository

5.6k Jan 04, 2023

hashily is a Python module that provides a variety of text decoding and encoding operations.

hashily is a python module that performs a variety of text decoding and encoding functions. It also various functions for encrypting and decrypting text using various ciphers.

5 Jul 17, 2022

Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

Universal Adversarial Triggers for Attacking and Analyzing NLP This is the official code for the EMNLP 2019 paper, Universal Adversarial Triggers for

248 Dec 17, 2022

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

beyond masking Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers The code is coming Figure 1: Pipeline of token-based pre-

23 Sep 27, 2022

(ACL-IJCNLP 2021) Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models.

BERT Convolutions Code for the paper Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models. Contains expe

21 Jul 18, 2022

DAGAN - Dual Attention GANs for Semantic Image Synthesis

Contents Semantic Image Synthesis with DAGAN Installation Dataset Preparation Generating Images Using Pretrained Model Train and Test New Models Evalu

104 Oct 08, 2022

自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器

ja-timex 自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器概要 ja-timex は、現代日本語で書かれた自然文に含まれる時間情報表現を抽出しTIMEX3と呼ばれるアノテーション仕様に変換することで、プログラムが利用できるような形に規格化するルールベースの解析器です。

116 Nov 09, 2022

Search Git commits in natural language

NaLCoS - NAtural Language COmmit Search Search commit messages in your repository in natural language. NaLCoS (NAtural Language COmmit Search) is a co

50 Mar 22, 2022

Partially offline multi-language translator built upon Huggingface transformers.

Translate Command-line interface to translation pipelines, powered by Huggingface transformers. This tool can download translation models, and then us

8 Oct 25, 2022

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Related tags

Overview

What is LASSL

Environment setting

How to Use

1. Train Tokenizer

2. Serialize Corpora

3. Pretrain Language Model

Contributors

Acknowledgements

Comments

Summary

TODO

Training tokenizer

Serializing corpora

To Do

TODO

Releases(v1.0.0)

v1.0.0(Nov 2, 2022)

What's Changed

v0.2.0(Sep 22, 2022)

What's Changed

New Contributors

v0.1.3(Mar 18, 2022)

Summary

What's Changed

New Contributors

v0.1.2(Dec 30, 2021)

Summary

What's Changed

v0.1.1(Dec 20, 2021)

Summary

What's Changed

v0.1.0(Dec 15, 2021)

Summary

What's Changed

New Contributors

Owner

LASSL: LAnguage Self-Supervised Learning

Pipeline for training LSA models using Scikit-Learn.

Use fastai-v2 with HuggingFace's pretrained transformers

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

Collection of scripts to pinpoint obfuscated code

Treemap visualisation of Maya scene files

Write Python in Urdu - اردو میں کوڈ لکھیں

A tool helps build a talk preview image by combining the given background image and talk event description

Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

ChatterBot is a machine learning, conversational dialog engine for creating chat bots

AMUSE - financial summarization

SimCSE: Simple Contrastive Learning of Sentence Embeddings

Chinese version of GPT2 training code, using BERT tokenizer.

hashily is a Python module that provides a variety of text decoding and encoding operations.

Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

(ACL-IJCNLP 2021) Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models.

DAGAN - Dual Attention GANs for Semantic Image Synthesis

自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器

Search Git commits in natural language

Partially offline multi-language translator built upon Huggingface transformers.