๐Ÿฆ… Pretrained BigBird Model for Korean (up to 4096 tokens)

Overview

Pretrained BigBird Model for Korean

What is BigBird โ€ข How to Use โ€ข Pretraining โ€ข Evaluation Result โ€ข Docs โ€ข Citation

ํ•œ๊ตญ์–ด | English

Apache 2.0 Issues linter DOI

What is BigBird?

BigBird: Transformers for Longer Sequences์—์„œ ์†Œ๊ฐœ๋œ sparse-attention ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋กœ, ์ผ๋ฐ˜์ ์ธ BERT๋ณด๋‹ค ๋” ๊ธด sequence๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿฆ… Longer Sequence - ์ตœ๋Œ€ 512๊ฐœ์˜ token์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” BERT์˜ 8๋ฐฐ์ธ ์ตœ๋Œ€ 4096๊ฐœ์˜ token์„ ๋‹ค๋ฃธ

โฑ๏ธ Computational Efficiency - Full attention์ด ์•„๋‹Œ Sparse Attention์„ ์ด์šฉํ•˜์—ฌ O(n2)์—์„œ O(n)์œผ๋กœ ๊ฐœ์„ 

How to Use

  • ๐Ÿค— Huggingface Hub์— ์—…๋กœ๋“œ๋œ ๋ชจ๋ธ์„ ๊ณง๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:)
  • ์ผ๋ถ€ ์ด์Šˆ๊ฐ€ ํ•ด๊ฒฐ๋œ transformers>=4.11.0 ์‚ฌ์šฉ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. (MRC ์ด์Šˆ ๊ด€๋ จ PR)
  • BigBirdTokenizer ๋Œ€์‹ ์— BertTokenizer ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. (AutoTokenizer ์‚ฌ์šฉ์‹œ BertTokenizer๊ฐ€ ๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค.)
  • ์ž์„ธํ•œ ์‚ฌ์šฉ๋ฒ•์€ BigBird Tranformers documentation์„ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("monologg/kobigbird-bert-base")  # BigBirdModel
tokenizer = AutoTokenizer.from_pretrained("monologg/kobigbird-bert-base")  # BertTokenizer

Pretraining

์ž์„ธํ•œ ๋‚ด์šฉ์€ [Pretraining BigBird] ์ฐธ๊ณ 

Hardware Max len LR Batch Train Step Warmup Step
KoBigBird-BERT-Base TPU v3-8 4096 1e-4 32 2M 20k
  • ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜, ํ•œ๊ตญ์–ด ์œ„ํ‚ค, Common Crawl, ๋‰ด์Šค ๋ฐ์ดํ„ฐ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต
  • ITC (Internal Transformer Construction) ๋ชจ๋ธ๋กœ ํ•™์Šต (ITC vs ETC)

Evaluation Result

1. Short Sequence (<=512)

์ž์„ธํ•œ ๋‚ด์šฉ์€ [Finetune on Short Sequence Dataset] ์ฐธ๊ณ 

NSMC
(acc)
KLUE-NLI
(acc)
KLUE-STS
(pearsonr)
Korquad 1.0
(em/f1)
KLUE MRC
(em/rouge-w)
KoELECTRA-Base-v3 91.13 86.87 93.14 85.66 / 93.94 59.54 / 65.64
KLUE-RoBERTa-Base 91.16 86.30 92.91 85.35 / 94.53 69.56 / 74.64
KoBigBird-BERT-Base 91.18 87.17 92.61 87.08 / 94.71 70.33 / 75.34

2. Long Sequence (>=1024)

์ž์„ธํ•œ ๋‚ด์šฉ์€ [Finetune on Long Sequence Dataset] ์ฐธ๊ณ 

TyDi QA
(em/f1)
Korquad 2.1
(em/f1)
Fake News
(f1)
Modu Sentiment
(f1-macro)
KLUE-RoBERTa-Base 76.80 / 78.58 55.44 / 73.02 95.20 42.61
KoBigBird-BERT-Base 79.13 / 81.30 67.77 / 82.03 98.85 45.42

Docs

Citation

KoBigBird๋ฅผ ์‚ฌ์šฉํ•˜์‹ ๋‹ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ์ธ์šฉํ•ด์ฃผ์„ธ์š”.

@software{jangwon_park_2021_5654154,
  author       = {Jangwon Park and Donggyu Kim},
  title        = {KoBigBird: Pretrained BigBird Model for Korean},
  month        = nov,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.5654154},
  url          = {https://doi.org/10.5281/zenodo.5654154}
}

Contributors

Jangwon Park and Donggyu Kim

Acknowledgements

KoBigBird๋Š” Tensorflow Research Cloud (TFRC) ํ”„๋กœ๊ทธ๋žจ์˜ Cloud TPU ์ง€์›์œผ๋กœ ์ œ์ž‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ ๋ฉ‹์ง„ ๋กœ๊ณ ๋ฅผ ์ œ๊ณตํ•ด์ฃผ์‹  Seyun Ahn๋‹˜๊ป˜ ๊ฐ์‚ฌ๋ฅผ ์ „ํ•ฉ๋‹ˆ๋‹ค.

You might also like...
KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

Generating Korean Slogans with phonetic and structural repetition
Generating Korean Slogans with phonetic and structural repetition

LexPOS_ko Generating Korean Slogans with phonetic and structural repetition Generating Slogans with Linguistic Features LexPOS is a sequence-to-sequen

Korean extractive summarization. 2021 AI ํ…์ŠคํŠธ ์š”์•ฝ ์˜จ๋ผ์ธ ํ•ด์ปคํ†ค ํ™”์„ฑ๊ฐˆ๋„๋‹ˆ๊นŒํŒ€ ์ฝ”๋“œ
Korean extractive summarization. 2021 AI ํ…์ŠคํŠธ ์š”์•ฝ ์˜จ๋ผ์ธ ํ•ด์ปคํ†ค ํ™”์„ฑ๊ฐˆ๋„๋‹ˆ๊นŒํŒ€ ์ฝ”๋“œ

korean extractive summarization 2021 AI ํ…์ŠคํŠธ ์š”์•ฝ ์˜จ๋ผ์ธ ํ•ด์ปคํ†ค ํ™”์„ฑ๊ฐˆ๋„๋‹ˆ๊นŒํŒ€ ์ฝ”๋“œ Leaderboard Notice Text Summarization with Pretrained Encoders์— ๋‚˜์˜ค๋Š” bertsumext๋ชจ๋ธ(ext

Training code for Korean multi-class sentiment analysis

KoSentimentAnalysis Bert implementation for the Korean multi-class sentiment analysis ์™œ ํ•œ๊ตญ์–ด ๊ฐ์ • ๋‹ค์ค‘๋ถ„๋ฅ˜ ๋ชจ๋ธ์€ ๊ฑฐ์˜ ์—†๋Š” ๊ฒƒ์ผ๊นŒ?์—์„œ ์‹œ์ž‘๋œ ํ”„๋กœ์ ํŠธ Environment: Pytorch, Da

Korean Sentence Embedding Repository

Korean-Sentence-Embedding ๐Ÿญ Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ๐Ÿฆ ๐Ÿ‡ฎ๐Ÿ‡ฉ 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

Crie tokens de autenticaรงรฃo รญntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) รฉ uma bilioteca criada para ser utilizada na geraรงรฃo de tokens seguros e รญntegros, ou seja, nรฃ

Comments
  • Pretraining Epoch ์งˆ๋ฌธ

    Pretraining Epoch ์งˆ๋ฌธ

    Checklist

    • [x] I've searched the project's issues

    โ“ Question

    ์•ˆ๋…•ํ•˜์„ธ์š” ์ €๋Š” ํ˜„์žฌ ์นœ๊ตฌ๋“ค๊ณผ ํ•จ๊ป˜ 4096 ํ† ํฐ์„ ์ž…๋ ฅ๋ฐ›์•„ ์š”์•ฝ ํƒœ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒ˜์Œ์—” ๋น…๋ฒ„๋“œ + ๋ฒ„ํŠธ ์กฐํ•ฉ์œผ๋กœ ํ•ด๋ณด๋ ค๊ณ  ํ–ˆ๋Š”๋ฐ, ์ด๋ฏธ monologg ๋‹˜๊ป˜์„œ ๋งŒ๋“ค์–ด์ฃผ์…จ๋”๋ผ๊ตฌ์š” ใ…Žใ…Ž ๊ทธ๋ž˜์„œ ๋กฑํฌ๋จธ + ๋ฐ”ํŠธ + ํŽ˜๊ฐ€์ˆ˜์Šค ์กฐํ•ฉ์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋ ค ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. pretrained๋œ KoBart๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์–ดํ…์…˜์„ ๋กฑํฌ๋จธ๋กœ ๋ฐ”๊พผ ํ›„, ํŽ˜๊ฐ€์ˆ˜์Šค task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ตฌ์กฐ๋กœ ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

    ํ˜„์žฌ 13GB์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์•„์„œ ์ „์ฒ˜๋ฆฌ์™€ ๋ฐ์ดํ„ฐ๋กœ๋” ์ž‘์„ฑ, ๋ชจ๋ธ ์ฝ”๋“œ๊นŒ์ง€๋Š” ์™„๋ฃŒํ•œ ์ƒํƒœ์ž…๋‹ˆ๋‹ค. ์ด๋ฒˆ ์ฃผ ๋‚ด๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋ ค ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

    ์ €ํฌ๊ฐ€ ๊ฐ€์ง„ GPU๋กœ๋Š” ๋Œ€๋žต ์ดํ‹€์ด๋ฉด 1 ์—ํฌํฌ๋ฅผ ๋Œ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์€๋ฐ, monologg๋‹˜๊ป˜์„œ๋Š” KoBirBird ๋ชจ๋ธ ๊ฐœ๋ฐœ ์‹œ ์—ํฌํฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๋„์…จ๋Š”์ง€ ์—ฌ์ญค๋ณด๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค.

    ์•„๋ฌด๋ž˜๋„ pretrained ๋œ ๋ชจ๋ธ์„ ๊ฐ€์ ธ๋‹ค ์“ฐ๋‹ค๋ณด๋‹ˆ ์—ํฌํฌ๋ฅผ ๋งŽ์ด ๋Œ ํ•„์š”๋Š” ์—†์„ ๊ฒƒ ๊ฐ™์€๋ฐ, ๊ธฐ์ค€์ ์œผ๋กœ ์‚ผ๊ณ  ์‹ถ์–ด์„œ์š”!

    ๋ง์ด ๊ธธ์–ด์กŒ๋Š”๋ฐ ์š”์•ฝํ•˜์ž๋ฉด, KoBirBird ํ•™์Šต ์‹œ ์—ํฌํฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ฃผ์…จ๋Š”์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ๊ทธ ๊ธฐ์ค€์€ ๋ฌด์—‡์œผ๋กœ ์‚ผ์œผ์…จ๋Š”์ง€๋„ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค.

    question 
    opened by KimJaehee0725 2
  • Specific information about this model.

    Specific information about this model.

    Checklist

    • [ x ] I've searched the project's issues

    โ“ Question

    • You mentioned "๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜, ํ•œ๊ตญ์–ด ์œ„ํ‚ค, Common Crawl, ๋‰ด์Šค ๋ฐ์ดํ„ฐ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต" and I want to know the size of total corpus for pre-training.

    • Also I want to know the vocab size of this model.

    ๐Ÿ“Ž Additional context

    question 
    opened by midannii 2
  • Fix some minors

    Fix some minors

    Description

    ์ฝ”๋“œ์™€ ์ฃผ์„ ๋“ฑ์„ ์ฝ๋‹ค๊ฐ€ ๋ณด์ธ ์ž‘์€ ์˜คํƒ€ ๋“ฑ์„ ์ˆ˜์ •ํ–ˆ์Šต๋‹ˆ๋‹ค

    ๋‹ค์–‘ํ•œ ๋…ธํ•˜์šฐ๋ฅผ ์•„๋‚Œ์—†์ด ๊ณต์œ ํ•ด์ฃผ์‹  @monologg , @donggyukimc ์—๊ฒŒ ๊ฐ์‚ฌ์˜ ๋ง์”€๋“œ๋ฆฝ๋‹ˆ๋‹ค.

    ์ดํ›„์—๋Š” GPU ํ™˜๊ฒฝ์—์„œ finetuning์„ ํ…Œ์ŠคํŠธํ•ด ๋ณผ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค ๊ณ ๋ง™์Šต๋‹ˆ๋‹ค.

    Related Issue

    chore 
    opened by sackoh 0
Releases(v1.0.0)
Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

TextCortex - HemingwAI Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingw

TextCortex AI 27 Nov 28, 2022
My Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks using Tensorflow

Easy Data Augmentation Implementation This repository contains my Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Per

Aflah 9 Oct 31, 2022
Machine learning models from Singapore's NLP research community

SG-NLP Machine learning models from Singapore's natural language processing (NLP) research community. sgnlp is a Python package that allows you to eas

AI Singapore | AI Makerspace 21 Dec 17, 2022
Text Classification in Turkish Texts with Bert

You can watch the details of the project on my youtube channel Project Interface Project Second Interface Goal= Correctly guessing the classification

42 Dec 31, 2022
Tool to check whether a GCP bucket is public or not.

Tool to check publicly accessible GCP bucket. Blog https://justm0rph3u5.medium.com/gcp-inspector-auditing-publicly-exposed-gcp-bucket-ac6cad55618c Wha

DIVYANSHU SHUKLA 7 Nov 24, 2022
An ActivityWatch watcher to pose questions to the user and record her answers.

aw-watcher-ask An ActivityWatch watcher to pose questions to the user and record her answers. This watcher uses Zenity to present dialog boxes to the

Bernardo Chrispim Baron 33 Dec 03, 2022
Spam filtering made easy for you

spammy Author: Tasdik Rahman Latest version: 1.0.3 Contents 1 Overview 2 Features 3 Example 3.1 Accuracy of the classifier 4 Installation 4.1 Upgradin

Tasdik Rahman 137 Dec 18, 2022
Repository for fine-tuning Transformers ๐Ÿค— based seq2seq speech models in JAX/Flax.

Seq2Seq Speech in JAX A JAX/Flax repository for combining a pre-trained speech encoder model (e.g. Wav2Vec2, HuBERT, WavLM) with a pre-trained text de

Sanchit Gandhi 21 Dec 14, 2022
Generating new names based on trends in data using GPT2 (Transformer network)

MLOpsNameGenerator Overall Goal The goal of the project is to develop a model that is capable of creating Pokรฉmon names based on its description, usin

Gustav Lang Moesmand 2 Jan 10, 2022
Modified GPT using average pooling to reduce the softmax attention memory constraints.

NLP-GPT-Upsampling This repository contains an implementation of Open AI's GPT Model. In particular, this implementation takes inspiration from the Ny

WD 1 Dec 03, 2021
Collection of scripts to pinpoint obfuscated code

Obfuscation Detection (v1.0) Author: Tim Blazytko Automatically detect control-flow flattening and other state machines Description: Scripts and binar

Tim Blazytko 230 Nov 26, 2022
A minimal Conformer ASR implementation adapted from ESPnet.

Conformer ASR A minimal Conformer ASR implementation adapted from ESPnet. Introduction I want to use the pre-trained English ASR model provided by ESP

Niu Zhe 3 Jan 24, 2022
Diffรฉrents programmes crรฉant une interface graphique a l'aide de Tkinter pour simplifier la vie des รฉtudiants.

GP211-Grand-Projet Ce repertoire contient tout les programmes nรฉcessaires au bon fonctionnement de notre projet-logiciel. Cette interface graphique es

1 Dec 21, 2021
A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

Zhenbang Feng 29 Nov 26, 2022
NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

MCSE: Multimodal Contrastive Learning of Sentence Embeddings This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multi

Saarland University Spoken Language Systems Group 39 Nov 15, 2022
Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

LibreTranslate Try it online! | API Docs | Community Forum Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it d

3.4k Dec 27, 2022
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 2k Dec 27, 2022
A retro text-to-speech bot for Discord

hawking A retro text-to-speech bot for Discord, designed to work with all of the stuff you might've seen in Moonbase Alpha, using the existing command

Nick Schorr 23 Dec 25, 2022
Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

RoBERTaABSA This repo contains the code for NAACL 2021 paper titled Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoB

106 Nov 28, 2022
Code for paper: An Effective, Robust and Fairness-awareHate Speech Detection Framework

BiQQLSTM_HS Code and data for paper: Title: An Effective, Robust and Fairness-awareHate Speech Detection Framework. Authors: Guanyi Mou and Kyumin Lee

Guanyi Mou 2 Dec 27, 2022