🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Last update: Dec 14, 2022

Overview

Pretrained BigBird Model for Korean

What is BigBird • How to Use • Pretraining • Evaluation Result • Docs • Citation

한국어 | English

What is BigBird?

BigBird: Transformers for Longer Sequences에서 소개된 sparse-attention 기반의 모델로, 일반적인 BERT보다 더 긴 sequence를 다룰 수 있습니다.

🦅 Longer Sequence - 최대 512개의 token을 다룰 수 있는 BERT의 8배인 최대 4096개의 token을 다룸

⏱️ Computational Efficiency - Full attention이 아닌 Sparse Attention을 이용하여 O(n²)에서 O(n)으로 개선

How to Use

🤗 Huggingface Hub에 업로드된 모델을 곧바로 사용할 수 있습니다:)
일부 이슈가 해결된 transformers>=4.11.0 사용을 권장합니다. (MRC 이슈 관련 PR)
BigBirdTokenizer 대신에 BertTokenizer 를 사용해야 합니다. (AutoTokenizer 사용시 BertTokenizer가 로드됩니다.)
자세한 사용법은 BigBird Tranformers documentation을 참고해주세요.

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("monologg/kobigbird-bert-base")  # BigBirdModel
tokenizer = AutoTokenizer.from_pretrained("monologg/kobigbird-bert-base")  # BertTokenizer

Pretraining

자세한 내용은 [Pretraining BigBird] 참고

	Hardware	Max len	LR	Batch	Train Step	Warmup Step
KoBigBird-BERT-Base	TPU v3-8	4096	1e-4	32	2M	20k

모두의 말뭉치, 한국어 위키, Common Crawl, 뉴스 데이터 등 다양한 데이터로 학습
ITC (Internal Transformer Construction) 모델로 학습 (ITC vs ETC)

Evaluation Result

1. Short Sequence (<=512)

자세한 내용은 [Finetune on Short Sequence Dataset] 참고

	NSMC (acc)	KLUE-NLI (acc)	KLUE-STS (pearsonr)	Korquad 1.0 (em/f1)	KLUE MRC (em/rouge-w)
KoELECTRA-Base-v3	91.13	86.87	93.14	85.66 / 93.94	59.54 / 65.64
KLUE-RoBERTa-Base	91.16	86.30	92.91	85.35 / 94.53	69.56 / 74.64
KoBigBird-BERT-Base	91.18	87.17	92.61	87.08 / 94.71	70.33 / 75.34

2. Long Sequence (>=1024)

자세한 내용은 [Finetune on Long Sequence Dataset] 참고

	TyDi QA (em/f1)	Korquad 2.1 (em/f1)	Fake News (f1)	Modu Sentiment (f1-macro)
KLUE-RoBERTa-Base	76.80 / 78.58	55.44 / 73.02	95.20	42.61
KoBigBird-BERT-Base	79.13 / 81.30	67.77 / 82.03	98.85	45.42

Docs

Citation

KoBigBird를 사용하신다면 아래와 같이 인용해주세요.

@software{jangwon_park_2021_5654154,
  author       = {Jangwon Park and Donggyu Kim},
  title        = {KoBigBird: Pretrained BigBird Model for Korean},
  month        = nov,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.5654154},
  url          = {https://doi.org/10.5281/zenodo.5654154}
}

Contributors

Jangwon Park and Donggyu Kim

Acknowledgements

KoBigBird는 Tensorflow Research Cloud (TFRC) 프로그램의 Cloud TPU 지원으로 제작되었습니다.

또한 멋진 로고를 제공해주신 Seyun Ahn님께 감사를 전합니다.

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

797 Dec 26, 2022

Generating Korean Slogans with phonetic and structural repetition

LexPOS_ko Generating Korean Slogans with phonetic and structural repetition Generating Slogans with Linguistic Features LexPOS is a sequence-to-sequen

3 May 23, 2022

Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

korean extractive summarization 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드 Leaderboard Notice Text Summarization with Pretrained Encoders에 나오는 bertsumext모델(ext

3 Aug 10, 2022

Training code for Korean multi-class sentiment analysis

KoSentimentAnalysis Bert implementation for the Korean multi-class sentiment analysis 왜 한국어 감정 다중분류 모델은 거의 없는 것일까?에서 시작된 프로젝트 Environment: Pytorch, Da

3 Dec 2, 2022

Korean Sentence Embedding Repository

Korean-Sentence-Embedding 🍭 Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

80 Jan 2, 2023

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

241 Jan 4, 2023

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet 🐦 🇮🇩 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

40 Nov 30, 2022

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

377 Jan 2, 2023

Crie tokens de autenticação íntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) é uma bilioteca criada para ser utilizada na geração de tokens seguros e íntegros, ou seja, nã

0 Nov 29, 2022

Comments

Pretraining Epoch 질문
Checklist

[x] I've searched the project's issues

❓ Question

안녕하세요 저는 현재 친구들과 함께 4096 토큰을 입력받아 요약 태스크를 수행할 수 있는 모델을 만들고 있습니다. 처음엔 빅버드 + 버트 조합으로 해보려고 했는데, 이미 monologg 님께서 만들어주셨더라구요 ㅎㅎ 그래서 롱포머 + 바트 + 페가수스 조합으로 학습을 진행하려 하고 있습니다. pretrained된 KoBart를 기반으로 어텐션을 롱포머로 바꾼 후, 페가수스 task를 수행하는 구조로 되어 있습니다.

현재 13GB의 데이터를 모아서 전처리와 데이터로더 작성, 모델 코드까지는 완료한 상태입니다. 이번 주 내로 학습을 진행하려 하고 있습니다.

저희가 가진 GPU로는 대략 이틀이면 1 에포크를 돌 수 있을 것 같은데, monologg님께서는 KoBirBird 모델 개발 시 에포크를 얼마나 도셨는지 여쭤보고 싶습니다.

아무래도 pretrained 된 모델을 가져다 쓰다보니 에포크를 많이 돌 필요는 없을 것 같은데, 기준점으로 삼고 싶어서요!

말이 길어졌는데 요약하자면, KoBirBird 학습 시 에포크를 얼마나 주셨는지 궁금합니다. 또한, 그 기준은 무엇으로 삼으셨는지도 궁금합니다.
question
opened by KimJaehee0725 2
Specific information about this model.
Checklist

[ x ] I've searched the project's issues

❓ Question

You mentioned "모두의 말뭉치, 한국어 위키, Common Crawl, 뉴스 데이터 등 다양한 데이터로 학습" and I want to know the size of total corpus for pre-training.

Also I want to know the vocab size of this model.

📎 Additional context
question
opened by midannii 2
Fix some minors

Description

코드와 주석 등을 읽다가 보인 작은 오타 등을 수정했습니다

다양한 노하우를 아낌없이 공유해주신 @monologg , @donggyukimc 에게 감사의 말씀드립니다.

이후에는 GPU 환경에서 finetuning을 테스트해 볼 예정입니다 고맙습니다.

Related Issue
chore

opened by sackoh 0

Releases(v1.0.0)

v1.0.0(Nov 8, 2021)

Initial release for KoBigBird - Pretrained BigBird Model for Korean
Source code(tar.gz)
Source code(zip)

Owner

Jangwon Park

GitHub Repository https://huggingface.co/monologg/kobigbird-bert-base

Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

TextCortex - HemingwAI Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingw

27 Nov 28, 2022

My Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks using Tensorflow

Easy Data Augmentation Implementation This repository contains my Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Per

9 Oct 31, 2022

Machine learning models from Singapore's NLP research community

SG-NLP Machine learning models from Singapore's natural language processing (NLP) research community. sgnlp is a Python package that allows you to eas

21 Dec 17, 2022

Text Classification in Turkish Texts with Bert

You can watch the details of the project on my youtube channel Project Interface Project Second Interface Goal= Correctly guessing the classification

42 Dec 31, 2022

Tool to check whether a GCP bucket is public or not.

Tool to check publicly accessible GCP bucket. Blog https://justm0rph3u5.medium.com/gcp-inspector-auditing-publicly-exposed-gcp-bucket-ac6cad55618c Wha

7 Nov 24, 2022

An ActivityWatch watcher to pose questions to the user and record her answers.

aw-watcher-ask An ActivityWatch watcher to pose questions to the user and record her answers. This watcher uses Zenity to present dialog boxes to the

33 Dec 03, 2022

Spam filtering made easy for you

spammy Author: Tasdik Rahman Latest version: 1.0.3 Contents 1 Overview 2 Features 3 Example 3.1 Accuracy of the classifier 4 Installation 4.1 Upgradin

137 Dec 18, 2022

Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

Seq2Seq Speech in JAX A JAX/Flax repository for combining a pre-trained speech encoder model (e.g. Wav2Vec2, HuBERT, WavLM) with a pre-trained text de

21 Dec 14, 2022

Generating new names based on trends in data using GPT2 (Transformer network)

MLOpsNameGenerator Overall Goal The goal of the project is to develop a model that is capable of creating Pokémon names based on its description, usin

2 Jan 10, 2022

Modified GPT using average pooling to reduce the softmax attention memory constraints.

NLP-GPT-Upsampling This repository contains an implementation of Open AI's GPT Model. In particular, this implementation takes inspiration from the Ny

1 Dec 03, 2021

Collection of scripts to pinpoint obfuscated code

Obfuscation Detection (v1.0) Author: Tim Blazytko Automatically detect control-flow flattening and other state machines Description: Scripts and binar

230 Nov 26, 2022

A minimal Conformer ASR implementation adapted from ESPnet.

Conformer ASR A minimal Conformer ASR implementation adapted from ESPnet. Introduction I want to use the pre-trained English ASR model provided by ESP

3 Jan 24, 2022

Différents programmes créant une interface graphique a l'aide de Tkinter pour simplifier la vie des étudiants.

GP211-Grand-Projet Ce repertoire contient tout les programmes nécessaires au bon fonctionnement de notre projet-logiciel. Cette interface graphique es

1 Dec 21, 2021

A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

29 Nov 26, 2022

NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

MCSE: Multimodal Contrastive Learning of Sentence Embeddings This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multi

39 Nov 15, 2022

Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

LibreTranslate Try it online! | API Docs | Community Forum Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it d

3.4k Dec 27, 2022

Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

2k Dec 27, 2022

A retro text-to-speech bot for Discord

hawking A retro text-to-speech bot for Discord, designed to work with all of the stuff you might've seen in Moonbase Alpha, using the existing command

23 Dec 25, 2022

Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

RoBERTaABSA This repo contains the code for NAACL 2021 paper titled Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoB

106 Nov 28, 2022

Code for paper: An Effective, Robust and Fairness-awareHate Speech Detection Framework

BiQQLSTM_HS Code and data for paper: Title: An Effective, Robust and Fairness-awareHate Speech Detection Framework. Authors: Guanyi Mou and Kyumin Lee

2 Dec 27, 2022