Google AI 2018 BERT pytorch implementation

Last update: Jan 07, 2023

Overview

BERT-pytorch

Pytorch implementation of Google AI's 2018 BERT, with simple annotation

BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Paper URL : https://arxiv.org/abs/1810.04805

Introduction

Google AI's BERT paper shows the amazing result on various NLP task (new 17 NLP tasks SOTA), including outperform the human F1 score on SQuAD v1.1 QA task. This paper proved that Transformer(self-attention) based encoder can be powerfully used as alternative of previous language model with proper language model training method. And more importantly, they showed us that this pre-trained language model can be transfer into any NLP task without making task specific model architecture.

This amazing result would be record in NLP history, and I expect many further papers about BERT will be published very soon.

This repo is implementation of BERT. Code is very simple and easy to understand fastly. Some of these codes are based on The Annotated Transformer

Currently this project is working on progress. And the code is not verified yet.

Installation

pip install bert-pytorch

Quickstart

NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator

0. Prepare your corpus

Welcome to the \t the jungle\n
I can stay \t here all night\n

or tokenized corpus (tokenization is not in package)

Wel_ _come _to _the \t _the _jungle\n
_I _can _stay \t _here _all _night\n

1. Building vocab based on your corpus

bert-vocab -c data/corpus.small -o data/vocab.small

2. Train your own BERT model

bert -c data/corpus.small -v data/vocab.small -o output/bert.model

Language Model Pre-training

In the paper, authors shows the new language model training methods, which are "masked language model" and "predict next sentence".

Masked Language Model

Original Paper : 3.3.1 Task #1: Masked LM

Input Sequence  : The man went to [MASK] store with [MASK] dog
Target Sequence :                  the                his

Rules:

Randomly 15% of input token will be changed into something, based on under sub-rules

Randomly 80% of tokens, gonna be a [MASK] token
Randomly 10% of tokens, gonna be a [RANDOM] token(another word)
Randomly 10% of tokens, will be remain as same. But need to be predicted.

Predict Next Sentence

Original Paper : 3.3.2 Task #2: Next Sentence Prediction

Input : [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP]
Label : Is Next

Input = [CLS] the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label = NotNext

"Is this sentence can be continuously connected?"

understanding the relationship, between two text sentences, which is not directly captured by language modeling

Rules:

Randomly 50% of next sentence, gonna be continuous sentence.
Randomly 50% of next sentence, gonna be unrelated sentence.

Author

Junseong Kim, Scatter Lab ([email protected] / [email protected])

License

This project following Apache 2.0 License as written in LICENSE file

Comments

Very low GPU usage when training on 8 GPU in a single machine

Hi, I am currently pretaining the BERT on my own data. I use the alpha0.0.1a5 branch (newest version).
I found only 20% of the GPU is in use.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:3F:00.0 Off |                    0 |
| N/A   40C    P0    58W / 300W |  10296MiB / 16152MiB |     32%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:40:00.0 Off |                    0 |
| N/A   37C    P0    55W / 300W |   2742MiB / 16152MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   40C    P0    58W / 300W |   2742MiB / 16152MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:42:00.0 Off |                    0 |
| N/A   47C    P0    61W / 300W |   2742MiB / 16152MiB |     24%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:62:00.0 Off |                    0 |
| N/A   36C    P0    98W / 300W |   2742MiB / 16152MiB |     17%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:63:00.0 Off |                    0 |
| N/A   38C    P0    88W / 300W |   2736MiB / 16152MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:64:00.0 Off |                    0 |
| N/A   48C    P0    80W / 300W |   2736MiB / 16152MiB |     25%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:65:00.0 Off |                    0 |
| N/A   46C    P0    71W / 300W |   2736MiB / 16152MiB |     24%      Default |
+-------------------------------+----------------------+----------------------+

I am not familiar with pytorch. Any one konws why?

help wanted

opened by mapingshuo 13

Example of Input Data

Could you give a concrete example of the input data? You gave an example of the corpus data, but not the dataset.small file found in this line:

bert -c data/dataset.small -v data/vocab.small -o output/bert.model

If you could show perhaps a couple of examples, that would be very helpful! I am new to pytorch, so the dataloader function is a little confusing.
question

opened by nateraw 8
Why doesn't the counter in data_iter increase?

I am currently playing around with training and testing the model. However, as I implemented the test section, I'm noticing that during the LM training, your counter doesn't increase when looping over data_iter found in pretrain.py. This would cause problems when calculating the average loss/accuracy, wouldn't it?

invalid question

opened by nateraw 6
model/embedding/position.py

div_term = (torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)).float().exp() should be: div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp()

In [51]: (torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)).float().exp() ...: Out[51]: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Additional question: I don't quite understand how "bidirectional" transformer in the raw paper implemented. Maybe like BiLSTM: concat two direction's transformer output together? Didn't find the similar structure in your code.

opened by zhupengjia 5
The LayerNorm implementation

I am wondering why don't you use the standard nn version of LayerNorm? I notice the difference is the denomenator: nn.LayerNorm use the {sqrt of (variance + epsilon)} rather than {standard deviation + epsilon}

Could you clarify these 2 approaches?
invalid question

opened by egg-west 4
How to embedding segment lable

Thanks for you code ,which let me leran more details for this papper .But i cant't understand segment.py. You haven't writeen how to embedding segment lable .
question

opened by waallf 3
The question about the implement of learning_rate

Nice implements! However, I have a question about learning rate. The learning_rate schedule which from the origin Transformers is warm-up restart, but your implement just simple decay. Could you implement it in your BERT code?
enhancement

opened by wenhaozheng-nju 3
Question about random sampling.

https://github.com/codertimo/BERT-pytorch/blob/7efd2b5a631f18ebc83cd16886b8c6ee77a40750/bert_pytorch/dataset/dataset.py#L50-L64

Well, seems random.random() always returns a positive number, so prob >= prob * 0.9 will always be true?

opened by SongRb 3
Mask language model loss

Hi, Thank you for your clean code on Bert. I have a question about Mask LM loss after I read your code. Your program computes a mask language model loss on both positive sentence pairs and negative pairs.

Does it make sense to compute Mask LM loss on negative sentence pairs? I am not sure how Google computes this loss.

opened by MarkWuNLP 3
imbalance GPU memory usage

Hi,

Nice try for BERT implementation.

I try to run your code in 4V100 and I find the memory usage is imbalance: the first GPU consume 2x memory than the others. Any idea about the reason?

Btw, I think the parameter order in train.py line 64 is incorrect.
help wanted question

opened by WencongXiao 3

[BERT] Cannot import bert

I have problems importing bert when following http://gluon-nlp.mxnet.io/examples/sentence_embedding/bert.html

(mxnet_p36) [[email protected] ~]$ ipython
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import warnings
   ...: warnings.filterwarnings('ignore')
   ...:
   ...: import random
   ...: import numpy as np
   ...: import mxnet as mx
   ...: from mxnet import gluon
   ...: import gluonnlp as nlp
   ...:
   ...:


In [2]:

In [2]: np.random.seed(100)
   ...: random.seed(100)
   ...: mx.random.seed(10000)
   ...: ctx = mx.gpu(0)
   ...:
   ...:

In [3]: from bert import *
   ...:
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-3-40b999f3ea6a> in <module>()
----> 1 from bert import *

ModuleNotFoundError: No module named 'bert'

Looks gluonnlp are successfully installed. Any idea?

(mxnet_p36) [[email protected] site-packages]$ ll /ec2-user-anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/gluonnlp-0.5.0.post0-py3.6.egg
-rw-rw-r-- 1 ec2-user ec2-user 499320 Dec 28 23:15 /ec2-user-anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/gluonnlp-0.5.0.post0-py3.6.egg

opened by logicmd 2

why specify `ignore_index=0` in the NLLLoss function in BERTTrainer?
trainer/pretrain.py

class BERTTrainer: def __init__(self, ...): ... # Using Negative Log Likelihood Loss function for predicting the masked_token self.criterion = nn.NLLLoss(ignore_index=0) ...

I cannot understand why ignore index=0 is specified when calculating NLLLoss. If the ground truth of is_next is False (label = 0) in terms of the NSP task but BERT predicts True, then NLLLoss will be 0 (or nan)... so what's the aim of ignore_index = 0 ???

====================

Well, I've found that ignore_index = 0 is useful to the MLM task, but I still can't agree the NSP task should share the same NLLLoss with MLM.
opened by Jasmine969 0
Added a Google Colab Notebook that contains all the code in this project.

For learning purposes, I added example.ipynb, which is a Google Colab Notebook that works right out of the box. I have also included an example data file that addresses #59 .

opened by ginward 0
It keeps trying to use CUDA despite --with_cuda False option

Hello,

I have tried to run bert with --with_cuda False, but the model keeps running "forward" function on cuda. These are my command line and the error message I got.

bert -c corpus.small -v vocab.small -o bert.model --with_cuda False -e 5

Loading Vocab vocab.small Vocab Size: 262 Loading Train Dataset corpus.small Loading Dataset: 113it [00:00, 560232.09it/s] Loading Test Dataset None Creating Dataloader Building BERT model Creating BERT Trainer Total Parameters: 6453768 Training Start EP_train:0: 0%|| 0/2 [00:00<?, ?it/s] Traceback (most recent call last): File "/home/yuni/anaconda3/envs/py3/bin/bert", line 8, in sys.exit(train()) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/main.py", line 67, in train trainer.train(epoch) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/trainer/pretrain.py", line 69, in train self.iteration(epoch, self.train_data) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/trainer/pretrain.py", line 102, in iteration next_sent_output, mask_lm_output = self.model.forward(data["bert_input"], data["segment_label"]) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/model/language_model.py", line 24, in forward x = self.bert(x, segment_label) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/model/bert.py", line 46, in forward x = transformer.forward(x, mask) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/model/transformer.py", line 29, in forward x = self.input_sublayer(x, lambda _x: self.attention.forward(_x, _x, _x, mask=mask)) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/model/utils/sublayer.py", line 18, in forward return x + self.dropout(sublayer(self.norm(x))) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/model/transformer.py", line 29, in x = self.input_sublayer(x, lambda _x: self.attention.forward(_x, _x, _x, mask=mask)) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/model/attention/multi_head.py", line 32, in forward x, attn = self.attention(query, key, value, mask=mask, dropout=self.dropout) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/yuni/anaconda3/envs/py3/lib/python3.6/site-packages/bert_pytorch/model/attention/single.py", line 25, in forward return torch.matmul(p_attn, value), p_attn RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 1.95 GiB total capacity; 309.18 MiB already allocated; 125.62 MiB free; 312.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

opened by hanyangii 0
dataset / dataset.py have one erro?

"
def get_random_line(self): if self.on_memory: self.lines[random.randrange(len(self.lines))][1] " This code is to get the incorrect next sentence（isNotNext : 0)， maybe it random get a lines it is (isnext:1)。

opened by ndn-love 0

Releases(0.0.1a4)

0.0.1a4(Oct 23, 2018)

Source code(tar.gz)
Source code(zip)
0.0.1a3(Oct 20, 2018)

Source code(tar.gz)
Source code(zip)
0.0.1a2(Oct 19, 2018)

Source code(tar.gz)
Source code(zip)
0.0.1a1(Oct 18, 2018)
Features

command line based train system

Source code(tar.gz)
Source code(zip)
0.0.1a0(Oct 18, 2018)

Source code(tar.gz)
Source code(zip)

Owner

Junseong Kim

Scatter Lab, Machine Learning Research Scientist, NLP

GitHub Repository

使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征，提升下游任务的表现。

Pretrain_Bert_with_MaskLM Info 使用Mask LM预训练任务来预训练Bert模型。基于pytorch框架，训练关于垂直领域语料的预训练语言模型，目的是提升下游任务的表现。 Pretraining Task Mask Language Model，简称Mask LM，即

24 Dec 10, 2022

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters - where the final result looks like waves in the ocean.

2 Sep 09, 2022

Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A brief explanation This script provides a quick way to setup a Time-of-day (Tod

2 Feb 03, 2022

Python port of Google's libphonenumber

phonenumbers Python Library This is a Python port of Google's libphonenumber library It supports Python 2.5-2.7 and Python 3.x (in the same codebase,

3.1k Dec 29, 2022

Fidibo.com comments Sentiment Analyser

Fidibo.com comments Sentiment Analyser Introduction This project first asynchronously grab Fidibo.com books comment data using grabber.py and then sav

3 Apr 15, 2022

American Sign Language (ASL) to Text Converter

Signterpreter American Sign Language (ASL) to Text Converter Recommendations Although there is grayscale and gaussian blur, we recommend that you use

0 Feb 20, 2022

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet 🐦 🇮🇩 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

40 Nov 30, 2022

A collection of GNN-based fake news detection models.

This repo includes the Pytorch-Geometric implementation of a series of Graph Neural Network (GNN) based fake news detection models. All GNN models are implemented and evaluated under the User Prefere

251 Jan 01, 2023

Code for hyperboloid embeddings for knowledge graph entities

Implementation for the papers: Self-Supervised Hyperboloid Representations from Logical Queries over Knowledge Graphs, Nurendra Choudhary, Nikhil Rao,

30 Dec 10, 2022

Toward a Visual Concept Vocabulary for GAN Latent Space, ICCV 2021

Toward a Visual Concept Vocabulary for GAN Latent Space Code and data from the ICCV 2021 paper Sarah Schwettmann, Evan Hernandez, David Bau, Samuel Kl

13 Dec 23, 2022

The aim of this task is to predict someone's English proficiency based on a text input.

English_proficiency_prediction_NLP The aim of this task is to predict someone's English proficiency based on a text input. Using the The NICT JLE Corp

1 Dec 13, 2021

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

2.9k Dec 31, 2022

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

8.4k Dec 30, 2022

Document processing using transformers

Doc Transformers Document processing using transformers. This is still in developmental phase, currently supports only extraction of form data i.e (ke

13 Dec 21, 2022

Textpipe: clean and extract metadata from text

textpipe: clean and extract metadata from text textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata

298 Nov 21, 2022

PyTorch implementation of NATSpeech: A Non-Autoregressive Text-to-Speech Framework

A Non-Autoregressive Text-to-Speech (NAR-TTS) framework, including official PyTorch implementation of PortaSpeech (NeurIPS 2021) and DiffSpeech (AAAI 2022)

760 Jan 03, 2023

Ecco is a python library for exploring and explaining Natural Language Processing models using interactive visualizations.

Visualize, analyze, and explore NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BER

1.6k Dec 25, 2022

Google AI 2018 BERT pytorch implementation

Related tags

Overview

BERT-pytorch

Introduction

Installation

Quickstart

0. Prepare your corpus

1. Building vocab based on your corpus

2. Train your own BERT model

Language Model Pre-training

Masked Language Model

Rules:

Predict Next Sentence

Rules:

Author

License

Comments

trainer/pretrain.py

Releases(0.0.1a4)

0.0.1a4(Oct 23, 2018)

0.0.1a3(Oct 20, 2018)

0.0.1a2(Oct 19, 2018)

0.0.1a1(Oct 18, 2018)

Features

0.0.1a0(Oct 18, 2018)

Owner

Junseong Kim

使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征，提升下游任务的表现。

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

Python port of Google's libphonenumber

Fidibo.com comments Sentiment Analyser

American Sign Language (ASL) to Text Converter

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

A collection of GNN-based fake news detection models.

Code for hyperboloid embeddings for knowledge graph entities

Toward a Visual Concept Vocabulary for GAN Latent Space, ICCV 2021

The aim of this task is to predict someone's English proficiency based on a text input.

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Document processing using transformers

Textpipe: clean and extract metadata from text

PyTorch implementation of NATSpeech: A Non-Autoregressive Text-to-Speech Framework

Ecco is a python library for exploring and explaining Natural Language Processing models using interactive visualizations.

This is the writeup of all the challenges from Advent-of-cyber-2019 of TryHackMe

PyJPBoatRace: Python-based Japanese boatrace tools 🚤

Speach Recognitions