Pre-training BERT Masked Language Models (MLM)

This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to pre-train JuriBERT presented in [https://arxiv.org/abs/2110.01485].

It also contains the code of the classification task that was used to evaluate JuriBERT.

Our models can be found at [http://master2-bigdata.polytechnique.fr/FrenchLinguisticResources/resources#juribert] and downloaded upon request.

Instructions

To pre-train a new BERT model you need the path to a dataset containing raw text. You can also specify an existing tokenizer for the model. Paths for saving the model and the checkpoints are required.

python pretrain.py \
      --files /path/to/text \
      --model_path /path/to/save/model \
      --checkpoint /path/to/save/checkpoints \
      --epochs 30 \
      --hidden_layers 2 \
      --hidden_size 128 \
      --attention_heads 2 \
      --save_steps 10 \
      --save_limit 0 \
      --min_freq 0

To finetune on a classification task you need the path to the pre-trained model and a CSV file containing the classification dataset. You need to specify the columns containing the category and the text as well as the path for saving the final model and the checkpoints.

python classification.py \
  --model "custom" \
  --pretrained_path /path/to/model.bin \
  --tokenizer_path /path/to/tokenizer.json \
  --data /path/to/data.csv \
  --category "category-column" \
  --text "text-column" \
  --model_path /path/to/save/model \
  --checkpoint /path/to/save/checkpoints

You can use --help to see all the available commands.

To test the masked language model use:

fill_mask = pipeline(
    "fill-mask",
    model="/path/to/model",
    tokenizer=tokenizer
)

fill_mask("Paris est la capitale de la <mask>.")

Pre-training BERT masked language models with custom vocabulary

Related tags

Overview

Pre-training BERT Masked Language Models (MLM)

Instructions

Owner

Stella Douka

Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

Train BPE with fastBPE, and load to Huggingface Tokenizer.

Google AI 2018 BERT pytorch implementation

ConvBERT: Improving BERT with Span-based Dynamic Convolution

SentAugment is a data augmentation technique for semi-supervised learning in NLP.

Ray-based parallel data preprocessing for NLP and ML.

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

PyTorch implementation of the NIPS-17 paper "Poincaré Embeddings for Learning Hierarchical Representations"

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Constituency Tree Labeling Tool

🤕 spelling exceptions builder for lazy people

2021 2학기 데이터크롤링 기말프로젝트

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

ADCS - Automatic Defect Classification System (ADCS) for SSMC

Anomaly Detection 이상치 탐지 전처리 모듈

Tool to check whether a GCP bucket is public or not.

A retro text-to-speech bot for Discord

TFIDF-based QA system for AIO2 competition

All the code I wrote for Overwatch-related projects that I still own the rights to.