Code for EMNLP2021 paper "Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training"

Related tags

Deep LearningVoCapXLM
Overview

VoCapXLM

Code for EMNLP2021 paper Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

Environment

DockerFile: dancingsoul/pytorch:VoCapXLM

Manully build the sentencepiece with following command:

cd sentencepiece
mkdir build
cd build
cmake ..
make -j $(nproc)
sudo make install
sudo ldconfig -v

Data Preparation

  1. Create a folder with mkdir -p monolingual_text in the root of this project.
  2. Sample monolingual corpus for each language individually, move them to the monolingual_text directory, named after their language codes (e.g., en.txt).
  3. Sample the multilingual corpus from monolingual corpora with the following command:
python sample_multilingual_corpus.py \
    --lang_prob_path ./lang_prob_wiki.json \ 
    --input_dir ./monolingual_text/ \ 
    --output_path ./multilingual_corpus.text \
    --n_sample <n_sample> --beta <beta> --rescale

where the options are described as follows:

  • --lang_prob_path: the probability of sampling training instances from each language during pre-training, lang_prob_wiki.json is counted on Wikipedia corpus and the probabilities are rescaled with alpha=0.7 from Equation (3) in our paper.
  • --n_sample: number of sentences in the multilingual corpus where the final multilingual sentencepiece model is trained, the default value is 20000000.
  • --rescale: further rescale the probability with another value beta from Equation (2) in our paper.
  • --beta: the rescaling factor in Equation (2), the default value is 0.7.

Training Monolingual SentencePiece Models

Train monolingual sentencepiece models in different sizes to obtain vocabularies with different ALP, i.e., language-specific vocabulary capacity.

python train_mono_spm.py \
    --input_dir ./monolingual_text/ \
    --output_dir ~/monolingual_spm/ \
    --languages <all_languages> \
    --min_vocab_size <min_vocab_size> \
    --max_vocab_size <max_vocab_size> \
    --delta_vocab_size <delta_vocab_size> \
    --n_sample <n_sample>

where the options are described as follows:

  • --languages: all languages under the monolingual_text directory, separated with ,, e.g. en,fr,zh.
  • --min_vocab_size: minimum vocabulary size allocated for each language, the default value is 1000.
  • --max_vocab_size: maximum vocabulary size allocated for each language, the default value is 50000.
  • --delta_vocab_size: the value of interval to learn vocabularies, the default value is 1000.
  • --n_sample: the number of sentences to calculate ALP for each language, the default value is 1000000.

or you can download our pre-trained monolingual sentencepiece models and vocabularies from [here][2].

Allocating Multilingual Vocabulary

Allocate the multilingual vocabulary from monolingual vocabularies:

python train_vocap.py \
    --lang_prob_path ./lang_prob_wiki.json \
    --input_dir ./monolingual_spm/ \
    --output_path ./multilingual.vocab \
    --beta <beta> --rescale --target_vocab_size <target_vocab_size>

where the options are described as follows:

  • --lang_prob_path: same as the above.
  • --rescale: same as the above.
  • --beta: same as the above.
  • --target_vocab_size: the desired vocabulary size of the multilingual vocabulary, the default value is 500000.

Then Use sentencepiece to train the tokenizer given the multilingual vocabulary:

spm_train --input=./multilingual_corpus.text --model_prefix=<model_name> --vocab_size=<target_vocab_size> \
--character_coverage=0.9995 --model_type=unigram --shuffle_input_sentence=true \
--input_sentence_size=<input_sentence_size> --vocab_path=./multilingual.vocab

where the options are described as follows:

  • --model_prefix: output model name prefix. <model_name>.model and <model_name>.vocab are generated.
  • --character_coverage: amount of characters covered by the model.
  • --vocab_size: same as --target_vocab_size.
  • --vocab_path: the required subwords in the final learned tokenizer.

Paper

Please cite our paper \cite{bo2021vocapxlm} if you found the resources in the repository useful.

@inproceedings{bo2021vocapxlm,
author = {Bo Zheng, Li Dong, Shaohan Huang, Saksham Singhal, Wanxiang Che, Ting Liu, Xia Song, Furu Wei},
booktitle = {Proceedings of EMNLP 2021},
title = {{Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training}},
year = {2021}
}

Reference

  1. https://github.com/google/sentencepiece
  2. https://drive.google.com/file/d/1VttgE30xo-i1ig5xsMF_7R4AB2sA5J9F/view?usp=sharing
Owner
Bo Zheng
Bo Zheng
EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation.

This repository contains data and code for our EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation. Please contact me at

9 Oct 28, 2022
A neuroanatomy-based augmented reality experience powered by computer vision. Features 3D visuals of the Atlas Brain Map slices.

Brain Augmented Reality (AR) A neuroanatomy-based augmented reality experience powered by computer vision that features 3D visuals of the Atlas Brain

Yasmeen Brain 10 Oct 06, 2022
[ICME 2021 Oral] CORE-Text: Improving Scene Text Detection with Contrastive Relational Reasoning

CORE-Text: Improving Scene Text Detection with Contrastive Relational Reasoning This repository is the official PyTorch implementation of CORE-Text, a

Jingyang Lin 18 Aug 11, 2022
NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation (ACL-IJCNLP 2021)

NeuralWOZ This code is official implementation of "NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation". Sungdong Kim, Mi

NAVER AI 31 Oct 25, 2022
Pretrained Pytorch face detection (MTCNN) and recognition (InceptionResnet) models

Face Recognition Using Pytorch Python 3.7 3.6 3.5 Status This is a repository for Inception Resnet (V1) models in pytorch, pretrained on VGGFace2 and

Tim Esler 3.3k Jan 04, 2023
OpenMMLab 3D Human Parametric Model Toolbox and Benchmark

Introduction English | 简体中文 MMHuman3D is an open source PyTorch-based codebase for the use of 3D human parametric models in computer vision and comput

OpenMMLab 782 Jan 04, 2023
Official implementation of NLOS-OT: Passive Non-Line-of-Sight Imaging Using Optimal Transport (IEEE TIP, accepted)

NLOS-OT Official implementation of NLOS-OT: Passive Non-Line-of-Sight Imaging Using Optimal Transport (IEEE TIP, accepted) Description In this reposit

Ruixu Geng(耿瑞旭) 16 Dec 16, 2022
Nsdf: A mesh SDF with just some code we can directly paste into our raymarcher

nsdf Representing SDFs of arbitrary meshes has been a bit tricky so far. Express

Jan Ivanecky 5 Feb 18, 2022
Self-Supervised Document-to-Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference

Self-Supervised Document Similarity Ranking (SDR) via Contextualized Language Models and Hierarchical Inference This repo is the implementation for SD

Microsoft 36 Nov 28, 2022
Fast SHAP value computation for interpreting tree-based models

FastTreeSHAP FastTreeSHAP package is built based on the paper Fast TreeSHAP: Accelerating SHAP Value Computation for Trees published in NeurIPS 2021 X

LinkedIn 369 Jan 04, 2023
A PyTorch implementation of "Semi-Supervised Graph Classification: A Hierarchical Graph Perspective" (WWW 2019)

SEAL ⠀⠀⠀ A PyTorch implementation of Semi-Supervised Graph Classification: A Hierarchical Graph Perspective (WWW 2019) Abstract Node classification an

Benedek Rozemberczki 202 Dec 27, 2022
AntroPy: entropy and complexity of (EEG) time-series in Python

AntroPy is a Python 3 package providing several time-efficient algorithms for computing the complexity of time-series. It can be used for example to e

Raphael Vallat 153 Dec 27, 2022
PyTorch implementation of NeurIPS 2021 paper: "CoFiNet: Reliable Coarse-to-fine Correspondences for Robust Point Cloud Registration"

CoFiNet: Reliable Coarse-to-fine Correspondences for Robust Point Cloud Registration (NeurIPS 2021) PyTorch implementation of the paper: CoFiNet: Reli

76 Jan 03, 2023
Alphabetical Letter Recognition

DecisionTrees-Image-Classification Alphabetical Letter Recognition In these demo we are using "Decision Trees" Our database is composed by Learning Im

Mohammed Firass 4 Nov 30, 2021
Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

Here is deepparse. Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning. Use deepparse to Use the pr

GRAAL/GRAIL 192 Dec 20, 2022
The repo contains the code to train and evaluate a system which extracts relations and explanations from dialogue.

The repo contains the code to train and evaluate a system which extracts relations and explanations from dialogue. How do I cite D-REX? For now, cite

Alon Albalak 6 Mar 31, 2022
Code repository for the paper "Tracking People with 3D Representations"

Tracking People with 3D Representations Code repository for the paper "Tracking People with 3D Representations" (paper link) (project site). Jathushan

Jathushan Rajasegaran 77 Dec 03, 2022
the code for paper "Energy-Based Open-World Uncertainty Modeling for Confidence Calibration"

EOW-Softmax This code is for the paper "Energy-Based Open-World Uncertainty Modeling for Confidence Calibration". Accepted by ICCV21. Usage Commnd exa

Yezhen Wang 36 Dec 02, 2022
PyTorch Implementations for DeeplabV3 and PSPNet

Pytorch-segmentation-toolbox DOC Pytorch code for semantic segmentation. This is a minimal code to run PSPnet and Deeplabv3 on Cityscape dataset. Shor

Zilong Huang 746 Dec 15, 2022
PyTorch implementations of Generative Adversarial Networks.

This repository has gone stale as I unfortunately do not have the time to maintain it anymore. If you would like to continue the development of it as

Erik Linder-Norén 13.4k Jan 08, 2023