Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

Last update: Sep 02, 2022

Overview

smaller-LaBSE

LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fine-tune due to the parameter size(~=471M) of this model. For instance, if I fine-tune this model with Adam optimizer, I need the GPU that has VRAM at least 7.5GB = 471M * (parameters 4 bytes + gradients 4 bytes + momentums 4 bytes + variances 4 bytes). So I applied "Load What You Need: Smaller Multilingual Transformers" method to LaBSE to reduce parameter size since most of this model's parameter is the word embedding table(~=385M).

The smaller version of LaBSE is evaluated for 14 languages using tatoeba dataset. It shows we can reduce LaBSE's parameters to 47% without a big performance drop.

If you need the PyTorch version, see https://github.com/Geotrend-research/smaller-transformers. I followed most of the steps in the paper.

Model	#param(transformer)	#param(word embedding)	#param(model)	vocab size
tfhub_LaBSE	85.1M	384.9M	470.9M	501,153
15lang_LaBSE	85.1M	133.1M	219.2M	173,347

Used Languages

English (en or eng)
French (fr or fra)
Spanish (es or spa)
German (de or deu)
Chinese (zh, zh_classical or cmn)
Arabic (ar or ara)
Italian (it or ita)
Japanese (ja or jpn)
Korean (ko or kor)
Dutch (nl or nld)
Polish (pl or pol)
Portuguese (pt or por)
Thai (th or tha)
Turkish (tr or tur)
Russian (ru or rus)

I selected the languages multilingual-USE supports.

Scripts

A smaller version of the vocab was constructed based on the frequency of tokens using Wikipedia dump data. I followed most of the algorithms in the paper to extract proper vocab for each language and rewrite it for TensorFlow.

Convert weight

mkdir -p downloads/labse-2
curl -L https://tfhub.dev/google/LaBSE/2?tf-hub-format=compressed -o downloads/labse-2.tar.gz
tar -xf downloads/labse-2.tar.gz -C downloads/labse-2/
python save_as_weight_from_saved_model.py

Select vocabs

./download_dataset.sh
python select_vocab.py

Make smaller LaBSE

./make_smaller_labse.py

Evaluate tatoeba

./download_tatoeba_dataset.sh
# evaluate TFHub LaBSE
./evaluate_tatoeba.sh
# evaluate the smaller LaBSE
./evaluate_tatoeba.sh \
    --model models/LaBSE_en-fr-es-de-zh-ar-zh_classical-it-ja-ko-nl-pl-pt-th-tr-ru/1/ \
    --preprocess models/LaBSE_en-fr-es-de-zh-ar-zh_classical-it-ja-ko-nl-pl-pt-th-tr-ru_preprocess/1/

Results

Tatoeba

Model	fr	es	de	zh	ar	it	ja	ko	nl	pl	pt	th	tr	ru	avg
tfHub_LaBSE(en→xx)	95.90	98.10	99.30	96.10	90.70	95.30	96.40	94.10	97.50	97.90	95.70	82.85	98.30	95.30	95.25
tfHub_LaBSE(xx→en)	96.00	98.80	99.40	96.30	91.20	94.00	96.50	92.90	97.00	97.80	95.40	83.58	98.50	95.30	95.19
15lang_LaBSE(en→xx)	95.20	98.00	99.20	96.10	90.50	95.20	96.30	93.50	97.50	97.90	95.80	82.85	98.30	95.40	95.13
15lang_LaBSE(xx→en)	95.40	98.70	99.40	96.30	91.10	94.00	96.30	92.70	96.70	97.80	95.40	83.58	98.50	95.20	95.08

Accuracy(%) of the Tatoeba datasets.
If the strategy to select vocabs is changed or the corpus used in the selection step is changed to the corpus similar to the evaluation dataset, it is expected to reduce the performance drop.

References

Load What You Need: Smaller Versions of Multilingual BERT (Paper: https://arxiv.org/abs/2010.05609, GitHub: https://github.com/Geotrend-research/smaller-transformers)
Language-agnostic BERT Sentence Embedding: https://arxiv.org/abs/2007.01852
TFHub - LaBSE: https://tfhub.dev/google/LaBSE/2
LaBSE blog post: https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond: https://arxiv.org/abs/1812.10464

Comments

Training time and Machine configuration

Hi, thanks for your sharing model. I want to make a smaller model, just contains two languages(en, zh). And I want to know the kind of machine GPU and how long does it need to cost?

opened by QzzIsCoding 2
Publish model to HuggingFace Model Hub?

I migrated the full LaBSE model from TF to PyTorch and uploaded them to the HuggingFace model hub. I saw this model on the TF hub and started migrating it for uploading to the HF Hub. I realized then that this wasn't published by Google but by @jeongukjae, so wanted to check with you before uploading it.

I have exported the model locally. I'm happy to check the changes in and upload the exported model if that's fine for you :).

opened by setu4993 2

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

Related tags

Overview

smaller-LaBSE

Used Languages

Scripts

Convert weight

Select vocabs

Make smaller LaBSE

Evaluate tatoeba

Results

Tatoeba

References

You might also like...

Comments

Training time and Machine configuration

Publish model to HuggingFace Model Hub?

Releases(15lang-1)

15lang-1(Sep 19, 2021)

Owner

Jeong Ukjae

Textlesslib - Library for Textless Spoken Language Processing

Label data using HuggingFace's transformers and automatically get a prediction service

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

This repository has a implementations of data augmentation for NLP for Japanese.

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

GSoC'2021 | TensorFlow implementation of Wav2Vec2

Creating a chess engine using GPT-3

Pretrain CPM - 大规模预训练语言模型的预训练代码

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

NLP library designed for reproducible experimentation management

CCF BDCI BERT系统调优赛题baseline（Pytorch版本）

Entity Disambiguation as text extraction (ACL 2022)

The ability of computer software to identify words and phrases in spoken language and convert them to human-readable text

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

Python library to make development of portfolio analysis faster and easier

News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征，提升下游任务的表现。

Translate U is capable of translating the text present in an image from one language to the other.

A very simple framework for state-of-the-art Natural Language Processing (NLP)