Research code for the paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models"

Last update: Aug 04, 2022

Related tags

Deep Learning hgiyt

Overview

Introduction

This repository contains research code for the ACL 2021 paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models". Feel free to use this code to re-run our experiments or run new experiments on your own data.

Setup

General

Clone this repo

git clone [email protected]:Adapter-Hub/hgiyt.git

Install PyTorch (we used v1.7.1 - code may not work as expected for older or newer versions) in a new Python (>=3.6) virtual environment

pip install torch===1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html

Initialize the submodules

git submodule update --init --recursive

Install the adapter-transformer library and dependencies

pip install lib/adapter-transformers
pip install -r requirements.txt

Pretraining

Install Nvidia Apex for automatic mixed-precision (amp / fp16) training

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Install wiki-bert-pipeline dependencies

pip install -r lib/wiki-bert-pipeline/requirements.txt

Language-specific prerequisites

To use the Japanese monolingual model, install the morphological parser MeCab with the mecab-ipadic-20070801 dictionary:

Install gdown for easy downloads from Google Drive

pip install gdown

Download and install MeCab

gdown https://drive.google.com/uc?id=0B4y35FiV1wh7cENtOXlicTFaRUE
tar -xvzf mecab-0.996.tar.gz
cd mecab-0.996
./configure 
make
make check
sudo make install

Download and install the mecab-ipadic-20070801 dictionary

gdown https://drive.google.com/uc?id=0B4y35FiV1wh7MWVlSDBCSXZMTXM
tar -xvzf mecab-ipadic-2.7.0-20070801.tar.gz
cd mecab-ipadic-2.7.0-20070801
./configure --with-charset=utf8
make
sudo make install

Data

We unfortunately cannot host the datasets used in our paper in this repo. However, we provide download links (wherever possible) and instructions or scripts to preprocess the data for finetuning and for pretraining.

Experiments

Our scripts are largely borrowed from the transformers and adapter-transformers libraries. For pretrained models and adapters we rely on the ModelHub and AdapterHub. However, even if you haven't used them before, running our scripts should be pretty straightforward :).

We provide instructions on how to execute our finetuning scripts here and our pretraining script here.

Models

Our pretrained models are also available in the ModelHub: https://huggingface.co/hgiyt. Feel free to finetune them with our scripts or use them in your own code.

Citation & Authors

@inproceedings{rust-etal-2021-good,
      title     = {How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models}, 
      author    = {Phillip Rust and Jonas Pfeiffer and Ivan Vuli{\'c} and Sebastian Ruder and Iryna Gurevych},
      year      = {2021},
      booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational
                  Linguistics, {ACL} 2021, Online, August 1-6, 2021},
      url       = {https://arxiv.org/abs/2012.15613},
      pages     = {3118--3135}
}

Contact Person: Phillip Rust, [email protected]

Don't hesitate to send us an e-mail or report an issue if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Research code for the paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models"

Related tags

Overview

Introduction

Setup

Data

Experiments

Models

Citation & Authors

Owner

AdapterHub

MAUS: A Dataset for Mental Workload Assessment Using Wearable Sensor - Baseline system

INSPIRED: A Transparent Dialogue Dataset for Interactive Semantic Parsing

Code for WSDM 2022 paper, Contrastive Learning for Representation Degeneration Problem in Sequential Recommendation.

Contrastive Learning of Structured World Models

Serving PyTorch 1.0 Models as a Web Server in C++

Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination

(AAAI2022) Style Mixing and Patchwise Prototypical Matching for One-Shot Unsupervised Domain Adaptive Semantic Segmentation

Latte: Cross-framework Python Package for Evaluation of Latent-based Generative Models

Optimizing Deeper Transformers on Small Datasets

The official PyTorch implementation of the paper: Xili Dai, Xiaojun Yuan, Haigang Gong, Yi Ma. "Fully Convolutional Line Parsing." .

This repository contains the PyTorch implementation of the paper STaCK: Sentence Ordering with Temporal Commonsense Knowledge appearing at EMNLP 2021.

PyTorch Implementation for "ForkGAN with SIngle Rainy NIght Images: Leveraging the RumiGAN to See into the Rainy Night"

Code for the paper "Adversarially Regularized Autoencoders (ICML 2018)" by Zhao, Kim, Zhang, Rush and LeCun

Lua-parser-lark - An out-of-box Lua parser written in Lark

a baseline to practice

Age Progression/Regression by Conditional Adversarial Autoencoder

Proposal, Tracking and Segmentation (PTS): A Cascaded Network for Video Object Segmentation

[SIGGRAPH Asia 2021] DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning.

This repository contains the reference implementation for our proposed Convolutional CRFs.

Training RNNs as Fast as CNNs

Research code for the paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models"

Related tags

Overview

Introduction

Setup

Data

Experiments

Models

Citation & Authors

Owner

AdapterHub

MAUS: A Dataset for Mental Workload Assessment Using Wearable Sensor - Baseline system

INSPIRED: A Transparent Dialogue Dataset for Interactive Semantic Parsing

Code for WSDM 2022 paper, Contrastive Learning for Representation Degeneration Problem in Sequential Recommendation.

Contrastive Learning of Structured World Models

Serving PyTorch 1.0 Models as a Web Server in C++

Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination

(AAAI2022) Style Mixing and Patchwise Prototypical Matching for One-Shot Unsupervised Domain Adaptive Semantic Segmentation

Latte: Cross-framework Python Package for Evaluation of Latent-based Generative Models

Optimizing Deeper Transformers on Small Datasets

The official PyTorch implementation of the paper: *Xili Dai, Xiaojun Yuan, Haigang Gong, Yi Ma. "Fully Convolutional Line Parsing." *.

This repository contains the PyTorch implementation of the paper STaCK: Sentence Ordering with Temporal Commonsense Knowledge appearing at EMNLP 2021.

PyTorch Implementation for "ForkGAN with SIngle Rainy NIght Images: Leveraging the RumiGAN to See into the Rainy Night"

Code for the paper "Adversarially Regularized Autoencoders (ICML 2018)" by Zhao, Kim, Zhang, Rush and LeCun

Lua-parser-lark - An out-of-box Lua parser written in Lark

a baseline to practice

Age Progression/Regression by Conditional Adversarial Autoencoder

Proposal, Tracking and Segmentation (PTS): A Cascaded Network for Video Object Segmentation

[SIGGRAPH Asia 2021] DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning.

This repository contains the reference implementation for our proposed Convolutional CRFs.

Training RNNs as Fast as CNNs

The official PyTorch implementation of the paper: Xili Dai, Xiaojun Yuan, Haigang Gong, Yi Ma. "Fully Convolutional Line Parsing." .