Code for the paper "Adapting Monolingual Models: Data can be Scarce when Language Similarity is High"

Last update: Aug 02, 2021

Related tags

Deep Learning low-resource-adapt

Overview

Wietse de Vries • Martijn Bartelds • Malvina Nissim • Martijn Wieling

Adapting Monolingual Models: Data can be Scarce when Language Similarity is High

This repository contains everything that is needed to replicate the results in the paper:

📝 Adapting Monolingual Models: Data can be Scarce when Language Similarity is High

Models

The best fine-tuned models for Gronings and West Frisian are available on the HuggingFace model hub:

Lexical layers

These models are identical to BERTje, but with different lexical layers (bert.embeddings.word_embeddings).

🤗 GroNLP/bert-base-dutch-cased (Dutch; source language)
🤗 GroNLP/bert-base-dutch-cased-gronings (Gronings)
🤗 GroNLP/bert-base-dutch-cased-frisian (West Frisian)

POS tagging

These models share the same fine-tuned Transformer layers + classification head, but with the retrained lexical layers from the models above.

🤗 GroNLP/bert-base-dutch-cased-upos-alpino (Dutch)
🤗 GroNLP/bert-base-dutch-cased-upos-alpino-gronings (Gronings)
🤗 GroNLP/bert-base-dutch-cased-upos-alpino-frisian (West Frisian)

Development

Conda/mamba dependencies are listed in environment.yml. This repository contains all scripts and configs that are needed to replicate the results in the paper. A more extensive usage guide will be provided later.

BibTeX entry

The paper is to appear in Findings of ACL2021. The preprint can be cited as:

@misc{devries2021adapting,
      title={{Adapting Monolingual Models: Data can be Scarce when Language Similarity is High}}, 
      author={Wietse de Vries and Martijn Bartelds and Malvina Nissim and Martijn Wieling},
      year={2021},
      eprint={2105.02855},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Code for the paper "Adapting Monolingual Models: Data can be Scarce when Language Similarity is High"

Related tags

Overview

Adapting Monolingual Models: Data can be Scarce when Language Similarity is High

Models

Lexical layers

POS tagging

Development

BibTeX entry

Owner

Wietse de Vries

Repository for the "Gotta Go Fast When Generating Data with Score-Based Models" paper

This is the implementation of the paper LiST: Lite Self-training Makes Efficient Few-shot Learners.

Safe Policy Optimization with Local Features

A python library to build Model Trees with Linear Models at the leaves.

HMLET (Hybrid-Method-of-Linear-and-non-linEar-collaborative-filTering-method)

Python implementation of cover trees, near-drop-in replacement for scipy.spatial.kdtree

git《Pseudo-ISP: Learning Pseudo In-camera Signal Processing Pipeline from A Color Image Denoiser》(2021) GitHub: [fig5]

Source code for PairNorm (ICLR 2020)

NHS AI Lab Skunkworks project: Long Stayer Risk Stratification

Music library streaming app written in Flask & VueJS

The 2nd place solution of 2021 google landmark retrieval on kaggle.

Official PyTorch Implementation of paper EAN: Event Adaptive Network for Efficient Action Recognition

Unofficial TensorFlow implementation of Protein Interface Prediction using Graph Convolutional Networks.

RobustART: Benchmarking Robustness on Architecture Design and Training Techniques

Dynamics-aware Adversarial Attack of 3D Sparse Convolution Network

MoViNets PyTorch implementation: Mobile Video Networks for Efficient Video Recognition;

Using pretrained language models for biomedical knowledge graph completion.

Resources related to EMNLP 2021 paper "FAME: Feature-Based Adversarial Meta-Embeddings for Robust Input Representations"

Library extending Jupyter notebooks to integrate with Apache TinkerPop and RDF SPARQL.

Deduplicating Training Data Makes Language Models Better