Training open neural machine translation models

Last update: Jan 03, 2023

Overview

Train Opus-MT models

This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.

Pre-trained models

The subdirectory models contains information about pre-trained models that can be downloaded from this project. They are distribted with a CC-BY 4.0 license license. More pre-trained models trained with the OPUS-MT training pipeline are available from the Tatoeba translation challenge also under a CC-BY 4.0 license license.

Quickstart

Setting up:

git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git
git submodule update --init --recursive --remote
make install

Training a multilingual NMT model (Finnish and Estonian to Danish, Swedish and English):

make SRCLANGS="fi et" TRGLANGS="da sv en" train
make SRCLANGS="fi et" TRGLANGS="da sv en" eval
make SRCLANGS="fi et" TRGLANGS="da sv en" release

More information is available in the documentation linked below.

Documentation

Tutorials

References

Please, cite the following paper if you use OPUS-MT software and models:

@InProceedings{TiedemannThottingal:EAMT2020,
  author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
  title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
  booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
  year = {2020},
  address = {Lisbon, Portugal}
 }

Acknowledgements

None of this would be possible without all the great open source software including

GNU/Linux tools
Marian-NMT
eflomal

... and many other tools like terashuf, pigz, jq, Moses SMT, fast_align, sacrebleu ...

We would also like to acknowledge the support by the University of Helsinki, the IT Center of Science CSC, the funding through projects in the EU Horizon 2020 framework (FoTran, MeMAD, ELG) and the contributors to the open collection of parallel corpora OPUS.

Training open neural machine translation models

Related tags

Overview

Train Opus-MT models

Pre-trained models

Quickstart

Documentation

Tutorials

References

Acknowledgements

Owner

Language Technology at the University of Helsinki

Tools to download and cleanup Common Crawl data

Prompt tuning toolkit for GPT-2 and GPT-Neo

Levenshtein and Hamming distance computation

Malware-Related Sentence Classification

Beyond the Imitation Game collaborative benchmark for enormous language models

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

An assignment on creating a minimalist neural network toolkit for CS11-747

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Code for using and evaluating SpanBERT.

Perform sentiment analysis and keyword extraction on Craigslist listings

构建一个多源（公众号、RSS）、干净、个性化的阅读环境

A PyTorch implementation of VIOLET

Few-shot Natural Language Generation for Task-Oriented Dialog

An open-source NLP library: fast text cleaning and preprocessing.

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

Natural Language Processing

Continuously update some NLP practice based on different tasks.

[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.