Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts

Last update: Jan 07, 2023

Overview

DataSelection-NMT

Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts

Quick update: The paper got accepted on Dec 6, 2021! I will link the repository to the paper as soon as it got published.

Our Pre-trained models on Hugging Face

Systems	Link	Systems	Link
Top1	Download	Top1	Download
Top2+Top1	Download	Top2	Download
Top3+Top2+...	Download	Top3	Donwload
Top4+Top3+...	Download	Top4	Donwload
Top5+Top4+...	Download	Top5	Donwload
Top6+Top5+...	Download	Top6	Donwload

How to use

Note: we ported the best checkpoints of trained models to the Hugging Face (HF). Since our models were trained by OpenNMT-py, it was not possible to employ them directly for inference on HF. To bypass this issue, we use CTranslate2– an inference engine for transformer models.

Follow steps below to translate your sentences:

1. Install the Python package:

pip install --upgrade pip
pip install ctranslate2

2. Download models from our HF repository: You can do this manually or use the following python script:

import requests

url = "Download Link"
model_path = "Model Path"
r = requests.get(url, allow_redirects=True)
open(model_path, 'wb').write(r.content)

3. Convert the downloaded model:

ct2-opennmt-py-converter --model_path model_path --output_dir output_directory

3. Translate tokenized inputs:

Note: the inputs should be tokenized by SentencePiece. You can also use tokenized version of IWSLT test sets.

import ctranslate2
translator = ctranslate2.Translator("output_directory/")
translator.translate_batch([["▁H", "ello", "▁world", "!"]])

import ctranslate2
translator = ctranslate2.Translator("output_directory/")
translator.translate_file(input_file, output_file, batch_type= "tokens/examples")

To customize the CTranslate2 functions, read this API document.

4. Detokenize the outputs:

Note: you need to detokenize the output with the same sentencepiece model as used in step 3.

tools/detokenize.perl -no-escape -l fr \
< output_file \
> output_file.detok

5. Remove the @@ tokens:

cat output_file.detok | sed -E 's/(@@)|(@@ )|(@@ ?$)//g' \
> output._file.detok.postprocessd

Use grep to check if @@ tokens removed successfully:

grep @@ output._file.detok.postprocessd

Authors

Javad Pourmostafa - Email, Website
Dimitar Shterionov - Email, Website
Pieter Spronck - Email, Website

You might also like...

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Parallel Tacotron2 Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

170 Dec 27, 2022

Code for the paper "JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design"

JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design This repository contains code for the paper: JA

55 Nov 29, 2022

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos. By adopting a unified pipeline-based API design, PyKale enforces standardization and minimalism, via reusing existing resources, reducing repetitions and redundancy, and recycling learning models across areas.

370 Dec 27, 2022

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

ood-text-emnlp Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them" Files fine_tune.py is used to finetune the GPT-2 mo

19 Oct 28, 2022

Generate images from texts. In Russian. In PaddlePaddle

ruDALL-E PaddlePaddle ruDALL-E in PaddlePaddle. Install: pip install rudalle_paddle==0.0.1rc1 Run with free v100 on AI Studio. Original Pytorch versi

20 Oct 18, 2022

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

t5-japanese Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts. The following is a list of models that

1 Dec 13, 2021

Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph

Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph This repository provides a pipeline to create a knowledge graph from ra

3 Jan 1, 2022

Code for CVPR2021 "Visualizing Adapted Knowledge in Domain Transfer". Visualization for domain adaptation. #explainable-ai

Visualizing Adapted Knowledge in Domain Transfer @inproceedings{hou2021visualizing, title={Visualizing Adapted Knowledge in Domain Transfer}, auth

80 Dec 25, 2022

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation [Paper] Prerequisites To install requirements: pip install -r requirements.txt

84 Dec 26, 2022

Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts

Related tags

Overview

DataSelection-NMT

Quick update: The paper got accepted on Dec 6, 2021! I will link the repository to the paper as soon as it got published.

Our Pre-trained models on Hugging Face

How to use

Authors

You might also like...

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Code for the paper "JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design"

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Generate images from texts. In Russian. In PaddlePaddle

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph

Code for CVPR2021 "Visualizing Adapted Knowledge in Domain Transfer". Visualization for domain adaptation. #explainable-ai

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation

Releases(1.1)

1.1(Oct 25, 2021)

Owner

Javad Pourmostafa

Code for HLA-Face: Joint High-Low Adaptation for Low Light Face Detection (CVPR21)

FaceAnon - Anonymize people in images and videos using yolov5-crowdhuman

Multi-Task Learning as a Bargaining Game

PConv-Keras - Unofficial implementation of "Image Inpainting for Irregular Holes Using Partial Convolutions". Try at: www.fixmyphoto.ai

Tutorials and implementations for "Self-normalizing networks"

PyTorch-lightning implementation of the ESFW module proposed in our paper Edge-Selective Feature Weaving for Point Cloud Matching

CURL: Contrastive Unsupervised Representations for Reinforcement Learning

This is the code for the paper "Motion-Focused Contrastive Learning of Video Representations" (ICCV'21).

Keras Implementation of The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation by (Simon Jégou, Michal Drozdzal, David Vazquez, Adriana Romero, Yoshua Bengio)

A PyTorch implementation of EfficientNet and EfficientNetV2 (coming soon!)

A scikit-learn-compatible module for estimating prediction intervals.

PiRank: Learning to Rank via Differentiable Sorting

Tensorflow Implementation of Pixel Transposed Convolutional Networks (PixelTCN and PixelTCL)

Image Restoration Toolbox (PyTorch). Training and testing codes for DPIR, USRNet, DnCNN, FFDNet, SRMD, DPSR, BSRGAN, SwinIR

This is implementation of AlexNet(2012) with 3D Convolution on TensorFlow (AlexNet 3D).

IDM: An Intermediate Domain Module for Domain Adaptive Person Re-ID,

The codes of paper 'Active-LATHE: An Active Learning Algorithm for Boosting the Error exponent for Learning Homogeneous Ising Trees'

Code for the paper "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web" (ECCV 2020)

A 3D Dense mapping backend library of SLAM based on taichi-Lang designed for the aerial swarm.

Simple sinc interpolation in PyTorch.