A benchmark for the task of translation suggestion

Overview

WeTS: A Benchmark for Translation Suggestion

Translation Suggestion (TS), which provides alternatives for specific words or phrases given the entire documents translated by machine translation (MT) has been proven to play a significant role in post editing (PE). WeTS is a benchmark data set for TS, which is annotated by expert translators. WeTS contains corpus(train/dev/test) for four different translation directions, i.e., English2German, German2English, Chinese2English and English2Chinese.


Contents

Data


WeTS is a benchmark dataset for TS, where all the examples are annotated by expert translators. As far as we know, this is the first golden corpus for TS. The statistics about WeTS are listed in the following table:

Translation Direction Train Valid Test
English2German 14,957 1000 1000
German2English 11,777 1000 1000
English2Chinese 15,769 1000 1000
Chinese2English 21,213 1000 1000

For corpus in each direction, the data is organized as:
direction.split.src: the source-side sentences
direction.split.mask: the masked translation sentences, the placeholder is "<MASK>"
direction.split.tgt: the predicted suggestions, the test set for English2Chinese has three references for each example

direction: En2De, De2En, Zh2En, En2Zh
split: train, dev, test

Models


We release the pre-trained NMT models which are used to generate the MT sentences. Additionally, the released NMT models can be used to generate synthetic corpus for TS, which can improve the final performance dramatically.Detailed description about the way of generating synthetic corpus can be found in our paper.

The released models can be downloaded at:

Download the models

and the password is "2iyk"

For inference with the released model, we can:

sh inference_*direction*.sh 

direction can be: en2de, de2en, en2zh, zh2en

Get Started


data preprocessing

sh process.sh 

pre-training

Codes for the first-phase pre-training are not included in this repo, as we directly utilized the codes of XLM (https://github.com/facebookresearch/XLM) with little modiafication. And we did not achieve much gains with the first-phase pretraining.

The second-phase pre-training:

sh preptraining.sh

fine-tuning

sh finetuning.sh

Codes in this repo is mainly forked from fairseq (https://github.com/pytorch/fairseq.git)

Citation


Please cite the following paper if you found the resources in this repository useful.

@article{yang2021wets,
  title={WeTS: A Benchmark for Translation Suggestion},
  author={Yang, Zhen and Zhang, Yingxue and Li, Ernan and Meng, Fandong and Zhou, Jie},
  journal={arXiv preprint arXiv:2110.05151},
  year={2021}
}

LICENCE


See LICENCE

Owner
zhyang
zhyang
Implementation of Pooling by Sliced-Wasserstein Embedding (NeurIPS 2021)

PSWE: Pooling by Sliced-Wasserstein Embedding (NeurIPS 2021) PSWE is a permutation-invariant feature aggregation/pooling method based on sliced-Wasser

Navid Naderializadeh 3 May 06, 2022
NHS AI Lab Skunkworks project: Long Stayer Risk Stratification

NHS AI Lab Skunkworks project: Long Stayer Risk Stratification A pilot project for the NHS AI Lab Skunkworks team, Long Stayer Risk Stratification use

NHSX 21 Nov 14, 2022
Nb workflows - A workflow platform which allows you to run parameterized notebooks programmatically

NB Workflows Description If SQL is a lingua franca for querying data, Jupyter sh

Xavier Petit 6 Aug 18, 2022
PyTorch implementation of Algorithm 1 of "On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models"

Code for On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models This repository will reproduce the main results from our pape

Mitch Hill 32 Nov 25, 2022
Cycle Consistent Adversarial Domain Adaptation (CyCADA)

Cycle Consistent Adversarial Domain Adaptation (CyCADA) A pytorch implementation of CyCADA. If you use this code in your research please consider citi

Hyunwoo Ko 2 Jan 10, 2022
Train a state-of-the-art yolov3 object detector from scratch!

TrainYourOwnYOLO: Building a Custom Object Detector from Scratch This repo let's you train a custom image detector using the state-of-the-art YOLOv3 c

AntonMu 616 Jan 08, 2023
Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

t5-japanese Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts. The following is a list of models that

Kimio Kuramitsu 1 Dec 13, 2021
git《Learning Pairwise Inter-Plane Relations for Piecewise Planar Reconstruction》(ECCV 2020) GitHub:

Learning Pairwise Inter-Plane Relations for Piecewise Planar Reconstruction Code for the ECCV 2020 paper by Yiming Qian and Yasutaka Furukawa Getting

37 Dec 04, 2022
Official implementation of NeurIPS 2021 paper "Contextual Similarity Aggregation with Self-attention for Visual Re-ranking"

CSA: Contextual Similarity Aggregation with Self-attention for Visual Re-ranking PyTorch training code for CSA (Contextual Similarity Aggregation). We

Hui Wu 19 Oct 21, 2022
[CVPR 2022 Oral] Balanced MSE for Imbalanced Visual Regression https://arxiv.org/abs/2203.16427

Balanced MSE Code for the paper: Balanced MSE for Imbalanced Visual Regression Jiawei Ren, Mingyuan Zhang, Cunjun Yu, Ziwei Liu CVPR 2022 (Oral) News

Jiawei Ren 267 Jan 01, 2023
Code Release for Learning to Adapt to Evolving Domains

EAML Code release for "Learning to Adapt to Evolving Domains" (NeurIPS 2020) Prerequisites PyTorch = 0.4.0 (with suitable CUDA and CuDNN version) tor

23 Dec 07, 2022
a general-purpose Transformer based vision backbone

Swin Transformer By Ze Liu*, Yutong Lin*, Yue Cao*, Han Hu*, Yixuan Wei, Zheng Zhang, Stephen Lin and Baining Guo. This repo is the official implement

Microsoft 9.9k Jan 08, 2023
Differentiable architecture search for convolutional and recurrent networks

Differentiable Architecture Search Code accompanying the paper DARTS: Differentiable Architecture Search Hanxiao Liu, Karen Simonyan, Yiming Yang. arX

Hanxiao Liu 3.7k Jan 09, 2023
[CVPR2021] UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicles

UAV-Human Official repository for CVPR2021: UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicle Paper arXiv Res

129 Jan 04, 2023
DECA: Detailed Expression Capture and Animation (SIGGRAPH 2021)

DECA: Detailed Expression Capture and Animation (SIGGRAPH2021) input image, aligned reconstruction, animation with various poses & expressions This is

Yao Feng 1.5k Jan 02, 2023
Radar-to-Lidar: Heterogeneous Place Recognition via Joint Learning

radar-to-lidar-place-recognition This page is the coder of a pre-print, implemented by PyTorch. If you have some questions on this project, please fee

Huan Yin 37 Oct 09, 2022
The best solution of the Weather Prediction track in the Yandex Shifts challenge

yandex-shifts-weather The repository contains information about my solution for the Weather Prediction track in the Yandex Shifts challenge https://re

Ivan Yu. Bondarenko 15 Dec 18, 2022
Using deep learning model to detect breast cancer.

Breast-Cancer-Detection Breast cancer is the most frequent cancer among women, with around one in every 19 women at risk. The number of cases of breas

1 Feb 13, 2022
(CVPR2021) DANNet: A One-Stage Domain Adaptation Network for Unsupervised Nighttime Semantic Segmentation

DANNet: A One-Stage Domain Adaptation Network for Unsupervised Nighttime Semantic Segmentation CVPR2021(oral) [arxiv] Requirements python3.7 pytorch==

W-zx-Y 85 Dec 07, 2022
Rotation Robust Descriptors

RoRD Rotation-Robust Descriptors and Orthographic Views for Local Feature Matching Project Page | Paper link Evaluation and Datasets MMA : Training on

Udit Singh Parihar 25 Nov 15, 2022