Code repository for the paper "Doubly-Trained Adversarial Data Augmentation for Neural Machine Translation" with instructions to reproduce the results.

Overview

Doubly Trained Neural Machine Translation System for Adversarial Attack and Data Augmentation

Languages Experimented:

  • Data Overview:

    Source Target Training Data Valid1 Valid2 Test data
    ZH EN WMT17 without UN corpus WMT2017 newstest WMT2018 newstest WMT2020 newstest
    DE EN WMT17 WMT2017 newstest WMT2018 newstest WMT2014 newstest
    FR EN WMT14 without UN corpus WMT2015 newsdiscussdev WMT2015 newsdiscusstest WMT2014 newstest
  • Corpus Statistics:

    Lang-pair Data Type #Sentences #tokens (English side)
    zh-en Train 9355978 161393634
    Valid1 2001 47636
    Valid2 3981 98308
    test 2000 65561
    de-en Train 4001246 113777884
    Valid1 2941 74288
    Valid2 2970 78358
    test 3003 78182
    fr-en Train 23899064 73523616
    Valid1 1442 30888
    Valid2 1435 30215
    test 3003 81967

Scripts (as shown in paper's appendix)

  • Set-up:

    • To execute the scripts shown below, it's required that fairseq version 0.9 is installed along with COMET. The way to easily install them after cloning this repo is executing following commands (under root of this repo):
      cd fairseq-0.9.0
      pip install --editable ./
      cd ../COMET
      pip install .
    • It's also possible to directly install COMET through pip: pip install unbabel-comet, but the recent version might have different dependency on other packages like fairseq. Please check COMET's official website for the updated information.
    • To make use of script that relies on COMET model (in case of dual-comet), a model from COMET should be downloaded. It can be easily done by running following script:
      from comet.models import download_model
      download_model("wmt-large-da-estimator-1719")
  • Pretrain the model:

    fairseq-train $DATADIR \
        --source-lang $src \
        --target-lang $tgt \
        --save-dir $SAVEDIR \
        --share-decoder-input-output-embed \
        --arch transformer_wmt_en_de \
        --optimizer adam --adam-betas ’(0.9, 0.98)’ --clip-norm 0.0 \
        --lr-scheduler inverse_sqrt \
        --warmup-init-lr 1e-07 --warmup-updates 4000 \
        --lr 0.0005 --min-lr 1e-09 \
        --dropout 0.3 --weight-decay 0.0001 \
        --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --max-tokens 2048 --update-freq 16 \
        --seed 2 
  • Adversarial Attack:

    fairseq-train $DATADIR \
        --source-lang $src \
        --target-lang $tgt \
        --save-dir $SAVEDIR \
        --share-decoder-input-output-embed \
        --train-subset valid \
        --arch transformer_wmt_en_de \
        --optimizer adam --adam-betas ’(0.9, 0.98)’ --clip-norm 0.0 \
        --lr-scheduler inverse_sqrt \
        --warmup-init-lr 1e-07 --warmup-updates 4000 \
        --lr 0.0005 --min-lr 1e-09 \
        --dropout 0.3 --weight-decay 0.0001 \
        --criterion dual_bleu --mrt-k 16 \
        --batch-size 2 --update-freq 64 \
        --seed 2 \
        --restore-file $PREETRAIN_MODEL \
        --reset-optimizer \
        --reset-dataloader 
  • Data Augmentation:

    fairseq-train $DATADIR \
        -s $src -t $tgt \
        --train-subset valid \
        --valid-subset valid1 \
        --left-pad-source False \
        --share-decoder-input-output-embed \
        --encoder-embed-dim 512 \
        --arch transformer_wmt_en_de \
        --dual-training \
        --auxillary-model-path $AUX_MODEL \
        --auxillary-model-save-dir $AUX_MODEL_SAVE \
        --optimizer adam --adam-betas ’(0.9, 0.98)’ --clip-norm 0.0 \
        --lr-scheduler inverse_sqrt \
        --warmup-init-lr 0.000001 --warmup-updates 1000 \
        --lr 0.00001 --min-lr 1e-09 \
        --dropout 0.3 --weight-decay 0.0001 \
        --criterion dual_comet/dual_mrt --mrt-k 8 \
        --comet-route $COMET_PATH \
        --batch-size 4 \
        --skip-invalid-size-inputs-valid-test \
        --update-freq 1 \
        --on-the-fly-train --adv-percent 30 \
        --seed 2 \
        --restore-file $PRETRAIN_MODEL \
        --reset-optimizer \
        --reset-dataloader \
        --save-dir $CHECKPOINT_FOLDER 

Generation and Test:

  • For Chinese-English, we use sentencepiece to perform the BPE so it's required to be removed in generation step. For all test we use beam size = 5. Noitce that we modified the code in fairseq-gen to use sacrebleu.tokenizers.TokenizerZh() to tokenize Chinese when the direction is en-zh.

    fairseq-generate $DATA-FOLDER \
        -s zh -t en \
        --task translation \
        --gen-subset $file \
        --path $CHECKPOINT \
        --batch-size 64 --quiet \
        --lenpen 1.0 \
        --remove-bpe sentencepiece \
        --sacrebleu \
        --beam 5
  • For French-Enlish, German-English, we modified the script to detokenize the moses tokenizer (which we used to preprocess the data). To reproduce the result, use following script:

    fairseq-generate $DATA-FOLDER \
        -s de/fr -t en \
        --task translation \
        --gen-subset $file \
        --path $CHECKPOINT \
        --batch-size 64 --quiet \
        --lenpen 1.0 \
        --remove-bpe \
        ---detokenize-moses \
        --sacrebleu \
        --beam 5

    Here --detokenize-moses would call detokenizer during the generation step and detokenize predictions before evaluating it. It would slow the generation step. Another way to manually do this is to retrieve prediction and target sentences from output file of fairseq and manually apply detokenizer from detokenizer.perl.

BibTex

@misc{tan2021doublytrained,
      title={Doubly-Trained Adversarial Data Augmentation for Neural Machine Translation}, 
      author={Weiting Tan and Shuoyang Ding and Huda Khayrallah and Philipp Koehn},
      year={2021},
      eprint={2110.05691},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Owner
Steven Tan
Johns Hopkins 21' Computer Science & Applied Mathematics and Statistics Major
Steven Tan
Indices Matter: Learning to Index for Deep Image Matting

IndexNet Matting This repository includes the official implementation of IndexNet Matting for deep image matting, presented in our paper: Indices Matt

Hao Lu 357 Nov 26, 2022
Official implementation of "A Shared Representation for Photorealistic Driving Simulators" in PyTorch.

A Shared Representation for Photorealistic Driving Simulators The official code for the paper: "A Shared Representation for Photorealistic Driving Sim

VITA lab at EPFL 7 Oct 13, 2022
Repo for EMNLP 2021 paper "Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression"

beyond-preserved-accuracy Repo for EMNLP 2021 paper "Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression" How to implemen

Kevin Canwen Xu 10 Dec 23, 2022
A PyTorch implementation of NeRF (Neural Radiance Fields) that reproduces the results.

NeRF-pytorch NeRF (Neural Radiance Fields) is a method that achieves state-of-the-art results for synthesizing novel views of complex scenes. Here are

Yen-Chen Lin 3.2k Jan 08, 2023
Official Implementation of 'UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers' ICLR 2021(spotlight)

UPDeT Official Implementation of UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers (ICLR 2021 spotlight) The

hhhusiyi 96 Dec 22, 2022
The personal repository of the work: *DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer*.

DanceNet3D The personal repository of the work: DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer. Dataset and Results Pleas

南嘉Nanga 36 Dec 21, 2022
Implementation of momentum^2 teacher

Momentum^2 Teacher: Momentum Teacher with Momentum Statistics for Self-Supervised Learning Requirements All experiments are done with python3.6, torch

jemmy li 121 Sep 26, 2022
Code and results accompanying our paper titled Mixture Proportion Estimation and PU Learning: A Modern Approach at Neurips 2021 (Spotlight)

Mixture Proportion Estimation and PU Learning: A Modern Approach This repository is the official implementation of Mixture Proportion Estimation and P

Approximately Correct Machine Intelligence (ACMI) Lab 23 Dec 28, 2022
JAXMAPP: JAX-based Library for Multi-Agent Path Planning in Continuous Spaces

JAXMAPP: JAX-based Library for Multi-Agent Path Planning in Continuous Spaces JAXMAPP is a JAX-based library for multi-agent path planning (MAPP) in c

OMRON SINIC X 24 Dec 28, 2022
Clustergram - Visualization and diagnostics for cluster analysis in Python

Clustergram Visualization and diagnostics for cluster analysis Clustergram is a diagram proposed by Matthias Schonlau in his paper The clustergram: A

Martin Fleischmann 96 Dec 26, 2022
AntiFuzz: Impeding Fuzzing Audits of Binary Executables

AntiFuzz: Impeding Fuzzing Audits of Binary Executables Get the paper here: https://www.usenix.org/system/files/sec19-guler.pdf Usage: The python scri

Chair for Sys­tems Se­cu­ri­ty 88 Dec 21, 2022
Locationinfo - A script helps the user to show network information such as ip address

Description This script helps the user to show network information such as ip ad

Roxcoder 1 Dec 30, 2021
The description of FMFCC-A (audio track of FMFCC) dataset and Challenge resluts.

FMFCC-A This project is the description of FMFCC-A (audio track of FMFCC) dataset and Challenge resluts. The FMFCC-A dataset is shared through BaiduCl

18 Dec 24, 2022
PyTorch implementation of paper “Unbiased Scene Graph Generation from Biased Training”

A new codebase for popular Scene Graph Generation methods (2020). Visualization & Scene Graph Extraction on custom images/datasets are provided. It's also a PyTorch implementation of paper “Unbiased

Kaihua Tang 824 Jan 03, 2023
[WACV 2020] Reducing Footskate in Human Motion Reconstruction with Ground Contact Constraints

Reducing Footskate in Human Motion Reconstruction with Ground Contact Constraints Official implementation for Reducing Footskate in Human Motion Recon

Virginia Tech Vision and Learning Lab 38 Nov 01, 2022
An abstraction layer for mathematical optimization solvers.

MathOptInterface Documentation Build Status Social An abstraction layer for mathematical optimization solvers. Replaces MathProgBase. Citing MathOptIn

JuMP-dev 284 Jan 04, 2023
TensorFlow Implementation of "Show, Attend and Tell"

Show, Attend and Tell Update (December 2, 2016) TensorFlow implementation of Show, Attend and Tell: Neural Image Caption Generation with Visual Attent

Yunjey Choi 902 Nov 29, 2022
Manipulation OpenAI Gym environments to simulate robots at the STARS lab

Manipulator Learning This repository contains a set of manipulation environments that are compatible with OpenAI Gym and simulated in pybullet. In par

STARS Laboratory 5 Dec 08, 2022
Unofficial Implementation of MLP-Mixer in TensorFlow

mlp-mixer-tf Unofficial Implementation of MLP-Mixer [abs, pdf] in TensorFlow. Note: This project may have some bugs in it. I'm still learning how to i

Rishabh Anand 24 Mar 23, 2022
Official PyTorch implementation of Less is More: Pay Less Attention in Vision Transformers.

Less is More: Pay Less Attention in Vision Transformers Official PyTorch implementation of Less is More: Pay Less Attention in Vision Transformers. By

73 Jan 01, 2023