Code repository for the paper "Doubly-Trained Adversarial Data Augmentation for Neural Machine Translation" with instructions to reproduce the results.

Overview

Doubly Trained Neural Machine Translation System for Adversarial Attack and Data Augmentation

Languages Experimented:

  • Data Overview:

    Source Target Training Data Valid1 Valid2 Test data
    ZH EN WMT17 without UN corpus WMT2017 newstest WMT2018 newstest WMT2020 newstest
    DE EN WMT17 WMT2017 newstest WMT2018 newstest WMT2014 newstest
    FR EN WMT14 without UN corpus WMT2015 newsdiscussdev WMT2015 newsdiscusstest WMT2014 newstest
  • Corpus Statistics:

    Lang-pair Data Type #Sentences #tokens (English side)
    zh-en Train 9355978 161393634
    Valid1 2001 47636
    Valid2 3981 98308
    test 2000 65561
    de-en Train 4001246 113777884
    Valid1 2941 74288
    Valid2 2970 78358
    test 3003 78182
    fr-en Train 23899064 73523616
    Valid1 1442 30888
    Valid2 1435 30215
    test 3003 81967

Scripts (as shown in paper's appendix)

  • Set-up:

    • To execute the scripts shown below, it's required that fairseq version 0.9 is installed along with COMET. The way to easily install them after cloning this repo is executing following commands (under root of this repo):
      cd fairseq-0.9.0
      pip install --editable ./
      cd ../COMET
      pip install .
    • It's also possible to directly install COMET through pip: pip install unbabel-comet, but the recent version might have different dependency on other packages like fairseq. Please check COMET's official website for the updated information.
    • To make use of script that relies on COMET model (in case of dual-comet), a model from COMET should be downloaded. It can be easily done by running following script:
      from comet.models import download_model
      download_model("wmt-large-da-estimator-1719")
  • Pretrain the model:

    fairseq-train $DATADIR \
        --source-lang $src \
        --target-lang $tgt \
        --save-dir $SAVEDIR \
        --share-decoder-input-output-embed \
        --arch transformer_wmt_en_de \
        --optimizer adam --adam-betas ’(0.9, 0.98)’ --clip-norm 0.0 \
        --lr-scheduler inverse_sqrt \
        --warmup-init-lr 1e-07 --warmup-updates 4000 \
        --lr 0.0005 --min-lr 1e-09 \
        --dropout 0.3 --weight-decay 0.0001 \
        --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --max-tokens 2048 --update-freq 16 \
        --seed 2 
  • Adversarial Attack:

    fairseq-train $DATADIR \
        --source-lang $src \
        --target-lang $tgt \
        --save-dir $SAVEDIR \
        --share-decoder-input-output-embed \
        --train-subset valid \
        --arch transformer_wmt_en_de \
        --optimizer adam --adam-betas ’(0.9, 0.98)’ --clip-norm 0.0 \
        --lr-scheduler inverse_sqrt \
        --warmup-init-lr 1e-07 --warmup-updates 4000 \
        --lr 0.0005 --min-lr 1e-09 \
        --dropout 0.3 --weight-decay 0.0001 \
        --criterion dual_bleu --mrt-k 16 \
        --batch-size 2 --update-freq 64 \
        --seed 2 \
        --restore-file $PREETRAIN_MODEL \
        --reset-optimizer \
        --reset-dataloader 
  • Data Augmentation:

    fairseq-train $DATADIR \
        -s $src -t $tgt \
        --train-subset valid \
        --valid-subset valid1 \
        --left-pad-source False \
        --share-decoder-input-output-embed \
        --encoder-embed-dim 512 \
        --arch transformer_wmt_en_de \
        --dual-training \
        --auxillary-model-path $AUX_MODEL \
        --auxillary-model-save-dir $AUX_MODEL_SAVE \
        --optimizer adam --adam-betas ’(0.9, 0.98)’ --clip-norm 0.0 \
        --lr-scheduler inverse_sqrt \
        --warmup-init-lr 0.000001 --warmup-updates 1000 \
        --lr 0.00001 --min-lr 1e-09 \
        --dropout 0.3 --weight-decay 0.0001 \
        --criterion dual_comet/dual_mrt --mrt-k 8 \
        --comet-route $COMET_PATH \
        --batch-size 4 \
        --skip-invalid-size-inputs-valid-test \
        --update-freq 1 \
        --on-the-fly-train --adv-percent 30 \
        --seed 2 \
        --restore-file $PRETRAIN_MODEL \
        --reset-optimizer \
        --reset-dataloader \
        --save-dir $CHECKPOINT_FOLDER 

Generation and Test:

  • For Chinese-English, we use sentencepiece to perform the BPE so it's required to be removed in generation step. For all test we use beam size = 5. Noitce that we modified the code in fairseq-gen to use sacrebleu.tokenizers.TokenizerZh() to tokenize Chinese when the direction is en-zh.

    fairseq-generate $DATA-FOLDER \
        -s zh -t en \
        --task translation \
        --gen-subset $file \
        --path $CHECKPOINT \
        --batch-size 64 --quiet \
        --lenpen 1.0 \
        --remove-bpe sentencepiece \
        --sacrebleu \
        --beam 5
  • For French-Enlish, German-English, we modified the script to detokenize the moses tokenizer (which we used to preprocess the data). To reproduce the result, use following script:

    fairseq-generate $DATA-FOLDER \
        -s de/fr -t en \
        --task translation \
        --gen-subset $file \
        --path $CHECKPOINT \
        --batch-size 64 --quiet \
        --lenpen 1.0 \
        --remove-bpe \
        ---detokenize-moses \
        --sacrebleu \
        --beam 5

    Here --detokenize-moses would call detokenizer during the generation step and detokenize predictions before evaluating it. It would slow the generation step. Another way to manually do this is to retrieve prediction and target sentences from output file of fairseq and manually apply detokenizer from detokenizer.perl.

BibTex

@misc{tan2021doublytrained,
      title={Doubly-Trained Adversarial Data Augmentation for Neural Machine Translation}, 
      author={Weiting Tan and Shuoyang Ding and Huda Khayrallah and Philipp Koehn},
      year={2021},
      eprint={2110.05691},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Owner
Steven Tan
Johns Hopkins 21' Computer Science & Applied Mathematics and Statistics Major
Steven Tan
piSTAR Lab is a modular platform built to make AI experimentation accessible and fun. (pistar.ai)

piSTAR Lab WARNING: This is an early release. Overview piSTAR Lab is a modular deep reinforcement learning platform built to make AI experimentation a

piSTAR Lab 0 Aug 01, 2022
Code for the paper “The Peril of Popular Deep Learning Uncertainty Estimation Methods”

Uncertainty Estimation Methods Code for the paper “The Peril of Popular Deep Learning Uncertainty Estimation Methods” Reference If you use this code,

EPFL Machine Learning and Optimization Laboratory 4 Apr 05, 2022
Deep learning based hand gesture recognition using LSTM and MediaPipie.

Hand Gesture Recognition Deep learning based hand gesture recognition using LSTM and MediaPipie. Demo video using PingPong Robot Files Pretrained mode

Brad 24 Nov 11, 2022
GAN encoders in PyTorch that could match PGGAN, StyleGAN v1/v2, and BigGAN. Code also integrates the implementation of these GANs.

MTV-TSA: Adaptable GAN Encoders for Image Reconstruction via Multi-type Latent Vectors with Two-scale Attentions. This is the official code release fo

owl 37 Dec 24, 2022
This program automatically runs Python code copied in clipboard

CopyRun This program runs Python code which is copied in clipboard WARNING!! USE AT YOUR OWN RISK! NO GUARANTIES IF ANYTHING GETS BROKEN. DO NOT COPY

vertinski 4 Sep 10, 2021
Kinetics-Data-Preprocessing

Kinetics-Data-Preprocessing Kinetics-400 and Kinetics-600 are common video recognition datasets used by popular video understanding projects like Slow

Kaihua Tang 7 Oct 27, 2022
Adjust Decision Boundary for Class Imbalanced Learning

Adjusting Decision Boundary for Class Imbalanced Learning This repository is the official PyTorch implementation of WVN-RS, introduced in Adjusting De

Peyton Byungju Kim 16 Jan 04, 2023
Underwater industrial application yolov5m6

This project wins the intelligent algorithm contest finalist award and stands out from over 2000teams in China Underwater Robot Professional Contest, entering the final of China Underwater Robot Prof

8 Nov 09, 2022
Bunch of different tools which helps visualizing and annotating images for semantic/instance segmentation tasks

Data Framework for Semantic/Instance Segmentation Bunch of different tools which helps visualizing, transforming and annotating images for semantic/in

Bruno Fernandes Carvalho 5 Dec 21, 2022
mlpack: a scalable C++ machine learning library --

a fast, flexible machine learning library Home | Documentation | Doxygen | Community | Help | IRC Chat Download: current stable version (3.4.2) mlpack

mlpack 4.2k Jan 09, 2023
Multi-Object Tracking in Satellite Videos with Graph-Based Multi-Task Modeling

TGraM Multi-Object Tracking in Satellite Videos with Graph-Based Multi-Task Modeling, Qibin He, Xian Sun, Zhiyuan Yan, Beibei Li, Kun Fu Abstract Rece

Qibin He 6 Nov 25, 2022
Code for paper " AdderNet: Do We Really Need Multiplications in Deep Learning?"

AdderNet: Do We Really Need Multiplications in Deep Learning? This code is a demo of CVPR 2020 paper AdderNet: Do We Really Need Multiplications in De

HUAWEI Noah's Ark Lab 915 Jan 01, 2023
Doods2 - API for detecting objects in images and video streams using Tensorflow

DOODS2 - Return of DOODS Dedicated Open Object Detection Service - Yes, it's a b

Zach 101 Jan 04, 2023
Re-implementation of the vector capsule with dynamic routing

VectorCapsule Re-implementation of the vector capsule with dynamic routing We implement the vector capsule and dynamic routing via graph neural networ

ZhenchaoTang 10 Feb 10, 2022
The Implicit Bias of Gradient Descent on Generalized Gated Linear Networks

The Implicit Bias of Gradient Descent on Generalized Gated Linear Networks This folder contains the code to reproduce the data in "The Implicit Bias o

Samuel Lippl 0 Feb 05, 2022
A pytorch reproduction of { Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation }.

A PyTorch Reproduction of HCN Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation. Ch

Guyue Hu 210 Dec 31, 2022
TensorFlow (Python API) implementation of Neural Style

neural-style-tf This is a TensorFlow implementation of several techniques described in the papers: Image Style Transfer Using Convolutional Neural Net

Cameron 3.1k Jan 02, 2023
Simple transformer model for CIFAR10

CIFAR-Transformer Simple transformer model for CIFAR10. Reference: https://www.tensorflow.org/text/tutorials/transformer https://github.com/huggingfac

9 Nov 07, 2022
Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

SwinTextSpotter This is the pytorch implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text R

mxin262 183 Jan 03, 2023
Bottleneck Transformers for Visual Recognition

Bottleneck Transformers for Visual Recognition Experiments Model Params (M) Acc (%) ResNet50 baseline (ref) 23.5M 93.62 BoTNet-50 18.8M 95.11% BoTNet-

Myeongjun Kim 236 Jan 03, 2023