A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

Overview

A2T: Towards Improving Adversarial Training of NLP Models

This is the source code for the EMNLP 2021 (Findings) paper "Towards Improving Adversarial Training of NLP Models".

If you use the code, please cite the paper:

@misc{yoo2021improving,
      title={Towards Improving Adversarial Training of NLP Models}, 
      author={Jin Yong Yoo and Yanjun Qi},
      year={2021},
      eprint={2109.00544},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Prerequisites

The work heavily relies on the TextAttack package. In fact, the main training code is implemented in the TextAttack package.

Required packages are listed in the requirements.txt file.

pip install -r requirements.txt

Data

All of the data used for the paper are available from HuggingFace's Datasets.

For IMDB and Yelp datasets, because there are no official validation splits, we randomly sampled 5k and 10k, respectively, from the training set and used them as valid splits. We provide the splits in this Google Drive folder. To use them with the provided code, place each folder (e.g. imdb, yelp, augmented_data) inside ./data (run mkdir data).

Also, augmented training data generated using SSMBA and back-translation are available in the same folder.

Training

To train BERT model on IMDB dataset with A2T attack for 4 epochs and 1 clean epoch with gamma of 0.2:

python train.py \
    --train imdb \
    --eval imdb \
    --model-type bert \
    --model-save-path ./example \
    --num-epochs 4 \
    --num-clean-epochs 1 \
    --num-adv-examples 0.2 \
    --attack-epoch-interval 1 \
    --attack a2t \
    --learning-rate 5e-5 \
    --num-warmup-steps 100 \
    --grad-accumu-steps 1 \
    --checkpoint-interval-epochs 1 \
    --seed 42

You can also pass roberta to train RoBERTa model instead of BERT model. To select other datasets from the paper, pass rt (MR), yelp, or snli for --train and --eval.

This script is actually just to run the Trainer class from the TextAttack package. To checkout how training is performed, please checkout the Trainer class.

Evaluation

To evalute the accuracy, robustness, and interpretability of our trained model from above, run

python evaluate.py \
    --dataset imdb \
    --model-type bert \
    --checkpoint-paths ./example_run \
    --epoch 4 \
    --save-log \
    --accuracy \
    --robustness \
    --attacks a2t a2t_mlm textfooler bae pwws pso \
    --interpretability 

This takes the last checkpoint model (--epoch 4) and evaluates its accuracy on both IMDB and Yelp dataset (for cross-domain accuracy). It also evalutes the model's robustness against A2T, A2T-MLM, TextFooler, BAE, PWWS, and PSO attacks. Lastly, with the --interpretability flag, AOPC scores are calculated.

Note that you will have to run --robustness and --interpretability with --accuracy (or after you separately evaluate accuracy) since both robustness and intepretability evaluations rely on the accuracy evaluation to know which samples the model was able to predict correctly. By default 1000 samples are attacked to evaluate robustness. Likewise, 1000 samples are used to calculate AOPC score for interpretability.

If you're evaluating multiple models for comparison, it's also advised that you provide all the checkpoint paths together to --checkpoint-paths. This is because the samples that are correctly by each model will be different, so we first need to identify the intersection of the all correct predictions before using them to evaluate robustness for all the models. This will allow fairer comparison of models' robustness rather than using attack different samples for each model.

Data Augmentation

Lastly, we also provide augment.py which we used to perform data augmentation methods such as SSMBA and back-translation.

Following is an example command for augmenting imdb dataset with SSMBA method.

python augment.py \
    --dataset imdb \
    --augmentation ssmba \
    --output-path ./augmented_data \
    --seed 42 

You can also pass backtranslation to --augmentation.

Owner
QData
http://www.cs.virginia.edu/yanjun/
QData
ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost LOVE is accpeted by ACL22 main conference as a long pape

Lihu Chen 32 Jan 03, 2023
This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

Joseph Imperial 1 Oct 05, 2021
BERTAC (BERT-style transformer-based language model with Adversarially pretrained Convolutional neural network)

BERTAC (BERT-style transformer-based language model with Adversarially pretrained Convolutional neural network) BERTAC is a framework that combines a

6 Jan 24, 2022
NLPShala , the best IDE for all Natural language processing tasks.

The revolutionary IDE for all NLP (Natural language processing) stuffs on the internet.

Abhi 3 Aug 08, 2021
Autoregressive Entity Retrieval

The GENRE (Generative ENtity REtrieval) system as presented in Autoregressive Entity Retrieval implemented in pytorch. @inproceedings{decao2020autoreg

Meta Research 611 Dec 16, 2022
Code for paper: An Effective, Robust and Fairness-awareHate Speech Detection Framework

BiQQLSTM_HS Code and data for paper: Title: An Effective, Robust and Fairness-awareHate Speech Detection Framework. Authors: Guanyi Mou and Kyumin Lee

Guanyi Mou 2 Dec 27, 2022
TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

TextAttack 🐙 Generating adversarial examples for NLP models [TextAttack Documentation on ReadTheDocs] About • Setup • Usage • Design About TextAttack

QData 2.2k Jan 03, 2023
Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

SEW (Squeezed and Efficient Wav2vec) The repo contains the code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speec

ASAPP Research 67 Dec 01, 2022
Codes to pre-train Japanese T5 models

t5-japanese Codes to pre-train a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts. The model is available at https://hug

Megagon Labs 37 Dec 25, 2022
Korean stereoypte detector with TUNiB-Electra and K-StereoSet

Korean Stereotype Detector Korean stereotype sentence classifier using K-StereoSet with TUNiB-Electra Web demo you can test this model easily in demo

Sae_Chan_Oh 11 Feb 18, 2022
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 04, 2022
Tools to download and cleanup Common Crawl data

cc_net Tools to download and clean Common Crawl as introduced in our paper CCNet. If you found these resources useful, please consider citing: @inproc

Meta Research 483 Jan 02, 2023
🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

A hyper-fast, safe Python module to read and write JSON data. Works as a drop-in replacement for Python's built-in json module. This is alpha software

Matthias 479 Jan 01, 2023
Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

Hans Alemão 4 Jul 20, 2022
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Salesforce 44 Nov 01, 2022
An open source framework for seq2seq models in PyTorch.

pytorch-seq2seq Documentation This is a framework for sequence-to-sequence (seq2seq) models implemented in PyTorch. The framework has modularized and

International Business Machines 1.4k Jan 02, 2023
This repository describes our reproducible framework for assessing self-supervised representation learning from speech

LeBenchmark: a reproducible framework for assessing SSL from speech Self-Supervised Learning (SSL) using huge unlabeled data has been successfully exp

49 Aug 24, 2022
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

Felipe Maia Polo 125 Dec 20, 2022
NLP applications using deep learning.

NLP-Natural-Language-Processing NLP applications using deep learning like text generation etc. 1- Poetry Generation: Using a collection of Irish Poem

KASHISH 1 Jan 27, 2022
Knowledge Management for Humans using Machine Learning & Tags

HyperTag helps humans intuitively express how they think about their files using tags and machine learning. Represent how you think using tags. Find what you look for using semantic search for your t

Ravn Tech, Inc. 166 Jan 07, 2023