Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Last update: Dec 31, 2022

Overview

Data Augmentation using Pre-trained Transformer Models

Code associated with the Data Augmentation using Pre-trained Transformer Models paper

Code contains implementation of the following data augmentation methods

EDA (Baseline)
Backtranslation (Baseline)
CBERT (Baseline)
BERT Prepend (Our paper)
GPT-2 Prepend (Our paper)
BART Prepend (Our paper)

DataSets

In paper, we use three datasets from following resources

Low-data regime experiment setup

Run src/utils/download_and_prepare_datasets.sh file to prepare all datsets.
download_and_prepare_datasets.sh performs following steps

Download data from github
Replace numeric labels with text for STSA-2 and TREC dataset
For a given dataset, creates 15 random splits of train and dev data.

Dependencies

To run this code, you need following dependencies

Pytorch 1.5
fairseq 0.9
transformers 2.9

How to run

To run data augmentation experiment for a given dataset, run bash script in scripts folder. For example, to run data augmentation on snips dataset,

run scripts/bart_snips_lower.sh for BART experiment
run scripts/bert_snips_lower.sh for rest of the data augmentation methods

How to cite

@inproceedings{kumar-etal-2020-data,
    title = "Data Augmentation using Pre-trained Transformer Models",
    author = "Kumar, Varun  and
      Choudhary, Ashutosh  and
      Cho, Eunah",
    booktitle = "Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems",
    month = dec,
    year = "2020",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.lifelongnlp-1.3",
    pages = "18--26",
}

Contact

Please reachout to [email protected] for any questions related to this code.

License

This project is licensed under the Creative Common Attribution Non-Commercial 4.0 license.

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Related tags

Overview

Data Augmentation using Pre-trained Transformer Models

DataSets

Low-data regime experiment setup

Dependencies

How to run

How to cite

Contact

License

Owner

Sapiens is a human antibody language model based on BERT.

Unsupervised Abstract Reasoning for Raven’s Problem Matrices

An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

Chinese version of GPT2 training code, using BERT tokenizer.

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

This repo contains simple to use, pretrained/training-less models for speaker diarization.

A fast and easy implementation of Transformer with PyTorch.

PyABSA - Open & Efficient for Framework for Aspect-based Sentiment Analysis

Toward a Visual Concept Vocabulary for GAN Latent Space, ICCV 2021

Chatbot for the Chatango messaging platform

Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

Experiments in converting wikidata to ftm

An Open-Source Package for Neural Relation Extraction (NRE)

Course project of [email protected]

A Python/Pytorch app for easily synthesising human voices

Data loaders and abstractions for text and NLP

This is a simple item2vec implementation using gensim for recbole

Huggingface Transformers + Adapters = ❤️

Exploration of BERT-based models on twitter sentiment classifications

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.