Code for text augmentation method leveraging large-scale language models

Overview

HyperMix

Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation.

Getting Started

Installing Packages

The main depedencies can be installed via pip install -r requirements.txt.

Usage

The main code is run through main.py. Check out --help for full list of commands.

python main.py --help

The code will automatically use the first GPU device, if detected.

A typical command to run BERT-base 10 times on the 1% subsample set of the SST-2 dataset and computing the average of all run is as follows.

python main.py --datasets sst2 \
    --train-subsample 0.01f \
    --classifier transformers \
    --model-name bert-base-uncased \
    --num-trials 1 \
    --augmenter none \
    --save-dir out

The script will create a directory named out in the current working directory and save the script log as out/run.log. It will also save any augmentations created during the experiments (if any augmentation is enabled).

To test GPT3Mix, prepare an OpenAI API key as described at the bottom of this README file, then use the following command:

python main.py --datasets sst2 \
    --train-subsample 0.01f \
    --classifier transformers \
    --model-name bert-base-uncased \
    --num-trials 1 \
    --augmenter gpt3-mix \
    --save-dir out

Managing Seeds

In the command above, the script will automatically generate seeds for sampling data and optimizing models. The seed used to generate each individual seed is called "master seed" and can be set using --master-data-seed and --master-exp-seed options. As evident from the option names, they are responsible for sampling data and optimizing a freshly initialized models respectively.

Sometimes, we need to manually set the seeds and not rely on automatically generated seeds from the master seeds. Manually seeding can be achieved via --data-seeds option. If this option is given, the master data seed will be ignored. We only support manualy data seeding for now.

OpenAI Key

Store OpenAI API Key under the current working directory as a file named openai-key. When running the main script, it will automatically detect the api key.

API keys can be provided to the script by --api-key option (not recommended) or from a file named openai-key in the current working directory.

Other Notes

At the moment we only support data augmentation leveraging OpenAI GPT-3 (GPT3Mix), but we will release an update that supports HyperCLOVA as soon as it becomes available to the public (HyperMix).

Citation

To cite our code or work, please use the following bibtex:

@inproceedings{yoo2021gpt3mix,
	title = "GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation",
	author = "Yoo, Kang Min  and
	  Park, Dongju  and
	  Kang, Jaewook  and
	  Lee, Sang-Woo  and
	  Park, Woomyoung",
	booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
	month = nov,
	year = "2021",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2021.findings-emnlp.192",
	pages = "2225--2239",
}
Owner
NAVER AI
Official account of NAVER AI, Korea No.1 Industrial AI Research Group
NAVER AI
Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

1.1k Dec 27, 2022
Header-only C++ HNSW implementation with python bindings

Hnswlib - fast approximate nearest neighbor search Header-only C++ HNSW implementation with python bindings. NEWS: version 0.6 Thanks to (@dyashuni) h

2.3k Jan 05, 2023
a CTF web challenge about making screenshots

screenshotter (web) A CTF web challenge about making screenshots. It is inspired by a bug found in real life. The challenge was created by @LiveOverfl

219 Jan 02, 2023
NLP tool to extract emotional phrase from tweets 🤩

Emotional phrase extractor Extract phrase in the given text that is used to express the sentiment. Capturing sentiment in language is important in the

Shahul ES 38 Oct 17, 2022
Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

Seq2Seq Speech in JAX A JAX/Flax repository for combining a pre-trained speech encoder model (e.g. Wav2Vec2, HuBERT, WavLM) with a pre-trained text de

Sanchit Gandhi 21 Dec 14, 2022
Implementation of some unbalanced loss like focal_loss, dice_loss, DSC Loss, GHM Loss et.al

Implementation of some unbalanced loss for NLP task like focal_loss, dice_loss, DSC Loss, GHM Loss et.al Summary Here is a loss implementation reposit

121 Jan 01, 2023
Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

PTR Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification" If you use the code, please cite the following paper: @art

THUNLP 118 Dec 30, 2022
Unsupervised intent recognition

INTENT author: steeve LAQUITAINE description: deployment pattern: currently batch only Setup & run git clone https://github.com/slq0/intent.git bash

sl 1 Apr 08, 2022
Installation, test and evaluation of Scribosermo speech-to-text engine

Scribosermo STT Setup Scribosermo is a LGPL licensed, open-source speech recognition engine to "Train fast Speech-to-Text networks in different langua

Florian Quirin 3 Jun 20, 2022
NL. The natural language programming language.

NL A Natural-Language programming language. Built using Codex. A few examples are inside the nl_projects directory. How it works Write any code in pur

2 Jan 17, 2022
Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

Low-resource-Machine-Translation This repository contains the code for the project relative to the course Deep Natural Language Processing. The goal o

Andrea Cavallo 3 Jun 22, 2022
This library is testing the ethics of language models by using natural adversarial texts.

prompt2slip This library is testing the ethics of language models by using natural adversarial texts. This tool allows for short and simple code and v

9 Dec 28, 2021
Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

LancoPKU 105 Jan 03, 2023
Mednlp - Medical natural language parsing and utility library

Medical natural language parsing and utility library A natural language medical

Paul Landes 3 Aug 24, 2022
Multiple implementations for abstractive text summurization , using google colab

Text Summarization models if you are able to endorse me on Arxiv, i would be more than glad https://arxiv.org/auth/endorse?x=FRBB89 thanks This repo i

463 Dec 26, 2022
Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles

Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles (TASLP 2022)

Zhuosheng Zhang 3 Apr 14, 2022
nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch. Most of the models in NLP were implemented with less than 100 lines of code.(except comments or blank li

Tae-Hwan Jung 11.9k Jan 08, 2023
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 549 Jan 06, 2023
Deep Learning for Natural Language Processing - Lectures 2021

This repository contains slides for the course "20-00-0947: Deep Learning for Natural Language Processing" (Technical University of Darmstadt, Summer term 2021).

0 Feb 21, 2022
Blazing fast language detection using fastText model

Luga A blazing fast language detection using fastText's language models Luga is a Swahili word for language. fastText provides a blazing fast language

Prayson Wilfred Daniel 18 Dec 20, 2022