ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Overview

ELECTRA

Introduction

ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the SQuAD 2.0 dataset.

For a detailed description and experimental results, please refer to our ICLR 2020 paper ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators.

This repository contains code to pre-train ELECTRA, including small ELECTRA models on a single GPU. It also supports fine-tuning ELECTRA on downstream tasks including classification tasks (e.g,. GLUE), QA tasks (e.g., SQuAD), and sequence tagging tasks (e.g., text chunking).

This repository also contains code for Electric, a version of ELECTRA inspired by energy-based models. Electric provides a more principled view of ELECTRA as a "negative sampling" cloze model. It can also efficiently produce pseudo-likelihood scores for text, which can be used to re-rank the outputs of speech recognition or machine translation systems. For details on Electric, please refer to out EMNLP 2020 paper Pre-Training Transformers as Energy-Based Cloze Models.

Released Models

We are initially releasing three pre-trained models:

Model Layers Hidden Size Params GLUE score (test set) Download
ELECTRA-Small 12 256 14M 77.4 link
ELECTRA-Base 12 768 110M 82.7 link
ELECTRA-Large 24 1024 335M 85.2 link

The models were trained on uncased English text. They correspond to ELECTRA-Small++, ELECTRA-Base++, ELECTRA-1.75M in our paper. We hope to release other models, such as multilingual models, in the future.

On GLUE, ELECTRA-Large scores slightly better than ALBERT/XLNET, ELECTRA-Base scores better than BERT-Large, and ELECTRA-Small scores slightly worst than TinyBERT (but uses no distillation). See the expected results section below for detailed performance numbers.

Requirements

Pre-training

Use build_pretraining_dataset.py to create a pre-training dataset from a dump of raw text. It has the following arguments:

  • --corpus-dir: A directory containing raw text files to turn into ELECTRA examples. A text file can contain multiple documents with empty lines separating them.
  • --vocab-file: File defining the wordpiece vocabulary.
  • --output-dir: Where to write out ELECTRA examples.
  • --max-seq-length: The number of tokens per example (128 by default).
  • --num-processes: If >1 parallelize across multiple processes (1 by default).
  • --blanks-separate-docs: Whether blank lines indicate document boundaries (True by default).
  • --do-lower-case/--no-lower-case: Whether to lower case the input text (True by default).

Use run_pretraining.py to pre-train an ELECTRA model. It has the following arguments:

  • --data-dir: a directory where pre-training data, model weights, etc. are stored. By default, the training loads examples from <data-dir>/pretrain_tfrecords and a vocabulary from <data-dir>/vocab.txt.
  • --model-name: a name for the model being trained. Model weights will be saved in <data-dir>/models/<model-name> by default.
  • --hparams (optional): a JSON dict or path to a JSON file containing model hyperparameters, data paths, etc. See configure_pretraining.py for the supported hyperparameters.

If training is halted, re-running the run_pretraining.py with the same arguments will continue the training where it left off.

You can continue pre-training from the released ELECTRA checkpoints by

  1. Setting the model-name to point to a downloaded model (e.g., --model-name electra_small if you downloaded weights to $DATA_DIR/electra_small).
  2. Setting num_train_steps by (for example) adding "num_train_steps": 4010000 to the --hparams. This will continue training the small model for 10000 more steps (it has already been trained for 4e6 steps).
  3. Increase the learning rate to account for the linear learning rate decay. For example, to start with a learning rate of 2e-4 you should set the learning_rate hparam to 2e-4 * (4e6 + 10000) / 10000.
  4. For ELECTRA-Small, you also need to specifiy "generator_hidden_size": 1.0 in the hparams because we did not use a small generator for that model.

Quickstart: Pre-train a small ELECTRA model.

These instructions pre-train a small ELECTRA model (12 layers, 256 hidden size). Unfortunately, the data we used in the paper is not publicly available, so we will use the OpenWebTextCorpus released by Aaron Gokaslan and Vanya Cohen instead. The fully-trained model (~4 days on a v100 GPU) should perform roughly in between GPT and BERT-Base in terms of GLUE performance. By default the model is trained on length-128 sequences, so it is not suitable for running on question answering. See the "expected results" section below for more details on model performance.

Setup

  1. Place a vocabulary file in $DATA_DIR/vocab.txt. Our ELECTRA models all used the exact same vocabulary as English uncased BERT, which you can download here.
  2. Download the OpenWebText corpus (12G) and extract it (i.e., run tar xf openwebtext.tar.xz). Place it in $DATA_DIR/openwebtext.
  3. Run python3 build_openwebtext_pretraining_dataset.py --data-dir $DATA_DIR --num-processes 5. It pre-processes/tokenizes the data and outputs examples as tfrecord files under $DATA_DIR/pretrain_tfrecords. The tfrecords require roughly 30G of disk space.

Pre-training the model.

Run python3 run_pretraining.py --data-dir $DATA_DIR --model-name electra_small_owt to train a small ELECTRA model for 1 million steps on the data. This takes slightly over 4 days on a Tesla V100 GPU. However, the model should achieve decent results after 200k steps (10 hours of training on the v100 GPU).

To customize the training, add --hparams '{"hparam1": value1, "hparam2": value2, ...}' to the run command. --hparams can also be a path to a .json file containing the hyperparameters. Some particularly useful options:

  • "debug": true trains a tiny ELECTRA model for a few steps.
  • "model_size": one of "small", "base", or "large": determines the size of the model
  • "electra_objective": false trains a model with masked language modeling instead of replaced token detection (essentially BERT with dynamic masking and no next-sentence prediction).
  • "num_train_steps": n controls how long the model is pre-trained for.
  • "pretrain_tfrecords": <paths> determines where the pre-training data is located. Note you need to specify the specific files not just the directory (e.g., <data-dir>/pretrain_tf_records/pretrain_data.tfrecord*)
  • "vocab_file": <path> and "vocab_size": n can be used to set a custom wordpiece vocabulary.
  • "learning_rate": lr, "train_batch_size": n, etc. can be used to change training hyperparameters
  • "model_hparam_overrides": {"hidden_size": n, "num_hidden_layers": m}, etc. can be used to changed the hyperparameters for the underlying transformer (the "model_size" flag sets the default values).

See configure_pretraining.py for the full set of supported hyperparameters.

Evaluating the pre-trained model.

To evaluate the model on a downstream task, see the below finetuning instructions. To evaluate the generator/discriminator on the openwebtext data run python3 run_pretraining.py --data-dir $DATA_DIR --model-name electra_small_owt --hparams '{"do_train": false, "do_eval": true}'. This will print out eval metrics such as the accuracy of the generator and discriminator, and also writing the metrics out to data-dir/model-name/results.

Fine-tuning

Use run_finetuning.py to fine-tune and evaluate an ELECTRA model on a downstream NLP task. It expects three arguments:

  • --data-dir: a directory where data, model weights, etc. are stored. By default, the script loads finetuning data from <data-dir>/finetuning_data/<task-name> and a vocabulary from <data-dir>/vocab.txt.
  • --model-name: a name of the pre-trained model: the pre-trained weights should exist in data-dir/models/model-name.
  • --hparams: a JSON dict containing model hyperparameters, data paths, etc. (e.g., --hparams '{"task_names": ["rte"], "model_size": "base", "learning_rate": 1e-4, ...}'). See configure_pretraining.py for the supported hyperparameters. Instead of a dict, this can also be a path to a .json file containing the hyperparameters. You must specify the "task_names" and "model_size" (see examples below).

Eval metrics will be saved in data-dir/model-name/results and model weights will be saved in data-dir/model-name/finetuning_models by default. Evaluation is done on the dev set by default. To customize the training, add --hparams '{"hparam1": value1, "hparam2": value2, ...}' to the run command. Some particularly useful options:

  • "debug": true fine-tunes a tiny ELECTRA model for a few steps.
  • "task_names": ["task_name"]: specifies the tasks to train on. A list because the codebase nominally supports multi-task learning, (although be warned this has not been thoroughly tested).
  • "model_size": one of "small", "base", or "large": determines the size of the model; you must set this to the same size as the pre-trained model.
  • "do_train" and "do_eval": train and/or evaluate a model (both are set to true by default). For using "do_eval": true with "do_train": false, you need to specify the init_checkpoint, e.g., python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{"model_size": "base", "task_names": ["mnli"], "do_train": false, "do_eval": true, "init_checkpoint": "<data-dir>/models/electra_base/finetuning_models/mnli_model_1"}'
  • "num_trials": n: If >1, does multiple fine-tuning/evaluation runs with different random seeds.
  • "learning_rate": lr, "train_batch_size": n, etc. can be used to change training hyperparameters.
  • "model_hparam_overrides": {"hidden_size": n, "num_hidden_layers": m}, etc. can be used to changed the hyperparameters for the underlying transformer (the "model_size" flag sets the default values).

Setup

Get a pre-trained ELECTRA model either by training your own (see pre-training instructions above), or downloading the release ELECTRA weights and unziping them under $DATA_DIR/models (e.g., you should have a directory$DATA_DIR/models/electra_large if you are using the large model).

Finetune ELECTRA on a GLUE task

Download the GLUE data by running this script. Set up the data by running mv CoLA cola && mv MNLI mnli && mv MRPC mrpc && mv QNLI qnli && mv QQP qqp && mv RTE rte && mv SST-2 sst && mv STS-B sts && mv diagnostic/diagnostic.tsv mnli && mkdir -p $DATA_DIR/finetuning_data && mv * $DATA_DIR/finetuning_data.

Then run run_finetuning.py. For example, to fine-tune ELECTRA-Base on MNLI

python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{"model_size": "base", "task_names": ["mnli"]}'

Or fine-tune a small model pre-trained using the above instructions on CoLA.

python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_small_owt --hparams '{"model_size": "small", "task_names": ["cola"]}'

Finetune ELECTRA on question answering

The code supports SQuAD 1.1 and 2.0, as well as datasets in the 2019 MRQA shared task

  • Squad 1.1: Download the train and dev datasets and move them under $DATA_DIR/finetuning_data/squadv1/(train|dev).json
  • Squad 2.0: Download the datasets from the SQuAD Website and move them under $DATA_DIR/finetuning_data/squad/(train|dev).json
  • MRQA tasks: Download the data from here. Move the data to $DATA_DIR/finetuning_data/(newsqa|naturalqs|triviaqa|searchqa)/(train|dev).jsonl.

Then run (for example)

python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{"model_size": "base", "task_names": ["squad"]}'

This repository uses the official evaluation code released by the SQuAD authors and the MRQA shared task to compute metrics

Finetune ELECTRA on sequence tagging

Download the CoNLL-2000 text chunking dataset from here and put it under $DATA_DIR/finetuning_data/chunk/(train|dev).txt. Then run

python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{"model_size": "base", "task_names": ["chunk"]}'

Adding a new task

The easiest way to run on a new task is to implement a new finetune.task.Task, add it to finetune.task_builder.py, and then use run_finetuning.py as normal. For classification/qa/sequence tagging, you can inherit from a finetune.classification.classification_tasks.ClassificationTask, finetune.qa.qa_tasks.QATask, or finetune.tagging.tagging_tasks.TaggingTask. For preprocessing data, we use the same tokenizer as BERT.

Expected Results

Here are expected results for ELECTRA on various tasks (test set for chunking, dev set for the other tasks). Note that variance in fine-tuning can be quite large, so for some tasks you may see big fluctuations in scores when fine-tuning from the same checkpoint multiple times. The below scores show median performance over a large number of random seeds. ELECTRA-Small/Base/Large are our released models. ELECTRA-Small-OWT is the OpenWebText-trained model from above (it performs a bit worse than ELECTRA-Small due to being trained for less time and on a smaller dataset).

CoLA SST MRPC STS QQP MNLI QNLI RTE SQuAD 1.1 SQuAD 2.0 Chunking
Metrics MCC Acc Acc Spearman Acc Acc Acc Acc EM EM F1
ELECTRA-Large 69.1 96.9 90.8 92.6 92.4 90.9 95.0 88.0 89.7 88.1 97.2
ELECTRA-Base 67.7 95.1 89.5 91.2 91.5 88.8 93.2 82.7 86.8 80.5 97.1
ELECTRA-Small 57.0 91.2 88.0 87.5 89.0 81.3 88.4 66.7 75.8 70.1 96.5
ELECTRA-Small-OWT 56.8 88.3 87.4 86.8 88.3 78.9 87.9 68.5 -- -- --

See here for losses / training curves of the models during pre-training.

Electric

To train Electric, use the same pre-training script and command as ELECTRA. Pass "electra_objective": false and "electric_objective": true to the hyperparameters. We plan to release pre-trained Electric models soon!

Citation

If you use this code for your publication, please cite the original paper:

@inproceedings{clark2020electra,
  title = {{ELECTRA}: Pre-training Text Encoders as Discriminators Rather Than Generators},
  author = {Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning},
  booktitle = {ICLR},
  year = {2020},
  url = {https://openreview.net/pdf?id=r1xMH1BtvB}
}

If you use the code for Electric, please cite the Electric paper:

@inproceedings{clark2020electric,
  title = {Pre-Training Transformers as Energy-Based Cloze Models},
  author = {Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning},
  booktitle = {EMNLP},
  year = {2020},
  url = {https://www.aclweb.org/anthology/2020.emnlp-main.20.pdf}
}

Contact Info

For help or issues using ELECTRA, please submit a GitHub issue.

For personal communication related to ELECTRA, please contact Kevin Clark ([email protected]).

Owner
Google Research
Google Research
Label Mask for Multi-label Classification

LM-MLC 一种基于完型填空的多标签分类算法 1 前言 本文主要介绍本人在全球人工智能技术创新大赛【赛道一】设计的一种基于完型填空(模板)的多标签分类算法:LM-MLC,该算法拟合能力很强能感知标签关联性,在多个数据集上测试表明该算法与主流算法无显著性差异,在该比赛数据集上的dev效果很好,但是由

52 Nov 20, 2022
Bridging Composite and Real: Towards End-to-end Deep Image Matting

Bridging Composite and Real: Towards End-to-end Deep Image Matting Please note that the official repository of the paper Bridging Composite and Real:

Jizhizi_Li 30 Oct 31, 2022
Mapping Conditional Distributions for Domain Adaptation Under Generalized Target Shift

This repository contains the official code of OSTAR in "Mapping Conditional Distributions for Domain Adaptation Under Generalized Target Shift" (ICLR 2022).

Matthieu Kirchmeyer 5 Dec 06, 2022
Official repository of ICCV21 paper "Viewpoint Invariant Dense Matching for Visual Geolocalization"

Viewpoint Invariant Dense Matching for Visual Geolocalization: PyTorch implementation This is the implementation of the ICCV21 paper: G Berton, C. Mas

Gabriele Berton 44 Jan 03, 2023
Implementation of E(n)-Transformer, which extends the ideas of Welling's E(n)-Equivariant Graph Neural Network to attention

E(n)-Equivariant Transformer (wip) Implementation of E(n)-Equivariant Transformer, which extends the ideas from Welling's E(n)-Equivariant G

Phil Wang 132 Jan 02, 2023
TJU Deep Learning & Neural Network

Deep_Learning & Neural_Network_Lab 实验环境 Python 3.9 Anaconda3(官网下载或清华镜像都行) PyTorch 1.10.1(安装代码如下) conda install pytorch torchvision torchaudio cudatool

St3ve Lee 1 Jan 19, 2022
Pytorch Implementation of Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Pytorch Implementation of Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [Paper] [Colab is coming soon] Approach Example Usage To r

170 Jan 03, 2023
Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Transformers for variable misuse, function naming and code completion tasks The official PyTorch implementation of: Empirical Study of Transformers fo

Bayesian Methods Research Group 56 Nov 15, 2022
NeROIC: Neural Object Capture and Rendering from Online Image Collections

NeROIC: Neural Object Capture and Rendering from Online Image Collections This repository is for the source code for the paper NeROIC: Neural Object C

Snap Research 647 Dec 27, 2022
Weakly Supervised 3D Object Detection from Point Cloud with Only Image Level Annotation

SCCKTIM Weakly Supervised 3D Object Detection from Point Cloud with Only Image-Level Annotation Our code will be available soon. The class knowledge t

1 Nov 12, 2021
Unofficial pytorch implementation of 'Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization'

pytorch-AdaIN This is an unofficial pytorch implementation of a paper, Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization [Hua

Naoto Inoue 873 Jan 06, 2023
StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators [Project Website] [Replicate.ai Project] StyleGAN-NADA: CLIP-Guided Domain Adaptation

992 Dec 30, 2022
FEMDA: Robust classification with Flexible Discriminant Analysis in heterogeneous data

FEMDA: Robust classification with Flexible Discriminant Analysis in heterogeneous data. Flexible EM-Inspired Discriminant Analysis is a robust supervised classification algorithm that performs well i

0 Sep 06, 2022
TensorFlow implementation of Style Transfer Generative Adversarial Networks: Learning to Play Chess Differently.

Adversarial Chess TensorFlow implementation of Style Transfer Generative Adversarial Networks: Learning to Play Chess Differently. Requirements To run

Muthu Chidambaram 30 Sep 07, 2021
Reference PyTorch implementation of "End-to-end optimized image compression with competition of prior distributions"

PyTorch reference implementation of "End-to-end optimized image compression with competition of prior distributions" by Benoit Brummer and Christophe

Benoit Brummer 6 Jun 16, 2022
People Interaction Graph

Gihan Jayatilaka*, Jameel Hassan*, Suren Sritharan*, Janith Senananayaka, Harshana Weligampola, et. al., 2021. Holistic Interpretation of Public Scenes Using Computer Vision and Temporal Graphs to Id

University of Peradeniya : COVID Research Group 1 Aug 24, 2022
Transferable Unrestricted Attacks, which won 1st place in CVPR’21 Security AI Challenger: Unrestricted Adversarial Attacks on ImageNet.

Transferable Unrestricted Adversarial Examples This is the PyTorch implementation of the Arxiv paper: Towards Transferable Unrestricted Adversarial Ex

equation 16 Dec 29, 2022
Unified file system operation experience for different backend

megfile - Megvii FILE library Docs: http://megvii-research.github.io/megfile megfile provides a silky operation experience with different backends (cu

MEGVII Research 76 Dec 14, 2022
a reccurrent neural netowrk that when trained on a peice of text and fed a starting prompt will write its on 250 character text using LSTM layers

RNN-Playwrite a reccurrent neural netowrk that when trained on a peice of text and fed a starting prompt will write its on 250 character text using LS

Arno Barton 1 Oct 29, 2021
This is the repo for the paper `SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization'. (published in Bioinformatics'21)

SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization This is the code for our paper ``SumGNN: Multi-typed Drug

Yue Yu 58 Dec 21, 2022