Learned Token Pruning for Transformers

Overview

LTP: Learned Token Pruning for Transformers

Screenshot from 2021-07-08 13-39-02

Screenshot from 2021-07-08 13-39-28

Check our paper for more details.

Installation

We follow the same installation procedure as the original Huggingface transformer repo.

pip install sklearn scipy datasets torch
pip install -e .  # in the top directory

Prepare Checkpoints

LTP is implemented on top of Huggingface transformer's I-BERT implementation. Therefore, we first need to generate a checkpoint file of ibert finetuned on the target downstream task. While you can do this on the original Huggingface repository, we also support our base branch ltp/base where you can run the following code to finetune ibert on the GLUE tasks.

git checkout ltp/base
cd examples/text-classification
python run_glue.py --model_name_or_path kssteven/ibert-roberta-base --output_dir {CKPT} --task {TASK} --do_train --do_eval {--some_more_arguments}
  • {TASK}: RTE, MRPC, STSB, SST2, QNLI, QQP, MNLI
  • Please refer to the Huggingface tutorial and the official documentation for more details in arguments and hyperparameters.
  • Note that as default ibert behaves the same as roberta (see this tutorial), hence the resulting model will be the same as roberta-base finetuned on the target GLUE task.

The final model will be checkpointed in {CKPT}.

  • Remove {CKPT}/trainer_state.json.
  • In the configuration file {CKPT}/config.json, change (1) "architectures" to ["LTPForSequenceClassification"] and (2) "model_type" to "ltp".

Run Learned Token Pruning

Add the following lines in the configuration file {CKPT}/config.json.

"prune_mode": "absolute_threshold",
"final_token_threshold": 0.01, 
"scoring_mode": "mean",

final_token_threshold determines the token threshold of the last layer, and the thresholds of the remaining layers will be linearly scaled. For instance, the thresholds for the 3rd, 6th, and 9th layers will be 0.0025, 0.005, and 0.0075, respectively, when setting the final_token_threshold , i.e., the threshold for the last (12th) layer, to 0.01. This number is a hyperparameter, and we found that 0.01 works well in many cases.

The learnable mode consists of 2 stages: soft threshold and hard threshold. Please refer to our paper for more details.

1. Soft Threshold

We first train the model using the soft threshold mode. This trains the thresholds as well as the model parameters to search for the best threshold configuration.

Run the following command:

python run.py --arch ltp-base --task {TASK} --restore {CKPT} --lr 2e-5 --temperature {T}\
  --lambda 0.1 --weight_decay 0 --bs 64 --masking_mode soft --epoch {epoch} --save_step 100 --no_load
  • {TASK}: RTE, MRPC, STSB, SST2, QNLI, QQP, MNLI
  • You can assign different learning rate for lr, but 2e-5 worked fine.
  • We set {epoch} to be 10 for smaller datasets (e.g., RTE, MRPC) and 1 for larger datasets (e.g., SST2, QNLI, MRPC).
  • --no_load flag will not load the best model at the end of the training (i.e., the final checkpoint will be the one at the end of training).
  • lambda is an important hyperparameter than controls the pruning level: the higher the value, the more we prune tokens. 0.01 ~ 0.2 worked well in many cases, but we recommend the user to empirically search for the best number for it.
  • temperature is another hyperparameter, and 1e-3 ~ 1e-5 worked well. In the paper, we searched over {1e−4, 2e−4, 5e−4, 1e−3, 2e−3}.

The final model will be checkpointed in {CKPT_soft} = checkpoints/base/{TASK}/absolute_threshold/rate_{final_token_threshold}/temperature_{T}/lambda_{lambda}/lr_{lr}. Remove trainer_state.json from the checkpoint file in {CKPT_soft}.

2. Hard Threshold

Once we learn the thresholds, we fix those values, turn back to the hard threshold mode, and finetune the model parameters only.

Run the following command:

python run.py --arch ltp-base --task {TASK} --restore {CKPT_soft} --lr {LR} --bs 64 --masking_mode hard --epoch 5 
  • We used {LR} {0.5, 1, 2}e-5 in the paper.
  • You can additionally set --save_step 500 for more frequent evaluation/logging. The default setting will evaluate for every 1 epoch.

The final model will be checkpointed in {CKPT_soft}/hard/lr_{LR}.

Run Baseline Methods

We additionally provide code to reproduce the baseline methods used in our paper (i.e., top-k and manual threshold).

Top-k Token Pruning

Add the following lines in {CKPT}/config.json.

"prune_mode": "topk",
"token_keep_rate": 0.2,

The token keep rates of the first three layers and the last layer are 1 and token_keep_rate, respectively. The keep rates of the remaining layers are scaled linearly. The smaller token_keep_rate is, the more aggressive we prune tokens. You can also assign negative number for token_keep_rate and, in that case, the keep rate of each layer will be assigned as max(0, keep_rate).

Run the following command:

python run.py --arch ltp-base --task {TASK} --restore {CKPT} --lr {LR} --bs 64 --masking_mode hard --epoch 5
  • We used {LR} {0.5, 1, 2}e-5 in the paper.
  • You can additionally set --save_step 500 for more frequent evaluation/logging. The default setting will evaluate for every 1 epoch.

The final model will be checkpointed in {CKPT}/topk/lr_{LR}.

Manual(Non-leanrable) Threshold Pruning

Add the following lines in {CKPT}/config.json.

"prune_mode": "absolute_threshold",
"final_token_threshold": 0.01, 
"scoring_mode": "mean",

Run the following command:

python run.py --arch ltp-base --task {TASK} --restore {CKPT} --lr {LR} --bs 64 --masking_mode hard --epoch 5 --save_step 500
  • We used {LR} {0.5, 1, 2}e-5 in the paper.
  • You can additionally set --save_step 500 for more frequent evaluation/logging. The default setting will evaluate for every 1 epoch.
  • Note that the only difference from the learned token pruning mode is that we run the hard threshold mode from the beginning.

The final model will be checkpointed in {CKPT}/hard/lr_{LR}.

This repository includes different versions of the prescribed-time controller as Simulink blocks and MATLAB script codes for engineering applications.

Prescribed-time Control Prescribed-time control (PTC) blocks in Simulink environment, MATLAB R2020b. For more theoretical details, refer to the papers

Amir Shakouri 1 Mar 11, 2022
🥇Samsung AI Challenge 2021 1등 솔루션입니다🥇

MoT - Molecular Transformer Large-scale Pretraining for Molecular Property Prediction Samsung AI Challenge for Scientific Discovery This repository is

Jungwoo Park 44 Dec 03, 2022
The code for our paper "AutoSF: Searching Scoring Functions for Knowledge Graph Embedding"

AutoSF The code for our paper "AutoSF: Searching Scoring Functions for Knowledge Graph Embedding" and this paper has been accepted by ICDE2020. News:

AutoML Research 64 Dec 17, 2022
A deep learning based semantic search platform that computes similarity scores between provided query and documents

semanticsearch This is a deep learning based semantic search platform that computes similarity scores between provided query and documents. Documents

1 Nov 30, 2021
MAUS: A Dataset for Mental Workload Assessment Using Wearable Sensor - Baseline system

MAUS: A Dataset for Mental Workload Assessment Using Wearable Sensor - Baseline system Getting started To start working on this assignment, you should

2 Aug 06, 2022
Implementation for Shape from Polarization for Complex Scenes in the Wild

sfp-wild Implementation for Shape from Polarization for Complex Scenes in the Wild project website | paper Code and dataset will be released soon. Int

Chenyang LEI 41 Dec 23, 2022
This computer program provides a reference implementation of Lagrangian Monte Carlo in metric induced by the Monge patch

This computer program provides a reference implementation of Lagrangian Monte Carlo in metric induced by the Monge patch. The code was prepared to the final version of the accepted manuscript in AIST

Marcelo Hartmann 2 May 06, 2022
JumpDiff: Non-parametric estimator for Jump-diffusion processes for Python

jumpdiff jumpdiff is a python library with non-parametric Nadaraya─Watson estimators to extract the parameters of jump-diffusion processes. With jumpd

Rydin 28 Dec 10, 2022
Flower - A Friendly Federated Learning Framework

Flower - A Friendly Federated Learning Framework Flower (flwr) is a framework for building federated learning systems. The design of Flower is based o

Adap 1.8k Jan 01, 2023
Minimisation of a negative log likelihood fit to extract the lifetime of the D^0 meson (MNLL2ELDM)

Minimisation of a negative log likelihood fit to extract the lifetime of the D^0 meson (MNLL2ELDM) Introduction The average lifetime of the $D^{0}$ me

Son Gyo Jung 1 Dec 17, 2021
Enhancing Knowledge Tracing via Adversarial Training

Enhancing Knowledge Tracing via Adversarial Training This repository contains source code for the paper "Enhancing Knowledge Tracing via Adversarial T

Xiaopeng Guo 14 Oct 24, 2022
DilatedNet in Keras for image segmentation

Keras implementation of DilatedNet for semantic segmentation A native Keras implementation of semantic segmentation according to Multi-Scale Context A

303 Mar 15, 2022
A graph-to-sequence model for one-step retrosynthesis and reaction outcome prediction.

Graph2SMILES A graph-to-sequence model for one-step retrosynthesis and reaction outcome prediction. 1. Environmental setup System requirements Ubuntu:

29 Nov 18, 2022
Distributed Asynchronous Hyperparameter Optimization in Python

Hyperopt: Distributed Hyperparameter Optimization Hyperopt is a Python library for serial and parallel optimization over awkward search spaces, which

6.5k Jan 01, 2023
Code repository for the paper Computer Vision User Entity Behavior Analytics

Computer Vision User Entity Behavior Analytics Code repository for "Computer Vision User Entity Behavior Analytics" Code Description dataset.csv As di

Sameer Khanna 2 Aug 20, 2022
ML-based medical imaging using Azure

Disclaimer This code is provided for research and development use only. This code is not intended for use in clinical decision-making or for any other

Microsoft Azure 68 Dec 23, 2022
Official code for paper "Optimization for Oriented Object Detection via Representation Invariance Loss".

Optimization for Oriented Object Detection via Representation Invariance Loss By Qi Ming, Zhiqiang Zhou, Lingjuan Miao, Xue Yang, and Yunpeng Dong. Th

ming71 56 Nov 28, 2022
Collect some papers about transformer with vision. Awesome Transformer with Computer Vision (CV)

Awesome Visual-Transformer Collect some Transformer with Computer-Vision (CV) papers. If you find some overlooked papers, please open issues or pull r

dkliang 2.8k Jan 08, 2023
Implementation of Change-Based Exploration Transfer (C-BET)

Implementation of Change-Based Exploration Transfer (C-BET), as presented in Interesting Object, Curious Agent: Learning Task-Agnostic Exploration.

Simone Parisi 29 Dec 04, 2022
Exploiting a Zoo of Checkpoints for Unseen Tasks

Exploiting a Zoo of Checkpoints for Unseen Tasks This repo includes code to reproduce all results in the above Neurips paper, authored by Jiaji Huang,

Baidu Research 8 Sep 06, 2022