[NeurIPS 2020] Official Implementation: "SMYRF: Efficient Attention using Asymmetric Clustering".

Last update: Dec 22, 2022

Related tags

Deep Learning smyrf

Overview

SMYRF: Efficient attention using asymmetric clustering

Get started:

Abstract

We propose a novel type of balanced clustering algorithm to approximate attention. Attention complexity is reduced from O(N^2) to O(NlogN), where N is the sequence length. Our algorithm, SMYRF, uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF is that it can be used as a drop-in replacement for dense attention layers without any retraining. On the contrary, prior fast attention methods impose constraints (e.g. tight queries and keys) and require re-training from scratch. We apply our method to pre-trained state-of-the-art Natural Language Processing and Computer Vision models and we report significant memory and speed benefits. Notably, SMYRF-BERT outperforms (slightly) BERT on GLUE, while using $50%$ less memory. We also show that SMYRF can be used interchangeably with dense attention before and after training. Finally, we use SMYRF to train GANs with attention in high resolutions. Using a single TPU, we train BigGAN on Celeba-HQ, with attention at resolution 128x128 and 256x256, capable of generating realistic human faces.

Authors: Giannis Daras, Nikita Kitaev, Augustus Odena, Alexandros G. Dimakis

Results

Memory-quality trade-off

GLUE benchmark

	Avg.	#	C	CoLA	MNLI-m/mm	MRPC	QNLI	QQP	RTE	SST-2	STS-B
BERT₁₂₈	82.69	1	1	57.83	84.43/84.68	88.41	91.31	89.70	65.70	93.46	88.73
SMYRF-BERT_2x32	82.98	2	32	58.79	83.76/84.27	87.69	91.14	89.72	68.59	93.23	89.65
SMYRF-BERT_2x16	81.74	2	16	58.90	82.86/83.49	85.72	89.53	89.33	64.98	93.12	87.75
BERT₆₄	81.57	1	64	58.80	82.34/82.47	87.02	90.48	89.69	61.73	93.00	88.64
BERT₃₂	73.56	1	32	56.40	64.51/63.41	77.89	79.81	88.59	55.23	92.66	83.53

Interchangeability of SMYRF and dense attention

Results on IMDB dataset. Using dense attention on inference consistently improves results, nearly matching dense attention perf.

	Memory	SMYRF Inference	Accuracy
RoBERTa	100%	☒	94.96%
SMYRF-RoBERTa	50%	☒	93.72%
SMYRF-RoBERTa	50%	☑	94.62%
BERT	100%	☒	94.12%
SMYRF-BERT	50%	☒	92.64%
SMYRF-BERT	50%	☑	93.54%

Smyrf-BigGAN training on Celeba-HQ-128

Generated faces by a Smyrf-BigGAN trained on 128x128 resolution with attention at 128x128, using 50% of dense memory.

Results after 120k iterations:

	Resolution	Attention	#	C	FID
BigGAN	128x128	64x64	1	4096	26.06
Smyrf-BigGAN	128x128	128x128	4	2048	25.03

where # denotes number of hashes and C number of queries per cluster.

What's here

The code hosted in this repository is the one we used to run all the experiments in the paper. Get started:

For a deeper dive, look at the examples/ folder where we have code for pre-training SMYRF-BigGAN, sampling from a pre-trained BigGAN with SMYRF, finetuning state-of-the-art NLP models with SMYRF and a lot more.

Acknowledgments

We would like to wholeheartedly thank the TensorFlow Research Cloud (TFRC) program that gave us access to Cloud TPUs and GCP credits to train our models.

The code for the NLP experiments is exclusively based on the HuggingFace transformers library. We are very grateful to the authors of the library for their work.

The code for the CV experiments is based on the PyTorch implementation of BigGAN available in this url. The code has been expanded to support training on TPUs. Again, we want to thank the author for open-sourcing this implementation.

Comments

Auto-regressive

Hi Giannis!

Thanks for the great paper! I am interested in your asymmetric LSH, as I think having separate query / key space (as opposed to shared QK as in Reformer) will bring performance improvements in LSH-based attention.

I saw that you recommended to a previous user to use this form of clustering for the auto-regressive case, and just wanted to probe if you had considered the scenario where a bucket of queries do not get matched with any keys from the past at all. This was an issue I had with trying to make separate QK space work with routing transformer, but just wondering if you had identified and found a solution to this problem.

Phil

opened by lucidrains 2
Logging and scoring

Currently logging and scoring is disabled for TPU BigGAN for maximum efficiency. We can probably re-write the logger and scorer to lower their performance bottleneck by converting most cpu materializations to XLA ops.
bug example

opened by giannisdaras 0
Ema not working on TPU

Exponential moving average on weights of G is not working on TPUs. The problem is related to the loading of the state dict: https://github.com/ajbrock/BigGAN-PyTorch/blob/master/utils.py#L614

For now, we disable ema.
bug example

opened by giannisdaras 0

[NeurIPS 2020] Official Implementation: "SMYRF: Efficient Attention using Asymmetric Clustering".

Related tags

Overview

SMYRF: Efficient attention using asymmetric clustering

Abstract

Results

Memory-quality trade-off

GLUE benchmark

Interchangeability of SMYRF and dense attention

Smyrf-BigGAN training on Celeba-HQ-128

What's here

Acknowledgments

You might also like...

Code for ICE-BeeM paper - NeurIPS 2020

Code for Discriminative Sounding Objects Localization (NeurIPS 2020)

Advances in Neural Information Processing Systems (NeurIPS), 2020.

Neuron Merging: Compensating for Pruned Neurons (NeurIPS 2020)

Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement (NeurIPS 2020)

Defending graph neural networks against adversarial attacks (NeurIPS 2020)

Code for the Population-Based Bandits Algorithm, presented at NeurIPS 2020.

Code release for NeurIPS 2020 paper "Co-Tuning for Transfer Learning"

Discovering Interpretable GAN Controls [NeurIPS 2020]

Comments

Auto-regressive

Logging and scoring

Ema not working on TPU

Releases(1.0)

1.0(Apr 30, 2020)

Owner

Giannis Daras

FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control

Pytorch codes for Feature Transfer Learning for Face Recognition with Under-Represented Data

[ACL 20] Probing Linguistic Features of Sentence-level Representations in Neural Relation Extraction

A texturizer that I just made. Nothing special here.

Code for reproducing key results in the paper "InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets"

A resource for learning about ML, DL, PyTorch and TensorFlow. Feedback always appreciated :)

Analysing poker data from home games with friends

Graph Posterior Network: Bayesian Predictive Uncertainty for Node Classification (NeurIPS 2021)

Official Pytorch Implementation of: "ImageNet-21K Pretraining for the Masses"(2021) paper

This repository introduces a short project about Transfer Learning for Classification of MRI Images.

Auxiliary data to the CHIIR paper Searching to Learn with Instructional Scaffolding

MOT-Tracking-by-Detection-Pipeline - For Tracking-by-Detection format MOT (Multi Object Tracking), is it a framework that separates Detection and Tracking processes?

PyTorch GPU implementation of the ES-RNN model for time series forecasting

The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

ThunderGBM: Fast GBDTs and Random Forests on GPUs

TorchGeo is a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.

Gans-in-action - Companion repository to GANs in Action: Deep learning with Generative Adversarial Networks

Demo for the paper "Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation"

Plug-n-Play Reinforcement Learning in Python with OpenAI Gym and JAX