Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Last update: Jan 05, 2023

Overview

Memory Efficient Attention Pytorch

Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(n²) Memory. In addition, the module will take care of masking, causal masking, as well as cross attention.

Install

$ pip install memory-efficient-attention-pytorch

Usage

For autoregressive language model

import torch
from memory_efficient_attention_pytorch import Attention

attn = Attention(
    dim = 512,
    dim_head = 64,                # dimension per head
    heads = 8,                    # number of attention heads
    causal = True,                # autoregressive or not
    memory_efficient = True,      # whether to use memory efficient attention (can be turned off to test against normal attention)
    q_bucket_size = 1024,         # bucket size along queries dimension
    k_bucket_size = 2048          # bucket size along key / values dimension
).cuda()

x = torch.randn(1, 65536, 512).cuda()
out = attn(x) # (1, 65536, 512)

Cross attention

import torch
from memory_efficient_attention_pytorch import Attention

cross_attn = Attention(
    dim = 512,
    dim_head = 64,
    heads = 8,
    memory_efficient = True,
    q_bucket_size = 1024,
    k_bucket_size = 2048
).cuda()

x = torch.randn(1, 65536, 512).cuda()
context = torch.randn(1, 65536, 512).cuda()
mask = torch.ones(1, 65536).bool().cuda()

out = cross_attn(x, context = context, mask = mask) # (1, 65536, 512)

benchmark and see how much torch jit helps
look at Triton and Keops and see if either can be a fit

Citations

@misc{rabe2021selfattention,
    title   = {Self-attention Does Not Need $O(n^2)$ Memory}, 
    author  = {Markus N. Rabe and Charles Staats},
    year    = {2021},
    eprint  = {2112.05682},
    archivePrefix = {arXiv},
    primaryClass = {cs.LG}
}

@misc{liu2021swin,
    title   = {Swin Transformer V2: Scaling Up Capacity and Resolution},
    author  = {Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
    year    = {2021},
    eprint  = {2111.09883},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Comments

[feature request] Combining with flash attention?

There is a new algorithm to optimize the qkv attention, https://github.com/HazyResearch/flash-attention https://arxiv.org/abs/2205.14135 It optimises the qkv attention part. Maybe you can look into integrating it with this.

opened by Vbansal21 15
i did this, we could build on top

Hi there!

It seems I did already some of the code... https://github.com/CHARM-Tx/linear_mem_attention_pytorch could we build on top of this? I talked to https://github.com/Chillee about an experimental functionality from functorch: https://github.com/pytorch/functorch that would allow for increased speed (mainly i want to match jax perofmance but its just difficult w/ pytorch imperative style).

I would love to collaborate on this if you want!

opened by hypnopump 5
Added dropout support to memory efficient variant

Hey Phil,

I have been using this repository for a project and I wanted to add dropout for completeness. I checked consistency with perceiver-ar impl.. I hope this is helpful.

-Matt

opened by usryokousha 2
Making this work with relative position bias from XTransformers

Is there a way to make this work with RelativePositionBias. Currently this produces an attention bias of size $BHN^2$ where B is batch size, H is number of heads and N is input size. Can this be chunked and computed per chunk?

opened by pfeatherstone 5
save_for_backward can only save variables, but argument 5 is of type bool

Hi,

Thank you for your indescribable work. I was trying to test your method specifically for cross-attention but It seems I get the error " save_for_backward can only save variables, but argument 5 is of type bool". I am not sure what I am doing wrong. I tried your own examples too but get the same error.

Can you please help me out?

Code:

import torch from memory_efficient_attention_pytorch import Attention

cross_attn = Attention( dim = 512, dim_head = 64, heads = 8, memory_efficient = True, q_bucket_size = 1024, k_bucket_size = 2048 ).cuda() (# out = sm_mod(inp1)) did this to avoid being a header x = torch.randn(1, 65536, 512).cuda() context = torch.randn(1, 65536, 512).cuda() (# mask = torch.ones(1, 65536).bool().cuda()) did this to avoid being a heading out = cross_attn(x

ERROR:

File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/abali/.vscode-server/extensions/ms-python.python-2022.8.1/pythonFiles/lib/python/debugpy/main.py", line 45, in cli.main() File "/home/abali/.vscode-server/extensions/ms-python.python-2022.8.1/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main run() File "/home/abali/.vscode-server/extensions/ms-python.python-2022.8.1/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file runpy.run_path(target_as_str, run_name=compat.force_str("main")) File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 265, in run_path return _run_module_code(code, init_globals, run_name, File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 97, in _run_module_code _run_code(code, mod_globals, init_globals, File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/stars/user/abali/Phd_work/ISBI2023/X3D-Multigrid/CrossAttn_X3d_v2.py", line 872, in out = cross_attn(x, context = context, mask = mask) # (1, 65536, 512) print(out) File "/home/abali/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/site-packages/memory_efficient_attention_pytorch/memory_efficient_attention.py", line 215, in forward out = attn_fn(q, k, v, mask = mask, attn_bias = attn_bias, causal = self.causal, q_bucket_size = q_bucket_size, k_bucket_size = k_bucket_size) File "/home/abali/.conda/envs/py38_ydp5/lib/python3.8/site-packages/memory_efficient_attention_pytorch/memory_efficient_attention.py", line 127, in memory_efficient_attention exp_weight_chunk, weighted_value_chunk, weight_max_chunk = summarize_qkv_fn( File "/home/abali/.local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 163, in checkpoint return CheckpointFunction.apply(function, preserve, *args) TypeError: save_for_backward can only save variables, but argument 5 is of type bool

opened by aliabid2243 1
Checkpointing is not compatible with .grad() or when an `inputs` parameter is passed to .backward()

https://github.com/lucidrains/memory-efficient-attention-pytorch/blob/35559a05572f9d4eb982a8e2e399b40a2d61b85c/memory_efficient_attention_pytorch/memory_efficient_attention.py#L95

Should this be: summarize_qkv_fn = summarize_qkv_chunk if needs_backwards else checkpointed_summarize_qkv_chunk instead of: summarize_qkv_fn = checkpointed_summarize_qkv_chunk if needs_backwards else summarize_qkv_chunk

opened by vrobot 0

Releases(0.1.1)

0.1.1(Dec 30, 2022)

null
Source code(tar.gz)
Source code(zip)
0.1.0(Dec 30, 2022)

Source code(tar.gz)
Source code(zip)
0.0.27(Nov 1, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.26(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.25(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.24(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.23(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.22(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.21(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.20(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.19(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.18(Jul 23, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.17(Mar 22, 2022)

Source code(tar.gz)
Source code(zip)
0.0.16(Mar 21, 2022)

Source code(tar.gz)
Source code(zip)
0.0.15(Mar 13, 2022)

Source code(tar.gz)
Source code(zip)
0.0.14(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.12(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.11(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.10(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.9(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.8(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.7(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.6(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.5(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.4(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.2(Mar 4, 2022)

Source code(tar.gz)
Source code(zip)
0.0.1(Mar 3, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

This repository implements variational graph auto encoder by Thomas Kipf.

Variational Graph Auto-encoder in Pytorch This repository implements variational graph auto-encoder by Thomas Kipf. For details of the model, refer to

215 Jan 02, 2023

Noise Conditional Score Networks (NeurIPS 2019, Oral)

Generative Modeling by Estimating Gradients of the Data Distribution This repo contains the official implementation for the NeurIPS 2019 paper Generat

451 Dec 26, 2022

Does Pretraining for Summarization Reuqire Knowledge Transfer?

Pretraining summarization models using a corpus of nonsense

12 Dec 19, 2022

Official Pytorch and JAX implementation of "Efficient-VDVAE: Less is more"

The Official Pytorch and JAX implementation of "Efficient-VDVAE: Less is more" Arxiv preprint Louay Hazami · Rayhane Mama · Ragavan Thurairatn

144 Dec 23, 2022

Scripts for training an AI to play the endless runner Subway Surfers using a supervised machine learning approach by imitation and a convolutional neural network (CNN) for image classification

About subwAI subwAI - a project for training an AI to play the endless runner Subway Surfers using a supervised machine learning approach by imitation

82 Jan 01, 2023

Theory-inspired Parameter Control Benchmarks for Dynamic Algorithm Configuration

This repo is for the paper: Theory-inspired Parameter Control Benchmarks for Dynamic Algorithm Configuration The DAC environment is based on the Dynam

1 Aug 19, 2022

details on efforts to dump the Watermelon Games Paprium cart

Reminder, if you like these repos, fork them so they don't disappear https://github.com/ArcadeHustle/WatermelonPapriumDump/fork Big thanks to Fonzie f

29 Dec 11, 2022

Original code for "Zero-Shot Domain Adaptation with a Physics Prior"

Zero-Shot Domain Adaptation with a Physics Prior [arXiv] [sup. material] - ICCV 2021 Oral paper, by Attila Lengyel, Sourav Garg, Michael Milford and J

40 Dec 21, 2022

FishNet: One Stage to Detect, Segmentation and Pose Estimation

FishNet FishNet: One Stage to Detect, Segmentation and Pose Estimation Introduction In this project, we combine target detection, instance segmentatio

1 Oct 05, 2022

This project generates news headlines using a Long Short-Term Memory (LSTM) neural network.

News Headlines Generator bunnysaini/Generate-Headlines Goal This project aims to generate news headlines using a Long Short-Term Memory (LSTM) neural

1 Jan 24, 2022

Election Exit Poll Prediction and U.S.A Presidential Speech Analysis using Machine Learning

Machine_Learning Election Exit Poll Prediction and U.S.A Presidential Speech Analysis using Machine Learning This project is based on 2 case-studies:

1 Jan 27, 2022

Learning to Simulate Dynamic Environments with GameGAN (CVPR 2020)

Learning to Simulate Dynamic Environments with GameGAN PyTorch code for GameGAN Learning to Simulate Dynamic Environments with GameGAN Seung Wook Kim,

199 Dec 26, 2022

Spatial Transformer Nets in TensorFlow/ TensorLayer

MOVED TO HERE Spatial Transformer Networks Spatial Transformer Networks (STN) is a dynamic mechanism that produces transformations of input images (or

36 Nov 23, 2022

A Machine Teaching Framework for Scalable Recognition

MEMORABLE This repository contains the source code accompanying our ICCV 2021 paper. A Machine Teaching Framework for Scalable Recognition Pei Wang, N

2 Dec 08, 2021

Deep Learning Specialization by Andrew Ng, deeplearning.ai.

Deep Learning Specialization on Coursera Master Deep Learning, and Break into AI This is my personal projects for the course. The course covers deep l

1.5k Jan 07, 2023

Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification

143 Jan 01, 2023

Deploy recommendation engines with Edge Computing

RecoEdge: Bringing Recommendations to the Edge A one stop solution to build your recommendation models, train them and, deploy them in a privacy prese

131 Jan 02, 2023

Tensor-Based Quantum Machine Learning

TensorLy_Quantum TensorLy-Quantum is a Python library for Tensor-Based Quantum Machine Learning that builds on top of TensorLy and PyTorch. Website: h

85 Dec 03, 2022

Unofficial Tensorflow-Keras implementation of Fastformer based on paper [Fastformer: Additive Attention Can Be All You Need](https://arxiv.org/abs/2108.09084).

Fastformer-Keras Unofficial Tensorflow-Keras implementation of Fastformer based on paper Fastformer: Additive Attention Can Be All You Need. Tensorflo

10 Jan 30, 2022

Python Implementation of the CoronaWarnApp (CWA) Event Registration

Python implementation of the Corona-Warn-App (CWA) Event Registration This is an implementation of the Protocol used to generate event and location QR

17 Oct 05, 2022

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Related tags

Overview

Memory Efficient Attention Pytorch

Install

Usage

Citations

Comments

[feature request] Combining with flash attention?

i did this, we could build on top

Added dropout support to memory efficient variant

Making this work with relative position bias from XTransformers

save_for_backward can only save variables, but argument 5 is of type bool

Code:

ERROR:

Checkpointing is not compatible with .grad() or when an `inputs` parameter is passed to .backward()

Releases(0.1.1)

0.1.1(Dec 30, 2022)

0.1.0(Dec 30, 2022)

0.0.27(Nov 1, 2022)

0.0.26(Jul 23, 2022)

0.0.25(Jul 23, 2022)

0.0.24(Jul 23, 2022)

0.0.23(Jul 23, 2022)

0.0.22(Jul 23, 2022)

0.0.21(Jul 23, 2022)

0.0.20(Jul 23, 2022)

0.0.19(Jul 23, 2022)

0.0.18(Jul 23, 2022)

0.0.17(Mar 22, 2022)

0.0.16(Mar 21, 2022)

0.0.15(Mar 13, 2022)

0.0.14(Mar 4, 2022)

0.0.12(Mar 4, 2022)

0.0.11(Mar 4, 2022)

0.0.10(Mar 4, 2022)

0.0.9(Mar 4, 2022)

0.0.8(Mar 4, 2022)

0.0.7(Mar 4, 2022)

0.0.6(Mar 4, 2022)

0.0.5(Mar 4, 2022)

0.0.4(Mar 4, 2022)

0.0.2(Mar 4, 2022)

0.0.1(Mar 3, 2022)