Implementation of Feedback Transformer in Pytorch

Last update: Oct 04, 2022

Overview

Feedback Transformer - Pytorch

Simple implementation of Feedback Transformer in Pytorch. They improve on Transformer-XL by having each token have access to the representations of all previous layers through time. This is achieved by aggregating the outputs of all layers into a shared memory, which each token across layers can attend to at each time step.

The main drawback is longer training time, due to its non-parallel nature. But I thought I'd build it to further exploration and research into this line of work.

Yannic Kilcher video

I also took the liberty to add some various enhancements, including pre-normalization, GLU gated feedforwards, as well as simplified T5 relative positional embeddings.

Install

$ pip install feedback-transformer-pytorch

Usage

import torch
from feedback_transformer_pytorch import FeedbackTransformer

model = FeedbackTransformer(
    num_tokens = 20000,           # number of tokens
    dim = 512,                    # dimension
    depth = 6,                    # depth
    seq_len = 2,                  # the sequence length of each segment or window
    mem_len = 256,                # length of the memory buffer
    dim_head = 64,                # dimension of each head
    heads = 8,                    # number of heads
    attn_dropout = 0.1,           # attention dropout
    ff_dropout = 0.1              # feedforward dropout
).cuda()

x = torch.randint(0, 20000, (2, 64)).cuda()
model(x)  # (2, 64, 20000)

If you would like to have fine control over the memory (when to detach, etc), you can do it with some extra keyword arguments on .forward

import torch
from feedback_transformer_pytorch import FeedbackTransformer

model = FeedbackTransformer(
    num_tokens = 20000,
    dim = 512,
    depth = 6,
    seq_len = 32,
    mem_len = 256
).cuda()

x1 = torch.randint(0, 20000, (2, 32)).cuda()
x2 = torch.randint(0, 20000, (2, 32)).cuda()
x3 = torch.randint(0, 20000, (2, 32)).cuda()

out1, mem1 = model(x1, return_memory = True)
out2, mem2 = model(x2, memory = mem1, return_memory = True)
out3, mem3 = model(x3, memory = mem2, return_memory = True)  # (2, 32, 20000)

Citations

@misc{fan2021addressing,
    title   = {Addressing Some Limitations of Transformers with Feedback Memory}, 
    author  = {Angela Fan and Thibaut Lavril and Edouard Grave and Armand Joulin and Sainbayar Sukhbaatar},
    year    = {2021},
    eprint  = {2002.09402},
    archivePrefix = {arXiv},
    primaryClass = {cs.LG}
}

Comments

Should it really be using lower layers output for keys and values?

Could you explain the logic of how the key-value pairs are formed at these lines and whether it is necessary?

https://github.com/lucidrains/feedback-transformer-pytorch/blob/d7d8939910d1491f01a3d93ce81d4663925fb389/feedback_transformer_pytorch/feedback_transformer_pytorch.py#L146-L151

It looks to me that line 146 transforms the output of the layer below (x) to keys and values, and the following lines combine these keys and values with the memory. I thought that x should only be used for forming the query here, and only the existing memory is used for keys and values.

opened by tarvaina 6
In place operation with gradient

https://github.com/lucidrains/feedback-transformer-pytorch/blob/main/feedback_transformer_pytorch/feedback_transformer_pytorch.py#L173 I think this is an error.

opened by hadaev8 4
Bug in weighted sum

Bug in https://github.com/lucidrains/feedback-transformer-pytorch/blob/main/feedback_transformer_pytorch/feedback_transformer_pytorch.py#L264

Should be layer_weight = rearrange(layer_weight, 'd -> d () () ()')

opened by Victor0118 1

Input/Output dimensions

Hey @lucidrains

Can I check the dimensions of the input and output, is it (seq_len, dim) -> (? ,dim, tokens)?

model = FeedbackTransformer(
    num_tokens = 20000,           # number of tokens
    dim = 512,                    # dimension
    depth = 6,                    # depth
    seq_len = 2,                  # the sequence length of each segment or window
    mem_len = 256,                # length of the memory buffer
    dim_head = 64,                # dimension of each head
    heads = 8,                    # number of heads
    attn_dropout = 0.1,           # attention dropout
    ff_dropout = 0.1              # feedforward dropout
).cuda()

x = torch.randint(0, 256, (2, 512)).cuda()
model(x)  # (1, 512, 20000)

opened by iiSeymour 1

Non intuitive memory usage with cross attention

Give simple 256 dim and 512 len tensor and memory len 16 feedback transformer uses 3.6gm memory after forward pass. With cross attention on 100 len tensor usage grows to 14gb.

While parallel version uses 3.1gb and 3.5gb.

Notebooks for testing https://colab.research.google.com/drive/1dRImydFn3WthOXdLYIvdf5bsqjXcmhC5?usp=sharing https://colab.research.google.com/drive/1n653j4Pz9_U7OukhTlUbomAHMvpPXwx0?usp=sharing

opened by hadaev8 0
I think mask padding value should be False

Here https://github.com/lucidrains/feedback-transformer-pytorch/blob/with-cross-attention/feedback_transformer_pytorch/feedback_transformer_pytorch.py#L181

opened by hadaev8 0
ETA for the enwiki8 example

Hey @lucidrains,

Any eta on the example for auto-regressive enwiki8 example? I and others would really appreciate it as always :)

Also, if you can provide an example for training on custom line-by-line TXT datasets, it would be absolutely fantastic.

Thank you.

opened by asigalov61 0

Releases(0.0.11)

0.0.11(Mar 2, 2021)

Source code(tar.gz)
Source code(zip)
0.0.10(Feb 22, 2021)

Source code(tar.gz)
Source code(zip)
0.0.9(Feb 3, 2021)

Source code(tar.gz)
Source code(zip)
0.0.8(Feb 2, 2021)

Source code(tar.gz)
Source code(zip)
0.0.7(Feb 2, 2021)

Source code(tar.gz)
Source code(zip)
0.0.6(Feb 2, 2021)

Source code(tar.gz)
Source code(zip)
0.0.5(Feb 2, 2021)

Source code(tar.gz)
Source code(zip)
0.0.4(Feb 2, 2021)

Source code(tar.gz)
Source code(zip)
0.0.3(Feb 2, 2021)

Source code(tar.gz)
Source code(zip)
0.0.2(Feb 2, 2021)

Source code(tar.gz)
Source code(zip)
0.0.1(Feb 2, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need.

GitHub Repository

Pretrained models for Jax/Flax: StyleGAN2, GPT2, VGG, ResNet.

169 Dec 26, 2022

Implementation of "Fast and Flexible Temporal Point Processes with Triangular Maps" (Oral @ NeurIPS 2020)

Fast and Flexible Temporal Point Processes with Triangular Maps This repository includes a reference implementation of the algorithms described in "Fa

20 Dec 02, 2022

Fast, modular reference implementation and easy training of Semantic Segmentation algorithms in PyTorch.

TorchSeg This project aims at providing a fast, modular reference implementation for semantic segmentation models using PyTorch. Highlights Modular De

1.4k Jan 02, 2023

This repository contains a re-implementation of the code for the CVPR 2021 paper "Omnimatte: Associating Objects and Their Effects in Video."

Omnimatte in PyTorch This repository contains a re-implementation of the code for the CVPR 2021 paper "Omnimatte: Associating Objects and Their Effect

728 Dec 28, 2022

AttGAN: Facial Attribute Editing by Only Changing What You Want (IEEE TIP 2019)

News 11 Jan 2020: We clean up the code to make it more readable! The old version is here: v1. AttGAN TIP Nov. 2019, arXiv Nov. 2017 TensorFlow impleme

568 Dec 14, 2022

A 2D Visual Localization Framework based on Essential Matrices [ICRA2020]

A 2D Visual Localization Framework based on Essential Matrices This repository provides implementation of our paper accepted at ICRA: To Learn or Not

27 Nov 07, 2022

StyleTransfer - Open source style transfer project, based on VGG19

9 Dec 13, 2021

Intrusion Detection System using ensemble learning (machine learning)

IDS-ML implementation of an intrusion detection system using ensemble machine learning methods Data set This project is carried out using the UNSW-15

4 Nov 25, 2022

The trained model and denoising example for paper : Cardiopulmonary Auscultation Enhancement with a Two-Stage Noise Cancellation Approach

1 Jan 18, 2022

Implementation of Feedback Transformer in Pytorch

Related tags

Overview

Feedback Transformer - Pytorch

Install

Usage

Citations

Comments

Should it really be using lower layers output for keys and values?

In place operation with gradient

Bug in weighted sum

Input/Output dimensions

Non intuitive memory usage with cross attention

I think mask padding value should be False

ETA for the enwiki8 example

Releases(0.0.11)

0.0.11(Mar 2, 2021)

0.0.10(Feb 22, 2021)

0.0.9(Feb 3, 2021)

0.0.8(Feb 2, 2021)

0.0.7(Feb 2, 2021)

0.0.6(Feb 2, 2021)

0.0.5(Feb 2, 2021)

0.0.4(Feb 2, 2021)

0.0.3(Feb 2, 2021)

0.0.2(Feb 2, 2021)

0.0.1(Feb 2, 2021)

Owner

Phil Wang

Pretrained models for Jax/Flax: StyleGAN2, GPT2, VGG, ResNet.

Implementation of "Fast and Flexible Temporal Point Processes with Triangular Maps" (Oral @ NeurIPS 2020)

Fast, modular reference implementation and easy training of Semantic Segmentation algorithms in PyTorch.

This repository contains a re-implementation of the code for the CVPR 2021 paper "Omnimatte: Associating Objects and Their Effects in Video."

AttGAN: Facial Attribute Editing by Only Changing What You Want (IEEE TIP 2019)

A 2D Visual Localization Framework based on Essential Matrices [ICRA2020]

StyleTransfer - Open source style transfer project, based on VGG19

Intrusion Detection System using ensemble learning (machine learning)

The trained model and denoising example for paper : Cardiopulmonary Auscultation Enhancement with a Two-Stage Noise Cancellation Approach

SIR model parameter estimation using a novel algorithm for differentiated uniformization.

[内测中]前向式Python环境快捷封装工具，快速将Python打包为EXE并添加CUDA、NoAVX等支持。

Pytorch for Segmentation

Neural Scene Flow Fields using pytorch-lightning, with potential improvements

A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers.

[ICCV 2021] Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation

A PyTorch implementation of SlowFast based on ICCV 2019 paper "SlowFast Networks for Video Recognition"

Anchor-free Oriented Proposal Generator for Object Detection

Tensorflow implementation for "Improved Transformer for High-Resolution GANs" (NeurIPS 2021).

Ipython notebook presentations for getting starting with basic programming, statistics and machine learning techniques

[ICCV2021] 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds