Implementation of a Transformer that Ponders, using the scheme from the PonderNet paper

Last update: Oct 04, 2022

Overview

Ponder(ing) Transformer

Implementation of a Transformer that learns to adapt the number of computational steps it takes depending on the difficulty of the input sequence, using the scheme from the PonderNet paper. Will also try to abstract out a pondering module that can be used with any block that returns an output with the halting probability.

This repository would not have been possible without repeated viewings of Yannic's educational video

Install

$ pip install ponder-transformer

Usage

import torch
from ponder_transformer import PonderTransformer

model = PonderTransformer(
    num_tokens = 20000,
    dim = 512,
    max_seq_len = 512
)

mask = torch.ones(1, 512).bool()

x = torch.randint(0, 20000, (1, 512))
y = torch.randint(0, 20000, (1, 512))

loss = model(x, labels = y, mask = mask)
loss.backward()

Now you can set the model to .eval() mode and it will terminate early when all samples of the batch have emitted a halting signal

import torch
from ponder_transformer import PonderTransformer

model = PonderTransformer(
    num_tokens = 20000,
    dim = 512,
    max_seq_len = 512,
    causal = True
)

x = torch.randint(0, 20000, (2, 512))
mask = torch.ones(2, 512).bool()

model.eval() # setting to eval makes it return the logits as well as the halting indices

logits, layer_indices = model(x,  mask = mask) # (2, 512, 20000), (2)

# layer indices will contain, for each batch element, which layer they exited

Citations

@misc{banino2021pondernet,
    title   = {PonderNet: Learning to Ponder}, 
    author  = {Andrea Banino and Jan Balaguer and Charles Blundell},
    year    = {2021},
    eprint  = {2107.05407},
    archivePrefix = {arXiv},
    primaryClass = {cs.LG}
}

You might also like...

Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

FLASH - Pytorch Implementation of the Transformer variant proposed in the paper Transformer Quality in Linear Time Install $ pip install FLASH-pytorch

209 Dec 28, 2022

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

ImageProcessingTransformer Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

61 Jan 1, 2023

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Episodic Transformers (E.T.) Episodic Transformer for Vision-and-Language Navigation Alexander Pashevich, Cordelia Schmid, Chen Sun Episodic Transform

62 Dec 24, 2022

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

CSWin-Transformer This repo is the official implementation of "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows". Th

409 Jan 6, 2023

3D-Transformer: Molecular Representation with Transformer in 3D Space

55 Dec 19, 2022

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

1 Dec 24, 2021

Transformer - Transformer in PyTorch

Transformer 完成进度 Embeddings and PositionalEncoding with example. MultiHeadAttent

1 Jan 6, 2022

Transformer Huffman coding - Complete Huffman coding through transformer

Transformer_Huffman_coding Complete Huffman coding through transformer 2022/2/19

3 May 19, 2022

The official repository for our paper "The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers". We significantly improve the systematic generalization of transformer models on a variety of datasets using simple tricks and careful considerations.

Codebase for training transformers on systematic generalization datasets. The official repository for our EMNLP 2021 paper The Devil is in the Detail:

57 Nov 21, 2022

Comments

Evaluating ponder-net on more pondering-steps than trained on.

As the paper says,

In evaluation, and under known temporal or computational limitations, N can be set naively as a constant (or not set any limit, i.e. N → ∞). For training, we found that a more effective (and interpretable) way of parameterizing N is by defining a minimum cumulative probability of halting. N is then the smallest value of n such that sum( p_sub_ j > 1 − ε)over(j=1, n) , with the hyper-parameter ε positive near 0 (in our experiments 0.05).

from that I infer that pondering can be done to more steps than trained on. How can be done so with this implementation?

edit: I was going through the paper again,and I think what the paper means is that the max_num_pondering_steps:N should be re evaluated at every training-step, the model should be run till the condition is met or a pre-defined num of max steps is reached, and where the cumsum_probs condition will be met will be set as 'N', with the cumsum_probs normalised with one of the methods. Then that value of 'N' will be used to calc prior geom for the kl_div (and not normalising the prior geom term).

i.e. if the num of pondering steps are initially set to 'M', then the model will recur for 'k' steps - i.e. till the condition is met or for 'M' num of max steps; then 'N' will be calculated by first calculating the probabilities - p_0 to p_k - then normalizing through one of the methods, then calculate cumulative-sum of those probabilities, and checking where the sum is greater than threshold, and assigning it the value 'N'. After that, calculating prior geometric values with the defined hyper-parameter, for 'N' seq-len, and using this in the kl-div term against the halting probs truncated to 'N' steps.

λp is a hyper-parameter that defines a geometric prior distribution pG(λp) on the halting policy (truncated at N)

opened by Vbansal21 0
Can pondernet used for imagenet?

I plan to do a project on the complexity of tasks on image dataset like imagenet, cifar 100. If I use a vision transformer, then can I implement my project?

opened by fryegg 2

Releases(0.0.8)

0.0.8(Oct 30, 2021)

Source code(tar.gz)
Source code(zip)
0.0.7(Aug 30, 2021)

Source code(tar.gz)
Source code(zip)
0.0.5(Aug 26, 2021)

Source code(tar.gz)
Source code(zip)
0.0.4(Aug 26, 2021)

Source code(tar.gz)
Source code(zip)
0.0.3(Aug 26, 2021)

Source code(tar.gz)
Source code(zip)
0.0.2(Aug 26, 2021)

Source code(tar.gz)
Source code(zip)
0.0.1(Aug 26, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

Implementation of a Transformer that Ponders, using the scheme from the PonderNet paper

Related tags

Overview

Ponder(ing) Transformer

Install

Usage

Citations

You might also like...

Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

3D-Transformer: Molecular Representation with Transformer in 3D Space

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

Transformer - Transformer in PyTorch

Transformer Huffman coding - Complete Huffman coding through transformer

The official repository for our paper "The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers". We significantly improve the systematic generalization of transformer models on a variety of datasets using simple tricks and careful considerations.

Comments

Evaluating ponder-net on more pondering-steps than trained on.

Can pondernet used for imagenet?

Releases(0.0.8)

0.0.8(Oct 30, 2021)

0.0.7(Aug 30, 2021)

0.0.5(Aug 26, 2021)

0.0.4(Aug 26, 2021)

0.0.3(Aug 26, 2021)

0.0.2(Aug 26, 2021)

0.0.1(Aug 26, 2021)

Owner

Phil Wang

Transfer Reinforcement Learning for Differing Action Spaces via Q-Network Representations

Experiments for distributed optimization algorithms

Lowest memory consumption and second shortest runtime in NTIRE 2022 challenge on Efficient Super-Resolution

Implementing SYNTHESIZER: Rethinking Self-Attention in Transformer Models using Pytorch

Code for our ICCV 2021 Paper "OadTR: Online Action Detection with Transformers".

Conditional Generative Adversarial Networks (CGAN) for Mobility Data Fusion

Pytorch Implementation of Residual Vision Transformers(ResViT)

In this project, we'll be making our own screen recorder in Python using some libraries.

Airborne Optical Sectioning (AOS) is a wide synthetic-aperture imaging technique

A rough implementation of the paper "A Steering Algorithm for Redirected Walking Using Reinforcement Learning"

[内测中]前向式Python环境快捷封装工具，快速将Python打包为EXE并添加CUDA、NoAVX等支持。

VIL-100: A New Dataset and A Baseline Model for Video Instance Lane Detection (ICCV 2021)

[CVPR'22] COAP: Learning Compositional Occupancy of People

Pytorch implementation for RelTransformer

DeceFL: A Principled Decentralized Federated Learning Framework

CPU inference engine that delivers unprecedented performance for sparse models

PyTorch implementation of "Supervised Contrastive Learning" (and SimCLR incidentally)

Doing the asl sign language classification on static images using graph neural networks.

The personal repository of the work: *DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer*.

MOpt-AFL provided by the paper "MOPT: Optimized Mutation Scheduling for Fuzzers"

The personal repository of the work: DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer.