Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Last update: Dec 23, 2022

Overview

Transformer in Transformer

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch.

Install

$ pip install transformer-in-transformer

Usage

import torch
from transformer_in_transformer import TNT

tnt = TNT(
    image_size = 256,       # size of image
    patch_dim = 512,        # dimension of patch token
    pixel_dim = 24,         # dimension of pixel token
    patch_size = 16,        # patch size
    pixel_size = 4,         # pixel size
    depth = 6,              # depth
    num_classes = 1000,     # output number of classes
    attn_dropout = 0.1,     # attention dropout
    ff_dropout = 0.1        # feedforward dropout
)

img = torch.randn(2, 3, 256, 256)
logits = tnt(img) # (2, 1000)

Citations

@misc{han2021transformer,
    title   = {Transformer in Transformer}, 
    author  = {Kai Han and An Xiao and Enhua Wu and Jianyuan Guo and Chunjing Xu and Yunhe Wang},
    year    = {2021},
    eprint  = {2103.00112},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Comments

Only works if pixel_size**2 == patch_size?

Hi, is this only supposed to work if

pixel_size**2 == patch_size

?. When setting the patch_size to any number that doesn't fulfill the equation this error occurs:

--> 146         pixels += rearrange(self.pixel_pos_emb, 'n d -> () n d')
    147 
    148         for pixel_attn, pixel_ff, pixel_to_patch_residual, patch_attn, patch_ff in self.layers:

RuntimeError: The size of tensor a (4) must match the size of tensor b (64) at non-singleton dimension 1

The error came when running

tnt = TNT(
    image_size = 128,       # size of image
    patch_dim = 256,        # dimension of patch token
    pixel_dim = 24,         # dimension of pixel token
    patch_size = 16,        # patch size
    pixel_size = 2,         # pixel size
    depth = 6,              # depth
    heads = 1,
    num_classes = 2,     # output number of classes
    attn_dropout = 0.1,     # attention dropout
    ff_dropout = 0.1        # feedforward dropout,
)
img = torch.randn(2, 3, 128, 128)
logits = tnt(img)

Since I am completely new to einops its quite hard for me to debug :D Thanks

opened by PhilippMarquardt 1

Not sure what is wrong!

RuntimeError Traceback (most recent call last) in 14 15 img = torch.randn(1, 3, 256, 256) ---> 16 logits = tnt(img) # (2, 1000)

~/opt/anaconda3/envs/ml/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1108 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1109 or _global_forward_hooks or _global_forward_pre_hooks): -> 1110 return forward_call(*input, **kwargs) 1111 # Do not call functions when jit is used 1112 full_backward_hooks, non_full_backward_hooks = [], []

~/opt/anaconda3/envs/ml/lib/python3.8/site-packages/transformer_in_transformer/tnt.py in forward(self, x) 159 patches = repeat(self.patch_tokens[:(n + 1)], 'n d -> b n d', b = b) 160 --> 161 patches += rearrange(self.patch_pos_emb[:(n + 1)], 'n d -> () n d') 162 pixels += rearrange(self.pixel_pos_emb, 'n d -> () n d') 163

RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

opened by RisabBiswas 0

patch_tokens vs patch_pos_emb

Hi!

I'm trying to understand your TNT implementation and one thing that got me a bit confused is why there are 2 parameters patch_tokens and patch_pos_emb which seems to have the same purpose - to encode patch position. Isn't one of them redundant?

self.patch_tokens = nn.Parameter(torch.randn(num_patch_tokens + 1, patch_dim))
self.patch_pos_emb = nn.Parameter(torch.randn(num_patch_tokens + 1, patch_dim))
...
patches = repeat(self.patch_tokens[:(n + 1)], 'n d -> b n d', b = b)
patches += rearrange(self.patch_pos_emb[:(n + 1)], 'n d -> () n d')

opened by stas-sl 0

Inconsistent model params with MindSpore src code
There's no function or readme description of TNT-S/TNT-B model in this codebase. Something like :

def tnt_b(num_class): return TNT(img_size=384, patch_size=16, num_channels=3, embedding_dim=640, num_heads=10, num_layers=12, hidden_dim=640*4, stride=4, num_class=num_class)

And heads number of inner block should be 4.... https://github.com/lucidrains/transformer-in-transformer/blob/main/transformer_in_transformer/tnt.py#L135

Wondering if anyone reproduce the paper reported results with this codebase??
opened by WongChen 0
Why the loss become NaN?

It is a great project. I am very interested in Transformer in Transformer model. I had use your model to train on Vehicle-1M dataset. Vehicle-1M is a fine graied visual classification dataset. When I use this model the loss become NaN after some batch iteration. I had decrease the learning rate of AdamOptimizer and clipping the graident torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=2.0, norm_type=2) . But the loss still will become NaN sometimes. It seems that gradients are not big but they are in the same direction for many iterations. How to solve it?

opened by yt7589 3

Releases(0.1.2)

0.1.2(Dec 27, 2021)

Source code(tar.gz)
Source code(zip)
0.1.1(Mar 23, 2021)

Source code(tar.gz)
Source code(zip)
0.1.0(Mar 21, 2021)

Source code(tar.gz)
Source code(zip)
0.0.9(Mar 18, 2021)

Source code(tar.gz)
Source code(zip)
0.0.8(Mar 10, 2021)

Source code(tar.gz)
Source code(zip)
0.0.7(Mar 9, 2021)

Source code(tar.gz)
Source code(zip)
0.0.6(Mar 4, 2021)

Source code(tar.gz)
Source code(zip)
0.0.5(Mar 4, 2021)

Source code(tar.gz)
Source code(zip)
0.0.4(Mar 4, 2021)

Source code(tar.gz)
Source code(zip)
0.0.3(Mar 3, 2021)

Source code(tar.gz)
Source code(zip)
0.0.2(Mar 2, 2021)

Source code(tar.gz)
Source code(zip)
0.0.1(Mar 2, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need.

GitHub Repository

SemEval2022 Patronizing and Condescending Language (PCL) Detection

SemEval2022 Patronizing and Condescending Language (PCL) Detection This task is from SemEval 2022. What is Patronizing and Condescending Language (PCL

0 Aug 05, 2022

sense-py-AnishaBaishya created by GitHub Classroom

Compute Statistics Here we compute statistics for a bunch of numbers. This project uses the unittest framework to test functionality. Pass the tests T

1 Oct 21, 2021

Head2Toe: Utilizing Intermediate Representations for Better OOD Generalization

Head2Toe: Utilizing Intermediate Representations for Better OOD Generalization Code for reproducing our results in the Head2Toe paper. Paper: arxiv.or

62 Dec 12, 2022

Hyperparameter Optimization for TensorFlow, Keras and PyTorch

Hyperparameter Optimization for Keras Talos • Key Features • Examples • Install • Support • Docs • Issues • License • Download Talos radically changes

1.6k Dec 15, 2022

Alpha-Zero - Telegram Group Manager Bot Written In Python Using Pyrogram

✨ Alpha Zero Bot ✨ Telegram Group Manager Bot + Userbot Written In Python Using

1 Feb 17, 2022

Library for time-series-forecasting-as-a-service.

TIMEX TIMEX (referred in code as timexseries) is a framework for time-series-forecasting-as-a-service. Its main goal is to provide a simple and generi

8 Jan 06, 2023

Source code for "Progressive Transformers for End-to-End Sign Language Production" (ECCV 2020)

Progressive Transformers for End-to-End Sign Language Production Source code for "Progressive Transformers for End-to-End Sign Language Production" (B

58 Dec 21, 2022

BabelCalib: A Universal Approach to Calibrating Central Cameras. In ICCV (2021)

BabelCalib: A Universal Approach to Calibrating Central Cameras This repository contains the MATLAB implementation of the BabelCalib calibration frame

55 Dec 30, 2022

Galactic and gravitational dynamics in Python

Gala is a Python package for Galactic and gravitational dynamics. Documentation The documentation for Gala is hosted on Read the docs. Installation an

101 Dec 22, 2022

Code and data form the paper BERT Got a Date: Introducing Transformers to Temporal Tagging

BERT Got a Date: Introducing Transformers to Temporal Tagging Satya Almasian*, Dennis Aumiller*, and Michael Gertz Heidelberg University Contact us vi

54 Dec 04, 2022

Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"

Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning This is the Github repository of our paper, "Common S

19 Nov 30, 2022

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Related tags

Overview

Transformer in Transformer

Install

Usage

Citations

Comments

Only works if pixel_size**2 == patch_size?

Not sure what is wrong!

patch_tokens vs patch_pos_emb

Inconsistent model params with MindSpore src code

Why the loss become NaN?

Releases(0.1.2)

0.1.2(Dec 27, 2021)

0.1.1(Mar 23, 2021)

0.1.0(Mar 21, 2021)

0.0.9(Mar 18, 2021)

0.0.8(Mar 10, 2021)

0.0.7(Mar 9, 2021)

0.0.6(Mar 4, 2021)

0.0.5(Mar 4, 2021)

0.0.4(Mar 4, 2021)

0.0.3(Mar 3, 2021)

0.0.2(Mar 2, 2021)

0.0.1(Mar 2, 2021)

Owner

Phil Wang

SemEval2022 Patronizing and Condescending Language (PCL) Detection

sense-py-AnishaBaishya created by GitHub Classroom

Head2Toe: Utilizing Intermediate Representations for Better OOD Generalization

Hyperparameter Optimization for TensorFlow, Keras and PyTorch

Alpha-Zero - Telegram Group Manager Bot Written In Python Using Pyrogram

Library for time-series-forecasting-as-a-service.

Source code for "Progressive Transformers for End-to-End Sign Language Production" (ECCV 2020)

BabelCalib: A Universal Approach to Calibrating Central Cameras. In ICCV (2021)

Galactic and gravitational dynamics in Python

Code and data form the paper BERT Got a Date: Introducing Transformers to Temporal Tagging

Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Convnet transfer - Code for paper How transferable are features in deep neural networks?

GitHub repository for the ICLR Computational Geometry & Topology Challenge 2021

PowerGridworld: A Framework for Multi-Agent Reinforcement Learning in Power Systems

Simultaneous Detection and Segmentation

Piotr - IoT firmware emulation instrumentation for training and research

K-Nearest Neighbor in Pytorch

An official implementation of MobileStyleGAN in PyTorch

A Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities