Implementation of the GBST block from the Charformer paper, in Pytorch

Last update: Dec 26, 2022

Overview

Charformer - Pytorch

Implementation of the GBST (gradient-based subword tokenization) module from the Charformer paper, in Pytorch. The paper proposes a module that automatically learns subword representations, obviating the need for tokenizers in the encoder setting.

AI Coffee Break with Letitia video

Install

$ pip install charformer-pytorch

Usage

import torch
from charformer_pytorch import GBST

tokenizer = GBST(
    num_tokens = 257,             # number of tokens, should be 256 for byte encoding (+ 1 special token for padding in this example)
    dim = 512,                    # dimension of token and intra-block positional embedding
    max_block_size = 4,           # maximum block size
    downsample_factor = 4,        # the final downsample factor by which the sequence length will decrease by
    score_consensus_attn = True   # whether to do the cheap score consensus (aka attention) as in eq. 5 in the paper
)

tokens = torch.randint(0, 257, (1, 1023)) # uneven number of tokens (1023)
mask   = torch.ones(1, 1023).bool()

# both tokens and mask will be appropriately downsampled

tokens, mask = tokenizer(tokens, mask = mask) # (1, 256, 512), (1, 256)

# now pass this on to your transformer

Citations

@misc{tay2021charformer,
    title   = {Charformer: Fast Character Transformers via Gradient-based Subword Tokenization}, 
    author  = {Yi Tay and Vinh Q. Tran and Sebastian Ruder and Jai Gupta and Hyung Won Chung and Dara Bahri and Zhen Qin and Simon Baumgartner and Cong Yu and Donald Metzler},
    year    = {2021},
    eprint  = {2106.12672},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}

Comments

positional embedding

in section 2.1.1 in the paper, the authors claim that by adding intra-block positional embeddings https://github.com/lucidrains/charformer-pytorch/blob/main/charformer_pytorch/charformer_pytorch.py#L90-L96 the block representations will be aware of the position of each character. however, if one were to be doing mean pooling as the author propose, wouldn't this amount to just adding the mean of the positional embeddings for every block? If anyone has any insights, please leave a comment
help wanted

opened by lucidrains 3
Cannot tokenize on GPU

Hi,

I'm using Charformer to do some error corrections on Colab. But I found that after I pass tokens to CUDA and start tokenizing, this would show up:

Did I do it in a wrong way?

opened by Shamepoo 2

example of how to read in/tokenize a text file, for use with HuggingFace Transformers?

Hello, I was attempting to adapt this guide for use with Charformer Pytorch. Colab notebook for that guide is here.

I'd like to be able to use GBST on the same data, https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt, but I'm not sure how to pass that in.

I tried looking at the source code, and the other issues here, but haven't yet found the details.

Some specific questions:

how do I "train" this tokenizer on a .txt file?
is it compatible with this section of the HF notebook, aka can it be passed into LineByLineTextDataset?

from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./oscar.eo.txt",
    block_size=128,
)

When I tried doing that line, I got the following error:

/usr/local/lib/python3.7/dist-packages/transformers/data/datasets/language_modeling.py:124: FutureWarning: This dataset will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets library. You can have a look at this example script for pointers: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py
  FutureWarning,

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-38-1688c68b48be> in <module>()
      5     tokenizer=tokenizer,
      6     file_path="./oscar.eo.txt",
----> 7     block_size=128,
      8 )

1 frames

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

TypeError: forward() got an unexpected keyword argument 'add_special_tokens'

opened by cdleong 0

Sequence Length Problem in NMT

After downsampling, the length of the sequence has been shortened. But how can I return the sequence to its original length since I may need to do sentence generation in error correction?

Thank you!

opened by Shamepoo 1
Bytes vs. Characters

The authors address the difference between bytes and characters in footnote 2, it seems like the byte is just the char embedding with dimension of 256. However, in the last sentence, For other languages, each character corresponds to 2–3 bytes in general. For simplicity and to align with prior work, we will generally talk about characters unless stated otherwise. and the example 子词分词, it becomes 子子子词词词分分分词词词, with the 3 bytes in every character.

What I want to know is, 3 bytes mean we replicate three times for every single character, then feed into embedding? If so, how to decide the number of bytes.

Thank you.

opened by jamfly 0

Releases(0.0.4)

0.0.4(Jul 15, 2021)

Source code(tar.gz)
Source code(zip)
0.0.3(Jul 8, 2021)

Source code(tar.gz)
Source code(zip)
0.0.2(Jun 30, 2021)

Source code(tar.gz)
Source code(zip)
0.0.1(Jun 30, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention

GitHub Repository

Bayesian Optimization Library for Medical Image Segmentation.

bayesmedaug: Bayesian Optimization Library for Medical Image Segmentation. bayesmedaug optimizes your data augmentation hyperparameters for medical im

7 Feb 10, 2022

This repository contains the implementation of the paper: "Towards Frequency-Based Explanation for Robust CNN"

RobustFreqCNN About This repository contains the implementation of the paper "Towards Frequency-Based Explanation for Robust CNN" arxiv. It primarly d

2 Jan 23, 2022

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

PyTorch Large-Scale Language Model A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset Latest Results 39.98 Perp

114 Nov 04, 2022

VACA: Designing Variational Graph Autoencoders for Interventional and Counterfactual Queries

VACA Code repository for the paper "VACA: Designing Variational Graph Autoencoders for Interventional and Counterfactual Queries (arXiv)". The impleme

16 Oct 10, 2022

implicit displacement field

Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields [project page][paper][cite] Geometry-Consistent Neural Shape Represe

100 Dec 19, 2022

CFNet: Cascade and Fused Cost Volume for Robust Stereo Matching（CVPR2021）

CFNet(CVPR 2021) This is the implementation of the paper CFNet: Cascade and Fused Cost Volume for Robust Stereo Matching, CVPR 2021, Zhelun Shen, Yuch

106 Dec 28, 2022

KeypointDeformer: Unsupervised 3D Keypoint Discovery for Shape Control

KeypointDeformer: Unsupervised 3D Keypoint Discovery for Shape Control Tomas Jakab, Richard Tucker, Ameesh Makadia, Jiajun Wu, Noah Snavely, Angjoo Ka

87 Nov 30, 2022

Refactoring dalle-pytorch and taming-transformers for TPU VM

Text-to-Image Translation (DALL-E) for TPU in Pytorch Refactoring Taming Transformers and DALLE-pytorch for TPU VM with Pytorch Lightning Requirements

61 Nov 07, 2022

TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning

TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning Authors: Yixuan Su, Fangyu Liu, Zaiqiao Meng, Lei Shu, Ehsan Shareghi, and Nig

79 Nov 04, 2022

Nvdiffrast - Modular Primitives for High-Performance Differentiable Rendering

Nvdiffrast – Modular Primitives for High-Performance Differentiable Rendering Modular Primitives for High-Performance Differentiable Rendering Samuli

675 Jan 06, 2023

The implementation of CVPR2021 paper Temporal Query Networks for Fine-grained Video Understanding, by Chuhan Zhang, Ankush Gupta and Andrew Zisserman.

Temporal Query Networks for Fine-grained Video Understanding 📋 This repository contains the implementation of CVPR2021 paper Temporal_Query_Networks

55 Dec 21, 2022

Implementation of the GBST block from the Charformer paper, in Pytorch

Related tags

Overview

Charformer - Pytorch

Install

Usage

Citations

Comments

positional embedding

Cannot tokenize on GPU

example of how to read in/tokenize a text file, for use with HuggingFace Transformers?

Sequence Length Problem in NMT

Bytes vs. Characters

Releases(0.0.4)

0.0.4(Jul 15, 2021)

0.0.3(Jul 8, 2021)

0.0.2(Jun 30, 2021)

0.0.1(Jun 30, 2021)

Owner

Phil Wang

Bayesian Optimization Library for Medical Image Segmentation.

This repository contains the implementation of the paper: "Towards Frequency-Based Explanation for Robust CNN"

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

VACA: Designing Variational Graph Autoencoders for Interventional and Counterfactual Queries

implicit displacement field

CFNet: Cascade and Fused Cost Volume for Robust Stereo Matching（CVPR2021）

KeypointDeformer: Unsupervised 3D Keypoint Discovery for Shape Control

Refactoring dalle-pytorch and taming-transformers for TPU VM

TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning

Nvdiffrast - Modular Primitives for High-Performance Differentiable Rendering

The implementation of CVPR2021 paper Temporal Query Networks for Fine-grained Video Understanding, by Chuhan Zhang, Ankush Gupta and Andrew Zisserman.

Generative Query Network (GQN) in PyTorch as described in "Neural Scene Representation and Rendering"

SimulLR - PyTorch Implementation of SimulLR

Pose estimation for iOS and android using TensorFlow 2.0

GUI for TOAD-GAN, a PCG-ML algorithm for Token-based Super Mario Bros. Levels.

Trading Strategies for Freqtrade

Generative Art Using Neural Visual Grammars and Dual Encoders

Manipulation OpenAI Gym environments to simulate robots at the STARS lab

Domain Adaptation with Invariant RepresentationLearning: What Transformations to Learn?

Sound-guided Semantic Image Manipulation - Official Pytorch Code (CVPR 2022)