The official implementation of Autoregressive Image Generation using Residual Quantization (CVPR '22)

Overview

Autoregressive Image Generation using Residual Quantization (CVPR 2022)

The official implementation of "Autoregressive Image Generation using Residual Quantization"
Doyup Lee*, Chiheon Kim*, Saehoon Kim, Minsu Cho, Wook-Shin Han (* Equal contribution)
CVPR 2022

The examples of generated images by RQ-Transformer using class conditions and text conditions.
Note that the text conditions of the examples are not used in training time.

TL;DR For autoregressive (AR) modeling of high-resolution images, we propose the two-stage framework, which consists of RQ-VAE and RQ-Transformer. Our framework can precisely approximate a feature map of an image and represent an image as a stack of discrete codes to effectively generate high-quality images.

Requirements

We have tested our codes on the environment below

  • Python 3.7.10 / Pytorch 1.9.0 / torchvision 0.10.0 / CUDA 11.1 / Ubuntu 18.04 .

Please run the following command to install the necessary dependencies

pip install -r requirements.txt

Coverage of Released Codes

  • Implementation of RQ-VAE and RQ-Transformer
  • Pretrained checkpoints of RQ-VAEs and RQ-Transformers
  • Training and evaluation pipelines of RQ-VAE
  • Image generation and its evaluation pipeline of RQ-Transformer
  • Jupyter notebook for text-to-image generation of RQ-Transformer

Pretrained Checkpoints

Checkpoints Used in the Original Paper

We provide pretrained checkpoints of RQ-VAEs and RQ-Transformers to reproduce the results in the paper. Please use the links below to download tar.gz files and unzip the pretrained checkpoints. Each link contains pretrained checkpoints of RQ-VAE and RQ-Transformer and their model configurations.

Dataset RQ-VAE & RQ-Transformer # params of RQ-Transformer FID
FFHQ link 355M 10.38
LSUN-Church link 370M 7.45
LSUN-Cat link 612M 8.64
LSUN-Bedroom link 612M 3.04
ImageNet (cIN) link 480M 15.72
ImageNet (cIN) link 821M 13.11
ImageNet (cIN) link 1.4B 11.56 (4.45)
ImageNet (cIN) link 1.4B 8.71 (3.89)
ImageNet (cIN) link 3.8B 7.55 (3.80)
CC-3M link 654M 12.33

FID scores above are measured based on original samples and generated images, and the scores in brackets are measured using 5% rejection sampling via pretrained ResNet-101. We do not provide the pipeline of rejection sampling in this repository.

(NOTE) Large-Scale RQ-Transformer for Text-to-Image Generation

We also provide the pretrained checkpoint of large-scale RQ-Transformer for text-to-image (T2I) generation. Our paper does not include the results of this large-scale RQ-Transformer for T2I generation, since we trained RQ-Transformer with 3.9B parameters on about 30 millions of text-to-image pairs from CC-3M, CC-12M, and YFCC-subset after the paper submission. Please use the link below to download the checkpoints of large-scale T2I model. We emphasize that any commercial use of our checkpoints is strictly prohibited.

Download of Pretrained RQ-Transformer on 30M text-image pairs

Dataset. RQ-VAE & RQ-Transformer # params
CC-3M + CC-12M + YFCC-subset link 3.9B

Evaluation of Large-Scale RQ-Transformer on MS-COCO

In this repository, we evaluate the pretrained RQ-Transformer with 3.9B parameters on MS-COCO. According to the evaluation protocol of DALL-Eval, we randomly select 30K text captions in val2014 split of MS-COCO and generate 256x256 images using the selected captions. We use (1024, 0.95) for top-(k, p) sampling, and FID scores of other models are from Table 2 in DALL-Eval paper.

Model # params # data Image / Grid Size FID on 2014val
X-LXMERT 228M 180K 256x256 / 8x8 37.4
DALL-E small 120M 15M 256x256 / 16x16 45.8
ruDALL-E-XL 1.3B 120M 256x256 / 32x32 18.6
minDALL-E 1.3B 15M 256x256 / 16x16 24.6
RQ-Transformer (ours) 3.9B 30M 256x256 / 8x8x4 16.9

Note that some text captions in MS-COCO are also included in the YFCC-subset, but the FIDs are not much different whether the duplicated captions are removed in the evaluation or not. See this paper for more details.

Examples of Text-to-Image (T2I) Generation using RQ-Transformer

We provide a jupyter notebook for you to easily enjoy text-to-image (T2I) generation of pretrained RQ-Transformers and the results ! After you download the pretrained checkpoints for T2I generation, open notebooks/T2I_sampling.ipynb and follows the instructions in the notebook file. We recommend to use a GPU such as NVIDIA V100 or A100, which has enough memory size over 32GB, considering the model size.

We attach some examples of T2I generation from the provided Jupyter notebook.

Examples of Generated Images from Text Conditions

a painting by Vincent Van Gogh
a painting by RENÉ MAGRITTE
Eiffel tower on a desert.
Eiffel tower on a mountain.
a painting of a cat with sunglasses in the frame.
a painting of a dog with sunglasses in the frame.

Training and Evaluation of RQ-VAE

Training of RQ-VAEs

Our implementation uses DistributedDataParallel in Pytorch for efficient training with multi-node and multi-GPU environments. Four NVIDIA A100 GPUs are used to train all RQ-VAEs in our paper. You can also adjust -nr, -np, and -nr according to your GPU setting.

  • Training 8x8x4 RQ-VAE on ImageNet 256x256 with a single node having four GPUs

    python -m torch.distributed.launch \
        --master_addr=$MASTER_ADDR \
        --master_port=$PORT \
        --nnodes=1 --nproc_per_node=4 --node_rank=0 \ 
        main_stage1.py \
        -m=configs/imagenet256/stage1/in256-rqvae-8x8x4.yaml -r=$SAVE_DIR
  • If you want to train 8x8x4 RQ-VAE on ImageNet using four nodes, where each node has one GPU, run the following scripts at each node with $RANK being the node rank (0, 1, 2, 3). Here, we assume that the master node corresponds to the node with rank 0.

    python -m torch.distributed.launch \
        --master_addr=$MASTER_ADDR \
        --master_port=$PORT \
        --nnodes=4 --nproc_per_node=1 --node_rank=$RANK \ 
        main_stage1.py \
        -m=configs/imagenet256/stage1/in256-rqvae-8x8x4.yaml -r=$SAVE_DIR

Finetuning of Pretrained RQ-VAE

  • To finetune a pretrained RQ-VAE on other datasets such as LSUNs, you have to load the pretrained checkpoints giving -l=$RQVAE_CKPT argument.
  • For example, when a pretrained RQ-VAE is finetuned on LSUN-Church, you can run the command below:
    python -m torch.distributed.launch \
        --master_addr=$MASTER_ADDR \
        --master_port=$PORT \
        --nnodes=1 --nproc_per_node=4 --node_rank=0 \ 
        main_stage1.py \
        -m=configs/lsun-church/stage1/church256-rqvae-8x8x4.yaml -r=$SAVE_DIR -l=$RQVAE_CKPT 

Evaluation of RQ-VAEs

Run compute_rfid.py to evaluate the reconstruction FID (rFID) of learned RQ-VAEs.

python compute_rfid.py --split=val --vqvae=$RQVAE_CKPT
  • The model checkpoint of RQ-VAE and its configuration yaml file have to be located in the same directory.
  • compute_rfid.py evaluates rFID of RQ-VAE on the dataset in the configuration file.
  • Adjust --batch-size as the memory size of your GPU environment.

Evaluation of RQ-Transformer

In this repository, the quantitative results in the paper can be reproduced by the codes for the evaluation of RQ-Transformer. Before the evaluation of RQ-Transformer on a dataset, the dataset has to be prepared for computing the feature vectors of its samples. To reproduce the results in the paper, we provide the statistics of feature vectors of each dataset, since extracting feature vectors accompanies computational costs and a long time. You can also prepare the datasets, which are used in our paper, as you follow the instructions of data/READMD.md.

  • Download the feature statistics of datasets as follows:
    cd assets
    wget https://arena.kakaocdn.net/brainrepo/etc/RQVAE/8b325b628f49bf60a3094fcf9419398c/fid_stats.tar.gz
    tar -zxvf fid_stats.tar.gz

FFHQ, LSUN-{Church, Bedroom, Cat}, (conditional) ImageNet

  • After the pretrained RQ-Transformer generates 50K images, FID (and IS) between the generated images and its training samples is computed.
  • You can input --save-dir to specify directory where the generated images are saved. If --save-dir is not given, the generated images are saved at the directory of the checkpoint.
  • When four GPUs in a single node are used, run the command below
    python -m torch.distributed.launch \
      --master_addr=$MASTER_ADDR \
      --master_port=$PORT \
      --nnodes=1 --nproc_per_node=4 --node_rank=0 \ 
      main_sampling_fid.py \
      -v=$RQVAE_CKPT -a=$RQTRANSFORMER_CKPT --save-dir=$SAVE_IMG_DIR

CC-3M

  • After the pretrained RQ-Transformer generates images using text captions of CC-3M validation set, FID between the validation images and generated images is computed together with CLIP score of generated images and their text conditions.
  • Evaluation of RQ-Transformer requires text prompts of cc-3m. Thus, please refer to data/READMD.md and prepare the dataset first.
  • When four GPUs in a single node are used, run the command below
    python -m torch.distributed.launch \
      --master_addr=$MASTER_ADDR \
      --master_port=$PORT \
      --nnodes=1 --nproc_per_node=4 --node_rank=0 \ 
      main_sampling_txt2img.py \
      -v=$RQVAE_CKPT -a=$RQTRANSFORMER_CKPT --dataset="cc3m" --save-dir=$SAVE_IMG_DIR

MS-COCO

  • We follow the protopocal of DALL-Eval to evaluate RQ-Transformer on MS-COCO, we use 30K samples, which are randomly selected in MS-COCO 2014val split, and provide the sampled samples as json file.
  • Evaluation of RQ-Transformer requires text prompts of MS_COCO. Thus, please refer to data/READMD.md and prepare the dataset first.
  • When four GPUs in a single node are used, run the command below
    python -m torch.distributed.launch \
      --master_addr=$MASTER_ADDR \
      --master_port=$PORT \
      --nnodes=1 --nproc_per_node=4 --node_rank=0 \ 
      main_sampling_txt2img.py \
      -v=$RQVAE_CKPT -a=$RQTRANSFORMER_CKPT --dataset="coco_2014val" --save-dir=$SAVE_IMG_DIR

NOTE

  • Unfortunately, we do not provide the training code of RQ-Transformer to avoid unexpected misuses by finetuning our checkpoints. We note that any commercial use of our checkpoints is strictly prohibited.
  • To accurately reproduce the reported results, the checkpoints of RQ-VAE and RQ-Transformer are correctly matched as described above.
  • The generated images are saved as .pkl files in the directory $DIR_SAVED_IMG.
  • For top-k and top-p sampling, the saved setting in the configuration file of pretrained checkpoints is used. If you want to use different top-(k,p) settings, use --top-k and --top-p in running the sampling scripts.
  • Once generated images are saved, compute_metrics.py can be used to evaluate the images again as follows:
python compute_metrics.py fake_path=$DIR_SAVED_IMG ref_dataset=$DATASET_NAME

Sampling speed benchmark

We provide the codes to measure the sampling speed of RQ-Transformer according to the code shape of RQ-VAEs, such as 8x8x4 or 16x16x1, as shown in Figure 4 in the paper. To reproduce the figure, run the following commands on NVIDIA A100 GPU:

# RQ-Transformer (1.4B) on 16x16x1 RQ-VAE (corresponds to VQ-GAN 1.4B model)
python -m measure_throughput f=16 d=1 c=16384 model=huge batch_size=100
python -m measure_throughput f=16 d=1 c=16384 model=huge batch_size=200
python -m measure_throughput f=16 d=1 c=16384 model=huge batch_size=500  # this will result in OOM.

# RQ-Transformer (1.4B) on 8x8x4 RQ-VAE
python -m measure_throughput f=32 d=4 c=16384 model=huge batch_size=100
python -m measure_throughput f=32 d=4 c=16384 model=huge batch_size=200
python -m measure_throughput f=32 d=4 c=16384 model=huge batch_size=500

BibTex

@article{lee2022autoregressive,
  title={Autoregressive Image Generation using Residual Quantization},
  author={Lee, Doyup and Kim, Chiheon and Kim, Saehoon and Cho, Minsu and Han, Wook-Shin},
  journal={arXiv preprint arXiv:2203.01941},
  year={2022}
}

Licenses

Contact

If you would like to collaborate with us or provide us a feedback, please contaus us,[email protected]

Acknowledgement

Our transformer-related implementation is inspired by minGPT and minDALL-E. We appreciate the authors of VQGAN for making their codes available to public.

Limitations

Since RQ-Transformer is trained on publicly available datasets, some generated images can include socially unacceptable contents according to the text conditions. When the problem occurs, please let us know the pair of "text condition" and "generated images".

Owner
Kakao Brain
Kakao Brain Corp.
Kakao Brain
Cascaded Pyramid Network (CPN) based on Keras (Tensorflow backend)

ML2 Takehome Project Reimplementing the paper: Cascaded Pyramid Network for Multi-Person Pose Estimation Dataset The model uses the COCO dataset which

Vo Van Tu 1 Nov 22, 2021
4K videos with annotated masks in our ICCV2021 paper 'Internal Video Inpainting by Implicit Long-range Propagation'.

Annotated 4K Videos paper | project website | code | demo video 4K videos with annotated object masks in our ICCV2021 paper: Internal Video Inpainting

Tengfei Wang 21 Nov 05, 2022
Code-free deep segmentation for computational pathology

NoCodeSeg: Deep segmentation made easy! This is the official repository for the manuscript "Code-free development and deployment of deep segmentation

André Pedersen 26 Nov 23, 2022
scikit-learn inspired API for CRFsuite

sklearn-crfsuite sklearn-crfsuite is a thin CRFsuite (python-crfsuite) wrapper which provides interface simlar to scikit-learn. sklearn_crfsuite.CRF i

417 Dec 20, 2022
Ensembling Off-the-shelf Models for GAN Training

Data-Efficient GANs with DiffAugment project | paper | datasets | video | slides Generated using only 100 images of Obama, grumpy cats, pandas, the Br

MIT HAN Lab 1.2k Dec 26, 2022
PoseViz – Multi-person, multi-camera 3D human pose visualization tool built using Mayavi.

PoseViz – 3D Human Pose Visualizer Multi-person, multi-camera 3D human pose visualization tool built using Mayavi. As used in MeTRAbs visualizations.

István Sárándi 79 Dec 30, 2022
Yet Another Robotics and Reinforcement (YARR) learning framework for PyTorch.

Yet Another Robotics and Reinforcement (YARR) learning framework for PyTorch.

Stephen James 51 Dec 27, 2022
PyTorch implementation of Neural Dual Contouring.

NDC PyTorch implementation of Neural Dual Contouring. Citation We are still writing the paper while adding more improvements and applications. If you

Zhiqin Chen 140 Dec 26, 2022
Pytorch tutorials for Neural Style transfert

PyTorch Tutorials This tutorial is no longer maintained. Please use the official version: https://pytorch.org/tutorials/advanced/neural_style_tutorial

Alexis David Jacq 135 Jun 26, 2022
Labelbox is the fastest way to annotate data to build and ship artificial intelligence applications

Labelbox Labelbox is the fastest way to annotate data to build and ship artificial intelligence applications. Use this github repository to help you s

labelbox 1.7k Dec 29, 2022
PROJECT - Az Residential Real Estate Analysis

AZ RESIDENTIAL REAL ESTATE ANALYSIS -Decided on libraries to import. Includes pa

2 Jul 05, 2022
Locally Constrained Self-Attentive Sequential Recommendation

LOCKER This is the pytorch implementation of this paper: Locally Constrained Self-Attentive Sequential Recommendation. Zhankui He, Handong Zhao, Zhe L

Zhankui (Aaron) He 8 Jul 30, 2022
This repository collects project-relevant Isabelle/HOL formalizations.

Isabelle/HOL formalizations related to the AuReLeE project Formalization of Abstract Argumentation Frameworks See AbstractArgumentation folder for the

AuReLeE project 1 Sep 10, 2022
DrQ-v2: Improved Data-Augmented Reinforcement Learning

DrQ-v2: Improved Data-Augmented RL Agent Method DrQ-v2 is a model-free off-policy algorithm for image-based continuous control. DrQ-v2 builds on DrQ,

Facebook Research 234 Jan 01, 2023
An algorithm study of the 6th iOS 10 set of Boost Camp Web Mobile

알고리즘 스터디 🔥 부스트캠프 웹모바일 6기 iOS 10조의 알고리즘 스터디 입니다. 개인적인 사정 등으로 S034, S055만 참가하였습니다. 스터디 목적 상진: 코테 합격 + 부캠끝나고 아침에 일어나기 위해 필요한 사이클 기완: 꾸준하게 자리에 앉아 공부하기 +

2 Jan 11, 2022
Symmetry and Uncertainty-Aware Object SLAM for 6DoF Object Pose Estimation

SUO-SLAM This repository hosts the code for our CVPR 2022 paper "Symmetry and Uncertainty-Aware Object SLAM for 6DoF Object Pose Estimation". ArXiv li

Robot Perception & Navigation Group (RPNG) 97 Jan 03, 2023
Implementation of the algorithm shown in the article "Modelo de Predicción de Éxito de Canciones Basado en Descriptores de Audio"

Success Predictor Implementation of the algorithm shown in the article "Modelo de Predicción de Éxito de Canciones Basado en Descriptores de Audio". B

Rodrigo Nazar Meier 4 Mar 17, 2022
[SIGGRAPH'22] StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets

[Project] [PDF] This repository contains code for our SIGGRAPH'22 paper "StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets" by Axel Sauer, Katja

742 Jan 04, 2023
Anatomy of Matplotlib -- tutorial developed for the SciPy conference

Introduction This tutorial is a complete re-imagining of how one should teach users the matplotlib library. Hopefully, this tutorial may serve as insp

Matplotlib Developers 1.1k Dec 29, 2022
TransPrompt - Towards an Automatic Transferable Prompting Framework for Few-shot Text Classification

TransPrompt This code is implement for our EMNLP 2021's paper 《TransPrompt:Towards an Automatic Transferable Prompting Framework for Few-shot Text Cla

WangJianing 23 Dec 21, 2022