Bi-directional Image and Text Generation

UMT-BITG (image & text generator)

Unifying Multimodal Transformer for Bi-directional Image and Text Generation,
Yupan Huang, Bei Liu, Yutong Lu, in ACM MM 2021 (Industrial Track).

UMT-DBITG (diverse image & text generator)

A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation,
Yupan Huang, Bei Liu, Jianlong Fu, Yutong Lu, in ACM MM 2021 (Video and Demo Track).

Poster or slides are available in the assets folder by visiting OneDrive.

Data & Pre-trained Models

Download preprocessed data and our pre-trained models by visiting OneDrive. We suggest following our data structures, which is consistent with the paths in config.py. You may need to modify the root_path in config.py. In addition, please following the instructions to prepare some other data:

Download grid features in path data/grid_features provided by X-LXMERT or follow feature extraction to extract these features.

wget https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/butd_features/COCO/maskrcnn_train_grid8.h5 -P data/grid_features
wget https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/butd_features/COCO/maskrcnn_valid_grid8.h5 -P data/grid_features
wget https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/butd_features/COCO/maskrcnn_test_grid8.h5 -P data/grid_features

For text-to-image evaluation on MSCOCO dataset, we need the real images to calculate the FID metric. For UMT-DBITG, we use MSCOCO karpathy split, which has been included in the OneDrive folder (images/imgs_karpathy). For UMT-BITG, please download MSCOCO validation set in path images/coco_val2014.

Citation

If you like our paper or code, please generously cite us:

@inproceedings{huang2021unifying,
  author    = {Yupan Huang and Bei Liu and Yutong Lu},
  title     = {Unifying Multimodal Transformer for Bi-directional Image and Text Generation},
  booktitle = {Proceedings of the 29th ACM International Conference on Multimedia},
  year      = {2021}
}

@inproceedings{huang2021diverse,
  author    = {Yupan Huang and Bei Liu and Jianlong Fu and Yutong Lu},
  title     = {A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation},
  booktitle = {Proceedings of the 29th ACM International Conference on Multimedia},
  year      = {2021}
}

Acknowledgement

Our code is based on LaBERT and X-LXMERT. Our evaluation code is from pytorch-fid and inception_score. We sincerely thank them for their contributions!

Feel free to open issues or email to me for help to use this code. Any feedback is welcome!

A collection of models for image<->text generation in ACM MM 2021.

Related tags

Overview

Bi-directional Image and Text Generation

UMT-BITG (image & text generator)

UMT-DBITG (diverse image & text generator)

Data & Pre-trained Models

Citation

Acknowledgement

Owner

Multimedia Research

Bi-level feature alignment for versatile image translation and manipulation (Under submission of TPAMI)

PySOT - SenseTime Research platform for single object tracking, implementing algorithms like SiamRPN and SiamMask.

[Pedestron] Generalizable Pedestrian Detection: The Elephant In The Room. @ CVPR2021

Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

FAST-RIR: FAST NEURAL DIFFUSE ROOM IMPULSE RESPONSE GENERATOR

AWS documentation corpus for zero-shot open-book question answering.

Face Mask Detection is a project to determine whether someone is wearing mask or not, using deep neural network.

A deep learning tabular classification architecture inspired by TabTransformer with integrated gated multilayer perceptron.

The source code for 'Noisy-Labeled NER with Confidence Estimation' accepted by NAACL 2021

Source code for From Stars to Subgraphs

Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Ceph.

A framework for analyzing computer vision models with simulated data

Perfect implement. Model shared. x0.5 (Top1:60.646) and 1.0x (Top1:69.402).

Generalized Data Weighting via Class-level Gradient Manipulation

automatic color-grading

The Hailo Model Zoo includes pre-trained models and a full building and evaluation environment

Python Jupyter kernel using Poetry for reproducible notebooks

Deep Compression for Dense Point Cloud Maps.

Official PyTorch implementation of the paper "Self-Supervised Relational Reasoning for Representation Learning", NeurIPS 2020 Spotlight.

HINet: Half Instance Normalization Network for Image Restoration