A collection of models for image - text generation in ACM MM 2021.

Last update: Oct 30, 2022

Related tags

Overview

Bi-directional Image and Text Generation

UMT-BITG (image & text generator)

Unifying Multimodal Transformer for Bi-directional Image and Text Generation,
Yupan Huang, Bei Liu, Yutong Lu, in ACM MM 2021 (Industrial Track).

UMT-DBITG (diverse image & text generator)

A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation,
Yupan Huang, Bei Liu, Jianlong Fu, Yutong Lu, in ACM MM 2021 (Video and Demo Track).

Poster or slides are available in the assets folder by visiting OneDrive.

Data & Pre-trained Models

Download preprocessed data and our pre-trained models by visiting OneDrive. We suggest following our data structures, which is consistent with the paths in config.py. You may need to modify the root_path in config.py. In addition, please following the instructions to prepare some other data:

Download grid features in path data/grid_features provided by X-LXMERT or follow feature extraction to extract these features.

wget https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/butd_features/COCO/maskrcnn_train_grid8.h5 -P data/grid_features
wget https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/butd_features/COCO/maskrcnn_valid_grid8.h5 -P data/grid_features
wget https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/butd_features/COCO/maskrcnn_test_grid8.h5 -P data/grid_features

For text-to-image evaluation on MSCOCO dataset, we need the real images to calculate the FID metric. For UMT-DBITG, we use MSCOCO karpathy split, which has been included in the OneDrive folder (images/imgs_karpathy). For UMT-BITG, please download MSCOCO validation set in path images/coco_val2014.

Citation

If you like our paper or code, please generously cite us:

@inproceedings{huang2021unifying,
  author    = {Yupan Huang and Bei Liu and Yutong Lu},
  title     = {Unifying Multimodal Transformer for Bi-directional Image and Text Generation},
  booktitle = {Proceedings of the 29th ACM International Conference on Multimedia},
  year      = {2021}
}

@inproceedings{huang2021diverse,
  author    = {Yupan Huang and Bei Liu and Jianlong Fu and Yutong Lu},
  title     = {A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation},
  booktitle = {Proceedings of the 29th ACM International Conference on Multimedia},
  year      = {2021}
}

Acknowledgement

Our code is based on LaBERT and X-LXMERT. Our evaluation code is from pytorch-fid and inception_score. We sincerely thank them for their contributions!

Feel free to open issues or email to me for help to use this code. Any feedback is welcome!

A collection of models for image - text generation in ACM MM 2021.

Related tags

Overview

Bi-directional Image and Text Generation

UMT-BITG (image & text generator)

UMT-DBITG (diverse image & text generator)

Data & Pre-trained Models

Citation

Acknowledgement

Owner

Multimedia Research

A Fast Sequence Transducer Implementation with PyTorch Bindings

NSFW A chatbot based on GPT2-chitchat

Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

State of the art faster Natural Language Processing in Tensorflow 2.0 .

A Multi-modal Model Chinese Spell Checker Released on ACL2021.

Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

German Text-To-Speech Engine using Tacotron and Griffin-Lim

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

Code-autocomplete, a code completion plugin for Python

null

fastai ulmfit - Pretraining the Language Model, Fine-Tuning and training a Classifier

STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

aMLP Transformer Model for Japanese

scikit-learn wrappers for Python fastText.

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

An open collection of annotated voices in Japanese language

Code for PED: DETR For (Crowd) Pedestrian Detection

A collection of models for image - text generation in ACM MM 2021.

Related tags

Overview

Bi-directional Image and Text Generation

UMT-BITG (image & text generator)

UMT-DBITG (diverse image & text generator)

Data & Pre-trained Models

Citation

Acknowledgement

Owner

Multimedia Research

A Fast Sequence Transducer Implementation with PyTorch Bindings

**NSFW** A chatbot based on GPT2-chitchat

Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

State of the art faster Natural Language Processing in Tensorflow 2.0 .

A Multi-modal Model Chinese Spell Checker Released on ACL2021.

Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

German Text-To-Speech Engine using Tacotron and Griffin-Lim

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

Code-autocomplete, a code completion plugin for Python

null

fastai ulmfit - Pretraining the Language Model, Fine-Tuning and training a Classifier

STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

aMLP Transformer Model for Japanese

scikit-learn wrappers for Python fastText.

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

An open collection of annotated voices in Japanese language

Code for PED: DETR For (Crowd) Pedestrian Detection

NSFW A chatbot based on GPT2-chitchat