Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Overview

OFA

[Paper] [Blog] [Colab] [Spaces]

Overview

OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.) to a simple sequence-to-sequence learning framework. For more information, please refer to our paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework.

News

  • 2022.2.13: Released the demo of image captioning. Have fun! Hugging Face Spaces
  • 2022.2.11: Released the Colab notebook for image captioning . Enjoy!
  • 2022.2.11: Released the pretrained checkpoint of OFA-Large and the complete (2-staged) finetuning code for image captioning.
  • 2022.2.10: Released the inference code & finetuned checkpoint for image captioning, which can reproduce the results on COCO Karparthy test split (149.6 CIDEr)

TODO

  • To release finetuning and inference codes for multimodal downstream tasks soon, including image captioning, VQA, text-to-image generation, SNLI-VE, Referring expression, comprehension, etc.
  • To release codes for pretraining soon.

Approach

approach

Requirements

  • python 3.7.4
  • pytorch 1.8.1
  • torchvision 0.9.1
  • JAVA 1.8 (for COCO evaluation)

Installation

git clone https://github.com/OFA-Sys/OFA
pip install -r requirements.txt

Datasets and Checkpoints

See datasets.md and checkpoints.md.

Pretraining

To release soon:)

Finetuning & Inference

Below we provide methods for fintuning and inference on different downstream tasks.

Caption

  1. Download data and files and put them in the correct directory
  2. Train
cd run_scripts/caption
nohup sh train_caption_stage1.sh &  # stage1, train with cross-entropy loss
nohup sh train_caption_stage2.sh &  # stage2, load the best ckpt of stage1 and train with CIDEr optimization 
  1. Inference
cd run_scripts/caption ; sh evaluate_caption.sh  # inference & evaluate

Gallery

Below we provide examples of OFA in text-to-image generation and open-ended VQA. Also, we demonstrate its performance in unseen task (Grounded QA) as well as unseen domain (Visual Grounding on images from unseen domains).

Text-to-Image Generation (normal query)

t2i_normal

Text-to-Image Generation (counterfactual query)

t2i_counterfactual

Open-Ended VQA

open_vqa

Grounded QA (unseen task)

grounded_qa

Viusal Grounding (unseen domain)

vg

Citation

Please cite our paper if you find it helpful :)

@article{wang2022OFA,
  title={Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework},
  author={Wang, Peng and Yang, An and Men, Rui and Lin, Junyang and Bai, Shuai and Li, Zhikang and Ma, Jianxin and Zhou, Chang and Zhou, Jingren and Yang, Hongxia},
  journal={arXiv e-prints},
  pages={arXiv--2202},
  year={2022}
}

Related Codebase

License

Apache-2.0

Owner
OFA Sys
OFA Sys
DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort

DatasetGAN This is the official code and data release for: DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort Yuxuan Zhang*, Huan Li

302 Jan 05, 2023
Training, generation, and analysis code for Learning Particle Physics by Example: Location-Aware Generative Adversarial Networks for Physics

Location-Aware Generative Adversarial Networks (LAGAN) for Physics Synthesis This repository contains all the code used in L. de Oliveira (@lukedeo),

Deep Learning for HEP 57 Oct 22, 2022
Tgbox-bench - Simple TGBOX upload speed benchmark

TGBOX Benchmark This script will benchmark upload speed to TGBOX storage. Build

Non 1 Jan 09, 2022
Educational API for 3D Vision using pose to control carton.

Educational API for 3D Vision using pose to control carton.

41 Jul 10, 2022
[CVPR 2022] CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation

CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation Prerequisite Please create and activate the following conda envrionment. To r

Qin Wang 87 Jan 08, 2023
Code for "LoFTR: Detector-Free Local Feature Matching with Transformers", CVPR 2021

LoFTR: Detector-Free Local Feature Matching with Transformers Project Page | Paper LoFTR: Detector-Free Local Feature Matching with Transformers Jiami

ZJU3DV 1.4k Jan 04, 2023
Official pytorch implementation of Active Learning for deep object detection via probabilistic modeling (ICCV 2021)

Active Learning for Deep Object Detection via Probabilistic Modeling This repository is the official PyTorch implementation of Active Learning for Dee

NVIDIA Research Projects 130 Jan 06, 2023
An implementation of Geoffrey Hinton's paper "How to represent part-whole hierarchies in a neural network" in Pytorch.

GLOM An implementation of Geoffrey Hinton's paper "How to represent part-whole hierarchies in a neural network" for MNIST Dataset. To understand this

50 Oct 19, 2022
Unified Instance and Knowledge Alignment Pretraining for Aspect-based Sentiment Analysis

Unified Instance and Knowledge Alignment Pretraining for Aspect-based Sentiment Analysis Requirements python 3.7 pytorch-gpu 1.7 numpy 1.19.4 pytorch_

12 Oct 29, 2022
GPU Accelerated Non-rigid ICP for surface registration

GPU Accelerated Non-rigid ICP for surface registration Introduction Preivous Non-rigid ICP algorithm is usually implemented on CPU, and needs to solve

Haozhe Wu 144 Jan 04, 2023
SuMa++: Efficient LiDAR-based Semantic SLAM (Chen et al IROS 2019)

SuMa++: Efficient LiDAR-based Semantic SLAM This repository contains the implementation of SuMa++, which generates semantic maps only using three-dime

Photogrammetry & Robotics Bonn 701 Dec 30, 2022
Double pendulum simulator using a symplectic Euler's method and Hamiltonian mechanics

Symplectic Double Pendulum Simulator Double pendulum simulator using a symplectic Euler's method. The program calculates the momentum and position of

Scott Marino 1 Jan 12, 2022
Denoising Diffusion Probabilistic Models

Denoising Diffusion Probabilistic Models Jonathan Ho, Ajay Jain, Pieter Abbeel Paper: https://arxiv.org/abs/2006.11239 Website: https://hojonathanho.g

Jonathan Ho 1.5k Jan 08, 2023
A Haskell kernel for IPython.

IHaskell You can now try IHaskell directly in your browser at CoCalc or mybinder.org. Alternatively, watch a talk and demo showing off IHaskell featur

Andrew Gibiansky 2.4k Dec 29, 2022
arxiv-sanity, but very lite, simply providing the core value proposition of the ability to tag arxiv papers of interest and have the program recommend similar papers.

arxiv-sanity, but very lite, simply providing the core value proposition of the ability to tag arxiv papers of interest and have the program recommend similar papers.

Andrej 671 Dec 31, 2022
Code for the ECCV2020 paper "A Differentiable Recurrent Surface for Asynchronous Event-Based Data"

A Differentiable Recurrent Surface for Asynchronous Event-Based Data Code for the ECCV2020 paper "A Differentiable Recurrent Surface for Asynchronous

Marco Cannici 21 Oct 05, 2022
Complete system for facial identity system. Include one-shot model, database operation, features visualization, monitoring

Complete system for facial identity system. Include one-shot model, database operation, features visualization, monitoring

2 Dec 28, 2021
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP The implementation of paper CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. CLIP2

168 Dec 29, 2022
Official NumPy Implementation of Deep Networks from the Principle of Rate Reduction (2021)

Deep Networks from the Principle of Rate Reduction This repository is the official NumPy implementation of the paper Deep Networks from the Principle

Ryan Chan 49 Dec 16, 2022