ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Last update: Dec 23, 2022

Overview

ROSITA

News & Updates

(24/08/2021)

Release the demo to perform fine-grained semantic alignments using the pretrained ROSITA model.

(15/08/2021)

Release the basic framework for ROSITA, including the pretrained base ROSITA model, as well as the scripts to run the fine-tuning and evaluation on three downstream tasks (i.e., VQA, REC, ITR) over six datasets.

Introduction

This repository contains source code necessary to reproduce the results presented in our ACM MM paper ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration, which encodes the cROSs- and InTrA-model prior knowledge in a in a unified scene graph to perform knowledge-guided vision-and-language pretraining. Compared with existing counterparts, ROSITA learns better fine-grained semantic alignments across different modalities, thus improving the capability of the pretrained model.

Performance

We compare ROSITA against existing state-of-the-art VLP methods on three downstream tasks. All methods use the base model of Transformer for a fair comparison. The trained checkpoints to reproduce these results are provided in finetune.md.

^_Tasks	^_VQA	^_REC			^_ITR
^_Datasets	^{_{VQAv2 dev \| std}}	^{_{RefCOCO val \| testA \| testB}}	^{_{RefCOCO+ val \| testA \| testB}}	^{_{RefCOCOg val \| test}}	^{_{IR-COCO [email protected] \| [email protected] \| [email protected]}}	^{_{TR-COCO [email protected] \| [email protected] \| [email protected]}}	^{_{IR-Flickr [email protected] \| [email protected] \| [email protected]}}	^{_{TR-Flickr [email protected] \| [email protected] \| [email protected]}}
^_ROSITA	^{_{73.91 \| 73.97}}	^{_{84.79 \| 87.99 \| 78.28}}	^{_{76.06 \| 82.01 \| 67.40}}	^{_{78.23 \| 78.25}}	^{_{54.40 \| 80.92 \| 88.60}}	^{_{71.26 \| 91.62 \| 95.58}}	^{_{74.08 \| 92.44 \| 96.08}}	^{_{88.90 \| 98.10 \| 99.30}}
^_SoTA-base	^{_{73.59 \| 73.67}}	^{_{81.56 \| 87.40 \| 74.48}}	^{_{76.05 \| 81.65 \| 65.70}}	^{_{75.90 \| 75.93}}	^{_{54.00 \| 80.80 \| 88.50}}	^{_{70.00 \| 91.10 \| 95.50}}	^{_{74.74 \| 92.86 \| 95.82}}	^{_{86.60 \| 97.90 \| 99.20}}

Installation

Software and Hardware Requirements

We recommand a workstation with 4 GPU (>= 24GB, e.g., RTX 3090 or V100), 120GB memory and 50GB free disk space. We strongly recommend to use a SSD drive to guarantee high-speed I/O. Also, you should first install some necessary package as follows:

Python >= 3.6
PyTorch >= 1.4 with Cuda >=10.2
torchvision >= 0.5.0
Cython

# git clone
$ git clone https://github.com/MILVLG/rosita.git 

# build essential utils
$ cd rosita/rosita/utils/rec
$ python setup.py build
$ cp build/lib*/bbox.cpython*.so .

Dataset Setup

To download the required datasets to run this project, please check datasets.md for details.

Pretraining

Please check pretrain.md for the details for ROSITA pretraining. We currently only provide the pretrained model to run finetuning on downstream tasks. The codes to run pretraining will be released later.

Finetuning

Please check finetune.md for the details for finetuning on downstream tasks. Scripts to run finetuning on downstream tasks are provided. Also, we provide trained models that can be directly evaluated to reproduce the results.

Demo

We provide the Jupyter notebook scripts for reproducing the visualization results shown in our paper.

Acknowledgment

We appreciate the well-known open-source projects such as LXMERT, UNITER, OSCAR, and Huggingface, which help us a lot when writing our codes.

Yuhao Cui (@cuiyuhao1996) and Tong-An Luo (@Zoroaster97) are the main contributors to this repository. Please kindly contact them if you find any issue.

Citations

Please consider citing this paper if you use the code:

@inProceedings{cui2021rosita,
  title={ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration},
  author={Cui, Yuhao and Yu, Zhou and Wang, Chunqi and Zhao, Zhongzhou and Zhang, Ji and Wang, Meng and Yu, Jun},
  booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
  year={2021}
}

^_Tasks	^_VQA	^_REC			^_ITR
^_Datasets	^{_{VQAv2 dev \| std}}	^{_{RefCOCO val \| testA \| testB}}	^{_{RefCOCO+ val \| testA \| testB}}	^{_{RefCOCOg val \| test}}	^{_{IR-COCO [email protected] \| [email protected] \| [email protected]}}	^{_{TR-COCO [email protected] \| [email protected] \| [email protected]}}	^{_{IR-Flickr [email protected] \| [email protected] \| [email protected]}}	^{_{TR-Flickr [email protected] \| [email protected] \| [email protected]}}
^_ROSITA	^{_{73.91 \| 73.97}}	^{_{84.79 \| 87.99 \| 78.28}}	^{_{76.06 \| 82.01 \| 67.40}}	^{_{78.23 \| 78.25}}	^{_{54.40 \| 80.92 \| 88.60}}	^{_{71.26 \| 91.62 \| 95.58}}	^{_{74.08 \| 92.44 \| 96.08}}	^{_{88.90 \| 98.10 \| 99.30}}
^_SoTA-base	^{_{73.59 \| 73.67}}	^{_{81.56 \| 87.40 \| 74.48}}	^{_{76.05 \| 81.65 \| 65.70}}	^{_{75.90 \| 75.93}}	^{_{54.00 \| 80.80 \| 88.50}}	^{_{70.00 \| 91.10 \| 95.50}}	^{_{74.74 \| 92.86 \| 95.82}}	^{_{86.60 \| 97.90 \| 99.20}}

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Related tags

Overview

ROSITA

News & Updates

Introduction

Performance

Installation

Software and Hardware Requirements

Dataset Setup

Pretraining

Finetuning

Demo

Acknowledgment

Citations

Owner

Vision and Language Group@ MIL

TensorFlow2 Classification Model Zoo playing with TensorFlow2 on the CIFAR-10 dataset.

TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

Source code for paper "Deep Superpixel-based Network for Blind Image Quality Assessment"

BEAMetrics: Benchmark to Evaluate Automatic Metrics in Natural Language Generation

Semantic segmentation task for ADE20k & cityscapse dataset, based on several models.

VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

The Simplest DCGAN Implementation

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

Functional deep learning

Code for Towards Streaming Perception (ECCV 2020) :car:

A modified version of DeepMind's Alphafold2 to divide CPU part (MSA and template searching) and GPU part (prediction model)

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

This repository contains codes of ICCV2021 paper: SO-Pose: Exploiting Self-Occlusion for Direct 6D Pose Estimation

Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting

Efficient Lottery Ticket Finding: Less Data is More

Optimal Adaptive Allocation using Deep Reinforcement Learning in a Dose-Response Study

A simple interface for editing natural photos with generative neural networks.

CSKG is a commonsense knowledge graph that combines seven popular sources into a consolidated representation

PyTorch implementation of "VRT: A Video Restoration Transformer"

Convert Apple NeuralHash model for CSAM Detection to ONNX.