project page for VinVL

Last update: Jan 09, 2023

Related tags

Deep Learning VinVL

Overview

VinVL: Revisiting Visual Representations in Vision-Language Models

Updates

02/28/2021: Project page built.

Introduction

This repository is the project page for VinVL, containing necessary instructions to reproduce the results presented in the paper. We presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model (code), the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model OSCAR (code), and utilize an improved approach to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks.

Performance

Task	t2i	t2i	i2t	i2t	IC	IC	IC	IC	NoCaps	NoCaps	VQA	NLVR2	GQA
Metric	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	M	C	S	C	S	test-std	test-P	test-std
SoTA_S	39.2	68.0	56.6	84.5	38.9	29.2	129.8	22.4	61.5	9.2	70.92	58.80	63.17
SoTA_B	54.0	80.8	70.0	91.1	40.5	29.7	137.6	22.8	86.58	12.38	73.67	79.30	61.62
SoTA_L	57.5	82.8	73.5	92.2	41.7	30.6	140.0	24.5	-	-	74.93	81.47	-
-----	---	---	---	---	---	---	---	---	---	---	---	---	---
VinVL_B	58.1	83.2	74.6	92.6	40.9	30.9	140.6	25.1	92.46	13.07	76.12	83.08	64.65
VinVL_L	58.8	83.5	75.4	92.9	41.0	31.1	140.9	25.2	-	-	76.62	83.98	-
gain	1.3	0.7	1.9	0.6	-0.7	0.5	0.9	0.7	5.9	0.7	1.69	2.51	1.48

t2i: text-to-image retrieval; i2t: image-to-text retrieval; IC: image captioning on COCO.

Leaderboard results

VinVL has achieved top-position in several VL leaderboards, including Visual Question Answering (VQA), Microsoft COOC Image Captioning, Novel Object Captioning (nocaps), and Visual Commonsense Reasoning (VCR).

Comparison with image features from bottom-up and top-down model (code).

We observe uniform improvements on seven VL tasks by replacing visual features from bottom-up and top-down model with ours. The NoCaps baseline is from VIVO, and our results are obtained by directly replacing the visual features. The baselines for rest tasks are from OSCAR, and our results are obtained by replacing the visual features and performing OSCAR+ pre-training. All models are BERT-Base size. As analyzed in Section 5.2 in the VinVL paper, the new visual features contributes 95% of the improvement.

Task	t2i	t2i	i2t	i2t	IC	IC	IC	IC	NoCaps	NoCaps	VQA	NLVR2	GQA
metric	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	M	C	S	C	S	test-std	test-P	test-std
bottom-up and top-down model	54.0	80.8	70.0	91.1	40.5	29.7	137.6	22.8	86.58	12.38	73.16	78.07	61.62
VinVL (ours)	58.1	83.2	74.6	92.6	40.9	30.9	140.6	25.1	92.46	13.07	75.95	83.08	64.65
gain	4.1	2.4	4.6	1.5	0.4	1.2	3.0	2.3	5.9	0.7	2.79	4.71	3.03

Please see the following two figures for visual comparison.

Source code

Pretrained Faster-RCNN model and feature extraction

The pretrained X152-C4 object-attribute detection can be downloaded here. With code from our Scene Graph Benchmark Repo (to be released soon), one can extract features with following command:

python tools/test_sg_net.py --config-file sgg_configs/vgattr/vinvl_x152c4.yaml TEST.IMS_PER_BATCH 2 MODEL.WEIGHT models/vinvl/vinvl_vg_x152c4.pth MODEL.ROI_HEADS.NMS_FILTER 1 MODEL.ROI_HEADS.SCORE_THRESH 0.2 DATA_DIR "../maskrcnn-benchmark-1/datasets1" TEST.IGNORE_BOX_REGRESSION True MODEL.ATTRIBUTE_ON True TEST.OUTPUT_FEATURE True

The output feature will be encoded as base64.

Find more pretrained models in DOWNLOAD.

Pre-exacted Image Features

For ease-of-use, we make pretrained features and predictions available for all pretraining datasets and downstream tasks. Please find the instructions to download them in DOWNLOAD.

Pretraind Oscar+ models and VL downstream tasks

The code to produce all vision-language results (both pretraining and downstream task finetuning) can be found in our OSCAR repo. One can find the model zoo for vision-language tasks here.

Citations

Please consider citing this paper if you use the code:

@article{li2020oscar,
  title={Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks},
  author={Li, Xiujun and Yin, Xi and Li, Chunyuan and Hu, Xiaowei and Zhang, Pengchuan and Zhang, Lei and Wang, Lijuan and Hu, Houdong and Dong, Li and Wei, Furu and Choi, Yejin and Gao, Jianfeng},
  journal={ECCV 2020},
  year={2020}
}

@article{zhang2021vinvl,
  title={VinVL: Making Visual Representations Matter in Vision-Language Models},
  author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
  journal={CVPR 2021},
  year={2021}
}

project page for VinVL

Related tags

Overview

VinVL: Revisiting Visual Representations in Vision-Language Models

Updates

Introduction

Performance

Leaderboard results

Comparison with image features from bottom-up and top-down model (code).

Source code

Pretrained Faster-RCNN model and feature extraction

Pre-exacted Image Features

Pretraind Oscar+ models and VL downstream tasks

Citations

Owner

Romanian Automatic Speech Recognition from the ROBIN project

Demo for Real-time RGBD-based Extended Body Pose Estimation paper

Deep Illuminator is a data augmentation tool designed for image relighting. It can be used to easily and efficiently generate a wide range of illumination variants of a single image.

meProp: Sparsified Back Propagation for Accelerated Deep Learning

Predicting Tweet Sentiment Maching Learning and streamlit

UPSNet: A Unified Panoptic Segmentation Network

Delving into Localization Errors for Monocular 3D Object Detection, CVPR'2021

Focal and Global Knowledge Distillation for Detectors

Proof of concept GnuCash Webinterface

Drone detection using YOLOv5

TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers.

This project deals with the detection of skin lesions within the ISICs dataset using YOLOv3 Object Detection with Darknet.

Code for the paper "Reinforced Active Learning for Image Segmentation"

InsTrim: Lightweight Instrumentation for Coverage-guided Fuzzing

codes for IKM (arXiv2021, Submitted to IEEE Trans)

Sample Code for "Pessimism Meets Invariance: Provably Efficient Offline Mean-Field Multi-Agent RL"

Official Pytorch implementation of 6DRepNet: 6D Rotation representation for unconstrained head pose estimation.

The (Official) PyTorch Implementation of the paper "Deep Extraction of Manga Structural Lines"

The FIRST GANs-based omics-to-omics translation framework

JupyterLite demo deployed to GitHub Pages 🚀