code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

Last update: Jan 02, 2023

Overview

Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

News

(03/16/2022) upload retrieval checkpoints finetuned on COCO and Flickr

This is the official PyTorch implementation of TCL

Requirements:

conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch
pip install transformers==4.8.1
pip install timm==0.4.9
conda install ruamel_yaml
pip install opencv-python
pip install --upgrade Pillow
pip install einops

Pre-training Datasets:

MSCOCO (2014)
Visual Genome (VG)
- Download images of part1 and part2 and combine them together
Conceptual Captions (CC3M)
- Download Train_GCC-training.tsv and Validation_GCC-1.1.0-Validation.tsv from kaggle
- Then use img2dataset to download images from downloaed tsv files
- More details
SBU Captions
- Download url from subcaptions
- Then use img2dataset to download images from url
CC12M
- Download cc12m.tsv
- Then use img2dataset to download images from the downloaed tsv file

Downstream-task Datasets:

Flickr30k
VQA v2
NLVR2
- recommend to use direct-image-download

Json Files from Pre-training and Downstream Tasks:

refer to Download in ALBEF
you need to change the image path in json files according to your downloaded images

Pre-trained checkpoint:

Pre-training:

python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Pretrain.py \
--config ./configs/Pretrain.yaml \
--output_dir output/pretrain

Downstream Tasks:

Image-Text Retrieval

# zero-shot coco 
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Retrieval.py \
--config ./configs/Retrieval_coco.yaml \
--output_dir output/pretrain_e30_Retrieval_coco_zeroshot \
--checkpoint output/pretrain/checkpoint_29.pth \
--evaluate

# fine-tune flickr
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Retrieval.py \
--config ./configs/Retrieval_flickr.yaml \
--output_dir output/pretrain_e30_Retrieval_flickr \
--checkpoint output/pretrain/checkpoint_29.pth

# fine-tune coco
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Retrieval.py \
--config ./configs/Retrieval_coco.yaml \
--output_dir output/pretrain_e30_Retrieval_coco \
--checkpoint output/pretrain/checkpoint_29.pth

# zero-shot flickr 
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Retrieval.py \
--config ./configs/Retrieval_flickr.yaml \
--output_dir output/pretrain_e30_Retrieval_flickr_zeroshot \
--checkpoint output/pretrain_e30_Retrieval_coco/checkpoint_best.pth \
--evaluate

VQA

python -m torch.distributed.launch --nproc_per_node=8 \
--use_env VQA.py \
--config ./configs/VQA.yaml \
--output_dir output/pretrain_e30_vqa \
--checkpoint output/pretrain/checkpoint_29.pth

how to evaluate and interpret the results(salesforce/ALBEF#19)

Visual Entailment

python -m torch.distributed.launch --nproc_per_node=8 \
--use_env VE.py \
--config ./configs/VE.yaml \
--output_dir output/pretrain_e30_VE \
--checkpoint output/pretrain/checkpoint_29.pth

how to evaluate and interpret the results(salesforce/ALBEF#19)

NLVR2

# pre-train nlvr
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env Pretrain_nlvr.py \
--config ./configs/NLVR_pretrain.yaml \
--output_dir output/pretrain_e30_NLVR_pretrain \
--checkpoint output/pretrain/checkpoint_29.pth

# fine-tune nlvr
python -m torch.distributed.launch --nproc_per_node=8 \
--use_env NLVR.py \
--config ./configs/NLVR.yaml \
--output_dir output/pretrain_e30_NLVR \
--checkpoint output/pretrain_e30_NLVR_pretrain/checkpoint_00.pth

how to evaluate and interpret the results(salesforce/ALBEF#19)

Citation:

@article{yang2022vision,
  title={Vision-Language Pre-Training with Triple Contrastive Learning},
  author={Yang, Jinyu and Duan, Jiali and Tran, Son and Xu, Yi and Chanda, Sampath and Chen, Liqun and Zeng, Belinda and Chilimbi, Trishul and Huang, Junzhou},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  year={2022}
}

Our code is largely borrowed from ALBEF

code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

Related tags

Overview

Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

News

Requirements:

Pre-training Datasets:

Downstream-task Datasets:

Json Files from Pre-training and Downstream Tasks:

Pre-trained checkpoint:

Pre-training:

Downstream Tasks:

Image-Text Retrieval

VQA

Visual Entailment

NLVR2

Citation:

Owner

Code for DeepCurrents: Learning Implicit Representations of Shapes with Boundaries

Categorizing comments on YouTube into different categories.

A faster pytorch implementation of faster r-cnn

Recreate CenternetV2 based on MMDET.

[ICML 2020] DrRepair: Learning to Repair Programs from Error Messages

City Surfaces: City-scale Semantic Segmentation of Sidewalk Surfaces

Code for the paper "Benchmarking and Analyzing Point Cloud Classification under Corruptions"

Snscrape-jsonl-urls-extractor - Extracts urls from jsonl produced by snscrape

NeuroFind - A solution to the to the Task given by the Oberseminar of Messtechnik Institute of TU Dresden in 2021

Create animations for the optimization trajectory of neural nets

Neural Point-Based Graphics

An Inverse Kinematics library aiming performance and modularity

Official repo of the paper "Surface Form Competition: Why the Highest Probability Answer Isn't Always Right"

Python package for downloading ECMWF reanalysis data and converting it into a time series format.

Mercury: easily convert Python notebook to web app and share with others

Image Data Augmentation in Keras

CTF Challenge for CSAW Finals 2021

Google AI Open Images - Object Detection Track: Open Solution

Bayesian Meta-Learning Through Variational Gaussian Processes

Sum-Product Probabilistic Language