Implementation of CoCa, Contrastive Captioners are Image-Text Foundation Models, in Pytorch

Last update: Dec 30, 2022

Overview

CoCa - Pytorch

Implementation of CoCa, Contrastive Captioners are Image-Text Foundation Models, in Pytorch. They were able to elegantly fit in contrastive learning to a conventional encoder / decoder (image to text) transformer, achieving SOTA 91.0% top-1 accuracy on ImageNet with a finetuned encoder.

This repository also chooses to adopt the specific transformer architecture from PaLM, for both the unimodal and multimodal transformers as well as the cross attention blocks (parallel SwiGLU feedforwards)

Yannic Kilcher presentation

Install

$ pip install coca-pytorch

Usage

First install the vit-pytorch for the image encoder, which needs to be pretrained

$ pip install vit-pytorch

Then

import torch

# import vision transformer

from vit_pytorch import ViT
from vit_pytorch.extractor import Extractor

vit = ViT(
    image_size = 256,
    patch_size = 32,
    num_classes = 1000,
    dim = 1024,
    depth = 6,
    heads = 16,
    mlp_dim = 2048
)

# do your vision transformer training

vit = Extractor(vit, return_embeddings_only = True)

# extractor will enable it so the vision transformer returns its embeddings

# import CoCa and instantiate it

from coca_pytorch.coca_pytorch import CoCa

coca = CoCa(
    dim = 512,                     # model dimension
    img_encoder = vit,             # vision transformer - image encoder, returning image embeddings as (batch, seq, dim)
    image_dim = 1024,              # image embedding dimension, if not the same as model dimensions
    num_tokens = 20000,            # number of text tokens
    unimodal_depth = 6,            # depth of the unimodal transformer
    multimodal_depth = 6,          # depth of the multimodal transformer
    dim_head = 64,                 # dimension per attention head
    heads = 8,                     # number of attention heads
    caption_loss_weight = 1.,      # weight on the autoregressive caption loss
    contrastive_loss_weight = 1.,  # weight on the contrastive loss between image and text CLS embeddings
).cuda()

# mock text and images

text = torch.randint(0, 20000, (4, 512)).cuda()
images = torch.randn(4, 3, 256, 256).cuda()

# train by giving CoCa your text and images with `return_loss = True`

loss = coca(
    text = text,
    images = images,
    return_loss = True  # set this to True to get the full caption + contrastive loss
)

loss.backward()

# do the above for as much text and images...
# then you can get the caption logits as so

logits = coca(
    text = text,
    images = images
) # (4, 512, 20000)

# and the CLIP-like text and image embeddings as

text_embeds, image_embeds = coca(
    text = text,
    images = images,
    return_embeddings = True
) # (4, 512), (4, 512)

Citations

@inproceedings{Yu2022CoCaCC,
  title   = {CoCa: Contrastive Captioners are Image-Text Foundation Models},
  author  = {Jiahui Yu and Zirui Wang and Vijay Vasudevan and Legg Yeung and Mojtaba Seyedhosseini and Yonghui Wu},
  year    = {2022}
}

@inproceedings{Chowdhery2022PaLMSL,
    title   = {PaLM: Scaling Language Modeling with Pathways},
    author  = {Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Maarten Bosma and Gaurav Mishra and Adam Roberts and Paul Barham and Hyung Won Chung and Charles Sutton and Sebastian Gehrmann and Parker Schuh and Kensen Shi and Sasha Tsvyashchenko and Joshua Maynez and Abhishek Rao and Parker Barnes and Yi Tay and Noam M. Shazeer and Vinodkumar Prabhakaran and Emily Reif and Nan Du and Benton C. Hutchinson and Reiner Pope and James Bradbury and Jacob Austin and Michael Isard and Guy Gur-Ari and Pengcheng Yin and Toju Duke and Anselm Levskaya and Sanjay Ghemawat and Sunipa Dev and Henryk Michalewski and Xavier Garc{\'i}a and Vedant Misra and Kevin Robinson and Liam Fedus and Denny Zhou and Daphne Ippolito and David Luan and Hyeontaek Lim and Barret Zoph and Alexander Spiridonov and Ryan Sepassi and David Dohan and Shivani Agrawal and Mark Omernick and Andrew M. Dai and Thanumalayan Sankaranarayana Pillai and Marie Pellat and Aitor Lewkowycz and Erica Oliveira Moreira and Rewon Child and Oleksandr Polozov and Katherine Lee and Zongwei Zhou and Xuezhi Wang and Brennan Saeta and Mark Diaz and Orhan Firat and Michele Catasta and Jason Wei and Kathleen S. Meier-Hellstern and Douglas Eck and Jeff Dean and Slav Petrov and Noah Fiedel},
    year    = {2022}
}

Comments

Contrastive loss should be applied to L2-normed embeddings instead of layer normed?

Hi @lucidrains, thank you for the implementation. Just wanted to confirm this with you, based on your code we're normalizing the img embedding and text embedding respectively using a learnable Layer Norm transformation before applying the contrastive loss. But based on my understanding, for contrastive loss we typically maximize the relative cosine similarity so the embeddings should be L2-normed instead of layernormed? Thank you.

opened by fedshyvana 2
Extractor in vit_pytorch will detach the tensor.

Thanks for your code! I think I may find a little bug. The cloned tensor in Extractor will be detached (https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/extractor.py#L39). So gradient may not propagate back to image encoder.

opened by techkang 2
Maybe don't need this rearrange

I think the logits before this line in shape (bsz, length, num_tokens) -> so I don't think here need one more rearrange https://github.com/lucidrains/CoCa-pytorch/blob/25de0b04326d8dc4c6f969e90b4466fc4894835e/coca_pytorch/coca_pytorch.py#L461

opened by CiaoHe 2

How to train the model using my own dataset?

Can someone tell me how to train the model using my own dataset? is it like below？But I have many images and texts...

# train by giving CoCa your text and images with `return_loss = True`
loss = coca(
    text = text,
    images = images,
    return_loss = True  # set this to True to get the full caption + contrastive loss
)

opened by keepcodeandsmile 1

why train VIT visual encoder first?

Hi, thanks for sharing this repo. In the CoCA paper, both the visual encoder and text encoder are end-to end trained. But in this repo, the vit is first pretrained then fixed to train CoCa.

opened by Flowerfan 1

attn_mask

cls_mask = rearrange(text!=self.pad_id, 'b j -> b 1 j')  
attn_mask = F.pad(cls_mask, (0, 1, seq, 0), value=True)

attn_mask = rearrange(attn_mask, 'b i j -> b 1 i j')  
sim = sim.masked_fill(~attn_mask, -torch.finfo(sim.dtype).max)

Hello, I am confused of the implement of "attn_mask". I think this padding function only can mask the last row of "sim". Could you please explain it? Perhaps it's a very fool question. Thank you so much.

opened by pldlgb 0

Reproducing the results in the paper

Thanks for this repo. Curious, is this an independent implementation of the CoCa paper? If yes, did you reproduce any result in the paper to ensure correctness of implementation?

opened by GKIBMNY 0
Generating the caption of a given image

Hello,

Thank you for having implemented this model. Have you already implemented some code to generate the caption of a given image? If not, do you have an idea about how you would do it in this particular architecture?

Thank you in advance.

opened by claudiogreco 0

Releases(0.0.7)

0.0.7(Dec 5, 2022)

null
Source code(tar.gz)
Source code(zip)
v0.0.6(Jun 8, 2022)

null
Source code(tar.gz)
Source code(zip)
0.0.5(May 6, 2022)

Source code(tar.gz)
Source code(zip)
0.0.4(May 6, 2022)

Source code(tar.gz)
Source code(zip)
0.0.2(May 5, 2022)

Source code(tar.gz)
Source code(zip)
0.0.1(May 5, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

A framework to train language models to learn invariant representations.

Invariant Language Modeling Implementation of the training for invariant language models. Motivation Modern pretrained language models are critical co

6 Nov 16, 2022

PyTorch Kafka Dataset: A definition of a dataset to get training data from Kafka.

7 Aug 01, 2022

Model of an AI powered sign language interpreter.

TEXT AND SPEECH TO SIGN LANGUAGE. A web application which takes in text or live audio speech recording as input, converts and displays the relevant Si

4 Mar 30, 2022

PyTorch implementation of SimCLR: A Simple Framework for Contrastive Learning of Visual Representations

1.7k Dec 28, 2022

Reproduce results and replicate training fo T0 (Multitask Prompted Training Enables Zero-Shot Task Generalization)

T-Zero This repository serves primarily as codebase and instructions for training, evaluation and inference of T0. T0 is the model developed in Multit

253 Dec 27, 2022

Numenta Platform for Intelligent Computing is an implementation of Hierarchical Temporal Memory (HTM), a theory of intelligence based strictly on the neuroscience of the neocortex.

NuPIC Numenta Platform for Intelligent Computing The Numenta Platform for Intelligent Computing (NuPIC) is a machine intelligence platform that implem

6.3k Dec 30, 2022

This repo is to present various code demos on how to use our Graph4NLP library.

Deep Learning on Graphs for Natural Language Processing Demo The repository contains code examples for DLG4NLP tutorials at NAACL 2021, SIGIR 2021, KD

143 Dec 23, 2022

Godot RL Agents is a fully Open Source packages that allows video game creators

Godot RL Agents The Godot RL Agents is a fully Open Source packages that allows video game creators, AI researchers and hobbiest the opportunity to le

326 Dec 30, 2022

Embodied Intelligence via Learning and Evolution

Embodied Intelligence via Learning and Evolution This is the code for the paper Embodied Intelligence via Learning and Evolution Agrim Gupta, Silvio S

111 Dec 13, 2022

MetaDrive: Composing Diverse Scenarios for Generalizable Reinforcement Learning

MetaDrive: Composing Diverse Driving Scenarios for Generalizable RL [ Documentation | Demo Video ] MetaDrive is a driving simulator with the following

276 Jan 04, 2023

Face Recognition and Emotion Detector Device

Face Recognition and Emotion Detector Device Orange PI 1 Python 3.10.0 + Django 3.2.9 Project's file explanation Django manage.py Django commands hand

2 Dec 21, 2021

An efficient PyTorch library for Global Wheat Detection using YOLOv5. The project is based on this Kaggle competition Global Wheat Detection (2021).

Global-Wheat-Detection An efficient PyTorch library for Global Wheat Detection using YOLOv5. The project is based on this Kaggle competition Global Wh

11 Sep 25, 2022

Code for "R-GCN: The R Could Stand for Random"

RR-GCN: Random Relational Graph Convolutional Networks PyTorch Geometric code for the paper "R-GCN: The R Could Stand for Random" RR-GCN is an extensi

31 Sep 07, 2022

A Vision Transformer approach that uses concatenated query and reference images to learn the relationship between query and reference images directly.

24 Dec 13, 2022

Implementation of CoCa, Contrastive Captioners are Image-Text Foundation Models, in Pytorch

Related tags

Overview

CoCa - Pytorch

Install

Usage

Citations

Comments

Releases(0.0.7)

0.0.7(Dec 5, 2022)

v0.0.6(Jun 8, 2022)

0.0.5(May 6, 2022)

0.0.4(May 6, 2022)

0.0.2(May 5, 2022)

0.0.1(May 5, 2022)