An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Last update: Jan 06, 2023

Related tags

Text Data & NLP openai-clip

Overview

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text

This repo aims at providing an easy to use and efficient code for extracting image & text features using the official OpenAI CLIP models, which is also optimized for multi processing GPU feature extraction.

The official OpenAI CLIP repo only supports extracting global visual features, while the local grid features from CLIP visual models may also contain more detailed semantic information which can benefit multi visual-and-language downstream tasks[1][2]. As an alternative, this repo encapsulates minor-modified CLIP code in order to extract not only global visual features but also local grid visual features from different CLIP visual models. What's more, this repo is designed in a user-friendly object-oriented fashion, allowing users to add their customized visual_extractor classes easily to customize different input and output grid resolution.

To verify the semantic meaning of the extracted visual grid features, we also applied the extracted visual grid features of MSCOCO images from different official CLIP models for standard image captioning task. We got comparable or superior results in transformer baseline easily without hard-tuning hyperparameters, via simply replacing BUTD features with the extracted CLIP gird features. Surprisingly, we got 116.9 CIDEr score in teacher-forcing setting and 129.6 in reinforcement learning setting when using ViT-B/32 CLIP model, which conflicts with the experiment results in CLIP-ViL paper [1] where the authors observed that CLIP-ViT-B with grid features has a large performance degradation compared with other models (58.0 CIDEr score in CLIP-ViT-B_Transformer setting in COCO Captioning).

We provide supported CLIP models, results on MSCOCO image captioning, and other information below. We believe this repo can facilitate the usage of powerful CLIP models.

1. Supported CLIP Models

Currently this repo supports five visual extractor settings, including three standard pipelines used in official OpenAI CLIP repo and two additional customized pipelines supporting larger input resolution. You can refer to this file for more details about customizing your own visual backbones for different input and output resolution. In order to imporve training efficiency in image captioning task, we apply AvgPool2d to the output feature map to reduce grid features size in some settings without large performance degradation. We will support more CLIP models in the future.

	Visual Backbone	CLIP Model	Input Resolution	Output Resolution	Feature Map Downsample	Grid Feature Shape	Global Feature Shape
Standard	RN101	RN101	224 x 224	7 x 7	None	49 x 2048	1 x 512
	ViT-B/32	ViT-B/32	224 x 224	7 x 7	None	49 x 768	1 x 512
	ViT-B/16	ViT-B/16	224 x 224	14 x 14	AvgPool2d(kernel_size=(2,2), stride=2)	49 x 768	1 x 512
Customized	RN101_448	RN101	448 x 448	14 x 14	AvgPool2d(kernel_size=(2,2), stride=2)	49 x 2048	1 x 512
Customized	ViT-B/32_448	ViT-B/32	448 x 448	14 x 14	AvgPool2d(kernel_size=(2,2), stride=2)	49 x 768	1 x 512

2. Results on MSCOCO Image Captioning (Karpathy's Splits)

We ran image captioning experiments on X-modaler with the extracted CLIP grid features. We easily got comparable or superior results in transformer baseline using the default hyperparameters in X-modaler's transformer baseline, except for SOLVER.BASE_LR=2e-4 in ViT-B/16 and ViT-B/32_448 teacher-forcing settings. The performance of transformer baseline using BUTD features is taken from X-modaler's paper.

2.1 Teacher-forcing

Name	[email protected]	[email protected]	[email protected]	[email protected]	METEOR	ROUGE-L	CIDEr-D	SPICE
BUTD	76.4	60.3	46.5	35.8	28.2	56.7	116.6	21.3
RN101	77.3	61.3	47.7	36.9	28.7	57.5	120.6	21.8
ViT-B/32	76.4	60.3	46.5	35.6	28.1	56.7	116.9	21.2
ViT-B/16	78.0	62.1	48.2	37.2	28.8	57.6	122.3	22.1
RN101_448	78.1	62.3	48.4	37.5	29.0	58.0	122.9	22.2
ViT-B/32_448	75.8	59.6	45.9	35.1	27.8	56.3	114.2	21.0

2.2 Self-critical Reinforcement Learning

Name	[email protected]	[email protected]	[email protected]	[email protected]	METEOR	ROUGE-L	CIDEr-D	SPICE
BUTD	80.5	65.4	51.1	39.2	29.1	58.7	130.0	23.0
RN101	-	-	-	-	-	-	-	-
ViT-B/32	79.9	64.6	50.4	38.5	29.0	58.6	129.6	22.8
ViT-B/16	82.0	67.3	53.1	41.1	29.9	59.8	136.6	23.8
RN101_448	81.7	66.9	52.6	40.5	29.9	59.7	136.1	23.9
ViT-B/32_448	-	-	-	-	-	-	-	-

3. Get Started

Note: The extracted feature files are compatible with X-modaler, where you can setup your experiments about cross-modal analytics conveniently.

3.1 Requirements

PyTorch ≥ 1.9 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this
timm ≥ 0.4.5

3.2 Examples

Use CLIP ViT-B/32 model to extract global textual features of MSCOCO sentences from dataset_coco.json in Karpathy's released annotations.

CUDA_VISIBLE_DEVICES=0 python3 clip_textual_feats.py \
    --anno dataset_coco.json \
    --output_dir ${TXT_OUTPUT_DIR} \
    --model_type_or_path 'ViT-B/32'

Use CLIP ViT-B/16 model to extract global and grid visual features of MSCOCO images.

CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
    --image_list 'example/MSCOCO/image_list_2017.txt' \
    --image_dir ${IMG_DIR} \
    --output_dir ${IMG_OUTPUT_DIR} \
    --ve_name 'ViT-B/16' \
    --model_type_or_path 'ViT-B/16'

Use CLIP RN101 model to extract global and grid visual features of MSCOCO images.

CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
    --image_list 'example/MSCOCO/image_list_2017.txt' \
    --image_dir ${IMG_DIR} \
    --output_dir ${IMG_OUTPUT_DIR} \
    --ve_name 'RN101' \
    --model_type_or_path 'RN101'

Use CLIP RN101 model to extract global and grid visual features of MSCOCO images with 448 x 448 resolution.

CUDA_VISIBLE_DEVICES=0 python3 clip_visual_feats.py \
    --image_list 'example/MSCOCO/image_list_2017.txt' \
    --image_dir ${IMG_DIR} \
    --output_dir ${IMG_OUTPUT_DIR} \
    --ve_name 'RN101_448' \
    --model_type_or_path 'RN101'

3.3 Speeding up feature extraction with Multiple GPUs

You can run the same script with same input list (i.e. --image_list or --anno) on another GPU (that can be from a different machine, provided that the disk to output the features is shared between the machines). The script will create a new feature extraction process that will only focus on processing the items that have not been processed yet, without overlapping with the other extraction process already running.

4. License

MIT

5. Acknowledgement

This repo used resources from OpenAI CLIP, timm, CLIP-ViL, X-modaler. The repo is implemented using PyTorch. We thank the authors for open-sourcing their awesome projects.

6. References

[1] How Much Can CLIP Benefit Vision-and-Language Tasks? Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer. In Arxiv2021.

[2] In Defense of Grid Features for Visual Question Answering. Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen. In CVPR2020.

An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Related tags

Overview

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text

1. Supported CLIP Models

2. Results on MSCOCO Image Captioning (Karpathy's Splits)

2.1 Teacher-forcing

2.2 Self-critical Reinforcement Learning

3. Get Started

3.1 Requirements

3.2 Examples

3.3 Speeding up feature extraction with Multiple GPUs

4. License

5. Acknowledgement

6. References

Owner

Jianjie(JJ) Luo

A NLP program: tokenize method, PoS Tagging with deep learning

This is a MD5 password/passphrase brute force tool

spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines

CLIPfa: Connecting Farsi Text and Images

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)

ChatBotProyect - This is an unfinished project about a simple chatbot.

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Klexikon: A German Dataset for Joint Summarization and Simplification

Yet Another Compiler Visualizer

UniSpeech - Large Scale Self-Supervised Learning for Speech

Weaviate demo with the text2vec-openai module

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

Searching keywords in PDF file folders

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

Official implementation of Meta-StyleSpeech and StyleSpeech

This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers.