Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Last update: Dec 30, 2022

Related tags

Deep Learning DeCLIP

Overview

DeCLIP

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm.

Our paper is available in arxiv

Updates

** Our code, dataset and models will be relased soon**

Introduction

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) (Radfordet al., 2021) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite data-hungry and requires 400M image-text pairs for pre-training, thereby restricting its adoption. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our DeCLIP can learn generic visual features more efficiently. Instead of using the single image-text contrastive supervision, we fully exploit data potential through the use of (1) self-supervision within each modality; (2) multi-view supervision across modalities; (3) nearest-neighbor supervision from other similar pairs. Benefiting from these intrinsic supervision, our DeCLIP-ResNet50 can achieve 60.4% zero-shot top1 accuracy on ImageNet, which is 0.8% above the CLIP-ResNet50 while using 7.1× fewer data. Our DeCLIP-ResNet50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks. Moreover, Scaling up the model and computing also works well in our framework.

Model

Our pretrain visual backbone model (w/o text encoder)

DeCLIP_r50 GoogleDriver.
DeCLIP_vitb32 GoogleDriver

Citing DeCLIP

@misc{li2021supervision,
      title={Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm}, 
      author={Yangguang Li and Feng Liang and Lichen Zhao and Yufeng Cui and Wanli Ouyang and Jing Shao and Fengwei Yu and Junjie Yan},
      year={2021},
      eprint={2110.05208},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Related tags

Overview

DeCLIP

Updates

Introduction

Model

Our pretrain visual backbone model (w/o text encoder)

Citing DeCLIP

Owner

Sense-GVT

Dynamic hair modeling from monocular videos using deep neural networks

Official PyTorch implementation of "BlendGAN: Implicitly GAN Blending for Arbitrary Stylized Face Generation" (NeurIPS 2021)

Semantically Contrastive Learning for Low-light Image Enhancement

Python implementation of "Multi-Instance Pose Networks: Rethinking Top-Down Pose Estimation"

An implementation of quantum convolutional neural network with MindQuantum. Huawei, classifying MNIST dataset

Developed an optimized algorithm which finds the most optimal path between 2 points in a 3D Maze using various AI search techniques like BFS, DFS, UCS, Greedy BFS and A*

Best practices for segmentation of the corporate network of any company

Official implementation of the article "Unsupervised JPEG Domain Adaptation For Practical Digital Forensics"

An Unsupervised Graph-based Toolbox for Fraud Detection

[NeurIPS-2021] Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data

a pytorch implementation of auto-punctuation learned character by character

This project contains an implemented version of Face Detection using OpenCV and Mediapipe. This is a code snippet and can be used in projects.

Code release for Local Light Field Fusion at SIGGRAPH 2019

In this tutorial, you will perform inference across 10 well-known pre-trained object detectors and fine-tune on a custom dataset. Design and train your own object detector.

[Preprint] "Chasing Sparsity in Vision Transformers: An End-to-End Exploration" by Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Zhangyang Wang

Official Pytorch implementation of "CLIPstyler:Image Style Transfer with a Single Text Condition"

Camera-caps - Examine the camera capabilities for V4l2 cameras

DeepLabv3+：Encoder-Decoder with Atrous Separable Convolution语义分割模型在tensorflow2当中的实现

[ArXiv 2021] One-Shot Generative Domain Adaptation

Qlib is an AI-oriented quantitative investment platform