GLIP: Grounded Language-Image Pre-training

Last update: Jan 01, 2023

Related tags

Overview

GLIP: Grounded Language-Image Pre-training

Updates

12/06/2021: GLIP paper on arxiv https://arxiv.org/abs/2112.03857. Code and Model are under internal review and will release soon. Stay tuned!

11/23/2021: Project page built.

Introduction

This repository is the project page for GLIP, containing necessary instructions to reproduce the results presented in the paper. This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks.

When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines.
After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA.
When transferred to 13 downstream object detection tasks, a few-shot GLIP rivals with a fully-supervised Dynamic Head.

Supervised baselines on COCO object detection: Faster-RCNN w/ ResNet50 (40.2) or ResNet101 (42.0) from Detectron2, and DyHead w/ Swin-Tiny (49.7).

Citations

Please consider citing this paper if you use the code:

@inproceedings{harold_GLIP2021,
      title={Grounded Language-Image Pre-training},
      author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},
      year={2021},
      booktitle={arXiv preprint arXiv:2112.03857},
}

GLIP: Grounded Language-Image Pre-training

Related tags

Overview

GLIP: Grounded Language-Image Pre-training

Updates

Introduction

Citations

Owner

Microsoft

This repository contains the code for the CVPR 2020 paper "Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision"

Self-Adaptable Point Processes with Nonparametric Time Decays

Multi agent DDPG algorithm written in Python + Pytorch

An official implementation of the Anchor DETR.

Unofficial implementation of PatchCore anomaly detection

PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

The official PyTorch implementation of the paper: Xili Dai, Xiaojun Yuan, Haigang Gong, Yi Ma. "Fully Convolutional Line Parsing." .

GAN-STEM-Conv2MultiSlice - Exploring Generative Adversarial Networks for Image-to-Image Translation in STEM Simulation

FastFace: Lightweight Face Detection Framework

Implementation of OmniNet, Omnidirectional Representations from Transformers, in Pytorch

Look Who’s Talking: Active Speaker Detection in the Wild

[NeurIPS2021] Code Release of K-Net: Towards Unified Image Segmentation

The DL Streamer Pipeline Zoo is a catalog of optimized media and media analytics pipelines.

[ICML 2022] The official implementation of Graph Stochastic Attention (GSAT).

Code for database and frontend of webpage for Neural Fields in Visual Computing and Beyond.

Data Engineering ZoomCamp

Using Clinical Drug Representations for Improving Mortality and Length of Stay Predictions

An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Dataset and Source code of paper 'Enhancing Keyphrase Extraction from Academic Articles with their Reference Information'.

Repositorio oficial del curso IIC2233 Programación Avanzada 🚀✨

GLIP: Grounded Language-Image Pre-training

Related tags

Overview

GLIP: Grounded Language-Image Pre-training

Updates

Introduction

Citations

Owner

Microsoft

This repository contains the code for the CVPR 2020 paper "Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision"

Self-Adaptable Point Processes with Nonparametric Time Decays

Multi agent DDPG algorithm written in Python + Pytorch

An official implementation of the Anchor DETR.

Unofficial implementation of PatchCore anomaly detection

PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

The official PyTorch implementation of the paper: *Xili Dai, Xiaojun Yuan, Haigang Gong, Yi Ma. "Fully Convolutional Line Parsing." *.

GAN-STEM-Conv2MultiSlice - Exploring Generative Adversarial Networks for Image-to-Image Translation in STEM Simulation

FastFace: Lightweight Face Detection Framework

Implementation of OmniNet, Omnidirectional Representations from Transformers, in Pytorch

Look Who’s Talking: Active Speaker Detection in the Wild

[NeurIPS2021] Code Release of K-Net: Towards Unified Image Segmentation

The DL Streamer Pipeline Zoo is a catalog of optimized media and media analytics pipelines.

[ICML 2022] The official implementation of Graph Stochastic Attention (GSAT).

Code for database and frontend of webpage for Neural Fields in Visual Computing and Beyond.

Data Engineering ZoomCamp

Using Clinical Drug Representations for Improving Mortality and Length of Stay Predictions

An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Dataset and Source code of paper 'Enhancing Keyphrase Extraction from Academic Articles with their Reference Information'.

Repositorio oficial del curso IIC2233 Programación Avanzada 🚀✨

The official PyTorch implementation of the paper: Xili Dai, Xiaojun Yuan, Haigang Gong, Yi Ma. "Fully Convolutional Line Parsing." .