Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

Overview

UNITER: UNiversal Image-TExt Representation Learning

This is the official repository of UNITER (ECCV 2020). This repository currently supports finetuning UNITER on NLVR2, VQA, VCR, SNLI-VE, Image-Text Retrieval for COCO and Flickr30k, and Referring Expression Comprehensions (RefCOCO, RefCOCO+, and RefCOCO-g). Both UNITER-base and UNITER-large pre-trained checkpoints are released. UNITER-base pre-training with in-domain data is also available.

Overview of UNITER

Some code in this repo are copied/modified from opensource implementations made available by PyTorch, HuggingFace, OpenNMT, and Nvidia. The image features are extracted using BUTD.

Requirements

We provide Docker image for easier reproduction. Please install the following:

Our scripts require the user to have the docker group membership so that docker commands can be run without sudo. We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended.

Quick Start

NOTE: Please run bash scripts/download_pretrained.sh $PATH_TO_STORAGE to get our latest pretrained checkpoints. This will download both the base and large models.

We use NLVR2 as an end-to-end example for using this code base.

  1. Download processed data and pretrained models with the following command.

    bash scripts/download_nlvr2.sh $PATH_TO_STORAGE

    After downloading you should see the following folder structure:

    ├── ann
    │   ├── dev.json
    │   └── test1.json
    ├── finetune
    │   ├── nlvr-base
    │   └── nlvr-base.tar
    ├── img_db
    │   ├── nlvr2_dev
    │   ├── nlvr2_dev.tar
    │   ├── nlvr2_test
    │   ├── nlvr2_test.tar
    │   ├── nlvr2_train
    │   └── nlvr2_train.tar
    ├── pretrained
    │   └── uniter-base.pt
    └── txt_db
        ├── nlvr2_dev.db
        ├── nlvr2_dev.db.tar
        ├── nlvr2_test1.db
        ├── nlvr2_test1.db.tar
        ├── nlvr2_train.db
        └── nlvr2_train.db.tar
    
  2. Launch the Docker container for running the experiments.

    # docker image should be automatically pulled
    source launch_container.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/img_db \
        $PATH_TO_STORAGE/finetune $PATH_TO_STORAGE/pretrained

    The launch script respects $CUDA_VISIBLE_DEVICES environment variable. Note that the source code is mounted into the container under /src instead of built into the image so that user modification will be reflected without re-building the image. (Data folders are mounted into the container separately for flexibility on folder structures.)

  3. Run finetuning for the NLVR2 task.

    # inside the container
    python train_nlvr2.py --config config/train-nlvr2-base-1gpu.json
    
    # for more customization
    horovodrun -np $N_GPU python train_nlvr2.py --config $YOUR_CONFIG_JSON
  4. Run inference for the NLVR2 task and then evaluate.

    # inference
    python inf_nlvr2.py --txt_db /txt/nlvr2_test1.db/ --img_db /img/nlvr2_test/ \
        --train_dir /storage/nlvr-base/ --ckpt 6500 --output_dir . --fp16
    
    # evaluation
    # run this command outside docker (tested with python 3.6)
    # or copy the annotation json into mounted folder
    python scripts/eval_nlvr2.py ./results.csv $PATH_TO_STORAGE/ann/test1.json

    The above command runs inference on the model we trained. Feel free to replace --train_dir and --ckpt with your own model trained in step 3. Currently we only support single GPU inference.

  5. Customization

    # training options
    python train_nlvr2.py --help
    • command-line argument overwrites JSON config files
    • JSON config overwrites argparse default value.
    • use horovodrun to run multi-GPU training
    • --gradient_accumulation_steps emulates multi-gpu training
  6. Misc.

    # text annotation preprocessing
    bash scripts/create_txtdb.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/ann
    
    # image feature extraction (Tested on Titan-Xp; may not run on latest GPUs)
    bash scripts/extract_imgfeat.sh $PATH_TO_IMG_FOLDER $PATH_TO_IMG_NPY
    
    # image preprocessing
    bash scripts/create_imgdb.sh $PATH_TO_IMG_NPY $PATH_TO_STORAGE/img_db

    In case you would like to reproduce the whole preprocessing pipeline.

Downstream Tasks Finetuning

VQA

NOTE: train and inference should be ran inside the docker container

  1. download data
    bash scripts/download_vqa.sh $PATH_TO_STORAGE
    
  2. train
    horovodrun -np 4 python train_vqa.py --config config/train-vqa-base-4gpu.json \
        --output_dir $VQA_EXP
    
  3. inference
    python inf_vqa.py --txt_db /txt/vqa_test.db --img_db /img/coco_test2015 \
        --output_dir $VQA_EXP --checkpoint 6000 --pin_mem --fp16
    
    The result file will be written at $VQA_EXP/results_test/results_6000_all.json, which can be submitted to the evaluation server

VCR

NOTE: train and inference should be ran inside the docker container

  1. download data
    bash scripts/download_vcr.sh $PATH_TO_STORAGE
    
  2. train
    horovodrun -np 4 python train_vcr.py --config config/train-vcr-base-4gpu.json \
        --output_dir $VCR_EXP
    
  3. inference
    horovodrun -np 4 python inf_vcr.py --txt_db /txt/vcr_test.db \
        --img_db "/img/vcr_gt_test/;/img/vcr_test/" \
        --split test --output_dir $VCR_EXP --checkpoint 8000 \
        --pin_mem --fp16
    
    The result file will be written at $VCR_EXP/results_test/results_8000_all.csv, which can be submitted to VCR leaderboard for evluation.

VCR 2nd Stage Pre-training

NOTE: pretrain should be ran inside the docker container

  1. download VCR data if you haven't
    bash scripts/download_vcr.sh $PATH_TO_STORAGE
    
  2. 2nd stage pre-train
    horovodrun -np 4 python pretrain_vcr.py --config config/pretrain-vcr-base-4gpu.json \
        --output_dir $PRETRAIN_VCR_EXP
    

Visual Entailment (SNLI-VE)

NOTE: train should be ran inside the docker container

  1. download data
    bash scripts/download_ve.sh $PATH_TO_STORAGE
    
  2. train
    horovodrun -np 2 python train_ve.py --config config/train-ve-base-2gpu.json \
        --output_dir $VE_EXP
    

Image-Text Retrieval

download data

bash scripts/download_itm.sh $PATH_TO_STORAGE

NOTE: Image-Text Retrieval is computationally heavy, especially on COCO.

Zero-shot Image-Text Retrieval (Flickr30k)

# every image-text pair has to be ranked; please use as many GPUs as possible
horovodrun -np $NGPU python inf_itm.py \
    --txt_db /txt/itm_flickr30k_test.db --img_db /img/flickr30k \
    --checkpoint /pretrain/uniter-base.pt --model_config /src/config/uniter-base.json \
    --output_dir $ZS_ITM_RESULT --fp16 --pin_mem

Image-Text Retrieval (Flickr30k)

  • normal finetune
    horovodrun -np 8 python train_itm.py --config config/train-itm-flickr-base-8gpu.json
    
  • finetune with hard negatives
    horovodrun -np 16 python train_itm_hard_negatives.py \
        --config config/train-itm-flickr-base-16gpu-hn.jgon
    

Image-Text Retrieval (COCO)

  • finetune with hard negatives
    horovodrun -np 16 python train_itm_hard_negatives.py \
        --config config/train-itm-coco-base-16gpu-hn.json
    

Referring Expressions

  1. download data
    bash scripts/download_re.sh $PATH_TO_STORAGE
    
  2. train
    python train_re.py --config config/train-refcoco-base-1gpu.json \
        --output_dir $RE_EXP
    
  3. inference and evaluation
    source scripts/eval_refcoco.sh $RE_EXP
    
    The result files will be written under $RE_EXP/results_test/

Similarly, change corresponding configs/scripts for running RefCOCO+/RefCOCOg.

Pre-tranining

download

bash scripts/download_indomain.sh $PATH_TO_STORAGE

pre-train

horovodrun -np 8 python pretrain.py --config config/pretrain-indomain-base-8gpu.json \
    --output_dir $PRETRAIN_EXP

Unfortunately, we cannot host CC/SBU features due to their large size. Users will need to process them on their own. We will provide a smaller sample for easier reference to the expected format soon.

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{chen2020uniter,
  title={Uniter: Universal image-text representation learning},
  author={Chen, Yen-Chun and Li, Linjie and Yu, Licheng and Kholy, Ahmed El and Ahmed, Faisal and Gan, Zhe and Cheng, Yu and Liu, Jingjing},
  booktitle={ECCV},
  year={2020}
}

License

MIT

Owner
Yen-Chun Chen
Researcher @ Microsoft Cloud+AI. previously Machine Learning Scientist @ Stackline; M.S. student @ UNC Chapel Hill NLP group
Yen-Chun Chen
Facilitating the design, comparison and sharing of deep text matching models.

MatchZoo Facilitating the design, comparison and sharing of deep text matching models. MatchZoo 是一个通用的文本匹配工具包,它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。 🔥 News

Neural Text Matching Community 3.7k Jan 02, 2023
BiNE: Bipartite Network Embedding

BiNE: Bipartite Network Embedding This repository contains the demo code of the paper: BiNE: Bipartite Network Embedding. Ming Gao, Leihui Chen, Xiang

leihuichen 214 Nov 24, 2022
Google AI 2018 BERT pytorch implementation

BERT-pytorch Pytorch implementation of Google AI's 2018 BERT, with simple annotation BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers f

Junseong Kim 5.3k Jan 07, 2023
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

Awni Hannun 647 Dec 25, 2022
Script to generate VAD dataset used in Asteroid recipe

About the dataset LibriVAD is an open source dataset for voice activity detection in noisy environments. It is derived from LibriSpeech signals (clean

11 Sep 15, 2022
A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

NEC Laboratories Europe 13 Sep 08, 2022
Perform sentiment analysis and keyword extraction on Craigslist listings

craiglist-helper synopsis Perform sentiment analysis and keyword extraction on Craigslist listings Background I love Craigslist. I've found most of my

Mark Musil 1 Nov 08, 2021
Transformers and related deep network architectures are summarized and implemented here.

Transformers: from NLP to CV This is a practical introduction to Transformers from Natural Language Processing (NLP) to Computer Vision (CV) Introduct

Ibrahim Sobh 138 Dec 27, 2022
基于Transformer的单模型、多尺度的VAE模型

UniVAE 基于Transformer的单模型、多尺度的VAE模型 介绍 https://kexue.fm/archives/8475 依赖 需要大于0.10.6版本的bert4keras(当前还没有推到pypi上,可以直接从GitHub上clone最新版)。 引用 @misc{univae,

苏剑林(Jianlin Su) 49 Aug 24, 2022
Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

FCS-applications Source code for CsiNet and CRNet using the Fully Connected Layer-Shared feedback architecture. Introduction This repository contains

Boyuan Zhang 4 Oct 07, 2022
Unsupervised intent recognition

INTENT author: steeve LAQUITAINE description: deployment pattern: currently batch only Setup & run git clone https://github.com/slq0/intent.git bash

sl 1 Apr 08, 2022
An implementation of WaveNet with fast generation

pytorch-wavenet This is an implementation of the WaveNet architecture, as described in the original paper. Features Automatic creation of a dataset (t

Vincent Herrmann 858 Dec 27, 2022
It analyze the sentiment of the user, whether it is postive or negative.

Sentiment-Analyzer-Tool It analyze the sentiment of the user, whether it is postive or negative. It uses streamlit library for creating this sentiment

Paras Patidar 18 Dec 17, 2022
PyTorch code for EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".

LXMERT: Learning Cross-Modality Encoder Representations from Transformers Our servers break again :(. I have updated the links so that they should wor

Hao Tan 838 Dec 19, 2022
State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Trapper (Transformers wRAPPER) Trapper is an NLP library that aims to make it easier to train transformer based models on downstream tasks. It wraps h

Open Business Software Solutions 42 Sep 21, 2022
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Vo Van Phuc 18 Nov 25, 2022
Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Speaker-Embeddings-Correlation-Pooling This is the original implementation of the pooling method introduced in "Speaker embeddings by modeling channel

Themos Stafylakis 10 Apr 30, 2022
ACL'22: Structured Pruning Learns Compact and Accurate Models

☕ CoFiPruning: Structured Pruning Learns Compact and Accurate Models This repository contains the code and pruned models for our ACL'22 paper Structur

Princeton Natural Language Processing 130 Jan 04, 2023
Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

GAN stability This repository contains the experiments in the supplementary material for the paper Which Training Methods for GANs do actually Converg

Lars Mescheder 884 Nov 11, 2022
Document processing using transformers

Doc Transformers Document processing using transformers. This is still in developmental phase, currently supports only extraction of form data i.e (ke

Vishnu Nandakumar 13 Dec 21, 2022