Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Overview

ClipBERT

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

Jie Lei*, Linjie Li*, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, Jingjing Liu

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning. In this repository, we support end-to-end pretraining and finetuning for the following tasks:

  • Image-text pretraining on COCO and VG captions.
  • Text-to-video retrieval finetuning on MSRVTT, DiDeMo, and ActivityNet Captions.
  • Video-QA finetuning on TGIF-QA and MSRVTT-QA.
  • Image-QA finetuning on VQA 2.0.

It is also feasible and easy to add other image-text or video-text tasks for pretraining and finetuning.

Requirements

We provide a Docker image for easier reproduction. Please install the following:

Our scripts require the user to have the docker group membership so that docker commands can be run without sudo. We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended.

Getting Started

General

  1. Create a folder that stores pretrained models, all the data, and results.

    PATH_TO_STORAGE=/path/to/your/data/
    mkdir -p $PATH_TO_STORAGE/txt_db  # annotations
    mkdir -p $PATH_TO_STORAGE/vis_db  # image and video 
    mkdir -p $PATH_TO_STORAGE/finetune  # finetuning results
    mkdir -p $PATH_TO_STORAGE/pretrained  # pretrained models
  2. Download pretrained models.

    Our e2e pretrained ClipBERT model (849MB), can be downloaded with the following command.

    bash scripts/download_pretrained.sh $PATH_TO_STORAGE

    This pretrained model can be used for finetuning on video-text tasks and image-text tasks. For your convenience, this script will also download bert-base-uncased and grid-feat-vqa model weights, which are used as initialization for pretraining.

  3. Launch the Docker container for running the experiments.

    # docker image should be automatically pulled
    source launch_container.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/img_db \
        $PATH_TO_STORAGE/finetune $PATH_TO_STORAGE/pretrained

    The launch script respects $CUDA_VISIBLE_DEVICES environment variable. Note that the source code is mounted into the container under /clipbert instead of built into the image so that user modification will be reflected without re-building the image. (Data folders are mounted into the container separately for flexibility on folder structures.)

Downstream Task Finetuning

Text-to-Video Retrieval

Tasks: MSRVTT retrieval, DiDeMo and ActivityNet Captions paragprah-to-video retrieval, MSRVTT MC Test.

  1. Download data.

    # outside the container  
    # download videos + annotations for $DSET
    bash scripts/download_$DSET.sh $PATH_TO_STORAGE

    $DSET can be one of msrvtt, didemo, anet.

  2. Finetuning.

    # inside the container
    horovodrun -np 4 python src/tasks/run_video_retrieval.py \
        --config $CONFIG_PATH \
        --output_dir $OUTPUT_DIR
    
    # for single GPU
    python src/tasks/run_video_retrieval.py \
        --config $CONFIG_PATH \
        --output_dir $OUTPUT_DIR

    $CONFIG_PATH should be set to one of the .json config files available at src/configs prefixed with _ret. For example, you can use src/configs/msrvtt_ret_base_resnet50.json for MSRVTT retrieval.

  3. Run inference.

    # inside the container
    horovodrun -np 4 python src/tasks/run_video_retrieval.py \
      --do_inference 1 --output_dir $OUTPUT_DIR \
      --inference_split val --inference_model_step $STEP \
      --inference_txt_db $TXT_DB \
      --inference_img_db $IMG_DB --inference_batch_size 64 \
      --inference_n_clips $INFERENCE_N_CLIPS

    $STEP is an integer, it tells the script to use the checkpoint $OUTPUT_DIR/ckpt/model_step_$STEP.pt for inference. $TXT_DB and $IMG_DB are path to annotation file and video data. You can use TXT_DB=/txt/downstream/msrvtt_retrieval/msrvtt_retrieval_val.jsonl and IMG_DB=/img/msrvtt for inference on MSRVTT retrieval val split. The results will be written under $OUTPUT_DIR. You can use different $INFERENCE_N_CLIPS for inference, such as 1 or 16. Using more clips will have a large impact on inference speed and memory usage. You may want to use smaller batch sizes if larger values are set.

    After MSRVTT retrieval model is trained, you can use it for inference for the MSRVTT MC Test task as well, which is essentially a retrieval task in a multiple-choice task setup.

    # inside the container
    horovodrun -np 4 python src/tasks/run_msrvtt_mc.py \
      --do_inference 1 --output_dir $OUTPUT_DIR \
      --inference_split val --inference_model_step $STEP \
      --inference_txt_db /txt/downstream/msrvtt_retrieval_mc/msrvtt_retrieval_mc_test.jsonl \
      --inference_img_db /img/msrvtt --inference_batch_size 64 \
      --inference_n_clips $INFERENCE_N_CLIPS

Video Question Answering

Tasks: TGIF-QA action, transition, and frameQA tasks; MSRVTT-QA.

  1. Download data.

    # outside the container  
    # download MSRVTT videos, and QA + retrieval annotations
    bash scripts/download_msrvtt.sh $PATH_TO_STORAGE  
    # download TGIF-QA videos and annotations
    bash scripts/download_tgif_qa.sh $PATH_TO_STORAGE  
  2. Finetuning.

    # inside the container
    horovodrun -np 4 python src/tasks/run_video_qa.py \
        --config $CONFIG_PATH \
        --output_dir $OUTPUT_DIR

    $CONFIG_PATH should be set to one of the .json config files available at src/configs contains the substring _qa. For example, you can use src/configs/msrvtt_qa_base_resnet50.json for MSRVTT-QA.

  3. Run inference.

    # inside the container
    horovodrun -np 4 python src/tasks/run_video_qa.py \
      --do_inference 1 --output_dir $OUTPUT_DIR \
      --inference_split val --inference_model_step $STEP \
      --inference_txt_db $TXT_DB \
      --inference_img_db $IMG_DB --inference_batch_size 64 \
      --inference_n_clips $INFERENCE_N_CLIPS

    $STEP is an integer, which tells the script to use the checkpoint $OUTPUT_DIR/ckpt/model_step_$STEP.pt for inference. $TXT_DB and $IMG_DB are path to annotation file and video data. You can use TXT_DB=/txt/downstream/msrvtt_retrieval/msrvtt_qa_val.jsonl and IMG_DB=/img/msrvtt for inference on MSRVTT QA val split.

    The results will be written under $OUTPUT_DIR. You can use different $INFERENCE_N_CLIPS for inference, such as 1 or 16. Using more clips will have a large impact on inference speed and memory usage. You may want to use smaller batch sizes if larger values are set.

Image Question Answering (VQA)

  1. Download data

    # outside the container
    # download COCO and VG data
    bash scripts/download_coco_vg.sh $PATH_TO_STORAGE
    # download VQA annotations
    bash scripts/download_vqa.sh $PATH_TO_STORAGE
  2. Finetuning

    # inside the container
    horovodrun -np 4 python src/tasks/run_vqa.py \
        --config src/configs/vqa_base_resnet50.json \
        --output_dir $OUTPUT_DIR
  3. Inference

    # inside the container
    horovodrun -np 4 python src/tasks/run_vqa.py \
      --do_inference 1 --output_dir $OUTPUT_DIR \
      --inference_split val --inference_model_step $STEP \
      --inference_txt_db $TXT_DB \
      --inference_img_db $IMG_DB \
      --inference_batch_size 64

Pretraining

  1. Download data

    # outside the container
    bash scripts/download_coco_vg.sh $PATH_TO_STORAGE
  2. Pretraining

    #inside the container
    horovodrun -np 8 python src/pretrain/run_pretrain.py \
        --config src/configs/pretrain_indomain_base_resnet50_mlm_itm.json \
        --output_dir $OUTPUT_DIR 

Data Preprocessing

ClipBERT takes raw video and text as inputs, there is no need to do feature extraction. However, to improve data loading speed, we use LMDB to store the raw image and video files. You can use the following script to convert a list of videos with file extensions mp4 and avi into LMDB:

# outside the container
python src/preprocessing/file2lmdb.py \
    --data_root /path/to/videos \
    --lmdb_save_dir /path/to/save/lmdb \
    --ext avi mp4 \
    --file_type video 

For images, use appropriate file extensions for --ext and --file_type image. Text annotation files are reorganized into jsonl files, see example preprocessed files downloaded by the scripts in scripts/.

Citation

If you find this code useful for your research, please consider citing:

@article{lei2021less,
  title={Less is More: ClipBERT for Video-and-Language Learningvia Sparse Sampling},
  author={Lei, Jie and Li, Linjie and Zhou, Luowei and Gan, Zhe and Berg, Tamara L. and Bansal, Mohit and Liu, Jingjing},
  journal={arXiv},
  year={2021}
}

Acknowledgement

We thank Yen-Chun Chen and Ruotian Luo for suggestions on the implementation. We also thank other members and interns at Microsoft Multimodal AI for their helpful discussions.

This code used resources from transformers, UNITER, HERO, grid-feats-vqa, SlowFast, Detectron2. The code is implemented using PyTorch, with multi-GPU support from Horovod and mixed precision support from apex. We thank the authors for open-sourcing their awesome projects.

License

MIT

Owner
Jie Lei 雷杰
UNC CS PhD student, vision+language.
Jie Lei 雷杰
Production First and Production Ready End-to-End Keyword Spotting Toolkit

Production First and Production Ready End-to-End Keyword Spotting Toolkit

223 Jan 02, 2023
Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Neural Network Models for Joint POS Tagging and Dependency Parsing Implementations of joint models for POS tagging and dependency parsing, as describe

Dat Quoc Nguyen 152 Sep 02, 2022
Write Alphabet, Words and Sentences with your eyes.

The-Next-Gen-AI-Eye-Writer The Eye tracking Technique has become one of the most popular techniques within the human and computer interaction era, thi

Rohan Kasabe 2 Apr 05, 2022
A natural language processing model for sequential sentence classification in medical abstracts.

NLP PubMed Medical Research Paper Abstract (Randomized Controlled Trial) A natural language processing model for sequential sentence classification in

Hemanth Chandran 1 Jan 17, 2022
Gathers machine learning and Tensorflow deep learning models for NLP problems, 1.13 < Tensorflow < 2.0

NLP-Models-Tensorflow, Gathers machine learning and tensorflow deep learning models for NLP problems, code simplify inside Jupyter Notebooks 100%. Tab

HUSEIN ZOLKEPLI 1.7k Dec 30, 2022
DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task。涵盖68个领域、共计916万词的专业词典知识库,可用于文本分类、知识增强、领域词汇库扩充等自然语言处理应用。

liuhuanyong 357 Dec 24, 2022
Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data Authors: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Ye

Yi-Chang Chen 5 Dec 15, 2022
Global Rhythm Style Transfer Without Text Transcriptions

Global Prosody Style Transfer Without Text Transcriptions This repository provides a PyTorch implementation of AutoPST, which enables unsupervised glo

Kaizhi Qian 193 Dec 30, 2022
Conditional probing: measuring usable information beyond a baseline

Conditional probing: measuring usable information beyond a baseline

John Hewitt 20 Dec 15, 2022
The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

Good news! Our new work exhibits state-of-the-art performances on DocUNet benchmark dataset: DocScanner: Robust Document Image Rectification with Prog

Hao Feng 231 Dec 26, 2022
We have built a Voice based Personal Assistant for people to access files hands free in their device using natural language processing.

Voice Based Personal Assistant We have built a Voice based Personal Assistant for people to access files hands free in their device using natural lang

Rushabh 2 Nov 13, 2021
News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

NLP T5 Project proposal Topic Modeling and Clustering of News-Articles-and-Essays Students: Nasser Alshehri Abdullah Bushnag Abdulrhman Alqurashi OVER

2 Jan 18, 2022
Yet another Python binding for fastText

pyfasttext Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresea

Vincent Rasneur 230 Nov 16, 2022
CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

- 基于标题的大规模商品实体检索top1 一、任务介绍 CCKS 2020:基于标题的大规模商品实体检索,任务为对于给定的一个商品标题,参赛系统需要匹配到该标题在给定商品库中的对应商品实体。 输入:输入文件包括若干行商品标题。 输出:输出文本每一行包括此标题对应的商品实体,即给定知识库中商品 ID,

43 Nov 11, 2022
A paper list of pre-trained language models (PLMs).

Large-scale pre-trained language models (PLMs) such as BERT and GPT have achieved great success and become a milestone in NLP.

RUCAIBox 124 Jan 02, 2023
Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

Welcome to Healthsea ✨ Create better access to health with spaCy. Healthsea is a pipeline for analyzing user reviews to supplement products by extract

Explosion 75 Dec 19, 2022
EMNLP 2021 paper "Pre-train or Annotate? Domain Adaptation with a Constrained Budget".

Pre-train or Annotate? Domain Adaptation with a Constrained Budget This repo contains code and data associated with EMNLP 2021 paper "Pre-train or Ann

Fan Bai 8 Dec 17, 2021
YACLC - Yet Another Chinese Learner Corpus

汉语学习者文本多维标注数据集YACLC V1.0 中文 | English 汉语学习者文本多维标注数据集(Yet Another Chinese Learner

BLCU-ICALL 47 Dec 15, 2022
KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

KoBERTopic 모델 소개 KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정했습니다. 기존 BERTopic : https://github.com/MaartenGr/BERTopic/tree/05a6790b21009d

Won Joon Yoo 26 Jan 03, 2023
A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

AI2 1.3k Jan 03, 2023