A PyTorch implementation of VIOLET

Overview

VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling

A PyTorch implementation of VIOLET

Overview

VIOLET is an implementation of
"VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling"
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu

VIOLET contains 3 components: Video Swin Transformer (VT) computes video features; Language Embedder (LE) extracts word embeddings; Cross-modal Transformer (CT) performs cross-modal fusion. To benefit from large-scale data, we incorporate 3 pretraining tasks: Masked Language Modeling (MVM) predicts the masked word tokens; Masked Visual-token Modeling (MVM) recovers the masked video patches; Visual-Text Matching (VTM) learns the alignments between video and text modality.

Requirements

This code is implemented under Python 3.8, PyTorch 1.7, and Torchvision 0.8.

Usage

Data preprocessing

As using outer datasets (cannot be shared by us), we provide preprocessing tools to extract sparse-sampled video frames into our compressed format.

cd _tools

# We use 4 frames during pretraining and 5 frames for downstream tasks
python extract_video-frame.py --path=msrvtt --sample=5 # output: msrvtt.pkl

# We use DALL-E to extract VQ tokens for MVM pretraining
wget https://cdn.openai.com/dall-e/encoder.pkl # download trained dall-e encoder
python extract_vq.py --path=msrvtt --frame=224 # output: msrvtt_vq.pkl

# We adopt file.seek() instead of loading entire data to reduce the memory cost during distributed pretraining
python extract_tsv.py --path=msrvtt # output: msrvtt.tsv, msrvtt.lineidx

There are parital examples (WebVid2.5M, CC3M, TGIF-Action, MSVD-QA, and MSRVTT-Retrieval) to help formulate the input data.

Pretraining

Put pretrained VT in ./_snapshot. This script pretrains on both video (WebVid2.5M) and image (CC3M) data via single-node multi-gpu distributed training.

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=7122 main_pretrain.py

Here is our best pretrained checkpoint (YT180M+WebVid2.5M+CC3M).

Downstream

CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qamc.py _data/args_tgif-action.json
CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qaoe.py _data/args_msvd-qa.json
CUDA_VISIBLE_DEVICES='0,1,2,3' python main_retrieval.py _data/args_msrvtt-retrieval.json
CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval.py _data/args_msrvtt-retrieval.json

We also provide all trained downstream checkpoints.

Citation

@inproceedings{fu2021violet, 
  author = {Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu}, 
  title = {VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling}, 
  booktitle = {arXiv:2111.1268}, 
  year = {2021} 
}
Owner
Tsu-Jui Fu
NLP & CV
Tsu-Jui Fu
Random-Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

Random Word Generator Generates meaningful words from dictionary with given no. of letters and words. This might be useful for generating short links

Mohammed Rabil 1 Jan 01, 2022
Implementation of ProteinBERT in Pytorch

ProteinBERT - Pytorch (wip) Implementation of ProteinBERT in Pytorch. Original Repository Install $ pip install protein-bert-pytorch Usage import torc

Phil Wang 92 Dec 25, 2022
Russian words synonyms and antonyms

ru_synonyms Russian words synonyms and antonyms. Install pip install git+https://github.com/ahmados/rusynonyms.git Usage from ru_synonyms import Anto

sumekenov 7 Dec 14, 2022
Using BERT-based models for toxic span detection

SemEval 2021 Task 5: Toxic Spans Detection: Task: Link to SemEval-2021: Task 5 Toxic Span Detection is https://competitions.codalab.org/competitions/2

Ravika Nagpal 1 Jan 04, 2022
2021语言与智能技术竞赛:机器阅读理解任务

LICS2021 MRC 1. 项目&任务介绍 本项目基于官方给定的baseline(DuReader-Checklist-BASELINE)进行二次改造,对整个代码框架做了简单的重构,对核心网络结构添加了注释,解耦了数据读取的模块,并添加了阈值确认的功能,一些小的细节也做了改进。 本次任务为202

roar 29 Dec 05, 2022
AI and Machine Learning workflows on Anthos Bare Metal.

Hybrid and Sovereign AI on Anthos Bare Metal Table of Contents Overview Terraform as IaC Substrate ABM Cluster on GCE using Terraform TensorFlow ResNe

Google Cloud Platform 8 Nov 26, 2022
BERT Attention Analysis

BERT Attention Analysis This repository contains code for What Does BERT Look At? An Analysis of BERT's Attention. It includes code for getting attent

Kevin Clark 401 Dec 11, 2022
Python SDK for working with Voicegain Speech-to-Text

Voicegain Speech-to-Text Python SDK Python SDK for the Voicegain Speech-to-Text API. This API allows for large vocabulary speech-to-text transcription

Voicegain 3 Dec 14, 2022
Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Training-code-of-STM This repository fully reproduces Space-Time Memory Networks Performance on Davis17 val set&Weights backbone training stage traini

haochen wang 128 Dec 11, 2022
自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器

ja-timex 自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器 概要 ja-timex は、現代日本語で書かれた自然文に含まれる時間情報表現を抽出しTIMEX3と呼ばれるアノテーション仕様に変換することで、プログラムが利用できるような形に規格化するルールベースの解析器です。

Yuki Okuda 116 Nov 09, 2022
The code from the whylogs workshop in DataTalks.Club on 29 March 2022

whylogs Workshop The code from the whylogs workshop in DataTalks.Club on 29 March 2022 whylogs - The open source standard for data logging (Don't forg

DataTalksClub 12 Sep 05, 2022
Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

hezw.tkcw 20 Dec 12, 2022
HF's ML for Audio study group

Hugging Face Machine Learning for Audio Study Group Welcome to the ML for Audio Study Group. Through a series of presentations, paper reading and disc

Vaibhav Srivastav 110 Jan 01, 2023
Treemap visualisation of Maya scene files

Ever wondered which nodes are responsible for that 600 mb+ Maya scene file? Features Fast, resizable UI Parsing at 50 mb/sec Dependency-free, single-f

Marcus Ottosson 76 Nov 12, 2022
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

13.2k Jul 07, 2021
Text Analysis & Topic Extraction on Android App user reviews

AndroidApp_TextAnalysis Hi, there! This is code archive for Text Analysis and Topic Extraction from user_reviews of Android App. Dataset Source : http

Fitrie Ratnasari 1 Feb 14, 2022
Transformers Wav2Vec2 + Parlance's CTCDecodeTransformers Wav2Vec2 + Parlance's CTCDecode

🤗 Transformers Wav2Vec2 + Parlance's CTCDecode Introduction This repo shows how 🤗 Transformers can be used in combination with Parlance's ctcdecode

Patrick von Platen 9 Jul 21, 2022
Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Yoon Kim 43 Dec 23, 2022
HAIS_2GNN: 3D Visual Grounding with Graph and Attention

HAIS_2GNN: 3D Visual Grounding with Graph and Attention This repository is for the HAIS_2GNN research project. Tao Gu, Yue Chen Introduction The motiv

Yue Chen 1 Nov 26, 2022