Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

Overview

ResDAVEnet-VQ

Official PyTorch implementation of Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

What is in this repo?

  • Multi-GPU training of ResDAVEnet-VQ
  • Quantitative evaluation
    • Image-to-speech and speech-to-image retrieval
    • ZeroSpeech 2019 ABX phone-discriminability test
    • Word detection
  • Qualitative evaluation
    • Visualize time-aligned word/phone/code transcripts
    • F1/recall/precision scatter plots for model/layer comparison

alt text

If you find the code useful, please cite

@inproceedings{Harwath2020Learning,
  title={Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech},
  author={David Harwath and Wei-Ning Hsu and James Glass},
  booktitle={International Conference on Learning Representations},
  year={2020},
  url={https://openreview.net/forum?id=B1elCp4KwH}
}

Pre-trained models

Model [email protected] Link MD5 sum
{} 0.735 gDrive e3f94990c72ce9742c252b2e04f134e4
{}->{2} 0.760 gDrive d8ebaabaf882632f49f6aea0a69516eb
{}->{3} 0.794 gDrive 2c3a269c70005cbbaaa15fc545da93fa
{}->{2,3} 0.787 gDrive d0764d8e97187c8201f205e32b5f7fee
{2} 0.753 gDrive d68c942069fcdfc3944e556f6af79c60
{2}->{2,3} 0.764 gDrive 09e704f8fcd9f85be8c4d5bdf779bd3b
{2}->{2,3}->{2,3,4} 0.793 gDrive 6e403e7f771aad0c95f087318bf8447e
{3} 0.734 gDrive a0a3d5adbbd069a2739219346c8a8f70
{3}->{2,3} 0.760 gDrive 6c92bcc4445895876a7840bc6e88892b
{2,3} 0.667 gDrive 7a98a661302939817a1450d033bc2fcc

Data preparation

Download the MIT Places Image/Audio Data

We use MIT Places scene recognition database (Places Image) and a paired MIT Places Audio Caption Corpus (Places Audio) as visually-grounded speech, which contains roughly 400K image/spoken caption pairs, to train ResDAVEnet-VQ.

  • Places Image can be downloaded here
  • Places Audio can be downloaded here

Optional data preprocessing

Data specifcation files can be found at metadata/{train,val}.json inside the Places Audio directory; however, they do not include the time-aligned word transcripts for analysis. Those with alignments can be downloaded here:

Open the *.json files and update the values of image_base_path and audio_base_path to reflect the path where the image and the audio datasets are stored.

To speed up data loading, we save images and audio data into the HDF5 binary files, and use the h5py Python interface to access the data. The corresponding PyTorch Dataset class is ImageCaptionDatasetHDF5 in ./dataloaders/image_caption_dataset_hdf5.py. To prepare HDF5 datasets, run

./scripts/preprocess.sh

(We do support on-the-fly feature processing with the ImageCaptionDataset class in ./dataloaders/image_caption_dataset.py, which takes a data specification file as input (e.g., metadata/train.json). However, this can be very slow)

ImageCaptionDataset and ImageCaptionDatasetHDF5 are interchangeable, but most scripts in this repo assume the preprocessed HDF5 dataset is available. Users would have to modify the code correspondingly to use ImageCaptionDataset.

Interactive Qualtitative Evaluation

See run_evaluations.ipynb

Quantitative Evaluation

ZeroSpeech 2019 ABX Phone Discriminability Test

Users need to download the dataset and the Docker image by following the instructions here.

To extract ResDAVEnet-VQ features, see ./scripts/dump_zs19_abx.sh.

Word detection

See ./run_unit_analysis.py. It needs both HDF5 dataset and the original JSON dataset to get the time-aligned word transcripts.

Example:

python run_unit_analysis.py --hdf5_path=$hdf5_path --json_path=$json_path \
  --exp_dir=$exp_dir --layer=$layer --output_dir=$out_dir

Cross-modal retrieval

See ./run_ResDavenetVQ.py. Set --mode=eval for retrieval evaluation.

Example:

python run_ResDavenetVQ.py --resume=True --mode=eval \
  --data-train=$data_tr --data-val=$data_dt \
  --exp-dir="./exps/pretrained/RDVQ_01000_01100_01110"

Training

See ./scripts/train.sh.

To train a model from scratch with the 2nd and 3rd layers quantized, run

./scripts/train.sh 01100 RDVQ_01100 ""

To train a model with the 2nd and 3rd layers quantized, and initialize weights from a pre-trained model (e.g., ./exps/RDVQ_00000), run

./scripts/train.sh 01100 RDVQ_01100 "--seed-dir ./exps/RDVQ_00000"
Owner
Wei-Ning Hsu
Research Scientist @ Facebook AI Research (FAIR). Former PhD Student @ MIT Spoken Language Systems Group
Wei-Ning Hsu
Solving reinforcement learning tasks which require language and vision

Multimodal Reinforcement Learning JAX implementations of the following multimodal reinforcement learning approaches. Dual-coding Episodic Memory from

Henry Prior 31 Feb 26, 2022
Official implementation of "Generating 3D Molecules for Target Protein Binding"

Generating 3D Molecules for Target Protein Binding This is the official implementation of the GraphBP method proposed in the following paper. Meng Liu

DIVE Lab, Texas A&M University 74 Dec 07, 2022
A light weight data augmentation tool for training CNNs and Viola Jones detectors

hey-daug A light weight data augmentation tool for training CNNs and Viola Jones detectors (Haar Cascades). This tool inflates your data by up to six

Jaiyam Sharma 2 Nov 23, 2019
Python script for performing depth completion from sparse depth and rgb images using the msg_chn_wacv20. model in ONNX

ONNX msg_chn_wacv20 depth completion Python script for performing depth completion from sparse depth and rgb images using the msg_chn_wacv20 model in

Ibai Gorordo 19 Oct 22, 2022
A parallel framework for population-based multi-agent reinforcement learning.

MALib: A parallel framework for population-based multi-agent reinforcement learning MALib is a parallel framework of population-based learning nested

MARL @ SJTU 348 Jan 08, 2023
Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)

DocFormer - PyTorch Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for t

171 Jan 06, 2023
MASA-SR: Matching Acceleration and Spatial Adaptation for Reference-Based Image Super-Resolution (CVPR2021)

MASA-SR Official PyTorch implementation of our CVPR2021 paper MASA-SR: Matching Acceleration and Spatial Adaptation for Reference-Based Image Super-Re

DV Lab 126 Dec 20, 2022
UIUCTF 2021 Public Challenge Repository

UIUCTF-2021-Public UIUCTF 2021 Public Challenge Repository Notes: every challenge folder contains a challenge.yml file in the format for ctfcli, CTFd'

SIGPwny 15 Nov 03, 2022
Pytorch Implementation of LNSNet for Superpixel Segmentation

LNSNet Overview Official implementation of Learning the Superpixel in a Non-iterative and Lifelong Manner (CVPR'21) Learning Strategy The proposed LNS

42 Oct 11, 2022
Code for STFT Transformer used in BirdCLEF 2021 competition.

STFT_Transformer Code for STFT Transformer used in BirdCLEF 2021 competition. The STFT Transformer is a new way to use Transformers similar to Vision

Jean-François Puget 69 Sep 29, 2022
SelfRemaster: SSL Speech Restoration

SelfRemaster: Self-Supervised Speech Restoration Official implementation of SelfRemaster: Self-Supervised Speech Restoration with Analysis-by-Synthesi

Takaaki Saeki 46 Jan 07, 2023
A Neural Net Training Interface on TensorFlow, with focus on speed + flexibility

Tensorpack is a neural network training interface based on TensorFlow. Features: It's Yet Another TF high-level API, with speed, and flexibility built

Tensorpack 6.2k Jan 09, 2023
Face and Pose detector that emits MQTT events when a face or human body is detected and not detected.

Face Detect MQTT Face or Pose detector that emits MQTT events when a face or human body is detected and not detected. I built this as an alternative t

Jacob Morris 38 Oct 21, 2022
Code for paper ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop.

Who Left the Dogs Out? Evaluation and demo code for our ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization

Benjamin Biggs 29 Dec 28, 2022
Denoising Normalizing Flow

Denoising Normalizing Flow Christian Horvat and Jean-Pascal Pfister 2021 We combine Normalizing Flows (NFs) and Denoising Auto Encoder (DAE) by introd

CHrvt 17 Oct 15, 2022
Graduation Project

Gesture-Detection-and-Depth-Estimation This is my graduation project. (1) In this project, I use the YOLOv3 object detection model to detect gesture i

ChaosAT 1 Nov 23, 2021
TabNet for fastai

TabNet for fastai This is an adaptation of TabNet (Attention-based network for tabular data) for fastai (=2.0) library. The original paper https://ar

Mikhail Grankin 116 Oct 21, 2022
[NeurIPS 2021 Spotlight] Aligning Pretraining for Detection via Object-Level Contrastive Learning

SoCo [NeurIPS 2021 Spotlight] Aligning Pretraining for Detection via Object-Level Contrastive Learning By Fangyun Wei*, Yue Gao*, Zhirong Wu, Han Hu,

Yue Gao 139 Dec 14, 2022
Learned Initializations for Optimizing Coordinate-Based Neural Representations

Learned Initializations for Optimizing Coordinate-Based Neural Representations Project Page | Paper Matthew Tancik*1, Ben Mildenhall*1, Terrance Wang1

Matthew Tancik 127 Jan 03, 2023
Deep universal probabilistic programming with Python and PyTorch

Getting Started | Documentation | Community | Contributing Pyro is a flexible, scalable deep probabilistic programming library built on PyTorch. Notab

7.7k Dec 30, 2022