Code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition"

Last update: Dec 01, 2022

Related tags

Overview

SEW (Squeezed and Efficient Wav2vec)

The repo contains the code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition" by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q Weinberger, and Yoav Artzi.

Model Checkpoints

Unsupervisedly Pre-trained on LibriSpeech 960h

Model	Pre-training updates	Dataset	Model
W2V2-tiny	100K	Librispeech 960h	download
W2V2-small	100K	Librispeech 960h	download
W2V2-mid	100K	Librispeech 960h	download
W2V2-base	100K	Librispeech 960h	download
SEW-tiny	100K	Librispeech 960h	download
SEW-small	100K	Librispeech 960h	download
SEW-mid	100K	Librispeech 960h	download
SEW-D-tiny	100K	Librispeech 960h	download
SEW-D-small	100K	Librispeech 960h	download
SEW-D-mid	100K	Librispeech 960h	download
SEW-D-mid (k127)	100K	Librispeech 960h	download
SEW-D-base	100K	Librispeech 960h	download
SEW-D-base+	100K	Librispeech 960h	download
SEW-D-mid	400K	Librispeech 960h	download
SEW-D-mid (k127)	400K	Librispeech 960h	download
SEW-D-base+	400K	Librispeech 960h	download

Usage

Dependencies

The code is tested with fairseq commit 05255f9, deberta commit bf17ca4 and the following packages.

torch==1.8.0
torchaudio==0.8.0
tqdm==4.49.0
Hydra==2.5
hydra-core==1.0.4
fvcore==0.1.5.post20210330
omegaconf==2.0.5
einops==0.3.0
fire==0.2.1

Apex

Please install NVIDIA's apex with

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

wav2letter decoder

Currently, we are decoding with wav2letter v0.2 python binding at commit 96f5f9d Please install the python binding here https://github.com/flashlight/wav2letter/tree/96f5f9d3b41e01af0a031ee0d2604acd9ef3b1b0/bindings/python The newest commit d5a93f0 in v0.2 branch leads to worse WER for wav2vec 2.0 baselines.

Installation

git clone https://github.com/asappresearch/sew.git
cd sew 
pip install -e .

Pre-training

Pre-training SEW models

Run the following command where $model_size can be tiny, small, or mid, and $ngpu is tne number of GPUs you want to use.

bash scripts/pt-sew.sh $model_size $ngpu

Pre-training SEW-D models

bash scripts/pt-sew-d.sh $model_size $ngpu

where $model_size can be tiny, small, mid, mid-k127, base, or base+.

Fine-tuning

Run the following script to fine-tune a model with the hyperparameters from wav2vec 2.0.

bash scripts/ft-model.sh $pre_trained_model $split $ngpu

where $pre_trained_model can be either a W2V2, SEW, or a SEW-D model checkpoint and $split can be 10m, 1h, 10h, or 100h.

Here we also provide a set of hyperparameters which sets all dropouts the same as the pre-training stage, and we found it to be more stable.

bash scripts/ft-model-stable.sh $pre_trained_model $split $ngpu

If you see out of GPU memory error, please scale down the dataset.max_tokens and scale up the optimization.update_freq in scripts/ft-model.sh. For example modifying these lines

  dataset.max_tokens=3200000 \
  optimization.update_freq="[$((8 / $ngpu))]" \

  dataset.max_tokens=1600000 \
  optimization.update_freq="[$((16 / $ngpu))]" \

which reduces the batch size and increases the gradient accumulation steps in order to use less GPU memory.

Evaluation

Please run this script to prepare the official LibriSpeech 4-gram language model.

bash scripts/prepare_librispeech_lm.sh $kenlm_build_bin

where $kenlm_build_bin is the folder that contains the KenLM build_binary executable file (e.g. /home/user/kenlm/build/bin).

Then run this script to evaluate a pre-trained ASR model

python tools/eval_w2v.py tunelm --subsets '["dev-clean", "dev-other", "test-clean", "test-other"]' --model $asr_checkpoint

Code for the paper Learning the Predictability of the Future

Learning the Predictability of the Future Code from the paper Learning the Predictability of the Future. Website of the project in hyperfuture.cs.colu

Computer Vision Lab at Columbia University

139 Nov 18, 2022

PyTorch code for the paper: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning This is the PyTorch implementation of our paper: FeatMatch: Feature-Based Augmentat

43 Nov 19, 2022

Code for the paper A Theoretical Analysis of the Repetition Problem in Text Generation

A Theoretical Analysis of the Repetition Problem in Text Generation This repository share the code for the paper "A Theoretical Analysis of the Repeti

37 Nov 21, 2022

Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

SA-Net: Shuffle Attention for Deep Convolutional Neural Networks (paper) By Qing-Long Zhang and Yu-Bin Yang [State Key Laboratory for Novel Software T

199 Jan 8, 2023

Open source repository for the code accompanying the paper 'Non-Rigid Neural Radiance Fields Reconstruction and Novel View Synthesis of a Deforming Scene from Monocular Video'.

Non-Rigid Neural Radiance Fields This is the official repository for the project "Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synt

296 Dec 29, 2022

Comments

8000 sample rate audio

Hello there,

I'm trying to train on 8000 Hz sample rate audio dataset. Is it enough to simply add task.sample_rate=8000 to the fairseq command or there are additional config changes that I should make?

I would much appreciate any advice

Thank you

opened by Mega4alik 0
How to train using not English Languages

Hi! Thank you for the awesome model!

We are very interested in your project and we try to use the sew for Japanese Language. When we train the model, should we use these scripts? Thanks! https://github.com/asappresearch/sew/tree/master/scripts

opened by jigenji 1
:bug: Fix padding mask calculation

This PR updates the padding mask calculation to be the same as the one in the reference Wav2Vec2 implementation (same commit as listed in SEW's README): https://github.com/pytorch/fairseq/blob/05255f96410e5b1eaf3bf59b767d5b4b7e2c3a35/fairseq/models/wav2vec/wav2vec2.py#L477

For more details on how and why it was fixed in fairseq, check out this PR by @patrickvonplaten https://github.com/pytorch/fairseq/pull/3228

opened by anton-l 0

Releases(v0.0.1)

v0.0.1(Sep 15, 2021)

First release.
Source code(tar.gz)
Source code(zip)

Code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition"

Related tags

Overview

SEW (Squeezed and Efficient Wav2vec)

Model Checkpoints

Unsupervisedly Pre-trained on LibriSpeech 960h

Usage

Dependencies

Apex

wav2letter decoder

Installation

Pre-training

Fine-tuning

Evaluation

You might also like...

Code for the paper Learning the Predictability of the Future

PyTorch code for the paper: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

Code for the paper A Theoretical Analysis of the Repetition Problem in Text Generation

Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

Open source repository for the code accompanying the paper 'Non-Rigid Neural Radiance Fields Reconstruction and Novel View Synthesis of a Deforming Scene from Monocular Video'.

Code for the Shortformer model, from the paper by Ofir Press, Noah A. Smith and Mike Lewis.

PyTorch code for ICLR 2021 paper Unbiased Teacher for Semi-Supervised Object Detection

Official code for paper "Optimization for Oriented Object Detection via Representation Invariance Loss".

Code for our CVPR 2021 paper "MetaCam+DSCE"

Comments

8000 sample rate audio

How to train using not English Languages

:bug: Fix padding mask calculation

Releases(v0.0.1)

v0.0.1(Sep 15, 2021)

Owner

ASAPP Research

Greedy Gaussian Segmentation

Some bravo or inspiring research works on the topic of curriculum learning.

An experiment on the performance of homemade Q-learning AIs in Agar.io depending on their state representation and available actions

Code repository for the paper "Tracking People with 3D Representations"

cisip-FIRe - Fast Image Retrieval

Controlling the MicriSpotAI robot from scratch

MetaDrive: Composing Diverse Scenarios for Generalizable Reinforcement Learning

CONditionals for Ordinal Regression and classification in PyTorch

Chainer implementation of recent GAN variants

Pseudo-Visual Speech Denoising

Reinforcement Learning for the Blackjack

Official code for our ICCV paper: "From Continuity to Editability: Inverting GANs with Consecutive Images"

A package to predict protein inter-residue geometries from sequence data

[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Unsupervised clustering of high content screen samples

🔥 Cogitare - A Modern, Fast, and Modular Deep Learning and Machine Learning framework for Python

ICRA 2021 - Robust Place Recognition using an Imaging Lidar

Ranking Models in Unlabeled New Environments （iccv21）

Use VITS and Opencpop to develop singing voice synthesis; Maybe it will VISinger.

Implementation of "Deep Implicit Templates for 3D Shape Representation"