WaveGrad

Implementation (PyTorch) of Google Brain's high-fidelity WaveGrad vocoder (paper). First implementation on GitHub with high-quality generation for 6-iterations.

Status

Real-time factor (RTF)

Number of parameters: 15.810.401

Model	Stable	RTX 2080 Ti	Tesla K80	Intel Xeon 2.3GHz*
1000 iterations	+	9.59	-	-
100 iterations	+	0.94	5.85	-
50 iterations	+	0.45	2.92	-
25 iterations	+	0.22	1.45	-
12 iterations	+	0.10	0.69	4.55
6 iterations	+	0.04	0.33	2.09

*Note: Used an old version of Intel Xeon CPU.

About

WaveGrad is a conditional model for waveform generation through estimating gradients of the data density with WaveNet-similar sampling quality. This vocoder is neither GAN, nor Normalizing Flow, nor classical autoregressive model. The main concept of vocoder is based on Denoising Diffusion Probabilistic Models (DDPM), which utilize Langevin dynamics and score matching frameworks. Furthemore, comparing to classic DDPM, WaveGrad achieves super-fast convergence (6 iterations and probably lower) w.r.t. Langevin dynamics iterative sampling scheme.

Installation

Clone this repo:

git clone https://github.com/ivanvovk/WaveGrad.git
cd WaveGrad

Install requirements:

pip install -r requirements.txt

Training

1 Preparing data

Make train and test filelists of your audio data like ones included into filelists folder.
Make a configuration file* in configs folder.

*Note: if you are going to change hop_length for STFT, then make sure that the product of your upsampling factors in config is equal to your new hop_length.

2 Single and Distributed GPU training

Open runs/train.sh script and specify visible GPU devices and path to your configuration file. If you specify more than one GPU the training will run in distributed mode.
Run sh runs/train.sh

3 Tensorboard and logging

To track your training process run tensorboard by tensorboard --logdir=logs/YOUR_LOGDIR_FOLDER. All logging information and checkpoints will be stored in logs/YOUR_LOGDIR_FOLDER. logdir is specified in config file.

4 Noise schedule grid search

Once model is trained, grid search for the best schedule* for a needed number of iterations in notebooks/inference.ipynb. The code supports parallelism, so you can specify more than one number of jobs to accelerate the search.

*Note: grid search is necessary just for a small number of iterations (like 6 or 7). For larger number just try Fibonacci sequence benchmark.fibonacci(...) initialization: I used it for 25 iteration and it works well. From good 25-iteration schedule, for example, you can build a higher-order schedule by copying elements.

Noise schedules for pretrained model

6-iteration schedule was obtained using grid search. After, based on obtained scheme, by hand, I found a slightly better approximation.
7-iteration schedule was obtained in the same way.
12-iteration schedule was obtained in the same way.
25-iteration schedule was obtained using Fibonacci sequence benchmark.fibonacci(...).
50-iteration schedule was obtained by repeating elements from 25-iteration scheme.
100-iteration schedule was obtained in the same way.
1000-iteration schedule was obtained in the same way.

Inference

CLI

Put your mel-spectrograms in some folder. Make a filelist. Then run this command with your own arguments:

sh runs/inference.sh -c <your-config> -ch <your-checkpoint> -ns <your-noise-schedule> -m <your-mel-filelist> -v "yes"

Jupyter Notebook

More inference details are provided in notebooks/inference.ipynb. There you can also find how to set a noise schedule for the model and make grid search for the best scheme.

Other

Generated audios

Examples of generated audios are provided in generated_samples folder. Quality degradation between 1000-iteration and 6-iteration inferences is not noticeable if found the best schedule for the latter.

Pretrained checkpoints

You can find a pretrained checkpoint file* on LJSpeech (22KHz) via this Google Drive link.

*Note: uploaded checkpoint is a dict with a single key 'model'.

Important details, issues and comments

During training WaveGrad uses a default noise schedule with 1000 iterations and linear scale betas from range (1e-6, 0.01). For inference you can set another schedule with less iterations. Tune betas carefully, the output quality really highly depends on it.
By default model runs in a mixed-precision way. Batch size is modified compared to the paper (256 -> 96) since authors trained their model on TPU.
After ~10k training iterations (1-2 hours) on a single GPU the model performs good generation for 50-iteration inference. Total training time is about 1-2 days (for absolute convergence).
At some point training might start to behave weird and crazy (loss explodes), so I have introduced learning rate (LR) scheduling and gradient clipping. If loss explodes for your data, then try to decrease LR scheduler gamma a bit. It should help.
By default hop length of your STFT is equal 300 (thus total upsampling factor). Other cases are not tested, but you can try. Remember, that total upsampling factor should be still equal to your new hop length.

History of updates

(NEW: 10/24/2020) Huge update. Distributed training and mixed-precision support. More correct positional encoding. CLI support for inference. Parallel grid search. Model size significantly decreased.
New RTF info for NVIDIA Tesla K80 GPU card (popular in Google Colab service) and CPU Intel Xeon 2.3GHz.
Huge update. New 6-iteration well generated sample example. New noise schedule setting API. Added the best schedule grid search code.
Improved training by introducing smarter learning rate scheduler. Obtained high-fidelity synthesis.
Stable training and multi-iteration inference. 6-iteration noise scheduling is supported.
Stable training and fixed-iteration inference with significant background static noise left. All positional encoding issues are solved.
Stable training of 25-, 50- and 1000-fixed-iteration models. Found no linear scaling (C=5000 from paper) of positional encoding (bug).
Stable training of 25-, 50- and 1000-fixed-iteration models. Fixed positional encoding downscaling. Parallel segment sampling is replaced by full-mel sampling.
(RELEASE, first on GitHub). Parallel segment sampling and broken positional encoding downscaling. Bad quality with clicks from concatenation from parallel-segment generation.

References

Nanxin Chen et al., WaveGrad: Estimating Gradients for Waveform Generation
Jonathan Ho et al., Denoising Diffusion Probabilistic Models
Denoising Diffusion Probabilistic Models repository (TensorFlow implementation), from which diffusion calculations have been adopted

Implementation of Google Brain's WaveGrad high-fidelity vocoder

Related tags

Overview

WaveGrad

Status

Real-time factor (RTF)

About

Installation

Training

1 Preparing data

2 Single and Distributed GPU training

3 Tensorboard and logging

4 Noise schedule grid search

Noise schedules for pretrained model

Inference

CLI

Jupyter Notebook

Other

Generated audios

Pretrained checkpoints

Important details, issues and comments

History of updates

References

Owner

Ivan Vovk

Platform-agnostic AI Framework 🔥

A PyTorch Implementation of FaceBoxes

Event-forecasting - Event Forecasting Algorithms With Python

This repository includes the official project for the paper: TransMix: Attend to Mix for Vision Transformers.

The official implementation code of "PlantStereo: A Stereo Matching Benchmark for Plant Surface Dense Reconstruction."

Fast and robust clustering of point clouds generated with a Velodyne sensor.

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

[NeurIPS 2021] "G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of Teacher Discriminators"

Specificity-preserving RGB-D Saliency Detection

Pretraining on Dynamic Graph Neural Networks

MEND: Model Editing Networks using Gradient Decomposition

PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

Using modified BiSeNet for face parsing in PyTorch

Official PyTorch implementation of "Evolving Search Space for Neural Architecture Search"

StarGAN v2 - Official PyTorch Implementation (CVPR 2020)

ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object Segmentation

An implementation demo of the ICLR 2021 paper Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks in PyTorch.

Learning Intents behind Interactions with Knowledge Graph for Recommendation, WWW2021

A python script to lookup Passport Index Dataset

[CVPR 2021] MiVOS - Scribble to Mask module