BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Related tags

Deep Learningbddm
Overview

Bilateral Denoising Diffusion Models (BDDMs)

GitHub Stars visitors arXiv demo

This is the official PyTorch implementation of the following paper:

BDDM: BILATERAL DENOISING DIFFUSION MODELS FOR FAST AND HIGH-QUALITY SPEECH SYNTHESIS
Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

Abstract: Diffusion probabilistic models (DPMs) and their extensions have emerged as competitive generative models yet confront challenges of efficient sampling. We propose a new bilateral denoising diffusion model (BDDM) that parameterizes both the forward and reverse processes with a schedule network and a score network, which can train with a novel bilateral modeling objective. We show that the new surrogate objective can achieve a lower bound of the log marginal likelihood tighter than a conventional surrogate. We also find that BDDM allows inheriting pre-trained score network parameters from any DPMs and consequently enables speedy and stable learning of the schedule network and optimization of a noise schedule for sampling. Our experiments demonstrate that BDDMs can generate high-fidelity audio samples with as few as three sampling steps. Moreover, compared to other state-of-the-art diffusion-based neural vocoders, BDDMs produce comparable or higher quality samples indistinguishable from human speech, notably with only seven sampling steps (143x faster than WaveGrad and 28.6x faster than DiffWave).

Paper: Published at ICLR 2022 on OpenReview

BDDM

This implementation supports model training and audio generation, and also provides the pre-trained models for the benchmark LJSpeech and VCTK dataset.

Visit our demo page for audio samples.

Updates:

  • May 20, 2021: Released our follow-up work FastDiff on GitHub, where we futher optimized the speed-and-quality trade-off.
  • May 10, 2021: Added the experiment configurations and model checkpoints for the VCTK dataset.
  • May 9, 2021: Added the searched noise schedules for the LJSpeech and VCTK datasets.
  • March 20, 2021: Released the PyTorch implementation of BDDM with pre-trained models for the LJSpeech dataset.

Recipes:

  • (Option 1) To train the BDDM scheduling network yourself, you can download the pre-trained score network from philsyn/DiffWave-Vocoder (provided at egs/lj/DiffWave.pkl), and follow the training steps below. (Start from Step I.)
  • (Option 2) To search for noise schedules using BDDM, we provide a pre-trained BDDM for LJSpeech at egs/lj/DiffWave-GALR.pkl and for VCTK at egs/vctk/DiffWave-GALR.pkl . (Start from Step III.)
  • (Option 3) To directly generate samples using BDDM, we provide the searched schedules for LJSpeech at egs/lj/noise_schedules and for VCTK at egs/vctk/noise_schedules (check conf.yml for the respective configurations). (Start from Step IV.)

Getting Started

We provide an example of how you can generate high-fidelity samples using BDDMs.

To try BDDM on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below intructions.

Dependencies

Step I. Data Preparation and Configuraion

Download the LJSpeech dataset.

For training, we first need to setup a file conf.yml for configuring the data loader, the score and the schedule networks, the training procedure, the noise scheduling and sampling parameters.

Note: Appropriately modify the paths in "train_data_dir" and "valid_data_dir" for training; and the path in "gen_data_dir" for sampling. All dir paths should be link to a directory that store the waveform audios (in .wav) or the Mel-spectrogram files (in .mel).

Step II. Training a Schedule Network

Suppose that a well-trained score network (theta) is stored at $theta_path, we start by modifying "load": $theta_path in conf.yml.

After modifying the relevant hyperparameters for a schedule network (especially "tau"), we can train the schedule network (f_phi in paper) using:

# Training on device 0
sh train.sh 0 conf.yml

Note: In practice, we found that 10K training steps would be enough to obtain a promising scheduling network. This normally takes no more than half an hour for training with one GPU.

Step III. Searching for Noise Schedules

Given a well-trained BDDM (theta, phi), we can now run the noise scheduling algorithm to find the best schedule (optimizing the trade-off between quality and speed).

First, we set "load" in conf.yml to the path of the trained BDDM.

After setting the maximum number of sampling steps in scheduling ("N"), we run:

# Scheduling on device 0
sh schedule.sh 0 conf.yml

Step IV. Evaluation or Generation

For evaluation, we set "gen_data_dir" in conf.yml to the path of a directory that stores the test set of audios (in .wav).

For generation, we set "gen_data_dir" in conf.yml to the path of a directory that stores the Mel-spectrogram (by default in .mel generated by TacotronSTFT or by our dataset loader bddm/loader/dataset.py).

Then, we run:

# Generation/evaluation on device 0 (only support single-GPU scheduling)
sh generate.sh 0 conf.yml

Acknowledgements

This implementation uses parts of the code from the following Github repos:
Tacotron2
DiffWave-Vocoder
as described in our code.

Citations

@inproceedings{lam2022bddm,
  title={BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis},
  author={Lam, Max WY and Wang, Jun and Su, Dan and Yu, Dong},
  booktitle={International Conference on Learning Representations},
  year={2022}
}

License

Copyright 2022 Tencent

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Disclaimer

This is not an officially supported Tencent product.

Owner
Research repositories.
Expressive Body Capture: 3D Hands, Face, and Body from a Single Image

Expressive Body Capture: 3D Hands, Face, and Body from a Single Image [Project Page] [Paper] [Supp. Mat.] Table of Contents License Description Fittin

Vassilis Choutas 1.3k Jan 07, 2023
Reinforcement learning models in ViZDoom environment

DoomNet DoomNet is a ViZDoom agent trained by reinforcement learning. The agent is a neural network that outputs a probability of actions given only p

Andrey Kolishchak 126 Dec 09, 2022
Parameterized Explainer for Graph Neural Network

PGExplainer This is a Tensorflow implementation of the paper: Parameterized Explainer for Graph Neural Network https://arxiv.org/abs/2011.04573 NeurIP

Dongsheng Luo 89 Dec 12, 2022
Implementation for "Exploiting Aliasing for Manga Restoration" (CVPR 2021)

[CVPR Paper](To appear) | [Project Website](To appear) | BibTex Introduction As a popular entertainment art form, manga enriches the line drawings det

133 Dec 15, 2022
This GitHub repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.'

About Repository This repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.' About Code

Arun Verma 1 Nov 09, 2021
[TNNLS 2021] The official code for the paper "Learning Deep Context-Sensitive Decomposition for Low-Light Image Enhancement"

CSDNet-CSDGAN this is the code for the paper "Learning Deep Context-Sensitive Decomposition for Low-Light Image Enhancement" Environment Preparing pyt

Jiaao Zhang 17 Nov 05, 2022
Veri Setinizi Yolov5 Formatına Dönüştürün

Veri Setinizi Yolov5 Formatına Dönüştürün! Bu Repo da Neler Var? Xml Formatındaki Veri Setini .Txt Formatına Çevirme Xml Formatındaki Dosyaları Silme

Kadir Nar 4 Aug 22, 2022
Dilated Convolution with Learnable Spacings PyTorch

Dilated-Convolution-with-Learnable-Spacings-PyTorch Ismail Khalfaoui Hassani Dilated Convolution with Learnable Spacings (abbreviated to DCLS) is a no

15 Dec 09, 2022
This library contains a Tensorflow implementation of the paper Stability Analysis of Unfolded WMMSE for Power Allocation

UWMMSE-stability Tensorflow implementation of Stability Analysis of UWMMSE Overview This library contains a Tensorflow implementation of the paper Sta

Arindam Chowdhury 1 Nov 16, 2022
Implementation of association rules mining algorithms (Apriori|FPGrowth) using python.

Association Rules Mining Using Python Implementation of association rules mining algorithms (Apriori|FPGrowth) using python. As a part of hw1 code in

Pre 2 Nov 10, 2021
Code for Dual Contrastive Learning for Unsupervised Image-to-Image Translation, NTIRE, CVPRW 2021.

arXiv Dual Contrastive Learning Adversarial Generative Networks (DCLGAN) We provide our PyTorch implementation of DCLGAN, which is a simple yet powerf

119 Dec 04, 2022
TorchGeo is a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.

TorchGeo is a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.

Microsoft 1.3k Dec 30, 2022
FairMOT for Multi-Class MOT using YOLOX as Detector

FairMOT-X Project Overview FairMOT-X is a multi-class multi object tracker, which has been tailored for training on the BDD100K MOT Dataset. It makes

Jonathan Tan 33 Dec 28, 2022
NCVX (NonConVeX): A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning.

The source code is temporariy removed, as we are solving potential copyright and license issues with GRANSO (http://www.timmitchell.com/software/GRANS

SUN Group @ UMN 28 Aug 03, 2022
SimBERT升级版(SimBERTv2)!

RoFormer-Sim RoFormer-Sim,又称SimBERTv2,是我们之前发布的SimBERT模型的升级版。 介绍 https://kexue.fm/archives/8454 训练 tensorflow 1.14 + keras 2.3.1 + bert4keras 0.10.6 下载

318 Dec 31, 2022
PyTorch implementation of the end-to-end coreference resolution model with different higher-order inference methods.

End-to-End Coreference Resolution with Different Higher-Order Inference Methods This repository contains the implementation of the paper: Revealing th

Liyan 52 Jan 04, 2023
(ICCV'21) Official PyTorch implementation of Relational Embedding for Few-Shot Classification

Relational Embedding for Few-Shot Classification (ICCV 2021) Dahyun Kang, Heeseung Kwon, Juhong Min, Minsu Cho [paper], [project hompage] We propose t

Dahyun Kang 82 Dec 24, 2022
(NeurIPS '21 Spotlight) IQ-Learn: Inverse Q-Learning for Imitation

Inverse Q-Learning (IQ-Learn) Official code base for IQ-Learn: Inverse soft-Q Learning for Imitation, NeurIPS '21 Spotlight IQ-Learn is an easy-to-use

Divyansh Garg 102 Dec 20, 2022
Image Segmentation Animation using Quadtree concepts.

QuadTree Image Segmentation Animation using QuadTree concepts. Usage usage: quad.py [-h] [-fps FPS] [-i ITERATIONS] [-ws WRITESTART] [-b] [-img] [-s S

Alex Eidt 29 Dec 25, 2022
Implementation of ETSformer, state of the art time-series Transformer, in Pytorch

ETSformer - Pytorch Implementation of ETSformer, state of the art time-series Transformer, in Pytorch Install $ pip install etsformer-pytorch Usage im

Phil Wang 121 Dec 30, 2022