A TensorFlow implementation of DeepMind's WaveNet paper

Overview

A TensorFlow implementation of DeepMind's WaveNet paper

Build Status

This is a TensorFlow implementation of the WaveNet generative neural network architecture for audio generation.

The WaveNet neural network architecture directly generates a raw audio waveform, showing excellent results in text-to-speech and general audio generation (see the DeepMind blog post and paper for details).

The network models the conditional probability to generate the next sample in the audio waveform, given all previous samples and possibly additional parameters.

After an audio preprocessing step, the input waveform is quantized to a fixed integer range. The integer amplitudes are then one-hot encoded to produce a tensor of shape (num_samples, num_channels).

A convolutional layer that only accesses the current and previous inputs then reduces the channel dimension.

The core of the network is constructed as a stack of causal dilated layers, each of which is a dilated convolution (convolution with holes), which only accesses the current and past audio samples.

The outputs of all layers are combined and extended back to the original number of channels by a series of dense postprocessing layers, followed by a softmax function to transform the outputs into a categorical distribution.

The loss function is the cross-entropy between the output for each timestep and the input at the next timestep.

In this repository, the network implementation can be found in model.py.

Requirements

TensorFlow needs to be installed before running the training script. Code is tested on TensorFlow version 1.0.1 for Python 2.7 and Python 3.5.

In addition, librosa must be installed for reading and writing audio.

To install the required python packages, run

pip install -r requirements.txt

For GPU support, use

pip install -r requirements_gpu.txt

Training the network

You can use any corpus containing .wav files. We've mainly used the VCTK corpus (around 10.4GB, Alternative host) so far.

In order to train the network, execute

python train.py --data_dir=corpus

to train the network, where corpus is a directory containing .wav files. The script will recursively collect all .wav files in the directory.

You can see documentation on each of the training settings by running

python train.py --help

You can find the configuration of the model parameters in wavenet_params.json. These need to stay the same between training and generation.

Global Conditioning

Global conditioning refers to modifying the model such that the id of a set of mutually-exclusive categories is specified during training and generation of .wav file. In the case of the VCTK, this id is the integer id of the speaker, of which there are over a hundred. This allows (indeed requires) that a speaker id be specified at time of generation to select which of the speakers it should mimic. For more details see the paper or source code.

Training with Global Conditioning

The instructions above for training refer to training without global conditioning. To train with global conditioning, specify command-line arguments as follows:

python train.py --data_dir=corpus --gc_channels=32

The --gc_channels argument does two things:

  • It tells the train.py script that it should build a model that includes global conditioning.
  • It specifies the size of the embedding vector that is looked up based on the id of the speaker.

The global conditioning logic in train.py and audio_reader.py is "hard-wired" to the VCTK corpus at the moment in that it expects to be able to determine the speaker id from the pattern of file naming used in VCTK, but can be easily be modified.

Generating audio

Example output generated by @jyegerlehner based on speaker 280 from the VCTK corpus.

You can use the generate.py script to generate audio using a previously trained model.

Generating without Global Conditioning

Run

python generate.py --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000

where logdir/train/2017-02-13T16-45-34/model.ckpt-80000 needs to be a path to previously saved model (without extension). The --samples parameter specifies how many audio samples you would like to generate (16000 corresponds to 1 second by default).

The generated waveform can be played back using TensorBoard, or stored as a .wav file by using the --wav_out_path parameter:

python generate.py --wav_out_path=generated.wav --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000

Passing --save_every in addition to --wav_out_path will save the in-progress wav file every n samples.

python generate.py --wav_out_path=generated.wav --save_every 2000 --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000

Fast generation is enabled by default. It uses the implementation from the Fast Wavenet repository. You can follow the link for an explanation of how it works. This reduces the time needed to generate samples to a few minutes.

To disable fast generation:

python generate.py --samples 16000 logdir/train/2017-02-13T16-45-34/model.ckpt-80000 --fast_generation=false

Generating with Global Conditioning

Generate from a model incorporating global conditioning as follows:

python generate.py --samples 16000  --wav_out_path speaker311.wav --gc_channels=32 --gc_cardinality=377 --gc_id=311 logdir/train/2017-02-13T16-45-34/model.ckpt-80000

Where:

--gc_channels=32 specifies 32 is the size of the embedding vector, and must match what was specified when training.

--gc_cardinality=377 is required as 376 is the largest id of a speaker in the VCTK corpus. If some other corpus is used, then this number should match what is automatically determined and printed out by the train.py script at training time.

--gc_id=311 specifies the id of speaker, speaker 311, for which a sample is to be generated.

Running tests

Install the test requirements

pip install -r requirements_test.txt

Run the test suite

./ci/test.sh

Missing features

Currently there is no local conditioning on extra information which would allow context stacks or controlling what speech is generated.

Related projects

Owner
Igor Babuschkin
Igor Babuschkin
Data Preparation, Processing, and Visualization for MoVi Data

MoVi-Toolbox Data Preparation, Processing, and Visualization for MoVi Data, https://www.biomotionlab.ca/movi/ MoVi is a large multipurpose dataset of

Saeed Ghorbani 51 Nov 27, 2022
FOSS Digital Asset Distribution Platform built on Frappe.

Digistore FOSS Digital Assets Marketplace. Distribute digital assets, like a pro. Video Demo Here Features Create, attach and list digital assets (PDF

Mohammad Hussain Nagaria 30 Dec 08, 2022
Code for IntraQ, PyTorch implementation of our paper under review

IntraQ: Learning Synthetic Images with Intra-Class Heterogeneity for Zero-Shot Network Quantization paper Requirements Python = 3.7.10 Pytorch == 1.7

1 Nov 19, 2021
Advancing mathematics by guiding human intuition with AI

Advancing mathematics by guiding human intuition with AI This repo contains two colab notebooks which accompany the paper, available online at https:/

DeepMind 315 Dec 26, 2022
Learning to Simulate Dynamic Environments with GameGAN (CVPR 2020)

Learning to Simulate Dynamic Environments with GameGAN PyTorch code for GameGAN Learning to Simulate Dynamic Environments with GameGAN Seung Wook Kim,

199 Dec 26, 2022
Fully Convolutional Networks for Semantic Segmentation by Jonathan Long*, Evan Shelhamer*, and Trevor Darrell. CVPR 2015 and PAMI 2016.

Fully Convolutional Networks for Semantic Segmentation This is the reference implementation of the models and code for the fully convolutional network

Evan Shelhamer 3.2k Jan 08, 2023
Volumetric Correspondence Networks for Optical Flow, NeurIPS 2019.

VCN: Volumetric correspondence networks for optical flow [project website] Requirements python 3.6 pytorch 1.1.0-1.3.0 pytorch correlation module (opt

Gengshan Yang 144 Dec 06, 2022
An official source code for "Augmentation-Free Self-Supervised Learning on Graphs"

Augmentation-Free Self-Supervised Learning on Graphs An official source code for Augmentation-Free Self-Supervised Learning on Graphs paper, accepted

Namkyeong Lee 59 Dec 01, 2022
Integrated Semantic and Phonetic Post-correction for Chinese Speech Recognition

Integrated Semantic and Phonetic Post-correction for Chinese Speech Recognition | paper | dataset | pretrained detection model | Authors: Yi-Chang Che

Yi-Chang Chen 1 Aug 23, 2022
Wordle-solver - Wordle answer generation program in python

🟨 Wordle Solver 🟩 Wordle answer generation program in python ✔️ Requirements U

Dahyun Kang 4 May 28, 2022
Implementation of experiments in the paper Clockwork Variational Autoencoders (project website) using JAX and Flax

Clockwork VAEs in JAX/Flax Implementation of experiments in the paper Clockwork Variational Autoencoders (project website) using JAX and Flax, ported

Julius Kunze 26 Oct 05, 2022
Fuwa-http - The http client implementation for the fuwa eco-system

Fuwa HTTP The HTTP client implementation for the fuwa eco-system Example import

Fuwa 2 Feb 16, 2022
A knowledge base construction engine for richly formatted data

Fonduer is a Python package and framework for building knowledge base construction (KBC) applications from richly formatted data. Note that Fonduer is

HazyResearch 386 Dec 05, 2022
A curated list of resources for Image and Video Deblurring

A curated list of resources for Image and Video Deblurring

Subeesh Vasu 1.7k Jan 01, 2023
Codebase for Time-series Generative Adversarial Networks (TimeGAN)

Codebase for Time-series Generative Adversarial Networks (TimeGAN)

Jinsung Yoon 532 Dec 31, 2022
Linear algebra python - Number of operations and problems in Linear Algebra and Numerical Linear Algebra

Linear algebra in python Number of operations and problems in Linear Algebra and

Alireza 5 Oct 09, 2022
The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data This repository provides the implementation details for

124 Dec 27, 2022
Send text to girlfriend in the morning

Girlfriend Text Send text to girlfriend (or really anyone with a phone number) in the morning 1. Configure your settings in utils.py. phone_number = "

Paras Adhikary 199 Oct 25, 2022
This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Introduction This is an official implementation of CvT: Introducing Convolutions to Vision Transformers. We present a new architecture, named Convolut

Bin Xiao 175 Jan 08, 2023
Heterogeneous Temporal Graph Neural Network

Heterogeneous Temporal Graph Neural Network This repository contains the datasets and source code of HTGNN. run_mag.ipynb is the training and testing

15 Dec 22, 2022