PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

Last update: Jan 02, 2023

Overview

FastPitchFormant - PyTorch Implementation

PyTorch Implementation of FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis.

Quickstart

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Inference

You have to download the pretrained models and put them in output/ckpt/LJSpeech/.

For English single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 1000000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

The generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/LJSpeech/val.txt --restore_step 1000000 --mode batch -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

to synthesize all utterances in preprocessed_data/LJSpeech/val.txt

Controllability

The pitch/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the pitch by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 1000000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml --duration_control 0.8 --pitch_control 0.8

Training

Datasets

The supported datasets are

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.

Preprocessing

First, run

python3 prepare_align.py config/LJSpeech/preprocess.yaml

for some preparations.

As described in the paper, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Alignments for the LJSpeech datasets are provided here. You have to unzip the files in preprocessed_data/LJSpeech/TextGrid/.

After that, run the preprocessing script by

python3 preprocess.py config/LJSpeech/preprocess.yaml

Alternately, you can align the corpus by yourself. Download the official MFA package and run

./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech

./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech

to align the corpus and then run the preprocessing script.

python3 preprocess.py config/LJSpeech/preprocess.yaml

Training

Train your model with

python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

TensorBoard

Use

tensorboard --logdir output/log/LJSpeech

to serve TensorBoard on your localhost.

Implementation Issues

Use HiFi-GAN instead of VocGAN for vocoding.

Citation

@misc{lee2021fastpitchformant,
  author = {Lee, Keon},
  title = {FastPitchFormant},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/keonlee9420/FastPitchFormant}}
}

References

PyTorch implementation of Tacotron speech synthesis model.

tacotron_pytorch PyTorch implementation of Tacotron speech synthesis model. Inspired from keithito/tacotron. Currently not as much good speech quality

279 Dec 9, 2022

PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

Lip to Speech Synthesis with Visual Context Attentional GAN This repository contains the PyTorch implementation of the following paper: Lip to Speech

6 Nov 2, 2022

Implementation of Diverse Semantic Image Synthesis via Probability Distribution Modeling

Diverse Semantic Image Synthesis via Probability Distribution Modeling (CVPR 2021) Paper Zhentao Tan, Menglei Chai, Dongdong Chen, Jing Liao, Qi Chu,

45 Nov 17, 2022

A Flow-based Generative Network for Speech Synthesis

WaveGlow: a Flow-based Generative Network for Speech Synthesis Ryan Prenger, Rafael Valle, and Bryan Catanzaro In our recent paper, we propose WaveGlo

2k Dec 26, 2022

Official implementation of the paper: "LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech"

LDNet Author: Wen-Chin Huang (Nagoya University) Email: [email protected] This is the official implementation of the paper "LDNet

40 Nov 20, 2022

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

TalkNet 2 [WIP] TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Predictio

69 Dec 17, 2022

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis Jungil Kong, Jaehyeon Kim, Jaekyoung Bae In our paper, we p

31 Dec 8, 2022

BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Bilateral Denoising Diffusion Models (BDDMs) This is the official PyTorch implementation of the following paper: BDDM: BILATERAL DENOISING DIFFUSION M

172 Dec 23, 2022

This program uses trial auth token of Azure Cognitive Services to do speech synthesis for you.

🗣️ aspeak A simple text-to-speech client using azure TTS API(trial). 😆 TL;DR: This program uses trial auth token of Azure Cognitive Services to do s

359 Jan 5, 2023

Comments

Error when duration_control is <1

I can set any value above 1 (ie. '--duration_control 1.9') to slow down the speaking rate, but can't do the opposite, anything below 1 (ie. 0.9) will throw this error message:

C:\FastPitchFormant>python synthesize.py --text "testing" --restore_step 600000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml --duration_control 0.9 --pitch_control 1
Removing weight norm...
Raw Text Sequence: testing
Phoneme Sequence: {T EH1 S T IH0 NG}
Traceback (most recent call last):
  File "synthesize.py", line 207, in <module>
    synthesize(model, args.restore_step, configs, vocoder, batchs, control_values)
  File "synthesize.py", line 95, in synthesize
    output = model(
  File "C:\ProgramData\Anaconda3\envs\pttf2cu111py38\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\FastPitchFormant\model\FastPitchFormant.py", line 89, in forward
    formant_hidden = self.formant_generator(h, mel_masks)
  File "C:\ProgramData\Anaconda3\envs\pttf2cu111py38\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\FastPitchFormant\model\modules.py", line 329, in forward
    output, enc_slf_attn = enc_layer(
  File "C:\ProgramData\Anaconda3\envs\pttf2cu111py38\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\FastPitchFormant\model\blocks.py", line 109, in forward
    enc_output, enc_slf_attn = self.slf_attn(
  File "C:\ProgramData\Anaconda3\envs\pttf2cu111py38\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\FastPitchFormant\model\blocks.py", line 162, in forward
    output, attn = self.attention(q, k, v, mask=mask)
  File "C:\ProgramData\Anaconda3\envs\pttf2cu111py38\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\FastPitchFormant\model\blocks.py", line 189, in forward
    attn = attn.masked_fill(mask, -np.inf)
RuntimeError: The size of tensor a (32) must match the size of tensor b (34) at non-singleton dimension 2

opened by MaxGodTier 1

Releases(v1.0.0)

v1.0.0(Aug 3, 2021)

First Official Release
Source code(tar.gz)
Source code(zip)
v0.1.0(Jul 17, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Keon Lee

GitHub Repository

The official repository for our paper "The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization".

Codebase for learning control flow in transformers The official repository for our paper "The Neural Data Router: Adaptive Control Flow in Transformer

24 Oct 15, 2022

wmctrl ported to Python Ctypes

work in progress wmctrl is a command that can be used to interact with an X Window manager that is compatible with the EWMH/NetWM specification. wmctr

22 Dec 31, 2022

A nutritional label for food for thought.

Lexiscore As a first effort in tackling the theme of information overload in content consumption, I've been working on the lexiscore: a nutritional la

34 Nov 08, 2022

A boosting-based Multiple Instance Learning (MIL) package that includes MIL-Boost and MCIL-Boost

27 Aug 08, 2022

YOLOPのPythonでのONNX推論サンプル

YOLOP-ONNX-Video-Inference-Sample YOLOPのPythonでのONNX推論サンプルです。 ONNXモデルは、hustvl/YOLOP/weights を使用しています。 Requirement OpenCV 3.4.2 or later onnxruntime 1.

8 Sep 05, 2022

CharacterGAN: Few-Shot Keypoint Character Animation and Reposing

CharacterGAN Implementation of the paper "CharacterGAN: Few-Shot Keypoint Character Animation and Reposing" by Tobias Hinz, Matthew Fisher, Oliver Wan

181 Dec 27, 2022

LVI-SAM: Tightly-coupled Lidar-Visual-Inertial Odometry via Smoothing and Mapping

LVI-SAM This repository contains code for a lidar-visual-inertial odometry and mapping system, which combines the advantages of LIO-SAM and Vins-Mono

1.1k Dec 27, 2022

Sequence modeling benchmarks and temporal convolutional networks

Sequence Modeling Benchmarks and Temporal Convolutional Networks (TCN) This repository contains the experiments done in the work An Empirical Evaluati

3.5k Jan 01, 2023

Yolov5 + Deep Sort with PyTorch

딥소트 수정중 Yolov5 + Deep Sort with PyTorch Introduction This repository contains a two-stage-tracker. The detections generated by YOLOv5, a family of obj

1 Nov 26, 2021

Code for "NeRS: Neural Reflectance Surfaces for Sparse-View 3D Reconstruction in the Wild," in NeurIPS 2021

Code for Neural Reflectance Surfaces (NeRS) [arXiv] [Project Page] [Colab Demo] [Bibtex] This repo contains the code for NeRS: Neural Reflectance Surf

234 Dec 30, 2022

This repository contains all data used for writing a research paper Multiple Object Trackers in OpenCV: A Benchmark, presented in ISIE 2021 conference in Kyoto, Japan.

OpenCV-Multiple-Object-Tracking Python is version 3.6.7 to install opencv: pip uninstall opecv-python pip uninstall opencv-contrib-python pip install

6 Dec 19, 2021

PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

Related tags

Overview

FastPitchFormant - PyTorch Implementation

Quickstart

Dependencies

Inference

Batch Inference

Controllability

Training

Datasets

Preprocessing

Training

TensorBoard

Implementation Issues

Citation

References

You might also like...

PyTorch implementation of Tacotron speech synthesis model.

PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

Implementation of Diverse Semantic Image Synthesis via Probability Distribution Modeling

A Flow-based Generative Network for Speech Synthesis

Official implementation of the paper: "LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech"

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

This program uses trial auth token of Azure Cognitive Services to do speech synthesis for you.

Comments

Error when duration_control is <1

Releases(v1.0.0)

v1.0.0(Aug 3, 2021)

v0.1.0(Jul 17, 2021)

Owner

Keon Lee

The official repository for our paper "The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization".

wmctrl ported to Python Ctypes

A nutritional label for food for thought.

A boosting-based Multiple Instance Learning (MIL) package that includes MIL-Boost and MCIL-Boost

YOLOPのPythonでのONNX推論サンプル

CharacterGAN: Few-Shot Keypoint Character Animation and Reposing

LVI-SAM: Tightly-coupled Lidar-Visual-Inertial Odometry via Smoothing and Mapping

Sequence modeling benchmarks and temporal convolutional networks

Yolov5 + Deep Sort with PyTorch

Code for "NeRS: Neural Reflectance Surfaces for Sparse-View 3D Reconstruction in the Wild," in NeurIPS 2021

This repository contains all data used for writing a research paper Multiple Object Trackers in OpenCV: A Benchmark, presented in ISIE 2021 conference in Kyoto, Japan.

Model-based 3D Hand Reconstruction via Self-Supervised Learning, CVPR2021

Official repository for Hierarchical Opacity Propagation for Image Matting

The AWS Certified SysOps Administrator

LUKE -- Language Understanding with Knowledge-based Embeddings

MakeItTalk: Speaker-Aware Talking-Head Animation

Offline Multi-Agent Reinforcement Learning Implementations: Solving Overcooked Game with Data-Driven Method

Implementing SYNTHESIZER: Rethinking Self-Attention in Transformer Models using Pytorch

Prososdy Morph: A python library for manipulating pitch and duration in an algorithmic way, for resynthesizing speech.

AOT-GAN for High-Resolution Image Inpainting (codebase for image inpainting)