Official PyTorch implementation of SyntaSpeech (IJCAI 2022)

Overview

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

arXiv | GitHub Stars | downloads | Hugging Face | 中文文档

This repository is the official PyTorch implementation of our IJCAI-2022 paper, in which we propose SyntaSpeech for syntax-aware non-autoregressive Text-to-Speech.



Our SyntaSpeech is built on the basis of PortaSpeech (NeurIPS 2021) with three new features:

  1. We propose Syntactic Graph Builder (Sec. 3.1) and Syntactic Graph Encoder (Sec. 3.2), which is proved to be an effective unit to extract syntactic features to improve the prosody modeling and duration accuracy of TTS model.
  2. We introduce Multi-Length Adversarial Training (Sec. 3.3), which could replace the flow-based post-net in PortaSpeech, speeding up the inference time and improving the audio quality naturalness.
  3. We support three datasets: LJSpeech (single-speaker English dataset), Biaobei (single-speaker Chinese dataset) , and LibriTTS (multi-speaker English dataset).

Environments

conda create -n synta python=3.7
condac activate synta
pip install -U pip
pip install Cython numpy==1.19.1
pip install torch==1.9.0 
pip install -r requirements.txt
# install dgl for graph neural network, dgl-cu102 supports rtx2080, dgl-cu113 support rtx3090
pip install dgl-cu102 dglgo -f https://data.dgl.ai/wheels/repo.html 
sudo apt install -y sox libsox-fmt-mp3
bash mfa_usr/install_mfa.sh # install force alignment tools

Run SyntaSpeech!

Please follow the following steps to run this repo.

1. Preparation

Data Preparation

You can directly use our binarized datasets for LJSpeech and Biaobei. Download them and unzip them into the data/binary/ folder.

As for LibriTTS, you can download the raw datasets and process them with our data_gen modules. Detailed instructions can be found in dosc/prepare_data.

Vocoder Preparation

We provide the pre-trained model of vocoders for three datasets. Specifically, Hifi-GAN for LJSpeech and Biaobei, ParallelWaveGAN for LibriTTS. Download and unzip them into the checkpoints/ folder.

2. Training Example

Then you can train SyntaSpeech in the three datasets.

cd <the root_dir of your SyntaSpeech folder>
export PYTHONPATH=./
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset # training in LJSpeech
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset # training in Biaobei
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset # training in LibriTTS

3. Tensorboard

tensorboard --logdir=checkpoints/lj_synta
tensorboard --logdir=checkpoints/biaobei_synta
tensorboard --logdir=checkpoints/libritts_synta

4. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset --infer # inference in LJSpeech
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset --infer # inference in Biaobei
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset ---infer # inference in LibriTTS

Audio Demos

Audio samples in the paper can be found in our demo page.

We also provide HuggingFace Demo Page for LJSpeech. Try your interesting sentences there!

Citation

@article{ye2022syntaspeech,
  title={SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech},
  author={Ye, Zhenhui and Zhao, Zhou and Ren, Yi and Wu, Fei},
  journal={arXiv preprint arXiv:2204.11792},
  year={2022}
}

Acknowledgements

Our codes are based on the following repos:

Comments
  • pinyin preprocess problem

    pinyin preprocess problem

    005804 你当#1我傻啊#3?脑子#1那么大#2怎么#1塞进去#4? ni3 dang1 wo2 sha3 a5 nao3 zi5 na4 me5 da4 zen3 me5 sai1 jin4 qu4

    txt_struct=[['', ['']], ['你', ['n', 'i3']], ['当', ['d', 'ang1']], ['我', ['uo3']], ['傻', ['sh', 'a3']], ['啊', ['a', '?', 'n', 'ao3']], ['?', ['z', 'i']], ['脑', ['n', 'a4']], ['子', ['m', 'e']], ['那', ['d', 'a4']], ['么', ['z', 'en3']], ['大', ['m', 'e']], ['怎', ['s', 'ai1']], ['么', ['j', 'in4']], ['塞', ['q', 'v4', '?']], ['进', []], ['去', []], ['?', []], ['', ['']]]

    ph_gb_word=['', 'n_i3', 'd_ang1', 'uo3', 'sh_a3', 'a_?n_ao3', 'z_i', 'n_a4', 'm_e', 'd_a4', 'z_en3', 'm_e', 's_ai1', 'j_in4', 'q_v4?', '', '', '', '']

    what is 'a_?_n_ao3'

    in the mfa_dict it appears ch_a1_d_ou1 ,a_?_n_ao3 and so on

    opened by windowxiaoming 2
  • discriminator output['y_c'] never used

    discriminator output['y_c'] never used

    Discriminator's output['y_c'] never used, and never calculated in discriminator forward func. What does this variable mean? https://github.com/yerfor/SyntaSpeech/blob/5b07439633a3e714d2a6759ea4097eb36d6cd99a/tasks/tts/synta.py#L81

    opened by mayfool 2
  • A question of KL divergence calculation

    A question of KL divergence calculation

    In modules/tts/portaspeech/fvae.py, SyntaFVAE compute loss_kl (line 121) , Can someone help explain why loss_kl = ((logqx - logpx) * nonpadding_sqz).sum() / nonpadding_sqz.sum() / logqx.shape[1],I think loss_kl should be compute by loss_kl = logqx.exp()*(logqx - logpx) I would be very grateful if you could reply to me!

    opened by JiaYK 2
  • mfa for multi speaker.

    mfa for multi speaker.

    In the code, group MFA inputs for better parallelism. For multi speaker, it maybe go wrong. For input g_uang3 zh_ou1 n_v3 d_a4 x_ve2 sh_eng1 d_eng1 sh_an1 sh_i1 l_ian2 s_i4 t_ian1 j_ing3 f_ang1 zh_ao3 d_ao4 i2 s_i4 n_v3 sh_i1. The TexGrid is

    	item [1]:
    		class = "IntervalTier"
    		name = "words"
    		xmin = 0.0
    		xmax = 9.4444
    		intervals: size = 56
    			intervals [1]:
    				xmin = 0
    				xmax = 0.5700000000000001
    				text = ""
    			intervals [2]:
    				xmin = 0.5700000000000001
    				xmax = 0.61
    				text = "eng"
    			intervals [3]:
    				xmin = 0.61
    				xmax = 0.79
    				text = "s_an1"
    			intervals [4]:
    				xmin = 0.79
    				xmax = 0.89
    				text = "eng"
    			intervals [5]:
    				xmin = 0.89
    				xmax = 1.06
    				text = "i1"
    			intervals [6]:
    				xmin = 1.06
    				xmax = 1.24
    				text = "eng"
    			intervals [7]:
    				xmin = 1.24
    				xmax = 1.3
    				text = ""
    			intervals [8]:
    				xmin = 1.3
    				xmax = 1.36
    				text = "s_an1"
    			intervals [9]:
    				xmin = 1.36
    				xmax = 1.42
    				text = ""
    			intervals [10]:
    				xmin = 1.42
    				xmax = 1.49
    				text = "eng"
    			intervals [11]:
    				xmin = 1.49
    				xmax = 1.67
    				text = "s_i4"
    			intervals [12]:
    				xmin = 1.67
    				xmax = 1.78
    				text = "eng"
    			intervals [13]:
    				xmin = 1.78
    				xmax = 1.91
    				text = ""
    			intervals [14]:
    				xmin = 1.91
    				xmax = 1.96
    				text = "er4"
    			intervals [15]:
    				xmin = 1.96
    				xmax = 2.06
    				text = "eng"
    			intervals [16]:
    				xmin = 2.06
    				xmax = 2.19
    				text = ""
    			intervals [17]:
    				xmin = 2.19
    				xmax = 2.35
    				text = "i1"
    			intervals [18]:
    				xmin = 2.35
    				xmax = 2.53
    				text = "eng"
    			intervals [19]:
    				xmin = 2.53
    				xmax = 3.03
    				text = "i1"
    			intervals [20]:
    				xmin = 3.03
    				xmax = 3.42
    				text = "eng"
    			intervals [21]:
    				xmin = 3.42
    				xmax = 3.48
    				text = "i1"
    			intervals [22]:
    				xmin = 3.48
    				xmax = 3.6
    				text = ""
    			intervals [23]:
    				xmin = 3.6
    				xmax = 3.64
    				text = "eng"
    			intervals [24]:
    				xmin = 3.64
    				xmax = 3.86
    				text = "i1"
    			intervals [25]:
    				xmin = 3.86
    				xmax = 3.99
    				text = "eng"
    			intervals [26]:
    				xmin = 3.99
    				xmax = 4.59
    				text = ""
    			intervals [27]:
    				xmin = 4.59
    				xmax = 4.869999999999999
    				text = "er4"
    			intervals [28]:
    				xmin = 4.869999999999999
    				xmax = 4.9799999999999995
    				text = "eng"
    			intervals [29]:
    				xmin = 4.9799999999999995
    				xmax = 5.1899999999999995
    				text = "s_i4"
    			intervals [30]:
    				xmin = 5.1899999999999995
    				xmax = 5.34
    				text = ""
    			intervals [31]:
    				xmin = 5.34
    				xmax = 5.43
    				text = "eng"
    			intervals [32]:
    				xmin = 5.43
    				xmax = 5.6
    				text = ""
    			intervals [33]:
    				xmin = 5.6
    				xmax = 5.76
    				text = "i1"
    			intervals [34]:
    				xmin = 5.76
    				xmax = 6.279999999999999
    				text = "eng"
    			intervals [35]:
    				xmin = 6.279999999999999
    				xmax = 6.359999999999999
    				text = "s_an1"
    			intervals [36]:
    				xmin = 6.359999999999999
    				xmax = 6.47
    				text = ""
    			intervals [37]:
    				xmin = 6.47
    				xmax = 6.6
    				text = "eng"
    			intervals [38]:
    				xmin = 6.6
    				xmax = 6.9399999999999995
    				text = "i1"
    			intervals [39]:
    				xmin = 6.9399999999999995
    				xmax = 7.039999999999999
    				text = "eng"
    			intervals [40]:
    				xmin = 7.039999999999999
    				xmax = 7.289999999999999
    				text = "s_an1"
    			intervals [41]:
    				xmin = 7.289999999999999
    				xmax = 7.369999999999999
    				text = "eng"
    			intervals [42]:
    				xmin = 7.369999999999999
    				xmax = 7.6
    				text = "s_i4"
    			intervals [43]:
    				xmin = 7.6
    				xmax = 7.699999999999999
    				text = "eng"
    			intervals [44]:
    				xmin = 7.699999999999999
    				xmax = 7.869999999999999
    				text = ""
    			intervals [45]:
    				xmin = 7.869999999999999
    				xmax = 8.049999999999999
    				text = "er4"
    			intervals [46]:
    				xmin = 8.049999999999999
    				xmax = 8.26
    				text = ""
    			intervals [47]:
    				xmin = 8.26
    				xmax = 8.299999999999999
    				text = "eng"
    			intervals [48]:
    				xmin = 8.299999999999999
    				xmax = 8.36
    				text = "s_i4"
    			intervals [49]:
    				xmin = 8.36
    				xmax = 8.389999999999999
    				text = ""
    			intervals [50]:
    				xmin = 8.389999999999999
    				xmax = 8.42
    				text = "eng"
    			intervals [51]:
    				xmin = 8.42
    				xmax = 8.45
    				text = ""
    			intervals [52]:
    				xmin = 8.45
    				xmax = 8.59
    				text = "s_an1"
    			intervals [53]:
    				xmin = 8.59
    				xmax = 8.83
    				text = ""
    			intervals [54]:
    				xmin = 8.83
    				xmax = 9.1
    				text = "eng"
    			intervals [55]:
    				xmin = 9.1
    				xmax = 9.44
    				text = "i1"
    			intervals [56]:
    				xmin = 9.44
    				xmax = 9.4444
    				text = ""
    
    opened by leon2milan 2
  • Problem with DDP

    Problem with DDP

    Hello, I have experimented on your excellent job with this repo. But I found the ddp is not effective. I wonder if the way I used is wrong?

    CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --nproc_per_node 3 tasks/run.py --config //fs.yaml --exp_name fs_test_demo --reset

    opened by zhazl 0
Releases(v1.0.0)
Owner
Zhenhui YE
I am currently a second-year computer science Ph.D student at Zhejiang University, working on deep learning and reinforcement learning.
Zhenhui YE
TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-Captured Scenarios

TPH-YOLOv5 This repo is the implementation of "TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-Captured

cv516Buaa 439 Dec 22, 2022
Official implementation of "Watermarking Images in Self-Supervised Latent-Spaces"

🔍 Watermarking Images in Self-Supervised Latent-Spaces PyTorch implementation and pretrained models for the paper. For details, see Watermarking Imag

Meta Research 32 Dec 13, 2022
PyTorch implementation of "Learning to Discover Cross-Domain Relations with Generative Adversarial Networks"

DiscoGAN in PyTorch PyTorch implementation of Learning to Discover Cross-Domain Relations with Generative Adversarial Networks. * All samples in READM

Taehoon Kim 1k Jan 04, 2023
Audio Visual Emotion Recognition using TDA

Audio Visual Emotion Recognition using TDA RAVDESS database with two datasets analyzed: Video and Audio dataset: Audio-Dataset: https://www.kaggle.com

Combinatorial Image Analysis research group 3 May 11, 2022
On Generating Extended Summaries of Long Documents

ExtendedSumm This repository contains the implementation details and datasets used in On Generating Extended Summaries of Long Documents paper at the

Georgetown Information Retrieval Lab 76 Sep 05, 2022
ECLARE: Extreme Classification with Label Graph Correlations

ECLARE ECLARE: Extreme Classification with Label Graph Correlations @InProceedings{Mittal21b, author = "Mittal, A. and Sachdeva, N. and Agrawal

Extreme Classification 35 Nov 06, 2022
Official code for paper Exemplar Based 3D Portrait Stylization.

3D-Portrait-Stylization This is the official code for the paper "Exemplar Based 3D Portrait Stylization". You can check the paper on our project websi

60 Dec 07, 2022
PyTorch implementation of "PatchGame: Learning to Signal Mid-level Patches in Referential Games" to appear in NeurIPS 2021

PatchGame: Learning to Signal Mid-level Patches in Referential Games This repository is the official implementation of the paper - "PatchGame: Learnin

Kamal Gupta 22 Mar 16, 2022
Benchmark for the generalization of 3D machine learning models across different remeshing/samplings of a surface.

Discretization Robust Correspondence Benchmark One challenge of machine learning on 3D surfaces is that there are many different representations/sampl

Nicholas Sharp 10 Sep 30, 2022
Benchmarks for semi-supervised domain generalization.

Semi-Supervised Domain Generalization This code is the official implementation of the following paper: Semi-Supervised Domain Generalization with Stoc

Kaiyang 49 Dec 10, 2022
Source code for paper: Knowledge Inheritance for Pre-trained Language Models

Knowledge-Inheritance Source code paper: Knowledge Inheritance for Pre-trained Language Models (preprint). The trained model parameters (in Fairseq fo

THUNLP 31 Nov 19, 2022
Fine-tuning StyleGAN2 for Cartoon Face Generation

Cartoon-StyleGAN 🙃 : Fine-tuning StyleGAN2 for Cartoon Face Generation Abstract Recent studies have shown remarkable success in the unsupervised imag

Jihye Back 520 Jan 04, 2023
Pytorch implementation of few-shot semantic image synthesis

Few-shot Semantic Image Synthesis Using StyleGAN Prior Our method can synthesize photorealistic images from dense or sparse semantic annotations using

40 Sep 26, 2022
CSKG is a commonsense knowledge graph that combines seven popular sources into a consolidated representation

CSKG: The CommonSense Knowledge Graph CSKG is a commonsense knowledge graph that combines seven popular sources into a consolidated representation: AT

USC ISI I2 85 Dec 12, 2022
Code for MarioNette: Self-Supervised Sprite Learning, in NeurIPS 2021

MarioNette | Webpage | Paper | Video MarioNette: Self-Supervised Sprite Learning Dmitriy Smirnov, Michaël Gharbi, Matthew Fisher, Vitor Guizilini, Ale

Dima Smirnov 28 Nov 18, 2022
PyTorch implementation of the Quasi-Recurrent Neural Network - up to 16 times faster than NVIDIA's cuDNN LSTM

Quasi-Recurrent Neural Network (QRNN) for PyTorch Updated to support multi-GPU environments via DataParallel - see the the multigpu_dataparallel.py ex

Salesforce 1.3k Dec 28, 2022
BlockUnexpectedPackets - Preventing BungeeCord CPU overload due to Layer 7 DDoS attacks by scanning BungeeCord's logs

BlockUnexpectedPackets This script automatically blocks DDoS attacks that are sp

SparklyPower 3 Mar 31, 2022
[CVPR2021 Oral] FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation.

FFB6D This is the official source code for the CVPR2021 Oral work, FFB6D: A Full Flow Biderectional Fusion Network for 6D Pose Estimation. (Arxiv) Tab

Yisheng (Ethan) He 201 Dec 28, 2022
TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

TorchMultimodal (Alpha Release) Introduction TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

Meta Research 663 Jan 06, 2023
Prototypical python implementation of the trust-region algorithm presented in Sequential Linearization Method for Bound-Constrained Mathematical Programs with Complementarity Constraints by Larson, Leyffer, Kirches, and Manns.

Prototypical python implementation of the trust-region algorithm presented in Sequential Linearization Method for Bound-Constrained Mathematical Programs with Complementarity Constraints by Larson, L

3 Dec 02, 2022