StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

Last update: Jan 01, 2023

Overview

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

Yinghao Aaron Li, Ali Zare, Nima Mesgarani

We present an unsupervised non-parallel many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2. Using a combination of adversarial source classifier loss and perceptual loss, our model significantly outperforms previous VC models. Although our model is trained only with 20 English speakers, it generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion. Using a style encoder, our framework can also convert plain reading speech into stylistic speech, such as emotional and falsetto speech. Subjective and objective evaluation experiments on a non-parallel many-to-many voice conversion task revealed that our model produces natural sounding voices, close to the sound quality of state-of-the-art text-tospeech (TTS) based voice conversion methods without the need for text labels. Moreover, our model is completely convolutional and with a faster-than-real-time vocoder such as Parallel WaveGAN can perform real-time voice conversion.

Paper: https://arxiv.org/abs/2107.10394

Audio samples: https://starganv2-vc.github.io/

Pre-requisites

Python >= 3.7
Clone this repository:

git https://github.com/yl4579/StarGANv2-VC.git
cd StarGANv2-VC

Install python requirements:

pip install SoundFile torchaudio munch parallel_wavegan torch pydub

Download and extract the VCTK dataset and use VCTK.ipynb to prepare the data (downsample to 24 kHz etc.). You can also download the dataset we have prepared and unzip it to the Data folder, use the provided config.yml to reproduce our models.

Training

python train.py --config_path ./Configs/config.yml

Please specify the training and validation data in config.yml file. Change num_domains to the number of speakers in the dataset. The data list format needs to be filename.wav|speaker_number, see train_list.txt as an example.

Checkpoints and Tensorboard logs will be saved at log_dir. To speed up training, you may want to make batch_size as large as your GPU RAM can take. However, please note that batch_size = 5 will take around 10G GPU RAM.

Inference

Please refer to inference.ipynb for details.

The pretrained StarGANv2 and ParallelWaveGAN on VCTK corpus can be downloaded at StarGANv2 Link and ParallelWaveGAN Link. Please unzip to Models and Vocoder respectivey and run each cell in the notebook.

ASR & F0 Models

The pretrained F0 and ASR models are provided under the Utils folder. Both the F0 and ASR models are trained with melspectrograms preprocessed using meldataset.py, and both models are trained on speech data only.

The ASR model is trained on English corpus, but it appears to work when training StarGANv2 models in other languages such as Japanese. The F0 model also appears to work with singing data. For the best performance, however, training your own ASR and F0 models is encouraged for non-English and non-speech data.

You can edit the meldataset.py with your own melspectrogram preprocessing, but the provided pretrained models will no longer work. You will need to train your own ASR and F0 models with the new preprocessing. You may refer to repo Diamondfan/CTC_pytorch and keums/melodyExtraction_JDC to train your own the ASR and F0 models, for example.

References

Acknowledgement

The author would like to thank @tosaka-m for his great repository and valuable discussions.

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

Related tags

Overview

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

Yinghao Aaron Li, Ali Zare, Nima Mesgarani

Pre-requisites

Training

Inference

ASR & F0 Models

References

Acknowledgement

Owner

Aaron (Yinghao) Li

Learning Generative Models of Textured 3D Meshes from Real-World Images, ICCV 2021

Annealed Flow Transport Monte Carlo

[CVPR2021 Oral] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

TensorFlow implementation of Deep Reinforcement Learning papers

Scalable, event-driven, deep-learning-friendly backtesting library

Python script for performing depth completion from sparse depth and rgb images using the msg_chn_wacv20. model in ONNX

Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations. [2021]

LAMDA: Label Matching Deep Domain Adaptation

Official source code of Fast Point Transformer, CVPR 2022

Code for Domain Adaptive Video Segmentation via Temporal Consistency Regularization in ICCV 2021

Image processing in Python

Get started with Machine Learning with Python - An introduction with Python programming examples

GUI for TOAD-GAN, a PCG-ML algorithm for Token-based Super Mario Bros. Levels.

A system for quickly generating training data with weak supervision

Weakly Supervised Segmentation with Tensorflow. Implements instance segmentation as described in Simple Does It: Weakly Supervised Instance and Semantic Segmentation, by Khoreva et al. (CVPR 2017).

Offical implementation of Shunted Self-Attention via Multi-Scale Token Aggregation

Boosted CVaR Classification (NeurIPS 2021)

Code for "AutoMTL: A Programming Framework for Automated Multi-Task Learning"

Implementation EfficientDet: Scalable and Efficient Object Detection in PyTorch

Tensorflow 2.x implementation of Panoramic BlitzNet for object detection and semantic segmentation on indoor panoramic images.