YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Overview

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

In our recent paper we propose the YourTTS model. YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.

Audios samples

Visit our website for audio samples.

Implementation

All of our experiments were implemented on the Coqui TTS repo. (Still a PR).

Colab Demos

Demo URL
Zero-Shot TTS link
Zero-Shot VC link

Checkpoints

All the released checkpoints are licensed under CC BY-NC-ND 4.0

Model URL
Speaker Encoder link
Exp 1. YourTTS-EN(VCTK) link
Exp 1. YourTTS-EN(VCTK) + SCL link
Exp 2. YourTTS-EN(VCTK)-PT link
Exp 2. YourTTS-EN(VCTK)-PT + SCL link
Exp 3. YourTTS-EN(VCTK)-PT-FR link
Exp 3. YourTTS-EN(VCTK)-PT-FR SCL link
Exp 4. YourTTS-EN(VCTK+LibriTTS)-PT-FR SCL link

Results replicability

To insure replicability, we make the audios used to generate the MOS available here. In addition, we provide the MOS for each audio here.

To re-generate our MOS results, follow the instructions here. To predict the test sentences and generate the SECS, please use the Jupyter Notebooks available here.

Comments
  • Languages other than PT, FR, EN

    Languages other than PT, FR, EN

    As YourTTS is multilingual TTS, I think that by training datasets, it seems that other languages might be available. However, YourTTS's checkpoint structure seems distinctive. Is there any training procedures that I can refer?

    opened by papercore-dev 7
  • Issue with Input type and weight type should be the same

    Issue with Input type and weight type should be the same

    Hi,

    I am trying to train YourTTS on my own dataset. So I followed your helpful guide with the latest stable version of Coqui TTS (0.8.0).

    After computing the embeddings (on GPU) without issue, I run into this RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same.

    I have already trained a VITS model with this dataset so everything is already set up. I understood that input Tensor resides on GPU whereas weight Tensor resides on CPU but how can I solve this ? Should I downgrade to CoquiTTS 0.6.2 ?

    Here is the full traceback :

    File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/trainer/trainer.py", line 1533, in fit
        self._fit()
      File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/trainer/trainer.py", line 1517, in _fit
        self.train_epoch()
      File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/trainer/trainer.py", line 1282, in train_epoch
        _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
      File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/trainer/trainer.py", line 1135, in train_step
        outputs, loss_dict_new, step_time = self._optimize(
      File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/trainer/trainer.py", line 996, in _optimize
        outputs, loss_dict = self._model_train_step(batch, model, criterion, optimizer_idx=optimizer_idx)
      File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/trainer/trainer.py", line 954, in _model_train_step
        return model.train_step(*input_args)
      File "/home/caraduf/YourTTS/TTS/TTS/tts/models/vits.py", line 1250, in train_step
        outputs = self.forward(
      File "/home/caraduf/YourTTS/TTS/TTS/tts/models/vits.py", line 1049, in forward
        pred_embs = self.speaker_manager.encoder.forward(wavs_batch, l2_norm=True)
      File "/home/caraduf/YourTTS/TTS/TTS/encoder/models/resnet.py", line 169, in forward
        x = self.torch_spec(x)
      File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/torch/nn/modules/container.py", line 139, in forward
        input = module(input)
      File "/home/caraduf/YourTTS/yourtts_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/caraduf/YourTTS/TTS/TTS/encoder/models/base_encoder.py", line 22, in forward
        return torch.nn.functional.conv1d(x, self.filter).squeeze(1)
    RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
    

    Thanks for helping me out!

    opened by Ca-ressemble-a-du-fake 6
  •  Speaker Encoder train on new language

    Speaker Encoder train on new language

    Hi, Can you elaborate about the source of where you get Speaker Encoder, and how do you train it with additional languages? How do you use model Wav2Vec which trained from fairseq? on config_se.json "run_description": "resnet speaker encoder trained with commonvoice all languages dev and train, Voxceleb 1 dev and Voxceleb 2 dev". Which languages include in this CV? which version of CV in this training? Thanks.

    opened by ikcla 5
  • YourTTS_zeroshot_VC_demo.ipynb

    YourTTS_zeroshot_VC_demo.ipynb

    Hi! I am trying to run YourTTS_zeroshot_VC_demo.ipynb and there seems to be access changes to the file best_model.pth.tar I am downloading it right now and I will manually upload it, so that I can run the notebook, but could you kindly fix the access rights so that others could easily run it like it was before. Thank you in advance! image

    opened by stalevna 5
  • train our own voice model

    train our own voice model

    Hi ,

    I have found your repo very interesting. So, I am trying out this. I am curious to know about training our voice files to creating checkpoint without involvement of text(As i have seen in previous issues to take reference of coqui model training) and without altering config.json. Can you please guide us how to proceed on this further.

    opened by chandrakanthlns 4
  • Train YourTTS on another language

    Train YourTTS on another language

    Good day!

    I have several questions, could you please help?

    Do I understand correctly that if I want to train the model on another language it is better to fine tune this model (YourTTS-EN(VCTK+LibriTTS)-PT-FR SCL): https://drive.google.com/drive/folders/15G-QS5tYQPkqiXfAdialJjmuqZV0azQV Or it is better to use other checkpoints.

    How many hours of audio is needed to have appropriate quality?

    I planned to use Common Voice Corpus to fine-tune the model on a new language, however, the audio format is mp3 not wav. Do I need to convert all the audio files or I can use mp3 format. If yes, how?

    Thank you for your time in advance!

    opened by annaklyueva 4
  • Select Speakers for Zero Shot TTS

    Select Speakers for Zero Shot TTS

    Hi ,

    Firstly great work on the project with time trying to understand the repo with more clarity. Wanted to know how can I select different speakers for different sections of text .

    Thanks in advance.

    opened by dipanjannC 4
  • From which version does coqui TTS starts supporting voice conversions and cloning?

    From which version does coqui TTS starts supporting voice conversions and cloning?

    Hi @Edresson, I am fairly new into the feild so please forgive for naive question. I am trying to use voice cloning feature. I trained a model on coqui-ai version 0.6 and in that installed environment. And I am using the command below to get the cloning done but it gives error that tts command does not expect "reference_wav" tts --model_path trained_model/best_model.pth.tar --config_path trained_model/config.json --speaker_idx "icici" --out_path output.wav --reference_wav target_content/asura_10secs.wav which might be because it did not support voice conversion then. Can you please confirm? Also, the model trained on version 0.6 doesn't run with latest version and ends up in dimension mismatch error which I am assuming due to model structure change probably. Please shed some light on this, It'll be really helpful.

    opened by tieincred 3
  • finetune VC on my voice

    finetune VC on my voice

    I would like to finetune yourTTS voice conversion on my own voice, and compare it to the zero-shot model. Could you provide the finetuning procedure for VC?

    opened by odeliazavlianovSC 3
  • Exp 1. YourTTS-EN(VCTK) + SCL(speaker encoder layers are not initialized )

    Exp 1. YourTTS-EN(VCTK) + SCL(speaker encoder layers are not initialized )

    I tried to run an experiment similar to Exp 1. YourTTS-EN(VCTK) + SCL initializing use_speaker_encoder_as_loss=true, speaker_encoder_loss_alpha=9.0, speaker_encoder_config_path and speaker_encoder_model_path(downloaded them from your google disk

    So my config file is almost identical to the one you have for the experiment(I don't have fine_tuning_mode=0, but I checked and 0 means disabled, so it shouldn't affect anything. Also use_speaker_embedding=false, otherwise it complains that vectors are initialized)

    My problem is when I print out model weights keys of your model and mine I have speaker encoder layers missing. They are not initialized for some reason. Unfortunately, I don't have any ideas why this could be happening :( Could you maybe point out a direction and what could I check?

      "use_sdp": true,
        "noise_scale": 1.0,
        "inference_noise_scale": 0.667,
        "length_scale": 1,
        "noise_scale_dp": 1.0,
        "inference_noise_scale_dp": 0.8,
        "max_inference_len": null,
        "init_discriminator": true,
        "use_spectral_norm_disriminator": false,
        "use_speaker_embedding": true,
        "num_speakers": 97,
        "speakers_file": null,
        "d_vector_file": "../speaker_embeddings/new-SE/VCTK+TTS-PT+MAILABS-FR/speakers.json",
        "speaker_embedding_channels": 512,
        "use_d_vector_file": true,
        "d_vector_dim": 512,
        "detach_dp_input": true,
        "use_language_embedding": false,
        "embedded_language_dim": 4,
        "num_languages": 0,
        "use_speaker_encoder_as_loss": true,
        "speaker_encoder_config_path": "../checkpoints/Speaker_Encoder/Resnet-original-paper/config.json",
        "speaker_encoder_model_path": "../checkpoints/Speaker_Encoder/Resnet-original-paper/converted_checkpoint.pth.tar",
        "fine_tuning_mode": 0,
        "freeze_encoder": false,
        "freeze_DP": false,
        "freeze_PE": false,
        "freeze_flow_decoder": false,
        "freeze_waveform_decoder": false
    
    opened by stalevna 3
  • Zeroshot TTS notebook no longer working

    Zeroshot TTS notebook no longer working

    Hi @Edresson @WeberJulian

    the demo notebook is no longer working with the current TTS master repo.

    I'm having hard time to execute things.

    Do you intend to adjust ? thanks

    opened by vince62s 3
Owner
Edresson Casanova
Computer Science PhD Student
Edresson Casanova
Library for machine learning stacking generalization.

stacked_generalization Implemented machine learning *stacking technic[1]* as handy library in Python. Feature weighted linear stacking is also availab

114 Jul 19, 2022
Official PyTorch implementation of "Physics-aware Difference Graph Networks for Sparsely-Observed Dynamics".

Physics-aware Difference Graph Networks for Sparsely-Observed Dynamics This repository is the official PyTorch implementation of "Physics-aware Differ

USC-Melady 46 Nov 20, 2022
[NeurIPS 2021] “Improving Contrastive Learning on Imbalanced Data via Open-World Sampling”,

Improving Contrastive Learning on Imbalanced Data via Open-World Sampling Introduction Contrastive learning approaches have achieved great success in

VITA 24 Dec 17, 2022
PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)

Wenwen Yu 498 Dec 24, 2022
Differentiable Neural Computers, Sparse Access Memory and Sparse Differentiable Neural Computers, for Pytorch

Differentiable Neural Computers and family, for Pytorch Includes: Differentiable Neural Computers (DNC) Sparse Access Memory (SAM) Sparse Differentiab

ixaxaar 302 Dec 14, 2022
Source code for paper "Deep Superpixel-based Network for Blind Image Quality Assessment"

DSN-IQA Source code for paper "Deep Superpixel-based Network for Blind Image Quality Assessment" Requirements Python =3.8.0 Pytorch =1.7.1 Usage wit

7 Oct 13, 2022
End-To-End Crowdsourcing

End-To-End Crowdsourcing Comparison of traditional crowdsourcing approaches to a state-of-the-art end-to-end crowdsourcing approach LTNet on sentiment

Andreas Koch 1 Mar 06, 2022
The personal repository of the work: *DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer*.

DanceNet3D The personal repository of the work: DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer. Dataset and Results Pleas

南嘉Nanga 36 Dec 21, 2022
HiFi-GAN: High Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks

HiFiGAN Denoiser This is a Unofficial Pytorch implementation of the paper HiFi-GAN: High Fidelity Denoising and Dereverberation Based on Speech Deep F

Rishikesh (ऋषिकेश) 134 Dec 27, 2022
MarcoPolo is a clustering-free approach to the exploration of bimodally expressed genes along with group information in single-cell RNA-seq data

MarcoPolo is a method to discover differentially expressed genes in single-cell RNA-seq data without depending on prior clustering Overview MarcoPolo

Chanwoo Kim 13 Dec 18, 2022
COD-Rank-Localize-and-Segment (CVPR2021)

COD-Rank-Localize-and-Segment (CVPR2021) Simultaneously Localize, Segment and Rank the Camouflaged Objects Full camouflage fixation training dataset i

JingZhang 52 Dec 20, 2022
The 3rd place solution for competition

The 3rd place solution for competition "Lyft Motion Prediction for Autonomous Vehicles" at Kaggle Team behind this solution: Artsiom Sanakoyeu [Homepa

Artsiom 104 Nov 22, 2022
On Effective Scheduling of Model-based Reinforcement Learning

On Effective Scheduling of Model-based Reinforcement Learning Code to reproduce the experiments in On Effective Scheduling of Model-based Reinforcemen

laihang 8 Oct 07, 2022
Omniscient Video Super-Resolution

Omniscient Video Super-Resolution This is the official code of OVSR (Omniscient Video Super-Resolution, ICCV 2021). This work is based on PFNL. Datase

36 Oct 27, 2022
Demonstrational Session git repo for H SAF User Workshop (28/1)

5th H SAF User Workshop The 5th H SAF User Workshop supported by EUMeTrain will be held in online in January 24-28 2022. This repository contains inst

H SAF 4 Aug 04, 2022
Minimal fastai code needed for working with pytorch

fastai_minima A mimal version of fastai with the barebones needed to work with Pytorch #all_slow Install pip install fastai_minima How to use This lib

Zachary Mueller 14 Oct 21, 2022
Ready-to-use code and tutorial notebooks to boost your way into few-shot image classification.

Easy Few-Shot Learning Ready-to-use code and tutorial notebooks to boost your way into few-shot image classification. This repository is made for you

Sicara 399 Jan 08, 2023
Deep Learning Pipelines for Apache Spark

Deep Learning Pipelines for Apache Spark The repo only contains HorovodRunner code for local CI and API docs. To use HorovodRunner for distributed tra

Databricks 2k Jan 08, 2023
PyTorch implementation of Algorithm 1 of "On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models"

Code for On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models This repository will reproduce the main results from our pape

Mitch Hill 32 Nov 25, 2022
Memory efficient transducer loss computation

Introduction This project implements the optimization techniques proposed in Improving RNN Transducer Modeling for End-to-End Speech Recognition to re

Fangjun Kuang 51 Nov 25, 2022