Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

Overview

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

This repo contains the official implementation of the VAE-GAN from the INTERSPEECH 2020 paper Voice Conversion Using Speech-to-Speech Neuro-Style Transfer.

Examples of generated audio using the Flickr8k Audio Corpus: https://ebadawy.github.io/post/speech_style_transfer. Note that these examples are a result of feeding audio reconstructions of this VAE-GAN to an implementation of WaveNet.

1. Data Preperation

Dataset file structure:

/path/to/database
├── spkr_1
│   ├── sample.wav
├── spkr_2
│   ├── sample.wav
│   ...
└── spkr_N
    ├── sample.wav
    ...
# The directory under each speaker cannot be nested.

Here is an example script for setting up data preparation from the Flickr8k Audio Corpus. The speakers of interest are the same as in the paper, but may be modified to other speakers if desirable.

2. Data Preprocessing

The prepared dataset is organised into a train/eval/test split, the audio is preprocessed and melspectrograms are computed.

python preprocess.py --dataset [path/to/dataset] --test-size [float] --eval-size [float]

3. Training

The VAE-GAN model uses the melspectrograms to learn style transfer between two speakers.

python train.py --model_name [name of the model] --dataset [path/to/dataset]

3.1. Visualization

By default, the code plots a batch of input and output melspectrograms every epoch. You may add --plot-interval -1 to the above command to disable it. Alternatively you may add --plot-interval 20 to plot every 20 epochs.

3.2. Saving Models

By default, models are saved every epoch. With smaller datasets than Flickr8k it may be more appropriate to save less frequently by adding --checkpoint_interval 20 for 20 epochs.

3.3. Epochs

The max number of epochs may be set with --n_epochs. For smaller datasets, you may want to increase this to more than the default 100. To load a pretrained model you can use --epoch and set it to the epoch number of the saved model.

3.4. Pretrained Model

You can access pretrained model files here. By downloading and storing them in a directory src/saved_models/pretrained, you may call it for training or inference with:

--model_name pretrained --epoch 99

Note that for inference the discriminator files D1 and D2 are not required (meanwhile for training further they are). Also here, G1 refers to the decoding generator for speaker 1 (female) and G2 for speaker 2 (male).

4. Inference

The trained VAE-GAN is used for inference on a specified audio file. It works by; sliding a window over a full melspectrogram, locally inferring melspectrogram subsamples, and averaging the overlap. The script then uses Griffin-Lim to reconstruct audio from the generated melspectrogram.

python inference.py --model_name [name of the model] --epoch [epoch number] --trg_id [id of target generator] --wav [path/to/source_audio.wav]

For achieving high quality results like the paper you can feed the reconstructed audio to trained vocoders such as WaveNet. An example pipeline of using this model with wavenet can be found here.

4.1. Directory Input

Instead of a single .wav as input you may specify a whole directory of .wav files by using --wavdir instead of --wav.

4.2. Visualization

By default, plotting input and output melspectrograms is enabled. This is useful for a visual comparison between trained models. To disable set --plot -1

4.3. Reconstructive Evaluation

Alongside the process of generating, components for reconstruction and cyclic reconstruction may be enabled by specifying the generator id of the source audio --src_id [id of source generator].

When set, SSIM metrics for reconstructed melspectrograms and cyclically reconstructed melspectrograms are computed and printed at the end of inference.

This is an extra feature to help with comparing the reconstructive capabilities of different models. The higher the SSIM, the higher quality the reconstruction.

References

Citation

If you find this code useful please cite us in your work:

@inproceedings{AlBadawy2020,
  author={Ehab A. AlBadawy and Siwei Lyu},
  title={{Voice Conversion Using Speech-to-Speech Neuro-Style Transfer}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4726--4730},
  doi={10.21437/Interspeech.2020-3056},
  url={http://dx.doi.org/10.21437/Interspeech.2020-3056}
}

TODO:

  • Rewrite preprocess.py to handle:
    • multi-process feature extraction
    • display error messages for failed cases
  • Create:
    • Notebook for data visualisation
  • Want to add something else? Please feel free to submit a PR with your changes or open an issue for that.
Owner
Ehab AlBadawy
Ehab AlBadawy
Python scripts for performing road segemtnation and car detection using the HybridNets multitask model in ONNX.

ONNX-HybridNets-Multitask-Road-Detection Python scripts for performing road segemtnation and car detection using the HybridNets multitask model in ONN

Ibai Gorordo 45 Jan 01, 2023
Code for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding

🍐 quince Code for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding 🍐 Installation $ git clone

Andrew Jesson 19 Jun 23, 2022
PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition, CVPR 2018

PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place

Mikaela Uy 294 Dec 12, 2022
AttentionGAN for Unpaired Image-to-Image Translation & Multi-Domain Image-to-Image Translation

AttentionGAN-v2 for Unpaired Image-to-Image Translation AttentionGAN-v2 Framework The proposed generator learns both foreground and background attenti

Hao Tang 530 Dec 27, 2022
Official implementation of "Motif-based Graph Self-Supervised Learning forMolecular Property Prediction"

Motif-based Graph Self-Supervised Learning for Molecular Property Prediction Official Pytorch implementation of NeurIPS'21 paper "Motif-based Graph Se

zaixi 71 Dec 20, 2022
Simple Python application to transform Serial data into OSC messages

SerialToOSC-Bridge Simple Python application to transform Serial data into OSC messages. The current purpose is to be a compatibility layer between ha

Division of Applied Acoustics at Chalmers University of Technology 3 Jun 03, 2021
x-transformers-paddle 2.x version

x-transformers-paddle x-transformers-paddle 2.x version paddle 2.x版本 https://github.com/lucidrains/x-transformers 。 requirements paddlepaddle-gpu==2.2

yujun 7 Dec 08, 2022
Unsupervised phone and word segmentation using dynamic programming on self-supervised VQ features.

Unsupervised Phone and Word Segmentation using Vector-Quantized Neural Networks Overview Unsupervised phone and word segmentation on speech data is pe

Herman Kamper 13 Dec 11, 2022
An investigation project for SISR.

SISR-Survey An investigation project for SISR. This repository is an official project of the paper "From Beginner to Master: A Survey for Deep Learnin

Juncheng Li 79 Oct 20, 2022
Explainable Medical ImageSegmentation via GenerativeAdversarial Networks andLayer-wise Relevance Propagation

MedAI: Transparency in Medical Image Segmentation What is this repo This repo contains the code and experiments that are implemented to contribute in

Awadelrahman M. A. Ahmed 1 Nov 22, 2021
Implementation of CoCa, Contrastive Captioners are Image-Text Foundation Models, in Pytorch

CoCa - Pytorch Implementation of CoCa, Contrastive Captioners are Image-Text Foundation Models, in Pytorch. They were able to elegantly fit in contras

Phil Wang 565 Dec 30, 2022
中文语音识别系列,读者可以借助它快速训练属于自己的中文语音识别模型,或直接使用预训练模型测试效果。

MASR中文语音识别(pytorch版) 开箱即用 自行训练 使用与训练分离(增量训练) 识别率高 说明:因为每个人电脑机器不同,而且有些安装包安装起来比较麻烦,强烈建议直接用我编译好的docker环境跑 目前docker基础环境为ubuntu-cuda10.1-cudnn7-pytorch1.6.

发送小信号 180 Dec 17, 2022
SEAN: Image Synthesis with Semantic Region-Adaptive Normalization (CVPR 2020, Oral)

SEAN: Image Synthesis with Semantic Region-Adaptive Normalization (CVPR 2020 Oral) Figure: Face image editing controlled via style images and segmenta

Peihao Zhu 579 Dec 30, 2022
Our VMAgent is a platform for exploiting Reinforcement Learning (RL) on Virtual Machine (VM) scheduling tasks.

VMAgent is a platform for exploiting Reinforcement Learning (RL) on Virtual Machine (VM) scheduling tasks. VMAgent is constructed based on one month r

56 Dec 12, 2022
PyTorch implementation of ICLR 2022 paper PiCO: Contrastive Label Disambiguation for Partial Label Learning

PiCO: Contrastive Label Disambiguation for Partial Label Learning This is a PyTorch implementation of ICLR 2022 Oral paper PiCO; also see our Project

王皓波 147 Jan 07, 2023
Part-aware Measurement for Robust Multi-View Multi-Human 3D Pose Estimation and Tracking

Part-aware Measurement for Robust Multi-View Multi-Human 3D Pose Estimation and Tracking Part-Aware Measurement for Robust Multi-View Multi-Human 3D P

19 Oct 27, 2022
Real-time Joint Semantic Reasoning for Autonomous Driving

MultiNet MultiNet is able to jointly perform road segmentation, car detection and street classification. The model achieves real-time speed and state-

Marvin Teichmann 518 Dec 12, 2022
ImageNet Adversarial Image Evaluation

ImageNet Adversarial Image Evaluation This repository contains the code and some materials used in the experimental work presented in the following pa

Utku Ozbulak 11 Dec 26, 2022
Low-code/No-code approach for deep learning inference on devices

EzEdgeAI A concept project that uses a low-code/no-code approach to implement deep learning inference on devices. It provides a componentized framewor

On-Device AI Co., Ltd. 7 Apr 05, 2022
Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning By Zhenda Xie*, Yutong Lin*, Zheng Zhang, Yue Ca

Zhenda Xie 293 Dec 20, 2022