Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe NHV in the future.

Last update: Dec 16, 2022

Related tags

Text Data & NLP FastVocoder

Overview

Fast (GAN Based Neural) Vocoder

Chinese README

Todo

Submit demo
Support NHV

Discription

Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe include NHV in the future. Developed on BiaoBei dataset, you can modify conf and hparams.py to fit your own dataset and model.

Usage

Prepare data
- write path of wav data in a file, for example: cd dataset && python3 biaobei.py
- bash preprocess.sh <wav path file> <path to save processed data> dataset/audio dataset/mel
- for example: bash preprocess.sh dataset/BZNSYP.txt processed dataset/audio dataset/mel

Train

command:

bash train.sh \
    <GPU ids> \
    /path/to/audio/train \
    /path/to/audio/valid \
    /path/to/mel/train \
    /path/to/mel/valid \
    <model name> \
    <if multi band> \
    <if use scheduler> \
    <path to configuration file>

for example:

bash train.sh \
0 \
dataset/audio/train \
dataset/audio/valid \
dataset/mel/train \
dataset/mel/valid \
hifigan \
0 0 0 \
conf/hifigan/light.yaml

Train from checkpoint

command:

bash train.sh \
    <GPU ids> \
    /path/to/audio/train \
    /path/to/audio/valid \
    /path/to/mel/train \
    /path/to/mel/valid \
    <model name> \
    <if multi band> \
    <if use scheduler> \
    <path to configuration file> \
    /path/to/checkpoint \
    <step of checkpoint>

Synthesize

command:

bash synthesize.sh \
    /path/to/checkpoint \
    /path/to/mel \
    /path/for/saving/wav \
    <model name> \
    /path/to/configuration/file

Acknowledgments

Comments

why set the L=30 ?

hello，I have some question， in the paper ，the shape of basis matrix is [32, 256] , but in the code ,the shape is [30, 256] . And according to the function "overlap_and_add" , output_size = (frames - 1) * frame_step + frame_length, if the L=30, I think it cannot match the real wave length ? for example, hop_len=256, mel.shape=[80, 140] , theoretically the output wave length is 140*256=35840. according to the code, the output wave length is 33600.

Thanks in advance.

opened by yingfenging 3
Link to Basis-MelGAN paper?

Hi Zhengxi, congrats on your paper's acceptance on Interspeech 2021!

I got pretty interested in your paper while reading the abstract of Basis-MelGAN on the README, but I could not find any link to the paper. Though the Interspeech conference is only 2 months away, don't you have any plans on publishing the paper on arXiv in near future?

opened by seungwonpark 2
Random start index in WeightDataset

At this line: https://github.com/xcmyz/FastVocoder/blob/a9af370be896b1096e746ce6489fb16fef8ca585/data/dataset.py#L97

If the input mel size smaller than fix-length, the random raise issue, I have try except to pass these short audios, but I just wonder it is handle in collate.

More than that, the segment size as I found in hifigan is 32, but in basic-melgan it (fix-length) is set to 140. Are there any difference between the 140 for biaobei and the one for LJspeech

opened by v-nhandt21 0
can basis-melgan be used as unversial vocoder?

I tried it for a single speaker dataset, rtf surprises me. Have you ever use basis-melgan for a multi-speaker dataset, or is it suitable for unseen speaker tts synthesis?

opened by mayfool 0
Shape mismatch error on new dataset
Hi, thanks for your work!

The frame rate of my dataset is 22050, and hop size of text2mel model is 256. I have changed hparams.py accordingly, but training results in an expcetion: (preprocessing was fine, anyway)

File "/home/user/speechlab/FastVocoder-main/model/loss/loss.py", line 23, in forward assert est_source_sub_band.size(1) == wav_sub_band.size(1)

I figured out that model inference still uses hop-size of 240. So how to make your code fully compatible with other datasets? it seems that the codes are somehow hardcoded for Biaobei dataset.
opened by tekinek 1
Multiband Architecture

Hi author, I have found the notes as "the generated audio has interference at a specific frequency" in this repo. I have encountered with the straight line at a specific frequency when developing similar multiband architecture, and I wonder if such phenomenon is the one you mentioned? And do you have some advice or solutions? Thanks.
help wanted

opened by Rongjiehuang 6

Releases(v1.0)

v1.0(Jun 24, 2021)

Source code(tar.gz)
Source code(zip)
basis.melgan.pt(53.36 MB)

Owner

Zhengxi Liu (刘正曦)

Interested in high performance neural vocoder and expressive TTS acoustic model. Member of DeepMist and developed MistGPU.

GitHub Repository

Paddle2.x version AI-Writer

Paddle2.x 版本AI-Writer 用魔改 GPT 生成网文。Tuned GPT for novel generation.

74 Jan 04, 2023

Quick insights from Zoom meeting transcripts using Graph + NLP

Transcript Analysis - Graph + NLP This program extracts insights from Zoom Meeting Transcripts (.vtt) using TigerGraph and NLTK. In order to run this

7 Sep 17, 2022

TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

TextAttack 🐙 Generating adversarial examples for NLP models [TextAttack Documentation on ReadTheDocs] About • Setup • Usage • Design About TextAttack

2.2k Jan 03, 2023

RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

RIDE: ROS IDE RIDE automatically creates the package and boilerplate OOP Python code for nodes as per your needs (RIDE is not an IDE, but even ROS isn

20 Jul 14, 2022

UniSpeech - Large Scale Self-Supervised Learning for Speech

UniSpeech The family of UniSpeech: WavLM (arXiv): WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing UniSpeech (ICML 202

281 Dec 15, 2022

The guide to tackle with the Text Summarization

1.2k Dec 30, 2022

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

148 Dec 26, 2022

This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"

Pattern-Exploiting Training (PET) This repository contains the code for Exploiting Cloze Questions for Few-Shot Text Classification and Natural Langua

1.4k Dec 30, 2022

SciBERT is a BERT model trained on scientific text.

1.2k Dec 24, 2022

Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

Seq2Seq Speech in JAX A JAX/Flax repository for combining a pre-trained speech encoder model (e.g. Wav2Vec2, HuBERT, WavLM) with a pre-trained text de

21 Dec 14, 2022

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Dense Passage Retrieval Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain Q&A research. It is based on the

1.1k Jan 07, 2023

In this project, we aim to achieve the task of predicting emojis from tweets. We aim to investigate the relationship between words and emojis.

Making Emojis More Predictable by Karan Abrol, Karanjot Singh and Pritish Wadhwa, Natural Language Processing (CSE546) under the guidance of Dr. Shad

2 Jan 17, 2022

Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe NHV in the future.

Related tags

Overview

Fast (GAN Based Neural) Vocoder

Todo

Discription

Usage

Acknowledgments

Comments

why set the L=30 ?

Link to Basis-MelGAN paper?

Random start index in WeightDataset

can basis-melgan be used as unversial vocoder?

Shape mismatch error on new dataset

Multiband Architecture

Releases(v1.0)

v1.0(Jun 24, 2021)

Owner

Zhengxi Liu (刘正曦)

Paddle2.x version AI-Writer

Quick insights from Zoom meeting transcripts using Graph + NLP

TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

UniSpeech - Large Scale Self-Supervised Learning for Speech

The guide to tackle with the Text Summarization

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"

SciBERT is a BERT model trained on scientific text.

Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

In this project, we aim to achieve the task of predicting emojis from tweets. We aim to investigate the relationship between words and emojis.

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

PyWorld3 is a Python implementation of the World3 model

Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

A single model that parses Universal Dependencies across 75 languages.

A library for Multilingual Unsupervised or Supervised word Embeddings

Model parallel transformers in JAX and Haiku

Header-only C++ HNSW implementation with python bindings

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,