wav2vec_finetune

Test finetuning of XLSR (multilingual wav2vec 2.0) for other speech classification tasks

Initial test: gender recognition on this dataset.
Finetune for autism detection
[] Clean up directory
[] Make training and evaluation scripts runnable with cmd line / shell scripts
[] Add random noise on training samples
[] Make baseline models

# make virtual env
pip install -r requirements.txt

mkdir data
mkdir preproc_data
mkdir model
cd data
wget https://zenodo.org/record/1219621/files/CaFE_48k.zip?download=1
unzip the file 

python preproc.py
python train.py
python evaluate.py

Updates

11/9: success! Trained a sex classifier on a small dataset that performs soso. Everything seems to work though.

TODO

Chunk audio files - make predictions in batches of e.g. 5 seconds
Set up benchmark models

Resources:

https://github.com/pytorch/fairseq/blob/master/examples/xlmr/README.md
https://arxiv.org/abs/2006.13979
https://huggingface.co/transformers/training.html
https://huggingface.co/blog/fine-tune-xlsr-wav2vec2
https://discuss.huggingface.co/t/german-asr-fine-tuning-wav2vec2/4558/5
https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files
https://github.com/huggingface/transformers/blob/master/examples/research_projects/wav2vec2/FINE_TUNE_XLSR_WAV2VEC2.md
https://github.com/m3hrdadfi/soxan
https://www.zhaw.ch/storage/engineering/institute-zentren/cai/BA21_Speech_Classification_Reiser_Fivian.pdf
https://github.com/DReiser7/w2v_did
https://github.com/ARBML/klaam
https://github.com/talhanai/speech-nlp-datasets

Notes:

Look into SpecAugment for finetuning: https://arxiv.org/abs/1904.08779 (on by default)
How to make the prediction?
- Better way than a small feedforward projection? (LSTM or something?)

Test finetuning of XLSR (multilingual wav2vec 2.0) for other speech classification tasks

Related tags

Overview

wav2vec_finetune

Updates

TODO

Resources:

Notes:

Owner

Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

Minimal GUI for accessing the Watson Text to Speech service.

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Index different CKAN entities in Solr, not just datasets

Japanese NLP Library

Pytorch implementation of winner from VQA Chllange Workshop in CVPR'17

Code for "Generative adversarial networks for reconstructing natural images from brain activity".

Pretrained Japanese BERT models

This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection"

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

The official repository of the ISBI 2022 KNIGHT Challenge

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

Dual languaged (rus+eng) tool for packing and unpacking archives of Silky Engine.

The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Train BPE with fastBPE, and load to Huggingface Tokenizer.

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

A large-scale (194k), Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.