Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis

Last update: Oct 17, 2022

Overview

TDY-CNN for Text-Independent Speaker Verification

Official implementation of

Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis
by Seong-Hu Kim, Hyeonuk Nam, Yong-Hwa Park @ Human Lab, Mechanical Engineering Department, KAIST

Accepted paper in ICASSP 2022.

This code was written mainly with reference to VoxCeleb_trainer of paper 'In defence of metric learning for speaker recognition'.

Temporal Dynamic Convolutional Neural Network (TDY-CNN)

TDY-CNN efficiently applies adaptive convolution depending on time bins by changing the computation order as follows:

$y(f, t) = \sigma (\sum_{k=1}^{K} \pi_{k}(t)y_k(f,t))$

where x and y are input and output of TDY-CNN module which depends on frequency feature f and time feature t in time-frequency domain data. k-th basis kernel is convoluted with input and k-th bias is added. The results are aggregated using the attention weights which depends on time bins. K is the number of basis kernels, and σ is an activation function ReLU. The attention weight has a value between 0 and 1, and the sum of all basis kernels on a single time bin is 1 as the weights are processed by softmax.

Requirements and versions used

Python version of 3.7.10 is used with following libraries

pytorch == 1.8.1
pytorchaudio == 0.8.1
numpy == 1.19.2
scipy == 1.5.3
scikit-learn == 0.23.2

Dataset

We used VoxCeleb1 & 2 dataset in this paper. You can download the dataset by reffering to VoxCeleb1 and VoxCeleb1.

Training

You can train and save model in exps folder by running:

python trainSpeakerNet.py --model TDy_ResNet34_half --log_input True --encoder_type AVG --trainfunc softmaxproto --save_path exps/TDY_CNN_ResNet34 --nPerSpeaker 2 --batch_size 400

This implementation also provides accelerating training with distributed training and mixed precision training.

Use --distributed flag to enable distributed training and --mixedprec flag to enable mixed precision training.
- GPU indices should be set before training : os.environ['CUDA_VISIBLE_DEVICES'] ='0,1,2,3' in trainSpeakernet.py.

Results:

Network	#Parm	EER (%)	C_det (%)
TDY-VGG-M	71.2M	3.04	0.237
TDY-ResNet-34(×0.25)	13.3M	1.58	0.116
TDY-ResNet-34(×0.5)	51.9M	1.48	0.118

This result is low-dimensional t-SNE projection of frame-level speaker embed-dings of MHRM0 and FDAS1 using (a) baseline model ResNet-34(×0.25) and (b) TDY-ResNet-34(×0.25). Left column represents embeddings for different speakers, and right column represents em-beddings for different phoneme classes.
Embeddings by TDY-ResNet-34(×0.25) are closely gathered regardless of phoneme groups. It shows that the temporal dynamic model extracts consistent speaker information regardless of phonemes.

Pretrained models

There are pretrained models in folder pretrained_model.

For example, you can check 1.4786 of EER by running following script using TDY-ResNet-34(×0.5).

python trainSpeakerNet.py --eval --model TDy_ResNet34_half --log_input True --encoder_type AVG --trainfunc softmaxproto --save_path exps/test --eval_frames 400 --initial_model pretrained_model/pretrained_TDy_ResNet34_half.model

Citation

@article{kim2021tdycnn,
  title={Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis},
  author={Kim, Seong-Hu and Nam, Hyeonuk and Park, Yong-Hwa},
  journal={arXiv preprint arXiv:2110.03213},
  year={2021}
}

Please contact Seong-Hu Kim at [email protected] for any query.

Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis

Related tags

Overview

TDY-CNN for Text-Independent Speaker Verification

Temporal Dynamic Convolutional Neural Network (TDY-CNN)

Requirements and versions used

Dataset

Training

Results:

Pretrained models

Citation

Owner

Seong-Hu Kim

Discord bot for notifying on github events

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Analyzes your GitHub Profile and presents you with a report on how likely you are to become the next MLH Fellow!

code for CVPR paper Zero-shot Instance Segmentation

Official implementation of "SinIR: Efficient General Image Manipulation with Single Image Reconstruction" (ICML 2021)

Pytorch implementation of the Variational Recurrent Neural Network (VRNN).

Investigating Attention Mechanism in 3D Point Cloud Object Detection (arXiv 2021)

The code of paper "Block Modeling-Guided Graph Convolutional Neural Networks".

DIRL: Domain-Invariant Representation Learning

《Geo Word Clouds》paper implementation

Implementation of "A Deep Learning Loss Function based on Auditory Power Compression for Speech Enhancement" by pytorch

MTA:SA Server Configer.

Implementation of several Bayesian multi-target tracking algorithms, including Poisson multi-Bernoulli mixture filters for sets of targets and sets of trajectories. The repository also includes the GOSPA metric and a metric for sets of trajectories to evaluate performance.

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Paper list of log-based anomaly detection

Cross-Image Region Mining with Region Prototypical Network for Weakly Supervised Segmentation

An addernet CUDA version

Code for SyncTwin: Treatment Effect Estimation with Longitudinal Outcomes (NeurIPS 2021)

Using BERT+Bi-LSTM+CRF

Boostcamp AI Tech 3rd / Basic Paper reading w.r.t Embedding