Pytorch implementation of our paper LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION.

Related tags

Deep LearningLiMuSE
Overview

LiMuSE

Overview

Pytorch implementation of our paper LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION.

LiMuSE explores group communication on a multi-modal speaker extraction model and further compresses the model size with quantization strategy.

Model

Our proposed model is a multi-steam architecture that takes multichannel mixture, target speaker’s enrolled utterance and visual sequences of detected faces as inputs, and outputs the target speaker’s mask in time domain. The encoded audio representations of mixture are then multiplied by the generated mask to obtain the target speech. Please see the figure below for detailed model structure.

flowchart_limuse

Datasets

We evaluate our system on two-speaker speech separation and speaker extraction problems using GRID dataset. The pretrained face embedding extraction network is trained on LRW dataset and MS-Celeb-1M dataset. And we use SMS-WSJ toolkit to obtain simulated anechoic dual-channel audio mixture. We place 2 microphones at the center of the room. The distance between microphones is 7 cm.

Getting Started

Preparation

If you want to adjust configurations of the framework and the path of dataset, please modify the option/train/train.yml file.

Training

Specify the path to train.yml file and run the training command:

python train.py -opt ./option/train/train.yml

This project supports full-precision and quantization training at the same time. Note that you need to modify two values of QA_flag in train.yml file if you would like to switch between full-precision and quantization stage. QA_flag in training settings stands for weight quantization while the one in net_conf stands for activation quantization.

View tensorboardX

tensorboard --logdir ./tensorboard

Result

  • Hyperparameters of LiMuSE

    Symbol Description Value
    N Number of filters in auto-encoder 128
    L Length of the filters (in audio samples) 16
    T Temperature 5
    X Number of GC-equipped TCN blocks in each repeat 6
    Ra Number of repeats in audio block 2
    Rb Number of repeats in fusion block 1
    K Number of groups -
  • Performance of LiMuSE and TasNet under various configurations. Q stands for quantization, VIS stands for visual cue and VP stands for voiceprint cue. Model size and compression ratio are also reported.

Method K SI-SDR (dB) #Params Model Size Compression Ratio
LiMuSE 32 16.72 0.36M 0.16MB 223.75
16 18.08 0.96M 0.40MB 89.50
LiMuSE (w/o Q) 32 23.77 0.36M 1.44MB 24.86
16 24.90 0.96M 3.84MB 9.32
LiMuSE (w/o Q and VP) 32 18.60 0.19M 0.76MB 47.11
16 24.20 0.52M 2.08MB 17.21
LiMuSE (w/o Q and VIS) 32 15.68 0.22M 0.88MB 40.68
16 21.91 0.55M 2.20MB 16.27
LiMuSE (w/o Q and GC) - 23.67 8.95M 35.8MB 1
TasNet (dual-channel) - 19.94 2.48M 9.92MB -
TasNet (single-channel) - 13.15 2.48M 9.92MB -

Citations

If you find this repo helpful, please consider citing:

@inproceedings{liu2021limuse,
  title={LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION},
  author={Liu, Qinghua and Huang, Yating and Hao, Yunzhe and Xu, Jiaming and Xu, Bo},
  booktitle={arXiv:2111.04063},
  year={2021},
}
Owner
Auditory Model and Cognitive Computing Lab
Auditory Model and Cognitive Computing Laboratory @ Institute of Automation, Chinese Academy of Sciences
Auditory Model and Cognitive Computing Lab
CausaLM: Causal Model Explanation Through Counterfactual Language Models

CausaLM: Causal Model Explanation Through Counterfactual Language Models Authors: Amir Feder, Nadav Oved, Uri Shalit, Roi Reichart Abstract: Understan

Amir Feder 39 Jul 10, 2022
CR-Fill: Generative Image Inpainting with Auxiliary Contextual Reconstruction. ICCV 2021

crfill Usage | Web App | | Paper | Supplementary Material | More results | code for paper ``CR-Fill: Generative Image Inpainting with Auxiliary Contex

182 Dec 20, 2022
Parameter-ensemble-differential-evolution - Shows how to do parameter ensembling using differential evolution.

Ensembling parameters with differential evolution This repository shows how to ensemble parameters of two trained neural networks using differential e

Sayak Paul 9 May 04, 2022
TensorFlow Implementation of Unsupervised Cross-Domain Image Generation

Domain Transfer Network (DTN) TensorFlow implementation of Unsupervised Cross-Domain Image Generation. Requirements Python 2.7 TensorFlow 0.12 Pickle

Yunjey Choi 864 Dec 30, 2022
The source code of the ICCV2021 paper "PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering"

Website | ArXiv | Get Start | Video PIRenderer The source code of the ICCV2021 paper "PIRenderer: Controllable Portrait Image Generation via Semantic

Ren Yurui 261 Jan 09, 2023
A clean implementation based on AlphaZero for any game in any framework + tutorial + Othello/Gobang/TicTacToe/Connect4 and more

Alpha Zero General (any game, any framework!) A simplified, highly flexible, commented and (hopefully) easy to understand implementation of self-play

Surag Nair 3.1k Jan 05, 2023
This is the dataset for testing the robustness of various VO/VIO methods

KAIST VIO dataset This is the dataset for testing the robustness of various VO/VIO methods You can download the whole dataset on KAIST VIO dataset Ind

1 Sep 01, 2022
TCube generates rich and fluent narratives that describes the characteristics, trends, and anomalies of any time-series data (domain-agnostic) using the transfer learning capabilities of PLMs.

TCube: Domain-Agnostic Neural Time series Narration This repository contains the code for the paper: "TCube: Domain-Agnostic Neural Time series Narrat

Mandar Sharma 7 Oct 31, 2021
Grounding Representation Similarity with Statistical Testing

Grounding Representation Similarity with Statistical Testing This repo contains code to replicate the results in our paper, which evaluates representa

26 Dec 02, 2022
Springer Link Download Module for Python

♞ pupalink A simple Python module to search and download books from SpringerLink. 🧪 This project is still in an early stage of development. Expect br

Pupa Corp. 18 Nov 21, 2022
piSTAR Lab is a modular platform built to make AI experimentation accessible and fun. (pistar.ai)

piSTAR Lab WARNING: This is an early release. Overview piSTAR Lab is a modular deep reinforcement learning platform built to make AI experimentation a

piSTAR Lab 0 Aug 01, 2022
Transformer - Transformer in PyTorch

Transformer 完成进度 Embeddings and PositionalEncoding with example. MultiHeadAttent

Tianyang Li 1 Jan 06, 2022
Utilities to bridge Canvas-generated course rosters with GitLab's API.

gitlab-canvas-utils A collection of scripts originally written for CSE 13S. Oversees everything from GitLab course group creation, student repository

Eugene Chou 5 Jun 08, 2022
Tutorial on active learning with the Nvidia Transfer Learning Toolkit (TLT).

Active Learning with the Nvidia TLT Tutorial on active learning with the Nvidia Transfer Learning Toolkit (TLT). In this tutorial, we will show you ho

Lightly 25 Dec 03, 2022
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Dec 30, 2022
Machine Learning toolbox for Humans

Reproducible Experiment Platform (REP) REP is ipython-based environment for conducting data-driven research in a consistent and reproducible way. Main

Yandex 662 Nov 20, 2022
A MNIST-like fashion product database. Benchmark

Fashion-MNIST Table of Contents Why we made Fashion-MNIST Get the Data Usage Benchmark Visualization Contributing Contact Citing Fashion-MNIST License

Zalando Research 10.5k Jan 08, 2023
Stacked Hourglass Network with a Multi-level Attention Mechanism: Where to Look for Intervertebral Disc Labeling

⚠️ ‎‎‎ A more recent and actively-maintained version of this code is available in ivadomed Stacked Hourglass Network with a Multi-level Attention Mech

Reza Azad 14 Oct 24, 2022
Implementation of FitVid video prediction model in JAX/Flax.

FitVid Video Prediction Model Implementation of FitVid video prediction model in JAX/Flax. If you find this code useful, please cite it in your paper:

Google Research 62 Nov 25, 2022
Segmentation Training Pipeline

Segmentation Training Pipeline This package is a part of Musket ML framework. Reasons to use Segmentation Pipeline Segmentation Pipeline was developed

Musket ML 52 Dec 12, 2022