Pseudo-Visual Speech Denoising

Overview

Pseudo-Visual Speech Denoising

This code is for our paper titled: Visual Speech Enhancement Without A Real Visual Stream published at WACV 2021.
Authors: Sindhu Hegde*, K R Prajwal*, Rudrabha Mukhopadhyay*, Vinay Namboodiri, C.V. Jawahar

PWC PWC

πŸ“ Paper πŸ“‘ Project Page πŸ›  Demo Video πŸ—ƒ Real-World Test Set
Paper Website Video Real-World Test Set (coming soon)


Features

  • Denoise any real-world audio/video and obtain the clean speech.
  • Works in unconstrained settings for any speaker in any language.
  • Inputs only audio but uses the benefits of lip movements by generating a synthetic visual stream.
  • Complete training code and inference codes available.

Prerequisites

  • Python 3.7.4 (Code has been tested with this version)
  • ffmpeg: sudo apt-get install ffmpeg
  • Install necessary packages using pip install -r requirements.txt
  • Face detection pre-trained model should be downloaded to face_detection/detection/sfd/s3fd.pth

Getting the weights

Model Description Link to the model
Denoising model Weights of the denoising model (needed for inference) Link
Lipsync student Weights of the student lipsync model to generate the visual stream for noisy audio inputs (needed for inference) Link
Wav2Lip teacher Weights of the teacher lipsync model (only needed if you want to train the network from scratch) Link

Denoising any audio/video using the pre-trained model (Inference)

You can denoise any noisy audio/video and obtain the clean speech of the target speaker using:

python inference.py --lipsync_student_model_path= --checkpoint_path= --input=

The result is saved (by default) in results/result.mp4. The result directory can be specified in arguments, similar to several other available options. The input file can be any audio file: *.wav, *.mp3 or even a video file, from which the code will automatically extract the audio and generate the clean speech. Note that the noise should not be human speech, as this work only tackles the denoising task, not speaker separation.

Generating only the lip-movements for any given noisy audio/video

The synthetic visual stream (lip-movements) can be generated for any noisy audio/video using:

cd lipsync
python inference.py --checkpoint_path= --audio=

The result is saved (by default) in results/result_voice.mp4. The result directory can be specified in arguments, similar to several other available options. The input file can be any audio file: *.wav, *.mp3 or even a video file, from which the code will automatically extract the audio and generate the visual stream.

Training

We illustrate the training process using the LRS3 and VGGSound dataset. Adapting for other datasets would involve small modifications to the code.

Preprocess the dataset

LRS3 train-val/pre-train dataset folder structure
data_root (we use both train-val and pre-train sets of LSR3 dataset in this work)
β”œβ”€β”€ list of folders
β”‚   β”œβ”€β”€ five-digit numbered video IDs ending with (.mp4)
Preprocess the dataset
python preprocess.py --data_root= --preprocessed_root=

Additional options like batch_size and number of GPUs to use in parallel to use can also be set.

Preprocessed LRS3 folder structure
preprocessed_root (lrs3_preprocessed)
β”œβ”€β”€ list of folders
|	β”œβ”€β”€ Folders with five-digit numbered video IDs
|	β”‚   β”œβ”€β”€ *.jpg (extracted face crops from each frame)
VGGSound folder structure

We use VGGSound dataset as noisy data which is mixed with the clean speech from LRS3 dataset. We download the audio files (*.wav files) from here.

data_root (vgg_sound)
β”œβ”€β”€ *.wav (audio files)

Train!

There are two major steps: (i) Train the student-lipsync model, (ii) Train the Denoising model.

Train the Student-Lipsync model

Navigate to the lipsync folder: cd lipsync

The lipsync model can be trained using:

python train_student.py --data_root_lrs3_pretrain= --data_root_lrs3_train= --noise_data_root= --wav2lip_checkpoint_path= --checkpoint_dir=

Note: The pre-trained Wav2Lip teacher model must be downloaded (wav2lip weights) before training the student model.

Train the Denoising model!

Navigate to the main directory: cd ..

The denoising model can be trained using:

python train.py --data_root_lrs3_pretrain= --data_root_lrs3_train= --noise_data_root= --lipsync_student_model_path= --checkpoint_dir=

The model can be resumed for training as well. Look at python train.py --help for more details. Also, additional less commonly-used hyper-parameters can be set at the bottom of the audio/hparams.py file.


Evaluation

To be updated soon!


Licence and Citation

The software is licensed under the MIT License. Please cite the following paper if you have used this code:

@InProceedings{Hegde_2021_WACV,
    author    = {Hegde, Sindhu B. and Prajwal, K.R. and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
    title     = {Visual Speech Enhancement Without a Real Visual Stream},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month     = {January},
    year      = {2021},
    pages     = {1926-1935}
}

Acknowledgements

Parts of the lipsync code has been modified using our Wav2Lip repository. The audio functions and parameters are taken from this TTS repository. We thank the authors for this wonderful code. The code for Face Detection has been taken from the face_alignment repository. We thank the authors for releasing their code and models.

Owner
Sindhu
Masters' by Research (MS) @ CVIT, IIIT Hyderabad
Sindhu
dataset for ECCV 2020 "Motion Capture from Internet Videos"

Motion Capture from Internet Videos Motion Capture from Internet Videos Junting Dong*, Qing Shuai*, Yuanqing Zhang, Xian Liu, Xiaowei Zhou, Hujun Bao

ZJU3DV 98 Dec 07, 2022
TICC is a python solver for efficiently segmenting and clustering a multivariate time series

TICC TICC is a python solver for efficiently segmenting and clustering a multivariate time series. It takes as input a T-by-n data matrix, a regulariz

406 Dec 12, 2022
Deconfounding Temporal Autoencoder: Estimating Treatment Effects over Time Using Noisy Proxies

Deconfounding Temporal Autoencoder (DTA) This is a repository for the paper "Deconfounding Temporal Autoencoder: Estimating Treatment Effects over Tim

Milan Kuzmanovic 3 Feb 04, 2022
basic tutorial on pytorch

Quick Tutorial on PyTorch PyTorch Basics Linear Regression Logistic Regression Artificial Neural Networks Convolutional Neural Networks Recurrent Neur

7 Sep 15, 2022
Pytorch implementation of Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization https://arxiv.org/abs/2008.11646

[TCSVT] Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization LPN [Paper] NEWs Prerequisites Python 3.6 GPU Memory = 8G Numpy 1.

46 Dec 14, 2022
Γ–zlem TaşkΔ±n 0 Feb 23, 2022
Framework for training options with different attention mechanism and using them to solve downstream tasks.

Using Attention in HRL Framework for training options with different attention mechanism and using them to solve downstream tasks. Requirements GPU re

5 Nov 03, 2022
Everything you want about DP-Based Federated Learning, including Papers and Code. (Mechanism: Laplace or Gaussian, Dataset: femnist, shakespeare, mnist, cifar-10 and fashion-mnist. )

Differential Privacy (DP) Based Federated Learning (FL) Everything about DP-based FL you need is here. οΌˆζ‰€ζœ‰δ½ ιœ€θ¦ηš„DP-based FLηš„δΏ‘ζ―ιƒ½εœ¨θΏ™ι‡ŒοΌ‰ Code Tip: the code o

wenzhu 83 Dec 24, 2022
Official source code of paper 'IterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo'

IterMVS official source code of paper 'IterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo' Introduction IterMVS is a novel lear

Fangjinhua Wang 127 Jan 04, 2023
This repo contains code to reproduce all experiments in Equivariant Neural Rendering

Equivariant Neural Rendering This repo contains code to reproduce all experiments in Equivariant Neural Rendering by E. Dupont, M. A. Bautista, A. Col

Apple 83 Nov 16, 2022
Learning hierarchical attention for weakly-supervised chest X-ray abnormality localization and diagnosis

Hierarchical Attention Mining (HAM) for weakly-supervised abnormality localization This is the official PyTorch implementation for the HAM method. Pap

Xi Ouyang 22 Jan 02, 2023
HarDNeXt: Official HarDNeXt repository

HarDNeXt-Pytorch HarDNeXt: A Stage Receptive Field and Connectivity Aware Convolution Neural Network HarDNeXt-MSEG for Medical Image Segmentation in 0

5 May 26, 2022
Code implementation of Data Efficient Stagewise Knowledge Distillation paper.

Data Efficient Stagewise Knowledge Distillation Table of Contents Data Efficient Stagewise Knowledge Distillation Table of Contents Requirements Image

IvLabs 112 Dec 02, 2022
Mmdetection3d Noted - MMDetection3D is an open source object detection toolbox based on PyTorch

MMDetection3D is an open source object detection toolbox based on PyTorch

Jiangjingwen 13 Jan 06, 2023
A repository for benchmarking neural vocoders by their quality and speed.

License The majority of VocBench is licensed under CC-BY-NC, however portions of the project are available under separate license terms: Wavenet, Para

Meta Research 177 Dec 12, 2022
Breast-Cancer-Prediction

Breast-Cancer-Prediction Trying to predict whether the cancer is benign or malignant using REGRESSION MODELS in Python. Team Members NAME ROLL-NUMBER

Shyamdev Krishnan J 3 Feb 18, 2022
πŸ”ͺ Elimination based Lightweight Neural Net with Pretrained Weights

ELimNet ELimNet: Eliminating Layers in a Neural Network Pretrained with Large Dataset for Downstream Task Removed top layers from pretrained Efficient

snoop2head 4 Jul 12, 2022
TensorFlow implementation of original paper : https://github.com/hszhao/PSPNet

Keras implementation of PSPNet(caffe) Implemented Architecture of Pyramid Scene Parsing Network in Keras. For the best compability please use Python3.

VladKry 386 Dec 29, 2022
[CVPR 2022 Oral] Balanced MSE for Imbalanced Visual Regression https://arxiv.org/abs/2203.16427

Balanced MSE Code for the paper: Balanced MSE for Imbalanced Visual Regression Jiawei Ren, Mingyuan Zhang, Cunjun Yu, Ziwei Liu CVPR 2022 (Oral) News

Jiawei Ren 267 Jan 01, 2023
TorchIO is a Medical image preprocessing and augmentation toolkit for deep learning. Part of the PyTorch Ecosystem.

Medical image preprocessing and augmentation toolkit for deep learning. Part of the PyTorch Ecosystem.

Fernando PΓ©rez-GarcΓ­a 1.6k Jan 06, 2023