Pseudo-Visual Speech Denoising

Overview

Pseudo-Visual Speech Denoising

This code is for our paper titled: Visual Speech Enhancement Without A Real Visual Stream published at WACV 2021.
Authors: Sindhu Hegde*, K R Prajwal*, Rudrabha Mukhopadhyay*, Vinay Namboodiri, C.V. Jawahar

PWC PWC

πŸ“ Paper πŸ“‘ Project Page πŸ›  Demo Video πŸ—ƒ Real-World Test Set
Paper Website Video Real-World Test Set (coming soon)


Features

  • Denoise any real-world audio/video and obtain the clean speech.
  • Works in unconstrained settings for any speaker in any language.
  • Inputs only audio but uses the benefits of lip movements by generating a synthetic visual stream.
  • Complete training code and inference codes available.

Prerequisites

  • Python 3.7.4 (Code has been tested with this version)
  • ffmpeg: sudo apt-get install ffmpeg
  • Install necessary packages using pip install -r requirements.txt
  • Face detection pre-trained model should be downloaded to face_detection/detection/sfd/s3fd.pth

Getting the weights

Model Description Link to the model
Denoising model Weights of the denoising model (needed for inference) Link
Lipsync student Weights of the student lipsync model to generate the visual stream for noisy audio inputs (needed for inference) Link
Wav2Lip teacher Weights of the teacher lipsync model (only needed if you want to train the network from scratch) Link

Denoising any audio/video using the pre-trained model (Inference)

You can denoise any noisy audio/video and obtain the clean speech of the target speaker using:

python inference.py --lipsync_student_model_path= --checkpoint_path= --input=

The result is saved (by default) in results/result.mp4. The result directory can be specified in arguments, similar to several other available options. The input file can be any audio file: *.wav, *.mp3 or even a video file, from which the code will automatically extract the audio and generate the clean speech. Note that the noise should not be human speech, as this work only tackles the denoising task, not speaker separation.

Generating only the lip-movements for any given noisy audio/video

The synthetic visual stream (lip-movements) can be generated for any noisy audio/video using:

cd lipsync
python inference.py --checkpoint_path= --audio=

The result is saved (by default) in results/result_voice.mp4. The result directory can be specified in arguments, similar to several other available options. The input file can be any audio file: *.wav, *.mp3 or even a video file, from which the code will automatically extract the audio and generate the visual stream.

Training

We illustrate the training process using the LRS3 and VGGSound dataset. Adapting for other datasets would involve small modifications to the code.

Preprocess the dataset

LRS3 train-val/pre-train dataset folder structure
data_root (we use both train-val and pre-train sets of LSR3 dataset in this work)
β”œβ”€β”€ list of folders
β”‚   β”œβ”€β”€ five-digit numbered video IDs ending with (.mp4)
Preprocess the dataset
python preprocess.py --data_root= --preprocessed_root=

Additional options like batch_size and number of GPUs to use in parallel to use can also be set.

Preprocessed LRS3 folder structure
preprocessed_root (lrs3_preprocessed)
β”œβ”€β”€ list of folders
|	β”œβ”€β”€ Folders with five-digit numbered video IDs
|	β”‚   β”œβ”€β”€ *.jpg (extracted face crops from each frame)
VGGSound folder structure

We use VGGSound dataset as noisy data which is mixed with the clean speech from LRS3 dataset. We download the audio files (*.wav files) from here.

data_root (vgg_sound)
β”œβ”€β”€ *.wav (audio files)

Train!

There are two major steps: (i) Train the student-lipsync model, (ii) Train the Denoising model.

Train the Student-Lipsync model

Navigate to the lipsync folder: cd lipsync

The lipsync model can be trained using:

python train_student.py --data_root_lrs3_pretrain= --data_root_lrs3_train= --noise_data_root= --wav2lip_checkpoint_path= --checkpoint_dir=

Note: The pre-trained Wav2Lip teacher model must be downloaded (wav2lip weights) before training the student model.

Train the Denoising model!

Navigate to the main directory: cd ..

The denoising model can be trained using:

python train.py --data_root_lrs3_pretrain= --data_root_lrs3_train= --noise_data_root= --lipsync_student_model_path= --checkpoint_dir=

The model can be resumed for training as well. Look at python train.py --help for more details. Also, additional less commonly-used hyper-parameters can be set at the bottom of the audio/hparams.py file.


Evaluation

To be updated soon!


Licence and Citation

The software is licensed under the MIT License. Please cite the following paper if you have used this code:

@InProceedings{Hegde_2021_WACV,
    author    = {Hegde, Sindhu B. and Prajwal, K.R. and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
    title     = {Visual Speech Enhancement Without a Real Visual Stream},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month     = {January},
    year      = {2021},
    pages     = {1926-1935}
}

Acknowledgements

Parts of the lipsync code has been modified using our Wav2Lip repository. The audio functions and parameters are taken from this TTS repository. We thank the authors for this wonderful code. The code for Face Detection has been taken from the face_alignment repository. We thank the authors for releasing their code and models.

Owner
Sindhu
Masters' by Research (MS) @ CVIT, IIIT Hyderabad
Sindhu
Code for "Learning From Multiple Experts: Self-paced Knowledge Distillation for Long-tailed Classification", ECCV 2020 Spotlight

Learning From Multiple Experts: Self-paced Knowledge Distillation for Long-tailed Classification Implementation of "Learning From Multiple Experts: Se

27 Nov 05, 2022
Implementation of Cross Transformer for spatially-aware few-shot transfer, in Pytorch

Cross Transformers - Pytorch (wip) Implementation of Cross Transformer for spatially-aware few-shot transfer, in Pytorch Install $ pip install cross-t

Phil Wang 40 Dec 22, 2022
Proof-Of-Concept Piano-Drums Music AI Model/Implementation

Rock Piano "When all is one and one is all, that's what it is to be a rock and not to roll." ---Led Zeppelin, "Stairway To Heaven" Proof-Of-Concept Pi

Alex 4 Nov 28, 2021
Official Pytorch implementation for 2021 ICCV paper "Learning Motion Priors for 4D Human Body Capture in 3D Scenes" and trained models / data

Learning Motion Priors for 4D Human Body Capture in 3D Scenes (LEMO) Official Pytorch implementation for 2021 ICCV (oral) paper "Learning Motion Prior

165 Dec 19, 2022
EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction

EquiBind: geometric deep learning for fast predictions of the 3D structure in which a small molecule binds to a protein

Hannes StΓ€rk 355 Jan 03, 2023
Additional code for Stable-baselines3 to load and upload models from the Hub.

Hugging Face x Stable-baselines3 A library to load and upload Stable-baselines3 models from the Hub. Installation With pip Examples [Todo: add colab t

Hugging Face 34 Dec 10, 2022
An algorithmic trading bot that learns and adapts to new data and evolving markets using Financial Python Programming and Machine Learning.

ALgorithmic_Trading_with_ML An algorithmic trading bot that learns and adapts to new data and evolving markets using Financial Python Programming and

1 Mar 14, 2022
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

This is the official PyTorch implementation of the ALBEF paper [Blog]. This repository supports pre-training on custom datasets, as well as finetuning on VQA, SNLI-VE, NLVR2, Image-Text Retrieval on

Salesforce 805 Jan 09, 2023
3DIAS: 3D Shape Reconstruction with Implicit Algebraic Surfaces (ICCV 2021)

3DIAS_Pytorch This repository contains the official code to reproduce the results from the paper: 3DIAS: 3D Shape Reconstruction with Implicit Algebra

Mohsen Yavartanoo 21 Dec 12, 2022
Gin provides a lightweight configuration framework for Python

Gin Config Authors: Dan Holtmann-Rice, Sergio Guadarrama, Nathan Silberman Contributors: Oscar Ramirez, Marek Fiser Gin provides a lightweight configu

Google 1.7k Jan 03, 2023
Evidential Softmax for Sparse Multimodal Distributions in Deep Generative Models

Evidential Softmax for Sparse Multimodal Distributions in Deep Generative Models Abstract Many applications of generative models rely on the marginali

Stanford Intelligent Systems Laboratory 9 Jun 06, 2022
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP The implementation of paper CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. CLIP2

168 Dec 29, 2022
Face Transformer for Recognition

Face-Transformer This is the code of Face Transformer for Recognition (https://arxiv.org/abs/2103.14803v2). Recently there has been great interests of

Zhong Yaoyao 153 Nov 30, 2022
Semantic Segmentation with Pytorch-Lightning

This is a simple demo for performing semantic segmentation on the Kitti dataset using Pytorch-Lightning and optimizing the neural network by monitoring and comparing runs with Weights & Biases.

Boris Dayma 58 Nov 18, 2022
Simple tool to combine(merge) onnx models. Simple Network Combine Tool for ONNX.

snc4onnx Simple tool to combine(merge) onnx models. Simple Network Combine Tool for ONNX. https://github.com/PINTO0309/simple-onnx-processing-tools 1.

Katsuya Hyodo 8 Oct 13, 2022
DGL-TreeSearch and the Gurobi-MWIS interface

Independent Set Benchmarking Suite This repository contains the code for our maximum independent set benchmarking suite as well as our implementations

Maximilian BΓΆther 19 Nov 22, 2022
3D ResNet Video Classification accelerated by TensorRT

Activity Recognition TensorRT Perform video classification using 3D ResNets trained on Kinetics-400 dataset and accelerated with TensorRT P.S Click on

Akash James 39 Nov 21, 2022
Catalyst.Detection

Accelerated DL R&D PyTorch framework for Deep Learning research and development. It was developed with a focus on reproducibility, fast experimentatio

Catalyst-Team 12 Oct 25, 2021
BADet: Boundary-Aware 3D Object Detection from Point Clouds (Pattern Recognition 2022)

BADet: Boundary-Aware 3D Object Detection from Point Clouds (Pattern Recognition

Rui Qian 17 Dec 12, 2022
Request execution of Galaxy SARS-CoV-2 variation analysis workflows on input data you provide.

SARS-CoV-2 processing requests Request execution of Galaxy SARS-CoV-2 variation analysis workflows on input data you provide. Prerequisites This autom

useGalaxy.eu 17 Aug 13, 2022