Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Last update: Dec 13, 2022

Related tags

Deep Learning multisensory

Overview

[Paper] [Project page]

This repository contains code for the paper:

Andrew Owens, Alexei A. Efros. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. arXiv, 2018

This release includes code and models for:

On/off-screen source separation: separating the speech of an on-screen speaker from background sounds.
Blind source separation: audio-only source separation using u-net and PIT.
Sound source localization: visualizing the parts of a video that correspond to sound-making actions.
Self-supervised audio-visual features: a pretrained 3D CNN that can be used for downstream tasks (e.g. action recognition, source separation).

Setup

Install Python 2.7
Install ffmpeg
Install TensorFlow, e.g. through pip:

pip install tensorflow     # for CPU evaluation only
pip install tensorflow-gpu # for GPU support

We used TensorFlow version 1.8, which can be installed with:

pip install tensorflow-gpu==1.8

Install other python dependencies

pip install numpy matplotlib pillow scipy

Download the pretrained models and sample data

./download_models.sh
./download_sample_data.sh

Pretrained audio-visual features

We have provided the features for our fused audio-visual network. These features were learned through self-supervised learning. Please see shift_example.py for a simple example that uses these pretrained features.

Audio-visual source separation

To try the on/off-screen source separation model, run:

python sep_video.py ../data/translator.mp4 --model full --duration_mult 4 --out ../results/

This will separate a speaker's voice from that of an off-screen speaker. It will write the separated video files to ../results/, and will also display them in a local webpage, for easier viewing. This produces the following videos (click to watch):

Input	On-screen	Off-screen

We can visually mask out one of the two on-screen speakers, thereby removing their voice:

python sep_video.py ../data/crossfire.mp4 --model full --mask l --out ../results/
python sep_video.py ../data/crossfire.mp4 --model full --mask r --out ../results/

This produces the following videos (click to watch):

Source	Left	Right

Blind (audio-only) source separation

This baseline trains a u-net model to minimize a permutation invariant loss.

python sep_video.py ../data/translator.mp4 --model unet_pit --duration_mult 4 --out ../results/

The model will write the two separated streams in an arbitrary order.

Visualizing the locations of sound sources

To view the self-supervised network's class activation map (CAM), use the --cam flag:

python sep_video.py ../data/translator.mp4 --model full --cam --out ../results/

This produces a video in which the CAM is overlaid as a heat map:

Action recognition and fine-tuning

We have provided example code for training an action recognition model (e.g. on the UCF-101 dataset) in videocls.py). This involves fine-tuning our pretrained, audio-visual network. It is also possible to train this network with only visual data (no audio).

Citation

If you use this code in your research, please consider citing our paper:

@article{multisensory2018,
  title={Audio-Visual Scene Analysis with Self-Supervised Multisensory Features},
  author={Owens, Andrew and Efros, Alexei A},
  journal={arXiv preprint arXiv:1804.03641},
  year={2018}
}

Updates

11/08/18: Fixed a bug in the class activation map example code. Added Tensorflow 1.9 compatibility.

Acknowledgements

Our u-net code draws from this implementation of pix2pix.

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Related tags

Overview

Contents

Setup

Pretrained audio-visual features

Audio-visual source separation

Blind (audio-only) source separation

Visualizing the locations of sound sources

Action recognition and fine-tuning

Citation

Updates

Acknowledgements

Owner

Andrew Owens

Fully Convolutional DenseNet (A.K.A 100 layer tiramisu) for semantic segmentation of images implemented in TensorFlow.

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

Pseudo-Visual Speech Denoising

JAXMAPP: JAX-based Library for Multi-Agent Path Planning in Continuous Spaces

On Nonlinear Latent Transformations for GAN-based Image Editing - PyTorch implementation

Code for ICCV 2021 paper Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes using Scene Graphs

List of papers, code and experiments using deep learning for time series forecasting

This repository provides the official implementation of 'Learning to ignore: rethinking attention in CNNs' accepted in BMVC 2021.

Can we visualize a large scientific data set with a surrogate model? We're building a GAN for the Earth's Mantle Convection data set to see if we can!

The code is the training example of AAAI2022 Security AI Challenger Program Phase 8: Data Centric Robot Learning on ML models.

Image Completion with Deep Learning in TensorFlow

Segmentation models with pretrained backbones. PyTorch.

This is the pytorch implementation for the paper: Learning Accurate Performance Predictors for Ultrafast Automated Model Compression, which is in submission to TPAMI

Deep Reinforcement Learning for mobile robot navigation in ROS Gazebo simulator

Official repository for the paper "Instance-Conditioned GAN"

Simple cross-platform application for DaVinci surgical video frame annotation

MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation

Pretrained Pytorch face detection (MTCNN) and recognition (InceptionResnet) models

Official repository of "Investigating Tradeoffs in Real-World Video Super-Resolution"

RepVGG: Making VGG-style ConvNets Great Again

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Related tags

Overview

Contents

Setup

Pretrained audio-visual features

Audio-visual source separation

Blind (audio-only) source separation

Visualizing the locations of sound sources

Action recognition and fine-tuning

Citation

Updates

Acknowledgements

Owner

Andrew Owens

Fully Convolutional DenseNet (A.K.A 100 layer tiramisu) for semantic segmentation of images implemented in TensorFlow.

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

Pseudo-Visual Speech Denoising

JAXMAPP: JAX-based Library for Multi-Agent Path Planning in Continuous Spaces

On Nonlinear Latent Transformations for GAN-based Image Editing - PyTorch implementation

Code for ICCV 2021 paper Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes using Scene Graphs

List of papers, code and experiments using deep learning for time series forecasting

This repository provides the official implementation of 'Learning to ignore: rethinking attention in CNNs' accepted in BMVC 2021.

Can we visualize a large scientific data set with a surrogate model? We're building a GAN for the Earth's Mantle Convection data set to see if we can!

The code is the training example of AAAI2022 Security AI Challenger Program Phase 8: Data Centric Robot Learning on ML models.

Image Completion with Deep Learning in TensorFlow

Segmentation models with pretrained backbones. PyTorch.

This is the pytorch implementation for the paper: *Learning Accurate Performance Predictors for Ultrafast Automated Model Compression*, which is in submission to TPAMI

Deep Reinforcement Learning for mobile robot navigation in ROS Gazebo simulator

Official repository for the paper "Instance-Conditioned GAN"

Simple cross-platform application for DaVinci surgical video frame annotation

MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation

Pretrained Pytorch face detection (MTCNN) and recognition (InceptionResnet) models

Official repository of "Investigating Tradeoffs in Real-World Video Super-Resolution"

RepVGG: Making VGG-style ConvNets Great Again

This is the pytorch implementation for the paper: Learning Accurate Performance Predictors for Ultrafast Automated Model Compression, which is in submission to TPAMI