Code for the paper "Unsupervised Contrastive Learning of Sound Event Representations", ICASSP 2021.

Related tags

Deep Learninguclser20
Overview

Unsupervised Contrastive Learning of
Sound Event Representations

This repository contains the code for the following paper. If you use this code or part of it, please cite:

Eduardo Fonseca, Diego Ortego, Kevin McGuinness, Noel E. O'Connor, Xavier Serra, "Unsupervised Contrastive Learning of Sound Event Representations", ICASSP 2021.

arXiv slides poster blog post video

We propose to learn sound event representations using the proxy task of contrasting differently augmented views of sound events, inspired by SimCLR [1]. The different views are computed by:

  • sampling TF patches at random within every input clip,
  • mixing resulting patches with unrelated background clips (mix-back), and
  • other data augmentations (DAs) (RRC, compression, noise addition, SpecAugment [2]).

Our proposed system is illustrated in the figure.

system

Our results suggest that unsupervised contrastive pre-training can mitigate the impact of data scarcity and increase robustness against noisy labels. Please check our paper for more details, or have a quicker look at our slide deck, poster, blog post, or video presentation (see links above).

This repository contains the framework that we used for our paper. It comprises the basic stages to learn an audio representation via unsupervised contrastive learning, and then evaluate the representation via supervised sound event classifcation. The system is implemented in PyTorch.

Dependencies

This framework is tested on Ubuntu 18.04 using a conda environment. To duplicate the conda environment:

conda create --name <envname> --file spec-file.txt

Directories and files

FSDnoisy18k/ includes folders to locate the FSDnoisy18k dataset and a FSDnoisy18k.py to load the dataset (train, val, test), including the data loader for contrastive and supervised training, applying transforms or mix-back when appropriate
config/ includes *.yaml files defining parameters for the different training modes
da/ contains data augmentation code, including augmentations mentioned in our paper and more
extract/ contains feature extraction code. Computes an .hdf5 file containing log-mel spectrograms and associated labels for a given subset of data
logs/ folder for output logs
models/ contains definitions for the architectures used (ResNet-18, VGG-like and CRNN)
pth/ contains provided pre-trained models for ResNet-18, VGG-like and CRNN
src/ contains functions for training and evaluation in both supervised and unsupervised fashion
main_train.py is the main script
spec-file.txt contains conda environment specs

Usage

(0) Download the dataset

Download FSDnoisy18k [3] from Zenodo through the dataset companion site, unzip it and locate it in a given directory. Fix paths to dataset in ctrl section of *.yaml. It can be useful to have a look at the different training sets of FSDnoisy18k: a larger set of noisy labels and a small set of clean data [3]. We use them for training/validation in different ways.

(1) Prepare the dataset

Create an .hdf5 file containing log-mel spectrograms and associated labels for each subset of data:

python extract/wav2spec.py -m test -s config/params_unsupervised_cl.yaml

Use -m with train, val or test to extract features from each subset. All the extraction parameters are listed in params_unsupervised_cl.yaml. Fix path to .hdf5 files in ctrl section of *.yaml.

(2) Run experiment

Our paper comprises three training modes. For convenience, we provide yaml files defining the setup for each of them.

  1. Unsupervised contrastive representation learning by comparing differently augmented views of sound events. The outcome of this stage is a trained encoder to produce low-dimensional representations. Trained encoders are saved under results_models/ using a folder name based on the string experiment_name in the corresponding yaml (make sure to change it).
CUDA_VISIBLE_DEVICES=0 python main_train.py -p config/params_unsupervised_cl.yaml &> logs/output_unsup_cl.out
  1. Evaluation of the representation using a previously trained encoder. Here, we do supervised learning by minimizing cross entropy loss without data agumentation. Currently, we load the provided pre-trained models sitting in pth/ (you can change this in main_train.py, search for select model). We follow two evaluation methods:

    • Linear Evaluation: train an additional linear classifier on top of the pre-trained unsupervised embeddings.

      CUDA_VISIBLE_DEVICES=0 python main_train.py -p config/params_supervised_lineval.yaml &> logs/output_lineval.out
      
    • End-to-end Fine Tuning: fine-tune entire model on two relevant downstream tasks after initializing with pre-trained weights. The two downstream tasks are:

      • training on the larger set of noisy labels and validate on train_clean. This is chosen by selecting train_on_clean: 0 in the yaml.
      • training on the small set of clean data (allowing 15% for validation). This is chosen by selecting train_on_clean: 1 in the yaml.

      After choosing the training set for the downstream task, run:

      CUDA_VISIBLE_DEVICES=0 python main_train.py -p config/params_supervised_finetune.yaml &> logs/output_finetune.out
      

The setup in the yaml files should provide the best results reported in our paper. JFYI, the main flags that determine the training mode are downstream, lin_eval and method in the corresponding yaml (they are already adequately set in each yaml).

(3) See results:

Check the logs/*.out for printed results at the end. Main evaluation metric is balanced (macro) top-1 accuracy. Trained models are saved under results_models/models* and some metrics are saved under results_models/metrics*.

Model Zoo

We provide pre-trained encoders as described in our paper, for ResNet-18, VGG-like and CRNN architectures. See pth/ folder. Note that better encoders could likely be obtained through a more exhaustive exploration of the data augmentation compositions, thus defining a more challenging proxy task. Also, we trained on FSDnoisy18k due to our limited compute resources at the time, yet this framework can be directly applied to other larger datasets such as FSD50K or AudioSet.

Citation

@inproceedings{fonseca2021unsupervised,
  title={Unsupervised Contrastive Learning of Sound Event Representations},
  author={Fonseca, Eduardo and Ortego, Diego and McGuinness, Kevin and O'Connor, Noel E. and Serra, Xavier},
  booktitle={2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  organization={IEEE}
}

Contact

You are welcome to contact [email protected] should you have any question/suggestion. You can also create an issue.

Acknowledgment

This work is a collaboration between the MTG-UPF and Dublin City University's Insight Centre. This work is partially supported by Science Foundation Ireland (SFI) under grant number SFI/15/SIRG/3283 and by the Young European Research University Network under a 2020 mobility award. Eduardo Fonseca is partially supported by a Google Faculty Research Award 2018. The authors are grateful for the GPUs donated by NVIDIA.

References

[1] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A Simple Framework for Contrastive Learning of Visual Representations,” in Int. Conf. on Mach. Learn. (ICML), 2020

[2] Park et al., SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. InterSpeech 2019

[3] E. Fonseca, M. Plakal, D. P. W. Ellis, F. Font, X. Favory, X. Serra, "Learning Sound Event Classifiers from Web Audio with Noisy Labels", In proceedings of ICASSP 2019, Brighton, UK

Owner
Eduardo Fonseca
Returning research intern at Google Research | PhD candidate at Music Technology Group, Universitat Pompeu Fabra
Eduardo Fonseca
Deep Q-Learning Network in pytorch (not actively maintained)

pytoch-dqn This project is pytorch implementation of Human-level control through deep reinforcement learning and I also plan to implement the followin

Hung-Tu Chen 342 Jan 01, 2023
Unofficial implementation of the Involution operation from CVPR 2021

involution_pytorch Unofficial PyTorch implementation of "Involution: Inverting the Inherence of Convolution for Visual Recognition" by Li et al. prese

Rishabh Anand 46 Dec 07, 2022
MLOps will help you to understand how to build a Continuous Integration and Continuous Delivery pipeline for an ML/AI project.

page_type languages products description sample python azure azure-machine-learning-service azure-devops Code which demonstrates how to set up and ope

1 Nov 01, 2021
Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval (NeurIPS'21)

Baleen Baleen is a state-of-the-art model for multi-hop reasoning, enabling scalable multi-hop search over massive collections for knowledge-intensive

Stanford Future Data Systems 22 Dec 05, 2022
Official Matlab Implementation for "Tiny Obstacle Discovery by Occlusion-aware Multilayer Regression", TIP 2020

Tiny Obstacle Discovery by Occlusion-aware Multilayer Regression Official Matlab Implementation for "Tiny Obstacle Discovery by Occlusion-aware Multil

Xuefeng 5 Jan 15, 2022
PixelPyramids: Exact Inference Models from Lossless Image Pyramids (ICCV 2021)

PixelPyramids: Exact Inference Models from Lossless Image Pyramids This repository contains the PyTorch implementation of the paper PixelPyramids: Exa

Visual Inference Lab @TU Darmstadt 8 Dec 11, 2022
Joint Learning of 3D Shape Retrieval and Deformation, CVPR 2021

Joint Learning of 3D Shape Retrieval and Deformation Joint Learning of 3D Shape Retrieval and Deformation Mikaela Angelina Uy, Vladimir G. Kim, Minhyu

Mikaela Uy 38 Oct 18, 2022
Code for C2-Matching (CVPR2021). Paper: Robust Reference-based Super-Resolution via C2-Matching.

C2-Matching (CVPR2021) This repository contains the implementation of the following paper: Robust Reference-based Super-Resolution via C2-Matching Yum

Yuming Jiang 151 Dec 26, 2022
Establishing Strong Baselines for TripClick Health Retrieval; ECIR 2022

TripClick Baselines with Improved Training Data Welcome 🙌 to the hub-repo of our paper: Establishing Strong Baselines for TripClick Health Retrieval

Sebastian Hofstätter 3 Nov 03, 2022
PyTorch source code for Distilling Knowledge by Mimicking Features

LSHFM.detection This is the PyTorch source code for Distilling Knowledge by Mimicking Features. And this project contains code for object detection wi

Guo-Hua Wang 4 Dec 17, 2022
Yolact-keras实例分割模型在keras当中的实现

Yolact-keras实例分割模型在keras当中的实现 目录 性能情况 Performance 所需环境 Environment 文件下载 Download 训练步骤 How2train 预测步骤 How2predict 评估步骤 How2eval 参考资料 Reference 性能情况 训练数

Bubbliiiing 11 Dec 26, 2022
Get the partition that a file belongs and the percentage of space that consumes

tinos_eisai_sy Get the partition that a file belongs and the percentage of space that consumes (works only with OSes that use the df command) tinos_ei

Konstantinos Patronas 6 Jan 24, 2022
Narya API allows you track soccer player from camera inputs, and evaluate them with an Expected Discounted Goal (EDG) Agent

Narya The Narya API allows you track soccer player from camera inputs, and evaluate them with an Expected Discounted Goal (EDG) Agent. This repository

Paul Garnier 121 Dec 30, 2022
Code for paper Adaptively Aligned Image Captioning via Adaptive Attention Time

Adaptively Aligned Image Captioning via Adaptive Attention Time This repository includes the implementation for Adaptively Aligned Image Captioning vi

Lun Huang 45 Aug 27, 2022
Robust & Reliable Route Recommendation on Road Networks

NeuroMLR: Robust & Reliable Route Recommendation on Road Networks This repository is the official implementation of NeuroMLR: Robust & Reliable Route

4 Dec 20, 2022
CLUES: Few-Shot Learning Evaluation in Natural Language Understanding

CLUES: Few-Shot Learning Evaluation in Natural Language Understanding This repo contains the data and source code for baseline models in the NeurIPS 2

Microsoft 29 Dec 29, 2022
BT-Unet: A-Self-supervised-learning-framework-for-biomedical-image-segmentation-using-Barlow-Twins

BT-Unet: A-Self-supervised-learning-framework-for-biomedical-image-segmentation-using-Barlow-Twins Deep learning has brought most profound contributio

Narinder Singh Punn 12 Dec 04, 2022
MoveNet Single Pose on OpenVINO

MoveNet Single Pose tracking on OpenVINO Running Google MoveNet Single Pose models on OpenVINO. A convolutional neural network model that runs on RGB

35 Nov 11, 2022
PoseViz – Multi-person, multi-camera 3D human pose visualization tool built using Mayavi.

PoseViz – 3D Human Pose Visualizer Multi-person, multi-camera 3D human pose visualization tool built using Mayavi. As used in MeTRAbs visualizations.

István Sárándi 79 Dec 30, 2022
🥈78th place in Riiid Answer Correctness Prediction competition

Riiid Answer Correctness Prediction Introduction This repository is the code that placed 78th in Riiid Answer Correctness Prediction competition. Requ

Jungwoo Park 10 Jul 14, 2022