Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

Last update: Jan 02, 2023

Related tags

Deep Learning AudioCLIP

Overview

AudioCLIP

Extending CLIP to Image, Text and Audio

This repository contains implementation of the models described in the paper arXiv:2106.13043. This work based on our previous works:

ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio (2021).
ESResNet: Environmental Sound Classification Based on Visual Domain Models (2020).

Abstract

In the past, the rapidly evolving field of sound classification greatly benefited from the application of methods from other domains. Today, we observe the trend to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models.

In this work, we present an extension of the CLIP model that handles audio in addition to text and images. Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset. Such a combination enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP's ability to generalize to unseen datasets in a zero-shot inference fashion.

AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task, out-performing other approaches by reaching accuracies of 90.07% on the UrbanSound8K and 97.15% on the ESC-50 datasets. Further it sets new baselines in the zero-shot ESC-task on the same datasets (68.78% and 69.40%, respectively).

Finally, we also assess the cross-modal querying performance of the proposed model as well as the influence of full and partial training on the results. For the sake of reproducibility, our code is published.

Downloading Pre-Trained Weights

The pre-trained model can be downloaded from the releases.

# AudioCLIP trained on AudioSet (text-, image- and audio-head simultaneously)
wget https://github.com/AndreyGuzhov/AudioCLIP/releases/download/v0.1/AudioCLIP-Full-Training.pt

How to Run the Model

The required Python version is >= 3.7.

AudioCLIP

On the ESC-50 dataset

python main.py --config protocols/audioclip-esc50.json --Dataset.args.root /path/to/ESC50

On the UrbanSound8K dataset

python main.py --config protocols/audioclip-us8k.json --Dataset.args.root /path/to/UrbanSound8K

Cite Us

@misc{guzhov2021audioclip,
      title={AudioCLIP: Extending CLIP to Image, Text and Audio}, 
      author={Andrey Guzhov and Federico Raue and Jörn Hees and Andreas Dengel},
      year={2021},
      eprint={2106.13043},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

You might also like...

This repository contains the code used for Predicting Patient Outcomes with Graph Representation Learning (https://arxiv.org/abs/2101.03940).

Predicting Patient Outcomes with Graph Representation Learning This repository contains the code used for Predicting Patient Outcomes with Graph Repre

76 Dec 22, 2022

Pytorch implementation of Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization https://arxiv.org/abs/2008.11646

[TCSVT] Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization LPN [Paper] NEWs Prerequisites Python 3.6 GPU Memory = 8G Numpy 1.

46 Dec 14, 2022

https://arxiv.org/abs/2102.11005

LogME LogME: Practical Assessment of Pre-trained Models for Transfer Learning How to use Just feed the features f and labels y to the function, and yo

149 Dec 19, 2022

Official Implementation for "ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement" https://arxiv.org/abs/2104.02699

ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement Recently, the power of unconditional image synthesis has significantly advanced th

967 Jan 4, 2023

ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

This is the project page for the paper: ISTR: End-to-End Instance Segmentation via Transformers, Jie Hu, Liujuan Cao, Yao Lu, ShengChuan Zhang, Yan Wa

182 Dec 19, 2022

Non-Official Pytorch implementation of "Face Identity Disentanglement via Latent Space Mapping" https://arxiv.org/abs/2005.07728 Using StyleGAN2 instead of StyleGAN

Face Identity Disentanglement via Latent Space Mapping - Implement in pytorch with StyleGAN 2 Description Pytorch implementation of the paper Face Ide

58 Dec 24, 2022

Minimal implementation of PAWS (https://arxiv.org/abs/2104.13963) in TensorFlow.

Comments

Make project usable by other python projects: remove git lfs and move files into an audioclip folder

Git lfs was giving problems, so I removed all assets files from it - the files can be found in the "Release" anyways.

Also it was a bit problematic to use this project in other projects because the folder structure was lacking. I moved all files into an "audioclip" folder to fix python pathing for external projects.

I renamed master to main, but I doubt that this change is going to stay once this pull request is merged.

opened by NotNANtoN 0

Releases(v0.1)

v0.1(Jun 29, 2021)
Text embeddings' vocabulary and PyTorch' state_dicts containing weights of the AudioCLIP model trained on AudioSet:

bpe_simple_vocab_16e6.txt.gz – CLIP's vocabulary (origin)

CLIP.pt – vanilla CLIP (text Transformer & ResNet-50 image-head, origin)

ESRNXFBSP.pt – ESResNeXt trained on AudioSet (standalone)

AudioCLIP trained on AudioSet (+ video frames)

AudioCLIP-Full-Training.pt – training of all three heads (text, image and audio)

AudioCLIP-Partial-Training.pt – training of the audio-head only

Source code(tar.gz)
Source code(zip)
AudioCLIP-Full-Training.pt(512.41 MB)
AudioCLIP-Partial-Training.pt(512.41 MB)
bpe_simple_vocab_16e6.txt.gz(1.29 MB)
CLIP.pt(389.49 MB)
ESRNXFBSP.pt(119.01 MB)

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

Related tags

Overview

AudioCLIP

Extending CLIP to Image, Text and Audio

Abstract

Downloading Pre-Trained Weights

How to Run the Model

AudioCLIP

On the ESC-50 dataset

On the UrbanSound8K dataset

Cite Us

You might also like...

This repository contains the code used for Predicting Patient Outcomes with Graph Representation Learning (https://arxiv.org/abs/2101.03940).

Pytorch implementation of Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization https://arxiv.org/abs/2008.11646

https://arxiv.org/abs/2102.11005

Official Implementation for "ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement" https://arxiv.org/abs/2104.02699

ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

Non-Official Pytorch implementation of "Face Identity Disentanglement via Latent Space Mapping" https://arxiv.org/abs/2005.07728 Using StyleGAN2 instead of StyleGAN

Minimal implementation of PAWS (https://arxiv.org/abs/2104.13963) in TensorFlow.

YOLO5Face: Why Reinventing a Face Detector (https://arxiv.org/abs/2105.12931)

A PyTorch implementation of EventProp [https://arxiv.org/abs/2009.08378], a method to train Spiking Neural Networks

Comments

Make project usable by other python projects: remove git lfs and move files into an audioclip folder

Releases(v0.1)

v0.1(Jun 29, 2021)

Owner

The official codes of "Semi-supervised Models are Strong Unsupervised Domain Adaptation Learners".

Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data

Here I will explain the flow to deploy your custom deep learning models on Ultra96V2.

Training DiffWave using variational method from Variational Diffusion Models.

[ACM MM 2021] Joint Implicit Image Function for Guided Depth Super-Resolution

An AI Assistant More Than a Toolkit

git《Investigating Loss Functions for Extreme Super-Resolution》(CVPR 2020) GitHub:

Aydin is a user-friendly, feature-rich, and fast image denoising tool

SymmetryNet: Learning to Predict Reflectional and Rotational Symmetries of 3D Shapes from Single-View RGB-D Images

QuakeLabeler is a Python package to create and manage your seismic training data, processes, and visualization in a single place — so you can focus on building the next big thing.

NHS AI Lab Skunkworks project: Long Stayer Risk Stratification

This is a Pytorch implementation of paper: DropEdge: Towards Deep Graph Convolutional Networks on Node Classification

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Code for this paper The Lottery Ticket Hypothesis for Pre-trained BERT Networks.

arxiv-sanity, but very lite, simply providing the core value proposition of the ability to tag arxiv papers of interest and have the program recommend similar papers.

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

Efficient-GlobalPointer - Pytorch Efficient GlobalPointer

This repository contains the code for the paper in EMNLP 2021: "HRKD: Hierarchical Relational Knowledge Distillation for Cross-domain Language Model Compression".

The Generic Manipulation Driver Package - Implements a ROS Interface over the robotics toolbox for Python

A small demonstration of using WebDataset with ImageNet and PyTorch Lightning