Source code for "MusCaps: Generating Captions for Music Audio" (IJCNN 2021)

Overview

MusCaps: Generating Captions for Music Audio

Ilaria Manco1 2, Emmanouil Benetos1, Elio Quinton2, Gyorgy Fazekas1
1 Queen Mary University of London, 2 Universal Music Group

This repository is the official implementation of "MusCaps: Generating Captions for Music Audio" (IJCNN 2021). In this work, we propose an encoder-decoder model to generate natural language descriptions of music audio. We provide code to train our model on any dataset of (audio, caption) pairs, together with code to evaluate the generated descriptions on a set of automatic metrics (BLEU, METEOR, ROUGE, CIDEr, SPICE, SPIDEr).

Setup

The code was developed in Python 3.7 on Linux CentOS 7 and training was carried out on an RTX 2080 Ti GPU. Other GPUs and platforms have not been fully tested.

Clone the repo

git clone https://github.com/ilaria-manco/muscaps
cd muscaps

You'll need to have the libsndfile library installed. All other requirements, including the code package, can be installed with

pip install -r requirements.txt
pip install -e .

Project structure

root
├─ configs                      # Config files
│   ├─ datasets
│   ├─ models  
│   └─ default.yaml              
├─ data                         # Folder to save data (input data, pretrained model weights, etc.)
│   ├─ audio_encoders   
│   ├─ datasets            
│   │   └─ dataset_name     
|   └── ...             
├─ muscaps
|   ├─ caption_evaluation_tools # Translation metrics eval on audio captioning 
│   ├─ datasets                 # Dataset classes
│   ├─ models                   # Model code
│   ├─ modules                  # Model components
│   ├─ scripts                  # Python scripts for training, evaluation etc.
│   ├─ trainers                 # Trainer classes
│   └─ utils                    # Utils
└─ save                         # Saved model checkpoints, logs, configs, predictions    
    └─ experiments
        ├── experiment_id1
        └── ...                  

Dataset

The datasets used in our experiments is private and cannot be shared, but details on how to prepare an equivalent music captioning dataset are provided in the data README.

Pre-trained audio feature extractors

For the audio feature extraction component, MusCaps uses CNN-based audio tagging models like musicnn. In our experiments, we use @minzwon's implementation and pre-trained models, which you can download from the official repo. For example, to obtain the weights for the HCNN model trained on the MagnaTagATune dataset, run the following commands

mkdir data/audio_encoders
cd data/audio_encoders/
wget https://github.com/minzwon/sota-music-tagging-models/raw/master/models/mtat/hcnn/best_model.pth
mv best_model.pth mtt_hcnn.pth

Training

Dataset, model and training configurations are set in the respective yaml files in configs. Some of the fields can be overridden by arguments in the CLI (for more details on this, refer to the training script).

To train the model with the default configs, simply run

cd muscaps/scripts/
python train.py <baseline/attention> --feature_extractor <musicnn/hcnn> --pretrained_model <msd/mtt>  --device_num <gpu_number>

This will generate an experiment_id and create a new folder in save/experiments where the output will be saved.

If you wish to resume training from a saved checkpoint, run

python train.py <baseline/attention> --experiment_id <experiment_id>  --device_num <gpu_number>

Evaluation

To evaluate a model saved under <experiment_id> on the captioning task, run

cd muscaps/scripts/
python caption.py <experiment_id> --metrics True

Cite

@misc{manco2021muscaps,
      title={MusCaps: Generating Captions for Music Audio}, 
      author={Ilaria Manco and Emmanouil Benetos and Elio Quinton and Gyorgy Fazekas},
      year={2021},
      eprint={2104.11984},
      archivePrefix={arXiv}
}

Acknowledgements

This repo reuses some code from the following repos:

Contact

If you have any questions, please get in touch: [email protected].

Owner
Ilaria Manco
AI & Music PhD Researcher at the Centre for Digital Music (QMUL)
Ilaria Manco
Preprocessed Datasets for our Multimodal NER paper

Unified Multimodal Transformer (UMT) for Multimodal Named Entity Recognition (MNER) Two MNER Datasets and Codes for our ACL'2020 paper: Improving Mult

76 Dec 21, 2022
Regularized Frank-Wolfe for Dense CRFs: Generalizing Mean Field and Beyond

CRF - Conditional Random Fields A library for dense conditional random fields (CRFs). This is the official accompanying code for the paper Regularized

Đ.Khuê Lê-Huu 21 Nov 26, 2022
This is the pytorch implementation of the paper - Axiomatic Attribution for Deep Networks.

Integrated Gradients This is the pytorch implementation of "Axiomatic Attribution for Deep Networks". The original tensorflow version could be found h

Tianhong Dai 150 Dec 23, 2022
Recurrent Conditional Query Learning

Recurrent Conditional Query Learning (RCQL) This repository contains the Pytorch implementation of One Model Packs Thousands of Items with Recurrent C

Dongda 4 Nov 28, 2022
[NeurIPS 2021] A weak-shot object detection approach by transferring semantic similarity and mask prior.

[NeurIPS 2021] A weak-shot object detection approach by transferring semantic similarity and mask prior.

BCMI 49 Jul 27, 2022
Codes for [NeurIPS'21] You are caught stealing my winning lottery ticket! Making a lottery ticket claim its ownership.

You are caught stealing my winning lottery ticket! Making a lottery ticket claim its ownership Codes for [NeurIPS'21] You are caught stealing my winni

VITA 8 Nov 01, 2022
2D Time independent Schrodinger equation solver for arbitrary shape of well

Schrodinger Well Python Python solver for timeless Schrodinger equation for well with arbitrary shape https://imgur.com/a/jlhK7OZ Pictures of circular

WeightAn 24 Nov 18, 2022
Everything about being a TA for ITP/AP course!

تی‌ای بودن! تی‌ای یا دستیار استاد از نقش‌های رایج بین دانشجویان مهندسی است، این ریپوزیتوری قرار است نکات مهم درمورد تی‌ای بودن و تی ای شدن را به ما نش

<a href=[email protected]"> 14 Sep 10, 2022
Class-Attentive Diffusion Network for Semi-Supervised Classification [AAAI'21] (official implementation)

Class-Attentive Diffusion Network for Semi-Supervised Classification Official Implementation of AAAI 2021 paper Class-Attentive Diffusion Network for

Jongin Lim 7 Sep 20, 2022
ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models (ICCV 2021 Oral)

ILVR + ADM This is the implementation of ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models (ICCV 2021 Oral). This repository is h

Jooyoung Choi 225 Dec 28, 2022
GPU Programming with Julia - course at the Swiss National Supercomputing Centre (CSCS), ETH Zurich

Course Description The programming language Julia is being more and more adopted in High Performance Computing (HPC) due to its unique way to combine

Samuel Omlin 192 Jan 03, 2023
PyTorch implementation of "Representing Shape Collections with Alignment-Aware Linear Models" paper.

deep-linear-shapes PyTorch implementation of "Representing Shape Collections with Alignment-Aware Linear Models" paper. If you find this code useful i

Romain Loiseau 27 Sep 24, 2022
Evaluating AlexNet features at various depths

Linear Separability Evaluation This repo provides the scripts to test a learned AlexNet's feature representation performance at the five different con

Yuki M. Asano 32 Dec 30, 2022
A Pytorch Implementation for Compact Bilinear Pooling.

CompactBilinearPooling-Pytorch A Pytorch Implementation for Compact Bilinear Pooling. Adapted from tensorflow_compact_bilinear_pooling Prerequisites I

169 Dec 23, 2022
The source code and dataset for the RecGURU paper (WSDM 2022)

RecGURU About The Project Source code and baselines for the RecGURU paper "RecGURU: Adversarial Learning of Generalized User Representations for Cross

Chenglin Li 17 Jan 07, 2023
Python Implementation of the CoronaWarnApp (CWA) Event Registration

Python implementation of the Corona-Warn-App (CWA) Event Registration This is an implementation of the Protocol used to generate event and location QR

MaZderMind 17 Oct 05, 2022
Toward Multimodal Image-to-Image Translation

BicycleGAN Project Page | Paper | Video Pytorch implementation for multimodal image-to-image translation. For example, given the same night image, our

Jun-Yan Zhu 1.4k Dec 22, 2022
PlaidML is a framework for making deep learning work everywhere.

A platform for making deep learning work everywhere. Documentation | Installation Instructions | Building PlaidML | Contributing | Troubleshooting | R

PlaidML 4.5k Jan 02, 2023
Robustness between the worst and average case

Robustness between the worst and average case A repository that implements intermediate robustness training and evaluation from the NeurIPS 2021 paper

CMU Locus Lab 16 Dec 02, 2022
Bayes-Newton—A Gaussian process library in JAX, with a unifying view of approximate Bayesian inference as variants of Newton's algorithm.

Bayes-Newton Bayes-Newton is a library for approximate inference in Gaussian processes (GPs) in JAX (with objax), built and actively maintained by Wil

AaltoML 165 Nov 27, 2022