The training code for the 4th place model at MDX 2021 leaderboard A.

Last update: Dec 18, 2022

Overview

This repository contains the training code of our winning model at Music Demixing Challenge 2021, which got the 4th place on leaderboard A (6th in overall), and help us (Kazane Ryo no Danna) winned the bronze prize.

Model Summary

Our final winning approach blends the outputs from three models, which are:

model 1: A X-UMX model [1] which is initialized with the weights of the official baseline, and is fine-tuned with a modified Combinational Multi-Domain Loss from [1]. In particular, we implement and apply a differentiable Multichannel Wiener Filter (MWF) [2] before the loss calculation, and compute the frequency-domain L2 loss with raw complex values.
model 2: A U-Net which is similar to Spleeter [3], where all convolution layers are replaced by D3 Blocks from [4], and two layers of 2D local attention are applied at the bottleneck similar to [5].
model 3: A modified version of Demucs [6], where the original decoding module is replaced by four independent decoders, each of which corresponds to one source.

We didn't encounter overfitting in our pilot experiments, so we used the full musdb training set for all the models above, and stopped training upon convergence of the loss function.

The weights of the three outputs are determined empirically:

	Drums	Bass	Other	Vocals
model 1	0.2	0.1	0	0.2
model 2	0.2	0.17	0.5	0.4
model 3	0.6	0.73	0.5	0.4

For the spectrogram-based models (model 1 and 2), we apply MWF to the outputs with one iteration before the fusion.

[1] Sawata, Ryosuke, et al. "All for One and One for All: Improving Music Separation by Bridging Networks." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.

[2] Antoine Liutkus, & Fabian-Robert Stöter. (2019). sigsep/norbert: First official Norbert release (v0.2.0). Zenodo. https://doi.org/10.5281/zenodo.3269749

[3] Hennequin, Romain, et al. "Spleeter: a fast and efficient music source separation tool with pre-trained models." Journal of Open Source Software 5.50 (2020): 2154.

[4] Takahashi, Naoya, and Yuki Mitsufuji. "D3net: Densely connected multidilated densenet for music source separation." arXiv preprint arXiv:2010.01733 (2020).

[5] Wu, Yu-Te, Berlin Chen, and Li Su. "Multi-Instrument Automatic Music Transcription With Self-Attention-Based Instance Segmentation." IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 2796-2809.

[6] Défossez, Alexandre, et al. "Music source separation in the waveform domain." arXiv preprint arXiv:1911.13254 (2019).

How to reproduce the training

Install Requirements / Build Virtual Environment

We recommend using conda.

conda env create -f environment.yml
conda activate demixing

Prepare Data

Please download musdb, and edit the "root" parameters in all the json files listed under configs/ to the path where you have the dataset.

Training Model 1

First download the pre-trained model:

wget https://zenodo.org/record/4740378/files/pretrained_xumx_musdb18HQ.pth

Copy the weights for initializing our model:

python xumx_weights_convert.py pretrained_xumx_musdb18HQ.pth xumx_weights.pth

Start training!

python train.py configs/x_umx_mwf.json --weights xumx_weights.pth

Checkpoints will be located under saved/. The config was set to run on a single RTX 3070.

Training Model 2

python train.py configs/unet_attn.json --device_ids 0 1 2 3

Checkpoints will be located under saved/. The config was set to run on four Tesla V100.

Training Model 3

python train.py configs/demucs_split.json

Checkpoints will be located under saved/. The config was set to run on a single RTX 3070, using gradient accumulation and mixed precision training.

Tensorboard Logging

You can monitor the training process using tensorboard:

tesnorboard --logdir runs/

Inference

First make sure you installed danna-sep. Then convert your checkpoints into jit scripts and replace the files under DANNA_CHECKPOINTS:

python jit_convert.py configs/x_umx_mwf.json saved/CrossNet\ Open-Unmix_checkpoint_XXX.pt $DANNA_CHECKPOINTS/xumx_mwf.pth

python jit_convert.py configs/unet_attn.json saved/UNet\ Attention_checkpoint_XXX.pt $DANNA_CHECKPOINTS/unet_attention.pth

python jit_convert.py configs/demucs_split.json saved/DemucsSplit_checkpoint_XXX.pt $DANNA_CHECKPOINTS/demucs_4_decoders.pth

Now you can use danna-sep to separate you favorite music and see how it works!

The training code for the 4th place model at MDX 2021 leaderboard A.

Related tags

Overview

Model Summary

How to reproduce the training

Install Requirements / Build Virtual Environment

Prepare Data

Training Model 1

Training Model 2

Training Model 3

Tensorboard Logging

Inference

Additional Resources

Owner

Chin-Yun Yu

FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

Training code for Korean multi-class sentiment analysis

NLP and Text Generation Experiments in TensorFlow 2.x / 1.x

Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

NL. The natural language programming language.

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

Test finetuning of XLSR (multilingual wav2vec 2.0) for other speech classification tasks

A simple word search made in python

Python package for performing Entity and Text Matching using Deep Learning.

Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time.

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)

Pytorch implementation of winner from VQA Chllange Workshop in CVPR'17

AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

Automatically search Stack Overflow for the command you want to run

A Practitioner's Guide to Natural Language Processing

customer care chatbot made with Rasa Open Source.

Official implementation of Meta-StyleSpeech and StyleSpeech

Reproducing the Linear Multihead Attention introduced in Linformer paper (Linformer: Self-Attention with Linear Complexity)