Video Background Music Generation with Controllable Music Transformer (ACM MM 2021 Oral)

Last update: Dec 27, 2022

Related tags

Deep Learning video-bgm-generation

Overview

CMT

Code for paper Video Background Music Generation with Controllable Music Transformer (ACM MM 2021 Best Paper Award)

[Paper] [Site]

Directory Structure

src/: code of the whole pipeline
- train.py: training script, take a npz as input music data to train the model
- model.py: code of the model
- gen_midi_conditional.py: inference script, take a npz (represents a video) as input to generate several songs
- src/video2npz/: convert video into npz by extracting motion saliency and motion speed
dataset/: processed dataset for training, in the format of npz
logs/: logs that automatically generate during training, can be used to track training process
exp/: checkpoints, named after val loss (e.g. loss_13_params.pt)
inference/: processed video for inference (.npz), and generated music(.mid)

Preparation

clone this repo
download lpd_5_prcem_mix_v8_10000.npz from HERE and put it under dataset/
download pretrained model loss_8_params.pt from HERE and put it under exp/
install ffmpeg=3.2.4
prepare a Python3 conda environment
```
pip install -r py3_requirements.txt
```
prepare a Python2 conda environment (for extracting visbeat)
- ```
pip install -r py2_requirements.txt
```
- open visbeat package directory (e.g. anaconda3/envs/XXXX/lib/python2.7/site-packages/visbeat), replace the original Video_CV.py with src/video2npz/Video_CV.py

Training

If you want to use another training set: convert training data from midi into npz under dataset/
```
python midi2numpy_mix.py --midi_dir /PATH/TO/MIDIS/ --out_name data.npz 
```

train the model

python train.py -n XXX -g 0 1 2 3

# -n XXX: the name of the experiment, will be the name of the log file & the checkpoints directory. if XXX is 'debug', checkpoints will not be saved
# -l (--lr): initial learning rate
# -b (--batch_size): batch size
# -p (--path): if used, load model checkpoint from the given path
# -e (--epochs): number of epochs in training
# -t (--train_data): path of the training data (.npz file) 
# -g (--gpus): ids of gpu
# other model hyperparameters: modify the source .py files

Inference

convert input video (MP4 format) into npz (use the Python2 environment)
```
cd src/video2npz
sh video2npz.sh ../../videos/xxx.mp4
```
- try resizing the video if this takes a long time

run model to generate .mid :

python gen_midi_conditional.py -f "../inference/xxx.npz" -c "../exp/loss_8_params.pt"

# -c: checkpoints to be loaded
# -f: input npz file
# -g: id of gpu (only one gpu is needed for inference)

if using another training set, change decoder_n_class in gen_midi_conditional to the decoder_n_class in train.py

convert midi into audio: use GarageBand (recommended) or midi2audio
- set tempo to the value of tempo in video2npz/metadata.json

combine original video and audio into video with BGM

ffmpeg -i 'xxx.mp4' -i 'yyy.mp3' -c:v copy -c:a aac -strict experimental -map 0:v:0 -map 1:a:0 'zzz.mp4'

# xxx.mp4: input video
# yyy.mp3: audio file generated in the previous step
# zzz.mp4: output video

Video Background Music Generation with Controllable Music Transformer (ACM MM 2021 Oral)

Related tags

Overview

CMT

Directory Structure

Preparation

Training

Inference

Owner

Zhaokai Wang

【Arxiv】Exploring Separable Attention for Multi-Contrast MR Image Super-Resolution

MAU: A Motion-Aware Unit for Video Prediction and Beyond, NeurIPS2021

Exploration & Research into cross-domain MEV. Initial focus on ETH/POLYGON.

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR 2022)

A Python Library for Graph Outlier Detection (Anomaly Detection)

This repository contain code on Novelty-Driven Binary Particle Swarm Optimisation for Truss Optimisation Problems.

Data from "HateCheck: Functional Tests for Hate Speech Detection Models" (Röttger et al., ACL 2021)

This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).

A Shading-Guided Generative Implicit Model for Shape-Accurate 3D-Aware Image Synthesis

Improving adversarial robustness by a coupling rejection strategy

CoReD: Generalizing Fake Media Detection with Continual Representation using Distillation (ACMMM'21 Oral Paper)

DrWhy is the collection of tools for eXplainable AI (XAI). It's based on shared principles and simple grammar for exploration, explanation and visualisation of predictive models.

Lava-DL, but with PyTorch-Lightning flavour

Implementation of the Point Transformer layer, in Pytorch

Low-code/No-code approach for deep learning inference on devices

PyTorch implementation of "Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning"

Monitor your ML jobs on mobile devices📱, especially for Google Colab / Kaggle

git《Beta R-CNN: Looking into Pedestrian Detection from Another Perspective》(NeurIPS 2020) GitHub:[fig3]

This code reproduces the results of the paper, "Measuring Data Leakage in Machine-Learning Models with Fisher Information"

Like ThreeJS but for Python and based on wgpu