MUSIC-AVQA, CVPR2022 (ORAL)

Related tags

AudioMUSIC-AVQA
Overview

Audio-Visual Question Answering (AVQA)

PyTorch code accompanies our CVPR 2022 paper:

Learning to Answer Questions in Dynamic Audio-Visual Scenarios (Oral Presentation)

Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen and Di Hu

Resources: [Paper], [Supplementary], [Poster], [Video]

Project Homepage: https://gewu-lab.github.io/MUSIC-AVQA/


What's Audio-Visual Question Answering Task?

We focus on audio-visual question answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes.

MUSIC-AVQA Dataset

The large-scale MUSIC-AVQA dataset of musical performance, which contains 45,867 question-answer pairs, distributed in 9,288 videos for over 150 hours. All QA pairs types are divided into 3 modal scenarios, which contain 9 question types and 33 question templates. Finally, as an open-ended problem of our AVQA tasks, all 42 kinds of answers constitute a set for selection.

  • QA examples

Model Overview

To solve the AVQA problem, we propose a spatio-temporal grounding model to achieve scene understanding and reasoning over audio and visual modalities. An overview of the proposed framework is illustrated in below figure.

Requirements

python3.6 +
pytorch1.6.0
tensorboardX
ffmpeg
numpy

Usage

  1. Clone this repo

    https://github.com/GeWu-Lab/MUSIC-AVQA_CVPR2022.git
  2. Download data

    Annotations (QA pairs, etc.)

    • Available for download at here
    • The annotation files are stored in JSON format. Each annotation file contains seven different keyword. And more detail see in Project Homepage

    Features

    • We use VGGish, ResNet18, and ResNet (2+1)D to extract audio, 2D frame-level, and 3D snippet-level features, respectively.

    • The audio and visual features of videos in the MUSIC-AVQA dataset can be download from Baidu Drive (password: cvpr):

      • VGGish feature shape: [T, 128]  Download (112.7M)
      • ResNet18 feature shape: [T, 512]  Download (972.6M)
      • R(2+1)D feature shape: [T, 512]  Download (973.9M)
    • The features are in the ./data/feats folder.

    • 14x14 features, too large to share ... but we can extract from raw video frames.

    Download videos frames

    • Raw videos: Availabel at Baidu Drive (password: cvpr):.

      Note: Please move all downloaded videos to a folder, for example, create a new folder named MUSIC-AVQA-Videos, which contains 9,288 real videos and synthetic videos.

    • Raw video frames (1fps): Available at Baidu Drive (14.84GB) (password: cvpr).

    • Download raw videos in the MUSIC-AVQA dataset. The downloaded videos will be in the /data/video folder.

    • Pandas and ffmpeg libraries are required.

  3. Data pre-processing

    Extract audio waveforms from videos. The extracted audios will be in the ./data/audio folder. moviepy library is used to read videos and extract audios.

    python feat_script/extract_audio_cues/extract_audio.py	

    Extract video frames from videos. The extracted frames will be in the data/frames folder.

    python feat_script/extract_visual_frames/extract_frames_adaptive_script.py
  4. Feature extraction

    Audio feature. TensorFlow1.4 and VGGish pretrained on AudioSet is required. Feature file also can be found from here (password: cvpr).

    python feat_script/extract_audio_feat/audio_feature_extractor.py

    2D visual feature. Pretrained models library is required.

    python feat_script/eatract_visual_feat/extract_rgb_feat.py

    3D visual feature.

    python feat_script/eatract_visual_feat/extract_3d_feat.py

    14x14 visual feature.

    python feat_script/extract_visual_feat_14x14/extract_14x14_feat.py
  5. Baseline Model

    Training

    python net_grd_baseline/main_qa_grd_baseline.py --mode train

    Testing

    python net_grd_baseline/main_qa_grd_baseline.py --mode test
  6. Our Audio-Visual Spatial-Temporal Model

    We provide trained models and you can quickly test the results. Test results may vary slightly on different machines.

    python net_grd_avst/main_avst.py --mode train \
    	--audio_dir = "path to your audio features"
    	--video_res14x14_dir = "path to your visual res14x14 features"

    Audio-Visual grounding generation

    python grounding_gen/main_grd_gen.py

    Training

    python net_grd_avst/main_avst.py --mode train \
    	--audio_dir = "path to your audio features"
    	--video_res14x14_dir = "path to your visual res14x14 features"

    Testing

    python net_grd_avst/main_avst.py --mode test \
    	--audio_dir = "path to your audio features"
    	--video_res14x14_dir = "path to your visual res14x14 features"

Results

  1. Audio-visual video question answering results of different methods on the test set of MUSIC-AVQA. The top-2 results are highlighted. Please see the citations in the [Paper] for comparison methods.

  2. Visualized spatio-temporal grounding results

    We provide several visualized spatial grounding results. The heatmap indicates the location of sounding source. Through the spatial grounding results, the sounding objects are visually captured, which can facilitate the spatial reasoning.

    Firstly, ./grounding_gen/models_grd_vis/ should be created.

    python grounding_gen/main_grd_gen_vis.py

Citation

If you find this work useful, please consider citing it.


@ARTICLE{Li2022Learning,
  title	= {Learning to Answer Questions in Dynamic Audio-Visual Scenarios},
  author	= {Guangyao li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu},
  journal	= {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year	= {2022},
}

Acknowledgement

This research was supported by Public Computing Cloud, Renmin University of China.

License

This project is released under the GNU General Public License v3.0.

?️ Open Source Audio Matching and Mastering

Matching + Mastering = ❤️ Matchering 2.0 is a novel Containerized Web Application and Python Library for audio matching and mastering. It follows a si

Sergey Grishakov 781 Jan 05, 2023
Users can transcribe their favorite piano recordings to MIDI files after installation

Users can transcribe their favorite piano recordings to MIDI files after installation

190 Dec 17, 2022
Pyrogram bot to automate streaming music in voice chats

Pyrogram bot to automate streaming music in voice chats Help If you face an error, want to discuss this project or get support for it, join it's group

Roj 124 Oct 21, 2022
Use android as mic/speaker for ubuntu

Pulse Audio Control Panel Platforms Requirements sudo apt install ffmpeg pactl (already installed) Download Download the AppImage from release page ch

19 Dec 01, 2022
ianZiPu is a way to write notation for Guqin (古琴) music.

PyBetween Wrapper for Between - 비트윈을 위한 파이썬 라이브러리 Legal Disclaimer 오직 교육적 목적으로만 사용할수 있으며, 비트윈은 VCNC의 자산입니다. 악의적 공격에 이용할시 처벌 받을수 있습니다. 사용에 따른 책임은 사용자가

Nancy Yi Liang 8 Nov 25, 2022
Manipulate audio with a simple and easy high level interface

Pydub Pydub lets you do stuff to audio in a way that isn't stupid. Stuff you might be looking for: Installing Pydub API Documentation Dependencies Pla

James Robert 6.6k Jan 01, 2023
Using python to generate a bat script of repetitive lines of code that differ in some way but can sort out a group of audio files according to their common names

Batch Sorting Using python to generate a bat script of repetitive lines of code that differ in some way but can sort out a group of audio files accord

David Mainoo 1 Oct 29, 2021
This Bot can extract audios and subtitles from video files

Send any valid video file and the bot shows you available streams in it that can be extracted!!

TroJanzHEX 56 Nov 22, 2022
Audio processor to map oracle notes in the VoG raid in Destiny 2 to call outs.

vog_oracles Audio processor to map oracle notes in the VoG raid in Destiny 2 to call outs. Huge thanks to mzucker on GitHub for the note detection cod

19 Sep 29, 2022
Algorithmic and AI MIDI Drums Generator Implementation

Algorithmic and AI MIDI Drums Generator Implementation

Tegridy Code 8 Dec 30, 2022
Audio2midi - Automatic Audio-to-symbolic Arrangement

Automatic Audio-to-symbolic Arrangement This is the repository of the project "Audio-to-symbolic Arrangement via Cross-modal Music Representation Lear

Ziyu Wang 24 Dec 05, 2022
Desktop music recognition application for windows

MusicRecognizer Music recognition application for windows You can choose from which of the devices the recording will be made. If you choose speakers,

Nikita Merzlyakov 28 Dec 13, 2022
The official repository for Audio ALBERT

AALBERT Here is also the official repository of AALBERT, which is Pytorch lightning reimplementation of the paper, Audio ALBERT: A Lite Bert for Self-

pohan 55 Dec 11, 2022
Audio fingerprinting and recognition in Python

dejavu Audio fingerprinting and recognition algorithm implemented in Python, see the explanation here: How it works Dejavu can memorize audio by liste

Will Drevo 6k Jan 06, 2023
This is an AI that runs in the terminal. It is a voice assistant that can do common activities and can also help in your coding doubts like

This is an AI that runs in the terminal. It is a voice assistant that can do common activities and can also help in your coding doubts like

OneBit 1 Nov 05, 2021
A python package for calculating the PESQ.

PyPESQ (WIP) Pypesq is a python wrapper for the PESQ score calculation C routine. It only can be used in evaluation purpose. INSTALL pip install https

Jingdong Li 269 Dec 18, 2022
A Youtube audio player for your terminal

AudioLine A lightweight Youtube audio player for your terminal Explore the docs » View Demo · Report Bug · Request Feature · Send a Pull Request About

Haseeb Khalid 26 Jan 04, 2023
🎵 A music bot for discord servers!

music bot A music bot for Discord Servers Features Play songs in your discord server Get the lyrics without going on a web explorer Commands Command P

1 Jul 25, 2022
TwitterMusicBot - A Twitter bot with Spotify integration.

A Twitter Music Bot 🤖 🎵 🎶 I created this project to learn more about APIs, so it only works for student purposes. Initially, delving into the Spoti

Gustavo Oliveira 2 Jan 02, 2022
Play any song directly into your group voice chat.

Telegram VCPlayer Bot Play any song directly into your group voice chat. Official Bot : VCPlayerBot | Discussion Group : VoiceChat Music Player Suppor

Shubham Kumar 50 Nov 21, 2022