A music comments dataset, containing 39,051 comments for 27,384 songs.

Overview

Music Comments Dataset

License: AGPL v3

A music comments dataset, containing 39,051 comments for 27,384 songs.

For academic research use only.

Introduction

This dataset is part of a recent multimodal deep learning project on music and natural language that I have been working on. The complete dataset contains 30s of audio, metadata, lyrics, and comments for each piece of data. This dataset contains only the lyrics and comments sections.

In the current stage, it only contains 39,051 comments for 27,384 songs (for dataset_summarization_positive.pkl) and can be larger if necessary (for other files).

Because the audio data is much less than the review data, I kept only this part as the dataset in order to ensure that music and reviews appear in pairs.

Here is a data sample:

Lyrics: Come up to meet you, tell you I'm sorry; You don't know how lovely you are; I had to find you, tell you I need you; ; Tell you I set you apart; Tell me your secrets and ask me your questions; Oh, let's go back to the start; ; Running in circles, coming up tails; Heads on a science apart; Nobody said it was easy; ; It's such a shame for us to part; Nobody said it was easy; No one ever said it would be this hard; ; Oh, take me back to the start; I was just guessing at numbers and figures; Pulling the puzzles apart; Questions of science, science and progress; ; Do not speak as loud as my heart; ; But tell me you love me, come back and haunt me; Oh and I rush to the start; Running in circles, chasing our tails; ; Coming back as we are; Nobody said it was easy; Oh, it's such a shame for us to part; Nobody said it was easy; No one ever said it would be so hard; I'm going back to the start; Oh ooh, ooh ooh ooh ooh; Ah ooh, ooh ooh ooh ooh; Oh ooh, ooh ooh ooh ooh; Oh ooh, ooh ooh ooh ooh

Ground Truth: The song is like poetry with many meanings to be sifted out applicable to many people in many different relationship situations. I find the lyrics touch me as if specifically written regarding my own situations at times. The following meaning I describe in no way reflects any situation I have ever had to face.

Data Source and Data Preprocessing

The audio and metadata files are from the Music4All Dataset, which I cannot make available directly due to agreeement restrictions, so anyone who would like to request that dataset can contact the authors directly.

The review data is mainly from songmeanings.com. I have done some data pre-processing to make the comment data more concise.

The first is the summarization method. I use the generative summarisation method to remove useless information from the comments (See Figure 1).

The second is the positive method. Each original comment carries a rating, which relates to the degree to which the comment itself is agreed by the community. The summarization token means that I only pick comments which have ratings > 0. The not_negative tokens means that the comments have ratings >= 0.

Folder Structure

.
├── README.md
├── codes
│   └── data.py
└── dataset
    ├── dataset_summarization_positive.pkl
    ├── dataset_summarization_not_negative.pkl
    ├── dataset_summarization.pkl
    ├── dataset_positive.pkl
    ├── dataset_not_negative.pkl
    └── dataset.pkl

In the data.py file, I have provided a PyTorch Dataset class to use.

Data Format

the .pkl file is an object List. It can be loaded and read using LyricsCommentsDatasetPsuedo class in data.py.

Each data contains two attributes: lyrics and comment. A lyric may correspond to more than one comment, so I broadcast the lyrics to ensure that each comment has a corresponding lyric.

Citation

@article{zhanggenerating,
  title={Generating Comments from Music and Lyrics},
  author={Zhang, Yixiao and Dixon, Simon},
  year={2021}
}
Owner
Zhang Yixiao
AI and Music PhD Student @c4dm
Zhang Yixiao
An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model

pl_prompt_sst An example project using OpenPrompt under the framework of pytorch-lightning for a training prompt-based text classification model on SS

Zhiling Zhang 5 Oct 21, 2022
Crowd sourced training data for Rasa NLU models

NLU Training Data Crowd-sourced training data for the development and testing of Rasa NLU models. If you're interested in grabbing some data feel free

Rasa 169 Dec 26, 2022
Basic yet complete Machine Learning pipeline for NLP tasks

Basic yet complete Machine Learning pipeline for NLP tasks This repository accompanies the article on building basic yet complete ML pipelines for sol

Ivan 20 Aug 22, 2022
CoSENT 比Sentence-BERT更有效的句向量方案

CoSENT 比Sentence-BERT更有效的句向量方案

苏剑林(Jianlin Su) 201 Dec 12, 2022
Reproduction process of BERT on SST2 dataset

BERT-SST2-Prod Reproduction process of BERT on SST2 dataset 安装说明 下载代码库 git clone https://github.com/JunnYu/BERT-SST2-Prod 进入文件夹,安装requirements pip ins

yujun 1 Nov 18, 2021
Leon is an open-source personal assistant who can live on your server.

Leon Your open-source personal assistant. Website :: Documentation :: Roadmap :: Contributing :: Story 👋 Introduction Leon is an open-source personal

Leon AI 11.7k Dec 30, 2022
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

OpenBMB 377 Jan 02, 2023
2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

繁體中文場景文字辨識 程式碼說明 組別:這就是我 成員:蔣明憲 唐碩謙 黃玥菱 林冠霆 蕭靖騰 目錄 環境套件 安裝方式 資料夾布局 前處理-製作偵測訓練註解檔 前處理-製作分類訓練樣本 part.py : 從 json 裁切出分類訓練樣本 Class.py : 將切出來的樣本按照文字分類到各資料夾

HuanyueTW 3 Jan 14, 2022
Tools, wrappers, etc... for data science with a concentration on text processing

Rosetta Tools for data science with a focus on text processing. Focuses on "medium data", i.e. data too big to fit into memory but too small to necess

207 Nov 22, 2022
Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein. See full documentation for detailed info on the toolbox. The goal of OTT is to pr

OTT-JAX 255 Dec 26, 2022
NLP codes implemented with Pytorch (w/o library such as huggingface)

NLP_scratch NLP codes implemented with Pytorch (w/o library such as huggingface) scripts ├── models: Neural Network models ├── data: codes for dataloa

3 Dec 28, 2021
⛵️The official PyTorch implementation for "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" (EMNLP 2020).

BERT-of-Theseus Code for paper "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing". BERT-of-Theseus is a new compressed BERT by progre

Kevin Canwen Xu 284 Nov 25, 2022
Non-Autoregressive Predictive Coding

Non-Autoregressive Predictive Coding This repository contains the implementation of Non-Autoregressive Predictive Coding (NPC) as described in the pre

Alexander H. Liu 43 Nov 15, 2022
Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

VAD-SLI-ASR Python scripts for a speech processing pipeline with Voice Activity

Dynamics of Language 14 Dec 09, 2022
Converts python code into c++ by using OpenAI CODEX.

🦾 codex_py2cpp 🤖 OpenAI Codex Python to C++ Code Generator Your Python Code is too slow? 🐌 You want to speed it up but forgot how to code in C++? ⌨

Alexander 423 Jan 01, 2023
Implementation for paper BLEU: a Method for Automatic Evaluation of Machine Translation

BLEU Score Implementation for paper: BLEU: a Method for Automatic Evaluation of Machine Translation Author: Ba Ngoc from ProtonX BLEU score is a popul

Ngoc Nguyen Ba 6 Oct 07, 2021
Pytorch version of BERT-whitening

BERT-whitening This is the Pytorch implementation of "Whitening Sentence Representations for Better Semantics and Faster Retrieval". BERT-whitening is

Weijie Liu 255 Dec 27, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
A framework for implementing federated learning

This is partly the reproduction of the paper of [Privacy-Preserving Federated Learning in Fog Computing](DOI: 10.1109/JIOT.2020.2987958. 2020)

DavidChen 46 Sep 23, 2022
Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Token Shift GPT Implementation of Token Shift GPT - An autoregressive model that relies solely on shifting along the sequence dimension and feedforwar

Phil Wang 32 Oct 14, 2022