An open collection of annotated voices in Japanese language

Last update: Dec 14, 2022

Related tags

Text Data & NLP koniwa

Overview

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション

Koniwa (声庭): An open collection of annotated voices in Japanese language

概要

Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテーションのコレクションです．
（商用目的での利用も可能です．）

アノテーション作業は始まったばかりです．皆様のコントリビューションをお待ちしております．

ファイルリンク

sound: 音声データ (Google Drive)
source: 参考データ (Google Drive): 原文などアノテーション時の参考になる資料
data: 書誌情報・アノテーションデータ

シリーズ

本コレクションは現在以下のオープンな音声データを利用しています．公開に関わってくださった皆様に深く感謝いたします．

amagasaki: CC BY 4.0
- 2011年4月〜2015年11月
- 兵庫県尼崎市のラジオ番組 (FMあまがさき)
  - いなむら市長の「ひと咲きまち咲きあまがさき」
  - いなむら市長の「い～なこの街あまがさき」 (2014年11月より改題)
free_culture_2012: CC BY 3.0
- 2012年8月
- J-WAVEのラジオ番組 J-WAVE 360° Forum 〜Seek and Find〜
higashiyodogawa: CC BY 4.0
- 2017年11月〜2021年7月
- 大阪市東淀川区の「広報ひがしよどがわ」音声版
librivox: パブリックドメイン
- LibriVox.orgの収録作品
- 歌など一部のものは除外している
minato: CC BY 4.0
- 2019年5月〜2020年12月
- 大阪市港区の「広報みなと」音声版
nishiyodogawa: CC BY 4.0
- 2018年8月〜2021年7月
- 大阪市西淀川区の『広報紙「きらり☆にしよど」音声版』
roudoku_toshokan: CC BY 2.1 JP (原文はパブリックドメイン)
- 池田英生氏の朗読図書館配信の朗読音声
tnc: CC BY 3.0 (原文はパブリックドメイン)
- テレビ西日本のアナウンサーによる朗読音声

Licence

原文・音声のライセンス

本コレクション内の音声は以下のいずれかでライセンスされているもののみを含めることにしています．

パブリックドメイン
- PDM
- CC0
クリエイティブ・コモンズ
- CC BY

アノテーションや文書のライセンス

以下は全てCC0 1.0でライセンスします

二次的著作物に該当するアノテーションのうち二次的著作部分
アノテーションのコメント・アノテーションマニュアルなどの本レポジトリ内の一次著作物（プログラムを除く）

プログラムのライセンス

プログラムはApache License 2.0でライセンスします．

Maintainer

shirayu

An open collection of annotated voices in Japanese language

Related tags

Overview

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション

概要

ファイルリンク

シリーズ

Licence

原文・音声のライセンス

アノテーションや文書のライセンス

プログラムのライセンス

Maintainer

Owner

Koniwa project

Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)

Repository for Project Insight: NLP as a Service

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Knowledge Oriented Programming Language

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Lattice methods in TensorFlow

Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe NHV in the future.

The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"

Harvis is designed to automate your C2 Infrastructure.

a test times augmentation toolkit based on paddle2.0.

lightweight, fast and robust columnar dataframe for data analytics with online update

TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech

Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

Weakly-supervised Text Classification Based on Keyword Graph

A python package to fine-tune transformer-based models for named entity recognition (NER).

Basic yet complete Machine Learning pipeline for NLP tasks

The tool to make NLP datasets ready to use

DziriBERT: a Pre-trained Language Model for the Algerian Dialect