A 10000+ hours dataset for Chinese speech recognition

Last update: Dec 16, 2022

Related tags

Text Data & NLP WenetSpeech

Overview

WenetSpeech

A 10000+ Hours Multi-domain Chinese Corpus for Speech Recognition

Download

Please visit the official website, read the license, and follow the instruction to download the data.

Benchmark

Toolkit	Model	test_net	test_meeting
Kaldi	Chain Model
ESPnet	Joint CTC/Conformer
WeNet	Joint CTC/Conformer

Description

Creation

First, we collect all the data from YouTube and Podcast; Then, OCR is used to label YouTube data, auto trancrition is used to label Podcast data; Finally, a novel end-to-end label error detection method is used to further validate and filter the data.

Set	Hours	Confidence	Usage
High Label	10005	>=0.95	Supervised Training
Weak Label	2478	[0.6, 0.95]	Semi-supervised or noise training
Unlabel	9952	/	Unsupervised training or Pre-training
In Total	22435	/	All above

High Label Data

All of the data is from Youtube and Podcast, and we tag all the data with its source and domain. We classify the data into 10 groups according to its domain,speaking style, or scenarios.

Domain	Youtube	Podcast	Total
audiobook	0	250.9	250.9
commentary	112.6	135.7	248.3
documentary	386.7	90.5	477.2
drama	4338.2	0	4338.2
interview	324.2	614	938.2
news	0	868	868
reading	0	1110.2	1110.2
talk	204	90.7	294.7
variety	603.3	224.5	827.8
others	144	507.5	651.5
Total	6113	3892	10005

We provide 3 training subsets, namely S, M and L. Subsets S, M are sampled from all the high label data which has the oracle confidence 1.0

Training Subsets	Confidence	Hours
L	[0.95, 1.0]	10005
M	1.0	1000
S	1.0	100

Evaluation Sets

Evaluation Sets	Hours	Source	Description
DEV	20	Internet	Specially designed for some speech tools which require cross-validation set in training
TEST_NET	23	Internet	Match test
TEST_MEETING	15	Real meeting	Mismatch test which is far-field, conversational, and spontaneous meeting speech

Contributors

ACKNOWLEDGEMENTS

WenetSpeech referred a lot of work of GigaSpeech, including metadata design, license design, data encryption, downloading pipeline, and so on. The authors would like to thank Jiayu Du and Guoguo Chen for their suggestions on this work.
The authors would like to thank my college Lianhui Zhang, Yu Mao for collecting some of the YouTube data.

A 10000+ hours dataset for Chinese speech recognition

Related tags

Overview

WenetSpeech

Download

Benchmark

Description

Creation

Categories

High Label Data

Evaluation Sets

Contributors

ACKNOWLEDGEMENTS

Owner

Neural-Machine-Translation - Implementation of revolutionary machine translation models

A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

Pytorch NLP library based on FastAI

A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

Artificial Conversational Entity for queries in Eulogio "Amang" Rodriguez Institute of Science and Technology (EARIST)

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

2021 2학기 데이터크롤링 기말프로젝트

Easy-to-use CPM for Chinese text generation

LUKE -- Language Understanding with Knowledge-based Embeddings

AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

Yes it's true :broken_heart:

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

A 10000+ hours dataset for Chinese speech recognition

Library for Russian imprecise rhymes generation

DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

Yodatranslator is a simple translator English to Yoda-language