A 10000+ hours dataset for Chinese speech recognition

Overview

WenetSpeech

Official website | Paper

A 10000+ Hours Multi-domain Chinese Corpus for Speech Recognition

WenetSpeech

Download

Please visit the official website, read the license, and follow the instruction to download the data.

Benchmark

Toolkit Dev Test_Net Test_Meeting AIShell-1
Kaldi 9.07 12.83 24.72 5.41
ESPNet 9.70 8.90 15.90 3.90
WeNet 8.88 9.70 15.59 4.61

Description

Creation

All the data are collected from YouTube and Podcast. Optical character recognition (OCR) and automatic speech recognition (ASR) techniques are adopted to label each YouTube and Podcast recording, respectively. To improve the quality of the corpus, we use a novel end-to-end label error detection method to further validate and filter the data.

Categories

In summary, WenetSpeech groups all data into 3 categories, as the following table shows:

Set Hours Confidence Usage
High Label 10005 >=0.95 Supervised Training
Weak Label 2478 [0.6, 0.95] Semi-supervised or noise training
Unlabel 9952 / Unsupervised training or Pre-training
In Total 22435 / All above

High Label Data

We classify the high label into 10 groups according to its domain, speaking style, and scenarios.

Domain Youtube Podcast Total
audiobook 0 250.9 250.9
commentary 112.6 135.7 248.3
documentary 386.7 90.5 477.2
drama 4338.2 0 4338.2
interview 324.2 614 938.2
news 0 868 868
reading 0 1110.2 1110.2
talk 204 90.7 294.7
variety 603.3 224.5 827.8
others 144 507.5 651.5
Total 6113 3892 10005

As shown in the following table, we provide 3 training subsets, namely S, M and L for building ASR systems on different data scales.

Training Subsets Confidence Hours
L [0.95, 1.0] 10005
M 1.0 1000
S 1.0 100

Evaluation Sets

Evaluation Sets Hours Source Description
DEV 20 Internet Specially designed for some speech tools which require cross-validation set in training
TEST_NET 23 Internet Match test
TEST_MEETING 15 Real meeting Mismatch test which is a far-field, conversational, spontaneous, and meeting dataset

Contributors

ACKNOWLEDGEMENTS

  • WenetSpeech refers a lot of work of GigaSpeech, and we thank Jiayu Du and Guoguo Chen for their suggestions on this work.
  • We thank Xi'an Future AI Innovation Center for providing hosting service for WenetSpeech. We also thank MindSpore for the support of this work, which is a new deep learning computing framework.
  • Our gratitude goes to Lianhui Zhang and Yu Mao for collecting some of the YouTube data.
Owner
Production First and Production Ready End-to-End Speech Toolkit
Localized representation learning from Vision and Text (LoVT)

Localized Vision-Text Pre-Training Contrastive learning has proven effective for pre- training image models on unlabeled data and achieved great resul

Philip Müller 10 Dec 07, 2022
Rewrite ultralytics/yolov5 v6.0 opencv inference code based on numpy, no need to rely on pytorch

Rewrite ultralytics/yolov5 v6.0 opencv inference code based on numpy, no need to rely on pytorch; pre-processing and post-processing using numpy instead of pytroch.

炼丹去了 21 Dec 12, 2022
Implementation of popular SOTA self-supervised learning algorithms as Fastai Callbacks.

Self Supervised Learning with Fastai Implementation of popular SOTA self-supervised learning algorithms as Fastai Callbacks. Install pip install self-

Kerem Turgutlu 276 Dec 23, 2022
We envision models that are pre-trained on a vast range of domain-relevant tasks to become key for molecule property prediction

We envision models that are pre-trained on a vast range of domain-relevant tasks to become key for molecule property prediction. This repository aims to give easy access to state-of-the-art pre-train

GMUM 90 Jan 08, 2023
[SIGGRAPH Asia 2021] Pose with Style: Detail-Preserving Pose-Guided Image Synthesis with Conditional StyleGAN

Pose with Style: Detail-Preserving Pose-Guided Image Synthesis with Conditional StyleGAN [Paper] [Project Website] [Output resutls] Official Pytorch i

Badour AlBahar 215 Dec 17, 2022
[NeurIPS 2021] Introspective Distillation for Robust Question Answering

Introspective Distillation (IntroD) This repository is the Pytorch implementation of our paper "Introspective Distillation for Robust Question Answeri

Yulei Niu 13 Jul 26, 2022
NeRD: Neural Reflectance Decomposition from Image Collections

NeRD: Neural Reflectance Decomposition from Image Collections Project Page | Video | Paper | Dataset Implementation for NeRD. A novel method which dec

Computergraphics (University of Tübingen) 195 Dec 29, 2022
[arXiv'22] Panoptic NeRF: 3D-to-2D Label Transfer for Panoptic Urban Scene Segmentation

Panoptic NeRF Project Page | Paper | Dataset Panoptic NeRF: 3D-to-2D Label Transfer for Panoptic Urban Scene Segmentation Xiao Fu*, Shangzhan zhang*,

Xiao Fu 111 Dec 16, 2022
An NLP library with Awesome pre-trained Transformer models and easy-to-use interface, supporting wide-range of NLP tasks from research to industrial applications.

简体中文 | English News [2021-10-12] PaddleNLP 2.1版本已发布!新增开箱即用的NLP任务能力、Prompt Tuning应用示例与生成任务的高性能推理! 🎉 更多详细升级信息请查看Release Note。 [2021-08-22]《千言:面向事实一致性的生

6.9k Jan 01, 2023
A pytorch reprelication of the model-based reinforcement learning algorithm MBPO

Overview This is a re-implementation of the model-based RL algorithm MBPO in pytorch as described in the following paper: When to Trust Your Model: Mo

Xingyu Lin 93 Jan 05, 2023
Open source simulator for autonomous vehicles built on Unreal Engine / Unity, from Microsoft AI & Research

Welcome to AirSim AirSim is a simulator for drones, cars and more, built on Unreal Engine (we now also have an experimental Unity release). It is open

Microsoft 13.8k Jan 05, 2023
Efficient neural networks for analog audio effect modeling

micro-TCN Efficient neural networks for audio effect modeling

Christian Steinmetz 94 Dec 29, 2022
Differentiable Wavetable Synthesis

Differentiable Wavetable Synthesis

4 Feb 11, 2022
PyTorch implementation of MSBG hearing loss model and MBSTOI intelligibility metric

PyTorch implementation of MSBG hearing loss model and MBSTOI intelligibility metric This repository contains the implementation of MSBG hearing loss m

BUT <a href=[email protected]"> 9 Nov 08, 2022
DeepGNN is a framework for training machine learning models on large scale graph data.

DeepGNN Overview DeepGNN is a framework for training machine learning models on large scale graph data. DeepGNN contains all the necessary features in

Microsoft 45 Jan 01, 2023
Joint Versus Independent Multiview Hashing for Cross-View Retrieval[J] (IEEE TCYB 2021, PyTorch Code)

Thanks to the low storage cost and high query speed, cross-view hashing (CVH) has been successfully used for similarity search in multimedia retrieval. However, most existing CVH methods use all view

4 Nov 19, 2022
Code for the paper "Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks"

ON-LSTM This repository contains the code used for word-level language model and unsupervised parsing experiments in Ordered Neurons: Integrating Tree

Yikang Shen 572 Nov 21, 2022
Code for "Hierarchical Skills for Efficient Exploration" HSD-3 Algorithm and Baselines

Hierarchical Skills for Efficient Exploration This is the source code release for the paper Hierarchical Skills for Efficient Exploration. It contains

Facebook Research 38 Dec 06, 2022
This is code of book "Learn Deep Learning with PyTorch"

深度学习入门之PyTorch Learn Deep Learning with PyTorch 非常感谢您能够购买此书,这个github repository包含有深度学习入门之PyTorch的实例代码。由于本人水平有限,在写此书的时候参考了一些网上的资料,在这里对他们表示敬意。由于深度学习的技术在

Xingyu Liao 2.5k Jan 04, 2023
A Semantic Segmentation Network for Urban-Scale Building Footprint Extraction Using RGB Satellite Imagery

A Semantic Segmentation Network for Urban-Scale Building Footprint Extraction Using RGB Satellite Imagery This repository is the official implementati

Aatif Jiwani 42 Dec 08, 2022