A 10000+ hours dataset for Chinese speech recognition

Overview

WenetSpeech

Official website | Paper

A 10000+ Hours Multi-domain Chinese Corpus for Speech Recognition

WenetSpeech

Download

Please visit the official website, read the license, and follow the instruction to download the data.

Benchmark

Toolkit Dev Test_Net Test_Meeting AIShell-1
Kaldi 9.07 12.83 24.72 5.41
ESPNet 9.70 8.90 15.90 3.90
WeNet 8.88 9.70 15.59 4.61

Description

Creation

All the data are collected from YouTube and Podcast. Optical character recognition (OCR) and automatic speech recognition (ASR) techniques are adopted to label each YouTube and Podcast recording, respectively. To improve the quality of the corpus, we use a novel end-to-end label error detection method to further validate and filter the data.

Categories

In summary, WenetSpeech groups all data into 3 categories, as the following table shows:

Set Hours Confidence Usage
High Label 10005 >=0.95 Supervised Training
Weak Label 2478 [0.6, 0.95] Semi-supervised or noise training
Unlabel 9952 / Unsupervised training or Pre-training
In Total 22435 / All above

High Label Data

We classify the high label into 10 groups according to its domain, speaking style, and scenarios.

Domain Youtube Podcast Total
audiobook 0 250.9 250.9
commentary 112.6 135.7 248.3
documentary 386.7 90.5 477.2
drama 4338.2 0 4338.2
interview 324.2 614 938.2
news 0 868 868
reading 0 1110.2 1110.2
talk 204 90.7 294.7
variety 603.3 224.5 827.8
others 144 507.5 651.5
Total 6113 3892 10005

As shown in the following table, we provide 3 training subsets, namely S, M and L for building ASR systems on different data scales.

Training Subsets Confidence Hours
L [0.95, 1.0] 10005
M 1.0 1000
S 1.0 100

Evaluation Sets

Evaluation Sets Hours Source Description
DEV 20 Internet Specially designed for some speech tools which require cross-validation set in training
TEST_NET 23 Internet Match test
TEST_MEETING 15 Real meeting Mismatch test which is a far-field, conversational, spontaneous, and meeting dataset

Contributors

ACKNOWLEDGEMENTS

  • WenetSpeech refers a lot of work of GigaSpeech, and we thank Jiayu Du and Guoguo Chen for their suggestions on this work.
  • We thank Xi'an Future AI Innovation Center for providing hosting service for WenetSpeech. We also thank MindSpore for the support of this work, which is a new deep learning computing framework.
  • Our gratitude goes to Lianhui Zhang and Yu Mao for collecting some of the YouTube data.
Owner
Production First and Production Ready End-to-End Speech Toolkit
Generalized Random Forests

generalized random forests A pluggable package for forest-based statistical estimation and inference. GRF currently provides non-parametric methods fo

GRF Labs 781 Dec 25, 2022
Spatial Transformer Nets in TensorFlow/ TensorLayer

MOVED TO HERE Spatial Transformer Networks Spatial Transformer Networks (STN) is a dynamic mechanism that produces transformations of input images (or

Hao 36 Nov 23, 2022
A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization

University1652-Baseline [Paper] [Slide] [Explore Drone-view Data] [Explore Satellite-view Data] [Explore Street-view Data] [Video Sample] [中文介绍] This

Zhedong Zheng 335 Jan 06, 2023
PyTorch Implementation of Unsupervised Depth Completion with Calibrated Backprojection Layers (ORAL, ICCV 2021)

Unsupervised Depth Completion with Calibrated Backprojection Layers PyTorch implementation of Unsupervised Depth Completion with Calibrated Backprojec

80 Dec 13, 2022
TDN: Temporal Difference Networks for Efficient Action Recognition

TDN: Temporal Difference Networks for Efficient Action Recognition Overview We release the PyTorch code of the TDN(Temporal Difference Networks).

Multimedia Computing Group, Nanjing University 326 Dec 13, 2022
An all-in-one application to visualize multiple different local path planning algorithms

Table of Contents Table of Contents Local Planner Visualization Project (LPVP) Features Installation/Usage Local Planners Probabilistic Roadmap (PRM)

Abdur Javaid 47 Dec 30, 2022
Vehicle direction identification consists of three module detection , tracking and direction recognization.

Vehicle-direction-identification Vehicle direction identification consists of three module detection , tracking and direction recognization. Algorithm

5 Nov 15, 2022
KIDA: Knowledge Inheritance in Data Aggregation

KIDA: Knowledge Inheritance in Data Aggregation This project releases our 1st place solution on NeurIPS2021 ML4CO Dual Task. Slide and model weights a

24 Sep 08, 2022
Seeing Dynamic Scene in the Dark: High-Quality Video Dataset with Mechatronic Alignment (ICCV2021)

Seeing Dynamic Scene in the Dark: High-Quality Video Dataset with Mechatronic Alignment This is a pytorch project for the paper Seeing Dynamic Scene i

DV Lab 21 Nov 28, 2022
Awesome Long-Tailed Learning

Awesome Long-Tailed Learning This repo pays specially attention to the long-tailed distribution, where labels follow a long-tailed or power-law distri

Stomach_ache 284 Jan 06, 2023
Malmo Collaborative AI Challenge - Team Pig Catcher

The Malmo Collaborative AI Challenge - Team Pig Catcher Approach The challenge involves 2 agents who can either cooperate or defect. The optimal polic

Kai Arulkumaran 66 Jun 29, 2022
Molecular AutoEncoder in PyTorch

MolEncoder Molecular AutoEncoder in PyTorch Install $ git clone https://github.com/cxhernandez/molencoder.git && cd molencoder $ python setup.py insta

Carlos Hernández 80 Dec 05, 2022
Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation. In CVPR 2022.

Nonuniform-to-Uniform Quantization This repository contains the training code of N2UQ introduced in our CVPR 2022 paper: "Nonuniform-to-Uniform Quanti

Zechun Liu 60 Dec 28, 2022
Many Class Activation Map methods implemented in Pytorch for CNNs and Vision Transformers. Including Grad-CAM, Grad-CAM++, Score-CAM, Ablation-CAM and XGrad-CAM

Class Activation Map methods implemented in Pytorch pip install grad-cam ⭐ Tested on many Common CNN Networks and Vision Transformers. ⭐ Includes smoo

Jacob Gildenblat 6.6k Jan 06, 2023
Plugin adapted from Ultralytics to bring YOLOv5 into Napari

napari-yolov5 Plugin adapted from Ultralytics to bring YOLOv5 into Napari. Training and detection can be done using the GUI. Training dataset must be

2 May 05, 2022
A benchmark for the task of translation suggestion

WeTS: A Benchmark for Translation Suggestion Translation Suggestion (TS), which provides alternatives for specific words or phrases given the entire d

zhyang 55 Dec 24, 2022
(ICCV 2021) ProHMR - Probabilistic Modeling for Human Mesh Recovery

ProHMR - Probabilistic Modeling for Human Mesh Recovery Code repository for the paper: Probabilistic Modeling for Human Mesh Recovery Nikos Kolotouros

Nikos Kolotouros 209 Dec 13, 2022
Matplotlib Image labeller for classifying images

mpl-image-labeller Use Matplotlib to label images for classification. Works anywhere Matplotlib does - from the notebook to a standalone gui! For more

Ian Hunt-Isaak 5 Sep 24, 2022
Official Implementation for the "An Empirical Investigation of 3D Anomaly Detection and Segmentation" paper.

An Empirical Investigation of 3D Anomaly Detection and Segmentation Project | Paper Official PyTorch Implementation for the "An Empirical Investigatio

Eliahu Horwitz 55 Dec 14, 2022
Generative Handwriting using LSTM Mixture Density Network with TensorFlow

Generative Handwriting Demo using TensorFlow An attempt to implement the random handwriting generation portion of Alex Graves' paper. See my blog post

hardmaru 686 Nov 24, 2022