A 2000-Hour Multi-dialectal Speech Recognition Database

Date:2015-03-31

A 2000-hour Multi-dialectal Corpus for Recognition has been released. The corpus contains two Chinese dialects - Northeastern dialect and Henan dialect. The speech samples are all collected from the corresponding provinces while ensuring subtle varieties amongst neighboring cities and provinces. For example, the Northeast dialect corpus is collected from the cities in all of the three northeast provinces (Heilongjiang Province, Jilin Province, and Liaoning Providence). Detailed information regarding the corpus is listed below:

Recording Locations	Languages(Dialects)	Durations	Participants
Northeast Provinces (including Heilongjiang, Jilin and Liaoning Provinces)	Northeast Dialect	1000 hours	2000
Henan Province	Henan Dialect	1000 hours	2000
Total		2000 hours	4000

The corpus has effective duration of 2000 hours. It is recorded using 16 kHz sampling frequency and 16-bit quantization accuracy in mono PCM format.

4000 native dialectal speakers are carefully selected. The corpus is collected using various Android smart phone models and primarily in indoor environments.

During the corpus design and collection process, we have formed an efficient team with high standards. This leads to great success for our corpus to have below 2% sentence-wise error.

This corpus has well-covered the major cities in the dialectal regions with rich volume. It can be used for training and testing the speech recognition system, as well as speech analysis and dialectal studies.