A 2000-Hour Multi-dialectal Speech Recognition Database


A 2000-hour Multi-dialectal Corpus for Recognition has been released. The corpus contains two Chinese dialects - Northeastern dialect and Henan dialect. The speech samples are all collected from the corresponding provinces while ensuring subtle varieties amongst neighboring cities and provinces. For example, the Northeast dialect corpus is collected from the cities in all of the three northeast provinces (Heilongjiang Province, Jilin Province, and Liaoning Providence). Detailed information regarding the corpus is listed below:

Recording Locations




Northeast Provinces (including Heilongjiang, Jilin and Liaoning Provinces)

Northeast Dialect

1000 hours


Henan Province

Henan Dialect

1000 hours



2000 hours




The corpus has effective duration of 2000 hours. It is recorded using 16 kHz sampling frequency and 16-bit quantization accuracy in mono PCM format.

4000 native dialectal speakers are carefully selected. The corpus is collected using various Android smart phone models and primarily in indoor environments.  

During the corpus design and collection process, we have formed an efficient team with high standards. This leads to great success for our corpus to have below 2% sentence-wise error.

This corpus has well-covered the major cities in the dialectal regions with rich volume. It can be used for training and testing the speech recognition system, as well as speech analysis and dialectal studies.

