A 1000-hour Cantonese Speech Recognition Database

Date:2014-02-15

Designed and collected by native Cantonese, a 1000-hour Cantonese Corpus for Recognition has been released. It is recorded in real environments using 16 kHz sampling frequency and 16-bit quantization accuracy mono PCM format. This corpus can be used for training and testing the speech recognition system, as well as speech analysis. It has been well-acknowledged by industry as a corpus with high speech quality and recognition accuracy.

The corpus collection and recording process has been scientifically proved where 1500 native Cantonese are selected covering 110 administrative district of the Guangdong Province while focusing on Zhuhai, Foshan, Sanshui and Guangzhou regions ensuring domination of the authentic Cantonese in the corpus. It also covered the Cantonese with the age between 15 to 55 with 1:1 gender distribution. This ensures a multi-purpose corpus with optimal diversity, comprehensiveness and balance to satisfying different requirements

During the corpus design and collection process, we have formed an efficient team with high standard capabilities. This leads to the great success for our Cantonese corpus to have below 2% sentence-wise error, which is dominating in the current market.