A 200-Hour Bilingual (Mandarin and English) Speech Recognition Database

Date:2014-02-05

A 200-Hour Bilingual (Mandarin and English) Corpus for Recognition has been released. It is recorded in real environments using 16 kHz sampling frequency and 16-bit quantization accuracy mono PCM format.

200 native Mandarin speakers （1:1 gender distribution） with their respected dialectal characteristics are carefully selected where mixed Mandarin and English as well as pure English sentences are collected for each person. The corpus are collected using multiple smart phone models with Andriod and IOS systems in both indoor (quiet and noise) and outdoor environments.

During the corpus design and collection process, we have formed an efficient team with high standard capabilities. This leads to the great success for our corpus to have below 2% sentence-wise error, which is dominating in the current market.

This corpus can be used for training and testing the speech recognition system, as well as speech analysis. It has been well-acknowledged by industry as a corpus with high speech quality and recognition accuracy.