• Huiting Data To Exhibit At ICASSP 2017 Conference In New Orleans
  • Huiting Data Awarded One of the Five Creative Products in 2015
  • Huiting Data Awarded One of the Five Creative Products in 2015
  • Far field command word corpus for recognition has been released. 100 male and 100 female speakers from 25 different districts with the age from 17 to 32 are included in the corpus. 200 different English sentences (5 to 50 words) are collected for each talker. The corpus is collected using various Android smart phone models and primarily in indoor (quiet and anechoic) environments. 6 channels of speech signals are simultaneously recorded, where detailed information is listed below:
  • A 75-hour Taiwan mandarin corpus for recognition has been released. The speech samples are collected in Taiwan. In total of 100 native Taiwan speakers (1:1 gender distribution) from the major areas of Taiwan are carefully selected. The corpus has effective duration of 75 hours. It is recorded using 16 kHz sampling frequency and 16-bit quantization accuracy in mono PCM format. The corpus is collected using various Android smart models and primarily in indoor environment.
  • A 1000 people gait recognition database has been released. It covered 1000 people (1:1 gender distribution) with the age from 4 to 85 where 20% of the participants are above 60-year-old. Gait recognition database was collected by 24 Hikvision cameras in outdoor environment. 1000 participants with their normal walking posture was recorded by these cameras. Each participant is asked to walk 48 times with 3 sets of outfits (one is normal, one with a coat and the other one with a bag). The resolution ratio is 1920*1080@25fps and the format is MP4.
  • Graphic database for recognition has been released. In total of 20,000 figure outlines including body outline and face outline are labelled. The outline region is black with a white background. The 20,000 images include people of all ages and genders with their respected outfits and postures. 12,000 of the pictures are people in street snaps, photo albums or life photos with different sizes. The rest 8000 images are gait images originated in gait recognition database. During the database design and collection process, we have formed an efficient team with high standard capabilities. This leads to the deviation of outline is controlled in 3 pixel.
  • Far field command word corpus for recognition has been released. It covered 200 speakers with the age from 17 to 32 with 1:1 gender distribution. The typescripts are 600 fixed Chinese command words, including intelligent household appliances, wake statement, vehicle statement and so on.
  • A 50-hour Children Bilingual Corpus for Recognition has been released. It covered 140 children with the age from 5 to 12 with 1:1 gender distribution. The corpus contains 25-hour mandarin and 25-hour English (including word’s pronunciation and alphablocks ), which covered the texts and words in elementary school textbooks. It is recorded in real environments using 16 kHz sampling frequency and 16-bit quantization accuracy mono PCM format. The corpus is collected using various Android smart phone models and primarily in indoor (quiet and anechoic ) environments. During the corpus design and collection process, we have formed an efficient team with high standard capabilities. This leads to the great success for our corpus to have below 2% sentence-wise error, which is dominating in the current market. This corpus can be used for training and testing the children speech recognition system, as well as speech analysis. It has been well-acknowledged by industry as a corpus with high speech quality and recognition accuracy.
  • A 400-hour On-Vehicle Japanese Corpus for Recognition has been released. The speech samples are collected in Japan and China. In total of 1000 native Japanese speakers from the major areas of Japan are carefully selected.The corpus is collected inside various cars. The recording environments have covered practical scenarios including and not limited to the variations in high/low vehicle speed and parking, window open and close modes. During the corpus design and collection process, we have formed an efficient team with high standards specializing in Japanese language. This leads to great success for our corpus to have below 5% sentence-wise error. This corpus has contained rich speech samples recorded in various combinations of car model and noise types. All of the participants are native Japanese speakers. It can be used for training and testing the speech recognition system, as well as speech analysis.
  • A 2000-hour Multi-dialectal Corpus for Recognition has been released. The corpus contains two Chinese dialects - Northeastern dialect and Henan dialect. The speech samples are all collected from the corresponding provinces while ensuring subtle varieties amongst neighboring cities and provinces. For example, the Northeast dialect corpus is collected from the cities in all of the three northeast provinces (i.e. Heilongjiang Province, Jilin Province, and Liaoning Province).
  • Huiting Cantonese Recognition Corpus is carefully designed by our research team and has been well-acknowledged by our industry customers. It has played an important role for our customer to improve the recognition accuracy of their Cantonese recognition system. It has been acknowledged to be the Cantonese corpus with the highest recognition accuracy. Our customers are impressed by our high quality standard, efficient progress control and responsibility. Many of them have shown great interest to maintain a long-term collaboration with us.
  • 主要工作职责 中文语言处理。包括:语料收集、分类、词性标注、文本校对、关键词标注等 语音录制、语音音节切分、韵律标注、发音符号校对、录音文本校对和听写等 文档、音乐、网页等数据资源采集与编辑 对研发成果进行测试、评估 任职条件 国家统招本科及以上学历;语言、文学类相关专业优先考虑 英语四级及以上水平;计算机操作熟练 一周至少四天坐班 普通话标准,拼音规则熟练、了解汉语语法;能够清晰分辨汉语声韵母和儿化音。有普通话水平一级证书的优先考虑 工作细致,责任心以及执行力强,有团队意识,善于沟通、协调 有语言处理或语音工作经验者优先,有各种数据资源管理经验者优先 对于表现优秀者提供转正机会,请将简历发送:lily@huitingtech.com
  • Designed and collected by native Cantonese, a 1000-hour Cantonese Corpus for Recognition has been released. It is recorded in real environments using 16 kHz sampling frequency and 16-bit quantization accuracy mono PCM format. This corpus can be used for training and testing the speech recognition system, as well as speech analysis. It has been well-acknowledged by industry as a corpus with high speech quality and recognition accuracy. The corpus collection and recording process has been scientifically proved where 1500 native Cantonese are selected covering 110 administrative district of the Guangdong Province while focusing on Zhuhai, Foshan, Sanshui and Guangzhou regions ensuring domination of the authentic Cantonese in the corpus. It also covered the Cantonese with the age between 15 to 55 with 1:1 gender distribution. This ensures a multi-purpose corpus with optimal diversity, comprehensiveness and balance to satisfying different requirements During the corpus design and collection process, we have formed an efficient team with high standard capabilities. This leads to the great success for our Cantonese corpus to have below 2% sentence-wise error, which is dominating in the current market.
  • A 2100-hour Accented Mandarin Speech Recognition Database for Recognition has been released. It is recorded in real environments using 16 kHz sampling frequency and 16-bit quantization accuracy mono PCM format. 2100 native Mandarin speakers (1:1 gender distribution) with their respected dialectal characteristics are carefully selected. The corpus are collected using multiple smart phone models in both indoor and outdoor environments. During the corpus design and collection process, we have formed an efficient team with high standard capabilities. This leads to the great success for our corpus to have below 2% sentence-wise error, which is dominating in the current market. This corpus can be used for training and testing the speech recognition system, as well as speech analysis. It has been well-acknowledged by industry as a corpus with high speech quality and recognition accuracy.
  • A 200-Hour Bilingual (Mandarin and English) Corpus for Recognition has been released. It is recorded in real environments using 16 kHz sampling frequency and 16-bit quantization accuracy mono PCM format. 200 native Mandarin speakers (1:1 gender distribution) with their respected dialectal characteristics are carefully selected where mixed Mandarin and English as well as pure English sentences are collected for each person. The corpus are collected using multiple smart phone models with Andriod and IOS systems in both indoor (quiet and noise) and outdoor environments. During the corpus design and collection process, we have formed an efficient team with high standard capabilities. This leads to the great success for our corpus to have below 2% sentence-wise error, which is dominating in the current market. This corpus can be used for training and testing the speech recognition system, as well as speech analysis. It has been well-acknowledged by industry as a corpus with high speech quality and recognition accuracy.
Recommended
  • {{item.name}}