语音识别-重要开源数据

一、音频数据

中文

CN-Celeb 下载地址：

http://www.openslr.org/82/

CN-Celeb 项目网址：

http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/CN-Celeb

CN-Celeb 论文地址：

https://arxiv.org/abs/1911.01799

https://arxiv.org/abs/2012.12468

Kaldi Recipe 地址：

https://github.com/kaldi-asr/kaldi/tree/master/egs/cnceleb

3. 10000小时中文数据集

https://arxiv.org/pdf/2110.03370.pdf （论文）

英文：

1、GigaSpeech：10000小时多领域英语开源数据集发布

https://github.com/SpeechColab/GigaSpeech

https://arxiv.org/abs/2106.06909（论文）

2、https://github.com/coqui-ai/open-speech-corpora

二、文本数据

1. CLUECorpus2020：可能是史上最大的开源中文语料库以及高质量中文预训练模型集合

2. 40个中文NLP词库： https://github.com/fighting41love/funNLP

3. 千万级中文公开免费聊天语料数据分享： https://github.com/codemayq/chinese_chatbot_corpus

4. 腾讯AI Lab开源800万中文词：https://ai.tencent.com/ailab/nlp/embedding.html

5. GitHub出现一个大型中文NLP资源，宣称要放出亿级语料库 : https://github.com/brightmart/nlp_chinese_corpus