李宏毅2020人类语言处理—P2

Speech Recognition

speech：a sequence of vector（length T, dimension d）
text：a sequence of tokens（length N, V kinds of different tokens）
T > N
李宏毅2020人类语言处理—P2

需要语言学家

Grapheme，smallest unit of a writing system，比如26 English alphabet，即V的种类数量。
实际上还要考虑“_”分词符号（space）和{punctuation marks}（标点符号）。中文不需要考虑space。

使用情况：李宏毅2020人类语言处理—P2

窗口为25毫秒，每次右移10毫秒，所以有重合部分。
16KHz，一秒16k个值。25毫秒有400个值。
frame表示该窗口的语音特征，3种dimension。

处理步骤及流行情况-2019年

Librispeech 免费

mnist只是类比。

还有Google Voice Search:12000+hours; FaceBook Video:13000+hours
实际The commercial systems use more than that number presented in paper.

Model

趋势