CHAPTER 1
看的是大牛Michael Collins教授的讲义和PPT,链接:http://www.cs.columbia.edu/~mcollins/,
完整课程链接:http://academictorrents.com/details/8a8f93e18dd6c46c48ee2936ed500b1ff4cc9175
原始模型
马尔可夫模型(Markov Model)
一阶模型 For Fixed-length Sequences:
二阶模型 For Fixed-length Sequences:
And it follows that the probability of an entire sequence is written as
二阶模型 For Variable-length Sentences:
步骤:
三元语言模型(Trigram model)
Where the STOP is end symbol, * is the start symbol.
例子:The sentence:the dog barks STOP
没看懂,可能是排列组合的问题
q的运算:
c(the, dog, barks) is the number of times that the sequence of three words the dog barks is seen in the training corpus. Similarly, define c(u, v) to be the number of times that the bigram (u, v) is seen in the corpus.
语言模型评估指标:困惑度(Perplexity)
Under a uniform probability model, the perplexity is equal to the vocabulary size. Perplexity can be thought of as the effective vocabulary size under the model: if, for example, the perplexity of the model is 120 (even though the vocabulary size is say 10, 000), then this is roughly equivalent to having an effective vocabulary size of 120.
One additional useful fact about perplexity is the following. If for any trigram u, v, w seen in test data, we have the estimate q(w|u, v) = 0, then the perplexity will be ∞.
三元模型平滑
-
线性平滑
-
调参:
这样可以避免原公式中可能出现的的分母(denominator)为0情况 -
折扣法
定义一个新 c*(v,w):
减去折扣
再定义一个参数 missing probability mass:
根据上表:
对c零和非零分组:
三元形式:
这样c(v,w) = 0也没关系,因为肯定不为0