计算两个文档之间的对称 Kullback-Leibler 散度答案

【问题标题】：Computing symmetric Kullback-Leibler divergence between two documents计算两个文档之间的对称 Kullback-Leibler 散度
【发布时间】：2016-05-30 16:56:57
【问题描述】：

我已经按照论文here 和代码here（使用对称kld 和第一个链接中论文中提出的退避模型实现）计算两个文本数据集之间的KLD。最后我改了for循环，返回两个数据集的概率分布，测试两者总和是否为1：

import re, math, collections

def tokenize(_str):
    stopwords = ['and', 'for', 'if', 'the', 'then', 'be', 'is', \
                 'are', 'will', 'in', 'it', 'to', 'that']
    tokens = collections.defaultdict(lambda: 0.)
    for m in re.finditer(r"(\w+)", _str, re.UNICODE):
        m = m.group(1).lower()
        if len(m) < 2: continue
        if m in stopwords: continue
        tokens[m] += 1

    return tokens
#end of tokenize

def kldiv(_s, _t):
    if (len(_s) == 0):
        return 1e33

    if (len(_t) == 0):
        return 1e33

    ssum = 0. + sum(_s.values())
    slen = len(_s)

    tsum = 0. + sum(_t.values())
    tlen = len(_t)

    vocabdiff = set(_s.keys()).difference(set(_t.keys()))
    lenvocabdiff = len(vocabdiff)

    """ epsilon """
    epsilon = min(min(_s.values())/ssum, min(_t.values())/tsum) * 0.001

    """ gamma """
    gamma = 1 - lenvocabdiff * epsilon

    """ Check if distribution probabilities sum to 1"""
    sc = sum([v/ssum for v in _s.itervalues()])
    st = sum([v/tsum for v in _t.itervalues()])

    ps=[] 
    pt = [] 
    for t, v in _s.iteritems(): 
        pts = v / ssum 
        ptt = epsilon 
        if t in _t: 
            ptt = gamma * (_t[t] / tsum) 
        ps.append(pts) 
        pt.append(ptt)
    return ps, pt

我已经测试过

d1 = """Many research publications want you to use BibTeX, which better organizes the whole process. Suppose for concreteness your source file is x.tex. Basically, you create a file x.bib containing the bibliography, and run bibtex on that file.""" d2 = """In this case you must supply both a \left and a \right because the delimiter height are made to match whatever is contained between the two commands. But, the \left doesn't have to be an actual 'left delimiter', that is you can use '\left)' if there were some reason to do it."""

sum(ps) = 1 但sum(pt) 在以下情况下远小于 1：

代码中是否存在不正确的地方？谢谢！

更新：

为了使 pt 和 ps 总和为 1，我不得不将代码更改为：

    vocab = Counter(_s)+Counter(_t)
    ps=[] 
    pt = [] 
    for t, v in vocab.iteritems(): 
        if t in _s:
            pts = gamma * (_s[t] / ssum) 
        else: 
            pts = epsilon

        if t in _t: 
            ptt = gamma * (_t[t] / tsum) 
        else:
            ptt = epsilon

        ps.append(pts) 
        pt.append(ptt)

    return ps, pt

【问题讨论】：

与您的问题无关，在您的测试字符串（d1 和 d2）中，您应该使用两个连续的反斜杠。反斜杠字符用于在 python 中转义。示例：x="\\left" 而不是 x="\left"。

标签： python nlp similarity information-retrieval

【解决方案1】：

sum(ps) 和 sum(pt) 都是 _s 和 _t 支持 s 的总概率质量（“支持 s”是指出现在 _s 中的所有单词，不管 _t 中出现的单词是什么）。这意味着

sum(ps)==1，因为 for 循环对 _s 中的所有单词求和。
sum(pt)

所以，我认为代码没有问题。

另外，与问题的标题相反，kldiv() 不计算对称 KL 散度，而是计算 _s 和 _t 的平滑版本之间的 KL 散度。

【讨论】：

谢谢汤姆！ “它们是 s 支持的概率”现在是有道理的。虽然在论文中确实指出这是对称 KLD，但您认为这里的实现是否有点偏离？
嗨，Tomer，我对original code 做了一些小改动，可以在here 找到。修改后的代码基于它们的词汇联合计算 ps 和 pt，而不是依赖于 s 的支持。你认为这是一个比原始实现更好的 kld 版本来测量两个文本语料库之间的距离吗？非常感谢！
原始函数计算 KL(_s || smoothed(_t))。请注意，平滑是为了处理 _t 的支持与 _s 不匹配的边缘情况（考虑 _s 中的 w 未出现在 _t 中，在这种情况下 log(t(w)) = log(0) = -inf,这将使KL = inf）。要计算对称 KL，只需计算 (kldiv(_s, _t) + kldiv(_t, _s)) / 2。
谢谢。我认为它们都是计算 KL 距离的有效方法。
程序中的 Epsilon 和 Gamma 到底是什么？

【解决方案2】：

每个文档的概率分布总和存储在变量sc 和st 中，它们接近于1。

【讨论】：