【发布时间】:2016-05-30 16:56:57
【问题描述】:
我已经按照论文here 和代码here(使用对称kld 和第一个链接中论文中提出的退避模型实现)计算两个文本数据集之间的KLD。最后我改了for循环,返回两个数据集的概率分布,测试两者总和是否为1:
import re, math, collections
def tokenize(_str):
stopwords = ['and', 'for', 'if', 'the', 'then', 'be', 'is', \
'are', 'will', 'in', 'it', 'to', 'that']
tokens = collections.defaultdict(lambda: 0.)
for m in re.finditer(r"(\w+)", _str, re.UNICODE):
m = m.group(1).lower()
if len(m) < 2: continue
if m in stopwords: continue
tokens[m] += 1
return tokens
#end of tokenize
def kldiv(_s, _t):
if (len(_s) == 0):
return 1e33
if (len(_t) == 0):
return 1e33
ssum = 0. + sum(_s.values())
slen = len(_s)
tsum = 0. + sum(_t.values())
tlen = len(_t)
vocabdiff = set(_s.keys()).difference(set(_t.keys()))
lenvocabdiff = len(vocabdiff)
""" epsilon """
epsilon = min(min(_s.values())/ssum, min(_t.values())/tsum) * 0.001
""" gamma """
gamma = 1 - lenvocabdiff * epsilon
""" Check if distribution probabilities sum to 1"""
sc = sum([v/ssum for v in _s.itervalues()])
st = sum([v/tsum for v in _t.itervalues()])
ps=[]
pt = []
for t, v in _s.iteritems():
pts = v / ssum
ptt = epsilon
if t in _t:
ptt = gamma * (_t[t] / tsum)
ps.append(pts)
pt.append(ptt)
return ps, pt
我已经测试过
d1 = """Many research publications want you to use BibTeX, which better
organizes the whole process. Suppose for concreteness your source
file is x.tex. Basically, you create a file x.bib containing the
bibliography, and run bibtex on that file."""
d2 = """In this case you must supply both a \left and a \right because the
delimiter height are made to match whatever is contained between the
two commands. But, the \left doesn't have to be an actual 'left
delimiter', that is you can use '\left)' if there were some reason
to do it."""
sum(ps) = 1 但sum(pt) 在以下情况下远小于 1:
代码中是否存在不正确的地方?谢谢!
更新:
为了使 pt 和 ps 总和为 1,我不得不将代码更改为:
vocab = Counter(_s)+Counter(_t)
ps=[]
pt = []
for t, v in vocab.iteritems():
if t in _s:
pts = gamma * (_s[t] / ssum)
else:
pts = epsilon
if t in _t:
ptt = gamma * (_t[t] / tsum)
else:
ptt = epsilon
ps.append(pts)
pt.append(ptt)
return ps, pt
【问题讨论】:
-
与您的问题无关,在您的测试字符串(d1 和 d2)中,您应该使用两个连续的反斜杠。反斜杠字符用于在 python 中转义。示例:
x="\\left"而不是x="\left"。
标签: python nlp similarity information-retrieval