【发布时间】:2017-01-12 21:55:53
【问题描述】:
根据这个 github 教程:gensim/docs/notebooks/doc2vec-lee.ipynb 我应该得到大约 96% 的准确率。
这是在 jupyter 4.3.1 笔记本上使用 gensim 0.13.4 的代码,全部来自 Anaconda Navigator。
import gensim
import os
import collections
import smart_open
import random
# Set file names for train data
test_data_dir='{}'.format(os.sep).join \
([gensim.__path__[0],'test','test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
def read_corpus(fname, tokens_only=False):
with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
for i, line in enumerate(f):
if tokens_only:
yield gensim.utils.simple_preprocess(line)
else:
# For training data, add tags
yield gensim.models.doc2vec.TaggedDocument \
(gensim.utils.simple_preprocess(line), [i])
train_corpus = list(read_corpus(lee_train_file))
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=10)
model.build_vocab(train_corpus)
model.train(train_corpus)
ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
inferred_vector = model.infer_vector(train_corpus[doc_id].words)
sims = model.docvecs.most_similar([inferred_vector] \
, topn=len(model.docvecs))
rank = [docid for docid, sim in sims].index(doc_id)
ranks.append(rank)
second_ranks.append(sims[1])
collections.Counter(ranks)
在模型评估教程中:
他们的输出是:
Counter({0: 292, 1: 8})
我来了
Counter({0: 31,
1: 24,
2: 16,
3: 19,
4: 16,
5: 8,
6: 8,
7: 10,
8: 7,
9: 10,
10: 12,
11: 12,
12: 5,
13: 9,
...
为什么我没有得到接近他们准确度的任何东西?
【问题讨论】:
-
欢迎来到 SO!您的问题缺乏基本格式,不清楚您在问什么。尝试编辑问题并展示您为解决问题所采取的步骤。此外,除非完全必要,否则避免引用外部链接。请阅读:stackoverflow.com/help/how-to-ask