【发布时间】:2018-09-24 13:38:42
【问题描述】:
我从 mongodb db news 创建了一个模型,并通过 mongo 集合 id 标记了文档
from gensim.models.doc2vec import TaggedDocument
i=0
docs=[]
for artical in lstcontent:
doct = TaggedDocument(clean_str(artical), [lstids[i]])
docs.append(doct)
i+=1
之后我创建了模型
pretrained_emb='tweet_cbow_300/tweets_cbow_300'
saved_path = "documentmodel/doc2vec_model.bin"
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = g.Doc2Vec(docs, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, workers=worker_count, hs=0, dm=dm, negative=negative_size, dbow_words=1, dm_concat=1, pretrained_emb=pretrained_emb, iter=train_epoch)
model.save(saved_path)
当我通过代码使用模型时:
import gensim.models as g
import codecs
model="documentmodel/doc2vec_model.bin"
start_alpha=0.01
infer_epoch=1000
m = g.Doc2Vec.load(model)
sims = m.docvecs.most_similar(['5aa94578094b4051695eeb10'])
sims
输出是
[('5aa944c1094b4051695eeaef', 0.9255372881889343),
('5aa945c1094b4051695eeb1d', 0.9222575426101685),
('5aa94584094b4051695eeb12', 0.9210859537124634),
('5aa945d2094b4051695eeb20', 0.9083569049835205),
('5aa945c7094b4051695eeb1e', 0.905883252620697),
('5aa9458f094b4051695eeb14', 0.9054019451141357),
('5aa944c7094b4051695eeaf0', 0.9019848108291626),
('5aa94589094b4051695eeb13', 0.9012798070907593),
('5aa945b1094b4051695eeb1a', 0.9000773429870605),
('5aa945bc094b4051695eeb1c', 0.8999895453453064)]
与 5aa94578094b4051695eeb10 无关的 id 我的问题在哪里!?
【问题讨论】:
-
可以使用内置的“simple_preprocess”实用程序代替clean_str函数。
标签: word2vec gensim cosine-similarity doc2vec