【发布时间】:2018-06-08 11:35:01
【问题描述】:
我尝试使用 gensim 为 300000 条记录生成主题。在尝试可视化主题时,我收到验证错误。我可以在模型训练后打印主题,但使用 pyLDAvis 失败
# Running and Training LDA model on the document term matrix.
ldamodel1 = Lda(doc_term_matrix1, num_topics=10, id2word = dictionary1, passes=50, workers = 4)
(ldamodel1.print_topics(num_topics=10, num_words = 10))
#pyLDAvis
d = gensim.corpora.Dictionary.load('dictionary1.dict')
c = gensim.corpora.MmCorpus('corpus.mm')
lda = gensim.models.LdaModel.load('topic.model')
#error on executing this line
data = pyLDAvis.gensim.prepare(lda, c, d)
在 pyLDAvis 上运行后尝试出现以下错误
---------------------------------------------------------------------------
ValidationError Traceback (most recent call last)
<ipython-input-53-33fd88b65056> in <module>()
----> 1 data = pyLDAvis.gensim.prepare(lda, c, d)
2 data
C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\gensim.py in prepare(topic_model, corpus, dictionary, doc_topic_dist, **kwargs)
110 """
111 opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs)
--> 112 return vis_prepare(**opts)
C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py in prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency, R, lambda_step, mds, n_jobs, plot_opts, sort_topics)
372 doc_lengths = _series_with_name(doc_lengths, 'doc_length')
373 vocab = _series_with_name(vocab, 'vocab')
--> 374 _input_validate(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency)
375 R = min(R, len(vocab))
376
C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py in _input_validate(*args)
63 res = _input_check(*args)
64 if res:
---> 65 raise ValidationError('\n' + '\n'.join([' * ' + s for s in res]))
66
67
ValidationError:
* Not all rows (distributions) in topic_term_dists sum to 1.
【问题讨论】:
-
从训练文档切换到另一组文档时遇到了同样的问题。你确定它是同一个字典。可能是从旧版本加载的。
-
检查你的语料库是否包含 NaNs、Nones、'-'s 等。这通常是因为 LDA、NMF 等不知道如何处理太短或其他的文档无效。
标签: python nlp lda topic-modeling