【问题标题】：How to interpret Sklearn LDA perplexity score. Why it always increase as number of topics increase?如何解释 Sklearn LDA 困惑分数。为什么它总是随着主题数量的增加而增加？
【发布时间】：2017-08-13 07:08:35
【问题描述】：

我尝试使用 sklearn 的 LDA 模型找到最佳主题数量。为此，我通过引用https://gist.github.com/tmylk/b71bf7d3ec2f203bfce2 上的代码来计算困惑度。

但是当我增加话题的数量时，困惑总是不合理的增加。我在实现上是错的还是它给出了正确的值？

from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
n_samples = 0.7
n_features = 1000
n_top_words = 20
dataset = kickstarter['short_desc'].tolist()
data_samples = dataset[:int(len(dataset)*n_samples)]
test_samples = dataset[int(len(dataset)*n_samples):]

为 LDA 使用 tf（原始术语计数）功能。

print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
t0 = time()
tf_test = tf_vectorizer.transform(test_samples)
print("done in %0.3fs." % (time() - t0))

计算（5、10、15 ... 100 个主题）的困惑度

for i in xrange(5,101,5):
    n_topics = i

    print("Fitting LDA models with tf features, "
          "n_samples=%d, n_features=%d n_topics=%d "
          % (n_samples, n_features, n_topics))

    lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                    learning_method='online',
                                    learning_offset=50.,
                                    random_state=0)
    t0 = time()
    lda.fit(tf)

    train_gamma = lda.transform(tf)
    train_perplexity = lda.perplexity(tf, train_gamma)

    test_gamma = lda.transform(tf_test)
    test_perplexity = lda.perplexity(tf_test, test_gamma)

    print('sklearn preplexity: train=%.3f, test=%.3f' %
          (train_perplexity, test_perplexity))

    print("done in %0.3fs." % (time() - t0))

困惑度计算结果

Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=5 
sklearn preplexity: train=9500.437, test=12350.525
done in 4.966s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=10 
sklearn preplexity: train=341234.228, test=492591.925
done in 4.628s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=15 
sklearn preplexity: train=11652001.711, test=17886791.159
done in 4.337s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=20 
sklearn preplexity: train=402465954.270, test=609914097.869
done in 4.351s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=25 
sklearn preplexity: train=14132355039.630, test=21945586497.205
done in 4.438s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=30 
sklearn preplexity: train=499209051036.715, test=770208066318.557
done in 4.076s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=35 
sklearn preplexity: train=16539345584599.268, test=24731601176317.836
done in 4.230s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=40 
sklearn preplexity: train=586526357904887.250, test=880809950700756.625
done in 4.596s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=45 
sklearn preplexity: train=20928740385934636.000, test=31065168894315760.000
done in 4.563s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=50 
sklearn preplexity: train=734804198843926784.000, test=1102284263786783616.000
done in 4.790s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=55 
sklearn preplexity: train=24747026375445286912.000, test=36634830286916853760.000
done in 4.839s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=60 
sklearn preplexity: train=879215493067590729728.000, test=1268331920975308783616.000
done in 4.827s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=65 
sklearn preplexity: train=30267393208097070645248.000, test=43678395923698735382528.000
done in 4.705s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=70 
sklearn preplexity: train=1091388615092136975532032.000, test=1564111432914603675222016.000
done in 4.626s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=75 
sklearn preplexity: train=37463573890268863118966784.000, test=51513357456275195169865728.000
done in 5.034s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=80 
sklearn preplexity: train=1281758440147129243608809472.000, test=1736796133443165299937378304.000
done in 5.348s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=85 
sklearn preplexity: train=45100838968058242714191265792.000, test=62725627465378386290422054912.000
done in 4.987s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=90 
sklearn preplexity: train=1555576278144903954081448460288.000, test=2117105172204280105824751190016.000
done in 5.032s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=95 
sklearn preplexity: train=52806759455785055803020813533184.000, test=70510180325555822379548402515968.000
done in 5.284s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=100 
sklearn preplexity: train=1885916623308147578324101753733120.000, test=2505878598724106449894719231098880.000
done in 5.374s.

【问题讨论】：

我能问一下您为什么撤消同行批准的编辑吗？我认为这个问题很有趣，但在目前的状态下很难解释。糟糕的语法使其基本上不可读。
除了语法问题，改正后的句子意思和我想要的不一样。例如，如果您增加主题的数量，我认为总体上应该会降低困惑度。即使目前的结果不合适，也不是增加或减少的值。
好吧，我仍然认为这基本上是编辑所反映的，尽管强调单调（总是增加或总是减少）而不是简单地减少。您当前的问题陈述令人困惑，因为您的结果不会随着主题数量而“总是增加”，而是有时会增加有时会减少（我相信您在这里指的是“不合理” - 这可能在翻译中丢失了 - 不合理是一个不同的数学词，在这种情况下没有意义，我建议改变它）
非常感谢 :) 我会尽快反映您的建议。
嗨！你找到解决办法了吗？我遇到了同样的问题..困惑正在增加..随着主题数量的增加。

标签： python scikit-learn topic-modeling perplexity

【解决方案1】：

scikit-learn 有一个 bug 导致困惑度增加：

https://github.com/scikit-learn/scikit-learn/issues/6777

【讨论】：