【问题标题】:How to interpret Sklearn LDA perplexity score. Why it always increase as number of topics increase?如何解释 Sklearn LDA 困惑分数。为什么它总是随着主题数量的增加而增加?
【发布时间】:2017-08-13 07:08:35
【问题描述】:

我尝试使用 sklearn 的 LDA 模型找到最佳主题数量。为此,我通过引用https://gist.github.com/tmylk/b71bf7d3ec2f203bfce2 上的代码来计算困惑度。

但是当我增加话题的数量时,困惑总是不合理的增加。我在实现上是错的还是它给出了正确的值?

from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
n_samples = 0.7
n_features = 1000
n_top_words = 20
dataset = kickstarter['short_desc'].tolist()
data_samples = dataset[:int(len(dataset)*n_samples)]
test_samples = dataset[int(len(dataset)*n_samples):]

为 LDA 使用 tf(原始术语计数)功能。

print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
t0 = time()
tf_test = tf_vectorizer.transform(test_samples)
print("done in %0.3fs." % (time() - t0))

计算(5、10、15 ... 100 个主题)的困惑度

for i in xrange(5,101,5):
    n_topics = i

    print("Fitting LDA models with tf features, "
          "n_samples=%d, n_features=%d n_topics=%d "
          % (n_samples, n_features, n_topics))

    lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                    learning_method='online',
                                    learning_offset=50.,
                                    random_state=0)
    t0 = time()
    lda.fit(tf)

    train_gamma = lda.transform(tf)
    train_perplexity = lda.perplexity(tf, train_gamma)

    test_gamma = lda.transform(tf_test)
    test_perplexity = lda.perplexity(tf_test, test_gamma)

    print('sklearn preplexity: train=%.3f, test=%.3f' %
          (train_perplexity, test_perplexity))

    print("done in %0.3fs." % (time() - t0))

困惑度计算结果

Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=5 
sklearn preplexity: train=9500.437, test=12350.525
done in 4.966s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=10 
sklearn preplexity: train=341234.228, test=492591.925
done in 4.628s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=15 
sklearn preplexity: train=11652001.711, test=17886791.159
done in 4.337s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=20 
sklearn preplexity: train=402465954.270, test=609914097.869
done in 4.351s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=25 
sklearn preplexity: train=14132355039.630, test=21945586497.205
done in 4.438s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=30 
sklearn preplexity: train=499209051036.715, test=770208066318.557
done in 4.076s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=35 
sklearn preplexity: train=16539345584599.268, test=24731601176317.836
done in 4.230s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=40 
sklearn preplexity: train=586526357904887.250, test=880809950700756.625
done in 4.596s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=45 
sklearn preplexity: train=20928740385934636.000, test=31065168894315760.000
done in 4.563s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=50 
sklearn preplexity: train=734804198843926784.000, test=1102284263786783616.000
done in 4.790s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=55 
sklearn preplexity: train=24747026375445286912.000, test=36634830286916853760.000
done in 4.839s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=60 
sklearn preplexity: train=879215493067590729728.000, test=1268331920975308783616.000
done in 4.827s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=65 
sklearn preplexity: train=30267393208097070645248.000, test=43678395923698735382528.000
done in 4.705s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=70 
sklearn preplexity: train=1091388615092136975532032.000, test=1564111432914603675222016.000
done in 4.626s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=75 
sklearn preplexity: train=37463573890268863118966784.000, test=51513357456275195169865728.000
done in 5.034s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=80 
sklearn preplexity: train=1281758440147129243608809472.000, test=1736796133443165299937378304.000
done in 5.348s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=85 
sklearn preplexity: train=45100838968058242714191265792.000, test=62725627465378386290422054912.000
done in 4.987s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=90 
sklearn preplexity: train=1555576278144903954081448460288.000, test=2117105172204280105824751190016.000
done in 5.032s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=95 
sklearn preplexity: train=52806759455785055803020813533184.000, test=70510180325555822379548402515968.000
done in 5.284s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=100 
sklearn preplexity: train=1885916623308147578324101753733120.000, test=2505878598724106449894719231098880.000
done in 5.374s.

【问题讨论】:

  • 我能问一下您为什么撤消同行批准的编辑吗?我认为这个问题很有趣,但在目前的状态下很难解释。糟糕的语法使其基本上不可读。
  • 除了语法问题,改正后的句子意思和我想要的不一样。例如,如果您增加主题的数量,我认为总体上应该会降低困惑度。即使目前的结果不合适,也不是增加或减少的值。
  • 好吧,我仍然认为这基本上是编辑所反映的,尽管强调单调(总是增加或总是减少)而不是简单地减少。您当前的问题陈述令人困惑,因为您的结果不会随着主题数量而“总是增加”,而是有时会增加有时会减少(我相信您在这里指的是“不合理” - 这可能在翻译中丢失了 - 不合理是一个不同的数学词,在这种情况下没有意义,我建议改变它)
  • 非常感谢 :) 我会尽快反映您的建议。
  • 嗨!你找到解决办法了吗?我遇到了同样的问题..困惑正在增加..随着主题数量的增加。

标签: python scikit-learn topic-modeling perplexity


【解决方案1】:

scikit-learn 有一个 bug 导致困惑度增加:

https://github.com/scikit-learn/scikit-learn/issues/6777

【讨论】:

    猜你喜欢
    • 2021-05-31
    • 2021-12-13
    • 1970-01-01
    • 2020-08-10
    • 1970-01-01
    • 2021-02-08
    • 2020-11-03
    • 2018-01-29
    • 1970-01-01
    相关资源
    最近更新 更多