【发布时间】:2017-08-13 07:08:35
【问题描述】:
我尝试使用 sklearn 的 LDA 模型找到最佳主题数量。为此,我通过引用https://gist.github.com/tmylk/b71bf7d3ec2f203bfce2 上的代码来计算困惑度。
但是当我增加话题的数量时,困惑总是不合理的增加。我在实现上是错的还是它给出了正确的值?
from __future__ import print_function
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
n_samples = 0.7
n_features = 1000
n_top_words = 20
dataset = kickstarter['short_desc'].tolist()
data_samples = dataset[:int(len(dataset)*n_samples)]
test_samples = dataset[int(len(dataset)*n_samples):]
为 LDA 使用 tf(原始术语计数)功能。
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
max_features=n_features,
stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
t0 = time()
tf_test = tf_vectorizer.transform(test_samples)
print("done in %0.3fs." % (time() - t0))
计算(5、10、15 ... 100 个主题)的困惑度
for i in xrange(5,101,5):
n_topics = i
print("Fitting LDA models with tf features, "
"n_samples=%d, n_features=%d n_topics=%d "
% (n_samples, n_features, n_topics))
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0)
t0 = time()
lda.fit(tf)
train_gamma = lda.transform(tf)
train_perplexity = lda.perplexity(tf, train_gamma)
test_gamma = lda.transform(tf_test)
test_perplexity = lda.perplexity(tf_test, test_gamma)
print('sklearn preplexity: train=%.3f, test=%.3f' %
(train_perplexity, test_perplexity))
print("done in %0.3fs." % (time() - t0))
困惑度计算结果
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=5
sklearn preplexity: train=9500.437, test=12350.525
done in 4.966s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=10
sklearn preplexity: train=341234.228, test=492591.925
done in 4.628s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=15
sklearn preplexity: train=11652001.711, test=17886791.159
done in 4.337s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=20
sklearn preplexity: train=402465954.270, test=609914097.869
done in 4.351s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=25
sklearn preplexity: train=14132355039.630, test=21945586497.205
done in 4.438s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=30
sklearn preplexity: train=499209051036.715, test=770208066318.557
done in 4.076s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=35
sklearn preplexity: train=16539345584599.268, test=24731601176317.836
done in 4.230s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=40
sklearn preplexity: train=586526357904887.250, test=880809950700756.625
done in 4.596s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=45
sklearn preplexity: train=20928740385934636.000, test=31065168894315760.000
done in 4.563s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=50
sklearn preplexity: train=734804198843926784.000, test=1102284263786783616.000
done in 4.790s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=55
sklearn preplexity: train=24747026375445286912.000, test=36634830286916853760.000
done in 4.839s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=60
sklearn preplexity: train=879215493067590729728.000, test=1268331920975308783616.000
done in 4.827s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=65
sklearn preplexity: train=30267393208097070645248.000, test=43678395923698735382528.000
done in 4.705s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=70
sklearn preplexity: train=1091388615092136975532032.000, test=1564111432914603675222016.000
done in 4.626s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=75
sklearn preplexity: train=37463573890268863118966784.000, test=51513357456275195169865728.000
done in 5.034s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=80
sklearn preplexity: train=1281758440147129243608809472.000, test=1736796133443165299937378304.000
done in 5.348s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=85
sklearn preplexity: train=45100838968058242714191265792.000, test=62725627465378386290422054912.000
done in 4.987s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=90
sklearn preplexity: train=1555576278144903954081448460288.000, test=2117105172204280105824751190016.000
done in 5.032s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=95
sklearn preplexity: train=52806759455785055803020813533184.000, test=70510180325555822379548402515968.000
done in 5.284s.
Fitting LDA models with tf features, n_samples=0, n_features=1000 n_topics=100
sklearn preplexity: train=1885916623308147578324101753733120.000, test=2505878598724106449894719231098880.000
done in 5.374s.
【问题讨论】:
-
我能问一下您为什么撤消同行批准的编辑吗?我认为这个问题很有趣,但在目前的状态下很难解释。糟糕的语法使其基本上不可读。
-
除了语法问题,改正后的句子意思和我想要的不一样。例如,如果您增加主题的数量,我认为总体上应该会降低困惑度。即使目前的结果不合适,也不是增加或减少的值。
-
好吧,我仍然认为这基本上是编辑所反映的,尽管强调单调(总是增加或总是减少)而不是简单地减少。您当前的问题陈述令人困惑,因为您的结果不会随着主题数量而“总是增加”,而是有时会增加有时会减少(我相信您在这里指的是“不合理” - 这可能在翻译中丢失了 - 不合理是一个不同的数学词,在这种情况下没有意义,我建议改变它)
-
非常感谢 :) 我会尽快反映您的建议。
-
嗨!你找到解决办法了吗?我遇到了同样的问题..困惑正在增加..随着主题数量的增加。
标签: python scikit-learn topic-modeling perplexity