【问题标题】:pyspark LDA get words in topicspyspark LDA 获取主题中的单词
【发布时间】:2019-04-27 23:16:11
【问题描述】:

我正在尝试运行 LDA。我不是将它应用于文字和文档,而是错误消息和错误原因。每一行都是一个错误,每一列都是一个错误原因。如果错误原因处于活动状态,则单元格为 1,如果错误原因未处于活动状态,则单元格为 0。 现在我试图为每个创建的主题(这里相当于一个错误模式)获取错误原因名称(不仅仅是索引)。到目前为止,我拥有的似乎可以工作的代码如下

# VectorAssembler combines all columns into one vector
assembler = VectorAssembler(
    inputCols=list(set(df.columns) - {'error_ID'}),
    outputCol="features")
lda_input = assembler.transform(df)

# Train LDA model
lda = LDA(k=5, maxIter=10, featuresCol= "features")
model = lda.fit(lda_input)

# A model with higher log-likelihood and lower perplexity is considered to be good.
ll = model.logLikelihood(lda_input)
lp = model.logPerplexity(lda_input)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))

# Describe topics.
topics = model.describeTopics(7)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)

# Shows the result
transformed = model.transform(lda_input)
print(transformed.show(truncate=False))

我的输出是:

基于https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda我添加了那部分,它不起作用:

 topics = model.topicsMatrix()
    for topic in range(10):
        print("Topic " + str(topic) + ":")
        for word in range(0, model.vocabSize()): 
            print(" " + str(topics[word][topic]))

我现在如何获得最常见的错误原因/找到与术语索引对应的列?

【问题讨论】:

    标签: apache-spark pyspark lda topic-modeling


    【解决方案1】:

    为了对 DenseMatrix 进行迭代,您需要将其转换为数组。 这不应该给出错误。但是我不确定打印结果,因为它取决于您的数据。

    topn_words = 10
    num_topics = 10
    
    topics = model.topicsMatrix().toArray()
    for topic in range(num_topics):
        print("Topic " + str(topic) + ":")
        for word in range(0, topn_words): 
            print(" " + str(topics[word][topic]))
    

    【讨论】:

    • 嗨@Tolga,我尝试了这段代码,但它仍然打印了一些数字而不是主题词。我使用的是 spark 2.4.3 版本
    • Topic 0: 30.17673729638776 99.14231560215744 6.66435717428376 12.15504287041995 18.982848683531195 100.50830388771101 -68.84233323370782 Topic 1: 29.973596695840133 98.1127093876147 7.307128038415854 12.078623258770016 18.90257467334695 96.68763877584024 -67.47621948864617 Topic 2: 29.40209579979105 97.47622395948163 5.6867501820791695 12.088347117003329 19.433579035205614 97.89893275034943 -67.58479307177592
    猜你喜欢
    • 1970-01-01
    • 2017-08-08
    • 2017-10-27
    • 1970-01-01
    • 2018-07-18
    • 1970-01-01
    • 1970-01-01
    • 2016-01-09
    • 1970-01-01
    相关资源
    最近更新 更多