【问题标题】:Topic modeling - run LDA in sklearn : how to compute the Wordcloud?主题建模 - 在 sklearn 中运行 LDA:如何计算 Wordcloud?
【发布时间】:2020-07-02 13:38:05
【问题描述】:

我在 sklearn 中训练了我的 LDA 模型来构建主题模型,但不知道如何为每个获得的主题计算关键词 Wordcloud?

这是我的 LDA 模型:

vectorizer = CountVectorizer(analyzer='word',       
                         min_df=3,                        
                         max_df=6000,
                         stop_words='english',             
                         lowercase=False,                   
                         token_pattern ='[a-zA-Z0-9]{3,}' 
                         max_features=50000,             
                        )
data_vectorized = vectorizer.fit_transform(data_lemmatized) # data_lemmatized is all my processed document text

best_lda_model = LatentDirichletAllocation(batch_size=128, doc_topic_prior=0.1,
                      evaluate_every=-1, learning_decay=0.7,
                      learning_method='online', learning_offset=10.0,
                      max_doc_update_iter=100, max_iter=10,
                      mean_change_tol=0.001, n_components=10, n_jobs=None,
                      perp_tol=0.1, random_state=None, topic_word_prior=0.1,
                      total_samples=1000000.0, verbose=0)

lda_output = best_lda_model.transform(data_vectorized)

我知道 best_lda_model.components_ 赋予主题词权重... vectorizer.get_feature_names() 给出每个主题的词汇表中的所有单词...

非常感谢!

【问题讨论】:

    标签: python scikit-learn lda topic-modeling word-cloud


    【解决方案1】:

    您必须遍历模型“components_”,其大小为 [n_components, n_features],因此第一个维度包含主题,第二个维度包含词汇表中每个单词的分数。因此,您首先需要找到与主题最相关的词的索引,然后通过使用 get_features_names() 定义的“词汇”字典,您可以检索这些词。

    import numpy as np
    
    # define vocabulary to get words names 
    vocab = vectorizer.get_feature_names()
    
    # dictionary to store words for each topic and number of words per topic to retrive 
    words = {}
    n_top_words = 10
    
    for topic, component in enumerate(model.components_):
    
        # need [::-1] to sort the array in descending order
        indices = np.argsort(component)[::-1][:n_top_words]
    
        # store the words most relevant to the topic
        words[topic] = [vocab[i] for i in indices]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-06-29
      • 2019-06-09
      • 2020-08-05
      • 2016-02-27
      相关资源
      最近更新 更多