Pandas：LDA Top n 关键字和具有权重的主题答案

【问题标题】：Pandas: LDA Top n keywords and topics with weightsPandas：LDA Top n 关键字和具有权重的主题
【发布时间】：2021-06-23 01:07:40
【问题描述】：

我正在使用 LDA 进行主题建模任务，我得到了 10 个组件，每个组件有 15 个热门词：

for index, topic in enumerate(lda.components_):
    print(f'Top 10 words for Topic #{index}')
    print([vectorizer.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

打印：

Top 10 words for Topic #0
['compile', 'describes', 'info', 'extent', 'changing', 'reader', 'reservation', 'countries', 'printed', 'clear', 'line', 'passwords', 'situation', 'tables', 'downloads']

现在我想创建一个 pandas 数据框来显示每个主题（索引）以及所有关键字（行）并查看它们的权重。我希望主题中不存在的关键字的权重为 0，但我无法让它发挥作用。到目前为止，我有这个，但它打印了所有的功能名称（大约 1700）。如何只为每个主题设置前 10 名？

topicnames = ['Topic' + str(i) for i in range(lda.n_components)]
# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(lda_model.components_)
# Assign Column and Index
df_topic_keywords.columns = vectorizer.get_feature_names()
df_topic_keywords.index = topicnames
# View
df_topic_keywords.head()

【问题讨论】：

我的回答对你有用吗@StivenLancheros？如果不是，您可以添加评论来解释问题，否则您可以看到what to do when someone answers。

标签： python pandas dataframe lda topic-modeling

【解决方案1】：

如果我理解正确，您有一个包含所有值的数据框，并且您希望在每行中保留前 10 个值，并且剩余值为 0。

这里我们transform每一行按：

获得第 10 个最高值
重新索引到行的原始索引（因此数据框的列）并用 0 填充：

>>> df.transform(lambda s: s.nlargest(10).reindex(s.index, fill_value=0), axis='columns')
    a  b   c   d   e  f   g   h   i   j   k  l   m   n   o   p  q   r  s   t   u   v  w  x  y
a   0  0  63  98   0  0  73   0  78   0  94  0   0  63  68  98  0   0  0  67   0  77  0  0  0
z  76  0   0   0  84  0  62  61   0  93   0  0  82  70   0   0  0  91  0   0  48  95  0  0  0

【讨论】：