主题建模时为每个文本分配多个主题答案

【问题标题】：Assigning multiple topics to each text when topic modelling主题建模时为每个文本分配多个主题
【发布时间】：2021-07-12 10:02:36
【问题描述】：

使用下面的 quanteda 函数，我想为每个文本分配多个主题，但不确定如何实现这一点。目前它只为每个文本分配一个主题。

corp_news <- data_corpus_inaugural

news_dfm <- dfm(corp_news, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol = TRUE, remove=stopwords("en"))

tmod_lda <- textmodel_lda(news_dfm, k = 10)

news_dfm$topic <- topics(tmod_lda)

【问题讨论】：

嗨 - 没有功能 textmodel_lda()。这听起来更像是一个主题建模包的问题。

标签： r nlp lda topic-modeling quanteda

【解决方案1】：

这在您正在使用的seededlda 包中没有实现，我认为其他任何 LDA 包也不会这样做。但是，topics 只是选择具有最高 theta 值的主题。这是直接来自source code：

#' @export
#' @method topics textmodel_lda
topics.textmodel_lda <- function(x) {
    result <- factor(max.col(x$theta), labels = colnames(x$theta),
                     levels = seq_len(ncol(x$theta)))
    result[rowSums(x$data) == 0] <- NA
    return(result)
}

我们可以修改这个函数以返回任意数量的主题，从最可能到最不可能排序：

topicsn <- function(x, n = 1) {
  if (n == 1) {
    result <- factor(max.col(x$theta), labels = colnames(x$theta),
                     levels = seq_len(ncol(x$theta)))
  } else if (n > 1) {
    result <- factor(apply(x$theta, 1, order)[, seq_len(n)],
                     labels = colnames(x$theta),
                     levels = seq_len(ncol(x$theta)))
    result <- matrix(result, ncol = n)
    colnames(result) <- paste0("top_topic_", seq_len(n))
  }
  
  return(result)
}

现在我再次运行您的代码（略有更改，因为新 quanteda 中的语法不同：

library(quanteda)
library(seededlda)
corp_news <- data_corpus_inaugural

news_dfm <- corp_news %>% 
  tokens(remove_numbers = TRUE, 
         remove_symbol = TRUE) %>% 
  tokens_remove(stopwords("en")) %>% 
  dfm()

tmod_lda <- textmodel_lda(news_dfm, k = 10)

现在我们可以使用该函数返回两个热门话题：

topicsn(tmod_lda, 2)
#>       top_topic_1 top_topic_2
#>  [1,] "topic3"    "topic1"   
#>  [2,] "topic10"   "topic2"   
#>  [3,] "topic4"    "topic4"   
#>  [4,] "topic1"    "topic5"   
#>  [5,] "topic5"    "topic9"   
#>  [6,] "topic9"    "topic10"  
#>  [7,] "topic8"    "topic3"   
#>  [8,] "topic2"    "topic8"   
#>  [9,] "topic6"    "topic6"   
#> [10,] "topic7"    "topic7"

对于第一个文本，最流行的主题是“topic3”，然后是“topic1”，依此类推。

【讨论】：