tf-idf 文档术语矩阵和 LDA：R 中的错误消息答案

【问题标题】：tf-idf document term matrix and LDA: Error messages in Rtf-idf 文档术语矩阵和 LDA：R 中的错误消息
【发布时间】：2018-01-15 20:05:44
【问题描述】：

我们可以将 tf-idf 文档术语矩阵输入到潜在狄利克雷分配 (LDA) 中吗？如果是，怎么做？

在我的情况下它不起作用，LDA 函数需要“词频”文档词矩阵。

谢谢

（我尽可能简洁地提出一个问题。所以，如果您需要更多详细信息，我可以添加

##########################################################################
                           TF-IDF Document matrix construction
##########################################################################    

> DTM_tfidf <-DocumentTermMatrix(corpora,control = list(weighting = 
function(x)+   weightTfIdf(x, normalize = FALSE)))
> str(DTM_tfidf)
List of 6
$ i       : int [1:4466] 1 1 1 1 1 1 1 1 1 1 ...
$ j       : int [1:4466] 6 10 22 26 28 36 39 41 47 48 ...
$ v       : num [1:4466] 6 2.09 1.05 3.19 2.19 ...
$ nrow    : int 64
$ ncol    : int 297
$ dimnames:List of 2
  ..$ Docs : chr [1:64] "1" "2" "3" "4" ...
  ..$ Terms: chr [1:297] "accommod" "account" "achiev" "act" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency - inverse document 
frequency" "tf-idf"

##########################################################################
                           LDA section
##########################################################################

> LDA_results <-LDA(DTM_tfidf,k, method="Gibbs", control=list(nstart=nstart,
  +                                seed = seed, best=best, 
  +                                burnin = burnin, iter = iter, thin=thin))

##########################################################################
                           Error messages
##########################################################################
  Error in LDA(DTM_tfidf, k, method = "Gibbs", control = list(nstart = 
  nstart,  : 
  The DocumentTermMatrix needs to have a term frequency weighting

【问题讨论】：

最好提供一个可重现的最小示例。您可以通过使用虚拟数据集或软件包附带的示例数据集来做到这一点。（参见例如我的最后一个答案。）

标签： r matrix text-mining lda tidytext

【解决方案1】：

如果您使用 topicmodels 包浏览 LDA 主题建模的文档，例如在 R 控制台中键入 ?LDA，您会看到此建模过程需要一个频率加权的文档术语矩阵，而不是 tf -idf 加权。

"Object of class "DocumentTermMatrix" with term-frequency weighting or an object coercible..."

所以答案是否定的，你不能在这个函数中直接使用 tf-idf-weighted DTM。如果您拥有 tf-idf-weighted DTM，则可以使用tm::weightTf() 对其进行转换以获得必要的权重。如果您是从头开始构建文档术语矩阵，则不要使用 tf-idf 对其进行加权。

【讨论】：

谢谢 .... 我使用 2 个帐户，因为有时我同时有两个不相关的问题。我不确定这是否会宣扬论坛的规则。
就LDA而言，是否可以对文档术语矩阵进行加权，并从稀疏矩阵中取出一些0权重术语？
这样做的目的是为了降低矩阵的维数（这里TF-IDF的一个功能就像去除停用词一样，而不是去除停用词，而是去除出现频率太高的词语料库的信息很少。我遇到了链接：davidmeza1.github.io/2015/07/20/topic-modeling-in-R.html
term_tfidf 0)) summary(term_tfidf) 就像我们在操作 DTM 一样，但是计算机仍然理解 DTM 是频率加权方案。 @朱莉娅