R 中的 DocumentTermMatrix 正在计算基于 2 的 Idf答案

【问题标题】：DocumentTermMatrix in R is computing Idf with respect to base 2R 中的 DocumentTermMatrix 正在计算基于 2 的 Idf
【发布时间】：2016-02-06 00:22:28
【问题描述】：

我使用以下 R 代码来计算 tf-idf：

library(tm)
library(SnowballC)
docs <- c(D1 = "The sky is blue", D2 = "The sun is bright", D3 = "The sun in the sky is bright.")
dd <- Corpus(VectorSource(docs)) #Make a corpus object from a text vector
#Clean the text
dd <- tm_map(dd, stripWhitespace)
dd <- tm_map(dd, content_transformer(tolower))
dd <- tm_map(dd, removeWords, stopwords("english"))
dtm <- TermDocumentMatrix(dd, control = list(weighting = weightTfIdf))
as.matrix(dtm);

我得到的结果如下：

Terms
Docs     blue   bright       sky       sun
   1 0.7924813 0.0000000 0.2924813 0.0000000
   2 0.0000000 0.2924813 0.0000000 0.2924813
   3 0.0000000 0.1949875 0.1949875 0.1949875

但是，如果我执行手动计算，结果会不匹配。我注意到的是，在 R IDF 中计算为 log2（文档总数/其中包含术语 t 的文档数）。

有没有办法在 R 中覆盖从 2 到 10 的对数底？请推荐

【问题讨论】：

标签： r tf-idf

【解决方案1】：

尝试编写自己的函数

weightTfIdf.log10 <- function (m, normalize = TRUE) 
{
    isDTM <- inherits(m, "DocumentTermMatrix")
    if (isDTM) 
        m <- t(m)
    if (normalize) {
        cs <- col_sums(m)
        if (any(cs == 0)) 
            warning("empty document(s): ", paste(Docs(m)[cs == 
                0], collapse = " "))
        names(cs) <- seq_len(nDocs(m))
        m$v <- m$v/cs[m$j]
    }
    rs <- row_sums(m > 0)
    if (any(rs == 0)) 
        warning("unreferenced term(s): ", paste(Terms(m)[rs == 
            0], collapse = " "))
    lnrs <- log10(nDocs(m)/rs)
    lnrs[!is.finite(lnrs)] <- 0
    m <- m * lnrs
    attr(m, "weighting") <- c(sprintf("%s%s", "term frequency - inverse document frequency", 
        if (normalize) " (normalized)" else ""), "tf-idf")
    if (isDTM) 
        t(m)
    else m
}
environment(weightTfIdf.log10) <- environment(TermDocumentMatrix)

dtm <- TermDocumentMatrix(dd, control = list(weighting = weightTfIdf.log10))
as.matrix(dtm)
#          Docs
# Terms              1          2          3
#   blue    0.23856063 0.00000000 0.00000000
#   bright  0.00000000 0.23856063 0.00000000
#   bright. 0.00000000 0.00000000 0.15904042
#   sky     0.08804563 0.00000000 0.05869709
#   sun     0.00000000 0.08804563 0.05869709

【讨论】：