【问题标题】:DocumentTermMatrix in R is computing Idf with respect to base 2R 中的 DocumentTermMatrix 正在计算基于 2 的 Idf
【发布时间】:2016-02-06 00:22:28
【问题描述】:

我使用以下 R 代码来计算 tf-idf:

library(tm)
library(SnowballC)
docs <- c(D1 = "The sky is blue", D2 = "The sun is bright", D3 = "The sun in the sky is bright.")
dd <- Corpus(VectorSource(docs)) #Make a corpus object from a text vector
#Clean the text
dd <- tm_map(dd, stripWhitespace)
dd <- tm_map(dd, content_transformer(tolower))
dd <- tm_map(dd, removeWords, stopwords("english"))
dtm <- TermDocumentMatrix(dd, control = list(weighting = weightTfIdf))
as.matrix(dtm);

我得到的结果如下:

Terms
Docs     blue   bright       sky       sun
   1 0.7924813 0.0000000 0.2924813 0.0000000
   2 0.0000000 0.2924813 0.0000000 0.2924813
   3 0.0000000 0.1949875 0.1949875 0.1949875

但是,如果我执行手动计算,结果会不匹配。 我注意到的是,在 R IDF 中计算为 log2(文档总数/其中包含术语 t 的文档数)。

有没有办法在 R 中覆盖从 2 到 10 的对数底? 请推荐

【问题讨论】:

    标签: r tf-idf


    【解决方案1】:

    尝试编写自己的函数

    weightTfIdf.log10 <- function (m, normalize = TRUE) 
    {
        isDTM <- inherits(m, "DocumentTermMatrix")
        if (isDTM) 
            m <- t(m)
        if (normalize) {
            cs <- col_sums(m)
            if (any(cs == 0)) 
                warning("empty document(s): ", paste(Docs(m)[cs == 
                    0], collapse = " "))
            names(cs) <- seq_len(nDocs(m))
            m$v <- m$v/cs[m$j]
        }
        rs <- row_sums(m > 0)
        if (any(rs == 0)) 
            warning("unreferenced term(s): ", paste(Terms(m)[rs == 
                0], collapse = " "))
        lnrs <- log10(nDocs(m)/rs)
        lnrs[!is.finite(lnrs)] <- 0
        m <- m * lnrs
        attr(m, "weighting") <- c(sprintf("%s%s", "term frequency - inverse document frequency", 
            if (normalize) " (normalized)" else ""), "tf-idf")
        if (isDTM) 
            t(m)
        else m
    }
    environment(weightTfIdf.log10) <- environment(TermDocumentMatrix)
    
    dtm <- TermDocumentMatrix(dd, control = list(weighting = weightTfIdf.log10))
    as.matrix(dtm)
    #          Docs
    # Terms              1          2          3
    #   blue    0.23856063 0.00000000 0.00000000
    #   bright  0.00000000 0.23856063 0.00000000
    #   bright. 0.00000000 0.00000000 0.15904042
    #   sky     0.08804563 0.00000000 0.05869709
    #   sun     0.00000000 0.08804563 0.05869709
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2016-10-15
      • 2014-07-23
      • 2015-04-17
      • 2019-11-21
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多