【问题标题】:How to create a document term matrix using native R如何使用原生 R 创建文档术语矩阵
【发布时间】:2013-10-25 15:28:07
【问题描述】:

我想使用原生 R 创建一个文档术语矩阵(无需额外的插件,例如 tm)。数据结构如下:

Doc1: the test was to test the test
Doc2: we did prepare the exam to test the exam
Doc3: was the test the exam
Doc4: the exam we did prepare was to test the test
Doc5: we were successful so we all passed the exam

我想要达到的目标如下:

         Term Doc1 Doc2 Doc3 Doc4 Doc5 DF
1         all    0    0    0    0    1  1
2         did    0    1    0    1    0  2
3        exam    0    2    1    1    1  4
4      passed    0    0    0    0    1  1

【问题讨论】:

  • 您可以查看tm 包中的源代码...并重写它...您为什么不想使用现成的工具?
  • 我会先查看已经存在的函数的源代码。

标签: r matrix document


【解决方案1】:

这是一种方法,但为什么不使用 tm 包?

## Your data
## dat <- structure(list(person = structure(1:5, .Label = c("Doc1", "Doc2", 
##     "Doc3", "Doc4", "Doc5"), class = "factor"), 
##     text = c("the test was to test the test", 
##     "we did prepare the exam to test the exam", "was the test the exam", 
##     "the exam we did prepare was to test the test", 
##     "we were successful so we all passed the exam"
##     )), .Names = c("doc", "text"), class = "data.frame", row.names = c(NA, 
##     -5L))

## Function to turn list of vects into sparse matrix
mtabulate <- function(vects) {
    lev <- sort(unique(unlist(vects)))
    dat <- do.call(rbind, lapply(vects, function(x, lev){ 
        tabulate(factor(x, levels = lev, ordered = TRUE),
        nbins = length(lev))}, lev = lev))
    colnames(dat) <- sort(lev) 
    data.frame(dat, check.names = FALSE)
}


out <- lapply(split(dat$text, dat$doc), function(x) {
    unlist(strsplit(tolower(x), " "))
})

t(mtabulate(out))

##            Doc1 Doc2 Doc3 Doc4 Doc5
## all           0    0    0    0    1
## did           0    1    0    1    0
## exam          0    2    1    1    1
## passed        0    0    0    0    1
## prepare       0    1    0    1    0
## so            0    0    0    0    1
## successful    0    0    0    0    1
## test          3    1    1    2    0
## the           2    2    2    2    1
## to            1    1    0    1    0
## was           1    0    1    1    0
## we            0    1    0    1    2
## were          0    0    0    0    1

【讨论】:

  • 对 DF 列使用 rowSums
猜你喜欢
  • 2015-05-19
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-11-26
  • 2016-11-02
  • 2011-06-12
  • 2015-08-05
  • 1970-01-01
相关资源
最近更新 更多