【问题标题】:r: character matching with dictionary word positionr:与字典词位置匹配的字符
【发布时间】:2016-08-08 04:20:46
【问题描述】:

我有两个数据框

word_table <- word_9 word_1 word_3 ...word_random word_2 na na ...word_random word_5 word_3 na ...word_random

dictionary_words <- word_2 word_3 word_4 word_6 word_7 word_8 word_9 . . . word_n 我在找什么,将word_tabledictionary_words 匹配,并将单词替换为字典中可用的单词位置,像这样,

result <- 7 na 2 ... 1 na na ... na 2 na ...

我尝试过pmatchcharmatchmatch 函数,当dictionary_words 的长度较小时,以正确的方式返回result,但当它相对较长时,例如超过 20000 个字时,@ 987654332@ 只出现在第一列,其余的列就像这样变成na

result <- 7 na na ... 1 na na ... na na na ...

还有其他方法可以进行字符匹配,例如使用任何应用函数

样本

word_table <- data.frame(word_1 <- c("conflict","", "resolved", "", "", ""), word_2 <- c("", "one", "tricky", "one", "", "one"), 
                 word_3 <- c("thanks","", "", "comments", "par",""),word_4 <- c("thanks","", "", "comments", "par",""), word_5 <- c("", "one", "tricky", "one", "", "one"), stringsAsFactors = FALSE)
colnames(word_table) <- c("word_1", "word_2", "word_3", "word_4", "word_5")
## Targeted Words
dictionary_words <- data.frame(cbind(c("abovementioned","abundant","conflict", "thanks", "tricky", "one", "two", "three","four", "resolved")))

## convert into matrix (if needed)
word_table <- as.matrix(word_table)
dictionary_words <- as.matrix(dictionary_words)

## pmatch for each of the element in the dataframe (dt)
# matched_table <- pmatch(dt, TargetWord)
# dim(matched_table) <- dim(dt)
# print(matched_table) 

result <- `dim<-`(pmatch(word_table, dictionary_words, duplicates.ok=TRUE), dim(word_table))
print(result) # working fine, but when the dictionary_words is large, returning result for only first column of the word_table

【问题讨论】:

  • 欢迎您!最好将您的问题与reproducible example 一起发布
  • 你能显示你的代码吗?你试过"dim&lt;-"(match(as.matrix(word_table), dictionary_words[,1]), dim(word_table))
  • 感谢 vincent,实际上我很难展示一个可重现的示例,因为正如我所提到的,当我使用相对较小的数据帧时,它运行良好。但是在使用大型数据框时,它只返回第一列结果。请找到我编辑过的样本。
  • 您不需要 data.frame(cbind ,只需 data.frame(V1 = c(... 就足够了。另外最好使用stringsAsFactors=FALSE 以避免将列转换为factor
  • 可以发一下原始数据集的str吗。

标签: r string character map-matching


【解决方案1】:

这是一个可重现的例子:

 word_table <- structure(list(V1 = structure(c(3L, 1L, 2L), .Label = c("word_2", 
                                                    "word_5", "word_9"), class = "factor"), V2 = structure(c(1L, 
                                                                                                             NA, 2L), .Label = c("word_1", "word_3"), class = "factor"), V3 = structure(c(1L, 
                                                                                                                                                                                          NA, NA), .Label = "word_3", class = "factor"), V4 = structure(c(1L, 
                                                                                                                                                                                                                                                          1L, 1L), .Label = "...word_random", class = "factor")), .Names = c("V1", 
                                                                                                                                                                                                                                                                                                                             "V2", "V3", "V4"), class = "data.frame", row.names = c(NA, -3L
                                                                                                                                                                                                                                                                                                                             ))

 dictionary_words <- structure(list(V1 = structure(1:7, .Label = c("word_2", "word_3", 
                                                              "word_4", "word_6", "word_7", "word_8", "word_9"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                        -7L))

你可以使用sapply

> sapply(word_table, function(x) match(x, dictionary_words[, 1]))
     V1 V2 V3 V4
[1,]  7 NA  2 NA
[2,]  1 NA NA NA
[3,] NA  2 NA NA

apply,如果您愿意:

> apply(word_table, 2, function(x) match(x, dictionary_words[, 1]))
V1 V2 V3 V4
[1,]  7 NA  2 NA
[2,]  1 NA NA NA
[3,] NA  2 NA NA

【讨论】:

  • 再次感谢 vincent,完美地处理了我上面提到的示例数据框。但是当word_table79x50dictionary_words20000x1 时发生同样的事情,结果只是第一列,其余的都变成NA
  • 您能否将dput(word_table)dput(dictionnary_words) 的结果粘贴到某处,例如gist 中?
  • 你好文森特,对不起......我是新来的,所以有点慢。希望你能在 github 中找到文件,gist.github.com/bipul-mohanto/9b6a960955419f8cb689cf2c32edcff1
  • 它运行良好,但word_table 中的许多单词在dictionnary_words 中不存在,请尝试以下操作:unique(word_table[, 1]) %in% dictionary_words
  • 另外,大多数单词中都有空格,word_table &lt;- gsub(" *", "", word_table) 应该会有所帮助。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2019-10-06
  • 2013-10-14
  • 2012-07-28
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多